Neural Computation, Volume 13, Number 1
CONTENTS Review
Detecting and Estimating Signals over Noisy and Unreliable Synapses: Information-Theoretic Analysis Amit Manwani and Christof Koch
1
Letters
An Algorithm for Modifying Neurotransmitter Release Probability Based on Pre- and Postsynaptic Spike Timing Walter Senn, Henry Markram, and Misha Tsodyks
35
Differential Filtering of Two Presynaptic Depression Mechanisms Richard Bertram
69
A Statistical Theory of Long-Term Potentiation and Depression John M. Beggs
87
Minimal Model for Intracellular Calcium Oscillations and Electrical Bursting in Melanotrope Cells of Xenopus Laevis L. Niels Cornelisse, Wim J. J. M. Scheenen, Werner J. H. Koopman, Eric W. Roubos, and Stan C. A. M. Gielen
113
Neural Field Model of Receptive Field Restructuring in Primary Visual Cortex Katrin Suder, Florentin W¨org¨otter, and Thomas Wennekers
139
On the Phase-Space Dynamics of Systems of Spiking Neurons. I: Model and Experiments Arunava Banerjee
161
On the Phase-Space Dynamics of Systems of Spiking Neurons. II. Formal Analysis Arunava Banerjee
195
Subtractive and Divisive Inhibition: Effect of Voltage-Dependent Inhibitory Conductances and Noise Brent Doiron, Andre Longtin, Neil Berman, and Leonard Maler
227
REVIEW
Communicated by Anthony Zador
Detecting and Estimating Signals over Noisy and Unreliable Synapses: Information-Theoretic Analysis Amit Manwani Christof Koch Computation and Neural Systems, California Institute of Technology, Pasadena, CA 91125, U.S.A. The temporal precision with which neurons respond to synaptic inputs has a direct bearing on the nature of the neural code. A characterization of the neuronal noise sources associated with different sub-cellular components (synapse, dendrite, soma, axon, and so on) is needed to understand the relationship between noise and information transfer. Here we study the effect of the unreliable, probabilistic nature of synaptic transmission on information transfer in the absence of interaction among presynaptic inputs. We derive theoretical lower bounds on the capacity of a simple model of a cortical synapse under two different paradigms. In signal estimation, the signal is assumed to be encoded in the mean ring rate of the presynaptic neuron, and the objective is to estimate the continuous input signal from the postsynaptic voltage. In signal detection, the input is binary, and the presence or absence of a presynaptic action potential is to be detected from the postsynaptic voltage. The efcacy of information transfer in synaptic transmission is characterized by deriving optimal strategies under these two paradigms. On the basis of parameter values derived from neocortex, we nd that single cortical synapses cannot transmit information reliably, but redundancy obtained using a small number of multiple synapses leads to a signicant improvement in the information capacity of synaptic transmission. 1 Introduction
Understanding the strategies that brains use to represent and process information received through the senses and to make crucial behavioral decisions has been a long-standing goal in neuroscience. Researchers continue to remain divided on the nature of the neural code, though a lot of recent effort in theoretical neuroscience has focused on deciphering the code. A common notion in neuroscience is that spike trains are stochastic and the average ring rate of a neuron constitutes the primary variable relating neuronal response to sensory experience (Adrian, 1932; Barlow, 1972). This view is supported by the demonstration of a quantitative relationship between the average ring rate of single cortical neurons and the performance of the Neural Computation 13, 1–33 (2000)
° c 2000 Massachusetts Institute of Technology
2
Amit Manwani and Christof Koch
animal in behavioral tasks (Werner & Mountcastle, 1963; Newsome, Britten, & Movshon, 1989; Zohary, Hillman, & Hochstein, 1990; Britten, Shadlen, Newsome, & Movshon, 1992). On the other hand, recent ndings indicate that spike timing can be highly precise, and as a result, neurons can potentially employ the ne temporal structure of interspike intervals to convey more information than a mean rate code (Rieke, Warland, van Steveninck, & Bialek, 1997). The relative timing of spikes across a population of neurons also appears to be relevant for coding in certain cases (Perkel & Bullock, 1968; Abeles, 1990; Bialek, Rieke, van Steveninck, & Warland, 1991; Bialek & Rieke, 1992; Richmond & Optican, 1992; Rieke et al., 1997). It is unclear which, if there is one in particular, is the universal coding strategy that brains use. However, a quantitative understanding of neuronal noise sources and their inuence on reliability and precision of neuronal signaling will determine the efcacy of neurons in transmitting information and the constraints under which neuronal codes must operate. Tools from statistical estimation theory (Poor, 1994) have recently been used to quantify the ability of neurons to transmit information through their spike outputs (Bialek et al., 1991; Bialek & Rieke, 1992; Gabbiani, 1996; Rieke et al., 1997). These techniques have been successfully applied to study the coding of random stimuli in peripheral sensory neurons in various neural systems. Most of these approaches treat the neuron as a black box, ignoring the details of neural information processing, and estimate neuronal information capacity by statistical analyses of the associated empirical inputoutput records. However, it is well known that neurons receive inputs at their synapses; synaptic inputs are integrated in the dendrites and delivered to the spike initiation zone, which generates the output spike train and transmits it via the axon to the neuron’s postsynaptic targets. The specic nature of information processing in the nervous system is determined by the details of this neural hardware. Much knowledge has accumulated about the manner in which signals are processed and transformed at various stages in single neurons through extensive experimental and theoretical efforts (Koch, 1999). However, the relationship between single-neuron biophysics and neural information processing has not been unraveled in a systematic and principled manner. We believe that a detailed understanding of the different neuronal components can provide deeper insight into the role of neurons as communication devices. Thus, a reductionist approach of applying experimental and theoretical tools to individual neuronal components will reveal aspects of the neural code that cannot be uncovered using conventional black box approaches. Since noise has a direct bearing on the nature of coding and limits the capacity of information processing systems, an essential step in determining the functional signicance of the different processing stages in neuronal information transfer (synapse, dendritic tree, axon, and so on) involves the characterization of the intrinsic biophysical noise they contribute. We have
Detecting and Estimating Signals
3
used this approach to study the effect of different membrane noise sources on information processing in dendritic cable models of cortical neurons in the presence of low densities of voltage-gated ion channels (Manwani & Koch, 1999a, 1999b, 1999c; Steinmetz, Manwani, London, Segev, & Koch, 2000). Here we focus on a critical component of neural processing, the synapse. Using the known biophysical details about cortical synapses, we derive a simple mathematical model of a cortical synapse. Our model ignores the ubiquitous use-dependent plasticity of cortical synapses and thus represents a simplied picture of synaptic transmission. It can be seen as a rst step in modeling and quantifying the effects of synapses on information processing. We compute the information-theoretical capacity of this synaptic model under two explicit coding paradigms: signal estimation and signal detection. In the signal estimation paradigm, the input is a random continuous signal encoded in the mean ring rate of a presynaptic neuron. Using tools from statistical estimation theory, we derive the optimal lter (in the sense of least-mean square error), which reconstructs the input from the postsynaptic voltage. In the signal detection paradigm, the input signal is encoded in an all-or-none format (the presence or absence of a presynaptic spike). We derive the optimal detector (in the sense of minimum probability of error) of presynaptic action potentials from the postsynaptic voltage. The signal detection method is similar to the yes-no decision paradigm used in psychophysics. We use the performance of the optimal estimator and optimal detector to characterize the synaptic efcacy for these two tasks and to derive bounds on the information capacity of cortical synapses. The strength of our approach lies in the combination of techniques from different scientic elds—single-neuron biophysics (Hille, 1992; Koch, 1999), membrane noise analysis (DeFelice, 1981; Traynelis & Jaramillo, 1998; White, Rubinstein, & Kay, 2000), cable theory, estimation theory (Poor, 1994), and information theory (Cover & Thomas, 1991)—and in their application to the question of neural coding from a novel perspective. Ultimately our goal is to answer questions like, Is the length of the apical dendrite of a neocortical pyramidal cell limited by considerations of signal-to-noise? What inuences the noise level in the dendritic tree of a real neuron endowed with voltagedependent channels? How accurately can the time course of an synaptic signal be reconstructed from the voltage at the spike initiation zone? and What is the channel capacity of an unreliable synapse onto a spine? A brief account of this work has appeared previously (Manwani & Koch, 1998). 2 The Stochastic Nature of Synaptic Transmission Our understanding of synaptic transmission is based on the seminal work of Katz and his collaborators (Katz, 1969). Their central ndings were that the release of neurotransmitter occurs in quanta through vesicles residing in the presynaptic terminal. Each vesicle is released independently of the others in
4
Amit Manwani and Christof Koch
a probabilistic manner, and the postsynaptic responses to vesicular release add linearly. Their theory was developed using experiments at the frog neuromuscular junction, but the stochastic and quantal nature of synaptic transmission has been found to be generally valid for central synapses as well. The probability of vesicle release at a single active zone of the frog neuromuscular junction was found to be quite low. However, this junction contains a large number of release sites (on the order of 1000), and as a result, the neuromuscular synapse is highly reliable. Thus, a presynaptic action potential invariably gives rise to a muscular response. While central synapses share many principles in common with neuromuscular junctions, there are crucial differences between them as well (Redman, 1990; Stevens, 1993). Synaptic boutons in central synapses contain only one or a few active release zones as opposed to the 1,000 or more found at the neuromuscular junction. Thus, in response to an action potential in the presynaptic terminal, at most one vesicle is released (Korn & Faber, 1991, 1993; Stevens & Wang, 1995). Moreover, in vitro studies in vertebrate and invertebrate systems have revealed that the probability of vesicle release (referred to as p) is generally low (Korn, Faber, & Triller, 1986; Laurent & Sivaramakrishnan, 1992; Bekkers & Stevens, 1994, 1995; Stevens & Wang, 1994). The probabilistic release of vesicles is believed to be the dominant factor responsible for the unreliability of synaptic transmission in cortical and hippocampal neurons (Allen & Stevens, 1994). The origin of the random nature of release is not yet fully understood, though it is known that the precise three-dimensional spatial relationship between calcium channels in the presynaptic membrane and the location of the vesicles determines release dynamics. The probability of synaptic release is not constant; it depends on whether the last presynaptic action potential led to a vesicle release and the exact timing of the previous presynaptic spikes. Depending on the sequence of presynaptic action potentials, p can either decrease or increase over a variety of timescales, from tens of milliseconds to seconds, minutes, and longer. This history-dependent plasticity in the state of synaptic efcacy is believed to be one of the substrates of learning and memory in biological neural networks (Abbott, Varela, Sen, & Nelson, 1997). p can also vary across synapses (Dobrunz & Stevens, 1997; Murthy, Sejnowski, & Stevens, 1997) and synaptic (Stratford, Tarczy-Hornoch, Martin, Bannister, & Jack, 1996) types impinging onto a single neuron. The unreliability of synaptic transmission in cortical neurons is compounded by the variability in the amplitude of the postsynaptic response following the successful release of a single vesicle (Bekkers, Richerson, & Stevens, 1990; Edwards, Konnerth, & Sakmann, 1990; Mason, Nicoll, & Stratford, 1991). This response variability occurs due to factors like variation in vesicle size and therefore in the number of neurotransmitter molecules, the number of available postsynaptic receptors, and so on. The trial-to-trial variability in postsynaptic amplitude can be quite large; in some cases, the
Detecting and Estimating Signals
5
variance in the size of the excitatory postsynaptic potential (EPSP) is as large as the mean (Bekkers et al., 1990; Larkman, Stratford, & Jack, 1991; Mason et al., 1991). A note of caution is in order here: a majority of studies of central synapses have been carried out in culture or slice preparations that lack many of the neuromodulators present in vivo. This might signicantly affect synaptic release properties. Only after a careful measurement of synaptic properties in in vivo experiments will the true picture of synaptic transmission in real brains emerge. In summary, central synapses appear to be highly unreliable, binary connections. This unreliability is several orders of magnitude higher than that observed in engineering systems (Koch, 1999). Theoretically, the nervous system can combat the unreliability of its components by making use of anatomical redundancy with multiple synapses (Moore & Shannon, 1956; von Neumann, 1956). However, cortical axons typically make only a few synaptic contacts onto other cortical target neurons (Sorra & Harris, 1993). Thus, the reliability of synaptic transmission must have a profound inuence on the manner in which signals are encoded and transmitted (Abbott et al., 1997; Lisman, 1997; Tsodyks & Markram, 1997; Zador, 1998) and the ways in which computation is performed in the brain (Stevens, 1994). An important debate that rages on is whether synaptic unreliability is a “bug” or a “feature” (Koch, 1999). In other words, do biophysical constraints limit the potential reliability of cortical synapses (for instance, the need to pack on the order of 1 billion synapses into 1 cubic millimeter of cortex might dictate the small size of synapses) or are there any computational advantages to this unreliability? If reliable synapses that are as small and compact as hippocampal synapses are found experimentally, it would certainly give credence to the hypothesis that synaptic unreliability is probably a design feature used in the brain. One possible computational advantage to having unreliable synapses is for reasons of plasticity. It has been argued that the release probability can be modied more easily over a larger dynamic range than the number of release sites, or the size of the postsynaptic response (Smetters & Zador, 1996; Zador & Dobrunz, 1997). Thus, synaptic unreliability could lead to an increase in the bandwidth of modulation of the postsynaptic response. It has also been suggested (Burnod & Korn, 1989) that the variability at central synapses plays the same role as the “temperature” parameter T in stochastic models of neuronal networks (Ackley, Hinton, & Sejnowski, 1985). Just as the presence of noise in large articial neural networks prevents them from getting stuck in local minima, synaptic unreliability and variability might help brains to learn and generalize better. 3 Channel Model of Synaptic Transmission Armed with the above knowledge about the physiology of central synapses, we abstract a synapse by a mathematical model that accounts for probabilis-
6
Amit Manwani and Christof Koch
0 No Spike
1
0 No Release
1 - p Spike
Release 1
Stimulus
Poisson Encoding k(t)
Spike Train
Input
Optimal Estimator
Noise P(q)
s(t)
Binary Encoding
1
n(t)
m(t)
X
p
h(t)
+
Variable Quantal Amplitude
EPSP Shape
Voltage
Matched Filter Optimal D etector
Spike / No Spike
Encoding
Synaptic Channel
Estimate ^ m(t)
V(t)
Membrane Stochastic Vesicle R elease
Wiener Filter
Y Decision Spike / No Spike
Decoding
Figure 1: Schematic block diagram of a synapse under the signal estimation and detection paradigms. A cortical synapse is modeled as a binary channel followed by a lter h(t). The amplitude of the lter is a random variable q with probability density P(q) D a(a q) J¡1 exp(¡a q) / (J ¡ 1)! The binary channel (inset: p = Prob [vesicle release], 1 ¡ p = Prob [failure]) models probabilistic vesicle release, and the random variable q models the variability in the size of the postsynaptic response of a single quantum observed in central synapses. The EPSP is approximated by an alpha function h(t) D hpeakt / tpeak exp(1 ¡ t / tpeak ). n(t) denotes additive postsynaptic voltage noise, which is assumed to be gaussian and white (over a bandwidth Bn ). In signal estimation, the objective is to estimate optimally a continuous input m(t) from the postsynaptic membrane O voltage V(t). This estimate is denoted by m(t). The presynaptic spike train s(t) is assumed to be a Poisson process with ring rate modulated by the random input m(t). Performance of the optimal linear estimator (Wiener lter) is used to quantify performance in the estimation task. In signal detection, the objective is to detect optimally the presence of a single presynaptic action potential on the basis of V(t). Thus, the input X and the decision Y are binary variables. Performance of the optimal detector (matched lter) quanties performance in signal detection.
tic vesicle release and the variability in response amplitude. Using the model as a starting point, we will derive its information-theoretic channel equivalents under the signal estimation and signal detection tasks. The capacities of these channels will be used to quantify the efcacy of synaptic transmission under the two paradigms. (Appendix A lists the symbols used.) We will assume that the synaptic bouton contains only one release site. Thus, the vesicle release process can be modeled as a binary channel (see the inset of Figure 1). The input to the channel is a binary variable that represents the presence or the absence of a presynaptic action potential, and its output is a binary variable representing the success or failure of release. The spontaneous release associated with cortical synapses is quite
Detecting and Estimating Signals
7
low, and so we shall assume that it is zero here (Zador, 1998). Our analysis, however, does not depend on this assumption. Let p denote the probability of release due to a presynaptic spike. The unreliability of vesicle release is the dominant source of noise in the release process as p can be as low as 0.1 for some hippocampal synapses (Hessler, Shirke, & Malinow, 1993). We model the postsynaptic response to the release of a single vesicle by a function h(t), which corresponds to the EPSP waveform of a fast, voltageindependent AMPA-like synapse modeled as an alpha function (Rall, 1967), h(t) D hpeak
t t exp 1 ¡ tpeak tpeak
,
where hpeak is the peak EPSP magnitude and tpeak is the corresponding timeto-peak. We assume that the postsynaptic responses to a sequence of vesicle releases add linearly. We incorporate synaptic variability by multiplying the response h (t) by a random variable q drawn from a probability distribution P (q ), which can be measured empirically. Thus, q models the trialto-trial variability in the amplitude of the postsynaptic responses observed for central neurons. Experimentally observed amplitude distributions are generally skewed toward higher amplitudes, though the experimental difculties of measuring very small synaptic events probably also contributes to these ndings (Bekkers & Stevens, 1996). Here we model P (q) by a gamma distribution, P (q ) D a
(a q) J¡1 exp(¡a q), ( J ¡ 1)!
where J is the order of the distribution. However, it can be replaced by any one-sided probability density 1 for the purpose of subsequent analysis. a and J together determine the spread of the distribution. q denotes the mean, and sq denotes the standard deviation of the quantal amplitude q: J J sq D . , a a The coefcient of variance (denoted by CVq ) is a measure of the amplitude variability and is given by qD
CVq D
sq q
D
1 J
.
Thus, J can be used to modify the variability of q. J D 1 corresponds to an exponential distribution with the highest variability (CVq D 1); at the other extreme, J D 1 corresponds to a delta function for which there is no variability in q (CVq D 0). 1 Here we assume that q is a nonnegative random variable; inhibitory synapses can be analyzed identically by reversing the sign of h(t).
8
Amit Manwani and Christof Koch
In addition to these two sources of noise, we also assume that the postsynaptic membrane voltage is corrupted by additive, gaussian noise n (t), which models the effect of other membrane noise sources like thermal noise, channel noise, background synaptic noise from other synapses, and so on (DeFelice, 1981). We denote the power spectral density of n (t) by Snn ( f ). We assume that n(t) is band limited (over a bandwidth Bn ), sn2 2 Bn D 0
Snn ( f ) D
¡ B n · f · Bn otherwise,
(3.1)
where sn2 is the variance of n (t ). We dene a signal-to-noise ratio SNR for the synapse as SNR D
1 Snn ( f )
1 0
dt h2 (t)
D
2B n sn2
1 0
dt h2 (t).
(3.2)
We have mathematically modeled a cortical synapse as a binary channel in cascade with a random amplitude lter and an additive gaussian noise source (see the schematic diagram in Figure 1). We ignore history-dependent effects (paired-pulse facilitation, vesicle depletion, calcium buffering, and so on) on synaptic transmission that endow central synapses with characteristics of sophisticated nonlinear lters (Markram & Tsodyks, 1996; Abbott et al., 1997). This implies that the synaptic parameters in our model will be regarded as constants. We derive theoretical lower bounds on the capacity of this simple model of synaptic transmission under the two representational paradigms: signal detection and signal estimation. In signal estimation, the signal is assumed to be encoded in the mean ring rate of the presynaptic neuron, and the objective is to estimate the continuous input signal from the postsynaptic voltage. In signal detection, the input is binary, and the presence or absence of a presynaptic action potential is to be detected from the postsynaptic voltage. The efcacy of information transfer in synaptic transmission is characterized by deriving optimal strategies under these two paradigms. 4 Signal Estimation Let m(t) be a zero-mean random stimulus presented to a presynaptic neuron and s (t) the resulting spike train. s(t) is modeled as a point process or as a sequence of delta functions, s (t ) D
i
d(t ¡ ti ),
where ti denotes the time when the ith spike occurs. We assume that m(t) and s (t ) are (real-valued) jointly weak-sense stationary (WSS) processes with
Detecting and Estimating Signals
9
2 < 1, |s(t) ¡ l| 2 < 1, where l nite variances, m2 (t) D sm D hs(t)i is the mean ring rate of the presynaptic neuron. In these equations, h ¢ i denotes an ensemble average over the joint stimulus and spike train distribution. The membrane voltage V (t) at the postsynaptic site due to the transmission of the spike train s(t) over the synapse is
V (t) D
i
qi Wi h (t ¡ ti ) C n (t),
(4.1)
where qi is the EPSP amplitude in response to the i th spike and Wi is a binary variable representing vesicle release. Thus, Wi = 0 implies that the i th presynaptic action potential did not lead to a vesicle release, whereas Wi = 1 denotes a successful release. The mean voltage hV(t)i does not contain any information about the modulations of the input m (t) and can be ignored. Let v (t ) refer to a zero-mean process obtained by subtracting the mean voltage from V (t), v(t) D V (t) ¡ hV(t)i .
(4.2)
The objective in signal estimation is to nd the optimal estimator of m (t) from the postsynaptic voltage v(t), where optimality is in the sense of least mean-square-error (MSE). In general, the optimal MSE estimator is nonlinear and complicated to treat analytically. Instead, we restrict ourselves to the analysis of the optimal linear estimator, O ( t ) D g (t ) ? v ( t ) , m where the f ?g denotes the convolution between two functions f and g. g (t) is the optimal linear lter that minimizes the MSE between the stimulus m (t ) and the estimate mO (t),
E D |m(t) ¡ mO (t)| 2 .
(4.3)
Using the orthogonality principle (Papoulis, 1991), g (t) can be obtained by solving the equation, Rvm (z) D ( g ? Rvv )(z ),
(4.4)
where Rvv (z) D hv(t)v(t C z)i is the auto-correlation of the postsynaptic voltage and Rvm (z ) D hv(t)m(t C z )i is the cross-correlation between the stimulus and the membrane voltage. Taking the Fourier transform of both sides of equation 4.4, G( f ) D
Svm ( f ) , Svv ( f )
(4.5)
10
Amit Manwani and Christof Koch
A
B
m(t)
+
^ m(t)
1 - PF
X=0 No Spike
Spike
^ n(t)
X=1
Estimation Channel
Y=0
PF
No Spike
PM
Spike
1 - PM
Y=1
Detection Channel
Figure 2: Channel models of synaptic performance under signal estimation and detection. (A) Effective channel model of the synapse for the signal estimation paradigm. The random stimulus m(t) is the continuous input to the channel, and O m(t) is the best linear estimate of the input based on the postsynaptic voltage O (see Figure 1). The effective reconstruction noise n(t) is the difference between O O the input and the estimate, n(t) D m(t) ¡ m(t). (B) Effective channel model of the synapse for the signal detection paradigm. X and Y are binary random variables corresponding to the input and the decision, respectively. The false alarm and the miss probabilities (PF and PM , respectively), which minimize the detection error Pe , are the cross-over error rates of the binary detection channel.
where Svv ( f ) D F fRvv (z )g is the power spectral density of v (t) and Svm ( f ) D F fRvm (z)g is the cross-spectral density of m (t) and s(t). F fg denotes the Fourier transform, which for a square integrable function g is dened as G ( f ) D F fg(z)g D
1 ¡1
g(t) D F ¡1 fG( f )g D
dz g (z)e ¡j2p f z , 1 ¡1
d f G( f )e j2p f z .
We dene a “reconstruction” noise for the signal estimation task as the O (t) ¡m(t). difference between the stimulus and the optimal estimate nO (t) D m It is clear that E D nO 2 (t) . Thus, performance in the estimation paradigm is quantied by E ; the lower the value of E , the better the synapse is at transmitting the signal. The overall estimation paradigm (see Figure 1) can be abstracted by an effective continuous information channel shown in Figure 2A, where m (t) is the input, mO (t) the output, and nO (t) the additive noise. The noise has zero mean, and its autocorrelation is given by Rnˆ nˆ (z) D Rmm (z ) ¡ ( g ? Rvm )(¡z ) (using equation 4.4). The power spectrum of the noise is given by, Snˆ nˆ ( f ) D F fRnˆ nˆ (z)g D Smm ( f ) ¡
|Svm ( f )| 2 . Svv ( f )
(4.6)
Detecting and Estimating Signals
11
Thus the MSE can be expressed as
ED
S
2 d f Snˆ nˆ ( f ) D sm ¡
df S
| Svm ( f ) | 2 , Svv ( f )
(4.7)
6 where the set S D f f | Svv ( f ) D 0g is called the support of Svv ( f ). As in Gabbiani and Koch (1998), we dene a normalized measure called the coding fraction:
j D1¡
E
2 sm
.
The coding fraction lies between 0 and 1, 0 · j · 1, where j D 0 denotes chance performance and j D 1 denotes perfect reconstruction. Using equation 4.7, we can write j D
1 2 sm
df S
| Svm ( f ) | 2 . Svv ( f )
(4.8)
Another measure of system performance is the mutual information rate I (mI v ) between m (t ) and v (t), dened as the amount of information transmitted by v (t) about m(t) in bits per second. By the data processing inequality (Cover & Thomas, 1991), the mutual information between m (t) and v (t) is greater than the mutual information between m(t) and mO (t): O ). I (mI v ) ¸ I (mI m A lower bound of I (mI mO ) and thus of I (mI v ) has been derived in Gabbiani (1996) and Gabbiani and Koch (1996): ILB D
1 2
S
d f log2
Smm ( f ) Snˆ nˆ ( f )
in bit s¡1 .
(4.9)
The lower bound is reached when the reconstruction noise nO (t) is gaussian. Equation 4.9 differs from the expression for capacity of an additive gaussian channel model, commonly used in information theory (Cover & Thomas, 1991), because, unlike in the case of the standard gaussian channel, the reconstruction noise nO (t) (see Figure 2) is not independent of the input m (t). We further assume that the spike train s(t) of the presynaptic neuron is a Poisson process with a mean ring rate, l(t), which is a function of the stimulus m(t), l(t) D hs(t)is|m D f ( k(t) ? m (t)),
(4.10)
where h ¢ is|m denotes an ensemble average over the spike train distribution for a xed value of the stimulus m (t). Since the stimulus itself is a stochastic
12
Amit Manwani and Christof Koch
(gaussian) process, the spike train s(t) is called a doubly stochastic Poisson process. Although the Poisson assumption is not strictly valid for neural spike trains in general, it is fairly common in theoretical neuroscience because it allows for derivation of closed-form analytical expressions (Gabbiani, 1996; Gabbiani & Koch, 1996). k(t) is a phenomenological lter that models the transformation of the stimulus modulations m (t) into the neuron’s ring rate, and f ( ¢ ) is a static, memoryless nonlinearity that incorporates the nonlinear aspects of the neuron’s input-output transformation, like half-wave rectication, saturation, and so on. The exact form of k(t) is not important, although in order to derive closed-form expression later, we will assume that k(t) has the form of a simple low-pass lter. When the input-output transformation is linear, f (x ) D l C x,2 the power spectrum of the spike train s (t) can be expressed as Sss ( f ) D l C |K( f )| 2 Smm ( f ),
(4.11)
where l is the mean ring rate and K ( f ) is the Fourier transform of the lter k(t). The variance of the ring rate sl2 is given by sl2 D
1 ¡1
df |K( f )| 2 Smm ( f ).
(4.12)
We dene the contrast of the ring rate as cl D sl / l. The positivity of l(t) imposes a restriction on how large the contrast can be. For linear encoding, we will require that the mean ring rate be at least three times as large as the standard deviation of the ring-rate uctuations, ensuring that the probability that l(t) is negative is less than 0.01. This implies that cl · 1 / 3. This presynaptic spike train is gated by a binary process corresponding to the stochastic vesicle release mechanism. Thus, the vesicle release is a Poisson process with rate pl(t) and v (t) (see equation 4.1) is a ltered shot noise process. Its power spectrum is given by Svv ( f ) D |H( f )| 2 (q2 C sq2 )p l C q2 p2 |K( f )| 2 Smm ( f ) C Snn ( f ).
(4.13)
The cross-spectral density Svm ( f ) is given by Svm ( f ) D q p Smm ( f ) H ( f ) K ( f ).
(4.14)
Using equations 4.8, 4.13, and 4.14, we obtain j D 2
1 2 sm
df S
[Smm ( f )]2 , Smm ( f ) C Sneff ( f )
(4.15)
We assume that l is high enough so that the probability of l(t) < 0 is negligible.
Detecting and Estimating Signals
13
where l(1 C CVq2 )
Sneff ( f ) D
p |K( f )| 2
C
Snn ( f ) . 2 2 p q |H( f )| 2 |K( f )| 2
(4.16)
Using equation 4.6, the power spectrum of the reconstruction noise nO (t) can be written as Snˆ nˆ ( f ) D
Smm ( f ) Sneff ( f ) . Smm ( f ) C Sneff ( f )
(4.17)
Synaptic transmission is ideal when the vesicle release is perfectly reliable, there is no variability in the EPSP amplitude, and membrane noise at the postsynaptic site is negligible (p D 1, CVq D 0, sn D 0). The coding fraction corresponding to this ideal case is denoted by j ¤ , where j¤ D
1 2 sm
df S
|K( f )| 2 S2mm ( f ) |K( f )| 2 Smm ( f ) C l
.
(4.18)
In general, even for perfect synaptic transmission, j ¤ < 1 due to the stochastic Poisson nature of the spike train. For the parameter values we consider here (summarized in the caption of Figure 3), the second term in equation 4.16 can be neglected. The dominant rst term in equation 4.16 indicates that signal estimation is limited by shot noise, that is, the inability to estimate the stimulus reliably is primarily due to the error in accurately estimating the ring rate of the neuron. It can be shown that for a xed ring-rate contrast, as l ! 1, j ! 1. In other words, perfect reconstruction takes place in the limit of innite ring rates. This agrees well with intuition; since the stimulus modulations are linearly encoded in the ring rate l(t), an accurate estimate of l(t) can be deconvolved to recover the input. Thus, for the signal estimation task, synaptic unreliability and variability make it harder to estimate the ring rate from the postsynaptic voltage. As compared to the ideal case, the shot noise for a single synapse increases by a factor
kD
1 C CVq2 p
.
To illustrate these results, we now apply them to a specic example modied from Gabbiani (1996). Let m (t) be a white, band-limited signal over a bandwidth Bm , 2 sm 2 Bm D 0
Smm ( f ) D
¡ Bm · f · Bm otherwise.
(4.19)
14
Amit Manwani and Christof Koch
A
B 0.5
12
information rate (bit/sec)
coding fraction
0.3 0.2 0.1 0 0
50 100 150 mean firing rate (Hz)
D information rate (bit/sec)
0.25
coding fraction
0.2 0.15 0.1 0.05
0
50 100 150 mean firing rate (Hz)
8 6 4 2 0
200
C
0
B m = 10 Hz B m = 30 Hz B m = 50 Hz
10
0.4
50 100 150 mean firing rate (Hz)
200
4 B m = 10 Hz B m = 30 Hz B m = 50 Hz
3
2
1
0
200
0
0
50 100 150 mean firing rate (Hz)
200
Figure 3: Comparison of performance in the signal estimation paradigm. (A) Coding fraction j and (B) the lower bound on the information rate ILB as a function of the mean ring rate l for different stimulus bandwidths Bm for an ideal synapse (p D 1, CVq D 0). (C) j and (D) ILB for a single unreliable and noisy synapse. We report the estimation performance for the optimal lter given by equations 4.21 and 4.23. Parameters are loosely based on those reported for cortical synapses: t D 20 msec, cl D 0.3, p D 0.4, CVq D 0.6, hpeak D 1 mV, tpeak D 0.5 msec, sn D 0.1 mV, Bn D 100 Hz.
If k(t ) is modeled as a simple low-pass lter, k(t) D A exp(¡t / t ), |K( f )| 2 D
A2 t 2 . 1 C (2p f t )2
Using the above quantities, a closed-form expression for the coding fraction (denoted by jlp) can be obtained (Gabbiani, 1996),
jlp D
c h
1 Cc
tan ¡1
h 1 Cc
,
(4.20)
Detecting and Estimating Signals
15
where h D 2p Bm t c2 l p t c D l ¡1 . k tan h h is the ratio of the signal bandwidth to the lter’s bandwidth, and c is the effective number of spikes available per unit signal bandwidth. As in Gabbiani (1996), one can also derive the optimal encoding lter that maximizes the coding fraction for a given stimulus power spectrum Smm ( f ) under the constraint that the variance of the mean ring rate sl2 is xed, |K( f )| 2 D
sl2 , 2 sm
¡Bm · f · Bm .
The coding fraction corresponding to the optimal lter above (denoted by jopt ) is given by
jopt D
1C
2 k Bm c2l l
¡1
.
(4.21)
Similarly, the lower bound for the low-pass lter ILB can be obtained as ILB D
1 c h ln 1 C 2p t ln(2) 1 C h2 ¡ 2 tan ¡1 h .
C 2 1 C c tan ¡1
h 1 Cc (4.22)
In equation 4.22, ln refers to the natural logarithm. The lower bound corresponding to the optimal encoding lter is given by c2l l 2k Bm
(4.23)
D ¡Bm log2 (1 ¡ jopt ).
(4.24)
ILB D Bm log2 1 C
Notice that when h ¿ 1, jlp ! jopt , and the lower bound on the information rate ILB for the low-pass lter and the optimal lter converge. Plots of j and ILB as functions of the mean ring rate l and the input bandwidth B m are shown in Figure 3. It can be observed that j increases with l and decreases with Bm . This can be understood as follows: l / Bm denotes the average number of spikes available in the ideal case to estimate the mean ring rate of the spike train (thus the input) over a time period (1/B m ) during which the input is relatively stationary. The fewer the number of spikes available,
16
Amit Manwani and Christof Koch
the poorer the estimate and lower the coding fraction. On the other hand, ILB increases with l and increases with Bm for low Bm but saturates at high Bm . This is because the decrease in the quality of estimation (j decreases with Bm ) is compensated by an increase in the number of independent samples (2Bm ) transmitted per second. This can be observed by taking the limitl ! 0 (equivalently, Bm ! 1) in equation 4.23, lim ILB D
l!0
1 c2l l Bm c2l l D , ln 2 2k Bm ln 2 2k
which is independent of Bm . We refer to this condition as the low signal-tonoise regime. The information transmitted per spike is given by c2l ILB . D 2k ln 2 l!0 l lim
Notice that the maximum information per spike depends only on the contrast cl of the ring-rate modulations and the factor k . In the case of an ideal synapse for cl = 1/3, the maximum information per spike is around 0.08 bits per spike. Notice that for a noisy synapse, the maximum information per spike decreases by a factor of k . However, if we assume that the signal is encoded by a pair of halfwave rectifying neurons, each representing one-half (positive/negative) of the input m (t ), as in Gabbiani (1996) and Gabbiani and Koch (1996), the maximum amount of information that can be transmitted by an ideal synapse is approximately equal to 1.13 bits per spike. p This corresponds to cl D p / 2 (¼ 1.25). Thus, the low information rates we obtain here are a consequence of our assumption of linear encoding, which limits the magnitude of the ring-rate contrast. Nevertheless, the analysis can be readily extended to include other encoding schemes (integrate-andre models, nonlinear Poisson encoding models with sigmoidal, half-wave rectication and other nonlinearities, and so on) and yield qualitatively similar results to those obtained here. The analysis can be generalized to the case of multiple independent synaptic connections between two neurons. Let Nsyn denote the number of parallel synaptic connections between two neurons, which can occur in the form of Nsyn independent release sites driven by the same presynaptic process at the same postsynaptic location or as multiple synapses made at different locations. In either case, it is assumed that the vesicle release at a given synapse is statistically independent of the release at other synapses. However, this does not imply that the EPSP waveforms corresponding to the different synapses are independent. In fact, EPSPs are correlated since they are driven by the same presynaptic spike train s(t). The postsynaptic membrane voltage can be written as Nsyn
V (t) D
lD 1
i
qli W il hl (t ¡ ti ) C n (t),
(4.25)
Detecting and Estimating Signals
17
where the superscript l refers to the lth synaptic connection. As before, we dene v (t) D V (t) ¡ hV(t)i. If the synapses are distributed at different electrotonic locations on the postsynaptic neuron, the corresponding EPSP waveforms hl (t) are different. Similarly, if the release properties across the synaptic population are nonuniform (Rosenmund, Clements, & Westbrook, 1993; Dobrunz & Stevens, 1997) , the random variables (qli , W il ) are governed by different distributions. However, for the sake of analytical tractability, we assume that the synapses are identical and are at the same electrotonic location. For the parameter values we consider here, the effect of the postsynaptic voltage noise n (t) is negligible, and it can be shown that the shot noise increases from the ideal case by a factor,
kN D
2 Nsyn ¡ 1 1 1 C CVq C . Nsyn p Nsyn
(4.26)
The details of this derivation are provided in appendix B. In the limit of a large number of synapses (Nsyn ! 1), kN ! 1. Thus, the effect of synaptic unreliability and variability can be offset by redundancy in the number of synaptic connections between neurons. Plots of the coding fraction and the information rate for the signal estimation task as a function of the number of parallel synapses are shown in Figure 4. Thus, although a single synapse has very low capacity, a small amount of redundancy causes a considerable increase in performance. 5 Signal Detection We now consider the problem of detecting the presence of a presynaptic action potential from measurements of membrane voltage at the postsynaptic site. This signal detection paradigm frequently arises in different elds of science and engineering, like radar, digital communications, pattern recognition, and psychophysics. The objective in signal detection is to decide which member from a discrete set of signals was generated by a source, on the basis of measurements (possibly noisy) of its output. In the case we consider here, the set has two elements: the absence or presence of a presynaptic spike within a xed temporal window. In our problem, the corresponding postsynaptic voltage waveform V (t) measured over a period 0 · t · T corresponds to either the noise process n (t) (denoted by hypothesis H0 ) or a noisy version of the EPSP h (t ) gated by stochastic vesicle release W (denoted by hypothesis H1 ). Let X and Y be binary variables denoting occurrence of a presynaptic spike and the decision, respectively. Thus, X D 1 if a spike occurred, else X D 0. Similarly, Y D 1 expresses the decision that a spike occurred. In psychophysics, the binary signal detection problem is known as a yes-no task (Green & Swets, 1966). The detection paradigm has been used in neuroscience to explore the relationship between the outputs of individual cortical neurons and the performance of the animal in behavioral
18
Amit Manwani and Christof Koch
A
B
information rate (bit/sec)
0.4
coding fraction
0.3
0.2
0.1
0 1
2
N
3
4
5
9 8
optimal low-pass
7 6 5
B m = 10 Hz B m = 30 Hz B m = 50 Hz
4 3 1
2
syn
N
3
4
5
syn
Figure 4: Effect of anatomical redundancy on estimation performance. (A) Coding fractionj and (B) the information rate ILB in signal estimation as a function of the number of parallel, identical synaptic connections Nsyn between a presynaptic and a postsynaptic neuron. The empty symbols correspond to performance in the presence of the low-pass lter, whereas the solid symbols correspond to the optimal encoding lter. There is little difference in estimation performance when k(t) (see Figure 1) is chosen to be a low-pass lter and an optimal encoding lter matched to the stimulus. The performance saturates at high Nsyn as it approaches the performance of an ideal synapse (p D 1, CVq D 0). The mean ring rate l is 200 Hz. Other parameter values used are summarized in the caption of Figure 3.
tasks (Newsome et al., 1989; Britten et al., 1992; Shadlen & Newsome, 1998). The binary signal detection problem can be formally expressed as H 0 : V (t) D n(t), H1 : V ( t ) D q W h( t ) C n ( t ) ,
Noise Signal C Noise
where q is the random EPSP amplitude and W is a binary variable representing the spike-conditioned vesicle release process. Thus, the goal is to design an optimal decision rule that minimizes the probability of error of detecting the presynaptic spike from the postsynaptic membrane voltage. Decision errors are of two kinds. A false alarm error (F) occurs when the decision is in favor of the signal (Y D 1) when in fact noise was present (X D 0). Conversely, a miss error (M) occurs when the decision is in favor of the noise (Y D 0) when the signal was present (X D 1). The probabilities of these errors are dened as P F D Prob[Y D 1 | X D 0],
PM D Prob[Y D 0 | X D 1].
The probability of detection error Pe is given by, P e D p 0 PF C (1 ¡ p 0 ) PM ,
(5.1)
Detecting and Estimating Signals
19
where p 0 and 1 ¡ p 0 are prior probabilities of occurrence of H 0 and H1 , respectively. We dene a likelihood ratio L X (V ) as L X (V ) D
Prob[V | X D 1] , Prob[V | X D 0]
(5.2)
where Prob[V | X D 1] and Prob[V | X D 0] denote the conditional probabilities of observing the voltage V (t) conditioned on the presence and absence of a presynaptic spike, respectively. Using Bayes’ rule, L X (V ) can be expanded as L X (V ) D
Prob[X D 1 | V] p 0 , Prob[X D 0 | V] p1
(5.3)
where Prob[X D 1 | V] and Prob[X D 0 | V] denote the posterior probabilities of the hypotheses conditioned on V (t). The ratio
L(V ) D
Prob[X D 1 | V] Prob[X D 0 | V]
is referred to as the posterior likelihood, whereas, L 0 D (1 ¡ p 0 ) / p 0 is called the prior likelihood. The decision rule, which minimizes P e , is given by (Poor, 1994),
L(V ) ¸ 1 L(V ) < 1
)YD1
) Y D 0,
which can be written as L X (V ) ¸ L ¡1 0 L X (V )
hM (see Figure 8b). Note that unlike the BCM theory, where the w is only a function of the postsynaptic rate, it depends on the individual synapse in our case. Distinguishing between the fpost -dependency of w on a fast and slow timescale leads to the sliding threshold property. Taking into account the transfer from the presynaptic frequencies to the postsynaptic response, this is an important means of the synapses to keep the postsynaptic activity around a point of the cell’s maximal sensitivity. If after reaching steady-state values, the postsynaptic frequency drops, say, by articially reducing the input from the dominant eye, P1 of the activated synapses is dis slowly downregulated (see Figure 8a). The rate of the change in P1 and its dis sp nal value are shown in Figure 8b as a function of the new value of fpost . The modulation threshold hM at which up- and downregulating forces neutralize is implicitly driven by the running mean of the postsynaptic spike
Algorithm for Modifying Neurotransmitter Release Probability
49
sp
Figure 8: BCM rule. (a) After converging to the steady state for fpre D 20 and sp fpost D
30 Hz, an instantaneous jump of the postsynaptic frequency to 10 Hz will downregulate P1 . Noisy lines represent the average over 50 trials (of the dis algorithm in Table 1) while the smooth lines represent the mean-eld solutions (obtained by integrating the differential equations, A.1–A.8). (b) Starting at the same steady state as in a, the vesicle discharge probability increases if the postsynaptic spike frequency jumps above hM ( D 30 Hz) and decreases if it jumps below. The rate of the change is a nonmonotonic function of the new value sp of fpost (solid line, calculated according to equation 3.2 and section A.4 with sp
fpre D 20 Hz). The dashed curve shows the new value of P1 after reaching the dis steady state and corresponds to the white curve in Figure 7b.
frequency. Keeping the spike frequencies xed at the new value, the synapsp tic parameters will adapt such that fpost becomes a zero of w and therefore sp
hM D fpost (cf. Section A.4 for an analytical treatment). At this point a slight change in the postsynaptic average frequency will result in a corresponding change of the vesicle discharge probability. A new presynaptic activation pattern, say, generated by stimulating the unsutured eye, sharing the same synapses will now have the chance for growing inuence onto the postsynaptic cell, and the average postsynaptic activity would start to increase again. A more detailed discussion of the learning rule’s selection and stabilization properties to dominant input patterns is given in Kempter et al. (1999b) and Abbott and Song (1999). Let us nally mention that it is possible to retrieve the rule of Artola-Brocker-Singer ¨ (Artola & Singer, 1993), which differs from the BCM rule by an additional threshold below which no modication takes place. In our scheme this is achieved by setting both thresholds Shu and Shd to positive values. 3.2.2 A Generalized Asymmetric Hebbian Rule. Different forms of Hebbian learning rules have been investigated that consider the product of preand postsynaptic activities as the quantity determining the synaptic mod-
50
W. Senn, H. Markram, and M. Tsodyks
0.5 0.3
w =1 Hz w =5 Hz w =20 Hz
¥
¥
Pdis
ss
Pdis
0.35
0.25
Hz
0 30 0
5
10
0.2
15
0.15 0.5
Time [s] (lag=0.2)
0 phase lag
0.5
sp
Figure 9: (a) Sinusoidal modulation of fpre (full line in the lower panel) and sp fpost (dashed line in the lower panel) of v D 1 Hz with presynaptic phase lag of 0.2 induces upregulation of P1 (upper panel). The noisy line in the upper dis panel represents the average time evolution of P1 dis over 100 trials (of the algorithm in Table 1) and the dashed curve represents the mean-eld solution (of equations A.1–A.8). (b) Steady-state values of Pdis for sinusoidal modulations of the pre- and postsynaptic frequencies versus presynaptic phase lag. The modulation frequencies were v D 1, 5, 20 Hz. Note that the curves represent smoothed versions of the learning curves shown in Figure 6. For the 1 Hz modulation, the synaptic depression may advance the release activity with respect to the presynaptic spike activity, and P1 may be upregulated even if this lags dis the postsynaptic peak activity. The horizontal dashed line represents the steady state without modulation. Same parameter values as in the caption of Figure 7.
ication (see, e.g., Brown & Chatterji, 1994, for a review). To prevent the saturation of the synaptic strengths, mixtures of Hebbian and anti-Hebbian rules are proposed by focusing on the covariance between pre- and postsynaptic activities (Sejnowski, 1977) or by normalization of the synaptic weight (Oja, 1982). The learning rule we consider belongs to the same class of Hebbian mixtures, however, with a strong asymmetry in the temporal correlations. This asymmetry makes the synapse sensitive to the phase lag between modulations of the pre- and postsynaptic frequencies. If the postsynaptic spike activity lags, the presynaptic release activity, P1 , is upregdis ulated, and if the postsynaptic spike activity leads the presynaptic release activity, P1 is downregulated. Figure 9a shows the change of P1 for a sidis dis nusoidal modulation of the pre- and postsynaptic spike frequencies of 1 Hz and phase lag of 0.2. The steady state of P1 for different rates of modulation dis and different lags is shown in Figure 9b. Two things are worth emphasizing. First, for slow modulation rates in the range of the vesicle recovery time constant, the point of zero change appears at positive lag of the presynaptic activity. This is rather counterintuitive since at positive lag, downregulation would be expected. Second, the maximal changes appear at lags of § 14 independent of frequency of modulation. This is true even if the pe-
Algorithm for Modifying Neurotransmitter Release Probability
51
riod of the modulation is much larger than the time window of synaptic adaptation. To gain insight into these mechanisms, we constrain ourselves to the simplifying assumption of an instantaneous recovery of the secondary messenger (t S D 0). This assumption implies that the amount of activated secondary messenger Su and Sd is always proportional to the NMDA receptors in the states Nu and Nd , respectively. On the other hand, if the pre- and postsynaptic spikes are independent, these states are roughly proportional to the running mean of the previous presynaptic release and postsynaptic spike frequencies, respectively (which follows from integrating equations A.1 and A.2). If we, moreover, neglect the thresholds for up- and downregulation (Shu D Shd D 0), then formula 3.1 for the synaptic modication can be comprised by the generalized asymmetric Hebbian rule dP1 dis P 1 sp rel )h rel sp ¼ rPu (1 ¡ P1 dis fpre i fpost ¡ rd Pdis h fpost i fpre , dt
(3.3)
where hfi D
1 tN
1 0
f (t ¡ t0 )e¡t / t dt0 0
N
represents the running mean of f . Note that by the setting t S D 0, the third-order nonlinearity in equation 3.1 is reduced to the usual correlations, although with the presynaptic release instead of the presynaptic spike frequency. The transformation of the presynaptic spike to the presynaptic release frequency itself is governed by the dynamics of the vesicle recovery (cf. equations A.8 and A.9). If we assume an instantaneous vesicle recovery (tvrec D 0), there is no activity-dependent depression, and the presynaptic release rate is proportional to the presynaptic spike rate. For “static” synapses, rel can therefore be replaced by f sp in equation 3.3. fpre pre Equation 3.3 provides an explanation of Figure 9. First, we observe that averaging f over the past delays the averaged quantity h f i. Second, we note that the effect of the synaptic depression is to advance the release rel compared to the spike rate f sp (since the peak of the release rate rate fpre pre is depressed; see, e.g., Abbott, Varela, Sen, & Nelson, 1997). For lag zero and a modulation frequency of v D 1 Hz these two effects cancel, and the factors in the Hebbian part of equation 3.3 are correlated, but not the ones in the anti-Hebbian part. This explains the upregulation of Pdis even at a small, positive lag of the presynaptic spike activity (see Figure 9b). For higher modulation frequencies beyond the time constant of vesicle recovery, however, the phase advance vanishes, and at zero lag, the Hebbian and antiHebbian part in equation 3.3 cancel each other. The maximal upregulation is reached for a presynaptic phase advance of 14 since due to the delaying effect of averaging, this leads to correlated factors in the Hebbian part of equation 3.3 and to uncorrelated factors in the anti-Hebbian part. Similarly,
52
W. Senn, H. Markram, and M. Tsodyks
maximal downregulation of Pdis is obtained around a presynaptic phase lag of ¡ 14 , and this barely depends on the modulation frequency. 3.2.3 A Spike Correlation Rule. In the previous paragraph, we considered correlations between pre- and postsynaptic mean ring rates. Independent of these correlations, the pre- and postsynaptic spike trains may exhibit spike-by-spike correlations that affect the synaptic efcacy as well. To quantify the inuence of this microcorrelation, we introduce the notion of (rst-order) spike correlation between pre- and postsynaptic spike train. This measure describes the temporal asymmetry with which on the average, a presynaptic spike occurs between the last previous and the next following postsynaptic spike. For a given presynaptic spike time tpre, we denote the < and the time of the following time of the previous postsynaptic spike by tpost > postsynaptic spike by tpost . The spike correlation cs is then dened by cs :D
1 < > htpre ¡ tpost itpre ¡ htpost ¡ tpre itpre , 4
< i > with normalization 4 D htpre ¡tpost tpre C htpost ¡tpre itpre . Due to the normalization, cs is restricted to values between ¡1 and 1, and these boundary values are reached if the presynaptic spikes all occur either instantaneously after or before the postsynaptic ones, respectively. Hence, we characterize the preand postsynaptic spike trains by three numbers, their (instantaneous) fresp sp quencies and their spike correlation, fpre, fpost , and cs . Although there is no surprise that the sign of the synaptic change will correspond to the sign of the correlation cs imposed to the spike trains, it is important on a formal level to quantify the synaptic modication as a function of these three variables. The computational power of rate-based neurons with transfer functions depending on the presynaptic correlations was recently characterized (Maass, 1998), but learning algorithms for such networks have yet to be studied. Let us now assume a scenario where the pre- and postsynaptic spike trains each exhibit Poisson characteristics when considered by themselves, but when compared to each other, they show some spike correlation cs 2 [¡1, 1]. Figure 10a depicts the change of P1 when after 2 seconds of Poisdis son stimulation, the spike correlation cs jumps from 0 to 1 (rst pre-, then post-) and from 0 to ¡1 (rst post-, then pre-), respectively. Figure 10b shows the steady state of P1 as a function of cs for xed pre- and post-synaptic fredis quencies. Reordering the left and right branches (inset of the gure) yields an average modication similar to the one induced by the shift between the periodic pre- and postsynaptic spike trains in Figure 6. To explain the effect of the spike correlation, we again assume the simplied setting of an instantaneous recovery of the secondary messenger (t S D 0) and of vanishing thresholds Shu and Shd . The spike correlation implicitly determines the distribution of the NMDA receptors over the states Nrec, Nu , and Nd . If cs is negative, the transition Nrec ! Nu induced by a
Algorithm for Modifying Neurotransmitter Release Probability
0.4
0.4
0.3 ¥
P
¥
P
0.5
dis
dis
0.3
0.2
0.2
0.1 1 1.6
53
2
3
Time [s]
4
5 3.6
0.2 1
0.1
0
1
(tpre tpost) / <tpre tpost>
1
0.5 0 0.5 spike spike coreleation css
1
Figure 10: (a) Upper panel: After reaching the steady state for a pre- and postsynaptic Poisson spike train of 20 Hz (0–2 sec), the spike correlation cs was switched from either 0 to 1 (inducing an increase of P1 ) or 0 to ¡1 (inducing a dis decrease of P1 ). The noisy lines represent the average for 100 repetitions of the dis experiment, and the smooth lines represent empirically corrected mean-eld calculations (cf. section A.5). Lower panel: After switching cs from 0 to 1, the spikes of the postsynaptic neuron (lower trace) were elicited immediately after a presynaptic spike (upper trace). (b) Change of P1 according to equation 3.4 for dis sp sp fpre D 20 Hz and fpost D 30 Hz versus the spike correlations cs . The inset shows a reordering of the branches and essentially reects the steady state of P1 as dis sp sp a function of the (normalized and averaged) spike time differences tpre ¡ tpost . Same parameter values as in the caption of Figure 7.
presynaptic release is less effective since with high probability, an immediately preceding transition Nrec ! Nd induced by a postsynaptic spike reduced the state Nrec . Since Nu tunes through the postsynaptic frequency, the upregulation of P1 , the net increase of P1 is reduced. In turn, a positive dis dis cs unfavors the state Nd since a previously arrived presynaptic spike triggered with probability Prel the transition Nrec ! Nu and less is remaining in the state Nrec to be moved to Nd . In this case the downregulation of P1 is dis less effective. This reasoning can be expressed by adapting formula 3.3 to dP1 dis dt
sp
C )h rel (1 ¡ rN )i fpost ¼ rPu (1 ¡ P1 dis fpre d [¡cs ]
sp
C rel (1 ¡ rN ¡ rPd P1 u [cs ] Prel )i fpre . dis h fpost
(3.4)
Note again that the presynaptic release frequency corresponds to the presysp rel naptic spike frequency times the probability of vesicle release, fpre D fprePrel , and that the latter itself depends on the dynamics of the presynaptic spike frequency (see equation 2.1 and section A.2). Equations 3.3 and 3.4 represent an asymmetric version of the traditional covariance rule based on mean ring rates (see, e.g., Sejnowski, 1977) with an account of correlations on the level of spikes rather than only on the level of the frequencies.
54
W. Senn, H. Markram, and M. Tsodyks
3.3 Full versus Simplied Model. One may ask whether the simplied model explaining the mean-eld properties in sections 3.2.2 and 3.2.3 above is sufcient to reproduce the experimental results. As long as one is interested in modeling the temporal asymmetry of the synaptic modication, this is indeed true, but it is difcult to t the nonlinearities of the additional experiments. To recapitulate, the simplied model is obtained by neglecting the dynamics of the secondary messenger, formally by taking the limit t S D 0 with rS D 1. (In this case, one may skip the steps post2 and rel2 in the algorithm and directly use the quantities Nu , Nd for Su , Sd in the steps post3 and rel3, respectively. Similarly, in the mean-eld description, equation 3.1, one may identify SuC , SdC with Nu , Nd , respectively (see section A.3). In a rough simplication, the discharge probability is then upregulated at each postsynaptic spike, depending on the running mean of the presynaptic release rate, and downregulated at each presynaptic release, depending on the running mean of the postsynaptic spike rate, sp
sp
rel rel 4Pdis ¼ h fpre i fpost ¡ h fpost i fpre .
(3.5)
This form of the algorithm would represent the minimal model to reproduce the temporal asymmetries in the Figures 3, 6, and 10a, but also in Figure 9b reecting the phase-advance property of depressing synapses. If the rule is applied to the strength of nondepressing synapses, it would probably also be sufcient to reproduce the experiments in other preparations, which were conned to investigate the asymmetry of the learning window at a xed frequency of the pairing (Bell et al., 1997; Zhang et al., 1998; Bi & Poo, 1998). In our case of depressing synapses, however, we did not to succeed in matching at the same time the additional experiments 2 and 3 with their nonlinear increase as a function of the pairing frequency and the nonmonotonic change as a function of the number of spikes. These two experiments let us introduce the S-variable in the kinetic scheme as well as the third-order nonlinearity given by the steps post3 and rel3 (which are established by the delta function in Equation A.5). First, two different time constants are needed to model: (1) the width of the learning window, which is probably narrower than 50 ms, as seen in Figures 3 and 6a, and (2) the cumulating effect of the spike pairing repeated in steps of 100 ms, as shown by the steep increase at 10 Hz in Figure 4. Experiment 3 also suggests that a time constant larger than 50 ms is involved since a transient behavior is seen extending over the 500 ms of the rst 10 to 15 spikes of 20 Hz (see Figure 5). The choice of two different secondary messengers was guided by the simplicity of the resulting scheme, and we do not make any statement about their nature. In fact, one could think of a single type of secondary messenger, but in this case additional processes have to be modeled to read out from this messenger whether and how much the discharge probability should be up- or downregulated (Grzywacz & Burgi, 1998).
Algorithm for Modifying Neurotransmitter Release Probability
55
Second, we introduced the third-order nonlinearity since in the presence of synaptic depression, it was difcult to obtain a nonlinear increase of the learning curve with respect to increasing frequencies (see Figure 4). The reason is that increasing the pairing frequency is essentially balanced out on the right-hand side of equation 3.5. If we consider the full scheme with large t S (compared to 1 / f ), however, equation 3.5 turns into sp
sp
sp
rel rel rel 4Pdis ¼ h fpre iN h fpost iS fpost ¡ h fpost iN h fpre iS fpre ,
(3.6)
where h.iN and h.iS denote the running mean with time constants t N and t S , sp respectively (cf. equation 3.3 and section A.3). Now fpost arises in the square in the rst term, which gives the steep increase with the pairing frequency. rel , which arises in the square in the second term, remains in Importantly, fpre the order of 1 Hz due to the synaptic depression (cf. section A.2). Although this reasoning deals with Poisson rates, the problem that the synaptic change is largely indifferent to increasing frequencies if we wouldn’t introduce further nonlinearities is also seen in experiment 2. At medium frequencies, in a train of ve paired spikes, one of the rst and often the last presynaptic spike only triggers a release. In this case, the postsynaptic spikes in the middle of the train are placed symmetrically after and before a presynaptic release, and no clear dominance for upregulation is shown. For higher frequencies, the duration of the 5-spike train is markedly below the vesicle recovery time constant, and maximally one release can be triggered. Due to the stochastic nature, this release may arise in response to the second presynaptic spike, and by the preceded postsynaptic event downregulation is favored. Moreover, since in this case the rst postsynaptic spike depleted the common pool (Nrec ), the same release event will be much less effective in inducing an upregulation via immediately following postsynaptic spike. 4 Discussion 4.1 Summary. We presented an algorithm that allows the prediction of the change in the temporal dynamics of synaptic transmission by depressing synapses induced by arbitrary trains of pre- and postsynaptic action potentials. The algorithm is based on the importance of precise relative timing of the pre- and postsynaptic action potentials and therefore constitutes a novel form of learning algorithm. It reproduces the change in the synaptic responses induced by the pairing: the asymmetry with respect to the spiketime difference, the monotonic increase with respect to the frequency, and the nonmonotonic dependency on the number of paired spikes. When applying Poisson spike trains, an anti-Hebbian regime is followed by a Hebbian regime during an increase of the postsynaptic frequency, provided the relative degree of upregulation (rPu / rPd ) is small enough. This feature, together with the slow adaptation of the borderline between the two regimes, allows the rule to preserve the stability and the selectivity to new
56
W. Senn, H. Markram, and M. Tsodyks
patterns as described by the BCM theory. When modulating the pre- and postsynaptic spike frequencies, the synaptic dynamics may dominate the temporal asymmetry of the learning rule, and, surprisingly, upregulation of the discharge probability may occur even if the presynaptic peak activity lags the postsynaptic peak activity. Finally, quantifying the spike correlation within two arbitrary spike trains permits specifying the learning rule as a function of the instantaneous pre- and postsynaptic frequencies and their instantaneous spike correlation. Such a formulation ts into the recently proposed framework of generalized neural networks, which are processing average ring rate together with average spike correlations (Maass, 1998). 4.2 Experimental Basis for the Model. The algorithm was constructed under several experimental constraints: 1. Excitatory glutamatergic synapses between pyramidal neurons in the neocortex are fast-depressing synapses, and the change that follows Hebbian pairing is equivalent to a change in the discharge probability Pdis (Markram & Tsodyks, 1996). This change alters the temporal dynamics of transmission (Tsodyks & Markram, 1997; Abbott et al., 1997). 2. The asymmetrical modication window with respect to relative timing of pre- and postsynaptic action potentials, as reported by Markram et al. (1997), was considered. In this window, up- and downregulation fall in a window of around 100 ms after and before a postsynaptic action potential, respectively. 3. Upregulation is dependent on NMDA receptors, and the backpropagating action potential can cause rapid pulses of Ca2 C through the NMDA receptor (Markram, Roth, & Helmchen, 1998; Spruston, Jonas, & Sakmann, 1995). The speed of these AP-evoked Ca2 C concentration pulses via NMDA receptors is such that only an enzyme with very high forward binding rates could capture this Ca2 C , providing the ideal situation for a coincident detector protein (Markram, Roth, & Helmchen, 1998). We therefore propose that the amplitude of the upregulation follows a time course of the NMDA current, which has a rise time constant of 10 to 20 ms and a decay time constant of 100 to 300 ms (Jonas & Spruston, 1994). The time constant of approximatively 60 ms for upregulation observed experimentally could be due to a threshold effect. 4. The magnitude of the upregulation increases as a function of the frequency of the pre- and postsynaptic action potentials in the train with an onset between 5 and 10 Hz (Markram et al., 1997). 5. Surprisingly, the upregulation was not increased when more action potentials were included in a train of a given frequency (data included in this study). This is proposed to be due to the balanced effects of up- and downregulation, which, after an initial transient, are reached for the given train of action potentials. 6. Downregulation is also dependent on NMDA receptor activation and on Ca2 C inux (HM, unpublished data). We therefore propose that the rise in
Algorithm for Modifying Neurotransmitter Release Probability
57
Ca2 C concentration following a backpropagating action potential inuences the NMDA receptor in such a manner that it results in downregulation when activated. Indeed, the dendritic Ca2 C transient has a similar time course as that of downregulation (Markram, Helm, & Sakmann, 1995), and intracellular Ca2 C has been shown to modify the kinetics of the NMDA receptor (Mayer et al., 1987; Medina, Filippova, Bakhramov, & Bregestovski, 1996; Mayer, 1998; Antonov, Gmiro, & Johnson, 1998). 7. Ca2 C from different sources or modes of inux can activate different processes due to differential binding as a function of forward and backward rate constants (Markram, Roth, & Helmchen, 1998). 4.3 Comparison with Other Learning Rules. Asymmetric learning rules based on mean ring rates or individual spike times were investigated in different contexts. They were used for storing trajectories (Abbott, 1996), modeling hippocampal place elds of rats wandering around in the Morris water maze (Blum & Abbott, 1996; Gerstner & Abbott, 1997), and modeling the temporal selectivity in the auditory pathway of barn owls (Gerstner, Kempter, van Hemmen, & Wagner, 1996). Recent work investigates the stabilization capabilities of asymmetric rules in keeping the cell’s sensitivity (Kempter et al., 1999b; Abbott & Song, 1999), similarly as it is achieved by the described sliding-threshold property on the level of mean ring rates. Neither of these works, however, take into account the synaptic dynamics. Our algorithm comes closest to the version of the learning rule outlined in Gerstner, Kempter, van Hemmen, and Wagner (1998, sec. 14.2.4), which deals with the kinetics of four different components corresponding to our Nu , Nd , Su , and Sd . In their description, no synaptic depression is taken into account, and the rule refers to a change in the absolute synaptic strength. Moreover, no common pool similar to Nrec is assumed, and Hebbian secondorder correlations are only considered. It is worth pointing out that the physiologically motivated restriction to a common pool Nrec may help to sharpen the distinction between positive and negative spike delays, even in the case of a deterministic release without depression. Considering simultaneous bursts of pre- and postsynaptic spikes of mixed temporal correlations, it would typically be the rst release-spike pair that, due to the depletion of the common pool, dominates the synaptic modication. An immediately following second pair with possible reversed temporal order will be much less effective if this pool was just reduced and therefore will not be able to annihilate the initiated modication. In comparison to rules based on mean ring rates one may establish connections to the BCM theory (Bienenstock et al., 1982) but also reproduce the rule of Artola-Brocker-Singer ¨ (Artola & Singer, 1993) by assuming nonvanishing thresholds for the up- and downregulation. On the other hand, the simplied form, equation 3.5, shows that our algorithm is distinct from the covariance rule (Sejnowski, 1977), which in our situation would be of the
58
W. Senn, H. Markram, and M. Tsodyks
form sp rel rel ( sp 4Pdis ¼ ( fpre ¡ h fpre i) fpost ¡ h fpost i),
and therefore would be symmetric with respect to the presynaptic release and the postsynaptic spike frequencies. The power of our algorithm is not only to cope with average ring frequencies and correlations but to offer a general framework for dealing with any pre- and postsynaptic spike train of any complexity and arbitrary correlations. 4.4 Predictions of the Model and Outlook. It is remarkable that simple rst-order kinetics with a cross-gating of the NMDA receptors can qualitatively explain the different spike-induced nonlinear synaptic modications. Several features of the model are tailored to explain the specic experimental results and raise new hypotheses that can be experimentally tested. Among these are the following:
1. Experiment 2 with a reversed relative timing would lead to a moderate monotonic decrease of Pdis with respect to the paired spike frequency (contrasting Figure 4). 2. Experiment 2 with 20 instead of 5 spikes in the paired spike trains would lead to a nonmonotonic Pdis with respect to frequency (see Figure 5). 3. The full learning curve for trains of 5 spikes versus the time difference of the train onsets would exhibit a maximal upregulation and maximal downregulation at ¡50 ms and 100 ms, respectively (see Figure 6b).
4. Applying pre- and postsynaptic Poisson spike trains would lead to a monotonic increase of the steady-state Pdis with respect to the postsynaptic frequency and to only a small decrease with respect to the presynaptic frequency (see Figure 7b). 5. A step change in the frequency of the postsynaptic Poisson spike train after a period of steady-state Poisson stimulations would lead to a decrease or increase in Pitdis depending on the sign of the step change (see Figure 8b). 6. Modulating sinusoidally the frequencies of pre- and postsynaptic Poisson spike trains of lag 0 would lead to an increase in Pdis if the modulation frequency is in the range of 1 Hz but would not show any change if the modulation rate is faster than 5 Hz (see Figure 7). 7. A single presynaptic release followed by a burst of postsynaptic spikes would lead to a superlinear increase of Pdis with respect to the burst frequency before eventually saturating. The last experiment appears to be of particular interest in the light of recently observed Ca2 C -triggered bursts in cortical pyramidal cells following coincident apical and somatic stimulations (Larkum, Zhu, & Sakmann,
Algorithm for Modifying Neurotransmitter Release Probability
59
1999). The threshold phenomenon according to which virtually no potentiation is seen for a 1 Hz pairing suggests that these bursts are functionally relevant in the existing form of synaptic modications. It also suggests that an incidentally occurring tight pairing would not be able to change the synaptic memory. Finally, more natural stimulation protocols, including bursts and Poisson trains, would help to specify the activity dependencies of the learning window, which is partially expressed in the data. It would also clarify the question to what extent the same synapse is able to encode individual spike timings while still being sensitive to pre- and postsynaptic average frequencies. To describe further long-term potentiation and long-term depression experiments, one may want to include effects of the postsynaptic membrane potential or the local Ca2 C concentration onto the NMDA receptor dynamics or the secondary messenger activation. Although we have not examined this, it could easily be achieved by endowing the different rate constants appearing in the scheme with a voltage- or Ca2 C -dependent dynamics instead of instantaneously activating them through delta functions of xed weights. Beside the characteristics of strong synaptic depression, a large heterogeneity of different synaptic dynamics including facilitation was found between neocortical pyramidal cells (Markram, Pikus, Gupta, & Tsodyks, 1998), suggesting that other parameters governing the temporal response could be subject to specic learning rules (Markram, Wang, & Tsodyks, 1998). One could speculate that due to additional computation such as the integration of a third coincident signal provided by growth factors or neuromodulators (e.g., synaptic growth or unmasking of postsynaptic receptors; Liao & Malinow, 1995) is required for the induction of real synaptic strengthening or to gate changes in the recovery time constant. Specifying these conditions remains an important challenge for a future research. Appendix A.1 Differential Equations Determining the Kinetic Scheme. For further analysis we give a formulation of the kinetic scheme (see Figure 2) in terms of differential equations. Let us denote by trel pre the time of a presynaptic sp
release and by tpost the time of a postsynaptic spike. A presynaptic release N at trel pre will saturate a fraction ru of the recovered NMDA receptors. A postsp
synaptic spike at tpost will block the fraction rN of Nrec . The entire dynamics d of the NMDA receptors is: dNu Nu rel D ¡ N C rN u Nrec d(t ¡ tpre ) , tu dt
(A.1)
60
W. Senn, H. Markram, and M. Tsodyks
dNd Nd sp ) D ¡ N C rN d Nrec d(t ¡ tpost , dt td
(A.2)
Nrec D 1 ¡ Nd ¡ Nu , where the delta function d(. . .) expresses that the quantities Nu and Nd have to be augmented at the times of a presynaptic release and a postsynaptic N spike by rN u Nrec and rd Nrec , respectively. The differential equations for the secondary messengers are: dSu Su sp D ¡ S C rS Nu (1 ¡ Su ) d(t ¡ tpost ), dt tu
(A.3)
dSd Sd D ¡ S C rS Nd (1 ¡ Sd ) d(t ¡ trel pre ). dt td
(A.4)
The time constants tuS and tdS encompass a diffusion process and are in the range of 300 ms. The limit probability is upregulated at a postsynaptic spike if Su exceeds some threshold Shu and downregulated at a presynaptic release if Sd exceeds some threshold Shd . According to the kinetic scheme, we have dP1 sp C 1 h C dis D rP u (1 ¡ Pdis )[Su ¡ Su ] d(t ¡ t post ) dt C h C rel ) ¡ rPd P1 dis [Sd ¡ Sd ] d(t ¡ tpre ,
(A.5)
where SuC and SdC are the values of Su and Sd immediately after a post- and presynaptic spike, SuC D rS Nu (1 ¡ Su ) C Su ,
SdC D rS Nd (1 ¡ Sd ) C Sd .
(A.6)
Taking the expectation values in equation A.5 leads to equation 3.1. While the changes in P1 occur instantaneously, the discharge probability Pdis dis slowly approaches the limit probability according to the equation P1 ¡ Pdis dPdis D dis P , dt tM
with
P tM ¼ 10 min.
(A.7)
A.2 Equations Determining the Probability of Release. A synapse is assumed to have a single site of release from which, according to the univesicular hypothesis, a single vesicle discharges with probability Pdis at the arrival of a presynaptic spike. Instantaneously after discharge, the site of release is empty, and it is assumed to be reoccupied by a new vesicle in a Poisson process with time constant tvrec ¼ 800 ms. The probability Pv that
Algorithm for Modifying Neurotransmitter Release Probability
61
a vesicle is at the site of release is therefore governed by the differential equation, 1 ¡ Pv dPv sp ¡ Pdis Pv d(t ¡ tpre ), D tvrec dt
(A.8)
sp
where tpre is the time a presynaptic spike arrives. This equation states that at the time of a presynaptic spike, Pv is reset with probability Pdis from its actual state back to 0. The probability Pnv C 1 of encountering a vesicle ready for release at arrival of the (n C 1)th spike is now calculated by Pnv C 1 D Pnv (1 ¡ Pdis ) e ¡ sp,n C 1
sp,n
4n t
C (1 ¡ e ¡
4n t
),
t D tvrec ,
where 4n D tpre ¡tpre is the time elapsed between the two spikes and Pnv is the probability that a vesicle was ready at the previous spike. Since the probability of release for a presynaptic spike is Prel D Pdis Pv , we get formula 2.1. The instantaneous release frequency is calculated from the presynaptic spike frequency according to sp
sp
rel fpre D Prel fpre D Pdis Pv fpre .
(A.9) sp
For presynaptic Poisson spike trains of rate fpre, one can calculate from equation A.8 the steady-state release probability by ss Pss rel D Pdis Pv D
Pdis
sp
1 C Pdis fpre tvrec
. sp
rel D Pss f . Note The release frequency at steady state is then given by fpre rel pre that due to the memory in the recovery process, the release train is not Poissonnian anymore.
A.3 Steady-State Equations for Poisson Spike Trains. We deduce the ss steady state of the limit probability of vesicle discharge, P1 , for pre- and dis sp sp postsynaptic Poisson spike trains with xed rates fpre and fpost . Since the
P for modication is large, we may assume that time constant tM P dis remains constant even during the steady-state stimulation. In the following, we calculate the steady-state value of P 1 toward which Pdis converges after the dis stimulation according to equation 4.7. First, we consider the steady states of the NMDA receptors and the secondary messengers. Replacing the delta functions in equations A.1 through rel and f sp , we obtain the following exA.4 with the corresponding rates fpre post pectation value in the steady state:
Nuss D
rel r uN fpre sp
rel C r N f 1 C ruN fpre d post
,
N r uN D rN u tu ,
(A.10)
62
W. Senn, H. Markram, and M. Tsodyks sp
Ndss
D
rdN fpost
sp
rel C r N f 1 C r uN fpre d post
N rdN D rN d td ,
,
(A.11)
sp
Sss u D Sss d D
r uS fpost Nuss sp
1 C r uS fpost Nuss rel ss r dS fpre Nd rel Nss 1 C r dS fpre d
,
,
r uS D rS tuS ,
(A.12)
rdS D rS tdS .
(A.13)
Note that due to the shared pool Nrec, the variables Nx and Sx (x D u, d) are not independent, and that, strictly speaking, we are not allowed to average the product in equations A.3 and A.4 individually. However, as long as one sp rel is small compared to the time constants t N of the frequencies fpost and fpre x
and txS , one of the variables Nx and Sx always has time to relax toward its mean between the jumps of the other. Indeed, for tvrec ¼ 1, the release rate will be less than ¼ 1 Hz, while the time constants in consideration are all smaller than ¼ 0.5 second and formulas A.12 and A.13 turn out to be a good approximation (cf. Figure 7a). Also, the error we are handling by treating rel as a Poisson rate seems to be marginal. By a similar reasoning we may fpre average the individual factors in equation A.5 to obtain ss
h C ) C 0 D rPu (1 ¡ P1 dis [Su ¡ Su ]
ss
sp
ss
C h C fpost ¡ rPd P1 dis [Su ¡ Su ]
ss
rel fpre ,
(A.14)
where the superscript ss denotes the expectation value of the corresponding quantity in the steady state. If the variation of Sx is small, the quantity ss ss [SxC ¡ Shx ] C can be approximated by [SxC ¡ Shx ] C , and in case of vanishing h thresholds Sx D 0, it is equal to SxC
ss
ss D rS Nxss (1 ¡ Sss x ) C Sx ,
x 2 fu, dg.
(A.15)
In this case one obtains from equation A.14 the expectation value of P1 in dis the steady state, 1
ss
P1 dis D
1
rel S C ss rPd fpre d C sp C ss rPu f Su post
,
(A.16)
which in general is a good approximation if SxC is above the threshold Shx most of the time. It is elusive to study the transition from the full model to the reduced model, which considers only the cross-gating of the NMDA receptors without secondary messengers. From equation A.5 we get at steady state with
Algorithm for Modifying Neurotransmitter Release Probability
63
vanishing thresholds dP1 ss C ss rel 1ss C ss sp dis D rP fpos ¡ rPd P1 fpre. u (1 ¡ Pdis )Su dis Sd dt
(A.17)
Linearizing Nxss and Sss x (see equations A.10–A.13) around zero frequencies (thereby neglecting saturation effects) and inserting into equation A.15 yields SuC
ss
SdC
ss
sp
sp
rel (1 rel ) C S S rel ¼ rQuN fpre ¡ rQuS tuS fpos fpre rQu tu fpos fpre sp
(A.18)
sp
rel (1 rel rel ¼ rQdN fpre ¡ rQdS tdS fpre fpost ) C rQdS tdS fpre fpost ,
(A.19)
sp
rel are small, the second and with appropriate constants rQxX . If tuS fpos and tdS fpre third terms in equations A.18 and A.19 vanishes, and the synaptic change is dominated by the second-order correlations,
dP1 ss rel sp P 1ss sp rel dis ) ¼ rQPu (1 ¡ P1 dis fpre fpos ¡ rQ d P dis fpost fpre . dt This is true if in particular t S D 0 (and the secondary messengers therefore sp rel are act only by a constant factor). If, on the other hand, tuS fpos and tdS fpre ss large, the secondary messengers are closer to saturation (Sx ¼ 1), the second term in equations A.18 and A.19 is now dominating, and a third-order correlation enters in equation A.17. This reasoning also motivates the short forms (see equations 3.5 and 3.6) of the learning rule for the nonsteady-state situation and the cases t S D 0 and t S > 1 / f , respectively. A.4 Deduction of the BCM Properties. To show the connection to the BCM theory (Bienenstock et al., 1982) we assume Shd D 0 and consider our learning rule, equation 3.1. Inserting the parameters into equation A.5 shows that the time constant of P1 is of an order slower than that of Su and Sd dis (see Figures 7a and 8a). In the limit of slow adapation of P1 , we may thus dis C C replace the quantities Su and Sd by their steady-state values. Writing the learning rule in the form 3.2, we obtain ss
sp
h C ) C w ( fpost ) D (1 ¡ P1 dis [Su ¡ Su ]
sp
fpost rel fpre
¡
rPd rPu
ss
C P1 dis Sd .
(A.20)
We have to show that this function satises the requirements of the BCM learning curves. The rst requirement w (0) D 0 is readily checked since ss sp SdC vanishes for fpost D 0 according to equations A.11 and A.13. We have sp
w 0 (0) < 0 since for small fpost , the second term in equation A.20 will dominate sp for rPd > 0. This is true since the rst term is bound from above by ( fpost )2 ,
64
W. Senn, H. Markram, and M. Tsodyks sp
sp
and the second term is bound from below by fpost . Since for large fpost , it is the rst term in equation A.20, which dominates, we get a second zero of w at
hM
ss
SdC P1 dis D P f rel h C pre ) C ss ru (1 ¡ P1 dis [Su ¡ Su ] rPd
(A.21)
with w 0 (hM ) ¸ 0. The modication threshold is implicitly driven by the ss ss sp postsynaptic spike rate fpost, which changes SuC and SdC via equations A.10 through A.13, and this in turn changes P1 via learning rule 3.2. The converdis gence of P1 for xed frequencies ensures that eventually the right-hand side dis sp of equation 3.2 vanishes, and this, by denition, happens when hM D fpost . sp
sp
For example, if fpost grows over hM , we have w ( fpost ) > 0 and the learning (1 ¡ P1 ), however, the threshold rule pushes P1 up. Due to the factor P1 dis dis / dis hM increases as well. Since the speed of hM is nonlinear in P1 , this factor dis ss
ss
dominates the (saturating) changes in SuC and SdC . The threshold hM will sp sp therefore eventually catch up and overtake fpost , and we get w ( fpost ) < 0. 1 According to the learning rule, Pdis now starts to decrease again. Hence, equation A.21 incorporates the sliding threshold property with respect to sp fpost , and this keeps the synapse sensitive to its environment. A.5 Change of P1 for Nonzero Spike Correlations. We rst quantify dis
the effect of the spike correlation onto the distribution of the up- and downregulating quantities Nu and Nd in their steady states. Recall that for a spike correlation of, say, cs D 1, each postsynaptic spike follows instantaneously after a presynaptic spike. For a positive spike correlation cs 2 [0, 1], the C ss effective postsynaptic frequency is reduced by the fraction rN u [cs ] Prel , and sp sp C ss N we must replace fpost by (1 ¡ ru [cs ] Prel ) fpost in the formulas A.10 and A.11, giving the average values of Nu and Nd , respectively. Similarly, for a negrel ative spike correlation cs 2 [¡1, 0], the presynaptic release frequency fpre rel in the same two formulas, and the [¡cs ] C ) fpre must be replaced by (1 ¡ rN d steady states Nuss and Nuss become functions of cs . Next, in formulas A.12, A.13, and A.15, the quantities Nuss and Nuss have to be replaced by their average values immediately after a pre- or postsynaptic spike, that is, by C C ss C ss ssC ss N ss Nuss D Nuss C rN u [cs ] Prel Nrec and Nd D Nd C rd [¡cs ] Nrec , respectively. (Since due to the memory in the vesicle recovery process the presynaptic release train is not Poissonian, we would not be allowed to replace the delta function by the release rate to calculate the mean-eld behavior. If in addition there is a nonvanishing correlation between the pre- and postsynaptic spike trains, this “simplication” indeed cannot be neglected anymore. We were urged to correct the mean-eld plots in Figure 10 empirically by a factor of 1.3 in front of cs .) If we now put Shu D Shd D 0 we get from equation A.14 the steady state of P1 as a function of the spike correlation cs (see dis
Algorithm for Modifying Neurotransmitter Release Probability
65
Figure 10b). The mean-eld evolution of P1 for a given spike correlation dis cs 2 [¡1, 1] is obtained by the analog replacements (while dropping the upper index ss ) in formulas A.1 through A.4 and A.6. Acknowledgments
This study was supported by the Priority Program Biotechnology of the Swiss National Foundation (grant 21-57076.99) (W. S.), the Ofce of Naval Research, the Human Frontier Science Program Organization, and the Israeli Academy of Sciences (H. M. and M. T.). References Abbott, L. (1996). Decoding neuronal ring and modeling neural networks. Quart. Rev. Biophys., 27, 291–331. Abbott, L., & Song, S. (1999). Temporally asymmetric Hebbian learning, spike timing and neuronal response variability. In M. Kearns, S. Solla, & D. Cohn (Eds.), Advances in neural information processing systems, 11. Cambridge, MA: MIT Press. Abbott, L., Varela, J., Sen, K., & Nelson, S. (1997).Synaptic depression and cortical gain control. Science, 275, 220–224. Antonov, S., Gmiro, V., & Johnson, J. (1998). Binding sites for permeant ions in the channel of NMDA receptors and their effects on channel block. Nature Neuroscience, 1(6), 451–461. Artola, A., & Singer, W. (1993). Long-term depression of excitatory synaptic transmission and its relationship to long-term potentiation. Trends in Neurosci., 16(11), 480–487. Bear, M., Cooper, L. N., & Ebner, F. F. (1987). A physiological basis for a theory of synapse modication. Science, 237, 42–48. Bell, C., Han, V., Sugawara, Y., & Grant, K. (1997). Synaptic plasticity in a cerebellum-like structure depends on temporal order. Nature, 387(6630), 278– 281. Bi, G., & Poo, M. (1998). Precise spike timing determines the direction and extent of synaptic modications in cultured hippocampal neurons. J. Neurosci., 18, 10464–10472. Bienenstock, E., Cooper, L., & Munro, P. (1982). Theory for the development of neuron selectivity: Orientation specicity and binocular interaction in visual cortex. J. Neurosci., 2(1), 32–48. Blum, K., & Abbott, L. (1996). A model of spatial map formation in the hippocampus of the rat. Neural Computation, 8(1), 85–93. Brown, T. H., & Chatterji, S. (1994). Hebbian synaptic plasticity: Evolution of the contemporary concept. In E. Domany, J. van Hemmen, & K. Schulten (Eds.), Model of neural networks II (pp. 287–314). Berlin: Springer. Debanne, D., Shulz, D., & Fregnac, Y. (1995). Temporal constraints in associative synaptic plasticity in hippocampus and neocortex. Can. J. Physiol. and Pharmacol., 73, 1295–1311.
66
W. Senn, H. Markram, and M. Tsodyks
Fregnac, Y., & Shulz, D. (1994). Models of synaptic plasticity and cellular analogs of learning in the developing and adult visual cortex. In V. Casagrande & P. Shinkman (Eds.), Advances in neural and behavioral development (4th ed.). Norwood, NJ: Ablex. Fregnac, Y., Shulz, D., Thorpe, S., & Bienenstock, E. (1988). A cellular analogue of visual cortical plasticity. Nature, 333, 367–370. Fusi, S., Annunziato, M., Badoni, D., Salamon, A., & Amit, D. J. (2000). Spikedriven synaptic plasticity: Theory, simulation, VLSI implementation. Neural Computation, 12(10), 2257–2258. Gerstner, W., & Abbott, L. (1997). Learning navigational maps through potentiation and modulation of hippocampal place cells. J. Comput. Neuroscience, 4, 79–94. Gerstner, W., Kempter, R., van Hemmen, J., & Wagner, H. (1996). A neuronal learning rule for sub-millisecond temporal coding. Nature, 383, 76–78. Gerstner, W., Kempter, R., van Hemmen, J., & Wagner, H. (1998). Hebbian learning of pulse timing in the barn owl auditory system. In W. Maass & C. Bishop (Eds.), Pulsed neural networks. Cambridge, MA: MIT Press. Grzywaca, N., & Burgi, P.-Y. (1998). Toward a biophysically plausible bidirectional Hebbian rule. Neural Compuation, 10, 499–520. Gustafsson, B., Wigstr om, ¨ H., Abraham, W., & Huang, Y.-Y. (1987). Long-term potentiation in the hippocampus using depolarizing current pulses as the conditioning stimulus to single volley synaptic potentials. J. Neurosci., 7, 774– 780. Hebb, D. O. (1949). The organization of behaviour. New York: Wiley. Hopeld, J. (1995). Pattern recognition computation using action potential timing for stimulus representation. Nature, 376, 33–36. Jonas, P., & Spruston, N. (1994). Mechanisms shaping glutamate-mediated excitatory postsynaptic currents in the CNS. Curr. Opin. Neurobiol., 4(3), 366–372. Kempter, R., Gerstner, W., & van Hemmen, J. (1999a). Hebbian learning and spiking neurons. Physical Review E, 59(4), 4498–4514. Kempter, R., Gerstner, W., & van Hemmen, J. (1999b). Spike-based compared to rate-based Hebbian learning. In M. S. Kearns, S. Solla, & D. Cohn (Eds.), Advances in neural information processing systems, 11. Cambridge, MA: MIT Press. Larkum, M., Zhu, J., & Sakmann, B. (1999). A novel cellular mechanism for coupling inputs arriving at different cortical layers. Nature, 398, 338–341. Liao, D., & Malinow, N. H. R. (1995). Activation of postsynaptically silent synapses during pairing-induced LTP in CA1 region of hippocampal slice. Nature, 375, 400–404. Maass, W. (1998). A simple model for neural computation with ring rates and ring correlations. Network: Computation in Neural Systems, 9(3), 381–397. Markram, H., Helm, P., & Sakmann, B. (1995). Dendritic calcium transients evoked by single back-propagating action potentials in rat neocortical pyramidal neurons. J. Physiol. (London), 485, 1–20. Markram, H., Lubke, ¨ J., Frotscher, M., & Sakmann, B. (1997). Regulation of synaptic efcacy by concidence of postsynaptic APs and EPSPs. Science, 275, 213–215.
Algorithm for Modifying Neurotransmitter Release Probability
67
Markram, H., Pikus, D., Gupta, A., & Tsodyks, M. (1998). Potential for multiple mechanisms, phenomena and algorithms for synaptic plasticity at single synapses. Neuropharmacology, 37, 489–500. Markram, H., Roth, A., & Helmchen, F. (1998). Competitive calcium binding: Implications for dendritic calcium signaling. JCNS, 5(3), 331–348. Markram, H., & Tsodyks, M. (1996). Redistribution of synaptic efcacy between neocortical pyramidal neurons. Nature, 382, 807–810. Markram, H., Wang, Y., & Tsodyks, M. (1998). Differential synaptic transmission via the same axon from neocortical layer 5 pyramidal neurons. PNAS, 95, 5323–5328. Mayer, M. (1998). Ion-binding sites in NMDA receptors: Classical approaches provide the numbers. Nature Neuroscience, 1(6), 433–434. Mayer, M., McDermott, A., Westbrook, G., Smith, S., & Barker, J. (1987). Agonistand voltage-gated calcium entry in cultured mouse spinal cord neurons under voltage clamp measured using Arsenazo III. J. Neurosci., 7, 3230–3244. Medina, I., Filippova, N., Bakhramov, A., & Bregestovski, P. (1996). Calciuminduced inactivation of NMDA receptor-channels evolves independently of run-down in cultured rat brain neurones. J. Physiol. (London), 495, 411–427. Oja, E. (1982). A simplied neuron model as a principal component analyzer. J. Math. Biol., 15, 267–273. Sejnowski, T. (1977). Storing covariance with nonlinearly interacting neurons. J. Math. Biol., 4, 303–321. Senn, W., Tsodyks, M., & Markram, H. (1997). An algorithm for synaptic modication based on exact timing of pre- and post-synaptic action potentials. In W. Gerstner, A. Germond, M. Hasler, & J.-D. Nicoud (Eds.), Articial Neural Networks—ICANN‘97 (pp. 121–126). Berlin: Springer-Verlag. Spruston, N., Jonas, P., & Sakmann, B. (1995). Dendritic glutamate receptor channels in rat hippocampal CA3 and CA1 pyramidal neurons. J. Physiol. (London), 482, 325–352. Stanton, P., & Sejnowski, T. (1989). Associative long-term depression in the hippocampus induced by Hebbian covariance. Nature, 339, 215–218. Triller, A., & Korn, H. (1982). Transmission at a central inhibitory synapse. III: Ultrastructure of physiologically identied and stained terminals. J. Neurophysiol., 48, 708–736. Tsodyks, M., & Markram, H. (1997). The neural code between neocortical pyramidal neurons depends on neurotransmitter release probability. Proc. Natl. Acad. Sci. USA, 94, 719–723. Tsumoto, T. (1990). Long-term potentiation and depression in the cerebral neocortex. Jpn. J. Physiol., 40, 573–593. Zhang, L., Tao, H., Holt, C., Harris, W., & Poo, M. (1998). A critical window in the cooperation and competition among developing retinotectal synapses. Nature, 395, 37–44. Received October 14, 1998; accepted March 21, 2000.
LETTER
Communicated by Laurence Abbott
Differential Filtering of Two Presynaptic Depression Mechanisms Richard Bertram Institute of Molecular Biophysics, Florida State University, Tallahassee, FL 32306, U.S.A. The ltering of input signals carried out at synapses is key to the information processing performed by networks of neurons. Two forms of presynaptic depression, vesicle depletion and G-protein inhibition of Ca2 C channels, can play important roles in the presynaptic processing of information. Using computational models, we demonstrate that these two forms of depression lter information in very different ways. Gprotein inhibition acts as a high-pass lter, preferentially transmitting high-frequency input signals to the postsynaptic cell, while vesicle depletion acts as a low-pass lter. We examine how these forms of depression separately and together affect the steady-state postsynaptic responses to trains of stimuli over a range of frequencies. Finally, we demonstrate how differential ltering permits the multiplexing of information within a single impulse train. 1 Introduction
Synapses are subject to a wide range of activity-dependent short-term and long-term changes in efcacy. These changes can potentiate or depress the synapse and have both presynaptic and postsynaptic sites of induction and expression. Forms of synaptic enhancement include facilitation, augmentation, posttetanic potentiation, long-term facilitation (Magleby, 1987; Zucker, 1996), and long-term potentiation (Gustaffson, Wigstrom, Abraham, & Huang, 1987). Forms of synaptic depression include postsynaptic desensitization (Hestrin, 1992), vesicle depletion (Dobrunz, Huang, & Stevens, 1997), G-protein-mediated presynaptic inhibition (Bean, 1989), and longterm depression (Stanton & Sejnowski, 1989). While the long-term forms of plasticity are thought to be linked to learning and memory (Bliss & Collingridge, 1993), the short-term forms likely act as lters. That is, shortterm plasticity acts as a frequency-dependent lter of the presynaptic signal, where the selectivity is determined by the type of short-term plasticity exhibited by the particular synapse. The focus of this article is on the differential ltering actions of two forms of short-term depression, vesicle depletion and G-protein-mediated presynaptic inhibition, which both have presynaptic sites of induction and action but involve very different mechanisms. Neural Computation 13, 69–85 (2000)
° c 2000 Massachusetts Institute of Technology
70
Richard Bertram
Figure 1: The steps involved in depression induced by vesicle depletion (a) and G-protein inhibition of release (b). High-frequency presynaptic stimuli increase transmitter release, which increases both depletion and G-protein activation. In contrast, low-frequency stimuli are most effective in mediating Ca2 C channel inhibition by activated G-proteins. Both depletion and channel inhibition reduce transmitter release during subsequent impulses.
Release of transmitter occurs when Ca2 C brought into the terminal through voltage-dependent channels binds to release sites in a readily releasable vesicle pool (Rosenmund & Stevens, 1996). During a train of impulses, this pool may deplete, reducing the probability of vesicle fusion later in the train. This form of depression is most severe during high-frequency stimulus trains (see Figure 1a). The mechanism of G-protein inhibition is more complicated (Dolphin, 1998). This occurs when chemical messengers such as GABA, adenosine, glutamate, dopamine, or serotonin bind to receptors in the presynaptic terminals that activate G-proteins. Major targets of these activated G-proteins are Ca2 C channels, primarily the P- and N-type found in presynaptic terminals (Boland & Bean, 1993; Chen & van den Pol, 1998; Stanley & Mirotznik, 1997), although there are other targets (Dittman & Regehr, 1996; Vaughan, Ingram, Connor, & Christie, 1997). Data indicate that Gbc subunits of activated G-proteins bind to the a1 subunit of a Ca2 C channel, putting the channel into a “reluctant” state where the probability of opening is greatly reduced (Bean, 1989; De Waard et al., 1997). As a result, less Ca2 C enters the terminal during an action potential and the probability of transmitter release is reduced. An important characteristic of G-protein inhibition of Ca2 C channels is that it is often relieved by depolarization (Bean, 1989; Zamponi & Snutch, 1998). We will demonstrate that this is crucial to the ltering properties of this type of synaptic depression.
Presynaptic Filtering Mechanisms
71
Receptors for G-protein agonists may respond to transmitter molecules released from neighboring synapses (Dittman & Regehr, 1997) or from the target synapse itself. This receptor binding leads to heterosynaptic or homosynaptic inhibition, respectively. The latter seems to be ubiquitous, since metabotropic transmitter autoreceptors have been found in many different presynaptic terminals (Chen & van den Pol, 1998; Langer, 1987; Starke, Gothert, & Kilbinger, 1989). The ltering properties of G-protein inhibition are not obvious. The amount of transmitter released and the subsequent G-protein activation are greatest during high-frequency trains. However, because activated G-proteins unbind from channels when the presynaptic terminal is depolarized, the binding afnity of G-proteins to Ca2 C channels is greatest at hyperpolarized potentials. Thus, G-protein activation is maximized by high-frequency stimulus trains, while the binding of activated G-proteins to Ca2 C channels is maximized by low-frequency trains (see Figure 1b). In this article, we use simple computational models of vesicle depletion and G-protein inhibition, coupled with a simple model of a postsynaptic cell with active membrane currents, to study the ltering properties of depletion and G-protein inhibition. We nd that depletion acts as a low-pass lter, preferentially transmitting low-frequency presynaptic stimuli to the postsynaptic cell. In contrast, G-protein inhibition acts as a high-pass lter in this model, ltering out low-frequency stimuli while reliably transmitting high-frequency trains. We next voltage-clamp the model postsynaptic cell and compute the synaptic current during presynaptic impulse trains of different frequencies. We nd that for a model synapse exhibiting both forms of depression, the steady-state frequency-response curve is at over a large range of frequencies. This is consistent with data from auditory glutamatergic synapses, which show strong depression due to both vesicle depletion and presynaptic inhibition mediated by GABAB receptors (Brenowitz, David, & Trussell, 1998). In contrast, cortical synapses show a steady-state excitatory postsynaptic current (EPSC) amplitude that is inversely proportional to the stimulus frequency (Abbott, Varela, Sen, & Nelson, 1997; Tsodyks & Markram, 1997), suggestive of depression due solely to vesicle depletion. With a at frequency-response curve, the total synaptic signal transmitted (the product of stimulus frequency and EPSC amplitude) increases roughly linearly with stimulus frequency, while with a frequency-response curve that decays as the inverse of frequency, the total signal is relatively insensitive to stimulus frequency (Abbott et al., 1997; Tsodyks & Markram, 1997). Thus, it appears that G-protein inhibition can remove the gain control introduced by vesicle depletion. Finally, we show that by taking advantage of the differential ltering performed by depletion and G-protein inhibition, information can be multiplexed within a single impulse train. This is accomplished by encoding one stream of information at a low frequency and a second stream at a
72
Richard Bertram
high frequency. The information transmitted at a particular synapse is then determined by the dominant form of synaptic depression at that synapse. Information multiplexing is a particularly efcient means of increasing the information content in limited neural circuitry. 2 Methods
We consider a simple two-compartment representation of a population of synaptic connections between one or more presynaptic neurons (subject to identical input stimuli) and a postsynaptic neuron. A brief description of the presynaptic and postsynaptic models is provided here; all equations and parameter values are provided in the appendix. A complete description of a similar presynaptic model (which includes an additional mechanism for synaptic facilitation) is given in Bertram and Behan (1999). 2.1 Presynaptic Models. We assume that each docked vesicle is associated with a single Ca2 C channel located 10 nm away. We assume further that each vesicle has a single low-afnity Ca2 C binding site (KD D 170 m M) that is inuenced by the microdomain of high Ca2 C concentration that forms at the inner mouth of the colocalized open channel. This site has a high Ca2 C unbinding rate and free intraterminal Ca2 C does not accumulate, so there is no facilitation of release. More complex models for the probability of transmitter release have been developed (Bertram, Sherman, & Stanley, 1996; Bertram, Smith, & Sherman, 1999; Fogelson & Zucker, 1985; Parnas, Dudel, & Parnas, 1986; Yamada & Zucker, 1992), but are not needed here. In the absence of G-protein binding, the state of the colocalized Ca2 C channel is determined by a ve-state kinetic scheme based on an N-type Ca2 C channel (Boland & Bean, 1993). This represents a “willing” Ca2 C channel. A channel that is bound by a G-protein is in a “reluctant” state and has a signicantly lower opening probability. The transition rate from a willing state to a reluctant state increases with the fraction of activated G-proteins, while the transition rate from a reluctant state to a willing state increases with depolarization. The binding of a transmitter molecule to a presynaptic autoreceptor, activating a G-protein, is modeled by a rst-order kinetic scheme, assuming that the mean transmitter concentration is proportional to the probability of release. The proportionality constant (T) was chosen so that the mean transmitter concentration during an impulse peaks at » 0.4 mM. Vesicle depletion is modeled as a rst-order process with a forward rate proportional to the transmitter concentration. Forward and backward kinetic rates were chosen so that there would be signicant depletion during a 70 Hz stimulus train, but not during a 5 Hz train. The relling of the readily releasable pool is intended to represent transfer of lled vesicles from a reserve pool (Worden, Bykhovskaia, & Hackett, 1997). With the depletion model, the transmitter concentration is T D T (1 ¡ D)R, where D is the frac-
Presynaptic Filtering Mechanisms
73
tion of vesicles depleted from the readily releasable pool and R is the release probability. 2.2 Postsynaptic Model. The postsynaptic membrane potential (Vpost ), like the presynaptic potential, is described by the Hodgkin-Huxley equations (Hodgkin & Huxley, 1952), with an excitatory synaptic current added to the voltage equation. Binding of transmitter to postsynaptic receptors is assumed to be a rst-order process, with forward rate proportional to the transmitter concentration. Receptor desensitization is not included in the model. 3 Results
Two presynaptic models for the probability of transmitter release are used initially: one that allows for vesicle depletion and one that allows for Gprotein inhibition. The depletion model includes a variable, D, that describes the depletion of vesicles from the readily releasable pool, and the G-protein inhibition model includes a variable CG that is the fraction of Ca2 C channels in a reluctant state. Thus, both D and CG are measures of processes that depress transmitter release. 3.1 Differential Filtering Properties. We rst examine the presynaptic responses to low- and high-frequency stimuli using 5 Hz and 70 Hz depolarizing current trains to induce presynaptic action potentials. During the high-frequency (70 Hz) stimulus, there is signicant vesicle depletion since subsequent impulses occur before the pool is relled by vesicles from a reserve pool (see Figure 2a). In the G-protein model, the high-frequency train releases a great deal of transmitter that binds to autoreceptors, producing a large fraction of activated G-proteins (see curve A in Figure 2b). However, there is little binding of activated G-proteins to Ca2 C channels (see curve CG in Figure 2b) due to the frequent depolarization. As a result, G-protein inhibition is minimal. The converse is true with low-frequency (5 Hz) stimuli. In the depletion model, the longer time between stimuli gives the readily releasable pool time to rell, so there is little depletion at the time of the subsequent stimulus (see Figure 2c). In the G-protein model, even though the fraction of G-proteins activated is less than during the high-frequency train, the longer periods of hyperpolarization between stimuli provide ample opportunity for activated G-proteins to bind to Ca2 C channels and put them into a reluctant state (see Figure 2d), reducing the Ca2 C inux, and thus the transmitter release, during subsequent stimuli. The postsynaptic responses to the high- and low-frequency trains for both presynaptic models are shown in Figures 3 and 4. The postsynaptic cell, at rest in the absence of synaptic input, generates a train of impulses in response to presynaptic action potentials when there is neither deple-
74
Richard Bertram
Figure 2: Vesicle depletion (D) occurs during a (a) high-frequency stimulus train but not during a (c) low-frequency train. The fraction of activated G-proteins (A) accumulates more during the (b) high-frequency train than during the (d) lowfrequency train, but the fraction of Ca2 C channels in a reluctant state (CG ) increases to a much higher level during the low-frequency train.
tion nor G-protein inhibition (see Figures 3a and 4a). However, during the high-frequency train, the postsynaptic voltage in the depleting synapse population fails to reach the spike threshold after the second stimulus, so only two postsynaptic impulses are generated (see Figure 3b). At this frequency, the postsynaptic voltage of the synapse population subject to G-protein inhibition reaches spike threshold with each stimulus, so the full input signal is faithfully transmitted (see Figure 3c). The opposite is true during low-frequency stimulation. In this case, the depleting synapse population faithfully transmits the input signal (see Figure 4b), while the population subject to G-protein inhibition transmits only the rst two impulses (see Figure 4c). These simulations suggest that G-protein inhibition acts as a high-pass lter, while vesicle depletion acts as a low-pass lter. We examine this further using steady-state responses over a range of stimulus frequencies, using models with depletion or G-protein inhibition alone and a model with both forms of depression. 3.2 Steady-State Responses. The active postsynaptic currents (INa,post and IK,post ) provide a voltage threshold that must be crossed for an action potential to be generated. To study the steady-state effects of depletion and
Presynaptic Filtering Mechanisms
75
Figure 3: Postsynaptic response to a 70 Hz train of presynaptic action potentials. The input signal is transmitted reliably to the postsynaptic cell when (a) there is no presynaptic depression or (c) transmitter release is subject to G-protein inhibition. However, the signal is not transmitted when (b) vesicle depletion occurs. Postsynaptic impulses have been chopped at 0 mV.
G-protein inhibition, it is useful to exclude this extra nonlinearity, so we focus now on the synaptic current (Isyn) induced by presynaptic impulses when the postsynaptic cell is voltage clamped at ¡30 mV. Our goal is to determine the steady-state Isyn over a range of stimulus frequencies (5–100 Hz), so at each frequency the presynaptic cell is stimulated for a length of time sufcient to ensure that the amplitude of the postsynaptic current has reached steady state (see Figure 5a). The steady-state responses with different depression mechanisms are shown in Figure 5b. In the presence of depletion alone, Isyn decreases with higher-stimulus frequencies (open circles). This is consistent with the lowpass ltering demonstrated earlier and the downward trend in steady-state excitatory postsynaptic potential (EPSP) amplitude observed in cortical neurons (Abbott et al., 1997; Tsodyks & Markram, 1997). A similar downward trend was shown in EPSC recordings from glutamatergic auditory
76
Richard Bertram
Figure 4: Postsynaptic response to a 5 Hz stimulus train. As before, the input signal is transmitted reliably when there is no depression (a). However, the signal is now also transmitted when transmitter release is subject to vesicle depletion (b), but not when it is subject to G-protein inhibition (c), in contrast to the high-frequency response (see Figure 3).
synapses in the absence of the GABAB agonist baclofen (Brenowitz et al., 1998), a condition where depletion is thought to be the only form of depression. With G-protein inhibition alone, Isyn is an increasing function of stimulus frequency (open squares in Figure 5b), reecting the high-pass ltering performed by this form of depression. When both depletion and G-protein inhibition are included in the model, Isyn rst increases and then remains relatively constant over a large range of frequencies (the closed triangles in Figure 5b). Thus, the combination of these forms of depression results in a steady-state response that is relatively insensitive to stimulus frequency. Similar insensitivity was observed in auditory synapses when presynaptic G-proteins were activated by the GABAB agonist baclofen, so that vesicle depletion was supplemented by some form of G-protein inhibition (Brenowitz et al., 1998). Implications of this insensitivity, and its contrast to the linear de-
Presynaptic Filtering Mechanisms
77
Figure 5: Synaptic current produced by long trains of presynaptic impulses. (a) Time course of Isyn during a 70 Hz train of presynaptic impulses with Gprotein inhibition but no depletion. The amplitude of the last peak is used as the steady-state value. (b) Steady-state Isyn over a range of stimulus frequencies. The maximum occurs at 5 Hz when the model synapse is subject to depletion alone (open circles). The maximum occurs at 100 Hz when G-protein inhibition is the only depression mechanism (open squares). When both depression mechanisms are included in the model (closed triangles), the response is relatively insensitive to stimulus frequencies between 40 and 100 Hz.
cline observed in cortical neurons (Abbott et al., 1997; Tsodyks & Markram, 1997), will be discussed shortly. 3.3 Information Multiplexing. Finally, we examine how synapses subject to depletion or G-protein inhibition would process an input signal with an initial low-frequency component followed by a high-frequency component. These components may be thought of as the concatenation of two independent streams of information. As was done earlier, we examine the voltage response of a postsynaptic cell with active membrane currents. Figure 6a shows the voltage response, in the absence of depression, to a train of presynaptic impulses at 10 Hz followed by a burst of impulses at 100 Hz (only the last four low-frequency responses are shown). If the model synapse is subject to depletion, the low-frequency component is transmit-
78
Richard Bertram
Figure 6: Postsynaptic voltage response to a presynaptic input signal consisting of a low-frequency (10 Hz) component followed by a high-frequency (100 Hz) component. (a) Both components are transmitted when the synapse is not subject to depression. The response to the last four low-frequency stimuli is shown, followed (at the arrow) by the high-frequency response. (b) The high-frequency component is ltered out by vesicle depletion, while the low-frequency component is transmitted. (c) G-protein inhibition lters out the low-frequency component, while the high-frequency component is transmitted.
ted, while the high-frequency component is ltered out (see Figure 6b). Thus, only information encoded at the low frequency is transmitted, while information encoded at the high frequency is ignored. Conversely, when the model synapse is subject to G-protein inhibition alone, only the highfrequency component of the signal is transmitted, while information at the low frequency is ignored (see Figure 6c). Thus, one way to transmit two independent streams of information on the same input line (a presynaptic cell) is to encode one at a low frequency and the other at a high frequency. The ltering performed at a particular synapse then determines which stream of information is transmitted by that synapse.
Presynaptic Filtering Mechanisms
79
4 Discussion
The way that synapses lter input signals is key to the information processing done by networks of neurons. Using computational models, we have demonstrated that two forms of presynaptic depression, vesicle depletion and G-protein inhibition of Ca2 C channels, lter input signals in very different ways. Vesicle depletion acts as a low-pass lter, ltering out high-frequency signals while reliably transmitting low-frequency signals. G-protein inhibition acts as a high-pass lter, ltering out low-frequency signals while reliably transmitting high-frequency signals. This unusual ltering property is due to the depolarization-induced relief of Ca2 C channel inhibition that is typical of G-protein-mediated Ca2 C channel inhibition (Bean, 1989; Zamponi & Snutch, 1998). The differential ltering performed by vesicle depletion and G-protein inhibition provides a means for the multiplexing of neuronal information, whereby information encoded at low frequencies is transmitted by synapses in which vesicle depletion is the primary depression mechanism, and information encoded at high frequencies is transmitted preferentially by those in which G-protein inhibition of Ca2 C channels is the dominant depression mechanism. This is a particularly efcient means of increasing the information content in neural circuitry. Furthermore, alterations in gene expression of the various proteins involved in vesicle depletion and G-protein inhibition can alter the mix of these forms of depression, modifying the type of information that is transmitted by the synapse. This provides exibility without changing the connectivity of the neural circuitry and is different from, but complementary to, systematic changes in synaptic strength. We have shown that the combination of the two forms of depression can produce a steady-state frequency-response curve that is at over a large range of frequencies. Thus, the effects of G-protein inhibition remove the gain control observed in some cortical neurons, where the steady-state EPSP amplitude decays at a rate inversely proportional to the stimulus frequency (Abbott et al., 1997; Tsodyks & Markram, 1997). A at frequency-response curve has been observed in auditory neurons, where baclofen was used to activate presynaptic GABAB receptors (Brenowitz et al., 1998). Binding to this type of receptor typically activates G-proteins and has been shown to produce inhibition in different neurons using different targets, including Ca2 C channels (Chen & van den Pol, 1998; Dittman & Regehr, 1996; Takahashi, Kajikawa, & Tsujimoto, 1998; Wu & Saggau, 1995), K C channels (Thompson & Gahwiler, ¨ 1992; Vaughan et al., 1997), or proteins downstream of those responsible for depolarization and Ca2 C entry (Dittman & Regehr, 1996). In the Brenowitz study, the G-protein target was not investigated, and indeed in our model with Ca2 C channels as the target, we were unable to produce a postsynaptic response that was greater in the presence of agonist than in its absence, as was shown in the auditory neuron. Whatever the target, the insensitivity of response amplitude implies that the total
80
Richard Bertram
synaptic response (frequency £ EPSC amplitude) increases roughly linearly with stimulus frequency, unlike the frequency-independent total responses produced by synapses exhibiting vesicle depletion but no G-protein inhibition (Abbott et al., 1997; Tsodyks & Markram, 1997). As discussed by these groups, frequency-independent total responses provide gain control, so that low-frequency input to a postsynaptic cell is not dominated by highfrequency input from a different synapse. Our model predicts that this gain control is removed by G-protein inhibition, allowing high-frequency input to dominate input of lower frequency. The details of this study depend quantitatively on the choice of parameter values. Of particular importance here are the rate of relling of readily releasable stores from a reserve pool of vesicles (kd¡ ) and the rate at which activated G-proteins become inactive (ka¡). We have investigated the effects on the frequency-response curve of varying these two parameters. In the depletion model, reducing the relling rate causes the steady-state frequencyresponse curve for depletion (see Figure 5b) to decay more rapidly and reach a lower value. Increasing kd¡ has the opposite effect. In the G-protein inhibition model, reducing the G-protein inactivation rate results in extra accumulation of activated G-proteins during low-frequency stimulus trains, and consequently the steady-state frequency-response curve for G-protein inhibition is lower at low frequencies than that shown in Figure 5b. Increasing ka¡ has the opposite effect. When either kd¡ or ka¡ is increased or decreased by a factor of two, the steady-state frequency-response curve of a model with both forms of depression remains at over most frequencies tested. Indeed, the range of frequencies where the response is at is extended by reducing kd¡ or increasing ka¡ . In a previous computational study (Bertram & Behan, 1999), we investigated the effects of G-protein inhibition and the depolarization-induced relief of inhibition on synaptic facilitation. The main result was that facilitation can be greatly enhanced as Ca2 C channels move from a reluctant state to a willing state, so some of the facilitation generally attributed to accumulation of free or bound Ca2 C may actually be due to the relief of basal G-protein inhibition. Another point made in this earlier report was that at very low stimulus frequencies (lower than the frequencies used in this article), there will be no accumulation of activated G-proteins. This implies that G-protein inhibition acts as a low-pass lter at these very low frequencies. When the frequency is very low, activated G-proteins do not accumulate and the signal is transmitted, but at higher frequencies there is sufcient Gprotein accumulation and Ca2 C channel binding to block the input signal. Thus, in our model, G-protein inhibition allows the transmission of signals at a very low frequency and at a high frequency, but lters out those with frequency in between. It has been demonstrated previously that in the absence of depletion, bursts of high-frequency presynaptic stimuli are more efciently transmitted than are lower-frequency trains (Lisman, 1997). This is due to facilitation,
Presynaptic Filtering Mechanisms
81
the potentiation of transmitter release that often occurs when one stimulus is preceded by one or more conditioning stimuli. Since facilitation is greater at high stimulus frequencies than at low frequencies, it acts like a highfrequency amplier. In the light of this, our results suggest that G-protein inhibition and presynaptic facilitation have similar signal processing properties in that both preferentially transmit high-frequency signals; the former lters out low-frequency signals, while the latter amplies high-frequency signals. The signal processing performed by these two forms of short-term plasticity contrasts greatly with the low-pass ltering carried out by vesicle depletion. Appendix A.1 Equations for Presynaptic Membrane Potential. Presynaptic action potentials were generated with the Hodgkin-Huxley (1952) equations by applying periodic 1 ms current pulses of magnitude Iap D 30 m Acm¡2 . The equations are:
Cm
dV dt dx dt dn dt dh dt
3 4 D ¡[gN Na x h (V ¡ VNa ) C gN K n (V ¡ VK ) C gN l (V ¡ Vl ) ¡ Iap ] (A.1)
D (x1 (V ) ¡ x) / tx (V )
(A.2)
D (n1 (V ) ¡ n ) / tn (V )
(A.3)
D (h1 (V ) ¡ h) / th (V )
(A.4)
where x1 (V ) D ax / (ax C bx ), tx (V ) D 1 / (ax C bx ) (similarly for n1 , h1 , tn , and th ) and ax D 0.2(V C 40) / [1 ¡ e¡(V C 40) / 10], bx D 8 e¡(V C 65) / 18 , an D 0.02(V C 55) / [1 ¡ e¡(V C 55) / 10], bn D 0.25 e ¡(V C 65) / 80, ah D 0.14 e ¡(V C 65) / 20, and bh D 2 / [1 C e ¡(V C 35) / 10]. Capacitance, maximum conductance values, and reversal potentials are Cm D 1 m Fcm ¡2 , gN Na D 120, gN K D 36, gN l D 0.3 (mScm¡2 ), and VNa D 50, VK D ¡77, and Vl D ¡54 (mV). It is assumed that Ca2 C current does not affect presynaptic voltage. A.2 Equations for the Presynaptic Ca 2 C Channel and Secretion Models. The equations for the G-protein-regulated presynaptic Ca2 C channels
are described in detail in Bertram and Behan (1999). Briey, the model channel has three G-protein-bound or “reluctant” closed states (CGi ), four unbound or “willing” closed states (Ci ), and a single open state. Equations for closed-state probabilities are: dC 1 D bC2 C lCG1 ¡ (4a C k) C1 dt dC 2 D 4aC1 C 2bC3 C 64lC G2 ¡ (b C 3a C k) C2 dt
(A.5) (A.6)
82
Richard Bertram
dC3 dt dC4 dt dCG1 dt dCG2 dt dCG3 dt
D 3aC2 C 3bC4 C (64)2 lCG3 ¡ (2b C 2a C k) C3
(A.7)
D 2aC3 C 4bm ¡ (3b C a) C4
(A.8)
D b 0 CG2 C kC1 ¡ (4a0 C l ) CG1
(A.9)
0 0 0 0 D 4a CG1 C 2b CG3 C kC2 ¡ (b C 3a C 64l) CG2
(A.10)
D 3a0 CG2 C kC3 ¡ (2b 0 C (64)2 l ) CG3 .
(A.11)
The probability that a channel is open is given by the conservation relation m D 1 ¡ C1 ¡ C2 ¡ C3 ¡ C4 ¡ CG1 ¡ CG2 ¡ CG3 . The fraction of channels in a reluctant state, CG D CG1 C CG2 C CG3 , is plotted in Figure 2. Voltage-dependent forward (a) and backward (b) kinetic rates are (in ms¡1 ): a D 0.9eV/ 22, a D a/ 8, 0
b D 0.03e ¡V/ 14 0
b D 8b.
(A.12) (A.13)
The G protein unbinding and binding rates are (in ms¡1 ) l D 0.00025 and: kD
0.3A , 68 C 32A
(A.14)
where A is the fraction of activated G-proteins or the fraction of presynaptic autoreceptors bound by neurotransmitter. This fraction changes in time according to: dA C D ka T (1 ¡ A) ¡ ka¡ A, dt
(A.15)
where T is the transmitter concentration and kaC D 0.2 mM ¡1 ms¡1 , ka¡ D 0.0015 ms¡1 . For simplicity, it is assumed that transmitter release sites have a single Ca2 C binding site and that presynaptic Ca2 C does not accumulate. The release probability (R) is given by: dR C ¡ D kr Cad (1 ¡ R) ¡ kr R, dt
(A.16)
where krC D 0.015 m M¡1 ms¡1 , kr¡ D 2.5 ms¡1 , and Cad is the domain Ca2 C concentration (in m M) at the mouth of an open Ca2 C channel. Depletion of releasable vesicles, D, is described by: dD C ¡ D kd T (1 ¡ D) ¡ kd D, dt
(A.17)
Presynaptic Filtering Mechanisms
83
where T D T (1 ¡ D )R is the transmitter concentration released during an impulse, T D 2 mM is a proportionality constant, and kdC D 0.5 mM¡1 ms¡1 , kd¡ D 0.025 ms¡1 . A.3 Equations for Postsynaptic Membrane Potential. The equations for postsynaptic potential are similar to equations A.1–A.4, with the addition of a synaptic current in the voltage (Vpost ) equation:
Cm
dVpost D ¡(INa,post C IK,post C Il,post C Isyn ), dt
(A.18)
where INa,post, IK,post , and Il,post are functions of Vpost and are similar to their presynaptic counterparts. The synaptic current is Isyn D gN syn b(Vpost ¡ Vsyn ), where gN syn D 0.3 mScm¡2 is the maximum synaptic conductance, Vsyn D 0 mV is the reversal potential, and b is the fraction of bound receptors, given by: db C ¡ D kb T (1 ¡ b ) ¡ kb b. dt
(A.19)
The binding and unbinding rates, kbC D 2 mM¡1 ms¡1 and kb¡ D 1 ms¡1 , are appropriate for “fast” receptors (Destexhe, Mainen, & Sejnowski, 1994). All differential equations were solved numerically using the software package XPPAUT (Ermentrout, 1996). Acknowledgments
We thank Arthur Sherman and Victor Matveev for helpful discussions, and anonymous reviewers for helpful suggestions. This work was supported by NSF awards DMS-9981822 and DBI–9602233. References Abbott, L. F., Varela, J. A., Sen, K., & Nelson, S. B. (1997). Synaptic depression and cortical gain control. Science, 275, 220–224. Bean, B. P. (1989). Neurotransmitter inhibition of neuronal calcium currents by changes in channel voltage dependence. Nature, 340, 153–156. Bertram, R., & Behan, M. (1999). Implications of G-protein-mediated Ca2C channel inhibition for neurotransmitter release and facilitation. J. Comput. Neurosci., 7, 197–211. Bertram, R., Sherman, A., & Stanley, E. F. (1996). Single-domain/bound calcium hypothesis of transmitter release and facilitation. J. Neurophysiol., 75, 1919– 1931. Bertram, R., Smith, G. D., & Sherman, A. (1999). A modeling study of the effects of overlapping Ca2C microdomains on neurotransmitter release. Biophys. J., 76, 735–750.
84
Richard Bertram
Bliss, T. V., & Collingridge, G. L. (1993).A synaptic model of memory: Long-term potentiation in the hippocampus. Nature, 361, 31–39. Boland, L. M., & Bean, B. P. (1993). Modulation of N-type calcium channels in bullfrog sympathetic neurons by luteinizing hormone-releasing hormone: Kinetics and voltage dependence. J. Neurosci., 13, 516–533. Brenowitz, S., David, J., & Trussell, L. (1998). Enhancement of synaptic efcacy by presynaptic GABAB receptors. Neuron, 20, 135–141. Chen, G., & van den Pol, A. N. (1998). Presynaptic GABAB autoreceptor modulation of P/Q-type calcium channels and GABA release in rat suptrachiasmatic nucleus neurons. J. Neurosci., 18, 1913–1922. De Waard, M., Liu, H., Walker, H., Scott, V. E. S., Gurnett, C. A., & Campbell, K. P. (1997). Direct binding of G-protein bc complex to voltage-dependent calcium channels. Nature, 385, 446–450. Destexhe, A., Mainen, Z. F., & Sejnowski, T. J. (1994). Synthesis of models for excitable membranes, synaptic transmission and neuromodulation using a common kinetic formalism. J. Comput. Neurosci., 1, 195–230. Dittman, J. S., & Regehr, W. G. (1996). Contributions of calcium-dependent and calcium-independent mechanisms to presynaptic inhibition at a cerebellar synapse. J. Neurosci., 16, 1623–1633. Dittman, J. S., & Regehr, W. G. (1997). Mechanism and kinetics of heterosynaptic depression at a cerebellar synapse. J. Neurosci., 17, 9048–9059. Dobrunz, L. E., Huang, E. P., & Stevens, C. F. (1997). Very short-term plasticity in hippocampal synapses. Proc. Natl. Acad. Sci. USA, 94, 14843–14847. Dolphin, A. C. (1998). Mechanisms of modulation of voltage-dependent calcium channels by G proteins. J. Physiol. (Lond.), 506, 3–11. Ermentrout, G. B. (1996). XPPAUT—the differential equations tool. Available at: www.pitt.edu/bardware. Fogelson, A. L., & Zucker, R. S. (1985). Presynaptic calcium diffusion from various arrays of single channels. Biophys. J., 48, 1003–1017. Gustaffson, B., Wigstrom, H., Abraham, W. C., & Huang, Y. Y. (1987). Longterm potentiation in the hippocampus using depolarizing current pulses as the conditioning stimulus to single volley synaptic potentials. J. Neurosci., 7, 774–780. Hestrin, S. (1992). Activation and desensitization of glutamate-activated channels mediating fast excitatory synaptic currents in the visual cortex. Neuron, 9, 991–999. Hodgkin, A. L., & Huxley, A. F. (1952). A quantitative description of membrane current and its application to conduction and excitation in nerve. J. Physiol. (Lond.), 117, 500–544. Langer, S. Z. (1987). Presynaptic regulation of monoaminergic neurons. In H. Y. Meltzer (Ed.), Psychopharmacology: The third generation of progress (pp. 151– 157). New York: Raven Press. Lisman, J. E. (1997). Bursts as a unit of neural information: Making unreliable synapses reliable. Trends Neurosci., 20, 38–43. Magleby, K. (1987). Short-term changes in synaptic efcacy. In G. M. Edelman, L. E. Gall, W. Maxwell, & W. M. Cowan (Eds.), Synaptic function (pp. 21–56). New York: Wiley.
Presynaptic Filtering Mechanisms
85
Parnas, H., Dudel, J., & Parnas, I. (1986). Neurotransmitter release and its facilitation in the craysh. VII. Another voltage dependent process besides Ca entry controls the time course of phasic release. P ugers ¨ Arch., 406, 121–130. Rosenmund, C., & Stevens, C. F. (1996). Denition of the readily releasable pool of vesicles at hippocampal synapses. Neuron, 862, 1197–1207. Stanley, E. F., & Mirotznik, R. R. (1997). Cleavage of syntaxin prevents G-protein regulation of presynaptic calcium channels. Nature, 385, 340–343. Stanton, P. K., & Sejnowski, T. J. (1989). Associative long-term depression in the hippocampus induced by Hebbian covariance. Nature, 339, 215–218. Starke, K., Gothert, M., & Kilbinger, H. (1989). Modulation of neurotransmitter release by presynaptic autoreceptors. Physiol. Rev., 69, 864–989. Takahashi, T., Kajikawa, Y., & Tsujimoto, R. (1998). G-protein-coupled modulation of presynaptic calcium currents and transmitter release by a GABAB receptor. J. Neurosci., 18, 3138–3146. Thompson, S. H., & Ga¨ hwiler, B. H. (1992).Comparison of the actions of baclofen at pre- and postsynaptic receptors in the rat hippocampus in vitro. J. Physiol. (Lond.), 451, 329–345. Tsodyks, M. V., & Markram, H. (1997). The neural code between neocortical pyramidal neurons depends on neurotransmitter release probability. Proc. Natl. Acad. Sci. USA, 94, 719–723. Vaughan, C. W., Ingram, S. L., Connor, M. A., & Christie, M. J. (1997). How opioids inhibit GABA-mediated neurotransmission. Nature, 390, 611–614. Worden, M. K., Bykhovskaia, M., & Hackett, J. T. (1997). Facilitation at the lobster neuromuscular junction: A stimulus-dependent mobilization model. J. Neurophysiol., 78, 417–428. Wu, L.-G., & Saggau, P. (1995). GABAB receptor-mediated presynaptic inhibition in guinea-pig hippocampus is caused by reduction of presynaptic Ca2 C inux. J. Physiol. (Lond.), 485, 649–657. Yamada, W. M., & Zucker, R. S. (1992). Time course of transmitter release calculated from simulations of a calcium diffusion model. Biophys. J., 61, 671–682. Zamponi, G. W., & Snutch, T. P. (1998). Decay of prepulse facilitation of N type calcium channels during G protein inhibition is consistent with binding of a single Gbc subunit. Proc. Natl. Acad. Sci. USA, 95, 4035–4039. Zucker, R. S. (1996). Exocytosis: A molecular and physiological perspective. Neuron, 17, 1049–1055. Received August 3, 1999; accepted March 23, 2000.
LETTER
Communicated by Misha Tsodyks
A Statistical Theory of Long-Term Potentiation and Depression John M. Beggs National Institute of Mental Health, Lab of Neural Network Physiology, Bethesda, MD 20892-4075, U.S.A. The synaptic phenomena of long-term potentiation (LTP) and long-term depression (LTD) have been intensively studied for over twenty-ve years. Although many diverse aspects of these forms of plasticity have been observed, no single theory has offered a unifying explanation for them. Here, a statistical “bin” model is proposed to account for a variety of features observed in LTP and LTD experiments performed with eld potentials in mammalian cortical slices. It is hypothesized that long-term synaptic changes will be induced when statistically unlikely conjunctions of pre- and postsynaptic activity occur. This hypothesis implies that nite changes in synaptic strength will be proportional to information transmitted by conjunctions and that excitatory synapses will obey a Hebbian rule (Hebb, 1949). Using only one set of constants, the bin model offers an explanation as to why synaptic strength decreases in a decelerating manner during LTD induction (Mulkey & Malenka, 1992); why the induction protocols for LTP and LTD are asymmetric (Dudek & Bear, 1992; Mulkey & Malenka, 1992); why stimulation over a range of frequencies produces a frequency-response curve similar to that proposed by the BCM theory (Bienenstock, Cooper, & Munro, 1982; Dudek & Bear, 1992); and why this curve would shift as postsynaptic activity is changed (Kirkwood, Rioult, & Bear, 1996). In addition, the bin model offers an alternative to the BCM theory by predicting that changes in postsynaptic activity will produce vertical shifts in the curve rather than merely horizontal shifts. 1 Introduction
Long-term potentiation (LTP) and long-term depression (LTD) are persistent changes in synaptic efcacy that can be induced by electrical stimulation. These phenomena have been intensively studied since 1973 (Bliss & Lomo, 1973), motivated by the hypothesis that synaptic changes affect information processing in the brain (reviewed in Bliss & Collingridge, 1993; Beggs et al., 1999). Recently, simple protocols have been developed so that either LTP or LTD can be reliably induced in eld potentials in the mammalian hippocampus (Dudek & Bear, 1992; Mulkey & Malenka, 1992) and neocortex (Kirkwood, Dudek, Gold, Aizenman, & Bear, 1993). Neural Computation 13, 87–111 (2000)
° c 2000 Massachusetts Institute of Technology
88
John M. Beggs
Although these protocols are now widely accepted, an understanding of why they work has taken years to develop and is still incomplete. For example, a paradox surrounds the asymmetry of induction protocols.Induction of LTP requires only about 100 presynaptic stimulation pulses, while induction of LTD requires 900 pulses. This seems unbalanced, since these protocolstypically produce amplitude or slope changes of nearly equal magnitude (Dudek & Bear, 1992; Mulkey & Malenka, 1992; Kirkwood, Rioult, & Bear, 1996). Existing models of synaptic plasticity would not predict this disparity (Hebb, 1949; Bienenstock, Cooper, & Munro, 1982; Bear, Cooper, & Ebner, 1987; Lisman, 1989). A seemingly unrelated phenomenon can be observed during LTD induction. As 1 Hz stimulation is delivered, the magnitude of the synaptic response declines in a decelerating manner (Mulkey & Malenka, 1992). No theory has been offered to explain this phenomenon or to relate it to other observations about LTD. Another question concerns the reduced magnitude of depression seen in slices of visual cortex taken from dark-reared rats. The well-known BCM theory (Bienenstock et al., 1982) has been used with some success to describe the results of LTP and LTD experiments in rat visual cortex (Kirkwood et al., 1996). The BCM theory does not predict that the magnitude of LTD observed in slices from dark-reared animals would be reduced, only that the frequency at which LTD is induced should be lower. Yet experimental results show that the magnitude of LTD is reduced in slices from dark-reared animals relative to controls (Kirkwood et al., 1996). Taken together, these issues suggest that there may be gaps in our knowledge about how synaptic plasticity is induced. According to the bin model, all of these phenomena may be a consequence of the following hypothesis: Long-term synaptic changes will be induced when statistically unlikely conjunctions of pre- and postsynaptic activity occur. This implies that changes in synaptic strength will be proportional to information transmitted by conjunctions, subject to the constraint that synaptic strengths are bounded. This hypothesis is a novel fusion of two earlier lines of thought concerned with synaptic conjunctions and information theory. The idea that conjunctions of pre- and postsynaptic activity could cause changes in synaptic strengths was rst proposed by Hebb (1949) and has since been investigated extensively through both physiological (e.g., Kelso, Ganong, & Brown, 1986; Kirkwood & Bear, 1994) and modeling studies (e.g., Linsker, 1988; Zador, Koch, & Brown, 1990). The hypothesis that synaptic strengths are related to transmitted information was originally proposed by Uttley (1966). Uttley argued that synaptic strength was proportional to the negative of the mutual information between the presynaptic input and the postsynaptic output. This negative relationship allowed for negative feedback and a highly stable system, with synaptic strengths often tending toward zero (Uttley, 1970, 1979). In contrast, Brindley (1969) and Marr (1970) proposed that synaptic strengths were related to the positive of the mutual information function, thus leading to
Statistical Theory of Long-Term Potentiation and Depression
89
bistable synapses. Whenever the mutual information was above a threshold, the synapse would assume a single positive strength; whenever the mutual information was below this threshold, the synapse would be set to zero strength. In a slightly different vein, Linsker (1988) showed that if synapses obeyed a Hebbian rule, they would maximize information transfer. The bin model described here proposes that synaptic strengths are proportional to the positive of the information transmitted by conjunctions (not mutual information). Further, the bin model shows that if synaptic strength is proportional to information transmitted by conjunctions, then the synapse will obey a Hebbian rule (different from Linsker’s assertion). Here, these ideas are combined in a new way to account for the aggregate behavior of synapses seen in LTP and LTD experiments using eld potentials in mammalian cortical slices. 2 The Bin Model 2.1 Probability and Information. Consider a model synapse with an identied presynaptic input and postsynaptic output, as shown in Figure 1A. Imagine that there is a time-stamped record of all the spikes that are red on each of these axons over a given period of time. Imagine further that this record is divided into many small time bins of equal size. Let us dene three temporal arrangements that could occur between a presynaptic and a postsynaptic spike:
A hit will occur whenever a presynaptic and a postsynaptic spike occur in the same bin. A near-miss will occur whenever a presynaptic spike occurs in a vacant bin immediately after a postsynaptic spike. A miss will occur whenever a presynaptic spike occurs in a vacant bin and does not produce a near-miss. We can make our denitions of these events arbitrarily precise by choosing the bin size to be suitably small (see the appendix for a more thorough treatment of binning). Now let us calculate how often the presynaptic input and the postsynaptic output will produce hits. To determine chance performance, consider how often hits would occur if Npr presynaptic and Npo postsynaptic spikes were randomly “dropped” into a row of Nb time bins, with the restriction that neither the presynaptic nor the postsynaptic cell may drop more than one of its own spikes into a bin (see Figure 1B). Given this model, the probability that n hits would occur by chance is described by the hypergeometric distribution (see Figure 1C), which is well known in the statistical literature
90
John M. Beggs
Statistical Theory of Long-Term Potentiation and Depression
91
(e.g., Johnson & Kotz, 1969; Meyer, 1970):
P (n | Npr , Npo , Nb ) D
Npo n
Nb ¡ Npo Npr ¡ n
Nb Npr
(2.1)
where: P (n | Npr , Npo , Nb ) D the probability that n hits will occur, given Npr , Npo , Nb n D the number of times a pre- and postsynaptic spike will be in the same bin Npr D the number of presynaptic spikes in the time period
Npo D the number of postsynaptic spikes in the time period Nb D the number of bins in the time period with the restriction that: n · Npr · Npo · Nb .1
For large values of (Npo / Nb ), the hypergeometric distribution approaches the binomial distribution (Meyer, 1970). The distribution will attain a maximum at npeak (Johnson & Kotz, 1969), given by: npeak D oor
(Npr C 1)(Npo C 1) . (Nb C 2)
(2.2)
Figure 1: Facing page. Bin model synapse. (A) Npr represents the number of spikes red by the presynaptic neuron (black ovals) in the given time period, Npo represents the number of spikes red by the postsynaptic neuron (open ovals) in the same time period, and Nb is the number of bins in the time period. A hit occurs whenever the pre- and postsynaptic cells re spikes in the same time bin. This is identied by an arrow. In this case, there are three hits (n D 3). (B) To calculate how many hits would occur by chance, the model assumes that spikes are dropped randomly into a row of time bins. In this example, ve presynaptic and six postsynaptic spikes are dropped into 25 time bins. (C) The hypergeometric distribution describes the probability, P, of obtaining each number of hits, n, for the example chosen. 1 For the mathematics to be correct, N must always be less than or equal to N . Thus, pr po Npo should always represent the cell that red the most spikes. In this article, Npo will be assigned to the postsynaptic cell for the sake of clarity only. Of course, the model would work equally well if Npo were assigned to the presynaptic cell if it red the most spikes, but that would make the nomenclature confusing.
92
John M. Beggs
This model implies that conditional probability is related to information transmission between the presynaptic input and the postsynaptic output. If the postsynaptic cell has access to the number of hits, n, along with knowledge of Npr , Npo , and Nb , then it can determine the presynaptic spike train to some extent. If it does not have access to n, however, the postsynaptic cell cannot make any distinctions among the set of possible presynaptic spike trains given by Npr and Nb . Therefore, n can be thought of as a message that conveys information about the presynaptic spike train to the postsynaptic cell. Using the formulation developed by Shannon (1948), the amount of information about the presynaptic spike train conveyed by n to the postsynaptic cell is given by: I (n, Npr , Npo , Nb ) D ¡ ln(P(n | Npr , Npo, Nb )). To normalize this expression, we may add a constant so that no information will be conveyed when the system is in its most probable state (when n D npeak): I (n, Npr , Npo , Nb ) D ¡ ln(P(n | Npr , Npo, Nb )) C C, where C D ln(Ppeak ). This suggests the substitution: WD
P (n | Npr , Npo , Nb ) Ppeak
,
where W is the normalized probability of occurrence: 0 · W · 1. Now the information conveyed by n may be rewritten as: I (n, Npr , Npo , Nb ) D ¡ ln(W).
(2.3)
Note that W D 1 when n D npeak, while W will approach zero when n is at either tail of the distribution. Intuitively, equation 2.3 means that when n is much larger or much smaller than the peak value npeak, then W is small and n conveys much information. If n is near npeak, however, then W is relatively large and the information conveyed by n is low. This is illustrated in Figure 2, where the information content is shown for each number of hits in the case where there are 5 presynaptic spikes, 25 bins, and several different numbers of postsynaptic spikes. As can be seen from the gure, the most extreme values of n convey the most information. 2.2 Probability Related to Changes in Synaptic Strength. Now plausible values for Nb , Npr , n, and Npo will be chosen so that probability values produced by the bin model can be related to changes in synaptic strength reported in eld potential experiments of LTP and LTD. The exact values of these parameters are relatively unimportant, since it will be shown later
Statistical Theory of Long-Term Potentiation and Depression
93
Figure 2: Unlikely numbers of hits convey the most information. (A, left) Probability of occurrence, P, for each number of hits, n, is plotted when Npr D 5, Npo D 6, and Nb D 25. (A, right) The information (from equation 2.3, but given in bits) conveyed about the presynaptic spike train is plotted for each number of hits. Note that the information is lowest when n is near the peak value, npeak D 1, and highest when n is much greater than npeak . (B, left) When the postsynaptic ring Npo is increased to 13, but Npr and Nb remain the same, the distribution is shifted to the right. (B, right) Information is again lowest when n is near the peak value npeak D 3, but highest when n is either greater than or less than npeak . (C, left) When the postsynaptic ring Npo is further increased to 19 but Npr and Nb remain the same, the distribution is shifted even farther to the right. (C, right) Information is highest when n is much less than npeak . This also shows that changes in postsynaptic ring rate can alter the amount of information conveyed by a given number of hits.
94
John M. Beggs
(in Figure 6) that the bin model is quite robust and produces qualitatively similar results for a wide range of parameter values. In what follows, it will be assumed that hits will lead to potentiation, while near-misses and misses will lead to depression. It will be shown that highly unlikely events will lead to large changes in synaptic strength, while moderately unlikely events will lead to only moderate changes in synaptic strength. In the case of two relatively sparse spike trains (where bins are more often empty than occupied), this implies that if a given number of hits will lead to strong potentiation, then the same number of near-misses will lead to equally strong depression. In contrast, the same number of misses will lead to only mild depression, since misses are not as unlikely as hits or near-misses if the spike trains are sparse. In choosing a bin size for the model, the work of Bi & Poo (1998) is informative. Using paired recordings from hippocampal neurons in culture, Bi and Poo have shown that potentiation is triggered when a presynaptic spike precedes a postsynaptic spike by 20 ms or less, an event that we may interpret as a hit (see the appendix for a more thorough treatment of binning and the calculation of probabilities associated with hits). In contrast, depression is triggered when a presynaptic spike follows a postsynaptic spike by 20 ms or less, an event that we may interpret as a near-miss. Although Bi and Poo did not closely examine misses, this type of event is prevalent in LTD experiments using eld potentials (and will be discussed later). Motivated by the work of Bi and Poo, a bin size of 20 ms will be chosen for the model, and no more than one hit will be counted in a given bin. How many bins (Nb ) should be used? In experiments using hippocampal slices, Huang, Colino, Selig, and Malenka (1992) have shown that prior synaptic activity can inuence the later induction of LTP. When a weak tetanus (30 Hz, 0.15 s) was delivered 20 minutes before 100 Hz stimulation, the induction of LTP was inhibited. In contrast, the same treatment given 80 minutes before 100 Hz stimulation was without effect. This suggests that if there is a nite time window during which spikes are integrated for long-term plasticity, that window is longer than 20 minutes but shorter than 80 minutes. In other experiments, Xu, Anwyl, and Rowan (1997) reported that transient exposure to stress could affect the induction of LTD in awake rats for 5 minutes after the stress was removed. When the stress had been removed for 20 minutes, however, no effects were seen. These data would suggest a time window at least 5 minutes long but shorter than 20 minutes. Taking into consideration data from both studies, a time period of 20 minutes will be chosen for the purposes of this model. When this 20-minute period is divided by 20 ms, it yields 60,000 bins (Nb D 60,000). In the slice, the number of presynaptic spikes red (Npr ) during the given time period will be determined by the number of stimulation pulses delivered. For the case of LTP induction, Npr will be 120, since this is the number of presynaptic pulses delivered during “theta burst” stimulation, a widely accepted protocol that has proved effective in inducing LTP in mammalian
Statistical Theory of Long-Term Potentiation and Depression
95
cortical structures (Kirkwood et al., 1993). In response to theta burst, it will be assumed that summation of postsynaptic potentials will cause the cell to re 30 spikes—one spike for each burst of 4 pulses delivered at 100 Hz. This is motivated by the fact that postsynaptic potentials delivered at 100 Hz with near-threshold stimulation intensity frequently summate to produce spikes within 3 to 5 pulses (Beggs, 1998). Thus, n will be 30 for the case of LTP induction with 100 Hz. For the case of LTD, Npr will be 900, since this is the number of pulses commonly used for induction of depression (Dudek & Bear, 1992; Mulkey & Malenka, 1992). In response to 900 pulses delivered at 1 Hz, it will be assumed that the postsynaptic cell res no action potentials. This is reasonable, since eld potentials recorded during LTD induction do not show population spikes, indicating that action potentials are not red (Dudek & Bear, 1992; Mulkey & Malenka, 1992). Thus, n will be zero during LTD induction. Next, the number of spikes for the postsynaptic cell (Npo ) must be chosen. Although spontaneous activity in the slice is nearly zero, there is evidence to suggest that experimental conditions before slicing may nonetheless inuence synaptic plasticity after slices are prepared. For example, days of light deprivation (Kirkwood et al., 1996) have been shown to inuence synaptic plasticity dramatically in slices of visual cortex taken from experimental animals, relative to controls. From these data, it is plausible that some biophysical mechanism becomes tuned to the average ring rate and retains that tuning even after slices are prepared (see Quinlan, Philpot, Huganir, & Bear, 1999 for a candidate mechanism). Because light deprivation seems to have the effect of reducing the threshold for potentiation equally on all inputs to a given cell, we will assume that this effect is postsynaptically mediated and may be represented by Npo . In the unanesthetized mammal, the average ring rate of a cortical or hippocampal pyramidal neuron is typically 1 Hz or slightly higher (e.g., Thompson, Deyo, & Disterhoft, 1990; Czepita, Reid, & Daw, 1994; Collins & Pare, 1999). For the purposes of this model, a postsynaptic ring rate of 1.5 Hz will be chosen. In 20 minutes, this ring rate will produce 1800 spikes (Npo D 1800). Now that we have candidate values for Nb , Npr , n, and Npo , we may use them to calculate probabilities and relate them to observed changes in synaptic strength. We may develop a correspondence between the probabilities generated by the bin model and the amplitude changes observed during the induction of LTD. During LTD induction, 900 presynaptic pulses are delivered, and no postsynaptic spikes are produced. This is evidenced by the fact that the eld potentials in studies of LTD do not show population spikes. As these 900 pulses are delivered, the magnitude of the synaptic response shows a progressive decline (Dudek & Bear, 1992; Mulkey & Malenka, 1992). This decline is seen in Figure 3A, which is an average of seven experiments and is typical of LTD induction observed with eld potentials (Mulkey & Malenka, 1992). Note that 1 Hz stimulation initially causes enhancement, but this is a short-term effect and mostly rebounds at the end of stimula-
96
John M. Beggs
tion. Because this article describes a model of long-term potentiation and depression, we will not attempt to t this short-term effect, but will focus only on effects that produce more enduring synaptic changes (see Varela et al., 1997, for a description of short-term plasticity in rat cortex). The magnitude of depression for a given point along the declining curve can be matched to the number of pulses that have been delivered in the LTD induction protocol up to that point (see Figure 3B). Consider an excitatory synapse with initial strength S that is observed to show long-term depression of an amount D S by the end of an experiment with 900 pulses. Before
Statistical Theory of Long-Term Potentiation and Depression
97
Table 1: Parameters for the LTD Induction Curve. Pulses
0 100 200 300 400 500 600 700 800 900
W
1.0 2.1 £ 10¡1 1.4 £ 10¡2 8.0 £ 10¡4 4.4 £ 10¡5 2.3 £ 10¡6 1.2 £ 10¡7 6.2 £ 10¡9 3.2 £ 10¡10 1.6 £ 10¡11
DS 0.00 0.22 0.44 0.63 0.74 0.86 0.91 0.96 0.98 1.00
the synapse begins its decline in efcacy, the observed depression is 0.00 D S. After 100 pulses, the observed depression is approximately 0.22 D S. After 300 pulses, the observed depression is approximately 0.63 D S. By the end of the experiment, the depression is 1.00 D S. These correspondences are taken from Figure 3B and listed in Table 1. To relate these depressions to probabilities, the bin model can be used to calculate the probabilities of zero hits for each number of pulses, since no postsynaptic spikes are produced during the LTD induction protocol (because no population spikes are observed in the LTD eld potentials). These Figure 3: Facing page. The LTD induction curve reveals a relationship between probability and changes in synaptic strength. (A) Plot of eld EPSP slope taken during the application of 1 Hz stimulation for 900 pulses (15 minutes) in hippocampus (data replotted from Mulkey & Malenka, 1992). The graph is a summary of seven experiments; some error bars are omitted for clarity. Zero marks the beginning of 1 Hz stimulation, and 900 marks the end. Note that 1 Hz stimulation initially causes enhancement, but this is a short-term effect and mostly rebounds at the end of stimulation. The bin model does not attempt to model these short-term effects. (B) The number of pulses at each point along the curve can be matched with the amount by which synaptic strength has decreased (D S). (C) A plot of changes in synaptic strength, D S, against normalized probability, W. Values for D S were taken from the data shown in A; values for W were calculated according to the bin model for each multiple of 100 pulses. Note that no change in strength is specied for W D 1.0, while the largest changes are associated with the lowest values of W. (D) A family of functions describing changes in synaptic strengths, D S(W), is plotted for different values of R (R D 0.2 lowest curve, R D 50, uppermost curve). (E) The function D S(W) (solid curve) provides the best t to the data (black squares), when R D 0.205. This minimizes the squared error. (F) The function D S(W) (solid curve) is plotted against the data (black squares) on a log scale, showing the goodness of t.
98
John M. Beggs
W values are listed in Table 1. From data in this table, it is possible to graph the magnitude of depression as a function of W. The resulting relationship, when plotted as magnitude of depression against W, is a sharply bending curve (see Figure 3C). We seek a function that can t this sharply bending curve. Let D S (W ) be a function that would take as input the normalized probability of occurrence W ranging from 0 to 1, and produce as output the magnitude by which the synaptic strength is changed, D S (W ), ranging from 1 to 0. We can deduce some general qualities of the function D S (W ) from looking at the numbers listed in Table 1. First, when W is very small, D S is large. This means that as W approaches 0, D S (W ) should approach 1. Second, when W is relatively large, D S is very small. This means that as W approaches 1, D S (W ) should approach 0. Third, as W increases from 0 to 1, the magnitude of D S (W ) always decreases and never increases. Mathematically, this means that the magnitude of D S (W ) should decrease in a strictly monotonic fashion as W ranges from 0 to 1. A function satisfying these general requirements is given by:
D S(W ) D
1 ¡ WR 1 C WR
,
(2.4)
where W is the normalized probability of occurrence of an event and D S (W ) is the magnitude of change in strength caused by an event with probability W and 0·W·1 0 · D S(W ) · 1 0 · R · 1.
Equation 2.4 has been plotted for several values of R in Figure 3D. The value of R can be any positive constant and will determine the exact shape of the curve. Note that with the appropriate choice of R, D S (W ) can reach any point in the space dened by the requirements. For R ¿ 1, the magnitude of D S(W ) will be large only when W is near 0, while for R À 1, the magnitude of D S(W ) will be small only when W is near 1. What value of R will best describe the data obtained in eld potential experiments performed in slices of mammalian cortical structures? Using numerical methods,2 a value of R was chosen to minimize the squared error between the data points given in Table 1 and the function D S(W ). It is found that R ¼ 0.205 gives the best t to the LTD induction curve (see Figures 3E and 3F). As the graph shows, this curve would not lead to much depression until W < 0.2, when the magnitude of D S values 2
Numerical integration was performed using Mathematica 3.0.1.
Statistical Theory of Long-Term Potentiation and Depression
99
would start to become greater than 0.2. This means that only unusually large numbers of misses would lead to substantial depression, while the vast majority of synaptic events would cause little change. In this model we will assume that the relationship between probability and synaptic depression is one instance of a general mechanism that will be at work in synaptic potentiation as well. As will be shown later, this assumption provides a reasonable description of the data for LTP experiments. We may combine both relationships: If n ¸ npeak, then: D S(W ) D C If n < npeak, then: D S (W ) D ¡
1 ¡ W 0.205 1 C W 0.205
1 ¡ W 0.205 1 C W 0.205
for hits. for misses and near-misses.
(2.5)
Note the use of plus and minus signs to indicate the direction of synaptic change. These equations indicate synaptic potentiation for improbable numbers of hits and synaptic depression for improbable numbers of misses or near-misses, and no synaptic changes when n D npeak. 2.3 Information Related to Changes in Synaptic Strength. Let us explore the relationship between probability and changes in synaptic strength by introducing the relation:
ds(W) D ¡k ¢ ln(W).
(2.6)
If the proportionality constant k is chosen correctly, ds(W) can approximate D S(W ) for most values of W (see Figure 4A). As W approaches zero, however, ds(W) approaches innity and departs from D S (W ) (see Figure 4B). We will choose the value for k that minimizes the squared error, weighted by the probability of occurrence, over the interval from 0 to 1 (this minimizes the expected value of the squared error): ED
1 0
W ¢ ((d s(W )) ¡ (D S (W )))2 ¢ dW.
This may be rewritten as: ED
1 0
W ¢ (¡k ¢ ln(W)) ¡
1 ¡ W 0.205 1 C W 0.205
2
¢ dW.
Using numerical methods,3 E may be plotted as a function of k (see Figure 4C). As can be seen from the graph, E is minimized when k ¼ 0.101. This value of k may now be substituted back into ds(W), and ds(W) may 3
Numerical integration was performed using Mathematica 3.0.1.
100
John M. Beggs
Figure 4: Actual changes in synaptic strength are nearly proportional to transmitted information. (A) The functions ds(W) and D S(W) are plotted together. Note that they are indistinguishable at this scale. The function D S(W) ts the experimental data, describing changes in synaptic strength as a function of normalized probability. Real changes in synaptic strength must be bounded, so the function D S(W) has a maximum value of 1 as W approaches 0. The function ds(W) is proportional to transmitted information. Information becomes innite as W approaches 0, creating a difference between ds(W) and D S(W). (B) A plot of ds(W) and D S(W) at higher scale, showing that there are differences as W approaches zero. (C) When the proportionality constant in ds(W) is chosen to minimize the error between ds(W) and D S(W), the curves are in fairly close agreement. A plot of the expected value of the error (squared difference between ds(W) and D S(W) weighted by the normalized probability) against the value of R. Note that E is minimized when k ¼ 0.101.
Statistical Theory of Long-Term Potentiation and Depression
101
be plotted together with D S (W ). From the graphs, it is evident that ds(W) and D S (W ) are in fairly good agreement except at small values of W (see Figures 4A and 4B). Let us now discuss the intuitive meaning ofds(W) and D S (W ). The information conveyed by an event with a normalized probability of occurrence W is given by ¡ ln(W). This means that ds(W) is proportional to the information conveyed by W. The function D S (W ) was chosen to map the probabilities generated by the bin model onto changes in synaptic strength that were observed in experiments. In other words, D S (W ) gives the magnitude of synaptic change as a function of probability. Because the magnitude of synaptic change must be nite, the output of D S (W ) is mapped onto a bounded interval from 0 to 1. In contrast, the output of the function ds(W) has no such restrictions and approaches innity as W approaches zero. Thus, the function D S (W ) closely approximates ds(W) for events with high probabilities of occurrence and deviates from ds(W) only for events with low probabilities of occurrence. From this it is clear that D S (W ) accurately models changes in synaptic strength and closely approximates ds(W) except at very small values of W. This leads to an interesting conclusion: Changes in synaptic strength are proportional to the information transmitted by conjunctions, subject to the constraint that the magnitude of synaptic change is bounded. Excitatory synapses that obey this principle will also obey Hebb’s rule. If synaptic strength is proportional to information, then increases in information transfer at excitatory synapses will lead to increases in synaptic strength, up to some maximum limit. The information conveyed by the number of hits, n, can be thought of as the amount by which knowledge of n will allow one to delimit the sample space of all possible presynaptic spike trains, given Npr , Npo , and Nb . If an excitatory synapse were to change so as to increase the information given by n, then it would change so as to make the observed n larger than the peak value npeak (recall Figure 2). This could be done if a synapse were to become strong enough to ensure that it red the postsynaptic cell every time it received a presynaptic impulse. Under these conditions, n would be much larger than expected by chance, and the information conveyed by n would be high. Every postsynaptic period of silence could be used to predict when the presynaptic input did not re. This is the case of a classic Hebb synapse (Hebb, 1949), where synaptic strength increases whenever the presynaptic and postsynaptic cell are active at the same time, thus driving n > npeak and increasing information transfer. 3 Application to Synaptic Plasticity
One of the most successful descriptions of long-term synaptic plasticity in mammalian cortex is the BCM theory (Bienenstock et al., 1982). In addition to its many predictions about the development of visual cortex, two wellknown studies (Dudek & Bear, 1992; Kirkwood et al., 1996) have shown a
102
John M. Beggs
Table 2: Parameter Values for Normally Reared Rats. Stimulus
0.067 Hz 1 Hz 10 Hz 20 Hz 100 Hz
n 0 0 6 9 30
Npr 80 900 120 120 120
Npo 1800 1800 1800 1800 1800
Nb 60,000 60,000 60,000 60,000 60,000
W £ 10¡1
4.0 1.6 £ 10¡11 3.8 £ 10¡1 3.3 £ 10¡2 1.0 £ 10¡18
D S(W) ¡1.9 ¡19.8 C 2.0 C 6.7 C 20.0
correspondence between the plasticity predicted by the BCM theory and the potentiation and depression observed in experiments with slices from mammalian cortical structures. We will examine whether the bin model can describe changes in amplitude observed by Kirkwood et al. (1996). In these experiments, eld potentials were used to assess the synaptic plasticity of slices of visual cortex taken from normally reared and dark-reared rats. The normally reared rats will be considered rst. The values of Npr are determined by the induction protocols used. Test pulses were delivered once every 15 seconds (0.067 Hz), low-frequency stimuli were delivered with 900 pulses, and high-frequency stimuli were delivered with 120 pulses (see Table 2). The choice of n in several cases requires explanation. When 120 pulses of theta burst are applied at 100 Hz, it will be assumed that the neuron will spike 30 times, once for each burst of 4 pulses. When 120 pulses are applied at 20 Hz, it will be assumed that summation of postsynaptic potentials will be less, leading to only 9 spikes. When 120 pulses are applied at 10 Hz, it will be assumed that even less summation occurs, leading to only 6 spikes. The results of the model are fairly robust with respect to the exact choice of n (see Figure 6). It is important, however, that some gradual decline in the number of spikes occurs as stimulation frequency is reduced. It will be assumed that no postsynaptic spikes occur whenever stimulation is applied at frequencies of 2 Hz or lower, because postsynaptic summation is negligible in this range. Values for D S (W ) were multiplied by 20 to scale to the maximum potentiation seen in the experiments (maximum LTP: 20%). As can be seen from the open circles in Figures 5A and 5B, the model predicts the general shape of the curve fairly well, with minor discrepancies for some of the points. An intuitive explanation for the shape of this curve can be derived from the equations for the bin model. High-frequency stimulation produces several hits (about 30) that lead to large changes in synaptic strength because each hit is unlikely to have occurred by chance. Although the misses produced by low-frequency stimulation are not as unlikely, the fact that there are so many of them (about 900) is very unlikely, and this leads to large changes. Thus, the asymmetry of induction protocols for LTP and LTD arises because hits convey more information than misses, which arises because neurons are quiescent more often than they are ring spikes. The test
Statistical Theory of Long-Term Potentiation and Depression
103
Table 3: Parameter Values for Dark-Reared Rats. Stimulus
n
Npr
Npo
Nb
W
D S(W)
0.067 Hz 1 Hz 2 Hz 10 Hz 20 Hz 100 Hz
0 0 0 6 9 30
80 900 900 120 120 120
300 300 300 300 300 300
60,000 60,000 60,000 60,000 60,000 60,000
1.0 £ 100 5.8 £ 10¡2 5.8 £ 10¡2 5.9 £ 10¡5 2.1 £ 10¡8 1.8 £ 10¡41
0.0 ¡5.7 ¡5.7 C 15.2 C 19.0 C 20.0
pulse will always produce a miss, but since it is delivered at such a low frequency, only 80 misses will occur in the 20-minute integration period. This small number of misses is not enough to convey as much information as 900 misses, so the depression induced by the test pulse is barely noticeable. One of the predictions of the BCM theory is that the stimulation frequency at which depression crosses over to potentiation, called h, should shift as a function of postsynaptic activity (see Figure 5C). Specically, one prediction of the BCM theory is that as postsynaptic activity decreases, h should slide to the left (Bienenstock et al., 1982). This means that a stimulation frequency that caused depression under normal conditions might cause slight potentiation under conditions of reduced postsynaptic activity. To test this prediction, Bear and colleagues (Kirkwood et al., 1996) prepared slices of visual cortex from rats reared in a completely dark environment. When these dark-reared slices were stimulated with a range of frequencies, the curve produced was indeed shifted with respect to the curve produced by control animals (see the lled circles in Figure 5A). Let us see if the bin model can describe the experimental data and whether the model can give some insight as to the nature of this shift. For the dark-reared rats, it will be assumed that the average ring rate of neurons in visual cortex will be lower than the rate for neurons from rats reared in a fully lighted environment. Although the data are not available for rats, Czepita et al. (1994) found that the spontaneous ring rate for neurons in dark-reared cats could be less than half of the ring rate for neurons in normally reared cats activated by visual stimuli. For this model, a spontaneous ring rate of 0.25 Hz will be used for dark-reared rats, giving 300 spikes in the period of 20 minutes (Npo D 300). Otherwise, the values for n, Npr , and Nb are the same as those previously given for the normally reared rats. The resulting probabilities and changes in synaptic strength are given in Table 3 and plotted as lled circles in Figure 5B. As can be seen from the gure, the model is qualitatively similar to the data. In their original paper, Bear and colleagues made note of the fact that h for the slices from the dark-reared rats is shifted to the left with respect to h for slices from the normally reared rats. Interpreted this way, the data would agree with the BCM theory, which predicts a leftward shift. According to
104
John M. Beggs
the BCM theory, the threshold frequency h is a reection of the average postsynaptic ring frequency. If a presynaptic input is activated at the same frequency as h, then no synaptic change will result. If a presynaptic input is activated at a frequency below h , then two things will happen. First, depression will result. Second, the average postsynaptic ring frequency will decline, causing h to shift to the left, where it will reect this lowered average ring rate. In a similar manner, a presynaptic input that is activated at a frequency above h will cause potentiation and will cause h to shift to the right. From this description, it should be clear that the BCM theory predicts that the frequency-response curve will shift left or right along the horizontal axis as a function of postsynaptic activity. No mention is made in the BCM theory about vertical shifts of the curve (Bienenstock et al., 1982; Clothiaux, Bear, & Cooper, 1991).
Statistical Theory of Long-Term Potentiation and Depression
105
Now let us discuss what the bin model predicts. The shape of the frequency-response curve predicted by the model was described previously. How will this curve change as postsynaptic activity is reduced? If postsynaptic activity is reduced from 1.5 Hz (normally reared) to 0.25 Hz (dark reared), then a presynaptic spike that arrives at a synapse will be much less likely to encounter a postsynaptic spike. This fact has two consequences. First, since hits are much rarer, they will convey more information per spike than under normal conditions. Thus, 9 hits produced by theta burst stimulation at 20 Hz will convey more information in slices from dark-reared rats than in slices from normally reared rats. The potentiation seen in slices from dark-reared rats at 20 Hz will therefore be greater than that seen in slices from normally reared rats. Second, since misses are much more common, they will convey less information per spike than under normal conditions. Because of this, the 900 misses produced by LTD induction will convey less information in slices from dark reared rats than in slices from normally reared rats. Consequently, the LTD seen in slices from dark-reared rats will be less than that in slices from normally reared rats. The overall effect of a reduction in postsynaptic activity, then, will be to cause an upward shift in the frequency-response curve (see Figure 5D). This view is consistent with the data that show that LTP is greater and LTD is less in slices from darkreared rats than in controls (for upward shifts under different conditions of reduced activity, see Wang & Wagner, 1999). Note that the BCM theory does not explicitly predict that the magnitude of LTD should be decreased in slices from dark reared rats, only that h should shift to the left. Figure 5: Facing page. The bin model compared to experimental data and the BCM theory. (A) Magnitude of synaptic change (D S) plotted against the log of stimulation frequency (Hz) for original data (modied from Kirkwood et al., 1996). Open circles represent slices of visual cortex prepared from rats reared in a lighted environment; lled circles represent slices of visual cortex prepared from dark-reared rats. (B) Points predicted by the bin model show fairly close agreement to the actual data. (C, D) Shifts in threshold compared for the BCM theory and the bin model. (C) The BCM theory models a decrease in postsynaptic activity by a threshold shift to the left, but does not specify that the amplitude of the curve should be changed. Thus, the peak magnitude of LTD for normal and deprived cortex is not predicted to differ. (D) In the bin model, a decrease in postsynaptic activity produces an upward shift in the curve. Note that the bin model predicts that the magnitude of LTD for deprived cortex will be less than that for normal cortex, in contrast to the BCM theory. (E, F) Experimental test to distinguish between the BCM theory and the bin model. (E) Data from Kirkwood et al. (1996) are plotted with the additional circled point that would be predicted by a leftward shift of the curve (as described by the BCM theory) if 900 pulses at 0.5 Hz were delivered to slices from dark-reared rats. (F) Values given by the bin model with the additional circled point that would be predicted if 900 pulses were delivered at 0.5 Hz to slices from dark-reared rats.
106
John M. Beggs
Which of these two theories better describes the data? A simple test could illuminate the issue: How much LTD will be produced in slices from darkreared rats if they are stimulated at 0.5 Hz? The BCM theory, which describes only a leftward shift in the curve, would predict that the magnitude of LTD in slices from dark-reared rats at 0.5 Hz should be greater than that seen at 1 Hz and that this should be similar to the LTD seen in normal slices stimulated at 1 Hz (see Figure 5E). The bin model, however, describes a vertical shift in the curve and would predict that the LTD seen at 0.5 Hz would be less than that seen at 1 Hz in slices from dark-reared rats (see Figure 5F). To explore the robustness of the bin model, three sets of parameters—integration period (Nb ), average postsynaptic ring rate (Npo ), and number of action potentials (n)—were either increased or decreased by 50% for slices from both normally reared and dark-reared rats. The results of these manipulations, shown in Figure 6, demonstrate that the model is extremely robust. Even under larger parameter variations, the qualitative predictions of the model remain unchanged. 4 Summ ary and Conclusions
Using only one set of constants, the bin model offers robust explanations for several disparate phenomena observed in LTP and LTD experiments. Many of these phenomena are not adequately explained by current theories. The asymmetry of induction protocols for LTP and LTD arises because hits convey more information than misses, which is a result of the fact that neurons are quiescent more often than they are ring spikes. The curving
Figure 6: Facing page. Robustness of the bin model. (A, B) Effects of a 50% increase or decrease in integration period. Nb (normally 60,000) was altered to either 90,000 or 30,000; all other parameters remained the same. (A) For slices from normally reared rats, the unaltered curve is plotted with open circles, the increase is plotted with squares, and the decrease is plotted with triangles. Note the minimal changes. (B) Slices from dark-reared rats. Note that no changes at all occurred in this case. (C, D) Effects of a 50% increase or decrease in postsynaptic ring rate. (C) For slices from normally reared rats, Npo (usually 1800)was altered to either 2700 or 900. Note the minor vertical shifts. (D) For slices from darkreared rats, Npo (usually 300) was altered to either 450 or 150. Again, note minor vertical shifts. (E, F) Effects of a 50% increase or decrease in the number of action potentials red in response to stimulation at 10, 20, and 100 Hz. For both types of slices, n (usually n D 6, 9, 30) was altered to either (9, 14, 45) or (3, 5, 15). (E) Minimal changes were brought at 10, 20, and 100 Hz, while no changes were brought at 1 Hz and at the test pulse, since no action potentials were red there. (F) Again note the slight changes. Overall, the bin model retains the same qualitative predictions over large variations in parameter values.
Statistical Theory of Long-Term Potentiation and Depression
107
decline in synaptic strength seen during LTD induction is a consequence of the function D S (W ), whose shape describes synaptic changes proportional to information transfer, subject to the constraint that synaptic changes must be bounded. The shape of the frequency-response curve predicted by the BCM theory and observed in experiments can also be explained in terms of the bin model. Unlikely numbers of hits cause potentiation, unlikely numbers of misses cause depression, and the misses caused by test pulses do not occur frequently enough in the integration time period to produce substantial depression. The bin model can also explain why reduced postsynaptic activity shifts this frequency-response curve. Reduced postsynaptic activity means that hits will convey more information per spike, so potentiation
108
John M. Beggs
will be greater at a given frequency. Reduced postsynaptic activity will also mean that misses convey less information, so the magnitude of depression will be less at a given frequency. The net effect of these changes will be an apparent upward shift in the frequency-response curve, an interpretation that is different from that offered by the BCM theory. Finally, the bin model offers a testable prediction so that these theories can be distinguished. It should be noted that the bin model is not an attempt to explain directly the biophysical basis of LTP and LTD, nor is it an attempt to explain shortterm synaptic plasticity. Rather, the bin model argues that a statistical view of synaptic interactions can offer a robust and unied description for previously unexplained phenomena seen in experiments of long-term synaptic plasticity. A major consequence of this view is that long-term synaptic changes encode transmitted information. We hope that such a description may inspire experimental tests of this idea and contribute to the formulation of more detailed biophysical models in the future. Appendix: Binning and the Calculation of Probabilities
It may be argued that the denition of a hit presented in the bin model includes cases where LTD would be observed. For example, if a postsynaptic spike is assigned to a bin because it occurs at the beginning of a 20 ms period and a presynaptic spike occurs at any time less than 20 ms after this postsynaptic spike, it will be assigned to the same bin, resulting in a hit. However, this case should lead to LTD, as shown by Bi and Poo (1998), because the presynaptic spike occurs after the postsynaptic spike. It can be shown that although binning the spike trains may destroy some information about precise temporal order, it does not alter the calculation of probabilities of hits or misses when compared to methods that do take precise temporal order into account. For example, consider a parallel recording where one presynaptic and one postsynaptic spike occur in a 1000 ms period. To keep track of temporal order, let us attach a “trailer ” that is 10 ms long after the presynaptic spike. Let us also attach a “leader ” that is 10 ms long before the postsynaptic spike. We will dene a hit to occur whenever the presynaptic trailer and the postsynaptic leader overlap in time. Thus, a hit will occur only when the presynaptic spike is followed by the postsynaptic spike by less than 20 ms. There can be no hit if the postsynaptic spike occurs before the presynaptic spike. This interval of 20 ms constitutes a “hit zone” over which a hit will occur. The probability that the presynaptic spike would produce a hit with the postsynaptic spike can be calculated by taking the ratio of the length of the hit zone (20 ms) to the length of the recording period (1000 ms): 20/ 1000 D 0.02. But this is exactly equivalent to calculating the probability with a bin method that does not account for precise temporal order. The probability that a randomly dropped spike would land in the same 20 ms bin already occupied by a previously dropped spike in a 1000 ms period is 1 / 50 D 0.02. (The period is divided into 50 bins of 20 ms
Statistical Theory of Long-Term Potentiation and Depression
109
length.) The case of multiple spikes can be handled in a similar manner, demonstrating that binning does not alter the calculation of probabilities, although it may destroy some temporal information. Acknowledgments
This work was supported by a training grant from NIH awarded to Yale University and by a grant from NIH awarded to Thomas H. Brown. I thank Ted Carnevale, Dan Johnston, David Jaffe, Nigel Daw, Tom Carew, Josh Brumberg, John McGann, Jim Moyer, Patrick Tao, Tom Brown, and two anonymous reviewers for useful comments. References Bear, M. F., Cooper, L. N., & Ebner, F.F. (1987). A physiological basis for a theory of synapse modication. Science, 237, 42–48. Beggs, J. M. (1998). Electrophysiologyof rat perirhinal cortex. Unpublished doctoral dissertation, Yale University. Beggs, J. M., Brown, T. H., Byrne, J. H., Crow, T., LeDoux, J. E., LeBar, K., & Thompson, R. F. (1999). Learning and memory: Basic mechanisms. In M. J. Zigmond, F. E. Bloom, S. C. Landis, J. L., Roberts, & L. R. Squire, (Eds.), Fundamental neuroscience. San Diego, CA: Academic Press. Bi, G. Q., & Poo, M. M. (1998). Synaptic modications in cultured hippocampal neurons: Dependence on spike timing, synaptic strength, and postsynaptic cell type. Journal of Neuroscience, 18, 10464–10472. Bienenstock, E. L., Cooper, L. N., & Munro, P. W. (1982). Theory for the development of neuron selectivity: Orientation specicity and binocular interaction in visual cortex. Journal of Neuroscience, 2, 32–48. Bliss, T. V. P., & Lomo, T. (1973). Long-lasting potentiation of synaptic transmission in the dentate area of the anaesthetized rabbit following stimulation of the perforant path. Journal of Physiology (Lond)., 232, 331–356. Bliss, T. V. P., & Collingridge, G. L. (1993). A synaptic model of memory: Longterm potentiation in the hippocampus. Nature, 361, 31–39. Brindley, G. S. (1969). Nerve net models of plausible size that perform many simple learning tasks. Proceedings of the Royal Society, 174, 173–191. Clothiaux, E. E., Bear, M. F., & Cooper, L. N. (1991). Synaptic plasticity in visual cortex: Comparison of theory with experiment. Journal of Neurophysiology,66, 1785–1804. Collins, D. R., & Pare, D. (1999). Reciprocal changes in ring probability of lateral and central medial amygdala neurons. Journal of Neuroscience, 19, 836–844. Czepita, D., Reid, S. N. M., & Daw, N. W. (1994). Effect of longer periods of dark rearing on NMDA receptors in cat visual cortex. Journal of Neurophysiology, 72, 1220–1226. Dudek, S. M., & Bear, M. F. (1992). Homosynaptic long-term depresssion in area CA1 of the hippocampus and effects of N-methyl-D-aspartate receptor blockade. Proceedings of the National Academy of Sciences, USA, 89, 4363–4367.
110
John M. Beggs
Hebb, D. O. (1949). The organization of behavior. New York: Wiley. Huang, Y. Y., Colino, A., Selig, D. K., & Malenka, R. C. (1992). The inuence of prior synaptic activity on the induction of long-term potentiation. Science, 255, 730–733. Johnson, N. L., & Kotz, S. (1969). Discrete distributions. Boston: Houghton Mifin. Kelso, S. R., Ganong, A. H., & Brown, T. H. (1986). Hebbian synapses in hippocampus. Proceedings of the National Academy of Sciences, USA, 83, 5326–5330. Kirkwood, A., & Bear, M. F. (1994). Hebbian synapses in visual cortex. Journal of Neuroscience, 14, 1634–1645. Kirkwood, A., Dudek, S. M., Gold, J. T., Aizenman, C. D., & Bear, M. F. (1993). Common forms of synaptic plasticity in the hippocampus and in neocortex in vitro. Science, 260, 1518–1521. Kirkwood, A., Rioult, M. G., & Bear, M. F. (1996). Experience-dependent modication of synaptic plasticity in visual cortex. Nature, 381, 526–528. Linsker, R. (1988).Self-organization in a perceptual network. Computer Magazine, 21, 105–117. Lisman, J. (1989). A mechanism for the Hebb and the anti-Hebb processes underlying learning and memory. Proceedings of the National Academy of Sciences, USA, 86, 9574–9578. Marr, D. (1970). A theory for cerebral neocortex. Proceedings of the Royal Society, 176, 161–234. Meyer, P. L. (1970). Introductory probability and statistical applications. Reading, MA: Addison-Wesley. Mulkey, R. M., & Malenka, R. C. (1992).Mechanisms underlying induction of homosynaptic long-term depression in area CA1 of the hippocampus. Neuron, 9, 967–975. Quinlan, E. M., Philpot, B. D., Huganir, R. L., & Bear, M. F. (1999). Rapid, experience-dependent expression of synaptic NMDA receptors in visual cortex in vivo. Nature Neuroscience, 4, 352–357. Shannon, C. E. (1948). A mathematical theory of communication. Bell Systems Technical Journal, 27, 623–656. Thompson, L. T., Deyo, R. A., & Disterhoft, J. F. (1990). Nimodipine enhances spontaneous activity of hippocampal pyramidal neurons in aging rabbits at a dose that facilitates learning. Brain Research, 535, 119–130. Uttley, A. M. (1966). The transmission of information and the effect of local feedback in theoretical and neural networks. Brain Research, 102, 23–35. Uttley, A. M. (1970). The informon: A network for adaptive pattern recognition. Journal of Theoretical Biology, 27, 31–67. Uttley, A. M. (1979). Information transmission in the nervous system. New York: Academic. Varela, J. A., Sen, K., Gibson, J., Fost, J., Abbott, L. F., & Nelson, S. B. (1997). A quantitative description of short-term plasticity at excitatory synapses in layer 2/3 of rat primary visual cortex. Journal of Neuroscience, 17, 7926–7940. Wang, H., & Wagner, J. J. (1999). Priming-induced shift in synaptic plasticity in the rat hippocampus. Journal of Neurophysiology, 82, 2024–2028. Xu, L., Anwyl, R., & Rowan, M. J. (1997). Behavioural stress facilitates the induction of long-term depression in the hippocampus. Nature, 387, 497–500.
Statistical Theory of Long-Term Potentiation and Depression
111
Zador, A., Koch, C., & Brown, T. H. (1990). Biophysical model of a Hebbian synapse. Proceedings of the National Academy of Sciences, USA, 87, 6718–6722. Received July 2, 1999; accepted April 20, 2000.
LETTER
Communicated by Peter Rowat
Minimal Model for Intracellular Calcium Oscillations and Electrical Bursting in Melanotrope Cells of Xenopus Laevis L. Niels Cornelisse Department of Cellular Animal Physiology and Department of Biophysics, Nijmegen Institute for Neurosciences, University of Nijmegen, 6525 ED Nijmegen, The Netherlands Wim J. J. M. Scheenen Werner J. H. Koopman Eric W. Roubos Department of Cellular Animal Physiology,Nijmegen Institute for Neurosciences, University of Nijmegen, 6525 E Nijmegen, The Netherlands Stan C. A. M. Gielen Department of Biophysics, Nijmegen Institute for Neurosciences, University of Nijmegen, 6525 EZ Nijmegen, The Netherlands A minimal model is presented to explain changes in frequency, shape, and amplitude of Ca2 C oscillations in the neuroendocrine melanotrope cell of Xenopus Laevis . It describes the cell as a plasma membrane oscillator with inux of extracellular Ca 2 C via voltage-gated Ca2 C channels in the plasma membrane. The Ca 2 C oscillations in the Xenopus melanotrope show specic features that cannot be explained by previous models for electrically bursting cells using one set of parameters. The model assumes a K Ca channel with slow Ca 2 C -dependent gating kinetics that initiates and terminates the bursts. The slow kinetics of this channel cause an activation of the K Ca -channel with a phase shift relative to the intracellular Ca 2 C concentration. The phase shift, together with the presence of a Na C channel that has a lower threshold than the Ca 2 C channel, generate the characteristic features of the Ca 2 C oscillations in the Xenopus melanotrope cell. 1 Introduction
Cells of multicellular organisms communicate with each other to control and coordinate their activities. This communication takes place by way of various types of rst messengers, such as neurotransmitters, hormones, and growth factors. First messengers specically contact a cell via receptors that subsequently transduce the extracellular signals into specic (intra-) cellular responses such as protein synthesis, secretion, or contraction. In this transduction process, a relatively small number of second-messenger molecules Neural Computation 13, 113–137 (2000)
° c 2000 Massachusetts Institute of Technology
114
L. N. Cornelisse et al.
are involved, such as cAMP, inositol trisphosphate, and Ca2 C ions. Among the second messengers, to date much attention is being paid to the role of the intracellular Ca2 C concentration ([Ca2 C ]i ) (Berridge, 1998; Bito, 1998; Neher, 1998). In many secretory cells, including neurons and neuroendocrine cells, [Ca2 C ]i changes in the form of Ca2 C oscillations are thought to control secretory events (Stojilkovic & Catt, 1992; Shibuya & Douglas, 1993; Berridge, 1998). In some cases these oscillations depend on the inux of extracellular Ca2 C via voltage-gated Ca2 C channels in the plasma membrane and on Ca2 C release from intracellular stores. In others, the inux of extracellular Ca2 C is the only cause of the Ca2 C oscillations (Stojilkovic & Catt, 1992; Scheenen, Jenks, Roubos, & Willems, 1994; Scheenen, Jenks, Willems, & Roubos, 1994; Stojilkovic, Tomic, Kukulijan, & Catt, 1994). In the latter cell type, the Ca2 C inux depends on the electrical state of the plasma membrane. Several models have been proposed for the mechanism by which Ca2 C oscillations are coupled to bursting electrical membrane activity, such as for the pancreatic b-cell (Chay & Keizer, 1983) and the R15 neuron in Aplysia (Chay, 1990; Canavier, Clark, & Byrne, 1991). However, although these models describe the electrical behavior of the cells, relatively little attention has been paid to the characteristics of the Ca2 C oscillations. Recently, several interesting details have been reported on Ca2 C oscillations in the melanotrope cell of the amphibian Xenopus Laevis (Scheenen, Jenks, van Dinter, & Roubos, 1996; Koopman, Scheenen, Roubos, & Jenks, 1997; Lieste et al., 1998). This neuroendocrine cell type is located in the intermediate lobe of the pituitary gland and secretes a-melanophore-stimulating hormone (aMSH), which stimulates the dispersion of the pigment melanin in dermal melanophores. This process enables the animal to adjust the gray intensity of its skin to the light intensity of the environment (background adaptation). The Xenopus melanotrope cell is an established and extensively studied object for investigations of the ways in which environmental stimuli are transduced into physiologically meaningful responses (Loh & Gainer, 1977; Maruthainar, Loh, & Smyth, 1992; Artero, Fasolo, Andreone, & Franzoni, 1994; Roubos, 1997; Jenks et al., 1998). Many of the various steps in this transduction process are known in great detail and can be experimentally approached and quantitatively manipulated both in vivo and in vitro. Various neuronal messengers, such as thyrotropin-releasing hormone (TRH), corticotropin-releasing hormone (CRH), acetylcholine, serotonin (all excitatory) and dopamine, neuropeptide Y (NPY), and c -aminobutyric acid (GABA) (all inhibitory) control the secretion of a-MSH from the Xenopus melanotrope cells (for reviews see Roubos, 1997; Jenks et al., 1998). All of these factors exert their effect on a-MSH secretion by affecting the Ca2 C oscillatory dynamics (Shibuya & Douglas, 1993; Scheenen, Jenks, Roubos, & Willems, 1994; Scheenen, Jenks, Willems, & Roubos, 1994). Recently, we showed that the Ca2 C oscillations are coupled to membrane action potential bursting (Lieste et al., 1998). Imaging of the [Ca2 C ]i oscillations with high temporal resolution has revealed that the oscillation patterns may differ in
Minimal Model for Calcium Oscillations
115
Figure 1: Characteristic features of Ca2 oscillations in the Xenopus melanotrope cell: (i) steps during the rise phase, (ii) plateau phase, (iii) exponential decline phase, (iv) abrupt transition from decline to rise phase. The cell has been loaded with the membrane permeable probe fura-2/ AM and uorescence emission was monitored at 520 nm (SD 15 nm; described in Lieste et al., 1998). Changes in [Ca2C ]i are expressed as ratio signals of fura-2 uorescence after excitation at 360 nm and 380 nm (F360/ F380). C
frequency, amplitude, and shape. In more detail, the Ca2 C oscillatory pattern of this cell is characterized by various other features such as (i) repetitive steps during the rise phase, (ii) a plateau phase, (iii) an exponential decline, and (iv) an abrupt transition from the decline phase to the rise phase of the next Ca2 C peak (Scheenen et al., 1996; Koopman et al., 1997; Lieste et al., 1998), as illustrated in Figure 1. The aim of this study is to develop a mathematical model that describes the detailed features of Ca2 C oscillations in the Xenopus melanotrope cell and links these features to the electrical behavior of its plasma membrane. For this purpose we have taken the Chay-Rinzel model (Chay & Rinzel, 1985) as a starting point. This model provides a qualitative description of Ca2 C oscillations with steps in the rise phase as a result of the coupling between bursting electrical membrane activity and [Ca2 C ]i . However, it cannot explain properties (ii)–(iv) of the Xenopus melanotrope as described above. Therefore, we have replaced in the model the original Ca2 C -sensitive K C -channel by a KCa -channel with Hodgkin-Huxley kinetics that are Ca2 C dependent and are slow compared to the kinetics of the voltage-gated channels. Furthermore, we have added an NaC channel that has a lower threshold than the Ca2 C channel. These modications enable us to describe Ca2 C oscillations with the features that are characteristic for the melanotrope cell of Xenopus Laevis, as depicted in Figure 1. Whereas the model describes and explains the characteristics of the Ca2 C oscillations and the relationship between electrical membrane activity and Ca2 C oscillations in this neuroendocrine cell, it also provides insight into the physiological parameters of the
116
L. N. Cornelisse et al.
cell that can account for the variations in the pattern of the Ca2 C oscillations. This is highly relevant from a biological point of view, as it is assumed that variations in this pattern reect different physiological states of this Xenopus neuroendocrine signal transducer cell (Koopman et al., 1997). 2 Model 2.1 Plasma Membrane Oscillator. In this section we describe the model to explain intracellular Ca2 C oscillations in the melanotrope cells of Xenopus Laevis. The model consists of a set of eight coupled, nonlinear differential equations and is based on a excitable membrane model proposed by Chay and Rinzel (1985). It contains a Hodgkin-Huxley-type formalism (Hodgkin & Huxley, 1952) for the generation of action potentials, a differential equation that describes the dynamics of [Ca2 C ]i , and a differential equation that describes the Ca2 C -dependent kinetics of a KCa -channel that provides the coupling between [Ca2 C ]i and the electrical activity. Spontaneous Ca2 C action potentials are generated in a bursting manner. Each action potential gives rise to a rapid inux of Ca2 C through voltage-gated Ca2 C channels and causes a small increase in [Ca2 C ]i (a “step”) (Koopman et al., 1997). During a burst of action potentials, a number of steps are generated, which results in a sequential and accumulative increase in [Ca2 C ]i . In the time periods between the bursts of action potentials, Ca2 C is removed from the cell by a Ca2 C removal mechanism. In the model, the fast kinetics of voltage-gated ion channels are combined with the slow kinetics of a KCa -channel that is progressively activated by an increase in [Ca2 C ]i . Activation (deactivation) of this slow K C channel causes a hyperpolarization (depolarization) of the plasma membrane. The activation and deactivation result in the stop and the start, respectively, of a burst of action potentials. On the basis of this concept for a plasma membrane oscillator, it is possible to describe the several characteristic features of the Ca2 C oscillations of the melanotrope cell, as given in the introduction. 2.2 Model Equations. The membrane potential (V) of an excitable cell
is given by Cm
dV D¡ dt
Iion
(2.1)
where Cm represents the membrane capacitance, and Iion the sum of the ion currents through the various ion channels. In this model the total ion current consists of ve components: Iion D ICa,HH C INa,HH C IK,HH C IL C IK,Ca ,
(2.2)
with the Hodgkin-Huxley type currents: ICa,HH , a voltage-gated Ca2 C current; INa,HH , a voltage-gated Na C current; IK,HH , a voltage-gated K C current;
Minimal Model for Calcium Oscillations
117
and IL , a leak current, that incorporates the contributions of all the ion currents (e.g., Cl ¡ currents) not explicitly included in the model. The current IK,Ca represents a slow K C current that is activated by [Ca2 C ]i . ICa,HH , INa,HH , IK,HH , and IL are modeled similarly to the ion currents in the original Hodgkin-Huxley scheme that describes the generation of action potentials (Hodgkin & Huxley, 1952). The currents are expressed in terms of a voltage-dependent conductance multiplied by the difference between the membrane voltage and the Nernst potential for the particular ion: ICa,HH D gN Ca,HH m3 h (V ¡ VCa ) INa,HH D gN Na,HH p3 q(V ¡ VNa ) IK,HH D gN K,HH n4 (V ¡ VK ) IL D gN L (V ¡ VL ),
(2.3)
where gN Ca,HH , gN Na,HH , gN K,HH , and gN L are the maximal conductances per unit area for the voltage-gated Ca2 C , voltage-gated Na C , voltage-gated K C , and leak channels, respectively. The symbols m, h, p, q, and n represent the probabilities to be in an open state for the m-, h-, p-, q-, and n-gates, respectively. VCa , VNa , VK , and VL are the Nernst potentials for Ca2 C , NaC , K C , and leak ions, respectively, which are assumed to be constant in this study. The time evolution of the probabilities m, h, p, q, and n is described by the following rst-order differential equation, introduced by Hodgkin and Huxley: dy D ay (1 ¡ y ) ¡ by y, dt
(2.4)
where y stands for m, h, p, q, or n and ay and b y are the activation and deactivation rates, respectively. The voltage dependencies of ay and by have qualitatively the same form as in the original Hodgkin-Huxley equations, but are shifted along the V-axis in accordance with the Chay-Rinzel model. These shifts of the voltage dependencies of ay and by result in the periodic generation of action potentials by the membrane. (The detailed expressions for ay and by , used in this study are given in Table 2.) The spontaneous activity of the membrane, as described above, can be blocked by a hyperpolarizing current. Alternating activation and deactivation of this current will generate bursts of action potentials. In this model, a slow potassium current, IK,Ca , which is activated by Ca2 C , plays the role of the hyperpolarizing current. IK,Ca is given by IK,Ca D gN K,Ca P (V ¡ VK ),
(2.5)
where gN K,Ca is the maximal conductance per unit area for KCa -channel, P is the probability for the channel to be in an open state, and VK is the Nernst
118
L. N. Cornelisse et al.
potential for K C . The time evolution of P depends on [Ca2 C ]i and is modeled by dP D aP (1 ¡ P ) ¡ bP P, dt
(2.6)
with aP and bP being the activation and deactivation rate, respectively. Equation 2.6 is analogous to equation 2.4. However, the rates in equation 2.6 are not voltage-dependent but are described by aP D uo [Ca2 C ]i ¡ [Ca2 C ]basal
(2.7)
b P D uc ,
with uo and uc constants, and [Ca2 C ]basal the basal level of [Ca2 C ]i . The parameters uo and uc determine how P responds to an oscillating [Ca2 C ]i signal. If both parameters are large relative to the rate of change of [Ca2 C ]i , P will closely follow changes in [Ca2 C ]i and therefore will be almost in phase with [Ca2 C ]i . If the values of uo and uc are small relative to the rate of change of [Ca2 C ]i , P will lag changes in [Ca2 C ]i and will be out of phase with respect to [Ca2 C ]i for periodic changes in [Ca2 C ]i . Note that for large values of uo and uc the probability P for the KCa -channel to be in an open state can be replaced by its value at equilibrium,
P1 D
aP aP C bP
D
[Ca2 C ]i ¡ [Ca2 C ]basal [Ca2 C ]i ¡ [Ca2 C ]basal C
uc uo
(2.8)
,
which, for [Ca2 C ]basal D 0, is the expression that was used by Chay and Rinzel (1985) for the open probability for the KCa -channel in their model. The change in [Ca2 C ]i depends on the inux of Ca2 C through the Ca2 C channels during an action potential and on the removal of free Ca2 C from the cytoplasm. Various mechanisms can be responsible for this removal, like a plasma-membrane-bound Ca2 C -ATPase that pumps Ca2 C out of the cell, mitochondrial uptake, and other xed or mobile Ca2 C binding sites. Because the precise mechanisms underlying the efux of Ca2 C in the melanotrope cell are unknown, the removal mechanism is modeled as simply as possible, by a linear ow proportional to kCa ([Ca2 C ]i ¡ [Ca2 C ]basal ) with kCa the rate constant for the removal of Ca2 C . The equation that describes the change of free Ca2 C in the cytoplasm is therefore given by 3 d[Ca2 C ]i D f ¡ ICa,HH ¡ kCa [Ca2 C ]i ¡ [Ca2 C ]basal 2rF dt
,
(2.9)
where the term ¡3 / (2rF) reects the scale factor to go from current per unit area to ion concentration for an ion with a double valence. This scale factor
Minimal Model for Calcium Oscillations
119
is calculated by taking the ratio of the cell surface area to the cell volume, with r the radius of the cell, and by dividing it by the Faraday constant F and by a factor of two because of the double valence of Ca2 C . The parameter f determines how fast [Ca2 C ]i changes in time and is dened by f D
d[Ca2 C ]i , d[Ca2 C ]T
(2.10)
with [Ca2 C ]T the total Ca2 C concentration inside the cell, i.e., the boundplus-free cytosolic Ca2 C concentration (excluding [Ca2 C ] in stores) (Chay, 1990). In the case of fast buffering (faster than a millisecond), the inverse of parameter f can be expressed by f ¡1 D 1 C
¼1C
[B] KB [B] KB
1 1C
[Ca2 C ]i KB
2
(2.11)
where [B] represents the buffer concentration and KB its dissociation constant (Chay, 1990). The second part of equation 2.11 is valid only if KB À [Ca2 C ]i . The set of eight coupled differential equations described in equations 2.1, 2.4, 2.6, and 2.9 was solved numerically, using the “lsode” method of the mathematical software MapleV release 5, by Waterloo Maple Inc., except for Figures 3 through 5, where we used the fourth-order Runge-Kutta method in a program written in C++ code. 3 Results
Figure 2 shows the response characteristics of the model as a function of time for [Ca2 C ]i (see Figure 2A), the membrane potential V (see Figure 2B), and the fraction P of open KCa -channels (see Figure 2C). Figures 2A and 2B provide a clear illustration of the coupling between the membrane potential and the Ca2 C oscillations. The specic features of the Ca2 C oscillations in the Xenopus melanotrope cell, as given in Figure 1, are present in this simulation: steps in the rise phase of the Ca2 C peak (1), a plateau phase (2), an exponential decline phase (3), and an abrupt transition from the decline phase to the rise phase (4). Figure 2A also shows the increase in Ca2 C removal between the steps during the rise phase. The values for the parameters used in the simulations are listed in Table 1. In the following, we will discuss these features in more detail. 3.1 Coupling of [Ca2 C ]i and Membrane Potential. Within each burst,
the action potentials appear with a progressively increasing time interval.
120
L. N. Cornelisse et al.
Figure 2: Simulation of spontaneous Ca2 C oscillations coupled to electrical bursting in the Xenopus melanotrope cell. In the calculated Ca2 C signal, the characteristic features of the Ca2C oscillations in the Xenopus melanotrope cell are present: (i) steps in the rise phase, (ii) plateau phase, (iii) exponential decline, (iv) abrupt transition, (v) upregulation of Ca2C removal, (vi) coupling between the plasma membrane action potentials, and the Ca2C steps (A). Corresponding burst of action potentials (B). Fraction of open KCa -channels (C).
Minimal Model for Calcium Oscillations
121
Table 1: Parameter Values. Cm D 1 m F / cm2
gN Ca,HH D 2600 m S / cm2
gN Na,HH D 780 m S / cm 2
gN K,HH D 2400 m S / cm2 gN L D 9.98 m S / cm 2
gN K, Ca D 18 m S / cm2 VCa D 100 mV
VNa D 60 mV
VK D ¡75 mV
VL D ¡50.95 mV V 0 D 50 mV
V n D 30 mV V p D 60 mV V q D 55 mV
r D 8.9 m m
f D 0.064
T D 17± C
F D 9.65 ¢ 104 C/mol
kCa D 6.2 s ¡1
uo D 0.01 (m M ¢ s) ¡1 uc D 0.003 s ¡1
V 0 D ¡52 mV P0 D 0.251
[Ca2 C ]i, 0 D 0.13 m M
[Ca2 C ]basal D 0.1 m M [Ca2 C ]basal D 0 m M
(in Figures 1 to 2 and 5 to 9)
(in Figures 3 to 5)
The sequence of action potentials is due to the combined effect of the voltagedependent Ca2 C , Na C , K C , and leak channels. Each action potential causes a transient Ca2 C inux, as described by the rst term at the right-hand side of equation 2.9. The second term at the right-hand side of equation 2.9 describes the removal of Ca2 C , which is proportional to [Ca2 C ]i . The sequential accumulation of Ca2 C ions in the cell causes an increase in [Ca2 C ]i . After several step increments, the mean [Ca2 C ]i reaches a plateau due to the balance between the inux and the removal of Ca2 C . This is caused by the fact that Ca2 C removal is proportional to [Ca2 C ]i and by the increase of the interval between subsequent action potentials. Once the generation of action potentials is stopped, there is no inux of Ca2 C anymore, and only
122
L. N. Cornelisse et al. Table 2: Expressions for Activation (a) and Deactivation (b) rates for Probabilities n, m, h, p, and q, Characterizing the Dynamics of the Various Ion Channels. an D w2
(¡(V C V n ) C 10) ¡(V C V n ) C 10 ¡1 10
am D w 20 ah D w 14e ap D w 20
e (¡(V C V 0 ) C 25) e
¡(V C V 0 ) C 25 ¡1 10
¡(V C V 0 ) 20
e
¡(V C V p ) C 25 10
¡(V C V q ) 20
(T¡6.3) / 10
bn D w 25e ,
,
,
¡1
,
¡(V C Vn ) 80
bm D w 800e bh D w
(¡(V C V p ) C 25)
aq D w 14e w D3
,
200 e
¡(V C V 0 ) C 30 C1 10
bp D w 800e bq D w
¡(V C V0 ) 18
¡(V C Vp ) 18
200 e
¡(V C V q ) C 30 C1 10
. See Table 1 for parameter values.
the right-hand term in equation 2.9 is different from zero, resulting in an exponential decline of [Ca2 C ]i . The KCa -channels in our model are progressively activated by the buildup of [Ca2 C ]i (see equation 2.6 and Figure 2C). The activated KCa -channels cause a hyperpolarization of the plasma membrane. This results in an increase in the time interval between the action potentials and nally stops the spontaneous generation of action potentials. In the resulting silent period of membrane activity, [Ca2 C ]i decreases. As a result of this decrease, the fraction of activated KCa -channels P, which lags [Ca2 C ]i , decreases. This eventually leads to depolarization of the membrane, and action potential bursting is resumed. Thus, the periodic suppression of the spontaneous generation of action potentials as a result of the alternating activation and deactivation of the KCa -channels causes the bursting behavior of the membrane and the resulting [Ca2 C ]i oscillations. 3.2 Dynamic Analysis of Models. In order to obtain insight into the qualitative properties of our model and to make a comparison between the properties of our model and the properties of the Chay-Rinzel model, we have applied a fast-slow analysis of the various dynamic features of the model. The main difference between our model and the Chay-Rinzel model refers to the parameter for the slow process, which controls the bursting. In the Chay-Rinzel model, the slow parameter is [Ca2 C ]i , which is directly coupled to electrical membrane activity. The dashed line in the phase-plane plot in Figure 3A shows the steady states of the fast subsystem of the Chay-Rinzel model for xed values of the control parameter [Ca2 C ]i . This curve has three branches, where the upper and lower branches represent the stable states.
Minimal Model for Calcium Oscillations
123
Figure 3: Fast-slow analysis for the Chay-Rinzel model: trajectory (solid line), periodic branches (heavy lines), and steady states (dashed curve) in the [Ca2 C ]i V phase plane (A), Ca2 C oscillations in a small range above basal level (B). For further explanation, see text.
The upper branch denotes the regime for stable oscillatory states (limit cycles). The maximum and minimum values for the membrane potential in this regime are denoted by the heavy lines. The middle branch represents the unstable states, and the lower branch represents the stable steady state. Bursting occurs when the fast subsystem jumps between the two stable states. The transitions are controlled by the slow parameter [Ca2 C ]i . During the active phase of the Chay-Rinzel model, there is an inux of Ca2 C , and the cycles, corresponding to action potentials, continue. As a consequence, [Ca2 C ]i increases until the system crosses the middle branch representing the unstable states. Then the limit cycle behavior stops, and the system will show a relaxation to the lower branch. Meanwhile [Ca2 C ]i decreases until the system leaves the lower branch and jumps to the upper branch, which initiates a new sequence of action potentials. This type of behavior has been classied as square bursting (Wang & Rinzel, 1995). Note that the changes in [Ca2 C ]i (see Figure 3B) lack an exponential decline phase that returns to basal level and a plateau phase, phases that are characteristic for the activity of the melanotrope cell. In our model, we are dealing with two slow parameters. The rst is parameter P, which is the slow parameter that controls the bursting; it depends on the second slow parameter [Ca2 C ]i , which is again coupled to the electrical membrane activity. The 3D plot in Figure 4A reveals the relation between membrane potential V, [Ca2 C ]i , and P. Figure 4B shows the projection on the P-V plane. As in Figure 3A, the dashed curve represents the stable and unstable steady states. Importantly, the trajectory in the P-V plane differs from the trajectory in Figure 3A. In Figure 3A, [Ca2 C ]i decreases between two action potentials, whereas in Figure 4B the slow parameter P increases steadily during the sequence of action potentials. Figure 4C, which gives the
124
L. N. Cornelisse et al.
Figure 4: Fast-slow subsytem analysis of our model. (A) 3D-plot of trajectory in C P-[Ca2 ]i -V phase space. (B) Projection of the trajectory (solid line) and steady states for different values of P (dashed curve) in the P-V plane. (C) Projection in the [Ca2C ]i -V plane. (D) Projection in the P-[Ca2 C ]i plane. (E) Simulated Ca2C oscillations.
relation between V and [Ca2 C ]i , is qualitatively similar for the Chay-Rinzel model and our model. However, in Figure 4C, the cycles tend to approximate a limit cycle in the [Ca2 C ]i -V plane. This limit cycle behavior in the [Ca2 C ]i -V plane corresponds to the plateau phase (see Figure 4E), where
Minimal Model for Calcium Oscillations
125
the parameter P increases (see Figure 4D) until the trajectory in the P-V plane crosses the line with unstable steady states, which ends the oscillating pattern of action potentials. When the burst of action potentials stops, [Ca2 C ]i decreases exponentially according to equation 2.9. After the burst of action potentials, parameter P continues for some time to increase due to the slow dynamics (see equation 2.6). After some time P starts to decrease (see Figure 4D) until the action potentials start again. 3.3 The Na C Channel. In our model the Na C channel is responsible for
the depolarization that leads to the abrupt transition of the [Ca2 C ]i signal from the decline to the rise phase. It opens at a membrane potential that is about 10 mV below that of the Ca2 C channel. Such a lower threshold is in agreement with experimental data (J. R. Lieste, unpublished data) and is implemented by a 10 mV larger voltage shift V p for the activation variable p of the Na C channel compared to the voltage shift V0 for the activation variable m for the Ca2 C channel. When the membrane is hyperpolarized after a burst of action potentials, the NaC current together with the leak current starts to depolarize the membrane very slowly. As a result of the slow depolarization of the membrane, the Na C current starts to increase slowly, while the Ca2 C current remains almost zero due to a higher threshold of the Ca2 C channels. At a certain point (near ¡54 mV) the Na C current starts to increase more rapidly and thereby initiates a rapid depolarization. Once the depolarization reaches the threshold for the Ca2 C channels, the Ca2 C current is activated rapidly, resulting in a sharp increase in [Ca2 C ]i . Without the Na C channel, the small depolarizing current at the onset of a new burst would be generated by the Ca2 C channel. In the case that the decline phase reaches such low levels that the efux proportional to [Ca2 C ]i becomes very small, the Ca2 C inux due to this depolarizing current may exceed the efux. This becomes evident as a smooth rising of [Ca2 C ]i resulting in a smooth transition. Because the Chay-Rinzel model does not display exponential declines that come close to the basal level for [Ca2 C ] (see Figure 3B), the problem of abrupt transitions does not apply. Therefore, we compare our model with a minimal model for square-wave bursting (Rinzel & Ermentrout, 1998), which displays smooth transitions. This model is qualitatively the same as the Chay-Rinzel model, but uses simpler kinetics for the fast subsystem. Like the Chay-Rinzel model, this model has a voltagedependent Ca2 C current as an inward current. The different behavior at the transition from decline to rise phase between the Rinzel-Ermentrout model and our model is illustrated best by looking at the [Ca2 C ]i -ICa phase plane. In Figure 5A we plot the trajectory of this model in the [Ca2 C ]i -ICa plane. In order to represent time in this graph we added circles at equidistant time steps on top of the traces in the plane. When [Ca2 C ]i becomes small, the trajectory slows down, resulting in a higher density of circles. For small values of [Ca2 C ]i , the trajectory slowly increases its speed while [Ca2 C ]i is increasing again. This results in a smooth transition from the decline to the
126
L. N. Cornelisse et al.
Figure 5: Trajectory of the Morris-Lecar model in the [Ca2 C ]i -ICa plane (A), resulting in a smooth transition from decline to rise phase for [Ca2C ]i (B). Trajectory for our model in the [Ca2C ]i -I plane of ICa and INa (C), resulting in an abrupt transition from decline to rise phase for [Ca2C ]i (D). For both models the circles in the trajectories are plotted with time steps D t D 0.02 Tp . With Tp the period of one Ca2 C peak.
rise phase, as depicted in Figure 5B. For our model, both the trajectories for INa and ICa are plotted in the [Ca2 C ]i -I plane, with equidistant time steps, in Figure 5C. When [Ca2 C ]i decreases, both trajectories slow down. At some point (near [Ca2 C ]i D 0.05 m M) ICa remains fairly constant, while INa starts to increase slowly. Near [Ca2 C ]i D 0.04 m M INa starts to increase rapidly. This is the start of a new action potential, which subsequently causes a rapid increase of ICa . This illustrates the abrupt transition from the decline to the rise phase, as shown in Figure 5D. Recent experiments with combined measurements of action currents and [Ca2 C ]i have revealed a blockage of electrical membrane activity of the cell and a return of [Ca2 C ]i to basal level in Na C -free medium (Na C replaced by N-methyl-D-glucamine; NMDG) (Lieste et al., 1998) (see also Figure 6A). Under these conditions, electrical activity can be induced again by applying a depolarizing 20 mM K C pulse extracellularly. This induced activity generates a large Ca2 C peak due to the continuous ring at a higher frequency
Minimal Model for Calcium Oscillations
127
Figure 6: Effect of extracellular Na C with or without a depolarizing K C pulse, on the electrical activity of the plasma membrane and the [Ca2 C ]i (after Lieste et al., 1998) (A). Simulation of the experiment in Figure 3A. The Na C removal is modeled by VNa D 0 mV and the depolarizing K C pulse by VK D ¡68 mV (B).
128
L. N. Cornelisse et al.
than under normal conditions. After returning to NaC free medium without 20 mM K C , the membrane activity is blocked again, and [Ca2 C ]i returns in an exponential way to its basal level. This experiment can be simulated with our model, as illustrated in Figure 6B. In the simulation the removal of Na C is modeled by setting the Nernst potential for Na C to a value close to 0 mV. This is done because during the washout of the normal bathing solution with Na C free medium, the concentration of Na C is assumed to drop to very low values similar to the concentration of NaC inside the cell. This results in a blockage of the electrical activity and in a return of [Ca2 C ]i to basal level. The effect of the depolarizing K C -pulse is simulated by increasing the Nernst potential for K C from ¡75 mV to ¡68 mV. It can be observed that a 20 mM K C pulse leads to a larger shift in VK than that performed in the simulation. However, recent experiments have conrmed that a less strong K C pulse of 5 mM is sufcient to induce a Ca2 C transient in silent cells similar to that of Figure 6A (Scheenen, personal communication). The membrane starts to generate action potentials again, with a high frequency, resulting in a large Ca2 C peak. Setting the VK to ¡75 mV again, but leaving VNa at 0 mV, stops the ring of action potentials and results in an exponential decline of [Ca2 C ]i to the basal level. 3.4 Parameter Dependence of the Ca2 C Oscillations. Various parame-
ters in our model affect the shape and frequency of the Ca2 C oscillations. In this section we will investigate the effect of four important parameters, uo and uc in equation 2.7, and kCa and f in equation 2.9, and to what extent these parameters can account for the variation of patterns in one cell and among different cells. In Figures 7 through 9 the [Ca2 C ]i is shown in the rst column (A). The corresponding graphs of the fraction of open KCa -channels P are shown in the second column (B), and the membrane potential V is given in the third column (C). All parameter settings are the same as in Figure 2, unless explicitly stated otherwise. Figure 7 shows the results for different parameter values of uo . Since the rate of active KCa -channels, dP / dt, is a function of both the positive term uo ([Ca2 C ]i ¡[Ca2 C ]basal )(1 ¡P) and the negative term ¡uc P in the right-hand side of equation 2.6, increasing uo will have the same effect qualitatively as decreasing uc : an enhancement of the activation of the KCa -channel for the same levels of [Ca2 C ]i . Therefore, we will discuss the case only for uo . For slow activation (uo D 0.005 (m M s) ¡1 ), the simulated [Ca2 C ]i shows a continuous sequence of Ca2 C steps, which, after some initial settling time, remains at a constant level above basal level. The mean level of [Ca2 C ]i reached after about 20 steps reects a balance between inux and removal of Ca2 C in equation 2.9. The corresponding electrical activity is a continuous ring of action potentials. After the initial settling, the fraction of open channels is constant, which implies a balance between activation and deactivation in equation 2.6. The product of uo with the plateau level of [Ca2 C ]i results
Minimal Model for Calcium Oscillations
129
Figure 7: Results for different values of parameter uo on the internal Ca2 C concentration (A), fraction of open KCa -channels (B), and the membrane potential (C).
in an activation of the KCa -channels that is large enough to decrease the frequency of the action potential ring but not large enough to stop membrane activity. For larger values of uo (uo D 0.008 (m M s) ¡1 and higher) the activation of the KCa -channel is increased, which leads to hyperpolarization that is strong enough to end the burst of action potentials. This is necessary to obtain the bursting and oscillating behavior of the cell as described in the previous section. Increasing uo also decreases the number of steps in a Ca2 C
130
L. N. Cornelisse et al.
Figure 8: Results for different values of parameter kCa on the internal Ca2C concentration (A), fraction of open KCa -channels (B), and the membrane potential (C).
peak and increases the length of the decline phase. The latter is related to a shorter burst and a longer interburst interval of the membrane potential. The simulations for different removal rates kCa are shown in Figure 8. The upper panel of Figure 8 shows the [Ca2 C ]i in the case of fast pumping of Ca2 C (kCa D 9.92 s¡1 ). After an initial settling, it shows a continuous sequence of Ca2 C steps at a constant level above basal level. The pattern is very similar to the upper panel in Figure 7A, but the mean level of [Ca2 C ]i is lower. This is due to the stronger removal compared to that in Figure 7,
Minimal Model for Calcium Oscillations
131
which settles the balance between inux and efux in equation 2.9 at a lower [Ca2 C ]i level. The corresponding graph of the membrane potential (see Figure 8C) shows a continuous ring of action potentials. The fraction of activated KCa -channels is constant, and it is large enough to decrease the frequency of action potential ring but not large enough to stop membrane activity. Decrease of the removal rate kCa results in a smaller value for the second term kCa [Ca2 C ]i on the right-hand side of equation 2.9. The resulting smaller efux allows a faster buildup of [Ca2 C ]i during the rise phase and a slower decrease of [Ca2 C ]i in the decline phase. This is reected by a faster activation of the KCa -channel and a slower deactivation. As a consequence, the number of steps per peak and the frequency of the oscillations are reduced for smaller removal rates kCa . Moreover, the frequency of the Ca2 C oscillations is also decreased due to the higher levels of activation of the KCa -channels that are reached, which requires more time to deactivate them. The results for different values for parameter f are shown in Figure 9. In the case of strong buffering (small f ) the change of free Ca2 C , relative to the change of the total amount of Ca2 C in the cell, is small. Therefore, in this case, each action potential gives a small contribution to the buildup of the Ca2 C peak (see equation 2.9 and the upper panel of Figure 9). Due to this slow increase of [Ca2 C ]i the activation of the KCa -channels is slow, which causes long bursts of action potentials. The fact that f is small in equation 2.9 also accounts for the slow decrease of free Ca2 C . This causes slow deactivation of the KCa -channels and therefore long interburst intervals. When the buffering becomes weaker (i.e., when f is increased; lower panels in Figure 9), the steps in the rise phase of the Ca2 C peaks increase, yielding a faster buildup of [Ca2 C ]i . This leads to faster activation of the KCa -channels, resulting in shorter bursts. Furthermore, faster deactivation, due to faster decrease of [Ca2 C ]i , yields shorter interburst intervals. 4 Discussion
We present a minimal model to explain the various characteristic features of Ca2 C oscillations and the electrical bursting of the neuroendocrine melanotrope cell in the pituitary gland of Xenopus Laevis. Two important aspects of the model are new compared to previous models for bursting cells. These aspects concern the relatively slow Ca2 C -dependent Hodgkin-Huxley kinetics for the Ca2 C -sensitive K C -channel, and the action of the Na C channel. They are necessary to reveal the specic features of the Ca2 C oscillations as measured in the Xenopus melanotrope cell. These features are (1) steps (typical range 1 to 10) in the rise phase of a Ca2 C peak, (2) a plateau phase at the end of the rise phase, (3) an exponential decline, (4) an abrupt transition from decline to rise phase, (5) an increase in the removal rate of Ca2 C during the rise phase, and (6) the coupling between the plasma membrane action potentials and the Ca2 C steps.
132
L. N. Cornelisse et al.
Figure 9: Results for different values of parameter f on the internal Ca2C concentration (A), fraction of open KCa -channels (B), and the membrane potential (C).
Various models by Chay and Kreizer (1983), Chay and Rinzel (1985), Chay (1990, 1996), and Canavier et al. (1991) have been proposed to explain the electrical bursting activity of the plasma membrane coupled to oscillations of [Ca2 C ]i . These models describe the electrical behavior of the various bursting cell types but pay little attention to specic characteristics of the Ca2 C oscillations. They have in common that the ion channels that hyperpolarize (depolarize) the plasma membrane in order to terminate (initiate) bursts of action potentials are directly activated (inactivated) by [Ca2 C ]i . This implies that their activity is in phase with [Ca2 C ]i . As a consequence,
Minimal Model for Calcium Oscillations
133
these models can only describe Ca2 C oscillations in a relatively small range above basal level, which is the level where the [Ca2 C ]i will return to after blocking the electrical activity of the membrane (Canavier et al., 1991: 100 up to 200 nM above basal level with amplitude of 100 up to 200 nM; Chay, 1996: 500 nM above basal level with amplitude of 100 nM; and Chay and Rinzel, 1985: 300 nM above basal level with amplitude of 20 nM). The Ca2 C oscillations in the Xenopus melanotrope cell have amplitudes in the order of 300 nM (Shibuya & Douglas, 1993) and they return to basal level after each Ca2 C peak (Lieste et al., 1998). In general, the buildup of a Ca2 C peak in the Chay and Canavier models is by a large number of steps corresponding to the number of action potentials in each burst. If the number of steps in the Ca2 C peaks is fewer than 10, the Ca2 C peaks show no plateaus. In the Xenopus melanotrope cell plateaus are generally observed after three to four steps. In the Chay and Canavier models, it is not possible to vary the length of the decline phase without changing the slope of the exponential decline. However, in the Xenopus melanotrope, there is a strong variability in the lengths of the decline phases of the Ca2 C peaks while the slopes can be described with the same exponential. The model we propose here is based on the Chay-Rinzel model (1985), modied in such a way that it describes the specic features of the Ca2 C oscillations in the melanotrope cell. The assumptions in our model and the parameter dependence of the Ca2 C oscillations will be discussed in the next two sections. 4.1 Assumptions of the Model. The KCa -channel. We have replaced the KCa -channel in the Chay-Rinzel model by a KCa -channel that has HodgkinHuxley gating kinetics that are relatively slow with respect to the gating kinetics of the voltage-sensitive channels in the model. The activation process of the KCa -channel in our model is purely Ca2 C -dependent with a relatively long time constant, such that the channel activity lags the changes in [Ca2 C ]i , resulting in a phase shift between [Ca2 C ]i and the slow KCa -channel activity. The proposed gating kinetics for the KCa -channel, which initiates and terminates the bursts of action potentials, lead to new results compared to the previous models for bursting cells by Chay and Rinzel (1985), Chay (1990, 1996) and Canavier et al. (1991): the model simulates Ca2 C peaks with a variable number of steps in the typical range of 1–10; the peaks start at basal level, approaching in some peaks, but not in all, a plateau level; and the [Ca2 C ]i then returns exponentially to its basal level. Various mechanisms can account for the slow kinetics of the KCa -channel. A possible candidate for this mechanism is Ca2 C -dependent phosphorylation of the channel as proposed by Goldbeter (1996). In this terminology P in equation 2.6 stands for the fraction of phosphorylated channels. The term uo ([Ca2 C ]i ¡ [Ca2 C ]basal ) is the Ca2 C -dependent conversion rate from the dephosphorylated state to the phosphorylated state, which is in fact the
134
L. N. Cornelisse et al.
activity of the protein kinase. Parameter uc is the conversion rate from the phosphorylated state to the dephosphorylated state, or the activity of the phosphatase. There is ample evidence that some KCa -channels can be regulated in this way. For instance, the a-subunit of the BK channel (“Big K C channel”) displays putative phosphorylation sites (Vergara, Latorre, Marrion, & Adelman, 1998). Furthermore, modulation of BK channels by phosphorylation is reported by Reinhart and Levitan (1995). They showed that the endogenous protein kinase activity is protein kinase C (PKC)-like. The increase in open probability due to phosphorylation occurs gradually over a period of minutes. This is in the same order of magnitude as the time constant for the channel kinetics of the KCa -channel in our model, which is given by tP D 1 / (uo ([Ca2 C ]i ¡ [Ca2 C ]basal ) C uc ) and which is about 4 minutes. An explanation of the dynamics of the melanotrope cell required the introduction of the slow kinetics of the KCa -channel. The dynamics of this channel are roughly in agreement with the dynamics of phosphorylation. However, the simple rst-order differential equations (equations 2.6 and 2.7) to model this process may seem somewhat unrealistic. Since detailed experimental data are lacking, this simple approximation was sufcient for the purpose of this study and for this developing model of the Xenopus melanotrope cell. Na C channel. In the models of Chay and Rinzel (1985) and Canavier et al. (1991), which lack a NaC channel, the transitions from the decline to the rise phase are smooth for the Ca2 C oscillations that have an oscillation time of several seconds. The introduction in the model of a NaC channel, with a lower threshold than the Ca2 C channel, explains the abrupt transition from the decline to the rise phase, as measured in the melanotrope cell. Independent evidence for a role of Na C channels comes from experiments by Lieste et al. (1998), which have shown that removal of extracellular Na C eliminates the electrical activity of the plasma membrane and blocks the Ca2 C oscillations. As a result, the [Ca2 C ]i returns to basal level. Restoration of the electrical activity can be obtained by applying a depolarizing K C pulse extracellularly adding KCl, resulting in a large Ca2 C peak. We are able to simulate this experiment with the model, whereas models that lack a Na C channel and are therefore insensitive to Na C removal cannot reproduce these experimental results. Ca2 C stores. In the cytoplasm, various stores are present that bind and release Ca2 C , such as the endoplasmic reticulum and the mitochondria (Pozzan, Rizzuto, Volpe, & Meldolesi, 1994). Experiments in the Xenopus melanotrope cell where these Ca2 C stores were manipulated show only minor modulatory effects on the Ca2 C oscillations, suggesting that stores are not involved in the generation of Ca2 C oscillations (Scheenen et al., 1994a, 1994b; Koopman et al., 1997). Therefore, our minimal model does not incorporate these Ca2 C stores as a source of Ca2 C . In this sense, the cell can be considered as a plasma membrane oscillator because the Ca2 C oscillations are determined by the plasma membrane activity only.
Minimal Model for Calcium Oscillations
135
Ca2 C dynamics. The assumption in the model of Ca2 C removal proportional to [Ca2 C ]i via the term kCa ([Ca2 C ]i ¡ [Ca2 C ]basal ) is in good agreement with the experimental observations (Lieste et al., 1998). It adequately describes both the exponential decline and the increasing removal rate during the rise phase of the Ca2 C peak. Because the cell is modeled as one compartment, the distribution of Ca2 C is assumed to be homogeneous throughout the cell and also during changes in [Ca2 C ]i due to efux and inux of Ca2 C across the membrane. This is a plausible assumption, considering that the time interval between two steps and the time needed for Ca2 C to travel from the membrane throughout the whole cell is on the order of a second (Scheenen et al., 1996). Therefore, the distribution of Ca2 C in the cell will homogenize between two Ca2 C steps, which follow each other by intervals of about 1 second. In the model, the buffering of Ca2 C is incorporated in the parameter f as described in equation 2.11. The exact buffer parameters for the Xenopus melanotrope cell are not known. However, we assume that these are not signicantly different from the standard values in other cell types (Nowycky & Pinter, 1993). Therefore, we assume an equilibrium coefcient KB of 5 m M and a buffer concentration [B] of 0.5 mM. This results in a value for f of nearly 0.01, which is close to the values used in our simulations. 4.2 Dependence of the Oscillation Pattern on uo , uc , kCa , and f . In our model four crucial parameters control the shape and frequency of the Ca2 C oscillations. Two of these, uo and uc , affect the length of the rise and the decline phase (and thus the frequency, amplitude and the number of steps in each Ca2 C peak) without changing the exponent of the decline phase. Decreasing (increasing) the removal rate kCa decreases (increases) the frequency, amplitude, number of steps per Ca2 C peak, and slope of the decline phase. The parameter f , which expresses the change of free Ca2 C with respect to the change in the total intracellular Ca2 C concentration, affects the frequency, number of steps per peak, and amplitude of both the Ca2 C step and the Ca2 C peak. Summarizing, we have constructed a mathematical model that not only neatly simulates all characteristics of the Ca2 C oscillation pattern in the Xenopus melanotrope cell, but also provides the possibility of testing the effects of neuronal factors (rst messengers, such as neurotransmitters and neuropeptides) on this pattern via the parameters uo , uc , and kCa . This approach is of biological relevance, as it is assumed that changes in the oscillation pattern reect differential regulations by various known regulatory neurotransmitters and neuropeptides (e.g., NPY, dopamine, GABA, TRH, CRF, and acetylcholine) of various cellular key processes, such as gene expression, protein biosynthesis, and secretion. We expect that application and further development of our model will stimulate neurobiological research on the role of neuronal factors in the control of cellular secretory processes.
136
L. N. Cornelisse et al.
References Artero, C., Fasolo, A., Andreone, C. & Franzoni, M. F. (1994). Multiple sources of the pituitary pars intermedia innervation in amphibians: A DiI retrograde tract-tracing study. Neuroscience Letters, 69, 163–166. Berridge, M. J. (1998). Neuronal calcium signaling. Neuron, 21, 13–26. Bito, H. (1998). The role of calcium in activity-dependent neuronal gene regulation. Cell Calcium, 23, 143–150. Canavier, C. C., Clark, J. W., & Byrne, J. H. (1991). Simulation of the bursting activity of neuron R15 in Aplysia: Role of ionic currents, calcium balance, and modulatory transmitters. Journal of Neurophysiology, 66, 2107–2124. Chay, T. R. (1990). Electrical bursting and intracellular Ca2C oscillations in excitable cell models. Biological Cybernetics, 63, 15–23. Chay, T. R. (1996). Modeling slowly bursting neurons via calcium store and voltage-independent calcium current. Neural Computation, 8, 951–978. Chay, T. R., & Keizer, J. (1983). Minimal model for membrane oscillations in the pancreatic beta-cell. Biophysical Journal, 42, 181–190. Chay, T. R., & Rinzel, J. (1985). Bursting, beating, and chaos in an excitable membrane model. Biophysical Journal, 47, 357–366. Goldbeter, A. (1996). Function of the rhythm of intercellular communication in Dictyostelium: Link with pulsatile hormone secretion. In A. Goldbeter (Ed.), Biochemical oscillations and cellular rhythms (pp. 345–347). Cambridge: Cambridge University Press. Hodgkin, A. L., & Huxley, A. F. (1952). A quantitative description of membrane current and its application to conduction and excitation in nerve. Journal of Physiology (London), 117, 500–544. Jenks, B. G., Buzzi, M., Dotman, C., Koning, K. H. De, Scheenen, W. J. J. M., Lieste, J. R., Leenders, H., Cruijsen, P., & Roubos, E. W. (1998). The signicance of multiple inhibitory mechanisms converging on the melanotrope cell of Xenopus Laevis. Annals New York Academy of Sciences, 839, 229–234. Koopman, W. J. H., Scheenen, W. J. J. M., Roubos, E. W., & Jenks, B. G. (1997). Kinetics of calcium steps underlying calcium oscillations in melanotrope cells of Xenopus Laevis. Cell Calcium, 22, 167–178. Lieste, J. R., Koopman, W. J. H., Reynen, V. J., Scheenen, W. J. J. M., Jenks, B. G., & Roubos, E. W. (1998). Action currents generate stepwise intracellular Ca2C patterns in a neuroendocrine cell. Journal of Biological Chemistry, 273, 25686–25694. Loh, Y. P., & Gainer, H. (1977). Biosynthesis, processing and control of release of melanotropic peptides in the neurointermediate lobe of Xenopus Laevis. Journal of General Physiology, 70, 37–58. Maruthainar, K., Loh, Y. P., & Smyth, D. G. (1992).The processing of b-endorphin and a-melanotropin in the pars intermedia of Xenopus Laevis is inuenced by background adaptation. Endocrinology, 135, 469–478. Neher, E. (1998). Vesicle pools and Ca2 C microdomains: New tools for understanding their roles in neurotransmitter release. Neuron, 20, 389–399.
Minimal Model for Calcium Oscillations
137
Nowycky, M. C., & Pinter, M. J. (1993). Time courses of calcium and calciumbound buffers following calcium inux in a model cell. Biophysical Journal, 64, 77–91. Pozzan, T., Rizzuto, R., Volpe, P., & Meldolesi, J. (1994). Molecular and cellular physiology of intracellular calcium stores. Physiological Review, 74, 595–636. Reinhart, P. H., & Levitan, I. B. (1995). Kinase and phosphatase activities intimately associated with a reconstituted calcium-dependent potassium channel. Journal of Neuroscience, 15, 4572–4579. Rinzel, J., & Ermentrout, B. (1998). Analysis of neural excitability and oscillations. In C. Koch & I. Segev (Eds.), Methods in neuronal modeling (pp. 251–291). Cambridge, MA: MIT Press. Roubos, E. W. (1997). Background adaptation by Xenopus Laevis: A model for studying neuronal information processing in the pituitary pars intermedia. Comparative Biochemistry and Physiology, 118A, 533–550 Scheenen, W. J. J. M., Jenks, B. G., Roubos, E. W., & Willems, P. H. G. M. (1994). Spontaneous calcium oscillations in Xenopus Laevis melanotrope cells are mediated by V-conotoxin sensitive calcium channels. Cell Calcium, 15, 36–44. Scheenen, W. J. J. M., Jenks, B. G., Willems, P. H. G. M., & Roubos E. W. (1994). Action of stimulatory and inhibitory a-MSH secretagogues on spontaneous ¨ calcium oscillations in melanotrope cells of Xenopus Laevis. P ugers Archiv, 427, 244–251. Scheenen, W. J. J. M., Jenks, B. G, van Dinter, R. J. A. M., & Roubos, E. W. C (1996). Spatial and temporal aspects of Ca2 oscillations in Xenopus Laevis melanotrope cells. Cell Calcium, 19, 219–227. Shibuya, I., & Douglas, W. W. (1993). Spontaneous cytosolic calcium pulsing detected in Xenopus melanotrophs: Modulation by secreto-inhibitory and stimulants ligands. Endocrinology, 132, 2166–2175. Stojilkovic, S. S., & Catt, K. J. (1992). Calcium oscillations in anterior pituitary cells. Endocrine Reviews, 13, 256–280. Stojilkovic, S. S., Tomic, M., Kukuljan, M., & Catt K. J. (1994). Control of calcium spiking frequency in pituitary gonadotrophs by a single-pool cytoplasmic oscillator. Molecular Pharmacology, 45, 1013–1021. Vergara, C., Latorre, R., Marrion, N. V., & Adelman, J. P. (1998). Calciumactivated potassium channels. Current Opinion in Neurobiology, 8, 321–329. Wang, X., & Rinzel, J. (1995). Oscillatory and bursting properties of neurons. In M. A. Arbib (Ed.), The handbook of brain theory and neural networks (pp. 686– 691). Cambridge, MA: MIT Press. Received May 25, 1999; accepted May 2, 2000.
LETTER
Communicated by Jack Cowan
Neural Field Model of Receptive Field Restructuring in Primary Visual Cortex Katrin Suder Florentin W org ¨ otter ¨ Institute of Physiology, Department of Neurophysiology, Ruhr-University, D-44780 Bochum, Germany Thomas Wennekers Max-Planck-Institute for Mathematics in the Sciences, D-04103 Leipzig, Germany Receptive elds (RF) in the visual cortex can change their size depending on the state of the individual. This reects a changing visual resolution according to different demands on information processing during drowsiness. So far, however, the possible mechanisms that underlie these size changes have not been tested rigorously. Only qualitatively has it been suggested that state-dependent lateral geniculate nucleus (LGN) ring patterns (burst versus tonic ring) are mainly responsible for the observed cortical receptive eld restructuring. Here, we employ a neural eld approach to describe the changes of cortical RF properties analytically. Expressions to describe the spatiotemporal receptive elds are given for pure feedforward networks. The model predicts that visual latencies increase nonlinearly with the distance of the stimulus location from the RF center. RF restructuring effects are faithfully reproduced. Despite the changing RF sizes, the model demonstrates that the width of the spatial membrane potential prole (as measured by the variance s of a gaussian) remains constant in cortex. In contrast, it is shown for recurrent networks that both the RF width and the width of the membrane potential prole generically depend on time and can even increase if lateral cortical excitatory connections extend further than bers from LGN to cortex. In order to differentiate between a feedforward and a recurrent mechanism causing the experimental RF changes, we tted the data to the analytically derived point-spread functions. Results of the ts provide estimates for model parameters consistent with the literature data and support the hypothesis that the observed RF sharpening is indeed mainly driven by input from LGN, not by recurrent intracortical connections. 1 Introduction
Receptive elds (RFs), dened as the retinal area where a stimulation of receptors causes a recorded neuron to respond, play a key role in identiNeural Computation 13, 139–159 (2000)
° c 2000 Massachusetts Institute of Technology
140
K. Suder, F. Worg ¨ otter, ¨ and T. Wennekers B
stimulus
non-synchronized EEG
C 10°
0°
0.0 0.4
1.2
2.0
0.0
cortical simple cell off-response
x-pos
100 I/s
synchronized EEG
100 I/s
A
0.4
1.2
2.0
0.0
time [s] 0.3
Figure 1: (A,B) Peristimulus-time histograms of an LGN relay cell during different EEG states in an anesthetized cat. Insets show EEG traces (5 s). (C) Shrinking of RF width of a cortical cell during a nonsynchronized EEG state. The cell starts ring roughly 50 ms after stimulus onset. The RF covers approximately 6 degrees at maximum and drops to less than 4 degrees afterward. Firing rates are gray-scale coded (0-40 I/s).
fying the functional mechanisms of visual neural circuits (Hubel & Wiesel, 1962; Levine & Shefner, 1991). For a long time, RFs have been envisaged as pure spatial phenomena with temporally xed properties. In recent years, though, sophisticated reverse correlation techniques have been developed to study spatial and temporal properties of RFs (DeAngelis, Ohzawa, & Freeman, 1993, 1995; Eckhorn, Krause, & Nelson, 1993). Elaborated physiological experiments now support the viewpoint that RFs are highly dynamic entities that can change due to cortex-intrinsic dynamical processes, context effects, attention, or synaptic plasticity (DeAngelis et al., 1993; Eckhorn et al., 1993; Gilbert, 1998; Shevelev, Volgulshev, & Sharaev, 1992). Recently, Worg¨ ¨ otter et al. (1998) demonstrated that spatial and temporal characteristics of RFs can, in addition, be modulated by the global state of the brain as, for instance, characterized by the electroencephalogram (EEG). It has been shown that cortical RFs in V1 are wider during synchronized EEG (dominated by a- and d-waves) than during less or nonsynchronized b-EEG. Moreover, the RF width can shrink considerably over time in response to a ashed stimulus in nonsynchronized states (see Figure 1C and Worg¨ ¨ otter et al., 1998). It is believed that a- and d-waves are more common during drowsiness and weak sleep and that less synchronized fast rhythms are associated with alert states. Therefore, the experimental results suggest that information processing is adapted to different behavioral states by qualitative RF changes and that the spatial resolution of visual processing is particularly high in alert states. Different ring patterns of cells in the lateral geniculate nucleus (LGN) during the different EEG states have been suggested as the main mechanism for this restructuring. Figure 1A shows that during synchronized EEG, a
Receptive Field Restructuring in Primary Visual Cortex
141
strong contrast independent burst component dominates the response to a ashed light spot. After that, only a small, persistent response remains, although the stimulus is still on. In contrast, during nonsynchronized EEG, LGN cells respond with a long-lasting tonic ring pattern at intermediate (contrast dependent) rates, whereas the burst component is often diminished (see Figure 1B). Worg¨ ¨ otter et al. (1998) hypothesized that these state-dependent LGN ring patterns essentially drive the cortical eld changes. Basically, it is assumed that the LGN cells excited by the ash stimulus innervate a local region in the visual cortex with decreasing (e.g., gaussian) effective strength from the projection center. Taking ring thresholds of cortical cells into account, it is then obvious that higher LGN ring rates are able to elicit spikes in a larger part of the cortical projectionregion than lower rates. Accordingly, the cortical point spread function—and the cortical RF, respectively—will be broad during the initial burst component and will sharpen afterward, when LGN ring rates decline. During synchronized EEG, the late cortical response component can also be missing if LGN cells re too seldom to raise cortical cells above threshold at all. We tested this hypothesis in a detailed biophysical model of the visual pathway (Worg¨ ¨ otter et al., 1998) comprising regular ring cells in area V1, two-state neurons in the LGN (burst and regular ring mode), and a modulatory inuence of the perigeniculate nucleus (PGN), which is supposed to be involved in the control of the different excitability of LGN cells during drowsy and alert states (AhlsÂen & Lo, 1982; Funke & Eysel, 1998; McCormick, 1992). Although in this model particularly lateral connections in V1 and feedback from V1 to LGN were considered, the simulations indicated that the experimentally observed cortical state dependencies are mainly due to a feedforward effect; feedback has only a weak modulatory inuence on late-response components or destabilizes the network if it is too strong (see Worg¨ ¨ otter et al., 1998; Crick & Koch, 1998). In principle, the recurrent loop between V1 and LGN can broaden RFs in V1 further. However, this is in contrast to our experimental ndings. Thus, recurrent connections do not seem to contribute signicantly to the RF size effects described above. The relative inuence of afferent versus cortex-intrinsic signals to orientation tuning of cortical simple cells has been studied previously by several authors in experiments (Ferster, Chung, & Wheat, 1996; Henry, Michalski, Wimbourne, & McCart, 1994; Sato, Katsuyama, Tamura, Hata, & Tsumoto, 1996) as well as in simulations (Adorjan, Levitt, Lund, & Obermayer, 1999; Ben-Yishai, Bar-Or, & Sompolinsky, 1995; Carandini & Ringach, 1997; Somers, Nelson, & Sur, 1995). It still remains controversial whether orientation tuning arises from already sharply tuned input to cortical cells, as originally suggested by Hubel and Wiesel (1962) or whether LGN input is only weakly tuned but amplied and sharpened by recurrent excitatory and inhibitory connections. Experimental evidence for both viewpoints ex-
142
K. Suder, F. Worg ¨ otter, ¨ and T. Wennekers
ists (reviews in Sompolinsky & Shapley, 1997; Vidyasagar, Pei, & Vogulshev, 1996). In contrast, our experimental data and detailed simulations indicate evidence for some purely input-driven RF properties. To validate their feedforward nature more thoroughly, a phenomenological neural eld model of the LGN-V1-projection is developed in this article (section 2). The advantage of this model as compared to biologically detailed ones is that it can be solved exactly. This way, explicit equations for cortical point-spread functions, latencies, and RF-width proles become available (section 3). They can be compared with simulated proles appearing in recurrent models (section 4) and, most important, can be tted to the experimental data (section 5). The results of the t in comparison with the two different model approaches (feedforward versus feedback) validate the feedforward ansatz. 2 Feedforward Neural Field Model
In this section we develop a neural eld approach to study the LGN-V1projection. The simple structure of the resulting equations allows us to compute cortical response functions analytically for input-driven activity in V1. Neural elds have been used for similar purposes, although usually in simulation studies (Giese, 1999; Krone, Mallot, Palm, & Schuz, ¨ 1986; Mineiro & Zipser, 1998; Sabatini & Solari, 1999; Wilson & Cowan, 1973; Worg¨ ¨ otter, Niebur, & Koch, 1991). The purpose of this section is to study the input-driven dynamics of V1. Therefore, we consider just one eld V (x, t) for V1 and neglect recurrent connections inside it (the latter are included in section 4). For convenience, we further idealize V1 as a one-dimensional eld, x 2 R, and assume that the activity in the LGN and V1 is spatiotemporally separable, V (x, t) D X (x )T (t) (see DeAngelis et al., 1995; Mineiro & Zipser, 1998)). Including a leakage term, the temporal development of the membrane potential V is given by t
dV (x, t) D ¡V(x, t) C dt
1 ¡1
K (x ¡ x0 )Isyn (x0 , t)dx0 ,
(2.1)
where t is a phenomenological time constant. The kernel K (x) describes the synaptic feedforward projection from the LGN input I to cortex. We choose a gaussian connectivity prole, x2
K0 ¡ 2s 2 K (x ) D p e 0, 2p
(2.2)
p with effective synaptic strength K 0 / 2p and width s0 . This represents a single on- or off-subeld. Simple cell RFs consisting of several subelds can be represented by superposition of responses of the form equation 2.1 with appropriate kernels.
Receptive Field Restructuring in Primary Visual Cortex
143
Figure 2: (A) Temporal input function Inst (t) used to simulate LGN ring during nonsynchronized EEG. Observe the two idealized components representing the burst and tonic phase (see equation 2.5). (B) Simulation of the cortical membrane potential V(x, t) in the nonsynchronized state. Simulation parameters are s0 D 1.7± , s1 D 0.5± , t D 10.0 ms, t 0 D 0 ms, t1 D 40 ms, t2 D 300 ms, c1 D 80 I / s, c2 D 40 I / s.
The synaptic input currents Isyn in V1 are fully described by the activity of the LGN cells I (x, t), which is gaussian in space and square wave in time (see below). Detailed dynamical processes in the LGN are not explicitly modeled but considered in the form of phenomenological spatiotemporally separable input functions I (x, t) D Ix (x)It (t). According to the experimental input in the form of small, light spots, the spatial input Ix has a gaussian shape in our model, which represents a localized activity prole in the LGN. The temporal input component It contains the EEG-state dependence and approximates the experimentally observed temporal ring patterns of LGN cells (see Figures 1A and 1B): ¡
x2 2s 2 1
Ix ( x ) D e Ist (t ) D c1 H(t ¡ t 0 )H (t1 ¡ t) Inst (t ) D Ist (t) C c2 H (t ¡ t1 )H(t2 ¡ t).
(2.3) (2.4) (2.5)
Here, H (t) is the Heaviside function. The function Ist (t) describes the highfrequency burst of spikes in the synchronized EEG in the form of a rectangular pulse of strength c1 lasting from t D t0 to t1 . Inst (t) describes the nonsynchronized case and contains, in addition, a tonic component of height c2 ( < c1 ) lasting from t1 to t2 (see Figure 2A). Firing rates R are derived from the membrane potential V by means of a rectifying function, R D [bV ¡ #] C ,
(2.6)
where b is the neuronal gain (in spikes/mV/s) and # the ring threshold. The function [x] C is zero for x · 0 and equal to x above zero.
144
K. Suder, F. Worg ¨ otter, ¨ and T. Wennekers
Figure 2B shows a simulation of the model. The cortical membrane potential V is plotted as a function of space and time for the nonsynchronized state. In accordance with the experimental data (see Figure 1C) a phasic peak is followed by a tonic component with a reduced amplitude. Furthermore, a restructuring of the potential takes place between 50 and 70 ms. 3 Response Function and Equipotential Lines
We now derive analytical expressions for the cortical membrane response. Because we assume stimulation by small, light spots, V (x, t) can be interpreted as the cortical point-spread function or, in the light of the linearity of the membrane and the spatial homogeneity of the model, as the spatiotemporal RF of the model cells. Inserting the assumptions 2.2 and 2.3– 2.5 into equation 2.1 and integrating, one obtains the spatial and temporal components X (x) and T (t). Thereby, we set V (x, t0 ) D 0, which means that potentials are measured relative to a resting potential of zero. X (x) is a convolution of two gaussians: the input distribution Ix (x ) and the feedforward kernel K (x ). Thus, X (x ) D
K 0s0 s1 s02 C s12
e
¡
x2 2(s 2 C s 2 ) 0 1
2
x2
¡ 2 K0 s0 s1 ¡ 2sx 2 D e r ¼ K 0s1 e 2s0 , sr
(3.1)
where sr2 :D s02 C s12 is the resulting width of the cortical membrane potential. The approximation holds for small stimuli, s1 ¿ s0 , that is, for point stimulation. Within this limit, larger stimuli, s1 , increase potential values approximately linearly, but the width of the cortical response is determined only by the width of the projection from LGN to cortex, s0 . The temporal amplitude factor T (t ) represents the response of a low-pass lter to the input It (t). In the nonsynchronized state, It (t) D Inst (t), one gets T (t) D c2 e ¡
c2 ¡ c1 e ¡
t¡t2 t
¡ c1 e ¡
t¡t0 t t¡t0 t
0 : t < t0 t¡t0 c1 (1 ¡ e¡ t ) : t0 · t < t1 t¡t1 C ( c1 ¡ c 2 ) e ¡ t : t1 · t < t2 t¡t1 ¡ t ( ) C c1 ¡ c 2 e : t2 · t.
(3.2)
The synchronized response T (t) is obtained from equation 3.2 by setting the tonic component to zero, that is, c2 D 0. The time course of T is determined by the height (c1 ) and duration (t0 – t1 ) of the burst and the tonic component (c2 , t1 – t2 ), and by the time constant t. Until now, potentials and not ring rates, as typically observed in experimental recordings, were considered. To obtain rates from the membrane potential, a threshold function has to be introduced (cf. equation 2.6). A useful concept in this context are lines of equal potential dened by V (x, t) D X (x )T (t) D k D constant.
(3.3)
Receptive Field Restructuring in Primary Visual Cortex
145
Relation 3.3 can be solved for either x D x(tI k ) or t D t (xI k ), giving the equipotential lines in parameterized form. Of particular interest is the case where k equals the ring threshold # of cortical cells (which is assumed to be the same for all cells; the exact form of the output function above threshold does not matter at this point). Then, w(t) :D x (tI k D #) describes the time course of the boundary between silent (subthreshold) and ring (superthreshold) cells. This is equivalent to the width of spatiotemporal RFs as observed experimentally by extracellular recordings of ring rates (see also Figure 3A). Inserting the spatial prole X (x) from equation 3.1 into equation 3.3 and isolating x, we get x2 (tI k ) D 2sr2 ln
K0 s0 s1 T (t) k sr
s1 ¿ s0
¼ 2s02 ln
K 0s1
k
T (t) .
(3.4)
The time dependence enters into equation 3.4 via the amplitude function T (t). Using equation 3.2 for T (t), we obtain the RF width for ashed stimuli. Examples, shown in Figure 3B, reveal that the width of the (superthreshold) RF, w(t), indeed changes with time as in the experiments, although the width sr of the (subthreshold) potential prole is constant (see equation 3.1). The reason for this difference is that the spatial prole X (x ) is simply modulated by the multiplicative amplitude function T (t ). Then, as long as the ring threshold of cortical cells, #, is constant, the width of the region of cells above threshold naturally varies, but the width of X (x ), of course, does not (see Figure 3A). w (t ) D x (tI k ) from equation 3.4 predicts the time course of the RF modulation. Equation 3.4 gives a predictable dependence of the RF boundary on the stimulus size, s1 , which could be tested experimentally. On the other hand, variation of K 0, k or s0 is certainly more complicated, if possible at all. The amplitude function, T (t ), however, could be systematically modied by the temporal input It (t). This may yield further information about network parameters. Solving equation 3.3 for t (x ), we get the times when the cortical cells cross a threshold level k : t 0 ¡ t ln 1 ¡ ¡t ln
t (xI k ) D ¡t ln
t2 X (x)(c2 e t
k
c1 X ( x )
k
¡c2 X(x) t0 t1 ¡c1 e t C (c1 ¡c2 )e t
k
t0 t1 ¡c1 e t C (c1 ¡c2 )e t
)
: t 0 < t · t1 : : t1 < t · t2
(3.5)
: : t2 < t.
Times t (xI k ) in the range t0 < t · t1 describe cortical latencies (i.e., the threshold is crossed from below), times in the range t1 < t · t2 are off-times due to decreasing potentials when the LGN ring rates switch from high (c1 , bursts) to low (c2 , tonic ring), and the last regime, t2 < t, characterizes
146
K. Suder, F. Worg ¨ otter, ¨ and T. Wennekers
burst phase
tonic phase
w w
Cortex
J s
V
s
V
r
r
x s
0
s
1
LGN -> Cortex projections
LGN rates 5
B
x 50
kappa = 1 kappa = 5 kappa =10
3 2 1 0
C
40 t(x) [ms]
4 x(t) [deg]
x s
0
s
1
x
kappa = 1 kappa=10 kappa=20
30 20 10
0
100
200 time [ms]
0
300
-6
-4 -2 0 2 space [deg]
4
6
Figure 3: (A) Illustration of the mechanism leading to a sharpening of the RF. The width of the cortical membrane potential, sr , is the same during the burst and tonic phase, but the width of the ring-rate prole w varies due to the LGN-driven amplitude modulation of the region of cells above threshold, #. (B, C) Simulation of equipotential lines x(t, k ) and t(x, k ) for different threshold levels k . For k D # such borders between subthreshold and superthreshold cells dene the dynamic RF width w(t) D x(tI k D #) and the ring latencies t(x, #). Note that x(t, k ) shrinks with the transition from burst to tonic phase around 50 ms. Simulation parameters are the same as for Figure 2.
off-times when the LGN activity is over. Some curves for cortical latencies are depicted in Figure 3C. Inserting equation 3.1 for the spatial prole X (x), a useful approximation for these cortical onset times can be given:
t (x ) ¡ t 0 ¼
kt c1 X (x )
¼
kt K0 c1 s1
1C
x2 2s02
.
(3.6)
Receptive Field Restructuring in Primary Visual Cortex
147
This means that as long as x ¿ s0, the latencies increase quadratically with distance x. A correlation between ring times and distance from RF center has indeed been observed, but the details of its dependence are still under examination (Bringuier, Chavane, Glaeser, & Fre gnac, 1999; Celebrini, Thorpe, Trotter, & Imbert, 1993; Dinse & Kruger, ¨ 1994; Ikeda & Wright, 1975; Worg¨ ¨ otter, Opara, Funke, & Eysel, 1996). Equation 3.6 further shows that the latencies are shorter for stronger cortical inputs (K 0c1 ) or larger stimuli (s1 ), and that they increase with ring thresholds, or with membrane time constant t , respectively. 4 Recurrent Neural Field Model
In this section we qualitatively explore the inuence of recurrent intracortical connections on the cortical potential by adding a feedback loop to the simple feedforward model: t
@V (x, t) @t
D ¡V(x, t) C X (x)It (t) C
1 ¡1
KDOG (x ¡ x0 )R (V (x0 , t)) dx0 .
(4.1)
The cortical connection kernel KDOG is chosen as a difference of gaussians to include excitatory and inhibitory feedback: 2
x2
Kexc ¡ x2 Kinh ¡ 2 KDOG (x ) D p e 2sexc ¡ p e 2sinh . 2p 2p
(4.2)
Parameters are such that KDOG has a Mexican hat prole (see below). The rate function R (V ) in equation 4.1 is zero for V · 0 and equal to bV for V > 0 (cf. equation 1.6). b is the neuronal gain (in spikes/s/mV). Input from LGN is the same as in the feedforward model. Note, however, that in equation 4.1, we have already inserted the total spatial input X (x) into cortex, that is, the spatial convolution of the LGN activity Ix (x ), equation 1.3, and the feedforward kernel, equation 1.2, from LGN to cortex. The temporal input component It in equation 4.1 is again given by equation 1.5. Similar models have been investigated recently in the context of orientation tuning in V1 (Adorjan et al., 1999; Ben-Yishai et al., 1995; Carandini & Ringach, 1997). Due to the nonlinearity, analytical solutions of equation 4.1 cannot be given. Therefore, we simulate equation 4.1 and discuss the qualitative differences that appear in contrast to the simple feedforward model. As parameters, we choose t D 10 ms, t0 D 0 ms, t1 D 50 ms, t2 D 300 ms and (somewhat arbitrarily) sr D 3± , sexc D 0.7± , sinh D 3.0± , Kexc b D 2.0 mV/deg, and Kinh b D 0.5 mV/deg. Furthermore, we dene k :D K 0s0 s1 / sr , C1 D kc1 , C2 D kc2 and choose the effective cortical inputs C1 D 10 mV and C2 D 2.5 mV for the burst and the tonic phase. With these parameters, the network operates in a regime of cortical amplication, as it has been proposed for
148
K. Suder, F. Worg ¨ otter, ¨ and T. Wennekers B
3
200 150 100 50 time [ms]
0 3 space [deg]
sigma
recurrent field model
10
0
space [deg]
A
2
1
0
0
50
100 150 time [ms]
200
250
Figure 4: (A) Simulation of the recurrent neural eld model. (B) Time course of the width of the potential V(x, t) for the data shown in A.
recurrent orientation tuning models (Adorjan et al., 1999; Ben-Yishi et al., 1995; Carandini & Ringach, 1997; Somers et al., 1995). Figure 4A shows that the simulated membrane potential looks quite similar to the response of the simple feedforward model (see Figure 2B) and also to the experimental data (see Figure 1C). Again, a strong and wide component dominates during the rst 50 ms and is followed by a weaker and smaller tonic component. This behavior is typical for large parameter regimes as long as the cortical gain is not too high, recurrent net excitation and inhibition are of roughly the same order, and LGN ring rates in the burst phase are signicantly larger than in the tonic phase. Although at rst glance the potential V (x, t) looks similar in the feedforward and the recurrent models, there is nonetheless a subtle but important difference. This becomes visible if we look at the width of the simulated potential prole. As emphasized earlier, the width is constant in time and equal to sr in the feedforward model. Strictly speaking, the width is not well dened in the recurrent model, because the potential prole is no longer gaussian. We can, however, t the central peak of the prole to a gaussian function to obtain an effective width, s (t). Figure 4B shows s(t) obtained this way for the data in Figure 4A. Obviously the width now depends on time, in strong contrast to the feedforward case. These temporal changes are, of course, due to the feedback connections. At any instance of time, the potential can be roughly envisaged as a superposition of components resulting from the LGN input and the excitatory as well as the inhibitory feedback. The relative contribution of these components changes with time. Initially, cortical cells are silent. Accordingly, the feedback is zero, and only the input determines the width of the potential distribution at the very early response. Thus, s(t D 0) D sr . As the ring activity rises, the width s declines, because the excitatory recurrent connections are relatively strong and sharply tuned (sexc D 0.7 degree), while input and inhibition are not (sr D sinh D 3.0 degrees). This way, the intracortical amplication leads to a gradual change from wide to sharply tuned spatial proles.
Receptive Field Restructuring in Primary Visual Cortex
149
The parameter dependence of the time course of s is complex and cannot be given in a closed form. It is important that a time dependence occurs generically in feedback models for almost all parameter values. This is true whether the model is built of two mutually connected elds that represent excitatory and inhibitory cortical cells separately or of just a single layer like the feedback model in equation 4.1. The time dependence does not have to be in the form of a shrinkage of the width; it is also possible to obtain an increasing s(t) if sexc is considerably larger than sr and sinh . Anatomical results, however, indicate that the Mexican hat prole with sinh > sexc is the biologically most realistic one (Salin & Bullier, 1995). In that case a time-varying width must appear in feedback models in response to ashed stimuli. This means that a constant width of the potentials, as found in the feedforward model, can be expected only for very special, unrealistic, and nongeneric conditions in a feedback model. Therefore, a constant s(t) in the experimental data would be a strong indicator for a pure feedforward mechanism of the observed RF restructuring. 5 Fit to Experimental Data
We now t the feedforward model to the experimental data to test whether it can accurately describe the data and to estimate the model parameters. Firing rates º(x, t) of on- and corresponding off-subelds of 16 V1 cells recorded during both states of the EEG were analyzed. Each eld was sampled at 20 positions with 0.5 degree resolution and for 30 time slices of 10 ms bin size. For the t, the potentials V (x, t) are supposed to transform into ring rates by means of a rectilinear function according to equation 2.6: R D f (V ) D R [bV ¡#] C C b, where b accounts for spontaneous background ring. Then the experimentally derived ring-rate data can be tted to obtain parameters of the underlying potential distribution. Note, however, that the analytical solution for the potential V (x, t) contains products of model parameters (see equations 3.1 and 3.2). This implies that we cannot determine all model parameters independently. We proceed in two steps: for every time slice ti , we rst determine the parameters of the spatial prole X (x ) by nonlinear least-square ts: f it
º(x, ti ) D
q (t i )e
¡ (x¡a)2 2s r
2
¡#
C
C b,
i D 1, 2, . . . , 30.
(5.1)
Here, a is an (arbitrary) offset of the RF center, and the q(ti ) are treated as t parameters. They are proportional to the amplitude factor T (ti ), that is, q(ti ) D K0 s0s1 b / sr ¢ T (ti ). Therefore, they can be tted in a second step to the function T (t) (see equation 3.2) to obtain the temporal model parameters. In addition, the procedure provides a sequence of tted widths sr (ti ), i D 1, 2, . . . , which can be checked for constancy. All ts are performed using the
150
K. Suder, F. Worg ¨ otter, ¨ and T. Wennekers
firing rates [I/s]
80
10 ms 70 ms 110 ms
firing rates [I/s]
A
70 50 30
B
data fit
60 40 20 0
0
firing rates [I/s]
35
5 space [deg] C
10 ms 70 ms 110 ms
15
0
5 space [deg]
50 D
25
25
5
0
firing rates [I/s]
10
100 150 200 250 time [ms] data fit
15
5 0
0
50
100 150 200 250 time [ms]
Figure 5: (A) Original data and t of a cortical simple cell (on-eld) to the spatial activity q(ti )X(x) for selected time slices ti recorded during nonsynchronized EEG. (B) Temporal t, q(t). (C, D) Spatial and temporal t for another example (off-eld during nonsynchronized EEG). In A and C, the ts correspond to the smooth curves.
Levenberg-Marquardt algorithm (Press, Teukolsky, Vetterling, & Flannery, 1993). On the left side of Figure 5 spatial ts are shown for different time steps of an on-response; the right part shows the temporal t of q (t ). Observe that the LGN activity does not reach V1 before t0 ¸ 30 ms. Before that, background noise is tted. Thus, the resulting t parameters are meaningless for t < t0 . The latency t0 varies between 30 and 60 ms for different subelds (cf. Figure 5B versus 5D). Off-subelds exhibit a longer delay than on-subelds (Schiller, 1992). Apparently the experimental data can be described very well by the simple feedforward model, although the cells differ signicantly in their spatiotemporal receptive eld proles (see Figure 5, top versus bottom). Some of the measured cells were too noisy, though. The data of two off-elds during nonsynchronized EEG and of two on- and four off-elds during synchronized EEG had to be excluded because no receptive eld was detectable. In the following, only the remaining 24 subelds were analyzed. To quantify the quality of the t, we determined the percentage of uctuations in the data that remained after the model t; we calculated PD
1 N ¡1
N i D1
( f (xi ) ¡ y (xi ))2 , y (x i ) 2
(5.2)
Receptive Field Restructuring in Primary Visual Cortex
151
separately, for the t in space and in time. Averaged over all 720 spatial ts (each of the 24 subelds was sampled at 30 different time steps) we determined that P D 0.092 § 0.059, which means that only 10% of the variation in the data cannot be explained by the t. P is of the same magnitude as the noise contained in the data. The latter can be quantied by the standard deviation of the background activity (roughly 10%) during the rst 30 ms of the response, where the LGN ring does not yet inuence cortical ring. Comparability of both measures— background noise and t error—implies that the model consistently describes the deterministic variation in the data (Press et al., 1993). Comparing the t for the two EEG states, it was slightly better for the nonsynchronized state (P D 0.083 compared to P D 0.100). The temporal t is as good as the spatial. Averaged over all 24 ts, we nd P D 0.107 § 0.144. Again, the t during synchronized EEG (P D 0.151) was not as good as the one during nonsynchronized EEG (P D 0.063). In general, the data are noisier in the synchronized EEG than in the nonsynchronized EEG. A major reason for this is probably that twice as many stimulus repetitions were recorded during nonsynchronized than during synchronized EEG. Interestingly, it turns out that the parameters sr (ti ), a (ti ), # (ti ), b(ti ) are almost constant over time after the LGN activity has reached the cortical layer (see Figure 6). To quantify their temporal variation, we calculated standard deviations over all time bins ti and took the mean over all sampled subelds. Values found are a: 5%, sr : 20%, b: 18%, and #: 16%. The most important result of the t is shown in Figures 6B–6D, where sr , the width of the cortical membrane potential, is plotted as a function of time. After the activity has reached V1 (around t0 D 40 ms in B), sr turns out to be constant over time. This was the case in almost all sampled subelds. Only two subelds revealed slight and insignicant trends toward larger values with increasing time. As outlined in section 4, this strongly supports the hypothesis that the RF restructuring is mainly input driven. RFs in experiments are most commonly derived from ring rates, not intracellular potentials. As a measure for the width of the ring-rate distribution, we can choose its half-width at baseline, w. This can be compared to the width sr of the (subthreshold) membrane potential (see Figure 6B), and it is computable from the tted model parameters using equation 3.4 with k D #, that is, w2 (ti ) :D x2 (ti I #i ) D 2sr2 (ti ) ln[q(ti ) / #(ti )]. Clearly, w depends on the EEG state, and it is also time dependent although sr is not (see Figure 6B). Table 1 shows mean values for sr (averaged over all time bins and subelds), as well as maximal eld widths wp and mean values for ws :D w(200 ms) during the late response component. Values for wp and ws are averaged over all sampled subelds. ws roughly corresponds to the half-width of “classical” RF sizes. Both measures, w and sr , and their relation are in agreement with literature data: RFs are consistently smaller when obtained from extracellular rates (compare, e.g., Alonso & Reid, 1995). Hirsch and colleagues mapped
152
K. Suder, F. Worg ¨ otter, ¨ and T. Wennekers
A
12 8 4 0
0
4
50
3 2 1 0
0
50
100 150 200 250 300 time [ms]
RF width w
4 2 0
3
on-field, syn.
center a sigma_r
6
0
100 150 200 250 300 time [ms]
C
B
8
sigma [deg]
sigma [deg]
background threshold
space [deg]
firing rate [I/s]
16
50 D
100 150 200 250 300 time [ms] off-field, non-syn.
2
1
0
0
50
100 150 200 250 300 time [ms]
Figure 6: (A, B) Time course of ve model parameters for the t to an on-eld of a simple cell during the nonsynchronized state. (A) Background activity b and ring threshold #. (B) Offset a of the center of the gaussian X(x), width of the spatial membrane prole sr and half-width of the extracellular ring rate prole w. Note that the LGN activity does not reach V1 before approximately 40 ms (vertical line in A to D); up to that time background noise is tted. (C, D) Time course of sr for two other examples. (C) Same on-subeld as in A and B but during synchronized EEG. sr shows a similar time course as for the nonsynchronized EEG but it is noisier. (D) Off-subeld during nonsynchronized EEG also showing a constant sr . In C and D the vertical lines again indicate the latency of the onset response, which is larger for the off-subeld in D.
Table 1: State and Time Dependence of RF Widths.
Synchronized Nonsynchronized
Extracellular Rates
Intracellular Potential, sr
Peak Width, wp
Width at ti D 200 ms, ws
2.2± 1.8±
2.0± 1.75±
1.0± 1.25±
Notes: Intracellular widths, sr , measured from membrane potentials are consistently larger than classical RFs determined from ring rates (ws ) and also as the maximal extracellular peak responses wp of w(t). During synchronized EEG, the classical RF shrinks roughly 50% on average, and somewhat less during the nonsynchronized EEG.
Receptive Field Restructuring in Primary Visual Cortex
153
RFs from intracellularly recorded potentials of striate cortical simple cells (Hirsch, Alonso, Reid, & Martinez, 1998). They found elds with diameters up to 4 degrees, consistent with a sr » 2 degrees from our data. Intracellular RFs of this size are also compatible with anatomical data (Antonini, Gillespie, Crair, & Stryker, 1998; Chapman, Zahs, & Stryker, 1991; Humphrey, Sur, Uhlrich, & Sherman, 1985). In a second step, we tted q(ti ), i D 1, 2, . . . , 30 (see equation 5.1) to kb ¢T(t) with k :D K0 s0s1 / sr to determine the parameters CQ 1 :D kbc1 , CQ 2 :D kbc2 , t0, t1 , t2 , and t. Note that only products kbci , i D 1, 2 can be obtained from the t. t0 accounts for latencies between stimulus onset and cortical response and t1 ¡ t0 for the duration of the phasic burst component. t2 was xed at 300 ms because data were sampled only during stimulus presentation. For most subelds, excellent ts were obtained (see Figure 5). Some on-subelds in the nonsynchronized case exhibited a signicant adaptation during the tonic phase (e.g., Figure 1B, 100–300 ms). In these cases, an adaptation term was added in the t function, that is, c2 was replaced by c2 exp(¡(t ¡ t1 ) / ta ). A comparison of the tted parameters for both EEG states and for onand off-subelds of the same neuron reveals only a few systematic relations. The main difference between on- and off-subelds is a delayed response for the off-elds of approximately 20 ms (t0 D 56 ms) compared to the on-elds (t0 D 35 ms), which is in accordance with the literature (Schiller, 1992). The main EEG state dependence turns out to be the difference between CQ 1 and CQ 2 , which is proportional to the difference between LGN activity during the burst and tonic phase. The difference is about three times as high in the synchronized (29 I/s) as in the nonsynchronized state (9 I/s). Moreover, we nd that bursts are more pronounced (40 § 8 versus 23 § 6 I/s) and that the tonic component is slightly smaller (11 § 5 versus 14 § 7 I/s) during synchronized EEG. This may account for the somewhat larger average peak width (wp ) and the lower stationary width (ws ) in the synchronized state, as revealed by Table 1. Individual cells can show this trend much more clearly than the mean values given in the table. Only in one out of all cells were bursts less pronounced during the synchronized EEG. The tonic component was almost completely missing in about one-third of the cells during synchronized EEG. The other model parameters do not exhibit dependences on EEG state or subeld type. Especially the parameters of the spatial gaussian prole, a and sr , do not show a signicant variation. Interestingly, the threshold # also seems to be constant, not only over time but also for different states and RF types of the same cell. Somewhat surprisingly, even the empirical time constant t shows no signicant dependence on the EEG state. Since neuronal membranes tend to have faster response times in depolarized states, one may have expected differences in t . However, all experiments in Worg¨ ¨ otter et al. (1998) were performed under weak anesthesia, where LGN cells are not as strongly hyperpolarized as in deep sleep. In that range, t may only weakly
154
K. Suder, F. Worg ¨ otter, ¨ and T. Wennekers
depend on the level of depolarization (Bernander, Douglas, Martin, & Koch, 1991). In addition, synchronized and nonsynchronized states were classied from the spectral content of the EEG during a recording session, that is, the transition between the two states was not induced by other, externally controlled means. Under these conditions, the range of modulation of the resting potentials of LGN cells might have been relatively small, such that t can be expected to be constant. Quantitatively analyzing the temporal parameters, we nd that they are in agreement with the literature (DeAngelis et al., 1995; Hirsch et al., 1998; Somers et al., 1995; Worg¨ ¨ otter and Koch, 1991): t D 13 § 7 ms, t1 ¡ t0 D 38 § 17 ms, ta D 320 § 50 ms, and that only the duration of the bursts, t1 ¡ t0 , exhibits a small state dependence: bursts are about 20% longer in the synchronized than in the nonsynchronized EEG. In addition to the ring rates º(x, t) and the RF widths w (t), cortical ring latencies tlat (x ) can be analyzed. In particular, it is interesting to see if their spatial distribution is in agreement with the distribution predicted by the model. In the model, t (x, k ) describes the equipotential lines of the cortical membrane potential, V (x, t) D k , in parameterized form. If the membrane potential equals the ring threshold, k D #, these times are equal to the cortical onset times when the cells start ring, that is, tlat (x ) D t (x, k D #). The model predicts a quadratic dependence tlat / x2 of the latencies on the distance x from the RF center as long as x is relatively small in comparison with the projection range of cortical input connections (see equation 3.6). In experiments, these times are measured as ring latencies. They describe the time from the onset of the stimulus to the rst response of the recorded cell. As the latencies were not directly recorded in our experiments, cortical onset times were obtained with the following procedure. For each location x, the time step ti , at which the ring rate rst exceeds the background ring rate plus twice the standard deviation of the background ring, was taken as tlat . The binning was changed from 10 ms to 2 ms to obtain a better temporal resolution. Eleven curves tlat (x ) could be obtained from the experimental data and used for further analysis. In general, all curves exhibit a similar course. The latencies increase for locations farther away from the RF center. To check whether the spatial dependence can be described by a quadratic relation as predicted by the model, the data were tted to the function a C bx C cx2 , and regression coefcients were calculated. Six out of the 11 curves had a regression coefcient higher than 0.85. The remaining curves exhibited regression coefcients between 0.62 and 0.74. Intracortical models predict a linear dependence of the ring latencies on the distance of the receptive eld center (Bringuier et al., 1999). Therefore, we also tted the data with a linear model a C bx. To this end, the data were split into two halves consisting of data points lying to the left and right sides of the shortest latency, respectively. One data set was vertically mirrored, and the linear t was carried out on the resulting data, including all points.
Receptive Field Restructuring in Primary Visual Cortex
155
Only in 2 of the 11 cases were regression coefcients higher than 0.85. Four cells had a coefcient between 0.62 and 0.74, and the remaining ve data sets one less than 0.62. Thus, a quadratic t describes the experimental latency distribution better than a linear t. This is an additional support of our hypothesis that the observed RF restructuring is caused by a thalamocortical feedforward mechanism. Due to the nonlinear, approximately quadratic relation between x and tlat , a constant speed at which the activity spreads in the cortical layer cannot be determined. Around the location of the strongest response, the instantaneous speed is higher (¼ 0.2 deg/ms) than toward the periphery (¼ 0.06 deg/ms). Averaging over the whole response area and over all subelds, an average speed of 0.1 degree per ms is obtained—for example, on average the activity travels 1 degree in 10 ms. This is comparable to literature data (Bringuier et al., 1999). In conclusion, the results of the t, especially the observed constant width of the potential distribution, validate the hypothesis that the restructuring from wide to small RFs can be well described by the proposed simple feedforward model and that intracortical restructuring is of only minor relevance. As explained before, a constant sr —a characteristic of the feedforward model—occurs only in articial situations in feedback models. 6 Discussion
We investigated a neural eld model of LGN and V1 to describe EEGdependent RF changes in V1 (Worg¨ ¨ otter et al., 1998). The analytic expressions for the cortical spatiotemporal activity, equations 3.1 and 3.2, show that the restructuring can be explained by EEG-state-dependent thalamic ring patterns (burst versus tonic mode) and a pure feedforward mechanism. The restructuring is due to an “iceberg effect,” with temporally xed threshold and spatial activity prole X of constant width sr but a timeand state-dependent temporal prole T. This way, the region of cells above threshold, that is, the RF, varies with EEG state and over time. To test this hypothesis, the model was tted to experimental data. Here, it is most important that the width sr of the spatial proles of the estimated membrane potentials V and the ring thresholds # indeed remain constant over time during a response period of 300 ms, although a strong modulation of the RF size, w, is present (as measured from spike rates). For that reason, the t supports the hypothesis that the experimentally observed RF changes are mainly due to input from LGN and not due to recurrent synaptic interactions in V1. Models with feedback interactions lead in almost all cases to a time-dependent width sr . Recurrent intracortical circuits have been suggested as being responsible for the sharpening of orientation tuning curves (Ben-Yishai et al., 1995; Somers et al., 1995) but do not seem to be involved in the sharpening of spatial RF tuning.
156
K. Suder, F. Worg ¨ otter, ¨ and T. Wennekers
The only parameter that showed a small systematic trend in its time course was the background activity b, which was slightly larger during the initial burst-driven part of the response (see Figure 6A). One possible reason might be that some weak excitatory connections from the LGN to a neighboring subeld exist, which are able to activate cortical cells only transiently during the strong burst component of the initial LGN response. Tanaka (1983) and Alonso and Reid (1995) investigated connections from on- or off-cells in the LGN to cortical simple cell subelds of the opposite polarity. Using cross-correlation techniques, they found that such connections are present but sparse. If the transient increase in background rates in our data is indeed due to such connections, the data indicate that those connections are not functional during the late parts of the cortical response. Consequently, their amount may be underestimated in cross-correlation studies using long-lasting stimuli. Field models comparable to those in this article have been used to study orientation tuning of simple cells in V1 (Ben-Yishai et al., 1995; Carandini & Ringach, 1997; Somers et al., 1995). In these models the variable x just represents orientation preference of cells in a cortical column and not spatial location, as in our framework. It is a matter of discussion as to whether orientation tuning is mediated by recurrent connections inside V1 or by feedforward input from the LGN. The method presented here may help to decide that question by considering tuning widths determined from intracellular membrane potentials. Those can be measured in anesthetized cats, or—assuming a feedforward model as the null hypothesis—they can be estimated from ts of ring-rate data as in our work here. Supposing the intracellular tuning width appears to be constant in time in such experiments, the orientation tuning would likely be input driven. Acknowledgments
K. S. and F. W. acknowledge the support of the Deutsche Forschungsgemeinschaft (SFB-509) and the HFSP. T. W. was supported in part by DFG grant Pa 268/8-1. We thank K. Funke and N. Kerscher for valuable comments and discussions and K. Funke, N. Kerscher, and Y. Zhao for help with the experimental data. References Adorjan, P., Levitt, J., Lund, J., & Obermayer, K. (1999). A model for the intracortical origin of orientation preference and tuning in macaque striate cortex. Vis. Neurosci, 16(2), 303–318. Ahls e n, G., & Lo, F.-S. (1981). Projection of brainstem neurons to the perigeniculate nucleus and the lateral geniculate nucleus in the cat. Brain Res., 238, 433–438.
Receptive Field Restructuring in Primary Visual Cortex
157
Alonso, R., & Reid, J.-M. (1995). Specicity of monosynaptic connections from thalamus to visual cortex. Nature, 378, 281–284. Antonini, A., Gillespie, D., Crair, M., & Stryker, M. (1998). Morphology of single geniculocortical afferents and functional recovery of the visual cortex after reverse monocular deprivation in the kitten. J. Neurosci., 18, 9896– 9909. Ben-Yishai, R., Bar-Or, R., & Sompolinsky, H. (1995).Theory of orientation tuning in visual cortex. Proc. Natl. Acad. Sci. USA, 92, 3844–3848. ¨ Douglas, J., Martin, K., & Koch, C. (1991). Synaptic background Bernander, O., activity inuences spatiotemporal integration in single pyramidal cells. Proc. Natl. Acad. Sci. USA, 88, 11569–11573. Bringuier, V., Chavane, F., Glaeser, L., & FrÂegnac, V. (1999). Horizontal propagation of visual activity in the synaptic integration eld of area 17 neurons. Science, 283, 695–699. Carandini, M., & Ringach, D. (1997). Predictions of a recurrent model of orientation selectivity. Vis. Res., 37, 3061–3071. Celebrini, S., Thorpe, S., Trotter, Y., & Imbert, M. (1993). Dynamics of orientation coding in area V1 of the awake primate. Vis. Neurosci., 10, 811–825. Chapman, B., Zahs, K., & Stryker, M. (1991). Relation of cortical cell orientation selectivity to alignment of receptive elds of geniculocortical afferents that arborize within a single orientation column in ferret visual cortex. J. Neurosci, 11, 1347–1358. Crick, F., & Koch, C. (1998). Constraints on cortical and thalamic projections: The no-strong-loop hypothesis. Nature, 391, 245–250. DeAngelis, G., Ohzawa, I., & Freeman, R. (1993). Spatiotemporal organization of simple-cell receptive elds in the cat’s striate cortex. 1. General characteristics and postnatal development. J. Neurophysiol., 69, 1091–1117. DeAngelis, G., Ohzawa, I., & Freeman, R. (1995). Receptive-eld dynamics in the central visual pathway. Trends Neurosci., 18, 451–458. Dinse, H., & Kruger, ¨ K. (1994).The timing of processing along the visual pathway in the cat. NeuroReport, 5, 893–897. Eckhorn, R., Krause, F., & Nelson, J. (1993). The RF-cinematogram—a crosscorrelation technique for mapping several visual receptive elds at once. Biol. Cybern., 69, 37–55. Ferster, D., Chung, S., & Wheat, H. (1996). Orientation selectivity of thalamic input to simple cells of cat visual cortex. Nature, 380, 249–252. Funke, K., & Eysel, U. (1998). Inverse correlation of ring patterns of single topographically matched perigeniculate neurons and cat dorsal lateral geniculate relay. Vis. Neurosci., 15, 711–729. Giese, M. (1999). Dynamic neural eld theory for motion perception. Norwell, MA: Kluwer. Gilbert, C. (1998). Adult cortical dynamics. Physiol. Rev., 78, 467–485. Henry, G., Michalski, A., Wimborne, B., & McCart, R. (1994). The nature and origin of orientation specicity in neurons of the visual pathway. Prog. Neurobiol., 43, 381–437. Hirsch, J., Alonso, J.-M., Reid, R., & Martinez, L. (1998). Synaptic integration in striate cortical simple cells. J. Neurosci, 18, 9517–9528.
158
K. Suder, F. Worg ¨ otter, ¨ and T. Wennekers
Hubel, D., & Wiesel, T. (1962). Receptive elds, binocular interaction and functional architecture in the cat’s visual cortex. J. Physiol. (London), 160, 106–154. Humphrey, A., Sur, M., Uhlrich, D., & Sherman, S. (1985). Projection patterns of individual x- and y-cell axons from the lateral geniculate nucleus to cortical area 17 in the cat. J. Comp. Neurol., 233, 159–189. Ikeda, H., & Wright, M. (1975). Retinotopic distribution, visual latency and orientation tuning of sustained and transient cortical neurons in area 17 of the cat. Exp. Brain Res., 22, 385–398. Krone, G., Mallot, H., Palm, G., & Schuz, ¨ A. (1986). Spatiotemporal receptive elds: A dynamical model derived from cortical architectonics. Proc. R. Soc. Lond., 226, 421–444. Levine, M., & Shefner, J. (1991). Fundamentals of sensation and perception(2nd ed.). Pacic Grove, CA: Brooks/Cole. McCormick, D. (1992). Neurotransmitter actions in the thalamus and cerebral cortex and their role in neuromodulation of thalamocortical activity. Prog. Neurobiol., 39, 337–388. Mineiro, P., & Zipser, D. (1998). Analysis of direction selectivity arising from recurrent cortical interactions. Neural Computation, 10, 353–371. Press, W., Teukolsky, S., Vetterling, W., & Flannery, B. (Eds.). (1993). Numerical recipes in C: The art of scientic computing (2nd ed.). Cambridge: Cambridge University Press. Sabatini, S., & Solari, F. (1999).An architectural hypothesis of direction selectivity in the visual cortex: The role of spatially asymmetric intracortical inhibition. Biol. Cybern., 80, 171–183. Salin, P.-A., & Bullier, J. (1995). Corticocortical connections in the visual system: Structure and function. Physiological Rev., 75, 107–154. Sato, H., Katsuyama, N., Tamura, H., Hata, Y., & Tsumoto, T. (1996).Mechanisms underlying orientation selectivity of neurons in the primary visual cortex of macaque. J. Physiol., 494, 757–771. Schiller, P. (1992). The on and off channels of the visual system. TINS, 15, 86–91. Shevelev, I., Volgulshev, M., & Sharaev, G. (1992). Dynamics of responses of V1 neurons evoked by stimulation of different zones of the receptive eld. Neuroscience, 51, 445–450. Somers, D., Nelson, S., & Sur, M. (1995). An emergent model of orientation selectivity in cat visual cortical simple cells. J. Neurosci., 15, 5465–5488. Sompolinsky, H., & Shapley, R. (1997). New perspectives on the mechanisms of orientation selectivity. Current Opinion in Neurobiol., 7, 514–522. Tanaka, K. (1983). Cross-correlation analysis of geniculostriate neuronal relationships in cats. J. Neurophysiol., 49, 1303–1318. Vidyasagar, T., Pei, X., & Vogulshev, M. (1996). Multiple mechanisms underlying the orientation selectivity of visual cortical neurones. Trends in Neurosci., 19, 272–277. Wilson, H., & Cowan, J. (1973). A mathematical theory of the functional dynamics of cortical and thalamic nervous tissue. Kybernetik, 13, 55–80. Worg¨ ¨ otter, F., & Koch, C. (1991). A detailed model of the primary visual pathway in the cat: Comparison of afferent excitatory and intracortical inhibitory connection schemes for orientation selectivity. J. Neurosci., 11, 1959–1979.
Receptive Field Restructuring in Primary Visual Cortex
159
Worg ¨ otter, ¨ F., Niebur, E., & Koch, C. (1991). Isotropic connections generate functional asymmetrical behavior in visual cortical cells. J. Neurophysiol., 66, 444– 459. Worg ¨ otter, ¨ F., Opara, R., Funke, K., & Eysel, U. (1996). Utilizing latency for object recognition in real and articial neural networks. NeuroReport, 7, 741–744. Worg ¨ otter, ¨ F., Suder, K., Zhao, Y., Kerscher, N., Eysel, U., & Funke, K. (1998). State-dependent receptive eld restructuring in the visual cortex. Nature, 396, 165–168. Received September 7, 1999; accepted April 3, 2000.
LETTER
Communicated by Peter Rowat
On the Phase-Space Dynamics of Systems of Spiking Neurons. I: Model and Experiments Arunava Banerjee Department of Computer Science, Rutgers, The State University of New Jersey, Piscataway, NJ 08854, U.S.A. We investigate the phase-space dynamics of a general model of local systems of biological neurons in order to deduce the salient dynamical characteristics of such systems. In this article, we present a detailed exposition of an abstract dynamical system that models systems of biological neurons. The abstract system is based on a limited set of realistic assumptions and thus accommodates a wide range of neuronal models. Simulation results are presented for several instantiations of the abstract system, each modeling a typical neocortical column to a different degree of accuracy. The results demonstrate that the dynamics of the systems are generally consistent with that observed in neurophysiological experiments. They reveal that the qualitative behavior of the class of systems can be classied into three distinct categories: quiescence, intense periodic activity resembling a state of seizure, and sustained chaos over the range of intrinsic activity typically associated with normal operational conditions in the neocortex. We discuss basic ramications of this result with regard to the computational nature of neocortical neuronal systems. 1 Introduction
Our understanding of the computational nature of systems of interconnected neuron-like elements has grown steadily over the past decades (Hopeld, 1982; Amit, Gutfreund, & Sompolinsky, 1987; Hornik, Stinchcombe, & White, 1989; Siegelmann & Sontag, 1992; Omlin & Giles, 1996, to mention but a few). The extent to which some of these results apply to systems of neurons in the brain, however, remains uncertain, primarily because the models of the neurons, as well as those of the networks used in such studies, do not sufciently resemble their biological counterparts. The past decade has also been witness to several theories advanced to explain the observed behavioral properties of large elds of neurons in the brain (Nunez, 1989; Wright, 1990; Freeman, 1991). In general, these models do not adopt the neuron as their basic functional unit; for example, Freeman’s model utilizes the KI, KII, and KIII congurations of neurons as its basic units, and Wright’s model lumps small cortical areas into single functional units. Moreover, some of these models are founded on presupposed Neural Computation 13, 161–193 (2000)
° c 2000 Massachusetts Institute of Technology
162
Arunava Banerjee
functional properties of the macroscopic phenomenon under investigation; for example, Wright’s and Nunez’s models assume that the global wavelike processes revealed in EEG recordings satisfy linear dynamics. These reasons account, in part, for the controversies that surround these theories. The search for coherent structures in the dynamics of neuronal systems has also seen a surge of activity (Basar, 1990; Kruger, ¨ 1991a; Aertsen & Braitenberg, 1992; Bower & Beeman, 1995). Armed with faster and more powerful computers, scientists are replicating salient patterns of activity observed in neuronal systems in phenomena as diverse as motor behavior in animals and oscillatory activity in the cortex. The models of the neurons as well as those of the networks are considerably more realistic in these cases. There is an emphasis on simulation, and it is hoped that analytic insight will be forthcoming from such experimentation. Our research shares the broad objectives of these diverse endeavors: to unravel the dynamical and computational properties of systems of neurons in the brain. Our goal, in particular, is to answer two crucial questions: (1) Are there coherent spatiotemporal structures in the dynamics of neuronal systems that can denote symbols1 and (2) If such structures exist, what restrictions do the dynamics of the system at the physical level impose on the dynamics of the system at the corresponding abstracted symbolic level? Our approach is characterized by our position on two signicant methodological issues. First, we take a conservative bottom-up approach to the problem, that is, we adopt the biological neuron as our basic functional unit,2 and second, given the generalized nature of the questions posed, we believe that answers can be obtained primarily through the analysis of the phase-space dynamics of an abstract dynamical system whose behavior sufciently matches that of the biological system. The rst part of this article is devoted to the construction of such a dynamical system. In section 2, we identify a general set of characteristics of a biological neuron and formulate a model of a neuron based on them. In section 3, we construct the phase-space for a system of such neurons. The geometric structure immanent in the phase-space is also described in detail. Finally in section 4, we specify the velocity eld that overlays the phase-space, thereby completing the construction of the abstract dynamical system. Section 5 is devoted to a numerical investigation of the dynamics of the system. A typical neocortical column is modeled based on anatomical and
1 Our usage of the term symbol conforms with the limited notion of a symbol as used in computer science—discrete states that mark a computational process regardless of representational content, if any, and not the notion of a symbol in the greater sense of the word—(the physical embodiment of a semantic unit) as used in the cognitive sciences. This work therefore does not take a position on the contentious issue of representationalism. 2 We believe that by not assuming a lumped unit with presupposed functional characteristics, we not only detract from controversy but also add to the well-foundedness of the theory.
I. Model and Experiments
163
physiological data. Several instantiations, of varying degrees of accuracy, are examined. The results of the simulations demonstrate that the dynamics of the systems are generally consistent with that observed in neurophysiological experiments. They also highlight certain dynamical properties that are robust across the instantiations. The results reveal that the qualitative behavior of the class of systems can be classied into three distinct categories: (1) when initiated at a low level of intrinsic activity, the system becomes quiescent; (2) when initiated at a high level of intrinsic activity, the system settles into a periodic orbit of intense regular activity resembling a state of seizure; and (3) when initiated over a range of intermediate levels of intrinsic activity (corresponding to normal operational conditions in the neocortex), the system displays sustained chaotic behavior. In section 6 we discuss basic ramications of this result with regard to the computational nature of neocortical neuronal systems. The questions posed earlier in this section are partially answered, and a research agenda, pursued in the companion article in this issue, is proposed for a denite resolution of the matter. 2 Model of the Neuron
We have chosen a deterministic (as opposed to a stochastic) model for the biological neuron for the following reasons. First, there is mounting evidence that information about stimulus is contained in the higher-resolution interspike intervals (ISIs) of spike trains and not in the lower-resolution spike frequencies (Strehler & Lestienne, 1986; Kruger, ¨ 1991b; Bair, Koch, Newsome, & Britten, 1994; Bair & Koch, 1996). Since the temporal sequence of ISIs is abstracted away in a stochastic model, there is little hope that such a model would shed light on all aspects of information processing in neuronal systems. Second, while random miniature postsynaptic potentials do exist, their mean amplitude is at least an order of magnitude smaller than the mean amplitude of the postsynaptic potential elicited by the arrival of a spike at a synapse. The impact of such noise on the dynamics of the system is best identied by contrasting the dynamics of a noise-free model to that of a model augmented with the appropriate noise. It is therefore logical to begin with the analysis of a deterministic model. Our model of the biological neuron is based on the following four observations: 1. The biological neuron is a nite precision machine in the sense that the depolarization at the soma that elicits an action potential is of a nite range (T § 2 ) and not a precise value (T ).
2. The effect of a synaptic impulse (resulting from the arrival of an afferent (incoming) spike) at the soma has the characteristic form
164
Arunava Banerjee
Figure 1: A schematic diagram of a neuron that depicts the soma, axon, and two synapses on as many dendrites. The axes by the synapses and the axon denote time (with respective origins set at present and the direction of the arrows indicating past). Spikes on the axon are depicted as solid lines. Those on the dendrites are depicted as broken lines, for, having been converted into graded potentials, their existence is only abstract (point objects indicating the time of arrival of the spike).
of an abrupt increase or decrease in potential followed by a longer exponential-like decay toward the resting level. 3. The postspike elevation of the threshold that may be modeled as the inhibitory effect of an efferent (outgoing) spike also decays after a while at an exponential rate. 4. The ISIs of any given neuron are bounded from below by its absolute refractory period, r, that is, no two spikes originate closer than r in time. Based on these, we construct our model of the neuron as follows. (Figure 1 is provided as a visual aid to complement the formal presentation below. The various aspects of the diagram are explained in the caption.) i. Spikes. Spikes are the sole bearers of information. They are identical except for their spatiotemporal location. (Spikes are dened as point objects. Although action potentials are not instantaneous, they can always be assigned occurrence times, which may, for example, be the time at which the membrane potential at the soma reached the threshold T .)
ii. Membrane potential. For any given neuron we assume an implicit, C1 , everywhere bounded function P¤ (xE 1 , xE 2 , . . . , xE m I xE 0 ) that yields the current membrane potential at the soma. (Subscripts i D 1, . . . , m j
represent the afferent synapses, and xE i D hx1i , x2i , . . . , xi , . . .i is a denumerable sequence of variables that represent, for spikes arriving at synapse i since innite past, the time lapsed since their arrivals. j
In other words, each xi for i D 1, . . . , m reports the time interval between the present and the time of arrival of a distinct spike at synapse
I. Model and Experiments
165
i. Finally, xE 0 is a denumerable sequence of variables that represent, in like manner, the time lapsed since the departure of individual spikes generated at the soma of the neuron.) P¤ : R1 ! R accounts for the entire spatiotemporal aspect of the neuron’s response to spikes. All information regarding the strength of afferent synapses, their location on the dendrites, and their modulation of one another ’s effects at the soma is implicitly contained within P¤ (¢). So is the exact nature of the postspike hyperpolarization that gives rise to both the absolute and the relative refractory periods. Consistency requires that either the domain of P¤ (¢) be restricted to 0 · x1i · x2i · ¢ ¢ ¢ for all i, or P¤ (¢) be constrained to be symmetric with respect to x1i , x2i , . . . for all i. iii. Effectiveness of a Spike . After the effect of a spike at the soma of a biological neuron has decayed below C (m2 C 1) , where 2 is the range of error of the threshold, C > 1 an appropriate constant (dened below), and m the number of afferent synapses on the cell, its effectiveness on the neuron expires. This inference is based on the observations that the biological neuron is a nite precision machine, the total effect of a set of spikes on the soma is almost linear in their individual effects when such effects are small (¼ C (m2 C 1) ), and owing to the exponential nature of the decay of the effects of afferent as well as efferent spikes and the existence of a refractory period, a quantity m < 1 can be computed such that the residual effect of the afferent spikes that have arrived at a synapse, whose effects on the soma have decayed below C (m2 C 1) (and that of the efferent spikes of the neuron satisfying the same constraint) is bounded from above by the innite series 1 2 (1 C m C m 2 C ¢ ¢ ¢ C m k C ¢ ¢ ¢). Setting C > 1¡m then guarantees C(m C 1) 2 that each such residual is less than (m C 1) . This argument, in essence, demonstrates that the period of effectiveness of a spike since its impact at a given afferent synapse or departure after being generated at the soma is bounded from above. 3 We model this feature by imposing the following restrictions on P¤ : 1. 8i D 0, . . . , m and 8j, 9ti such that 8t ¸ ti
@P¤ j
@xi
| x j Dt D 0 irrei
spective of the values assigned to the other variables. ¤ 2. 8i D 0, . . . , m and 8j, 8t · 0 @P j | x j Dt D 0 irrespective of the @xi
i
values assigned to the other variables. 3. 8i D 0, . . . , m and 8j, P¤ (¢) | x j D 0 D P¤ (¢) | x j D t all other variables i i i held constant at any values.
3 This does not necessarily imply that the neuron has a bounded memory of events. Information can be transferred from input spikes to output spikes during the input spikes’ period of effectiveness.
166
Arunava Banerjee
4. P¤ (0|t1 , 0|t1 , . . . , 0|t2 , 0|t2 , . . . , . . . , 0|tm , 0|tm , . . . I 0|t0 , 0|t0, j . . .) D 0. In other words, P¤ (¢) D 0 if 8i, j xi D 0 or ti . j The rst criterion enforces that the sensitivity of P¤ (¢) to spike xi expires after time ti . Hence, if for every i there exists an ni such that j j > ni ) ai ¸ ti ; then P¤ (Ea1 , Ea2 , . . . , aE m I Ea 0 ) D P¤ (Ea01 , Ea02 , . . . , aE 0m I Ea00 ) where each aE 0i is derived from the corresponding Eai by setting all values after index ni to ti . The second criterion results from P¤ (¢) ¤ being C1 and @P j | x j < 0 D 0 (a spike that has not yet arrived at an @xi
i
afferent synapse can have no effect on the membrane potential, and a spike that has not yet been generated at the soma cannot elicit a postspike hyperpolarization). The third criterion enforces that all else being equal, the membrane potential at the soma is the same whether an afferent spike has just arrived or its effectiveness has just expired, and in the case of an efferent spike whether it has just been generated or its inhibitory effect on the soma has just expired. The fourth criterion enforces that the resting state of the soma is set at 0. iv. Finite dimensionality. The function characterizing the membrane potential can, as a result, be dened over a nite dimensional space. This is based on the observation that at any afferent synapse i D 1, . . . , m, spikes arrive at intervals bounded from below by ri (the absolute refractory period of the neuron presynaptic to i), and for i D 0 spikes are generated at the soma of the neuron at intervals bounded from below by r0 (the absolute refractory period of the neuron in question). Consequently, at most ni D dti / ri e variables can have values less than ti . Based on iii above, we can dene, n
P (x11 , . . . , xn1 1 , x12 , . . . , xn2 2 , . . . , x1m , . . . , xnmm I x10 , . . . , x 00 ) n
n
D P¤ (x11 , . . . , x1 1 , t1 , t1 , . . . , x12 , . . . , x2 2 , t2 , t2 , . . . , . . . ,
x1m , . . . , xnmm , tm , tm , . . . I x10 , . . . , xn00 , t0 , t0 , . . .)
(2.1)
and use P (¢) instead of P¤ (¢) to represent the current membrane potential.4 v. Threshold and refractoriness. A simple model is assumed for the generation of a spike at a soma, that is, a spike is generated when 4 A few technicalities should be mentioned here. The denition of P¤ assumes that spikes have been arriving at each synapse and have been leaving the soma since time t D ¡1. If the number of spikes on any synapse or the number of efferent spikes is nite over the time (¡1, 0), a denumerable set of dummy variables can be introduced to extend ¤ the function’s domain to R1 . For any such dummy variable x, @@Px =0 is consistent with ¤ the constraints imposed on P .
I. Model and Experiments
167
(¢)
P (¢) D T and dPdt ¸ 0.5 T is assumed to be a constant. However, it must be noted that a more general criterion of the form P (¢) D T (¢) can be transformed into the form P (¢) D T if the parameters of T are a subset of those of P. The above model of the neuron does not insist on a specic membrane potential function, only that the particular function possess certain qualitative properties. By retaining this level of generality, we can pose questions that are more basic, such as, To what extent would the dynamics of a neuronal system be affected if the potential function P (¢) were to be replaced by any arbitrary function that is unimodal with respect to all variables, and at the same time satises all the constraints (items ii, iii, and v above)? Besides, making a model more specic by resorting to explicit functions is much easier than making it more general. We must mention here that in spite of the generality of the membrane potential function, this model remains some distance from an accurate depiction of the biological neuron. Changes in the strength of synapses, which would translate into evolving potential functions, are not modeled. The biophysical processes that are involved in synaptic modulation are not yet completely understood and the phenomenon continues to inspire intense research. Furthermore, persistent currents such as those that give rise to plateau potentials cannot be modeled in this framework. We can only hope that not much is lost in terms of the salient behavior of the system at our leaving out such details. 3 The Phase-Space
We begin by revising certain denitions for the purpose of instituting the measure of uniformity necessary for the construction of a phase-space for j a system of neurons. Previously, xi ’s for i D 1, . . . , m represented the time since the arrival of spikes at synapses, and for i D 0 the time since the inception of spikes at the soma. We eliminate the asymmetry in the choice of j origins of the two sets of variables by redening xi 8i, j to represent the time since the inception of the spike at the corresponding soma. The subscript j i in xi is now set to refer to a neuron rather than a synapse. In addition to changes in P (¢), this prompts a shift in focus from the postsynaptic to the presynaptic neuron. Given a system of neurons, an upper bound on the number of effective, efferent spikes that a neuron can possess at any time is rst computed 5 P(¢) being C1 , the postspike hyperpolarization cannot be instantaneous. P(¢) D T is crossed twice—once when the spike is generated and once when the inhibitory effect of
the spike kicks in. The constraint
dP(¢) dt
j
D
@P(¢) dx i i, j @x j dt i
¸ 0 ensures that the neuron does
not spike during the membrane potential’s downward journey.
168
Arunava Banerjee
as follows. Let i D 1, . . . , S denote the set of all neurons in the system, ki D 1, . . . , mi the set of efferent synapses on neuron i (synapses to which neuron i is presynaptic), lik the time it takes for a spike generated at the soma i of neuron i to reach synapse ki , tki , as before, the time interval over which a i spike impinging on synapse ki is effective on the corresponding postsynaptic neuron, and ri the absolute refractory period of neuron i. Since its inception, the maximum time until which any spike can be effective on any correi sponding postsynaptic neuron is then given by  D maxS,m fli C tkii g. i D1,ki D 1 ki Let & denote the maximum time period over which any spike is effective on the neuron that generates it, and let U D maxfÂ, &g. Then ni D dU / ri e provides an upper bound on the number of effective spikes that neuron i can possess at any time. The requisite changes in P (¢) are threefold. First, the number of variables associated with some synapses is increased so that all synapses actuated by neuron i are consistently assigned ni variables. This is achieved by setting appropriately fewer variables in P¤ (¢) to tkii in criterion iv of the previous section. Next, P (¢) is translated along all but one set of axes such that the former origin is now at hlik1 , lik1 , . . . , . . . , likm , likm , . . . , 0, 0, . . .i, i1 i1 im im where i1 , . . . , im are the neurons presynaptic to the neuron in question (i0 ) at synapses ki1 , . . . , kim . @Pj | x j Dt D 0 for 0 · t · liki holds owing to the manner @xk
i
ki
in which P (¢) was previously dened. Finally, references to synapses in the variables are switched to references to appropriate neurons, and multiple variables referencing the same spike (occurs when a neuron makes multiple synapses on a postsynaptic neuron) are consolidated. We note immediately that the state of a system of neurons can be specied completely by enumerating, for all neurons i D 1, . . . , S , the positions of the ni (or fewer) most recent spikes generated by neuron i within U time from the present. On the one hand, such a record species the exact location of all spikes that are still situated on the axons of respective neurons, and on the other, combined with the potential functions, it species the current state of the soma of all neurons. While it is assumed here that neurons do not receive external input, incorporating such is merely a matter of introducing additional neurons whose state descriptions are identical to that of the external input. This gives us the following initial representation of the state information contributed by each neuron toward the specication of the state of the system: at any given moment neuron i reports the ni -tuple hx1i , x2i , . . . , xni i i, j where each xi denotes the time since the inception of a distinct member of the ni most recent spikes generated at the soma of neuron i within U time from the present. Since ni is merely an upper bound on the number of such spikes, it is conceivable that fewer than ni spikes satisfy the criterion, in which case the remaining components in the ni -tuple are set at U . From a dynamical perspective, this amounts to the value of a component growing
I. Model and Experiments
169
until U is reached, at which point the growth ceases. The phase-space for neuron i is then the closed ni -cube [0, U ]ni ½ Rni . This initial formulation of the phase-space is, however, fraught with redundancy. First, the nite description of the state of a neuron requires that a variable be reused to represent a new spike when the effectiveness of the old spike it represented terminates. Since a variable set at U (ineffective spike) is rst set to 0 when assigned to a new spike, one of the two positions is redundant. The abstract equivalence of the positions then dictates that U be identied with 0. Second, if hx1i , x2i , . . . , xni i i and hy1i , y2i , . . . , yni i i are two ni -tuples that are a nontrivial permutation of one another (9s, a permutas (j ) j tion, such that 8j D 1, . . . , ni xi D yi ), then while xE i and yE i are two distinct n points in [0, U ] i , they represent the same state of the neuron (which variable represents a spike is immaterial). This stipulates that all permutations of an ni -tuple be identied with the same state. In what follows we construct a differentiable manifold that satises both of these criteria. The construction is divided into two stages. The rst stage transforms the interval [0, U ] into the unit circle S1 (spikes now being represented as complex numbers of unit modulus), and the second computes a complex polynomial whose roots identically match this set of complex numbers. By retaining only the coefcients of this polynomial, all order information is eliminated. Stage 1. We impose on the manifold R the equivalence relation: x » y if
(x ¡ y ) D aU , where x, y 2 R, and a 2 Z, and regard R / » with its standard quotient topology. The equivalence class mapping p : R ! R / » can be 2p it identied with p (t) D e U . p maps R / » onto S1 D fz 2 C | |z| D 1g. We therefore apply the transformation hx1 , x2 , . . . , xn i !he
2p ixn U
2p ix1 U
,e
2p ix 2 U
, ...,
i to the initial formulation of the phase-space of a neuron. The new e phase-space is T n the n-torus. j We assume hereafter that all variables xi are normalized, that is, scaled from [0, U ] to [0, 2p ]. P (¢) is also assumed to be modied to reect the scaling of its domain. Stage 2. That the set G=fs | s is a permutation of n elementsg forms a Group
motivates the approach that the orbit space T n / G of the action of G on T n be considered. G, however, does not act freely6 on T n , and therefore the quotient topology of T n / G does not inherit the locally Euclidean structure of T n . Hence, we explicitly construct the space as a subset of the Euclidean space and endow it with the standard topology. While such a subset necessarily contains boundaries, the construction does shed light on the intrinsic structure of the space. 6
A Group G acts freely on a set X if 8x (gx D x implies g D e).
170
Arunava Banerjee
As noted earlier, we apply the transformation hz1 , z2 , . . . , zn i ! han , an¡1 , . . . , a0 i where an , an¡1 , . . . , a0 2 C are the coefcients of f (z ) D an zn C an¡1 zn¡1 C ¢ ¢ ¢ C a0 whose roots lie at z1 , z2 , . . . , zn 2 C. The immediate question, then, is what are necessary and sufcient constraints on han , an¡1 , . . . , a 0 i 2 Cn C 1 for all roots zi of f (z) to satisfy |zi | D 1. We answer this in two steps. First, we consider the case of distinct roots, and subsequently that of multiple roots. Theorem 1.
of degree n:
Let f (z ) D an zn C an¡1 zn¡1 C ¢ ¢ ¢ C a1 z C a 0 be a complex polynomial
Let f ¤ (z ) be dened as f ¤ (z) D zn fN(1 / z ) D aN 0zn C aN 1 zn¡1 C ¢ ¢ ¢ C aN n , where aN i represents the complex conjugate of ai . Construct the sequence h f0 (z ), f1 (z), . . . , fn¡1 (z )i as follows:
i. f0 (z) D f 0 (z ) where f 0 (z) is the derivative of f (z ), and ( j) ( j) ii. fj C 1 (z ) D bN0 fj (z ) ¡ bn¡1¡j fj¤ (z) for j D 0, 1, . . . , n ¡ 2, where fj (z ) is represented as the polynomial
n¡1¡j ( j ) k bk z . kD 0
( j)
In each polynomial fj (z ), the constant term b0 is a real number, which we denote ( j C 1)
(j )
( j)
by dj :dj C 1 D b0 D |b 0 | 2 ¡ |bn¡1¡j | 2 for j D 0, 1, . . . , n ¡ 2. Then, necessary and sufcient conditions for all roots of f (z) to be distinct and to lie on the unit circle, |z| D 1, assuming without loss of generalization that an D 1, are: |a0 | D 1, and ai D aN n¡i a0 for i D 1, 2, . . . , n ¡ 1, and
d1 < 0 and dj > 0 for j D 2, 3, . . . , n ¡ 1.
(3.1) (3.2)
Proof . See appendix A.
Stated informally, equations 3.1 ensure that the degree of freedom reected in the dimensionality of the initial space does not change after the transformation. Of the coefcients han¡1 , an¡2 , . . . , a0 i (an D 1 w.l.o.g), (i) a0 has one degree of freedom (|a 0 | D 1) and (ii) if n is odd, hadn/ 2e¡1 , . . . , a1 i can be derived from han¡1 , . . . , adn/ 2e I a 0 i or (iii) if n is even, adn / 2e has only one degree of freedom (adn/ 2e D aN dn/ 2e a 0 implies that adn/ 2e D (1 / 2) a0 ) and hadn/ 2e¡1 , . . . , a1 i can be derived from han¡1 , . . . , adn/ 2e C 1 I a0 i. Hence, if n is odd, han¡1 , . . . , adn/ 2e I a 0 i 2 Cbn/ 2c £ S1 completely species the polynomial, and if n is even, han¡1 , . . . , adn/ 2e C 1 I adn/ 2e , a 0 i 2 Cbn / 2c¡1 £ M2 does the same. ¨ band and we shall demonstrate M2 denotes the two-dimensional Mobius that when n is even, hadn/ 2e , a0 i 2 M2 . (1 / 2) a0 has two solutions, h and (h C p ), where h 2 [0, p ). hadn/ 2e , a0 i is represented as h § |adn/ 2e |, a0 i, where adn/ 2e ’s choice between h and (h C p ) is transferred to the sign of |adn/ 2e | (positive |adn/ 2e | implies h and negative
I. Model and Experiments
171
|adn/ 2e | implies (h C p )). Topological constraints then require that all points h § |adn/ 2e |, a 0 i ´ ha, 0i be identied with h § |adn/ 2e |, a 0 i ´ h¡a, 2p i. Expressions 3.2 enforce (n ¡ 1) additional constraints expressed as strict inequalities on C1 functions over the transformed space (Cbn/ 2c £ S1 or Cbn/ 2c¡1 £ M2 ). That the functions are C1 is evident from their being composed entirely of operators C , ¡, £, and N. We note immediately, based on continuity and strict inequality, that the resulting space is open in Cbn/ 2c £S1 (n is odd) or Cbn/ 2c¡1 £ M 2 (n is even). Moreover, since the space is the image of a connected set under a continuous mapping, it is connected as well. We denote the resultant space by Ln . We now consider the case wherein both distinct and multiple roots are allowed. The transformed space corresponding to f (z ) constrained to have all roots on |z| D 1 without the requirement that they be distinct, is the closure of Ln (Ln ) in Cbn/ 2c £ S1 for odd n, or in Cbn / 2c¡1 £ M2 for even n. Furthermore, the set (Ln nLn ) lies on the boundary of Ln . Theorem 2.
Proof . See appendix A.
The multidimensional boundary set (Ln nLn ) can be partitioned, based on the multiplicity of corresponding roots, into connected sets of xed dimensionality. Each such set is diffeomorphic to an appropriate manifold Ln¡1 , Ln¡2 , . . . , or L0 by way of the mapping that disregards all but one member of each multiple root. Finally, Ln is bounded because 8i |ai | · ni . The transformed space is a compact manifold with boundaries; a subset of the same dimension of Cbn/ 2c £ S1 if n is odd or of Cbn/ 2c¡1 £ M 2 if n is even. Corollary 1.
The nature of the spaces corresponding to dimensions n D 1, 2, 3 is best demonstrated in gures. For n D 1, the phase-space is the unit circle S1 , for ¨ band, and for n D 3, a solid torus whose cross-section n D 2, the Mobius is a concave triangle that revolves uniformly around its centroid as one travels around the torus. Figure 2a presents the phase-spaces for n D 1, 2 and 3. Finally, the phase-space for the entire neuronal system is the Cartesian product of the phase-spaces of the individual neurons. We shall henceforth denote the phase-space of neuron i by i Lni . Pi (x11 , . . . , xn1 1 , x12 , . . . , xn2 2 , . . . , x1m , . . . , xnmm I x1i , . . . , xni i ) for all neurons i was assumed to be C1 in the previous section. In appendix B we dene a corresponding function on the transformed space, Pi : 1 Ln1 £ 2 Ln2 £ ¢ ¢ ¢ £ m Lnm £ i Lni ! R, which is C1 .
172
Arunava Banerjee
Figure 2: (a) Phase-spaces for n D 1, 2 and 3. Note that the one-dimensional boundary of the Mobius ¨ band is a circle, and the two- and one-dimensional boundaries of the torus are a Mobius ¨ band and a circle, respectively. (b) Subspaces s ¸ 2 and s ¸ 1 in the phase-space for n D 3. The torus is cut in half, and the circle and the Mobius ¨ band within each half are exposed separately. 3.1 Geometric Structure of the Phase-Space: A Finer Topology. In the previous section we constructed the phase-space for an individual neuron as a xed dimensional manifold, the xed dimensionality derived from the existence of an upper bound on the number of effective, efferent spikes that the neuron can possess at any time. We also noted that at certain times, the neuron might possess fewer effective spikes. It follows that under such j
circumstances, the remaining variables are set at eixi D 1. In this section we examine the structure induced in the phase-space as a result of this feature. Formally, we denote by p 2 h1 Ln1 £ 2 Ln2 £¢ ¢ ¢ £ S LnS i a point in the phasespace and by hp1 , p2 , . . . , pS i its projection on the corresponding individual spaces, that is, pi 2 i Lni . For each neuron i, we denote the number of variables j
set at eixi D 1 by si . Such spikes we label as dead since their effectiveness on all neurons has expired. All other spikes we label as live. To begin, we note that for any i Lni , Space| si ¸ k ½ Space| si ¸ ( k¡1) . In the previous section we established that the phase-space of a neuron with a sum total of k spikes is a k-dimensional compact manifold. We now relate a phase-space A corresponding to k spikes to the subset satisfying s ¸ 1 of a phase-space B corresponding to ( k C 1) spikes. Equating the appropriate polynomials, (z ¡ 1)[z k C a k¡1 z k¡1 C ¢ ¢ ¢ C a0 ] D z kC 1 C bkzk C ¢ ¢ ¢ C b 0, we arrive at bi D ai¡1 ¡ ai k
ai¡1 D
jD i
for i D 0, . . . , k (assuming a¡1 D 0, ak D 1),
bj C 1
for i D 1, . . . , k.
hence, (3.3)
I. Model and Experiments
173
That the mapping F k: A ! B, F k (a k¡1 , . . . , adk/ 2e I a0 ) D (bk, . . . , bd(kC 1) / 2e I b0 ), is injective follows directly from equations 3.3. That it is an immersion (rank F k D dim A D k at all points) follows from the description of (DF k ), which when simplied results in the (k C 1 £ k) matrix, I 0
0 C
for k = even, and
I D
for k = odd,
(3.4)
where C D hcos( 2a0 ), sin( 2a0 )iT , D D h|adk/ 2e |, 0, 0, . . . , 2 sin( 2a0 ), ¡2 cos ( 2a0 )i, and I and 0 are identity and zero matrices of appropriate dimensions. A being compact, F k: A ! B is also an imbedding. Consequently, A is a homeomorphism onto its image; the topology induced by A on Fk (A ) is compatible with the subspace topology induced by B on F k (A ). Finally, repeated application of the argument establishes that 8k Space| s ¸ k is a regular submanifold of Space| s ¸ ( k¡1 ) . Alternatively, the mapping from a space A corresponding to k spikes, to the subset satisfying s ¸ (l ¡ k) of a space B corresponding to l spikes, can be constructed directly by composing multiple F ’s. Not only does F kl D Fl¡1 ± Fl¡2 ± ¢ ¢ ¢ ± F k map A into B, but it also maps relevant subspaces 7 of A into corresponding subspaces of B. Since (DFkl ) D (DFl¡1 ) ¤ (DFl¡2 ) ¤ ¢ ¢ ¢ ¤ (DF k ), it follows from the arguments in the previous paragraph that F kl is an imbedding. Figure 2b displays the subspaces satisfying s ¸ 2 (imbedding of a circle) and s ¸ 1 (imbedding of a Mobius ¨ band) located within the phase-space j
for n D 3. We shall henceforth denote by i Lni the subspace satisfying si ¸ j 0
1
ni
for neuron i. Consequently, i Lni D i Lni ¾ i Lni ¾ ¢ ¢ ¢ ¾ i Lni . It must be noted that F kl : A ! B not only maps phase-spaces but also maps ows identically, a fact manifest in the following informal description of the dynamics of the system: If the state of the system at a given instant is such that neither any live spike is on the verge of death nor is any neuron on the verge of spiking, then all live spikes continue to age uniformly. If a spike is on the verge of death, it expires (i.e., stops aging). j
This occurs when a spike reaches eixi D 1. If a neuron is on the verge of spiking, exactly one dead spike corresponding to that neuron is turned live. j Since all dead spikes remain stationary at eixi D 1, they register as a constant factor in the dynamics of a neuron. This has the important ramication that the total number of spikes assigned to a neuron has little impact on its phase-space dynamics. F kl : A ! B, in essence, constitutes a bridge between all ows that correspond to a given number of live spikes pos7
Subspaces corresponding to at most (k ¡ 1), (k ¡ 2), . . . , 0 live spikes.
174
Arunava Banerjee
sessed by a neuron over a given interval of time. While the total number of spikes assigned to a neuron dictates the dimensionality of its entire phasespace, ows corresponding to a given number of live spikes lie on a xed dimensional submanifold and are C1 -conjugate 8 to each other. j
iL
We note that 8j ¸ 1 i Lni is a C1 hypersurface of dimension (ni ¡ j) in j
9
Moreover, since 8j ¸ 2 i Lni features a multiple root at ei0 , it lies on the boundary of i Lni . ni .
j
The submanifolds i Lni (j D 1, . . . , ni ) are not revealed in the topology i of Lni regarded (as the case may be) as a subspace of Cbni / 2c £ S1 or of Cbni / 2c¡1 £M2 (topologized by respective standard differentiable structures). We therefore assign i Lni the topology generated by the family of all relatively j
open subsets of i Lni , 8j ¸ 0. This new, strictly ner topology better matches our intuitions about the set of states that should comprise the neighborhood of any given state of the system. In the previous section we dened a C1 function Pi (¢) that represented the potential at the soma of neuron i. We denote by PSi the subset of the space wherein Pi (¢) D T , and by P Ii the subset wherein dPi (¢) / dt ¸ 0 is additionally true. Trivially then, PIi µ PSi . Based on the implicit function theorem and the fact that at all points satisfying Pi (¢) D T there exists a 6 direction x such that @Pi (¢) / @x D 0,10 it follows that PSi is a C1 regular submanifold of codimension 1. Moreover, PIi is a closed subset of PSi since Pi (¢) being C1 , dPi (¢) / dt along any direction d / dt is C1 on PSi . 4 The Velocity Field S i S In this section we stipulate the velocity eld, V : i D1 Ln i ! iD 1 T (i Lni ), that arises from the natural dynamics of the system. T (¢) denotes the tangent bundle (appropriated from the ambient space Cbni / 2c £ S1 or Cbni / 2c¡1 £ M 2 ). We dene two vector elds, V 1 and V 2 . For the case in which no spike is on the verge of death and no neuron is on the verge of spiking, the velocity si
si C 1
i i is specied by V 1 . Formally, V 1 applies to all points p 2 S i D1 ( Lni n Lni ) I for all values of si · ni that satisfy 8i D 1, . . . , S p 2/ Pi . For the case in which a neuron is on the verge of spiking, the velocity is specied by V 2 . Formally, V 2 applies to all points p that satisfy 9i such that p 2 P Ii . Since V 1 8
Since F kl : A ! B is C1 . S ½ M is a k-dimensional Cp hypersurface in the m-dimensional manifold M if for every point s 2 S and coordinate neighborhood hU, w i of M such that s 2 U, w (U \ S) is the image of an open set in R k under an injective Cp mapping of rank k, Q: R k ! Rm . 10 This is not the case when P (¢) D T is a point. In all other cases there exist effective i spikes that satisfy the criterion. 9
I. Model and Experiments
175
is, by denition, a function of si ’s for all i D 1, . . . , S , it accounts for the case in which a spike is on the verge of death. It follows from the natural dynamics of the system that given any point p D hp1 , . . . , pS i in the phase-space, V 1 (p ) D hV11 (p ), . . . , V S1 (p )i can be dened as hV11 (p1 ), . . . , VS1 (pS )i. That is, each component Vi1 si of V 1 is solely a function of pi 2 i Lni . This is also true for V 2 . Once it is determined which PIi ’s p lies on, each component Vi2 can be computed based solely on pi . 0
si
We noted earlier that the differentiable structures of i L(ni ¡si ) and i Lni are compatible by virtue of F (nnii ¡si ) being an imbedding. We also noted that ows corresponding to V 1 for k live spikes lie on a k-dimensional submanifold and are C1 -conjugate to each other, F kl constituting the bridge between 0
1
such ows. We therefore dene V i1 on (i L (ni ¡si ) ni L(ni ¡si ) ). The corresponding si
si C 1
velocity eld on (i Lni ni Lni
) is then dened as F(nni ¡s ) (Vi1 ).11 i
i
?
Let pi correspond to zni ¡si C ani ¡si ¡1 zni ¡si ¡1 C ¢ ¢ ¢ C a0 D
Since all roots rotate at constant speed, that is, zj D e Simple algebra then demonstrates: Theorem 3.
0
2
dt
)
, dzj / dt D
¡ zj ). 2p i U zj .
1
V i1 for the C1 -conjugate ows on (i L (ni ¡si ) ni L (ni ¡si ) ) is given by
2p dani ¡si ¡k D khO for k D 1, . . . , U dt d|a ni ¡si |
i (hj C 2pU t
ni ¡si ( j D1 z
ni ¡ si 2
and k D ni ¡ si
D 0 when (ni ¡ si ) is even.
(4.1) (4.2)
i hO denotes the basis vector @/ @h on C 3 ani ¡si ¡k for any k D 1, . . . , b ni ¡s 2 c, 1 ih represented as fhr, h i | re D ani ¡si ¡kg, on S 3 a 0 represented as fh | h D a 0g, and on M2 3 ha ni ¡si , a0 i for even (ni ¡ si ), represented as fhr, h i | r D 2 § |a ni ¡si |, h D a 0g. 2 Stated informally, each parameter revolves around its origin at uniform speed, the speed of successive parameters increasing in steps of (2p / U ). Turning our attention to Vi2 , we note that Vi2 (pi ) can be obtained from V i1 (pi )
s
s C1
i i by setting si to (si ¡ 1); Vi2 (pi ) for pi 2 (i Lni ni Lni
i si ¡1 Lni
) is equivalent to V i1 (pi ) si
si ¡1
ignoring the fact that pi lies additionally on i Lni ½ i Lni . S i (i ) We have dened V : S i D 1 Lni ! i D 1 T Lni based on our consummate knowledge of ows in the phase-space. The utility of this construction is to be found in the assistance it renders in the analysis of local properties of on
11 We use the notation, F (X ) f D X ( f ± F) where F is a C1 map of manifolds, X is a ? p p p tangent vector at p, and f is an arbitrary function that belongs to C1 (F(p)).
176
Arunava Banerjee
Figure 3: Schematic depiction of the velocity eld V .
ows (a topic pursued in the companion article in this issue). V is however discontinuous. It must therefore be conrmed that V does not generate inconsistencies in the model. We consider this problem next. s
s C1
i i Both Vi1 and V i2 are smooth on (i Lni ni Lni
) for any si . The only points si where the projection of a trajectory Y(x, t) on i Lni is not differentiable (where, incidentally, it is not even continuous) is therefore when it meets s C1
i Lni or it meets PIi . This gives rise to the potential problem that dPi (¢) / dt might not be computable on PIi , for whereas Pi (¢) is C1 , one component si of dY (x, t ) / dt in i Lni —the velocity of the dead spike that just turned live— is undened. However, one needs only to recall from appendix B that for any spike x stationed at ei0, @Pi (¢) / @x D 0 over an innitesimal interval (eid > eix > ei(2p ¡d) ). The problem is therefore resolved trivially. Figure 3 presents a schematic depiction of the velocity eld V described
i
1
above. Figure 3a displays V corresponding to a patch of i L2 within a patch 0
of i L2 , and Figure 3b displays V corresponding to a PIi lying on a patch of i L1 3
0
within i L3 . We are now in a position to dene an abstract dynamical system without i reference to neuronal systems. The phase-space of the system is S i D 1 Lni , 1 I which contains S C hypersurfaces Pi for i D 1, . . . , S . The velocity eld V on the phase-space is as dened in the previous theorem. In order for the system to be consistent, each hypersurface PIi must at the least satisfy two additional constraints. First, the criterion that any neuron i possess at most ni live spikes can be enforced by the constraint: 8i D 0
0
1
0
S j i i j ) ) 1, . . . , S , PIi \ ( ji¡1 D 1 Lnj £( Ln i n Ln i £ j D i C 1 Lnj D ;. Second, the criterion that multiple roots (coincident spikes) occur only at ei0 (when the spikes are dead) can be enforced by requiring that the normal to the hypersurface s ¡1
s C1
s
s C1
j j j j PIi \ jSD 1 ( j Lnj n j Lnj ) at PIi \ jSD1 ( j Lnj n j Lnj ) not be orthogonal to V 2 for all values of i and sj · nj . Finally, since neurons that spike spontaneously are not modeled, one can nj additionally enforce the weaker constraint: 8i D 1, . . . , S , PIi \ jSD1 j Lnj D ;.
I. Model and Experiments
177 ni
i It then follows that S i D 1 Lni , denoting the state of quiescence in the neuronal system, is the sole xed point of the dynamical system.
5 Model Simulation and Results
A model is worthwhile only to the extent to which it succeeds in emulating the salient behavior of the physical system it models. In this section we therefore investigate by means of simulation the dynamical behavior of the model just described. We have chosen as our target unit a typical column in the neocortex. These neuronal subsystems are ideally suited for our purpose because their congurations (connectivity pattern, distribution of synapses, etc.) t statistical distributions and they exhibit distinctive patterns of qualitative dynamical behavior. 5.1 General Anatomical and Physiological Characteristics of Neocortical Columns. Extensive information is available about the anatomi-
cal and physiological composition of the neocortex (Braitenberg & Schuz, ¨ 1991; Shepherd, 1998). While not sharply dened everywhere, neocortical columns have a diameter of approximately 0.5 mm. Intracolumn connectivity is markedly denser than intercolumn connectivity. Whereas inhibitory connections play a prominent role within a column, intercolumn connections are distinctly excitatory. A column contains approximately 105 neurons. The number of synapses each neuron makes ranges between 103 and 104 . The probability of a neuron’s making multiple synapses on the dendrite of a postsynaptic neuron is fairly low (Braitenberg, 1978). The connectivity between neurons within a column (intracolumn as opposed to thalamic afferents), while having evolved to achieve a specic function, has been experimentally ascertained to t a statistical distribution (Schuz, ¨ 1992). The location of afferent synapses on the collaterals of the neurons has similarly been shown to t a statistical distribution (Braitenberg & Schuz, ¨ 1991). Neurons in the neocortex can be divided into two main groups: the spiny and the smooth neurons. Spiny neurons can, in general, be subdivided into pyramidal and stellate cells, with pyramidal cells constituting by far the major morphological class. Diversity among smooth neurons is much higher (Peters & Regidor, 1981). Spiny cells receive only Type II (inhibitory) synapses on their cell bodies and both Type I (excitatory) and Type II synapses on their dendrites. They are presynaptic to only Type I synapses. Smooth cells receive both Type I and Type II synapses on their cell bodies and are, in general, presynaptic to Type II synapses (Peters & Proskauer, 1980). Approximately 80% of all neurons in the neocortex are spiny, and the rest are smooth. Both kinds contribute on average to the same number of synapses per neuron. Axonal ramications that make local connections range in length between 100 and 400 m m. The speed of propagation of a spike can be estimated at approximately 1 m/s based on the fact that axons within a col-
178
Arunava Banerjee
umn are unmyelinated. Synaptic delays range between 0.3 and 0.5 msec. Consequently, the time interval between the generation of a spike at an axon hillock and its subsequent arrival across a synapse can be estimated to lie between 0.4 and 0.9 msec. Finally, anatomical investigations suggest that dendritic collaterals are seldom longer than two or three space constants. 12 The past few decades have witnessed numerous investigations into the electrophysiological activity of the cortex. The majority of these studies fall under two categories. The rst is the recording of action potentials (spikes), which reect the output of single cortical neurons with a time resolution of milliseconds, and the second is the recording of the electroencephalogram (EEG), a slow, continuous wave that reects the activity of hundreds of thousands of neurons. An aspect shared universally among single cortical neuron recordings is the apparent stochastic nature of the spike trains (Burns & Webb, 1976). A good measure of spike variability is given by the coefcient of variation (CV)13 of an ISI distribution. Spike trains of cortical neurons have CVs ranging from 0.5 to 1.1, resembling a Poisson process for which CV D 1. In the case of EEG recordings, the outputs are signicantly more abstruse. Classical signal analysis has long considered EEG records as realizations of (often stationary) stochastic processes, and spectral analysis has been the conventional method for extracting the dominant frequencies of the rhythms (Erdi, 1996). Spectral charts of EEG records generally exhibit peak densities at frequencies of major cerebral rhythms, superimposed on a 1/f spectral envelope. Recently, some researchers have come to regard EEG records as the chaotic output of a nonlinear system (Basar, 1990) and have attempted to measure the dimensionality of its various components (Babloyantz, Salazar, & Nicolis, 1985; Roschke ¨ & Basar, 1989). Unfortunately, the results remain varied and inconclusive. 5.2 Experimental Setup. We conducted simulation experiments to compare the dynamical characteristics of our model to the salient properties of the dynamics of neocortical columns. The parameterized function v(x, t ) D p 2 faQ / (x t )ge¡bx / t e ¡c t was used to model the response to a spike at the soma of a neuron (refer to MacGregor & Lewis, 1977, for a derivation of this equation). The total response at the soma was computed as the sum of the responses to individual spikes. The parameters a, b, and c were set so as to t response curves from NEURON v2.0 (Hines, 1993). Refractoriness was modeled by the function ce¡dt .14
12
Distance over which response drops by a factor of e. The ratio of the standard deviation to the mean of the ISI histogram (sD t / D t). 14 Whereas in the abstract model, P(¢) must necessarily be C1 , such restrictions do not apply here because the function is, in any case, discretized in time for simulation. 13
I. Model and Experiments
179
Four separate sets of experiments were performed. The physiological accuracy of the model was enhanced with each successive set of experiments. In each case a neuronal system comprising 1000 neurons with each neuron connected randomly to 100 other neurons was modeled.15 5.2.1 Model I. Eighty percent of all neurons in the system were randomly chosen to be excitatory (spiny) and the rest inhibitory (smooth). The strength of a synapse (Q) was chosen randomly from a uniform distribution over the range [5.7, 15.7]. Values for the parameters a, b, and c were chosen so as to model a synapse with an uncharacteristically fast response time (t ¼ 10 msec) and were set at xed values for the entire system. x in v (x, t) was also set at a xed value for the entire system. In other words, not only were all afferent synapses assumed to be located at the same distance from the soma, but they were also assumed to be generating identical (to a constant factor) responses. Excitatory and inhibitory neurons were constructed to be identical in all respects save the strength of inhibitory synapses, which was enhanced to six times the magnitude (Q £ 6.0). The lengths of the axonal collaterals and their variability were assumed to be uncharacteristically large; the time interval between the birth of a spike at a soma and its subsequent arrival across a synapse was randomly chosen to lie between 5 and 15 msec (uniformly distributed). The parameters c and d in the function modeling refractoriness were set at appropriate values and held constant over the entire system. The threshold was established such that at least 10 excitatory spikes (and no inhibitory spikes) with coincident peak impact16 were required to cause a neuron to re and it was held constant over the entire system. The system was initialized with 5 live spikes per neuron chosen randomly over their respective lifetimes (approximately 200 spikes per second per neuron). 5.2.2 Model II. The lengths of the axonal collaterals as well as their variability were modied to reect realistic dimensions. The time interval between the birth of a spike at a soma and its subsequent arrival across a synapse was randomly chosen to lie between 0.4 and 0.9 msec. All other aspects of the model were left unchanged. 5.2.3 Model III. The unrealistic assumption that all afferent synapses be located at the same distance from the soma and generate similar responses was eliminated. Instead, the locations of the synapses were chosen 15
Simulating larger systems of neurons was found to be computationally intractable. Instead, we ran experiments on systems with 100 and 500 neurons. The qualitative characteristics of the dynamics described here became more pronounced as the number of neurons in the system (and their connectivity) was increased. 16 The probability of 10 spikes’ arriving in such a manner that their peak impact on a soma coincide is very low. It took on average 20 live spikes to cause a neuron to re.
180
Arunava Banerjee
based on anatomical data. On the spiny cells, inhibitory synapses were cast randomly to within a distance of 0.3 space constants from the soma, and excitatory synapses were cast randomly from a distance of 0.3 to 3.0 space constants from the soma. On smooth cells, both inhibitory and excitatory synapses were cast randomly between distances of 0.0 and 3.0 space constants from the soma. Since a synapse closer to the soma induced a more intense response, we were also able to eliminate the factitious assumption of inhibitory synapses’ being six times stronger than excitatory synapses. Next, 50% of all excitatory synapses were randomly chosen to be of type non-NMDA (AMPA/kainate) and the rest of type NMDA.17 Likewise, 50% of all inhibitory synapses were randomly chosen to be of type GABA A and the rest of type GABAB . The characteristic response of each kind of synapse was modeled after graphs reported in Bernander, Douglas, and Koch (1992) using the parameterized potential function. Finally, the threshold for ring of a neuron was reduced to an uncharacteristically low value, and the system was initialized at a very sparsely active state (approximately 20 spikes per second per neuron). 5.2.4 Model IV. The threshold for ring of a neuron was returned to its characteristic value, and the system was initialized at a more realistic state. (Two separate initializations, one with approximately 100 spikes per second per neuron and another with approximately 150 spikes per second per neuron, were considered.) 5.3 Data Recorded. The dynamics of several random instantiations of the models was investigated. In each case neurons were assigned a maximum number of effective spikes (n D dU / re) based on upper (U ) and lower (r) bounds computed from physiological parameters. Each system was initialized randomly, and the ensuing dynamics was observed in the absence of external input. Two classes of data was recorded. The temporal evolution of the total number of live spikes registered by the entire system (a feature peculiar to the dynamical system under consideration) was recorded as an approximation to EEG data, and individual spike trains of 10 randomly chosen neurons from each system were recorded for comparison with real spike train recordings. 5.4 Results. The most signicant outcome of the simulation experiments was the emergence of distinct patterns of qualitative dynamical behavior that were found to be robust across all models and their numer-
17 While the voltage dependence of NMDA synapses can be modeled in this framework, for the sake of simplicity, the NMDA receptors were assumed to be relieved of the Mg2 C block at all times. In other words, they were set at peak conductance irrespective of the postsynaptic polarization.
I. Model and Experiments
181
Figure 4: Time series (over the same interval) of the total number of live spikes from a representative system initiated at different levels of activity. Two of the three classes of behavior, intense periodic activity and sustained chaotic activity, are shown.
ous instantiations. Three classes of behavior were encountered. Each randomly generated instantiation of the models possessed an intrinsic range of activity18 over which the system displayed sustained chaotic behavior. This range was found to conform with what is generally held as normal operational conditions in the neocortex. If the system was initiated at an activity level below this range, the total number of live spikes dropped as the dynamics evolved, until the trivial xed point of zero activity (the state of quiescence) was reached. In contrast, if the system was initiated at an activity level above this range, the total number of live spikes rose until each neuron spiked periodically at a rate determined primarily by its absolute refractory period. In other words, the system settled into a periodic orbit of intense regular activity resembling a state of seizure. Figure 4 displays the result of one such experiment, a result that is representative of the qualitative dynamics of all the systems that were investigated. Figures 5 and 6 report results from simulations of systems that were initialized at activity levels within the above-mentioned range. Figure 5 displays normalized time series data pertaining to the total number of live spikes from representative instantiations of each model. Also shown are the results of a power spectrum analysis corresponding to each of the time series. 18 The corresponding region in the state-space is not exactly quantiable through numerical analysis because of the nature of chaos. Under conditions of relatively uniform activity, a range with soft bounds was, however, detected with respect to the average spike frequency.
182
Arunava Banerjee
Figure 5: Normalized time series of total number of live spikes from instantiations of each model and corresponding power spectrums.
Although there is a great deal more to an EEG recording than the dynamics of 1000 model neurons, the presence of peak densities at specic frequencies and a 1/f spectral envelope in each of the spectrums supports the view that the abstract system does model the salient aspects of the dynamics of neocortical columns. Furthermore, the absence of any stochastic component to the model19 demonstrates that the standard features of the EEG spectrum do not necessarily imply that the underlying process is stochastic. Of the 10 neurons from each system whose spike trains were recorded during simulation, we chose one whose behavior we identied as representative of the mean. Figure 6 displays the time series of ISIs of these neurons (one per model) and their corresponding frequency distributions. The spike
19
The neurons are deterministic, and the system does not receive external input.
I. Model and Experiments
183
Figure 6: ISI recordings of representative neurons from each model and corresponding frequency distributions.
trains are not only aperiodic, but their ISI distributions suggest that they could be generated from a Poisson process. We also computed the values of the CVs for the various spike trains. The results obtained were encouraging: values ranged from 0.063 to 2.921 with the majority falling around 1.21. 20 Neurons with higher rates of ring tended to have lower values of CV. The agreement between the qualitative aspects of the simulation data and that of real ISI recordings speaks additionally in favor of the viability of our model. We conducted a second set of experiments to ascertain how predisposed the systems were to the chaotic behavior in the presence of regulating external input. Each system was augmented with two pacemaker neurons—an
20 Softky and Koch (1993) report that the CVs for spike trains of fast-ring monkey visual cortex neurons lie between 0.5 and 1.1. As the results clearly indicate, our deterministic model can account for this range of data.
184
Arunava Banerjee
Figure 7: ISI recordings of ve representative neurons each, from three systems with 70%, 80%, and 90% of the synapses on neurons driven by two pacemaker cells, are shown.
excitatory and an inhibitory—that spiked at regular intervals.21 The topology of the system was modied such that of the 100 afferent synapses on each neuron, s received their input from one of the two pacemaker cells chosen randomly, and the remaining (100 ¡ s ) received their input from other neurons in the system. s was increased with successive experiments until the system was found to settle into a periodic orbit. Figure 7 depicts the ISIs of ve representative neurons each, from systems with s set at 70, 80, and 90. The discovery that the system did not quite settle into a periodic orbit even when 90% of all synapses were driven by periodic spikes bears witness to the system’s propensity for chaotic behavior. 21 The two pacemaker cells spiked every seventeenth time step, generating the simplest form of periodic input.
I. Model and Experiments
185
Figure 8: Two trajectories that diverge from very close initial conditions.
Finally, experiments were conducted to determine whether the systems were sensitive to initial conditions, a feature closely associated with chaotic behavior. Each system was initialized at two proximal points in state-space, and the resulting dynamics was recorded. Respective trajectories were found either to diverge strongly or become coincident. Figure 8 depicts the temporal evolution of the total number of live spikes registered by two identical systems of 100 neurons that were initialized with approximately ve live spikes per neuron, chosen randomly over respective lifetimes. The initialization of the two systems was identical except for one spike (chosen randomly) in the second system, which was perturbed by 1 msec. As the gure demonstrates, the trajectories diverged strongly. The results presented in this section are substantially more qualitative than quantitative. The goal of the simulation experiments was not to emulate the dynamics of any specic neocortical column. The data necessary to accomplish such a feat are not available, and we do not claim that the idealized neuron models all aspects of a neocortical neuron. The principal objective of the exercise was to demonstrate that the abstract dynamical system developed in the previous section was capable of modeling adequately the qualitative dynamics of a typical neocortical column. To this end, values for the various parameters were chosen such that the resulting systems resembled generic neocortical columns, and the dynamics of these systems were assessed only to the extent to which they conformed with the generic dynamics of neocortical columns. 6 Conclusions and Future Research
There is general agreement in the scientic community regarding the position that the physical states of the brain are causally implicated in cognition and behavior. What constitutes a discrete brain state is, however, a problem far from resolved.
186
Arunava Banerjee
It is our belief that a principled resolution to the problem can be arrived at by analyzing, exclusively, the dynamics of the brain (or any part thereof) as constrained by its anatomy and physiology. From a dynamical systems perspective, the coherent spatiotemporal structures, if any, inherent in the dynamics of a physical system are revealed in any attractors that are present in its phase-space. Our objective has therefore been to determine if there exist attractors in the dynamics of systems of neurons and, if so, what their salient characteristics are. The rst step in such a program is the construction of an abstract dynamical system that models systems of neurons. We have presented in this article a detailed exposition of one such dynamical system. In order to determine the viability of the model, we conducted simulation experiments on several instantiations of the system, each modeling a typical column in the neocortex. We found an agreeable t between the qualitative aspects of the results and that of real data. A signicant outcome of the simulation experiments was the emergence of three distinct categories of qualitative behavior that were found to be robust across all the investigated instantiations of the system. In addition to the trivial behavior of quiescence, the class of systems displayed sustained chaotic behavior in the region of the phase-space associated with normal operational conditions in the neocortex and intense periodic behavior in the region of the phase-space associated with seizure-like conditions in the neocortex. Based on this evidence, we can surmise that coherent structures do exist in the dynamics of neocortical neuronal systems operating under normal conditions and that they are chaotic attractors. This would then imply that at a certain level, the computational dynamics of such systems is inherently unpredictable. Experimental evidence of the nature presented in this article, however, does not amount to a proof of existence of chaotic attractors in that class of systems. Furthermore, there is as yet little that is known about the structure of these attractors. In the companion article in this issue, we pursue these issues through a formal analysis of the phase-space dynamics of the abstract system. Not only is the existence of chaos in the noted region of the phase-space conrmed, but also the root cause behind the existence of the distinct categories of qualitative behavior is revealed. The results of the experiments also shed light on why the strictly neurophysiological perspective to the study of the brain faces profound problems. Whereas the abstract system presented in this article contains a good deal of structure, such is not easily revealed in its dynamics. This is more so in the case of real neurophysiological data, wherein the impression is one of an appalling lack of structure. The incongruence between data recorded at the individual or local neuronal level and at the organism’s behavioral level is enormous. Bridging this gap without access to an intermediary model is a formidable task. Whereas approaching the problem via a model might be
I. Model and Experiments
187
deemed weaker because acceptance of the results hinges on the acceptance of the adequacy of the model, it is our view that given the circumstances, this is arguably the only viable approach. Appendix A
The proof of theorem 1 is based on the following two theorems. All roots of f (z) are distinct and lie on |z| D 1 , (a) |a0 | D 1, ai D aN n¡i a0 for i D 1, . . . , n ¡ 1, and (b) f 0 (z) has all roots in |z| < 1. Theorem 4.
Proof .
i. Distinct roots on |z| D 1 ) (a) and (b). Assume w.l.o.g that an D 1. zi is a root of f (z ) , z¤i D (1 / zNi ) is a root of f ¤ (z). |zi | D 1 implies (1 / zN i ) D zi . Hence the roots of f (z) and f ¤ (z ) are identical. Consequently, a holds. That all roots of f 0 (z) lie in |z| < 1 follows from all roots of f (z ) being distinct and Lucas Theorem (any convex polygon that contains all roots of f (z ) also contains all roots of f 0 (z )). ii. (a) and (b) ) distinct roots on |z| D 1. That the roots of f (z ) lie on |z| D 1 follows from the following theorem in Cohn (1922). If g (z) D bm zm C bm¡1 zm¡1 C ¢ ¢ ¢ C b0 satises bm¡i D ubNi for all i D 0, . . . , m then g (z ) has in |z| < 1 as many roots as [g0 (z )]¤ D zm¡1 gN 0 (1 / z ). f (z ) by criterion a satises the premise, and therefore, it has in |z| < 1 as many roots as [ f 0 (z )]¤ . By criterion b [ f 0 (z )]¤ has no roots in |z| < 1. Hence, f(z) has no roots in |z| < 1. Finally, |a0 | D 1 constraints all roots of f (z ) to lie on |z| D 1. Criterion b then enforces that f (z ) has no multiple roots. Lemma 1 (Cohn).
d1 < 0, dj > 0 for j D 2, . . . , n ¡ 1. (from theorem 1) , all roots of f 0 (z ) lie in |z| < 1.
Theorem 5.
Proof .
i. d1 < 0, dj > 0 for j D 2, . . . , n ¡ 1. ) all roots of f 0 (z ) lie in |z| < 1. Follows directly from the following theorem in Marden (1948). If for f (z) D am zm C am¡1 zm¡1C ¢ ¢ ¢ Ca 0 , h f0 (z), f1 (z ), . . . , m¡j ( j ) k ( ) ( ) fm (z)i is constructed as fj (z ) D kD 0 ak z where f 0 z D f z and for j D
Lemma 2 (Marden).
( j)
(j )
( j C 1)
0, . . . , m ¡ 1, fj C 1 (z ) D aN 0 fj (z) ¡ am¡j fj¤ (z), and dj C 1 D a0 d1 < 0, d2 , . . . , dm > 0, then f (z) has all roots in |z| < 1.
satisfy
ii. All roots of f 0 (z ) lie in |z| < 1 ) d1 < 0, dj > 0 for j D 2, . . . , n ¡ 1. (a) d1 < 0. Since f 0 (z) D f0 (z ) D bm zm C ¢ ¢ ¢ C b1 z C b0 (here m D n ¡ 1) has all roots in |z| < 1, the modulus of the product of all roots is less than 1. Hence, |b0 | < |bm |, and d1 D |b 0 | 2 ¡ |bm | 2 < 0.
188
Arunava Banerjee
(b) d2 , d3 , . . . , dn¡1 > 0 follows from the following theorem in Marden (1948). If in lemma 2, fj (z ) has pj roots in |z| < 1 and none on 6 0, then fj C 1 ( z) has pj C 1 D 1 fm ¡ j ¡ [(m ¡ j ) ¡ 2pj ]sgndj C 1 g |z| D 1 and if dj C 1 D 2 roots in |z| < 1 and none on |z| D 1. Lemma 3 (Marden).
f0 (z ) D f 0 (z ) has all n ¡ 1 roots in |z| < 1 and none in |z| D 1. We have shown that d1 < 0. Therefore, sgnd1 D ¡1. Based on the lemma, then, f1 (z) has p1 D 0 roots in |z| < 1 and none in |z| D 1. In other words, all roots of f1 (z ) lie in |z| > 1. Using the argument in a, we get, d2 > 0. We now apply the lemma repeatedly to get p2 D p1 D 0 and so on, implying d3 , . . . , dn¡1 > 0. The proof of theorem 2 is based on the following theorem: x 2 Ln and x 2/ Ln , x denotes a polynomial with all roots on |z| D 1 and at least 1 multiple root. x 2 Ln and x 2/ Ln ) x is a boundary point of Ln , that is, every open set containing x contains a y 2/ Ln . Theorem 6.
Proof .
i. x 2 Ln and x 2/ Ln ) All roots on |z| D 1 and at least 1 multiple root. Since x is a limit point of Ln , 9 denumerable sequence hy1 , y2 , . . . , y k, y kC 1 , . . .i, s.t 8i yi 2 Ln and limi!1 yi D x. We now apply the following theorem in Coolidge (1908) to the sequence. If two polynomials f1 (z) D zm C am¡1 zm¡1 C ¢ ¢ ¢ C a0 and ¢ ¢ ¢ C b0 are so related that 8i ai is a constant, and 8i bi f 2 (z ) D approaches ai , then the roots of f1 (z ) and f2 (z ), where each multiple root of order k is counted k times as k roots, may be put into such a one-to-one correspondence that the absolute value of the difference between each two corresponding roots approaches 0. Since x 2/ Ln , either the corresponding polynomial does not have all roots on |z| D 1, or not all roots are distinct. If the rst case is true, the distance between at least one set of roots in the theorem is bounded from below, thus contradicting it. ii. All roots on |z| D 1 and at least 1 multiple root ) x 2 Ln and x 2/ Ln . Trivially x 2/ Ln . Let hz1 , z2 , . . . , zn i be the roots of the polynomial corresponding to x with multiple roots of order k repeated k times. Choose a d > 0 s.t it satises 6 6 dj d < |(1 / 2) min8i, j zi D6 zj ( (zi / zj ))| . Choose hd1 , d2 , . . . , dn i s.t 8i, j i D j ) di D and 8i, 0 < |di | < d. p p p Now consider the sequence of ordered sets hz1 ei(d1 / 2 ) , z2 ei(d2 / 2 ) , . . . , zn ei(dn / 2 ) i for p D 0, 1, . . .. Each set comprises distinct roots on |z| D 1. The sequence converges absolutely to hz1 , z2 , . . . , zn i. Since coefcients are continuous functions of roots, the corresponding sequence hy1 , y2 , . . . , yk , y kC 1 , . . .i in coefcient space satises 8i yi 2 Ln and limi!1 yi D x. x is therefore a limit point of Ln . Lemma 4 (Coolidge).
zm C
bm¡1 zm¡1 C
I. Model and Experiments
189
iii. x 2 Ln and x 2/ Ln ) x is a boundary point. x 2 Cbn/ 2c £S1 (Cbn/ 2c¡1 £ M2 ) ) corresponding polynomial satises 9s, a permutation, s.t zi D (1 / zNsi ). Let x 2 Ln and x 2/ Ln . Let heih1 , eih2 , . . . , eihn i be the set of roots that correspond to x and w.l.o.g h ¤ D h1 D h2 D ¢ ¢ ¢ D hk ( k ¸ 2). For any d < 1 consider ¤ ¤ the sequence of ordered sets h((2p ¡ d) / 2p )eih , (2p / (2p ¡ d))eih , eih3 , . . . , eihn i for p D 0, 1, . . . The sequence converges absolutely to heih1 , eih2 , . . . , eihn i, and each set satises 9s s.t zi D (1 / zN si ). Based on arguments identical to ii above, the corresponding sequence hy1 , y2 , . . . , yk , ykC 1 , . . .i in coefcient space satises 8i yi 2/ Ln and limi!1 yi D x. x is therefore a boundary point of Ln . Appendix B
Assumption iii of section 2 makes P (¢) C1 on T n . Let F ¡1 : Ln ! U ½ T n map the transformed space into the initial space: F ¡1 han¡1 , . . . , adn/ 2e (| adn/ 2e |)I a 0 i D hx1 , . . . , xn i for odd (even) n, s.t, 0 · x1 · x2 · ¢ ¢ ¢ · xn < 2p . Boundaries of U are of no concern because the range of F ¡1 could equally well be d · x1 · x2 · ¢ ¢ ¢ · xn < 2p C d for arbitrary d and P (¢) is C1 on T n . Let G (¢) denote the Cartesian product of F ¡1 (¢) with itself appropriately many times that maps the transformed space of all relevant neurons into their initial spaces. Dene, tentatively, P D P ± G. Since both P and F are C1 , P is C1 wherever the Jacobian of F is nonsingular. Theorem 7.
@xk @aj
is undened when xk is a multiple root, that is, 9a, b s.t xa D
xb ) Det (DF) D 0.
Proof . 8j aj is a symmetric function of hx1 , . . . , xn i. Therefore, 8j
@aj @xa
D
@aj @xb
when xa D xb . In other words, if xa D xb , columns a and b in (DF) are identical. Theorem 8.
6 6 x j ) ) Det ( DF ) D 6 0. 8i, j (i D j ) xi D
Proof . We prove the theorem for cases, n is odd and n is even, separately.
i. n is odd. Based on the functions dening coefcients in terms of roots and simple matrix manipulations, (DF) evaluated at hx1 , x2 , . . . , xn i D hh1 , h2 , . . . , hn i can be reduced to the matrix [ai, j ]1·i, j ·n dened as: for all j D 1, . . . , n, (a) if i D 1, ai, j D 1, (b) if i > 1 is even, ai, j D sin( 2i hj ), and (c) if i > 1 is odd, ). ai, j D cos( i¡1 2 hj 6 We prove Det (DF) D 0 by demonstrating that for any vector xE , xE ¤ [ai, j ]1 ·i, j·n D 0 implies xE ´ h0, . . . , 0i. Let n D 2m C 1, and f (x) D a0 C m i D 1 a2i¡1 sin ix C a2i cos ix. For arbitrary reals a0 , a1 , . . . , a2m not all zero, we demonstrate that the maximum number of roots that f (x ) can have in [0, 2p ) is 2m, thus proving the claim.
190
Arunava Banerjee
Let g(x ) D c 0 C m i D 1 c i sin(ix C bi ) such that g( x ) ´ f ( x). Since g ( x) is periodic with period 2p , if g (x) has m roots in [0, 2p ) then g1 (x) D g0 (x) has at least m roots in [0, 2p ), and since g1 (x) is also periodic with period 2p , g2 (x ) D g00 (x ) has at least m roots in [0, 2p ) and so on. 4k g4k (x ) D m i D1 i c i sin(ix C bi ) for k D 1, 2, . . . . As k rises, (a) the coefcients of sin(ix C bi ) for i D 1, . . . , (m ¡ 1) become negligible as compared to the coefcient of sin(mx C bm ), and (b) the magnitude of the gradients of i4kc i sin(ix C bi ) for i D 1, . . . , (m ¡ 1) also become negligible as compared to the gradient of m4kc m sin(mx C bm ) near its roots. Since m4kc m sin(mx C b m ) has 2m roots in [0, 2p ), we have m · 2m. ii. n is even. Let n D 2m. For any h ¤ , the mapping Fhx1 , . . . , x2m i D han¡1 , . . . , |adn / 2e |I a0 i is a C1 -diffeomorphism in an open set about hh1 , h2 , . . . , h2m i if and only if the mapping F0 hx1 , . . . , x2m I x2m C 1 i D han¡1 , . . . , |adn/ 2e |I a0 I x2m C 1 i is a C1 -diffeomorphism in an open set about hh1 , h2 , . . . , h2m , h ¤ i. Consequently, all that needs to be shown is that for n D 2m, the mapping ha2m¡1 , . . . , |am |I a 0I x2m C 1 i ! hA2m , . . . , Am C 1 I A 0 i is C1 (Ai ’s are the coefcients of the polynomial of degree 2m+1 whose roots include the ad¤ 6 6 hj ) and 8i D 1, . . . , 2m hi D 6 h ¤ is ditional eih ). Note that 8i, j (i D j ) hi D 1 enforced by i above on the mapping hA2m , . . . , Am C 1 I A0 i ! hx , . . . , x2m I x2m C 1 i. The trivial relations between the Ai ’s and the ai ’s immediately prove that the mapping is C1 . The portion of Ln actually explored by the state dynamics of the neuron has i j multiple roots only at eix D eix D 1, that is, when the spikes are dead. We therefore dene P (¢) as follows: 1. If 8 relevant i each hx1i , x2i , . . . , xni i i is composed of distinct elements, then P (¢) is C1 as demonstrated in the preceding theorem. 2. If 9 relevant i s.t hx1i , x2i , . . . , xni i i is not composed of distinct elements, ¤ and x¤ is one such multiple element (at eix D 1), then we dene, q for all m, j1 , . . . , jl¡1 , k1 , . . . , kl (such that liD 1 ki D m), and ip ’s (not necessarily distinct, for q D 1, . . . , l, p D 1, . . . , kq ) @m P @k1 x j1
£ Although
. . . @kl¡1 x jl¡1 @kl x¤ @kl x¤
@ail . . . @ail 1
kl
@kl x¤ @ail ...@ail 1
kl
D 0.
£
@k1 x j1 @ai1 . . . @ai1 1
k1
...
@kl¡1 x jl¡1 @ail¡1 . . . @ail¡1 1
kl¡1
(B.1)
is undened in such a case, by imposing the ad¤
ditional constraint @@xP¤ D 0 over an innitesimal interval (eid > eix > ei(2p ¡d) ) irrespective of the values assigned to the other variables,
I. Model and Experiments
191
we can maintain innite differentiability of P, since we then have @m P k l¡1 jl¡1 kl ¤ D 0 over the interval. k1 j1 @ x ...@
x
@ x
3. If 9 relevant i s.t hx1i , x2i , . . . , xni i i is not composed of distinct elements, ¤ but member element x¤ is a distinct element, then @@xa cannot be directly j
computed from (DF)¡1 since det (DF) D 0. However, as is shown next, ¤ ¤ tangible value exists for @@xaj . It is also shown that @@xaj is C1 .
Let the roots of f (z) D zn C an¡1 zn¡1 C ¢ ¢ ¢ C a0 , constrained to 1 2 n a b lie on |z| D 1, be heix , eix , . . . , eix i. Let eix D eix and all other roots be distinct. q Let the sequence of sets hyp1 , yp2 , . . . , ypn i for p D 1, 2, . . . be constructed s.t 8p, q yp q is distinct, 8q limp!1 yp D xq . Then 8p (DF)¡1 | hy1 ,y2 ,...,yn i is well dened and Theorem 9.
p
limiting values exist for values of k.
@xk @aj
p
6 for all k D a or b. Moreover,
p
@xk @aj
is C1 for all such
Proof . That 8p (DF) ¡1 | hyp1 ,yp2 ,...,ypn i is well dened follows from the previous
theorem. Just as in the previous theorem, we consider the case for odd n rst. We refer to the simplied version of (DF) evaluated at hh1 , h2 , . . . , hn i d given earlier. In the limit, column b can be replaced by ( dh column a) ´ T h0, cos(ha ), ¡ sin(ha ), 2 cos(2ha ), ¡2 sin(2ha ), . . .i . We demonstrate that this matrix is invertible. Let g (x ) D c 0 C m ). ( ) i D 1 c i sin(ix C bi For arbitrary reals c 0, c 1 , b1 , . . . , c m , bm not all zero, g x cannot have 2m distinct roots and one multiple root in [0, 2p ) because g1 (x ) must then have at least 2m C 1 roots in [0, 2p ) and so on. All rows except a and b in (DF)¡1 , computed as (DF) ¡1 (DF) D I, therefore have nite limiting values, identical to the corresponding values in the inverse of the matrix described above. Since the elements of the matrix are C1 , and its determinant (a C1 function of its elements) is nonzero, the elements of its inverse are C1 . The case for even n is treated exactly as in the previous theorem. References Aertsen, A., & Braitenberg, V. (1992). Information processing in the cortex: Experiments and theory. Berlin: Springer-Verlag. Amit, D. J., Gutfreund, H., & Sompolinsky, H. (1987). Information storage in neural networks with low levels of activity. Phys. Rev. A, 35(5), 2293–2303. Babloyantz, A., Salazar, J. M., & Nicolis, C. (1985). Evidence of chaotic dynamics of brain activity during the sleep cycle. Phys. Lett. A, 111, 152–156. Bair, W., & Koch, C. (1996). Temporal precision of spike trains in extrastriate cortex of the behaving macaque monkey. Neural Comp., 8, 1185–1202.
192
Arunava Banerjee
Bair, W., Koch, C., Newsome, W., & Britten, K. (1994). Reliable temporal modulation in cortical spike trains in the awake monkey. In Proceedings of the Symposium on Dynamics of Neural Processing. Basar, E. (1990). Chaos in brain function. Berlin: Springer-Verlag,. Bernander, O., Douglas, R. J., & Koch, C. (1992). A model of regular-ring cortical pyramidal neurons (CNS Memo 16). Pasadena: California Institute of Technology. Bower, J. M., & Beeman, D. (1995). The book of GENESIS: Exploring realistic neural models with the GEneral NEural SImulation System. New York: TELOS/Springer-Verlag. Braitenberg, V. (1978). Cortical architectonics: General and area. In M. A. B. Brazier & H. Petsche (Eds.), Architectonics of the cerebral cortex (pp. 443–465). New York: Raven Press. Braitenberg, V., & Schuz, ¨ A. (1991). Anatomy of the cortex: Statistics and geometry. Berlin: Springer-Verlag. Burns, B., & Webb, A. C. (1976). The spontaneous activity of neurones in the cat’s cerebral cortex. Proc. Royal Soc. London (Biol.), 194, 211–223. Cohn, A. (1922). Uber die anzahl der wurzeln einer algebraischen gleichung in einem kreise. Math. Zeit, 14, 110–148. Coolidge, J. L. (1908). The continuity of the roots of an algebraic equation. Annals of Math. (series 2), 9, 116–118. Erdi, P. (1996). Levels, models, and brain activities: Neurodynamics is pluralistic (open peer commentary). Behav. and Brain Sci., 19, 296–297. Freeman, W. J. (1991). Predictions on neocortical dynamics derived from studies in paleocortex. In E. Basar and T. H. Bullock (Eds.), Induced rhythms of the brain (pp. 183–199). Cambridge, MA: Birkhauser Boston. Hines, M. (1993). NEURON—A program for simulation of nerve equations. In F. Eckman (Ed.), Neural systems:Analysis and modeling (pp. 127–136). Norwell, MA: Kluwer. Hopeld, J. (1982). Neural networks and physical systems with emergent collective computational abilities. Proc. Natl. Acad. Sci. U.S.A., 79, 2554– 2558. Hornik, K., Stinhcombe, M., & White, H. (1989). Multilayer feedforward networks are universal approximators. Neural Networks, 2, 359–366. Kruger, ¨ J. (1991a). Neuronal cooperativity. Berlin: Springer-Verlag. Kruger, ¨ J. (1991b). Spike train correlation on slow time scales in monkey visual cortex. In J. Kruger ¨ (Ed.), Neuronal cooperativity. Berlin: Springer-Verlag. MacGregor, R. J., & Lewis, E. R. (1977).Neural modeling. New York: Plenum Press. Marden, M. (1948). The number of zeros of a polynomial in a circle. Proc. Natl. Acad. Sci. U.S.A., 34, 15–17. Nunez, P. L. (1989). Generation of human EEG by a combination of long and short range neocortical interactions. Brain Topography, 1, 199–215. Omlin, C. W., & Giles, C. L. (1996). Constructing deterministic nite-state automata in recurrent neural networks. J. ACM, 43, 937–972. Peters, A., & Proskauer, C. C. (1980).Synaptic relationships between a multipolar stellate cell and a pyramidal neuron in the rat visual cortex: A combined golgi-electron microscope study. J. Neurocytol., 9, 163–183.
I. Model and Experiments
193
Peters, A., & Regidor, J. (1981). A reassessment of the forms of nonpyramidal neurons in area 17 of cat visual cortex. J. Comp. Neurol., 203, 685–716. Roschke, ¨ J., & Basar, E. (1989). Correlation dimensions in various parts of cat and human brain in different states. In E. Basar and T. H. Bullock (Eds.), Brain dynamics (pp. 131–148). Berlin: Springer-Verlag. Schuz, ¨ A. (1992). Randomness and constraints in the cortical neuropil. In A. Aertsen and V. Braitenberg (Eds.), Information processing in the cortex (pp. 3– 21). Berlin: Springer-Verlag. Shepherd, G. M. (1998). The synaptic organization of the brain. New York: Oxford University Press. Siegelmann, H. T., & Sontag, E. D. (1992).On the computational power of neuralnets. In Proceedingsof the Fifth ACM Workshopon Computational Learning Theory (Pittsburgh, PA). Softky, W. R., & Koch, C. (1993). The highly irregular ring of cortical-cells is inconsistent with temporal integration of random EPSPs. J. Neurosci., 13, 334– 350. Strehler, B. L., & Lestienne, R. (1986). Evidence on precise time-coded symbols and memory of patterns in monkey cortical neuronal spike trains. Proc. Natl. Acad. Sci. U.S.A., 83, 9812–9816. Wright, J. J. (1990). Reticular activation and the dynamics of neuronal networks. Biol. Cybern., 62, 289–298. Received June 14, 1999; accepted May 1, 2000.
LETTER
Communicated by Peter Rowat
On the Phase-Space Dynamics of Systems of Spiking Neurons. II: Formal Analysis Arunava Banerjee Department of Computer Science, Rutgers, The State University of New Jersey, Piscataway, NJ 08854, U.S.A. We begin with a brief review of the abstract dynamical system that models systems of biological neurons, introduced in the original article. We then analyze the dynamics of the system. Formal analysis of local properties of ows reveals contraction, expansion, and folding in different sections of the phase-space. The criterion for the system, set up to model a typical neocortical column, to be sensitive to initial conditions is identied. Based on physiological parameters, we then deduce that periodic orbits in the region of the phase-space corresponding to normal operational conditions in the neocortex are almost surely (with probability 1) unstable, those in the region corresponding to seizure-like conditions are almost surely stable, and trajectories in the region corresponding to normal operational conditions are almost surely sensitive to initial conditions. Next, we present a procedure that isolates all basic sets, complex sets, and attractors incrementally. Based on the two sets of results, we conclude that chaotic attractors that are potentially anisotropic play a central role in the dynamics of such systems. Finally, we examine the impact of this result on the computational nature of neocortical neuronal systems. 1 Introduction
As mentioned in the original article, also appearing in this issue, the goal of our research is to determine whether there are coherent spatiotemporal structures in the dynamics of neuronal systems. Our approach to this problem has been to formulate an abstract dynamical system that models systems of biological neurons and subsequently conduct a comprehensive analysis of the dynamics of the system. In the original article we presented a detailed exposition of an abstract dynamical system that models systems of biological neurons. We also presented results from simulations of the system set up to model a typical column in the neocortex. The agreement between the qualitative aspects of the simulation results and that of real data attested to the viability of the model. In this article, we conduct a formal analysis of the dynamics of the abstract system. Neural Computation 13, 195–225 (2000)
° c 2000 Massachusetts Institute of Technology
196
Arunava Banerjee
In section 2, we briey review the abstract dynamical system introduced in the original article. In section 3, we augment the system with an additional structure, a Riemannian metric, and perform a perturbation analysis on the dynamics of the augmented system. In section 4, we conduct a measure analysis on the system. The analysis reveals contraction, expansion, and folding in different sections of the phase-space. In section 5, we conduct a local cross-section analysis of the dynamics of the system. The criterion for the system, set up to model a typical neocortical column, to be sensitive to initial conditions is identied. The salient qualitative characteristics of the system are then deduced based on physiological parameters. The results not only explain the simulation results presented in the original article but also make predictions that are borne out by further experimentation. In section 6, we present a procedure that isolates all basic sets, complex sets, and attractors incrementally. Based on these results we conclude that the coherent spatiotemporal structures in the dynamics of the system operating under conditions considered routine in the neocortex are almost surely chaotic attractors that are potentially anisotropic. Finally, in section 7 we examine the impact of this result on the computational nature of neocortical neuronal systems. 2 Abstract Dynamical System
We present a brief review of the abstract dynamical system that models systems of biological neurons, introduced in the original article. Readers who desire a comprehensive exposition of the system should consult the original article. The dynamical system is constructed in two stages. A deterministic model of a neuron is rst formulated based on a set of basic assumptions about the biological neuron. A designated number of instances of the model are then linked together appropriately to construct the dynamical system. 2.1 Model of the Neuron. A biological neuron, at the highest level of abstraction, is a device that transforms multiple series of action potentials (also known as spikes) arriving at its various afferent (incoming) synapses into a series of action potentials on its axon. At the heart of this transformation lies the quantity P, the membrane potential at the soma of the neuron. Effects of the afferent spikes that have arrived at the various synapses of the neuron, as well as those of the efferent (outgoing) spikes that have departed since being generated at its soma, interact nonlinearly to generate this potential. The efferent spikes generated by the neuron coincide with the times at which this potential reaches the threshold of the neuron. It was demonstrated in the original article, based on the observations that (1) the biological neuron is a nite precision machine, (2) the individual effects of afferent as well as efferent spikes on the membrane potential at
II: Formal Analysis
197
the soma of a biological neuron decay after a while at an exponential rate, and (3) the interval between successive spikes generated by a biological neuron is bounded from below by its absolute refractory period, that in a deterministic model it can reasonably be assumed that the individual effects of afferent as well as efferent spikes on the membrane potential at the soma decay to 0 within a bounded period of time. It then follows that at any given moment in time, there are at most a bounded number of most recent spikes (afferent as well as efferent) that are effective on the membrane potential at the soma of the neuron. This is modeled formally as follows. Let i D 1, . . . , m denote the afferent synapses on a given neuron. Let r 0 denote its absolute refractory period, and ri for i D 1, . . . , m the absolute refractory period of the neuron presynaptic to it at synapse i. Let ti for i D 0, . . . , m denote the bound on the period of effectiveness of spikes. To elaborate, any spike that arrived at synapse i D 1, . . . , m before time ti as well as any spike that was generated at its soma before time t0 can be disregarded in the computation of the current membrane potential at the soma of the neuron. It follows that for each i D 0, . . . , m, there are at most ni D dti / ri e most recent spikes that need to be considered in the computation of the current membrane potential at the soma of the neuron. The neuron is assigned a C1 function P (x11 , . . . , xn1 1 , . . . , . . . , x1m , . . . , xnmm I 1 ... x 0, , xn00 ) that models the current membrane potential at its soma. Subscripts i D 1, . . . , m represent the afferent synapses on the neuron. Each j
xi for i D 1, . . . , m represents the time since the arrival of a distinct member of the ni most recent spikes that have reached synapse i, and for i D 0 the time since the departure of a distinct member of the n 0 most recent spikes that were generated at the soma of the neuron. The domain of P (¢) j is restricted to 0 · xi · ti for all i, j. Since ni is merely an upper bound j
j
on the number of spikes (xi ) that satisfy 0 · xi · ti , it is conceivable that fewer than ni spikes satisfy the criterion, in which case the remaining variables are set at ti . Following are an additional set of constraints imposed on P (¢) at boundaries 0 and ti to maintain consistency in the model: 1. 8i D 0, . . . , m and 8j D 1, . . . , ni 9d > 0 such that 8t 2 [ti ¡ d, ti ], @P | j x j D t D 0 irrespective of the values assigned to the other variables. @xi
i
2. 8i D 0, . . . , m and 8j D 1, . . . , ni 9d > 0 such that 8t 2 [0, d], irrespective of the values assigned to the other variables.
@P D j | j @xi xi D t
0
3. 8i D 0, . . . , m and 8j D 1, . . . , ni P (¢) | x j D 0 D P (¢) | x j D t all other varii i i ables held constant at any values. j
4. If 8i D 0, . . . , m and 8j D 1, . . . , ni xi D 0 or ti , then P (¢) D 0.
198
Arunava Banerjee
Finally, a spike is generated by the neuron whenever P (¢) D T (the membrane potential at the soma reaches the threshold of the neuron), and dP / dt ¸ 0 (during its rising phase). 2.2 The Dynamical System. We rst consider systems of neurons that do not receive external input. Let i D 1, . . . , S denote the set of all neurons in any such system. Given any neuron i, an upper bound U i on the time period until which any spike, since its inception, can be effective on the membrane potential at the soma of neuron i can be computed by extending each tj , for j D 1, . . . , m, by the time it takes a spike to reach synapse j from its inception at the soma of the corresponding presynaptic neuron, and choosing the (Ui ) yields maximum over all j D 0, . . . , m. It then follows that U D maxS iD 1 an upper bound on the time period until which any spike, since its inception, can be effective on any neuron in the system. j We now redene xi to denote the time since the inception of the spike at the soma of neuron i and reassign ni to dU / ri e, where ri is the absolute refractory period of neuron i. Pi (¢)’s (the subscript referring to neuron i) are modied accordingly; each function is translated along certain axes j
to account for the change in the origin of the xi ’s, the functions are then appropriately redened on higher-dimensional spaces to reect the changes j
in the ni ’s, their domains are modied to 8i, j 0 · xi · U , and references to synapses in the variables are switched to references to appropriate neurons. We note that the constraints (1 through 4) set forth in the previous section hold on these modied functions at the new boundaries 0 and U . The state of the system of neurons can now be specied completely by enumerating, for all neurons i D 1, . . . , S , the positions of the ni (or fewer) most recent spikes generated by neuron i within U time from the present. Such a record species the exact location of all spikes that are still situated on the axons of respective neurons and, combined with the potential functions, it species the current state of the soma of all neurons. Note that all information regarding the topology of the network, the strength and location of the afferent synapses on the various neurons, their modulation of one another ’s effects at respective somas, the anatomical characteristics of the various axonal arborizations, and so forth is implicitly contained within the set of functions Pi (¢) for i D 1, . . . , S . We can now generalize the system to receive external input by introducing additional neurons whose state descriptions are identical to that of the external input. While the set of ni -tuples hx1i , x2i , . . . , xni i i 2 [0, U ]ni for i D 1, . . . , S does specify the state of the system as argued above, this representation is fraught with redundancy. First, ni being merely an upper bound on the number of j spikes generated by neuron i that satisfy 0 · xi · U , fewer spikes might satisfy the criterion, in which case the remaining components of the ni -tuple are set at U . These components correspond to spikes whose effectiveness on
II: Formal Analysis
199
all neurons in the system has expired. The nite description of the state of a neuron requires that a variable be reused to represent a new spike (set at 0) when the effectiveness of the old spike it represented terminates (when it is set at U ). One of the two positions, 0 and U , is therefore redundant. Second, if hx1i , x2i , . . . , xni i i and hy1i , y2i , . . . , yni i i are two ni -tuples that are a nontrivial permutation of one another, they represent the same state of the neuron. Hence, all but one member of each such permutation group is redundant. The transformation hx1 , x2 , . . . , xn i ! han¡1 , an¡2 , . . . , a 0 i where each 2p i j zj D e U x for j D 1, . . . , n is a root of the complex polynomial f (z) D zn C an¡1 zn¡1 C ¢ ¢ ¢ C a0 , eliminates these redundancies. It was shown in the original article that |a0 | D 1 and ai D aN n¡i a0 for i D 1, . . . , n ¡ 1 constitute a necessary set of constraints for f (z ) to have all roots on |z| D 1. Consequently, for odd n, han¡1 , . . . , adn/ 2e I a 0 i 2 Cbn/ 2c £ S1 completely species f (z ), and for even n, han¡1 , . . . , adn/ 2eC 1 I adn/ 2e , a0 i 2 Cbn/ 2c¡1 £ M 2 does the same, where C, S1 , and M2 represent the complex plane, the unit circle, and the two-dimensional Mobius ¨ band, respectively. A sufcient set of constraints was also derived in the original article, and it was shown that when both sets of constraints are imposed, the resultant space is a compact subset of Cbn/ 2c £ S1 (for odd n) or Cbn/ 2c¡1 £ M2 (for even n). We denote the resultant space by Ln . Ln is the closure of the open set Ln that corresponds to all points hx1 , x2 , . . . , xn i that are composed of distinct com2p i j 2p i j2 6 6 e U x ). Ln n Ln is a multidimensional boundary ponents (j1 D j2 ) e U x 1 D set that corresponds to all points hx1 , x2 , . . . , xn i that have one or more iden2p i j 2p i j 6 tical components (e U x 1 D e U x 2 for some j1 D j2 ). Finally, the boundary set can be partitioned, based on the multiplicity of the components, into subsets that are diffeomorphic to Ln¡1 , Ln¡2 , . . . , or L 0 . We assume hereafter j that all variables xi are normalized, that is, scaled from [0, U ] to [0, 2p ]. Each Pi (¢) is likewise assumed to be modied to reect the scaling of its domain. We denote by i Lni the resultant space for neuron i. The phase-space for the i system of neurons is then given by S i D 1 Lni . It was demonstrated in the original article that the transformation described above, Fni : T ni ! i Lni (points ni 1 2 in T ni , the ni -torus, represented as heixi , eixi , . . . , eixi i), is a local diffeomorni 1 2 6 eixi D 6 ¢¢¢ D 6 e ixi , and that corresponding phism at all points satisfying eixi D 1 to each Pi (¢) there exists a C function P i : jSD 1 j Lnj ! R. We observed earlier that at certain times, neuron i might possess fewer than ni effective spikes. It follows from the deliberations above that under j
such circumstances, the remaining variables are set at eixi D 1. For each neuron i we denote the number of such variables by si . The corresponding spikes we label as dead since their effectiveness on all neurons in the system has expired. All other spikes we label as live. We now present an informal description of the dynamics of the system. If the state of the system at a given instant is such that neither any live spike is on the verge of death nor
200
Arunava Banerjee
is any neuron on the verge of spiking, then all live spikes continue to age j uniformly (the corresponding variables xi grow at a constant rate). If a spike is on the verge of death, it expires, that is, stops aging. This occurs when j
the corresponding variable reaches eixi D 1. If a neuron is on the verge of spiking, exactly one dead spike corresponding to that neuron is turned live. This dichotomy between dead and live spikes induces a certain strucj
ture on i Lni . We denote by i Lni the subspace of i Lni that satises si ¸ j. 0
We note immediately that there exists a natural mapping from i L (ni ¡si ) , a si space corresponding to (ni ¡ si ) live or dead spikes, to i Lni , the subspace corresponding to (ni ¡ si ) live or dead spikes in a space corresponding to ni spikes. It was demonstrated in the original article that the canonical map0
si
ping F(nnii ¡si ) : i L (ni ¡si ) ! i Lni is not only an imbedding for all si · ni , but also maps ows identically, a fact manifest in the informal description of the dynamics of the system. To elaborate, since all dead spikes remain staj
tionary at eixi D 1, they register as a constant factor in the dynamics of the system. While the total number of spikes assigned to a neuron dictates the dimensionality of its entire space, ows corresponding to a given number of live spikes lie on a xed dimensional submanifold and are C1 -conjugate to one another. j The submanifolds i Lni (j D 1, . . . , ni ) are not revealed in the topology of i Lni regarded (as the case may be) as a subspace of Cbni / 2c £ S1 or of Cbni / 2c¡1 £M2 (topologized by respective standard differentiable structures). We therefore assign i Lni the topology generated by the family of all relatively j
open subsets of i Lni , 8j ¸ 0. i I We denote by PSi the subspace of S i D 1 Lni satisfying Pi (¢) D T , and by P i the subspace wherein dP i (¢) / dt ¸ 0 is additionally true. It was demonstrated in the original article that PSi is a C1 regular submanifold of codimension 1 and that PIi is a closed subset of PSi . S i i The velocity eld V : S i D 1 Lni ! i D 1 T ( Lni ) is stipulated by way of two S i 1 elds: V for the case wherein p 2 iD 1 Lni satises 8i D 1, . . . , S p 2/ PIi , and V 2 for the case wherein 9i p 2 PIi .1 We note from the informal description of the dynamics of the system that the component elds, V i1 and V i2 , for each neuron i D 1, . . . , S can be specied based solely on pi 2 i Lni . Moreover, 0
1
it follows that V i1 can be dened on (i L(ni ¡si ) ni L (ni ¡si ) ), the corresponding si
si C 1
eld on (i Lni ni Lni
) then dened as F(nni ¡s ) (Vi1 ).2 i
i
?
1 T(¢) denotes the tangent bundle (appropriated from Cbni / 2c £ S1 or Cbni / 2c¡1 £ M2 , as the case may be). 2 We use the notation, F (X ) f D X ( f ± F) where F is a C1 map of manifolds, X is a ? p p p tangent vector at p, and f is an arbitrary function that belongs to C1 (F(p)).
II: Formal Analysis
201 0
1
Since Vi1 for pi 2 (i L(ni ¡si ) ni L (ni ¡si ) ) corresponds to all (ni ¡si ) roots rotatda ing at constant speed, it is dened as ni ¡si ¡k D 2p khO for k D 1, . . . , b ni ¡si c, dt
2
U
| (ni ¡ si ), and when (ni ¡ si ) is even, d|a(nidt¡si ) /2 D 0. hO is the basis vector @/ @h on C, M2 , and S1 with elements represented respectively as reih , h § r, h i, and s
s C1
i i h. Finally, Vi2 for pi 2 (i Lni ni Lni
fact that pi lies additionally on
s ¡1
) is equivalent to Vi1 on i Lnii
i Lsi ni
½
ignoring the
i Lsi ¡1 . ni
3 The Augmented System and Its Basic Features
Our rst objective is to identify the local properties of the dynamics of the system. We pursue this goal through a measure analysis in section 4 and a cross-section analysis in section 5. These analyses require the phasespace to be endowed with a Riemannian metric. We begin with its specication. 3.1 Riemannian Metric. We choose the metric such that all ows cors s C1 (i i i i ) (for all values of si ’s) are not only responding to V 1 on S i D 1 Lni n Lni measure preserving but also shape preserving. si Since for every si > 1, i Lni corresponds to states of neuron i that feature 0
j
multiple components set at eixi D 1, it lies on the boundary of i Lni . It follows from the description of i Lni in the previous section that the boundary s C1
s
i i set in the neighborhood of pi 2 (i Lni ni Lni
) is locally diffeomorphic to an
si ¡1 open subset of i Lni . Finally, the nature of Vi1 and Vi2 reveals that for all si si C 1 si ¡1 pi 2 (i Lni ni Lni ), V i 2 T (i Lni ). It is therefore sufcient as well as approi si ¡1 priate that the Riemannian metric be dened over T ( S i D 1 Lni ), that is, as S i si ¡1 i si ¡1 W: T ( S i D1 Lni ) £ T ( i D 1 Lni ) ! R. si si C 1 si si C 1 Let pi 2 (i Lnini Lni ) denote the state of neuron i. We consider (i Lnini Lni ) si ¡1 0 si ¡1 as a subspace of i Lni . The imbedding Fnnii¡(si ¡1) maps i Lni ¡(si ¡1) onto i Lni 1 2 0 si si C 1 such that (i Lni ¡(si ¡1)ni Lni ¡(si ¡1) ) ½ i Lni ¡(si ¡1) is mapped onto (i Lnini Lni ) ½ i Lsi ¡1 . The mapping F ni ¡(si ¡1) ! i L 0 ni ¡(si ¡1) : T ni ni ¡(si ¡1) is a local diffeon i ¡(si ¡1) 1 2 ix ix i i 6 6 6 morphism at all points satisfying e . Hence, D e D ¢¢¢ D eixi ni (Fn ¡(s ¡1) ± Fni ¡(si ¡1) ) is a local diffeomorphism at all such points. Finally, i i whereas the section of i Lni that is actually explored by the state dynam-
ics of neuron i, that is, the feasible space, does contain states composed of j
identical components, such components are necessarily set at eixi D 1. Consequently, if W ½ T ni ¡(si ¡1) denotes the pre-image of the feasible section of s ¡1
s C1
1
2
n i ¡(si ¡1)
6 6 6 (i Lnii ni Lnii ), then the constraint eixi D ¢¢¢ D eixi D eixi throughout W.
is satised
202
Arunava Banerjee
S i i D 1 Lni ) in the feasible C s s 1 i i j ni S i i i section of iD 1 ( Lni n Lni ) as the set of vectors ( Fni ¡(si ¡1)± i Fni ¡(si ¡1) )?(@ @xi ) for i D 1, . . . , S and j D 1, . . . , ni ¡ (si ¡ 1). We shall, for the sake of brevity, j j henceforth refer to (i Fnnii¡(si ¡1) ± i Fni ¡(si ¡1) )?(@ @xi ) as Ei . We set the Riemannian metric as W(Eba , Edc ) D 1 if a D c and b D d, and W (Eba , Edc ) D 0
We can now dene a C1 -compatible basis for T (
/
/
otherwise. The component labels in W can be chosen in a manner such that E1i 2 si ¡1 si si C 1 s s C1 (i i i i ), T (i Lni ) lies in T ? (i Lni ni Lni ).3 The feasible section of S i D 1 Lni n Lni for any values of si ’s, can now be considered a Riemannian manifold in its s C1
s
j
i i i i own right. T ( S i D1 ( Lni n Lni )) in the feasible section is spanned by hEi | i D 1, . . . , S I j D 2, . . . , ni ¡ (si ¡ 1)i, which forms an orthonormal basis. The
s
s C1
i i feasible section of (i Lni ni Lni
) is therefore also a regular submanifold of
S i si i si C 1 i We now consider S i D1 Ln i as the union of the sets i D1 ( Lni n Lni ) for all i D 1, . . . , S , si D 0, . . . , ni , and assign it the topology generated by the i si i si C 1 family of all open sets in each S i D1 ( Ln i n Ln i ) induced by the Riemannian i metric. It is clear that in the feasible section of S i D1 Ln i , this topology is i Lsi ¡1 . ni
identical to that presented in section 2.
V 1 on
S i si i si C 1 i D 1 ( Lni n Lni )
S,ni ¡(si ¡1) (2p i D1, j D2 j frames Ei for i D 1,
is dened as
j
/ U ) Ei
in the new
frame. Since the eld of coordinate . . . , S , j D 2, . . . , ni ¡ (si ¡ 1) satises rEba Edc D 0 (r being the Riemannian connection) for all a, c 2 f1, . . . , S g and b, d 2 f2, . . . , ni ¡ (si ¡ 1)g, and the coefcients i si i si C 1 (2p / U ) are constants, V 1 is a constant vector eld on S i D 1 ( Lni n Lni ). V 1 is therefore not only measure preserving but also shape preserving on S i si i si C 1 i D 1 ( Lni n Lni ).
This leads to a substantial simplication in the analysis of the dynamics of the system. Since any trajectory Yx (t) in the phase-space has associated with it a sequence of times ht1 , t2 , . . . , tk, tkC 1 , . . .i such that for all j the s
s C1
i i i i segment fYx (t) | tj < t < tj C 1 g lies strictly on S i D 1 ( Lni n Lni ) for xed values of si ’s (segments that correspond to periods during which neither any live spike expires nor does any neuron re), and since each such segment is both volume and shape preserving, the analysis of the local properties of Yx (t) reduces to the analysis of a nite or countably innite set of discrete events at times ht1 , t2 , . . . , t k, tkC 1 , . . .i, each event denoting the birth and/or death of one or more spikes. Since any event involving the simultaneous birth and death of multiple spikes can be regarded as a series of mutually independent births and deaths of individual spikes,4 the
3
si
si C 1
The orthogonal complement of T( i Lni ni Lni ). At the time of its birth or death, a spike has no impact on any membrane potential function. 4
II: Formal Analysis
203
analysis can be further restricted to the birth of a spike and the death of a spike. 3.2 Perturbation Analysis. s
s C1
(i i i i ) 3.2.1 Birth. Let fYx (t ) | tj < t < tj C 1 g lie strictly on S i D 1 Lni n Lni for arbitrary but xed values of si ’s. Let the event associated with Yx (tj C 1 ) be, without loss of generalization, the birth of a spike at neuron 1, that is, Yx (tj C 1 ) 2 PI1 and no other PIi . Then fYx (t ) | tj C 1 < t < tj C 2 g lies strictly on s ¡1
s
s
s C1
(1 Ln11 n1 Ln11 ) £ S (i i i i ). Consider p1 D Yx (tj C 1 ¡ t¤ ) on the segi D 2 Lni n Lni ment fYx (t ) | tj < t < tj C 1 g and p2 D Yx (tj C 1 C t¤ ) on the segment fYx (t) | tj C 1 < t < tj C 2 g such that Yx (t) for tj C 1 ¡ t¤ · t · tj C 1 C t¤ lies within a sin-
S i si ¡1 i si C 1 i D 1 ( Lni n Lni ). Let p1 be represented n1 ¡(s1 ¡1) 1 n2 ¡(s2 ¡1) n ¡(sS ¡1) 1 as ha1 , . . . , a1 i where a1i D 0 , a2 , . . . , a2 , . . . , a1S , . . . , aSS n1 ¡(s1 ¡1) 1 n2 ¡(s2 ¡1) 1 for i D 1, . . . , S , and p2 as hb1 , . . . , b1 , b2 , . . . , b2 , . . . , b1S , . . . , nS ¡(sS ¡1) i, in local coordinates. Then b11 D (2p U )t¤ , b1i D 0 for i D bS j j 2, . . . , S , and bi D ai C (2p U )2t¤ for the remaining i D 1, . . . , S and
gle coordinate neighborhood 5 of
j D 2, . . . , ni ¡ (si ¡ 1). Let pQ1 on
/
/
S i si i si C 1 i D1 ( Lni n Ln i )
be sufciently close to p1 such that YxQ (t), s1 ¡1 1
the trajectory through pQ1 , has a corresponding segment on (1 Ln1
s1
n Ln1 ) £
S i si i si C 1 ( ). Let pQ1 be represented in local coordinates as hQa11 , . . . , i D 2 Lni n Lni ¡(s ¡1) ¡(sS ¡1) n1 ¡(s1 ¡1) 1 i where aQ 1i D a1i D 0 for aQ 1 , aQ 2 , . . . , aQ n2 2 2 , . . . , aQ 1S , . . . , aQ nSS j j j i D 1, . . . , S , and aQ i D ai C D xi for i D 1, . . . , S , j D 2, . . . , ni ¡ (si ¡ 1). Let YxQ (t) be parameterized such that pQ1 D YxQ (tj C 1 ¡ t¤ ). Let pQ2 D YxQ (tj C 1 C t¤ ) be represented in local coordinates as hbQ11 , . . . , bQn1 1 ¡(s1 ¡1) , bQ12 , . . . , ¡(s ¡1) ¡(sS ¡1) i where bQ11 D b11 C D y11 , bQ1i D b1i D 0 for bQn2 2 2 , . . . , bQ1S , . . . , bQnSS j j j i D 2, . . . , S , and bQi D bi C D yi for i D 1, . . . , S , j D 2, . . . , ni ¡ (si ¡ 1). j j j j D yi D D xi for i D 1, . . . , S and j D 2, . . . , ni ¡ (si ¡ 1) since bQi D aQ i C (2p U )2t¤ for all such values of i and j. Let D t be such that YxQ (tj C 1 C D t) lies on PI1 . Then D y11 D ¡(2p U )D t. Since both Yx (tj C 1 ) and YxQ (tj C 1 C D t) lie on PI1 ,
/
/
P1 a11 , a21 C a1S , a2S C
5
2p ¤ 2p ¤ n ¡(s ¡1) C t , . . . , a1 1 1 t , ..., U U 2p ¤ n t , . . . , aSS U
¡(sS ¡1)
C
2p ¤ t U
DT,
and
With the distinguished set of coordinate frames described in section 3.1.
(3.1)
204
Arunava Banerjee
P 1 a11 , a21 C D x21 C
2p ¤ 2p ¤ (t C D t), . . . , an1 1 ¡(s1 ¡1)C D xn1 1 ¡(s1 ¡1)C ( t C D t ), U U
. . . , a1S , a2S C D x2S C C
2p ¤ (t C D t ) U
2p ¤ (t C D t), . . . , anSS U
¡(sS ¡1)
n ¡(sS ¡1)
C D xSS
DT.
(3.2)
P1 (¢), being C1 , can be expanded as a Taylor series around Yx (tj C 1 ). Minor algebraic manipulations, and neglecting all higher-order terms, then yields
D y11 All
@P1 j
@xi
D
S,ni ¡(si ¡1) i D 1, j D 2
@P1 j @xi
j
£ D xi
/
S,ni ¡(si ¡1)
@P 1
i D 1, j D 2
@xi
j
.
(3.3)
’s in equation 3.3 are evaluated at Yx (tj C 1 ). We denote
j
by k ai . Then 8k D 1, . . . , S
i, j
kaj i
D 1, and
D y11
D
i, j
1aj D j . i xi
@Pk j
@xi
/
@Pk i, j @x j i
3.2.2 Death. We assume that the event associated with Yx (tj C 1 ) is, without loss of generalization, the death of a spike at neuron 1. Then fYx (t) | s C1
s C2
s
s C1
1 1 i i i i tj C 1 < t < tj C 2 g lies strictly on (1 Ln1 n1 Ln1 ) £ S i D2 ( Lni n Lni ). We ¤ ¤ now consider points p1 D Yx (tj C 1 ¡ t ) and p2 D Yx (tj C 1 C t ) such that Yx (t) for tj C 1 ¡ t¤ · t · tj C 1 C t¤ lies within a single coordinate neigh-
s
s C2
i i i i borhood of S i D 1 ( Lni n Lni ). Let p1 be represented in local coordinates as ha11 , . . . , an1 1 ¡s1 , a12 , . . . , an2 2 ¡s2 , . . . , a1S , . . . , anSS ¡sS i, and p2 be represented as
hb11 , . . . , bn1 1 ¡s1 , b12 , . . . , bn2 2 ¡s2 , . . . , b1S , . . . , bnSS
j
¡sS
i. Then b11 D 0, and bi D j ¤ ai C (2p / U )2t for all i D 1, . . . , S , j D 1, . . . , (ni ¡ si ) except i D j D 1. Let pQ1 D YxQ (tj C 1 ¡t¤ ) and pQ2 D YxQ (tj C 1 C t¤ ) be points on a YxQ (t) sufciently close to Yx (t) so as to have corresponding segments on the noted submanifolds. Let pQ1 be represented in local coordinates as hQa11 , . . ., aQ n1 1 ¡s1 , aQ 12 , . . ., j
j
j
¡s ¡sS i where aQ i D ai C D xi for all i D 1, . . . , S , aQ n2 2 2 , . . . , aQ 1S , . . . , aQ nSS ¡s ¡s j D 1, . . ., (ni ¡si ), and pQ2 be represented as hbQ11 , . . ., bQn1 1 1 , bQ12 , . . ., bQn2 2 2 , . . ., j j j ¡s n S i where bQi D bi C D yi for all i D 1, . . . , S , j D 1, . . . , (ni ¡ si ). bQ1 , . . . , bQ S S
S
j j Then it follows from bQ11 D b11 D 0 and bQi D aQ i C (2p / U )2t¤ for all other i, j j j that D y11 D 0 and D yi D D xi for all i D 1, . . . , S , j D 1, . . . , (ni ¡ si ) except i D j D 1.
4 Measure Analysis 4.1 Expansion. We demonstrate that a trajectory is expansive at the birth of a spike. We consider the adjacent segments of Yx (t) described in sec( )-dimensional hypercube tion 3.2.1. Let C denote an innitesimal S i D 1 ni ¡si
II: Formal Analysis
205 j
spanned by vectors 2 Ei (2 ! 0) for i D 1, . . . , S , j D 2, . . . , ni ¡(si ¡1) at any s s C1 (i i i i ). When C passes into location on the segment of Yx (t) on S i D 1 Lni n Lni s ¡1 1 n
S i si i si C 1 ( ) i D 2 Lni n Lni
s1
(1 Ln11
Ln1 ) £
S ( )i D1 ni ¡ si j 1 j 1) ( C ai E1 for 2 Ei
it is transformed into the
dimensional parallelepiped C 0 spanned by the vectors j i D 1, . . . , S , j D 2, . . . , ni ¡ (si ¡ 1), where the 1 ai ’s are evaluated at the I point Yx (t) \ P1 . j Vectors Ei for i D 1, . . . , S , j D 1, . . . , ni ¡ (si ¡ 1) are by assumption orthonormal. Let A and A0 denote the matrices associated with the vectors spanning C and C 0 represented as row coordinate vectors with respect to S the above basis. A and A0 are then the ( S i D1 ni ¡ si ) £ ( i D 1 ni ¡ (si ¡ 1)) matrices: 0 0 AD . .. 0 2
0
¢¢¢ ¢¢¢ .. . ¢¢¢
0 2 .. .. . . 0 0
0 0 .. .
1 a2 12 1 a3 12
and A0 D
.. .
1 n S ¡(sS ¡1) aS 2
2
2
0 0 2 .. .. . . 0 0
¢¢¢ ¢¢¢ .. . ¢¢¢
0 0 .. . (4.1) . 2
In A, columns corresponding to E1i for i D 1, . . . , S are 0 vectors. All other columns contain an 2 at an appropriate location. A0 is identical to A except for ¡(s ¡1) the rst column, which is replaced by the vector h1 a122 , . . . , 1 a1n1 1 2 , . . . , ¡(s ¡1) n 1 2 S aS 2 , . . . , 1 aSS 2 iT . S The iD 1 (ni ¡si )-dimensional measure of C and C 0 can then be computed as the square root of the Gram determinants 6 of the respective matrices, A and A0 . They are
/A/ D 2
S iD 1
ni ¡si
and
/ A0 /
S
D2
It follows from of C .
iD 1
n i ¡si
i, j
1 aj i
£
1C
S,ni ¡(si ¡1)
(1 aij )2 .
i D1, j D2
D 1 that the volume of C 0 is strictly greater than that
4.2 Contraction. We demonstrate that a trajectory is contractile at the death of a spike. We consider the adjacent segments of Yx (t) described in section 3.2.2. Let C denote an innitesimal S i D 1 (ni ¡ si )-dimensional 6 The Gram determinant of matrix B is given by det(B ¤ BT ). The square root of the Gram determinant is also known as the modulus of B and is represented as / B / .
206
Arunava Banerjee j
hypercube spanned by vectors 2 E i (2 ! 0) for i D 1, . . . , S , j D 1, . . . , s s C1 (ni ¡ si ) at any location on the segment of Yx (t) on S (i i i i ). When i D 1 Lni n Lni s C1
s C2
s
s C1
1 1 i i C passes into (1 Ln1 n1 Ln1 ) £ SiD 2 (i Lni ni Lni ), it is transformed into the j S ( iD1 ni ¡ si )-1-dimensional hypercube C 0 spanned by the vectors 2 Ei for all i D 1, . . . , S and j D 1, . . . , (ni ¡ si ) except i D j D 1. In other words, the S 1 i D 1 ( ni ¡ si )—1-dimensional hypercube C collapses along E1 to produce a S 0 ( iD1 ni ¡ si )—1-dimensional hypercube C . Any lower-dimensional parj j allelepiped inside C spanned by the vectors 2 (Ei C bi E11 ) for i D 1, . . . , S , j 6 0 for some i, j therefore j D 1, . . . , (ni ¡ si ) except i D j D 1 such that bi D i si i si C 1 experiences a contraction in volume as it passes from S i D1 ( Lni n Ln i ) into
s C1 1 n
(1 Ln11
s1 C 2
Ln1
S i si i si C 1 i D 2 ( Lni n Lni ).
)£
4.3 Folding. We demonstrate that folding can occur across a series of ( )-dimensional hybirths and deaths of spikes. Let CQ denote a S i D1 ni ¡ si si
si C 1
i i percuboid of maximal measure in the feasible section of S i D1 ( Lni n Lni ) S Ut that is transformed after time 2p into a iD1 (ni ¡ si )-dimensional hypers1 ¡1 1 s1 si si C 1 ), past the birth of a spike at surface CQ0 in (1 L n L ) £ S (i L ni L n1
i D2
n1
ni ni S,ni ¡(si ¡1) j j [ai , ai C D i ], and CQ0 as x11 D i D1, j D 2 n ¡(s ¡1) n ¡(s ¡1) n ¡(sS ¡1) ) in local coorh (x21 , . . . , x1 1 1 , x22 , . . . , x2 2 2 , . . . , x2S , . . . , xSS j j j dinates, where xi 2 [ai C t, ai C D i C t] for i D 1, . . . , S , j D 2, . . . , ni ¡ (si ¡1). j j i ¡(si ¡1) [ai C (t ¡ T ), ai C D i C (t ¡ T )] (the hypersurface CQ Then PI1 \ S,n i D1, j D2 U ( after time 2p t ¡ T ) intersected with PI1 ),7 when translated by a distance j T along all dimensions (@ @xi ) for i D 1, . . . , S , j D 2, . . . , ni ¡ (si ¡ 1),
neuron 1. Let CQ be represented as
/
i D j D 1, yields identically the hypersurface h (¢) D T in CQ0 . The shape of j CQ0 and the result of a dimensional collapse along any Ei for i D 1, . . . , S , 0 j D 2, . . . , ni ¡ (si ¡ 1) on CQ is therefore completely specied by the position S i si i si C 1 i si i si C 1 I of CQ in S i D 1 ( Lni n Lni ) and the nature of P1 \ i D 1 ( Lni n Lni ). We now assume that P1 (¢) is unimodal with respect to all variables that, at the given moment, correspond to effective spikes. Any coordinate curve of an effective variable can then have only point intersections (at most two) s
s C1
i i i i with PI1 \ S i D 1 ( Lni n Lni ). Finally, if a spike, over its nite lifetime, is effective in the generation of one or more spikes, then there exists, trivially, a last spike that it is effective on. We assume, without loss of generalization, that x22 is one such spike, and x11 is the nal spike it is effective on.
7 We use PI1 and PS1 to represent both the hypersurfaces in the phase-space and the corresponding hypersurfaces in local coordinates. The context will determine which hypersurface is being referred to.
II: Formal Analysis
207
In order for CQ0 to be curved in such a manner that at the subsequent death of x22 (irrespective of any number of intermediate births and/or deaths of spikes) it folds upon itself, there must exist a coordinate curve for x22 that intersects PI1 \ CQ twice for at least one position of CQ.8 Since P1 (¢) is C1 and P1 (¢) | x2 D 0 D P1 (¢) | x2 D2p when all other variables are held constant, such co2
2
ordinate curves exist for all points on PS1 \
S,ni ¡(si ¡1) (0, i D 1, j D 2
2p ). Furthermore,
if | @P1 / @x22 | at the intersection of any such coordinate curve with PS1 , at the j i ¡(si ¡1) falling phase of P1 (¢), is not so large as to make S,n @P1 / @xi < 0,9 i D 1, j D 2 then dP1 (¢) / dt ¸ 0 is satised by both intersections. Consequently, the coori ¡(si ¡1) (0, 2p ). dinate curve intersects twice with the hypersurface PI1 \ S,n i D 1, j D 2 The only question that remains unresolved is whether both intersections s
s C1
i i i i of such a coordinate curve lie on S i D1 ( Ln i n Ln i ). While this question can be settled only when the specic instantiation of P1 (¢) is known, there are two aspects of the system that have a signicant impact on the answer. i ¡(si ¡1) (0, 2p )g and the tighter First, the closer T is to maxfP1 (x) | x 2 S,n i D 1, j D 2 the peak of P1 (¢) is, the greater the chances are that a coordinate curve exists
in a
S i si i si C 1 ( ) that intersects PI1 twice. Second, the largest D i for which i D 1 Lni n Lni S i D1 ( ni ¡ si )-dimensional hypercuboid can t into the feasible section of S i si i si C 1 i ¡1)r i ( ) is ( (n 2p ). Clearly, folding of CQ0 is more likely ¡ (n(in¡s i D 1 Lni n Lni i ¡si ) i ¡si )U
when (ni ¡ si ) is small, that is, when the system is sparsely active. 5 Local Cross-Section Analysis
Whereas we have just demonstrated that both expansion and contraction occur locally around any trajectory, the cumulative effect of such events remains obscure. We address this issue in this section with regard to the dynamics of the system set up to model a typical neocortical column. The solution is presented in stages. First, the problem is phrased in terms of a quantiable property of a particular matrix. A deterministic process that constructs the matrix incrementally is described. Next, each step in the process is reformulated as a stochastic event. The event is analyzed, and conclusions are drawn regarding its effect on the matrix. Physiological parameters of a typical neocortical column are then used to identify the qualitative properties of trajectories in various sections of the phase-space. 8
x22 is by assumption not effective on any intermediate births of spikes. The two intersections of the coordinate curve with CQ0 therefore remain on a coordinate curve of x22 after such births. Any intermediate deaths of spikes have no impact on whether the intersections in question remain on a coordinate curve. 9 This is generally the case for excitatory spikes in the cortex. The impact of such spikes on the potential function is given by a steep rising phase and a relatively gentle falling phase.
208
Arunava Banerjee
5.1 The Deterministic Process. We consider the class of trajectories that i ni are not drawn into the trivial xed point S i D 1 Lni (state of quiescence). We
demonstrated that if hD x21 , . . ., D xn1 1 ¡(s1 ¡1), . . ., D x2S , . . ., D xnSS ¡(sS ¡1) iT and hD x11 , . . ., D xn1 1 ¡s1, . . ., D x1S , . . ., D xnSS ¡sS iT are, respectively, perturbations on a trajectory before the birth and the death of a spike at neuron 1, and the perturbations after the corresponding events are hD y11 , . . ., D yn1 1 ¡(s1 ¡1), . . ., D y2S , . . ., D ynSS ¡(sS ¡1) iT and hD y21 , . . ., D yn1 1 ¡s1, . . ., D y1S , . . ., D ynSS ¡sS iT , then
D y11 D y21
1 2 a1
.. .
D
D y21
0 .. . 0
D ynSS
1 .. . 0
¡(sS ¡1)
1 nS ¡(sS ¡1) aS
... 0 .. . 0
D x21
0 .. . 1
.. .
(5.1)
D xnSS ¡(sS ¡1)
and
.. .
D ynSS ¡sS
D
1 .. . 0
0
..
. 0
0 .. . 1
D x11 D x21 .. .
D xnSS
.
(5.2)
¡sS
If we denote by A k the matrix that corresponds to the event Yx (tk ), by the column vector D x 0 a perturbation just prior to the event Yx (t1 ), and by the column vector D x k the corresponding perturbation just past the event Yx (tk ), then D x k D Ak ¤ Ak¡1 ¤ ¢ ¢ ¢ ¤ A1 ¤ D x 0 . Alternatively, if A k0 denotes the product Ak ¤ A k¡1 ¤ ¢ ¢ ¢ ¤ A1 , then D x k D A k0 ¤ D x 0 , where Ak0 is dened recursively as A 00 D I (the identity matrix) and 8k ¸ 0, Ak0C 1 D AkC 1 ¤ A k0. Given the nature of these matrices, we deduce that (1) if Yx (tkC 1 ) corresponds to the birth of a spike at neuron l, then Ak0C 1 can be generated from j
Ak0 by identifying rows ri in A k0 that correspond to spikes that are effective j j
in the birth of the given spike, and introducing a new row i, j l ai ri at an appropriate location into Ak0 , and (2) if Yx (tk C 1 ) corresponds to the death of a spike at neuron l, then A k0C 1 can be generated from Ak0 by identifying the row in A k0 that corresponds to the given spike, and deleting it from Ak0 . Finally, since we shall be analyzing periodic orbits and trajectories, both the initial and the nal perturbations must lie on local cross-sections of Yx (t). The velocity eld being j
S,n i ¡(si ¡1) (2p i D 1, j D 2
j
/ U )Ei , the initial perturbation must
i ¡(si ¡1) satisfy S,n D xi D 0, and the nal perturbation must be adjusted i D1, j D2 along the orbit (each component must be adjusted by the same quantity) to satisfy the same equation. In terms of matrix operations, this translates
II: Formal Analysis
209
into B ¤ A k0 ¤ C, where B and C (assuming A k0 is an (m £ n ) matrix) are the (m ¡ 1 £ m ) and (n £ n ¡ 1) matrices: 1 m ¡ m1
1¡ BD
.. . ¡ m1
¡ m1
...
¡ m1
1 ¡ m1 . . . ¡ m1 .. .. .. . . . ¡ m1 . . . 1 ¡ m1
¡ m1 ¡ m1 .. . ¡ m1
,
1 0 .. CD . 0 ¡1
0 1 .. . 0 ¡1
... 0 ... 0 . .. . .. . ... 1 . . . ¡1
(5.3)
If Yx (t) is aperiodic, D x0k D B ¤ A k0 ¤ C ¤ D x00 for arbitrary value of k species the relation between an initial and a nal perturbation, both of which lie on transverse sections of the trajectory. The sensitivity of the trajectory to initial conditions is then determined by kB ¤ A k0 ¤ Ck2 .10 If limk!1 kB ¤ A k0 ¤ Ck is unbounded (0), then the trajectory is sensitive (insensitive) to initial conditions. If, instead, Yx (t) is periodic with Yx (t1 ) D Yx (tkC 1 ) (a closed orbit with period (tkC 1 ¡ t1 )), then A k0 is a square matrix and D x0k D B ¤ A k0 ¤ C ¤ D x 00 is a Poincare map. The stability of the periodic orbit is then determined by r (B ¤ Ak0 ¤ C).11 If r (B ¤ A k0 ¤ C) > 1 ( < 1), that is, limr!1 k(B ¤ Ak0 ¤ C)r k is unbounded (0), then the periodic orbit is unstable (stable). It should be noted that limk!1 kB ¤ Ak0 ¤ Ck and limr!1 k(B ¤ Ak0 ¤ C )r k, by virtue of the nature of their limiting values, yield identical results irrespective of the choice of the matrix norm. We use the Frobenius norm (kAkF D Trace(AT ¤ A)) in the upcoming analysis. 5.1.1 The Revised Process. We present a revised process that constructs a matrix AQ k0 that substantially simplies the computation of the spectral properties of B ¤ Ak0 ¤ C. We begin with AQ 00 D I ¡ n1 (1 ), where (1 ) represents the matrix all of whose elements are 1, and n D S i D 1 (ni ¡ si ) is the number of components in the initial perturbation. We then proceed with the rest of the process (generating AQ k0C 1 from AQ k0 ) unaltered. The following lemma relates the properties of AQ k0 to those of B ¤ Ak0 ¤ C. If AQ k0 is an (n£n ) square matrix, then if limr!1 k(I¡ n1 (1 ))¤( AQ k0 )r k D 0 ( Q k )r Q limr!1 k A0 ¤ A 0 k is unbounded (0), then limr!1 k(B¤ Ak0 ¤C)r k is unbounded (0). If instead AQ k is an (m £ n ) matrix, then if limk!1 k(I ¡ 1 (1 )) ¤ AQ k k is Lemma 1.
0
unbounded (0), limk!1 kB ¤ A k0 ¤ Ck is unbounded (0). 10
m
0
kAk2 is the spectral norm of the matrix A (the natural matrix norm induced by the kAxk l2 norm), that is, kAk2 D supxD6 0 kxk 2 . 2 11 r (A) is the spectral radius of the matrix A, that is, r (A) D max 1 · i· n |li | where the li ’s are the eigenvalues of the matrix (Axi D li xi ).
210
Arunava Banerjee
Proof . AQ k0 is an (n £ n ) square matrix: The sum of the elements of any row j
l in A00 D I is 1. Since 8l i, j ai D 1, it follows that 8k ¸ 0 the sum of the k elements of any row in A 0 remains 1. The sum of the elements of any row in AQ 00 being 0, the same argument proves that 8k ¸ 0 the sum of the elements of any row in AQ k0 is 0. Moreover, induction on k proves that 8k AQ k0 D Ak0 ¡ n1 (1 ). We also note that C ¤ B D I ¡ n1 (1 ), AQ k0 ¤ n1 (1 ) D ( 0), and n1 (1 ) ¤ n1 (1 ) D n1 (1 ), and conclude via induction on r that (B ¤ Ak0 ¤ C)r D B ¤ (AQ k0 )r ¤ C. B being AQ 00 without the last row, B ¤ ( AQ k0 )r is AQ 00 ¤ ( AQ k0 )r without the last row. If limr!1 k AQ 00 ¤ ( AQ k0 )r k D 0, then limr!1 AQ 00 ¤ ( AQ k0 )r D ( 0 ). Hence, limr!1 B ¤ (AQ k0 )r D ( 0 ). If, on the other hand, limr!1 k AQ 00 ¤ (AQ k0 )r k is unbounded, then so is limr!1 kB ¤ ( AQ k0 )r k since AQ 00 ¤ ( AQ k0 )r amounts to deleting the row mean from each row of (AQ k0 )r . Finally, the product with C does not have an impact on the unboundedness of the matrix because the sum of the elements of any row in B ¤ ( AQ k0 )r is 0.
AQ k0 is not constrained to be a square matrix: B is I ¡ m1 (1 ) with the last row eliminated. Moreover, since 8k ¸ 0 AQ k0 D A k0 ¡ n1 (1 ), we have B ¤ A k0 D B ¤ AQ k0 . The remainder of the proof follows along the lines of the previous case.
5.2 The Stochastic Process. We present a stochastic counterpart for the above process. The process is begun with an (n £ n ) matrix A, each of whose n rows are drawn independently from a uniform distribution on [¡0.5, 0.5]n . Therefore, if v1 , v2 , . . . , vn denote the n rows of the matrix, then E (vi ) D h0, 0, . . . , 0i, where E (¢) denotes the expected value. Next, the row mean vN D (1 / n ) niD 1 vi is deducted from each row of A. This yields AQ 00 ¤ A, which we set to A 00. It is clear that E (vi ) in A 00 remains h0, 0, . . . , 0i. Let A k0 for some k be an (m £ n ) matrix where m is a large but nite integer. For the birth of a spike, A k0C 1 is generated from A k0 as follows:
1. Randomly sample the space of row vectors in A k0 and choose p rows, v1 , v2 , . . . , vp , where p is a random number chosen from a predened range [Plow , P high]. m being large, it can safely be assumed that the rows are chosen with replacement even if they are not. 2. Choose p i.i.d. random variables X1 , X2 , . . . , Xp from a given independent distribution. Let x1 , x2 , . . . , xp be the derived random variables p xi D X i / i D 1 X i . p
3. Construct vnew D i D 1 xi vi , and insert the row at a random location into A k0 to generate A k0C 1 . For the death of a spike, A k0C 1 is generated from A k0 as follows: 1. Randomly choose a row from the space of row vectors in A k0.
II: Formal Analysis
211
2. Delete the row from Ak0 to generate A k0C 1 . In the case of a system set up to model a typical neocortical column, the deterministic process described earlier can reasonably be viewed as the stochastic process described above. First, the connectivity between neurons within a column, while having evolved to achieve a specic function, has been experimentally ascertained to t a uniform distribution (Schuz, ¨ 1992). Therefore, when neurons in the system are spiking at comparable rates, one can reasonably assume that each spike, during its nite lifetime, is equally likely to be effective in the birth of a new spike somewhere in the system. Second, whereas vi species the sensitivity of the corresponding spike to the initial set of spikes, xi species the sensitivity of the spike that just originated to the spike in question. Clearly the two are causally independent. Third, whereas spikes that are effective in the birth of a new spike are related spatially in the system, this does not translate into any necessary relation among the elements of the corresponding rows. Finally, the location of afferent synapses on the collaterals of neurons in a column has been shown to t a statistical distribution (Braitenberg & Schuz, ¨ 1991). Since X1 , . . . , Xp j
(corresponding to the various @P / @xi ’s) depend on the positions of the corresponding spikes and the nature of P, they can reasonably be approximated by i.i.d. random variables. If A k0 D AQ k0 ¤ A, and A k0C 1 is generated as above, then A k0C 1 D AQ k0C 1 ¤ A. On the one hand, if a row is deleted from A k0 to generate A k0C 1 , the act corresponds to the deletion of the corresponding row from AQ k0 to generate AQ k0C 1 . On the other hand, if v1 , v2 , . . . , vp are the rows chosen from A k0 to construct vnew , and u1 , u2 , . . . , up are the corresponding rows in AQ k0, since vi D p hui .c1 , ui .c2 , . . . , ui .cn i (where c1 , . . . , cn are the columns of A), iD 1 xi vi D C1 C 1 k k h( xi ui ). c1 , ( xi ui ). c2 , . . . , ( xi ui ). cn i. Therefore, A 0 D AQ 0 ¤ A. The following lemma describes the statistical properties of the rows of the matrix past the act of addition and/or deletion of a row. V (¢) denotes the variance of the rows, that is, V (vi ) D E ((vi ¡ m ).(vi ¡ m )), and C(¢) denotes the covariance between rows, that is, C(vi , vj ) D E ((vi ¡ m ).(vj ¡ m ))iD6 j . Let E k (vi ) D m , Vk (vi ) D s 2 , and C k (vi , vj ) D j 2 after the kth step (the result being the (m £ n ) matrix A k0 ). Lemma 2.
i. If the ( k C 1)th step involves the deletion of a row (vdel ), then E kC 1 (vi ) D m , and V kC 1 (vi ) ¡ CkC 1 (vi , vj ) D (s 2 ¡ j 2 ).
ii. If instead, the (k C 1)th step involves the addition of a row (vnew ), then E kC 1 (vi ) D m . Moreover, if pE (x2i ) D (1 C D ) for some D 2 R (D > ¡1 ( )¡2 2 2 necessarily), then V kC 1 (vi ) ¡ C kC 1 (vi , vj ) D (1 C Dmm¡1 (m C 1) )(s ¡ j ) .
212
Arunava Banerjee
Proof.
i. Since vdel is chosen randomly from the population of rows in the matrix, we have E (vdel ) D E k (vi ) D m , V (vdel ) D V k (vi ) D s 2 , and C(vdel , vi ) D Ck (vi , vj ) D j 2 . Consequently, E kC 1 (vi ) D m and V kC 1 (vi ) ¡ CkC 1 (vi , vj ) D (s 2 ¡ j 2 ). Before considering ii, we note that for the derived random variables 6 x1 , x2 , . . . , xp , (a) 8 i, pE (xi ) D 1, and (b) 8 i, j, i D j, pE (x2i ) C p (p ¡1)E(xi xj ) D 1. p (a ) follows from iD 1 E (Xi / Xi ) D 1, and 8i, j E (Xi / Xi ) D E (Xj / Xj ) because of the i.i.d. nature of the Xi ’s. (b ) follows from piD 1 E (Xi2 / ( Xi )2 ) C p,p E ((Xi Xj ) / ( Xi )2 ) D 1, 6 j i D 1, j D 1,i D
and that, the Xi ’s being i.i.d., 8i, j E (X2i / ( Xi )2 ) D E (Xj2 / ( Xj )2 ) and 6 8i, j, i D j E ((Xi Xj ) / ( Xi )2 ) yields the same result irrespective of the values of i and j. p p ( ). ii. E (vnew ) D E k ( iD1 xi vi ) D i D 1 E k xi vi Since the xi ’s are chosen inp p dependent of the vi ’s, E (vnew ) D iD 1 E (xi )E k (vi ) D iD1 1p E k (vi ) D m . Moreover, since E (vi ) D h0, 0, . . . , 0i for the initial population of row vectors, we can now conclude that 8k, E k (vi ) D h0, 0, . . . , 0i. p p Since m D h0, 0, . . ., 0i, V (vnew ) D E (vnew .vnew ) D E k ( iD1 xi vi . i D1 xi vi ) D p p,p E k ( iD 1 x2i (vi .vi ))CE k ( iD1, j D1,iD6 j xi xj (vi .vj )). Since the xi ’s are chosen indepenp ) C E (xi xj )E k ( p,p v .vj ) D 6 j i i D1 vi .vi i D1, j D1,i D p,p 1 2 2 m¡1 .vj ) D (1 C D )s ¡ D ( m s C m j 2 ) D 6 j vi i D 1, j D 1,i D
dent of the vi ’s, V (vnew ) D E (x2i )E k (
p 1CD D ( iD 1 vi .vi ) C p(¡ ( p Ek p¡1 ) E k 2 C m¡1 (s 2 2 ). s D m ¡j
( 1 j 2 )) D j 2 C m1 (s 2 ¡ Likewise, C(vnew , vi ) D E (vnew .vi ) D p ( m1 ( 1p s 2 ) C m¡1 m p j 2 ). The two results give rise to the recurrence relations: V kC 1 D V k C D (m¡1 ) ( ¡ C k ) and CkC 1 D Ck C m (m2C 1) (Vk ¡ Ck ). Therefore, V kC 1 ¡ CkC 1 D m (m C 1) V k (m¡1)¡2 D (1 C m (m C 1) )(Vk ¡ Ck ). Finally, in the base case let E (vi .vi ) D s02 for A. E (vi .vj )iD6 j D 0 since vi is chosen independent of vj , and E (vi ) D m D h0, 0, . . . , 0i. In the case of )s02 D (1 ¡ n1 )s02 . E (vi .vj )iD6 j D A 00 D AQ 00 ¤ A, E(vi .vi ) D (1 ¡ 1n )2 s02 C ( n¡1 n2 ¡2 (1 1 )s 2 C ( n¡2 )s 2 1 2 2 ¡n 0 0 D ¡ n s0 . Therefore, V 0 ¡ C 0 D s0 . n n2 Since we have assumed that the trajectory is not drawn into the trivial xed point, if A k0 is an (mk £ n ) matrix, then m k 2 [Mhigh, Mlow ] where p Mlow À 0. Moreover, E (kA k0 kF ) D mkV k. For the sake of brevity, we shall henceforth refer to Vk (vi ) and C k (vi , vj ) for rows in A k0 as V (A k0 ) and C(A k0 ), respectively. The following theorem identies the local properties of trajectories based on the above lemma. In all cases, we assume that the trajectory is not drawn i ni into the trivial xed point S i D1 Lni . It might be claimed that without external input sustaining the dynamics of the system, an aperiodic trajectory that is
II: Formal Analysis
213
not drawn into the xed point is not realizable. We therefore consider two cases: one with and one without external input. For the case with external input, the connectivity of the neurons in the system, whether they are input neurons or interneurons, is assumed to be uniform. The number of spikes generated within the system for any given trajectory Yx (t ) since time t1 is denoted by I (t), and the number of external spikes introduced into the system since time t1 is denoted by E (t ). Theorem 1 (Sensitivity).
trivial xed point.
Let Yx (t) be a trajectory that is not drawn into the
i. Given a system receiving no external input, if Yx (t) is aperiodic, then if D > M2low (D < M2high ), Yx (t ) is, with probability 1, sensitive (insensitive) to initial conditions. ii. Given a system sustained by external input, if Yx (t) is aperiodic, and constants º, c satisfy 8t |E(t) ¡ ºI (t)| < c, then if D > M2low (1 C º C Mclow )2 C 2(º C Mc )(1 C º C Mc ), Yx (t) is, with probability 1, sensitive to initial low low conditions. iii. Given a system receiving no external input, if Yx (t) is periodic such that Yx (t1 ) D Yx (tkC 1 ), then if D > M2low (D < M2high ), Yx (t ) is, with probability 1, unstable (stable). Proof .
i. Of the k steps in the process if kb involve birth of spikes and kd involve death of spikes then kb C kd D k, and | kb ¡ kd | is bounded. Therefore, k ! 1 implies kb ! 1 and kd ! 1. D Mhigh ¡2 kb 2 ¡2 kb 2 ) s0 > V (A k0 ) ¡ C (A k0 ) > (1 C D Mlow ) s0 . If D > 0, then (1 C 2 2 Mlow
Mhigh
Simple algebra shows that V ((I ¡ m1 (1 )) ¤ A k0 ) D (1 ¡ m1 )(V (A k0 ) ¡ C(A k0 )). It therefore follows that if D < M2 , then limk!1 E (k(I ¡ m1 (1 )) ¤ A k0 kF ) D 0, high
and since kAkF ¸ 0 for any A, Pr(limk!1 k(I ¡ m1 (1 )) ¤ A k0 kF > 0) D 0. If, on the other hand, D > M2 , then limk!1 E (k(I ¡ m1 (1 )) ¤ A k0 kF ) is unbounded. low Moreover, Pr(limk!1 k(I ¡ m1 (1 )) ¤ A k0 kF is bounded) D 0.12 Since (I ¡ m1 (1 )) ¤ A k0 D (I ¡ m1 (1 )) ¤ AQ k0 ¤ A, and A is, with probability 1, a.s a bounded rank n matrix, limk!1 k(I ¡ m1 (1 )) ¤ A k0 k D 0 (unbounded) () limk!1 k(I ¡ m1 (1 )) ¤ AQ k0k D 0 (unbounded). The result then follows from lemma 1. ii. Since 8t |E(t) ¡ºI(t)| < c, if there are m live internal spikes at any given time, the number of live external spikes at the same time lies in the range 12 For it to be bounded, an innite subsequence of events, each of which occurs with probability < 1, must take place.
214
Arunava Banerjee
[ºm ¡ c, ºm C c]. If p live spikes are randomly chosen from the system and pm pm q of those are internal spikes, then E (q) 2 [ (1 Cº)m C c , (1 Cº)m¡c ]. We now modify the stochastic process as follows. The process is begun with A 00 D A. If the kth step involves the introduction of an external spike into the system, A k0 is set to A k¡1 0 . If the kth step involves the birth of an internal spike, A k0 is generated from A k¡1 by randomly choosing q rows from A k¡1 0 0 , choosing random variables x1 , x2 , . . . , xp as described earlier, constructing q vnew D iD 1 xi vi , and inserting it at a random location into A k¡1 0 . Clearly, as in case i, 8k E k (vi ) D h0, 0, . . . , 0i. Simple algebra also demon(1 C D )q D q (q¡1) strates that at the birth of a spike V k D ( mmC 1 C (m C 1)p ¡ m (m C 1)p(p¡1) )V k¡1 ¡ D (m¡1 )q(q¡1) , m(m C 1)p(p¡1) C k¡1
2q 2(m¡1)q C m (m C 1)p ) Ck¡1 . Noting and Ck D m (m C 1)p V k¡1 C ( m¡1 m C1 that m ¸ p ¸ q and assuming that D > 0, we arrive at Vk ¡ Ck ¸ (1 C D (m¡1 )q(q¡1 )¡2(mp¡mq C q)(p¡1 ) )(Vk¡1 ¡ Ck¡1 ). m(m C 1)p(p¡1)
Therefore, if D > M2 (1 C º C Mc )2 C 2(º C Mc )(1 C º C Mc ) it follows that low low low low Vk ¡ C k > Vk¡1 ¡ C k¡1 . Finally, arguments identical to those in lemma 1 a.s yield limk!1 k(I ¡ m1 (1 )) ¤ Ak0 ¤ Akis unbounded H) limk!1 kB ¤ Ak0 ¤ Ck is unbounded. iii. We modify the stochastic process so as to construct the new matrix (AQ k0 )r ¤ A for arbitrary r. As in case i, the process is begun with A 00 D AQ 00 ¤ A. Each step in the process is then carried out in the exact same manner as in case i. However, after every k steps, the row mean of the matrix is deducted from each row, that is, the row mean is deducted from each row of A kr 0 , for r D 1, 2, . . . before the next step is performed. Deletion of the row mean at the end of the rst passage around the periodic orbit amounts to the operation AQ 00 ¤ A k0 D AQ 00 ¤ AQ k0 ¤ A. It follows from a previous discussion that every subsequent addition or deletion of rows can be considered as being performed on AQ 00. Therefore, just prior to the deletion of the row mean after the rth passage around the periodic orbit, we have ( AQ k0 )r ¤ A. ) (1¡ 1n )(V (AQ kr )¡C (AQ kr )) ( Q 0 Q kr ) Since V (AQ 00 ¤ AQ kr 0 ¤A D 0 ¤A 0 ¤ A and C A 0 ¤ A 0 ¤ A D ) ( Q kr )), we have V (AQ 00 ¤ AQ kr ) ( Q 0 Q kr ) ¡ 1n (V ( AQ kr 0 ¤ A ¡ C A0 ¤ A 0 ¤ A ¡ C A0 ¤ A0 ¤ A D kr kr Q Q V (A0 ¤ A ) ¡ C (A 0 ¤ A ). The rest of the argument follows along the lines of case i. Case iii is based on the assumption that the vi ’s remain independent of the xi ’s in spite of the xi ’s being identical each time a specic event is encountered on the periodic orbit as it is traversed repeatedly by the process. We demonstrate that the assumption is reasonable. Figure 1 depicts a single traversal of a periodic orbit in a three-neuron system. The four spikes, A, B, C, and D, in region H1 constitute the initial set of spikes (therefore, n D 4). Nodes in region N , also labeled A, B, C, and D, represent the deletion of the row mean at the end of each passage
II: Formal Analysis
215
Q k )n . Figure 1: Graphical depiction of the construction of ( A 0
around the periodic orbit. The periodic orbit is extended to the right (H2 ) to mark the repetition of the spikes A, B, C, and D. The dotted arrows relate which spikes are effective in the birth of any given spike. We now regard the diagram as a directed graph with the additional constraint that nodes in N be identied with corresponding spikes in H2, that is, they be regarded as one and the same. Edges from nodes in N /H2 to corresponding spikes in H1 are assigned weights (1 ¡ n1 ), and to other spikes in H1 are assigned weights ¡ n1 . Each dotted edge is assigned weight xi . Let vi correspond to any given spike i in the periodic orbit. The n (D 4) elements of the row vi (sensitivity to A, B, C, and D) after the rth passage around the periodic orbit can then be computed by identifying all paths from A in N /H2 (respectively, B, C, and D for the three remaining elements) to spike i such that each path crosses N /H2 r times before terminating at spike i, and setting vi D paths e1 e2 ...eq w (e1 )w (e2 ) . . . w(eq ), where e1 . . . eq are the edges in a path and w(ej )’s their weights. Given any spike vi in the periodic orbit, a spike vj that it is effective on, and the corresponding effectiveness xi , contributions to vi from paths that do not pass through vj are independent of xi .13 Assuming that there are approximately n live spikes at any time,14 the proportion of such paths is approximately (1¡ n1 )r . The remaining paths can be partitioned into sets based on the number of times l D 1, . . . , r, that each path passes through vj . The assumption of independence is unbiased for each such set because the correspondp,...,p p,...,p ing terms satisfy E (xi ) j D 1,..., j D 1 E (xj1 . . . xjl ) D j D 1,..., j D1 E (xi xj1 . . . xjl ). 1
l
1
l
5.3 Qualitative Dynamics of Columns in the Neocortex. The parameters of P k (¢), for any neuron k, can be partitioned into two sets: the set of 13 For example, in Figure 1 contributions to v1 from paths that do not pass through v2 are independent of x . 14 Given that there are ¼ 105 neurons in a typical column, the spiking rate of a typical neuron is ¼ 40 / sec, and that the effects of a PSP lasts for ¼ 150 msec, n ¼ 6 £ 105 .
216
Arunava Banerjee
afferent spikes into neuron k, and the set of spikes generated by neuron k. Under normal conditions, the spiking rate of a typical neuron in the neocortex is approximately 40 per second. The interspike intervals being large, j @P k / @x k | Pk (¢) D T is negligible. Consequently, at the birth of a spike at neuron j
k, kak is negligible for all j D 1, . . . , n k. Approximately 80% of all neurons in the neocortex are spiny (physiological experiments indicate that they are predominantly excitatory), and the rest are smooth (predominantly inhibitory) (Peters & Proskauer, 1980; Peters & Regidor, 1981; Braitenberg & Schuz, ¨ 1991; Shepherd, 1998). It therefore follows that the majority of the spikes in the system at any given time are excitatory. The impact of such spikes on P (¢) of a typical neuron is given by a steep rising phase followed by a relatively gentle and prolonged falling phase (Bernander, Douglas, & Koch, 1992). Consequently, the typical distrij
bution of the kai ’sat the birth of a spike at any neuron k comprises a few large positive elements and numerous small negative elements. We determined through numerical calculations based on mean spike rate, connectivity, and potential function data from the models presented in the original article that j j E ( i, j ( kai )2 ) > 1 under such conditions, and that E ( i, j (k ai )2 ) rises as the number of small negative elements increase. The resultant values were in fact quite large and met all relevant conditions (from cases i, ii, and iii) in theorem 1. A neuronal system is considered to be operating under seizure-like conditions when most of the neurons in the system spike at intervals approaching the relative refractory period. Under such conditions, the rst spike in the second set progressively assumes a signicant role in the determination of the time at which the neuron spikes next. Stated formally, if the most recent spike at neuron k is x1k , then at the subsequent birth of a spike j
at neuron k, ka1k D 1 ¡ 2 and the remainder of the kai ’s (that sum to 2 ) satisfy an appropriately scaled version of the distribution presented in the previous paragraph. As the system progressively approaches seizure-like conditions, 2 drops from 1 to 0. Simple algebra shows that when 2 is small, j E ( i, j ( kai )2 ) ¼ 1 ¡ 22 . In summary, as one proceeds from the region of the phase-space associated with low neuronal activity (2 ¼ 1) to the region associated with high neuronal activity (2 ¼ 0), periodic orbits go from being almost surely unstable (trajectories almost surely sensitive to initial conditions), to almost surely neutral, to almost surely stable, and back to being almost surely neutral. j We also found that under conditions of sparse activity, E ( i, j ( kai )2 ) < 1 j
if the distribution of the @P k / @xi ’s is inverted, a state that can be realized by setting the impact of an excitatory (inhibitory) spike to have a gentle and prolonged rising (falling) phase followed by a steep and short falling (rising) phase. Piecewise linear functions were used to fashion the potential
II: Formal Analysis
217
Figure 2: Dynamics of two systems of neurons that differ in terms of the impact of spikes on Pi (¢).
functions in the models presented in the original article, and simulation experiments were conducted. The results were consistent with the assertions of theorem 1. Figure 2 presents the result of one such experiment. The columns represent two systems comprising 1000 neurons each, identical in all respects save the potential functions of the neurons. The rst row depicts the effect of a typical excitatory spike on the potential function of a neuron, the second normalized time-series data pertaining to the total number of live spikes in each system, and the third results of a power spectrum analysis on each time series. While in the rst case the dynamics is chaotic, in the second it is at best quasi-periodic. 6 Global Analysis
In this section, we demonstrate how all basic sets, complex sets, attractors, and their realms of attraction can be isolated, given the exact instantiations of PIi for i D 1, . . . , S .
218
Arunava Banerjee
We begin by introducing an equivalence relation on S i si i si C 1 ( )n i D 1 Lni n Lni
S i i D 1 Lni n
S i D1
PIi .
S I We dene p » q if p, q 2 i D 1 Pi for arbitrary but xed values of the si ’s, and 9Yx (t) such that p D Yx (t1 ), q D Yx (t2 ), and 8t 2 si
si C 1
i i [min(t1 , t2 ), max(t1 , t2 )], Yx (t ) 2 S i D1 ( Ln i n Ln i ). We dene the equivaS i I lence class [p] D fq | q » pg, and a mapping H: ( S i D1 Lni n i D 1 Pi ) / » S i S I ! ( iD 1 Lni n iD 1 Pi ) / » as H([p]) D [q] if there exists a Yx (t) such that two of its successive segments (as dened in section 3) satisfy fYx (t) | tj < t < tj C 1 g µ [p] and fYx (t) | tj C 1 < t < tj C 2 g µ [q]. We operate hereafter on S 15 i I ( S i D1 Lni n i D 1 Pi ) / », and forsake Y(x, t ) in favor of H. We note immediately that H is only a piecewise continuous function. The analysis in the previous section was restricted to Yx (t)’s that are locally continuous in x for the simple reason that all other Yx (t)’s are, trivially, sensitive to initial conditions. The following denitions are to be regarded in the light of the fact that H is discontinuous. [p] is labeled a wandering point if 9U an open neighborhood of [p], and 9nmin ¸ 0 such that 8n > nmin , H n (U )\ U D ;. [p] is labeled a nonwandering point if 8U open neighborhood of [p], and 8nmin ¸ 0, 9n > nmin such S i I) 6 that Hn (U ) \ U D ;. L ½ ( S i D1 Lni n i D 1 Pi / » is labeled a basic set if 8[p], [q] 2 L , 8U, V open neighborhoods of [p] and [q] respectively, and 6 ;, H n 2 ( U ) \ V D 6 ;, 8nmin ¸ 0, 9n1 , n2 , n3 , n4 > nmin such that Hn1 (U ) \ U D n n 3 4 6 ;, and H ( V ) \ U D 6 ;, and furthermore, L is a maximal set H (V ) \ V D with regard to this property. It follows from the denition that the set of all nonwandering points, V, is closed, and that any basic set L is a closed subset of V. Membership in the same basic set, however, falls short of an equivalence relation. The relation is reexive since 8U, V open neighborhoods of [p], U \ V is an open neighborhood of [p], symmetric by denition, but not transitive since H is not a homeomorphism. Each nonwandering point is therefore a member of one or more basic sets. This feature signicantly affects the characterization of an attractor. Two 6 basic sets L and L 0 are considered coupled if either L \ L 0 D ; or there exists a nite or countably innite sequence of basic sets L 1 , . . . , L k, . . .
such that 8i L i and L i C 1 are coupled, and in addition, L \ k/ 1 6 iD 1 L i D
k/ 1 6 iD 1 L i D
; and
L0 \ ;. We note immediately that this is an equivalence relation, and it therefore partitions the set of all basic sets into equivalence classes. More signicant, however, is the fact that the relation partitions V into disjoint sets. This follows from the observation that if, on the contrary, [p] belongs to two distinct classes, then there exist basic sets L and L 0 , members
15
(
iL n PI with the topology dened ni iD 1 iD 1 i I) P » the corresponding quotient topology. iD 1 i
We regard
iD 1
i
Ln i n
/
in section 2, and assign
II: Formal Analysis
219
of the respective classes, that are coupled. Each such disjoint subset of V is labeled a complex set. We demonstrate that any complex set, J , is a closed subset of V. If [p] is a limit point of J , every open neighborhood of [p] contains a nonwandering point. [p] is therefore a nonwandering point. Moreover, 9L 1 , . . . , L k, . . . ½ J such that [p] 2 1 i D 1 L i . Consequently, if [p] 2 L , then L ½ J , and therefore [p] 2 J . J is therefore closed. Finally, a complex set J is labeled an attractor 0 6 if 9U such that J ½ int (U ) the interior of U, 8 J 0 D J U \ J D ;, and U is a trapping region, that is, H (U ) µ U. The discontinuity of H, in essence, sanctions the existence of anisotropic16 attractors. A procedure to locate all nonwandering points, basic sets, complex sets, S i I) and attractors in ( S i D 1 Lni n i D 1 Pi / » is presented in Figure 3. It begins si
si C 1
S i i I with the initial partition of the phase-space f S i D1 ( Lni n Lni )n i D 1 Pi / » | 0 · si · ni I i D 1, . . . , S g, and generates a succession of renements. The partition operation is well dened since any P partitioned satises H (P ) \ P D ;. j We label node Na from stage j a descendant of node Nbk from stage k if j > k and the corresponding sets in the partition of the phase-space satisfy j j j j Pa ½ Pbk. Let N1 N2 . . . Nn be a path in the graph at stage j. We dene i P j
j
j
j
j
recursively as: 1 P D Pn and i C 1 P D H ¡1 (i P) \ Pn¡i . We label N1 N2 . . . Nn a 6 ; and p C 1 P D ;. We now false path of primary length p if 8i D 1, . . . , p i P D make certain observations about the procedure. If at stage j there is a trajectory in the phase-space that has consecutive j j j segments in the sets P1 , P2 , . . . , Pn , then the corresponding graph contains j
j
j
the path N1 N2 . . . N n . In other words, the trajectories in the phase-space are a subset of the paths in the graph at all stages. j j j If at stage j the graph does not contain the path N1 N2 . . . Nn , then at jC1 jC1 jC1 stage ( j C 1) the graph does not contain the path Ni Ni . . . Ni for jC1
any descendant Nik
j
j
1
j
2
n
of N k , k D 1, . . . , n. j
If at stage j, N1 N2 . . . Nn is a false path of primary length p, then at jCp¡1 j C p ¡1 j C p ¡1 stage ( j C p ¡ 1) no path Ni Ni . . . Ni exists for any dej C p ¡1
scendant Nik
j
1
2
n
of N k , k D 1, . . . , n. This follows from the observations: j
j
j
(i) (base case) If at stage j, N1 N2 . . . Nn is a false path of primary length jC1 jC1 jC1 2, then at stage ( j C 1) no path Ni Ni . . . Nin exists for any dejC1
scendant Nik j
j
j
j
1
2
of N k , k D 1, . . . , n, and (ii) (inductive case) If at stage j,
N1 N2 . . . Nn is a false path of primary length p, then at stage ( j C 1) if a path
16 A trapped trajectory does not necessarily visit every basic set in the complex set. Moreover, the sequence of basic sets visited by any such trajectory depends on its point of entry.
220
Arunava Banerjee
Figure 3: A procedure to locate all nonwandering points, basic sets, complex sets, and attractors in the phase-space.
jC1
Ni1
jC1
Ni2
jC1
. . . Nin
jC1
then the path Ni
1
jC1
exists for any descendant Nik jC1
Ni
2
jC1
. . . Ni
n¡1
j
of N k , k D 1, . . . , n,
is a false path of primary length (p ¡ 1). j
These follow from the fact that at stage ( j C 1), Pn¡1 is partitioned such that j C1
j C1
9 Pi1 , . . . , P q n¡1
in¡1
that satisfy
j C1
q P kD1 ik
n¡1
j
j
D H ¡1 (Pn ) \ Pn¡1 . j
j
j
Given any stage j, we refer to a set of nodes hN k i D fNil | l D 1, . . . , mkg as k
j
a cluster if
mk j P 6 l D 1 il D k
;. We dene a relation Recurrent(¢, ¢) between clusters
II: Formal Analysis
221
j
j
j
as Recurrent (hNa i, hNb i) if and only if 9la , l0a 2 f1, . . . , ma g and 9lb , l0b 2 j
j
j
j
j
j
j
j
j
ib
i bb
1
q
iaa
f1, . . . , mb g such that paths N la Ne1 . . . Nep N lb and N l0 N f . . . N f N l0 exist in ia
j
the graph where Nh lies on a cycle for some h 2 filaa , e1 , . . . , ep , ilbb g and some 0
l
l0
h 2 fibb , f1 , . . . , fq , iaa g. The following theorem identies all nonwandering S i I) points in ( S i D 1 Lni n i D 1 Pi / » with the exception of the trivial xed point S i ni i ni at iD 1 Lni . We assign the node corresponding to S i D 1 Lni nonwandering status at the start of the process. j
Let fhN k i | k D 1, . . . , nj g be the set of all clusters at stage j such
Theorem 2.
j
j
that 8k D 1, . . . , nj , Recurrent (hN k i, hN k i). Then the set of all nonwandering points is given by V D
1 jD 0 (
j
nj ,mk j P kD 1,l D 1 il
).
k
S i si i si C 1 i D 1 ( Lni n Lni )n
S I i D 1 Pi
j
nj ,mk j P kD1,l D1 il
1 jD0 (
Proof . [p] is a nonwandering point ) [p] 2
k
). Let [p] 2
/ » for arbitrary but xed values of the si ’s. We j
assume, without loss of generalization, that [p] 2 P1 at stage j. If [p] lies j in the interior of P1 , an open neighborhood U of [p] can be chosen such j
j
that U ½ P1 . There is then a trajectory through P1 that returns to it. The j
unitary cluster fN1 g then satises the criterion. If, on the other hand, [p] j
j
j
lies on the boundary of P1 , there are only nitely many sets P1 , . . . , Pr in S i si i si C 1 i D 1 ( Lni n Lni )n
j 17 S I i D 1 Pi » such that 8l D 1, . . . , r, [p] 2 Pl . We choose j r n 6 l D 1 Pl . Then, since H (U ) \ U D ; for an innite se-
/
U such that U ½ quence of n’s, 9l1 , l2 2 f1, . . . , rg, not necessarily distinct, such that there j
j
are arbitrarily long trajectories from Pl1 to Pl2 . Since the graph has a nite j
j
1
2
number of nodes at any stage, there exists a path from Nl to Nl such that a j
j
node on the path lies on a cycle. The cluster fN1 , . . . , Nr g then satises the criterion. [p] is a wandering point ) [p] 2/ j
1 jD 0 (
j
nj ,mk kD 1,l D1
j
Pil ). As above, we assume k
[p] 2 P1 at stage j. Since at each stage the diameter 18 of each set in the partition drops by a factor of two, given any open neighborhood U of [p], j
j
j
j
6 there exists a stage j where (P1 ½ U ) ^ 8l(P1 \ Pl D ; ) Pl ½ U ). Let
Given the nature of the topology, [p] 2/ P for any P in the partition that does not lie
17
si
si C 1
( i Lni ni Lni )n iD 1 PIi / ». 18 max 8[p], [q] (d([p], [q])) where d(¢, ¢) is the distance between [p] and [q] based on the Riemannian metric. in
iD 1
222
Arunava Banerjee
j
j
j
6 ; for l D 1, . . . , r. Then any cluster that contains N can, at best, P1 \ Pl D 1 j contain N l for some l 2 f1, . . . , rg. Given any such cluster, if 9l1 , l2 such that
j
j
there is a path from Nl1 to Nl2 through a cycle, then the path is a false path since j Nl1
and
r l D1
j Pl ½ U. There is then a stage ( j C p) wherein all descendants of do not lie on the path.
j Nl2
j
Let fhN k i | k D 1, . . . , nj g for stages j D 1, 2, . . . be successive sets such that (i) each such set comprises of a maximal set of clusters satisfying 8k1 , k2 2 Corollary 1.
j
j
1
2
f1, . . . , nj g, Recurrent (hN k i, hN k i), and (ii) 8j Then as such.
j
nj ,mk kD1,l D1
1 jD 0 (
j Pil ) k
k
j¡1
nj¡1 ,mk kD 1,l D 1
µ
j¡1
Pil . k
is a basic set. Conversely, every basic set is representable
1 jD0 (
Proof . If [p]1 , [p]2 2
j
nj ,mk j P kD 1,l D 1 il
j
nj ,mk j P kD1,l D1 il
) for any sequence of sets satisfying
k
criteria i and ii, then based on arguments similar to those in the previous theorem, we conclude that there exists a basic set L such that [p]1 , [p]2 2 1 jD 0 (
L . Moreover, L being maximal, L such that [p] 2/
j nj ,mk
1 ( jD 0
kD 1,l D 1
j
nj ,mk j P kD 1,l D 1 il
k
) µ L . Finally, if 9 [p] 2
j j Pil ), then fhN k 0 i | k D 1, . . . , nj0 g is not k
maximal for some stage j0. Conversely, if L is a basic set, a sequence of sets satisfying criteria i and j ii can be constructed such that 8[p] 2 L , and for every stage j, [p] 2 hN k i j
j
for some hN k i in the set. This is based on (a) if fhN k i | k D 1, . . . , nj g for j¡1
some stage j satises criterion i, then the minimal set fhN k0 j
j¡1
j¡1
j
j¡1
k
k0
j ik
i | 8N l 2
2 hN k0 i Pil ½ Pil g is Recurrent for all pairs of member
hN k i 9Nil
k0
clusters, and can therefore be expanded to satisfy criteria i and ii, (b) if j fhN k 0 i | k D 1, . . . , nj0 g at some stage j0 is a set of clusters such that 8[p] 2 j
j
L 9hN k 0 i where [p] 2 int (hN k 0 i) and vice versa, then there are only nitely many ways it can be expanded to satisfy criterion i, and (c) there is then a stage ( j0 C p ) wherein all points in these additional clusters do not lie in j0 C p ( jD 0
j
nj ,mk j P kD1,l D1 il
). In other words, an innite sequence of sets satisfying
k
criteria i and ii can be constructed such that L µ if 9[p] 2
1 jD 0 (
j nj ,mk
kD1,l D1
1 jD 0 (
j
j
nj ,mk j P k D1,l D 1 il
). Finally,
k
Pil ) such that [p] 2/ L , then arguments similar to k
those in the previous theorem reveal a contradiction.
II: Formal Analysis
223 j
Let fhN k i | k D 1, . . . , nj g for stages j D 1, 2, . . . be successive sets such that (i) each such set comprises an equivalence class of clusters generated by j j j the transitive closure of Recurrent (¢, ¢) on the set fhN k i | Recurrent (hN k i, hN k i)g, Corollary 2.
j
and (ii) 8j
nj ,mk j P kD 1,l D 1 il
k
j¡1
µ
nj¡1 ,mk kD 1,l D 1
j¡1
Pil . Then k
1 jD 0 (
j
nj ,mk j P ) kD 1,l D 1 il
set. Conversely, every complex set is representable as such.
is a complex
k
Proof . L and L 0 are coupled if and only if a sequence of basic sets L , L 1 ,
L 2 , . . . , L 0 exists such that each consecutive pair of basic sets shares one or more clusters at every stage. This follows from the observation that given j any L and its corresponding sequence fhN k i | k D 1, . . . , nj g for j D 1, 2, . . ., j
if [p] 2 L , then for every stage j there exists a hN k0 i that satises [p] 2 j
j
j
int(hN k i), and hN k i 2 fhN k i | k D 1, . . . , nj g. The remainder of the proof 0 0 follows along the lines of the proof of the previous corollary. If G j denotes the graph at stage j and H j is a subgraph of G j such that there are no edges leading from nodes in H j to nodes in (G j nH j ), then we note from previous observations that the union of the sets corresponding to the nodes in H j constitutes a trapping region. We now demonstrate that a complex set J is an attractor if and only if there is a stage j0 where no path exists from any node in the equivalence class of clusters associated with J (as dened in the previous corollary) to a node in an equivalence class of clusters associated with any other complex set J 0 . We rst note that if such a stage exists, the criterion remains true for all subsequent stages. The criterion guarantees the existence of maximal j subgraphs H j for every j ¸ j0 such that (a) every cluster hN k i in the equivj j alence class associated with J lies in H , (b) H does not contain any node from an equivalence class of clusters associated with any other complex set J 0 , and (c) there are no edges from H j to (G j nH j ). If W j denotes the union of the sets corresponding to the nodes in H j , the realm of attraction of J is given by limj!1 W j . If, on the contrary, at every stage j there exists a path from a node in the equivalence class associated with J to a node in an equivalence class associated with some other complex set J 0 , then there is no set U such that 0 6 0 J ½ int (U ), 8 J D J U \ J D ;, and H (U ) µ U. If instead such a U exists, 0 6 0 since 8 J D J U \ J D ;, given any other complex set J 0 , there exists an open neighborhood V of J 0 such that U \ V D ;. Moreover, since H(U) µ U, 8n ¸ 0 H n (int (U )) \ V D ;. In conclusion, we recall that with regard to the dynamics of cortical columns, we demonstrated in section 5 that trajectories (periodic orbits) in the region of the phase-space corresponding to normal operational conditions are, with probability 1, sensitive to initial conditions (unstable). It therefore follows that attractors in this region of the phase-space are almost
224
Arunava Banerjee
surely chaotic. When combined with the observation that the inherently discontinuous nature of H sanctions the existence of anisotropic attractors, it becomes apparent that the dynamics of cortical columns, under normal operational conditions, is governed by attractors that are not only almost surely chaotic but are also potentially anisotropic. 7 Conclusions and Future Research
The results presented in this article yield a principled resolution of the issue of information coding in recurrent neuronal systems (such as columns in the neocortex): that attractors in the phase-space denote the symbolic states of such systems. While the spatial arrangement of the attractors in the phase-space of certain neuronal systems might prompt the surmise of rate or population coding, such schemes are rendered epiphenomenal from this perspective. The chaotic and potentially anisotropic nature of the attractors attains principal signicance in this context. We rst consider the ramications of the attractors in question being almost surely chaotic. Recall that external input into the system is modeled by introducing additional neurons whose state descriptions match the input identically. Stated formally, the expanded R i i phase-space is given by S i D1 Ln i £ i D S C 1 Lni , where i D 1, . . . , S denote the neurons in the system, and i D (S C 1), . . . , R denote the input neurons. Since the times at which spikes are generated at the input neurons are determined solely by the external input, the expanded phase-space does not contain hypersurfaces PIi for i D (S C 1), . . . , R. Each input corresponds to a unique trajectory Y(t) in RiD S C 1 i Lni . The temporal evolution of the system upon input Y(t) is given by the trajectory Y(t) in the expanded phase-space whose projection on Ri DS C 1 i Lni is Y(t). It is based on this formulation that we demonstrated in section 5 that the dynamics of a cortical column under normal operational conditions, sustained by external input from the thalamus and/or other cortical columns, is sensitive to initial conditions. This implies that even under conditions where the exact input is known, it is impossible to predict, based solely on nominal knowledge of the attractor that the system currently resides in, which attractors the system will visit as its dynamics unfolds, and at what times. In other words, the symbol-level dynamics of a cortical column cannot be modeled as a deterministic automaton. We now consider the ramications of the fact that the attractors in question, in addition to being chaotic, are potentially anisotropic. We recall that anisotropic attractors are attracting complex sets that are composed of multiple basic sets. Such attractors are capable of maintaining complex sequence information. The direction from which the system approaches such an attractor determines not only the subset of the constituent basic sets that the system visits, but also the order in which they are visited. In other words,
II: Formal Analysis
225
structure can be discerned in the phase-space of the neuronal system at two levels. First, there is the relationship between the attractors, each of which constitutes a macrosymbolic state of the system, and within each attractor the relationship between the basic sets each of which constitutes a microsymbolic state of the system. Our research has generated numerous questions that require investigation. While it is clear that changes in the potential function of the neurons, brought about by long-term potentiation and depression can dramatically change the nature of the dynamics of a neuronal system (creating, destroying, and altering the shape and location of the attractors in the phase-space), the exact nature of these changes is unknown. The nature of the bounds on the number of attractors that a system can maintain, as a function of the number of neurons, their connectivity, and the nature of their potential functions, is also unknown. We plan to address these questions in our future work. References Bernander, O., Douglas, R. J., & Koch, C. (1992). A model of regular-ring cortical pyramidal neurons (CNS Memo 16). Pasadena, CA: California Institute of Technology. Braitenberg, V., & Schuz, ¨ A. (1991). Anatomy of the cortex: Statistics and geometry. Berlin: Springer-Verlag. Peters, A., & Proskauer, C. C. (1980).Synaptic relationships between a multipolar stellate cell and a pyramidal neuron in the rat visual cortex: A combined golgi-electron microscope study. J. Neurocytol., 9, 163–183. Peters, A., & Regidor, J. (1981). A reassessment of the forms of nonpyramidal neurons in area 17 of cat visual cortex. J. Comp. Neurol., 203, 685–716. Schuz, ¨ A. (1992). Randomness and constraints in the cortical neuropil. In A. Aertsen and V. Braitenberg (Eds.), Information Processing in the cortex (pp. 3– 21). Berlin: Springer-Verlag. Shepherd, G. M. (1998). The synaptic organization of the brain. New York: Oxford University Press. Received June 14, 1999; accepted May 1, 2000.
LETTER
Communicated by Eric de Schutter
Subtractive and Divisive Inhibition: Effect of Voltage-Dependent Inhibitory Conductances and Noise Brent Doiron Andr Âe Longtin Physics Department, University of Ottawa, Ottawa, Canada K1N 6N5 Neil Berman Leonard Maler Department of Cellular and MolecularMedicine, Universityof Ottawa, Ottawa, Canada K1H 8M5 The inuence of voltage-dependent inhibitory conductances on ring rate versus input current (f-I) curves is studied using simulations from a new compartmental model of a pyramidal cell of the weakly electric sh Apteronotus leptorhynchus . The voltage dependence of shunting-type inhibition enhances the subtractive effect of inhibition on f-I curves previously demonstrated in Holt and Koch (1997) for the voltage-independent case. This increased effectiveness is explained using the behavior of the average subthreshold voltage with input current and, in particular, the nonlinearity of Ohm’s law in the subthreshold regime. Our simulations also reveal, for both voltage-dependent and -independent inhibitory conductances, a divisive inhibition regime at low frequencies (f < 40 Hz). This regime, dependent on stochastic inhibitory synaptic input and a coupling of inhibitory strength and variance, gives way to subtractive inhibition at higher-output frequencies (f > 40 Hz). A simple leaky integrateand-re type model that incorporates the voltage dependence supports the results from our full ionic simulations. 1 Introduction
A current problem in single neuron computation is the characterization of the relation between the synaptic input current, I, received by a neuron and the frequency, f, of action potentials that the neuron generates in response to this input. Knowledge of this f-I relationship provides a simplied plausible description of neural input-output operations, which can be used to model networks of neurons (see, e.g., Abbott, 1991). Adaptive signal processing strategies have received much recent interest in the context of these f-I relationships (see, e.g., Nelson, 1994; Carandini & Hegger, 1994; Nelson & Paulin, 1995; Holt & Koch, 1997; Abbott, Varela, Sen, & Nelson, 1997; Tsodyks & Markram, 1997). One such strategy involves gain control via Neural Computation 13, 227–248 (2000)
° c 2000 Massachusetts Institute of Technology
228
B. Doiron, A. Longtin, N. Berman, and L. Maler
feedforward and feedback inhibition. Such feedback alters the sensitivity of the neuron to excitatory input and thus modies the f-I relationship. Inhibitory transmission in the central nervous system is mediated primarily by GABAA and GABAB gated receptors. GABAB receptors are linked to potassium channels, and their activation results in long-lasting, large inhibitory postsynaptic potentials (IPSPs) with a relatively small conductance change. This inhibition has been referred to as “subtractive” since its primary effect has been hypothesized to reduce the effect of excitatory input linearly. The f-I curve is thus shifted to higher excitatory currents, but its shape is maintained. However, GABAA receptors are connected to chloride channels that have a reversal potential close to the resting membrane potential of the cell. When activated, these channels produce brief, relatively small, potential shifts, yet cause large conductance increases; this has been described as shunting inhibition. It was commonly thought that shunting gain control is divisive in nature; that is, the slope of the f-I curve is reduced under shunting (Rose, 1977; Koch & Poggio, 1992). However, Holt and Koch (1997) have demonstrated that in the suprathreshold regime, the effect of shunting synapses is subtractive. Theoretical studies of such effects have assumed voltage-independent inhibitory conductances. However, there is substantial evidence that this is not the case. Segal and Barker (1984) and Yoon (1994) have obtained an increase in GABAA chloride conductance with depolarization in studies on rat hippocampal cells. Similar results have been obtained for glycine and GABAA -activated chloride channels in sh brain (Faber & Korn, 1987; Legendre & Korn, 1995; Berman & Maler, 1998a). The putative mechanisms are thought to involve a voltage-dependent increase in channel open time (Segal & Barker, 1984; Faber & Korn, 1987; Legendre & Korn, 1995) and decrease in channel desensitization (see Yoon, 1994, and additional references therein). We therefore reexamined the role of voltage-dependent inhibition in the sub- and suprathreshold regimes in determining the subtractive or divisive character of the inhibition. We inserted voltage dependence of shunting inhibition (chloride channels) into two neural models: (1) a new detailed compartmental model of the basilar pyramidal cell of the electrosensory lateral line lobe (ELL) of the weakly electric sh Apteronotus letorhynchus and (2) a simple leaky integrate-and-re model (LIF). We developed the rst model because of the detailed electrophysiological knowledge of inhibition in the ELL (Berman & Maler, 1998a, 1998b, 1998c). The second model was investigated for comparison with the rst. Apart from its analytic tractability, it ensured that our results were not model dependent. Our ndings can be summarized as follows: In the subthreshold regime, shunting inhibition has a divisive effect, and in particular, voltage-dependent inhibition produces a nonlinear current-voltage relationship.
Subtractive and Divisive Inhibition
229
The suprathreshold subtractive nature of the f-I curve observed by Holt and Koch (1997) is enhanced by the voltage-dependent inhibitory conductance. Thus, voltage dependence allows for a more effective (as dened below) subtractive gain control. The stochastic nature of the inhibitory input leads to divisive inhibition at low ring rates, an effect enhanced by the voltage dependence of the shunting inhibition. Section 2 presents the model of the pyramidal cell used in our study. Simulation results for the subthreshold and suprathreshold regimes follow in section 3. The simple LIF model, which claries the origin of the effects studied here, is examined in section 4. A conclusion and outlook are given in section 5. 2 Methods 2.1 Compartmental Model. Basilar pyramidal cells in the ELL receive segregated feedforward and feedback input (see Berman & Maler, 1999, and references therein). The former is from the electroreceptor’s giving both excitatory input directly to the basal dendrite bush of the cell and disynaptic inhibitory GABAergic input to the cell soma. The feedback excitatory input is to the apical dendrites of the pyramidal cells, while the feedback inhibition is via a different set of GABAergic interneurons. Both sets of interneurons use GABAA receptors (t D 7 ms, Erev D ¡70 mV). We developed a two-dimensional, 152-compartment model of the basilar pyramidal cell using confocal images of a Lucifer yellow-lled neuron (Berman, Plant, Turner, & Maler, 1997). The diameters and length of the compartments ranged between 0.5 and 7 m m and 5 and 700 m m, respectively. However, each compartment was subdivided into a number of isopotential segments for computational purposes; the maximum section length was set to 25 m m. Here, we report only on results for inhibitory synapses ending on the soma; the use of the full model is nevertheless justied by the fact that it provides a realistic passive dendritic load on the soma compartment. The dynamical equations were integrated with NEURON (Hines & Carnevale, 1997) using a central difference scheme (Crank Nicholson). The time step was set to 0.025 ms, a value much smaller than the width of measured synaptic responses (Berman et al., 1997; Berman & Maler, 1999) and action potentials (Turner, Maler, Deerinck, Levinson, & Ellisman, 1994). For all compartments, axial resistivity was set to 250 V / cm, and capacitance per unit area was taken as 0.75 m F / cm2 , which are realistic values for vertebrate neurons (Mainen & Sejnowski, 1998). The passive characteristics of the model cell are such that the input resistance, Rin D 30.1 MV, and the passive membrane time constant, tm D 9.1 ms, are comparable to experimental measurements (Berman & Maler, 1998a). The temperature was set at 28± C (Berman & Maler, 1998a).
230
B. Doiron, A. Longtin, N. Berman, and L. Maler
2.2 Hodgkin-H uxley Dynamics. We considered the soma as having nine conductance channels. The differential equation governing the membrane potential at the soma is:
Cm
@ @t
Vm C INa C IDr C INaP C IK1 C IK2 C IKV3 C Ileak C Iexc C Iinh D 0.
(2.1)
The active channels involve fast sodium INa and delayed potassium IDr ; they are associated with action potential generation. IKV3 is a high-threshold potassium channel whose kinetics are well known (Wang, Gan, Forsythe, & Kaczmarek, 1998). Kv3 is abundant in ELL pyramidal cell somata (Turner, Morales, Rashid, & Dunn, 1999; Rashid, Morales, Turner, & Dunn, 1999) and has been included for added realism. The channel parameters (see the appendix) were tted to experimental voltage clamp tests of cloned HEK cells (Ray Turner, unpublished observations). INaP is a persistent sodium current (Mathieson & Maler, 1988; Turner et al., 1994), which is balanced by IK1 , a persistent potassium current. IK2 is a slow-activating persistent potassium current inserted to mimic the correct ring adaptation under prolonged stimulus, as observed experimentally (see the appendix). The above six channels are modeled as modied Hodgkin-Huxley channels (Hodgkin & Huxley, 1952; Koch, Bernander, & Douglas, 1995). Ileak corresponds to the standard leak channel. Finally, the synaptic input to the neuron is separated into an excitatory current Iexc and inhibitory current Iinh . The tting of parameters is discussed in the appendix. 2.3 Synaptic Modeling. A synapse response is modeled as an alpha t
function, gsyn (t) D gmax tt e1¡ t (Jack, Noble, & Tsien, 1975; Bernander et al., 1991). The function gives the synaptic conductance change for t > t0 , where t0 is the synaptic response onset time. Here, gmax is the maximum conductance, reached at t D t , the synapse time constant. A sigmoidal form was used for the chloride channel voltage-dependent conductance (for experimental justication of sigmoidal dependence, see Figure 5c of Legendre & Korn, 1995). This synaptic conductance is still minor in comparison with the conductances underlying the action potential (gNa , gKV3 , and gDr). We chose the following modication to the classical alpha function, in which gmax is made voltage dependent: gsyn (Vm , t) D gmax
a 1Ce
C ¡ (l cVm )
Cb
t 1¡ t e t. t
(2.2)
Here a and b represent the degree of voltage dependence in the synapse, and l and c parameterize the sigmoidal dependence assumed for gmax . We assume that each inhibitory synapse has a xed mean rate of ring with exponential interspike interval distribution (each is driven by a homogeneous Poisson process). Changing this mean rate varies the amount of inhibition.
Subtractive and Divisive Inhibition
b) 3000
a) vdep
1.2
S
1.0
S
0.8
vindep
0.6
2500 Conductance(nS)
Conductance(nS)
231
2000 1500 vindep
S vdep S
1000 500
0.4
0 -70
-68
-66 -64 Voltage(mV)
-62
-60
0.5 1.0 1.5 Input Current(nA)
2.0
Figure 1: (a) Voltage dependence of the two conductances considered in our study. Svdep refers to the voltage-dependent coefcient (large bracket in equation 2.2) with parameters a D 1, b D 0.2, l D 65 mV, c D 1 mV, and multiplied by gmax D 0.00114 m S. Svindep is for the voltage-independent case, with parameters a D 0, b D 1, multiplied by gmax D 0.001 m S. These parameters are xed throughout our study. (b) Total time-summed conductance from 250 synapses (summed conductance contributions of all synapses for the length of the simulation ¡1 s), each ring according to a Poisson shot-noise process with a mean rate of 10 Hz, as a function of excitatory input current. The voltage-independent curve shows no dependency on input current. The Svdep curves depend on input current and in fact are similar to the average subthreshold voltage curves in vdep vindep Figure 2. If we let gmax D 0.00114 m S and gmax D 0.001 m S, then the integrals of both curves are approximately equal.
2.4 Quantitative Matching of Svindep and Svdep Synapses. Here we compare f-I characteristics for two extreme situations: the voltage-independent synapse (labeled Svindep : a D 0, b D 1) and a highly voltage-dependent synapse (labeled Svdep : a D 1, b D 0.2; note that the Svdep synapse was given some voltage independence to ensure physiological realism). These parameters were adjusted so as to mimic the two- to three-fold increase in GABAA channel conductance over » 20 mV of membrane depolarization observed in ELL pyramidal cells in response to application of exogenous GABA (Berman & Maler, 1998a). The l and c parameters were chosen to produce a gradual change in conductance over the whole subthreshold voltage range (Vm < ¡60 mV). Figure 1a shows the resulting voltage dependencies of gsyn that were used in the Svindep and Svdep simulations below. The different gmax values in Figure 1a were chosen based on the following considerations. In the voltageindependent case, the total synaptic conductance, summed over all synapses,
232
B. Doiron, A. Longtin, N. Berman, and L. Maler
uctuates around a constant, regardless of the excitatory input current. However, this is not so for the voltage-dependent case, where higher-input currents produce larger depolarizations, and thus larger inhibitory conductances. Figure 1b illustrates this by plotting the total time-summed conductance from 250 synapses of both types (mean rate of 10 Hz) as a function vdep
of input current. gmax in Figure 1a was chosen so that the integral of these total conductance curves (Figure 1b) is approximately equal over the 0 to 2 nA input current range. This ensures that changes in f-I characteristics due to inhibition, over the input current range of interest, are not dominated by mismatched total conductance magnitudes. For inhibitory input frequencies greater than 10 Hz, the mean subthreshold voltage will decrease, which reduces gvdep . The matching is nevertheless still good. 3 Simulation Results
The total feedforward and feedback excitation Iexc was modeled as a constant depolarizing current I. For each such current, an average output frequency was computed over a 1000 ms window, following a 100 ms transient period. This was done for currents of 0.1 nA to 2.0 nA in increments of 0.01 nA in order to create an f-I curve for a given rate of inhibitory synaptic input. The inhibitory discharge rate was then varied in order to alter the amount of inhibition, and a new f-I curve was obtained from new simulations. 3.1 Subthreshold Dynamics: Divisive Effects and Gain Control. Figure 2 plots the average subthreshold voltage as a function of constant input current. The rheobase current Irh refers to the excitatory input current required to initiate spiking for a given rate of inhibition. This current is determined by the onset of spiking in Figure 3. For I < Irh in Figure 2, the subthreshold voltage uctuates around its average value, due to the stochastic nature of the inhibitory input. For I > Irh , only the voltages in the interspike intervals (and for computational purposes, V < ¡60 mV) were used to compute the average subthreshold voltage. The shape of the maxima of these curves will be discussed later; the decrease of the average subthreshold voltage at higher input currents is the result of the increasing inuence (due to the increasing spiking rate) of the repolarizing potassium currents (IDr, IKV3 , and IK1 ) that follow spikes. We ran simulations using each synapse type for 10 Hz and 15 Hz inhibitory rates. The average subthreshold voltage was computed in each case and plotted against input current. The general shape, shown in Figure 2, is similar for all plots. The Svindep synapses (see Figure 2a) give a linear rise to Irh , as expected from Ohm’s law, and as the inhibitory strength increases from 10 Hz to 15 Hz, a divisive effect is clearly seen as a slope reduction (» 25%). For Svdep inhibition (see Figure 2b), the curve grows nonlinearly;
Subtractive and Divisive Inhibition Irh
a)
-62
233 Irh
vindep
S
(mV)
-64
-66
10Hz 15Hz
-68
0.0
0.2
0.4
Irh
b)
-62
0.6 0.8 1.0 Input Current(nA)
1.2
1.4
Irh
vdep
(mV)
S -64
-66
10Hz 15Hz
-68 0.0
0.2
0.4
0.6 0.8 1.0 Input Current(nA)
1.2
1.4
Figure 2: Average subthreshold voltage as a function of input current for (a) Svindep and (b) Svdep with 10 and 15 Hz inhibitory ring rates. The rheobase current Irh and linear ts to the data to the left of Irh are shown in a. Irh is shifted further to the right in b, and the gain (slope) increases nonlinearly with input current.
a t to rxb yielded b » 0.44 for the 10 Hz and b » 0.47 for the 15 Hz case. Again, as inhibition was increased from 10 Hz to 15 Hz, the effect on the curve is divisive, with a 20% reduction in r. It is seen that the linear divisive effect in Figure 2a (Svindep ) is constant (xed slope) over I < Irh . However, the divisive effect in Figure 2b (Svdep ) has increasing potency
200 150 100 50
5Hz
0
Discharge Frequency(Hz)
0.0
0.5 1.0 1.5 Input Current(nA)
Discharge Frequency(Hz)
B. Doiron, A. Longtin, N. Berman, and L. Maler
200 150 100 50
2.0
200 150 100 50
15Hz 0
10Hz
0 0.0
Discharge Frequency(Hz)
Discharge Frequency(Hz)
234
0.5 1.0 1.5 Input Current(nA)
2.0
200 150 100 50
20Hz 0
0.0
0.5 1.0 1.5 Input Current(nA)
2.0
0.0
0.5 1.0 1.5 Input Current(nA)
2.0
Figure 3: Mean discharge frequency versus excitatory input current for both Svindep and Svdep simulations. For each panel, the xed inhibitory synaptic ring rate is indicated. In all panels the control simulation (no inhibition) is the left curve, the Svindep simulation is the middle curve, and the Svdep simulation is the right curve.
(slope decreases) as I approaches Irh . Further, the shift in rheobase currents for both rates, Irh (15 Hz)–Irh (10 Hz), is enhanced in the voltage-dependent case as a result of the nonuniform gain control. This is an explanation for the increased effectiveness of voltage-dependent inhibition near the onset of ring (see below). 3.2 Suprathreshold Dynamics: Enhanced Subtractive Inhibition for
Svdep . Figure 3 plots the mean ring rate versus input current for both synapse types and for mean inhibitory rates of 0 (control), 5, 10, 15, and 20 Hz. The observed subtractive shift (with respect to control) is seen to in-
Subtractive and Divisive Inhibition
235
crease with this inhibitory rate, in agreement with Holt and Koch’s (1997) results obtained with a cat pyramidal cell model. Further, the subtractive shift in the Svdep curves is greater than for the Svindep case for equal synaptic ring rates; however, the asymptotic slope at high input current is the same in both cases. The difference in the subtractive shifts between the Svindep and Svdep f-I curves increases with inhibitory rate; indeed, a 0.25 nA additional current is required for the Svdep simulation to reach the spiking threshold at a 20 Hz inhibitory ring rate as compared to the Svindep result. Given the matching of the total time-summed conductances over the 0.1–2 nA range, discussed in section 2.4, the increased shift for voltage-dependent inhibition is not simply the result of increased amounts of inhibitory conductance. Instead, the result implies a greater effectiveness in the subtractive gain control of voltage-dependent synapses. By effectiveness, we mean that for equivalent amounts of inhibitory conductance, voltage-dependent synapses shunt a greater amount of depolarizing current than voltage-independent synapses. This effectiveness argument is strengthened by a comparison of the f-I curves of the Svdep case at 15 Hz and the Svindep case at 20 Hz, shown in Figure 4a, and their associated total time-summed conductance plots given in Figure 4b (the conductance is summed over 250 synapses for the length of a simulation). Figure 4a shows f-I curves that are nearly identical for both cases, implying that the inhibitory effect is the same for both synapse types. However, Figure 4b shows that over the whole current range, the voltagedependent total time-summed inhibitory conductance is, at all input currents, less than the voltage-independent case (this is to be expected since the Svdep synapses are ring » 25% less often than the Svindep synapses). Thus, for less overall input conductance, the Svdep synapses produce equivalent subtractive shifts in f-I curves. This increased effectiveness of voltage-dependent inhibition was predicted for inhibitory input to the Mauthner cell (Faber & Korn, 1987; Legendre & Korn, 1995). The explanation of the increased effectiveness can be understood from Figure 1a, the behavior of the average subthreshold voltage (see Figure 2), and Ohm’s law: iinh D g(Vm , t) ¢ (Vm ¡Erev ). In section 2.4 we forced the timesummed conductance of both the Svindep and Svdep synapses to be equivalent, yet this gives no information on how the dynamics of both synapses differ. We thus give, in Figure 4c, the time series of both the somatic voltage (right axis) and the total instantaneous conductance (at each time step the conductance is summed over all synapses) of Svindep and Svdep (left axis) during a simulation where 250 synapses of both types red at a mean rate of 10 Hz. The gure shows that the total instantaneous inhibitory conductance for Svdep follows the somatic voltage, whereas for Svindep , the conductance is approximately constant over time (uctuations due to the stochastic nature of the inhibition). Near the spiking threshold, gvdep is larger than gvindep , and the opposite holds after the voltage repolarizes subsequent to a spike, as anticipated from Figure 1a. Furthermore the battery term in Ohm’s law is also larger near spiking threshold than after repolarization. Since iinh is the
236
B. Doiron, A. Longtin, N. Berman, and L. Maler
200
b) 4000
control vindep S 20Hz S
vdep
Conductance(uS)
Discharge Frequency(Hz)
a)
15Hz
150 100 50 0
2000 vindep
S 20Hz vdep S 15Hz
1000
0 0.5 1.0 1.5 Input Current(nA)
c)
3000
2.0
0.5 1.0 1.5 Input Current(nA)
2.0
80
60 -20 40 -40
Voltage(mV)
Conductance(nS)
0
20 -60 0 60
80
100 Time(ms)
120
vdep
140
product of gvdep and this battery term, then iinh is much larger near threshold than for the voltage-independent case, where only the battery term varies with voltage. Thus, interestingly, in the voltage-dependent case, the inhibitory synapses “spread” their conductance more effectively in time, tracking the variations of the battery term. In particular, the nonuniform gain control near rheobase for gvdep (see Figure 2: see section 3.1) is due to the increased conductance near threshold, which accounts for the enhanced subtractive effectiveness of voltage-dependent inhibition. We have also veried that the time-summed inhibitory currents for Figure 4c are approximately equal for both synapse types, so that the above argument applies to the inhibitory current as well for this example.
Subtractive and Divisive Inhibition
237
3.3 Suprathreshold Dynamics:Subtractive and Divisive Regimes. The f-I curves have an increasingly sigmoidal shape as inhibition is increased, a result of combined divisive and subtractive effects. This can be shown by separating the f-I curves into both a 0–40 Hz and a 40–100 Hz region and performing a linear regression to the data in each region to obtain f-I curve slope values. Figure 5a plots the tted slope as a function of inhibitory rate for each region and synapse type. The region f > 40 Hz shows almost no change in slope for both types of inhibition: only subtractive effects are seen. Holt and Koch (1997) paid close attention to such a region where, they argued, the spike-repolarizing potassium current clamps the mean membrane potential to more negative values. The shunting inhibition then has the effect of a constant current source. Figures 2 and 5b show that the average subthreshold voltage also becomes clamped (i.e., relatively constant) at higher input currents; according to Figure 5b, this corresponds to f > 90 Hz. Thus, this clamping mechanism is also at work in our simulations. Interestingly, Figure 5a shows that a purely subtractive effect exists for f > 40 Hz, that is, before the voltage is clamped, and, further, that this effect occurs for both synapse types. A divisive effect in the suprathreshold regime—a reduction in slope with increasing inhibition—is seen for both synapse types at lower frequencies ( f < 40 Hz; see Figure 5a). Figure 5b shows that this divisive regime and the maximum of the voltage curve occur over the same input current range. There, the inhibitory effect is strong since the battery and conductance terms are large. Changes in excitatory input current then produce only small changes in ring frequency. Increasing the inhibitory ring rate will broaden the width of the average subthreshold voltage maximum, causing
Figure 4: Facing page. (a) Discharge frequency versus excitatory input current. The open squares are for Svindep synapses with a 20 Hz ring rate, and the solid squares are for Svdep synapses with a 15 Hz rate. (b) Total time-summed conductance of both synapse types as a function of excitatory input current, with the inhibitory ring rates given in a. Note that for all input currents, the Svindep conductance is larger than the Svdep conductances since the Svdep synapses re » 25% less often. (c) Total instantaneous inhibitory conductance for both synapse types is plotted as a function of time (the heavy dashed line shows the Svindep conductance, and the dotted line represents the Svdep conductance, both plotted on the left axis) along with the somatic voltage caused by all ionic currents (solid line plotted on the right axis). The excitatory current was set at 1.5 nA. The simulation included 250 synapses with Svdep parameters and 250 synapses with Svindep parameters, both in the soma compartment. Each synaptic ring rate was set at 10 Hz. The Svindep conductance uctuates around a constant, but these uctuations are not correlated with the voltage, as in the Svdep case. Note that the data presented in a and b pertain to the same simulations, while c is a separate simulation, with particulars given above.
400
S
vindep
0-40Hz
S
vindep
40-100Hz
S
vdep
0-40Hz
S
vdep
40-100Hz
300
200
b) 150
Maximum of Voltage Curve
-64
100
-66
50 Divisive Section of f-I curve
0 0
5 10 15 Inhibitory Rate(Hz)
-62
0.0
0.5 1.0 1.5 Input Current(nA)
(mV)
Discharge Slope(Hz/nA)
a)
B. Doiron, A. Longtin, N. Berman, and L. Maler
Discharge Frequency(Hz)
238
-68 f-I Curve 2.0 m
Figure 5: (a) Slope of the f-I curves (as those in Figure 3) as a function of inhibitory ring rate. The slope was determined by linear regression in two regions: 0–40 Hz and 40–100 Hz. In the 0–40 Hz region, the slope decreases with inhibition for both Svindep and Svdep inhibition, that is, a divisive effect is seen. In the 40–100 Hz region, no change of slope for either inhibition type is seen; this is a subtractive effect. (b) Mean discharge frequency and average subthreshold voltage as a function of input current for an Svdep simulation with a mean inhibitory rate of 15 Hz at each synapse. The divisive region is marked. The region around the maximum voltage coincides with the divisive section of the f-I curve.
an initial divisive effect. This broadening of the voltage maximum is linked to the stochastic nature of the inhibition. In the absence of inhibitory input, the average subthreshold voltage in our model displays a sharp peak at the onset of ring (not shown). This is also the case when the mean inhibition is modeled using constant (deterministic) conductances (reduction via Campbell’s theorem; see Bernander et al., 1991; Holt & Koch, 1997); consequently, f-I curves then have an abrupt onset (not shown). However, a nonzero variance of the total stochastic inhibitory input causes a broadening of the maximum of the average subthreshold voltage versus input current curve (see Figure 5b); the broadening is proportional to this variance, which depends linearly on the inhibitory ring rate (by Campbell’s theorem). A broad maximum is required to give the divisive regime a sufcient current interval to underlie a ring-rate region of approximately 40 Hz. This suggests that the stochastic nature of the synaptic input causes the overall sigmoidal shape of the f-I curve. We elaborate on the nature of the dependence of the divisive regime on stochastic input in the next section. Note that the Svdep simulations produce a slightly larger divisive effect since their inhibition is more effective during the voltage maximum than the
Subtractive and Divisive Inhibition
239
Svindep cases. This is illustrated in Figure 5a in that for equivalent inhibitory rates, the estimated slope in the divisive regime is always smaller for the Svdep synapse as compared to the Svindep one. However, the transition from divisive to subtractive inhibition does occur for both Svindep and Svdep , implying that voltage dependency is not required to produce this transition (see Figures 3 and 5a). 4 Analytical Model
In this section, we further analyze our results using an LIF model. Let gshunt (Vm ) be the conductance associated with a shunting inhibitory channel. For analytic simplicity, its voltage dependency is chosen linear in the membrane voltage Vm rather than sigmoidal: gshunt (Vm ) D g ¢
aVm
k
Cb
.
(4.1)
Here g is the overall strength of the inhibition, a and b are dimensionless parameters that control the degree of voltage dependency (similar to a and b in equation 2.2), and k is introduced to preserve dimensionality and force both linear and constant terms to be similar in magnitude (we choose k D 5 mV, the midpoint between reset and threshold voltage). Assuming only this shunting inhibition and a constant excitation, we have Cm
dV m C g¢ dt
2 aVm
k
C bVm
D 1.
(4.2)
The subthreshold case yields the steady-state voltage Vss : Vss D
¡bk 1 C 2a 2
b 2k 2 4k I C . 2 a ga
(4.3)
For b À a (voltage-independent inhibition), Vss is linear in I, while for b ¿ a (voltage-dependent inhibition), Vss rises nonlinearly, as is the case in Figure 2b for the full ionic model. The case where the steady-state voltage equals the threshold voltage, Vss D Vthres , corresponds to I D Irh . Inserting these relations into equation 4.3 yields Irh D
2 agV thres
k
C gbVthres .
(4.4)
Equation 4.4 thus shows that as the inhibition becomes increasingly voltage dependent (i.e., as a/ b increases), the rheobase current Irh shifts to higher values. This shift is also seen in the compartmental simulations (see Figure 3).
240
B. Doiron, A. Longtin, N. Berman, and L. Maler
control a a
120
=0.1; b =0.9 =0.9; b =0.1
100 80 60
g=10nS
40
b) Discharge Frequency(Hz)
Discharge Frequency(Hz)
a)
20 0
control a
=0.1; b =0.9 a
=0.9; b =0.1
120 100 80 60
g=40nS
40 20 0
0.0
0.5
1.0 1.5 Input Current(A)
2.5x10
-9
0.0
0.5
1.0 1.5 Input Current(A)
2.5x10
-9
Figure 6: Mean ring frequency versus input current in the LIF model. (a) Constant g of 10 nS (weak inhibition). (b) Constant g of 40 nS (strong inhibition). In each graph, the control case for g D 0 is shown. Each graph displays results for weak voltage dependence (a < b) and strong voltage dependence (a > b). Other parameter values are Cm D 1 nF, k D 5 mV, Vth D 10 mV, t0 D 0.001 s.
Further, equation 4.2 is an analytically tractable Ricatti equation (Davis, 1962), with solution for Vm (0) D 0: V m (t ) D
g tanh 2ga
gt gb C arctan h 2k Cm kb 2 g C 4aI
¡
bk , 2a
(4.5)
where g D (k gb )2 C 4agIk . We can derive frequency of discharge f from equation 4.5. Letting Vm (T ) D Vthres yields the period T between spikes: TD
2k Cm 2gaVth C gk ¢ arctan h g g
¡ arctan h
gb
kb 2 g C 4aI
.
(4.6)
The frequency of discharge is then f D 1 / (T C to ), where we have added an absolute refractory period t0 (» 1 ms). Figure 6 plots this frequency versus input current for two extreme cases: high b and low a (weakly voltage dependent) and low b and high a (strongly voltage dependent). We chose to keep a C b D 1 in all plots so as to ensure that the inhibitory strength is controlled by g only, and not by a and b. Increasing g corresponds to increasing the inhibitory ring rate in the compartmental model. According to Campbell’s theorem, these two quantities are linearly related. The subtractive effect and its increase with g, found in the compartmental model, is clearly reproduced by this LIF model, as seen in Figure 6. The increased subtractive effectiveness for the voltage-dependent case (for xed g) is also seen.
Subtractive and Divisive Inhibition
241
Note that in our LIF model, we have used a linear approximation to the sigmoidal dependence used in the compartmental model. Nevertheless, qualitatively similar results are obtained in both cases. This indicates that the exact dependence of g (Vm ) on Vm is not important to establish increased effectiveness of voltage-dependent conductance; only a monotonic increasing function is required. As verication, we have also incorporated a sigmoidal dependence in our LIF model and obtained f-I curves using numerical integration (fourth-order Runge-Kutta, xed time step of 10¡5 ). The results (not shown) are qualitatively similar. This not only illustrates the robustness of the model mentioned above, but also provides further evidence that the increased subtractive effectiveness of voltage-dependent inhibition is a consequence of the voltage dependence rather than of some unsuspected dynamical effect in the full ionic model. Finally, we recall that the divisive effect ( f < 40 Hz) seen in the compartmental model was a consequence of the stochastic nature of the inhibition. Our LIF model does not have a stochastic input, and its average subthreshold voltage curve has a sharp peak at the onset of periodic ring (not shown). This deterministic model produces purely subtractive f-I curve shifts through inhibition (see Figure 6). However, the inclusion of stochastic input in the LIF model is known to produce sigmoidal f-I curves as in Figure 3 (see, e.g., LÂansk Ây & Sato, 1999, and Figure 7 in this article). Stochastic forcing also broadens the peak of the average subthreshold voltage versus input current curves as in Figure 5b (not shown). In view of this, we have set out to determine (1) whether a subtractive effect is also present with stochastic synaptic input and (2) whether stochastic input produces a divisive regime at lower ring frequencies, as in the compartmental model. For simplicity we considered only the voltage-independent case (a D 0) of equation 4.1, since divisiveness was also seen for the Svindep synapses in the compartmental simulations (see Figure 5a). There are a variety of ways in which a stochastic synaptic model with reversal potentials can be approximated by diffusion models (LÂansk Ây & Sato, 1999). Here we let the conductance g in the LIF model be a stochastic quantity by setting g D gN C s(g)g(t) where gN is the mean conductance and g(t) is a stochastic process of standard deviation s( gN ). To match the smoothness of the conductance uctuations in the compartmental model, we modelg(t) as an Ornstein-Uhlenbeck process (lowpass-ltered gaussian white noise) with correlation time t D 75 ms; our results were not qualitatively sensitive to this correlation time. Equation 4.1 thus becomes a stochastic differential equation with multiplicative noise (since the noise term multiplies the state variable Vm ): Cm @g @t
@Vm @t D ¡
C gN C s( gN )g( t ) Vm D I
(4.70 )
g Cj t
(4.700 )
242
B. Doiron, A. Longtin, N. Berman, and L. Maler
where j is gaussian white noise with zero-mean and unit standard deviation. Numerical simulations produced sigmoidal curves of mean ring rate versus input current I (see Figure 7), as expected. Increases in inhibitory ring rate in the compartmental model were modeled here as increases in mean conductance g; N these increases are linearly related, as discussed above. Further, and also in accordance with Campbell’s theorem, the variance s 2 of the conductance uctuations was increased linearly with the inhibitory ring rate, and thus with g. N Figure 7a shows that the inclusion of this stochastic forcing produces f-I curve shifts that are qualitatively similar to those found for the deterministic LIF and compartmental models; hence, subtractive inhibition (Holt & Koch, 1997) is maintained in the presence of such forcing. Figure 7a also clearly shows a divisive effect as gN increases, as in the compartmental model. The origin of this effect here may lie in the coupling of the variance of the uctuations to their mean or in the multiplicative nature of the noise, or in both. Figure 7b presents f-I curves from simulations in which the variance was made independent of g. N Although the noise is still multiplicative, divisiveness is no longer seen. We have also veried that a stochastic LIF with additive noise (i.e., stochastic forcing is simply added to the right-hand side of equation 4.1) shows divisiveness as well, as long as the variance increases with the mean conductance (not shown). Thus, the divisiveness seen in the LIF model, and presumably in the compartmental model as well, arises from increases in the variance of the stochastic forcing that accompanies increases in inhibitory input. The origin of the effect can be seen by comparing the right-most curves in Figures 7a and 7b. The point of ring onset is one of the two points used for estimating the slope of the f-I curve at lower discharge frequencies. The second point is determined by the intersection of the stochastic f-I curve with a line of constant ring frequency (e.g., 40 Hz was used in the compartmental model; 10 Hz could be used for the LIF model in Figure 7). The onset of ring in Figure 7a occurs at a lower input current than in Figure 7b; this is a consequence of the higher conductance variance in Figure 7a. Since the estimated slope is inversely proportional to the input current interval between these two points, it is smaller in Figure 7a than in Figure 7b. Further, in the voltage-dependent conductance case, the effectiveness of the inhibition is increased, particularly near the onset of ring (see Figure 1b). This increases the input current interval used for the slope estimation, yielding a smaller slope in comparison with the voltage-independent case. Thus, voltage dependence further enhances divisiveness. 5 Conclusion and Outlook
We have presented a morphologically and physiologically realistic compartmental model of a pyramidal cell in the electrosensory lateral line lobe of a weakly electric sh. This model was then used to investigate the effect of voltage-dependent inhibitory conductances on ring rate versus input
Subtractive and Divisive Inhibition
Discharge Frequency(Hz)
25
a)
243
g=30nS
g=90nS
20 15 10 5 0 0.0
Discharge Frequency(Hz)
25
0.2
b)
0.4 Input Curremt(nA)
0.6
g=30nS
0.8
g=90nS
20 15 10 5 0 0.0
0.2
0.4 Input Current(nA)
0.6
0.8
Figure 7: Mean ring rate f versus input current I for the stochastic LIF model, given by equations 4.7 and 4.70 . Firing rates were calculated at 0.01 nA steps using 10 sec simulations for each input current. (a) The mean inhibitory conductance (voltage independent) is proportional to the ring rate of Poisson inputs, while the standard deviation is proportional to the square root of this ring rate (Campbell’s theorem). Here we choose s D a gN with a D 3 £ 10¡4 . A clear divisive effect at low ring frequencies ( f < 10 Hz) is present as gN increases from 30 nS to 90 nS. (b) The standard deviation of the conductance is now made constant at s D b, with b D 5 £ 10¡8 . No divisiveness is present as gN increases. In all plots, the deterministic curve (s D 0) is superimposed. The stochastic integration used a xed step (10¡5 sec) fourth-order Runge-Kutta method combined with the Box-Muller algorithm for the gaussian deviates.
current curves. To our knowledge, this article is the rst to model this dependence explicitly, and it highlights the importance of this dependence, especially when dealing with the subthreshold regime. Our results show
244
B. Doiron, A. Longtin, N. Berman, and L. Maler
that this kind of shunting inhibition enhances the subtractive shifts reported in Holt and Koch (1997) and indicates an increased effectiveness of voltagedependent inhibitory conductances compared to the commonly assumed voltage- independent ones. This is due to the tracking of the membrane potential by the inhibitory conductance, such that the inhibitory effect is maximal near the onset of spiking. One interesting implication of this increased effectiveness is that small changes in inhibitory conductance, arising, for example, from synaptic plasticity, can have strong effects on gain control in sensory systems (Nelson, 1994; Nelson & Paulin, 1995). We have also uncovered a divisive regime at low ring frequencies. It relates to the broadening of the peak of the average subthreshold voltage versus input current characteristic, a consequence of the stochastic nature of the inhibitory synaptic input. Our study of the LIF model reveals that the conductance uctuations used to model the randomness of the Poisson inputs in the compartmental model are responsible for producing a divisive regime at lower discharge frequencies; this is true provided that the variance of these uctuations increases with the mean conductance. This divisiveness is enhanced by the voltage dependence of the conductance. However, these uctuations do not alter the subtractive nature of the f-I curves at higher discharge frequencies or their enhancement by voltage-dependent conductances. The spontaneous ring rate of some classes of ELL pyramidal cells is well below 40 Hz (Bastian & Courtwright, 1991). During electrolocation of prey objects (Nelson & MacIver, 1999), only small alterations of electroreceptor input occur, and it is therefore likely that ring-rate modulations of these pyramidal cells will remain in the divisive regime. Therefore, divisive inhibition of ELL pyramidal cells is functionally relevant for electroreception and likely relevant for other neural networks as well. Future work should analyze further the possible interactions of voltagedependent inhibition with various voltage-gated inward currents present in ELL pyramidal cells (NMDA: Berman et al., 1997; persistent sodium channels: Stuart, 1999; Turner et al., 1994; potassium conductances: Berman & Maler, 1998b) Further, our study has been performed for xed inhibitory rates. However, the inhibitory ring rates are likely to be correlated with ELL pyramidal cell ring rates (Berman & Maler, 1999). It will be interesting to investigate the effect of voltage-dependent inhibition in closed loop (Nelson, 1994; Nelson & Paulin, 1995). Finally, it will be important to incorporate eventually more realistic excitatory synaptic input with variable degrees of correlation with inhibitory input (Berman & Maler, 1999). Appendix
In this appendix, we provide details on the estimation of parameters for the pyramidal cell ionic model. The qualitative conclusions of our article do not depend sensitively on the exact values of our parameters. Figure 8
Subtractive and Divisive Inhibition
245
200 Discharge Frequency(Hz)
I=1.2nA 150
100 I=0.6nA 50
I=0.3nA
0
200
400 600 Time(ms)
800
Figure 8: Mean discharge frequency versus time from our simulations and from a single cell in vitro recording (Berman & Maler, 1998b). Parameters are as in Table 1. The bottom two traces give the experimental and simulation results for a current injection of 0.3 nA, the middle two for a current of 0.6 nA, and the top two for a current of 1.2 nA. The simulations were compared to all currents from 0.1 to 1.2 nA in steps of 0.1 nA (the other nine plots are not shown).
plots ring frequency versus time for both a real ELL basilar pyramidal cell under constant depolarization (0.3 nA, 0.6 nA, and 1.2 nA injections), as well as the compartmental model measurements under equivalent conditions. We adjusted active and leak channel parameters to obtain reasonable agreement between simulated and measured ring frequency versus time plots for a range of excitatory input currents (Mathieson & Maler, 1988). The nal parameters are shown in Table 1. Good agreement is found for 0.3 and 0.6 nA, with increasing discrepancies as the current increases. The inclusion of a fourth potassium conductance, which will mimic the proper ring saturation at high currents, is planned. The growth of the model is an ongoing process, yet this parameter choice is deemed reasonable, given the variability observed in such experimental data, and the issues addressed here. Acknowledgments
We thank Martin St. Hilaire and Maurice Chacron for useful discussions, as well as Michael Hines for useful advice on the use of NEURON and the Koch
246
B. Doiron, A. Longtin, N. Berman, and L. Maler
Table 1: Ionic Channel Parameters. INa Ion g a/b t (ms) Erev (mV) gmax (S / cm2 )
C
Na m2 h 40 / ¡ 3 45 / 3 0.2 0.6 C 50 0.85
IDr
INaP
IK1
IK2
C
C
C
C
K m2 40 / ¡ 3
Na m3 50 / ¡ 7
K m 42 / ¡ 5
K m 30/ ¡ 1
0.4
0.1
2.0
4000
¡95 0.5
C 50 0.03
¡95 0.01
¡95 0.015
IKV3 C
K m3 h 0 / ¡ 18 0 / 40 2.0 2.0 ¡95 5
Ileak N /A N /A N /A N /A ¡70 7E-5
Notes: Each current is described by I D gmax g . Erev is the reversal potential of the channel. gmax is the maximum conductance of the channel. g D mi h j is the product of the activation and inactivation variables raised to powers i and j, respectively. t is the voltage-independent time constant associated with m or h. a and b are the constants appearing in the sigmoid expression for m1 and h1 : x1 D 1 / 1 C exp((Vm C a) / b). The state variable x evolves as @@xt D x1tx¡x . The columns for INa and IKV3 contain double entries in some blocks; the top entry is associated with the m parameter, the bottom with the h.
lab for making their NEURON templates available. Valuable discussion with Ray Turner was necessary for proper model kinetics. This work was supported by NSERC (B. D., A. L.) and MRC (L. M., N. B.) Canada. References Abbott, L. F. (1991). Realistic synaptic inputs for model neural networks. Network, 2, 245–258. Abbott, L. F., Varela, J. A., Sen, K., & Nelson, S. B. (1997). Synaptic depression and cortical gain control. Science, 275, 220–223. Bastian, J., & Courtwright, J. (1991). Morphological correlates of pyramidal cell adaptation rate in the electrosensory lateral line line lobe of weakly electric sh. J. Comp. Physiol A., 168, 393–407. Berman, N., & Maler, L. (1998a).Inhibition evoked from primary afferents in the electrosensory lateral line lobe of the weakly electric sh. J. Neurophysiol., 80, 3173–3196. Berman, N., & Maler, L. (1998b).Interaction of GABAB -mediated direct feedback inhibition with voltage-gated currents of pyramidal cells in the electrosensory lateral line lobe. J. Neurophysiol., 80, 3197–3213. Berman, N., & Maler, L. (1998c). Distal vs proximal inhibitory shaping of feedback excitation in the lateral line lobe. J. Neurophysiol., 80, 3214–3232. Berman, N., & Maler, L. (1999). Neural architecture of the electrosensory lateral line lobe: Adaptations for coincidence detection, a sensory searchlight and frequency-dependent adaptive ltering. J. Exp. Biol., 202, 1243–1253. Berman, N., Plant, J., Turner, R., & Maler, L. (1997). Excitatory amino acid receptors at a feedback pathway in the electrosensory system: Implications for the searchlight hypothesis. J. Neurophysiol., 78, 1869–1881.
Subtractive and Divisive Inhibition
247
Bernander, O., Douglas, R., Martin, K., & Koch, C. (1991). Synaptic background activity inuences spatiotemporal integration in single pyramidal cells. Proc. Natl. Acad. Sci. USA, 88, 11569–11573. Carandini, M., & Heeger, D. (1994). Summation and division by neurons in primate visual cortex. Science, 264, 1333–1336. Davis, H. (1962). Introduction to nonlinear differential and integral equations. New York: Dover. Faber, D., & Korn, H. (1987). Voltage-dependence of glycine-activated Clchannels: A potentiometer for inhibition? J. Neuroscience, 7, 807–811. Hines, M. L., & Carnevale, N. T. (1997). The neuron simulation environment. Neural Comp., 9, 1179–1209. Hodgkin, A., & Huxley, A. (1952). A quantitative description of membrane current and its application to conduction and excitation in nerve. J. Physiol., 117, 500–544. Holt, G., & Koch, C. (1997). Shunting inhibition does not have a divisive effect on ring rates. Neural Comp., 9, 1001–1013. Jack, J., Noble, D., & Tsien, R. (1975). Electric current ow in excitable cells. New York: Oxford University Press. Koch, C., Bernander, O., & Douglas, R. (1995). Do neurons have a voltage or a current threshold for action potential initiation? J. Comp. Neuroscience, 2, 63–82. Koch, C., & Poggio, T. (1992). Multiplying with synapses and neurons. In T. McKenna, J. Davis, & S. Zornetzer (Eds.), Single neuron computation. Orlando, FL: Academic Press. LÂansk Ây, P., & Sato, S. (1999). The stochastic diffusion models of nerve membrane depolarizations and interspike interval generation. J. Peripheral Nervous System, 4, 27–42. Legendre, P., & Korn, H. (1995). Voltage dependence of conductance changes evoked by glycine release in the zebrash brain. J. Neurophysiol., 73, 2404– 2412. Mainen, Z. F., & Sejnowski, T. J. (1998). Modeling active dendritic processes in pyramidal neurons. In C. Koch & I. Segev (Eds.), Methods in neuronal modeling: From ions to networks. Cambridge, MA: MIT Press. Mathieson, W. B., & Maler, M. (1988). Morphological and electrophysiological properties of a novel in vitro preparation: The electrosensory lateral line lobe brain slice. J. Comp. Physiol. A, 163, 489–506. Nelson, M. E. (1994). A mechanism for neuronal gain control by descending pathways. Neural Comp., 6, 255–269. Nelson, M. E., & MacIver, M. (1999). Prey capture in the weakly electric sh Apteronotus albifrons: Sensory acquisition strategies and electrosensory consequences. J. Exp. Biol., 202, 1195–1203. Nelson, M. E., & Paulin, M. G. (1995). Neural simulations of adaptive reafference suppression in the elasmobranch electrosensory system. J. Comp. Physiol. A, 177, 723–736. Rashid, A., Morales, E., Turner, R. W., & Dunn, R. J. (1999). Molecular characterization of two Kv3-type K channels in a vertebrate sensory neuron. XXIX Proc. Soc. Neurosci., 25, 2248.
248
B. Doiron, A. Longtin, N. Berman, and L. Maler
Rose, D. (1977). On the arithmetical operation preformed by inhibitory synapses onto the neuronal soma. Exp. Brain. Res., 28, 221–223. Segal, M., & Barker, J. (1984). Rat hippocampal neurons in culture: Properties of GABA activated Cl¡ ion conductance. J. Neurophysiol., 51, 500–515. Stuart, G. (1999). Voltage-activated sodium channels amplify inhibition in neocortical pyramidal neurons. Nature Neuroscience, 2, 144–150. Tsodyks, M. V., & Markram, H. (1997). The neural code between neocortical pyramidal neurons depends on neurotransmitter release probability. Proc. Nat. Acad. Sci., 94, 719–723. Turner, R. W., Maler, L., Deerinck, T., Levinson, S., & Ellisman, M. (1994). TTXsensitive dendritic sodium channels underlie oscillatory discharge in a vertebrate sensory neuron. J. Neurosci., 14, 6453–6471. Turner, R. W., Morales, E., Rashid, A., & Dunn, R. J. (1999). Characterization of an Apteronotid Kv3 potassium channel. XXIX Proc. Soc. Neurosci., 25, 193. Yoon, K. W. (1994). Voltage-dependent modulation of GABAA receptor channel desensitization in rat hippocampal neurons. J. Neurophysiol., 71, 2151–2160. Wang, L. Y., Gan, L., Forsythe, I. D., & Kaczmarek, L. K. (1998). Contribution of the Kv3.1 potassium channel to high-frequency ring in mouse auditory neurons. J. Physiol. Lond., 509, 183–194. Received August 6, 1999; accepted March 7, 2000.
Neural Computation, Volume 13, Number 2
CONTENTS Review
Spatiotemporal Connectionist Networks: A Taxonomy and Review Stefan C. Kremer
249
Notes
Formulations of Support Vector Machines: A Note from an Optimization Point of View Chih-Jen Lin
307
Minimal Feedforward Parity Networks Using Threshold Gates Hon-Kwok Fung and Leong Kwan Li
319
Letters
Does Corticothalamic Feedback Control Cortical Velocity Tuning? Ulrich Hillenbrand and J. Leo van Hemmen
327
A Competitive-Layer Model for Feature Binding and Sensory Segmentation Heiko Wersing, Jochen J. Steil, and Helge Ritter
357
Learning Object Representations Using A Priori Constraints Within ORASSYLL Norbert Kruger ¨
389
Binding and Normalization of Binary Sparse Distributed Representations by Context-Dependent Thinning Dmitri A. Rachkovskij and Ernst M. Kussul
411
Resolution-Based Complexity Control for Gaussian Mixture Models Peter Meinicke and Helge Ritter
453
REVIEW
Communicated by C. Lee Giles
Spatiotemporal Connectionist Networks: A Taxonomy and Review Stefan C. Kremer Guelph Natural Computation Group, Department of Computing and Information Science, University of Guelph, Guelph, Ontario, N1G 2W1 Canada
[email protected] This article reviews connectionist network architectures and training algorithms that are capable of dealing with patterns distributed across both space and time—spatiotemporal patterns. It provides common mathematical, algorithmic, and illustrative frameworks for describing spatiotemporal networks, making it easier to compare and contrast their representational and operational characteristics. Computational power, representational issues, and learning are discussed. In additional references to the relevant source publications are provided. This article can serve as a guide to prospective users of spatiotemporal networks by providing an overview of the operational and representational alternatives available. 1 Introduction
In this article, we propose a new taxonomy for characterizing connectionist approaches to learning input-output relationships in which data are distributed across both space and time—spatiotemporal patterns. The ability to learn these types of relationships is directly applicable to dynamical system identication and control (Alippi & Piuri, 1996), syntactic pattern recognition (Fu, 1982), and grammatical induction (Angluin & Smith, 1983). Throughout the article, we maintain a problem-driven philosophy rather than an architecture-driven approach, though we do restrict ourselves broadly to the connectionist paradigm. In this regard, the taxonomy presented is not a taxonomy of recurrent networks but rather of spatiotemporal networks. Since there are certainly recurrent networks that are also spatiotemporal, these are discussed, yet other recurrent networks, such as those that are designed to settle to a stable xed point, are omitted. Conversely, we describe a number of nonrecurrent networks that are nonetheless used to classify, identify, or generate spatiotemporal sequences. The problem-driven philosophy differentiates the taxonomy presented here from previously described architectural approaches (Mozer, 1994; Horne & Giles, 1995; Tsoi & Back, 1994), which we examine in section 3.2. Neural Computation 13, 249–306 (2001)
° c 2001 Massachusetts Institute of Technology
250
Stefan C. Kremer
The new taxonomy is based along the four fundamental design decisions that the designer of any spatiotemporal sequence processing system must make: (1) selecting a representation for state transition functions, (2) selecting possible output functions, (3) dening the initial state of the system, and (4) choosing an algorithm for parameter update. This design decision approach ensures that the scope of the taxonomy is broad enough to accommodate any current or future spatiotemporal connectionist networks, something that could not be accomplished with an architecture-based taxonomy. We apply our taxonomy to the leading connectionist networks and describe their relation to each other. In section 2, we dene the notation and terminology used. In section 3, we identify features of good taxonomies, survey several existing categorization schemes, and identify some goals for the development, in section 4, of a new taxonomy possessing the ideal features identied. This new taxonomy is structured along four design decisions: (1) how the state of the system is computed (discussed in section 5), (2) how the output of the system is computed (section 6), (3) how the system is initialized (section 7), and (4) how the system is trained (section 8). In section 10, we present some conclusions. 2 Terminology and Mathematical Preliminaries 2.1 Spatiotemporal Connectionist Networks. We dene a spatiotemporal connectionist network (STCN) as a parallel distributed information processing structure that is capable of dealing with input data presented across time as well as space. One might argue that this denition is too informal and describes too diverse a set of computational systems to be of practical use; however, it is difcult to assign a more precise mathematical denition to STCNs that does not contradict the common use of the term to describe a variety of very diverse architectures. In this article, we provide precise mathematical descriptions of the most popular classes of STCNs and their constituent components, as opposed to dwelling on the formulation of one denition for all STCNs. This will allow us to develop formal descriptions of STCNs with specic properties, without unnaturally restricting the denition of STCN in general. The following formalisms serve as a basis for the subsequent description of specic STCN components. 2.2 A Note About Notation. The eld of STCNs has attracted researchers from a wide range of specializations, including computer science, engineering, linguistics, and psychology. Perhaps this partially accounts for the inconsistent notations employed throughout the existing STCN literature. In this article, we have attempted to use the more common mathematical representations, while maintaining consistency across all architectures described. We begin by introducing these mathematical formalisms.
Spatiotemporal Connectionist Networks
251
It is convention in algebra to interpret a vector xE D (x1 , x2 , x3 , . . .), as being equivalent to the column matrix x1 x2 x3 .. .
.
(Note that we index our vector components beginning at 1 and not at 0.) This translation of vectors to matrices will prove practical, given the frequency with which we will be computing a vector (e.g., Ez) based on a matrix product of a two-dimensional matrix (e.g., W) and another vector (e.g., xE ), as in the equation zj D
i
Wji ¢ xi .
(2.1)
Note the order of the indices of W : Wji measures the effect of component i of vector xE on component j of vector Ez, and not vice versa. This allows us to write the equation for the vector Ez as simply Ez D W ¢ x E.
(2.2)
More frequently, we will apply a nonlinearity to each component of the product zE . We denote this by a vector function, fE(¢), which maps each component of one vector to another according to the scalar function, f (¢). Typically, the nonlinearity employed is the sigmoid, s(zi ) D
1 , 1 C e¡zi
(2.3)
in which case, we write s E (Ez) for the vector function. The results of forwardpropagating one layer of activations in a network, through a layer of weighted connections and a nonlinearity to compute the next layer ’s activations, can now be written E ). yE D s E (W ¢ x
(2.4)
In addition to frequent matrix vector multiplications, we will commonly use vectors whose elements consist of the concatenation of the elements of two smaller vectors. In order to simplify the description of these vectors, we will use © to represent a function mapping two vectors, such as bE and cE in vectors spaces B and C, to a new vector aE in the vector space dened by
252
Stefan C. Kremer
the Cartesian product of the spaces, B £ C. The operation is dened such that the representation of vector Ea is formed by concatenating the elements E Specically, if Ea D bE © cE, then the rst of vector cE onto elements of vector b. elements of Ea are all the elements of bE and the remaining elements of Ea are all the elements of cE. Mathematically, ai D
bi ci¡dim (bE )
if i · dim(bE ) if i > dim(bE )
(2.5)
and dim( Ea ) D dim(bE ) C dim(cE ),
(2.6)
where dim( Ea ) represents the dimensionality of Ea. We will further require an inverse of this operator, capable of extracting a smaller vector from a larger vector. To do this, we dene the use of square parentheses in the expression bE D Ea[i. . j] such that vector bE is equal to a vector whose components are equal to components i through j of vector Ea, i, j 2 f1, 2, 3, . . . , dim( Ea )g and i < j. More precisely we require that the kth component of vector bE is equal to the (i C k ¡ 1)th component of aE , b k D ai C k¡1 ,
(2.7)
and that the dimensionality of vector bE is equal to the number of terms in the inclusive sequence from i to j: dim(bE ) D j ¡ i C 1.
(2.8)
It is now possible to invert the © operator, aE D bE © cE,
(2.9)
such that bE D aE [1. . dim( Ea ) ¡ dim(cE )],
(2.10)
cE D Ea[(dim(bE ) C 1) . . dim( aE )].
(2.11)
and
The vector notation presented here allows us to represent patterns distributed across the processing units of an STCN. Each vector component denotes the activation value of a particular processing element in an STCN.
Spatiotemporal Connectionist Networks
253
2.3 Input Dimensions. In STCNs, input and output patterns can vary across time as well as space. For the purposes of simulation on digital computers, it is useful to discretize the temporal dimension by sampling at regular intervals and consider a system in which time proceeds by intervals of D t. We shall use the symbol t to represent a particular point in time, where t 2 f0, D t, 2D t, 3D t, . . .g. In this formulation, D t can be considered to be the unit of measure for the quantity t, and thus it is simplest to omit the units and express t simply as a member of the set of whole numbers, t 2 f0, 1, 2, 3, . . .g. It should be noted that even a continuous-time system, simulated on a digital computer, is usually converted into a set of simple rst-order difference equations, making it formally identical to a discrete-time network. Furthermore, barring high-frequency components in the network’s input (relative to D t), a continuous-time STCN can be precisely duplicated by a discretetime system (Pearlmutter, 1995). The dimension of time in STCNs differs from the spatial dimensions in conventional connectionist networks. Components of an input pattern distributed across space (i.e., across various input nodes) can all be accessed at the same time. However, only the current component of patterns distributed across time is accessible at any given instant. We assume that the observable world for an STCN at the instant t is described by an input vector, xE (t). Systems that use older input vectors (e.g., time-delay networks; see below) can still be considered, but such a system must include a mechanism for storing historical input vectors (typically a queue of previous values). This vector is supplied to the STCN at time t by setting the activation values of the input units of the STCN to the components of the vector. Thus, the input vector can be considered a stimulus. It is important to note the difference between information encoded across the input vector xE (t) and information encoded in different input vectors xE (0), xE (1), . . . , presented to the network at different times. Values of the former can be accessed in parallel, while the latter must be accessed sequentially (in a forward direction). This makes it the responsibility of the network to store any information contained in xE (t) until it is required at a later time. 2.4 Long- and Short-Term Memory in Spatiotemporal Connectionist Networks. Both conventional and spatiotemporal connectionist networks are equipped with memory in the form of connection weights, denoted by one or more matrices, depending on the number of layers of connections in the network. These are typically updated after each training step and constitute a memory of all previous training. In the sense that this memory extends back past the current input pattern, all the way to the rst training step, we shall refer to the weights as long-term memory. Once a connectionist network has been successfully trained, this long-term memory remains xed during the operation of the network. In addition to these weight matrices, some networks also use other trainable parameters. These may represent the connectivity scheme of the net-
254
Stefan C. Kremer
work (e.g., recurrent cascade correlation; see section 8.9), the types of transmission delays associated with the connections (e.g., gamma networks; see section 5.5), or the initial activation values of the internal processing elements (see section 8.7). These parameters are also part of the long-term memory. We defer a more detailed discussion of the details of these types of parameters until later in the article. For now, we shall only dene W to denote an n-tuple representing all the adaptable parameters of a network. W includes one or more weight matrices and may include connectivity scheme parameters, transmission delay parameters, and initial activation values depending on the type of network. The distinguishing characteristic of STCNs is that they also include a form of short-term memory. It is this memory that allows these networks to deal with input and output patterns that vary across time and thus denes them as STCNs. Conventional connectionist networks compute the activation values of all nodes at time t based only on the input at time t. By contrast, in STCNs the activations of some nodes at time t are computed based on activations at time t ¡ 1, or earlier. These activations serve as a short-term memory. We use the state vector, Es (t ¡ 1), to represent the activations at time t ¡ 1 of those nodes that are used to compute the activations of other nodes at time t, that is, state nodes. Unlike the long-term memory, which remains static once training is completed, the short-term memory is continually recomputed with each new input vector during both training and operation. This is due to the fact that long-term memory is stored in connection weights (which are updated only during training) while shortterm memory is represented by node activations (which are computed with each time-step, even after training). 2.5 Output, Teaching, and Error. Finally, it is necessary to specify a representation for the response of the network to its stimuli. Like the traditional models, STCNs encode their response in the activations of a special set of units called output units. Thus, we represent the output of an STCN by a vector, yE (t ). Most connectionist architectures learn by computing the difference between their response and a teacher-supplied desired (or ideal) response and adjusting their long-term memory accordingly. We denote the desired response by another vector, yE ¤ (t). The difference between the desired output vector and the actual output vector is the error vector EE (t) D yE ¤ (t) ¡ yE (t), and the total network error, e , is dened as one-half of the square of the magnitude of this vector:
eD
t¸0
1 E kE (t)k2 . 2
(2.12)
The total error, e , is a measure of the overall performance (or lack thereof). It is this quantity that is minimized by gradient descent during training (see section 8).
Spatiotemporal Connectionist Networks
255
Table 1: Notation. fE(¢)
Generic vector nonlinearity
s(¢) E
Sigmoid vector nonlinearity
bE © cE
Concatenation of components of vectors bE and cE
dim( Ea )
Dimensionality of vector Ea
aE[i . . j]
Vector consisting of components i through j of vector Ea
Dt
Units of time
xE (t)
Input to the network at time t
W
Weight matrix Adaptable parameters of network
Es (t)
State vector at time t
yE (t)
Output of network at time t
yE¤ (t)
Desired response at time t
E E(t)
Error vector at time t
e
Total error scalar
2.6 Summary. In this section, we have introduced a number of symbols to represent various properties and components of STCNs. This notation is summarized in Table 1 for easy reference. 3 Evaluation of Taxonomies 3.1 Criteria of a Good Taxonomy. Before presenting our taxonomy of STCNs for spatiotemporal processing, we briey discuss three general properties of good taxonomies: (1) descriptive adequacy (it must classify all objects in its domain), (2) simplicity (it should be relatively simple to classify and name any of the objects in the domain), and, most important, (3) predictive power (objects in the taxonomy with similar classications should possess similar properties, and objects in the taxonomy with differing classications should have differing properties). A good taxonomy balances simplicity with predictive power while maintaining descriptive adequacy. In other words, it must have a level of discrimination that is not so low that it groups dissimilar objects together or so high that the number of groups required to encompass all objects becomes unwieldy. This can best be accomplished if the taxonomy describes features along multiple orthogonal dimensions rather than along one single dimension. If multiple dimensions are available, then different yet similar objects can be classied as having the same feature along one or more dimensions,
256
Stefan C. Kremer Table 2: Horne and Giles’ Taxonomy.
Observable states Narendra and Parthasarathy, (1990) TDNN (Lang, Waibel, & Hinton, 1990) Gamma network (de Vries & Principe, 1992b)
Hidden dynamics Single-Layer First order High order (Giles et al., 1990) Bilinear Quadratic (Watrous & Kuhn, 1992b) Multilayer Robinson and Fallside (1988) Simple recurrent networks (Elman, 1988) Local feedback Back and Tsoi (1991) Frasconi, Gori, and Soda (1992) Poddar and Unnikrishnan (1991b)
while having different features along other dimensions. By contrast, in a single-dimension taxonomy, the two objects must be classied as either the same or different. 3.2 Previous Work. Horne and Giles (1995), in an article comparing the performance of different architectures , have developed a simple taxonomy for STCNs (see Table 2). Their approach focuses on partitioning the space of existing STCN architectures. The rst partition separates those networks whose state representations are encoded in input and output units only from those networks whose state representations are encoded in hidden units. Horne and Giles refer to the former class as networks with observable states and the latter as networks with hidden dynamics. Networks with
Spatiotemporal Connectionist Networks
257
observable states include Narendra and Parthasarathy’s (1990) networks; Lang, Waibel, & Hinton’s time delay neural networks (TDNN) (1990); and de Vries and Principe’s (1992b) gamma networks. The class of networks with hidden dynamics is further partitioned into single-layer, multilayer, and local feedback networks. Horne and Giles’s taxonomy considers only the issue of connectivity and not those concerning adaptation. Nor does it describe a number of popular classes of spatiotemporal networks including recurrent cascade correlation (Fahlman, 1991), recursive autoassociative memory (Pollack, 1989, 1990), autoassociative recurrent network (Maskara & Noetzel, 1992), connectionist pushdown automaton (Giles, Sun, Chen, Lee, & Chen, 1990), connectionist Turing machine (Williams & Zipser, 1989b), and second-order constructive learning (Giles et al., 1995). This limits the taxonomy’s descriptive adequacy and predictive power. Tsoi and Back (1994) have developed a taxonomy designed specically for locally recurrent, globally feedforward networks, a subset of STCNs. These are networks in which all connections are feedforward with the exception of one temporal self-looping connection for each node. Tsoi and Back base their taxonomy on the processing performed by the feedback connections used and which value is delayed. They refer to these two considerations as synapse type and feedback location, respectively. The feedback location is subdivided along three dimensions depending on which combination of the nodes’ previous activation values—the nodes’ previous net inputs or the previous output value—is transmitted by the local feedback synapse. Thus, Tsoi and Back’s taxonomy is composed of four dimensions: synapse type (which can assume a value of simple or dynamic), synapse feedback (which can be yes or no), activation feedback (which can be yes or no), and output feedback (which can be yes or no). A simple synapse corresponds to a feedback connection with a single scalar weight, while a dynamic synapse has a linear transfer function. These dimensions are illustrated in Table 3. Tsoi and Back’s taxonomy is capable of dealing only with locally recurrent, globally feedforward networks, a small proportion of STCNs. Mozer (1994) has developed a taxonomy for STCNs, which is illustrated in Table 4. It is based on the assumption that every STCN consists of two mechanisms: a short-term memory and a predictor. The short-term memory computes the state of the STCN, and the second mechanism uses the state to compute output. Mozer characterizes the operation of the short-term memory of STCNs along the dimensions of content and form (which dene how the short-term memory is computed), and adaptability (which denes how changes in the computation of the short-term memory are made during the adaptation process). Mozer notes that the predictor component of an STCN is always a feedforward network. The table illustrates the dimensions Mozer used to describe STCNs. Despite being the most sophisticated and comprehensive taxonomy of STCNs developed to date, it has two limitations. The rst concerns sacri-
258
Stefan C. Kremer Table 3: Tsoi and Back’s Taxonomy.
Synapse type Simple Dynamic
Feedback at synapse Yes No
Feedback at activation Yes No
Feedback at output Yes No
cing predictive power for simplicity. Mozer’s taxonomy applies only to the short-term memory component of the network. He does not describe the predictor component beyond stating that it is a network. This means that networks with different predictor components (e.g., multilayer versus single-layer versus no-layer, or rst-order connections versus second-order connections) are grouped in the same category. Since it is well-known that the number of layers and order of connections greatly affect what a connectionist network can compute (see Lippmann, 1987), predictive power is limited by not categorizing STCNs based on their predictor components. The same argument applies to the short-term memory component. Mozer uses the term transformed input and state memories to encompass a very broad range of connectionist networks with varying layer numbers and connectivity schemes, including Fahlman’s recurrent cascade correlation networks and Giles’s second-order networks, two approaches that are vastly different in approach and computational power (see Giles et al., 1995; Kremer, 1996a, 1996b). The second problem with Mozer’s taxonomy is that it also sacrices too much descriptive adequacy in favor of simplicity. Mozer himself identies one interesting and very popular architecture type of STCN, which he calls the standard architecture (e.g., Elman, 1990a; Mozer, 1989) that does not
Spatiotemporal Connectionist Networks
259
Table 4: Mozer’s Taxonomy.
Content Input (I) Transformed input (TI) Transformed input and state (TIS) Output (O) Transformed output (TO) Transformed output and state (TOS)
Form Delay line Exponential trace Gamma
Adaptability Static Adaptive
t within his classication scheme. Although the standard architecture is very similar to TIS, there are differences between the two approaches (see Figure 1). Note that the standard architecture is like Mozer ’s transformed input and state, except that the rst two hidden layers have been collapsed into a single layer. The standard architecture does not t anywhere else in the different possibilities for memory contents discussed by Mozer either. Another example of sacricing descriptive adequacy concerns the networks of Narendra and Parthasarathy (1990). These networks compute their outputs based on a history of both input and output patterns. In this sense, these networks could be considered as having input and output, or IO, memories. Yet Mozer’s taxonomy does not include this type of memory content. Nor does it adequately describe different systems of
260
Stefan C. Kremer
(a)
output y( t )
(b)
output y( t )
hidden 3
hidden 3
hidden 2 s(t )
hidden 2 s( t )
temporal delay
temporal delay hidden 1 x’(t)
input x (t)
input x (t)
Figure 1: TIS and the standard architecture. (a) TIS memory architecture. (b) Standard architecture. (Adapted from Mozer, 1989).
memory adaptability. Elman’s SRN algorithm, for example is described as lying somewhere between static and adaptive. This seems strange, since Elman’s system clearly satises Mozer ’s criterion for adaptiveness (“the neural net can adjust memory parameters”; Mozer, 1994, p. 252). Yet Elman’s algorithm is clearly different—and perhaps less effective at adapting (Servan-Schreiber, Cleeremans, & McClelland, 1988) than the conventional recurrent adaptation algorithms—because it uses a truncated version of gra-
Spatiotemporal Connectionist Networks
261
dient descent, which we will describe in greater detail later. Fahlman has developed yet another adaption algorithm that does not t into Mozer ’s taxonomy. Fahlman’s networks adapt not only parameters like weights but also the number of nodes in the network. This is an important distinction, since it implies that Fahlman’s networks are capable of increasing their own memory capacity, which other STCNs are not. Clearly this type of adaption will have a profound impact on how these networks perform on temporal problems. Two other classication systems have been developed in articles on spatiotemporal pattern recognition systems: that of Ghosh and Deuser (1995) focuses on classifying different approaches in the context of recognizing sonar sequences, while that of Maren (1990) looks at a broader range of applications to dynamical systems classication. Both approaches use a fairly coarse-grained approach similar to that of Mozer and share a number of Mozer’s limitations. Table 5 summarizes these previous approaches. When Mozer and Horne and Giles rst developed their classication schemes, they provided simple, descriptively adequate, and predictively powerful frameworks for describing some of the leading STCN approaches of the time. Tsoi and Back also developed a simple taxonomy addressing a popular subset of STCNs. The recent urry of activity in the eld of STCN research, however, has dated these ways of classifying networks. The problems that these three taxonomies exhibit in the light of new STCN designs can all be traced to a common cause: all are attempts to classify existing STCNs—they use an architectural approach. Since it is impossible to predict future STCN developments, new designs are difcult to t into the categories dened earlier. One solution to this problem is to omit the new STCN designs from the taxonomy; however, this results in a lack of descriptive adequacy. An alternative solution is to place new designs into the closest possible categories (even if they do not t well). This situation tends to degrade the predictive of the taxonomy since it will be more difcult to make predictions based on tenuous categorizations. Another option is to add a new category for the new network design, but this tends to decrease simplicity due to a proliferation of categories. A nal option is to develop a new taxonomy. Although a new taxonomy will never be able to withstand all future developments, it can be designed to anticipate future developments (especially hybrid approaches that mix components of existing systems). This article attempts to dene such a taxonomy. 4 A New Taxonomy for Spatiotemporal Connectionist Networks An alternative to attempting to classify existing STCNs based on their respective properties is to examine what all STCNs have in common. We develop our taxonomy around the computational components of spatiotemporal systems (examining the goals of computation) rather than around
262
Stefan C. Kremer
Table 5: Summary of Taxonomies.
Taxonomy
Dimensions Covered
Architectures Covered
Categories
Horne and Giles (1995)
Connectivity only, not adaptation
Many popular classes of networks missed
Good
Tsoi and Back (1994)
Good
Covers only locally recurrent, globally feedforward networks
Good
Mozer (1994)
Short-term memory only, not predictor
Some networks do not t into any categories
Very broad categories
Ghosh and Deuser (1995)
Good
Focuses on networks for sonar sequences only
Very broad categories
Maren (1990)
Good
Architectures for dynamical systems classication
Very broad categories
the representations and algorithms of particular STCNs. This approach is related to Marr’s(1982) trilevel hypothesis which describes information processing systems at three levels: (1) the computational theory level, which describes the goal of the computation; (2) the representation and algorithm level, which describes how the computation can be implemented; and (3) the hardware implementation level, which describes the physical realization of the computation. While other taxonomies are structured around the representation and algorithm level, focusing on how computations are implemented, we base our taxonomy around the computational theory level. Our taxonomy is based on the perspective of anyone wishing to develop a spatiotemporal processing system. Those in this position are forced to answer two basic questions. The rst is, How does one select, limit, and represent the set of systems that can be induced? This set of systems is dened as the hypothesis space of the processing system. The second ques-
Spatiotemporal Connectionist Networks
263
tion is, What algorithm is used to identify, from the hypothesis space, the particular system that best suits the training data? This is the search algorithm. By basing our taxonomy on these two fundamental questions that need to be answered for any search algorithm, we can group STCNs into categories without placing any restrictions on the descriptive adequacy of the taxonomy. We now break down the rst question—How does one represent the hypothesis space?—into two components. For STCNs, a hypothesis represents a particular network with a specic set of trainable parameters W , or, equivalently, the spatiotemporal input-output relationship dened by these parameters. Every spatiotemporal system can be dened by two components: one that computes the new internal state of the system based on the input and previous state, and one that computes the output of the system based on the new state. Thus, it is natural to break the process of representing the hypothesis space along these lines and identify two new questions that must be answered. The rst is, How does one represent the possible computations that could be performed in order to compute the system’s state? The second is, How does one represent the possible computations that could be performed in order to compute the system’s output? These two questions together are equivalent to the original question (What is the hypothesis space?). This is because the representation of any spatiotemporal mechanism can be broken into state and output components and because these two components completely dene the spatiotemporal processing. The second question, What is the search algorithm? can also be broken down into two components. The search algorithm effectively denes operators for moving the induction system from one point in the (parameter) hypothesis space to another. This too can be broken into two subquestions: What is the initial state of the network’s parameters? and, How are the parameters updated in response to the training data? This leaves us with four fundamental questions that must be answered by any STCN design. These questions, illustrated in Table 6, form the basis dimensions of our taxonomy. Any spatiotemporal system can be dened by its position in this four-dimensional space. Specic coordinates along each dimension represent specic answers to the four questions. A set of four such coordinates completely denes an STCN. Each dimension of the taxonomy can be described as a generic function, and the specic instantiation of each function will determine the behavior of the STCN. The rst function denes how the state vector, Es (t ), is computed, based on the previous state vector, Es(t ¡ 1), the current input vector, xE (t), and the network’s trainable parameters, W . This function will be denoted fEs (Es(t ¡ 1), xE (t), W ). The second function computes the output vector, yE (t), based on the current state vector, Es(t), and the parameters of the STCN, W . It will be denoted fyE (Es (t), W ). The third function computes the initial state of the STCN’s parameters based on available a priori information about
264
Stefan C. Kremer
Table 6: Four basic questions to Be answered by Any STCN and the Mathematical Functions Describing the Dimensions.
Question
Function
How does one represent the possible computations that could be performed in order to compute the system’s state?
fEs (Es(t ¡ 1), xE (t), W )
How does one represent the possible computations that could be performed in order to compute the system’s output?
fyE (Es (t ), W )
What is the initial state of the system?
f
How does one adapt the trainable parameters of the system?
fD (EE (t), W , xE (¢))
0
(a priori info)
the problem to be solved. It is denoted f 0 . The fourth and nal function computes the new value of the parameters of the STCN, in response to the error, EE (t), the current parameters, W , and the input history, xE (¢), of the STCN. This function is denoted fD (EE (t), W , xE (¢)). These functions are illustrated opposite their corresponding dimensions in Table 6. Note that we have as yet made absolutely no assumptions about the representation or algorithms employed by each of the four functions. We have focused only on the goal of the computation, thus working at Marr’srst level rather than the second. In the following sections, we provide example instantiations of the functions that correspond to the representations and algorithms employed in a variety of popular STCNs. Using these four functions, it is now possible to describe the operation of any STCN. This operation is described in pseudocode in Table 7. In a sense, the pseudocode can be thought of as a mathematical denition of STCN paralleling the informal denition provided in section 2.1. Note that at this point, we have developed a general framework that is applicable to any STCN. By specifying the instantiations of the four functions described above, it is possible to dene points along the four dimensions of the taxonomy and hence specic STCN induction systems. The algorithm presented assumes that weight updates are performed every time a target is available. Sometimes networks are trained in a batch mode, in which case error values are accumulated in step 11, and step 12 is executed only at the end of the presentation of a xed number of sequences, called an epoch. The same sequence may then be presented again, or additional strings may be added before the next epoch begins.
Spatiotemporal Connectionist Networks
265
Table 7: Generic Algorithm for STCNs.
00 01 02 03 04 05 06 07 08 09 10 11 12 13 14
algorithm STCN W à f 0 (a priori info) do tÃ0 do tÃtC1 xE (t) à environment Es(t) à fEs (Es (t ¡ 1), x E (t), W ) yE (t ) à fyE (Es (t ), W ) if target yE ¤ (t ) available EE à yE ¤ (t) ¡ yE (t) e à 12 EE 2 E W, x (E, E (¢))
W Ã W C fD while not end of sequence while t < tmax and e > threshold
/* /* /* /* /* /* /* /*
initialize */ begin training */ reset time */ begin sequence */ increment time */ determine input */ update state */ determine output */
/* compute error */ /* compute scalar */ /* update parameters */ /* no solution found and not out of time */
Note: Particular network designs differ only in their instantiations of the four fundamental functions.
5 Computing the State Vector We begin by examining instantiations of the state vector function, fEs (Es (t ¡1), xE (t ), W ) used in existing STCNs. In particular, we discuss thirteen popular alternatives: window in time memories, NARX memories, time-delay neural network memories, feedforward exponential decay memories, gamma memories, Jordan memories, local feedback memories, connectionist innite impulse response lter memories, rst-order context memories, secondorder context memories, long short-term memories, connectionist pushdown automata memories, and connectionist Turing machine memories. 5.1 Window in Time (WIT) Memory. The window in time (WIT) memory approach to computing the state vector is one of the rst types of shortterm memory in spatiotemporal networks. Its best-known implementation is NETtalk (Sejnowski & Rosenberg, 1987, 1988; Rosenberg & Sejnowski, 1986), but it has also been used by a variety of other authors including Lang et al. (1990) and Waibel, Hanazawa, Hinton, Shikano, & Lang, (1989). The short-term memory vector, Es (t ), is always computed in the same manner throughout the training process and represents a nite, so-called window in time on the input vectors. This window contains a nite history of input vectors. If we consider a window with width n input vectors, then the state
266
Stefan C. Kremer
Current Input
x (t)
x (t -1)
x (t -2)
x (t -6)
s( t )
Figure 2: Short-term memory of WIT networks.
of the STCN is given by Es (t ) D x E (t ) © x E (t ¡ 1) © ¢ ¢ ¢ © xE (t ¡ n C 1),
(5.1)
and the function to compute state recursively is dened as: fEs (Es(t ¡ 1), xE (t), W ) ´ xE (t ) © Es (t ¡ 1)[1. . (n ¡ 1) ¢ dim(xE (t ))].
(5.2)
Figure 2 illustrates a WIT memory. The illustrative convention used for this and subsequent images is described in Table 8. At time t, the input vector, xE (t ), which corresponds to the current input, is added to the left side of the state vector, Es (t ) (indicated by the dashed box). All other vectors are shifted right. It is critical to note that the equation to compute the next state of the STCN does not make use of the trainable parameters, W . This means that the function remains completely static; the state vector is computed the same way every time. The WIT architecture has been widely used in a number of applications but is best known for the NETtalk application. A WIT architecture with n D 6 is also used to predict the dynamics of a plant in Back and Tsoi (1991), although the authors present the network as an example of a more general architecture called a FIR MLP (see section 5.9 below). The computational power of these networks has been identied as being equivalent to denite machines (Kohavi, 1978, in Giles & Horne, 1994). 5.2 Nonlinear Autoregressive with Exogenous Inputs Memory. A variation on the WIT memory is to use two temporal windows: one window on the input (as in WIT memories) and a second on the output produced by the network. Because of their similarities to nonlinear autoregressive with exogenous inputs (NARX) models, STCNs with this type of memory have
Spatiotemporal Connectionist Networks
267
Table 8: Figure Key. Nodes are represented by hollow circles. Ellipses (drawn using three solid circles) indicate the possibility of arbitrary numbers of nodes forming an activation vector or arbitrary numbers of activation vectors contributing to a state vector. Units forming input and output activation vectors are grouped in solid-lined rectangles. Units forming state activation vectors are grouped in dotted rectangles. Connections between nodes are represented by lines with information ow always pointing in an upward direction; wherever practical and unless otherwise indicated, all connections are shown. Flow through time is illustrated by thick, hollow arrows pointing from recent activation values to older activation arrows. Activation values can be thought of as being copied once per time step through the thick, hollow arrows. Activation values that are set external to the nodes illustrated (i.e., from output nodes, Turing machine controller or continuous stack) are indicated by solid arrows
been called NARX networks in (Lin, Horne, Tino, ˜ and Giles (1996a and 1996b). The state in this type of network is equivalent to: Es(t) D fx E (t ) © x E (t ¡ 1) © ¢ ¢ ¢ © x E (t ¡ n C 1)g
© fyE (t ¡ 1) © yE (t ¡ 2) © ¢ ¢ ¢ © yE (t ¡ m)g.
(5.3)
Clearly, the WIT memory described above is a special case of the NARX memory where m D 0 (i.e., the size of the output window is zero). NARX memory and WIT memories are classied separately in this taxonomy because the more restrictive WIT memory has important representaetional limitations that are not shared by NARX memory. The state function for the NARX memory can be dened as the recurrent function: fEs (Es(t ¡ 1), xE (t), W ) ´ xE (t) © Es(t ¡ 1)[1. . (n ¡ 1) ¢ dim(xE (t))] © yE (t ¡ 1) © Es(t ¡ 1) ¢[n ¢ dim(xE (t)) C 1. . n ¢ dim(xE (t)) C (m ¡ 1) ¢ dim(yE (t ))].
(5.4)
In this equation, there is no direct dependence of state on the weights of the network W . An indirect dependence is present, however, in that the output yE depends on the weights according to the output function used (see section 6). A NARX memory with input window of length, n D 7, and output window of length, m D 7, is illustrated in Figure 3. Unlike the WIT memory, which is not recurrent, output activations in this network are used to compute activation values of internal nodes, which in turn are used to
268
Stefan C. Kremer
Current Input
x (t )
x (t -1)
x (t -2)
x (t -6)
y( t -1)
y( t -2)
y( t -3)
y( t -7)
Previous Output
s( t )
Figure 3: NARX memory.
recompute output activations. This recurrence gives the NARX memory the ability to encode a history of activations extending back arbitrarily in time. NARX networks have been used in Lapedes and Farber (1987), Chen, Billings, and Grant (1991), Narendra and Parthasarathy (1990), Lin, Giles, Horne, and Kung (1996), Lin et al. (1996b), Lin, Horne, Tino, ˜ and Giles (1996a), and Siegelmann et al. (1997). The last of these articles contains a proof that NARX networks can have the computational power of Turing machines. 5.3 Time-Delay Neural Network (TDNN) Memories. Thus far, we have seen two approaches to using previous node activations to represent state. The WIT memory stores previous inputs, and the NARX memory stores previous inputs and output. Another alternative is to store the activations of input and hidden units. This approach has been used extensively by Waibel (1988; Waibel et al., 1989). Here, the state is dened as:
Es (t ) D x E (t ) © x E (t ¡ 1) © x E (t ¡ 2) © ¢ ¢ ¢ © x E (t ¡ n) (I ) (I ) (I ) E E E ( ) ( ( © h t © h t ¡ 1) © h t ¡ 2) © ¢ ¢ ¢ © hE (I ) (t ¡ m )
© hE (II ) (t) © hE (II ) (t ¡ 1) © hE (II ) (t ¡ 2) © ¢ ¢ ¢ © hE (II ) (t ¡ l).
(5.5)
Spatiotemporal Connectionist Networks
269
hE (I ) (t) and hE (II ) (t) represents the activations of the rst and second hidden layer, respectively. They are computed as: hE (I ) (t) D s(W(I ) ¢ fxE (t ¡ 1) © xE (t ¡ 2) © ¢ ¢ ¢ © xE (t ¡ n )g) and hE (II ) (t ) D s(W(II ) ¢ fhE (I ) (t ¡ 1) © hE (I ) (t ¡ 2) © ¢ ¢ ¢ © hE (I ) (t ¡ m )g), respectively, where W(I ) and W(II ) represent the weight matrices leading to the rst and second hidden layers, respectively. These matrices are a component of the network’s trainable parameters W . This formulation implies a state function, which can be formulated as: fEs (Es(t ¡ 1), xE (t ), W ) ´ xE (t) © Es (t ¡ 1)[1. . (n ¡ 1) ¢ dim(xE (t ))] © hE (I ) (t)
© Es (t ¡ 1)[n ¢ dim(xE (t)) C 1 . . n ¢ dim(xE (t)) C (m ¡ 1) ¢ dim(hE (I ) (t))] © hE (II ) (t) © Es (t ¡ 1)[n ¢ dim(xE (t)) C m ¢ dim(xE (t)) C 1
. . n ¢ dim(xE (t)) C m ¢ dim(hE (I ) (t)) E (II ) (t))] C ( l ¡ 1) ¢ dim( h
(5.6)
Variations on this network might include more or fewer hidden layers, but this two-hidden-layer approach is the one implemented for various applications in the area of phoneme recognition (Waibel, 1988; Waibel et al., 1989). 5.4 Feedforward Exponential Decay (FED) Memories. An alternative to storing explicitly a nite number of previous activation values from nodes in the network is to store an innite history of previous activations implicitly. This can be accomplished by basing the network’s state on the entire history of various node activations. In order to realize such a system in a nite number of state nodes, it is necessary to represent the innite history in a manner that it can be computed incrementally based on the previous history. Poddar and Unnikrishnan (1991a, 1991b) have used the following state function: E (t ¡ 1)g. Es(t) D a ¢ Es(t ¡ 1) C (I ¡ a) ¢ fxE (t ¡ 1) © h
(5.7)
Here, a represents a diagonal matrix of decay constants that remains xed during training, I a diagonal matrix of ones, and hE (t ¡1) is a vector of hidden
270
Stefan C. Kremer
unit activations computed, E (t ¡ 1) © Es(t ¡ 1)[1. . dim(xE )]g). hE (t ¡ 1) D s E (W ¢ fx
(5.8)
Unwrapping the recursive equation 5.7 reveals that the state is indeed based on the innite, exponentially decaying history of input and hidden unit activations of the network: Es (t ) D
t t D1
(I ¡ a)t ¢ fxE (t ¡ t ) © hE (t ¡ t )g.
(5.9)
Thus, this architecture avoids the difculty of having to decide on a particular window length as in the WIT memory. It should be noted that despite the fact that equation 5.7 is dened recursively, this network is not a recurrent network. Every node’s activation can be computed based on only a nite input history for the network. FED memories have been used for time-series prediction (Poddar & Unnikrishnan, 1991a) and prediction of speech signals (Poddar & Unnikrishnan, 1991b). 5.5 Gamma Memories. The FED memory uses an exponentially decaying history. Other history functions could also be used. To be of practical use, however, the selected history function must be incrementally computable. That is, the state of the system should be computable based on a nite number of parameters rather than requiring storage of the arbitrarily long activation history of the nodes in the network. A history function that satises this criterion is based on the (normalized) gamma density function 1 (C (x) ´ 0 tx¡1 et dt). The memory uses two parameters per node represented by two diagonal matrices: depth (m ) and resolution (v). The state vector of a gamma memory is computed according to the equation fEEs (Es(t ¡ 1), xE (t), W )[i ¢ dim(xE ) . . (i C 1) ¢ dim(xE ) ¡ 1] ´ B((1 ¡ m ) ¢ Es(t ¡ 1)[(i ¡ 1) ¢ dim(xE ) . . (i) ¢ dim(xE ) ¡ 1] C m ¢ Es (t ¡ 1)[i ¢ dim( x E ) . . (i C 1) ¢ dim(xE ) ¡ 1]),
(5.10)
where Es (t )[0. . dim(x E ) ¡ 1] ´ x E (t ),
(5.11)
and B is a matrix of binomials computed as Bi, j D
t . vi, j
(5.12)
This network is nonrecurrent since no part of the state is required to compute a subsequent state. In this network, the parameter m is one of the trainable parameters contained in W . The gamma memory approach has been used by de Vries and Principe (1991, 1992a) and applied to signal prediction, including electroencephalograms.
Spatiotemporal Connectionist Networks
271
5.6 Habituation-Based Memories. Stiles and Ghosh (1996, 1997) have developed another interesting alternative to computing state. In their networks, state is based on a mathematical model of habituation suggested by Arbib. This discrete-time model describes the effect that repeated input signals have on synaptic weight strength. Stiles and Ghosh use a set of weight values computed according to Arbib’s model as inputs to a feedforward network. Since the habituation model causes weights to change in response to input patterns, the weights can serve to represent a history of the input or state of the system. Specically, si (t) D si (t ¡ 1) C ti (ai (1 ¡ si (t ¡ 1)) ¡ si (t ¡ 1)x f (i) ),
(5.13)
where ti and ai are constants dening the rate of habituation, and f (i) is a function mapping a state index to an input index. The authors found that having multiple state values associated with each input improved performance and was a biologically realistic assumption. Stiles and Ghosh proved mathematically that memories of this type are able to approximate arbitrarily well any continuous, causal, time-invariant mapping from input sequences to output sequences. These networks have been shown to outperform time-delay neural networks on a variety of problems, including classication of Banzhaf sonograms. 5.7 Jordan Memories. Michael Jordan (1986a, 1986b) has developed a variation on the FED approach that uses a history of the output units. Specifically, Jordan denes the state of his networks as Es(t) D
1 Es(t ¡ 1) C yE (t ¡ 1). 2
(5.14)
Jordan networks can be viewed as storing an innite history of past outputs since the state of the network can be rewritten as fEEs Es(t ¡ 1), xE (t), W ´ yE (t ¡ 1) C
1 2
yE (t ¡ 2) C
1 2
2
yE (t ¡ 3) C ¢ ¢ ¢ C
1 2
t¡1
yE (0).
(5.15)
Here, it can be seen that Es (t ) is computed based on the entire history of yE (t), though with exponentially decaying inuence as time increases. These networks are recurrent since each state vector depends on the previous state vector. Jordan applied his networks to the problem of producing articulatory sequences. This network is also interesting for historical reasons since it is one of the earliest reported recurrent networks and provided the foundation for subsequent approaches, including Elman’s networks (see below).
272
Stefan C. Kremer
s(t )
x (t)
s ( t -1)
Current Input Figure 4: LF memory.
5.8 Local Feedback Memories. Another simple memory is the local feedback (LF) memory. These memories use a set of hidden units to represent the network’s state. Each state unit’s activation is computed based not only on the inputs, but also on that same unit’s previous activation: E (t) C X ¢ Es (t ¡ 1)). fEEs (Es(t ¡ 1), xE (t), W ) ´ s E (W ¢ x
(5.16)
Here, W (which is a trainable parameter and part of W ) represents a matrix of connection weights between the input units and the hidden units that make up Es(t), while X (also a trainable parameter and part of W ) is a diagonal matrix. The matrix X represents a set of recurrent (self) connections among the state units. This type of memory is graphically represented in Figure 4. Rather than showing the locally recurrent connections in this type of memory, we illustrate the operation of the network using a ctitious set of processing units called context units. These units can be considered to contain a copy of the activations of the state units at the previous time step. The context units allow us to separate the temporal delay from the modulation component of a connection’s operation and greatly simplify illustrations of STCNs. The use of context units was pioneered by Elman (1990a, 1991b).
Spatiotemporal Connectionist Networks
273
The LF approach was developed by Gori, Bengio, and De Mori (1989) and used in Frasconi et al. (1992). A variation of this technique, presented in the same articles, uses the same approach without the nonlinearity. In this case, fEEs (Es(t ¡ 1), xE (t), W ) ´ W ¢ xE (t) C X ¢ Es(t ¡ 1).
(5.17)
An extension of this approach has been used by Fahlman (1991) by the name, recurrent cascade correlation (RCC). While the LF memories used by Gori et al. (1989) and Frasconi et al. (1992) use only one layer, Fahlman’s networks use a cascaded approach in which unit i receives connections from all lowered-numbered nodes. Fahlman’s LF memories are computed incrementally. That is, each component of the state vector depends on all lower-numbered components for the same time step. The rst components of the state vector are equivalent to the input vector. Additionally, each node receives a recurrent connection from itself with time delay of one unit. Since no node receives delayed connections from nodes other than itself, this represents a strictly local recurrence. It is the local recurrence that introduces temporal dependency. This type of memory has been studied by Fahlman (1991), Giles et al. (1995), and Kremer (1996a, 1996b). Once again, we shall not show recurrent connections in our diagram, but rather an additional set of context units whose activations are equal to the state units at the previous time step. The computation of state vector in an RCC memory is dened as fEs (Es(t ¡ 1), xE (t ), W )[i] ´
xE i (t)
8i · dim(xE )
s E (Wi,¢ ¢ f1 © Es(t)[1. . i ¡ 1] © Es(t ¡ 1)[i]g)
otherwise,
(5.18)
where W represents a submatrix of W containing the connections to the state nodes. The architecture that computes this state vector is depicted in Figure 5 (adapted from an illustration in Giles et al., 1995). Each component of the state vector is computed based on the values of all lower-numbered components at the same time step and the value of the component itself at the previous time step. Fahlman’s RCC memory is incapable of representing certain classes of nite state automata. Details of these limitations are found in Giles et al. (1995) and Kremer (1996a, 1996b). LF memories have similar limitations which are detailed in Frasconi and Gori (1996) and Kremer (1999). 5.9 Innite Impulse Response Filter Memories. The innite impulse response (IIR) memories use connections between nodes that are modeled after IIR lters. While a standard connection performs a simple multiplication—multiplying a presented signal by the connection weight, zi (t) D
274
Stefan C. Kremer
x (t) s( t )
Current Input
s ( t -1) Figure 5: RCC memory.
wij ¢ xj (t)— an IIR connection performs a more complex temporal computation: p¡1
zi ( t ) D
t D0
( )
x wij (t ) ¢ xj (t ¡ t ) C
q
( )
t D1
z wij (t ) ¢ zj (t ¡ t ).
(5.19)
In both equations, xj (t) represents the value transmitted into the connection at time t, and zi (t) represents the value emitted from the other side of the connection at time t (before it passes through the destination node’s nonlinearity). w(x ) (t ) is the weight multiplier for the t th previous input and w(z) (t ) is the weight multiplier for the tth previous output. It is possible to create an entire layer of such connections, connecting the STCN’s input, xE (t), to a layer of hidden processing units, hE (t). In such a network, the transmitted values can be computed by p¡1
zE (t) D
t D0
W(x) (t ) ¢ xE (t ¡ t ) C
q t D1
W(z) (t ) ¢ Ez(t ¡ t ),
(5.20)
Spatiotemporal Connectionist Networks
275
z (t)
x (t)
x (t -1)
x (t-n +1)
z (t-m )
z (t -2)
z (t -1)
s (t)
Current Input Figure 6: IIR memory.
and, the hidden unit activations by hE (t) D s E ( zE (t)).
(5.21)
This approach is illustrated in Figure 6. Of course, in practice, an STCN like this would require storage of not only the history of input patterns, xE (t ), xE (t ¡ 1), xE (t ¡ 2) . . . , but also the history of transmitted values, Ez (t ¡ 1), Ez (t ¡ 2), zE (t ¡ 3), . . . . Thus, the state vector of such a network is dened as: Es(t) D fx E (t ) © x E (t ¡ 1) © ¢ ¢ ¢ © x E (t ¡ n C 1)g
© fEz(t ¡ 1) © Ez (t ¡ 2) © ¢ ¢ ¢ © zE (t ¡ m )g,
which can be computed recurrently as: fEEs (Es (t ¡ 1), xE (t), W )
(5.22)
276
Stefan C. Kremer
´ xE (t ) © Es(t ¡ 1)[1. . (n ¡ 1) ¢ dim(xE (t))] © Ez(t ¡ 1)
© Es(t ¡ 1)[(n C 1) ¢ dim(xE (t)) . . (n ) ¢ dim(xE (t)) C (m ¡ 1) ¢ dim(Ez(t))].
(5.23)
This architecture differs from the LF memory described above in that the LF architecture sums the signals transmitted through numerous connections and optionally also passes the result through a nonlinearity before looping it back. By contrast, the IIR memory sends back the signal transmitted through each connection separately (prior to summation). The IIR memory has been used in Back and Tsoi (1990, 1991, 1993), Back, Wan, Lawrence, and Tsoi (1995), and Lawrence, Tsoi, and Back (1996). While some of these articles assume a general model in which the hidden unit activations can be computed in multiple layers, in practice only one layer of connections is typically used. The equations for the multilayer version can be derived in a fashion analogous to that used above. 5.10 First-Order Context Memory. Another approach to short-term memory that has been widely used is based on computing the STCN’s state using a single-layer rst-order feedforward network (Rumelhart, Hinton, & Williams, 1986). The network uses the previous state (also called context) and the current input as input and produces the new state as output. This approach, abbreviated FOC memory, is used in Elman’s (1990a, 1991b) simple recurrent networks (SRN), Pollack’s (1989, 1990, 1994) recurrent autoassociative memory (RAAM), Maskara and Noetzel’s (1992) autoassociative recurrent network, and Williams and Zipser’s (1989b) real time recurrent learning (RTRL) networks. The FOC short-term memory is stored in a set of hidden units whose activations are computed based on the activations in a layer of input units and a layer of context units, and on the weights from these two layers to the state units. By convention (see Elman, 1990a), the activations of all the context units are initially set to 0 or 0.5 (though we will see a different approach in section 8.7) and subsequently set to the activation values of the hidden units at the previous time step (the number of context and hidden units must be equal). Specically, the hidden unit activation values are copied to the context units at each time step. Thus, at any given time, the hidden unit activations represent the current state, and the context unit activations represent the previous state. Thus, the function used to compute the state vector is: E (t) © Es(t ¡ 1)g), fEs (Es(t ¡ 1), xE (t), W ) ´ s E (W ¢ f1 © x
(5.24)
where W is a matrix and a component of W representing the adaptable weights of the connections in the memory. As in section 2.2, s( E )represents E x
Spatiotemporal Connectionist Networks
277
s( t )
x (t )
s ( t -1)
Current Input Figure 7: FOC memory.
the result of applying the function s(x) D 1 C1e¡x to each component of vector xE . The FOC short-term memory is depicted in Figure 7. Unlike WIT and NARX memories, whose state vector is equivalent to a nite number of previous input and output vectors, FOC memories use states based on the previous state vector and input vector. This implies that the state vector of this type of memory can contain information not found in recent input and output vectors. The FOC memory approach has also been employed in Elman (1988, 1989, 1990b, 1991a), Cleeremans, ServanSchreiber, and McClelland (1989), Cleeremans and McClelland (1991), Servan-Schreiber et al. (1988), Servan-Schreiber, Cleeremans, and McClelland (1989, 1991, 1994), and Williams and Zipser (1989a). These networks have the representational power of nite state automata as was shown in Kremer (1995) and Goudreau, Giles, Chakradhar, and Chen (1994). 5.11 Second-Order Context Memory. A variation on FOC memory involves using a single-layer second-order network to compute state based on previous state and current input. Second-order connections are connections that connect three nodes (rather than just two) and are a restricted type of the Sigma-Pi units studied by Rumelhart et al. (1986). In these connections, the activation of one of the nodes is modulated by that of another and transmitted as a signal to the third, which computes a weighted sum of all of the products and passes the value through a squashing function. The processing performed by rst-order and second-order connections is illustrated in Figure 8.
278
Stefan C. Kremer
(a)
(b)
i
j
k
(c)
l
i
j
k
(d)
l
a i
a i
S
S
wij
wik
wil
a j
a k
a l
w ijk
w ijl
x
x
a a a j k l Figure 8: First-order versus second-order connections. (a) Three rst-order connections from nodes j, k, l to node i. (b) Two second-order connections from j and k, k and l, to node i. (c) Diagram of computation performed by rst-order connections. (d) Diagram of computation performed by second-order connections.
In the SOC memory, every input unit is connected with every state unit to every other state unit via a second-order connection with a temporal delay of one time step. The context units in this network operate in a manner analogous to the FOC memory. That is, their activation values are equal to the activation values of the corresponding state units at the previous time step. In this network, second-order connections then connect every possible pair of one input and one context unit to every state unit. Thus, E (t)) ¢ Es(t ¡ 1)), fEs (Es(t ¡ 1), xE (t), W ) ´ s E ((W ¢ x
(5.25)
where W is a three-dimensional weight array component of W , and W ¢ xE (t) is a two-dimensional matrix, M, whose components Mij are computed: Mij D
k
Wijk ¢ xE k (t).
(5.26)
Figure 9 depicts the SOC short-term memory. Note that only the connections leading to one of the state nodes are depicted here. The remaining 48
Spatiotemporal Connectionist Networks
279
s( t )
x (t)
s ( t -1)
Current Input Figure 9: SOC memory.
connections leading into the other three state nodes in the diagram are not illustrated in the interest of clarity. SOC memories have been used extensively by Giles et al. (1991, 1992a, 1992b), Giles and Omlin (1992), Goudreau et al. (1994), Liu, Sun, Chen, Lee, and Giles (1990), Miller and Giles (1993), Omlin and Giles (1994a, 1996b), Shaw and Mitchell (1990), Sun, Chen, Giles, Lee, and Chen (1990a), Sun, Chen, Lee, and Giles (1991), and Watrous and Kuhn (1992a, 1992b). The principle of this approach is based on using the input units to gate signals transmitted from context to state units. SOC networks are particularly adept at encoding nite state automata (the details of such an encoding are found in Goudreau et al., 1994). 5.12 Long Short-Term Memory. Another second-order approach has been pioneered by Hochreiter and Schmidhuber (1996, 1997a, 1997b) This architecture is specically designed to avoid some of the limitations associated with the training of recurrent architectures. We will discuss these limitations in greater detail in section 8 but describe the architecture here. This approach is called long short-term memory (LSTM). The basic idea is to use both node activations and the net inputs to the nodes’ transfer functions as state, fEs (Es(t ¡ 1), xE (t ), W ) ´ Es(t ¡ 1)[1. . dim(Es) / 2] C in (t ) ¢ s E Wc ¢ fEs(t ¡ 1)[dim(Es) / 2 C 1 . . dim(Es)] © xE (t)g © out (t) ¢ s E Es(t ¡ 1)[1. . dim(Es ) / 2] .
(5.27)
280
Stefan C. Kremer
where in (t) and out (t) are two specialized gating units that control the ow of information into and out of the memory. Their activations are computed: E in ¢ fEs (t ¡ 1)[dim(Es ) / 2 C 1. . dim(Es )] in (t) D s W
© xE (t) © in (t ¡ 1) © out (t ¡ 1)g ,
(5.28)
and E out ¢ fEs(t ¡ 1)[dim(Es) / 2 C 1. . dim(Es)] out (t) D s W
© xE (t) © in (t ¡ 1) © out (t ¡ 1)g ,
(5.29)
respectively. It is also possible to use multiple memory blocks of the types described by the equations above, each with its own input and output gating units. The LSTM memory approach has been shown by its developers to be very effective for learning a number of spatiotemporal input-output associations spanning long time periods. 5.13 Connectionist Pushdown Automaton Memory with Continuous Stack. Connectionist pushdown automaton (CPA) memories are radically different from the STCN memories that have previously been discussed. This is because in addition to storing state in the activations of nodes, these memories also store state on an external continuous stack in a hybrid connectionist-classical system. In this sense, they are like classical pushdown automata (Hopcroft & Ullman, 1979), whose state can be considered to be equivalent to the state of the nite controller and the contents of an unbounded stack. Hopcroft and Ullman refer to this type of extended state as an instantaneous description (ID). While a conventional nite state controller can send three discrete signals to the stack (pop, push, no action), the stack of a CPA memory must respond to continuous (numerical) signals received from the nodes in the connectionist controller. Similarly, while a conventional nite state controller retrieves discrete symbols from its stack, the connectionist controller of a CPA memory must be able to accept continuous values from its stack. These are requirements of the gradient-descent learning algorithms used in STCNs, which require a differentiable error function. The stack controller in a CPA memory is an extension of the SOC memories described above. While SOC memories are described by three classes of nodes (input, context, state), the stack controller in a CPA memory has two additional node classes (action nodes and read nodes), which are used to manipulate the external stack. There is only one action node used to send and remove symbols (pop a symbol, push a symbol, no change) to or from the stack. The read nodes are used to encode the symbol at the top of the stack. CPA memories also use third-order connections instead of second-order connections, combining one input, one context, and one read node in a connection’s transmission.
Spatiotemporal Connectionist Networks
281
Since a pushdown automaton accepts input strings only if, after presentation of the entire string, the stack is empty, the error function for a CPA memory must take into account the length of the stack for legal strings. Since the error function for any gradient-descent algorithm must be continuous, the stack length of a CPA memory must also be represented by a continuous value. This continuity is achieved by giving each symbol in the stack a certain real-valued length and dening the stack length to be equal to the sum of the lengths of the symbols on the stack. Symbols are pushed onto or real-valued popped off the stack based on the activation value of the action node. This activation value can range from ¡1 to C 1. Activation values greater than a small constant 2 are interpreted as a signal to push the current input symbol onto the stack, while activation values less than ¡2 are interpreted as a pop signal, and intermediate values are interpreted as no change in stack content. When a symbol (encoded as a vector) is pushed onto the stack, its length in the stack is dened to be equal to the activation value of the action node. When a pop operation occurs, the action node’s activation value determines the length of stack, which is deleted. Since symbol lengths in the stack can vary from 2 to 1 and deletions can vary from 2 to 1, there are three possible pop operations: (1) a pop that will only remove part of a symbol from the stack (reducing the symbol’s length), (2) a pop that removes an entire symbol from the stack, or (3) a pop that removes multiple symbols from the stack. After CPA memories have been trained, push and pop operations typically become discretized. The second additional node class consists of read nodes whose activation values are equal to the symbol encoded at the top of the stack. If the symbol at the top of the stack has a length of less than one, its activation vector is multiplied by its length and added to subsequent symbol vectors to form the read node activation values, until a total length of 1 is read. In the CPA memory, the read nodes are treated as input nodes, and the action node is implemented as a special output node. Third-order connections lead from every triple, consisting of an input, state, and read node, to every output unit as well as the action node. For our purposes, the state vector Es(t) will represent only the state of the stack controller and not the stack itself, since it is external to the network and cannot be easily represented in the nite length vector Es (t). Thus, fEs (Es(t ¡ 1), xE (t), W ) ´ s E (((W ¢ xE (t)) ¢ Es(t ¡ 1)) ¢ Er(t)),
(5.30)
where Er(t) is the vector representing the symbol read from the stack at time t, W is a four-dimensional array corresponding to the connection weights to the state units, and W is a component of W . Here, the multiplications are performed such that ((W ¢ xE (t)) ¢ Es(t ¡ 1)) ¢ Er(t) D
l
k
j
(Wijkl ¢ xl (t) ¢ sk (t ¡ 1) ¢ rj (t).
(5.31)
282
Stefan C. Kremer
s(t )
a( t )
Action A (0.6) s ( t -1)
x (t)
Current Input
r (t)
B (0.3) C (0.9)
External Stack
Figure 10: The CPA memory.
The CPA memory is illustrated in Figure 10. Note that only two of the thirdorder connections are shown in order to simplify the illustration. CPA memories have been studied by Das, Giles and Sun (1993), Giles et al. (1990), Sun, Chen, Giles, Lee, and Chen (1990b), Sun et al. (1990a), and Zeng, Goodman, and Smyth (1994). 5.14 Connectionist Turing Machine Memory. Williams and Zipser (1989b) developed a Turing machine connectionist memory (CTM) by replacing the nite state controller of a classical Turing machine by an FOC memory in order to design an STCN capable of universal computation. The connectionist network controls the operations performed on a standard Turing machine tape (move left, no change, write 1, write 0, move right) based on the activations of ve action nodes and reads the current symbol at the tape head into its read nodes. This network was never trained to learn the tape operations. Instead, the tape operations were supplied to the network during training, so only the control mechanism needed to be learned. For this reason, a continuous version of a tape was not required. The state vector for a CTM memory (which does not represent the external tape) is dened as: fEs (Es(t ¡ 1), xE (t), W ) ´ s E (W ¢ 1 © Es(t ¡ 1) © Er(t)).
(5.32)
A CTM memory is illustrated in Figure 11. Only the connections to one of the state nodes are shown. 6 Computing the Output Vector We now examine the instantiations of the output vector function, fyE (Es (t ), W ). In particular, we discuss the three leading alternatives: zero-layer, one-layer, and two-layer output functions.
Spatiotemporal Connectionist Networks
283
Infinite Turing Machine Tape
s(t )
1 0 0 1 0 1 1 0
a( t ) Left, Right or No Change Signal
s ( t -1)
Symbol read from tape
r ( t -1)
Figure 11: CTM memory.
6.1 Zero Layer. The simplest possible manner in which to compute the output of an STCN given the state vector is to use the state vector (or a portion thereof) as the output. Without loss of generality, we assume that the rst few components of the state vector represent the output. This simple way of computing STCN output can then be easily represented: fyE (Es(t), W ) D Es(t)[1. . dim(yE )].
(6.1)
This zero-layer approach to output computation has been used by Giles et al. (1992b), Giles and Omlin (1992, 1993b), Goudreau et al. (1994), Liu et al. (1990), Miller and Giles (1993), Omlin and Giles (1994a), Shaw and Mitchell (1990), Sun et al. (1990a, 1990b, 1991), Watrous and Kuhn (1992a, 1992b), Das et al. (1993), Giles et al. (1990), Sun, Giles, Chen, and Lee (1993), Zeng et al. (1994) and Williams and Zipser (1989b). It is illustrated in Figure 12a. Zerolayer output functions are typically used with computationally powerful memories, which can represent the state of network in an arbitrary form, including one in which the low-numbered nodes also contain the appropriate output vector. If used with a memory in which encodings are limited (as with FOC memories), a zero-layer output function will limit the types of outputs that can be computed. 6.2 One Layer. The next output function, increasing in complexity, computes the output vector using an additional layer of nodes. These nodes are massively parallelly connected to all the nodes that form the state vector,
284
Stefan C. Kremer
y (t )
y (t )
y (t ) s( t )
s( t )
(a)
(b)
s( t ) (c)
Figure 12: Output functions of STCNs. (a) Zero layer. (b) One layer. (c) Two layer.
giving an output function dened by: (out) ¢ f1 © Es(t)g , fyE (Es (t), W ) D s E W
(6.2)
where W (out) is a matrix of connection weights between the state and output nodes (W(out) is a component of W ). This one-layer approach to output computation has been used in Elman’s SRNs (Elman, 1990a, 1991b), Pollack’s RAAM (Pollack, 1989, 1990), Maskara and Noetzel’s AARNs (Maskara & Noetzel, 1992), and Fahlman’s RCC (Fahlman, 1991), and is illustrated in Figure 12b. The one-layer approach is typically used with memories that can encode their state in such a manner that the desired output can be extracted from the state using a linear mapping (for details, see Kremer, 1995). For networks that can adopt arbitrary state representations, one-layer networks are not required, though they can provide a more parsimonious representation of the desired mappings. 6.3 Two Layer. The logical next step is to use a two-layer system where: (out2) (out1) ¢ f1 © s ¢ f1 © Es(t)g)g , fyE (Es (t), W ) D s E W E (W
(6.3)
and W(out1) and W(out2) are components of W representing the weights to the rst and second output layers of connections, respectively. This approach to computing output is employed by NETtalk (Sejnowski & Rosenberg, 1988) and by Narendra and Parthasarathy (1990) and is illustrated in Figure 12c. As noted earlier, for the latter case, the output of the NARX network could also be interpreted as the network’s state, in which case a zero-layer output
Spatiotemporal Connectionist Networks
285
function would be used. Two-layer outputs are commonly used with memories that are static and thus cannot be manipulated to form a convenient representation for the desired output. They can also be used to provide more parsimonious representations. 7 Initializing the Parameters We now turn our attention to the initialization of the trainable parameters, W . Not surprisingly, initialization can have a signicant effect on learning performance. 7.1 Randomization of Initial Parameters. The simplest and most common approach is to randomize the trainable parameters according to an arbitrary probability distribution. The greatest advantage of this approach is that no a priori knowledge about the problem to be solved is required. But more sophisticated approaches have been suggested when additional knowledge is available to set initial weight values. 7.2 Split-State Encoding. For many applications, the researcher may already have some prior information about potential types of solutions that are to be learned. The most natural way to incorporate this type of a priori information about a problem to be solved by an STCN is to encode it in the weights of the network, since it is the weights that dene the computation a network performs. By initializing the weights of an STCN based on the a priori information, network training can begin at a point in weight space that lies close to a nal solution. Training can then enhance the initial weights to drive the network to an even better solution. This process has been called renement (Giles & Omlin, 1993a, 1993b). The key to initializing an STCN lies in the encoding of the a priori knowledge in the connection weights of the STCN. Kremer (1995) has described a method for translating nite state machines into FOC memories. The approach is based on a state-splitting technique (also described in Goudreau et al., 1994). Each state in the nite state machine corresponds to one or more state nodes in the FOC. If one of these state nodes has a high activation value, then the STCN is “in” the corresponding state. Once the encoding of states has been dened, a straightforward method can be used to set weights such that the desired state transitions are implemented in the system. (For more details, see Kremer, 1995.) The advantage of this approach lies in its ability to encode nite state machine rules into network. The disadvanges are that nite state machine rules representing a priori information about real-world problems are often not readily available and the number of hidden units required is equal to the number of input symbols in the automaton times the number of states.
286
Stefan C. Kremer
7.3 Direct Encoding. Das, Giles, Omlin, and others (Giles & Omlin, 1992, 1993a, 1993b; Omlin & Giles, 1992, 1994a, 1996a, 1996b; Das et al., 1993) have developed a more straightforward method for incorporating nite state transitions into STCNs using SOC memories. In this architecture, it is possible to map each state in the nite state machine directly to one state node in the STCN. The result of this more direct mapping is that it is possible also to map each line in the transition table of the nite state machine to one second-order connection in the STCN. This affords the implementor the possibility of incorporating partial information about a nite state automaton (i.e., a subset of the state transitions) in the SOC memory. The direct encoding is advantageous in that it offers a trivial mapping from nite state machines to network weights (each state transition corresponds to exactly one weight) and that each state requires only one hidden unit. Once again, however, real-world rules are not readily available. 7.4 Chain Graph Encoding. A third approach to encoding an automaton in STCN has been developed by Frasconi, Gori, Maggini, and Soda (1991, 1995). They encode a chain graph automaton in a network using a linear programming algorithm. This approach is particularly appealing for speech applications because phonetic and acoustic sequences can be readily translated into chains. 8 Updating the Parameters Most of the parameter change functions described here and in the STCN literature are based on applying the gradient-descent algorithm to the weights of the network. In particular, changes are made in proportionto the negative of the error gradient in weight space according to the equation (8.1)
D W ´ ¡c ¢ re,
where c is a small, real-valued constant, and e is the total network error dened in section 2.5. Expanding e gives
D W ´ ¡c ¢ r
t
1 kEE (t)k2 2
.
(8.2)
In conventional connectionist learning tasks, it is computationally convenient to evaluate this function piecewise over time by making weight changes after presenting each input pattern, as illustrated in the algorithm of Table 7. For many spatiotemporal problems, this type of evaluation becomes a necessity since there are a potentially innite number of sample patterns, and hence it would be impossible to wait until all had been presented before adjusting the weights. Thus, weight changes are made according to
Spatiotemporal Connectionist Networks
287
the formula
D W(t) ´ ¡c ¢ r
1 E kE (t)k2 . 2
(8.3)
Note that in the limit as c approaches 0, the effect of repeated applications of equation 8.3 approaches that of equation 8.2. However, for larger values of c, successive evaluations of the gradient in equation 8.3 will occur at different points in the weight space. Equation 8.3 is an approximation to the gradient, which improves as c ! 0. We now turn our attention to evaluating the gradient,
r
1 kEE (t)k2 D 2
Ii, j i, j
@ @Wi, j
1 2
k
EE k (t)2 ,
(8.4)
where Ii, j represents a matrix whose components are all 0 except for the (i, j)th component whose value is 1, and where Wi, j and EE k (t) represent the (i, j)th and kth components of W and EE (t ), respectively. This equation can be simplied to
r
1 kEE (t)k2 D 2
Ii, j i, j
k
EE k (t)
@EE k (t) @Wi, j
(8.5)
,
and further to
r
1 E kE (t)k2 D 2
Ii, j i, j
k
(yE ¤k (t) ¡ yE k (t))
@E y k (t) @Wi, j
.
(8.6)
The weight change functions discussed below differ in how they compute the nal gradient in this equation. 8.1 Full Gradient Descent (FGD). The calculation of the nal gradient can be computed by expanding the numerator to its maximum extent (using the function that computes state) and then computing the gradient of the resulting expression: @E yk ( t ) @Wi, j
D
@ @Wi, j
fyE fEs fEs fEs ¢ ¢ ¢ fEs (Es(0), xE (0), W) ¢ ¢ ¢ ,
xE (t ¡ 2), W , xE (t ¡ 1), W , xE (t), W , W . k
(8.7)
288
Stefan C. Kremer
If the state function does not use the value of W (e.g., WIT memory), the computation of this derivative is quite simple, since it reduces to the computation of the weight gradient in error space in a nonrecurrent network. If the state function does use the value W, then the computation of this derivative is much more complicated due to its recurrent nature. A number of different algorithms have been suggested to perform the full gradientdescent computation in an efcient manner. Typically, the approaches trade space complexity for time complexity. Since all of the different (FGD) algorithms compute the same result and we will not be doing a space or time complexity analysis of the learning algorithms used by STCNs, we will not discuss differences in implementations of full gradient descent and refer interested readers to Rumelhart et al. (1986) for a description of backpropagation through time (BPTT), Williams and Zipser (1988, 1989b) for a description of real-time recurrent learning (RTRL), Schmidhuber (1992) for a BPTT/RTRL hybrid approach, and Sun et al. (1992) for a Green’s function method approach to gradient descent. It is worth noting, however, that BPTT is not considered a real-time algorithm because the length of time and amount of memory required to compute a weight update are linear in the length of the input sequence being considered. By contrast RTRL, Schmidhuber ’s and Sun et al.’s approaches require constant time and memory to compute a weight update (regardless of input pattern length). Full gradient descent has been used by Sejnowski and Rosenberg (1987, 1988), Giles et al. (1991, 1992a, 1992b), Giles and Omlin (1992), Goudreau et al. (1994), Liu et al. (1990), Miller and Giles (1993), Omlin and Giles (1994a, 1994b, in press), Shaw and Mitchell (1990), Sun et al. (1990a, 1991, 1993); Watrous and Kuhn (1992a, 1992b), Das et al. (1993), Giles et al. (1990), Zeng et al. (1994), and Williams and Zipser (1989b). The equation for weight change for full gradient descent is: fD (EE (t), W , xE (¢)) D ¡c¢
Ii, j
i, j
k
(yE ¤k (t) ¡ yE k (t))
@ @Wi, j
¢ fyE fEs fEs fEs ¢ ¢ ¢ fEs (Es(0), xE (0), W) ¢ ¢ ¢ , xE (t ¡ 2), W , xE (t ¡ 1), W , xE (t), W , W . (8.8) An important limitation to the full gradient-descent approach has been independently identied by Hochreiter (1991) and Bengio, Frasconi and Simard (1993, 1994). This limitation is based on the fact that as temporal sequences increase in length, the inuence of early components of the sequence has less and less impact on the systems’ output. This causes the
Spatiotemporal Connectionist Networks
289
partial gradients, which dene the weight changes, to shrink to zero as sequence lengths increase (a solution to this problem can be found in the LSTM memory described in section 5.12). A special case of full gradient descent occurs for locally recurrent networks like the LF memories described above. In locally recurrent networks, the complexity of the computation of the gradient is signicantly reduced to the point where the computation can be performed in a manner that is local in both space and time. This approach has been used in LF memories (Gori et al. 1989, Frasconi et al. 1992). Another variation on this approach updates not only connection weights by means of gradient descent, but also other parameters of the network such as the values mE and v E in equation 5.10. This approach has been used in de Vries and Principe (1991, 1992a). 8.2 Teacher Forcing. If an STCN’s state (or part thereof) is equivalent to the STCN’s previous output, then a simplication of full gradient descent, called teacher forcing (TF), can be used. NARX memories use a state based entirely on input and output. Also, any STCN that uses a zero-layer output function uses part of the output vector as its state vector. In either of these cases, it is possible to substitute the value of the desired output, yE ¤ , for the value of the actual output, yE , whenever the latter occurs in the state vector’s computation. This can eliminate a great deal of the recurrence in the equation and thus simplify computation. The equation for teacher-forced weight update is: fD (EE (t), W , xE (¢)) D
Ii, j
i, j
k
(yE ¤k (t) ¡ yE k (t))
@ @Wi, j
fyE fEs fEs fEs ¢ ¢ ¢ ¤ (0)
fEs (Es(0), xE (0), W) | yE (0)´yE
¢¢¢,
( )´ xE (t ¡ 2), W | yE t¡2 yE
¤(
t¡2)
,
xE (t ¡ 1), W | yE (t¡1 )´yE
¤(
t¡1)
,
( )´ ( ) xE (t), W | yE t yE t , W . (8.9) ¤
Intuitively, this is equivalent to assuming that the network has already learned to produce the desired output for all previous time steps when computing the weight change for the current time step. This approach has been used by Jordan (1986a, 1986b), Williams and Zipser (1989a), and Narendra and Parthasarathy (1990).
290
Stefan C. Kremer
Output
Hidden
Input Figure 13: An encoder network.
8.3 Truncated Gradient Descent. Another approach to simplifying the gradient computation is to not completely expand the equation to compute output, prior to computing the derivative. This is referred to as truncated gradient descent (TGD). In this method the current state is evaluated, but the previous state is not. This results in the following weight update equation: fD (EE (t), W , xE (¢)) ´
Ii, j i, j
k
yE ¤k (t ) ¡ yE k (t )
@ @Wi, j
fyE fEs Es(t ¡ 1), xE (t ), W , W .
(8.10)
Since this equation is only an approximation to the true gradient, an STCN using this technique is not guaranteed to follow the true gradient in its search for a local minimum. Further, a local minimum in the error space dened by this function may not correspond to a local minimum in the original error space. This form of weight adjustment has been used by Elman (1990a, 1991b), Cleeremans et al. (1989), and Servan-Schreiber et al. (1988, 1989, 1991). 8.4 Autoassociative Gradient Descent. By denition, the memory vector of an STCN must contain some information about previous inputs. This implies that the weights in the network should be adapted to preserve input information in the memory vector. One way of accomplishing this task is to force the state explicitly to encode the input and the previous state by using an encoder network. An encoder network is a nonrecurrent network consisting of an input layer, a hidden layer, and an output layer. The input and output layers are of equal size, and the hidden layer is smaller (see Figure 13). The target vector for the output units is always equivalent to the current input vector. By training the output layer to reproduce the
Spatiotemporal Connectionist Networks
291
Extraction Units Trained to equal x (t )
Trained to equal s ( t -1)
s( t )
x (t )
s ( t -1)
Current Input Figure 14: Extraction units are trained to reproduce input and previous state.
input, the smaller hidden layer adopts a compressed representation of the input. Autoassociative gradient descent (AAGD) weight update works by temporarily creating an additional set of “extraction” units (see Figure 14). These units receive their inputs from the state units and are trained to reproduce the current input and previous state. It is the difference between the activations of the “extraction” units and the activations of the input units that denes the error that is minimized by gradient descent. Thus, the memory is trained not to produce a desired output pattern, but rather to preserve information about inputs and state. Of course, this information can then be used to generate a desired output pattern using any of the systems described in section 6. By training in this fashion, the state computation becomes the rst step of an encoder network and is thus trained to encode the state as
292
Stefan C. Kremer
a compressed representation of input and previous state. Once training is complete, the extraction units can be removed, leaving a network that encodes state. Furthermore, training can be accomplished using the classical feedforward algorithm on the architecture shown in Figure 14. In this approach, a complete coding of the sequence must be developed in the state representations (a requirement not found in other approaches where only relevant input information need be encoded). Mathematically, if the vector uE (t ) represents the activations of the “extraction units,” then the weight update equation is: fD (EE (t), W , xE (¢)) ´
Ii, j i, j
f W
k (u )
(xE (t) © Es(t ¡ 1)) k ¡ uE k (t)
¢ fEs Es (t ¡ 1), xE (t ), W
,
@ @Wi, j
(8.11)
where W(u) is an additional matrix of weights from state to extraction units, which is simultaneously trained according to the equation:
D W (u ) ´
Ii, j i, j
k
(xE (t) © Es(t ¡ 1)) k ¡ uE k (t)
s E (t) © Es(t ¡ 1)) . E Wu ¢ (x
@ (u )
@Wi, j
(8.12)
AAGD weight update is used by Maskara and Noetzel (1992), and Pollack (1989, 1990). 8.5 Stack Learning. Adjustment of weights in an STCN with a continuous stack (see section 5.13) can be accomplished by means of a full gradientdescent procedure. In order to take advantage of the stack, however, it is also necessary to retrieve information from the stack and perform push and pop operations on it. In conventional pushdown automata (PDA), such operations are discrete in terms of their operation. Das et al. (1993) developed an ingenious way to incorporate the use of a stack by relying on an error term added to the conventional Euclidean error measure. The stack used, unlike in a conventional PDA, is a continuous device capable of pushing partial symbols onto itself as well as popping multiple symbols at a time. The continuous nature of the stack extends to the number of symbols on the stack at any given time (i.e., the stack’s length). By adding an error term equal to the square of the stack length at the end of a string, l, for legal strings, the network can be trained to produce an empty stack after processing any legal string. The length of the stack is equal to the sum of the activation values of the action node throughout the string, so it is possible to minimize error by backpropagating the error from the action node. Then, fD (EE (t), W , xE (¢)) D
1 c ¢ r (kEE (t )k2 C l2 ) 2
(8.13)
Spatiotemporal Connectionist Networks
293
for t D T and zero for other values of t. Stack learning (SL) has been studied by Das et al. (1993), Giles et al. (1990), Sun et al. (1990a, 1993), and Zeng et al. (1994). 8.6 Alopex. Unnikrishnan and Venugopal (1994) and Forcada and Carrasco (1995) have experimented with an alternative to gradient descent called the Alopex algorithm. This algorithm employs a simulated annealing type of approach to adapting parameters. Specically weight changes are made probabilistically according to: fD (EE (t), W , xE (¢)) D D W Wji (t ¡ 1) ¡ d Wji (t ¡ 1) C d
D Wji (t ) D
(8.14) with probability pji (t) with probability 1 ¡ pji (t)
(8.15)
where d is a small, positive constant. The probability of making a negative weight change is computed: pji (t ) D
1 1 C e¡
dWji (t)¢D E(t) T(t)
,
(8.16)
where T (t) represents a temperature and dWji (t) D Wji (t ¡ 1) ¡ Wji (t ¡ 2)
(8.17)
dE(t) D E (t ¡ 1) ¡ E (t ¡ 2).
(8.18)
Temperature is controlled by an annealing schedule, which gradually reduces the number of changes that are not correlated to decreasing error: T (t ) D
d N
t¡1 t
0D
| D E (t0 )|.
(8.19)
t¡N
Using this probabilistic approach, it is possible to avoid getting stuck in local minima. 8.7 Training the Initial State. While section 5 described how to compute new values of the STCN’s state vector based on previous input, output, or state vectors, we usually assumed that the initial state vector, Es (init) D Es(0), was a vector consisting entirely of components with value 0 or one half. A better solution has been proposed by Forcada and Carrasco (1995) and Bulsari and SaxÂen (1995), who recommend training the initial state vector, Es(init) , just like any other adaptable parameter in the network. That is, Es(init) becomes part of W :
W D hW, Es (init) i.
(8.20)
294
Stefan C. Kremer
Then the component W can be trained using any one of the weight adaptation techniques described above. Es is updated according to the Alopex algorithm, where each component i is updated as: d Esi(init) (t) D
(init )
Esi
(init )
Esi
(t ¡ 1) ¡ d (t ¡ 1) C d
with probability pi (t) with probability 1 ¡ pi (t)
fD (EE (t), W , xE (¢)) D hD W, d Es (init) i.
(8.21)
(8.22)
This approach can improve training time and success rate (see Forcada and Carrasco, 1995). 8.8 Manual Architecture Changes. Finally, we turn our attention to the process that adjusts the size of the state vector. Most networks assume a xed-size state vector, but this is not the only option. Instead, the state vector size can be considered part of the trainable parameters:
W D hW, dim(Es)i.
(8.23)
If state vector size is adaptable, then the size of the weight matrix of connections leading into or out of the state vector must also be adaptable. By far the most common approach to adjusting state vector size to a given problem is a manual trial-and-error procedure. That is, some heuristic is used to guess a suitable number of state nodes, and only if those nodes are unable to learn the given problem are more added. If overgeneralization occurs, nodes may also be removed. Note that the new architecture does not reuse any of the weight parameters of the previous architecture; all weight values are reinitialized when the state vector size is changed: ( ) ( ) fD (EE (t), W , xE (¢)) D ¡W C hW new , dim(Es new )i.
(8.24)
8.9 Automatically Incrementing Nodes. The inadequacy of the trialand-error method described led Fahlman (1991) and Giles et al. (1995) to devise automatic incremental architecture adjustment techniques. These techniques create new processing units in the network to represent memory as required. If the network is unable to learn by weight adjustment, then new state nodes are added individually. Each state node is connected to compute the desired state function used by the STCN. Unlike the trial-and-error approach, existing weights are maintained, while new nodes to and from the new node are initialized to small values. In this sense, the short-term memory is increased until it is large enough to solve the given problem. Figure 15 illustrates the growth of the memory in an RCC memory; growth in other types of memory can be done similarly. The parameter update equation for
Spatiotemporal Connectionist Networks
(a)
(c)
295
(b)
(d)
Figure 15: Adding nodes to a LRSI memory. (a) Before any nodes are added. (b) After one node is added. (c) After two nodes are added. (d) After the addition of a third node.
node incrementation is fD (EE (t), W , xE (¢)) D W © W (random) ,
(8.25)
where ©W (random) represents the addition of any necessary rows and columns with randomized values to the original matrix. 8.10 Other Work Describing Weight Change Functions. For an excellent discussion of weight update functions in the context of nonlinear adaptive ltering, see Nerrand, Roussel-Ragot, Personnaz, Dreyfus, and Marcos (1993). Pearlmutter (1995) has also written a comprehensive review of gradient-based weight change function for STCNs.
296
Stefan C. Kremer
Table 9: Specic STCN Designs. Network
State Function
Output Function
Initial State
Training Algorithm
Window in time: Sejnowski and Rosenberg (1987, 1988)
WIT
Two-layer
Randomization FGD & MAC
Neural network innite impulse NARX response lter: Narendra and Parthasarathy (1990)
Two-layer
Randomization TF
Simple recurrent network: Elman (1990a, 1991b)
FOC
One-layer
Randomization TGD
Simple recurrent network: Kremer, 1995
FOC
One-layer
Split state
Recursive autoassociative memory: FOC Pollack (1989, 1990, 1994)
One-layer
Randomization AAGD
Autoassociative recurrent network: FOC Maskara and Noetzel (1992)
One-layer
Randomization AAGD
Real time recurrent learning: Williams and Zipser (1988, 1989a, 1989b)
FOC
Zero-layer Randomization FGD
Second-order network: Giles et al. (1991, 1992a, 1992b), Giles and Omlin (1992)
SOC
Zero-layer Randomization FGD
—
Second-order automaton encoding: SOC Giles and Omlin (1993a, 1993b)
Zero-layer Direct
FGD
Local feedback: Frasconi et al. (1991, 1995)
LF
One-layer
Chain graph
FGD
Recurrent cascade correlation: Fahlman (1991)
LF
One-layer
Randomization FGD & AIN
Connectionist pushdown automaton: Giles et al. (1990)
CPA
Zero-layer Randomization SL
Connectionist Turing machine: Williams and Zipser (1998, 1989b)
CTM
Zero-layer Randomization FGD
Second-order constructive learning: Giles et al. (1995)
SLCOCC Zero-layer Randomization FGD & AIN
Spatiotemporal Connectionist Networks
297
9 Open Questions Finally, we will mention some of the open problems in the eld. Our goal is not to provide an in-depth discussion of each problem (since each could easily occupy its own article). Instead we provide some key references for readers who feel challenged to investigate further. The temporal credit assignment problem that must be solved in order to nd weights for a network to accomplish a given purpose is very difcult. Most algorithms fail to learn long dependencies reliably even in the simplest toy problems. The LSTM architecture described here represents one approach to overcoming this challenge (some others are identied in (Bengio et al., 1994, and Hochreiter & Schmidhuber, 1997a). Despite these successes, this problem remains one of the greatest challenges facing the eld. The representation capabilities of many networks have been studied in the context of formal grammars, but there remain many unanswered questions about the capabilities of specic architectures in relation to specic computational models. Most results in this area have focused on classical symbolic metaphors of computation. Much less has been done in relating STCNs to dynamical systems, fuzzy automata or noisy sequence identication, though Kolen (1994), Omlin et al. (1998), and Maass and Orponen (1997, 1998) represent notable developments in these three areas, respectively. The issue of knowledge encoding also warrants additional future study. Although there have been a number of results showing that randomly generated rules can be readily incorporated into recurrent networks, there have been very few examples of networks in which real-word a priori knowledge was used to improve performance. Frasconi et al. (1991, 1995) represent an effective application of knowledge from a real-world problem domain. 10 Conclusions In this article, we described the architectures and operation of a wide variety of spatiotemporal connectionist networks. Rather than adopting an architecture-by-architecture approach, we described the networks by developing a taxonomy describing four fundamental design decisions. This permitted us to present many architectures more efciently since there is considerable overlap in design decisions across different networks. Table 9 provides a comparative view of the works discussed in this review to give a clear picture of its representational structure and predictive power. Prior to developing the taxonomy, we identied three properties that make taxonomies effective: descriptive adequacy, simplicity, and predictive power. We examined three taxonomies developed by Mozer (1994), Horne and Giles (1995), and Tsoi and Back (1994), and found that the recent urry of research into STCNs has made them obsolete in terms of both predictive power and descriptive adequacy.
298
Stefan C. Kremer
In response, we developed a new taxonomy based on four dimensions: how the state vector is computed, how the output vector is computed, how the changes in initial trainable parameters of the network are selected, and how these parameters are trained. These four dimensions were chosen because they represent the fundamental design decisions that any STCN system must address. Next, we identied specic points along these four dimensions. Along the state-vector dimension we identied window in time (WIT) memories, nonlinear autoregressive with exogeneous inputs (NARX) memories, timedelay neural network (TDNN) memories, feedforward exponential decay (FED) memories, gamma memories, Jordan memories, local feedback (LF) memories, innite impulse response (IIR) lter memories, rst-order context (FOC) memories, second-order context (SOC) memories, long short-term memories (LSTM), connectionist pushdown automaton (CPA) memories, and connectionist Turing machine (CTM) memories. Along the output dimension we identied zero-layer, one-layer, and two-layer functions. We described randomization of initial parameters, split-state encoding, direct encoding, and chain graph encoding as points on the initializing parameters dimension. Finally, on the parameter change dimension, we dened full gradient descent (FGD), teacher forcing (TF), truncated gradient descent (TGD), autoassociative gradient descent (AAGD), stack learning (SL), training initial state (TIS), manual architecture changes (MAC), and automatically incrementing nodes (AIN). Because the new taxonomy is the only one developed around the fundamental design decisions that must be addressed by any STCN system, it has superior predictive power when used to compare and analyze how different STCNs might perform on various problems. Furthermore, the fact that the taxonomy is based around the principles of spatiotemporal processing, rather than specic existing STCN designs, implies that it will easily accommodate future STCN designs as well. Acknowledgments I acknowledge the helpful comments of Ramon Neco and three anonymous reviewers on an earlier draft of the manuscript for this article. I was supported by a research grant from the Natural Sciences and Engineering Council of Canada. References Alippi, C., & Piuri, V. (1996). Neural methodology for prediction and identication of non-linear dynamic system. In Proceedings: International Workshop on Neural Networks for Identication, Control, Robotics, and Signal/Image Processings, Venice Italy, August 21–23, 1996 (pp. 305–313). Los Alamitos, CA: IEEE Computer Society Press.
Spatiotemporal Connectionist Networks
299
Angluin, D., & Smith, C. (1983). Inductive inference: Theory and methods. ACM Computing Surveys, 15(3), 237–269. Back, A., & Tsoi, A. (1990). A time series modelling methodology using FIR and IIR synapses. In F. Murtagh (Ed.), Proc Workshop on Neural Networks for Statistical and Economic Data (pp. 187–194). Dublin: DOSES, Statistical Ofce of European Communities. Back, A. D., & Tsoi, A. C. (1991). FIR and IIR synapses, a new neural network architecture for time series modelling. Neural Computation, 3(3), 375–385. Back, A. D., & Tsoi, A. C. (1993). A simplied gradient algorithm for IIR synapse multilayer perceptrons. Neural Computation, 5(3), 456–462. Back, A., Wan, E., Lawrence, S., & Tsoi, A. (1995). A unifying view of some training algorithms for multilayer perceptrons with FIR lter synapses. In J. Vlontzos, J. Hwang, & E. Wilson (Eds.), Neural networks for signal processing, 4 (pp. 146–154). New York: IEEE Press. Bengio, Y., Frasconi, P., & Simard, P. (1993). The problem of learning long-term dependencies in recurrent networks. In Proceedings of the 1993 IEEE International Conference on Neural Networks (vol. 3), (pp. 1183–1188). San Francisco: IEEE Press. Bengio, Y., Simard, P., & Frasconi, P. (1994). Learning long-term dependencies with gradient descent is difcult. IEEE Transactions on Neural Networks, 5(2), 157–166. Bulsari, A. B., & SaxÂen, H. (1995). A recurrent network for modeling noisy temporal sequences. Neurocomputing, 7, 29–40. Chen, S., Billings, S., & Grant, P. (1991). Non-linear system identication using neural networks. International Journal of Control, 51(6), 1191–1214. Cleeremans, A., & McClelland, J. L. (1991). Learning the structure of event sequences. Journal of Experimental Psychology: General, 120(3), 235–253. Cleeremans, A., Servan-Schreiber, D., & McClelland, J. L. (1989). Finite state automata and simple recurrent networks. Neural Computation, 1(3), 372–381. Das, S., Giles, C. L., & Sun, G.-Z. (1993). Using prior knowledge in a NNPDA to learn context-free languages. In S. J. Hanson, J. D. Cowan, & C. L. Giles (Eds.), Advances in neural information processing systems, 5 (pp. 65–72). San Mateo, CA: Morgan Kaufmann. de Vries, B., & Principe, J. M. (1991). A theory for neural networks with time delays. In R. P. Lippmann, J. E. Moody, & D. S. Touretzky (Eds.), Advances in neural information processing systems/ advances in neural information processing, 3 (pp. 162–168). San Mateo, CA: Morgan Kaufmann. de Vries, B., & Principe, J. (1992a). The gamma model—A new neural network for temporal processing. Neural Networks, 5(4), 565–576. de Vries, B., & Principe, J. C. (1992b). The gamma model—A new neural model for temporal processing. Neural Networks, 5(4), 565–576. Elman, J. L. (1988). Finding structure in time (Tech. Rep. No. 8801). La Jolla, CA: Center for Research in Language, University of California at San Diego, University of California, San Diego. Elman, J. L. (1989). Structured representations and connectionist models. In Proceedings of the 11th Annual Conference of the Cognitive Science Society (pp. 17–25). Ann Arbor: University of Michigan.
300
Stefan C. Kremer
Elman, J. (1990a). Finding structure in time. Cognitive Science, 14, 179–211. Elman, J. (1990b). Representation and structure in connectionist models. In G. T. Altmann (Ed.), Cognitive models of speech processing: Psycholinguistica and computational perspectives. Cambridge, MA: MIT Press. Elman, J. (1991a). Incremental learning, or the importance of starting small (Tech. Rep. No. 9101). La Jolla, CA: Center for Research in Language, University of California at San Diego. Elman, J. L. (1991b).Distributed representations, simple recurrent networks and grammatical structure. Machine Learning, 7(2/3), 195–226. Fahlman, S. E. (1991). The recurrent cascade-correlation architecture (Tech. Rep. No. CMU-CS-91-100).Pittsburgh, PA: School of Computer Science, CarnegieMellon University. Forcada, M. L., & Carrasco, R. C. (1995). Learning the initial state of a secondorder recurrent neural network during regular-language inference. Neural Computation, 7(5), 923–930. Frasconi, P., Gori, M., Maggini, M., & Soda, G. (1991). A unied approach for integrating explicit knowledge and learning by example in recurrent networks. In 1991 IEEE INNS International Joint Conference on Neural Networks—Seattle (Vol. 1, pp. 811–816). Piscataway, NJ: IEEE Press. Frasconi, P., Gori, M., & Soda, G. (1992). Local feedback in multilayered networks. Neural Computation, 4, 120–130. Frasconi, P., Gori, M., Maggini, M., & Soda, G. (1995). Unied integration of explicit rules and learning by example in recurrent networks. IEEE Transactions on Knowledge and Data Engineering, 7(2), 340–346. Frasconi, P., & Gori, M. (1996). Computational capabilities of local-feedback recurrent networks acting as nite-state machines. IEEE Transactionson Neural Networks, 7(6), 1521–1525. Fu, K.-S. (1982). Syntactic pattern recognition and applications. Engelwood Cliffs, NJ: Prentice-Hall. Ghosh, J., & Deuser, L. (1995). Classication of spatiotemporal patterns with applications to recognition of sonar sequences. In E. Covey (Ed.), Neural representations of temporal patterns. New York: Plenum. Giles, C., Sun, G., Chen, H., Lee, Y., & Chen, D. (1990). Higher order recurrent networks and grammatical inference. In D. S. Touretzky (Ed.), Advances in neural information processing systems, 2 (pp. 380–387). San Mateo, CA: Morgan Kaufmann. Giles, C. L., Chen, D., Miller, C., Chen, H., Sun, G., & Lee, Y. (1991). Second-order recurrent neural networks for grammatical inference. In 1991 IEEE INNS International Joint Conference on Neural Networks—Seattle (Vol. 2, pp. 273–281). Piscataway, NJ: IEEE Press. Giles, C. L., Miller, C. B., Chen, D., Sun, G. Z., Chen, H. H., & Lee, Y. C. (1992). Extracting and learning an unknown grammar with recurrent neural networks. In J. E. Moody, S. J. Hanson, & R. P. Lippmann (Eds.), Advances in neural information processingsystems,4 (pp. 317–324). San Mateo, CA: Morgan Kaufmann. Giles, C., Miller, C., Chen, D., Chen, H., Sun, G., & Lee, Y. (1992). Learning and extracting nite state automata with second-order recurrent neural networks. Neural Computation, 4(3), 393–405.
Spatiotemporal Connectionist Networks
301
Giles, C. L., & Omlin, C. (1992). Inserting rules into recurrent neural networks. In S. Kung, F. Fallside, J. A. Sorenson, & C. Kamm (Eds.), Neural Networks for Signal Processing II, Proceedings of the 1992 IEEE Workshop (pp. 13–22). Piscataway, NJ: IEEE Press. Giles, C., & Omlin, C. (1993a). Rule renement with recurrent neural networks. In Proceedingsof the IEEE InternationalConference on Neural Networks (ICNN’93) (Vol. 2, pp. 801–806). Giles, C. L., & Omlin, C. (1993b).Extraction, insertion and renement of symbolic rules in dynamically-driven recurrent neural networks. Connection Science, 5(3,4), 307–337. Giles, C. L., & Horne, B. G. (1994). Representation and learning in recurrent neural network architectures. In Proceedings of the Eighth Yale Workshop on Adaptive and Learning Systems (pp. 128–134). New Haven, CT: Center for Systems Science, Dunham Laboratory, Yale University. Giles, C. L., Chen, D., Sun, G.-Z., Chen, H.-H., Lee, Y.-C., & Goudreau, M. W. (1995). Constructive learning of recurrent neural networks: Limitations of recurrent casade correlation and a simple solution. IEEE Transactions on Neural Networks, 6(4), 829–836. Gori, M., Bengio, Y., & De Mori, R. (1989). BPS: A learning algorithm for capturing the dynamic nature of speech. In Proceedings of the First International Joint Conference on Neural Networks, Washington, DC (Vol. 2, pp. 417–423). San Diego: IEEE TAB Neural Network Committee. Goudreau, M., Giles, C., Chakradhar, S., & Chen, D. (1994). First-order vs. second-order single-layer recurrent neural networks. IEEE Transactions on Neural Networks, 5(3), 511–513. Hochreiter, J. (1991). Untersuchungen zu dynamischen neuronalen Netzen. Master’s thesis, Institut fur ¨ Informatik, Lehrstuhl Prof. Brauer, Technische Universit a¨ t Munchen. ¨ Hochreiter, S., & Schmidhuber, J. (1996). Long short term memory (Tech. Rep. No. FKI-207-95). Munchen: ¨ Fakultaet fuer Informatik, Technische Universitaet Munchen. ¨ Hochreiter, S., & Schmidhuber, J. (1997a).Long short-term memory. Neural Computation, 9(8), 1735–1780. Hochreiter, S., & Schmidhuber, J. (1997b). LSTM can solve hard long time lag problems. In M. C. Mozer, M. I. Jordan, & T. Petsche (Eds.), Advances in neural information processing systems, 9 (pp. 473–479). Cambridge, MA: MIT Press. Hopcroft, J., & Ullman, J. (1979). Introduction to automata theory, languages, and computation. Reading, MA: Addison-Wesley. Horne, B. G., & Giles, C. L. (1995). An experimental comparison of recurrent neural networks. In G. Tesauro, D. Touretzky, & T. Leen (Eds.), Advances in neural information processing systems, 7 (pp. 697–704). Cambridge, MA: MIT Press. Jordan, M. I. (1986a). Serial order: A parallel distributed processing approach (Tech. Rep. No. ICS Report 8604). La Jolla, CA: Institute for Cognitive Science, University of California at San Diego. Jordan, M. I. (1986b). Attractor dynamics and parallelism in a connectionist sequential machine. In Proceedings of the Eighth Annual Conference of the Cognitive Science Society (pp. 531–546). Hillside, NJ: Erlbaum.
302
Stefan C. Kremer
Kohavi, Z. (1978). Switching and nite automata theory (2nd ed). New York: McGraw-Hill. Kolen, J. (1994). Recurrent networks: State machines or iterated function systems? In M. Mozer, P. Smolensky, D. Touretzky, J. Elman, & A. Weigend (Eds.), Proceedings of the 1993 Connectionist Models Summer School. Hillsdale, NJ: Erlbaum. Kremer, S. C. (1995). On the computational power of Elman-style recurrent networks. IEEE Transactions on Neural Networks, 6(4), 1000–1004. Kremer, S. C. (1996a). Comments on “constructive learning of recurrent neural networks: Limitations of recurrent cascade correlation and a simple solution.” IEEE Transactions on Neural Networks, 7(4). [References in this article were incorrectly numbered in the published version: “[2]” should read “[1],” “[3]” should read “[2],” and “[4]” should read “[3]”.] Kremer, S. C. (1996b). Finite state automata that recurrent cascade-correlation cannot represent. In D. S. Touretzky, M. C. Mozer, & M. E. Hasselmo (Eds.), Advances in neural information processing systems, 8 (pp. 612–618). MIT Press. Kremer, S. C. (1999). Identication of a specic limitation on local-feedback recurrent networks acting as Mealy Moore machines. IEEE Transactions Neural Networks, 10, 433–438. Lang, K. J., Waibel, A. H., & Hinton, G. (1990). A time-delay neural network architecture for isolated word recognition. Neural Networks, 3(1), 23–43. Lapedes, A., & Farber, R. (1987). Nonlinear signal processing using neural networks prediction and system modelling. (Tech. Rep. No. LA-UR87-2662). Los Alamos: Los Alamos National Laboratory. Lawrence, S., Tsoi, A. C., & Back, A. D. (1996). The gamma MLP for speech phoneme recognition. In D. S. Touretzky, M. C. Mozer, & M. E. Hasselmo, (Eds.), Advances in neural information processing systems, 8 (pp. 785–791). Cambridge, MA: MIT Press. Lin, T., Giles, C. L., Horne, B. G., & Kung, S.-Y. (1996). A delay damage model selection algorithm for NARX neural networks (Tech. Rep. Nos. CS-TR-3707 and UMIACS-TR-96-77). College Park, MD: Institute for Advanced Computer Studies, University of Maryland. Lin, T., Horne, B., Tino, ˜ P., & Giles, C. (1996a). Learning long-term dependencies is not as difcult with NARX recurrent neural networks. In D. Touretzky, M. Mozer, & M. Hasselmo (Eds.), Advances in neural information processing systems, 8 (p. 577). Cambridge, MA: MIT Press. Lin, T., Horne, B. G., Tino, ˜ P., & Giles, C. L. (1996b). Learning long-term dependencies in NARX recurrent neural networks. IEEE Transactions on Neural Networks, 7(6), 1329–1338. Lin, T., Giles, C., & Horne, B. G. (1998). What to remember: How memory orders affect the performance of NARX networks. In IEEE World Conference on Computational Intelligence (p. 1051). Lippmann, R. P. (1987). An introduction to computing with neural nets. IEEE ASSP Magazine, pp. 4–22. Liu, Y., Sun, G., Chen, H., Lee, Y., & Giles, C. (1990). Grammatical inference and neural network state machines. In Proceedings of the International Joint Conference on Neural Networks 1990 (Vol. 1, pp. 285–288).
Spatiotemporal Connectionist Networks
303
Maass, W., & Orponen, P. (1997). On the effect of analog noise in discrete-time analog computations. In Proceedings of the Annual Conference on Neural Information Processing Systems 1996, in Denver, USA (Advances in Neural Information Processing Systems, vol. 9, Cambridge, MA: MIT Press. Maass, W., & Orponen, P. (1998). On the effect of analog noise in discrete-time analog computations. Neural Computation, 10(5), 1071–1095. Maren, A. (1990). Neural networks for spatiotemporal recognition. In A. Maren (Ed.), Handbook of neural computing applications. Orlando, FL: Academic Press. Marr, D. (1982). Chapter 1: The philosophy and the approach. San Francisco: W. H. Freeman. Maskara, A., & Noetzel, A. (1992). Forced simple recurrent neural networks and grammatical inference. In Proceedings of the 14th Conference of the Cognitive Science Society (pp. 420–425). Miller, C., & Giles, C. L. (1993). Experimental comparison of the effect of order in recurrent neural networks. International Journal of Pattern Recognition and Articial Intelligence, 7(4), 849–872. Mozer, M. (1989). A focused backpropagation algorithm for temporal pattern processing/recognition. Complex Systems, 3(4), 349–381. Mozer, M. C. (1994). Neural net architectures for temporal sequence processing. In A. Weigend & N. Gershenfeld (Eds.), Time series prediction (pp. 243–264). Reading, MA: Addison–Wesley. Narendra, K. S., & Parthasarathy, K. (1990).Identication and control of dynamical systems using neural networks. IEEE Transactions on Neural Networks, 1(1), 4–27. Nerrand, O., Rousel-Ragot, P., Personnaz, L., Dreyfus, G., & Marcos, S. (1993). Neural networks and nonlinear adaptive ltering: Unifying concepts and new algorithms. Neural Computation, 5, 165–199. Omlin, C., & Giles, C. L. (1992). Training second-order recurrent neural networks using hints. In D. Sleeman & P. Edwards (Eds.), Proceedings of the Ninth International Conference on Machine Learning (pp. 363–368). San Mateo, CA: Morgan Kaufmann. Omlin, C., & Giles, C. (1994a). Constructing deterministic nite-state automata in sparse recurrent neural networks. In IEEE International Conference on Neural Networks (ICNN’94) (pp. 1732–1737). Piscataway, NJ: IEEE Press. Omlin, C., & Giles, C. L. (1994b). Extraction and insertion of symbolic information in recurrent neural networks. In V. Honavar & L. Uhr (Eds.), Articial intelligence and neural networks: Steps toward principled integration (pp. 271– 299). San Diego, CA: Academic Press. Omlin, C., & Giles, C. (1996a). Constructing deterministic nite-state automata in recurrent neural networks. Journal of the ACM, 43(6), 937–972. Omlin, C. W., & Giles, C. L. (1996b).Stable encoding of large nite-state automata in recurrent neural networks with sigmoid discriminants. Neural Computation, 8(7), 675–696. Omlin, C., Thornber, K., & Giles, C. (1998). Fuzzy nite-state automata can be deterministically encoded into recurrent neural networks. IEEE Transactions on Fuzzy Systems, 6(1), 76–89.
304
Stefan C. Kremer
Pearlmutter, B. A. (1995). Gradient calculations for dynamic recurrent neural networks: A survey. IEEE Transactions on Neural Networks, 6(5), 1212–1228. Poddar, P., & Unnikrishnan, K. (1991a). Memory neuron networks: A prolegomenon (Tech. Rep. No. GMR–7493). Warren, MI: Computer Science Department, General Motors Research Laboratories. Poddar, P., & Unnikrishnan, K. (1991b). Nonlinear prediction of speech signals using memory neuron networks. In B. Juang, S. Kung, and C. Kamm (Eds.), Neural Networks for Signal Processing: Proceedings of the 1991 IEEE Workshop (pp. 395–404). New York: IEEE Press. Pollack, J. (1989). Implications of recursive distributed representations. In D. Touretzky (Ed.), Advances in neural information processing systems, 1 (pp. 527–536). San Mateo, CA: Morgan Kaufmann. Pollack, J. B. (1990). Recursive distributed representations. Articial Intelligence, 46, 77–105. Pollack, J. (1994). Chapter 4: Limits of connectionism (Tech. Rep.). Columbus, OH: Department of Computer Science, Ohio State University. Robinson, A. J., & Fallside, F. (1988). Static and dynamic error propagation networks with application to speech coding. In D. Z. Anderson (Ed.), Neural information processing systems (pp. 632–641). New York; American Institute of Physics. Rosenberg, C. R., & Sejnowski, T. J. (1986). The spacing effect on “nettalk,” a massively-parallel network. In Proceedings of the Eighth Annual Conference of the Cognitive Science Society (pp. 72–89). Hillsdale, NJ: Erlbaum. Rumelhart, D., Hinton, G., & Williams, R. (1986). Learning internal representation by error propagation. In J. L. McClelland, D. Rumelhart, & the PDP Group (Eds.), Parallel distributed processing: Explorations in the microstructure of cognition (Vol. 1). Cambridge, MA: MIT Press. Schmidhuber, J. H. (1992). A xed size storage o(n3 ) time complexity learning algorithm for fully recurrent continually running networks. Neural Computation, 4(2), 243–248. Sejnowski, T., & Rosenberg, C. (1987). Parallel networks that learn to pronounce English text. Complex Systems, 1, 145–168. Sejnowski, T. J., & Rosenberg, C. R. (1988). NETtalk: A parallel network that learns to read aloud. In J. A. Anderson & E. Rosenfeld (Eds.), Neurocomputing: Foundations of research (pp. 663–672). Cambridge, MA: MIT Press. Servan-Schreiber, D., Cleeremans, A., & McClelland, J. L. (1988). Encoding sequential structure in simple recurrent networks (Tech. Rep. CMU-CS-88-183). Pittsburgh, PA: School of Computer Science, Carnegie Mellon University. Servan-Schreiber, D., Cleeremans, A., & McClelland, J. (1989). Learning sequential structure in simple recurrent networks. In D. Touretzky (Ed.), Advances in neural information processing systems, 1. San Mateo, CA: Morgan Kaufmann. Servan-Schreiber, D., Cleeremans, A., & McClelland, J. (1991). Graded state machines: The representation of temporal contingencies in simple recurrent networks. Machine Learning, 7, 161–193. Servan-Schreiber, D., Cleeremans, A., & McClelland, J. L. (1994). Graded state machines:The representationof temporal contingencies in simple recurrent networks (pp. 241–269). San Diego, CA: Academic Press.
Spatiotemporal Connectionist Networks
305
Shaw, A., & Mitchell, R. (1990). Phoneme recognition with a time-delay neural network. In International Joint Conference on Neural Networks, San Diego (Vol. 2, pp. 191–195). New York: IEEE. Siegelmann, H., Horne, B., & Giles, C. (1997). Computational capabilities of recurrent NARX neural networks. IEEE Trans. on Systems, Man & Cybernetics, 27, 208. Stiles, B. W., & Ghosh, J. (1996).Two stage habituation based neural networks for dynamic signal classication. In C. T. Leondes (Ed.), Computer techniques and algorithms in digital signal processing techniques and applications (pp. 301–337). San Diego, CA: Academic Press. Stiles, B. W., & Ghosh, J. (1997). A habituation based neural network for spatiotemporal classication. Neurocomputing, 15(3,4), 273–307. Sun, G., Chen, H., Giles, C., Lee, Y., & Chen, D. (1990a).Connectionist pushdown automata that learn context-free grammars. In M. Caudill (Ed.), Proceedings of the International Joint Conference on Neural Networks 1990 (Vol. 1, pp. 577–580). Hillsdale, NJ: Erlbaum. Sun, G., Chen, H., Giles, C., Lee, Y., & Chen, D. (1990b). Neural networks with external memory stack that learn context-free grammars from examples. In Proceedings of the 1990 Conference on Information Sciences and Systems (Vol. 2, pp. 649–653). Princeton, NJ: Department of Electrical Engineering, Princeton University. Sun, G., Chen, H., Lee, Y., & Giles, C. L. (1991). Turing equivalence of neural networks with second order connection weights. In 1991 IEEE INNS International Joint Conference on Neural Networks—Seattle (Vol. 2, pp. 357–362). Piscataway, NJ: IEEE Press. Sun, G.-Z., Chen, H.-H., & Lee, Y.-C. (1992). Green’s function method for fast online learning algorithm of recurrent neural networks. In J. E. Moody, S. J. Hanson, & R. P. Lippmann (Eds.), Advances in neural information processing systems, 4 (pp. 333–340). San Mateo, CA: Morgan Kaufmann. Sun, G., Giles, C., Chen, H., & Lee, Y. (1993). The neural network pushdown automaton: Model, stack and learning simulations (Tech. Rep. No. UMIACS-TR-9377/CS-TR-3118). College Park, MD: University of Maryland. Tsoi, A. C., & Back, A. D. (1994). Locally recurrent globally feedforward networks: A critical review of architectures. IEEE Transactions on Neural Networks, 5(2), 229–239. Unnikrishnan, K. P., & Venugopal, K. P. (1994). Alopex: A correlation-based learning algorithm for feedforward and recurrent neural networks. Neural Computation, 6(3), 469–490. Waibel, A. (1988). Consonant recognition by modular construction of large phonemic time-delay neural networks. In D. Anderson (Ed.), Neural information processing systems (pp. 215–223). New York: American Institute of Physics. Waibel, A., Hanazawa, T., Hinton, G., Shikano, K., & Lang, K. (1989). Phoneme recognition using time-delay neural networks. IEEE Transactions on Acoustics, Speech and Signal Processing, 37(3), 328–339. Watrous, R. L., & Kuhn, G. M. (1992a). Induction of nite-state automata using second-order recurrent networks. In J. E. Moody, S. J. Hanson, & R. P. Lipp-
306
Stefan C. Kremer
mann (Eds.), Advances in neural information processing systems, 4 (pp. 309–316). San Mateo, CA: Morgan Kaufmann. Watrous, R. L., & Kuhn, G. M. (1992b). Induction of nite-state languages using second-order recurrent networks. Neural Computation, 4(3), 406–414. Williams, R. J., & Zipser, D. (1988). A learning algorithm for continually running fully recurrent neural networks (Tech. Rep. No. ICS 8805). La Jolla, CA: Institute for Cognitive Science, University of California at San Diego. Williams, R. J., & Zipser, D. (1989a). Experimental analysis of the real-time recurrent learning algorithm. Connection Science, 1(1), 87–111. Williams, R. J., & Zipser, D. (1989b).A learning algorithm for continually running fully recurrent neural networks. Neural Computation, 1(2), 270–280. Zeng, Z., Goodman, R. M., & Smyth, P. (1994).Discrete recurrent neural networks for grammatical inference. IEEE Transactionson Neural Networks, 5(2), 320–330. Received December 3, 1998; accepted March 20, 2000.
NOTE
Communicated by John Platt
Formulations of Support Vector Machines: A Note from an Optimization Point of View Chih-Jen Lin Department of Computer Science and Information Engineering, National Taiwan University, Taipei 106, Taiwan In this article, we discuss issues about formulations of support vector machines (SVM) from an optimization point of view. First, SVMs map training data into a higher- (maybe innite-) dimensional space. Currently primal and dual formulations of SVM are derived in the nite dimensional space and readily extend to the innite-dimensional space. We rigorously discuss the primal-dual relation in the innite-dimensional spaces. Second, SVM formulations contain penalty terms, which are different from unconstrained penalty functions in optimization. Traditionally unconstrained penalty functions approximate a constrained problem as the penalty parameter increases. We are interested in similar properties for SVM formulations. For two of the most popular SVM formulations, we show that one enjoys properties of exact penalty functions, but the other is only like traditional penalty functions, which converge when the penalty parameter goes to innity. 1 Introduction
The support vector machine (SVM) is a new and promising classication technique (for surveys of SVM, see Vapnik, 1995, 1998). Given training vectors xi 2 Rn , i D 1, . . . , l in two classes and a vector y 2 Rn such that yi 2 f1, ¡1g, the SVM requires the solution of the following quadratic programming problem (Boser, Guyon, & Vapnik, 1992): P0 :
1 min hw, wi 2 yi (hw, w (xi )i C b) ¸ 1,
i D 1, . . . , l,
where h¢, ¢i is the inner product of two nite or innite vectors. Training vectors xi are mapped into a higher-dimensional space by the function w . In this new space, we seek a separating hyperplane hw, w (x)i C b D 0 with coefcients w and b such that all training vectors are separated. Note that this higher-dimensional space can be nite or innite. In this article, to stress the difference between nite and innite inner products, we represent them as xT y and hx, yi, respectively. Neural Computation 13, 307–317 (2001)
° c 2001 Massachusetts Institute of Technology
308
Chih-Jen Lin
If in a higher-dimensional space, data are still not separable, Cortes and Vapnik (1995) propose adding a penalty term C liD1 ji in the objective function, where C is a large, positive number: PC :
1 min hw, wi C C 2
l
ji i D1
yi (hw, w (xi )i C b ) ¸ 1 ¡ ji ,
ji ¸ 0,
i D 1, . . . , l.
The technique of introducing variables ji can be traced back to, for example, Bennet and Mangasarian (1992) for linear programming formulations. As PC becomes a problem with many (or innite) variables, currently the standard method (see, for example, Cortes & Vapnik, 1995, and Vapnik, 1995, 1998) for solving problem PC is through the following dual, which is a nite problem with l variables: DC :
1 min aT Qa ¡ eT a 2 yT a D 0, 0 · ai · C,
i D 1, . . . , l,
where Qij ´ yi yj hw (xi ), w (xj )i and e is the vector of all ones. If a is a sol lution of (DC ), w ´ i D 1 yi ai w ( xi ) is a solution of problem PC . Usually ( ) ( ) ( )i K xi , xj ´ hw xi , w xj is called the kernel. Although w (x ) is in a very high-dimensional space, if a special w is used, K (xi , xj ) is nite and can be easily calculated. Algorithms for solving DC can be seen in, for example, Sch¨olkopf, Burges, and Smola (1998). Similarly, the dual of problem P 0 is as follows: D0 :
1 min aT Qa ¡ eT a 2 yT a D 0, ai ¸ 0,
i D 1, . . . , l.
However, in the above references, the derivation of DC was through the standard convex analysis in nite-dimensional spaces. Note that w (x ) and w can be vectors in an innite-dimensional Hilbert space. Since properties of convex analysis in nite spaces may not always hold in innite spaces, we must be careful with any generalization. In this article, all results are derived for innite-dimensional formulations. The rst issue is on the relation between PC and DC : C ¸ 0. For example, we may ask whether it might happen that P 0 has an optimal solution but D0 is not solvable. Note that it is possible that a convex programming problem has a lower bound on all objective values but has no feasible points achieving this bound. In section 2, we show that P 0 and D 0 are either both solvable
Formulations of Support Vector Machines
309
or P 0 is infeasible and D 0 is unbounded. In addition, PC and DC , C > 0 are both solvable. Cortes and Vapnik (1995) and, more recently, Friess, Cristianini, and Campbell (1998) offer a different SVM formulation: PNC :
min 12 hw, wi C C2 liD 1 ji2 yi (hw, w (xi )i C b) ¸ 1 ¡ ji , i D 1, . . . , l.
DN C
min 12 aT Q C CI a ¡ eT a yT a D 0, ai ¸ 0, i D 1, . . . , l.
Here I is the identity matrix. Using similar techniques, we show that both PNC and DN C , C > 0 are always solvable. The second topic of this article is the penalty parameter. In PC (or PNC ), the penalty term C liD 1 ji (or C2 liD 1 ji2 ) is added to reduce the occurrence of incorrectly classied data. Note that such an approach is different from traditional penalty functions, which modify constrained problems to unconstrained ones. As the penalty parameter goes to innity, the solutions of this sequence of unconstrained problems approximate a solution of the original problem. Here we are interested in similar properties for SVM formulations. In section 3, we show that for PC , after C is greater than a certain number, all solutions of PC are a solution of P 0 . In other words, PC is like an exact penalty function approach. On the other hand, this result may not hold for PNC but as C ! 1, solutions of PNC still converge to a solution of P 0. We hope that we demonstrate a framework on optimization properties of SVM formulations. Hence, when new formulations are proposed, similar tools and techniques can be applied without many modications. Finally we make conclusions and discussions in section 4. 2 Primal-Dual Relation of SVM Formulations For the discussion in the rest of this article, we need the following theorem, which shows the necessary and sufcient optimality conditions of a convex optimization problem. This is an extension of the KKT condition in nitedimensional spaces. Note that the standard necessary condition usually requires some constraint qualications (e.g., Slater condition). However, for problems with linear constraints, such an assumption is not necessary. Theorem 1.
Consider an optimization problem min f (x ) s.t. hvi , xi ¡ bi ¸ 0,
i D 1, . . . , l,
where x 2 X, a Hilbert space, vi 2 X, i D 1, . . . , l and bi 2 R, i D 1, . . . , l. If f is a real convex and differentiable function, then a feasible element x is an optimal solution if and only if there exist real numbers l1 , . . . , ll such that
r f (x ) ¡ l 1 v 1 ¡ ¢ ¢ ¢ ¡ l l v l D 0 li ¸ 0, li (hvi , xi ¡ bi ) D 0, 8i D 1, . . . , l.
310
Chih-Jen Lin
Proof. The necessary condition when constraints are linear can be seen in, for example, Girsanov (1972, theorem 11.3). The sufcient condition does not require any constraint qualication. An example of proof can be seen in corollary 1.1 of Barbu and Precupanu (1986, chap. 3). Note that the denition of r f (x ) here is in the sense of Frechet differentiability. Theorem 1 can be extended to linear equality constraints as a linear equality can be written as two linear inequalities. Then the following theorem shows the relation between P 0 and D 0: Theorem 2. can happen:
For problems P 0 and D0 , only one of the following two situations
P 0 solvable and D 0 solvable, P 0 infeasible and D 0 unbounded. On the other hand, PC and DC , C > 0, are both solvable. Proof. First we prove that if D 0 is solvable, P 0 is solvable. Suppose that an optimal solution of D 0 is a, by theorem 1 (necessary condition), there exists a b such that ai (Qa ¡ e ¡ by )i D 0, i D 1, . . . , l, and Qa ¡ e ¡ by ¸ 0. By dening w ´ liD1 ai yi w (xi ), we have (w, b) which satises the optimality condition of P 0. From theorem 1 (sufcient condition), (w, b) is an optimal solution of P 0. Hence P 0 is solvable. Next we show that if P 0 is feasible, D 0 is solvable. Consider the following problem: max inf a¸ 0
w,b
1 hw, wi ¡ 2
l iD 1
ai (yi (hw, w (xi )i C b ) ¡ 1)
.
(2.1)
When yT a D 0, infw,b [ 12 hw, wi ¡ liD 1 ai (yi (hw, w (xi )i C b ) ¡ 1)] has a global minimum at w D liD 1 ai yi w (xi ) (see, for example, Debnath & Mikusinski, 1999, example 9.3.3). Thus equation 2.1 is equivalent to D0 . It can be clearly seen that if P 0 is feasible with any feasible solution w¤ , 12 hw¤ , w¤ i is an upper bound of equation 2.1. That is, ¡ 12 hw¤ , w¤ i is a lower bound of D0 . From corollary 27.3.1 of Rockafellar (1970), any feasible bounded nitedimensional space quadratic convex function over a polyhedral attains at least an optimal solution. Since D 0 is always feasible, we know that it is solvable. Furthermore, because a feasible P 0 implies a solvable D 0 and a solvable D 0 implies a solvable P 0 , we know that if P 0 is feasible, its minimum
Formulations of Support Vector Machines
311
is attained. Therefore, we conclude that there are only two possible situations: P 0 solvable and D 0 solvable. P 0 infeasible and D0 unsolvable. We then use the same property that if D0 is bounded, it is solvable. Therefore, D 0 unsolvable means D0 is unbounded, so the proof is complete. Since PC and DC , C > 0, are both feasible, by the same arguments above, they are both solvable. Using a similar method, we can prove that PNC and DN C are also both solvable. 3 Penalty Parameters In PC (or PNC ), the penalty term C liD 1 ji (or C2 liD 1 ji2 ) is used to reduce the occurrence of wrongly classied data. This is different from the penalty approach in mathematical programming. Originally this method approximates a constrained problem by unconstrained minimization. When the penalty parameter is large enough, minima of unconstrained problems are close to a solution. Hence it is important to know that if P 0 is separable, how close a solution of PC is to a solution of P 0 as C increases. This can be a criterion to judge the appropriateness of using the penalty term and will be the main topic of this section. First we prove the following technical lemma: Lemma 1.
For any C ¸ 0, the vector w of optimal solutions of P C is unique.
N jN ) with w D 6 Proof. Assume that (w, b, j ) and (w, N b, wN are both optimal O jO ) D a(w, b, j ) C (1 ¡ a)(w, N jN ) is O b, solutions of P C . For any 0 · a · 1, (w, N b, feasible to PC . Therefore, 1 O wi O CC hw, 2
l iD 1
jOi ¸
1 hw, wi C C 2
l i D1
ji D
1 hw, N wi N CC 2
l iD 1
jNi .
(3.1)
Note that equation 3.1 can be treated as two inequalities. By multiplying the rst part by a and the second part by (1 ¡ a), 1 2 1 1 1 a hw, wi C a(1 ¡ a)hw, wi N C (1 ¡ a)2 hw, N wi N ¸ ahw, wi C (1 ¡ a)hw, N wi. N 2 2 2 2 Thus, a(1 ¡ a)hw ¡ w, N w ¡ wi N · 0, 80 · a · 1.
312
Chih-Jen Lin
Since w ¡ wN is in a Hilbert space, w ¡ wN D 0. This causes the contradiction, so w must be unique. For other variables such as b and j of PC and a of DC , there are always examples where multiple solutions occur. Next we present a relation between solutions of PC , C > 0 and those of P 0 . Theorem 3. If P 0 is solvable, there is a C¤ such that for all C ¸ C¤ , any solution (w, b) of P 0 is a solution of PC . Proof. If (w, b) is a solution of P, by theorem 1, there are a and b such that l
wD
i D1
yi ai w (xi ), yT a D 0,
ai (yi hw (xi ), wi C b ¡ 1) D 0,
a¸0 i D 1, . . . , l.
Let C > max(ai ). For all C > C we can set j D 0 and obtain ¤
¤,
l
wD
i D1
yi ai w (xi ), yT a D 0
ai (yi hw (xi ), wi C b ¡ 1 C ji ) D 0 0 · ai · C, ji (C ¡ ai ) D 0, i D 1, . . . , l.
Therefore, from theorem 1, (w, b, j ) is also a solution of PC . Theorem 3 proves a property similar to exact penalty functions. That is, after the penalty parameter is large enough, any solution of the original problem is also a solution of penalty functions. However, according to Di Pillo (1994), they are only weak exact penalty functions, though most early literature on penalty functions studied this result. Since the penalty approach is an attempt to solve P 0 , it is more important to study the converse property, which ensures that solutions of PC are solutions of P 0 . It is shown in the following theorem and corollary: Theorem 4. If P 0 is solvable, there is a C¤ such that for all C ¸ C¤ , any solution (w, b, j ) of PC satises that j D 0 and (w, b) is a solution of P 0 . Proof. Consider any C¤ such that theorem 3 holds and assume (w, b ) is a solution of P 0 . From theorem 3, (w, b, 0) is also an optimal solution of PC . N jN ), from lemma 1, wN D w. Therefore, Now if we solve PC and obtain (w, N b, 1 hw, N wi N CC 2
l
iD 1
jNi ·
1 hw, wi. 2
Thus, jN D 0 and (w, N bN ) is an optimal solution of P 0.
Formulations of Support Vector Machines
Corollary 1.
313
If P 0 is solvable and
C¤ ´ min
max ai | a is a solution of D 0 ,
i D 1,...,l
each of PC , C < C¤ has a different solution w from the solution of P 0 . On the contrary, PC , C ¸ C¤ has the same solution w as that of P 0. Proof. When C ¸ C¤ , the result is given by theorem 4. For the case of C < C¤ , assume there is a PC with a solution (wC , bC , jC ) such that wC is also part of a solution of P 0 . If (w, b) is a solution of P 0 , (w, b) is feasible to PC , so 1 hwC , wC i C C 2
l iD 1
(jC )i ·
1 hw, wi. 2
This implies jC D 0. Hence any solution of DC is also a solution of D0 . Since C < C¤ , this contradicts the denition of C¤ . As we show only that vectors w are different, when C < C¤ , it is still possible to obtain the same separating hyperplane. Note that hw, w (x )i C 6 0, are appropriate coefcients. b D 0 is the hyperplane so any a(w, b), a D However, when C < C¤ , the separating plane generated by PC can also be very different from that of P 0 . For example, consider a problem with three points, 0, 1, and 100, where 0 is in the rst class, 1 and 100 are in the other class, and K (xi , xj ) D xTi xj . When C ¸ 2, a D [2, 2, 0] and the hyperplane is 2x ¡ 1 D 0. When 1 · C · 2, a D [C, C, 0] and w D C. As b can be any number that satises ¡1 · b · 1 ¡ C, b D ¡C / 2 can be a choice. Therefore, 2x ¡ 1 D 0 can still be a solution. However, when C becomes smaller, the hyperplane starts moving toward the training point 100. In addition, the distance between two planes hw, xi C b D § 1 becomes very large. This shows that if small Cs are used, some careful analysis should be conducted. Next we switch the target to PNC and DN C . Interestingly they are not like exact penalty functions. In the following example, we show that no matter how large C is, solutions of PNC are not solutions of P 0 . In Figure 1a, there are three training vectors in two different classes. If Qij D yi yj xTi xj , the optimal separating hyperplane for P 0 is [2 2]x ¡ 1 D 0. The optimality condition of PNC is 1 C
0 0
0 1C 0
1 C
0 0 1C
1 C
a1 a2 a3
Cb
¡1 1 1 ¡ 1 1 1
D
l1 l2 . l3
At the solution, all three vectors are support vectors so l D 0. Solving the above equation with yT a D 0 obtains bD
C¡1 , a1 D (1 C b)C, ¡C ¡ 3
a2 D a3 D
(1 ¡ b)C 2C . D CC1 CC3
314
Chih-Jen Lin
Figure 1: A simple example.
The separating hyperplane is C(1 ¡ b) C(1 ¡ b) x C b D 0, CC1 CC1 that is, [2C 2C]x C (1 ¡ C ) D 0.
(3.2)
No matter what C is, equation 3.2 is different from [2 2]x ¡ 1 D 0. However, it can be seen in Figure 1b that as C ! 1, equation 3.2 approaches [2 2]x ¡ 1 D 0. In the following theorems we show that PNC has properties of only traditional penalty functions. That is, as C ! 1, solutions of PNC approach solutions of P 0 . Lemma 2. Let (wCj , bCj , jCj ) be a solution of PNCj , aCj be the solution of DN Cj ,j D 1, 2, . . ., and assume limj!1 Cj D 1. If P 0 is solvable and limj!1 aCj exists, then limj!1 wCj exists and is the solution of P 0 . Proof. From the optimality condition (see theorem 1) of DN C , aCj D CjCj . l Since limj!1 aCj exists, limj!1 jCj D 0. As wCj D i D 1 yi (aCj ) i w ( xi ) and limj!1 aCj exists, limj!1 wCj exists and we can denote it as w. Note that max (1 ¡ (jCj )i ¡ hwCj , w (xi )i) · bCj · min (¡1 C (jCj )i C hwCj , w (xi )i).
i: yi D 1
i: yi D ¡1
As Cj ! 1, limj!1 jCj D 0 so max (1 ¡ hw, w (xi )i) · min (¡1 C hw, w (xi )i).
i: yi D 1
i: yi D ¡1
Formulations of Support Vector Machines
315
Hence w is feasible to P 0. Furthermore, as P 0 is solvable, we can assume wN is a solution. Since wN is feasible to all PN C , C > 0 and wCj is the optimal solution of PN Cj , we have 1 1 hwCj , wCj i · hw, N wi. N 2 2 As j ! 1, 12 hw, wi · 12 hw, N wi. N Hence limj!1 wCj D w is an optimal solution of P 0. Theorem 5. Let (wC , bC , jC ) be a solution of PNC . If P 0 is solvable, then limC!1 wC exists and is the solution of P 0. Proof. First we prove that faC | C ¸ 2 g is bounded, where aC is a solution of DC and 2 is any positive number. If aN is a solution of D 0, aN is feasible to DN C and aC is feasible to D0 , so 1 T 1 aN Qa N ¡ eT a N · aCT QaC ¡ eT aC 2 2 1 T 1 T I I · aC Q C aC ¡ eT aC · a a N QC N ¡ eT a. N 2 2 C C The above formula provides lower and upper bounds on 12 aCT (Q C CI )aC ¡ eT aC when C ¸ 2 > 0. From the optimality condition (see theorem 1) of DN C , 1 1 I I ¡ aCT Q C aC C eT aC D aCT Q C aC . 2 2 C C Hence 12 aCT Q C CI aC is bounded and so as eT aC . Since aC ¸ 0, this implies that faC g is bounded after C is large enough. Since faC | C ¸ 2 g is bounded, there exists a convergent sequence faCj g. From lemma 2, limj!1 wCj exists, where wCj is part of a solution of PNCj . 6 If limC!1 wC D w, there is a sequence fwCk g and d > 0 such that kwCk ¡ wk ¸ d, for all k D 1, 2, . . .. However, faCk g is bounded so we can nd a subsequence of faCk g which satises conditions of lemma 2. Hence this subsequence converges to a solution of P 0 , that is, w. This contradicts the assumption that kwCk ¡ wk ¸ d, 8k. Therefore, limC!1 wC D w 4 Conclusions Through the use of optimization techniques, we have a better understanding about formulations of SVMs. This work can be a good example of analyzing future SVM formulations by optimization techniques. SVM formulations are very simple convex quadratic problems in Hilbert spaces. Thus we can use the simple Wolfe dual (see equation 2.1) to derive
316
Chih-Jen Lin
the primal-dual relation instead of using the standard conjugate duality. For different SVM formulations, the results of theorem 2 could be quite different. For example, for the newº-SVM proposed by Sch¨olkopf, Smola, Williamson, and Bartlett (2000), the primal problem can be unbounded while the dual problem can be infeasible (see Crisp & Burges, in press, and Chang & Lin, 2000). Acknowledgments This work was supported in part by the National Science Council of Taiwan, grant NSC-88-2213-E-002-097. The author thanks Jorge MorÂe for pointing him to Di Pillo (1994) and some helpful discussions. He also thanks ChihWei Hsu and Steve Wright for their comments. References Barbu, V., & Precupanu, T. (1986). Convexity and optimization in banach spaces. Dordrecht: D. Reidel. Bennett, K. P., & Mangasarian, O. L. (1992). Robust linear programming discrimination of two linear inseparable sets. Optimization Methods and Software 1, 23–34. Boser, B., Guyon, I., & Vapnik, V. (1992). A training algorithm for optimal margin classiers. In Proceedings of the Fifth Annual Workshop on Computational Learning Theory. Chang, C.-C., & Lin, C.-J. (2000). Some analysis on º-support vector classication (Tech. Rep.) Taipei, Taiwan: Department of Computer Science and Information Engineering, National Taiwan University. Cortes, C., & Vapnik, V. (1995). Support-vector network. Machine Learning, 20, 273–297. Crisp, D. J., & Burges, C. J. C. (In press). A geometric interpretation of º-svn classiers. In Advances in neural information processing systems. Cambridge, MA: MIT Press. Debnath, L., & Mikusinski, P. (1999). Introduction to Hilbert spaces with applications (2nd ed.). San Diego: Academic Press. Di Pillo, G. (1994). Exact penalty methods. In I. Ciocco (Ed.), Algorithms for continuous optimization. Dordrecht: Kluwer. Friess, T.-T., Cristianini, N., & Campbell, C. (1998). The kernel Adatron algorithm: A fast and simple learning procedure for support vector machines. In Proceedings of the 15th Intl. Conf. Machine Learning. San Mateo, CA: Morgan Kaufmann. Girsanov, I. V. (1972). Lectures on mathematical theory of extremum problems. Berlin: Springer-Verlag. Rockafellar, R. T. (1970). Convex analysis. Princeton, NJ: Princeton University Press. Sch¨olkopf, B., Burges, C. J. C., & Smola, A. J. (Eds.) (1998). Advances in kernel methods—Support vector learning. Cambridge, MA: MIT Press.
Formulations of Support Vector Machines
317
Sch¨olkopf, B., Smola, A., Williamson, R. C., & Bartlett, P. L. (2000). New support vector algorithms. Neural Computation, 12, 1083–1121. Vapnik, V. (1995). The nature of statistical learning theory. New York: SpringerVerlag. Vapnik, V. (1998). Statistical learning theory. New York: Wiley. Received September 20, 1999; accepted April 12, 2000.
NOTE
Communicated by David Stork
Minimal Feedforward Parity Networks Using Threshold Gates Hon-Kwok Fung Leong Kwan Li Department of Applied Mathematics, Hong Kong Polytechnic University, Hong Kong This article presents preliminary research on the general problem of reducing the number of neurons needed in a neural network so that the network can perform a specic recognition task. We consider a singlehidden-layer feedforward network in which only McCulloch-Pitts units are employed in the hidden layer. We show that if only interconnections between adjacent layers are allowed, the minimum size of the hidden layer required to solve the n-bit parity problem is n when n · 4. 1 Introduction
Consider a single-hidden-layer feedforward network that has n ¸ 2 input units, one output unit, and a hidden layer comprising only McCulloch-Pitts (Mc-P) nodes. Assume that each unit can pass information only to the subsequent layer (thus, no connections within a layer or between nonadjacent layers are allowed). Using sufcently many hidden nodes and well-selected weights and thresholds, it is a simple matter to solve the so-called n-bit parity problem, which asks how a neural network can recognize the parity of any binary input. (See Li, 1992, for a solution using n hidden nodes in the network described above.) How many Mc-P hidden units are necessary for the recognition task is not entirely trivial. There is yet no general rule for determining a priori the minimum number of hidden neurons required for any specic recognition task. Although we do have some measures (such as the Vapnik-Chervonenkis dimension or the more general measures studied by Sontag, 1992) that concern the minimum number of hidden neurons required for a neural network to perform an arbitrary recognition task on a perhaps arbitrary set, these two kinds of minima can be vastly different. For example, for the network described above, n hidden neurons are sufcient for recognising parities on Vn D f0, 1gn , but according to Sontag (1992), it requires roughly 2n hidden neurons to guarantee that the net can recognize any partition of any set containing |Vn | D 2n points in Rn . Several results for using a minimum number of hidden nodes to solve the n-bit parity problems can be found in the literature. For single-hiddenlayer feedforward parity networks, Stork and Allen (1992) showed that when some staircase-like activation functions are employed in the hidden Neural Computation 13, 319–326 (2001)
° c 2001 Massachusetts Institute of Technology
320
Hon-Kwok Fung and Leong Kwan Li
layer, only two hidden neurons are required. Brown (1993) showed that if direct connections between inputs and the output are allowed, only one hidden neuron is actually required (using another staircase-like activation function). If input-output connections are allowed and the activation functions take a special sigmoidal form, Minor (1993) proved that bn / 2c hidden neurons are minimal. Yet when only the simplest activation function (the Heaviside function) is employed, the minimum problem remains unsolved. We will partially solve the problem here. We show that for the feedforward network described at the beginning of this article with n · 4 inputs, the minimum number of hidden neurons required is n. 2 Minimal Parity Networks
Let sgn(¢) be the sign function or the Heaviside function dened by sgn(u) D 1 if u > 0 and sgn(u) D 0 if u · 0. If u is a vector, let sgn(u) be the vector computed from u entrywise. Thus, if there are n input units and m hidden neurons, the relation between the input state vector x 2 Rn and the hidden state vector y 2 Rm of our single-hidden-layer network is given by y D sgn( f (x )) ´ sgn(Wx C h ), where W 2 Rn£m is the weight matrix and h is the threshold. We want to show that n hidden units are necessary for solving the n-bit (n ¸ 2) parity problem. In other words, we want to show that n ¡ 1 hidden neurons are not enough. Since all hidden neurons are Mc-P units, if we can show that for some inputs u, v of different parities, f (u ) and f (v) lie in the same orthant (which is a quadrant in R2 or an octant in R3 ), then we are done because sgn( f (u )) D sgn( f (v ))—that is, because u, v give the same hidden states. However, here we seek to prove an even stronger statement. Let us call a function f : Rn ! Rn¡1 a color separator for a two-colored hypercube Hn D [0, 1]n (for short, a color separator) if any two adjacent vertices of Hn are mapped to distinct orthants in Rn¡1 . From the above discussion, if we can show that P (n ): no afne function f : Rn ! Rn¡1 is color separating, then it is evident that n ¡ 1 hidden neurons are not enough for solving the n-bit parity problem. This is our goal: to prove P (n ) for n D 2, 3, 4. The proofs will be given in section 4. We do not know yet whether P (n ) is true for n ¸ 5. 3 Notations and Preliminaries
We shall use the following notations. sgn(¢) is the Heaviside function dened in the previous section. f always denotes the afne function in P (n ) for some n. W and h are, respectively, the weight matrix and the threshold vector such
Minimal Feedforward Parity Networks
321
that f (x ) D Wx C h. The canonical basis of R k is denoted by fe1 , . . . , ekg or fQe1 , . . . , eQk g, depending on whether the domain or the range of f is concerned. Basically all vectors in this article are column vectors, but for convenience we also identify row vectors with column vectors. The inner product of two vectors u, v is denoted by u ¢ v. If w D (x, y ) 2 R2 is a vector such that x > 0 and y · 0, we write w D ( C , ¡). Thus (1, ¡2) D ( C , ¡). Similar notations apply to other combinations of signs and also to higher dimensions. We denote an unspecied or unknown sign in a sign pattern by an asterisk (¤). Thus, we may also write (1, ¡2) D ( C , ¤). If two vectors u, v have the same sign pattern, we write u » v—for example, (0, ¡1, 3) » (¡4, ¡10, 2). The convex hull of a set of points S is denoted by conv S. We view the vertex set Vn D f0, 1gn of the hypercube Hn D [0, 1]n as a graph connected by the edges in Hn . When we talk about a face of Vn , we mean the vertex set of a face of Hn . Apart from notations, we also need a few general properties for color separators and noncolor separators. Since the proofs of the properties stated below are straightforward, we shall omit them here. First, note that color separation is rather like a geometrical property than an algebraic property. There is some partial freedom for us to make changes to the coordinate systems for the domain or the range of f without compromising color separation. For example, given yQ 2 f (Vn ), it does not matter which particular x 2 Vn is such that yQ D f (x ). If it happens that f (e1 ) D y, Q we may relabel e1 as the origin and reverse the directions of some axes in Rn Q We may even reverse the direction of certain if necessary so that f (0) D y. axes in Rn¡1 so that yQ lie inside any prespecied orthant, yet preserving the color separation property of f . The mathematical formulation of this fact is given as follows: Let f : Rn ! Rn¡1 be an afne function and g f : Rn ! Rn¡1 be a composite of one or more functions of the following forms: Proposition 1.
a. g f (x ) D f (Dx C (I ¡ D )h), where h D (e1 C e2 C ¢ ¢ ¢ C en ) / 2 and D is a diagonal matrix whose diagonal elements are 1 or ¡1.
b. g f (x ) D Pn¡1 f (Pn x ), where P n¡1 , Pn are permutation matrices of orders n ¡ 1 and n. Then f is a color separator if and only if g f is. Second, since f is afne, we can (and will, throughout this article) always assume that each f (v ) in f (Vn ) and each Wei are nonzero entrywise, whenever P (n ) is concerned: For every afne function f : Rn ! Rn¡1 , there exists w 2 Rn¡1 close to zero and M 2 R (n¡1)£(n¡1) close to the identity matrix I such that f is color separating if and only if g dened by g (x) D M ( f (x ) C w ) is. Proposition 2.
322
Hon-Kwok Fung and Leong Kwan Li
Third, if some face f (F ) of f (Vn ) lies entirely in a “canonical” half-space (whose boundary is some f(x1 , . . . , xn¡1 ): xj D 0g), we may identify F with Vn¡1 and the canonical half-space with Rn¡2 and view f as a mapping that maps Vn¡1 into Rn¡2 . Thus we have: Assume P (n ¡1) is true. If F is a face of Vn and the ith coordinate of all points in f (F ) are of the same sign for some i, then f cannot be color separating. Proposition 3.
Proposition 3 allows the use of mathematical induction to disqualify f as a color separator when f maps a face of Vn into a canonical half-space. However, it can happen that no faces of f (Vn ) lie in any canonical half-space. The following corollary shows how the alignment or conguration of points in f (Vn ) is constrained when the “face trick” (proposition 3) fails: Assume P (n ¡ 1) is true, but f : Rn ! Rn¡1 is an afne and is color separating. Then for any v 2 Vn and j 2 f1, 2, . . . , ng, exactly one of the following cases is true when the j-coordinate is concerned: Corollary 1.
a. There exist two adjacent vertices v1 , v2 of v such that f (v ) lies between f (v1 ) and f (v2 ), that is, W (v1 ¡ v ) ¢ eQj · 0 < W (v2 ¡ v ) ¢ eQj .
b. f (v ) has the largest or smallest jth coordinate among all points in f (Vn ). Also, for every vertex u adjacent to v, f (u ) and f (v) must lie on the same 6 sgn( f ( u )¢ eQj ) D sgn( f ( v) ¢ eQj ) . side of the jth axis. Hence sgn(W(u ¡v)¢ eQj ) D 4 Proofs of the Main Results Theorem 1.
P (2), P (3) are true.
Proof . Clearly P (2) is true. We now show that P (3) is also true.
Assume P (3) is false for some f : R3 ! R2 . Without loss of generality, we assume that f (0) ¢ eQ1 D minf f (v ) ¢ eQ1 : v 2 V3 g and f (0) D (¡, ¡). Therefore, 6 by corollary 1(b), f (e1 ), f (e2 ), f (e3 ) D (¡, ¤). Since f (0) D (¡, ¡) » f (ei ) for each i, we must have, in turn, f (e1 ), f (e2 ), f (e3 ) D (¡, C ). However, this implies f (0) ¢ eQ2 D minf f (v) ¢ eQ2 : v 2 V3 g, which is a contradiction of corollary 1(b). Thus P (3) must be true. Now we shall prove P (4) step by step. First, the following proposition should be evident, and we shall omit its proof. Let c, c C u1 , c C u2 , c C u1 C u2 2 R2 be four vertices of a parallelogram such that c is the one having the smallest x-coordinate, and no two adjacent vertices lie in the same quadrant. By reversing the direction of the yaxis if necessary, we may assume that c D (¡, ¡), c C u1 C u2 D ( C , C ) and c C u1 D ( C , ¡) or (¡, C ). Proposition 4.
Minimal Feedforward Parity Networks
323
Suppose P (4) is false for f . Then there exist two opposite vertices u, w in F D convfe1 , e2 , e3 g (hence u C w D e1 C e2 C e3 ) and some j 2 f1, 2, 3g such that f (u ) and f (w) have, respectively, the minimum and maximum jth coordinates among all points in f (F ) and lie inside opposite octants in R3 . Lemma 1.
Proof . Let u and v be the pair of opposite vertices of F such that f (u ) and
f (v ) have the minimum and the maximum x-coordinates among all vertices of f (F ). If f (u) and f (v ) lie inside opposite octants, we are done. Otherwise they must both lie inside a certain canonical half-space H. Without loss of generality, let H D f(x, y, z): z < 0g. Not all outgoing edges of f (u) in f (F) point toward the same (positive or negative) side of the z-axis; otherwise the entire face f (F) of f (V4 ) will lie inside H (because the two extreme corners f (u ) and f (v) with extremal zcoordinates do), which is a contradiction of proposition 3 or P (3). Therefore, without loss of generality (relabel another vertex in V4 as the origin and reverse the direction of the x-axis in R3 if necessary), we may assume that u D 0, v D e1 C e2 C e3 and We1 , We2 D ( C , ¤, ¡), We3 D ( C , ¤, C ), f (0), f (e1 ), f (e2 ), f (e1 C e2 ), f (e1 C e2 C e3 ) D (¤, ¤, ¡).
(4.1)
Note that in this case, f (e3 ) and f (e1 C e2 ) have, respectively, the maximum and the minimum z-coordinates among all vertices in f (F ). So it remains to prove that f (e3 ) and f (e1 C e2 ) lie in opposite octants. Observe that if f (0) D (¡, ¡, ¡) and f (e1 C e2 ) D ( C , C , ¡), then f (e1 C e2 C e3 ) D f (e1 C e2 ) C We3 D ( C , C , ¡) C ( C , ¤, C ) D ( C , ¤, ¤), meaning that f (e1 C e2 C e3 ) D ( C , ¤, ¡). 6 Since f (e1 C e2 C e3 ) » f (e1 C e2 ), we must have f (e1 C e2 C e3 ) D ( C , ¡, ¡) and We3 D ( C , ¡, C ). Therefore, by applying proposition 1 to the x-, ycoordinates in equation 4.1 (and by interchanging the roles of e1 and e2 if necessary), we can split the problem into two cases. Case 1. f (0) D (¡, ¡, ¡), f (e1 ) D ( C , ¡, ¡), f (e1 C e2 ) D ( C , C , ¡), f (e1 C e2 C e3 ) D ( C , ¡, ¡), We1 D ( C , ¤, ¡), and We3 D ( C , ¡, C ). In this case, f (e1 C e3 ) D f (e1 ) C We3 D ( C , ¡, ¡) C ( C , ¡, C ) D ( C , ¡, ¤) D ( C , ¡, C ),
(4.2)
6 where the last equality is due to the fact that f (e1 C e3 ) » f (e1 ) D ( C , ¡, ¤). Also, we have f (e3 ) D f (0) C We3 D (¡, ¡, ¡) C ( C , ¡, C ) D (¤, ¡, ¤). 6 6 However, since f (e3 ) » f (0) D (¡, ¡, ¡) and f (e3 ) » f (e1 C e3 ) D ( C , ¡, C ), we must have f (e3 ) D (¡, ¡, C ) or ( C , ¡, ¡). The latter is impossible, or else f (e1 C e3 ) D f (e3 ) C We1 D ( C , ¡, ¡) C ( C , ¤, ¡) D ( C , ¤, ¡), contradicting equation 4.2. Thus, f (e3 ) D (¡, ¡, C ), which is inside an octant opposite to where f (e1 C e2 ) D ( C , C , ¡) lies.
324
Hon-Kwok Fung and Leong Kwan Li
Case 2. f (0) D (¡, ¡, ¡); f (e1 ), f (e2 ) D (¡, C , ¡); f (e1 C e2 ) D ( C , C , ¡); f (e1 C e2 C e3 ) D ( C , ¡, ¡); We1 , We2 D ( C , ¤, ¡) and We3 D ( C , ¡, C ). In 6 this case, as f (ei ) D f (0) C Wei » f (0), it is immediate that We1 , We2 D 6 ( C , C , ¡). Also, since f (e3 ), f (e1 C e2 C e3 ) » f (e1 C e3 ), it is easy to see that f (e3 ) cannot take either of the following sign patterns: (¤, C , ¤), ( C , ¤, ¡) or ( C , ¡, ¤). Thus, f (e3 ) D (¡, ¡, ¡) or (¡, ¡, C ). The former is impossible 6 because f (e3 ) » f (0). Therefore, the two vertices f (e3 ) D (¡, ¡, C ) and f (e1 C e2 ) D ( C , C , ¡) lie inside opposite octants in R3 . We are now ready to prove P (4). We shall prove by contradiction. Theorem 2.
P (4) is true.
Proof . Assume P (4) is false. Then by the previous lemma, we may assume
without loss of generality that f (0) D (¡, ¡, ¡), f (e1 C e2 C e3 ) D ( C , C , C ) and they are the two points in convfe1 , e2 , e3 g that have, respectively, the smallest z-coordinate and the largest z-coordinate. We also assume without loss of generality that We4 ¢ eQ3 > 0, that is, We4 D (¤, ¤, C ).
(4.3)
Then f (e1 C e2 C e3 C e4 ) D (¤, ¤, C ) and f (0) ¢ eQ3 D minf f (v ) ¢ eQ3 : v 2 V4 g, f (e1 C e2 C e3 C e4 ) ¢ eQ3 D maxf f (v ) ¢ eQ3 : v 2 V4 g.
(4.4) (4.5)
Therefore, by corollary 1(b) and the fact that f (e4 ) is adjacent to f (0), f (e4 ) D (¤, ¤, ¡).
(4.6)
Note that 6 ( C , C , ¤)I We4 D
(4.7)
otherwise by equation 4.3, we have We4 D ( C , C , C ) and f (e1 C e2 C e3 C e4 ) D f (e1 C e2 C e3 ) C We4 D ( C, C , C ) C ( C , C , C ) » f (e1 , e2 , e3 ),
6 ( C , C , ¡), or else which is a contradiction to P (4). Thus, in turn, f (e4 ) D 6 ( )¡ (0) ( ( C C C C ¡)¡(¡, ¡, ¡) ¤). Furthermore, We4 D f e4 f D , , D , , f (e4 ) D 6 (¡, ¡, ¡) because f (e4 ) » f (0). Hence f (e4 ) D (¡, C , ¡) or ( C , ¡, ¡), by equation 4.6. Without loss of generality, let f (e4 ) D ( C , ¡, ¡). Therefore,
Minimal Feedforward Parity Networks
325
We4 D f (e4 ) ¡ f (0) D ( C , ¡, ¡) ¡ (¡, ¡, ¡) D ( C , ¤, ¤) and in turn We4 D ( C , ¡, C ) by equations 4.3 and 4.7. Consequently, f (e1 C e2 C e3 C e4 ) D f (e1 C e2 C e3 ) C We4 D ( C , C , C ) C ( C , ¡, C ) D ( C , ¡, C ).
Now, since f (0) D (¡, ¡, ¡) D (¡, ¤, ¤) and f (e4 ) D ( C , ¡, ¡) D ( C , ¤, ¤), by corollary 1(a) (with j D 1), there must exist some i such that f (ei ) D (¡, ¤, ¤) and Wei D (¡, ¤, ¤). However, by applying corollary 1(b) with j D 3 on f (0), we must have f (ei ) D (¤, ¤, ¡) and Wei D (¤, ¤, C ). Therefore 6 in summary, f (ei ) D (¡, ¤, ¡) and Wei D (¡, ¤, C ). As f (ei ) » f (0) D (¡, ¡, ¡), we must have f (ei ) D (¡, C , ¡) and in turn Wei D (¡, C , C ). So, if we put v D e1 C e2 C e3 C e4 ¡ ei , then f (v ) D f (e1 C e2 C e3 C e4 ) ¡ Wei D ( C , ¡, C ) ¡ (¡, C , C ) D ( C , ¡, ¤).
6 ( C , ¡, C ) because f ( v ) » 6 However, v D f (e1 C e2 C e3 C e4 ). On the other hand, 6 ( C , ¡, ¡), or else f (v ), f (e1 C e2 C e3 C e4 ), together with equation 4.5 vD would violate corollary 1(b) (when j D 3). Therefore we arrive at a contradiction. Hence P (4) must be true.
5 Conclusion
We have shown that for a threshold gate feedforward network with only one hidden layer, if no intralayer connections or direct connections between inputs and output are allowed, it takes at least n hidden units for the network to solve the n-bit parity problem for n D 2, 3, 4. The interesting question of whether the statement is true for n ¸ 5 is left open. Acknowledgments
This work was supported by grant PolyU 5058/98E from the Research Grant Council of Hong Kong Special Administrative Region. References Brown, D. A. (1993). Neural network letter. Neural Networks, 6, 607–608. Li, L. K. (1992). Mathematical aspects of neural networks. Unpublished doctoral dissertation, University of Southern California. Minor, J. M. (1993). Parity with two layer feedforward nets. Neural Networks, 6, 705–707.
326
Hon-Kwok Fung and Leong Kwan Li
Sontag, E. D. (1992). Feedforward nets for interpolation and classication. J. Comp. Sys. Sci., 45, 20–48. Stork, D. G., & Allen, J. D. (1992). How to solve the n-bit parity problem with two hidden units. Neural Networks, 5, 923–926. Received January 6, 2000; accepted April 6, 2000.
LETTER
Communicated by Christof Koch
Does Corticothalamic Feedback Control Cortical Velocity Tuning? Ulrich Hillenbrand J. Leo van Hemmen ¨ ¨ Physik Department der TU Munchen, D-85747 Garching bei Munchen, Germany The thalamus is the major gate to the cortex, and its contribution to cortical receptive eld properties is well established. Cortical feedback to the thalamus is, in turn, the anatomically dominant input to relay cells, yet its inuence on thalamic processing has been difcult to interpret. For an understanding of complex sensory processing, detailed concepts of the corticothalamic interplay need to be established. To study corticogeniculate processing in a model, we draw on various physiological and anatomical data concerning the intrinsic dynamics of geniculate relay neurons, the cortical inuence on relay modes, lagged and nonlagged neurons, and the structure of visual cortical receptive elds. In extensive computer simulations, we elaborate the novel hypothesis that the visual cortex controls via feedback the temporal response properties of geniculate relay cells in a way that alters the tuning of cortical cells for speed. 1 Introduction
The thalamus is the major gate to the cortex for peripheral sensory signals, input from various subcortical sources, and reentrant cortical information. Thalamic nuclei, however, do not merely relay information to the cortex but perform some operation on it while being modulated by various transmitter systems (McCormick, 1992) and in continuous interplay with their cortical target areas (Guillery, 1995; Sherman, 1996; Sherman & Guillery, 1996). Indeed, cortical feedback to the thalamus is the anatomically dominant input to relay cells even in those thalamic nuclei that are directly driven by peripheral sensory systems. While it is well established that the receptive elds of cortical neurons are strongly inuenced by convergent thalamic inputs of different types (Saul & Humphrey, 1992a, 1992b; Reid & Alonso, 1995; Alonso, Usrey, & Reid, 1996; Ferster, Chung, & Wheat, 1996; Jagadeesh, Wheat, Kontsevich, Tyler, & Ferster, 1997; Murthy, Humphrey, Saul, & Feidler, 1998; Hirsch, Alonso, Reid, & Martinez, 1998), the modulation effected by cortical feedback in thalamic response has been difcult to interpret. Experiments and theoretical considerations have pointed to a variety of operations of the visual cortex on the lateral geniculate nucleus (LGN), such as attention-related gating of geniculate relay cells (GRCs) (Sherman & Koch, Neural Computation 13, 327–355 (2001)
° c 2001 Massachusetts Institute of Technology
328
Ulrich Hillenbrand and J. Leo van Hemmen
1986), gain control of GRCs (Koch, 1987), synchronizing ring of neighboring GRCs (Sillito, Jones, Gerstein, & West, 1994; Singer, 1994), increasing trans-information in GRC output (McClurkin, Optican, & Richmond, 1994), and switching GRCs from a detection to an analyzing mode (Godwin, Vaughan, & Sherman, 1996; Sherman, 1996; Sherman & Guillery, 1996). Nonetheless, the evidence for any particular function to date is still sparse and rather indirect. Clearly, detailed concepts of the interdependency of thalamic and cortical operation could greatly advance our ideas about complex sensory, and ultimately cognitive, processing. Here we present a novel view on the corticothalamic puzzle by proposing that control of velocity tuning of visual cortical neurons may be an eminent function of corticogeniculate processing. We have outlined some of the ideas previously in Hillenbrand and van Hemmen (2000). In this section we review facts, some well established and others still controversial, on the thalamocortical system in order to clear the ground for the following simulation of the primary visual pathway. 1.1 Geniculate Response Timing and Cortical Velocity Tuning. Velocity selectivity or velocity tuning, taken here to mean a preference for a certain speed and direction of motion of visual features, requires convergence of pathways with different spatial information and different temporal characteristics, such as delays, onto single neurons (see, e.g., Hassenstein & Reichardt, 1956; Watson & Ahumada, 1985; Emerson, 1997). For higher mammals this is believed to occur in the primary visual cortex (Movshon, 1975; Orban, Kennedy, & Maes, 1981a, 1981b). In the A-laminae of cat LGN, two types of X-relay cell have been identied that dramatically differ in their temporal response properties (Mastronarde, 1987a; Humphrey & Weller, 1988a; Saul & Humphrey, 1990). Those that are more delayed in response time and phase have been termed lagged, the others nonlagged cells (with the exception of very few so-called partially lagged neurons) (see Figure 1). In lagged neurons, the on-response to a ash of light is preceded by a dip in the ring rate lasting for 5 to 220 ms, and there is typically a transient of high ring rate just after the offset of a prolonged light stimulus (Mastronarde, 1987a; Humphrey & Weller, 1988a). For a moving light bar, the time lag of the on-response peak is about 100 ms after the bar has passed the receptive eld (RF) center (Mastronarde, 1987a). In contrast, the nonlagged cells’ responses resemble their retinal input and show no transient at stimulus offset (Mastronarde, 1987a; Humphrey & Weller, 1988a). Lagged X-cells comprise about 40% of all X-relay cells (Mastronarde, 1987a; Humphrey & Weller, 1988b). Physiological (Mastronarde, 1987b), pharmacological (Heggelund & Hartveit, 1990), and structural (Humphrey & Weller, 1988b) evidence suggest that rapid feedforward inhibition via intrageniculate interneurons plays a decisive role in shaping the lagged cells’ response. Some authors have additionally related differences in receptor
Corticothalamic Feedback Control
A
on-center X-RGC on-center XN
329
B
on-center X-RGC on-center XL
Figure 1: Averaged responses of nonlagged (A, XN , thick line) and lagged (B, XL , thick line) geniculate on-center X-cells and their respective main excitatory retinal input (A and B, X-RGC, thin lines) to a moving light bar. Upper and lower histograms show responses to opposite directions of motion. Double arrowheads indicate the position of the central point of the receptive elds; circles indicate the approximate size of the receptive eld centers. The width of the bar was 0.5 degrees and is drawn to scale. The bar was swept at 5 degrees per second and 100 times for the XN -cell, 102 times for the XL -cell in each direction. Spikes were collected in bins of 10 ms width. Source: Adapted from Mastronarde (1987a).
types to the lagged-nonlagged dichotomy (Heggelund & Hartveit, 1990; Hartveit & Heggelund, 1990; see, however, Kwon, Esguerra, & Sur, 1991). Layer 4B in cortical area 17 of the cat is the target of both lagged and nonlagged geniculate X-cells (Saul & Humphrey, 1992a; Jagadeesh et al., 1997; Murthy et al., 1998). The spatiotemporal RFs of its direction-selective simple cells can routinely be interpreted as being composed of subregions that receive geniculate inputs alternating between lagged and nonlagged X-type (Saul & Humphrey, 1992a, 1992b; DeAngelis, Ohzawa, & Freeman, 1995; Jagadeesh et al., 1997; Murthy et al., 1998), just as convergent and segregated geniculate on- and off-inputs have been shown to outline the spatial structure of the simple cells’ RFs (Reid & Alonso, 1995; Alonso et al., 1996; Ferster et al., 1996; Hirsch et al., 1998). According to this view, directional selectivity is created by the response-phase difference of roughly
330
Ulrich Hillenbrand and J. Leo van Hemmen
a quarter cycle between successive off-lagged, off-nonlagged, on-lagged, and on-nonlagged responses across the RF. At least for simple cells in layer 4B, this RF structure determines the response to moving visual features (McLean & Palmer, 1989; McLean, Raab, & Palmer, 1994; Reid, Soodak, & Shapley, 1991; Albrecht & Geisler, 1991; DeAngelis, Ohzawa, & Freeman, 1993; DeAngelis et al., 1995; Jagadeesh, Wheat, & Ferster, 1993; Jagadeesh et al., 1997; Murthy et al., 1998), and hence the cell’s tuning for direction and speed.1 Lagged and nonlagged inputs that converge, either directly or via other cortical neurons, on simple cells and segregate in separate subregions of the simple cells’ RF are thus likely to contribute to the earliest level of cortical velocity selectivity (Saul & Humphrey, 1990, 1992a, 1992b; Ferster et al., 1996; Jagadeesh et al., 1997; Wimbauer, Wenisch, Miller, & van Hemmen, 1997; Wimbauer, Wenisch, van Hemmen, & Miller, 1997; Murthy et al., 1998). Certainly, intracortical input to cortical cells also contributes to velocityselective responses, given that these inputs anatomically outnumber thalamic inputs (Ahmed, Anderson, Douglas, Martin, & Nelson, 1994). Suggested intracortical effects include sharpening of tuning properties by suppressive interactions (Hammond & Pomfrett, 1990; Reid et al., 1991; Hirsch et al., 1998, Crook, Kisvarday, & Eysel, 1998; Murthy & Humphrey, 1999), amplication of geniculate inputs by recurrent excitation (Douglas, Koch, Mahowald, Martin, & Suarez, 1995; Suarez, Koch, & Davis, 1995), and normalization of responses by local interactions (Toth, Kim, Rao, & Sur, 1997). Intracortical circuits can in principle even generate their own direction selectivity by selectively inhibiting responses to nonpreferred motion (Douglas et al., 1995; Suarez et al., 1995; Maex & Orban, 1996). Our modeling is complementary to the latter in that we emphasize the inuence of geniculate inputs on cortical RF properties that is suggested by numerous studies (Saul & Humphrey, 1992a, 1992b; Reid & Alonso, 1995; Alonso et al., 1996; Ferster et al., 1996; Toth et al., 1997; Jagadeesh et al., 1997; Murthy et al., 1998; Hirsch et al., 1998), in order to bring out effects that are specic to the geniculate contribution to spatiotemporal tuning. Great care must be taken when extrapolating from cats to primates. In particular, no lagged relay cells have been described in the primate LGN so far. On the other hand, a recent study (Valois & Cottaris, 1998) does suggest a set of geniculate inputs to directionally selective simple cells in macaque striate cortex that is essentially analogous, in terms of response properties, to the lagged-nonlagged set envisaged for cat simple cells. If the underlying physiology for primates turns out to be similar to the one for cats, the results 1 To avoid confusion, we point out that the term speed tuning is sometimes used in a more restricted sense. Simple cells exhibit tuning for spatial and temporal frequencies that results in preference for speeds of moving gratings depending on their spatial frequency. Here we will be concerned with the more natural case of stimuli having a low-pass frequency content (Field, 1994), specically, those composed of local features such as thin bars.
Corticothalamic Feedback Control
331
presented here extend to primates. 1.2 Geniculate Relay Modes. Thalamocortical neurons possess a characteristic blend of voltage-gated ion channels (Jahnsen & LlinÂas, 1984a, 1984b; Huguenard & McCormick, 1992; McCormick & Huguenard, 1992) that jointly determine the timing and pattern of action potentials in response to a sensory stimulus (see the appendix for a brief introduction to models of ion currents). Depending on the initial membrane polarization, the GRC response to a visual stimulus is in a range between a tonic and a burst mode (Sherman, 1996; Sherman & Guillery, 1996). At hyperpolarization below roughly ¡70 mV, a Ca2 C current, called the low-threshold Ca2 C current or T-current (IT ; T for “transient”), gets slowly deinactivated. As the membrane depolarizes above roughly ¡70 mV, the current activates, followed by a rapid transition from the deinactivated to the inactivated state, thereby producing a Ca2 C spike with an amplitude that depends on how long and how strongly the cell has been hyperpolarized previously. After sufcient hyperpolarization, the Ca2 C spike will thus reach the threshold for Na C spiking and give rise to a burst of one to seven action potentials riding its crest (Jahnsen & LlinÂas, 1984a, 1984b; Huguenard & McCormick, 1992; McCormick & Huguenard, 1992). All other action potentials—those that are not promoted by a Ca2 C spike and, hence, do not group into bursts—are called tonic spikes. Although the issue is still controversial, there is some evidence that a mixture of burst and tonic spikes may be involved in the transmission of visual signals in lightly anesthetized or awake animals (Guido, Lu, & Sherman, 1992; Guido, Lu, Vaughn, Godwin, & Sherman, 1995; Guido & Weyand, 1995; Mukherjee & Kaplan, 1995; Sherman, 1996; Sherman & Guillery, 1996; Reinagel, Godwin, Sherman, & Koch, 1999). In lagged cells, because of the strong feedforward inhibition they are assumed to receive, burst spikes have been held responsible for the high-activity transient seen at the offset of their retinal input (Mastronarde, 1987a, 1987b), thus contributing substantially to the delayed peak response to a moving bar (Mastronarde, 1987b). In nonlagged cells at resting membrane potentials below ¡70 mV, bursting constitutes a very early part of the visual response, producing a phase lead of up to a quarter cycle relative to their retinal input (Lu, Guido, & Sherman, 1992; Guido et al., 1992; Mukherjee & Kaplan, 1995). At more depolarized membrane potentials, nonlagged responses are dominated by tonic spikes and are in phase with retinal input (Lu et al., 1992; Guido et al., 1992; Mukherjee & Kaplan, 1995). Cortical feedback to the A-laminae of the LGN, arising mainly from layer 6 of area 17 (Sherman, 1996; Sherman & Guillery, 1996), can locally modulate the response mode, and hence the timing, of GRCs by shifting their membrane potentials on a timescale that is long as compared to retinal inputs. This may occur directly through the action of metabotropic glutamate and NMDA receptors (depolarization) (McCormick & von Krosigk, 1992;
332
Ulrich Hillenbrand and J. Leo van Hemmen
Godwin et al., 1996; Sherman, 1996; Sherman & Guillery, 1996; von Krosigk, Monckton, Reiner, & McCormick, 1999) and indirectly via the perigeniculate nucleus (PGN) or geniculate interneurons by activation of GABAB receptors (hyperpolarization) of GRCs (Crunelli & Leresche, 1991; Sherman & Guillery, 1996; von Krosigk et al., 1999). Indeed, GRCs in vivo are dynamic and differ individually in their degree of burstiness (Lu et al., 1992; Guido et al., 1992; Mukherjee & Kaplan, 1995). Here we explicate the causal link between the variable response timing of GRCs and variable tuning of cortical simple cells for speed of moving features, thus identifying control of speed tuning as a likely mode of corticothalamic operation. 2 The Model
Before turning to our simulation results we describe in this section the underlying model of the primary visual pathway. 2.1 Geniculate Input to the Primary Visual Cortex. For the GRCs we have employed a 12-channel model of the cat relay neuron (Huguenard & McCormick, 1992; McCormick & Huguenard, 1992), adapted to 37 degrees Celsius (see the appendix for a brief introduction to biophysical neuron models). The neuron model includes a transient and a persistent Na C current, several voltage-gated K C currents, a voltage- and Ca2 C -gated K C Figure 2: Facing page. Model of the primary visual pathway. (A) Open / lled circles and arrow / bar heads indicate excitatory / inhibitory neurons and their respective synapses. A retinal ganglion cell (RGC) sends its axon to the lateral geniculate nucleus (LGN) and synapses excitatorily on a relay cell (open circle) and an intrageniculate interneuron (lled circle), which in turn inhibits the same relay cell (arrangement called synaptic triad). The relative strengths of feedforward excitation and feedforward inhibition shape a relay cell’s response to be of the lagged or nonlagged type (see the text and Figure 4). There is an inhibitory feedback loop via the perigeniculate nucleus (PGN). The inuence of cortical feedback has been modeled as a variation of the relay cells’ resting membrane potential by control of a K C leak current. There is no cortical input to the PGN in the model. Moreover, we neglect any fast (ionotropic) cortical feedback. Possible effects of such feedback are discussed in section 4. (B) Arrangement in visual space of the receptive eld (RF) centers of the 100 lagged and 100 nonlagged relay cells comprising the model LGN. These relay cells are envisaged to project onto the same cortical simple cell and create an on- or off-region of its RF. In the simulations, the diameter of a single lagged or nonlagged RF center is 0.5 degrees. Results for rescaled versions of this geometry can be derived straightforwardly from the simulations; see section 4. The bar and arrow on the left indicate preferred orientation and direction of motion, respectively. Source: Adapted from Hillenbrand and van Hemmen (2000).
Corticothalamic Feedback Control
A
333
K +-leak control from cortex To cortex
To cortex
PGN
LGN
RGC B
Lagged
Nonlagged
current, a low- and a high-threshold Ca2 C current, a hyperpolarizationactivated mixed cation current, and Na C and K C leak conductances. As shown in Figure 2A, retinal input reaches a GRC directly as excitation and indirectly via an intrageniculate interneuron as inhibition, thus establishing the typical triadic synaptic circuit found in the glomeruli of X-GRCs (Sherman & Koch, 1990; Sherman & Guillery, 1996). The temporal difference between the two afferent pathways equals the delay of the inhibitory synapse and has been taken to be 1.0 ms (Mastronarde, 1987b). As will be described in more detail, we have found typical lagged responses for strong feedforward inhibition with weak feedforward excitation, in agreement with Mastronarde (1987b), Humphrey and Weller (1988b),
334
Ulrich Hillenbrand and J. Leo van Hemmen
and Heggelund and Hartveit (1990). On the other hand, typical nonlagged responses are produced by weak feedforward inhibition with strong feedforward excitation. We have therefore implemented lagged and nonlagged relay cells in the model by varying the relative strengths of feedforward excitation and feedforward inhibition. It is known that both NMDA and non-NMDA receptors contribute to retinogeniculate excitation to varying degrees, ranging from almost pure non-NMDA to almost pure NMDA-mediated responses in individual GRCs of both lagged and nonlagged varieties (Kwon et al., 1991). At least in lagged cells, however, early responses and, hence, responses to the transient stimuli that will be considered here, seem to depend to a lesser degree on the NMDA receptor type than late responses (Kwon et al., 1991). Since the essential characteristics of lagged and nonlagged responses apparently do not depend on the special properties of NMDA receptors, an assumption conrmed by our results, we have chosen the postsynaptic conductances in GRCs to be entirely of the non-NMDA type. The time course of postsynaptic conductance change in GRCs following reception of an input has been modeled by an alpha function, g(t > 0) D gmax
t t exp 1 ¡ t t
.
(2.1)
For excitation, the rise time t has been chosen to be 0.4 ms (Mukherjee & Kaplan, 1995); for inhibition it is 0.8 ms. The latter value was estimated from the relative durations of S potentials recorded at excitatory and inhibitory geniculate synapses (Mastronarde, 1987b) and was found to reproduce the rise times of inhibitory postsynaptic potentials recorded in relay cells following stimulation of the optic chiasm (Bloomeld & Sherman, 1988). The reversal potentials are, for excitation, 0 mV, and for inhibition, ¡85.8 mV (Bal, von Krosigk, & McCormick, 1995). The model system comprises 100 lagged and 100 nonlagged relay neurons. Their RF centers are 0.5 degrees in diameter (Cleland, Harding, & Tuluney-Keesey, 1979) and are spatially arranged in a lagged and a nonlagged cluster subtending 0.7 degrees each and displaced by 0.45 degrees (see Figure 2B). More precisely, the central points of the RFs of lagged and nonlagged cells are uniformly distributed within two separate intervals of 0.2 degrees each along a certain axis, which will be the axis of bar motion during stimulation (see section 2.3). The RFs’ offsets in the direction orthogonal to this axis, that is, in the direction that denes the preferred orientation of the bulk RF, are irrelevant as long as the stimulus bar is long enough to pass through all RFs of the relay cells in one sweep. In fact, the bar used in the simulations is much longer than typical RFs of simple cells (see section 2.3). The layout of geniculate inputs (see Figure 2B) matches the basic structure of a single on- or off-region in an RF of a directional simple cell in cortical layer 4B onto which the GRCs are envisaged to project (Saul & Humphrey,
Corticothalamic Feedback Control
335
1992a, 1992b; DeAngelis et al., 1995; Jagadeesh et al., 1997; Murthy et al., 1998). To complete the geniculate input to an RF of this type, this laggednonlagged unit would have to be repeated with alternating on-off polarity and a spatial offset that would determine the simple cell’s preference for some spatial frequency. Since we are not concerned here with effects of spatial frequency (see footnote 1), omission of the other on-off regions does not affect our conclusions. Results for rescaled RF geometries can be derived straightforwardly from the simulations (see section 4). The number of geniculate cells contributing to a simple cell’s RF has been estimated roughly from Ahmed et al. (1994). Only its order of magnitude matters. We have also taken into account feedback inhibition via the PGN (Lo & Sherman, 1994; Sherman & Guillery, 1996; see Figure 2A). Connections between PGN neurons and GRCs are all to all within, and separate for the lagged and nonlagged populations.2 Axonal plus synaptic delays are 2.0 ms in both directions. Intrageniculate interneurons and PGN cells, like GRCs, possess a complex blend of ionic currents. They are, however, thought to be active mainly in a tonic spiking mode during the awake state (Contreras, Dossi, & Steriade, 1993; Pape, Budde, Mager, & Kisvarday, 1994). For an efcient usage of computational resources and time, we have therefore modeled these neurons by the spike-response model (Gerstner & van Hemmen, 1992), which gives a reasonable approximation to tonic spiking (Kistler, Gerstner, & van Hemmen, 1997). Note that for the present model, it is irrelevant whether transmission across dendrodendritic synapses between intrageniculate interneurons and GRCs actually occurs with or without spikes (cf. Cox, Zhou, & Sherman, 1998). For a relay neuron, all that matters is the fact that an excitatory retinal input is mostly followed by an inhibitory input (Bloomeld & Sherman, 1988). The spike-response neurons have been given an adaptive spike output, implemented as an accumulating refractory potential (Gerstner & van Hemmen, 1992), that is, there is some adaptation of transmission across the dendrodendritic synapses. The refractory potential, and hence the effect of adaptation, saturates on a timescale of 10.0 ms. 2.2 Cortical Feedback. Metabotropic glutamate receptors effect a closing of K C leak channels and a membrane depolarization, while GABAB receptors, via PGN or geniculate interneurons, effect an opening of K C leak channels and a membrane hyperpolarization. Accordingly, we have incorporated the inuence of cortical feedback to the thalamus by varying the K C leak conductance of GRCs (McCormick & von Krosigk, 1992; Godwin et al., 2 This synaptic separation of the lagged and the nonlagged pathways was implemented solely to allow for independent simulation of the two. Although an inhibitory coupling of lagged and nonlagged cells could in reality cause some anticorrelation of their ring, there is no evidence for anticorrelation of GRCs. Any such effects thus seem negligible. In any case, they would not affect our conclusions.
336
Ulrich Hillenbrand and J. Leo van Hemmen
1996; see Figure 2A). The resulting stationary membrane potential in the absence of any retinal input will be called resting membrane potential. All GRCs, lagged and nonlagged, have been assigned the same resting membrane potential; here we assume a uniform action of cortical feedback at least on the scale of single RFs in area 17. By varying the resting membrane potential, we investigate a strictly modulatory role of corticogeniculate feedback, as opposed to the retinal inputs that drive relay cells to re (cf. Sherman & Guillery, 1996; Crick & Koch, 1998). For every single stimulus presentation (see below) we have kept the K C leak conductance constant. This is justied by the slow action, compared to typical passage times of local stimulus features through RFs, of the metabotropic receptors, ranging from hundreds of milliseconds for GABAB to seconds for metabotropic glutamate receptors (von Krosigk et al., 1999). Nonetheless, it is clear that dynamics in the corticogeniculate pathway may produce effects for slow-moving stimuli that we cannot account for here. In modeling cortical feedback, we neglect input to the LGN that is mediated by ionotropic receptors and, hence, acts on a much shorter timescale. Furthermore, we do not explicitly model cortical input to the PGN. The effects those inputs may have on our results will be discussed in section 4. 2.3 Stimulation. The input to GRCs has been modeled as a set of Poisson spike trains with time-varying ring rates. For investigation of the temporal transfer characteristics of lagged and nonlagged neurons, these rates varied sinusoidally between 0 and 100 spikes per second (amplitude 50 spikes / s, DC component 50 spikes / s) at a range of temporal frequencies. Before any responses to sinusoidal stimuli have been collected, the stimuli were presented for 1 second, that is, depending on the frequency, between 1 and 11 cycles. We have recorded the response for the following 100 seconds of stimulus presentation. For studying the responses to moving bars, rates have been tted to recordings from retinal ganglion cells in response to moving, thin (0.1 degrees), long (10 degrees) bars (Cleland & Harding, 1983). The t for a single retinal ganglion cell is of the form
r(t) D (rp C r0 ) exp ¡
t
D
2
¡ r0 ,
(2.2)
where rp is the peak rate, r0 is the background rate, and D is a width parameter (see Figure 3). For the different speeds of bar motion used in the simulations, rp and D have been chosen to t the data of Cleland and Harding (1983) while r 0 D 38 spikes per second (Mastronarde, 1987b). Different GRCs received retinal input from statistically independent sources. We have studied bar responses of single lagged and nonlagged neurons as well as of the entire population of 100 lagged and 100 nonlagged neurons in the geniculate model. Accordingly, bars were moved across single RFs of
Corticothalamic Feedback Control
337
rp
r0
r0 t
Figure 3: Time-dependent rate response to a moving bar of a retinal ganglion cell (cf. equation 2.2). This ring rate has been used in the simulations to generate input spikes to geniculate relay cells by an inhomogeneous Poisson process. The peak rate rp and the width of the response peak have been adjusted for different bar speeds to t the data of Cleland and Harding (1983). The background rate r 0 is taken to be 38 spikes per second (Mastronarde, 1987b).
relay cells or the whole bulk RF in the preferred and antipreferred directions (see Figure 2B). Bar motion always started 3D (cf. equation 2.2) before it hit the rst RF center and stopped 3D after it had passed the last RF center. There was a 1 second interval of stimulation with the background activity (38 spikes / s) between bar sweeps. 2.4 Data Analysis. We collected spike times with 0.1 ms resolution. Spikes of single relay neurons in response to moving bars were counted in bins of 5 ms, a timescale relevant to postsynaptic integration, for variable synaptic excitatory and inhibitory input strengths. The bin counts have been averaged over 100 bar sweeps at each velocity and synaptic setting. Spikes of single lagged and nonlagged neurons in response to sinusoidal stimulation with variable frequency v were counted in a time window of 5 ms shifted by steps of 1 ms. The spike counts were averaged over all cycles of the stimulus presented within 100 seconds of stimulation. For the resulting spike-count functions, we determined the amplitude F1 (v) and the phase w 1 (v) of their rst Fourier component. With the amplitude A (D 50 spikes / s; see section 2.3) and the phase y of the sinusoidal input rate, we have calculated the amplitude transfer function F1 (v) / A and the phase transfer function y ¡ w 1 (v) (negative phase transfer means phase lead over the input). For the investigation of velocity tuning, spikes of all 100 lagged and
338
Ulrich Hillenbrand and J. Leo van Hemmen
100 nonlagged relay cells were pooled. For each velocity v of bar motion tested, we calculated the total lagged and nonlagged response rates rl (v, t) and rnl (v, t), respectively, as spike counts in 5 ms windows shifted by steps of 1 ms (t D 1, 2, . . . ms). The velocity tuning of the pooled lagged and nonlagged peak rates per neuron is R`(v ) D
1 max r`(v, t), 100 t2[ti ,tf ]
` D l, nl,
(2.3)
where the times ti and tf are chosen such that all of the response to a bar sweep lie in the interval [ti , tf ]. We are primarily interested in the total geniculate input to a cortical simple cell onto which the GRCs are envisaged to project. To this end, we shifted lagged spikes by 2 ms to later times in order to account for the fact that the lagged cells’ conduction times to cortex are slightly longer than those of the nonlagged cells (Mastronarde, 1987a; Humphrey & Weller, 1988a). Furthermore, although lagged responses in the LGN tend to be weaker than nonlagged responses (Mastronarde, 1987a; Humphrey & Weller, 1988a; Saul & Humphrey, 1990), they appear to be about equally efcient in driving cortical simple cells (Saul & Humphrey, 1992a). The cortical (possibly synaptic) cause being beyond the scope of this work, we simply counted every lagged spike twice to obtain the velocity tuning of the effective geniculate input to a cortical cell, R(v) D
1 max [2rl (v, t ¡ 2ms) C rnl (v, t)] . 100 t2[ti ,tf ]
(2.4)
The peak input rate R(v) per lagged-nonlagged pair is correlated with simple cell activity because postsynaptic potentials are summed almost linearly in simple cells (Kontsevich, 1995; Jagadeesh et al., 1993, 1997). The total geniculate input rate R(v) to a cortical neuron depends on (1) the magnitude of the pooled lagged and nonlagged response peaks, Rl (v ) and Rnl (v ), respectively, and (2) their relative timing. To differentiate between these two factors, we determined the times tl (v ) and tnl (v) of the maxima of the lagged and nonlagged response rates, respectively, t`(v) D arg max r`(v, t), t2[ti ,tf ]
` D l, nl,
(2.5)
and calculated the peak time differences tnl (v ) ¡ tl (v ) as a function of the bar velocity v. Means and standard errors have been estimated from a sample of 30 bar sweeps at each bar velocity. 2.5 Numerics. The model is described by a high-dimensional system of nonlinear, coupled, stochastic differential equations. For numerical integration of the GRC dynamics, we used an adaptive fth-order Runge-Kutta
Corticothalamic Feedback Control
339
algorithm3 (Press, Teukolsky, Vetterling, & Flannery, 1992). The maximal time step was 0.02 ms and was scaled down to satisfy upper bounds on the estimated error per time step. Increasing or decreasing those bounds by a factor of 10 had negligible effects on the time course of the membrane potential of a GRC, and no effect on spike timing within the temporal resolution of 0.1 ms we used for recording. The dynamics of spike-response neurons was solved by exact integrals. Each simulation started with a 3 second period without any stimulus to allow the GRCs’ dynamics to converge on its stationary (resting) state. Simulations were run on an IBM SP2 parallel computer. 3 Results
We rst address the response properties of single relay neurons in the model and then turn to the total geniculate input to a cortical neuron. 3.1 Lagged and Nonlagged Relay Neurons. We have checked whether both lagged- and nonlagged-type responses could be produced within our model by simply varying the synaptic strengths of feedforward excitation and feedforward inhibition of relay neurons (see Figure 2A). Varying the peak postsynaptic conductances gmax (see equation 2.1) for excitation and inhibition and stimulating with a bar moving at 4 degrees per second, we found a lagged-nonlagged transition that is analogous to a rst-order phase transition in response timing (see Figure 4 for an example at a resting membrane potential of ¡65 mV). At strong excitation and weak inhibition, there is a response peak with zero delay relative to the input peak. As the excitation is reduced and the inhibition increased, this nonlagged peak shrinks, and a lagged peak develops. The latter invariably has a delay of roughly 100 ms relative to the input peak, a value consistent with experimental data (Mastronarde, 1987a; cf. Figure 1B). At strong inhibition and weak excitation, the lagged peak is the dominant part of the response. We have also checked the dependence of relay-cell responses on their resting membrane potential. The peak postsynaptic conductances gmax for the lagged cell have now been xed at 0.0125 m S for excitation and 0.25 m S for inhibition; for the nonlagged cell, they have been xed at 0.05 m S for excitation and at 0.0125 m S for inhibition (cf. Figure 4). In Figure 5 we show the bar response (4 deg / s) and the temporal transfer of amplitude and phase of a lagged and a nonlagged neuron for the resting membrane potentials ¡72 mV and ¡61 mV. Again, the response data agree well with experiments 3 An algorithm for numeric integration of ordinary differential equations is said to be of nth order, if the error per time step dt is of order dtn C 1 . Note that because of discontinuities in the system of differential equations (Huguenard & McCormick, 1992; McCormick & Huguenard, 1992), more sophisticated and faster methods for integration than RungeKutta cannot be safely applied.
340
Ulrich Hillenbrand and J. Leo van Hemmen
0.25 0.225 0.2
Inhibition
0.175 0.15 0.125 0.1 0.075 0.05 0.025 0 m S
0
0.0125
0.025
0.0375
0.05
Excitation Figure 4: Dependence of moving-bar response of single modeled relay neurons on the strengths of feedforward excitation and feedforward inhibition. In each plot, the horizontal axis spans 750 ms; the vertical axis indicates the time of the retinal input peak and spans 100 spikes per second. Across the whole array of plots, the peak postsynaptic conductances gmax (equation 2.1) vary for excitation horizontally from 0 to 0.05 m S, and for inhibition vertically from 0 to 0.25 m S. The regions of lagged- and nonlagged-type responses in this parameter space are at low excitation with high inhibition and at high excitation with low inhibition, respectively. The resting membrane potential is ¡65 mV. Responses are averaged over 100 bar sweeps.
(Mastronarde, 1987a; Saul & Humphrey, 1990; Lu et al., 1992; Guido et al., 1992; Mukherjee & Kaplan, 1995). In particular, the lagged cell’s response shows a phase lag relative to the input that increases with frequency; the nonlagged cell goes through a transition between a low-pass and in-phase relay mode to a bandpass and phase lead (at frequencies < 8 Hz) relay mode
Corticothalamic Feedback Control
Nonlagged
spikes/s
60
ratio
cycles
4
0.2 0.1
3 40
2
2
20 0
200
spikes/s 60
ms
4
6
8
ratio
10 Hz
3 2
20
1
200
0
200
Bar response
ms
6
8
10 Hz
4
6
8
10 Hz
0.2
2
40
4
0.1
1 200
Lagged
341
cycles 1
0.8 0.6 0.4 0.2 2
4
6
8
10 Hz
Amplitude transfer -72 mV
2
Phase transfer
-61 mV
Figure 5: Dependence of moving-bar response and temporal transfer function of single modeled relay neurons on their resting membrane potential. Typical nonlagged responses (top row, gmax D 0.05 m S for excitation and 0.0125 m S for inhibition) and lagged responses (bottom row, gmax D 0.0125 m S for excitation and 0.25 m S for inhibition; cf. Figure 4) have been reproduced at the two resting membrane potentials ¡72 mV (solid lines) and ¡61 mV (dashed lines). For the bar responses (leftmost column), the time of the retinal input peak has been set to zero. As the membrane is hyperpolarized, the nonlagged bar response peak shifts to earlier times. Conversely, the lagged bar response shifts to later times. The changes in bar response timing are also reected in corresponding changes in the phase transfer functions (rightmost column). Bar responses are averaged over 100 bar sweeps. Amplitude and phase transfer have been calculated from responses to sinusoidal input rates, averaged over 100 seconds. Note the different scales on the “cycles” axes for nonlagged and lagged cells.
as the membrane hyperpolarizes. The former corresponds to the tonic, the latter to the burst relay mode. Remarkably, as the resting membrane potential is varied, the timing of the bar response shifts in opposite directions for lagged and nonlagged cells (cf. Figure 5, left column). Increasing hyperpolarization shifts the lagged response peak to later times, while the nonlagged response peak moves to earlier times. In view of what we have reviewed on relay modes and lagged cells (cf. section 1.2), it seems likely that the low-threshold Ca2 C current IT is in part responsible for the GRCs’ response timing. In Figure 6 we show simulated traces of IT for the moving-bar scenarios. For nonlagged neurons, the current is insignicant at ¡61 mV, but exhibits a pronounced peak at the start of the response to the bar at ¡72 mV. The peak of IT conrms the nature of the early response component seen in Figure 5 (top left column) as Ca2 C -mediated burst spikes, in agreement with Lu et al. (1992), Guido et al.
342
Ulrich Hillenbrand and J. Leo van Hemmen
Nonlagged
nA 1.2 0.8 0.4
200
0
200
ms
200
0
200
ms
nA
Lagged
1.2
0.8
0.4
I T bar response -72 mV
-61 mV
Figure 6: Transient and low-threshold Ca2 C current IT associated with the bar stimulus scenarios shown in Figure 5, leftmost column (averaged over 100 bar sweeps). High IT indicates burst spikes mediated by an underlying Ca2 C spikes. For the nonlagged neuron at a resting membrane potential of ¡61 mV, IT is always small and does not contribute to the response. In the remaining cases, the timing of the response shown in the leftmost column of Figure 5 can be seen to be largely determined by IT .
(1992), and Mukherjee and Kaplan (1995). For lagged neurons, on the other hand, we see that the timing of the IT peak faithfully reects the timing of the response peak at both resting membrane potentials (see Figure 5, bottom left column). In fact, the prole of the IT traces resembles the one of the spike rates, indicating that the Ca2 C current promotes ring throughout the transient responses simulated here. We will return to the signicance of burst spikes for the results in section 4. The reason for the opposite shifts of lagged and nonlagged response timing, then, lies in the interaction of the low-threshold Ca2 C current IT with the different levels of inhibition received by lagged and nonlagged neurons. With only weak feedforward inhibition, nonlagged neurons re-
Corticothalamic Feedback Control
343
spond to retinal input with immediate depolarization, eventually reaching the activation threshold for the Ca2 C current. If the Ca2 C current is in the deinactivated state, it will boost depolarization and give rise to an early burst component of the visual response. The lower the resting membrane potential, the more deinactivated and, hence, stronger the Ca2 C current will be, and the stronger the early burst relative to the late tonic response component. Lagged neurons, on the other hand, receive strong feedforward inhibition and, hence, initially respond to retinal input with hyperpolarization. Repolarization occurs when inhibition gets weaker. This may result from either cessation of retinal input or adaptation, that is, fatigue, of the inhibitory input to GRCs (cf. Figure 2A). With the Ca2 C current IT being deinactivated by the excursion of the membrane potential to low values, lagged spiking starts with burst spikes as soon as the voltage reaches the Ca2 C -activation threshold. This will take longer, if the resting membrane potential is lower, leading to the shift in response timing with membrane polarization observed here. Adaptation of inhibition is implemented in this model by the refractoriness of the spike-response neurons that represent the inhibitory interneurons (cf. section 2.1). The refractory potential saturates, however, on a timescale (10.0 ms) much shorter than the delay of the lagged response of roughly 100 ms (cf. Figure 5). Its role in generating a response delay for lagged neurons in our model can thus be only very limited. It is important to note that the lagged on-response is different from a nonlagged off-response. A nonlagged off-response produces a phase lag of half a cycle relative to the nonlagged on response at all frequencies. The right column of Figure 5 shows that this is not true for the simulated lagged response. Rather, the phase transfer function has a signicantly higher slope—that is, a higher phase latency—for the lagged response than for the nonlagged response (cf. Saul & Humphrey, 1990) at both resting membrane potentials. Moreover, we observed that lagged cells produce a delay of moving-bar responses that does not vanish at high speeds (not shown). This xed delay component must be largely determined by the internal neuronal dynamics of the ion currents, notably of IT , that follows hyperpolarization. For the remaining simulations we have always set the peak postsynaptic conductances for lagged and nonlagged neurons to the values used for the data shown in Figures 5 and 6. 3.2 Total Geniculate Input to Cortex. Lagged and nonlagged responses have to be combined so as to yield a velocity-selective input to a cortical neuron. For different values of the resting membrane potential, Figure 7 shows in the columns from left to right the velocity tuning of the lagged population (Rl ), of the nonlagged population (Rnl ), the peak-time differences (tnl ¡ tl ) of their responses for the preferred direction, and the tuning of the total geniculate input (R) to a cortical cell for the preferred and nonpreferred direction of motion (see section 2.4 for details). As in vivo, the
344
Ulrich Hillenbrand and J. Leo van Hemmen spikes/s
spikes/s
60 1
spikes/s
-66.5 mV
3
4
1
spikes/s
50
60
40
50 1
spikes/s
-72 mV
2
2
3
4
40 35 30 25
1
spikes/s
2
2
3
3
4
4
60 55 50 45 1
spikes/s
2
3
4
20 10 1
2
log2 [ v /(deg s -1 )] R l
3
4
180
ms
100 50 0 50 ms
140 1
2
3
4
1
spikes/s
2
3
4
1
2
3
4
2
3
R nl
4
2
3
4
80
1
2
3
4
1
2
3
4
1
2
3
4
spikes/s
100
120
50
100
ms
1
120
80 1
2
3
4
0 25 50 75 1
100
spikes/s
160
0
60 55 50 45
-76 mV 15
spikes/s
50 0 50
70
-61 mV 50 40
ms
80
60
spikes/s
80 70 160 1
2
3
tnl - t l
4
R
Preferred Nonpreferred
Figure 7: Geniculate moving-bar response and geniculate input to cortex as predicted by model simulations. Velocity tuning and timing of the response peaks have been plotted for resting membrane potentials indicated on the far left. In the columns, we show from left to right as functions of the bar velocity the peak response rate of the lagged population (Rl ), the nonlagged population (Rnl ), their peak time difference (tnl ¡ tl ) for the preferred direction, and the total geniculate input (R) to a cortical cell for the preferred (solid lines) and nonpreferred (dashed lines) direction of motion. The horizontal axes show the logarithm (base 2) of speed in all graphs. The bars in the graphs are standard errors. As the membrane is hyperpolarized, the total geniculate input to a cortical cell peaks at progressively lower velocities. Means and standard errors are estimated from 30 bar sweeps. Source: Adapted from Hillenbrand and van Hemmen (2000).
lagged cells prefer lower velocities and have lower peak ring rates than the nonlagged cells (Mastronarde, 1987a; Humphrey & Weller, 1988a; Saul & Humphrey, 1990). The key observation, however, is that the maximum of the total geniculate input rate to a cortical neuron shifts to lower velocities as the membrane potential hyperpolarizes (see Figure 7, right column). The total geniculate input rate R assumes its maximum at a velocity of bar motion where the peak discharges of the lagged and nonlagged neurons coincide—where tnl ¡ tl ¼ 0. The shift of the maximum with hyperpolarization to lower velocities is produced by a corresponding shift of the peak-time differences tnl ¡ tl and of the lagged tuning Rl , while the maximum of the nonlagged tuning Rnl remains essentially unchanged. The shift of the peak time differences, in turn, is a reection of the opposite shifts in bar response
Corticothalamic Feedback Control
345
timing of lagged and nonlagged neurons described in the previous subsection (cf. Figure 5, left column). The total geniculate input rate R is higher for the direction of bar motion where tnl ¡ tl assumes lower values. In other words, the direction preferred is the one where the lagged cells receive their retinal input before the nonlagged cells (cf. Figures 2B and 7, right column). We found that feedback inhibition from the PGN does not affect the timing of lagged and nonlagged responses. Its only effect is to reduce variations in response amplitude by countering increases in ring rate of relay neurons with stronger inhibition. The PGN feedback loop thus moderates the differences in response activity both at different levels of the resting membrane potential and between lagged and nonlagged neurons. The latter difference may be further reduced by making the feedback inhibition stronger for nonlagged than for lagged neurons. For the data shown in Figure 7, this has been implemented by allowing stronger or, equivalently, more synapses of nonlagged neurons on PGN cells than synapses of lagged neurons. Despite this, the nonlagged responses dominate, and there is a drop of geniculate activity with increasing hyperpolarization, especially of the lagged responses. We will return to this issue in section 4. In general, however, the PGN loop increases the range of resting membrane potentials of relay cells that yield balanced lagged and nonlagged inputs to the cortex, thereby extending the dynamic range of speed tuning of the total geniculate input to a cortical neuron. 4 Discussion
Recently it has been proposed that corticogeniculate feedback modulates the spatial layout of simple-cell RFs by exploiting the thalamic burst-tonic transition of relay modes (Worg¨ ¨ otter et al., 1998). Along a similar line, the main point made by our modeling is that one should expect a modulatory inuence of cortical feedback on the spatiotemporal RF structure of simple cells. More precisely, we observe a shift in the time to the bar response peak that is opposite for lagged and nonlagged cells (cf. Figure 5, left column). Assuming (1) an RF layout as usually found for direction-selective simple cells in area 17 and (2) an inuence of convergent geniculate lagged and nonlagged inputs on this RF structure, it follows that the observed shifts in response timing affect cortical speed tuning. To the best of our knowledge, nobody has looked for such an effect yet. We have investigated the geniculate input to simple cells, which clearly cannot be compared with their output directly. Because of intracortical processing, we cannot expect to reproduce tuning widths and direction selectivity indices of cortical neurons. Rather, the tuning width of geniculate input is likely to be larger, and its directional selectivity weaker than of a cortical neuron’s output (cf. section 1.1). Indeed, supercial inspection of the rightmost column of Figure 7 reveals that the directional bias of R is rather weak compared to what can be found for directional cells in cat areas
346
Ulrich Hillenbrand and J. Leo van Hemmen
17 and 18 (Orban et al., 1981a). On the other hand, the tuning width of R is relatively narrow (Orban et al., 1981b) instead of wide. The narrowness of speed tuning in our simulations may be reconciled with experimental data in the following ways. First, we have simulated the ideal case of equal resting membrane potential, and hence lagged and nonlagged response timing, for all of the GRCs. Scattered values of membrane potentials will produce less sharply tuned proles for R. Second, if it is true that velocity tuning is not a static but a dynamic property of cortical cells, as is proposed in this article, measured—effective—tuning widths should be larger than the width of the tuning under static conditions as simulated here. Quantitative comparison of the tuning of R with cortical velocity tuning is, for the above reasons, problematic. Nonetheless, it is interesting to note that much like velocity tuning in areas 17 and 18 (Orban et al., 1981b), the dynamic range of the modeled geniculate input—that is, the difference between the highest and the lowest response values on each tuning curve R(v) decreases and the tuning width increases with decreasing optimal velocity4 (see Figure 7, right column). Moreover, the range of preferred velocities lies within the range observed for velocity-tuned cells (Orban et al., 1981b). Because of scaling properties of the retinal ganglion cells’ velocity tuning (Cleland & Harding, 1983), rescaled versions of the RF geometry shown in Figure 2B produce accordingly shifted tuning curves (on a logarithmic speed scale). In particular, we retrieve the positive correlation between RF size and preferred speed found in areas 17 and 18 (Orban et al., 1981b) from the geniculate input. The effects of lagged and nonlagged response timing in the present model are dependent on the low-threshold Ca2 C current and ensuing burst spikes. The signicance of our results for visual processing in the awake, behaving animal, then, is subject to the occurrence of burst spikes under such conditions. As mentioned in section 1.2, this issue is still under much debate. For nonlagged cells, burst spikes will have a role in normal vision only, if their resting membrane potential gets hyperpolarized enough. For lagged cells, it cannot be settled if their (transient) responses are indeed supported by the low-threshold Ca2 C current, as was seen in the simulations. If this turns out to be wrong, the effect of cortical input on lagged response timing could be different from what we have observed. In this regard, it would be interesting to study the effect of additional NMDA channels at the synapses of retinal afferents on GRCs (cf. Heggelund & Hartveit, 1990; Hartveit & Heggelund, 1990). Nonetheless, the data on response timing of the modeled lagged cells suggest that some essential aspect of the true lagged mechanism has been captured in the model. Responses of X-relay cells to moving bars and textures are on average reduced after ablation of the visual cortex in cats (Gulyas, Lagae, Eysel, &
4
The correlation with tuning width was signicant only in area 18 (Orban et al., 1981b).
Corticothalamic Feedback Control
347
Orban, 1990). This is consistent with what we observe in our simulations of relay cells, assuming a depolarizing net effect of cortical feedback on relay neurons (Funke & Eysel, 1992; Worg¨ ¨ otter et al., 1998). In fact, the response rates of lagged and nonlagged neurons decrease with progressive hyperpolarization (cf. Figure 7, rst and second columns), despite disinhibition by PGN feedback. The question arises of how the visual cortex would deal with the resulting differences in the maximal geniculate input activity (cf. Figure 7, right column) in a way that preserves the speed tuning of the afferent signal for a wide range of geniculate membrane polarizations. In principle, this is straightforward since it is area 17 itself that modulates the membrane potential of relay cells. By a similar mechanism, it could adjust the responsiveness of layer 4B neurons to geniculate input. An appropriate modulatory signal could most easily be derived from the same layer 6 neurons that project to the LGN, or from their neighbors that share the same information on the actual corticothalamic feedback. In this context, it is very interesting that layer 6 neurons that project to the LGN indeed send axon collaterals specically to layer 4 (Katz, 1987). The PGN, and more generally the thalamic reticular nucleus, implements both a disynaptic inhibitory feedback loop and an indirect corticothalamic feedback pathway to relay cells (Sherman, 1996; Sherman & Guillery, 1996). This double role suggests possible interactions between the two functions. Depending on whether individual PGN neurons engage in both types of circuitry and on the details of connections between different PGN neurons, the strength of the disynaptic feedback inhibition exerted by PGN neurons on GRCs could be modulated by cortical feedback. Unlike in our simulations, the efciency of the LGN-PGN loop might thus covary with the GRCs’ resting membrane potential. In theory, this would offer a very elegant mechanism to compensate the mentioned differences in GRC response level at different resting membrane potentials. For the time being, this is mere speculation. Recently it has been found that responses in the thalamic reticular nucleus of rat that are mediated by a specic subtype of metabotropic glutamate receptor (group II) result in long-lasting cell hyperpolarization (Cox & Sherman, 1999) instead of depolarization as usual. The effect seems to be caused by the opening of a K C leak channel similar to a GABAB response. This observation adds variants of possible corticothalamic pathways for the slow control of thalamic membrane potential. Specically, it suggests that relay cells may be depolarized by reticular disinhibition. Moreover, if group II receptors turned out to be active on relay neurons, as they are on reticular neurons, a direct hyperpolarizing effect of corticothalamic feedback would become conceivable. In the model we have considered only one type of cortical input to the LGN: the input mediated by metabotropic receptors that slowly control a K C leak conductance on GRCs (cf. section 2.2). There are other cortical inputs,
348
Ulrich Hillenbrand and J. Leo van Hemmen
mediated by ionotropic receptors, that act on the much shorter timescale of the retinal inputs. While such cortical feedback certainly inuences the detailed temporal pattern of geniculate spiking (see, e.g., Sillito et al., 1994), it seems unlikely that they affect the gross timing of a transient response peak on a timescale of several 10 ms. An interesting exception is perhaps NMDA receptor-mediated feedback, with time constants in between those of metabotropic and (ionotropic) AMPA / kainate or GABAA responses. In future work, it would be interesting to include NMDA channels at corticothalamic synapses in the model. We have presented arguments for the existence of a particular dynamic gating mechanism for thalamocortical information transfer—for the transfer of information on visual motion. New experiments are required to check the implications directly. If the proposed mechanism turns out to be effective in awake, behaving animals, it will have important, as yet unrecognized, consequences for motion processing. A possible implication in motion-mediated object segmentation is discussed in Hillenbrand and van Hemmen (2000). Appendix
We here give a brief account of essential concepts that are related to biophysical neuron models and, in particular, to the model of the thalamic relay neuron (Huguenard & McCormick, 1992; McCormick & Huguenard, 1992) studied in this work. For a detailed exposition of data and theory on ion channels and excitable membranes, see Tuckwell (1988a, 1988b) and Hille (1992). The essential electrical properties of neuronal membranes are described by the differential equation dV 1 D dt C
n
Ii ,
(A.1)
iD 1
where V is the cell’s membrane potential, Ii are the currents through the different types of ion channels in the membrane, and C is the membrane capacitance. The art of building a neuron model is to nd good empirical, quantitative descriptions of all the relevant ion currents. Equation A.1 describes a pointlike neuron or a single compartment of an extended neuron. Thalamic relay neurons are well described by single-compartment models (Huguenard & McCormick, 1992; McCormick & Huguenard, 1992). Within the Ohmic approximation, the ion currents are described by p
q
Ii D gi mi i hi i (Vi ¡ V ) ,
(A.2)
with the reversal potential Vi , the maximal conductance gi , the gates mi and hi , and some positive, usually integer, constants pi and qi . The reversal
Corticothalamic Feedback Control
349
potential Vi is approximately equal to the Nernst potential for ions of type i but is usually determined empirically. The gates mi and hi are dynamic variables that assume values between zero and one according to differential equations that involve the membrane potential V (see below). It must be stressed that expressions of type A.2 are primarilyempirical ts to the voltage dependence of ionic currents. Nonetheless, an oversimplied but intuitive physical interpretation of A.2 is that ion currents ow through an ensemble of channels of type i that have pi m-gates and qi h-gates each. The gates are open and closed with certain probabilities. An individual channel allows ions to pass only if all its gates are in the open state. In what follows, we will drop the index i for notational simplicity. With the given picture of ionic gates in mind, we may understand the dynamics of the gates m and h. Transitions between the open and closed states are governed by the transition rates am / h and bm / h , dm D am (V )(1 ¡ m ) ¡ bm (V )m, dt
(A.3)
dh D ah (V )(1 ¡ h ) ¡ bh (V )h. dt
(A.4)
The rates, in turn, are functions of the membrane potential V. Instead of transition rates, one may specify the gates’ asymptotic values m1 and h1 and time constants tm / h . Their relation to the transition rates is m1 (V ) D h1 (V ) D
am (V ) , am (V ) C bm (V )
ah (V ) , ah (V ) C bh (V )
tm/ h (V ) D
1 . am / h (V ) C bm / h (V )
(A.5) (A.6) (A.7)
By convention, the m-gate is usually the one that opens (m1 ¼ 1) at higher and closes (m1 ¼ 0) at lower membrane potentials; for the h-gate, the situation is the other way around. The m-gate is called the activation gate and the h-gate the inactivation gate. Accordingly, a current is said to activate when the m-gate opens and inactivate when the h-gate closes. For the ring pattern of thalamic relay neurons, the transient and lowthreshold Ca2 C current IT is of particular importance. Analogous to the production of Na C spikes by the transient Na C current INa , IT produces Ca2 C spikes that can promote Na C spikes (see section 1.2). Some types of ion channels do not inactivate; they have q D 0. An ion channel that neither inactivates nor deactivates, that is, p D q D 0, is called a leak channel. Leak channels are characterized by a constant conductance g (cf. equation A.2).
350
Ulrich Hillenbrand and J. Leo van Hemmen
For some ion channels the Ohmic approximation A.2 for the ion current I is not satisfactory. In those cases Goldman’s constant-eld equation, I D gmp hq
Vz2 e2 ci ¡ ce exp(¡zeV / kT ) , 1 ¡ exp(¡zeV / kT ) kT
(A.8)
often is a better choice (Tuckwell, 1988a). Here ci and ce are the ion’s concentrations in the intra- and extracellular space, respectively, and z is its valence. As usually, e is the elementary charge, k is the Boltzmann constant, and T is the absolute temperature. For the thalamic relay neuron, the Ca2 C currents are modeled according to equation A.8. Besides the voltage-gated channels introduced here, thalamic relay neurons have channels that are gated by membrane potential and the intracellular concentration of Ca2 C ions (Huguenard & McCormick, 1992; McCormick & Huguenard, 1992). Their transition rates a and b (cf. equations A.3 and A.4) are functions of membrane voltage and intracellular Ca2 C concentration. Moreover, receptor-gated channels are responsible for most of the synaptic transmission in the central nervous system (cf. equation 2.1). Acknowledgments
We thank Esther Peterhans and Christof Koch for stimulating discussions and two anonymous referees for valuable comments on the manuscript. U. H. was supported by the Deutsche Forschungsgemeinschaft (DFG), grant GRK 267. References Ahmed, B., Anderson, J. A., Douglas, R. J., Martin, K. A. C., & Nelson, J. C. (1994). Polyneuronal innervation of spiny stellate neurons in cat visual cortex. Journal of Comparative Neurology, 341, 39–49. Albrecht, D. G., & Geisler, W. S. (1991). Motion selectivity and the contrastresponse function of simple cells in the visual cortex. Visual Neuroscience, 7, 531–546. Alonso, J. M., Usrey, W. M., & Reid, R. C. (1996). Precisely correlated ring in cells of the lateral geniculate nucleus. Nature, 383, 815–819. Bal, T., von Krosigk, M., & McCormick, D. A. (1995). Synaptic and membrane mechanisms underlying synchronized oscillations in the ferret LGNd in vitro. Journal of Physiology (Lond), 483, 641–663. Bloomeld, S. A., & Sherman, S. M. (1988). Postsynaptic potentials recorded in neurons of the cat’s lateral geniculate nucleus following electrical stimulation of the optic chiasm. Journal of Neurophysiology, 60, 1924–1945. Cleland, B. G., & Harding, T. H. (1983). Response to the velocity of moving visual stimuli of the brisk classes of ganglion cells in the cat retina. Journal of Physiology (Lond), 345, 47–63.
Corticothalamic Feedback Control
351
Cleland, B. G., Harding, T. H., & Tulunay-Keesey, U. (1979). Visual resolution and receptive eld size: Examination of two kinds of cat retinal ganglion cell. Science, 205, 1015–1017. Contreras, D., Dossi, R. C., & Steriade, M. (1993). Electrophysiological properties of cat reticular thalamic neurones in vivo. Journal of Physiology (Lond), 470, 273–294. Cox, C. L., & Sherman, S. M. (1999). Glutamate inhibits thalamic reticular neurons. Journal of Neuroscience, 19, 6694–6699. Cox, C. L., Zhou, Q., & Sherman, S. M. (1998). Glutamate locally activates dendritic outputs of thalamic interneurons. Nature, 394, 478–482. Crick, F., & Koch, C. (1998). Constraints on cortical and thalamic projections: The no-strong-loops hypothesis. Nature, 391, 245–250. Crook, J. M., Kisvarday, Z. F., & Eysel, U. T. (1998). Evidence for a contribution of lateral inhibition to orientation tuning and direction selectivity in cat visual cortex: Reversible inactivation of functionally characterized sites combined with neuroanatomical tracing techniques. European Journal of Neuroscience, 10, 2056–2075. Crunelli, V., & Leresche, N. (1991). A role for GABAB receptors in excitation and inhibition of thalamocortical cells. Trends in Neurosciences, 14, 16–21. DeAngelis, G. C., Ohzawa, I., & Freeman, R. D. (1993). Spatiotemporal organization of simple-cell receptive elds in the cat’s striate cortex. II. Linearity of temporal and spatial summation. Journal of Neurophysiology, 69, 1118–1135. DeAngelis, G. C., Ohzawa, I., & Freeman, R. D. (1995). Receptive-eld dynamics in the central visual pathways. Trends in Neurosciences, 18, 451–458. Douglas, R. J., Koch, C., Mahowald, M., Martin, K. A. C., & Suarez, H. H. (1995). Recurrent excitation in neocortical circuits. Science, 269, 981–985. Emerson, R. C. (1997). Quadrature subunits in directionally selective simple cells: Spatiotemporal interactions. Visual Neuroscience, 14, 357–371. Ferster, D., Chung, S., & Wheat, H. (1996). Orientation selectivity of thalamic input to simple cells of cat visual cortex. Nature, 380, 249–252. Field, D. J. (1994). What is the goal of sensory coding? Neural Computation, 6, 559–601. Funke, K., & Eysel, U. T. (1992). EEG-dependent modulation of response dynamics of cat DLGN relay cells and the contribution of corticogeniculate feedback. Brain Research, 573, 217–227. Gerstner, W., & van Hemmen, J. L. (1992). Associative memory in a network of “spiking” neurons. Network, 3, 139–164. Godwin, D. W., Vaughan, J. W., & Sherman, S. M. (1996).Metabotropic glutamate receptors switch visual response mode of lateral geniculate nucleus cells from burst to tonic. Journal of Neurophysiology, 76, 1800–1816. Guido, W., Lu, S.-M.,& Sherman, S. M. (1992).Relative contributions of burst and tonic responses to the receptive eld properties of lateral geniculate neurons in the cat. Journal of Neurophysiology, 68, 2199–2211. Guido, W., Lu, S.-M., Vaughan, J. W., Godwin, D. W., & Sherman, S. M. (1995). Receiver operating characteristic (ROC) analysis of neurons in the cat’slateral geniculate nucleus during tonic and burst response mode. Visual Neuroscience, 12, 723–741.
352
Ulrich Hillenbrand and J. Leo van Hemmen
Guido, W., & Weyand, T. (1995). Burst responses in thalamic relay cells of the awake, behaving cat. Journal of Neurophysiology, 74, 1782–1786. Guillery, R. W. (1995). Anatomical evidence concerning the role of the thalamus in corticocortical communication: A brief review. Journal of Anatomy, 187, 583–592. Gulyas, B., Lagae, L., Eysel, U., & Orban, G. A. (1990). Corticofugal feedback inuences the responses of geniculate neurons to moving stimuli. Experimental Brain Research, 79, 441–446. Hammond, P., & Pomfrett, C. J. (1990). Directionality of cat striate cortical neurones: Contribution of suppression. Experimental Brain Research, 81, 417– 425. Hartveit, E., & Heggelund, P. (1990). Neurotransmitter receptors mediating excitatory input to cells in the cat lateral geniculate nucleus. II. Nonlagged cells. Journal of Neurophysiology, 63, 1361–1372. Hassenstein, B., & Reichardt, W. (1956). Systemtheoretische Analyse der Zeit-, Reihenfolgen- und Vorzeichenauswertung bei der Bewegungsperzeption des ¨ Naturforschung, 11b, 513–524. Russelk¨ ¨ afers chlorophanus. Zeitschrift fur Heggelund, P., & Hartveit, E. (1990). Neurotransmitter receptors mediating excitatory input to cells in the cat lateral geniculate nucleus. I. Lagged cells. Journal of Neurophysiology, 63, 1347–1360. Hille, B. (1992). Ionic channels of excitable membranes. Sunderland, MA: Sinauer. Hillenbrand, U., & van Hemmen, J. L. (2000).Spatiotemporal adaptationthrough corticothalamic loops: A hypothesis. Visual Neuroscience, 17, 107–118. Hirsch, J. A., Alonso, J.-M., Reid, R. C., & Martinez, L. M. (1998). Synaptic integration in striate cortical simple cells. Journal of Neuroscience, 18, 9517– 9528. Huguenard, J. R., & McCormick, D. A. (1992). Simulation of the currents involved in rythmic oscillations in thalamic relay neurons. Journal of Neurophysiology, 68, 1373–1383. Humphrey, A. L., & Weller, R. E. (1988a). Functionally distinct groups of X-cells in the lateral geniculate nucleus of the cat. Journal of Comparative Neurology, 268, 429–447. Humphrey, A. L., & Weller, R. E. (1988b).Structural correlates of functionally distinct X-cells in the lateral geniculate nucleus of the cat. Journal of Comparative Neurology, 268, 448–468. Jagadeesh, B., Wheat, H. S., & Ferster, D. (1993). Linearity of summation of synaptic potentials underlying direction selectivity in simple cells of the cat visual cortex. Science, 262, 1901–1904. Jagadeesh, B., Wheat, H. S., Kontsevich, L. L., Tyler, C. W., & Ferster, D. (1997). Direction selectivity of synaptic potentials in simple cells of the cat visual cortex. Journal of Neurophysiology, 78, 2772–2789. Jahnsen, H., & Llin a s, R. (1984a). Electrophysiological properties of guinea-pig thalamic neurons: An in vitro study. Journal of Physiology (Lond), 349, 205– 226. Jahnsen, H., & Llin a s, R. (1984b). Ionic basis for the electroresponsiveness and oscillatory properties of guinea-pig thalamic neurons in vitro. Journal of Physiology (Lond), 349, 227–247.
Corticothalamic Feedback Control
353
Katz, L. C. (1987). Local circuitry of identied projection neurons in cat visual cortex brain slices. Journal of Neuroscience, 7, 1223–1249. Kistler, W. M., Gerstner, W., & van Hemmen, J. L. (1997). Reduction of the Hodgkin-Huxley equations to a single-variable threshold model. Neural Computation, 9, 1015–1045. Koch, C. (1987). The action of the corticofugal pathway on sensory thalamic nuclei: A hypothesis. Neuroscience, 23, 399–406. Kontsevich, L. L. (1995). The nature of the inputs to cortical motion detectors. Vision Research, 35, 2785–2793. Kwon, Y. H., Esguerra, M., & Sur, M. (1991). NMDA and non-NMDA receptors mediate visual responses of neurons in the cat’s lateral geniculate nucleus. Journal of Neurophysiology, 66, 414–428. Lo, F. S., & Sherman, S. M. (1994). Feedback inhibition in the cat’s lateral geniculate nucleus. Experimental Brain Research, 100, 365–368. Lu, S.-M., Guido, W., & Sherman, S. M. (1992). Effects of membrane voltage on receptive eld properties of lateral geniculate neurons in the cat: Contributions of the low-treshold CA2C conductance. Journal of Neurophysiology, 68, 2185–2198. Maex, R., & Orban, G. A. (1996). Model circuit of spiking neurons generating directional selectivity in simple cells. Journal of Neurophysiology, 75, 1515– 1545. Mastronarde, D. N. (1987a). Two classes of single-input X-cells in cat lateral geniculate nucleus. I. Receptive-eld properties and classication of cells. Journal of Neurophysiology, 57, 357–380. Mastronarde, D. N. (1987b). Two classes of single-input X-cells in cat lateral geniculate nucleus. II. Retinal inputs and the generation of receptive-eld properties. Journal of Neurophysiology, 57, 381–413. McClurkin, J. W., Optican, L. M., & Richmond, B. J. (1994). Cortical feedback increases visual information transmitted by monkey parvocellular lateral geniculate nucleus neurons. Visual Neuroscience, 11, 601–617. McCormick, D. A. (1992). Neurostransmitter actions in the thalamus and cerebral cortex and their role in neuromodulation of thalamocortical activity. Progress in Neurobiology, 39, 337–388. McCormick, D. A., & Huguenard, J. R. (1992). A model of the electrophysiological properties of thalamocortical relay neurons. Journal of Neurophysiology, 68, 1384–1400. McCormick, D. A., & von Krosigk, M. (1992). Corticothalamic activation modulates thalamic ring through glutamate “metabotropic” receptors. Proceedings of the National Academy of Sciences USA, 89, 2774–2778. McLean, J., & Palmer, L. A. (1989). Contribution of linear spatiotemporal receptive eld structure to velocity selectivity of simple cells in area 17 of the cat. Vision Research, 29, 675–679. McLean, J., Raab, S., & Palmer, L. A. (1994). Contribution of linear mechanisms to the specication of local motion by simple cells in areas 17 and 18 of the cat. Visual Neuroscience, 11, 295–306. Movshon, J. A. (1975). The velocity tuning of single units in cat striate cortex. Journal of Physiology (Lond), 249, 445–468.
354
Ulrich Hillenbrand and J. Leo van Hemmen
Mukherjee, P., & Kaplan, E. (1995).Dynamics of neurons in the cat lateral geniculate nucleus: In vivo electrophysiology and computational modeling. Journal of Neurophysiology, 74, 1222–1243. Murthy, A., & Humphrey, A. L. (1999). Inhibitory contributions to spatiotemporal receptive-eld structure and direction selectivity in simple cells of cat area 17. Journal of Neurophysiology, 81, 1212–1224. Murthy, A., Humphrey, A. L., Saul, A. B., & Feidler, J. C. (1998). Laminar differences in the spatiotemporal structure of simple cell receptive elds in cat area 17. Visual Neuroscience, 15, 239–256. Orban, G. A., Kennedy, H., & Maes, H. (1981a). Response to movement of neurons in areas 17 and 18 of the cat: Direction selectivity. Journal of Neurophysiology, 45, 1059–1073. Orban, G. A., Kennedy, H., & Maes, H. (1981b). Response to movement of neurons in areas 17 and 18 of the cat: Velocity sensitivity. Journal of Neurophysiology, 45, 1043–1058. Pape, H. C., Budde, T., Mager, R., & Kisvarday, Z. F. (1994). Prevention of Ca (2 C ) -mediated action potentials in GABAergic local circuit neurones of rat thalamus by a transient K+ current. Journal of Physiology (Lond), 478, 403– 422. Press, W. H., Teukolsky, S. A., Vetterling, W. T., & Flannery, B. P. (1992).Numerical recipes in C: The art of scientic computing. Cambridge: Cambridge University Press. Reid, R. C., & Alonso, J. M. (1995). Specicity of monosynaptic connections from thalamus to visual cortex. Nature, 378, 281–284. Reid, R. C., Soodak, R. E., & Shapley, R. M. (1991). Directional selectivity and spatiotemporal structure of receptive elds of simple cells in cat striate cortex. Journal of Neurophysiology, 66, 505–529. Reinagel, P., Godwin, D., Sherman, S. M., & Koch, C. (1999). Encoding of visual information by LGN bursts. Journal of Neurophysiology, 81, 2558–2569. Saul, A. B., & Humphrey, A. L. (1990). Spatial and temporal response properties of lagged and nonlagged cells in the cat lateral genicualte nucleus. Journal of Neurophysiology, 64, 206–224. Saul, A. B., & Humphrey, A. L. (1992a). Evidence of input from lagged cells in the lateral geniculate nucleus to simple cells in cortical area 17 of the cat. Journal of Neurophysiology, 68, 1190–1207. Saul, A. B., & Humphrey, A. L. (1992b). Temporal-frequency tuning of direction selectivity in cat visual cortex. Visual Neuroscience, 8, 365–372. Sherman, S. M. (1996). Dual response modes in lateral geniculate neurons: Mechanisms and functions. Visual Neuroscience, 13, 205–213. Sherman, S. M., & Guillery, R. W. (1996). Functional organization of thalamocortical relays. Journal of Neurophysiology, 76, 1367–1395. Sherman, S. M., & Koch, C. (1986). The control of retinogeniculate transmission in the mammalian lateral geniculate nucleus. Experimental Brain Research, 63, 1–20. Sherman, S. M., & Koch, C. (1990). Thalamus. In G. M. Shepherd (Ed.), The synaptic organization of the brain (pp. 246–278). Oxford: Oxford University Press.
Corticothalamic Feedback Control
355
Sillito, A. M., Jones, H. E., Gerstein, G. L., & West, D. C. (1994). Feature-linked synchronization of thalamic relay cell ring induced by feedback from the visual cortex. Nature, 369, 479–482. Singer, W. (1994). A new job for the thalamus. Nature, 369, 444–445. Suarez, H., Koch, C., & Douglas, R. (1995). Modeling direction selectivity of simple cells in striate visual cortex within the framework of the canonical microcircuit. Journal of Neuroscience, 15, 6700–6719. Toth, L. J., Kim, D. S., Rao, S. C., & Sur, M. (1997). Integration of local inputs in visual cortex. Cerebral Cortex, 7, 703–710. Tuckwell, H. C. (1988a). Introduction to theoretical neurobiology, Volume 1: Linear cable theory and dendritic structure. Cambridge: Cambridge University Press. Tuckwell, H. C. (1988b).Introductionto theoreticalneurobiology,Volume 2:Nonlinear and stochastic theories. Cambridge: Cambridge University Press. Valois, R. L. D., & Cottaris, N. P. (1998). Inputs to directionally selective simple cells in macaque striate cortex. Proceedings of the National Academy of Sciences USA, 95, 14488–14493. von Krosigk, M., Monckton, J. E., Reiner, P. B., & McCormick, D. A. (1999). Dynamic properties of corticothalamic excitatory postsynaptic potentials and thalamic reticular inhibitory postsynaptic potentials in thalamocortical neurons of the guinea-pig dorsal lateral geniculate nucleus. Neuroscience, 91, 7–20. Watson, A. B., & Ahumada, A. J. (1985). Model of human visual-motion sensing. Journal of the Optical Society of America, A2, 322–342. Wimbauer, S., Wenisch, O. G., Miller, K. D., & van Hemmen, J. L. (1997).Development of spatiotemporal receptive elds of simple cells: I. Model formulation. Biological Cybernetics, 77, 453–461. Wimbauer, S., Wenisch, O. G., van Hemmen, J. L., & Miller, K. D. (1997). Development of spatiotemporal receptive elds of simple cells: II. Simulation and analysis. Biological Cybernetics, 77, 463–477. Worg ¨ otter, ¨ F., Suder, K., Zhao, Y., Kerscher, N., Eysel, U. T., & Funke, K. (1998). State-dependent receptive-eld restructuring in the visual cortex. Nature, 396, 165–168. Received June 7, 1999; accepted May 2, 2000.
LETTER
Communicated by Yali Amit
A Competitive-Layer Model for Feature Binding and Sensory Segmentation Heiko Wersing HONDA R&D Europe (Germany), Carl-Legien-Str. 30, 63073 Offenbach/Main, Germany Jochen J. Steil Helge Ritter University of Bielefeld, Faculty of Technology, D-33501, Bielefeld, Germany We present a recurrent neural network for feature binding and sensory segmentation: the competitive-layer model (CLM). The CLM uses topographically structured competitive and cooperative interactions in a layered network to partition a set of input features into salient groups. The dynamics is formulated within a standard additive recurrent network with linear threshold neurons. Contextual relations among features are coded by pairwise compatibilities, which dene an energy function to be minimized by the neural dynamics. Due to the usage of dynamical winner-take-all circuits, the model gains more exible response properties than spin models of segmentation by exploiting amplitude information in the grouping process. We prove analytic results on the convergence and stable attractors of the CLM, which generalize earlier results on winner-take-all networks, and incorporate deterministic annealing for robustness against local minima. The piecewise linear dynamics of the CLM allows a linear eigensubspace analysis, which we use to analyze the dynamics of binding in conjunction with annealing. For the example of contour detection, we show how the CLM can integrate gure-ground segmentation and grouping into a unied model. 1 Introduction
From the viewpoint of brain theory (von der Malsburg 1981, 1995), feature binding may provide one of the basic sensory information processing principles. A key question is, then, What are the neural correlates of binding processes? In addition to the importance of this question regarding our understanding of brain function, there is also great interest in using similar mechanisms for pattern recognition applications like image segmentation and object recognition. A large body of neural network research has focused on binding models based on temporally correlated neural activity (von der Malsburg, 1981) Neural Computation 13, 357–387 (2001)
° c 2001 Massachusetts Institute of Technology
358
H. Wersing, J. J. Steil, and H. Ritter
stimulated by neurophysiological ndings (Singer & Gray, 1995; Eckhorn 1994), which support a functional role of synchronized activity in perceptual processes. Correlation-based feature binding has been modeled by phasecoupled oscillators (Baldi & Meir, 1990; Sompolinsky, Golomb, & Kleinfeld, 1991), where, however, problems are slow convergence and the necessity of all-to-all connections for robust synchrony. Other nonlinear oscillator models have been developed (von der Malsburg & Buhmann, 1992; Schillen & Konig, ¨ 1994), but were tested on only small networks and highly simplied test images. Relaxation oscillators (Somers & Kopell, 1993; Terman & Wang, 1995) show long-range synchrony also with local connections and have been applied to region-based image segmentation (Wang & Terman, 1997) and auditory segregation (Brown & Wang, 1997). A problem is the limited number of groups that can be stably represented (about ve), which can be overcome only by introducing algorithmic abstractions of the original model. A phase-averaging model with rule-based interactions has been applied to the extraction of contour saliency by Yen and Finkel (1998). Nevertheless, successful applications to the segmentation of real-world data are still rather exceptional (Wang & Terman, 1997; Yen & Finkel, 1998). This is mainly caused by the high dynamical complexity of these models, which makes their simulation costly and their analytic study a difcult task. A different approach to feature binding are spin models, which have been developed for computer vision (Geman, Geman, Grafgne, & Dong, 1990; HÂerault & Horaud, 1993; Opara & Worg¨ ¨ otter, 1998) and combinatorial optimization applications (Peterson & Soderberg, 1989; Blatt, Wiseman, & Domany, 1997). Each feature is represented by a spin variable that attains one of a discrete set of spin states, and a binding of two features corresponds to both sharing the same spin states. With regard to applications, these models have the great advantage of being derived from energy or cost functions that characterize the stable output states as their minima. This energy-based approach establishes a link to pairwise clustering (Rose & Fox, 1993; Hofmann & Buhmann, 1997) and labeling problems in combinatorial optimisation (Kamgar-Parsi & Kamgar-Parsi, 1990). Relaxation labeling (Rosenfeld, Hummel, & Zucker, 1976), a standard technique in the eld of pattern recognition, also falls into this category of energy-based labeling iteration schemes (Hummel & Zucker, 1983; Pelillo, 1994). Although offering a conceptual approach to binding, these models share certain drawbacks regarding their biological plausibility, since they require either iterative discrete cluster update procedures (Opara & Worg¨ ¨ otter, 1998; Blatt et al., 1997) or complex normalizing nonlinearities (Peterson & Soderberg, 1989; Rosenfeld et al., 1976; Hummel & Zucker, 1983). Since grouping is an intensively studied subject in computer vision, there exist a variety of other algorithms, such as those based on Markov random elds (Geman & Geman, 1984), variational approaches (Mumford & Shah, 1989), and curve evolution (Kimia, Tannenbaum, & Zucker, 1995). (See also Wang & Terman, 1997, for other related references.) Recent work has stressed
Feature Binding and Sensory Segmentation
359
the importance of grouping for dealing with the occlusion problem (August, Siddiqi, & Zucker, 1999; Elder & Zucker, 1996). Another recent approach to segmentation has used normalized cuts (Shi & Malik, 1997) to combine eigenanalysis with graph partitioning based on feature similarities. In this article we analyze the competitive-layer model (Ritter, 1990; Wersing, Steil, & Ritter, 1997) (CLM), which realizes an energy-based approach to feature binding in a standard additive recurrent neural network with linear threshold neurons. Similar to spin models, a feature is assigned to one of a set of labels that are selected by a columnar local winner-take-all (WTA) circuit. These columns are coupled by lateral interactions that determine preferred bindings according to the mutual compatibility of the features. The neurons, which represent the same label assignment, are arranged in layers orthogonal to the columnar structure (see Figure 1). As was shown in Wersing and Ritter (1999), the attractors of the CLM provide feasible solutions to relaxation labeling problems. Ontrup and Ritter (1998) have applied the CLM to texture segmentation, based on features derived from Gabor lter banks, and have shown that the method performs well on a wide range of image data. The CLM energy function approach is similar to the quadratic energy function of Potts mean eld (PMF) approaches to combinatorial optimization. An important difference is that the CLM is formulated as a piecewise linear system, and thus allows using the tools of eigensubspace analysis for an inspection of the binding process. Our results extend previous results on single-column WTA networks (Sum & Tam, 1996; Hahnloser, 1998) to the layered multicolumn case. By using dynamical WTA circuits as opposed to strict normalizations in PMF, the model gains more exible and biologically plausible response properties. We show how this can be used advantageously in the grouping of contours and gure-ground segmentation. In section 2 we introduce the CLM architecture and discuss the main properties of its binding dynamics. Section 3 is devoted to a general theoretical analysis, which covers convergence, attractors, and an eigensubspace analysis of the CLM with relation to the grouping properties. In section 4 an efcient simulation procedure is stated, which we used for the application to contour grouping presented in section 5. Section 6 discusses the results with respect to binding properties, noise tolerance, and biological relevance. 2 The CLM Feature Binding Model 2.1 The CLM Architecture. The CLM consists of a set of L layers of feature-selective neurons (see Figure 1). We denote the activity of a neuron at position r in layer a by xra and denote as a column r the set of the neuron activities xra , a D 1, . . . , L that share a common position r in each layer. We associate each column with a particular feature described by a parameter vector m r . A typical feature example are local edge elements characterized
360
H. Wersing, J. J. Steil, and H. Ritter
xrL
xr’L
vertical WTA interaction
xr2
lateral interaction
xr’2
xr1
xr’1
hr
hr’
layer L
layer 2
layer 1
input
Figure 1: CLM architecture. For each input feature, there is in each layer a responding neuron. A vertical WTA circuit implements a topographic competition between layers. Lateral interactions characterize compatibility between features and guide the binding process.
by position and orientation, m r D (xr , yr , hr ). More complex features were used by Ontrup and Ritter (1998) for texture segmentation, which employed a vector of local Gabor lter responses at different spatial frequencies and orientations. A binding between two features, represented by columns r and r0 , is expressed by simultaneous activities xraO > 0 and xr0 aO > 0 that share a common layer a. O Therefore, binding is achieved by having each (activated) column r assign its feature to one (or several, but see below) of the layers a, interpreting the activity xra as a measure for the certainty of that assignment. All the neurons in a column r are equally driven by an external input hr which is to be interpreted as the signicance of the detection of feature r by a preprocessing step. The afferent input hr is fed to the activities xra with a connection weight Jr > 0. Within each layer a, the activities are coupled by the lateral interaction frra0 which corresponds to the degree of compatibility between features r and r0 and is a symmetric function of the feature parameters, frra0 D fa (m r , m r0 ) D fa (m r0 , m r ). The interactions frra0 determine which pattern congurations, if elicited as an activity pattern within a single layer a, will be mutually supporting among their constituent parts (frra0 > 0) or instead suffer mutual inhibition (frra0 < 0). Two examples for lateral interactions motivated by Gestalt laws of perceptual grouping are shown in Figure 2. The lateral interaction pattern may be identical in all the layers or different to allow greater
Feature Binding and Sensory Segmentation
361
+ +
Proximity Interaction
Input
Grouping
Input
Grouping
+ +
Continuity Interaction
Figure 2: Examples of lateral interactions for perceptual grouping. An “OnCenterOffSurround” interaction, which is excitatory for short distances and weakly inhibitory for larger distances, is capable of clustering according to proximity of points. Different symbols denote activity in different layers. The Gestalt principle of continuity can be modeled by a local cocircular interaction of edge elements that lie along curves of constant curvature combined with a weak long-range inhibition to separate different segments.
exibility. A very useful case, which we dicuss for the purpose of perceptual grouping, is the incorporation of a “ground” layer to perform simultaneous grouping and gure-ground segmentation (see section 5). The number of layers need not correspond to the number of groups, since for sufciently many layers, only those are active that carry a salient segment. The purpose of the layered arrangement in the CLM is to enforce a dynamical assignment of the input features to the layers, using the contextual information stored in the lateral interactions. The assignment is realized by a columnar WTA circuit, which uses mutual symmetric inhibitory interacab ba tions with strength Ir D Ir > 0 between neural activities xra and xrb that share a common column r. The combination of afferent inputs and lateral and vertical interactions can be combined into the standard additive activity dynamics, xP ra D ¡xra C s Jr hr ¡
b
ab Ir xrb C
r
0
frra0 xr0 a C xra
(2.1)
362
H. Wersing, J. J. Steil, and H. Ritter D ¡xra C s(Era C xra ),
where s(x) D max(0, x) is a nonsaturating linear threshold transfer function, and Era C xra is the total input to neuron xra. The additional self-excitatory term simplies the following computations and can be compensated for by taking f0 arr0 D frra0 ¡ drr0 . 2.2 Energy Formulation and Binding Dynamics. The dynamics 2.1 has
an energy function of the form ED¡
ra
Jr hr xra C
1 2
ab
r
ab
Ir xra xrb ¡
1 2
a
rr
0
frra0 xra xr0 a .
(2.2)
The energy satises @E / @xra D ¡Era , which makes E nonincreasing under the dynamics 2.1, d / dt E D ¡ ra Era (¡xra C s(E ra C xra )) · 0, since equation 2.1 connes the activities to be nonnegative (see appendix A). Thus, the attractors of the dynamics 2.1 are the local minima of equation 2.2 under ab constraints xra ¸ 0. While the vertical interactions Ir establish the columnar a WTA process, the lateral interactions frr0 contribute in the quadratic energy as a sum over all pairwise compatibilities within groups. If we interpret the negative value of the energy as the overall quality of the grouping, the aim of the dynamics 2.1 is to reach a globally minimal or almost minimal energy state. The quadratic summation approach is similar to the corresponding Potts spin mean-eld free energy (Peterson & Soderberg, 1989), E PMF D ¡
1 2
a
rr0
frra0 xra xr0 a C T
xra log xra,
(2.3)
ra
which is to be minimized subject to the weighting constraint a xra D 1. On the contrary, in a Potts spin model, there is no explicit representation of an afferent input, and the columnar activity must always sum to one due to its probabilistic interpretation. The additional convex entropy term that is weighted by the temperature parameter T biases the local minima of equation 2.3 toward soft columnar assignments, since it attains its minimum at xra D 1 / L. Minimal solutions to equation 2.3 can then be obtained by the recurrent dynamics xP ra D ¡xra C
eFra / T , F rb / T be
Fra D
r
0
frra0 xr0 a ,
(2.4)
which implements a columnar WTA circuit that is gradually sharpened by decreasing (or “annealing”) T ! 0. The CLM dynamics, equation 2.1, provides a similar soft competition scheme between all possible groupings in the different layers of the model.
Feature Binding and Sensory Segmentation
363
The hard weighting constraint and divisive nonlinearity (see equation 2.4), however, are replaced by a more exible WTA circuit, which is coupled dynamically to the afferent input hr and permits a contextual modulation of the input according to the salience of the grouping, but also requires an additional stability analysis. Perceptual context effects of this kind have been observed in a wide range of physiological (Gilbert, 1992; Kapadia, Ito, & Westheimer, 1995) and psychophysical (KovÂacs & Julesz, 1993; Field, Hayes, & Hess, 1992; Polat & Sagi, 1994) studies. As our later eigensubspace analysis of the CLM dynamics reveals, the columnar WTA dynamics is driven by eigenmodes that depend on the lateral interaction pattern and whose eigenvalues characterize the rate of the group formation. The analysis motivates a mechanism for slowing the lateral mode dynamics for increased grouping quality by adding a self-inhibitory coupling at each neuron of the 0 form frra0 D frra ¡ Tdrr0 , where T ¸ 0 is the strength of this self-inhibition. This leads to a new CLM energy, E0 D E C T
ra
x2ra ,
(2.5)
which adds an analog convex term that biases the local minima toward graded assignments and thus makes the WTA softer. Similar to annealing in the Potts model (Peterson & Soderberg, 1989), we can then use a gradual lowering of the inhibitory self-coupling T to sweep from graded assignments to the nal unique assignment as the nal grouping result. A time course of the dynamics with and without self-inhibitory annealing performing contour grouping is shown in Figure 3. Annealing results in a much more regular group formation process, which produces groups’ proceeding hierarchically from more to less salient groups. The slowing of the dynamics, however, results in a trade-off for convergence time. Whether annealing is necessary depends on the complexity of the input pattern with respect to the lateral interactions. Two problems that have been emphasized by Wang and Terman (1997) are long-range coherence with local interactions and the proper separation of different segments. Self-inhibitory annealing increases long-range coherence and also reduces the necessary strength of global inhibition for achieving separation by a reinforcement of the corresponding dynamical modes, which we discuss in detail in the following section. 3 Theoretical Analysis
The dynamics 2.1 is a nonlinear dynamical system consisting of a series of continuously connected linear systems. An alternative dynamical implementation is of the form xP ra D vra Era,
(3.1)
364
H. Wersing, J. J. Steil, and H. Ritter
a) Activity dynamics without annealing
b) Activity dynamics with self-inhibitory annealing Figure 3: Activity dynamics and group formation in the CLM. The neural activity is displayed by different symbols for different layers, with symbol size representing activity value. The lateral interaction is locally cocircular with a weak global inhibition as in Figure 2. (a) Typical time course of the dynamics. After initialization with random activity values, all activities within a column are equally active in all layers. After that, dynamical modes, which break the symmetry between layers, cause formation of groups. The “soft competition” between the group assignments lasts until the columnar WTA circuit has caused an assignment to one of the layers. The modes are given by the eigenmodes of the lateral interaction pattern. The main problems are long-range coherence with local interactions and a proper separation of different groups. Fragmented groupings can occur, especially for very short-range interactions due to fast expression of modes, which lead to fragmented groups. (b) Shows how the mechanism of self-inhibitory annealing can be used to suppress these fragmenting modes. During the dynamics, the strength of self-inhibition is lowered, which has the effect of making the columnar WTA circuit less strict and increases the “soft competition” between grouping assignments of a single feature. Fluctuations due to fragmented modes are smoothed out. The groups are then differentiated in a sequence, where the most salient structures appear rst.
which avoids the occurrence of the s(.)-nonlinearities and instead enforces the constraint xra ¸ 0 by a set of “switching variables,” vra D
0 1
for xra D 0, Era < 0 else.
(3.2)
This reformulation corresponds to constrained gradient descent in the energy (see equation 2.2) which naturally makes E nondecreasing, d / dt E D
Feature Binding and Sensory Segmentation
365
2 · 0. Since both dynamics share the same energy function, they ¡ ra vra Era converge to the same attractors, which are local minima of equation 2.2, and may be considered dynamically equivalent. Since the formulation 3.1 has the advantage of simplifying the linear analysis by shifting the nonlinearity to the boundary where xra D 0, in the following we mainly consider the form 3.1. Nevertheless, we prove our results in appendix A in a form that also applies to the biologically more plausible and thus conventional form, that is, equation 2.1. Our theoretical analysis is rst based on a discussion of the model attractors and the conditions for convergence. We then discuss with an eigensubspace analysis which of the attractors are preferred by the recurrent dynamics and use a synthetic example to show how this relates to the properties of the binding process implemented by the CLM.
3.1 Convergence and Assignment. Networks composed of nonsaturating linear threshold neurons, as in equations 2.1 or 3.1, may be unbounded if the excitatory interactions are not balanced by sufcient inhibition. Hahnloser (1998) has discussed the stability for a single WTA circuit and given a criterion based on global inhibition. The following theorem, which we prove in the appendix, states that in fact, local inhibition is sufcient to ensure boundedness for the CLM system of layerwise coupled WTA columns.
If kra > 0 with kra D Iraa ¡ frra ¡ r0 D6 r max(0, frra0 ), then the CLM dynamics is bounded. If 0 · xra (0) · M for all r, a, where M D maxr ( Jr hr /kra ) then 0 · xra (t) · M for all r, a and t > 0. Theorem 1.
The factors kra control the stability margin and the maximal amplication of the inputs hr . They are positive if the self-interaction strength Iraa ¡ frra of a neuron is larger than the sum of the excitatory connections r0 D6 r max(0, frra0 ) ab
converging onto it. If in the simplest case Jr D Ir D J for all r, a, b, the afferent input and vertical interaction terms can be combined in the energy to the term J r (hr ¡ b xrb )2 . We can interpret this as a penalty term, with J as a constraint multiplier enforcing the condition that the summed activity in a column is equal to the input hr . Unlike in the usual penalty function approach to combinatorial optimization problems, in the framework of sensory segmentation we consider here a strict enforcement of this constraint is undesirable. If we choose J close to the stability margin of theorem 1, the penalty term derived from the vertical interactions no longer dominates over the lateral interactions, and the columnwise activity is strongly inuenced by lateral effects; we return to this point after stating the next theorem. Essential to the columnwise WTA approach is that we obtain a proper assignment of the features, that is, a state where each column r contains at
366
H. Wersing, J. J. Steil, and H. Ritter
most a single nonvanishing activity xraO > 0. The conditions on the vertical interactions for this assignment property to hold are stated in the following theorem: Let Fra D r0 frra0 xr0 a denote the lateral feedback of neuron r, a. If the lateral interaction is self-excitatory, farr > 0 for all r, a, and the vertical interactions bb ab satisfy Iraa Ir · (Ir )2 for all a, b, then an attractor of the CLM has in each column r either Theorem 2.
i. At most one positive activity xraO with xraO D
Jr
hr C IraO aO
FraO , IraO aO
6 a, O xrb D 0 for all b D
(3.3)
O D a(r) O where a is the index of the maximally supporting layer characterized 6 a, O or by FraO > Frb for all b D
ii. All activities xra , a D 1, . . . , L in a column r vanish and Fra · ¡Jr hr for all a D 1, . . . , L. This theorem states that as long as there are self-excitatory interactions within each layer and the vertical cross-inhibition is sufciently strong, the CLM converges to a unique assignment of features to the layer with maximal lateral feedback Fra. A central result is that due to the layered topology of the network, this does not require arbitrarily large vertical couplings. Also note that the lack of an upper saturation is essential for the WTA behavior because it allows the exclusion of spurious ambiguous states. Nonzero activities are stable only due to a dynamical equilibrium and not due to saturation. Douglas, Koch, Mahowald, Martin, & Suarez (1995) and Hahnloser (1998) have argued for the plausibility of this form of dynamical stability since cortical neurons rarely operate close to saturation. The lateral and vertical feedback modulates the input intensity by an amount dependent on the ab ratios of Iraa , Fra , and Jr . Consider again the simple example Jr D Ir D J, for all r, a, b where, according to theorem 2, the output of the only active neuron in a column is then given by xraO D hr C FraO / J. This shows that by lowering J, the lateral effects can be increased, and we obtain a contextdependent activity distribution, which still remains sensitive to the input intensities. Note that the additional self-inhibitory annealing by choosing 0 frra0 D frra ¡ Tdrr0 for T sufciently large violates the condition of theorem 2 and leads to stable states with multiple nonzero activity within a column, as the following eigensubspace analysis shows. 3.2 Eigensubspace Analysis. Now that we have discussed the possible attractors of the model, we turn to the discussion which of them will be preferred by the dynamics. Since the CLM time development is dened by the piecewise linear differential equation, 3.1, we can apply the method
Feature Binding and Sensory Segmentation
367
of eigensubspace analysis to obtain a solution in the linear domain, where the constraints are inactive. We consider the important special case of a lateral interaction frra0 D frr0 that is identical for all layers. For simplicity, ab we take Jr D Ir D J for all r, a, b as constant. As we will show in this section, the special form of two topological “orthogonal” interactions allows us to characterize the global eigenmodes completely in terms of the lateral eigenmodes. Apart from constraints, the system is described by the linear dynamics xP ra D Jhr C
ab
rb 0
(3.4)
Grr0 xr0 b ,
ab where Grr0 D ¡Jdrr0 C dab frr0 . Introducing the N £ L vectors
x D (x 1 , . . . , x L )
h D (h 0 , . . . , h 0 )
with with
xa D (x1a , . . . , xNa )
and
h 0 D (h1 , . . . , hN )
(3.5) (3.6)
we can write this as (3.7)
xP D J h C Gx ,
where G is a linear transformation in the space RN£N RL£L and can be decomposed as G D f IdL£L ¡ J Id N£N IL£L ,
(3.8)
with Id as the identity and IL£L as an L £ L-matrix of 1s. Equation 3.8 shows that an orthonormal eigenvector basis fvic , L ic g for G can be obtained from orthonormal eigenvector bases 1 fb i , li giD 1...N and fqc , m c gc D1...L for f and IL£L respectively: v ic D b i qc ,
L ic D li ¡ Jm c .
i D 1 . . . N, c D 1 . . . L,
(3.9) (3.10)
qc , m c can be calculated analytically. Here we use only that the rst eigenvector obviously is given by
p
q 1 D 1 / L (1, . . . , 1),
m 1 D L,
(3.11)
while the remaining orthogonal eigenvectors have zero eigenvalues m 2 D m 3 D ¢ ¢ ¢ D m L D 0. This has a direct geometrical interpretation. We can 1
We assume that the li are sorted in descending order, l1 ¸ l2 ¸ ¢ ¢ ¢ ¸ lN .
368
H. Wersing, J. J. Steil, and H. Ritter
divide the eigenmodes of the linear system into two classes: DC-subspace. This is the subspace spanned by the eigenmodes p
v i1 D 1 / L (b i , . . . , b i ),
L i1 D li ¡ JL,
(3.12)
which contains the coherent eigenmodes that have equal components in all layers. For sufciently large J, all the corresponding eigenvalues L i1 are negative. AC-subspace. This is the orthogonal subspace spanned by the remaining eigenvectors, c c 6 1 v ic D D (q1 b i , . . . , qL b i ),
(3.13)
L ic D li ,
which contain the eigenmodes causing differences in the activity patterns between layers and have a zero mean summed over columns. Their eigenvalues L ic D6 1 are given by the eigenvalues li of the lateral interaction frr0 alone and are L ¡ 1 degenerate due to the symmetry between layers. The linear dynamics can be completely characterized by the eigensubspaces and the location of the xed point, which is determined by the inhomogeneous input h . Relative to the xed point, the components cic (t) D (x (t) ¡ x F ) ¢ v ic develop according to cic (t) D cic (0)eL ic t . This corresponds to exponential growth for positive eigenvalues, while negative eigenvalues lead to attracting afne constraint surfaces that pass through the xed point. By expanding the xed-point equation xP D 0 on the basis vic , we obtain the following expansion of the xed point x F in powers of li / JL (note that h , and thus also xF , has no component in the AC space): x Fa D
1 1 h0 C L L
i
li i (b ¢ h 0 ) b i C O JL
li JL
2
.
(3.14)
Therefore, if JL À li , we can approximate the xed point by x Fra D hr / L. We can now draw a sketch (see Figure 4) of the initial CLM dynamics referring to our eigensubspace analysis. Suppose hr > 0 for all r and we initialize the system with small, positive random values 0 < xra (0) ¿ hr . For sufciently large JL À li , all eigenmodes in the DC-subspace have large negative eigenvalues. Therefore, the initial state will be rapidly projected onto the afne subspace F a xra D a xra orthogonal to the DC-subspace, which passes through the xed point. The rapid projection dynamics in the DC space moves the activities away from the zero boundary and thus keeps the initial dynamics linear. Now the modes in the AC-subspace with positive eigenvalues li develop differences between layers on a slower timescale, since their eigenvalues
Feature Binding and Sensory Segmentation
hr
xr2
AC
369
DC
xF affine constraint surface
xr1+ xr2 = h r
xr1
hr
Figure 4: Sketch of the linear dynamics for two layers. Shown are the activity trajectories for the two activities xr1 , xr2 of a single column r. Starting from small initial values (the gray square), the activities quickly approach the xed point x F and the constraint surface aD 1, 2 xra D hr in the DC subspace. Then the dynamics in the orthogonal AC subspace drives the WTA process until only one layer is active.
have smaller absolute value. The principal AC-mode with the largest positive eigenvalue dominates the AC-subspace dynamics in this initial phase and determines the timescale of the WTA dynamics, which drives some activities to zero in a column while others are increased. After some of the activities reach zero level, they no longer contribute to the linear dynamics, and the nonlinearity is taking effect by introducing a new segment of the piecewise linear dynamics with different resulting eigenmodes. 2 Since then the symmetry between layers is broken, the coupled WTA dynamics gets more complex in the general case by subsequently driving the unassigned columns toward an assigned state, as can be seen by comparison with Figure 3. For regular and highly symmetric patterns, however, which we discuss in the following section, the nal grouping result can be already obtained from the dominant mode in the initial linear phase of the dynamics.
2
ab
The resulting interaction matrix is of the form vra vr0 b Grr0 .
370
H. Wersing, J. J. Steil, and H. Ritter 0
The suggested mechanism of self-inhibitory annealing by choosing frra0 D ¡ Tdrr0 shifts all eigenvalues by l0i D li ¡ T. If T is greater than the critical value Tc D l1 , then all eigenvalues are negative, and the unassigned xed point xF is asymptotically stable, which is analogous to Potts mean-eld models (Peterson & Soderberg 1989). If T is then lowered from the critical value, at rst only the b 1 -AC modes acquire a positive eigenvalue and grow exponentially away from the xed point. The following section discusses how this can be employed to increase the grouping quality. frra
3.3 Coherence, Separation, and Self-Inhibitory Annealing. Let us consider two simple stimulus examples for the contour grouping lateral interaction to discuss the binding problems of long-range coherence and separation referring to the previous eigensubspace analysis. The rst stimulus consists of a circle, composed of N regularly placed edge elements (see Figure 5). We assume that we can neglect the effect of all other columns r0 , which are at subthreshold activity (i.e., hr0 < 0), and therefore consider the lateral eigenvectors from only the resulting N £ N lateral interaction matrix frr0 of the active features. The contour grouping lateral interaction has two contributions—the local cocircular interaction frrcocirc (see appendix B) and a 0 global inhibition with strength k, giving together frra0 D frrcocirc ¡ k. Since the 0 overall interaction pattern is rotationally invariant on the circle, the eigenvectors of the interaction are the sine and cosine waves on the circle with corresponding spatial frequencies. We rst assume k D 0. Each edge shares positive interactions in a local neighborhood range, and the zeroth order (constant) cosine with b1r D 1 is cocirc the eigenvector with largest eigenvalue l1 D , equal to the sum r0 frr0 over the local cocircularity values. The corresponding global AC modes coherently increase and decrease the activity on the circle in the different layers. The next two modes with degenerate eigenvalues l2,3 < l1 are the rst-order sine and cosine waves, which split the circle into two halves. The gap between l1 and l2,3 determines whether the coherent mode dominates over the fragmenting modes. It is a general property of local interactions that this gap tends to zero if the range of the interactions tends to zero.3 Since the difference between the modes grows exponentially, the time the system stays in the initial linear mode determines the degree of coherence. By shifting the spectrum of the lateral interaction with the self-inhibitory addition f0rr0 D frr0 ¡ Tdrr0 , this time is increased. This can be used to suppress the spurious b 2,3 modes dynamically, although this also reduces the overall 3 Suppose the spatial exponential decay of the cocircular interaction is replaced by a hard cut-off function, such that the interaction of two perfectly cocircular features on the circle at angular positions w r , w r0 2 [0, 2p ] is frr0 D 1 if |w r ¡ w r0 | < D and frr0 D 0 otherwise. The difference between the coherent mode b1r D 1 and the rst fragmenting modes b2r D sin(w r ), b3r D cos(w r ) is in the limit of sufciently small feature spacing given by 2(D ¡ sin(D)) ! 0 for D ! 0.
Feature Binding and Sensory Segmentation
b1 Input 1
b1
b4
b3
b2
Lateral Eigenvectors
b2
Input 2 1
Input 3
371
b4 b3 b5 Lateral Eigenvectors
b
b2
b3
b4
b
5
b
b6
6
Lateral Eigenvectors Figure 5: Lateral eigenvectors of the contour grouping interaction. (Top) The lateral eigenvectors of the lateral interactions on the active features of this circle input pattern. Vector components are shown as black (positive) and white (negative) edges with thickness proportional to magnitude. To achieve a coherent binding on the circle, the eigenvalue l1 of the coherent mode b1 must be sufciently large compared to the eigenvalues l2,3 of the circle-splitting modes b 2, 3 . (Middle) For two well-separated circles, the eigenvectors are symmetric and antisymmetric combinations of the single circle eigenvectors. Without global inhibition, the modes b 1 and b 2 are degenerate. A weak global inhibition sufces to lower the eigenvalue l2 and thus suppress the unwanted b 2 mode. (Bottom) For more complex inputs, the succession of eigenvectors constitutes a multiscale hierarchy. The principal eigenmode b 1 separates the circle from the crossing line and the short lines. b 4 and b 5 drive the segregation of the short lines. The simple self-inhibitory annealing scheme proposed in section 3.3 can be interpreted as hierarchically suppressing subsequent eigenmodes, resulting in a hierarchical group formation process.
convergence rate. Now suppose we add a second equal and well-separated circle, such that there are no local cocircular interactions between the circles (see Figure 5). Now the task is to achieve not only coherence on the circles but also proper separation. Since the lateral interaction pattern is exchange symmetric between the circles, the eigenvectors of the composite system are symmetric
372
H. Wersing, J. J. Steil, and H. Ritter
and antisymmetric combinations of the single circle eigenvectors with the same eigenvalues. The modes b 1 and b 2 , responsible for separating and grouping together the circles, respectively, are degenerate with l1 D l2 without any global inhibition. If we now add the global lateral inhibition by choosing k > 0, only the b 2 mode undergoes a change in eigenvalue, since all other modes have a zero component within this global constant inhibition. The effect of k is then given by shifting the eigenvalue l2 of the spurious mode to the value l2 ¡ 2kN. Therefore, if k is sufciently large, the mode that separates the two circles coherently is dominant. There is, however, an obvious limitation to increasing k, since a strong global inhibition destabilizes the coherence within a circle. Similarly, for the single circle, the inhibition lowers the eigenvalue of the single coherent mode by l01 D l1 ¡ kN. Therefore, to achieve both long-range coherence and separation, only small values of k are possible. For our applications, we choose k D z / N, where 0 < z · 1. The picture gets considerably more complex if there are more complex interacting groups in the feature input, as the lower example in Figure 5 shows. The sign distribution of the lateral eigenvectors constitutes a multiscale hierarchy of the structures in the input. By gradually lowering the selfinhibition T, which can be understood as a type of annealing process with regard to the energy function of the dynamics, these modes are expressed in a hierarchical sequence, as can be seen by looking back at Figure 3. 4 Simulation
The CLM dynamics can be simulated in principle in parallel by any differential equation integrator like the Euler or Runge-Kutta method. Also, due to the piecewise linearity, an analog VLSI implementation may be even simpler than classic circuits incorporating sigmoid transfer functions. If simulated on a serial processor, there is an alternative approach that replaces the explicit trajectory integration by a rapid search for xed-point attractors. This can be done by iteratively solving the xed-point equations, which is largely facilitated by the piecewise linearity of the model (Ontrup & Ritter, 1998). This iterative solution procedure, also known as a GaussSeidel approach, has also been extensively used for Markov random eld approaches to image segmentation (Besag, Green, Higdon, & Mengersen, 1995). The algorithm in conjunction with an exponential annealing schedule for the temperature parameter T can be implemented in the following way: 1. Initialize all xra with small random values around
xra (t D 0) 2 [hr / L ¡ 2 , hr / L C 2 ]. Initialize T with T D Tc .
Feature Binding and Sensory Segmentation
373
2. Do N ¢ L times: Choose (r, a) randomly and update xra D max( 0, j ), where
j D
Jr hr ¡
ab C 6 a Ir xrb bD aa a Ir ¡ frr C
T
a 0 6 r frr0 xr a r0 D
.
3. Decrease T by T :D gT , with 0 < g < 1. Go to step 2 until convergence.
The single activity update in step 2 corresponds to solving the xedpoint equation for this activity with all other activities held constant. For the gure-ground setup as described in section 2, the critical temperature is given by Tc D lmax ffrr0 g. This asynchronous dynamics converges (Ontrup & Ritter, 1998) due to a convergence result on asynchronous iteration in neural networks by Feng (1997). We observed a speed gain of roughly a factor of 10 compared to Euler integration at comparable solution quality. For simulation without self-inhibitory annealing simply, set T D 0. 5 Application to Contour Grouping 5.1 Related Work. It has long been known that in the early stages of visual information processing in visual cortex area V1, many neurons can be found (Hubel & Wiesel, 1962) that respond to local-oriented edge elements within their classical receptive eld. One of the major issues in vision research is the mechanism that the visual system uses to integrate these local elements into global salient contours to facilitate robust boundary detection and object recognition. This process has been considered (Sajda & Finkel, 1994; Zucker, Dobbins, & Iverson, 1989) as being composed of two components: the process of enhancement, where local edge information is combined cooperatively for smooth and salient contours, and the process of segmentation or grouping, where separate contour elements are assigned to different groups. Recent neural models of contour integration have mainly focused on the enhancement stage of visual processing. Many models of orientation tuning are composed of competitive orientational “OnCenterOffSurround” interactions within hypercolumns (Ben-Yishai, Lev Bar-Or, & Sompolinsky, 1995; Mundel, Dimitrov, & Cowan, 1997), which are modulated by experimentally found horizontal or lateral long-range interactions (Gilbert, 1992; Weliky, Kandler, Fitzpatrick, & Katz, 1995) to produce contextual effects. Enhancement through feature linking is addressed by the model of Yen and Finkel (1996, 1998), which, however, requires rule-based interactions and global activity normalizations. Li (1998) has stated a more biologically plausible dynamical model for contour integration in area V1 and discusses properties of the oscillatory correlations in the model that may be useful to facilitate synchronization-based segmentation. As the results of Li (1998)
374
H. Wersing, J. J. Steil, and H. Ritter
n1
n2
d r1
r2
a) Edge parameters
b) Interaction field
Figure 6: Lateral interaction for contour grouping. (a) Pairwise interaction depends on difference vector d and two unit vectors n 1 , n 2 encoding orientation for computational convenience. (b) The resulting interaction pattern for a single horizontal edge at the center. Sampled at surrounding positions and orientations, black edges share excitatory and gray edges share inhibitory interaction with the central edge. Length codes for interaction strength.
demonstrate, the complex excitatory and inhibitory interactions lead to distorted and less predictable correlations for more complex visual scenes, leaving the question of a robust grouping process mostly unanswered. In the following we present the application of the CLM as a new approach to the grouping stage of contour integration. To keep the model simple and allow for the application to large feature sets of real images, we do not consider the generation of local orientation within a hypercolumn and leave this to a preprocessing step. The essential WTA, then, operates not between local orientation alternatives within a hypercolumn but between different grouping alternatives of the edge feature with xed local orientation. 5.2 CLM Contour Grouping Model. We suppose that in an initial integration step, the local edge information has been subsampled and preprocessed. As a result of the preprocessing, we assume that within a localized area, only a single edge detector is active at an optimally tuned orientation. We therefore consider a set of idealized edge feature detectors, indexed by r 2 f1, . . . , Ng and characterized by position, local orientation, and an associated edge intensity value hr . The CLM contour grouping model is then composed of a set of L layers a 2 1, . . . , L where L ¡ 1 “gure” layers are provided to respond to salient contour groups and a single “ground” layer is provided to capture the background features. The pairwise lateral interaction between edge features in the gure layers is visualized in Figure 6. The excitatory component links edges that are cocircular, that is, lie tangentially to a common circle, with a tolerance that admits small deviations. This orientational eld is superimposed with a gaussian distance-dependent component. This excitatory
Feature Binding and Sensory Segmentation
375
interaction eld is similar to recent models (Parent & Zucker, 1989; Yen & Finkel, 1998; Li, 1998). A similar interaction pattern between orientationsensitive cells has been observed experimentally (Gilbert, 1992; Weliky et al., 1995). The local inhibitory component is of comparable magnitude as the excitatory eld and insensitive to orientation as suggested by experimental ndings (DeAngelis, Freeman, & Ohazawa, 1994). To support the separation of remote contour parts, an additional weak long-range inhibition is necessary, which has no direct biological counterpart in the experimentally observed lateral connection structure. 4 The complete lateral interaction, given as frrcocirc , and the preprocessing stage, which we employed for the feature 0 generation, are dened in appendix B. The lateral interactions frra0 are then given as f1rr0 D mdrr0 for the ground layer and as farr>0 1 D frrcocirc ¡ k in the gure layers. The parameter m > 0 0 denes a self-coupling against which lateral interactions in the gure layers must compete to “pop out” a feature from the ground layer. The weak global inhibition strength k was chosen as k D 0.3 / N. The vertical ab interactions and input strength are chosen equally as Jr D Ir D J with J > Jc D maxr r0 max(s1 frr0 ), where Jc is the critical value implied by the stability margins of theorem 1. We used a value of J D 1.1Jc . The application of the CLM contour grouping model to a real image is shown in Figure 7. The preprocessing (see appendix B) results in a set of features where object and background textures produce a noisy background. To obtain a robust grouping for the complex scene of interacting edge elements, annealing in T was necessary, where we used an exponential schedule as described in section 4 with g D 0.99. The grouping result is shown in Figure 7c. By an appropriate choice of the ground-layer strength m D 3.5, which must approximately match the sum over the cocircular interactions of an edge that is part of a proper contour, the salient contours are detected as single groups. Since noisy edges lack the lateral support of aligned edges, they are then captured by the ground layer, and for sufciently many available layers, only those containing salient segments will be active. The lateral interactions cause an amplication of salient lowintensity contours without enhancing noisy fragments. To demonstrate that this amplitude-dependent modulation is crucial in the model, we compared the CLM grouping result to a Potts spin mean-eld model with equal lateral interaction structure and annealing. Spin models generally lack the ability to represent additional amplitude information. The result in Figure 7d illustrates that this makes them more sensitive to noise in the image, because noisy aligned elements may form spurious groups. Since we were not able to produce signicantly better results by varying parameters m and k, we conclude that the amplitude-dependent dynamics in the CLM
4 Wang & Terman (1997) have speculated about the thalamus’s providing the global inhibition.
376
H. Wersing, J. J. Steil, and H. Ritter
a) Input image
b) Edge features
Interaction
c) CLM grouping
d) Potts spin grouping
Figure 7: Grouping of a natural image. (a) An input image is preprocessed to generate a (b) set of ¼ 2000edge features. The edge input intensity hr is visualized as the thickness of the displayed elements. The noise from the background and fruit textures has low amplitude; however, it cannot be suppressed by thresholding without losing low-contrast contours of the objects. (c) The grouping result for a CLM architecture with 20 gure layers and 1 ground layer. The symbols (7 symbols £ black, light, and dark gray) represent activity in different layers with symbol size proportional to magnitude. The ground layer is omitted. The scale of the lateral cocircularity interaction is displayed in the upper left corner. The CLM grouping gives 11 segments, which achieve an identication of the most important curve elements in the presence of background noise, where some of the layers remain inactive. The low-intensity edges of the pear and apple are amplied due to the supporting lateral interaction on the salient contours. (d) The grouping by a Potts spin system with the same interactions. Due to the lack of amplitude information, the model has a strong tendency to “hallucinate” curves, which cannot be compensated by the ground layer.
Feature Binding and Sensory Segmentation
377
leads in this setting to a better signal-noise separation for complex input patterns. Figure 8 shows the performance of the CLM with the same parameter settings and identical annealing on more complex image data, where for an aerial image, the number of layers is not sufcient to allow a single layer for each salient segment. In that case, the dynamics tries to nd a compromise by combining segments into the same layer such that their mutual inhibition is minimal, and thus the overall energy level is minimal. This results in a tendency to prefer a combination of parallel segments, which are separated by a distance larger than the local inhibition eld of the cocircular interaction. In principle, it would be possible to suppress multiple segments within a layer by raising the global inhibition. This, however, poses an upper limit on the maximum size of a stable segment, since then the global inhibition may exceed the local cocircular support. The time course of the dynamics during self-inhibitory annealing illustrates the resulting hierarchical group formation process, which initially expresses the most salient groups in the available layers. 6 Discussion 6.1 Binding Properties. The CLM binding model shares some components with other recent binding models. For the contour grouping example that we have considered, the lateral interactions are based on a local compatibility combined with a weak global inhibition, which is the same for LEGION (Wang & Terman, 1997) and the ECU spin update model (Opara & Worg ¨ otter, ¨ 1998). The LEGION model also uses a mechanism to perform gure-ground segmentation by suppressing features with low lateral support. The main difference lies in the dynamical implementation, which for the CLM is given by a consistent model of neural activity dynamics in a layered system of coupled WTA columns. For the local contour grouping model, long-range coherence and separation are enforced by a self-inhibitory annealing mechanism that causes controlled expression of dynamical modes, but also results in a trade-off for convergence time. Another CLM application to texture segmentation (Ontrup & Ritter, 1998), has shown that for sufciently dense and long-ranged feature interactions, the CLM performs well also without annealing. The models of Wang and Terman (1997) and Opara and Worg¨ ¨ otter (1998) employ mechanisms similar to region growing and have been successfully applied to the task of grayscale segmentation of image data; however, we believe that the complex interactions of overlapping and intersecting curve elements might pose problems for their application to contour grouping. Our eigensubspace analysis reveals an interesting parallel between the grouping dynamics in the CLM and the oscillatory synchronizations in the contour integration model of Li (1998). If we restrict ourselves to lateral interactions that are only excitatory, then the dominant oscillation mode in
378
H. Wersing, J. J. Steil, and H. Ritter
Figure 8: Grouping results. the gure shows the CLM grouping results for two other images with identical parameter settings and 20 or more ground layers. For the “leena” image (¼ 2000 features), the CLM gives 18 segments, which cover the most salient contours. The ground-layer strength introduces a minimum size that a segment must have to collect enough lateral support; thus, smaller contour segments are not resolved and would require a multiscale extension of the model. Low-intensity edges are enhanced at the brim and the top of the hat. The lower aerial image (¼ 3500 features) is taken from Yen and Finkel (1996). Due to the presence of more salient contours than layers, some layers carry more than a single segment—for example, the “black circle” layer, which has three vertical groups in the upper left corner and the center of the image. This illustrates a tendency to combine parallels into the same layer, which is most efcient with regard to the energy level of the global grouping. The lower row shows the sequence of emerging groups when lowering the self-inhibition from left to right.
Feature Binding and Sensory Segmentation
379
that model has the same eigenvector components as the dominant grouping mode in the CLM. Therefore, a similar mechanism of suppressing spurious eigenmodes could be used to improve the synchronization properties in that model. We consider a combination of synchronization-based and spatial mechanisms as a promising approach to more realistic models of cortical feature binding, which have enough computational power to be used for sensory segmentation tasks. 6.2 Selective Amplication and Noise Tolerance. A main difference of the CLM compared to spin models of segmentation is the explicit usage of amplitude information in the input. The usage of dynamical WTA circuits results in a context-dependent amplitude modulation that enhances salient groups due to their supporting lateral interactions. Our comparison to a Potts spin model shows that this results in a better noise tolerance for the contour grouping example. This process has been discussed by Li (1998) and Li and Dayan (1999) as selective amplication. We have considered a setup where only the contributions of a set of coarsely sampled features are active, while others are omitted from the model. This decreases the ability to hallucinate contours, since we do not have a full range of feature detectors at all orientations as in the model by Li (1998). This also excludes a direct “lling in” of contours that are not present in the input. To ensure a proper operation of the model in an extended case of features at all orientations, it would be necessary, as in Li (1998), to increase the threshold of the nonsaturating linear transfer function from zero to a nite positive value to suppress the activation of neurons with zero afferent input. 6.3 Biological Relevance. The key principle of the CLM architecture is the encoding of feature bindings by the assignment to separate populations of laterally interconnected and locally competitive neurons. A biologically realistic interpretation of the CLM architecture may be expressed in terms of the prominent layered structure of the real visual cortex, or it may be implemented in the rich local connectivity structure of a single neuronal layer itself. Here, we want to restrict our discussion on the macaque monkey primary visual cortex, area V1, which is one of the most intensively studied regions of the brain. Although it is known that neurons in the granular and supergranular layers of the primary visual cortex are connected via horizontal connections of considerable lateral spread (Rockland & Lund, 1983), there are currently no experimental data available that support the idea of specic inhibitory (competitive) interactions between disjunct modules of laterally interconnected neurons. However, the CLM model illustrates that this principle could be easily implemented in a simple neural circuit, and therefore we propose looking for related anatomical structures as an interesting experimental question. The presence of sufciently many layers to carry each salient segment in an image is hard to justify in the visual system, although
380
H. Wersing, J. J. Steil, and H. Ritter
it may be useful from a technical applications point of view. Nevertheless, since our suggested self-inhibitory control mechanism together with the gure-ground separation allows for a concentration on the most salient groups, a limited number of layers may not form a severe restriction. The principle of feature binding within competitive and spatially segregated layers may help to process visual information by different routes through the cortical neuropile. For example, there is evidence of local topographic inhibition between those layers in area V1 that receive the majority of afferent inputs from the LGN (Hubel & Wiesel, 1972); the principal input layers 4Ca and 4Cb are coupled by reciprocal inhibitory interneurons (Lund, 1987) (varieties a-3, b-3). Since the a and b sublayers are characterized by different response properties (Blasdel & Fitzpatrick, 1984; Hawken & Parker, 1984) as well as a different spread of lateral connections (Yoshioka, Levitt, & Lund, 1994), a putative interpretation of the inhibitory circuitry could be that feature binding in each sublayer is performed by such a competitive interaction. Moreover, the neurons in layers 4Ca and 4Cb project to different target layers within area V1, which contain key sets of efferent neurons to other cortical areas (Yoshioka et al., 1994; Yabuta & Callaway 1998). Therefore, a binding of features to different depth of layer 4C may form the basis of the subsequent processing stages. The layered structure of the CLM involves a certain degree of neural redundancy and could be considered as a “waste” of neural hardware. The xed hardwired layers might seem less exible than the synchronizationbased approach, which carries additional information into the temporal domain. Models have shown, however, that complex interactions lead to strong limitations with regard to stability and separation of the groupings. We suggest the principle of topological segregation, as used in the CLM, as an additional binding principle that could improve the robustness and stability of the feature binding processes. Appendix A: Proof of Theorems
We consider the CLM dynamical system as xP ra D vra Era ,
where
E ra D Jr hr ¡
ab Ir xrb C
ab
b
ba
where Jr > 0, Ir D Ir
vra D
r
0
0 for xra D 0, Era < 0 1 else,
frra0 xr0 a ,
(A.1) (A.2)
> 0, and frra0 D fra0 r .
If kra > 0 with kra D Iraa ¡ frra ¡ r0 D6 r max(0, frra0 ), then the CLM dynamics is bounded. If 0 · xra (0) · M for all r, a, where M D maxr ( Jr hr /kra ) then 0 · xra (t) · M for all r, a and t > 0. Theorem 1.
Feature Binding and Sensory Segmentation
381
Proof . We prove the boundedness by constructing a hypercube in the positive domain, which cannot be left by the dynamics. Since xP ra ¸ 0 for xra D 0, the dynamics cannot leave the positive domain if initialized with xra (0) ¸ 0 and therefore xra (t) ¸ 0 for all t ¸ 0. The (r, a)-face of the hypercube is
dened by
xra D M,
xr0 b · M
6 r or b D 6 a. r0 D
for
(A.3)
We now show how to choose M such that xP ra · 0 for all xra on the face (r, a), causing the system to stay inside the hypercube. On face (r, a) we have xra D M > 0 and therefore vra D 1. Hence xP ra D Jr hr ¡
b
ab Ir xrb C
r
frra0 xr0 a
0
a · Jr hr ¡ Iaa r xra C frr xra C
6 r r0 D
(A.4)
max(0, frra0 )xr0 a
(A.5)
where we have omitted negative off-diagonal terms. We can now insert equation A.3 and obtain xP ra · Jr hr ¡ Iraa ¡ frra ¡ D Jr hr ¡ kra M
·0
6 r r0 D
max(0, frra0 )
M
if M ¸ Jr hr /kra .
(A.6) (A.7)
Therefore, if we choose M ¸ maxr ( Jr hr /kra ), then xP r · 0 on all faces r. Note that the conditions on kra in theorem 1 do not imply diagonal domab inance of the global interaction matrix Grr0 , since they do not take into ab account the vertical cross-inhibitions Ir . Corollary 1.
Under the conditions of theorem 1, the dynamical system
xP ra D ¡xra C s(Era C xra ),
(A.8)
where s (x ) D max(0, x ) is bounded under the same conditions. Proof . Since vra Era · 0 if and only if ¡xra C s(Era C xra ) · 0, the conditions
on the faces of the constructed box hold equivalently for this alternative formulation of the dynamics. In the following we state sufcient conditions on the lateral and vertical interactions that ensure the convergence to an unambiguous state, with only one active layer within a column. We dene the lateral feedback Fra within a layer by Fra D r0 frra0 xr0 a .
382
H. Wersing, J. J. Steil, and H. Ritter
If the lateral interaction is self-excitatory, farr > 0 for all r, a, and bb ab the vertical interactions satisfy Iraa Ir · (Ir )2 for all a, b, then an attractor of the CLM has in each column r either Theorem 2.
i. At most one positive activity xraO with xraO D
Jr FraO hr C aO aO , IraO aO Ir
6 a, O xrb D 0 for all b D
(A.9)
O D a(r) O where a is the index of the maximally supporting layer characterized 6 a O or by FraO > Frb for all b D
ii. All activities xra , a D 1, . . . , L in a column r vanish and Fra · ¡Jr hr for all a D 1, . . . , L. Proof . Suppose an equilibrium xF has two positive activities xFra,b > 0 in a
column r at two layers a and b. Then the constraint is inactive, vra,b D 1; E hence, @x@ra,b D 0. Now consider a small perturbation within this column of the form 2 ra D ga , 2 rb D gb , 2 r0 a0 D 0 for all other r0 , a0 . Expanding the quadratic function E about the equilibrium, we obtain for D E under this perturbation with all linear components vanishing at the equilibrium, 2D E D
rr ab 0
(Irab drr0 ¡ dab frra0 )2 bb
ab
bb
ab
(A.10)
ra2 r0 b b
aa 2 2 a 2 2 D Ir ga C Ir gb C 2Ir gagb ¡ frrga ¡ frrgb
· Iraag2a C Ir gb2 C 2Ir gagb .
(A.11) (A.12)
If we can state a perturbation (ga , gb ) for which D E < 0, then xF can be no local minimum and therefore be no attractor of the gradient-descent dynamics. The quadratic (2 £ 2) form (see equation A.12) has at least one bb ab negative eigenvalue l1 or l2 if l1 l2 D Iraa Ir ¡ (Ir )2 < 0. Then, however, bb there exists a perturbation (ga , gb ) with D E < 0. If equality holds, Iaa r Ir D 2 (Iab r ) , then l1 D 0 or l2 D 0 and we can state a perturbation from the b null-space for which then D E < 0 due to frra , frr > 0. We have shown by contradiction that an attractor can have at most one positive activity in a column. Now consider case i. Let aO denote the layer index of the positive activity in row r. The constraint is then inactive for xraO and active for xrb D6 aO . Hence 0 D EraO D Jr hr ¡ IraO aO xraO C FraO ,
0 > Erb D6 aO D Jr hr ¡ IraO aO xraO C Frb D6 aO ,
(A.13) (A.14)
Feature Binding and Sensory Segmentation
383
which proves case i. Case ii gives the remaining possible stable equilibria that are not covered by case i. Appendix B: Preprocessing and Cocircular Interaction
The edge features are generated by subdividing the image into nonoverlapping small subrectangles. On each subrectangle, 3 £ 3 pixel Sobel x and y operators are applied, which sample the intensity gradient information at orthogonal directions. Within each subrectangle, we choose an edge feature indexed by r at the position of maximum squared Sobel response and a unit orientation vector in the direction of the normalized Sobel x and y components. The input hr is given by the squared Sobel response added by a small positive constant. We observed that this slightly raising of the zero level improves the ability to enhance low-input contours through lateral effects. The constant was chosen as 10% of the maximum edge intensity. The original pixel sizes of the images before subsampling were 298 £ 284 (fruit still life), 224 £ 224 (leena image), and 380 £ 380 (aerial image). y The cocircular interaction of two edges at positions r1 D (rx1 , r1 ) and y r2 D (rx2 , r2 ) with a difference vector d D r1 ¡ r2 , d D || d | | and dO D d / d, and y y unit orientation vectors nO 1 D (nx1 , n1 ), nO 2 D (nx2 , n2 ) (see Figure 6) is given by f cocirc ((r 1 , nO 1 ), (r 2 , nO 2 ))
2 2 ¡d2 2 ¡C2 S ¡ Ie ¡2d / R , D h ( a1 a2 q ) e / R
y y where a1 D nx1 dOy ¡ n1 dOx , a2 D nx2 dOy ¡ n2 dOx , q D nO 1 ¢ nO 2 and the function h, dened by h (x ) D 1 for x ¸ 0, and h (x ) D 0 otherwise, is used to exclude skewed symmetric edges. The parameter R controls the spatial range, which is smaller for the inhibitory component. The degree of cocircularity is given by C D | nO 1¢ dO | ¡ | nO 2¢ dO |, which is equal to zero if both edges lie tangentially to a common circle. The parameter S > 0 controls the sharpness of the cocircularity constraint, and I > 0 controls the strength of the local inhibition. For all the simulations in this article, the spatial dimensions were scaled into a unit square, and the parameters were R D 0.1, S D 300, I D 0.5, m D 3.5.
Acknowledgments
We thank Ute Bauer and Wolf Jurgen ¨ Beyn for helpful discussions. We also thank the anonymous referees for useful comments, which improved the clarity of the article. H. W. is supported by DFG grant GK-231. J. J. S. is supported by DFG grant Ri-621-2-1.
384
H. Wersing, J. J. Steil, and H. Ritter
References August, J., Siddiqi, K., & Zucker, S. W. (1999). Contour fragment grouping and shared, simple occluders. Computer Vision and Image Understanding, 76, 146– 162. Baldi, P., & Meir, R. (1990). Computing with arrays of coupled oscillators: An application to preattentive texture discrimination. Neural Computation, 2(4), 458–471. Ben-Yishai, R., Lev Bar-Or, R., & Sompolinsky, H. (1995). Theory of orientation tuning in visual cortex. Proc. Nat. Acad. Sci. USA, 92, 3844–3848. Besag, J., Green, P., Higdon, D., & Mengersen, K. (1995). Bayesian computation and stochastic systems. Statistical Science, 1, 3–66. Blasdel, G. G., & Fitzpatrick, D. (1984). Physiological organization of layer 4 in macaque striate cortex. J. Neurosci., 4(3), 880–895. Blatt, M., Wiseman, S., & Domany, E. (1997). Data clustering using a model granular magnet. Neural Computation, 9(8), 1805–1842. Brown, G. J., & Wang, D. L. (1997). Modelling the perceptual segregation of double vowels with a network of neural oscillators. Neur. Netw., 10, 1547– 1558. DeAngelis, G. C., Freeman, R. D., & Ohazawa, I. (1994). Length and width tuning of nerons in the cat’s primary visual cortex. Neurophysiology, 71, 347–374. Douglas, R., Koch, C., Mahowald, M., Martin, K., & Suarez, H. (1995). Recurrent excitation in neocortical circuits. Science, 269, 981–985. Eckhorn, R. (1994). Oscillatory and non-oscillatory synchronizations in the visual cortex and their roles in associations of visual features. Progress in Brain Research, 102, 405–426. Elder, J. H., & Zucker, S. W. (1996). Computing contour closure. In Proc. 4th European Conference on Computer Vision (pp. 399–412). Feng, J. (1997). Lyapunov functions for neural nets with nondifferentiable inputoutput characteristics. Neural Computation, 9(1), 43–49. Field, D. J., Hayes, A., & Hess, R. F. (1992). Contour integration by the human visual system: Evidence for a local “association eld.” Vision Research, 33(2), 173–193. Geman, S., & Geman, D. (1984). Stochastic relaxation, Gibbs distributions, and the Bayesian restoration of images. IEEE Trans. Pattern Analysis and Machine Intelligence, 6, 721–741. Geman, D., Geman, S., Grafgne, C., & Dong, P. (1990). Boundary detection by constrained optimization. IEEE Trans. Pattern Analysis and Machine Intelligence, 12(7), 609–628. Gilbert, C. D. (1992). Horizontal integration and cortical dynamics. Neuron, 9(1), 1–13. Hahnloser, R. L. T. (1998). On the piecewise analysis of linear threshold neurons. Neur. Netw., 11, 691–697. Hawken, M. J., & Parker, A. J. (1984). Contrast sensitivity and orientation selectivity in lamina IV of the striate cortex of old world monkeys. Exp. Brain Res., 54, 367–372.
Feature Binding and Sensory Segmentation
385
HÂerault, L., & Horaud, R. (1993).Figure-ground discrimination: A combinatorial optimization approach. IEEE Trans. Pattern Analysis and Machine Intelligence, 15(9), 899–914. Hofmann, T., & Buhmann, J. (1997). Pairwise data clustering by deterministic annealing. IEEE Trans. Pattern Analysis and Machine Intelligence, 19(1), 1–14. Hubel, D. H., & Wiesel, T. N. (1962). Receptive elds, binocular interaction and functional architecture in cat’s visual cortex. J. Physiol., 160, 106–154. Hubel, D. H., & Wiesel, T. N. (1972). Laminar and columnar distribution of geniculocortical bers in macaque monkey. J. Comp. Neurol., 146, 421–450. Hummel, R. A. & Zucker, S. W. (1983). On the foundations of relaxation labeling processes. IEEE Trans. Pattern Analysis and Machine Intelligence, 5(3), 267–286. Kamgar-Parsi, B., & Kamgar-Parsi, B. (1990). On problem solving with Hopeld networks. Biol. Cyb., 62, 415–423. Kapadia, M. K., Ito, M., & Westheimer, G. (1995). Improvement of visual sensitivity by changes in local context: Parallel studies in human observers and in V1 of alert monkeys. Neuron, 15, 843–856. Kimia, B. B., Tannenbaum, A. R., & Zucker, S. W. (1995). Shapes, shocks, and deformations I. The components of 2-dimensional shape and the reactiondiffusion space. Int. Journal of Computer Vision, 15, 189–224. KovÂacs, I., & Julesz, B. (1993). A closed curve is much more than an incomplete one: Effect of closure in gure-ground segmentation. Proc. Nat. Acad. Sci. USA, 90, 7495–7497. Li, Z. (1998). A neural model of contour integration in the primary visual cortex. Neural Computation, 10, 903–940. Li, Z., & Dayan, P. (1999). Computational differences between asymmetrical and symmetrical networks. Network, 10, 59–77. Lund, J. S. (1987). Local circuit neurons of macaque monkey striate cortex: I. Neurons of laminae 4c and 5a. Journal of Comparative Neurology, 257, 60– 92. Mumford, D., & Shah, J. (1989). Optimal approximations by piecewise smooth functions, and associated variational problems. Comm. Pure and Appl. Math., 42, 577–684. Mundel, T., Dimitrov, A., & Cowan, J. D. (1997). Visual cortex circuitry and orientation tuning. In M. C. Mozer, M. I. Jordan, & T. Petsche (Eds.), Advances in neural information processing systems, 9 (p. 887). Cambridge, MA: MIT Press. Ontrup, J., & Ritter, H. (1998). Perceptual grouping in a neural model: Reproducing human texture perception (Tech. Rep. No. SFB 360 98-6). Bielefeld: University of Bielefeld. Opara, R., & Worg ¨ otter, ¨ F. (1998). A fast and robust cluster update algorithm for image segmentation in spin-lattice models without annealing—Visual latencies revisited. Neural Computation, 10(6), 1547–1566. Parent, P., & Zucker, S. W. (1989). Trace inference, curvature consistency, and curve detection. IEEE Trans. Pattern Analysis and Machine Intelligence, 46(17), 763–770. Pelillo, M. (1994). On the dynamics of relaxation labeling processes. In IEEE Intl. Conf. on Neural Networks (ICNN) (Vol. 2, pp. 606–1294). New York: IEEE Press.
386
H. Wersing, J. J. Steil, and H. Ritter
Peterson, C., & Soderberg, B. (1989). A new method for mapping optimization problems onto neural networks. Int. J. Neural Systems, 1(1), 3–22. Polat, U., & Sagi, D. (1994). The architecture of perceptual spatial interactions. Vision Research, 34, 73–78. Ritter, H. (1990). A spatial approach to feature linking. In Proc. Int. Neur. Netw. Conf. Paris (Vol 2, pp. 898–901). Rockland, K. S., & Lund, J. S. (1983). Intrinsic laminar lattice connections in primate visual cortex. J. Comp. Neurol., 216, 303–318. Rose, K., & Fox, G. (1993). Constrained clustering as an optimization method. IEEE Trans. Pattern Analysis and Machine Intelligence, 15(8), 785–794. Rosenfeld, A., Hummel, R. A., & Zucker, S. W. (1976). Scene labeling by relaxation operations. IEEE Trans. Systems, Man, Cybernetics, 6(6), 420–433. Sajda, P., & Finkel, L. H. (1994). Dual mechanisms for neural binding and segmentation. In J. D. Cowan, G. Tesauro, & J. Alspector (Eds.), Advances in neural information processing systems, 6 (pp. 993–1000). San Mateo, CA: Morgan Kaufmann. Schillen, T. B., & Konig, ¨ P. (1994). Binding by temporal structure in multiple feature domains of an oscillatory network. Biol. Cyb., 70, 397–405. Shi, J., & Malik, J. (1997). Normalized cuts and image segmentation. In Proc. of the IEEE Conf. on Comp. Vision and Pattern Recognition, Puerto Rico. Singer, W., & Gray, C. M. (1995). Visual feature integration and the temporal correlation hypothesis. Ann. Rev. Neurosci., 18, 555–586. Somers, D., & Kopell, N. (1993). Rapid synchronization through fast threshold modulation. Biol. Cyb., 68, 393–407. Sompolinsky, H., Golomb, D., & Kleinfeld, D. (1991). Cooperative dynamics in visual processing. Physical Review A, 43(12), 6990–7011. Sum, J. P. F., & Tam, P. K. S. (1996). Note on the maxnet dynamics. Neural Computation, 8(3), 491–499. Terman, D., & Wang, D. L. (1995). Global competition and local cooperation in a network of coupled oscillators. Physica D, 81, 148–176. von der Malsburg, C. (1981). The correlation theory of brain function (Tech. Rep. No. 81-2). MPI Gottingen. ¨ von der Malsburg, C. (1995).Binding in models of perception and brain function. Current Opinion in Neurobiology, 5, 520–526. von der Malsburg, C., & Buhmann, J. (1992). Sensory segmentation with coupled oscillators. Biol. Cyb., 54, 29–40. Wang, D., & Terman, D. (1997). Image segmentation based on oscillatory correlation. Neural Computation, 9(4), 805–836. Weliky, M., Kandler, K., Fitzpatrick, D., & Katz, L. C. (1995). Patterns of excitation and inhibition evoked by horizontal connections in visual cortex share a common relationship to orientation columns. Neuron, 15, 541–552. Wersing, H., & Ritter, H. (1999). Feature binding and relaxation labeling with the competitive layer model. In Proc. Eur. Symp. on Art. Neur. Netw. ESANN (pp. 295–300). Wersing, H., Steil, J. J., & Ritter, H. (1997). A layered recurrent neural network for feature grouping. In W. Gerstner, A. Germond, M. Hasler, & J.-D. Nicoud (Eds.), Proc. ICANN (pp. 439–444).
Feature Binding and Sensory Segmentation
387
Yabuta, N. H., & Callaway, E. M. (1998). Functional streams and local connections of layer 4c neurons in primary visual cortex of the macaque monkey. J. Neurosci., 18(20), 9489–9499. Yen, S.-C., & Finkel, L. (1996). Salient contour extraction by temporal binding in a cortically-based network. In M. Jordan, M. Mozer, & T. Petsche (Eds.), Advances in neural information processing systems, 9 (pp. 915–921). Cambridge, MA: MIT Press. Yen, S.-C., & Finkel, L. H. (1998). Extraction of perceptually salient contours by striate cortical networks. Vision Research, 38(5), 719–741. Yoshioka, T., Levitt, J. B., & Lund, J. S. (1994). Independence and merger of thalamocortical channels within macaque monkey primary visual cortex: anatomy of interlaminar projections. Vis. Neurosci., 11, 467–489. Zucker, S. W., Dobbins, A., & Iverson, L. (1989). Two stages of curve detection suggest two styles of visual computation. Neural Computation, 1(1), 68–81. Received May 4, 1999; accepted March 21, 2000.
LETTER
Communicated by Alan Yuille
Learning Object Representations Using A Priori Constraints Within ORASSYLL Norbert Kruger ¨ ¨ Neuroinformatik, Ruhr-Universit¨at Bochum, 44780 Bochum, Germany¤ Institut fur In this article, a biologically plausible and efcient object recognition system (called ORASSYLL) is introduced, based on a set of a priori constraints motivated by ndings of developmental psychology and neurophysiology. These constraints are concerned with the organization of the input in local and corresponding entities, the interpretation of the input by its transformation in a highly structured feature space, and the evaluation of features extracted from an image sequence by statistical evaluation criteria. In the context of the bias-variance dilemma, the functional role of a priori knowledge within ORASSYLL is discussed. In contrast to systems in which object representations are dened manually, the introduced constraints allow an autonomous learning from complex scenes. 1 Introduction
The necessity of the existence of a certain amount of a priori knowledge within a system that has to deal with a high-dimensional learning problem such as object recognition is well founded (Geman, Bienenstock, & Doursat, 1995; Abu-Mostafa, 1995). However, the denite selection and formalization for a specic task domain is still an open question. Many articial object recognition systems implicitly apply a certain amount of a priori knowledge. The aim of the work presented here is to put the concrete choice of predetermined structural constraints into focus. Plausible requirements for predetermined structural constraints are discussed, and denitions of suitable constraints for object recognition are given. Furthermore, their formalization within an articial system called ORASSYLL (object recognition with autonomous learned and sparse symbolic representations based on local line detectors) is discussed. I will claim that ndings of developmental psychology and neurophysiology indicate that the human visual system possesses certain structural constraints at birth and supports their denition. The study of the primate’s visual system enables us to look at the result of an evolutionary learning algorithm
¤ Current address: Institut fur ¨ Informatik, Christian-Albrechts-Universit a¨ t zu Kiel, 24105 Kiel, Germany
Neural Computation 13, 389–410 (2001)
° c 2001 Massachusetts Institute of Technology
390
Norbert Kruger ¨
that has established predetermined structural constraints that may be extracted up to a certain degree from experimental data and can be realized within technical systems. As a result of this controlled application of a priori knowledge and in contrast to approaches applying a manually designed object representation, model-based representations within ORASSYLL can be extracted autonomously or with only little manual intervention within a perception-action cycle. In section 2 arguments for the necessity of a priori constraints are summarized, and the relation of evolutionary learning and individual learning is discussed. Motivated by genetically determined structure in the human brain and observations of the behavior of newborns and infants (described in section 3), I will dene constraints for visual learning on which the object recognition system ORASSYLL is based (section 4). These constraints are concerned with the organization of the input in local and corresponding entities utilizing interaction of action and perception (section 4.1), the interpretation of the input by its transformation in a highly structured feature space, resulting in a sparse object representation (section 4.2), and the evaluation of features extracted from an image sequence by statistical evaluation criteria (section 4.2). A technical description of ORASSYLL focusing on the learning of object representations and the role of the introduced constraints is given in section 5. 2 The Bias-Variance Dilemma
Learning is inherently faced with the bias-variance dilemma (Geman et al., 1995): If the starting conguration of the system has many degrees of freedom, it can learn from and specialize to a wide variety of domains, but it will in general have to pay for this advantage by having many internal degrees of freedom—the “variance” problem. On the other hand, if the initial system has few degrees of freedom, it may be able to learn efciently, but there is great danger that the structural domain spanned by those degrees of freedom does not cover the given domain of application at all—the “bias” problem. As a conclusion Geman et al. (1995) argue that “bias needs to be designed to each particular problem.” This conclusion is the starting point of this article, in which concrete choices of bias are suggested in terms of structural constraints realized within ORASSYLL. If articial neural networks are considered as being caught in the variance part of the bias-variance dilemma, another type of object recognition system suffers from too much bias. These systems can be called model-based systems (see, e.g., Yuille, 1991; Lanitis, Taylor, & Cootes, 1997). As an example, Lanitis et al. (1997) can successfully locate faces by matching a manually dened face model with a certain number of free parameters, enabling the adaptation to a specic face in a specic pose. Because the model of the face is dened manually, each time the algorithm is applied to a new object class, a new representation has to be designed manually again. In this way—for example, in Cootes, Taylor, Cooper, and Graham (1995)—resistors are local-
Learning Object Representations
391
ized within the framework of the object representation in Lanitis et al. (1997). Each concrete choice of a priori knowledge is a crucial point. A wrong choice may lead to the exclusion of good solutions in the search space. An amount of predetermined knowledge that is too restricted may result in an increase of the search space, leading to unrealistic learning time and bad generalization. This trade-off requires rst that the a priori constraints should be powerful in the sense that they cover the essential structure of the problem they deal with. Second, they should be general, such that they can be applied in many situations and do not restrict the system to deal with only very specic subproblems. Of course, these two necessary properties are not sufcient to give a unique denition for structural constraints. Indeed there is a certain amount of arbitrariness, and a proof for the a priori constraints (e.g., based on the statistics of the input data of the visual system and a precise denition of the task it has been designed for) is far beyond the stage of visual science today and, of course, far beyond the scope of this work. How can we escape the bias-variance dilemma? The existence of a pattern recognition system—the human visual system—able to deal with its surroundings efciently and with sufcient adaptivity raises hope about this possibility. The predetermined structural constraints have evolved during evolution and appear to be well suited to organize visual experience. Therefore, they seem to cover essential structures of the physical world, giving us a valuable opportunity to look at results of biology to become inspired for suitable denitions of constraints. In this sense, the Kantian idea (Kant, 1781) of establishing a table of a priori constraints that organizes perception can be supported, guided, and justied by a good amount of neuropysiological and psychophysical data, discussed in the next section. 3 Development of the Visual System
To motivate the constraints applied within ORASSYLL, a short description of the development of human visual skills is given. I restrict myself to the aspects relevant for predetermined structural knowledge applied within ORASSYLL. In section 3.1 the observable (regarding the behavior of infants), or outer, development of the human visual system is summarized. In section 3.2 some aspects of the neurophysiological, or inner, correlate of this development are described. Both ways of description, inner and outer, give evidence that the human brain is neither a blank table nor a system completely determined at birth; rather, structural knowledge is already existent and guides postnatal learning. 3.1 Developmental Psychology of Visual and Gripping Abilities.
Newborns already possess a remarkable set of visual abilities. They are able to distinguish lines (Rauh, 1995) and colors (Jones-Molfese, 1992). Movement, high contrast, and faces are “interesting features” to which the newborn infant directs its attention (Goren, 1975). Furthermore, newborns are able to follow (clumsily) a moving object by rotating their head and eyes
392
Norbert Kruger ¨
(Rauh, 1995). Spelke (1993) demonstrated that the Gestalt principle of “common fate” is already used by 4-month-old infants and also demonstrated the appearance of the Gestalt principles collinearity and parallelism at an age of 7 months. The process of gaining control of arms and hands and their interaction with the visual system develops within the rst six months. At approximately 4 to 6 months, an infant can perform a visually controlled movement of arms and hands. Infants under the age of 4 months mainly characterize an object by its movement or position. They perceive an object as “something at a certain position” or “something moving with a certain velocity.” Objects have no “above, below, left, right, in front or behind” (Bower, 1971). In coincidence with gaining visual control about their movement, the object representation of infants starts to be based on higher features, such as form and size (Bower, 1971). An interesting analogy of ORASSYLL and human acquisition of object representation is the use of form features by infants at an age when they are able to carry out a perception-action-cycle, which is also necessary within the articial system presented here (see section 4.1). 3.2 Development of the Visual System: Neurobiology. The relation of intrinsic properties of brain areas to extrinsic inuences on the structure of the cortex is a controversial question. As an extreme viewpoint Creutzfeld (1977) proposed that all cortical neurons are initially equipotent and that laminar and areal differences in the organization of the cortex are induced exclusively by extrinsic inuence. There exist indeed impressive phenomena of extrinsic effects on structuring of brain areas (see, e.g., Sur, Garraghty, & Roe, 1988; Blakemore & Cooper, 1970). However, a considerable number of counterexamples also show the restricted adaptivity of the visual system, as in the case of strabism and astigmatism, for example (Atkinson & Braddick, 1989). Furthermore, the evolution of complex structures without postnatal visual experiences, such as orientation maps (Wiesel & Hubel, 1974; Godecke ¨ & Bonhoeffer, 1996), conrms the importance of genetic predetermination. I have already given a short overview of the results of research about the cortogenesis of the visual system that is relevant for the choices of constraints within ORASSYLL(Kruger, ¨ 1998b). I argue there that the connections of brain areas and the receptive eld size of neurons in different areas are largely predetermined and established at birth. Even the features extracted in some areas (orientation, movement and color) (Wiesel & Hubel, 1974; Godecke ¨ & Bonhoeffer, 1996), that is, the coarse sensitivity of neurons, are basically initiated before the rst postnatal visual experience. Other features, such as extraction of disparity information, depend on extrinsic inuence and do not develop without it. Input-dependent ne tuning and local normalization processes (parallel to the development of lateral synaptical connections) develop during the rst months of visual experience. The arrangement of features in computational maps (Knudsen, Lac, & Esterly,
Learning Object Representations
393
1987) is a major principle applied throughout the brain in the early stages of visual processing (Hubel & Wiesel, 1979) and higher stages (Tanaka, 1993) and probably, at least for area V1, is genetically determined (see Wiesel & Hubel, 1974; Godecke ¨ & Bonhoeffer, 1996). The utilization of Gestalt principles (such as collinearity and parallelism) also evolves with visual experience (see section 3.1), possibly making use of statistical properties of natural images (see also Phillips & Singer, 1997). It is an interesting by–product of ORASSYLL that the Gestalt principles of collinearity and parallelism can be detected as signicant relations of the class of natural images after interpreting the Gabor wavelets according to the constraints applied within ORASSYLL (see section 4.2 and Kruger, ¨ 1998a). Developmental psychology and neurophysiological research give indications for the impressive adaptivity of the visual system and its capability to extract signicant information from experience. Both ways of description also indicate a large amount of genetic prestructuring. Maybe the important question is not so much the relative weight of genetic predetermination and adaptation but a precise denition of the predetermined structural constraints. The experiments in striate cortex (Wiesel & Hubel, 1974; Godecke ¨ & Bonhoeffer, 1996) already give good hints about predetermined structures, such as feature choices and their organization in computational maps. 4 The A Priori Constraints
Inspired by the constraints imposed by the human visual system (as described in section 3), in this section predetermined structural constraints for visual learning, which will be realized within ORASSYLL, are introduced. Each pattern recognition system has to apply a certain amount of a priori knowledge. What may be different in this work is that these structural constraints are the focus of attention and the starting point of designing the object recognition system. For the constraints, I will discuss analogies to constraints imposed by the human visual system (see Table 1) and differences and relations to design choices in other systems. I do not assume that the constraints introduced here are complete in the sense that they cover all necessary constraints for an object recognition system as efcient as the human visual system. First, here the focus is on shape processing; other clues, such as color or disparity information, are ignored. Second, in its current form, ORASSYLL is a two-dimensional (2D) approach. Third, for object representation on higher stages of visual processing (higher than V1), additional constraints (e.g., Gestalt principles) probably have to be taken into account. Nevertheless, the constraints used within ORASSYLL are sufcient to extract autonomously 2D representations of objects from images, and these representations can be applied to difcult vision tasks. The a priori constraints can be divided into constraints concerning the organization of the input (PL1–2, section 4.1); constraints concerning feature selection, feature organization, and feature processing (PF1–4, section 4.2);
PE2: Minimal redundancy
PE1: Maximal discrimination
PF3: Hierarchical processing PF4: Sparse coding
PF2: Feature Arrangement
PF1: Feature selection
Localized receptive elds (Hubel & Wiesel, 1979)
PL1: Independence PL2: Correspondence
Transformation of redundancy of input data into cognitive maps (Barlow, 1961)
Genetically determined orientation-sensitive Gabor–shaped neurons (Wiesel & Hubel, 1974; Jones & Palmer, 1987) Genetically determined order of orientation maps in V1 (Wiesel & Hubel, 1974; Godecke ¨ & Bonhoeffer, 1996) Applied within whole cortex (see, e.g., Oram & Perrett, 1994) Equal response probability of neurons across images and low response probability for a single image (Palm, 1980; Field, 1994), date spread in V1 compared to ganglia cells
Interaction of arm movement and perception (perception-action-cycle) (Rauh, 1995; Koenderink, 1992)
Analogy in Visual System
Constraint
Speed-up of learning by internal evaluation of features instead of, e.g., error backpropagation Reduction of redundancy by representing similar entities by one entity utilizing metric
Reduction of search space by forcing description by specic features with symbolic meaning Combination of similar and separation of dissimilar entities by a metric Sharing of resources; speed-up of feature processing Low memory requirements and high storage capacity; reduction of space of relations within object representations; speed-up of matching
Reducing complexity by splitting the input space in subspaces Learning with comparable entities
Functional Role Within ORASSYLL
Table 1: Constraints, Analogy in the Human Visual System, and Functional Role Within ORASSYLL.
394 Norbert Kruger ¨
Learning Object Representations
395
and constraints concerning statistical feature evaluation (PE1–2, section 4.3). Their relationship to biological and psychological ndings and their functional role within ORASSYLL are summarized in Table 1. 4.1 Locality and the Correspondence Problem. The system’s input is spatially organized. The input is divided into nonoverlapping subparts (PL1: independence) and reects a certain consistency of moving objects in time (PL2: correspondence). This organization of the input decreases the relevant feature space and facilitates learning.
PL (Locality). PL1 (independence): Features at distant locations are assumed to be independent. PL2 (correspondence): Only features corresponding to the same landmark for different examples of the same object class are used for learning. 4.1.1 PL1. The rst part of the constraint PL already implies the locality of features: a local feature (i.e., a feature that describes a quality of a local part of an image) does not interact with features corresponding to other landmarks. Corresponding to the limited receptive eld size of neurons in V1, the features will be local (see PF1). It has been demonstrated that local lters similar to the lters applied within ORASSYLL are the result of feature extraction by independent component analysis (Bell & Sejnowski, 1996). The splitting of the feature space in smaller independent subspaces is a design principle of many articial systems (e.g., Alpaydin & Jordan, 1996; Wiskott, Fellous, Kruger, ¨ & von der Malsburg, 1997). Nevertheless, others (e.g., Turk & Pentland, 1991) apply global functions to the input image. 4.1.2 PL2. The second part of PL comprises a fundamental problem of vision: the correspondence problem. For any learning algorithm, it is indispensable to ensure that comparable entities (i.e., comparable landmarks) are used as training data. Looking at a single image of an object makes it difcult to distinguish between features corresponding to the background and features corresponding to the object. If the object to be learned is moving, a number of factors will produce high variation in the data: Translation of the object, which leads to the appearance of corresponding landmarks at different positions in the image and varying background and scale Rotation, which may lead to the occurrence of very dissimilar views of the same object Variation of illumination, which causes shadows on the object and amplies or diminishes the occurrence of textures or edges
396
Norbert Kruger ¨
a)
b)
c)
d) i)
ii)
iii)
iv)
v)
Figure 1: (a) Robot arm with the camera. (b) “Retinal” images produced by following the robot arm holding a toy duck. (c.i–c.iv) Signicant features per instance extracted in a rectangular region (shown in b.i). (c,v) Learned representation. (d) Training data and learned representation for a toy car.
A sensible learning of a certain view of an object seems to be impossible unless at least some of these sources of variation are eliminated. The easiest way is to solve the correspondence problem manually, but this is a serious restriction. At least for one instance of an object class, interaction of motor control and vision can help to create controlled training data. By moving an object with a robot arm and following the object by keeping xation relative to the robot hand using its known three-dimensional (3D) position, training data are produced in which a certain view of an object is shown with varying background and illumination but with corresponding landmarks in the same pixel position within the image (see Figures 1a and 1b). The method is comparable to an arm movement and grasping controlled by vision such as infants 4 to 6 months old are able to (see section 3.1) and reects the strong relation between action and perception (Koenderink, 1992; Sommer, 1997). Interestingly infants’ concept of objects changes dramatically at this stage of development (see section 3.1). The ability to create a situation in
Learning Object Representations
397
which an object appears under controlled conditions may help, as in the object recognition system, in learning a suitable representation of objects. 4.2 Feature Selection and Feature Organization. The spatially organized input is transformed to a structured feature space; in other words, the input is seen through the “glasses” of this feature space. The a priori constraints introduced here are concerned with the choice of features and their transformation to a more “symbolic” meaning (PF1: feature selection), their interrelation (PF2: feature arrangement), their computation (PF3: hierarchical processing), and their role within the object representation (PF4: sparse coding). An important difference from other object recognition systems is the exploitation of a rich structure of the feature space for learning. The locality of features and the mechanism described in Figure 1 ensure that only comparable local entities are input for learning. These features can be compared by a metric of the feature space. Interestingly, the specic interpretation of Gabor wavelet responses within ORASSYLL allows for the detection of Gestalt principles in the input data.
PF (Feature Assumptions). PF1 (feature selection): Signicant features of a localized area of the twodimensional projection of the visual world are localized (curved) line segments. PF2 (feature metric): A metric denes a distance between these features indicating the differences in their properties orientation, curvature, and position. PF3 (hierarchical processing): These features are computed from simpler features in a hierarchical fashion. PF4 (sparse coding): An object is coded as a sparse and spatially ordered arrangement of these features. 4.2.1 PF1. In ORASSYLL Gabor or banana wavelets and their symbolic analog, (curved) local line segments, are used as basic features that are given a priori (see Figure 2). The restriction to Gabor or banana wavelets gives a signicant reduction of the search space. Instead of allowing, say, all linear lters as possible features (as realized, e.g., in the scalar product of a backpropagation neuron), a restriction to a small subset is imposed. Neurophysiological experiments (Wiesel & Hubel, 1974) (see also section 3.2) show that the human visual system also makes certain kinds of feature choices before their rst acts of postnatal visual experience. Within the framework of ORASSYLL it has been shown (Kruger, ¨ 1998a) that the Gestalt principles of collinearity and parallelism can be detected and described as second-order statistics of normalized Gabor wavelet responses. This nonlinear normalization was initially developed to solve a certain sub-
398
Norbert Kruger ¨
a)
b)
c)
Figure 2: Path corresponding to a banana wavelet. (a) Arbitrary wavelet. (b) Symbolic representation by the corresponding curve. (c) Visualization of a representation of an object class.
problem within ORASSYLL: the comparison of a symbolic feature with a certain position within the gray-level image. The normalization transforms Gabor wavelet responses such that they express the system’s condence for the presence or absence of a local line segment—that is, it represents their interpretation in terms of constraint PF1. Surprisingly, this transformation affects the second-order statistics of Gabor wavelet responses signicantly (see Figures 3a–d). Looking only at the nonnormalized Gabor wavelet responses, the Gestalt principles of collinearity and parallelism are barely detectable (see Figures 3e–f). 4.2.2 PF2. In ORASSYLL, the ordered arrangement of features is achieved by a metric dening a distance between features indicating their differences in the property orientation, curvature, and position. This metric organization is essential for learning in the object recognition system, because it allows the clustering of similar features and thus the determination of representatives for such clusters. 4.2.3 PF3. In the visual cortex of primates, hierarchical processing of features of increasing complexity and increasing receptive eld size occurs. In the object recognition system, the main advantage of hierarchical processing (see Figure 4) is speed-up and reduction of memory requirements. Hierarchical processing is a widely used principle, realized in most of the neural network systems. 4.2.4 PF4. Sparse coding is discussed as a coding scheme realized in the primate’s visual system (Field, 1994). A sparse representation can be dened as a coding of an object by a small number of binary features taken from a large feature space. In ORASSYLL an expansion of the feature space is forced by extracting a number of features for each pixel. For the representation of an object, only about 10¡6 of all available features are required. In this sense, the objects (see, e.g., Figure 2c) are represented sparsely. ORAS-
Learning Object Representations
399
SYLL differs in this aspect from many other object recognition systems that apply compact representations (e.g., Turk & Pentland, 1991). The sparseness of representation allows a fast matching because only a few features have to be checked within an image. Furthermore, the space of possible feature relations within the object representation is reduced in a sparse representation. Taking into account that important aspects are coded by these relations (e.g., Gestalt relations), this potentially facilitates learning within the space of multiple-order relations. 4.3 Evaluation of Features. In contrast to the constraints (PF1–PF4) that dene the feature space itself, two criteria are now introduced for the selection of “good” features to represent an object class. These constraints enable the system to use multiple visual experiences of instances of the same object class to extract signicant information. In contrast to, say, backpropagation (Rumelhart, Hinton, & Williams, 1986), in which global criteria are optimized, the two constraints represent criteria that guide and speed up the learning algorithm by evaluating intermediate stages of processing (see also the discussion of the credit assignement problem in, e.g., Arbib, 1994).
PE (Evaluation). PE1 (maximal discrimination): Features are preferred whose values on images vary little within classes and vary much between classes. PE2 (minimal redundancy): Redundant information shall be eliminated from the system. 4.3.1 PE1 and PE2. Analogies to the constraints PE1 and PE2 cannot be detected by biological experiments, as it is possible for the constraints PF1–PF4. However, Barlow (1961) discusses redundancy reduction as an important principle underlying the transformation of sensory messages in the brain. Furthermore, both constraints are applied implicitly in many pattern recognition algorithms (see, e.g., Fisher, 1923; Kruger, ¨ 1997). 5 Description of the Object Recognition System
In this section I give a technical description of ORASSYLL, based on the a priori constraints introduced in section 4. My aim is not to give a detailed description of the whole system but to describe how the applied constraints enable learning of object models for realistic and difcult tasks (for details concerning the complete system, see Kruger, ¨ 1998b; Kruger ¨ & Peters, 2000). 5.1 The Feature Space. In this section the realization of the constraints PF1–PF4 is described. By utilizing these constraints, the input is transformed into a metrically organized feature space. Images are interpreted as an assembly of spatially organized local line segments that are processed hierar-
400
Norbert Kruger ¨
chically. E
5.1.1 Gabor and Banana Wavelets. A banana wavelet Bb is a complexvalued function, parameterized by a vector bE of four variables bE D ( f, a, c, s) expressing the attributes frequency ( f ), orientation (a), curvature (c), and elongation (s).1 It can be understood as a product of a curved and rotated E
E
complex wave function Fb and a stretched two-dimensional Gaussian Gb , E bent and rotated according to Fb (see Figure 2a). Banana wavelets are generalized Gabor wavelets (for Gabor wavelets, see, e.g., Daugman, 1985) that possess, in addition to frequency and orientation, the parameters of curvature and elongation. The approach introduced here does not necessitate the use of banana wavelets and is also applicable with Gabor wavelets (see Figure 5d.i,ii,iv, for object representations with only straight line segments). 5.1.2 The Feature Space. The six-dimensional space of vectors cE D (xE , bE ) E is called the feature space, with cE representing the wavelet Bb with its center at pixel position xE in an image. PF2 is realized by a metric d (cE1 , cE2 ). Two coordinates cE1 , cE2 are expected to have a small distance d when their corresponding kernels are similar—that is, they represent similar features. For the exact denition of the metric, rst a distance measure is dened for the orientation-curvature subspace (a, c) expressing the Moebius topology thereof, setting d ((a1 , c1 ), (a2 , c2 )) D nq q (a1 ¡a2 )2 ( ¡c )2 C c1 2 2 , min e2 e a
c
((a1 ¡a)¡a2 )2 ( C )2 C c1 2c2 , e2a ec
q
((a1 C a )¡a2 )2 ( C )2 C c1 2c2 e2a ec
o
on the subspace (a, c). Then a distance measure on the complete coordinate space is dened by d (cE1 , cE2 ) D v u (x ¡x )2 (y ¡y )2 ( f ¡ f )2 1 2 u C 1 22 C 1 22 u e2x ey ef t C d((a1 , c1 ), (a2 , c2 )) 2 C
(s1 ¡s2 )2 e2s
.
(5.1)
The values Ee D (ex , ey , e f , ea , ec , es ) dene a cube of volume 1 in the features space, that is, they dene the weights for the different properties, such as orientation and curvature. In the standard settings, Ee D (4, 4, 0.01, 0.3, 0.4, 3.0) is used. 1
For Gabor wavelets bE is reduced to ( f, a)
Learning Object Representations
401
5.1.3 Nonlinear Transformations of the Filter Responses. The feature processing of ORASSYLL consists of a two-step nonlinear transformation of the complex lter responses. In the rst step, the magnitude of the lter E
response is extracted after the convolution of Bb with the image I. In contrast to the complex lter response oscillating with phase, the magnitude of the response is stabler under a slight variation of position (see Potzsch, ¨ E
Kruger, ¨ & von der Malsburg, 1996). A lter Bb causes a strong response at a certain pixel position when the local structure of the image at that position E is similar to B b . The magnitude of the lter responses depends signicantly on the strength of edges in the image. However, here I am interested only in the presence, and not the strength, of edges. Thus, in a second step, a function N normalizes ¡ the¢ real valued lter responses r (cE) into the interval [0, 1]. The value N r (cE) represents the likelihood of the presence or absence of a local line segment corresponding to cE D (xE0 , bE ). This normalization is based on the “above-average criterion”: AAC: A line segment corresponding to the banana wavelet cE is present if the corresponding banana wavelet response is distinctly above the average response. ¡The¢normalization is realized by mapping r(cE) by a sigmoid function N. N r (cE) returns a small value when r(cE) is below an average response E and a high value if it is close to the maximum response Max. Both parameters, average E and maximum Max, are computed using local and global information. Because of this locality, they vary with pixel position. The inuence of the normalization to the second-order statistics of Gabor wavelet responses is demonstrated in Figure 3. 5.1.4 Approximation of Banana Wavelets by Gabor Wavelets. To reduce computational requirements for the extraction of the large feature space, an algorithm to approximate banana wavelets from Gabor wavelets and banana wavelet responses from Gabor wavelet responses is dened (see Figure 4). By this hierarchical processing (PF3), a speed-up to a factor 5 can be achieved. 5.2 Learning. A sparse object representation (PF4) is extracted from single images or image sequences. Features caused from background structure or illumination can be eliminated by a learning scheme that makes use of the structure of the feature space (PF1, PF2). The learning algorithm applies the evaluation criteria PE1 and PE2 as internal criteria to determine signicant features for an object class. It presupposes the correspondence of local landmarks in different images (PL1, PL2), which can be achieved by the interaction of perception and action.
402 0.4
Norbert Kruger ¨ b)
a)
c)
0.3 0.2
d)
0.1
40
0 -40
e)
f)
0
40
g)
h)
Figure 3: Cross-correlation of pairs of (a–d) normalized Gabor wavelet responses and (e–h) unmodied Gabor wavelet responses four orientations on a large set of natural images: (a, e) horizontal-horizontal; (b, f) horizontaldiagonal; (c, g) horizontal-vertical; (d, h) horizontal-diagonal. The x- and y-axes represent the separation of the kernels (labeling of all axes for a–h is the same as in a), and the z-axis represents the correlation. In a, parallelism and collinearity are clearly visible. Collinearity is detectable as a ridge in the rst diagram, and parallelism appears as a global property expressed in the at part of the surface in the rst diagram clearly above the surfaces corresponding to nonparallel orientations.
=
b
. 1
b + 2.
b + 3.
Figure 4: Hierarchical processing. The more complex banana wavelet on the left is approximated by the weighted sum of Gabor wavelets on the right.
5.2.1 Extracting Signicant Features per Instance. Here the aim is to extract the local structure in an image in terms of (curved) local line segments. A signicant feature per instance of an object is dened by two qualities: it has to cause a strong response (C1) and it has to represent a maximum within a local area of the feature space (C2). Figures 1c.i–iv, 5b, 5d, and 6b.i– iv show the signicant features per instance for some objects. Each wavelet is described by a curve with same orientation, curvature, and size. Lower frequencies are represented as thicker line segments.
Learning Object Representations
403
a)
b)
c)
d) i)
ii)
iii)
iv)
Figure 5: One-shot learning. (a, c) The objects to be learned in front of a homogeneous background. (b, d) The extracted representations. For all objects a rectangular grid was roughly positioned on the object as in the rst image, a.i).
In terms of analogy to the processing in area V1 in the mammalian visual system, C1 may be interpreted as the response of a certain column that indicates the general presence of a feature, whereas C2 represents the intercolumnar competition giving a more specic coding of this feature (Oram & Perrett, 1994). Therefore, the feature space is divided in locally distinct columns (PL1) in which related features (or features with close distance (PF2)) are represented. 5.2.2 One-Shot Learning. By positioning a rectangular grid on an object (see Figure 5a.i) in front of a homogeneous background and extracting signicant features per instance, suitable representations of objects can be extracted. These representations have already been successfully applied to difcult discrimination tasks. For instance, for a difcult 10-class problem in hand posture classication, a recognition rate of 80% could be achieved with representations extracted from single images (see row 2 in Table 2). 5.2.3 Clustering. In case of a nonhomogeneous background and uncontrolled illumination, one-shot learning would create representations with line segments corresponding to background (see Figure 1c.i–iv). In this case,
404
Norbert Kruger ¨
a)
b)
c)
d)
e)
f)
Figure 6: Clustering. (a) Distribution of the signicant features per instance extracted at a certain landmark. (b) Code book initialization. (c) Code book vectors after learning. (d) Substituting sets of code book vectors with small distance by their center of gravity. (e) Counting the number of elements within a certain radius. (f) Deleting code book vectors representing insignicant features.
a more sophisticated learning scheme has to be applied. After extracting the signicant features per instance in different pictures, an algorithm extracts invariant local features for a class of objects is applied. Here the task is the selection of the relevant features for the object class from the noisy features extracted from the training examples (see Figures 6b.i–iv and 1c.i–iv). A signicant feature should be independent of background, illumination, or accidental qualities of a certain example of the object class; in other words, it should be invariant under these transformations (PE1). This is realized Table 2: Matching Results for Hand Posture Recognition.
1) 2) 3) 4) 5) 6)
Representation Transformation Performance Number Seconds Representations Representation Approximation Seconds Matching Recognition 10 Standard No approx 17.0 9.5 93% 10 One instance Approx 4.9 12.4 80% 10 Bunch graph 0.9 18.0 93% 10 Standard No approx 17.0 9.5 90% 10 Standard Approx 4.9 9.5 80 % 10 Bunch graph 0.9 18.0 65%
Learning Object Representations
405
a)
b)
i)
ii)
iii)
iv)
v)
Figure 7: (a) Pictures for training. (b.i–iv): Signicant features per instance describing besides relevant information accidental features such as background, shadow, or surface textures. (b.v) The learned representation.
within the algorithm by measuring the probability of occurrence of features in a local area of the feature space for different examples. The metric (PF2) allows the grouping of similar features into one bin, but it also allows the reduction of redundant information (PE2) by avoiding multiple similar features in the learned representation. In this way, it becomes possible to learn sparse object representations (PF4) in very difcult situations (see Figure 6). The correspondence problem (PL2) is assumed to be solved; it is assumed that the position of certain landmarks of an object is known on pictures of different examples of these objects. In Figure 6, corresponding landmarks are determined manually; in Figure 1, this manual intervention is substituted by motor-controlled feedback The learning algorithm works as follows (illustrated for two dimensions in Figure 7). Let I be a set of pictures of different examples of a class of objects of certain orientation and approximately equal size (see Figure 6a). I ( j, k) represents a local area in the jth image in I with the kth landmark as its center. Let Esijk be the ith signicant feature per instance extracted in the area I ( j, k) . All Esijk for a specic k are collected in one set S k. Then the LBG vector
quantisation algorithm (Linde, Buzo, & Gray, 1980) is applied to S k (see Figure 6b). After vector quantization, a code book C1 expresses the vectors Esijk with a certain number nC1 of code book vectors cE1i 2 C1 ½ I , cE1i : 1, . . . , nC1 (see Figure 7b). The LBG algorithm reduces the distortion error—the average error occurring when all elements of S k are replaced by the nearest code book vector in C1 . In case of high densities of elements Esijk in S k, it may be advantageous in terms of the distortion error to have code book vectors cE and cE 0 with small distance d (cE, cE 0 ) (PF2). But the signicant features for a certain class of objects are expected to express independent qualities (L1); they are expected to have large distances in the feature space. Therefore a smaller code book C2 is constructed in which the cE, cE 0 2 C1 with close distances are replaced by their center of gravity. Let r1 2 IR C be xed. For all cE 2 C1 , the number of
406
Norbert Kruger ¨
cE0 2 C1 with distance d(cE, cE 0 ) < r1 (see Figure 7c) is computed. If there exists 6 at least one such cE 0 D cE, all the code book vectors in C1 with d (cE, cE 0 ) < r1 are substituted by their center of gravity (see Figure 7d). C2 now represents a code book with fewer than or an equal number of elements as C1 , with redundant code book vectors being eliminated. Now the important features for the kth landmark of a certain object can be dened as those code book vectors cE 2 C2 for which a certain percentage p of Esijk exists with d (cE, Esijk ) < r2 2 IR C (see Figures 7e and 7f).
5.2.4 Autonomous Learning. To achieve correspondence (PL2) and avoid manual intervention, the mechanism described above can be applied. Then the exible grid can be substituted by a rectangular grid, and the interaction of the camera and the motor-controlled feedback ensures that landmarks are positioned at corresponding pixel position on the object (see Figure 1). Then the same learning algorithm as described in section 5.2 for manually dened landmarks can be applied (see Figure 1.v for autonomously learned representations). 5.3 Matching. To use the learned representation for location and classication of objects, elastic graph matching (EGM) (Lades et al., 1993; Wiskott et al., 1997) is used. To apply EGM, a similarity function between a graph labeled with the learned local line segments and a certain position in the image is dened. It simply averages local similarities. These local similarities express the system’s condence whether a pixel in the image represents a certain landmark. The graph is adapted in position and scale by optimizing the total similarity. The graph with the highest similarity determines the size and position of the objects within the image. In a nutshell, the local similarity is dened as follows: For each learned feature and pixel position in the image, it is simply checked whether the corresponding normalized lter response (see section 5.1) is high or low— that is, whether the corresponding feature is present or absent. Because of the sparseness (PF4) of the representation, only a few of these checks have to be made; therefore the matching is fast. Because it made use of only the important features, the matching is efcient.
5.3.1 Simulations. The test sets of hand postures contain images of 10 different hand postures in front of a homogeneous background with controlled illumination (see Table 1, rows 1–3, 240 images) and with a second set containing images with an inhomogeneous background and varying illumination (rows 4–6, 200 images). Matching with 10 representations (one for each hand posture) takes 9.5 seconds; the recognition rate was 93% (rst row). The simulations corresponding to the second row were performed with representations extracted by one-shot learning. The performance is still remarkably high (80%). The performance with the bunch graph ap-
Learning Object Representations
407
proach, as described in Triesch and von der Malsburg (1996) is given in the third row. Results for the test set with uncontrolled background and illumination are shown in rows 4–6. For the rst test set performance within the bunch graph approach (Wiskott et al., 1997; Triesch & von der Malsburg, 1996) is comparable to ORASSYLL. For the second and more difcult set, performance of ORASSYLL is signicantly better. Loos, Fritzke, and von der Malsburg (1998) performed face detection with binarized banana wavelets was performed on a very large data set (more than 700 pictures) with size variation of faces between 40 and 60 pixels, an inhomogeneous background, and uncontrolled illumination. For this set, performance was 95%. For the problem of face nding it has been demonstrated (Kruger, ¨ 1998b) that performance could be increased from 54% to 77% compared to the bunch graph approach on an extremely difcult test set with a signicant speed-up. It also could be performed successfully matching with cans, toys, and other objects. Especially in case of uncontrolled illumination and an inhomogeneous background, signicant improvement compared to Lades et al. (1993) and Wiskott et al. (1997) could be achieved. Furthermore, in Kruger ¨ (1998b) and Kruger ¨ and Peters (2000) ORASSYLL was compared extensively to the older system (Lades et al., 1993; Wiskott et al., 1997) as well as to a bunch of other object recognition systems.
6 Conclusion and Outlook
ORASSYLL is founded on reections about the necessity, structure, and amount of a priori knowledge an articial vision system might require. Genetically determined structures of the human visual system and ndings of developmental psychology supported the denition of predetermined structural constraints within ORASSYLL. In contrast to model-free methods, within ORASSYLL, the input is organized within a perception-action cycle and transformed into a highly structured feature space. Learning is not only based on trial and error but is guided by internal statistical criteria. As a result of this controlled application of a priori knowledge and in contrast to approaches applying a manually designed object representation (e.g., Yuille, 1991), model-based representations can be extracted autonomously or with only a little manual intervention. These representations were applied for difcult discrimination tasks. An important problem remains the integration of higher stages of object representation from which much less is known compared to our knowledge of striate cortex (for an overview see, e.g., Hoffman, 1980). It is possible that a key to formalizing these higher levels of visual processing is nding and formalizing appropriate a priori constraints. This will be a challenging task for ongoing and future research.
408
Norbert Kruger ¨
Acknowledgments
I thank Christoph von der Malsburg, Laurenz Wiskott, Michael Potzsch, ¨ Niklas Ludtke, ¨ Wolfgang Banzhaf, and Ladan Shams, Peter Hancock, and Helge Ritter for fruitful discussion. This work was supported by grants from the German Ministry for Science and Technology 01IN504E9 (NEUROS), 01M3021A4 (Electronic Eye), and the EC Human Capital and Mobility Program (ERBCHRX CT930097).
References Abu-Mostafa, Y. (1995). Hints. Neural Computation, 7, 639–671. Alpaydin, E., & Jordan, M. (1996). Local linear perceptrons for classication. IEEE Transactions on Neural Networks, 7(3), 788–792. Arbib, M. (1994). The handbook of brain theory and neural networks. Cambridge, MA: MIT Press. Atkinson, J., & Braddick, O. (1989). Development of basic visual functions. In A. Slater & G. Bremner (Eds.), Infant development (pp. 7–41). Hillside, NJ: Erlbaum. Barlow, H. (1961). Possible principles underlying the transformation of sensory messages. In W. A. Rosenblith (Ed.), Sensory communication (pp. 217–234). Cambridge, MA: MIT Press. Bell, A., & Sejnowski, T. (1996). Edges are the “independent components” of natural scenes. In M. Mozer, M. Jordan, & T. Petsche (Eds.), Advances in neural information processing systems, 9. Cambridge, MA: MIT Press. Blakemore, C., & Cooper, G. (1970). Development of the brain depends on the visual environment. Nature, 228, 477–478. Bower, T. (1971). The object in the world of the infant. Scientic American, 225, 30–38. Cootes, T., Taylor, C., Cooper, D., & Graham, J. (1995). Active shape models— their training and application. Computer Vision and Image Understanding, 61, 33–59. Creutzfeld, O. (1977). Generality of the functional structure of the neocortex. Naturwissenschaften, 64, 507–517. Daugman, J. (1985). Uncertainty relation for resolution in space, spatial frequency, and orientation optimized by 2D visual cortical lters. Journal of the Optical Society of America, 2(7), 1160–1169. Field, D. (1994). What is the goal of sensory coding? Neural Computation, 6(4), 561–601. Fisher, A. (Ed.). (1923). The mathematical theory of probabilities. New York: Macmillan. Geman, S., Bienenstock, E., & Doursat, R. (1995). Neural networks and the bias/variance dilemma. Neural Computation, 4, 1–58. Godecke, ¨ I., & Bonhoeffer, T. (1996). Development of identical orientation maps for two eyes without common visual experience. Nature, 379, 251–255.
Learning Object Representations
409
Goren, C. (1975). Form perception, innate form preferences and visually mediated head—turning in the human newborn. Paper presented at the Biennial Meeting of the Society for Research in Child Development, Denver. Hoffman, D. (Ed.). (1980). Visual intelligence:How we create what we see. New York: Norton. Hubel, D., & Wiesel, T. (1979). Brain mechanisms of vision. Scientic American, 241, 130–144. Jones, J., & Palmer, L. (1987). An evaluation of the two dimensional Gabor lter model of simple receptive elds in striate cortex. Journal of Neurophysiology, 58(6), 1223–1258. Jones-Molfese, V. (1992). Responses of neonates to colored stimuli. Child Development, 48, 1092–1095. Kant, I. (1781). Kritik der reinen Vernunft. Knudsen, E., Lac, S., & Esterly, S. (1987). Computational maps in the brain. Ann. Rev. Neuroscience., 10, 41–65. Koenderink, J. (1992). Wechsler’s vision: An essay review of computational vision by Harry Wechsler. Ecological Psychology, 4, 121–128. Kruger, ¨ N. (1997). An algorithm for the learning of weights in discrimination functions using a priori constraints. IEEE Transactions on Pattern Recognition and Machine Intelligence, 19, 764–768. Kruger, ¨ N. (1998a).Collinearity and parallism are statistically signicant second order relations of complex cell responses. Neural Processing Letters, 8(2), 117– 129. Kruger, ¨ N. (1998b). Visual learning with a priori constraints. Doctoral dissertation, University of Bielefeld. Aachen: Shaker Verlag. Kruger, ¨ N., & Peters, G. (2000). ORASSYLL: Object recognition with autonomously learned and sparse symbolic representations based on metrically organized local line detectors. Computer Vision and Image Understanding, 77, 48–77. Lades, M., Vorbruggen, ¨ J., Buhmann, J., Lange, J., von der Malsburg, C., Wurtz, ¨ R., & Konen, W. (1993). Distortion invariant object recognition in the dynamic link architecture. IEEE Transactions on Computers, 42(3), 300–311. Lanitis, A., Taylor, C., & Cootes, T. (1997). Automatic interpretation and coding of face images using exible models. IEEE Transactions on Pattern Recognition and Machine Intelligence, 19, 743–756. Linde, Y., Buzo, A., & Gray, R. (1980). An algorithm for vector quantizer design. IEEE Transactions on Communication, COM-28, 84–95. Loos, H., Fritzke, B., & von der Malsburg, C. (1998). Positionsvorhersage von bewegten Objekten in großformatigen Bildsequenzen. In S. Posch & H. Ritter (Eds.), Proceedings in Articial Intelligence: Dynamische Perzeption (pp. 31–38). Bielefeld: Inx Verlag. Oram, M., & Perrett, D. (1994).Modeling visual recognition from neurobiological constraints. Neural Networks, 7, 945–972. Palm, G. (1980). On associative memory. Biological Cybernetics, 36, 19–31. Phillips, W. A., & Singer, W. (1997). In search of common foundations for cortical processing. Behavioral and Brain Sciences, 20(4), 657–682.
410
Norbert Kruger ¨
P¨otzsch, M., Kruger, ¨ N., & von der Malsburg, C. (1996). Improving object recognition by transforming Gabor lter responses. Network:Computation in Neural Systems, 7, 341–347. Rauh, H. (1995). Fruhe ¨ kindheit. In R. Oerter & L. Montada (Eds.), Entwicklungspsychologie (3rd ed.) (pp. 167–248). Munich: Psychologie VerlagsUnion. Rumelhart, D., Hinton, G., & Williams, R. (1986). Learning representation by back-propagating errors. Nature, 323(9), 533–536. Sommer, G. (1997). Algebraic aspects of designing behaviour based systems. In G. Sommer & J. Koenderink (Eds.), Algebraic frames for the perceptionand action cycle (pp. 1–28). Berlin: Springer-Verlag. Spelke, E. (1993). Principles of object perception. Cognitive Science, 14, 29–56. Sur, M., Garraghty, P., & Roe, A. (1988). Experimentally induced visual projections into auditory thalamus and cortex. Science, 242, 1437–1441. Tanaka, K. (1993). Neuronal mechanisms of object recognition. Science, 262, 685– 688. Triesch, J., & von der Malsburg, C. (1996). Robust classication of hand postures against complex background. In Proceedings of the Second International Workshop on Automatic Face- and Gesture Recognition, Vermont (pp. 170–175). Turk, M., & Pentland, A. (1991). Eigenfaces for recognition. Journal of Cognitive Neuroscience, 3(1), 71–86. Wiesel, T., & Hubel, D. (1974). Ordered arrangement of orientation columns in monkeys lacking visual experience. J. Comp. Neurol., 158, 307–318. Wiskott, L., Fellous, J., Kruger, ¨ N., & von der Malsburg, C. (1997). Face recognition by elastic bunch graph matching. IEEE Transactions on Pattern Analysis and Machine Intelligence, 19, 775–780. Yuille, A. L. (1991).Deformable templates for face recognition. Journal of Cognitive Neuroscience, 3(1), 59–70. Received June 25, 1998; accepted March 23, 2000.
LETTER
Communicated by Pentti Kanerva
Binding and Normalization of Binary Sparse Distributed Representations by Context-Dependent Thinning Dmitri A. Rachkovskij V. M. Glushkov Cybernetics Center, Pr. Acad. Glushkova 40, Kiev 22, 252022, Ukraine Ernst M. Kussul Centro de Instrumentos, Universidad Nacional Autonoma de Mexico, 04510 Mexico D.F., Mexico Distributed representations were often criticized as inappropriate for encoding of data with a complex structure. However Plate’s holographic reduced representations and Kanerva’s binary spatter codes are recent schemes that allow on-the-y encoding of nested compositional structures by real-valued or dense binary vectors of xed dimensionality. In this article we consider procedures of the context-dependent thinning developed for representation of complex hierarchical items in the architecture of associative-projective neural networks. These procedures provide binding of items represented by sparse binary codevectors (with low probability of 1s). Such an encoding is biologically plausible and allows a high storage capacity of distributed associative memory where the codevectors may be stored. In contrast to known binding procedures, context-dependent thinning preserves the same low density (or sparseness) of the bound codevector for a varied number of component codevectors. Besides, a bound codevector is similar not only to another one with similar component codevectors (as in other schemes) but also to the component codevectors themselves. This allows the similarity of structures to be estimated by the overlap of their codevectors, without retrieval of the component codevectors. This also allows easy retrieval of the component codevectors. Examples of algorithmic and neural network implementations of the thinning procedures are considered. We also present representation examples for various types of nested structured data (propositions using role ller and predicate arguments schemes, trees, and directed acyclic graphs) using sparse codevectors of xed dimension. Such representations may provide a fruitful alternative to the symbolic representations of traditional articial intelligence as well as to the localist and microfeaturebased connectionist representations.
Neural Computation 13, 411–452 (2001)
° c 2001 Massachusetts Institute of Technology
412
Dmitri A. Rachkovskij and Ernst M. Kussul
1 Introduction
The problem of representing nested compositional structures is important for connectionist systems, because hierarchical structures are required for an adequate description of real-world objects and situations. In fully local representations, an item (entity, object) of any complexity level is represented by a single unit (node, neuron) (or a set of units with no common units with other items). Such representations are similar to symbolic ones and share their drawbacks, among them the limitation of the number of representable items by the number of available units in the pool and therefore the impossibility of representing the combinatorial variety of real-world objects. Besides, a unit corresponding to a complex item represents only its name and pointers to its components (constituents). Therefore, in order to determine the similarity of complex items, they should be unfolded into the base-level (indecomposable) items. The attractiveness of distributed representations was emphasized by the paradigm of cell assemblies (Hebb, 1949) that inuenced the work of Marr (1969), Willshaw (1981), Palm (1980), Hinton, McClelland, and Rumelhart (1986), Kanerva (1988), and many others. In fully distributed representations, an item of any complexity level is represented by its conguration pattern over the whole pool of units. For binary units, this pattern is a subset of units that are in the active state. If the subsets corresponding to various items intersect, then the number of these subsets is much more than the number of units in the pool, providing an opportunity to solve the problem of information capacity of representations. If similar items are represented by similar subsets of units, the degree of corresponding subsets’ intersection could be the measure of their similarity. The potentially high information capacity of distributed representations provides hope for solving the problem of representing a combinatorially growing number of recursive compositional items in a reasonable number of bits. Representing composite items by concatenation of activity patterns of their component items would increase the dimensionality of the coding pool. If the component items are encoded by pools of equal dimensionality, one could try to represent composite items as a superposition of activity patterns of their components. The resulting coding pattern would have the same dimensionality. However, another problem arises here, known as “superposition catastrophe” (e.g., von der Malsburg, 1986) as well as “ghosts” and “false” or “spurious” memory (e.g., Feldman & Ballard, 1982; Hopeld, 1982; Hopeld, Feinstein, & Palmer, 1983). A simple example looks as follows. Let there be component items a, b, c and composite items ab, ac, cb. Let us represent any two of the composite items, for example, ac or cb. For this purpose, superimpose activity patterns corresponding to the component items a and c, c and b. The ghost item ab also becomes represented in the result, though it is not needed. In the “superposition catastrophe” formulation, the problem
Binary Sparse Distributed Representations
413
consists in no way of telling which two items (ab, ac, or ab, cb, or ac, cb) make up the representation of the composite pattern abc, where the patterns of all three component items are activated. The supposition of no internal structure in distributed representations (or assemblies) (Legendy, 1970; von der Malsburg, 1986; Feldman, 1989; see also Milner, 1996) held back their use for representation of complex data structures. The problem is to represent in the distributed fashion not only the information on the set of base-level components making up a complex hierarchical item, but also the information on the combinations in which they meet, the grouping of those combinations, and so forth. That is, some mechanisms were needed for binding together the distributed representations of certain items at various hierarchical levels. One of the approaches to binding is based on temporal synchronization of constituent activation (Milner, 1974; von der Malsburg, 1981, 1985; Shastri & Ajjanagadde, 1993; Hummel & Holyoak, 1997). Although this mechanism may be useful inside a single level of composition, its capabilities to represent and store complex items with multiple levels of nesting are questionable. Here we will consider binding mechanisms based on the activation of specic coding unit subsets corresponding to a group (combination) of items—the mechanisms that are closer to the so-called conjunctive coding approach (Smolensky, 1990; Hummel & Holyoak, 1997). The “extra units” considered by Hinton (1981) represent various combinations of active units of two or more distributed patterns. The extra units can be considered as binding units encoding various combinations of distributedly encoded items by distinct distributed patterns. Such a representation of bound items is generally described by tensor products (Smolensky, 1990) and requires an exponential growth of the number of binding units with the number of bound items. However, for recursive structures it is desired that the dimensionality of the patterns representing composite items is the same as the component items’ patterns. Besides, the most important property of distributed representations—their similarity for similar structures—should be preserved. The problem was discussed by Hinton (1990), and a number of mechanisms for construction of his “reduced descriptions” have been proposed. Hinton (1990), Pollack (1990), and Sperduti (1994) get the reduced description of a composite pattern as a result of multilevel perceptron training using back-propagation algorithm. However, their patterns are low dimensional. Plate (1991, 1995) binds high-dimensional patterns with gradual (real-valued) elements on the y, without increasing the dimensionality, using the operation of circular convolution. Kanerva (1996) uses bitwise XOR to bind binary vectors with equal probability of 0s and 1s. Binary distributed representations are especially attractive, because binary bitwise operations are enough to handle them, providing the opportunity for signicant simplication and acceleration of algorithmic implementations.
414
Dmitri A. Rachkovskij and Ernst M. Kussul
Sparse binary representations (with small fraction of 1s) are of special interest. The sparseness of codevectors allows a high storage capacity of distributed associative memories (Willshaw, Buneman, & Longuet-Higgins, 1969; Palm, 1980; Lansner & Ekeberg, 1985; Amari, 1989), which can be used for their storage, and still further acceleration of software and hardware implementations (e.g., Palm & Bonhoeffer, 1984; Kussul, Rachkovskij, & Baidyk, 1991a; Palm, 1993). Sparse encoding also has neurophysiological correlates (Foldiak & Young, 1995). The procedure for binding of sparse distributed representations (“normalization procedure”) was proposed by Kussul as one of the features of the associative-projective neural networks (Kussul, 1988, 1992; Kussul, Rachkovskij, & Baidyk, 1991a). In this article, we describe various versions of such a procedure and its possible neural network implementations, and we provide examples of its use for the encoding of complex structures. In section 2 we discuss representational problems encountered in the associative-projective neural networks (APNN) and an approach for their solution. In section 3 the requirements on the context-dependent thinning procedure for binding and normalization of binary sparse codes are formulated. In section 4 several versions of the thinning procedure along with their algorithmic and neural network implementations are described. Some generalizations and notations are given in section 5. In section 6 retrieval of individual constituent codes from the composite item code is considered. In section 7 the similarity characteristics of codes obtained by context-dependent thinning procedures are examined. In section 8 we show examples of encoding various structures using context-dependent thinning. Related work and a general discussion are presented in section 9, and conclusions are given in section 10. 2 Representation of Composite Items in the APNN: The Problems and the Answer 2.1 Features of the APNN. The APNN is the name of a neural network architecture proposed by Kussul in 1983 for AI problems that require efcient manipulation of hierarchical data structures. Fragments of the architecture implemented in software and hardware were also used for solving pattern-recognition tasks (Kussul, 1992, 1993). APNN features of interest here are as follows (Kussul, 1992; Kussul et al., 1991a; Amosov et al., 1991):
Items of any complexity (an elementary feature, a relation, a complex structure, etc.) are represented by stochastic distributed activity patterns over the neuron eld (pool of units). The neurons are binary, and therefore patterns of activity are binary vectors.
Binary Sparse Distributed Representations
415
Items of any complexity are represented over the neural elds of the same and high dimensionality N. The number M of active neurons in the representations of items of various complexity is approximately (statistically) the same and small compared to the eld dimensionality N. However, M is large enough to maintain its own statistical stability. Items of various complexity level are stored in different distributed autoassociative neural network memories with the same number N of neurons. Thus, items of any complexity are encoded by sparse distributed stochastic patterns of binary neurons in the neural elds of the same dimensionality N. It is convenient to represent activity patterns in the neural elds as binary vectors, where 1s correspond to active neurons. We use boldface lowercase letters for codevectors to distinguish them from the items they represent, denoted in italics. The number of 1s in x is denoted | x |. We seek to make | x | ¼ M for x of various complexity. The similarity of codevectors is determined by the number of 1s in their intersection or overlap: | x ^ y |, where ^ is elementwise conjunction of x and y . The probability of 1s in x (or density of 1s in x, or simply the vector density) is p (x ) D | x | / N. Information encoding by stochastic binary vectors with a small number of 1s allows a high capacity of correlation-type neural network memory (known as Willshaw memory or Hopeld network) to be reached using Hebbian learning rule (Wilshaw et al., 1969; Palm, 1980; Lansner & Ekeberg, 1985; Frolov & Muraviev, 1987, 1988; Frolov, 1989; Amari, 1989; Tsodyks, 1989). The codevectors we are talking about may be exemplied by vectors with M D 100, . . . , 1000, N D 10,000, . . . , 100,000. Although the maximal storage capacity is reached at M D log N (Willshaw et al., 1969; p Palm, 1980), we use M ¼ N to get a network with a moderate number N of neurons and N 2 connections at sufcient statistical stability of M (e.g., the standard deviation of M less than 3%). Under this choice of the codep vector density p D M / N ¼ 1 / N, the information capacity holds high enough and the number of stored items can exceed the number of neurons in the network (Rachkovskij, 1990a, 1990b; Baidyk, Kussul, & Rachkovskij, 1990). Let us consider the problems arising in the APNN and other neural network architectures with distributed representations when composite items are constructed. 2.2 Density of Composite Codevectors. The number H of component items (constituents) comprising a composite item grows exponentially with the nesting level, that is, with going to higher levels of the part-whole hier-
416
Dmitri A. Rachkovskij and Ernst M. Kussul
archy. If S items of a level l constitute an item of the adjacent higher level (l C 1), then for level (l C L ) the number H becomes H D SL .
(2.1)
The presence of several items comprising a composite item is encoded by the concurrent activation of their patterns, that is, by superposition of their codevectors. For binary vectors, we will use superposition by bitwise disjunction. Let us denote composite items by concatenation of symbols denoting their component items, for example, abc. Corresponding composite codevectors (ai _ bi _ ci , i D 1, . . . , N) will be denoted as a _ b _ c or simply abc. Construction of composite items will be accompanied by fast growth of density p0 and respective number M0 of 1s in their codevectors. For H different superimposed codevectors of low density p: p0H D 1 ¡ (1 ¡ p )H ¼ 1 ¡ e¡pH , 0
0
MH ¼ pH N.
(2.2) (2.3)
Equations 2.2 and 2.3 take into account the absorption of coincident 1s that prevents the exponential growth of their number versus the composition level L. However, it is important that p0 À p (see Figure 1) and M0 À M. Since the dimensionality N of codevectors representing items of various complexity is the same, the size of corresponding distributed autoassociative neural networks, where the codevectors are stored and recalled, is also p the same. Therefore at M0 À M ¼ N (at the higher levels of hierarchy), their storage capacity in terms of the number of recallable codevectors will decrease dramatically. To maintain high storage capacity at each level, M0 should not substantially exceed M. However, due to the requirement of statistical stability, the number of 1s in the code cannot be reduced signicantly. Besides, the operation of distributed autoassociative neural network memory usually implies the same number of 1s in codevectors. Thus, it is necessary to keep the number of 1s in the codevectors of complex items approximately equal to M. (However, some variation of M between distinct hierarchical levels may be tolerable and even desirable.) These provide one of the reasons that composite items should be represented not by all 1s of their component codevectors but only by their fraction approximately equal to M (i.e., only by some M representatives of active neurons encoding the components). 2.3 Ghost Patterns and False Memories. The well-known problem of ghost patterns or superposition catastrophe was mentioned in section 1. It consists of losing the information on the membership of component codevectors in particular composite codevector, when several composite codevectors are superimposed in their turn.
Binary Sparse Distributed Representations
417
Figure 1: Growth of the density p0 of 1s in the composite codevectors of higher hierarchical levels (see equations 2.1 and 2.2). Here each codevector of the higher level is formed by bit disjunction of S codevectors of the preceding level. Items at each level are uncorrelated. The codevectors of base-level items are independent, with the density of 1s equal to p. The number of hierarchical levels is L. (For any given number of base-level items, the total number of 1s in the composite codevectors is obviously limited by the number of 1s in the disjunction of baselevel codevectors.)
This problem is due to the essential property of superposition operation. The contribution of each member to their superposition does not depend on the contributions of other members. For superposition by elementwise disjunction, representation of a in a _ b and a _ c is the same. The result of superposition of several base-level component codevectors contains only the information concerning participating components and no information about the combinations in which they meet. Therefore, if common items are constituents of several composite items, then the combination of the latter generally cannot be inferred from their superposition codevector. For example, let a complex composite item consist of base level items a, b, c, d, e, f . Then how could one determine that it really consists of the composite items abd, bce, caf , if there may also be other items, such as abc and def ? In the formulation of false or spurious patterns, superposition of composite patterns abc and def generates false patterns (“ghosts”) abd , bce, caf, and so forth. The problem of introducing “false assemblies” or “spurious memories” (unforeseen attractors) into a neural network (e.g., Kussul, 1980; Hopeld et al., 1983; Vedenov, 1987, 1988) has the same origin as the problem of ghosts. Training of an associative memory of matrix type is usually performed using some version of Hebbian learning rule implemented by superimposing in
418
Dmitri A. Rachkovskij and Ernst M. Kussul
Figure 2: Hatched circles represent patterns of active units encoding items; formed connections are plotted by arrowed lines. (A) Formation of a false assembly. When three assemblies (abd ; bce ; acf ) are consecutively formed in a neural network by connecting all active units of patterns encoding their items, the fourth assembly abc (griddy hatch) is formed as well, though its pattern was not explicitly presented to the network. (B) Preventing of a false assembly. If each of three assemblies is formed by connecting only subsets of active units encoding the component items, then the connectivity of the false assembly is weak. x yz denotes the subset of units encoding item x when it is combined with items y and z. The pairwise intersections of the small circles represent the false assembly.
the weight matrix the outer products of memorized codevectors. For binary connections, for example, W ij0 D Wij _ xi xj ,
(2.4)
where xi and xj are the states of the ith and the jth neurons when the pattern x to be memorized is presented (i.e., the values of the corresponding bits of x ), Wij and Wij0 are the connection weights between the ith and the jth neurons before and after training, respectively, and _ stands for disjunction. When this learning rule is sequentially used to memorize several composite codevectors with partially coinciding components, false assemblies (attractors) may appear—that is, memorized composite codevectors that were not presented to the network. For example, when representations of items abd, bce, caf are memorized, the false assembly abc (unforeseen attractor) is formed in the network (see Figure 2A). Moreover, various two-item assemblies (such as ab, ad) are present, which also were not explicitly presented for storing. The problem of introducing false assemblies can be avoided if nondistributed associative memory is used, where the patterns are not superimposed when stored and each composite codevector is placed into a separate memory word. However, the problem of false patterns or superposition catastrophe still persists.
Binary Sparse Distributed Representations
419
2.4 An Idea of the Thinning Procedure. A systematic use of distributed representations provides the prerequisite to solve both the problem of codevector density growth and the superposition catastrophe. The idea of solution consists of including in the representation of a composite item not full sets of 1s encoding its component items but only their subsets. If we choose the fraction of 1s from each component codevector so that the number of 1s in the codevector of a composite item is equal to M, then the density of 1s will be preserved in codevectors of various complexity. For example, if S D 3 items of level l comprise an item of level l C 1, then approximately M / 3 of 1s should be preserved from each codevector of the lth level. Then the codevector of level l C 1 will have approximately M of 1s. If two items of level l C 1 comprise an item of level l C 2, then approximately M / 2 of 1s should be preserved from each codevector of level l C 1. Thus, the low number M of 1s in the codevectors of composite items of various complexity is maintained, and therefore high storage capacity of the distributed autoassociative memories where these lowdensity codevectors are stored can be maintained as well (see also section 2.1). Hence the component codevectors are represented in the codevector of the composite item in a reduced form—by a fraction of their 1s. The idea that the items of higher hierarchical levels (“oors”) should contain their components in reduced, compressed, coarse form is well accepted among those concerned with diverse aspects of articial intelligence research. Reduced representation of component codevectors in the codevector of composite items realized in the APNN may be relevant to “coarsen models” of Amosov (1967), “reduced descriptions” of Hinton (1990), and “conceptual chunking” of Halford, Wilson, and Phillips (1998). Reduced representation of component codevectors in the codevectors of composite items also allows a solution of the superposition catastrophe. If the subset of 1s included in the codevector of a composite item from each of the component codevectors depends on the composition of component items, then different subsets of 1s from each component codevector will be found in the codevectors of different composite items. For example, nonidentical subsets of 1s will be incorporated into the codevectors of items abc and acd from a . Therefore, the component codevectors will be bound together by the subsets of 1s delegated to the codevector of the composite item. It hinders the occurrence of false patterns and assemblies. For the example from section 1, when both ac and cb are present, we will get the following overall composite codevector: ac _ ca _ c b _ b c , where x y stands for the subset of 1s in x that becomes incorporated in the composite 6 6 codevector given y as the other component. Therefore, if a c D ab , b c D ba , we do not observe the ghost pattern ab _ b a in the resultant codevector. For the example of Figure 2A, where false assemblies emerge, they do not emerge under reduced representation of items (see Figure 2B). Now
420
Dmitri A. Rachkovskij and Ernst M. Kussul
interassembly connections are formed between different subsets of active neurons that have a relatively small intersection. Therefore, the connectivity of assembly corresponding to the nonpresented item abc is low. That the codevector of a composite item contains the subsets of 1s from the component codevectors preserves the information on the presence of component items in the composite item. That the composition of each subset of 1s depends on the presence of other component items preserves the information on the combinations in which the component items occurred. That the codevector of a composite item has approximately the same number of 1s as its component codevectors allows the combinations of such composite codevectors to be used for construction of still more complex codevectors of higher hierarchical levels. Thus, an opportunity emerges to build up the codevectors of items of varied composition level containing the information not only on the presence of their components, but on the structure of their combinations as well. It provides the possibility of estimating the similarity of complex structures without their unfolding but simply as overlap of their codevectors, which many authors consider a very important property for AI systems (e.g., Kussul, 1992; Hinton, 1990; Plate, 1995, 1997). Originally the procedure reducing the sets of coding 1s of each item from the group that makes up a composite item was called normalization (Kussul, 1988, 1992; Kussul & Baidyk, 1990). That name emphasized the property of maintaining the number of 1s in the codes of composite items equal to that of component items. However in this article we will call it context-dependent thinning (CDT) by its action mechanism, which reduces the number of 1s, taking into account the context of other items from their group. 3 Requirements on the Context-Dependent Thinning Procedures
Let us summarize the requirements on the CDT procedures and on the characteristics of codevectors produced by them. The procedures should process sparse binary codevectors. An important case of input is superimposed component codevectors. The procedures should output the codevector of the composite item where the component codevectors are bound and the density of the output codevector is comparable to the density of component codevectors. Let us call the resulting (output) codevector thinned codevector. The requirements may be expressed as follows: Determinism. Repeated application of the CDT procedures to the same input should produce the same output. Variable number of inputs. The procedure should process one, two, or several codevectors. One important case of input is a vector in which several component codevectors are superimposed.
Binary Sparse Distributed Representations
421
Sampling of inputs. Each component codevector of the input should be represented in the output codevector by a fraction of its 1s (or their reversible permutation). Proportional sampling. The number of 1s representing input component codevectors in the output codevector should be proportional to their density. If the number of 1s in a and b is the same, then the number of 1s from a and b in thinned ab should also be (approximately) the same. Uniform low density. The CDT procedures should maintain (approximately) uniform low density of output codevectors (small number M0 of 1s) under a varied number of input codevectors and their correlation degree. Density control. The CDT procedures should be able to control the number M 0 of 1s in output codevectors within some range around M (the number of 1s in the component codevectors). For one important special case, M0 D M. Unstructured similarity. An output codevector of the CDT procedures should be similar to each component codevector at the input (or to its reversible permutation). Fulllment of this requirement follows from fulllment of the sampling of inputs requirement. The thinned codevector for ab is similar to a and b . If the densities of component codevectors are the same, the magnitude of similarity is the same (as follows from the requirement of proportional sampling). Similarity of subsets. The reduced representations of a given component codevector should be similar to each other to a degree that varies directly with the similarity of the set of other codevectors with which it is composed. The representation of a in the thinned abc should be more similar to its representation in the thinned abd than in thinned aef. Structured similarity. If two sets (collections) of component items are similar, their thinned codevectors should be similar as well. It follows from the similarity of subsets requirement. If a and a 0 are similar and b and b 0 are similar, then thinned ab should be similar to thinned a0 b 0 or thinned abc should be similar to thinned abd . Binding. Representation of a given item in a thinned codevector should be different for different sets (collections) of component items. Representation of a in thinned abc should be different from the representation of a in thinned abd . Thus, the representation of a in the thinned composite codevector contains information on the other components of a composite item.
422
Dmitri A. Rachkovskij and Ernst M. Kussul
4 Versions of the Context-Dependent Thinning Procedures
Let us consider some versions of the CDT procedure, their properties and implementations. 4.1 Direct Conjunctive Thinning of Two or More Codevectors. Direct conjunctive thinning of binary x and y is implemented as their element-wise conjunction: z D x ^ y,
(4.1)
where z is a thinned and bound result. The requirement of determinism holds for the direct conjunctive thinning procedure. The requirement of the variable number of inputs is not met, since only two codevectors are thinned. Overlapping 1s of x and y go to z ; therefore the sampling of inputs requirement holds. Since equal numbers of 1s from x and y enter into z even if x and y are of different density, the requirement of proportional sampling is not fullled in general. For stochastically independent vectors x and y , the density of the resulting vector z is: p(z ) D p (x )p(y ) < min(p(x ), p (y )) < 1.
(4.2)
Here min() selects the smallest of its arguments. Let us note that for correlated x and y , the density of 1s in z depends on the degree of their correlation. Thus, p(z ) is maintained the same only for independent codevectors of constant density, and the requirement of uniform low density is generally not met. Since p(z ) for sparse vectors is substantially lower than p(x ) and p (y ), the requirement of density control is not met, and recursive construction of bound codevectors is not supported (see also Kanerva, 1998; Sjodin, ¨ Kanerva, Levin, & Kristoferson, 1998). Similarity and binding requirements may be considered as partially satised for two codevectors (see also Table 1). Although the operation of direct conjunctive thinning of two codevectors does not meet all requirements on the CDT procedure, we have applied it for encoding external information, in particular, for binding of distributed binary codevectors of feature item and its numerical value (Kussul & Baidyk, 1990; Rachkovskij & Fedoseyeva, 1990; Artykutsa, Baidyk, Kussul, & Rachkovskij, 1991; Kussul et al., 1991a, 1991b). The density p of the codevectors of features and numerical values was chosen so as to provide a specied density p0 of the resulting codevector (see Table 2, K D 2). To thin more than two codevectors, it is natural to generalize equation 4.1: z D ^s x s ,
(4.3)
where s D 1, . . . , S, S is the number of codevectors to be thinned. Although this operation allows binding of two or more codevectors, a single vector
Binary Sparse Distributed Representations
423
Table 1: Properties of Various Versions of Thinning Procedures. Properties of Thinning Procedures
Direct Conjunctive Thinning
Permutive Thinning
Additive and Subtractive CDT
Determinism Variable number of inputs Sampling of inputs Proportional sampling Uniform low density Density control Unstructured similarity Similarity of subsets Structured similarity Binding
Yes No-Yes Yes No No No Yes-No Yes-No Yes-No Yes-No
Yes Yes Yes Yes No No Yes Yes Yes Yes
Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes
Notes: Yes D the property is present. No D the property is not present. No-Yes and Yes-No D the property is partially present.
cannot be thinned. The density of resulting codevector z depends on the densities of x s and their number S. Therefore, to meet the requirement of uniform low density, the densities of xs should be chosen depending on the number of thinned codevectors. Also, the requirement of density control is not satised. We applied this version of direct conjunctive thinning to encode positions of visual features on a two-dimensional retina. Three codevectors were bound (S D 3): the codevector of a feature, the codevector of its X-coordinate, and the codevector of its Y-coordinate (unpublished work of 1991–1992 on recognition of handwritten digits, letters, and words in collaboration with WACOM Co., Japan). Also, this technique was used to encode words and word combinations for text processing (Rachkovskij, 1996). In so doing, the codevectors of letters comprising words were bound (S > 10). The density of codevectors to be bound by thinning was chosen so as to provide a specied density of the resulting codevector (see Table 2, K D 3, . . . , 12). Neural network implementations of direct conjunctive thinning procedures are rather straightforward and will not be considered here.
Table 2: The Density p of K Independent Codevectors Chosen to Provide a Specied Density p0 of Codevectors Produced by Their Conjunction. K p0
2
3
4
5
6
7
8
9
10
11
12
0.001 0.010 0.015
0.032 0.100 0.122
0.100 0.215 0.247
0.178 0.316 0.350
0.251 0.398 0.432
0.316 0.464 0.497
0.373 0.518 0.549
0.422 0.562 0.592
0.464 0.599 0.627
0.501 0.631 0.657
0.534 0.658 0.683
0.562 0.681 0.705
424
Dmitri A. Rachkovskij and Ernst M. Kussul
4.2 Permutive Conjunctive Thinning. The codevectors to be bound by direct conjunctive thinning are not superimposed. Let us consider the case where S codevectors are superimposed by disjunction: z D _s x s .
(4.4)
Conjunction of a vector with itself produces the same vector: z ^ z D z. So let us modify z by permutation of all its elements and make a conjunction with the initial vector: z 0 D z ^ z» .
(4.5)
Here, z » is the permuted vector. In vector matrix notation, it can be rewritten as: z 0 D z ^ Pz,
(4.5a)
where P is an N £ N permutation matrix (each row and each column of P has a single 1, and the rest of P is 0; multiplying a vector by a permutation matrix permutes the elements of the vector). Proper permutations are those producing the permuted vector that is independent of the initial vector (e.g. random permutations or shifts). Then the density of the result is p(z 0 ) D p (z )p(z » ) D p (z )p (Pz ).
(4.6)
Let us consider the composition of the resulting vector: z0 D z ^ z » D ( x1 _ ¢ ¢ ¢ _ x s ) ^ z »
» D (x 1 _ ¢ ¢ ¢ _ x s ) ^ (x » 1 _ ¢ ¢ ¢ _ xs )
» » » D x 1 ^ (x » 1 _ ¢ ¢ ¢ _ x s ) _ ¢ ¢ ¢ _ x s ^ (x 1 _ ¢ ¢ ¢ _ x s ) » » » » D (x 1 ^ x 1 ) _ ¢ ¢ ¢ _ (x 1 ^ x s ) _ ¢ ¢ ¢ _ (x s ^ x 1 ) _ ¢ ¢ ¢ _ (x s ^ xs ). (4.7)
Thus the resulting codevector is the superposition of all possible pairs of bitwise codevector conjunctions. Each pair includes one component codevector and one permuted component codevector. Because of the initial disjunction of component codevectors, this procedure meets more requirements on the CDT procedures than direct conjunctive thinning. The requirement of variable number of inputs is now fully satised. As follows from equation 4.7, each component codevector xs is thinned by conjunction with one and the same stochastic independent vector z» . Therefore, statistically the same fraction of 1s is left from each component x s , and the requirements of sampling of inputs and proportional sampling hold.
Binary Sparse Distributed Representations
425
Figure 3: Neural network implementation of permutive conjunctive thinning. The same N-dimensional binary pattern is activated in the input neural elds f in 1 and fin 2 . It is a superposition of several component codevectors. fin 1 is connected to f out by a bundle of direct projective connections. fin 2 is connected to fout by a bundle of permutive connections. Conjunction of the superimposed component codevectors and their permutation is obtained in the output neural eld fout , where the neural threshold h D 1.5.
For S sparse codevectors of equal density p (x ) ¿ 1, p (z ) ¼ Sp (x ), 0
p (z ) D p ( z ) p (z
(4.8) »)
2 2(
¼ S p x ).
(4.9)
To satisfy the requirements of density, p (z 0 ) D p(x ) should hold for various S. It means that p (x ) should be equal to 1 / S2 . Therefore, at xed density p (x ), the density requirements are not satised for a variable number S of component items. The similarity and binding requirements hold. In particular, the requirement of similarity of subsets holds because the higher the number of identical items, the more identical conjunctions are superimposed in equation 4.7. A neural network implementation of permutive conjunctive thinning is shown in Figure 3. In neural network terms, units are called neurons, and their pools are called neural elds. There are two input elds, fin1 and fin2 , and one output eld, f out , consisting of N binary neurons each. fin1 is connected to fout by a bundle of N direct projective (1-to-1) connections. Each
426
Dmitri A. Rachkovskij and Ernst M. Kussul
connection of this bundle connects neurons of the same number. fin2 is connected to f out by the bundle of N permutive projective connections. Each connection of this bundle connects neurons of different numbers. Synapses of the neurons of the output eld have weights of C 1. The same pattern of superimposed component codevectors is activated in both input elds. Each neuron of the output eld summarizes the excitations of its two inputs (synapses). The output is determined by comparison of the excitation level with the threshold h D 1.5. Therefore each output neuron performs conjunction of its inputs. Thus, the activity pattern of the output eld corresponds to bit conjunction of pattern present in the input elds and its permutation. Obviously, there are a lot of different congurations of permutive connections. Permutation by shift is particularly attractive because it is simple and fast to implement in computer simulations. 4.3 Additive CDT Procedure. Although the density of resulting codevectors for permutive cojunctive thinning is closer to the density of each component codevector than for direct conjunction, it varies with the number and density of component codevectors. Let us make a codevector z by the disjunction of S component codevectors x s , as in equation 4.4. Since the density of component codevectors is low and their number is small, the “absorption” of common 1s is low; according to equations 4.8 and 4.9, p (z 0 ) is approximately S2 p (x ) times p (x ). For example, if p(x ) D 0.01 and S D 5, then p (z0 ) ¼ (1 / 4)p(x ). Therefore, to make the density of the thinned codevector equal to the density of its component codevectors, let us superimpose an appropriate number K of independent vectors with the density p (z0 ):
) (_ kz » ). hz i D _k ( z ^ z » k D z^ k
(4.10)
Here hz i is the thinned output vector and z» is a unique (independent k stochastic) permutation of elements of vector z , xed for each k. In vectormatrix notation, we can write: hz i D _k (z ^ P kz ) D z ^ _ k (P kz ).
(4.10a)
The number K of vectors to be superimposed by disjunction can be determined as follows. If the density of the superposition of permuted versions of z is made p(_ k (P kz )) D 1 / S,
(4.11)
then after conjunction with z (see equations 4.10 and 4.10a) we will get the needed density of hz i: p(hz i) D p (z ) / S ¼ Sp (x ) / S D p (x ).
(4.12)
Binary Sparse Distributed Representations
427
Table 3: Number K of Permutations of an Input Codevector That Produces the Proper Density of the Thinned Output Codevector in the Additive Version of CDT. Number S of Component Codevectors in the Input Codevector Density p of Component Codevector
2
3
4
5
6
7
0.001 0.005 0.010
346.2 69.0 34.3
135.0 26.8 13.3
71.8 14.2 7.0
44.5 8.8 4.4
30.3 6.0 2.9
21.9 4.3 2.1
Note: K should be rounded to the nearest integer.
Taking into account the “absorption” of 1s in disjunction of K permuted vectors z » , equation 4.11 can be rewritten as 1 / S D 1 ¡ (1 ¡ pS )K .
(4.13)
K D ln(1 ¡ 1 / S ) / ln(1 ¡ pS).
(4.14)
Then
The dependence K (S ) at different p is shown in Table 3. 4.3.1 Meeting the CDT Procedure Requirements. Since the conguration of each kth permutation is xed, the procedure of additive CDT is deterministic. The input vector z to be thinned is the superposition of component codevectors. The number of these codevectors may be variable; therefore, the requirement of variable number of inputs holds. The output vector is obtained by conjunction of z (or its reversible permutation) with the independent vector _k (P kz ). Therefore, the 1s of all codevectors superimposed in z are equally represented in hz i, and both the sampling of inputs and the proportional sampling requirements hold. Density control of the output codevector for variable number and density of component codevectors is realized by varying K (see Table 3). Therefore, the density requirements hold. Since the sampling of inputs and proportional sampling requirements hold, the codevector hz i is similar to all component codevectors x s , and the requirement of unstructured similarity holds. The more similar are the components of one composite item to those of another, the more similar are their superimposed codevectors z . Therefore, the more similar are the vectors—disjunctions of K xed permutations of z—and the more similar representations of each component codevector will remain after conjunction (see equation 4.10) with z. Thus, the similarity of subsets requirement holds. Characteristics of this similarity will be considered in more detail in section 7.
428
Dmitri A. Rachkovskij and Ernst M. Kussul
Figure 4: (A), (B). Two algorithmic implementations of the additive version of the CDT procedure. Parameter seed denes a conguration of shift permutations. For small K, checking that r is unique would be useful.
Since different combinations of component codevectors produce different z and therefore different codevectors of K permutations of z , representations of certain component codevector in the thinned codevector will be different for different combinations of component items, and the binding requirement holds. The more similar are representations of each component in the output vector, the more similar are output codevectors (the requirement of structured similarity holds). 4.3.2 An Algorithmic Implementation. Shift is an easily implementable permutation. Therefore an algorithmic implementation of this CDT procedure may be as in Figure 4A. Another example of this procedure does not require preliminary calculation of K (see Figure 4B). In this version, conjunctions of the initial and permuted vectors are superimposed until the number of 1s in the output vector becomes equal to M.
Binary Sparse Distributed Representations
429
Figure 5: Neural network implementation of the additive version of CDT procedure. There are four neural elds with the same number of neurons: two input elds f in 1 and fin 2 , the output eld f out , and the intermediate eld fint . The neurons of f in 1 and f out are connected by the bundle of direct projective connections (1-to-1). fint and fout are also connected in the same manner. The same binary pattern z (corresponding to superimposed component codevectors) is in the input elds f in 1 and f in 2 . The intermediate eld f int is connected to the input eld f in 2 by K bundles of permutive projective connections. The number K of required bundles is estimated in Table 3. Only two bundles are shown here: one by solid lines and one by dotted lines. The threshold of f int neurons is 0.5. Therefore, fint accumulates (by bit disjunction) various permutations of the pattern z in fin 2 . The threshold of f out is equal to 1.5. Hence, this eld performs a conjunction of the pattern z from f in 1 and the pattern of K permuted and superimposed z from f int . z , hz i, w correspond to the notation of Figure 4.
4.3.3 A Neural Network Implementation. A neural network implementation of the rst example of the additive CDT procedure (see Figure 4A) is shown in Figure 5. To choose K depending on the density of z , the neural network implementation should incorporate some structures not shown in the gure. They should determine the density of the initial pattern z and activate (turn on) K bundles of permutive connections from their total number Kmax . Alternatively, these structures should actuate the bundles of permutive connections
430
Dmitri A. Rachkovskij and Ernst M. Kussul
one by one in the xed order1 until the density of the output vector in fout becomes M / N. Let us recall that bundles of shift permutive connections are used in algorithmic implementations. 4.4 Subtractive CDT Procedure. Let us consider another version of the CDT procedure. Rather than masking z with the straight disjunction of permuted versions of z, as in additive thinning, let us mask it with the inverse of that disjunction:
) ( ). hz i D z ^ : (_kz » k D z ^ : _k P kz
(4.15)
If we choose K to make the number of 1s in hz i equal to M, then this procedure will satisfy the requirements of section 3. Therefore, the density of superimposed permuted versions of z before inversion should be 1 ¡ 1 / S (compare to equation 4.13). Thus, the number K of permuted vectors to be superimposed in order to obtain the required density (taking into account “absorption” of 1s) is determined from: 1 ¡ 1 / S D 1 ¡ (1 ¡ pS )K .
(4.16)
Then, for pS ¿ 1 K ¼ ln S / (pS ).
(4.17)
Algorithmic implementations of this subtractive CDT procedure (Kussul, 1988, 1992; Kussul & Baidyk, 1990) are analogous to those presented for the additive CDT procedure in section 4.3.2. A neural network implementation is shown in Figure 6. Since the value of lnS / S is approximately the same at S D 2, 3, 4, 5 (see Table 4), one can choose the K for a specied density of component codevectors p (x ) as: K ¼ 0.34 / p (x ).
(4.18)
At such K and S, p(hz i) ¼ p (x ). Therefore, the number K of permutive connection bundles in Figure 6 can be xed, and their sequential activation is not needed. So each neuron of Fout may be considered as connected to an average of K randomly chosen neurons of Fin 2 by inhibitory connections. More precise values of K (obtained as exact solution of equation 4.16) for different values of p are presented in Table 4. 1 As noted by Kanerva (personal communication), all K max bundles could be activated in parallel if the weight of the kth bundle is set to be 2 ¡k and the common threshold of fint neurons is adjusted dynamically so that fout has the desired density of 1s.
Binary Sparse Distributed Representations
431
Figure 6: Neural network implementation of the subtractive CDT procedure. There are three neuron elds of the same number of neurons: two input elds f in 1 and fin 2 , as well as the output eld fout . The copy of the input vector z is in both input elds. The neurons of f in 1 and f out are connected by the bundle of direct projective connections (1-to-1). The neurons of f in 2 and f out are connected by K bundles of independent permutive connections. (Only two bundles of permutive connections are shown here: one by solid lines and one by dotted lines). Unlike Figure 5, the synapses of permutive connections are inhibitory (the weight is ¡1). The threshold of the output eld neurons is 0.5. Therefore, the neurons of z remaining active in fout are those for which none of the permutive connections coming from z is active. As follows from Table 4, K is approximately the same for the number S D 2, . . . , 5 of component codevectors of certain density p.
This version of the CDT procedure was originally proposed under the name normalization procedure (Kussul, 1988; Kussul & Baidyk, 1990; Amosov et al., 1991). We have used it in the multilevel APNN for sequence processing (Rachkovskij, 1990b; Kussul & Rachkovskij, 1991). We have also used it for binding of sparse codes in perceptron-like classiers (Kussul, Baidyk, Lukovich, & Rachkovskij, 1993) and in one-level APNN applied for recognition of vowels (Rachkovskij & Fedoseyeva, 1990, 1991), word roots (Fedoseyeva, 1992), textures (Artykutsa et al., 1991; Kussul et al., 1991b), shapes (Kussul & Baidyk, 1990), handprinted characters (Lavrenyuk, 1995), and logical inference (Kasatkin & Kasatkina, 1991).
432
Dmitri A. Rachkovskij and Ernst M. Kussul Table 4: Function ln S / S and Number K of Permutations of an Input Codevector That Produces the Proper Density of the Thinned Output Codevector in the Subtractive Version of CDT. Number S of Component Codevectors in the Input Codevector 2
3
4
0.347
0.366
6
7
0.347
0.322
0.299
0.278
346.2 69.0 34.3
365.7 72.7 36.1
345.9 68.6 34.0
321.1 63.6 31.4
297.7 58.8 29.0
277.0 54.6 26.8
ln S / S
p 0.001 0.005 0.010
5
Note: K should be rounded to the nearest integer.
5 Procedures of Auto-Thinning, Hetero-Thinning, Self-Exclusive Thinning, and Notation
In sections 4.2 through 4.4 we considered the versions of thinning procedures where a single vector (superposition of component codevectors) was the input. The corresponding pattern of activity was present in both fin 1 and fin 2 (see Figures 3, 5, and 6), and therefore the input vector thinned itself. Let us call these procedures auto-thinning or auto-CDT and denote them as label
hu i.
(5.1)
Here u is the codevector to be thinned (usually superposition of component codevectors), which is in the input elds in 1 and fin 2 of Figures 3, 5, and 6. labelh. . .i denotes a particular conguration of thinning (particular realization of bundles of permutive connections). Let us note that Plate uses angle brackets to denote normalization operation in HRRs (Plate, 1995; see also section 9.1.5). A lot of orthogonal congurations of permutive connections are possible. Differently labeled CDT procedures implement different thinning. In the algorithmic implementations (see Figure 4), different labels will use different seeds. No label corresponds to some xed conguration of thinning. Unless otherwise specied, it is assumed that the number K of bundles is chosen to maintain the preset density of the thinned vector hu i, usually |hu i| ¼ M. _k (P ku ) can be expressed as Ru thresholded at one-half, where the matrix R is the disjunction, or it can also be the sum of K permutation matrices P k . This, in turn, can be written as a function T (u ), so that we get hu i D u ^ T (u ).
(5.2)
Binary Sparse Distributed Representations
433
It is possible to thin one codevector by another one if the pattern to be thinned is activated in fin 1 and the pattern that thins is activated in fin 2 . Let us call such procedure hetero-CDT, hetero-thinning, or thinning u with w . We denote hetero-thinning as label
hu iw .
(5.3)
Here w is the pattern that does the thinning. It is activated in fin 2 of Figures 3, 5, and 7. u , the pattern that is thinned, is activated in fin 1 . label h. . .i is the conguration label of thinning. For auto-thinning, we may write hu i D hu iu . For the additive hetero-thinning, equation 4.10 can be rewritten as hu iw D u ^ (_k (P kw )) D u ^ T (w ).
(5.4)
For the subtractive hetero-thinning, equation 4.15 can be rewritten as hu iw D u ^ : (_k (P k w )) D u ^ : T (w ).
(5.5)
As before, we denote composite codevector u to be thinned by its component codevectors, for example, u D a _ b _ c or simply u D abc . Auto-thinning of composite codevector u : Examples.
hu iu D ha _ b _ c ia_b_c D ha _ b _ c i D habc iabc D habc i. Hetero-thinning of composite codevector u with codevector d : hu id D ha _ b _ c id D habc id . For both additive and subtractive CDT procedures: habc i D ha iabc _ hb iabc _ hc iabc . We can also write habc i D (a ^ T (abc )) _ (b ^ T (abc )) _ (c ^ T (abc )). An analogous expression can be written for a composite pattern with other numbers of components. Let us note that K should be the same for thinning of the composite pattern as a whole or its individual components. For the additive CDT procedure, it is also true: habc i D ha iabc _ hb iabc _ hc iabc D ha ia _ hb ia _ hc ia _ ha ib _ hb ib _ hc ib _ ha ic _ hb ic _ hc ic .
For the subtractive CDT procedure we can write: ha ibcd D hhha ib ic id
and
habc i D hhha ia ib ic _ hhhb ia ib ic _ hhhc ia ib ic .
434
Dmitri A. Rachkovskij and Ernst M. Kussul
Let us also consider a modication of the auto-CDT procedures that will be used in section 7.2. If we eliminate the thinning of a component codevector with itself, we obtain “self-exclusive” auto-thinning. Let us denote it as habc inabc : habc inabc D ha ibc _ hb iac _ hc iab . 6 Retrieval of Component Codevectors
After thinning, the codevectors of component items are present in the thinned codevector of a composite item in a reduced form. We must be able to retrieve complete component codevectors. Since the requirement of the unstructured similarity holds, the thinned composite codevector is similar to its component codevectors. So if we have a full set (alphabet) of component codevectors of the preceding (lower) level of compositional hierarchy, we can compare them with the thinned codevector. The similarity degree is determined by the overlap of codevectors. The alphabet items corresponding to the codevectors with maximum overlaps are the sought-after components. The search of the most similar component codevectors can be performed by a sequential nding of overlaps of the codevector to be decoded with all codevectors of the component alphabet. An associative memory can be used to implement this operation in parallel. After retrieving the fullsized component codevectors of the lower hierarchical level, one can then retrieve their component codevectors of a still lower hierarchical level in an analogous way. For this purpose, the alphabet of the latter should be known as well. If the order of component retrieval is important, some auxiliary procedures can be used (Kussul, 1988; Amosov et al., 1991; Rachkovskij, 1990b; Kussul & Rachkovskij, 1991). Let us consider the alphabet of six component items a, b, c, d, e, f . They are encoded by stochastic xed vectors of N D 100, 000 bits with M ¼ 1000 bits set to 1. Let us obtain the thinned codevector habc i. The number of 1s in habc i in our numerical example is |habc i| D 1002. Let us nd the overlap of each component codevector with the thinned codevector: | a ^ habc i| D 341; | b ^ habc i| D 350; | c ^ habc i| D 334; | d ^ habc i| D 12; | e ^habc i| D 7; | f ^habc i| D 16. So the representation of the component items a, b, c is substantially higher than the representation of the items d, e, f occurring due to a stochastic overlap of independent binary codevectors. The numbers obtained are typical for the additive and the subtractive versions of thinning, as well as for their self-exclusive versions. Example.
7 Similarity Preservation by the Thinning Procedures
In this section, let us consider the similarity of thinned composite codevectors as well as the similarity of thinned representations of component
Binary Sparse Distributed Representations
435
codevectors in the thinned composite codevectors. These kinds of similarity are considered under different combinations of component items and different versions of thinning procedures. Let us use the following CDT procedures: Permutive conjunctive thinning, section 4.2 (Paired-M) Additive auto-CDT, section 4.3 (CDTadd) Additive self-exclusive auto-CDT, section 5 (CDTadd-sl) Subtractive auto-CDT, section 4.4 (CDTsub) Subtractive self-exclusive auto-CDT, section 5 (CDTsub-sl) For these experiments, let us rst obtain the composite codevectors that have 5 down to 0 component codevectors in common: abcde , abcdf, abcfG , abfgh , afghi, fghij . For the component codevectors, N D 100,000 bits with M ¼ 1000. Then for each thinning procedure, let us thin the composite codevectors down to the density of their component codevectors. (For permutive conjunctive thinning, the density of component codevectors was chosen to get approximately M of 1s in the result.) 7.1 Similarity of Thinned Codevectors. Let us nd an overlap of thinned codevector habcde i with habcde i, habcdf i, habcfg i, habfgh i, hafghi i, and hfghij i. Here h i is used to denote any thinning procedure. A normalized measure of the overlap of x with various y is determined as | x ^ y | / | x |. The experimental results are presented in Figure 7A, where the normalized overlap of thinned composite codevectors is shown versus the normalized overlap of corresponding unthinned composite codevectors. It can be seen that the overlap of thinned codes for various versions of the CDT procedure is approximately equal to the square of overlap of unthinned codes. For example, the similarity (overlap) of abcde and abfgh is approximately 0.4 (two common components of ve total), and the overlap of their thinned codevectors is about 0.16. 7.2 Similarity of Com ponent Codevector Subsets Included in Thinned Codevectors. Some experiments were conducted in order to investigate the
similarity of subsets requirement. The similarity of subsets of a component codevector incorporated into various thinned composite vectors was obtained as follows. First, the intersections of various thinned ve-component composite codevectors with their component a were determined: u D a ^ habcde i, v D a ^ habcdf i, w D a ^ habcfg i, x D a ^ habfgh i, y D a ^ hafghi i. Then the normalized values of the overlap of intersections were obtained as | u ^ v | / | u |, | u ^ w | / | u |, | u ^ x | / | u |, | u ^ y | / | u |. Figure 7B shows how the similarity (overlap) of component codevector subsets incorporated into two thinned composite codevectors varies versus the similarity of corresponding unthinned composite codevectors. These
436
Dmitri A. Rachkovskij and Ernst M. Kussul
Figure 7: (A) Overlap of thinned composite codevectors. (B) Overlap of component subsets in thinned composite codevectors versus the overlap of the corresponding unthinned composite codevectors for various versions of thinning procedures. CDTadd: the additive; CDTsub: the subtractive; CDTadd-sl: the selfexclusive additive; CDTsub-sl: the self-exclusive subtractive CDT procedure; Paired-M: permutive conjunctive thinning. The densities of component codevectors are chosen to obtain M of 1s in the thinned codevector. For component codevectors, N D 100,000, M ¼ 1000. The number of component codevectors is 5. The results are averaged over 50 runs with different random codevectors.
Binary Sparse Distributed Representations
437
dependencies are different for different thinning procedures. For the CDTadd and the CDTadd-sl, they are close to linear, but for the CDTsub and CDTsub-sl, they are polynomial. Which is preferable depends on the application. 7.3 The Inuence of the Depth of Thinning. By the depth of thinning, we understand the density value of a thinned composite codevector. Before, we considered it equal to the density of component codevectors. Here, we vary the density of the thinned codevectors. The experimental results presented in Figure 8 are useful for estimating the resulting similarity of thinned codevectors in applications. As in sections 7.1 and 7.2, composite codevectors of ve components were used. Therefore, approximately 5M of 1s (actually, more close to 4.9M because of random overlaps) were in the input vector before thinning. We varied the number of 1s in the thinned codevectors from 4M to M / 4. Only the additive and the subtractive CDT procedures were investigated. The similarity of thinned codevectors is shown in Figure 8A. For a shallow thinning, where the resulting density is near the density of input composite codevector, the similarity degree of resulting vectors is close to that of input codevectors (the curve is close to linear). For a deep thinning, where the density of thinned codevectors is much less than the density of input codevectors, the similarity function behaves as a power function, transforming from linear through quadratic to cubic (for subtractive thinning). The similarity of component subsets in the thinned codevector is shown in Figure 8B. For the additive CDT procedure, the similarity function is linear, and its angle reaches approximately 45 degrees for “deep” thinning. For the subtractive CDT procedure, the function is similar to the additive one for the shallow thinning and becomes near-quadratic for the “deep” thinning. 8 Representation of Structured Expressions
Let us consider representation of various kinds of structured data by binary sparse codevectors of xed dimensionality. In the examples below, the items of the base level for a given expression may, in their turn, represent complex structured data. 8.1 Transformation of Symbolic Bracketed Expressions into Representations by Codevectors. Performing the CDT procedure can be viewed as
an analog of introducing brackets into symbolic descriptions. As mentioned in section 5, the CDT procedures with different thinning congurations are denoted by different labels at the opening thinning bracket: 1 h i, 2 h i, 3 h i, 4 h i, 5 h i, and so on.
438
Dmitri A. Rachkovskij and Ernst M. Kussul
Figure 8: (A) Overlap of thinned composite codevectors. (B) Overlap of component subsets in thinned composite codevectors for various “depth” of additive (CDTadd) and subtractive (CDTsub) CDT procedures. There are ve components in the composite item. Therefore, the input composite codevector includes approximately 5M of 1s. The composite codevector is thinned to have from 4M to M/ 4 of 1s. Two curves for thinning depth M are consistent with the corresponding curves in Figure 7. For all component codevectors, N D 100,000, M ¼ 1000. The results are averaged over 50 runs with different random codevectors.
Binary Sparse Distributed Representations
439
Therefore, in order to represent a complex symbolic structure by a distributed binary codevector, one should: 1. Map each symbol of the base-level item to the corresponding binary sparse codevector of xed dimensionality. 2. Replace conventional brackets in symbolic bracketed representation by “thinning” ones. Each compositional level has its own label of thinning brackets, that is, thinning conguration. 3. Superimpose the codevectors inside the thinning brackets of the deepest nesting level by elementwise disjunction. 4. Perform the CDT procedure on superimposed codevectors using the conguration of thinning corresponding to a particular “thinning” label. 5. Superimpose the resulting thinned vectors inside the thinning brackets of the next nesting level. 6. Perform the CDT procedure on superimposed codevectors using the appropriate thinning conguration of that nesting level. 7. Repeat the two previous steps until the whole structure is encoded. 8.2 Representation of Ordered Items. For many propositions, the order of arguments is essential. To represent the order of items encoded by the codevectors, binding with appropriate roles is usually used. One approach is to use explicit binding of role codevectors (agent-object, antecedent-consequent, or just an ordinal number) with the item (ller) codevector. This binding can be realized by an auto- or hetero-CDT procedure (Rachkovskij, 1990b). Item a, which is number 3, may be represented as ha _ n3 i or ha in3 , where n3 is the codevector of the “third-place” role. Another approach is to use implicit binding by providing different locations for different positions of an item in a proposition. To preserve the xed dimensionality of codevectors, it was proposed to encode different positions by the specic shifts of codevectors (Kussul & Baidyk, 1993). (Reversible permutations can be also used.) For our example, we have a shifted by the number n3 of 1-bit shifts corresponding to the third place of an item. These and other techniques to represent the order of items have their pros and cons. Thus, a specic technique should be chosen depending on the application. We will not consider details here. It is important that such techniques exist, and we will denote the codevector of item a at the nth place simply by a n. Let us note that generally the modication of an item codevector to encode its ordinal number should be different for different nesting levels. It is analogous to having its own thinning conguration at each level of nesting. Therefore, a and b should be modied in the same manner in
440
Dmitri A. Rachkovskij and Ernst M. Kussul
1 h. . . a
n . . .i and 1 h. . . b n . . .i, but a should generally be modied differently in 2 h. . . a n . . .i. 8.3 Examples.
8.3.1 Role-Filler Structure. Representations of structures or propositions by the role-ller scheme are widely used (Smolensky, 1990; Pollack, 1990; Plate, 1991, 1995; Kanerva, 1996; Sperduti, 1994). Let us consider the relational instance (8.1)
knows(Sam, loves(John, Mary)). Using Plate’s HRRs, it can be represented as: L1 D love C loveagt ¤ john C loveobj ¤ mary,
L2 D know C knowagt ¤ sam C knowobj ¤ L1 ,
(8.2) (8.3)
where ¤ stands for binding operation and C denotes addition. In our representation: L1 D 2 hlove _ 1 hloveagt _ john i _ 1 hloveobj _ maryii, 4
3
3
L2 D h know _ hknowagt _ sam i _ hknowobj _ L1 ii.
(8.4) (8.5)
8.3.2 Predicate-Arguments Structure. Let us consider representation of relational instances loves(John, Mary) and loves(Tom, Wendy) by the predicatearguments (or symbol-argument-argument) structure (Halford et al., 1998): loves ¤ John ¤ Mary C loves ¤ Tom ¤ Wendy .
(8.6)
Using our representation, we obtain: 2 1
h hloves 0 _ John 1 _ Mary 2i _ 1 hloves 0 _ Tom 1 _ Wendy 2ii.(8.7)
Let us note that this example may be represented using the role-ller scheme of HRRs as L1 D loves C lover ¤ Tom C loved ¤ Wendy , L2 D loves C lover ¤ John C loved ¤ Mary, L D L1 C L2 .
(8.8) (8.9) (8.10)
Under such a representation, the information about who loves whom is lost in L (Plate, 1995; Halford et al., 1998). In our representation, this information
Binary Sparse Distributed Representations
441
is preserved even using the role-ller scheme: L1 D 2 hloves _ 1 hlover _ Tom i _ 1 hloved _ Wendy ii, 2
1
1
L2 D hloves _ hlover _ John i _ hloved _ Mary ii, L D hL1 _ L2 i.
(8.11) (8.12) (8.13)
Here is another example of a relational instance from Halford et al. (1998): cause(shout-at(John ,Tom),hit(Tom, John)).
(8.14)
Using our representation scheme, it may be represented as 2
hcause 0 _ 1 hshout -at 0 _ John 1 _ Tom 2i 1 _ 1 hhit 0 _ Tom 1 _ John 2i 2i.
(8.15)
8.3.3 Treelike Structure. Here is an example of a bracketed binary tree adapted from Pollack (1990): ((d (a n ))(v (p (d n )))).
(8.16)
If we do not take the order into account but use only the information about the grouping of constituents, our representation may look as simple as: 4 3
h hd _ 2 ha _ n ii _ 3 hv _ 2 hp _ 1 hd _ n iiii.
(8.17)
8.3.4 Labeled Directed Acyclic Graph. Sperduti and Starita (1997) and Frasconi, Gori, and Sperduti (1997) provide examples of labeled directed acyclic graphs. Let us consider F ( a, f (y ), f (y, F (a, b))).
(8.18)
Using our representation, it may look like 3
hF 0 _ a 1 _ 2 hf 0 _ y 1i 2
_2 hf 0 _ y 1 _ 1 hF 0 _ a 1 _ b 2i 2i 3i.
(8.19)
9 Related Work and Discussion
CDT procedures allow the construction of binary sparse representations of complex data structures, including nested compositional structures or partwhole hierarchies. The basic principles of such representations and their use for data handling were proposed in the context of the APNN paradigm (Kussul, 1988, 1992; Kussul et al., 1991a).
442
Dmitri A. Rachkovskij and Ernst M. Kussul
9.1 Comparison to Other Representation Schemes. Let us compare our scheme for representation of complex data structures using the CDT procedure (we will refer to it as APNN-CDT) with other schemes using distributed representations. The best known schemes are (L)RAAMs (Pollack, 1990; Blair, 1997; Sperduti 1994), tensor product representations (Smolensky, 1990; Halford et al., 1998), holographic reduced representations (HRRs) (Plate, 1991, 1995), and binary spatter codes (BSCs) (Kanerva, 1994, 1996). For this comparison, we will use the framework of Plate (1997), who proposes distinguishing these schemes using the following features: the nature of distributed representation, the choice of superposition, the choice of binding operation, how the binding operation is used to represent predicate structure, and the use of other operations and techniques.
9.1.1 The Nature of Distributed Representation. Vectors of random realvalued elements with the gaussian distribution are used in HRRs. Dense binary random codes with the number of 1s equal to the number of 0s are used in BSCs. Vectors with real or binary elements (without specied distributions) are used in other schemes. In the APNN-CDT scheme, binary vectors with a randomly distributed small number of 1s are used to encode base-level items. 9.1.2 The Choice of Superposition. The operation of superposition is used for unstructured representation of an aggregate of codevectors. In BSCs, superposition is realized as a bitwise thresholded addition of codevectors. Schemes with nonbinary elements, such as HRRs, use elementwise summation. For tensors, superposition is realized as adding up or ORing the corresponding elements. In the APNN-CDT scheme, elementwise OR is used. 9.1.3 The Choice of Binding Operation. Most schemes use special operations for the binding of codevectors. The binding operations producing the bound vector that has the same dimension as initial codevectors (or one of them in (L)RAAMs) are convenient for representation of recursive structures. The binding operation is performed “on the y” by circular convolution (HRRs), elementwise multiplication (Gayler, 1998), or XOR (BSCs). In (L)RAAMs, binding is realized through the multiplication of input codevectors by the weight matrix of the hidden layer formed by training a multilayer perceptron using the codevectors to be bound. The vector obtained by binding can be bound with another codevector in its turn. In tensor models, binding of several codevectors is performed by their tensor product. The dimensionality of the resulting tensor grows with the number of bound codevectors. In the APNN-CDT scheme, binding is performed by the CDT procedure. Unlike the other schemes, where the codevectors to be bound are not superimposed, they can be superimposed by disjunction in the basic version of
Binary Sparse Distributed Representations
443
the CDT procedure. Superposition codevector z (as in equation 4.4) makes the context codevector. The result of the CDT procedure may be considered as superimposed bindings of each component codevector with the context codevector, or it may be considered as superimposed paired bindings of all component codevectors with each other. (Note that in the “self-exclusive” CDT version [section 5] the codevector of each component is not bound to itself. In the hetero-CDT version, one codevector is bound to another codevector through thinning with the latter.) According to Plate’s framework, CDT as a binding procedure can be considered as a special kind of superposition (disjunction) of certain elements of the tensor product of z by itself (i.e., N2 scalar products zi zj ). Actually, hzi i is a disjunction of certain zi zj D zi ^ zj , where zj is the jth element of permuted z (see equation 4.10). CDT can also be considered as a hashing procedure: the subspace to where hashing is performed is dened by 1s of z , and some 1s of z are mapped to that subspace. Since the resulting bound codevector hz i is obtained in the CDT procedure by thinning the 1s of z , (where the component codevectors are superimposed), hz i is similar to its component codevectors (unstructured similarity is preserved). Therefore, to retrieve the components bound in the thinned codevector, we only need to choose the most similar component codevectors from their alphabet. This can be done using an associative memory. None of the mentioned binding operations, except for the CDT, preserves unstructured similarity. Therefore, to extract some component codevector from the bound codevector, they demand to know the other component codevector(s). Then rebinding of the bound codevector with the inverses of known component codevector(s) produces a noisy version of the sought component codevector. This operation is known as decoding or unbinding. To eliminate noise from the unbound codevector, a clean-up memory with the full alphabet of component codevectors is also required in those schemes. If some or all components of the bound codevector are not known, decoding in those schemes requires exhaustive search (substitution, binding, and checking) through all combinations of codevectors from the alphabet. Then the obtained bound codevector most similar to the bound codevector to be decoded provides the information on its composition. As in the other schemes, structured similarity is preserved by the CDT, that is, bindings of similar patterns are similar to each other. However, the character of similarity is different. In most of the other schemes, the similarity of the bound codevectors is equal to the product of similarities of the component codevectors (e.g., Plate, 1995). For example, the similarity of a ¤ b and a ¤ b 0 is equal to the similarity of b and b’. Therefore if b and b’ are not similar at all, the bound vectors will not be similar. The codevectors to be bound by the CDT procedure are initially superimposed component codevectors, so their initial similarity is the mean of the components’ similarities. Also, the thinning itself preserves approximately
444
Dmitri A. Rachkovskij and Ernst M. Kussul
the square of similarity of the input vectors. So the similarity for dissimilar b and b’ will be > 0.25 instead of 0 for the other schemes. 9.1.4 How the Binding Operation Is Used to Represent Predicate Structure. In most of the schemes, predicate structures are represented by role-ller bindings. Halford et al. (1998) use predicate-argument bindings. The APNNCDT scheme allows such representations of predicate structures as role-ller bindings and predicate-argument bindings and also offers a potential for other possible representations. Both ordered and unordered arguments can be represented. 9.1.5 Other Operations and Techniques. Normalization. After superposition of codevectors, some normalizing transformation is used in various schemes to bring the individual elements or the total strength of the resulting codevector within certain limits. In BSCs, it is the threshold operation that converts a nonbinary codevector (the bitwise sum of component codevectors) to a binary one. In HRRs, it is the scaling of codevectors to the unit length that facilitates their comparison. The CDT procedure performs a dual role: it not only binds superimposed codevectors of components, but it also normalizes the density of the resulting codevector. It would be interesting to check to what extent the normalization operations in other schemes provide the effect of binding as well. Clean-up memory. Associative memory is used in various representation schemes for the storage of component codevectors and their recall (clean-up after nding their approximate noisy versions using unbinding). After the CDT procedure, the resulting codevector is similar to its component codevectors; however, the latter are represented in the reduced form. Therefore, it is natural to use associative memories in the APNN-CDT scheme to store and retrieve the codevectors of component items of various complexity levels. Since component codevectors of different complexity levels have approximately the same small number of 1s, an associative memory based on assembly neural networks with simple Hebbian learning rule allows efcient storage and retrieval of a large number of codevectors. Chunking. The problem of chunking remains one of the least developed issues in existing representation schemes. In the HRRs and BSCs, chunks are normalized superpositions of stand-alone component codevectors and their bindings. In its turn, the codevector of a chunk can be used as one of the components for binding. Thus, chunking allows structures of arbitrary nesting or composition level to be built. Each chunk should be stored in a clean-up memory. When complex structures are decoded by unbinding, noisy versions of chunk codevectors are obtained. They are used to retrieve pure versions from the clean-up memory, which can be decoded in their turn.
Binary Sparse Distributed Representations
445
In those schemes, the codevectors of chunks are not bound. Therefore, they cannot be superimposed without the risk of structure loss, as is repeatedly noted in this article. In the APNN-CDT scheme, any composite codevector after thinning represents a chunk. Since the component codevectors are bound in the chunk codevector, the latter can be operated as a single whole (an entity) without confusion of components belonging to different items. When a compositional structure is constructed using HRRs or BSCs, the chunk codevector is usually the ller, which becomes bound with some role codevector. In this case, in distinction to the APNN-CDT scheme, the components a, b , c of the chunk become bound with the role rather than with each other: role ¤(a C b C c ) D role ¤ a C role ¤ b C role ¤ c .
(9.1)
Again, if the role is not unique, it cannot be determined to which chunk the binding role ¤ a belongs. Also, the role codevector should be known for unbinding and subsequent retrieval of the chunk. Thus, in the representation schemes of HRRs and binary spatter codes, each of the component codevectors belonging to a chunk binds with (role) codevectors of other hierarchical levels not belonging to that chunk. Therefore such bindings may be considered “vertical.” In the APNN-CDT scheme, a ”horizontal” binding is essential: the codevectors of the chunk components are bound with each other. In the schemes of Plate, Kanerva, and Gayler, the vertical binding chain role upper level ¤ (role lower level ¤ ller) is indistinguishable from role lower level ¤ (role upper level ¤ ller), because their binding operations are associative and commutative. For the CDT procedure, in 2 6 6 contrast, 2 h1 ha _ b i _ c i D ha _ 1 hb _ c ii, and also hha _ b i _ c i D ha _ hb _ c ii. Gayler (1998) proposes binding a chunk codevector with its permuted version. It resembles the version of thinning procedure from section 4.2, but for real-valued codevectors. Different codevector permutations for different nesting levels allow the components of chunks from different levels to be distinguished in a similar fashion as using different congurations of thinning connections in the CDT. However since the result of binding in the scheme of Gayler and in the other considered schemes (with the exception of APNN-CDT) is not similar to the component codevectors, in those schemes decoding of the chunk codevector created by binding with a permutation of itself will generally require exhaustion of all combinations of component codevectors. This problem with the vertical binding schemes of Plate, Kanerva, and Gayler can be rectied by using a binding operation that, prior to a conventional binding operation, permutes its left and right arguments differently (as discussed in Plate, 1994).
446
Dmitri A. Rachkovskij and Ernst M. Kussul
The obvious problem of tensor product representation is the growth of dimensionality of the resulting pattern obtained by the binding of components. If it is not solved, the dimensionality will grow exponentially with the nesting depth. Halford et al. (1998) consider chunking as the means to reduce the rank of tensor representation. To realize chunking, they propose using the operations of convolution, concatenation, superposition, and some special function that associates the outer product with the codevector of lower dimension. However, the rst three operations do not rule out confusion of grouping or ordering of arguments inside chunks (i.e., different composite items may produce identical chunks). And the special function (and its inverse) requires concrete denition. Probably it could be done using associative memory, for example, of the sigma-pi type proposed by Plate (1998). In (L)RAAMs the chunks of different nesting levels are encoded in the same weight matrix of connections between the input layer and the hidden layer of a multilayer perceptron. It may be one of the reasons for poor generalization. Probably if additional multilayer perceptrons are introduced for each nesting level (with the input for each following perceptron provided by the hidden layer of the preceding one, similarly to Sperduti & Starita, 1997), generalization in those schemes would improve. In the APNN, chunks (thinned composite codevectors) of different nesting levels are memorized in different autoassociative neural networks. It allows an easy similarity-based decoding of a chunk through its subchunks of the previous nesting level and decreases memory load at each nesting level (see also Rachkovskij, in press). 9.2 Sparse Binary Schemes. Indicating unknown areas where useful representation schemes for nested compositional structures can be found, Plate (1997) notes that known schemes poorly handle sparse binary patterns, because known binding and superposition operations change the density of sparse patterns. Of the work known to us, only Sj¨odin (1998) expresses the idea of “thinning” in an effort to avoid the low associative memory capacity for dense binary patterns. He denes the thinning operation as preserving the 1s corresponding to the maximums of some function dened over the binary vector. The values of that function can be determined at cyclic shifts of the codevector by the number of steps equal to the ordering number of 1s in that codevector. However, it is not clear from such a description what the properties of the maximums are and therefore what the character of similarity is. The CDT procedure considered in this article allows the density of codevectors to be preserved while binding them. Coupled with the techniques for encoding the pattern ordering, this procedure allows implementing various representation schemes of complex structured data. Approximately the same low density of binary codevectors at different nesting levels permits the use of identical procedures for construction, recognition, comparison,
Binary Sparse Distributed Representations
447
and decoding of patterns at different hierarchical levels of the APNN architecture (Kussul, 1988, 1992; Kussul et al., 1991a). The CDT procedure preserves the similarity of encoded descriptions, allowing the similarity of complex structures to be determined by the overlap of their codevectors. Also, in the codevectors of complex structures formed using the CDT procedure, representation of the component codevectors (subset of their 1s) is reduced. Therefore, the APNN-CDT scheme can be considered another implementation of Hinton’s (1990) reduced descriptions, Amosov’s (1967) item coarsing, or compression of Halford et al. (1998). Besides, the CDT scheme is biologically relevant since it uses sparse representations and allows simple neural network implementation. 10 Conclusion
The CDT procedures described in this article perform binding of items represented as sparse binary codevectors. They allow a variable number of superimposed patterns to be bound on the y while preserving the density of bound codevectors. The result of the CDT is of the same dimensionality as the component codevectors. Using the auto-CDT procedures as analogs of brackets in the bracketed symbolic representation of various complex data structures permits easy transformation of these representations to the binary codevectors of xed dimensionality with a small number of 1s. Unlike other binding procedures, binding by the auto-CDT preserves the similarity of bound codevector with each of the component codevectors. It thus becomes possible to determine the similarity of complex items with each other by the overlap of their codevectors and to retrieve in full size the codevectors of their components. Such operations are efciently implementable by distributed associative memories, which provide high storage capacity for the codevectors with small number of 1s. We have already used the APNN-CDT style representations in applications (earlier work is reviewed in Kussul, 1992, 1993; more recent developments are described in Lavrenyuk, 1995; Rachkovskij, 1996; Kussul & Kasatkina, 1999; Rachkovskij, in press). We hope that the CDT procedures will nd application in distributed representation and manipulation of complex compositional data structures, contributing to the progress of connectionist symbol processing (Touretzky, 1990, 1995; Touretzky & Hinton, 1988; Hinton, 1990; Plate, 2000). Fast (parallel) evaluation of similarity or nding the most similar compositional items allowed by such representations are extremely useful for solution of a wide range of AI problems. Acknowledgments
We are grateful to Pentti Kanerva and Tony Plate for their extensive and helpful comments, valuable suggestions, and continuous support. This work
448
Dmitri A. Rachkovskij and Ernst M. Kussul
was funded in part by the International Science Foundation, grants U4M000 and U4M200.
References Amari, S. (1989). Characteristics of sparsely encoded associative memory. Neural Networks, 2, 445–457. Amosov, N. M. (1967). Modelling of thinking and the mind. New York: Spartan Books. Amosov, N. M., Baidyk, T. N., Goltsev, A. D., Kasatkin, A. M., Kasatkina, L. M., Kussul, E. M., & Rachkovskij, D. A. (1991). Neurocomputers and intelligent robots. Kiev: Naukova dumka. (In Russian) Artykutsa, S. Ya., Baidyk, T. N., Kussul, E. M., & Rachkovskij, D. A. (1991). Texture recognition using neurocomputer (Preprint 91-8). Kiev, Ukraine: V. M. Glushkov Institute of Cybernetics. (In Russian) Baidyk, T. N., Kussul, E. M., & Rachkovskij, D. A. (1990). Numerical-analytical method for neural network investigation. In Proceedings of the International Symposium on Neural Networks and Neural Computing—NEURONET’90 (pp. 217–222). Prague, Czechoslovakia. Blair, A. D. (1997). Scaling-up RAAMs (Tech. Rep. No. CS-97-192).Waltham, MA: Brandeis University, Department of Computer Science. Fedoseyeva, T. V. (1992). The problem of neural network to recognize word roots. In Neuron-like networks and neurocomputers (pp. 48–54). Kiev, Ukraine: V. M. Glushkov Institute of Cybernetics. (In Russian) Feldman, J. A. (1989). Neural representation of conceptual knowledge. In L. Nadel, L. A. Cooper, P. Culicover, & R. M. Harnish (Eds.), Neural connections, mental computation (pp. 68–103). Cambridge, MA: MIT Press. Feldman, J. A., & Ballard, D. H. (1982). Connectionist models and their properties. Cognitive Science, 6, 205–254. Foldiak, P., & Young, M. P. (1995). Sparse coding in the primate cortex. In M. A. Arbib (Ed.), Handbook of brain theory and neural networks (pp. 895–898). Cambridge, MA: MIT Press. Frasconi, P., Gori, M., & Sperduti, A. (1997). A general framework for adaptive processing of data structures (Tech. Rep. No. DSI-RT-15/97). Firenze, Italy: Universita degli Studi di Firenze, Dipartimento di Sistemi e Informatica. Frolov, A. A. (1989). Information properties of bilayer neuron nets with binary plastic synapses. Biophysics, 34, 868–876. Frolov, A. A., & Muraviev, I. P. (1987). Neural models of associative memory. Moscow: Nauka. (In Russian) Frolov, A. A., & Muraviev, I. P. (1988). Informational characteristics of neuronal and synaptic plasticity. Biophysics, 33, 708–715. Gayler, R. W. (1998). Multiplicative binding, representation operators, and analogy. In K. Holyoak, D. Gentner, & B. Kokinov (Eds.), Advances in analogy research: Integration of theory and data from the cognitive, computational, and neural sciences (p. 405). Soa, Bulgaria: New
Binary Sparse Distributed Representations
449
Bulgarian University. (Poster abstract). Full poster available online at: http://cogprints.soton.ac.uk/abs/comp/199807020. Halford, G. S., Wilson W. H., & Phillips S. (1998). Processing capacity dened by relational complexity: Implications for comparative, developmental, and cognitive psychology. Behavioral and Brain Sciences, 21, 723–802. Hebb, D. O. (1949). The organization of behavior. New York: Wiley. Hinton, G. E. (1981). Implementing semantic networks in parallel hardware. In G. E. Hinton & J. A. Anderson (Eds.), Parallel models of associative memory (pp. 161–187). Hillside, NJ: Erlbaum. Hinton, G. E. (1990). Mapping part-whole hierarchies into connectionist networks. Articial Intelligence, 46, 47–76. Hinton, G. E., McClelland, J. L., & Rumelhart, D. E. (1986). Distributed representations. In D. E. Rumelhart, J. L. McClelland, & the PDP Research Group (Eds.), Parallel distributed processing: Exploration in the microstructure of cognition 1: Foundations (pp. 77–109). Cambridge, MA: MIT Press. Hopeld, J. J. (1982). Neural networks and physical systems with emergent collective computational abilities. Proceedings of the National Academy of Sciences, USA, 79, 2554–2558. Hopeld, J. J., Feinstein, D. I., & Palmer, R. G. (1983). “Unlearning” has a stabilizing effect in collective memories. Nature, 304, 158–159. Hummel, J. E., & Holyoak K. J. (1997).Distributed representations of structure: A theory of analogical access and mapping. Psychological Review, 104, 427–466. Kanerva, P. (1988). Sparse distributed memory. Cambridge, MA: MIT Press. Kanerva, P. (1994). The spatter code for encoding concepts at many levels. In M. Marinaro & P.G. Morasso (Eds.), ICANN ’94, Proceedings of International Conference on Articial Neural Networks (Sorrento, Italy), (Vol. 1, pp. 226–229). London: Springer-Verlag. Kanerva, P. (1996). Binary spatter-coding of ordered K-tuples. In C. von der Malsburg, W. von Seelen, J. C. Vorbruggen, & B. Sendhoff (Eds.), Proceedings of the International Conference on Articial Neural Networks—ICANN’96, Bochum, Germany. Lecture Notes in Computer Science, 1112 (pp. 869–873). Berlin: Springer-Verlag. Kanerva, P. (1998). Encoding structure in Boolean space. In L. Niklasson, M. Boden, and T. Ziemke (Eds.), ICANN 98: Perspectives in Neural Computing (Proceedings of the 8th International Conference on Articial Neural Networks, Skoevde, Sweden), (Vol. 1, pp. 387–392). London: Springer-Verlag. Kasatkin, A. M., & Kasatkina, L. M. (1991). A neural network expert system. In Neuron-like networks and neurocomputers (pp. 18–24). Kiev, Ukraine: V. M. Glushkov Institute of Cybernetics. (In Russian) Kussul, E. M. (1980). Tools and techniques for development of neuron-like networks for robot control. Unpublished Dr. Sci. dissertation. V. M. Glushkov Institute of Cybernetics. (In Russian) Kussul, E. M. (1988). Elements of stochastic neuron-like network theory. In Internal Report “Kareta-UN” (pp. 10–95). Kiev, Ukraine: V. M. Glushkov Institute of Cybernetics. (In Russian) Kussul, E. M. (1992) Associative neuron-like structures. Kiev: Naukova Dumka. (In Russian)
450
Dmitri A. Rachkovskij and Ernst M. Kussul
Kussul, E. M. (1993). On some results and prospects of development of associative-projective neurocomputers. In Neuron-like networks and neurocomputers (pp. 4–11). Kiev, Ukraine: V. M. Glushkov Institute of Cybernetics. (In Russian) Kussul, E. M., & Baidyk, T. N. (1990).Design of a neural-like network architecture for recognition of object shapes in images. Soviet Journal of Automation and Information Sciences, 23. 53–58. Kussul, E. M., & Baidyk, T. N. (1993). On information encoding in associativeprojective neural networks (Preprint 93-3). Kiev, Ukraine: V. M. Glushkov Institute of Cybernetics. (In Russian) Kussul, E. M., Baidyk, T. N., Lukovich, V. V., & Rachkovskij, D. A. (1993). Adaptive neural network classier with multioat input coding. In Proceedings of NeuroNimes’93, Nimes, France, Oct. 25–29, 1993. EC2-publishing. Kussul, E. M., & Kasatkina, L. M. (1999). Neural network system for continuous handwritten words recognition. In Proceedings of the International Joint Conference on Neural Networks. (Washington, D.C.). Kussul, E. M., & Rachkovskij, D. A. (1991). Multilevel assembly neural architecture and processing of sequences. In A. V. Holden & V. I. Kryukov (Eds.), Neurocomputers and attention: Vol. II. Connectionism and neurocomputers (pp. 577– 590). Manchester: Manchester University Press. Kussul, E. M., Rachkovskij, D. A., & Baidyk, T. N. (1991a).Associative-projective neural networks: Architecture, implementation, applications. In Proceedings of the Fourth International Conference “Neural Networks & Their Applications,” Nimes, France, Nov. 4–8, 1991 (pp. 463–476). Kussul, E. M., Rachkovskij, D. A., & Baidyk, T. N. (1991b). On image texture recognition by associative-projective neurocomputer. In C. H. Dagli, S. Kumara, & Y. C. Shin (Eds.), Proceedings of the ANNIE’91 Conference “Intelligent Engineering Systems Through Articial Neural Networks” (pp. 453–458). ASME Press. Lansner, A., & Ekeberg, O. (1985). Reliability and speed of recall in an associative network. IEEE Trans. Pattern Analysis and Machine Intelligence, 7, 490–498. Lavrenyuk, A. N. (1995). Application of neural networks for recognition of handwriting in drawings. In Neurocomputing: Issues of theory and practice (pp. 24–31). Kiev, Ukraine: V. M. Glushkov Institute of Cybernetics. (In Russian) Legendy, C. R. (1970). The brain and its information trapping device. In J. Rose (Ed.), Progress in cybernetics, vol. 1. New York: Gordon and Breach. Marr, D. (1969). A theory of cerebellar cortex. Journal of Physiology, 202, 437–470. Milner, P. M. (1974). A model for visual shape recognition. Psychological Review, 81, 521–535. Milner, P. M. (1996).Neural representations: Some old problems revisited. Journal of Cognitive Neuroscience, 8, 69–77. Palm, G. (1980). On associative memory. Biological Cybernetics, 36, 19–31. Palm, G. (1993). The PAN system and the WINA project. In P. Spies (Ed.), EuroArch’93 (pp. 142–156). Berlin: Springer-Verlag. Palm, G., & Bonhoeffer, T. (1984).Parallel processing for associative and neuronal networks. Biological Cybernetics, 51, 201–204.
Binary Sparse Distributed Representations
451
Plate, T. A. (1991). Holographic reduced representations: Convolution algebra for compositional distributed representations. In J. Mylopoulos & R. Reiter (Eds.), Proceedings of the 12th International Joint Conference on Articial Intelligence (pp. 30–35). San Mateo, CA: Morgan Kaufmann. Plate, T. A. (1994). Distributed representations and nested compositional structure. Unpublished PhD dissertation, University of Toronto. Plate, T. A. (1995). Holographic reduced representations. IEEE Transactions on Neural Networks, 6, 623–641. Plate, T. (1997). A common framework for distributed representation schemes for compositional structure. In F. Maire, R. Hayward, & J. Diederich (Eds.), Connectionist systems for knowledge representation and deduction (pp. 15–34). Brisbane: Queensland University of Technology. Plate, T. (1998). Randomly connected sigma-pi neurons can form associative memories. Submitted. Plate, T. (2000). Structured operations with vector representations. Expert systems. International Journal of Knowledge Engineering and Neural Networks, 17, 29–40. Pollack, J. B. (1990). Recursive distributed representations. Articial Intelligence, 46, 77–105. Rachkovskij, D. A. (1990a). On numerical-analytical investigation of neural network characteristics. In Neuron-like networks and neurocomputers (pp. 13–23). Kiev, Ukraine: V. M. Glushkov Institute of Cybernetics. (In Russian) Rachkovskij, D. A. (1990b). Development and investigation of multilevel assembly neural networks. Unpublished Ph.D. dissertation, V. M. Glushkov Institute of Cybernetics. (In Russian) Rachkovskij, D. A. (1996). Application of stochastic assembly neural networks in the problem of interesting text selection. In Neural network systems for information processing (pp. 52–64). Kiev, Ukraine: V. M. Glushkov Institute of Cybernetics. (In Russian) Rachkovskij, D. A. (in press). Representation and processing of structures with binary sparse distributed codes. IEEE Transactions on Knowledge and Data Engineering (Special Issue). Draft available at
[email protected]. Rachkovskij, D. A., & Fedoseyeva, T. V. (1990). On audio signals recognition by multilevel neural network. In Proceedings of the International Symposium on Neural Networks and Neural Computing—NEURONET’90 (pp. 281–283). Prague, Czechoslovakia. Rachkovskij, D. A., & Fedoseyeva T. V. (1991). Hardware and software neurocomputer system for recognition of acoustical signals. In Neuron-like networks and neurocomputers (pp. 62–68). Kiev, Ukraine: V. M. Glushkov Institute of Cybernetics. (In Russian) Shastri, L., & Ajjanagadde, V. (1993). From simple associations to systematic reasoning: Connectionist representation of rules, variables, and dynamic bindings using temporal synchrony. Behavioral and Brain Sciences, 16, 417– 494. Sj¨odin, G. (1998). The Sparchunk code: A method to build higher-level structures in a sparsely encoded SDM. In Proceedings of IJCNN’98 (pp. 1410–1415). Piscataway, NJ: IEEE.
452
Dmitri A. Rachkovskij and Ernst M. Kussul
Sjodin, ¨ G., Kanerva, P., Levin, B., & Kristoferson, J. (1998). Holistic higher-level structure-forming algorithms. In Proceedings of 1998 Real World Computing Symposium—RWC’98 (pp. 299–304). Smolensky, P. (1990). Tensor product variable binding and the representation of symbolic structures in connectionist systems. Articial Intelligence, 46, 159– 216. Sperduti, A. (1994). Labeling RAAM. Connection Science, 6, 429–459. Sperduti, A., & Starita, A. (1997). Supervised neural networks for the classication of structures. IEEE Transactions on Neural Networks, 8, 714–735. Touretzky, D. S. (1990). BoltzCONS: Dynamic symbol structures in a connectionist network. Articial Intelligence, 46, 5–46. Touretzky, D. S. (1995). Connectionist and symbolic representations. In M. A. Arbib (Ed.), Handbook of brain theory and neural networks (pp. 243–247). Cambridge, MA: MIT Press. Touretzky, D. S., & Hinton, G. E. (1988). A distributed connectionist production system. Cognitive Science, 12, 423–466. Tsodyks, M. V. (1989). Associative memory in neural networks with the Hebbian learning rule. Modern Physics Letters B, 3, 555–560. Vedenov, A. A. (1987). “Spurious memory” in model neural networks (Preprint IAE4395/1). Moscow: I. V. Kurchatov Institute of Atomic Energy. Vedenov, A. A. (1988). Modeling of thinking elements. Moscow: Science. (In Russian) von der Malsburg, C. (1981). The correlation theory of brain function (Internal Rep. No. 81-2). Gottingen, Germany: Max-Planck-Institute for Biophysical Chemistry, Department of Neurobiology. von der Malsburg, C. (1985). Nervous structures with dynamical links. Ber. Bunsenges. Phys. Chem., 89, 703–710. von der Malsburg, C. (1986) Am I thinking assemblies? In G. Palm & A. Aertsen (Eds.), Proceedings of the 1984 Trieste Meeting on Brain Theory (pp. 161–176). Heidelberg: Springer-Verlag. Willshaw, D. (1981). Holography, associative memory, and inductive generalization. In G. E. Hinton & J. A. Anderson (Eds.), Parallel models of associative memory (pp. 83–104). Hillside, NJ: Erlbaum. Willshaw, D. J., Buneman, O. P., & Longuet-Higgins, H. C. (1969). Nonholographic associative memory. Nature, 222, 960–962. Received May 6, 1999; accepted April 6, 2000.
LETTER
Communicated by John Platt
Resolution-Based Complexity Control for Gaussian Mixture Models Peter Meinicke Helge Ritter Neuroinformatics Group, University of Bielefeld, Bielefeld, Germany In the domain of unsupervised learning, mixtures of gaussians have become a popular tool for statistical modeling. For this class of generative models, we present a complexity control scheme, which provides an effective means for avoiding the problem of overtting usually encountered with unconstrained (mixtures of) gaussians in high dimensions. According to some prespecied level of resolution as implied by a xed variance noise model, the scheme provides an automatic selection of the dimensionalities of some local signal subspaces by maximum likelihood estimation. Together with a resolution-based control scheme for adjusting the number of mixture components, we arrive at an incremental model renement procedure within a common deterministic annealing framework, which enables an efcient exploration of the model space. The advantages of the resolution-based framework are illustrated by experimental results on synthetic and high-dimensional real-world data. 1 Introduction
A crucial point in unsupervised learning is the control of model complexity, since it balances the two competing objectives of generalization and delity. Unfortunately the basic multivariate gaussian mixture model (e.g., Titterington, Smith, & Makov, 1985) does not provide an appropriate complexity control, which would allow nding a suitable trade-off between overtting and oversmoothing. In particular, with unconstrained covariance matrices, the number of free parameters grows quadratically with the input dimensionality. Therefore, several approaches have been proposed that aim to restrict the complexity of the original model by some additional constraints. Besides rather coarse techniques like using diagonal or spherical covariance models or sharing some covariance parameters among the mixture components (e.g., Fraley, 1998) there are more exible methods like Bayesian regularization (Ormoneit & Tresp, 1998; Bishop, 1999) or subspace modeling (Roweis & Ghahramani, 1999; Tipping and Bishop, 1999), which allow for a more precise complexity control. As an alternative approach within this eld, in this article we present a scheme based on the control of two kinds of complexity inherent in mixtures Neural Computation 13, 453–475 (2001)
° c 2001 Massachusetts Institute of Technology
454
Peter Meinicke and Helge Ritter
of multivariate gaussians, both of them adjusted by a single continuous parameter, which determines the degree of smoothing and thus the level of resolution imposed by the model. The rst type of complexity arises if the model partitions the input space into subspaces of signal and noise. The dimensionality of the signal subspace, which is assumed to represent the relevant variability of the data, may be used to control what we will loosely call the vertical complexity of the model. The second type of complexity arises if the model partitions the input space into regions that can be associated with some prototypical reference points. In that case the number of localized prototypes may be used to adjust what we call the horizontal complexity of the model.1 As a main result we show that a specic resolution implies a certain vertical complexity of a multivariate gaussian model, since according to a maximum likelihood criterion, it determines the dimensionality of a principal signal subspace. Thus we achieve continuous control of the vertical complexity, similar to Bayesian approaches (Ormoneit & Tresp, 1998; Bishop, 1999), while preserving the number of subspace dimensions as a meaningful parameter. In contrast to a purely discrete control scheme (Roweis & Ghahramani, 1999; Tipping & Bishop, 1999), which estimates the noise variance(s) for a given subspace dimensionality, the latter complexity parameter is no longer prespecied but becomes subject to maximum likelihood estimation at a specic resolution, as implied by the xed variance noise model. Since the approach extracts a principal signal subspace spanned by the leading eigenvectors of the sample covariance matrix, one may denote it as maximum likelihood principal component analysis (ML-PCA), referring to our scheme as the dimensionality-estimating (D-PCA) and to the discrete scheme as the noise-estimating (N-PCA) one. In the following section we indicate that the dimensionality-estimating approach may be advantageous over the noise-estimating one in several respects. As an example, which we introduce in section 6, it can considerably facilitate the design of modular plug-in classiers while avoiding the problems that result from noise estimation in high dimensions. On the other hand, it has been known for some time that a prespecied noise variance can control the horizontal complexity of a mixture of spherical gaussians with equal mixing weights (Geman & Hwang, 1982). In this 1 The motivation for this choice of terminology lies in differential geometry. If we view the linear subspaces as “attached” to the individual prototypes to describe the variation of the data in their vicinity and consider the observed data as noisy samples of a smooth manifold, the resulting construction might be viewed as a discretized model (given by the discrete set of prototypes) of a manifold whose points have become augmented with locally attached vector spaces. In differential geometry, such structure is known as a vector bundle (Spivak, 1979). The “bers” of a vector bundle are the individual subspaces; geometrically, they are viewed as a “vertical” extension of the “horizontal” base manifold, and there is a canonical projection that maps each subspace into the point of the base manifold to which it is attached.
Complexity Control for Gaussian Mixture Models
455
article as a natural combination of the two complexity control schemes we propose a sequential application within a common deterministic annealing (Durbin, Szeliski, & Yuille, 1989; Rose, Gurewitz, & Fox, 1990) framework. At a coarse resolution, a decrease of the noise variance is used to control the number of distinct prototypes (horizontal complexity) as they arise during optimization with respect to the mixture of spherical gaussians. At a higher level of resolution, a further decrease of this variance is used for an adaptation of the local subspace dimensionalities (vertical complexity) of the gaussian component densities. In that way the initialization problem (McKenzie & Alder, 1994), usually encountered with the tting of gaussian mixture models, is remedied, since one starts with the unique maximum likelihood estimator (MLE) of the simplest model and increases the complexity by gradually deforming the objective function by deterministic annealing, which means tracking the maximum likelihood solution across decreasing noise levels. For the sake of clarity, we rst consider the vertical and horizontal complexity control schemes separately. Then the combined algorithm is introduced, and we present some experimental results on synthetic and realworld data. We close with a discussion of the method in the context of related approaches. 2 Vertical Complexity
With respect to a certain class of generative models, vertical complexity can be controlled by the dimensionality of a subspace, which is assumed to cover the relevant variability of some multivariate distribution. Especially in highdimensional spaces, it is reasonable to assume that the d-dimensional data in essence reect the variation of a subset of q · d “meaningful” random variables superposed by some d-dimensional noise. Thus, the objective of model tting is to exploit some subspace structure within the data, which often results from their concentration near some manifold of lower dimensionality. This gives rise to the following generative model: x D f (z ) C u ,
(2.1)
where f : Rq ! Rd is a continuous function and z , u are q- and d-dimensional random vectors, respectively. The question about an adequate q is closely related to the question about the intrinsic dimensionality of the data, and normally one starts with a hypothesis about this entity before estimating the model parameters. However, we may alternatively start with a hypothesis about the actual noise present in the data, since it implies some level of resolution, which determines whether some subspace structure becomes apparent or stays invisible. In the following we will consider a “gaussian” realization of such an approach.
456
Peter Meinicke and Helge Ritter
2.1 Resolution-Dependent Gaussian. A parametric model, which has some analytical and computational attractiveness, is achieved if f is afne and all random variables are gaussian. In particular, we shall assume the signal z » Nq (0, ¢) with ¢ diagonal and the isotropic noise u » Nd ( 0, s 2 I ) to be uncorrelated. For convenience we shall further assume all diagonal elements of ¢ to be distinct2 and nonzero. With an afne mapping f (z ) D Az C ¹, all components of x are linear combinations of gaussian random variables, and thus the observables can be characterized by a multivariate normal density,
1 g (x | w ) D (2p ) ¡d/ 2 | §| ¡1 / 2 exp ¡ (x ¡ ¹)T § ¡1 (x ¡ ¹) , 2
(2.2)
where the vector w contains the parameters of ¹ and §, and | ¢ | denotes the determinant operation. With the above denitions of z and u the covariance matrix §, which is the expectation E[(Az C u )(Az C u )T ] can be decomposed according to
§ D ª C s 2 I,
rank ª D q,
(2.3)
where I is the d £ d identity matrix and for q < d, ª is a singular matrix. With an eigendecomposition § D ¡¤¡T , it can be parameterized by
ª D ¡q (¤q ¡ s 2 Iq ) ¡Tq ,
(2.4)
where ¤q D diag(l1 , . . . , lq ) is the diagonal matrix of the q largest population eigenvalues of § with l1 > ¢ ¢ ¢ > lq > s 2 , and ¡q , which takes the role of the previous A , contains the corresponding q eigenvectors as columns. The MLEs of the model parameters require consideration of only the rst- and second-order statistics of the sample, which we represent by the data set X D fx1 , . . . , x N g ½ Rd . Then the required statistics are given by the sample mean and sample covariance matrix, x¯ D
1 N
N
xi, iD 1
SD
1 N
N i D1
(x i ¡ x¯ )(x i ¡ x¯ )T ,
(2.5)
respectively. With the above model, the MLE of the population mean is simply ¹ O D x¯ , while the MLE of ª requires an eigendecomposition of O we have to determine s 2 . S D CLC T . However, before obtaining ª 2 The assumption of distinct signal variances is not really necessary, but it enables a unique parameterization of the corresponding covariance model. See equations 2.3 and 2.4.
Complexity Control for Gaussian Mixture Models
457
With a discrete complexity control, the above model is used with q xed (Moghaddam & Pentland, 1997; Tipping & Bishop, 1999; Roweis & Ghahramani, 1999) and an optimal s 2 is estimated from the data. In that case the MLE sO 2 is given by the arithmetic mean of the d ¡ q smallest eigenvalues of S (Anderson, 1963). As an alternative, we propose an approach that reverses the classical roles of noise and dimensionality: instead of xing q and estimating s 2 , a certain noise variance is hypothesized, and the subspace dimensionality now becomes the quantity that is estimated. With s 2 xed, it is shown in the appendix that the MLEs of the population eigenvalues are equal to s 2 if their sample counterparts are smaller than or equal to the assumed noise variance. Therefore, the number of sample eigenvalues that exceed s 2 determines the optimal subspace dimensionality, qO D |fli : li > s 2 , i D 1, . . . , dg|,
(2.6)
where the li are the eigenvalues of S , with l1 > ¢ ¢ ¢ > ld . Since the MLEs of ¤q and ¡q are given by the qO largest sample eigenvalues and the corresponding O can eigenvectors (see the appendix), the MLE of the signal covariance ª now be calculated according to equation 2.4 by inserting the former MLEs: O D C qO [diag(l1 , . . . , lqO ) ¡ s 2 IqO ] C T . ª qO
(2.7)
In essence, the above formalism may be viewed as a statistical model for principal component analysis (PCA) under the assumption of normally distributed data. Since the principal directions of the gaussian are estimated by the corresponding eigenvectors of the sample covariance matrix, in practice the application of the formalism would be similar to conventional PCA. If we choose s 2 ¸ l1 —that is, the noise variance exceeds or equals the largest sample eigenvalue—then the model is spherical with qO D 0. While decreasing s 2 , qO increases, and the model covariance is estimated within the principal subspace spanned by the rst qO eigenvectors of S . From the role of the noise variance, we have a clear difference to PCA: while the latter aims at a small residual variance within the d ¡ q dimensional orthogonal subspace, the above probabilistic model ts well if the residual (co)variance is isotropic. Nevertheless, the maximum likelihood scheme can be used as an alternative to conventional PCA if there is evidence that the sample is from a multivariate normal distribution. There are two possible ways for the maximum likelihood variant on PCA to proceed. In comparison with the noise-estimating (N-PCA) approach, we argue that there are several advantages associated with the proposed dimensionality-estimating (D-PCA) scheme: In high-dimensional spaces, estimation of s 2 becomes unreliable unless a huge sample is available. In that case, N-PCA will usually suffer from underestimation of the noise variance, which is obvious for (nearly) singular S .
458
Peter Meinicke and Helge Ritter
In several applications, for example, pattern recognition, that require the training of different class models, with all data from the same sensor, a common noise variance seems plausible and is easily realized by the D-PCA model (see section 6 for an example). For a mixture model, it seems plausible to allow for different local subspace dimensionalities of the component densities. Within the N-PCA framework, it would be cumbersome to test all possible combinations of local dimensionalities. Furthermore (as indicated by the experimental results), the D-PCA model seems less sensitive to overtting due to growing numbers of mixture components. Especially the last point is of particular importance, since it allows tackling the problem of differing local dimensionalities in a convenient way, while providing more robustness with respect to wrong choices of the number of mixture components. The corresponding extension to a mixture model, based on the constrained gaussian, is straightforward and is the subject of section 4. 3 Horizontal Com plexity
With respect to a certain class of generative models, the horizontal complexity can be controlled by the number of localized sources, which form a nite mixture distribution. The assumption is that the N-point data set is a sample from a distribution with a smaller number of K local means that correspond to different “prototypes.” Therefore, we expect the data points to be concentrated in certain regions of the input space, and the objective of model tting is to discover some cluster structure within the sample. A probabilistic model can be dened by an appropriate mixture of unimodal densities, p(x | W) D
K
p i f (x | w i ) ,
(3.1)
iD 1
where W comprises all parameters of the component densities f (x | w i ) plus the mixing weights p i , which sum to unity and can be interpreted as the prior probabilities that a data point is generated by the ith component source. The “classical” non-Bayesian approach to probabilistic clustering of vector data typically relies on discrete complexity control since it usually assumes a certain number K of prototypes, and then, besides the local means, it optimizes the corresponding mixing and variance parameters (e.g., Everitt, 1980). An alternative scheme arises from a consideration of the classical K-means (least-squares) optimality criterion within a maximum entropy framework (Rose et al., 1990). This leads to an optimization scheme for maximizing the likelihood of a resolution-dependent mixture model.
Complexity Control for Gaussian Mixture Models
459
3.1 Resolution-Dependent Mixture of Spherical Gaussians. If we take the component densities to be spherical gaussians with equal (noise) variances and equal a priori probabilities, we have the following mixture density:
p (x | µ ) D K ¡1 (2p s 2 ) ¡d/ 2
K iD 1
exp ¡
1 kx ¡ ¹i k2 , 2s 2
(3.2)
with µ D (¹T1 , . . . , ¹TK )T . It has been observed (Geman & Hwang, 1982) that with s 2 xed at a certain (nonzero) level and starting with K D N components, maximum likelihood estimation with respect to the local means leads to a number K < N of distinct ¹ O i ; that is, some of the centroids will coalesce to maximize the likelihood. This indicates that the optimal horizontal complexity corresponds to a smaller number of prototypes, since the coinciding components can be replaced by a single one (with increased mixing weight) without changing the likelihood of the model. It also illustrates the well-known fact that the gaussian kernel density estimator with K D N and ¹i D x i in general is not an MLE and offers the possibility of controlling the number of distinct prototypes by adjusting the noise level s 2 (e.g., Weiss, 1998). Although it would be possible to start with the kernel density estimator and maximize the likelihood with respect to the ¹i at a certain s 2 level, such a strategy would be prohibitive for larger sample sizes. So instead of iteratively reducing a complex model, we may ask if there is a way of incrementally enlarging a model of minimum horizontal complexity. Indeed, a suitable method, which has been proposed in terms of maximum entropy clustering comes with the technique of deterministic annealing (DA) (Durbin et al., 1989; Rose et al., 1990), which we will consider here solely from the perspective of maximum likelihood estimation. It can be shown that at an innite variance, the estimated positions of a nite number of centroids all coincide in the sample mean, which then is the unique maximizer of a concave likelihood function. The idea is then to start the optimization with 2 that minimal model at some large smax and gradually decrease the variance while allowing the coinciding components to split up in order to increase the likelihood of the model. In that way, the variance parameter s 2 introduces a homotopy, which gradually deforms the log-likelihood function of the model, L (µ I X , s 2 ) D
N
K
log iD 1
jD 1
exp ¡
1 kxi ¡ ¹j k2 ¡ C, 2s 2
(3.3)
with C D N log K C Nd / 2 log(2p s 2 ). Now for each decrement of s 2 , equation 3.3 has to be maximized with respect to µ. Note that a decreasing of s 2
460
Peter Meinicke and Helge Ritter
corresponds to a reduction of the uncertainty in the data to cluster assign2 ments, and therefore in the limit smin ! 0, a hard clustering of the data is achieved. The incremental model renement is achieved through successive prototype splitting, which occurs at bifurcation points of a s 2 -parameterized curve, tracking the MLE of h. A necessary condition for a split to occur is that the Hessian of the log-likelihood function becomes singular; that is, one of its eigenvalues equals zero. It has been shown (Rose, Gurewitz, & Fox, 1992), that for an initial conguration where all centroids coincide in the sample mean, this can happen only if s 2 equals one of the eigenvalues of the sample covariance matrix S . Therefore, the rst split can occur if s 2 attains the largest eigenvalue of S . In addition, a small portion of noise has to be added to the centroids in order to enable a split. Starting with the largest sample eigenvalue, usually a negative exponential decay of s 2 , often replaced by the inverse temperature b D 1 / (2s 2 ), is chosen to realize an appropriate annealing schedule (Hofmann, Puzicha, & Buhmann, 1998). As compared with classical K-means-like approaches to clustering, we might expect that the homotopy-based technique will be less sensitive to poor local extrema of the objective function. With DA, these extrema appear only for lower values of s 2 , and we might hope that then the annealing scheme will already have arrived at a better optimum. Indeed, this conjecture is supported by experimental results (Hoffmann et al., 1998) indicating that although no nal global optimum can be guaranteed, the clustering performance of DA is superior in this domain. Recently some deeper theoretical justication has been given by Buhmann (1998). 4 Combined Complexity Control
Within the discrete control scheme as proposed by Tipping and Bishop (1999), different proportions of horizontal and vertical complexity constitute a nite set of models that have to be compared in order to nd the most appropriate one for the problem at hand. With the resolution-based framework, we have to take a sample from the continuous model space in order to compare models of different complexity. For that purpose, we now propose combining the previous two complexity control schemes within a common deterministic annealing framework that allows realizing an incremental model renement procedure for general mixtures of gaussians. This provides a rather efcient means for sampling from the s 2 -parameterized space, since each successive model optimization builds on the parameter estimates of the previous one. 4.1 Two-Phase Deterministic Annealing. For the combination of the two control schemes, we rst notice that with both kinds of complexity, a model renement results from decreasing the hypothesized noise variance. However, from the two preceding sections, we know that an ambiguity must
Complexity Control for Gaussian Mixture Models
461
arise if s 2 attains the largest eigenvalue of the sample covariance matrix. Since the maximum likelihood criterion itself does not suggest a specic choice, we have to decide which complexity has to be increased rst. Although other strategies may be possible, a rather natural combination of the two formalisms therefore arises from a simple concatenation of the two complexity control schemes. During a rst phase of the homotopy-based optimization, the splitting scheme of section 3 is used to control the horizontal complexity. This means tting a mixture of spherical gaussians, that is, maximizing equation 3.3 with respect to µ , while decreasing s 2 until a certain 2 variance shmin is reached. During the second phase, the annealing scheme is continued, but now the number of component densities is xed, and vertical complexity is allowed to increase by subspace adaptation of the local gaussians. This means tting mixture of subspace-constrained gaussians (as dened in section 2) by maximization of the following log-likelihood function, L (W I X , s 2 ) N
D
K
log i D1
jD 1
p j exp ¡ 12 (x i ¡ ¹j )T (ªj C s 2 I )¡1 (x i ¡ ¹j ) (2p )d/ 2 | ªj C s 2 I | 1/ 2
,
(4.1)
with respect to W, which now contains the parameters w j of the K gaussian densities g(x | wj ) together with the mixing weights, which can now be treated as free parameters. The maximization is repeated for each further 2 decrement of s 2 until a nal variance svmin is reached. Although the transition between the two complexity control phases cannot be guaranteed to be continuous in the log-likelihood, in practice we observed the occurring jumps to be rather small. Figure 1 summarizes this combined optimization scheme in a more algorithmic fashion. 2 An appropriate value for smax is the largest eigenvalue of the sample covariance matrix. Suitable values for the variance decrease rate a, which yield a sufcient tracking behavior, turn out to lie between 0.5 and 1.0, but different annealing schedules that are more sensitive to the problem at hand are also possible. 2 Since a specic choice of shmin implicitly determines the number of mixture components, one may alternatively choose a certain K and stop the rst annealing phase if all component means have split up. Thereby the number of components is subject to model selection, and besides cross-validation, many criteria have been suggested to nd a suitable K, where some of them have recently been reviewed by Roberts, Husmeier, Rezek, & Penny (1998) within a comparative study. 2 The value of svmin determines the vertical complexity of the model and a specic choice again requires some model selection criterion, which in principle can be the same as in the horizontal case. Within a Bayesian framework, a suitable evidence approximation (e.g., Roberts et al., 1998) may be used for
462
Peter Meinicke and Helge Ritter
2 2 2 1. Dene smax , shmin , svmin , a 2 ]0, 1[ 2 2. Set m D 0, sm2 D smax
3. Maximize L (µ I X , sm2 )
Bounds and decrease rate of s 2 Initialization Via EM optimization, for example
2 4. Set m D m C 1, sm2 D a sm¡1 2 5. If sm2 > shmin Goto 3.
6. Maximize L (© I X , sm2 )
Decrease s 2 Check for lower bound Via EM optimization, for example
2 7. Set m D m C 1, sm2 D a sm¡1
2 8. If sm2 > svmin Goto 6 Else Stop .
Decrease s 2 Check for lower bound
Figure 1: Combined algorithm for incremental model renement by deterministic annealing.
suggesting a nal noise level, provided our assumption of local gaussianity holds true. 2 2 In essence, specic values of shmin , svmin , and a select a certain set of candidates from a continuous model space. The model instances of this set may be characterized by the values of the complexity parameters as depicted in Figure 2. Although the rst annealing phase may be viewed as an initialization step, we emphasize the fact that unlike the many initialization procedures
s K=1, Q=0
K=2, Q=0
K=1, Q=1
K=2, Q=1
2
hmin
K=3, Q=0
K=1, Q=2
s
2
vmin
Figure 2: Example of a discretized model space with different proportions of the number of mixture components K and the sum of (local) subspace dimensionK alities Q D iD 1 qi . The proposed algorithm always starts from the upper left corner and moves along the rst row until the noise variance attains a certain 2 2 shmin . Then the algorithm moves downward until a nal svmin is reached.
Complexity Control for Gaussian Mixture Models
463
that are based on somewhat arbitrary heuristics, both DA phases in principle maximize the same objective function, which is only gradually deformed during annealing. From this we might argue the other way around that using a hard-clustering algorithm, as many practitioners advocate for initialization when tting a mixture of unconstrained gaussians, is indeed a special case of the above scheme, since we might extend the rst phase to the limiting case of zero noise and then t a model of maximum complexity during the second phase. 4.2 EM-Optimization. Although several optimization methods may be suitable for maximization of the above log-likelihood functions at certain s 2 levels, the EM algorithm (Dempster, Laird, & Rubin, 1977) is a convenient tool for that purpose, since neither computation of the Hessian nor step-size adjustment as in ordinary descent methods is required. The EM scheme as applied to maximization of equation 3.4 comprises two steps, which are iterated until convergence: E-step: Calculation of the expected values of the auxiliary indicator vari-
ables, hij D p j g (xi |wj ) / p (xi |W),
(4.2)
which denote the posterior probability that x i has been generated by the jth mixture component and which are part of the expected “complete data” log-likelihood function (Dempster et al., 1977). M-step: Maximization of the expected complete data log-likelihood can
be achieved by solving K independent convex optimization problems, each with respect to the corresponding component parameters w j . Each component optimization requires calculation of the local sample mean and covariance matrix, xN j D 1 / nj
N
hij x i , iD 1
Sj D 1 / nj
N i D1
hij (xi ¡ x¯ j )(x i ¡ x¯ j )T ,
(4.3)
N where nj D O j D xN j , the mixing weights are updated by i D1 hij . Now ¹ pO j D nj / N, and each of the K optimizations is completed by an eigenO j according to decomposition of Sj followed by calculation of qOj and ª equations 2.6 and 2.7.
The application of the EM algorithm to maximization of the log-likelihood of equation 3.3 is analogous, with the difference that the M-step requires only the calculation of the local sample means.
464
Peter Meinicke and Helge Ritter
Z
Z 1
1
0
0 -1 -1
X0
1
(a)
-1
0 Y
1
-1 -1 X0
0Y
1
1
(b)
Figure 3: Two kinds of simulated data together with local unit Mahalanobisdistance ellipsoids of tted model. (a) Sample from mixture of three 3D gaussians and K D 3 model. (b) Sample from noisy spiral and K D 8 model. 5 Simulations
We now consider the performance of the resolution-dependent mixture model on different synthetic data sets. As depicted in Figure 3, the data were sampled from two three-dimensional (3D) distributions of different type, which show some cluster and curve structure, respectively. In both cases, training sets were kept small, at a size of N D 100, to simulate sparse data. As a performance measure, we always used the log-likelihood of an independent “unseen” test set, which we sampled from the same distribution as the training set. 5.1 Mixture Data. As the rst experiment should mainly illustrate how the dimensionality-estimating scheme works in practice, the data were sampled from a mixture of three trivariate gaussians with equal a prioriprobabilities. The three components had covariance structures with standard deviations (0.5, 0.1, 0.1), (0.5, 0.3, 0.1), and (0.5, 0.4, 0.3) along the three principal axes, respectively, to simulate different local intrinsic dimensionalities, that is, fqi g D f1, 2, 3g. The resulting three clusters were randomly rotated and placed at the corners of an equilateral triangle with two units edge length. The test set was chosen to have the same size as the training set, which consisted of 100 points. The resolution-dependent model was tted by the two-phase deterministic annealing scheme using the EM algorithm for local maximization of the log-likelihood. Thereby, the decay rate was xed throughout at a D 0.9, and 2 smax was always set to the largest eigenvalue of S . The rst annealing phase was run until all K component means had split up. Then the second phase
Complexity Control for Gaussian Mixture Models
465
log-likelihood
-1
-2
-3
train test 0
1 2 3 4 log of inverse temperature
Figure 4: Log-likelihood per data point on test and training set of “mixture” data plotted against the logarithm of increasing inverse temperature b D 2s1 2 for a model with K D 3 components.
was run until the performance of the model had reached a maximum. As expected, the best performance was achieved for a K D 3 model, where the corresponding log-likelihood on training and test data is plotted against the logarithm of the inverse temperature b D 2s1 2 in Figure 4. While the test set log-likelihood achieves a maximum near the smallest component variance of the generating model corresponding to log b ¼ 3.9, the training set log-likelihood is monotonously increasing until the maximum model complexity is reached. In addition, we also recorded the estimated subspace dimensionalities during annealing, as depicted in Figure 5. Over a wide range of the temperature scale, the model correctly reects the dimensional structure of the data before it ends up in saturation with full-covariance components. From the plots, one might argue that the computational effort for the DA scheme is enormous, due to a large number of annealing steps. However, the picture is a bit misleading, since most of the single optimizations at certain s 2 levels do not require much computation. In nearly all cases, only one complete EM cycle was necessary to fulll a suitable convergence criterion on the log-likelihood. 5.2 Noisy Spiral Data. To investigate the performance for the case where the source distribution clearly differs from the assumption of local gaussianity, we sampled some data from a uniform distribution on a 3D spiral with two revolutions, unit radius and two units height, and added isotropic gaussian noise with standard deviation 0.1.
Peter Meinicke and Helge Ritter
dimensionality
466
q1 q2 q3
3 2 1 0
0
1 2 3 4 log of inverse temperature
Figure 5: Dimensionality estimation with respect to “mixture” data. For each component, the estimated dimensionality is plotted against the logarithm of increasing inverse temperature.
For the rst experiment, we chose a setup similar to the one used by Tipping and Bishop (1999), who tested the performance of the N-PCA-based mixture model with K D 8 components on noisy spiral data. In particular we used a 100-points training set and measured the performance of the noise and dimensionality estimating models on a 1000-points test set. In addition, we used a 100-points validation set to determine an optimal noise level for the dimensionality-estimating models and an optimal subspace dimensionality for the noise-estimating ones. Within the dimensionalityestimating framework, we chose the same optimization scheme as already used with the mixture data of the previous experiment and stopped the annealing when the validation log-likelihood began to decrease. Since there are three choices for a global subspace dimensionality in 3D, within the noise-estimating framework, we had to compare mixtures with spherical, one-dimensional subspace and full-size covariance models. To achieve better comparability, all models were initialized by probabilistic clustering via deterministic annealing, as described in section 3. The annealing was always run until all component means had split up.3 Since the splitting requires some random perturbations, each optimization was repeated 100 times, where the mean (with standard deviation), maximum, and minimum log-likelihood on the test set are given in Table 1, which shows a slight superiority of the resolution-based D-PCA scheme over the 3 For the N-PCA models, we used the resulting data assignment probabilities h (see ij section 4.2) as initial values for the subsequent optimization phase.
Complexity Control for Gaussian Mixture Models
467
Table 1: Comparison of Performance Based on 100 DA-Clustering Initializations. Method
Mean
N-PCA D-PCA
¡1.89 ¡1.62
Standard Deviation 0.0811 0.0393
Minimum
Maximum
¡1.95 ¡1.63
¡1.61 ¡1.5
Note: Mean, standard deviation, minimum, and maximum of the log-likelihood per data point on test data for two estimation schemes with respect to a K D 8 mixture, where N-PCA denotes the noise-estimating and D-PCA the resolution-based scheme.
Table 2: Performance Based on 100 Hard-Clustering Initializations. Method
Mean
N-PCA D-PCA
¡1.96 ¡1.68
Standard Deviation 0.195 0.109
Minimum
Maximum
¡2.54 ¡1.95
¡1.56 ¡1.43
Note: See Table 1 for further explanations.
discrete N-PCA scheme. We attribute this difference to the fact that with a global subspace dimensionality in a 3D data space, the discrete control scheme offers only three vertical model complexities. On the noisy spiral data with a K D 8 mixture, in most cases the 1D subspace model yielded the highest likelihood on the validation set. On the other hand, with the resolution-based approach, one can select a model from an innite set of complexities, due to the continuous noise parameter, and thus it provides more exible control. We then repeated the experiment without the rst annealing phase to investigate its inuence on the resulting model performance. For that purpose, each initialization was achieved by K-means hard clustering. From the results in Table 2, we see that the DA clustering (and thus the combined algorithm) is well suited for optimization within the resolution-based framework, since the average performance is slightly better than without DA and, more signicant, the deviation from the mean is considerably smaller. With respect to the N-PCA scheme, one can see that the DA initialization improves performance quite similarly. In a second experiment, we investigated the performance of the model for an increasing number K of mixture components and compared it with the N-PCA-based approach. For each K, we performed DA-initialized op2 timizations with complexity parameters svmin and q determined as in the previous experiment on the validation set. From the resulting plots of Figure 6, we see that the superiority of the D-PCA-based approach as observed in the previous experiment holds for nearly all values of K. Moreover, most
468
Peter Meinicke and Helge Ritter
log-likelihood
-1
-2
D-PCA N-PCA
-3 4
8
12 K
16
20
Figure 6: Performance with respect to “spiral” data for different K-size models; log-likelihood per data point on test set, plotted against increasing numbers of mixture components for D-PCA and N-PCA models.
interestingly, the resolution-dependent model seems to be less sensitive to overtting for higher numbers of components. Obviously the noise level not only restricts the vertical complexity but also limits the horizontal complexity, similar to the spherical case discussed in section 3. This fact is of great practical interest, since it makes the choice of an adequate K less critical than for the noise-estimating approach, which increasingly suffered from overtting on the actual noisy-spiral example for larger values of K, as one can see from Figure 6. While the plotted test set log-likelihood is rapidly decreasing for 12 < K < 16, the log-likelihood on the training set is further increasing within this interval. On the other hand, for the D-PCA-based model, the training set log-likelihood remained nearly constant for K > 10. For K D 16 the N-PCA scheme switched to spherical covariance models, which explains the local increase of the plotted performance at this position. 6 Real-World Example
To demonstrate the practical benets of the resolution-based approach, we tested the scheme on a high-dimensional density estimation problem. It was part of the task to construct a plug-in classier (Ripley, 1996) for recognition of handwritten digits, which have been shown to be suitably represented by local linear models (Hinton, Revow, & Dayan, 1995; Hastie & Simard, 1995). We chose a modular approach that required training 10 different density models.
Complexity Control for Gaussian Mixture Models
469
Table 3: Test Set Error Rate (in percent) on Digit Classication.
N-PCA D-PCA
KD1
KD2
KD4
KD8
K D 16
6.16 4.69
5.7 4.3
5.46 3.43
4.48 2.91
3.34 2.5
Note: Comparison of N-PCA- and D-PCA-based mixtures for different numbers of components K.
The data consisted of images of handwritten digits that are transformed to feature vectors by reading the pixel intensity values in lexicographic order. We used the MNIST database (LeCun, 1998), which is available over the Internet and offers about 60,000 training and 10,000 test images of size 28 £ 28 pixel with 8-bit resolution. For the following experiments, we sampled the images down to 8 £8 and 16 £16 to obtain 64- and 256-dimensional data points, respectively. For the rst experiment with 64-dimensional data, we again compared the performances of N-PCA- and D-PCA-based models. For that purpose, we separated a 1000-point validation set from each training set and used the remaining points (¼ 5000) for unsupervised training of the single-digit-class models with xed subspace dimensionalities and noise levels, respectively. For computational convenience, we chose a common number K of mixture components for the density models of a single classier and compared the performances of different classiers for K D 1, 2, 4, 8, 16. Since a common noise level seems plausible, we trained all D-PCA models of a single 2 classier with equal svmin . In order to achieve comparable computational loads, for an N-PCA-based classier we chose a common q for the density models. Similar to the DA scheme, the N-PCA mixtures were successively optimized for increasing q, just taking the membership probabilities hij with respect to an optimized q model as initial values for EM optimization of 2 the q C 1 model. For each K, suitable values for q and svmin were found by optimizing the discrimination ability of the corresponding classier on the validation data, that is, the optimization was stopped, when the classication error on the validation sets began to increase. With these values of q 2 and svmin , the optimization was repeated on the whole training set for each digit class. The recognition performances of the different classiers on the test set are shown in Table 3. From the corresponding error rates we see that the D-PCA-based classiers outperform the N-PCA-based ones for all tested values of K. Thus, the assumption of a common noise level for the class specic density models seems more realistic than a common subspace dimensionality. In the N-PCA case, the common q, which we determined from the validation data, ranged from 7 to 11 dimensions. The estimated subspace dimensionalities in the D-PCA case were in general higher for the corresponding best
470
Peter Meinicke and Helge Ritter
dimensionality
30 25 20 15 10 5 0
0 1 2 3 4 5 6 7 8 9 digit
Figure 7: Estimated subspace dimensionalities of the different digit class models 2 for svmin . The bars indicate the average dimensionalities obtained for K D 16 components within the 64-dimensional data space, together with the minimum and maximum indicated by the upper and lower markings.
K classier. As an example, in Figure 7 the average local subspace dimensionalities are plotted for the ten-digit models of the K D 16 classier. In a second experiment, we investigated whether the performance of the best classier of the previous experiment based on D-PCA mixtures with K D 16 could be further increased by taking images of higher resolution. For that purpose, we used a 16 £ 16 version of the previous training, validation, and test sets and found that the smallest classication error on the validation 2 data was reached for svmin ¼ 2000, which corresponds to gaussian noise with a standard deviation of about 45 grayvalues. With the corresponding models, we achieved an error rate of 1.74 percent on the test set, which is comparable with the published result of the specialized LeNet-1 neural network on 16 £ 16 MNIST images (LeCun, 1998), where an error rate of 1.7% is reported. 7 Discussion
Semiparametric models and in particular mixtures of gaussians are widely used for probabilistic modeling, since they offer an attractive compromise between the parametric and nonparametric approaches to density estimation (Bishop, 1995). However, with sparse data in high dimensions, gaussian mixture models with unconstrained covariance easily lead to severe overtting, and recently several related approaches have been proposed that ad-
Complexity Control for Gaussian Mixture Models
471
dress the problem of a suitable complexity control. Thereby the approaches can be divided into two categories, which discriminate between discrete and continuous control schemes. With “Mixtures of Probabilistic Principal Component Analysers,” Tipping and Bishop (1999) proposed a maximum likelihood variant of local PCA learning (Kambhaltla & Leen, 1994; Bregler & Omohundro, 1994). Since the model complexity is controlled by the number of principal subspace dimensions, their approach belongs to the rst (discrete) category. The approach is based on the estimation of the noise variance, and therefore the above arguments (see section 2) apply. As another discrete scheme with a more general noise model, Ghahramani and Hinton (1996) proposed a mixture of factor analyzers. Again noise estimation is a sensitive issue. Both models have recently been presented in a common framework by Roweis and Ghahramani (1999). On the other hand, within the second (continuous) category, Bayesian approaches offer attractive alternatives that might even suggest suitable values for those complexity parameters, normally treated as xed within the maximum likelihood framework. Besides approaches that tried to nd an adequate number of mixture components (Cheeseman et al., 1988; Richardson & Green, 1997), recently a Bayesian extension to the noise-estimating variant on PCA (see section 2) has been proposed (Bishop, 1999), which automatically determines an “effective” subspace dimensionality for each component. Although appealing from a computational viewpoint, the proposed method suffers from the usual problem of Bayesian approaches, which is the choice of a defendable prior. Bishop (1999) assumes the principal directions (scaled by the corresponding standard deviations) to be independent and normally distributed with spherical covariances and zero mean. Although this approach leads to a convenient estimation scheme, other priors are conceivable and may lead to different effective subspace dimensionalities. In fact, the rather high dimensionalities that Bishop (1999) reported for a handwritten digit data set range from 54 to 63 dimensions for a 64-dimensional data space and are in strong contrast with our results on the classication task of the previous section, which suggest a considerably lower subspace dimensionality. As another approach of the second category, Ormoneit and Tresp (1998) proposed using a Bayesian penalty to constrain the full-size covariance models of a gaussian mixture without resorting to a subspace model. The inuence of their Wishart prior has to be prespecied by a single parameter, which plays a similar role as the noise variance within the resolution-based framework. Our continuous complexity control scheme can be viewed as a hybrid approach that combines some of the advantages of the previous discrete and continuous approaches. It avoids the singularity problems that may result from noise estimation in high dimensions, and at the same time it provides some robustness against overtting due to suboptimal choices of
472
Peter Meinicke and Helge Ritter
the number of mixture components. Another favorable feature arises from a natural combination with a horizontal complexity control scheme within a common deterministic annealing framework. The resulting two-phase model renement procedure obviates any heuristic parameter initialization and allows for an efcient exploration of the model space. Thereby the resolution-based framework preserves the intuitive meaning of the number of subspace dimensionalities, which can provide some additional insight into the structure of the data. Therefore we think that such an approach is also well suited for exploratory data analysis (data mining), where it may be of interest to monitor the model renement at increasing levels of resolution. Appendix: Subspace Dimensionality Estimation
Let X D fx1 , . . . , xN g ½ Rd be a sample from a d-variate normal population Nd (¹, § ) with § D ª C s 2 I and known s 2 . Then with an eigendecomposition of the nonsingular sample covariance matrix S D C diag(l1 , . . . , ld ) C T , which is unique with probability one, that is, we can state l1 > l2 > ¢ ¢ ¢ > ld , the maximum likelihood estimator of the principal subspace dimensionality q, which is the rank of ª, is Proposition.
qO D |fli : li > s 2 , i D 1, . . . , dg|. Proof. With respect to the multivariate normal density of the observables we have the following log-likelihood (e.g., Mardia, Kent, & Bibby, 1979): N N log |2p § | ¡ tr (§¡1 S ) 2 2 N ¡1 T ¡ (xN ¡ ¹) § (xN ¡ ¹), 2
L (¹, §I X ) D ¡
(A.1)
where xN denotes the sample mean and tr (¢) the trace operator. After maximization with respect to ¹, that is, ¹ O D xN and elimination of some constants, it remains to maximize ¡ log | § | ¡ tr (§ ¡1 S ) D ¡ log | ¤| ¡ tr (¡T ¤¡1 ¡CLC T ),
(A.2)
where we replaced the covariance matrices by their eigendecompositions with ¤ D diag(l1 , . . . , ld ) containing the population eigenvalues in decreasing order, that is, l1 ¸ l2 ¸ ¢ ¢ ¢ ¸ lq C 1 D ¢ ¢ ¢ D ld D s 2 . Since it is necessary to minimize the trace term, we write it in the more convenient form tr (¤ ¡1 RLR T ) D (l1¡1 , . . . , ld¡1 ) Q (l1 , . . . , ld )T D v T Qw ,
(A.3)
Complexity Control for Gaussian Mixture Models
473
where we replaced ¡C by the orthogonal matrix R D (rij ) and then rewrote the trace term as a matrix product. Thereby, v , w denote d-vectors of the diagonals of ¤ ¡1 , L , respectively, and Q D (r2ij ) has unit row and column sums. To minimize equation A.3, the smallest component value of v must multiply the maximum component of Qw . Due to unit row sums of Q , with respect to the maximum norm, we have kQw k1 · kQ k1 kw k1 D l1 .
(A.4)
Thus, it is possible to achieve a stepwise minimization of equation A.3 if the rst diagonal element of Q is set to one. If we now consider the remaining (d ¡ 1) £ (d ¡ 1) lower right submatrix and the corresponding vector components of v and w , the same rule applies recursively, and we nd that Q D I minimizes the above trace. From this, we have that C is a (generally nonunique) MLE of ¡, and we can rewrite equation A.2 in terms of the eigenvalues of S and ª . Then the problem of nding qO can be stated as determining the number of nonzero eigenvalues of ª that maximize d
¡
iD 1
log(di C s 2 ) ¡
d i D1
li di C s 2
,
(A.5)
where the di are the d eigenvalues (in decreasing order) of ª. Differentiation with respect to the di yields d rst-order necessary conditions, di C s 2 ¡ li D 0,
i D 1, . . . , d.
(A.6)
We now have to distinguish two cases: l ¡ s2 dOi D i 0
if li > s 2 else.
(A.7)
In the rst case, a regular maximum is achieved; in the second case, the maximum is at a discontinuity, since it lies at the boundary of the parameter space. Nevertheless, within the admissible parameter range, the maximum is still unique, since for s 2 ¸ li the ith term of equation A.5 can be written (with index omitted) as f (d) D ¡ log(d C l C 2 ) ¡ l(d C l C 2 ) ¡1 , 2 ¸ 0, which is strictly monotone for d ¸ 0. Now counting the number of nonzero dOi or, equivalently, the sample eigenvalues greater than s 2 yields the unique maximum likelihood estimator of q. It can easily be seen that the above estimator is consistent, since for N ! 1, the sample eigenvalues converge in probability to their population counterparts, and thus li ! s 2 , i > q.
474
Peter Meinicke and Helge Ritter
References Anderson, T. W. (1963). Asymptotic theory for principal component analysis. Annals of Mathematical Statistics, 34, 122–148. Bishop, C. M. (1995). Neural networks for pattern recognition. Oxford: Clarendon Press. Bishop, C. M. (1999). Bayesian PCA. In M. S. Kearns, S. A. Solla, & D. A. Cohn (Eds.), Advances in neural information processing systems, 11. Cambridge, MA: MIT Press. Bregler, C., & Omohundro, S. M. (1994). Surface learning with applications to lipreading. In J. D. Cowan, G. Tesauro, & J. Alspector (Eds.), Advances in neural information processing systems, 6 (pp. 43–50). San Mateo, CA: Morgan Kaufmann. Buhmann, J. M. (1998). Empirical risk approximation: An induction principle for unsupervised learning (Tech. Rep.) Bonn: Department of Computer Science III, University of Bonn. Cheeseman, P., Kelly, J., Self, M., Stutz, J., Taylor, W., & Freeman, D. (1988). AutoClass: A Bayesian classication system. In Proceedings of the Fifth International Workshop on Machine Learning, Ann Arbor (pp. 54–64). San Mateo, CA: Morgan Kaufmann. Dempster, A. P., Laird, N. M., & Rubin, D. B. (1977). Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society Series B, 39, 1–38. Durbin, R., Szeliski, R., & Yuille, A. (1989). An analysis of the elastic net approach to the traveling salesman problem. Neural Computation, 1(3), 348–358. Everitt, B. S. (1980). Cluster analysis (2nd ed.). London: Heinemann Educational Books. Fraley, C. (1998). Algorithms for model-based gaussian hierarchical clustering. SIAM Journal on Scientic Computing, 20(1), 270–281. Geman, S., & Hwang, C.-R. (1982). Nonparametric maximum likelihood estimation by the method of sieves. Annals of Statistics, 10(2), 401–414. Ghahramani, Z., & Hinton, G. (1996). The EM algorithm for mixtures of factor analyzers (Tech. Rep. No. CRG-TR-96-1). Toronto: Department of Computer Science, University of Toronto. Hastie, T., & Simard, P. (1995). Learning prototype models for tangent distance. In G. Tesauro, D. Touretzky, & T. Leen (Eds.), Advances in neural information processing systems, 7 (pp. 999–1006). Cambridge, MA: MIT Press. Hinton, G. E., Revow, M., & Dayan, P. (1995). Recognizing handwritten digits using mixtures of linear models. In G. Tesauro, D. Touretzky, & T. Leen (Eds.), Advances in neural information processing systems, 7 (pp. 1015–1022). Cambridge, MA: MIT Press. Hofmann, T., Puzicha, J., & Buhmann, J. M. (1998). Unsupervised texture segmentation in a deterministic annealing framework. IEEE Transactions on Pattern Analysis and Machine Intelligence, 20(8), 803–818.
Complexity Control for Gaussian Mixture Models
475
Kambhaltla, N., & Leen, T. K. (1994). Fast non-linear dimension reduction. In J. D. Cowan, G. Tesauro, & J. Alspector (Eds.), Advances in neural information processing systems, 6 (pp. 152–159). San Mateo, CA: Morgan Kaufmann. LeCun, Y. (1998). The MNIST database of handwritten digits. AT&T Labs–Research. Available online at: http://www.research.att.com/»yann/ocr/mnist/ index.html. Mardia, K. V., Kent, J. T., & Bibby, J. M. (1979). Multivariate analysis. London: Academic Press. McKenzie, P., & Alder, M. (1994). Initializing the EM algorithm for use in gaussian mixture modelling. In E. S. Gelsema & L. N. Kanal (Eds.), Pattern recognition in practice IV (pp. 91–105). Amsterdam: Elsevier. Moghaddam, B., & Pentland, A. (1997). Probabilistic visual learning for object representation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 19(7), 696–710. Ormoneit, D., & Tresp, V. (1998). Averaging, maximum penalized likelihood and Bayesian estimation for improving gaussian mixture probability density estimates. IEEE Transactions on Neural Networks, 9(4), 639–650. Richardson, S., & Green, P. J. (1997). On Bayesian analysis of mixtures with an unknown number of components. Journal of the Royal Statistical Society Series B, 59, 731–792. Ripley, B. D. (1996). Pattern recognition and neural networks. Cambridge: Cambridge University Press. Roberts, S. J., Husmeier, D., Rezek, I., & Penny, W. (1998).Bayesian approaches to gaussian mixture modeling. IEEE Transactions on Pattern Analysis and Machine Intelligence, 20(11), 1133–1142. Rose, K., Gurewitz, E., & Fox, G. C. (1990). Statistical mechanics and phase transitions in clustering. Physical Review Letters, 65(8), 945–948. Rose, K., Gurewitz, E., & Fox, G. C. (1992). Vector quantization by deterministic annealing. IEEE Transactions on Information Theory, 38(4), 1249–1257. Roweis, S., & Ghahramani, Z. (1999). A unifying review of linear gaussian models. Neural Computation, 11(2), 305–345. Spivak, M. (1979). A comprehensive introduction to differential geometry (2nd ed.). Boston: Publish or Perish. Tipping, M. E., & Bishop, C. M. (1999). Mixtures of probabilistic principal component analysers. Neural Computation, 11(2), 443–482. Titterington, D. M., Smith, A. F. M., & Makov, U. E. (1985). Statistical analysis of nite mixture distributions. Chichester: Wiley. Weiss, Y. (1998). Phase transitions and the perceptual organization of video sequences. in M. I. Jordan, M. J. Kearns, & S. A. Solla (Eds.), Advances in neural information processing systems, 10. Cambridge, MA: MIT Press. Received February 4, 1999; accepted May 5, 2000.
Neural Computation, Volume 13, Number 3
CONTENTS Article
The Limits of Counting Accuracy in Distributed Neural Representations A. R. Gardner-Medwin and H. B. Barlow
477
Note
An Expectation-Maximization Approach to Nonlinear Component Analysis Roman Rosipal and Mark Girolami
505
Letters
A Population Density Approach That Facilitates Large-Scale Modeling of Neural Networks: Extension to Slow Inhibitory Synapses Duane Q. Nykamp and Daniel Tranchina
511
A Complex Cell-Like Receptive Field Obtained by Information Maximization Kenji Okajima and Hitoshi Imaoka
547
Self-Organization of Topographic Mixture Networks Using Attentional Feedback James R. Williamson
563
Auto-SOM: Recursive Parameter Estimation for Guidance of SelfOrganizing Feature Maps Karin Haese and Geoffrey J. Goodhill
595
Exponential Convergence of Delayed Dynamical Systems Tianping Chen and Shun-ichi Amari
621
Improvements to Platt’s SMO Algorithm for SVM Classier Design S. S. Keerthi, S. K. Shevade, C. Bhattacharyya, and K. R. K. Murthy
637
Learning Hough Transform: A Neural Network Model Jayanta Basak
651
A Constrained EM Algorithm for Independent Component Analysis Max Welling and Markus Weber
677
Emergence of Memory-Driven Command Neurons in Evolved Articial Agents Ranit Aharonov-Barki, Tuvik Beker, and Eytan Ruppin
691
ARTICLE
Communicated by David Field
The Limits of Counting Accuracy in Distributed Neural Representations A. R. Gardner-Medwin Department of Physiology, University College London, London WC1E 6BT, U.K. H. B. Barlow Physiological Laboratory, Cambridge CB2 3EG, U.K. Learning about a causal or statistical association depends on comparing frequencies of joint occurrence with frequencies expected from separate occurrences, and to do this, events must somehow be counted. Physiological mechanisms can easily generate the necessary measures if there is a direct, one-to-one relationship between signicant events and neural activity, but if the events are represented across cell populations in a distributed manner, the counting of one event will be interfered with by the occurrence of others. Although the mean interference can be allowed for, there is inevitably an increase in the variance of frequency estimates that results in the need for extra data to achieve reliable learning. This lowering of statistical efciency (Fisher, 1925) is calculated as the ratio of the minimum to actual variance of the estimates. We dene two neural models, based on presynaptic and Hebbian synaptic modication, and explore the effects of sparse coding and the relative frequencies of events on the efciency of frequency estimates. High counting efciency must be a desirable feature of biological representations, but the results show that the num ber of events that can be counted simultaneously with 50% efciency is fewer than the number of cells or 0.1–0.25 of the number of synapses (on the two models)—many fewer than can be unambiguously represented. Direct representations would lead to greater counting efciency, but distributed representations have the versatility of detecting and counting many unforeseen or rare events. Efcient counting of rare but important events requires that they engage more active cells than common or unimportant ones. The results suggest reasons that representations in the cerebral cortex appear to use extravagant numbers of cells and modular organization, and they emphasize the importance of neuronal trigger features and the phenomena of habituation and attention. 1 Introduction
The world we live in is highly structured, and to compete in it successfully, an animal has to be able to use the predictive power that this structure makes Neural Computation 13, 477–504 (2001)
° c 2001 Massachusetts Institute of Technology
478
A. R. Gardner-Medwin and H. B. Barlow
possible. Evolution has molded innate genetic mechanisms that help with the universal basics of nding food, avoiding predators, selecting habitats, and so forth, but much of the structure is local, transient, and stochastic rather than universal and fully deterministic. Higher animals greatly improve the accuracy of their predictions by learning about this statistical structure through experience: they learn what sensory experiences are associated with rewards and punishments, and they also learn about contingencies and relationships between sensory experiences even when these are not directly reinforced. Sensory inputs are graded in character and may provide weak or strong evidence for identication of a discrete binary state of the environment such as the presence or absence of a specic object. Such classications are the data on which much simple inference is built and about which associations must be learned. Learning any association requires a quantitative step in which the frequency of a joint event is observed to be very different from the frequency predicted from the probabilities of its constituents. Without this step, associations cannot be reliably recognized, and inappropriate behavior could result from attaching too much importance to chance conjunctions or too little to genuine causal ones. Estimating a frequency depends in its turn on counting, using that word in the rather general sense of marking when a discrete event occurs and forming a measure of how many times it has occurred during some epoch. Counting is thus a crucial prerequisite for all learning, but the form in which sensory experiences are represented limits how accurately it can be done. If there is at least one cell in a representation of the external world that res in one-to-one relation to the occurrence of an event (i.e., if that event is directly represented according to our denitions—see Table 1), then there is no difculty in seeing how physiological mechanisms within such a cell could generate an accurate measure of the event frequency. On the other hand, there is a problem when the events correspond to patterns on a set of neurons (i.e., with a distributed representation; see Table 1). In a distributed representation, a particular event causes a pattern of activity in several cells, but even when this pattern is unique, there is no unique element in the system that signals when the particular event occurs and does not signal at other times. Each cell active in any pattern is likely to be active for several different events during a counting epoch, so no part of the system is reliably active when, and only when, the particular event occurs. The interference that results from this overlap in distributed representations can be dealt with in two ways: (1) cells and connections can be devoted to identifying directly in a one-to-one manner when the patterns occur (i.e., a direct representation can be generated), or (2) the interference can be accepted and the frequency of occurrence of the distributed patterns estimated from the frequency of use of their individual active elements. The second procedure is likely to increase the variance of estimated counts, and distributed representation would be disadvantageous when
Limits of Counting Accuracy
479
Table 1: Denitions. The network is a set of Z binary cells. An event, relative to the network, is any stimulation that causes a specic activity vector (pattern of W active and Z-W inactive cells). This vector is the representation of the event. Repeated occurrence of a representation implies repetition of the same event, even if the external stimulus is different. A direct representation of an event contains at least one active cell that is active in no other event. The cells that directly represent an event in this way have a one-to-one relation between their activity and occurrences of the event. All other representations are distributed representations. Each active cell is active also in other events, and identication of an event requires interaction with more than one cell. Compact distributed representations employ relatively few cells to distinguish a given number (N) of events, close to the minimum Z D log 2 (N).
The activity ratio of a representation is the fraction of cells (a D W / Z) that are active in it. A sparse representation has a low activity ratio. Direct representations are not necessarily sparse, though they must have extreme sparseness (W D 1) to represent the greatest possible number of distinct events (Z) on a network. Overlap between two representations is the number of shared active cells (U). Counting of an event means estimating how many times it has occurred during a counting epoch. Counting accuracy is limited by overlap and by the interference ratio (see equation 4.2), which in simple cases is the total number of occurrences of interfering events (i.e., events different from the counted event) divided by occurrences of the counted event.
this happens because the speed and reliability of learning would be impaired. On the other hand, distributed representation is often regarded as a desirable feature of the brain because it brings the capacity to distinguish large numbers of events with relatively few cells (see, for instance, Hinton & Anderson, 1981; Rumelhart & McClelland, 1986; Hinton, McClelland, & Rumelhart, 1986; Churchland, 1986; Farah, 1994). With sparse distributed representations, networks can also operate as content-addressable memories that store and retrieve amounts of information approaching the maximum permitted by their numbers of modiable elements (Willshaw, Buneman, & Longuet-Higgins, 1969; Gardner-Medwin, 1976). Recently Page (2000) has emphasized some disadvantages of distributed representations and argued that connectionist models should include a “localist” component, but we are not aware of any detailed discussion of the potential loss of counting accuracy that results from overlap, so our goal in this article is to analyze this quantitatively. To give the analysis concrete meaning, we formulated two specic neural models of the way frequency estimates could be made. Neither is intended as a direct model of the way the brain actually counts, nor do we claim that counting is the sole function of any part of the brain, but the models help to identify issues that relate more to the nature of representations than to specic mechanisms. Counting
480
A. R. Gardner-Medwin and H. B. Barlow
is a necessary part of learning, and representations that could not support efcient counting could not support efcient learning. We express our results in terms of the reduction in statistical efciency (Fisher, 1925) of these models, since this reveals the practical consequences of the loss of counting accuracy in terms of the need for more experience before an association or contingency can be learned reliably. We do not know of any experimental measures of the statistical efciency of a learning task, but it has a long history in sensory and perceptual psychology where, for biologically important tasks, the efciencies are often surprisingly high (Rose, 1942; Tanner & Birdsall, 1958; Jones, 1959; Barlow, 1962; Barlow & Reeves, 1979; Barlow & Tripathy, 1997). From our analysis we conclude that compact distributed representations (i.e., ones with little redundancy) enormously reduce the efciency of counting and must therefore slow reliable learning, but this is not the case if they are redundant, having many more cells than are required simply for representation. The analysis enables us to identify principles for sustaining high efciency in distributed representations, and we have conrmed some of the calculations through simulation. We think these results throw light on the complementary advantages of distributed and direct representation. 1.1 The Statistical Efciency of Counting. The events we experience are often determined by chance, and it is their probabilities that matter for the determination of optimal behavior. Probability estimates must be based on nite samples of events, with inherent variability, and accurate counting is advantageous insofar as it helps to make the most efcient use of such samples. For simplicity, we analyze the common situation in which the numbers of events follow (at least approximately) a Poisson distribution about the mean, or expected, value. The variance is then equal to the mean (m ), and the coefcient of variation (i.e., standard deviation divided p by mean) is 1 / m . A good algorithm for counting is unbiased; on average it gives the actual number within the sample, but it may nevertheless introduce a variance V. (See Table 2 for a listing of the symbols used in this article.) This variance arises within the nervous system, in a manner quite distinct from the Poisson variance whose origin is in the environment; we assume they are independent and therefore sum to a total variance V C m . It is convenient to consider the fractional increase of variance, caused by the algorithm in a particular context, which we call the relative variance (r): r D V/m .
(1.1)
Adding variance to a probability estimate has a similar effect to making do with a smaller sample, with a larger coefcient of variation. Following Fisher (1925) we dene efciency e as the fraction of the sample that is effectively
Limits of Counting Accuracy
481
Table 2: Principal Symbols Employed. Z
number of binary cells in the network
Pi
a binary pattern (or vector) of activity on the network
Ei
an event causing the pattern Pi (its representation)
N
the number of such distinct events that may occur with nite probability during a counting epoch
Wi
number of active cells in the representation of Ei
ai D Wi / Z
activity ratio for the representation of Ei
Uij
overlap (i.e., number of shared active cells) between Pi , Pj
mi
expectation of the number of occurrences of Ei in a counting epoch actual number of occurrences of Ei
mi MD
m i i
total number of event occurrences within the counting epoch
V
variance introduced in estimating a count m
r D V /m
relative variance, that is, the variance of the estimate of m relative to the intrinsic Poisson variance of m ( D m )
e D m / (m C V)
efciency in estimating m , given the variance V in counting a sample
Wc
interference ratio while counting Ec (see equation 4.2)
hyi
the statistical expectation of any variable y
yO
an estimate of y
the set of yi for all possible values of i
fyi g
made use of: e D m / (m C V ) D (1 C r ) ¡1 .
(1.2)
Efciency is a direct function of r, and if r > 1, then e < 50%, which means that the time and resources required to gather reliable data will be more than two times greater than is in principle achievable with an ideal algorithm. If r ¿ 1, then efciency is nearly 100% and there would be little to gain from a better algorithm in the same situation. 2 A Simple Illustration As an illustration of the problem, consider how to count the occurrences of a particular letter (e.g., A) in a sequence of letters forming some text. If A has a direct representation in the sense that an element is active when and only when A occurs (as on a keyboard), then it is easy to count the occurrences of A with precision. But if A is represented by a distinctive pattern of active elements (as in the ASCII code), then the problem is to infer from counts of usage of individual elements whether and how often A has occurred. The ASCII code is compact, with 127 keyboard and control characters distinguished on just 7 bits. Obviously 7 usage counts cannot in
482
A. R. Gardner-Medwin and H. B. Barlow
general provide enough information to infer 127 different counts precisely. The result is underdetermined except for a few simple cases. In general there is only a statistical relation between the occurrence of letters and the usage of their representational elements, and our problem is to calculate, for cases ranging from direct representations to compact codes, how much variance is added when inferring these frequencies. Note that seven specic subsets of characters have a one-to-one relation to activity on individual bits in the code. For example, the least signicant bit is active for a set including the characters ACEG as well as many others. Such subsets have special signicance because the summed occurrences of events in them are easily computed on the corresponding bit. In the ASCII code, they are generally not subsets of particular interest, but in the brain, it would be advantageous for them to correspond to categories of events that can be grouped together for learning. This would improve generalization, increase the sample size for learning about the categories, and reduce the consequences of overlap errors. Our analysis ignores the benet from such organization and assumes that the representations of different events are randomly related, though we discuss this further in section 6.2. The conversion of directly represented key presses into a distributed ASCII code is certainly not advantageous for the purpose of counting characters. The events that the brain must count, however, are not often directly represented at an early stage, nor do they occur one at a time in a mutually exclusive manner, as do typed characters. Each event may arouse widespread and varied activity that requires much neural computation before it is organized in a consistent form, suitable for counting and learning. We assume here that perceptual mechanisms exploit the redundancy of sensory messages and generate suitable inputs for our models as outlined below and discussed later (in section 6). These simplications enable us to focus on the limitations of counting accuracy that arise even under relatively ideal conditions. 3 Formal Denition of the Task Consider a set of Z binary cells (see Figure 1) on which is generated, one at a time, a sequence of patterns of activity belonging to a set fP1 , . . . , PN g that correspond to N distinct categorizations of the environment described as events fE1 , . . . , EN g. The patterns (binary vectors) are said to represent the events. Each pattern Pi is an independent random selection of Wi active cells out of the Z cells, with the activity ratio ai D W i / Z. The corresponding vector fxi1 , . . . , xiZ g has elements 1 or 0 where cells are active or inactive in Pi . The active cells in two different patterns Pi , Pj overlap by Uij cells (Uij ¸ 0), where Uij D
(xik xjk ). kD1,Z
Limits of Counting Accuracy
483
Figure 1: Outline of the projection model. Sets of binary cells (for example, those marked by ellipses) are activated when there are repeatable sensory or neural events. The frequency of occurrence of a particular event Ec (with its active cells black) is estimated from activation of an accumulator cell X when Ec is represented at the time of testing. Synaptic weights onto X are proportional to the usage of individual cells. Additional circuitry (not shown) counts the number of active cells Wc and estimates both the average number of active cells and the total number of events (M) during the counting period.
Note that two different events may occasionally have identical representations, since these are assigned at random. Consider an epoch during which the total number of occurrences fmi g of events fEi g can be treated as independent Poisson variables with expectations fm i g. The totals M and m T are dened as M D i (mi ) and m T D i (m i ). The task we dene is to estimate, using only plausible neural mechanisms, the actual number of occurrences (mc ) of representations of an individual event Ec when this event is identied by a test presentation after the counting epoch. We suppose that the system can employ an accurate count of the total occurrences ( M ) summed over all events during the epoch, and also the average activity ratio aN during the epoch: a N D
(mi ai ) / M.
(3.1)
i
We require specic and neurally plausible models of the way the task is done, and these are described in the next two sections. The rst model (in section 3.1) is based on modiable projections from the cells of a representation. They support counting by increasing effectiveness in proportion to presynaptic usage, though associative changes or habituation might alternatively support learning or novelty detection with similar constraints. The second model (in section 3.2) is based on changes of internal support for a pattern of activation. This greatly adds to the number of variables available to store relevant information by involving modiable synapses between ele-
484
A. R. Gardner-Medwin and H. B. Barlow
ments of a distributed representation, analogous to the rich interconnections of the cerebral cortex. Readers wishing to skip the mathematical derivations in sections 3.1 and 3.2 should look at their initial paragraphs with Figures 1 and 2 and proceed to section 4. 3.1 The Projection Model. This model (see Figure 1) estimates mc by obtaining a sum Sc of the usage, during the epoch, of all those cells that are active in the representation of Ec . This computation is readily carried out with a single accumulator cell X (see Figure 1) onto which project synapses from all the Z cells. The strengths of these synapses increase in proportion to their usage. When the event Ec is presented in a test situation after the end of an epoch of such counting, the summed activation onto X gives the desired total: Sc D
(xjkmj )
xck cells k
D mc W c C
events j
(mj Ujc ).
(3.2)
6 c events j D
If there were no interference from overlap between active cells in Ec and in any other events occurring during the epoch (i.e., if mj D 0 for all j for which Ujc > 0), then Sc D mc Wc . In this situation, Sc / W c gives a precise estimate of mc and is easily computed since W c is the number of cells active during the test presentation of Ec . In general, Sc will be larger than mc Wc due to overlap between event representations. An adjustment for this can be made on the basis of the total number of events M and the average activity ratio a, N yielding a revised sum S0c : S0c D Sc ¡ MWc a. N
(3.3)
Expansion using equations 3.1 and 3.2 yields: S0c D mc W c (1 ¡ ac ) C
6 c events j D
(mj (Ujc ¡ aj Wc )).
(3.4)
The expectation of each term in the sum in equation 3.4 is zero, since hUjc i D aj Wc and the covariance for variations of mj and Ujc is zero since they are O c of mc is determined by independent processes. An unbiased estimate m therefore given by: mO c D S0c / (Wc (1 ¡ ac )).
(3.5)
To calculate the reliability and statistical efciency of this estimate mO c , we need to know the variance of S0c due to the interference terms in equation 3.4.
Limits of Counting Accuracy
485
This is simplied by the facts that mj and Ujc vary independently and that hUjc i D aj Wc : Var(S0c ) D
6 c jD
(hmj i2 Var(Ujc ) C Var (Ujc )Var(mj )).
(3.6)
Ujc has a hypergeometric distribution, close to a binomial distribution, with expectation aj Wc and variance aj (1 ¡ aj )(1 ¡ ac )(1 ¡ 1 / Z) ¡1 W c . Substituting these values and hmj i D Var (mj ) D m j for the Poisson distribution of mj we obtain: Var(S0c ) D Wc
6 c jD
(aj (1 ¡ aj )(1 ¡ ac )(1 ¡ 1 / Z )¡1 (m j C m j2 )).
(3.7)
Note that this analysis includes two sources of uncertainty in estimates of mc : variation of the frequencies of interfering events around their means and uncertainty of the overlap of representations of individual interfering events. The overlaps between different representations are strictly xed quantities for a given nervous system, so this source of variation does not contribute if the nervous system can adapt appropriately. Results of calculations are therefore given for both the full variance (see equation 3.7) and the expectation value of hVarm (S0c )i when fUjc g is xed—that is, for variations of fmj g alone. This modied result is obtained as follows. Instead of equation 3.6, we have the variance of S0c (see equation 3.4) due to variations of fmj g alone: Varm (S0c ) D
6 c jD
((Ujc ¡ aj Wc )2 m j ).
(3.8)
This depends on the particular (xed) values of fUjc g, but we can calculate its expectation for randomly selected representations: hVarm (S0c )i D
6 c jD
D Wc
((hUjc2 i ¡ 2aj Wc hUjc i C aj2 W c2 )m j ) 6 c jD
(aj (1 ¡ aj )(1 ¡ ac )(1 ¡ 1 / Z)m j ).
(3.9)
This expression is similar to equation 3.7, omitting the terms m j2 . Note that if m j ¿ 1 for all events that might occur, the difference between the two expressions is negligible. This corresponds to a situation where there may be many possible interfering events, but each one has a low probability of occurrence. The variance does not then depend on whether individual overlaps are xed and known, since the events that occur are themselves unpredictable.
486
A. R. Gardner-Medwin and H. B. Barlow
O c of a desired The relative variance r (see equation 1.1) for an estimate m count is obtained by dividing the results in equations 3.7 and 3.9 rst by the square of the divisor in equation 3.5 (W c (1¡ac ))2 and then by the expectation of the count m c . Square brackets are used to denote the terms in equation 3.7 due to overlap variance that are omitted in equation 3.9:
O c) D r (m
1 £ (Z ¡ 1)
(aj (1 6 c jD
¡ aj )(m j [ C m j2 ]))
ac (1 ¡ ac )m c
.
(3.10)
3.2 The Internal Support Model. In the projection model, the stored variables correspond to the usage of individual cells. The number of such variables is restricted to the number of cells Z, and this is a limiting factor in inferring event frequencies. The second model takes advantage of the greater number of synapses connecting cells, which in the cerebral cortex can exceed the number of cells by factors of 103 or more. The numbers of pairings of pre- and postsynaptic activity can be stored by Hebbian mechanisms and yield substantially independent information at separate synapses (Gardner-Medwin, 1969). The number of stored variables for a task analogous to counting is greatly increased, though at the cost of more complex mechanisms for handling the information. The support model (see Figure 2) employs excitatory synapses that acquire strengths proportional to the number of times the pre- and postsynaptic cells have been active together during events experienced in a counting epoch. Full connectivity (Z2 synapses counting all possible pairings of activity) is assumed here in order to establish optimum performance, though the same principles would apply in a degraded manner with sparse connectivity. During test presentation of the counted event Ec , the potentiated synapses between its active cells act in an autoassociative manner to give excitatory support for maintenance of the representation (Gardner-Medwin, 1976, 1989). The extent of this internal excitatory support depends substantially on whether, and how often, Ec has been active during the epoch. Interference is caused by overlapping events, just as with the projection model, though with less impact because the shared fraction of pairings of cell activity is, with sparse representations, much less than the shared fraction of active cells. Measurement of internally generated excitation requires appropriate external handling of the cell population (see Figure 2). In principle, the whole histogram of internal excitation onto different cells can be established by imposing uctuating levels of diffuse inhibition along with activation of the pattern representing Ec . Our analysis requires just the total (or average) excitation onto the cells of Ec (see Equation 3.11). The neural dynamics may introduce practical limitations on the accuracy of such a measure in a given period of time, so our results represent an upper bound on performance employing this model.
Limits of Counting Accuracy
487
Figure 2: Outline of the internal support model. Internal excitatory synapses within the network measure the frequency of co-occurrences of activity in pairs of cells by a Hebbian mechanism. On a test representation of the event Ec to be counted (indicated by the hatched active cells), the total of the internal activation stabilizing the active pattern is estimated by testing the effect of a diffuse inhibitory inuence on the number of active cells.
We restrict the analysis for simplicity to situations with equal activity ratios for all events (aj ´ aI Wj ´ W ) and full connectivity. Each of Z cells is connected to every other cell with Hebbian synapses counting preand postsynaptic coincidences, including autapses that effectively count individual cell usage. Analysis follows the same lines as for the projection model (see section 3.1), and only the principal results are stated here, with some steps omitted. When Ec is presented as a test stimulus, the total excitation Qc from one cell to another within Ec is given, analogous to equation 3.2, by: Qc D mc W 2 C
events
6 c jD
(mj Ujc2 ).
(3.11)
A corrected value Q0c is calculated to allow for average levels of interference: Q0c D Qc ¡ MaW (aW C 1 ¡ 2a)(1 ¡ 1 / Z) ¡1 .
(3.12)
This has expectation equal to mc W 2 (1 ¡ a)(1 C a ¡ 2 / Z)(1 ¡ 1 / Z) ¡1 so we can obtain an unbiased estimate of mc as follows: O c D Q0c / (W 2 (1 ¡ a)(1 C a ¡ 2 / Z)(1 ¡ 1 / Z) ¡1 ). m
(3.13)
488
A. R. Gardner-Medwin and H. B. Barlow
The full variance of Q0c is calculated as for S0c (see equation 3.6), taking account of the independent Poisson distributions for the numbers of interfering events fmj g and hypergeometric distributions for the overlaps fUjc g: Var (Q0c ) D Var(Ujc2 )
6 c jD
(m j C m j2 ).
(3.14)
The algebraic expansion of Var(Ujc2 ) is complex, but is simplied with the terminology F (r ) D (W! / (W ¡ r)!)2 (Z ¡ r )! / Z!: Var (Ujc2 ) D (F (4) C 6F (3) C 7F(2) C F (1) ¡ (F (2) C F (1) )2 ).
(3.15)
The relative variance for the frequency estimate mO c (see equation 3.13) can then be written as: O c) D r (m
j mI [£(1 C mN i )], £ mc Z2
(3.16)
where m I D jD6 c m j is the expected total number of occurrences of interfering events, mN i D jD6 c m j2 / m I is the average number of repeats of individual interfering events, weighted according to their frequencies, and j is given by: j D
F (4) C 6F (3) C 7F (2) C F(1) ¡ (F (2) C F (1) )2 4 W Z ¡2 (1 ¡ W Z )2 (1 C (W ¡ 2) Z )2 (1 ¡ 1 Z) ¡2
/
/
/
.
(3.17)
j depends on both W and Z, but for networks of different size (Z ), it has a minimum value for an optimal choice of W (¼ Z / 2) that is only weakly dependent on Z: jmin D 3.9 for Z D 10, 6.7 for Z D 100, 8.7 for Z D 1000, 9.9 for Z D 105 . The corresponding optimal activity ratio, to give minimum variance p and maximum counting efciency with this model, is therefore a ¼ 1 / 2Z. 4 Results for Events Represented with Equal Activity Ratios Analysis for the support model was restricted, for simplicity, to cases where all activity ratios are equal. In this section we also apply this restriction to the projection model to assist initial interpretation and comparisons. The relative variance for the projection model (see equation 3.10) becomes: O c) D r (m
1 mI [£(1 C mN i )]. £ (Z ¡ 1) m c
(4.1)
Both this and the corresponding equation 3.16 for the support model can be broken down as products of a representation-dependent term ((Z ¡ 1) ¡1
Limits of Counting Accuracy
489
or j Z¡2 ) and a term that depends on only the expected frequencies of the counted and interfering events. The latter term is the same for both models, and we call it the interference ratio (W c ) for a particular event Ec :
Wc D
Expected occurrences of events other than the counted event Expected occurrences of counted event
[£(1 C mN i )].
(4.2)
The interference ratio expresses the extent to which a counted event is likely to be swamped by interfering events during the counting epoch. The principal determinant is the ratio of occurrences of interfering and counted events: m I /m c . The term in square brackets expresses the added uncertainty that can be introduced when interfering events occur with multiple repetitions (mN i À 1), because overlaps that are above and below expectation do not then average out so effectively. As described after equation 3.7, stable conditions may allow the nervous system to adapt to compensate for a xed set of overlaps, corresponding to omission of this term; it is in any case negligible if interfering events seldom repeat (mN i ¿ 1). We can see how W c governs the relative variance of the count by substituting in equations 4.1 and 3.16 for the two models: Wc Z ¡1 Wc O c) D j £ 2 . rsupport (m Z
O c) D rprojection (m
(4.3) (4.4)
If the number of cells (or synapses, for the support model) is much less than the interference ratio (Z ¿ W c or Z2 ¿ j W c ), then r À 1 and counting efciency is necessarily very low. High efciency ( > 50%) requires r < 1, and networks that are correspondingly much larger and very redundant from an information-theoretic point of view (see section 6.1). For a particular number of cells Z, 50% efciency for an event Ec requires that W c < Z or Z2 /j on the two models. For the support model, the highest levels of interference are tolerated with sparse representations giving minimum j , p with a approximately 1 / (2Z) and W c < 0.1 ¡ 0.25Z2 (see section 3.2). If we set a given criterion of efciency, the maximum interference ratio W c scales in proportion to Z for the projection model and approximately Z2 for the support model. This is consistent with what one might expect given that the numbers of underlying variables used to accumulate counts are, respectively, Z and Z2 in the two models, though the performance per cell in the projection model is up to 10 times greater than the performance per synapse in the support model. Figure 3 illustrates the dependence of efciency on the number of cells Z, for an interference ratio W c D 100 and a D 0.1. When Z exceeds 100 (D W c ),
490
A. R. Gardner-Medwin and H. B. Barlow
Figure 3: Counting efciency as a function of the number of cells (Z) used for distributed representations assumed to have the same activity ratio (a) for all events. Calculations are for W c D 100 (i.e., 100 times more total occurrences of interfering events than of the counted event). The full lines are for a D 0.1 with the projection model (heavy line) and support model (thin line) and for p a D 1 / (2Z) for maximum efciency using the support model (dotted line). Note that if all events are equiprobable, implying that there is a total of just 101 different event types occurring, then seven cells would sufce to represent them distinctly if there were no need for counting. High counting efciency requires 10 times this number, and the corresponding ratio would be higher for larger numbers of events.
the efciency exceeds about 50% on the projection model and 93% on the internal support model. The advantage of the second model (based on counts of paired cellular activity) results from the fact that the proportion of pairings shared with a typical interfering event (a2 ) is less than the proportion of shared active cells (a). Efciency with the projection model is independent of a, while with the support model, it is maximized by choosing a to minimize j in equation 4.4 (see above). Simulations were performed using LABVIEW (National Instruments) to conrm some of the results in the foregoing analysis. Figures 4A and 4B show simulation results for conditions for which Figure 3 gives the theoretical expectations (N D 101 equiprobable events, m D 10, W c D 100, a D 0.1, Z D 100). The observed efciencies are in agreement (see the legend to Figure 4), and the graphs illustrate the extent of correspondence between estimated and true counts. The horizontal spread on these graphs shows
Limits of Counting Accuracy
491
Figure 4: Estimated counts plotted against actual counts, obtained by simulation with the projection model (A,C,D) and the support model (B). The network had 100 cells (Z D 100) with 10 active during events (W D 10, a D 0.1), selected at random for each of 101 different events for A,B,C and 10 for D. The numbers of occurrences of each event were independent Poisson variables with mean m D 10. New sets of representations and counts were generated to calculate each point. Estimated counts used the algorithms either with (A,B) or without (C,D) adaptation to the actual overlapping of representations of interfering events with the counted event. Interference ratios (W c : equation 4.2) were therefore 100 for A,B,D and 1089 for C. Perfect estimates would lie on the line of equality (dashed). The solid line shows the linear regression of estimates on actual counts. Efciency e is shown, calculated as the mean squared error for the 100 points, divided by m . This was not signicantly different (in ve repeats of such simulations) from efciencies expected from the theoretical analysis (equations 4.3, 4.4, 1.2), which were A:50%, B:93.3%, C:8.3%, D:50%.
492
A. R. Gardner-Medwin and H. B. Barlow
the Poisson variation of the true counts about their mean (m D 10), while the vertical deviation from the (dashed) lines of equality shows the added variance due to the algorithms. The closeness of the regression lines and the lines of equality (ideal performance) shows that the algorithms are unbiased, while the efciency is the mean squared error divided by m (almost equivalent to the squared correlation coefcient r2 ). Performance is better for the support model (see Figure 4B) than the projection model (see Figure 4A) and is worse (see Figure 4C) if adaptation to xed stochastic conditions and overlaps is not employed. The interference ratio (W c : see equation 4.2) rose for Figure 4C to W c D 1089 with the same number of events handled by the network. With just 10 events (substantially fewer than the number of cells), W c was restored to 100 and the efciency to 50%, as predicted (see Figure 4D). The theoretical dependence of efciency on a, Z, and W c for representations having uniform a is shown in contour plots in Figures 5A and 5B for the projection and support models, respectively. Contours of equal efciency (for W c D 100) are plotted against a and log(Z / W c ). For the projection model (see Figure 5A) the efciency is independent of a and depends on only the ratio Z / W c (or, more strictly, (Z ¡ 1) / W c : see equation 4.3). This graph would not be signicantly different for any larger value of W c . For the support model (see Figure 5B) the efciency is higher than for the projection model (see Figure 5A) for all combinations of Z / W c and a, especially if a is close to the optimum level of sparseness corresponding to the conp tour minima (a ¼ 1 / 2Z). Higher interference ratios (W c > 100) yield even greater efciencies with a given value of Z / W c , while the benets of sparse coding with the support model become more pronounced. Some of the implications of these plots are illustrated here with an example. Suppose initially that events are represented by 12 active cells on a population of 30 (a D 0.4, Z D 30). With an interference ratio W c D 100, the counting efciency would be 22% on the projection model and 45% on the support model (points marked a on Figures 4A and 4B). These are modest efciencies, but changing the representations with the same event statistics can improve the efciency. Representing events with more sparse activity on the same number of cells, as might be achieved by simply raising activation thresholds, can improve counting efciency with the support model, though not with the projection model. Representation with 4 instead of 12 active cells out of 30 (a D 0.13) gives efciencies of 22%, 63% on the two models (points b). Though efciency is improved with the support model, information may be lost by this encoding since there are only about 104 distinct patterns with 4 active cells out of 30, compared with 108 having 12 active. A better strategy is to use more cells. For example, recoding to 5 active cells on 100, we get 50%, 93% efciency on the two models (points c), and with 4 active on 300 we get 75%, 98% (points d). Each of these encodings retains about 108
Limits of Counting Accuracy
493
Figure 5: Plots showing counting efciency (indicated on contours) as a function of activity ratio a on the horizontal axis (assumed equal for all events) and log10 of the factor by which the number of cells Z exceeds the interference ratio W c (vertical axis). Plots are for the projection model (A) and the support model (B), calculated for Wc D 100, though plot A is essentially identical for all larger values of Wc , apart from the cut-off at low a corresponding to the requirement W ¸ 1. Points a–d are referred to in the text.
distinguishable patterns, so incurs no loss of information. The improvement in counting efciency is achieved at the cost of extra redundancy, with, in this example, a tenfold increase of the number of cells. These levels of efciency fall short of the 100% efciency that can be achieved if cells are assigned to direct representations for each event of interest. One hundred cells would be enough (without duplication or spare capacity) to provide direct representations for 100 events, which is the largest number that can simultaneously have interference ratios W c D 100. If these 100 cells are used instead for a distributed repesentation for these 100 events, then there would be a moderate loss of efciency to 50% or 93% on the two models. The merits of distributed and direct representation are considered further in the discussion, but these results suggest that if distributed representations are to be used for efcient counting, then just as many cells may be required as for direct representations. However, their ability to represent rare and novel events unambiguously gives them greater exibility even in relation to counting. 5 Events with Different Probabilities and Different Activity Ratios The interference ratio W c is higher for rare events than for common events, since the factor that varies most in proportionate terms in equation 4.2 is the denominator m c . If, as assumed in the last section, all event representations
494
A. R. Gardner-Medwin and H. B. Barlow
have the same activity ratio a, this means that rare events (with probability less than about 1 / Z on the projection model) are counted inefciently because the overlap errors act like a source of intrinsic noise. Although only relatively few events (at most Z or Z2 /j ) can simultaneously be counted with ¸ 50% efciency, many of the huge number with distinct representations on a network may occur, but too rarely to be countable. Since the relative frequencies of different events are not stable features of an animal’s environment, however, an infrequent event in one epoch may be counted efciently in different epochs when it occurs more frequently (see section 6.4). The poor counting efciency for rare events may also be boosted if they are represented with higher activity ratios than other events. This requires that particular events be identied as worthy of such amplication, through being recognized as important, or simply novel. Conversely, lowering a for common or unimportant events will be benecial. The mechanisms that may vary activity ratios are discussed in section 6.3. The results are illustrated in Figure 6 for a simple example using the projection model, based on equation 3.10. Calculations are for 7 events with different frequencies— rst with equal activity ratios a D 0.02 (squares) and then with activity ratios inversely proportional to frequency (crosses), giving more uniform efciency. The way in which rare events benet from higher a is that the high probability that individual cells will be active for other interfering events is compensated by the pooling of information from more active cells. Experimentation with different power law relationships a D km ¡n required n in the range 1.0 to 1.3 to give approximately uniform efciency over a range of conditions, with n » 1.0 when all values of a are ¿ 1. Although these results are presented only for the projection model, it is clear that qualitatively similar conclusions must apply for the support model. 6 Discussion Counting events, or determining their frequencies, is necessary for learning and for appropriate control of behavior. This article analyzes the uncertainty in estimating frequencies that arises from sharing active elements between the distributed representations of different events. To analyze this problem, we make substantial simplifying assumptions. We treat neurons as binary, with different events represented by randomly related selections of active cells. We treat the frequencies of interfering events as independent Poisson variables. And we employ just two simple models as counting algorithms, with numbers of physiological variables equal to the numbers of cells in one case and synapses in the other. Sensory information contains a great deal of associative statistical structure that is absent from the representations we model. Our results would require modication for structured data, but it has long been argued, with considerable supporting evidence (Attneave, 1954; Barlow, 1959, 1989; Watanabe, 1960; Atick, 1992), that a prime function of sensory and perceptual pro-
Limits of Counting Accuracy
495
Figure 6: Effect of nonuniform event frequencies on counting efciency. The number of active cells (A) and the counting efciency (B) are shown for seven events with widely differing average counts during a counting epoch (horizontal axes). Two different assumptions are made: events are represented each with the same number of active cells ( ), or with a number inversely proportional to the event probability (£). In each case the mean value of a (weighted with probabilities of occurrence) is 0.02. Calculations are for the projection model with Z D 400 cells and full variances (see equation 3.10). Rare events have lower counting efciency when all representations have equal activity ratios, but approximately uniform efciency is achieved (£) when more active cells are used to represent rare events.
cessing is to remove much of the associative statistical structure of the raw sensory input by using knowledge of what is predictable. The resulting less structured input would be closer to what we have assumed for our analysis. Notice, however, that inference-like processes are involved in unsupervised learning and perceptual coding, and that these will be inuenced by errors of counting in ways similar to those analyzed here.
496
A. R. Gardner-Medwin and H. B. Barlow
Our aim in the discussion is to identify simple insights and principles that are likely to apply to all processes of adaptation and learning that depend on counting. There are ve sections: the counting problem in general, ve principles for reducing the interference problem, possible applications of these principles in the brain, relative merits of direct and distributed representations, and the problem of handling the huge numbers of representable events on a distributed network. 6.1 The Counting Problem in Distributed Representations. Counting with distributed representations nearly always leads to errors, but the loss of efciency need not be great if the representations follow principles outlined in the next section. We shall see that the capacity of a network to count different events (and therefore to learn about them) is generally enormously less than its capacity to represent different events. The interference ratio W c , dened in equation 4.2, gives direct insight into what limits counting efciency. At its simplest, this is the ratio of the total number of interfering events (of any type) to the number of events being counted over a dened counting epoch. In our rst (projection) model, the number of cells employed in the representations (Z) must exceed W c if the efciency is to remain above 50%; in the second (internal support) model, fewer cells may sufce for equivalent performance, provided the number of synapses (Z2 ) is at least four to ten times W c (see equation 4.4). Increasing the number of cells increases efciency, and with either model, three times as many cells or synapses are required to achieve 75% efciency and nine times as many for 90% efciency. Multiple repetitions of individual interfering events tend to amplify the errors due to overlap and raise these requirements further (see equation 4.2). In a situation where all events are equally frequent, the interference ratio is just the number of different events, and this number is limited, as above, to roughly Z or Z2 / 10 for 50% efciency. In the more general case where events have different probabilities or activate different numbers of cells, their counting efciencies will vary; roughly speaking, it is those with the largest products of activity ratio and expected frequency (aim i ) that have the greatest counting efciency (see section 5), and the number that are simultaneously countable is again limited to roughly Z or Z2 / 10. Events that are rare and sparsely represented cannot be efciently counted, though if their probability rises, they may become countable in a different epoch (see section 6.4). The number of simultaneously countable events is dramatically fewer than the 2Z distinct events that can be unambiguously represented on Z cells, and it scales with only the rst or second power of Z not exponentially. This parallels long-established results on the storage and recall of patterns using synaptic weight changes, where the number of retrievable patterns scales between the rst and second power of the number of cells, with the retrievable information (in bits) ranging from 8% to 69% (with extreme as-
Limits of Counting Accuracy
497
sumptions) of the number of synapses involved for different models (e.g., Willshaw et al., 1969; Gardner-Medwin, 1976; Hopeld, 1982). Huge levels of redundancy are required in order simultaneously to count or learn about a large number (n) of events, compared with the minimum number of cells (log2 (n )) on which these events could be distinctly represented. For Z D n, as required either for direct representations or for distributed representations with 50% counting efciency on our projection model, the redundancy dened in this way is 93% for n D 100, rising to 99.9% for n D 104 . This problem will reemerge in various guises in the following sections. 6.2 Principles for Minimizing Counting Errors. Our results lead to the following principles for achieving effective counting in distributed representations: 1. Where many events are to be counted on a net, use highly redundant codes with many times the number of cells needed simply to distinguish these events. 2. Use sparse coding for common and unimportant events, but raise the activity ratio for events that can be identied as likely to have useful associations, especially if the events are rare. 3. Minimize overlap between representations, aiming for overlaps less than those of the random selections of active cells that we have assumed for analysis. These principles arise directly from the analysis of the problem as we have dened it: to count particular reproducible events on a network in the context of interfering events represented on the same network. Two more principles arise from considering how representations can economically permit counting that is effective for learning: 4. Representations should be organized so that counting can occur in small modules, each being a focus of information likely to be relevant to a particular type of association. Large numbers of events would then be grouped for counting into subsets (those that have the same representation in a module). 5. The special subsets that activate individual cells should, where possible, be ones that resemble each other in their associations. Overlap between representations should ideally mirror similarities in the implications of events in the external world. These extra principles can help to avoid the necessity of counting large numbers of individual events, making learning depend on the counting of fewer categorizations of the environment, and therefore manageable on fewer cells. They can lead to appropriate generalization of learning and
498
A. R. Gardner-Medwin and H. B. Barlow
can make learning faster since events within a subset obviously occur more often than the individual events. 6.3 Does the Brain Follow These Principles? One of the puzzles about the cortical representation of sensory information lies in the enormous number of cells that appear to be devoted to this purpose. A region of 1 deg2 at the human fovea contains about 10,000 sample points at the Nyquist interval (taking the spatial frequency limit as 50 cycles per degree), and the number of retinal cones sampling the image is quite close to this gure. But the number of cells in the primary visual cortex devoted to that region is of order 108 . Some of the 104 -fold increase may be explained by the role of the primary visual cortex in distributing information to many destinations, but this cannot account for all of it, and one must conclude that this cortical representation is grossly redundant. The selective advantage that has driven the evolution of cortical redundancy may be the necessity for efcient counting and learning, as encapsulated in our rst principle. In relation to the second principle, evidence for sparse coding was put forward by LegÂendy and Salcman (1985), and it was clear from the earliest recordings from single neurons in the cerebral cortex that they are hard to excite vigorously and must spend most of their time ring at very low rates, with only brief periods of strong activity when they happen to be excited by their appropriate trigger feature. Field (1994), Baddeley (1996), Baddeley et al. (1997), Olshausen and Field (1997), van Hateren and van der Schaaf (1998), Smyth, Tolhurst, Baker, and Thompson (1998), Tolhurst, Smyth, Baker, and Thompson (1999), Tadmor and Tolhurst (in press), and others have now quantitatively substantiated this impression. Counting accuracy for a particular event benets from the sparse coding of other events (reducing their overlap and interference), not from its own sparse coding. Sparseness is especially important for common events because of the extent of interference they can cause, and because it can be afforded with little impact on their own counting, which is already quick and efcient. The benets of sparseness are greatest for counting based on Hebbian synaptic modication (as in our support model and a related model for direct detection of associations: Gardner-Medwin & Barlow, 1992). Signicant events (particularly rare ones) may need higher activity ratios to boost their counting efciency at the expense of others. We envisage that this might be achieved if selective attention to important and novel events favors the representation of more features of the environment through lowering of cell thresholds and the acquisition of more detailed sensory information. Adaptation and habituation, on the other hand, can raise thresholds, favoring the required sparse representation of common events. This exibility of distributed representations is attractive, for in biological circumstances, the probability and signicance of individual events are highly context dependent.
Limits of Counting Accuracy
499
Overlap reduction, as advocated in our third principle, follows to some extent from any mechanism achieving sparse coding. But there are also indications that well-known phenomena like the waterfall illusion and other after-effects of pattern adaptation involve a “repulsion” mechanism (Barlow, 1990) that would directly reduce overlap between representations following persistent coactivation of their elements (Barlow & Foldi ¨ ak, 1989; Foldi ¨ ak, 1990; Carandini, Barlow, O’Keefe, Poirson, & Movshon, 1997). Similar after-effects offer possible mechanisms for improving the long-term recall of overlapping representations of events, through processing during sleep (Gardner-Medwin, 1989). The fourth principle ts rather well with the organization of the cortex in modules at several scales. The great majority of interactions between cortical cells are with other cells that are close by (Braitenberg & Schuz, ¨ 1991), so the smallest module might be a mini-column or column, such as are found in sensory areas (Edelman & Mountcastle, 1978). These would be the focus of the statistical structure comprising local correlations. Each pyramidal cell receives many (predominantly local) inputs that effectively dene its module, as the connections to the accumulator cell dene the network in our projection model (see Figure 1). The outputs then go to other areas of specialization, for example, area MT, as a focus for the spatiotemporal correlations of motion information. Optimal organization of a large system must involve mixing diverse forms of information to nd new types of association (Barlow, 1981), as well as concentrating information that has revealed associations in the past, and this seems broadly consistent with cortical structure. The trigger features of cortical neurons often make sense in terms of behaviorally important objects in the world surrounding an animal, suggesting that the brain exploits the advantages expressed in the fth principle. For instance, complex cells in V1 respond to the same spatiotemporal pattern at several places in the receptive eld, effectively generalizing for position. The same is true for motion selectivity in MT, and perhaps for cells that respond to faces or hands in inferotemporal cortex (Gross, Desimone, Albright, & Schwarz, 1985). Such direct representation of signicant features (Barlow, 1972, 1995) assists in making distributed representations workable. 6.4 The Relative Merits of Counting with Direct and Distributed Representations. Maximum counting efciencies for events represented within a network or counting module would be achieved with direct representations, but this requires prespecication of the patterns to be detected and counted. In contrast, a distributed network has the exibility to represent and count, without modication, events that have not been foreseen. Our results show that provided these unforeseen representations have no more than chance levels of overlap with those of other events (as in our models), good performance can be achieved on a network of Z cells with almost any
500
A. R. Gardner-Medwin and H. B. Barlow
limited selection (numbering of order Z or Z2 / 10) of frequent events out of the very large number (of order 2Z ) that can be represented. Infrequent but important events can be included within this selection by raising their activity ratios. The potential for unambiguous representation of a huge number of rare and unforeseen events means that distributed networks can be very versatile for counting purposes in a nonstationary environment. Learning often takes place in short epochs during which some type of experience is frequent—for example, learning on a summer walk that nettles sting or enlarging one’s vocabulary in a French lesson. A system in which there is gradual decay of interference from previous epochs, reasonably matched in its time course to the duration of a signicant counting epoch, can in principle allow efcient counting of any transiently frequent event that is represented by any one of the patterns that the network can generate. This can assist in learning associations from short periods of intense and novel experience, though, of course, there may be hazards in generalizing such associations to other periods. Once direct representations are set up for signicant events, many such events might in principle be simultaneously detected and counted, through parallel processing. This cannot occur where events have overlapping distributed representations. One cannot have two different patterns simultaneously on the same set of cells. Thus, there is a trade-off between the exibility of distributed representations in handling unforeseen events and the constraint that they can handle them only one at a time. Since events of importance at the sensory periphery are not generally mutually exclusive, it seems necessary to use distributed representations in a serial rather than a parallel fashion by attending to one aspect of the environment at a time. This is a signicant cost to be paid for the versatility of distributed representations, but it resembles in some degree the way we handle novel and complex experience. 6.5 Managing the Combinatorial Problem. Huge numbers of highlevel feature detectors are required by some models of sensory coding based on direct representations—the so-called grandmother cell or yellow volkswagen problem (Harris, 1980; Lettvin, 1995). The ability of distributed representations to represent a vastly greater number of events than is possible with direct representation is often thought to permit a way around this problem, but our results show that economy is not so simply achieved. For counting and learning, distributed representations have no advantage (or with our support model, rather limited advantage) over direct representations, because the maximum number that can be efciently counted scales only with Z or Z2 rather than 2Z . Where distributed networks are used for representing larger numbers of signicant events, counting on subsets of events (our fourth principle) may permit an economy of cells. This economy relies, however, on being able to separate off into a relatively small module the information that is needed to establish one type of association about an
Limits of Counting Accuracy
501
event, while other modules receive information relevant to other associations. A combination of economical representation of events on distributed networks with the more extravagant use of cells required for counting may depend for success on a property of the represented events: that their associations can be learned and generalized from a separable fraction of the information they convey. The combinatorial problem arises in an acute form at a sensory surface such as the retina, since impossibly large numbers of patterns can (and do) occur on small numbers of cells. Receptors that are not grouped close together experience states that are largely independent in a detailed scene, and for just 50 receptors, many of the 1015 distinct events may be almost equally likely (even considering just two states per cell). This means that it is out of the question to count peripheral sensory events separately, and it would probably be impossible to identify even the correlations due to image structure without the topographic organization that allows this to be done initially on small sets of cells. The considerable (10,000-fold) expansion in cell numbers from retina to cortex (see section 6.3) allows for the counting of many subsets of events, but it would not go far toward efcient counting of useful events without using a hierarchy of small modules each analyzing different forms of structure to be found in the image. 7 Conclusion Our analysis suggests ways in which distributed and direct representations should be related, and it has implications for understanding many features of the cortex. These include the expansion of cell numbers involved in cortical sensory representations, the extensive intracortical connectivity and its modular organization, the tendency for trigger features to correspond to behaviorally meaningful subsets of events, phenomena of habituation to common events, alerting to rare events, and attention to one thing at a time. Distributed representation is unavoidable in the brain but may cause serious errors in counting and inefciency in learning unless it is guided by the principles that we have identied. References Atick, J. J. (1992). Could information theory provide an ecological theory of sensory processing? Network, 3, 213–251. Attneave, F. (1954). Informational aspects of visual perception. Psychological Review, 61, 183–193. Baddeley, R. J. (1996). An efcient code in V1? Nature (London), 381, 560–561. Baddeley, R., Abbott, L. F., Booth, M. C. A., Sengpiel, F., Freeman, T., Wakeman, E. A., & Rolls, E. T. (1997). Responses of neurons in primary and inferior temporal visual cortices to natural scenes. Proceedings of the Royal Society, Series B, 264, 1775–1783.
502
A. R. Gardner-Medwin and H. B. Barlow
Barlow, H. B. (1959). Sensory mechanisms, the reduction of redundancy, and intelligence. In The Mechanisation of thought processes (pp. 535–559). London: Her Majesty’s Stationery Ofce. Barlow, H. B. (1962). Measurements of the quantum efciency of discrimination in human scotopic vision. Journal of Physiology (London), 160, 169–188. Barlow, H. B. (1972).Single units and sensation: A neuron doctrine for perceptual psychology? Perception, 1, 371–394. Barlow, H. B. (1981). Critical limiting factors in the design of the eye and visual cortex. The Ferrier lecture, 1980. Proceedings of the Royal Society, London, B, 212, 1–34. Barlow, H. B. (1989). Unsupervised learning. Neural Computation, 1, 295–311. Barlow, H. B. (1990). A theory about the functional role and synaptic mechanism of visual after-effects. In C. B. Blakemore (Ed.), Vision: Coding and efciency. Cambridge: Cambridge University Press. Barlow, H. B. (1995). The neuron doctrine in perception. In M. Gazzaniga (Ed.), The cognitive neurosciences (pp. 415–435). Cambridge, MA: MIT Press. Barlow, H. B., & F¨oldiÂa k, P. (1989). Adaptation and decorrelation in the cortex. In R. Durbin, C. Miall, & G. Mitchison (Eds.), The computing neuron (pp. 54–72). Reading, MA: Addison-Wesley. Barlow, H. B., & Reeves, H. B. (1979). The versatility and absolute efciency of detecting mirror symmetry in random dot displays. Vision Research, 19, 783–793. Barlow, H. B., & Tripathy, S. P. (1997). Correspondence noise and signal pooling in the detection of coherent visual motion. Journal of Neuroscience, 17, 7954– 7966. Braitenberg, V., & Schuz, ¨ A. (1991). Anatomy of the cortex: Statistics and geometry. Berlin: Springer-Verlag. Carandini, M., Barlow, H. B., O’Keefe, L. P., Poirson, A. B., & Movshon, J. A. (1997). Adaptation to contingencies in macaque primary visual cortex. Proceedings of the Royal Society, Series B, 352, 1149–1154. Churchland, P. S. (1986). Neurophilosophy: Towards a unied science of the mindbrain. Cambridge, MA: MIT Press. Edelman, G. E., & Mountcastle, V. B. (1978). The mindful brain. Cambridge, MA: MIT Press. Farah, M. J. (1994).Neuropsychological inference with an interactive brain: A critique of the “locality” assumption. Behavioural and Brain Sciences, 17, 43–104. Field, D. J. (1994). What is the goal of sensory coding? Neural Computation, 6, 559–601. Fisher, R. A. (1925). Statistical methods for research workers. Edinburgh: Oliver and Boyd. Foldi ¨ ak, P. (1990). Forming sparse representations by local anti-Hebbian learning. Biological Cybernetics, 64(2), 165–170. Gardner-Medwin, A. R. (1969). Modiable synapses necessary for learning. Nature, 223, 916–919. Gardner-Medwin, A. R. (1976). The recall of events through the learning of associations between their parts. Proceedings of the Royal Society B, 194, 375– 402.
Limits of Counting Accuracy
503
Gardner-Medwin, A. R. (1989). Doubly modiable synapses: A model of short and long-term auto-associative memory. Proceedings of the Royal Society B, 238, 137–154. Gardner-Medwin, A. R., & Barlow, H. B. (1992). The effect of sparseness in distributed representations on the detectability of associations between sensory events. Journal of Physiology, 452, 282P. Gross, C. G., Desimone, R., Albright, T. D., & Schwarz, E. L. (1985). Inferior temporal cortex and pattern recognition. In C. Chagas, R. Gattass, & C. Gross (Eds.), Pattern recognition mechanisms (pp. 165–178). Vatican City: Ponticia Academia Scientiarium. Harris, C. S. (1980).Insight or out of sight? Two examples of perceptual plasticity in the human adult. In C. S. Harris (Ed.), Visual coding and adaptability(pp. 95– 149). Hillsdale, NJ: Erlbaum. Hinton, G. E., & Anderson, J. A. (1981). Parallel models of associative memory. Hillsdale, NJ: Erlbaum. Hinton, G. E., McClelland, J. L., & Rumelhart, D. E. (1986). Distributed representations. In D. E. Rumelhart, J. L. McClelland, & the PDP Group (Eds.), Parallel distributedprocessing:Explorationsin the microstructureof cognition(pp. 77–109). Cambridge, MA: MIT Press. Hopeld, J. (1982). Neural networks and physical systems with emergent collective computational properties. Proc. Natl. Acad. Sci. USA, 79, 2554–2558. Jones, R. C. (1959). Quantum efciency of human vision. Journal of the Optical Society of America, 49, 645–653. LegÂendy, C. R., & Salcman, M. (1985). Bursts and recurrences of bursts in the spike trains of spontaneously active striate cortex neurons. Journal of Neurophysiology, 53, 926–939. Lettvin, J. Y. (1995). J. Y. Lettvin on grandmother cells. In M. S. Gazzaniga (Ed.), The cognitive neurosciences (pp. 434–435). Cambridge, MA: MIT Press. Olshausen, B. A., & Field, D. J. (1997). Sparse coding with an overcomplete basis set: A strategy employed by V1? Vision Research, 37, 3311–3325. Page, M. (in press). Connectionist modeling in psychology: A localist manifesto. Behavioural and Brain Sciences, 23. Rose, A. (1942). The relative sensitivities of television pick-up tubes, photographic lm, and the human eye. Proceedings of the Institute of Radio Engineers, 30, 293–300. Rumelhart, D. E., & McClelland, J. (1986). Parallel distributed processing. Cambridge, MA: MIT Press. Smyth, D., Tolhurst, D. J., Baker, G. E., & Thompson, I. D. (1998). Responses of neurons in the visual cortex of anaesthetized ferrets to natural visual scenes. Journal of Physiology, London, 509, 50–51P. Tadmor, Y., & Tolhurst, D. J. (in press). The differences-of-gaussians’ receptive eld model and the contrasts in natural scenes. Vision Research. Tanner, W. P., & Birdsall, T. G. (1958). Denitions of d’ and g as psychophysical measures. Journal of the Acoustical Society of America, 30, 922–928. Tolhurst, D. J., Smyth, D., Baker, G. E., & Thompson, I. D. (1999). Variations in the sparseness of neuronal responses to natural scenes as recorded in striate cortex of anaesthetized ferrets. Journal of Physiology, London, 515, 101P.
504
A. R. Gardner-Medwin and H. B. Barlow
van Hateren, F. H., & van der Schaaf, A. (1998). Independent component lters of natural images compared with simple cells in primary visual cortex. Proceedings of the Royal Society, Series B, 265, 359–366. Watanabe, S. (1960). Information-theoretical aspects of inductive and deductive inference. I.B.M. Journal of Research and Development, 4, 208–231. Willshaw, D. J., Buneman, O. P., & Longuet-Higgins, H. C. (1969). Nonholographic associative memory. Nature, 222, 960–962. Received July 28, 1999; accepted June 14, 2000.
NOTE
Communicated by Zoubin Ghahramani
An Expectation-Maximization Approach to Nonlinear Component Analysis Rom an Rosipal Mark Girolami Computational Intelligence Research Unit, Department of Computing and Information Systems, University of Paisley, Paisley, PA1 2BE, Scotland, U.K. The proposal of considering nonlinear principal component analysis as a kernel eigenvalue problem has provided an extremely powerful method of extracting nonlinear features for a number of classication and regression applications. Whereas the utilization of Mercer kernels makes the problem of computing principal components in, possibly, innitedimensional feature spaces tractable, there are still the attendant numerical problems of diagonalizing large matrices. In this contribution, we propose an expectation-maximization approach for performing kernel principal component analysis and show this to be a computationally efcient method, especially when the number of data points is large. 1 Introduction
The notion of performing linear principal component analysis (PCA) in a high- (and possibly innite-) dimensional nonlinear feature space was rst proposed in Sch¨olkopf, Smola, and Muller ¨ (1998). The key feature of the proposed method is the creation of the kernel matrix (Sch¨olkopf et al., 1998) and its subsequent diagonalization. Each element of the kernel matrix is composed of the dot product of the nonlinearly mapped data points, and the elegant use of Mercer kernels allows these dot products in highdimensional feature space to be computed using simple kernel functions in data space (Scholkopf, ¨ Burges, & Smola, 1999; Scholkopf ¨ et al., 1998). The dimension of the kernel matrix is equal to the number of data points, and so its diagonalization allows nonlinear principal components to be extracted from the data, the number of which may be up to the number of observed data points. This implies that the number of principal components extracted can exceed the dimensionality of the data (Sch¨olkopf et al., 1998). In practice, however, if there is a substantial number of observations, then the diagonalization of the associated kernel matrix can be somewhat problematic in terms of computational demands and numerical accuracy (Jolliffe, 1986; Rosipal, Girolami, & Trejo, 2000). The problem of performing an eigenvalue decomposition on large covariance matrices can be alleviated by using the expectation-maximization (EM) approach for PCA, which Neural Computation 13, 505–510 (2001)
° c 2001 Massachusetts Institute of Technology
506
Roman Rosipal and Mark Girolami
emerges from considering PCA from a probabilistic perspective (Tipping & Bishop, 1999; Roweis & Ghahramani, 1999). Taking the EM approach to PCA in data space, we now show that this can be modied to allow the efcient computation of the kernel PCA (Sch¨olkopf et al.,1998). 2 An EM Approach to Kernel PCA
The data space model for probabilistic PCA is given as y D Cx C v, where C 2 Rp£k and the observation vector and latent variable vectors are given as y 2 Rp and x 2 Rk, respectively. The latent variables are normally distributed with zero-mean and identity covariance. The zero-mean noise v is also normally distributed with a covariance matrix dened as Y. It is shown in Roweis and Ghahramani (1999) and Tipping and Bishop (1999) that as the noise level in the model becomes innitesimal, the PCA model is recovered. The posterior density then becomes a delta function P (x | y ) D d(x ¡ (C T C )¡1 C T y ), and the EM algorithm is effectively a straightforward least-squares projection (Roweis & Ghahramani, 1999), which is given below. We denote the matrix of data observations as Y 2 Rp£n and the matrix of latent variables as X 2 Rk£n . Then: E-step: X D (C T C ) ¡1 C T Y .
M-step: C new D YX T (XX T ) ¡1 .
We now wish to transform this EM procedure to feature space F . The implicit nonlinear map from input space to F is denoted as W (.) (Sch¨olkopf et al., 1998). For clarity of exposition, we now denote the matrix Y to be the matrix that has individual columns consisting of the following vectors W(y 1 ), . . . , W(y n ) of the mapped observed data. Centering in the feature space can be carried out in a straightforward manner by “centering” the kernel matrix K (which is dened below) using the simple procedure outlined in Scholkopf ¨ et al. (1998). Realizing that the columns of C are scaled and orthogonally rotated eigenvectors computed by diagonalization of the sample covariance matrix we can express the rth column of C as C r D jnD 1 c jr W(yj ) (Tipping & Bishop, 1999; Scholkopf ¨ et al., T ( ) 1998). Using this fact, we can write the r, s element of the matrix C C as (C T C )r,s D (C r )T C s D n
D
i, j D 1
n i D1
c ir W T (y i )
n jD 1
c irc js (W(y i ).W (yj )) D
c js W(yj )
n
i, j D1
c irc js K(y i , yj ),
which gives in matrix form C T C D C T K C, where the columns of the matrix C 2 Rn£k consist of the f° i gikD1 vectors. Likewise, the second term of the
Nonlinear Component Analysis
507
E-step expression can be written as follows: (C T Y )r,s D
n iD 1
c ir W T (y i ) W (y s ) D
n iD 1
c ir (W(y i ).W (y s )) D
n iD 1
c ir K (y i , y s ),
then (C T Y ) D C T K . So, nally we have the required E-step: X D (C T K C )¡1 C T K .
Now let us consider the M-step. Denote the following term as follows:
A ´ X T (XX T )¡1 . As a consequence of the fact that the columns of C lie in the span of W (y 1 ), . . . , W(y n ), we can consider the set of equations
n iD 1
W T (yj )C new D W T (yj )YA for all j D 1, . . . , n c i1 K (yj , y i ), . . . ,
n
i D1
c ikK (yj , y i ) D K (yj , y1 ), . . . , K (yj , y n ) A
for all j D 1, . . . , n. We can write it in matrix form K C new D KA .1 The M-step then follows: C new D A D X T (XX T ) ¡1 . Tipping and Bishop (1999) have shown that in the case of innitesimal noise in our model, that is, Y D lims 2 !0 s 2 I, the matrix C at convergence will be equal to C D U L 1 / 2 R , where the columns of the U matrix are the eigenvectors of the sample covariance matrix, with corresponding eigenvalues l1 , . . . , lk creating the diagonal matrix L , and R is an arbitrary orthogonal rotation matrix. Tipping and Bishop (1999) also pointed out that taking the columns of R T to be equal to the eigenvectors of the C T C matrix, we can recover the true principal axes. Thus, in our case, the projection of the test point y onto the k nonlinear principal components is now given by
¯ (y ) :D R (C T C ) ¡1 C T W(y ) D R (C T K C )¡1 C T K y , where K y is the “centered” vector [K(y1 , y ), . . . , K (y n , y )]T (Scholkopf ¨ et al., 1998). We should note a number of points regarding this method for performing kernel PCA. First, due to the use of the Mercer kernels, the method is independent of the dimensionality of the input space. Second, the computational complexity, per iteration, of the proposed EM method for kernel PCA is O (kn2 ), where n is the number of data points and k is the number 1 In general, K is a positive semidenite matrix; however, we can guarantee positive deniteness by adding arbitrary small, positive values on the diagonal.
508
Roman Rosipal and Mark Girolami
of extracted components. Where a small number of eigenvectors must be extracted and a large number of data points are available, this method is comparable in complexity to the iterative power method, which has complexity O (n2 ). Direct diagonalization of a symmetric K matrix to solve the eigenvalue problem for kernel PCA (Sch¨olkopf et al., 1998) has complexity of the order O (n3 ). 3 Experiments
We tested the proposed EM algorithm for kernel PCA on two articial twodimensional data sets. Thus, by projecting the test input data to the extracted principal components, we can visually compare the results, although we may have the situation where the nonlinear mapping to feature space is not explicitly known. In the rst example, we generated two parabolic shapes vertically and horizontally mirrored. The data were generated by the function y D x2 C 0.6, where the x values have a uniform distribution in [¡1, 1]. We used 500 data points uniformly divided for each parabolic shape. The polynomial kernels of degree 2 and 3 were used. In the second example, three two-dimensional gaussian clusters with means [¡0.5, ¡0.2I 0.0, 0.6I 0.5, 0.0] and common variance 0.1 were generated. Each cluster consisted of 500 data points. The gaussian kernel K (x , y ) D kx ¡y k2 exp(¡ 0.1 ) was used. In each case, the initial values of the C matrix components were randomly generated with uniform distribution in [¡1, 1]. From Figures 1a and 1b, we can see the equivalence of the rst two eigenvectors found by our EM approach (bottom) and the MATLAB eig procedure (top) in the case of using the polynomial kernel of degree 2. We ran the EM algorithm until the dot product between the eigenvectors of the two-dimensional subspace found by EM and kernel PCA was less than 0.999 (approximately 2.5 degrees). In this setting, we found that on average, fewer than two EM steps were sufcient to extract the two required eigenvectors. We also investigated the convergence of the proposed algorithm by running 200 simulations with different initializations of the C matrix. The average value of the dot product between the eigenvectors of the two-dimensional subspace, which were found, was 0.995 (approximately 5.7 degrees). In six cases we observed that at convergence, the dot product was less than 0.98 (approximately 11.5 degrees), but greater than 0.96 (approximately 16.2 degrees). Slightly better convergence results were achieved using the thirdorder polynomial kernel. In that case the average value of the dot product was 0.998 (approximately 3.6 degrees), and only in four cases was the value of the dot product in the 0.96 to 0.98 range. The use of the MATLAB eig function requires 646 times the number of ops required by the proposed EM approach. The MATLAB eigs function,2 which is based on the iterative 2
here.
The default MATLAB setting with convergence tolerance equal to 1 £10¡10 was used
Nonlinear Component Analysis
509
Figure 1: (a–b) First two principal components extracted by diagonalization of the K matrix (top) and by the proposed EM algorithm (bottom) on the rst example using the second-order polynomial kernel. (c) First two principal components extracted by the proposed EM algorithm on the second example. The grayscales in the gures represent the principal component values, and the contours represent lines of constant principal component value.
power method, still requires 5.8 times as many ops as two iterations of our EM method. In Figure 1c we demonstrate results found in the second example using three EM steps. We can see that rst two principal components nicely separate the three clusters in coincidence with results reported on kernel PCA (Scholkopf ¨ et al., 1998).
4 Conclusion
We have proposed an EM-based approach for performing a principal components decomposition in the feature space F . This provides another method for performing kernel-based principal component analysis, which in many cases is extremely efcient in terms of the computing resources required. Further work in this area includes the study of the convergence behavior of the EM approach to nonlinear kernel-based PCA.
510
Roman Rosipal and Mark Girolami
Acknowledgments
R. R. is funded by a research grant for the project Objective Measures of Depth of Anaesthesia, University of Paisley and Glasgow Western Inrmary NHS trust, and is partially supported by Slovak Grant Agency for Science (grants 2/5088/00 and 00/5305/468). References Jolliffe, I. T. (1986). Principal component analysis. New York: Springer-Verlag. Rosipal, R., Girolami, M., & Trejo, L. (2000). Kernel PCA for feature extraction and de-noising in non-linear regression (Tech. Rep. No. 4). Paisley, Scotland: CIS Department, University of Paisley. Roweis, S., & Ghahramani, Z. (1999). A unifying review of linear gaussian models. Neural Computation, 11, 305–345. Sch¨olkopf, B., Burges, C., & Smola, A. (1999).Advances in kernel methods—Support vector learning. Cambridge, MA: MIT Press. Sch¨olkopf, B., Smola, A., & Muller, ¨ K. R. (1998). Nonlinear component analysis as a kernel eigenvalue problem. Neural Computation, 10, 1299–1219. Tipping, M. E., & Bishop, C. M. (1999). Probabilistic principal component analysis. J. R. Statist. Soc. B, 61, 611–622. Received March 9, 2000; accepted May 31, 2000.
LETTER
Communicated by Laurence Abbott
A Population Density Approach That Facilitates Large-Scale Modeling of Neural Networks: Extension to Slow Inhibitory Synapses Duane Q. Nykamp ¤ Courant Institute of Mathematical Sciences, New York University, New York, NY 10012, U.S.A. Daniel Tranchina Department of Biology, Courant Institute of Mathematical Sciences, and Center for Neural Science, New York University, New York, NY 10012, U.S.A. A previously developed method for efciently simulating complex networks of integrate-and-re neurons was specialized to the case in which the neurons have fast unitary postsynaptic conductances. However, inhibitory synaptic conductances are often slower than excitatory ones for cortical neurons, and this difference can have a profound effect on network dynamics that cannot be captured with neurons that have only fast synapses. We thus extend the model to include slow inhibitory synapses. In this model, neurons are grouped into large populations of similar neurons. For each population, we calculate the evolution of a probability density function (PDF), which describes the distribution of neurons over state-space. The population ring rate is given by the ux of probability across the threshold voltage for ring an action potential. In the case of fast synaptic conductances, the PDF was one-dimensional, as the state of a neuron was completely determined by its transmembrane voltage. An exact extension to slow inhibitory synapses increases the dimension of the PDF to two or three, as the state of a neuron now includes the state of its inhibitory synaptic conductance. However, by assuming that the expected value of a neuron’s inhibitory conductance is independent of its voltage, we derive a reduction to a one-dimensional PDF and avoid increasing the computational complexity of the problem. We demonstrate that although this assumption is not strictly valid, the results of the reduced model are surprisingly accurate. 1 Introduction
Ina previous article (Nykamp & Tranchina, 2000), we explored a computationally efcient population density method. This method was introduced ¤
Present address: Dept. Mathematics, UCLA, Los Angeles, CA 90095 U.S.A.
Neural Computation 13, 511–546 (2001)
° c 2001 Massachusetts Institute of Technology
512
Duane Q. Nykamp and Daniel Tranchina
as a tool for facilitating large-scale neural network simulations by Knight and colleagues (Knight, Manin, & Sirovich, 1996; Omurtag, Knight, & Sirovich, 2000; Sirovich, Knight, & Omurtag, in press; Knight, 2000). Earlier works provided a foundation for our previous and current analysis, including those of Knight (1972a, 1972b), Kuramoto (1991), Strogatz and Mirollo (1991), Abbott and van Vreeswijk (1993), Treves (1993), and Chawanya, Aoyagi, Nishikawa, Okuda, and Kuramoto (1993). The population density approach considered in this article is accurate when the network comprises large, sparsely connected subpopulations of identical neurons each receiving a large number of synaptic inputs. We showed (Nykamp & Tranchina, 2000) that the model produces good results, at least in some test networks, even when all these conditions that guarantee accuracy are not met. We demonstrated, moreover, that the population density method can be 200 times faster than conventional direct simulations in which the activity of thousands of individual neurons and hundreds of thousands of synapses is followed in detail. In the population density approach, integrate-and-re point neurons are grouped into large populations of similar neurons. For each population, one calculates the evolution of a probability density function (PDF), which represents the distribution of neurons over all possible states. The ring rate of each population is given by the ux of probability across a distinguished voltage, which is the threshold voltage for ring an action potential. The populations are coupled by stochastic synapses in which the times of postsynaptic conductance events for each neuron are governed by a modulated Poisson process whose rate is determined by the ring rates of its presynaptic populations. In our original model (Nykamp & Tranchina, 2000), we assumed that the time course of the postsynaptic conductance waveform of both excitatory and inhibitory synapses was fast on the timescale of the typical 10– 20 ms membrane time constant (Tama s, Buhl, & Somogyi, 1997). In this approximation, the conductance events were instantaneous (delta functions), so that the state of each neuron was completely described by its transmembrane voltage. This assumption resulted in a one-dimensional PDF for each population. The approximation of instantaneous synaptic conductances seems justied for those excitatory synaptic conductances mediated by AMPA (alpha-amino-3-hydroxy-5-methyl-4-isoxazolepropionic acid) receptors that decay with a time constant of 2–4 ms (Stern, Edwards, & Sakmann, 1992). We were motivated to embellish and extend the population density methods because the instantaneous synaptic conductance approximation is not justied for some inhibitory synapses. Most inhibition in the cortex is mediated by the neurotransmitter c -aminobutyric acid (GABA) via GABAA or GABAB receptors. Some GABAA conductances decay with a time constant that is on the same scale as the membrane time constant (Tam Âas et al., 1997; Galarreta & Hestrin, 1997), and GABAB synapses are an order of magnitude
Population Density Approach
513
slower. Furthermore, slow inhibitory synapses may be necessary to achieve proper network dynamics. With slow inhibition, the state of each neuron is described not only by its transmembrane voltage, but also by one or more variables describing the state of its inhibitory synaptic conductance. Thus, a full population density implementation of neurons with slow inhibitory synapses would require two or more dimensions for each PDF. Doubling the dimension will square the size of the discretized state-space and greatly decrease the speed of the population density computations. Instead of developing numerical techniques to solve this more difcult problem efciently, we chose to develop a reduction of the expanded population density model back to one dimension that gives fast and accurate results. In section 2 we outline the integrate-and-re point neuron model with fast excitation and slow inhibition that underlies our population density formulation. In section 3 we briey derive the full population density equations that include slow inhibition. We reduce the population densities to one dimension and derive the resulting equations in section 4. We test the validity of the reduction in section 5 and demonstrate the dramatic difference in network behavior that can be introduced by the addition of slow inhibitory synapses. We discuss the results in section 6. 2 The Integrate-and-Fire Neuron
As in our original model (Nykamp & Tranchina, 2000), the implementation of the population density approach is based on an integrate-and-re, singlecompartment neuron (point neuron). We describe here the changes made to introduce slow inhibition. The evolution of a neuron’s transmembrane voltage V in time t is specied by the stochastic ordinary differential equation, 1 c
dV C gr ( V ¡ Er ) C Ge ( t )( V ¡ Ee ) C Gi ( t )(V ¡ Ei ) D 0, dt
(2.1)
where c is the capacitance of the membrane; gr , Ge (t), and Gi (t) are the resting, excitatory, and inhibitory conductances, respectively; and Er/ e/ i are the corresponding equilibrium potentials. When V (t) reaches a xed threshold voltage vth , the neuron is said to re a spike, and the voltage is reset to the voltage vreset . The synaptic conductances Ge (t) and Gi (t) vary in response to excitatory and inhibitory input, respectively. We denote by Aek/ i the integral of the change in Ge/ i (t ) due to a unitary excitatory/inhibitory input at time Tek/ i . 1 In stochastic ordinary differential equations and auxiliary equations, upper-case roman letters indicate random quantities.
514
Duane Q. Nykamp and Daniel Tranchina
Thus, if GO ek/ i (t) is the response of the excitatory/inhibitory synaptic conductance to a single input at time T k , then Ak D GO k (t)dt. The voltage V (t) e/ i
e/ i
e/ i
is a dynamic random variable because we view the Aek/ i and Tek/ i as random quantities. We denote the average value of Ae/ i by m Ae / i .
2.1 Excitatory Synaptic Input. Since the change in Ge (t) due to excitatory input is assumed to be very fast, it is approximated by a delta function, k k Ge (t) D k Ae d(t ¡ Te ). Upon receiving an excitatory input, the voltage jumps by the amount
DV
¤k D C e Ee ¡ V (Tek¡ ) ,
(2.2)
where C e¤k D 1 ¡ exp(¡Aek / c).2 Since the size of each conductance event Aek is random, the C e¤k are random numbers. We dene the complementary cumulative distribution function for C e¤ , FQC e¤ (x ) D Pr(C e¤ > x ),
(2.3)
which is some given function of x that depends on the chosen distribution of Aek. 2.2 Inhibitory Synaptic Input. The inhibitory synaptic input is modeled by a slowly changing inhibitory conductance Gi (t). Upon receiving an inhibitory synaptic input, the inhibitory conductance increases and then decays exponentially. The rise of the inhibitory conductance can be modeled as either an instantaneous jump (which we call rst-order inhibition) or a smooth increase (second-order inhibition). For rst-order inhibition, the evolution of Gi (t) is governed by
ti
dG i D ¡Gi (t) C dt
k
Aik d (t ¡ Tik ),
(2.4)
where ti is the time constant for the decay of the inhibitory conductance. The response of Gi (t) to a single inhibitory synaptic input at T D 0 is Gi (t ) D (Ai / ti ) exp(¡t / ti ) for t > 0. For second-order inhibition, we introduce an auxiliary variable S (t) and i model the evolution of Gi (t ) and S (t) with the equations ti dG ¡Gi (t) C S (t) dt D dS k k and ts dt D ¡S(t) C k Ai d(t ¡ Ti ), where ts/ i is the time constant for the rise and decay of the inhibitory conductance (ts · ti ). The response of Gi (t) to a single inhibitory synaptic input is either a difference of exponentials or, if ti D ts , an alpha function. 2
We use the notation C e¤ for consistency with notation used in an earlier article.
Population Density Approach
515
Since the Aik are random numbers, we dene the complementary cumulative distribution function for Ai , FQAi (x) D Pr(Ai > x ),
(2.5)
which is some given function of x. 3 The Full Population Density Model
With slow inhibitory synapses, the state of a neuron is described not only by its voltage (V(t)), but also by one (rst-order inhibition) or two (secondorder inhibition) variables for the state of the inhibitory conductance (Gi (t) and possibly S (t)). Thus, the state-space is now at least two-dimensional. Figure 1a illustrates the time courses of V (t) and Gi (t) (with rst-order inhibition) driven by the arrival of random excitatory and inhibitory synaptic inputs. When an excitatory event occurs, V jumps abruptly; when an inhibitory event occurs, Gi jumps abruptly. Thus, a neuron executes a “random walk” in the V ¡ Gi plane, as illustrated in Figure 1b. For simplicity, we focus on the two-dimensional case of rst-order inhibition. The equations for second-order inhibition are similar. We rst derive equations for a single population of uncoupled neurons with specied input rates. We denote the excitatory and inhibitory input rates at time t by ºe (t) and ºi (t), respectively. For the noninteracting neurons, ºe (t ) and ºi (t) are considered to be given functions of time. In section 4.4, we outline the equations for networks of interacting neurons. 3.1 The Conservation Equation. The two-dimensional state-space for rst-order inhibition leads to the following two-dimensional PDF for a neuron:
r (v, g, t)dv dg D PrfV(t) 2 (v, v C dv ) and Gi (t ) 2 ( g, g C dg )g,
(3.1)
for v 2 (Ei , vth ) and g 2 (0, 1). For a population of neurons, the PDF can be interpreted as a population density, that is, r (v, g, t)dv dg D Fraction of neurons with V (t ) 2 (v, v C dv) and Gi (t) 2 ( g, g C dg ).
(3.2)
Each neuron has a refractory period of length tref immediately after it res a spike. During the refractory period, its voltage does not move, though its inhibitory conductance evolves as usual. When a neuron is refractory, it is not accounted for in r (v, g, t). For this reason, r (v, g, t) does not usually integrate to 1. Instead, r (v, g, t)dv dg is the fraction of neurons that are not refractory. In appendix B, we describe the probability density function fref that describes neurons in the refractory state. However, since fref plays
Duane Q. Nykamp and Daniel Tranchina
(a) 50
vth
60 70
Gi
2
Voltage (mV)
516
1 0 0
10
20 30 Time (ms)
40
50
(b) 2.5 2
Gi
1.5 t=50 ms 1 0.5 t=0 ms 0 70
65
60 Voltage (mV)
55 vth
Figure 1: Illustration of the two-dimensional state-space with rst-order inhibition. (a) A trace of the evolution of voltage and inhibitory conductance in response to constant input rate (ºe D 400 imp/sec, ºi D 200 imp/sec). For illustration, spikes are shown by dotted lines and the gray-level changes after each spike. (b) The same trace shown as a trajectory in the two-dimensional state-space of voltage and inhibitory conductance.
no role in the reduced model, we mention only the properties of fref as we need them for the derivation. The evolution of the population density r (v, g, t) is described by a conservation equation, which takes into account the voltage reset after a neuron res. Since a neuron reenters r (v, g, t) after it becomes nonrefractory, we view the voltage reset as occurring at the end of the refractory period. Thus,
Population Density Approach
517
the evolution equation is: @r @t
(v, g, t) D ¡r ¢ JE(v, g, t) C d (v ¡ vreset ) JU (tref , g, t) @ JV
D ¡
@v
(v, g, t) C
@JG @g
(v, g, t ) C d(v ¡ vreset ) JU (tref , g, t).
(3.3)
JE(v, g, t ) D JV (v, g, t), JG (v, g, t) is the ux of probability at (V (t), Gi (t)) D (v, g). The ux JE is a vector quantity in which the JV component gives the ux across voltage, and the JG component gives the ux across conductance. Each component is described below. The last term in equation 3.3 is the source of probability at vreset due to the reset of each neuron’s voltage after it becomes nonrefractory. The calculation of JU (tref , g, t), which is the ux of neurons becoming nonrefractory, is described in appendix B. Since a neuron res a spike when its voltage crosses vth , the population ring rate is the total ux across threshold: r(t) D
JV (vth , g, t)dg.
(3.4)
The population ring rate is not a temporal average but an average over all neurons in the population. When a neuron res a spike, it becomes refractory for the duration of the refractory period of length tref . Thus, the total ux of neurons becoming nonrefractory at time t is equal to the ring rate at time t ¡ tref : JU (tref , g, t)dg D r(t ¡ tref ).
(3.5)
Condition 3.5 is the only property of JU that we will need for our reduced model. 3.2 The Components of the Flux. The ux across voltage JV in equation 3.3 is similar to the ux across voltage from the fast synapse model (Nykamp & Tranchina, 2000). The total ux across voltage is
gr (v ¡ Er )r (v, g, t) C ºe (t) c g ¡ (v ¡ Ei )r (v, g, t), c
JV (v, g, t) D ¡
v Ei
FQC e¤
v ¡ v0 E e ¡ v0
r (v0 , g, t)dv0 (3.6)
where ºe (t) is the rate of excitatory synaptic input at time t. The three terms of the ux across voltage correspond to the terms on the right-hand side of equation 2.10. The rst two terms are the components of the ux due to leakage toward Er and excitation toward Ee , respectively,
518
Duane Q. Nykamp and Daniel Tranchina
which are identical to the leakage and excitation uxes in the fast synapse model. Because the slow inhibitory conductance causes the voltage to evolve smoothly toward Ei with velocity ¡(V(t) ¡ Ei )Gi (t) / c, the inhibition ux is the third term in equation 3.6. The ux across conductance JG consists of two components, corresponding to the two terms on the right-hand side of equation 2.4, g JG (v, g, t) D ¡ r (v, g, t) C ºi (t ) ti
g 0
FQAi (ti ( g ¡ g0 ))r (v, g0 , t )dg0 ,
(3.7)
where ºi (t ) is the rate of inhibitory synaptic input at time t. The rst term is the component of ux due to the decay of the inhibitory conductance toward 0 at velocity ¡Gi (t) / ti . The second term is the ux from the jumps in the inhibitory conductance when the neuron receives inhibitory input. The derivation of the second term is analogous to the derivation of the ux across voltage due to excitation. 4 The Reduced Population Density Model
For the sake of computational efciency, we developed a reduction of the model with slow synapses back to one dimension. This reduced model can be solved as quickly as the model with fast synapses. To derive the reduced model, we rst show that the jumps in voltage due to excitation, and thus the jumps across the ring threshold vth , are independent of the inhibitory conductance. We therefore need only to calculate the distribution of neurons across voltage (and not inhibitory conductance) in order to compute the population ring rate r(t) accurately. We then show that the evolution of the distribution across voltage depends on the inhibitory conductance only via its rst moment. The evolution of this moment can be computed by a simple ordinary differential equation if we assume it is independent of voltage. 4.1 EPSP Amplitude Independent of Inhibitory Conductance. The fact that the size of the excitatory voltage jump is independent of the inhibitory conductance is implicit in equation 2.2. Insight may be provided by the following argument. To calculate the response in voltage to a delta function excitatory input at time T, we integrate equation 2.1 from the time just before the input T ¡ to the time just after the input T C . The terms involving gr and Gi (t) drop out because they are nite and thus cannot contribute to the integral over the innitesimal interval (T ¡ , T C ). The term with Ge (t) remains because across this interval Ge (t) is a delta function. Physically, the independence from Gi (t) corresponds to excitatory current so fast that it can be balanced only by capacitative current. Although the peak of the excitatory postsynaptic potential (EPSP) does not depend on the inhibitory conductance, the time course of the EPSP does. A large inhibitory
Population Density Approach
519
conductance would, for example, create a shunt, which would quickly bring the voltage back down after an excitatory input. 4.2 Evolution of Distribution in Voltage. Since excitatory jump sizes are independent of inhibitory conductance, we can calculate the population ring rate at time t if we know only the distribution of neurons across voltage. We denote the marginal PDF in voltage by fV (v, t), which is dened by
fV (v, t)dv D PrfV(t) 2 (v, v C dv )g
(4.1)
and is related to r (v, g, t) by3 fV (v, t) D
r (v, g, t)dg.
(4.2)
We derive an evolution equation for fV by integrating equation 3.3 with respect to g, obtaining: @fV
@JV
(v, t) D ¡
@t
@v
(v, g, t)dg C
C d(v ¡ vreset )
@ JG @g
(v, g, t)dg
JU (tref , g, t)dg.
(4.3)
Because Gi (t) cannot cross 0 or innity, the ux at those points is zero, and consequently the second term on the right-hand side is zero: @ JG
(v, g, t)dg D lim JG (v, g, t) ¡ JG (v, 0, t) D 0.
@g
g!1
(4.4)
Using condition 3.5, we have the following conservation equation for fV , @fV @t
(v, t) D ¡
@JNV @v
(v, t) C d(v ¡ vreset )r(t ¡ tref ),
(4.5)
where JNV (v, t) is the total ux across voltage: JNV (v, t) D
JV (v, g, t)dg v gr (v ¡ Er ) fV (v, t) C ºe (t) FQC e¤ c Ei 1 ¡ (v ¡ Ei ) gr (v, g, t)dg. c
D ¡
3
v ¡ v0 E e ¡ v0
fV (v0 , t)dv0 (4.6)
Note that this enables us to rewrite expression 3.4 for the population ring rate as
r(t) D ºe (t)
v th i
FQ C e¤
vth ¡v 0 0 e ¡v
fV (v0 , t)dv0 (since r (vth , g, t) D r (E i , g, t) D 0), conrming
that the ring rate depends on only the distribution in voltage.
520
Duane Q. Nykamp and Daniel Tranchina
Equation 4.5 would be a closed equation for fV (v, t) except for its dependence on g r (v, g, t)dg. Thus, to calculate the population ring rates, we do not need the distribution of neurons across inhibitory conductance; we need only the average value of that conductance for neurons at each voltage. This can be seen more clearly by rewriting the last term of equation 4.6 as follows. Let fG|V ( g, v, t) be the PDF for Gi given V: fG|V ( g, v, t)dg D PrfGi (t ) 2 ( g, g C dg )| V (t) D vg.
(4.7)
Then, since r (v, g, t ) D fG|V ( g, v, t ) fV (v, t), we have that g r (v, g, t)dg D
gfG|V ( g, v, t)dg fV (v, t)
D m G|V (v, t) fV (v, t),
(4.8)
where m G|V (v, t) is the expected value of Gi given V: m G|V (v, t) D
g fG|V ( g, v, t)dg.
(4.9)
Substituting equation 4.8 into 4.6, we have the following expression for the total ux across voltage, v
gr JNV (v, t) D ¡ (v ¡ Er ) fV (v, t) C ºe (t) FQC e¤ c Ei m G|V (v, t) (v ¡ Ei ) fV (v, t), ¡ c
v ¡ v0 Ee ¡ v0
fV (v0 , t)dv0 (4.10)
which, combined with equation 4.5, gives the equation for the evolution of the distribution in voltage: @fV @t
(v, t ) D
v gr (v ¡ Er ) fV (v, t ) ¡ ºe (t) FQCe¤ @v c Ei m G|V (v, t) (v ¡ Ei ) fV (v, t ) C c C d(v ¡ vreset ) r( t ¡ tref ).
@
v ¡ v0 Ee ¡ v0
fV (v0 , t)dv0
(4.11)
4.3 The Independent Mean Assumption. Equation 4.11 depends on the mean conductance m G|V (v, t ). In appendix A, we derive an equation for the evolution of the mean conductance m G|V (v, t ) (equation A.1, coupled with equations A.5 and A.6). The equation cannot be solved directly because it depends on m G2 |V , the second moment of Gi . The equation for the evolution of the second moment would depend on the third, and so on for higher moments, forming a moment expansion in Gi .
Population Density Approach
521
To close the moment expansion, one has to make an assumption that will make the expansion terminate. We assume that m G|V (v, t) is independent of voltage, that is, that the mean conditioned on the voltage is equal to the unconditional mean, m G|V (v, t) D m G (t ),
(4.12)
where m G (t) is the mean value of Gi (t) averaged over all neurons in the population. Since the equation for the evolution of Gi , 2.4, is independent of voltage, this assumption closes the moment expansion at the rst moment and enables us to derive, as shown in appendix A, a simple ordinary differential equation for the evolution of the mean voltage: dm G º (t)m Ai ¡ m G (t) (t) D i . ti dt
(4.13)
Therefore, if we make our independence assumption, equation 4.12, we can combine equation 4.13 with 4.11 to calculate the evolution of a single population of uncoupled neurons to excitatory and inhibitory input at the given rates ºe (t) and ºi (t). 4.4 Coupled Populations. All the population density derivations above were for a single population of uncoupled neurons. However, we wish to apply the population density method to simulate large networks of coupled neurons. The implementation of a network of population densities requires two additional steps. (A detailed description and analysis of the assumptions behind these steps is given in Nykamp & Tranchina, 2000.) The rst step is to group neurons into populations of similar neurons and form a population density for each group, denoted fVk (v, t), where k D 1, 2, . . . , N enumerates the N populations. Each population has a corresponding population ring rate, denoted rk (t). Denote the set of excitatory indices by L E and the set of inhibitory indices by L I ; the set of excitatory/ inhibitory populations is f fVk (v, t) | k 2 L E / I g. Denote any imposed external excitatory/inhibitory input rates to population k by ºek/ i,o (t). The second step is to determine the connectivity between the populations and form the connectivity matrix Wjk, j, k D 1, 2, . . . , N. Wjk is the average number of presynaptic neurons from population j that project to each postsynaptic neuron in population k. The synaptic input rates to each population are then determined by the ring rates of its presynaptic population as well as external ring rates by
ºek/ i (t) D ºek/ i,o (t) C
Wjk j2L E / I
1 0
ajk (t0 )r j (t ¡ t0 )dt0 ,
(4.14)
522
Duane Q. Nykamp and Daniel Tranchina
where ajk (t0 ) is the distribution of latencies of synapses from population j to population k. The choice of ajk (t0 ) used in our simulations is given in appendix D. Using the input rates determined by equation 4.14, one can calculate the evolution of each population using the equations for the single population. 4.5 Summary of Equations. Combining equations 4.5, 4.10, 4.12, and 4.13, we have a partial differential integral equation coupled with an ordinary differential equation for the evolution of a single population density with synaptic input rates ºe (t) and ºi (t). In a network with populations fVk (v, t), these input rates are given by equation 4.14. The ring rates of each population are given by equation 3.4. The following summarizes the equations of the population density approach with populations k D 1, 2, . . . , N: @fVk @t
(v, t ) D ¡
k @JNV
@v
(v, t ) C d(v ¡ vreset )r k (t ¡ tref )
gr JNVk (v, t) D ¡ (v ¡ Er ) fVk (v, t) C ºek (t) c ¡
k( ) mG t
c
v Ei
FQC e¤
v ¡ v0 Ee ¡ v0
(v ¡ Ei ) fVk (v, t)
(4.15)
fVk (v0 , t)dv0 (4.16)
r k (t) D JNVk (vth , t)
(4.17)
k k( ) dm G ºk (t)m Ai ¡ m G t (t) D i ti dt
(4.18)
ºek/ i (t) D ºek/ i,o (t) C
Wjk j2L E /I
1 0
ajk (t0 )r j (t ¡ t0 )dt0 .
(4.19)
The boundary conditions for the partial differential integral equations are that fVk (Ei , t) D fVk (vth , t) D 0. In general, the parameters Ee/ i / r , vth / reset , gr , c, and m Ai as well as the function FQC e¤ , could depend on k. 5 Tests of Validity
The reduced population density model would give the same results as the full model if the expected value of the inhibitory conductance were independent of the voltage. We expect that the reduced model should give good results if the independence assumption, equation 4.12, is approximately satised.
Population Density Approach
523
We performed a large battery of tests by which to gauge the accuracy of the reduced model. For each test, we compared the ring rates of the reduced population density model with the ring rates from direct simulations of groups of integrate-and-re neurons, computed as described in appendix C. We modeled the inhibitory input accurately for the direct simulations, using the equations from section 2. Thus, the direct simulation ring rates serve as a benchmark against which we can compare the reduced model. 5.1 Testing Procedure. We focus rst on simulations of a single population of uncoupled neurons. For a single population of uncoupled neurons that receive specied input that is a modulated Poisson process, the population density approach without the independent mean assumption, equation 4.12, would give exactly the same results as a large population of individual neurons. Thus, tests of single population results will demonstrate what errors the independent mean assumption alone introduces. We also performed extensive tests using a simple network of one excitatory and one inhibitory population. In the network simulations, the comparison with individual neurons also tests other assumptions of the population density, such as the assumption that the combined synaptic input to each neuron can be approximated by a modulated Poisson process. We demonstrated earlier that these assumptions are typically satised for sparse networks (Nykamp & Tranchina, 2000). The network tests with the reduced population density model will demonstrate the accumulated errors from all the approximations. For both single-population and two-population tests, we performed many runs with random sets of parameters. We chose parameters in the physiological regime, but the parameters were chosen independent of one another. Thus, rather than focusing on parameter sets that may be relevant to a particular physiological system, we decided to stress the model with many combinations of parameters to probe the conditions under which the reduced model is valid. However, we analyzed the results of only those tests where the average ring rates were between 5 spikes per second and 300 spikes per second. Average spike rates greater than 300 spikes per second are unphysiological, and spike rates below 5 spikes per second are too low to permit a meaningful comparison by our deviation measure. We randomly varied the following parameters: the time constant for inhibitory synapses ti , the average unitary excitatory input magnitude m Ae , the average peak unitary inhibitory synaptic conductance m Ai / ti , and the connectivity of the network Wjk. We selected ti randomly from the range ti 2 (2, 25) ms. Scaling the excitatory input by the membrane capacitance to make it dimensionless, we used the range m Ae / c 2 (0.001 , 0.030). Similarly, scaling the peak unitary inhibitory conductance by the resting conductance, we used the range m Ai / (ti gr ) 2 (0.001 , 0.200). The single-population simulations represented uncoupled neurons, so we set the connectivity to zero (W11 D 0). For the two-population simulations, we set all connections to the
524
Duane Q. Nykamp and Daniel Tranchina
N j, k D 1, 2) and selected that strength randomly same strength (Wjk D W, N (5 from W 2 , 50). The other parameters were xed for all simulations. These xed parameters, as well as the forms of the probability distributions for Ae/ i , are described in appendix D. For the purpose of illustration, we also ran some single-population simulations with a larger peak unitary inhibitory conductance. We allowed m Ai / (ti gr ) to range as large as 1, which is not physiological since a single inhibitory event then doubles the membrane conductance from its resting value. To provide inputs rich in temporal structure, we constructed each of the 1 ( ) and 1 ( ) from sums of sinusoids with frequencies external input ratesºe,o t ºi,o t of 0, 1, 2, 4, 8, and 16 cycles per second. The phases and relative amplitudes of each component of the sum were chosen randomly. The steady compo1 1 nents, which were the mean rates ºN e,o and ºN i,o , were independently chosen 1 from the range ºNe / i,o 2 (0, 2000) spikes per second. The amplitudes of the other ve components were scaled by the largest common factor that would keep ºe1/ i,o (t) in the range (0, 2Nºe1/ i,o ) for all time. For the two-population simulations, the second population received no external input, and we set 2 ( ) 2 ( ) ºe,o t D ºi,o t D 0. We performed similar tests using second-order inhibition with larger inhibitory time constants (ti 2 (50, 100) ms, ts 2 (2, 5) ms) and a lower inhibitory reversal potential (Ei D ¡90 mV rather than ¡70 mV), in order to approximate GABAB synapses. Since the results were similar to those with rst-order inhibition, we do not focus on them here. We also performed one test of a physiological network. We implemented a slow inhibitory synapse version of the model of one hypercolumn of visual cortex we used earlier (Nykamp & Tranchina, 2000). This model simulates the response of neurons in the primary visual cortex of a cat to an oriented ashed bar. We kept every aspect of the model unchanged except the inhibitory synapses. The original model had instantaneous inhibitory synapses. In this implementation, we used rst-order inhibition with ti D 8 ms and kept the integral of the unitary inhibitory conductance the same as that in the previous model. 5.2 Test Results. For each run, we calculated the ring rates in response to the input using both the population density method and a direct simulation of 1000 individual integrate-and-re point neurons. We calculated the ring rates of the individual neurons by counting spikes in 5 ms bins, denoting the ring rate for the jth bin as rQj . To compare the two ring rates, we averaged the population density ring rate over each bin, denoting the average rate for the jth bin by rj . We then calculated the following deviation measure,
D
D
j
(rj ¡ rQj )2 2 j rj
,
(5.1)
Population Density Approach
525
where the sum is over all bins. The deviation measure D is a quantitative measure of the difference between the ring rates, but it has shortcomings, as demonstrated below. For two populations, we report the maximum D of the populations. 5.2.1 Single Uncoupled Population. We performed 10,000 single-population runs and found that the reduced population density method was surprisingly accurate for all the runs. The average D was only 0.05 and the maximum D was 0.19. Figure 2a shows the results of a typical run. For this run with a D near average (D D 0.06), the ring rates of the population density and direct simulations are virtually indistinguishable. Since the population density results closely matched the direct simulation results for all 10,000 runs, we simulated a set of 500 additional runs, which included unrealistically large unitary inhibitory conductance events (m Ai / (ti gr ) as large as 1). In some cases, the large inhibitory conductance events led to larger deviations. The results from the run with the highest deviation measure (D D 0.43) are shown in Figure 2b. In this example, the population density ring rates are consistently lower than the direct simulation ring rates. The derivation of the reduced population density model assumed that the mean value of the inhibitory conductance m G|V (v, t) was independent of the voltage v (see equation 4.12). A necessary (though not sufcient) condition for equation 4.12 is the absence of correlation between the inhibitory conductance and the voltage. Thus, the correlation coefcient calculated from the direct simulation might be a good predictor of the reduced population density’s accuracy. The correlation coefcient, which can range between ¡1 and 1, is plotted underneath each ring rate in Figure 2. For the average results of Figure 2a, the correlation coefcient is very small and usually negative. For the larger deviation in Figure 2b, the inhibitory conductance is strongly negatively correlated with the voltage throughout the run. Thus, at least in this case, the correlation coefcient does explain the difference in accuracy between the two runs. Not only do the population density simulations give accurate ring rates, they also capture the distribution of neurons across voltage ( fV ) and the mean inhibitory conductance (m G ), as shown by the snapshots in Figure 3. Figure 3a compares the distribution of neurons across voltage ( fV (v, t)) as calculated by the population density and direct simulations. Although the direct simulation histogram is still ragged with 1000 neurons, it is very close to the fV (v, t) calculated by the population density model. Figure 3b demonstrates that the population density can calculate an accurate mean inhibitory conductance despite the wide distribution of neurons across inhibitory conductance. For illustration, Figure 3c shows the full distribution of neurons across voltage and inhibitory conductance. In order to produce a relatively smooth graph, this distribution was estimated by averaging the direct simulation
Firing Rate (imp/s)
(a)
2000 1000 0
0
0
200
400 600 Time (ms)
800
2000 1000
50
0
0
0.4 0 0.4 0.8 1000
0
200
400 600 Time (ms)
800
Corr. Coef.
Firing Rate (imp/s)
(b)
0.4 0 0.4 0.8 1000
Corr. Coef.
50
Input Rate (imp/s)
Duane Q. Nykamp and Daniel Tranchina
Input Rate (imp/s)
526
Figure 2: Example of single uncoupled population results. Excitatory (black line) and inhibitory (gray line) input rates are plotted in the top panel. Population density ring rate (black line) and direct simulation ring rate (histogram) are plotted in the middle panel. The correlation coefcient between the voltage and the inhibitory conductance is shown in the bottom panel. (a) Results from a typical simulation. The ring rates from both simulations are nearly identical, and the correlation coefcient is small. Parameters: ti D 6.5 ms, m Ae / c D 0.015, m Ai / (ti gr ) D 0.029, º N e D 619 imp/s, ºN i D 961 imp/s. (b) The results with the largest deviation when we allowed the unitary inhibitory conductance to be unrealistically large. The population density ring rate consistently undershoots that of the direct simulation, and the correlation coefcient is large and negative. Parameters: ti D 22.2 ms, m Ae / c D 0.030, m Ai / (ti gr ) D 0.870, º N e D 531 imp/s, ºNi D 118 imp/s.
over 100 repetitions of the input. The surprising result of our reduced method is that we can collapse the full distribution in Figure 3c to the distribution in voltage of Figure 3a coupled with the mean inhibitory conductance from Figure 3b.
(a) Probability Density
0.2
527
0.1 0 70
65 60 55 Voltage (mV) (c)
6 4 2 0 0
0.6 0.6
0.5 Gi
1
(d)
0.4
0.4
0.2
0.2
0 70 65 Volt 60 age 55 0 (mV )
(b)
Mean Gi
Probability Density
Probability Density
Population Density Approach
0.5 Gi
1
0 70 65 60 55 Voltage (mV)
Figure 3: Snapshots from t D 900 ms of Figure 2a. (a) Distribution of neurons across voltage as computed by population density (black line) and direct simulation (histogram). (b) Distribution of neurons across inhibitory conductance from direct simulation (histogram). The vertical black line is at the average inhibitory conductance from the population density simulation. (c) Distribution of neurons across voltage and inhibitory conductance from the direct simulation averaged over 100 repetitions of the input. (d) Conditional mean inhibitory conductance m G|V (v, t) from direct simulation (black line). The horizontal gray line is plotted at the mean inhibitory conductance m G (t) from the population density simulation.
To analyze our critical independence assumption (see equation 4.12), we plot in Figure 3d an estimate of the conditional mean m G|V computed by the direct simulation along with the unconditional mean value m G from the population density simulation. For this example, the estimate of m G|V is nearly independent of voltage, which explains why the population density results so closely match those from the direct simulation. The snapshots in Figure 4, which are from t D 900 ms in Figure 2b, further demonstrate the breakdown of the independence assumption that can occur with unrealistically large unitary inhibitory conductances. Figure 4b
528
Duane Q. Nykamp and Daniel Tranchina
(a)
(b) i
4
Mean G
Probability Density
0.2
0.1
0 70
2
65 60 55 Voltage (mV)
0 70
65 60 55 Voltage (mV)
Figure 4: Snapshots from t D 900 ms of Figure 2b. (a) Distribution of neurons across voltage as computed by population density (black line) and direct simulation (histogram). (b) Conditional mean inhibitory conductance m G|V (v, t) from direct simulation (black line). The horizontal gray line is plotted at the mean inhibitory conductance m G (t) from the population density simulation.
shows that the mean inhibitory conductance m G|V (v, t) is not independent of voltage but decreases as the voltage increases. This dependence leads to the discrepancy in the distribution of neurons across voltage seen in Figure 4a. The match between the direct simulation and the population density results might be improved if, rather than making an independence assumption (see equation 4.12) at the rst moment of Gi , we retained the second moment from the moment expansion described at the beginning of section 4.3. Further analysis of the sources of disagreements between direct and population density computations is given in the discussion. 5.2.2 Two-Population Network. We simulated 300 runs of the two-population network. Most (245 of 300) of the runs had a relatively low deviation measure of D below 0.30. A substantial fraction (55 of 300) had deviations larger than 0.30. However, as demonstrated by examples below, we discovered numerous cases where D was high, but the ring rates of the population density and direct simulations were subjectively similar. This discrepancy is largely due to the sensitivity of D to timing; slight differences in timing frequently lead to a large D . Rather than invent another deviation measure, we analyzed by eye each of the 55 simulations with D > 0.30 to determine the source of the large D . We discovered that for the majority (31 of 55) of the simulations, the two ring rates appeared subjectively similar, as demonstrated by examples below. For only 24 of the 300 simulations did the ring rates of the population density differ substantially from the direct simulation ring rate. However, we demonstrate evidence that these simulations represent situations in which the network has sev-
529
4000 2000 0
500 500
600
700 800 Time (ms)
900
Firing Rate (imp/s)
(b)
0 1000
4000 2000
300
0
200 100 0
300 200 100
500
600
700 800 Time (ms)
900
0 1000
Input Rate (imp/s)
0
Firing Rate (imp/s)
500
Firing Rate (imp/s)
Firing Rate (imp/s)
(a)
Input Rate (imp/s)
Population Density Approach
Figure 5: Examples of two population results. (a) The ring rates match well despite a large deviation measure (D D 0.64). Top panel: Excitatory (black line) and inhibitory (gray line) input rates. Middle and bottom panels: Excitatory and inhibitory population ring rates, respectively. Population density (black line) N D 48.05, and direct simulation (histogram) results are shown. Parameters: W ti D 22.25 ms, m Ae / c D 0.011, m Ai / (ti gr ) D 0.196, ºN e D 1853 imp/s, ºN i D 82 imp/s. (b) Slight timing differences caused the large deviation (D D 0.98). Panels as in N D 47.89, ti D 2.49 ms, m A / c D 0.005, m A / (ti gr ) D 0.137, a. Parameters: W e i ºN e D 1913 imp/s, ºN i D 1367 imp/s.
eral different stable ring patterns and that neither the direct simulation nor the population density simulation can be viewed as the standard or correct result. Figure 5a demonstrates a case in which there is good qualitative agreement between direct and population density computations despite a large deviation measure (D D 0.64). Note that since the population ring rate is
530
Duane Q. Nykamp and Daniel Tranchina
a population average, not a temporal average, the sharp peaks above 600 impulses per second in the inhibitory population ring rate do not indicate that a single neuron was ring above 600 impulses per second. Rather, these sharp peaks indicate that many neurons in the population red a spike within a short time interval, a phenomenon we call population synchrony. For this example, the large D was caused by slight differences in timing of the tight population synchrony. Thus, not only does this example illustrate deciencies in the measure D , it also demonstrates how well the population density model can capture the population synchrony of a strongly coupled network. In other cases, a large D indicated more substantial discrepancies between the direct and population density simulations. Figure 5b shows the results of a simulation with D D 0.98. The deviation measure is so high because the timing of the two models differs enough that the peaks in activity sometimes do not overlap. Even so, the timings typically differ by only about 5 ms, and the ring rates are subjectively similar. In Figure 6a, the population density underestimates the ring rates three times (D D 0.66) but catches all qualitative features. In 24 simulations, we observed periods of gross qualitative differences between the population density ring rates and the direct simulation ring rates. We discovered that in each of these cases, the network dynamics were extremely sensitive to parameters; rounding parameters to three signicant digits led to large differences in ring pattern for both population density and direct simulations. In some cases, this rounding caused the ring patterns to converge and produced a low deviation measure. An example of a simulation with a large deviation measure (D D 1.09) is shown in Figure 6b. Between t D 250 and 400 ms, the network appears to have a stable xed point, with both populations ring steadily at high rates. The direct simulation approaches this putative xed point more rapidly than the population density simulation. At around t D 400 ms, the response of both simulations is consistent with a xed point losing stability, as the ring rates oscillate around the xed point with increasing amplitude. However, the oscillations of the direct simulation grow more rapidly than those of the population density simulation, and the populations in the direct simulation stop ring 150 ms before those in the population density simulation. From this point on, the population density and direct simulations behave similarly. But since the input rates for this example change slowly, the ring rates do not phase-lock to the input rates, and the two simulations oscillate out of phase. The behavior after t D 400 ms is reminiscent of drifting through a Hopf bifurcation and tracking the ghost of an unstable xed point (Baer, Erneux, & Rinzel, 1989). Once the xed point loses stability, the state of the system evolves away from the unstable xed point only slowly. Any additional noise will cause the system to leave the xed point faster (Baer et al., 1989). Thus, the additional noise in the direct simulation (e.g., from heterogene-
531 (a) 2000 0
0
0
500
200
400 600 Time (ms)
800
0 1000
Firing Rate (imp/s)
500
(b) 1000 0 500 0 500
0
200
400 600 Time (ms)
800
0 1000
Firing Rate (imp/s)
Firing Rate (imp/s)
2000
Input Rate (imp/s)
Firing Rate (imp/s)
4000
Input Rate (imp/s)
Population Density Approach
Figure 6: Further examples of two population results. Panels as in Figure 5. (a) A simulation where the population density underestimates the ring rate. N D 25.54, ti D 11.66 ms, m A / c D 0.018,m A / (ti gr ) D 0.172, ºN e D 908 Parameters: W e i imp/s, ºN i D 1391 imp/s. (b) A simulation with large deviations between the N D 40.942, population density and direct simulation ring rates. Parameters: W ti D 19.992 ms, m Ae / c D 0.018, m Ai / (ti gr ) D 0.144, º N e D 696 imp/s, º N i D 44 imp/s.
ity in the connectivity or nite size effects) might explain why the direct simulation leaves the xed point more quickly. Furthermore, the stability of this xed point is extremely sensitive to parameters. Even when we leave all parameters xed and generate a different realization of the direct simulation network, the point at which the direct simulation drops off the xed point can differ by 100 ms. Rounding parameters to three signicant digits changes the duration of the xed point by hundreds of milliseconds. Moreover, when we increased m Ai / (ti gr ) from 0.143 to 0.16, we no longer observed the high xed point, and the ring rates of the two simulations matched with only small delays.
532
Duane Q. Nykamp and Daniel Tranchina
Each of the 24 simulations with high deviation measure exhibited similarly high sensitivity to parameters. Most of them appeared to have xed points with high ring rates, giving rise to apparent bistability in the network activity; typically, these xed points gained or lost stability differently for the two models. These observations suggest the possibility that the large deviations above reect differences between Monte Carlo simulations and solutions of the corresponding partial differential integral equations under bistable conditions. In these situations, neither simulation can be considered the standard; the deviation measure D is not a measure of the error in the population density simulation but a reection of the sensitivity of the bistable conditions to noise. Thus, these discrepancies probably do not result from our dimension-reduction approximation. 5.2.3 Model of One Hypercolumn of Visual Cortex. In our model of one hypercolumn of visual cortex, based on that of Somers, Nelson, and Sur (1995), excitatory and inhibitory populations are labeled by their preferred orientations. These preferred orientations are established by the input from the lateral geniculate nucleus (LGN), as demonstrated in Figure 7a. The j LGN synaptic input rates (ºe,o (t)) to excitatory populations in response to a ashed bar oriented at 0 degree are shown in the top panel. The population with a preferred orientation at 0 degree received the strongest input, and the population at 90 degrees received the weakest input. Inhibitory populations received similar input. The resulting excitatory population ring rates (r j (t)) are shown in the bottom panel. More details on the structure of the network are given in appendix D and in Nykamp and Tranchina (2000). The addition of slow inhibitory synapses changes the dynamics of the network response. For Figure 7a, we used slow inhibition with ti D 8 ms. In Figure 7b, we show the response of the same network to the same input, where we have substituted fast inhibitory synapses, keeping m Ai (the average integral of a unitary inhibitory conductance) xed. After an initial transient of activity, the ring rate for the fast inhibition model is smooth, indicating asynchronous ring. In contrast, the ring rate for the slow inhibition model (see Figure 7a) exhibits nonperiodic cycles of population activity. Thus, this example demonstrates the importance of slow inhibitory synapses to network dynamics. In Figure 8, we plot a comparison of the population density ring rates with the ring rates of a direct simulation of the same network. As above, the direct simulation contained 1000 integrate-and-re point neurons per population. We averaged over two passes of the stimulus to obtain sufcient spikes for populations with low ring rates. Figure 8 compares the ring rates of excitatory populations with preferred orientations of 0 degree (top) and 90 degrees (bottom). The ring rates match well. The match between the populations at 0 degree is excellent, though there is a slight timing
533
(a) 2000
400
0 o
0 ±10o ±20o ±30o
200
Firing Rate (imp/s)
Firing Rate (imp/s)
1000
Input Rate (imp/s)
Population Density Approach
o
0 0
>30
100
200 300 Time (ms) (b)
400
500
100
200 300 Time (ms)
400
500
60 40 20 0 0
Figure 7: Results of visual cortex simulation. (a) Population density results with slow inhibition. Top panel: Synaptic input rates from the LGN to excitatory populations. These input rates are produced by a bar ashed from t D 100 ms to t D 350 ms, oriented at 0 degree; thus, the input to the population at 0 degree is the largest. Bottom panel: The ring rates of the excitatory population densities in response to this input. (b) Population density results with fast inhibition. The ring rates of the excitatory population densities to the same input as in a. The only difference between b and the bottom panel in a is that ti D 8 ms for a, while ti is essentially 0 for b.
difference in the fourth wave of activity. The binning of the direct simulation spikes cannot show the ne temporal structure visible in the population density ring rate. The only obvious deviation is the larger rebound in the population density than in the direct simulation at 90 degrees after t D 400 ms. However, the difference is only a couple of impulses per second, as the population is barely ring. This example also serves to illustrate the speed of the population density approach. For both the population density and the direct simulations above, we used a time step of D t D 0.5 ms. For the population density computations, we discretized the voltage state-space using D v D 0.25 mV. These simulations of a real time period of 500 ms with 36 populations took 22 seconds for the population density and 2600 seconds for the direct simu-
Duane Q. Nykamp and Daniel Tranchina
(a) 400 200 0 0
Firing Rate (imp/s)
Firing Rate (imp/s)
534
100
200 300 Time (ms)
400
500
400
500
(b)
4 2 0 0
100
200 300 Time (ms)
Figure 8: Comparison between population density and direct simulation ring rates for the visual cortex simulation. Population density ring rates are plotted with a black line, and direct simulation ring rates are shown by the histograms. The population density ring rates are from gure 7a. (a) Excitatory population with a preferred orientation of 0 degree. (b) Excitatory population with a preferred orientation of 90 degrees. Note the change in scale.
lations (using a Silicon Graphics Octane computer with one 195 MHz MIPS R10000 processor). Thus, the population density simulation was over 100 times faster than the direct simulation. 6 Discussion
Application of population density techniques to modeling networks of realistic neurons will require extensions of the theory beyond simple integrateand-re neurons with instantaneous postsynaptic conductances. We have taken a rst step in this direction by adding a realistic time course for the inhibitory postsynaptic conductance. The major difculty of extending the population density approach to more complicated neurons arises from the additional dimension of the PDF required for each new variable describing the state of a neuron. Each additional dimension increases the computer time required to simulate the
Population Density Approach
535
populations. Thus, dimension-reduction procedures like the one employed in this article become important in developing efcient computational methods for simulating population densities. 6.1 Analysis of Dimension Reduction. The dimension-reduction procedure we used was inspired in part by the population density with meaneld synapses by Treves (1993), who developed a model by approximating the value of each neuron’s synaptic conductance by the mean value over the population. Although we derive our equations by making a looser assumption (see equation 4.12), the resulting equations are equivalent to a mean-eld approximation in the inhibitory conductance; the equations depend only on the mean inhibitory conductance m G .
6.1.1 Anticipated Deciency of Reduction. Because the membrane potential of a neuron can be driven below the rest potential Er only by a large inhibitory conductance, the expected value of a neuron’s inhibitory conductance must depend on its voltage, contrary to our independence assumption, equation 4.12. Moreover, since a larger inhibitory conductance would decrease a neuron’s voltage, one would anticipate that the mean inhibitory conductance would be negatively correlated with the voltage, as was sometimes seen when we set the unitary inhibitory peak conductance to unphysiologically large values (see Figure 4). The inability of our population density model to represent these correlations leads to its systematically too low ring rates when those correlations are important (e.g., see Figure 2b). Since the population density model cannot account for the dependence of m G|V (v, t) on voltage, it overestimates the mean inhibitory conductance for high voltages and underestimates it for low voltages (see Figure 4b) in the presence of large negative correlations. These voltage-dependent discrepancies in the inhibitory conductance increase the fraction of neurons at intermediate voltages in the population density model because neurons with high (low) voltage receive articially high (low) inhibition. The fraction of neurons at both high and low voltages is thus lower than in the direct simulation, resulting in the differences in the distribution of neurons seen in Figure 4a. Since fewer neurons are near threshold and ready to re, the ring rate for the population density model is lower than the ring rate of the direct simulation. 6.1.2 Analysis of Results. The surprising discovery of much better accuracy than might be anticipated must be explained by decorrelating actions in the dynamics of the neurons. By analyzing the 500 single population runs that included large inhibition, we obtained insight into both the cause of the deterioration of the independence assumption, equation 4.12, and the decorrelating actions that help preserve the independence. A summary of the 500 runs is plotted in Figure 9. We see a conrmation in Figure 9a that unphysiologically large peak unitary inhibitory conductances
536
Duane Q. Nykamp and Daniel Tranchina
(c) 0.5 0.4 0.3 0.2 0.1 0 0 50 100 150 Mean Firing Rate (imp/s)
(d) 0.4 D
D
D
(b) 0.5 0.4 0.3 0.2 0.1 0 0 0.1 0.2 0.3 0.4 0.5 Mean Absolute Correlation
D
(a) 0.5 0.4 0.3 0.2 0.1 0 0 0.2 0.4 0.6 0.8 1 Peak Unitary Inhibitory Cond.
0.2 0 0
500 h
1000
Figure 9: A summary of the 500 single population runs that included large inhibition. (a) A scatter plot of deviation measure (D ) versus peak unitary inhibitory conductance. The shaded region marks the range of more realistic conductance sizes used in the rst 10,000 single population simulations. (b) Error measure versus the average (over time and neurons) of the absolute value of the correlation coefcient between the voltage and inhibitory conductance. (c) Error measure versus the average ring rate. (d) Error measure versus g as dened by equation 6.1. For clarity, a point with g D 1650 and D D 0.1 was omitted.
are required for a large deviation measure D . The range of peak conductances used for the rst 10,000 runs is shaded, the upper end of which is already larger than what is measured experimentally (Tam Âas et al., 1997; Galarreta & Hestrin, 1997). The dependence of the error on the correlation between inhibitory conductance and voltage shown in Figure 9b suggests that at least for the single uncoupled population, the error in the population density results is indeed a result of a violation of the independence assumption, equation 4.12. Since many runs with large peak unitary inhibitory conductance have small D (see Figure 9a), peak conductance size alone is not a good prediction of the accuracy of the population density method. We searched for other quantities that were more strongly related to D and would help us better understand the source of the error. We discovered that D is strongly related
Population Density Approach
537
to the average ring rate (see Figure 9c) and less reliably related to the relative strength of inhibition over excitation (see Figure 9d). Both the voltage reset after ring a spike and the voltage jumps due to excitatory input are decorrelating actions that help preserve the independence assumption, equation 4.12. The voltage reset after a spike is particularly effective because a neuron that likely has a low inhibitory conductance is moved to a low voltage, thus descreasing the negative correlation between the voltage and the inihbitory conductance. As a result, a high ring rate should reduce the correlation and thus the deviation measure, creating the relationship between ring rate and the deviation measure seen in Figure 9c. The voltage jumps due to excitatory synaptic input also reduce the correlation between voltage and inhibitory conductance, as the jumps are similar to diffusion with a diffusion coefcient that contains the factor ºe (t)m 2Ae (Nykamp & Tranchina, 2000). The inhibitory effects that introduce correlation are proportional to m G (t), whose quasi-steady-state value is ºi (t)m Ai . We see a moderately strong relationship (see Figure 9d) between the deviation measure and the ratio of these quantities, gD
ºNim Ai tm , ºNe m 2Ae
(6.1)
where we have multiplied by tm D c / gr to make the ratio dimensionless. For the vast majority of simulations, D increases with g, though for g > 200, there are a few simulations with higher-than-expected deviation measures. The dynamics of these two decorrelating actions can be seen in individual simulations, as shown in Figure 10. In Figure 10a, the correlation coefcient mimics the features of the ring rate. Throughout the stimulus in Figure 10b, including a period when the population is not ring, the correlation coefcient reects the excitatory input rate. When the excitatory input rate is nearly zero around t D 650 ms, the correlation coefcient is largest in magnitude. These two decorrelating actions are crucial for the accuracy of our population density model. Presumably the reason they work so well is that they help the model most during periods when the neurons are ring. Thus, for example, the periods of highest correlation in Figure 10 occur harmlessly when the population is virtually silent. We performed the same analysis for the two population runs. However, as expected, the results were not as clean since D is not a satisfactory deviation measure when it is large. Moreover, for the 24 simulations with large subjective deviations, we would not expect to observe a strong link between the decorrelating actions and the deviations since these deviations are likely not a result of our independence assumption, equation 4.12. Nonetheless, similar relationships as in the single-population case were observed (not shown). We also observed that the probability of a large deviation increases N increases. This observation is consistent with as the coupling strength W
4000 0
0
0.4 0 0.4 0.8 1000
0
200
400 600 Time (ms)
800
Firing Rate (imp/s)
(b)
Corr. Coef.
2000
50
4000 2000 0
50
0
0
200
400 600 Time (ms)
800
0.4 0 0.4 0.8 1000
Corr. Coef.
Firing Rate (imp/s)
(a)
Input Rate (imp/s)
Duane Q. Nykamp and Daniel Tranchina
Input Rate (imp/s)
538
Figure 10: Further examples of results with a single population. Panels as in Figure 2. (a) An example where the correlation coefcient is seen to mimic the ring rate. Parameters: ti D 20.60 ms, m Ae / c D 0.010, m Ai / (ti gr ) D 0.193, º Ne D 1603 imp/s, º N i D 449 imp/s. (b) An example where the correlation coefcient closely follows the excitatory input rate ºe (t) (black line in top panel). This is an example simulation with second-order inhibition. Parameters: ti D 79.25 ms, ts D 4.03 ms, m Ae / c D 0.005, peak unitary inhibitory conductance/gr D 0.003, ºNe D 1843 imp/s, º N i D 563 imp/s.
our hypothesis that network bistability may underlie some of the observed large deviations, as strong coupling facilitates network bistability. 6.2 Noninstantaneous Synapses and Network Dynamics. Slow inhibition may be important for proper network dynamics. For example, based on results from rate models such as the Wilson and Cowan (1972) equations, one expects that the relation ti > te , where te is the time constant of excitatory synapses, may be important for the emergence of oscillatory behavior. Clearly, this relation cannot be retained when inhibition is approximated
Population Density Approach
539
as instantaneous, and thus models with fast inhibition may miss oscillatory behavior that is present with slow inhibition. Oscillatory behavior was exactly what emerged in our visual cortex simulations with the slow inhibition. The nonperiodic oscillations of activity observed with ti D 8 ms (see Figure 7a) were not present with fast inhibition (see Figure 7b). Since the excitation was fast (effectively te D 0 ms), the relation ti > te would remain true even for relatively small ti . Indeed, we observed oscillations even with ti as low as 2 ms (not shown). The oscillations began to disappear with ti D 1 ms, and the ring rate approached the fast inhibition results (see Figure 7b) as ti was made even smaller. Similar results were obtained with second-order inhibition, so these results were not peculiar to rst-order inhibition. In networks where synapses are mainly excitatory, the time course of excitatory synapses may play a role in network dynamics such as synchronization (Abbott & van Vreeswijk, 1993; Gerstner, 2000). Furthermore, slower NMDA (N-methyl-D-aspartate) excitatory synapses cannot be accurately modeled as fast synapses. Thus, extending the model to include slow excitation would be a natural next step in the development of the population density approach. 6.3 Computational Speed. We have demonstrated that the population density approach can be roughly 100 times faster than direct simulations. The speed advantage of the population density approach for slow inhibition is even greater than the advantage for fast synapses reported earlier (Nykamp & Tranchina, 2000). The population density simulations can be speeded up even further using a diffusion approximation (Nykamp & Tranchina, 2000). The computational speed of the population density simulations with slow inhibition depends on the reduction of the two- or three-dimensional PDF to one dimension. Further embellishments on the individual neuron model, such as additional compartments, ionic channels, or synaptic features, would typically increase the dimension of the PDF describing the state-space of the neuron. Using dimension-reduction procedures such as the one employed here or the principal component analysis method of Knight and colleagues (Knight, 2000), one may be able to reduce the dimension of these PDFs, allowing speedy computation of the resulting population density simulations. Thus, the population density approach may be a useful tool for efciently simulating large networks of not only simple integrate-and-re neurons but also more complicated, realistic neurons. Appendix A: Derivation of Mean Conductance Evolution
In this appendix, we complete the derivation of equations for the reduced model in section 4 by deriving the evolution equation for the conditional mean conductance m G|V (v, t) and unconditional mean conductance m G (t).
540
Duane Q. Nykamp and Daniel Tranchina
A.1 The Evolution of the Conditional Mean Conductance. We derive an equation for the evolution of the conditional mean conductance m G|V (v, t) by multiplying equation 3.3 by g and integrating with respect to g. Using equation 4.8 we obtain: @ @t
[m G|V (v, t) fV (v, t )] D ¡
@ @v
gJV (v, g, t)dg C
C d(v ¡ vreset )
JG (v, g, t)dg
gJU (tref , g, t)dg.
(A.1)
We used integration by parts to obtain the second term on the right-hand side. The boundary terms were zero because JG (v, g, t ) D 0 for g D 0, 1.4 Using equations 3.7 and 4.9, the integral of JG becomes: 1 JG (v, g, t)dg D ¡ m G|V (v, t ) fV (v, t) ti C ºi ( t )
dg
g
0
dg0 [FQAi (ti ( g ¡ g0 ))]r (v, g0 , t). (A.2)
We change the order of integration in the integral from the last term of equation A.2: 1 0
dg D D
g 0
dg0 FQAi (ti ( g ¡ g0 ))r (v, g0 , t )
1 0
1 g0
1 0
FQAi (ti ( g ¡ g0 ))dg r (v, g0 , t)dg0
1 FQA (x )dx ti i
1 0
r (v, g0 , t)dg0 ,
where x D ti ( g ¡ g0 ). Since FQ Ai (x ) D density function for Ai , we have 1 0
FQAi (x)dx D D D
1 0
0
1 1
0
1 x
0
y
1 ( ) where fAi x fAi y dy,
(A.3) is the probability
fAi (y )dy dx
dx fAi (y)dy
yf Ai (y)dy
D m Ai , 4
We implicitly assume JG approaches zero faster than 1 / g for large g.
(A.4)
Population Density Approach
541
where m Ai is the average value of Ai . Combining equations A.2 through A.4, the second term from the right-hand side of equation A.1 becomes: JG (v, g, t)dg D
ºi (t)m Ai ¡ m G|V (v, t) fV (v, t). ti
(A.5)
Using equation 3.6, the integral from the rst term on the right-hand side of equation A.1 can be written as gr (v ¡ Er )m G|V (v, t) fV (v, t ) c v v ¡ v0 C ºe ( t ) m G|V (v0 , t) fV (v0 , t)dv0 FQC e¤ 0 E ¡ v e Ei 1 ¡ (v ¡ Ei )m G2 |V (v, t) fV (v, t ). (A.6) c
gJV (v, g, t)dg D ¡
In equation A.6, m G2 |V (v, t) is dened similarly to m G|V (v, t), and we used the identity g2 r (v, g, t )dg D m G2 |V (v, t) fV (v, t).
(A.7)
In a similar manner, equations for m Gk |V fV (v, t) can be derived for all k, in which m Gk |V fV (v, t) depends on m Gk C 1 |V fV (v, t). A.2 The Evolution of the Unconditional Mean Conductance. With the independent mean assumption, equation 4.12, we simply need to derive an equation for the unconditional mean m G (t). Since the evolution of Gi (t) is independent of the voltage (see equation 2.4), we can immediately write down the probability density function of Gi ,
fG ( g, t)dg D PrfGi (t) 2 ( g, g C dg)g,
(A.8)
and its evolution equation, @fG @t
( g, t) D ¡
@JNG @g
( g, t).
(A.9)
The total ux across conductance JNG is identical in form to JG (see equation 3.7): g JNG ( g, t) D ¡ fG ( g, t) C ºi (t) ti
g 0
FQ Ai (ti ( g ¡ g0 )) fG ( g0 , t)dg0 .
(A.10)
The unconditional mean m G (t) can then be written in terms of fG : m G (t) D
g fG ( g, t)dg.
(A.11)
542
Duane Q. Nykamp and Daniel Tranchina
The evolution equation for m G (t) is obtained in the same way as that of m G|V (v, t) above, so the intermediate steps are omitted. We multiply equation A.9 by g and integrate with respect to g to obtain: dm G (t) D ¡ dt D
@JNG @g
( g, t )dg
JNG ( g, t)dg
1 m G (t) C ºi (t) ti ºi (t)m Ai ¡ m G (t) D , ti D ¡
dg
g 0
dg0 FQAi (ti ( g ¡ g0 )) fG ( g0 , t) (A.12)
which is equation 4.13. Appendix B: A Refractory State Probability Density Function
In general, a neuron becomes refractory for a period after ring. During the refractory period, a neuron’s voltage does not move, though its synaptic conductances evolve as usual. The presence of a refractory period adds complications to the full population density model. These complications stem from the fact that refractory neurons are not accounted for in r (v, g, t ). Since refractory neurons evolve according to different rules, we track them by a separate PDF, fref , fref (u, g, t)du dg D PrfU(t) 2 (u, u C du ) and Gi (t) 2 ( g, g C dg )g,
(B.1)
where U (t) is the time since the neuron red. Since fref accounts for only refractory neurons and a neuron becomes nonrefractory when U (t) D tref , the function fref is dened only for u 2 (0, tref ). Since a neuron is either refractory or accounted for in r (v, g, t), we have the conservation condition, r (v, g, t)dv dg C
fref (u, g, t)du dg D 1.
(B.2)
The evolution equation of fref (u, g, t) is @fref @t
(u, g, t ) D ¡
@JU @u
(u, g, t) C
@ JG 0 @g
(u, g, t ) ,
(B.3)
where we denote the the ux across U by JU and ux across inhibitory conductance for a refractory neuron by JG0 . The ux across U is simply JU (u, g, t) D
dU fref (u, g, t) D fref (u, g, t) dt
(B.4)
Population Density Approach
543
because dU 1 (the time since ring increases identically with time). Condt D sequently, the boundary condition at u D 0 is fref (0, g, t) D JV (vth , g, t ) since neurons that re immediately become refractory. The ux across conductance for refractory neurons, JG0 (u, g, t), is identical in form to the ux across conductance for nonrefractory neurons (see equation 3.7): JG0 (u, g, t ) D ¡
g fref (u, g, t) C ºi (t) ti
g 0
FQ Ai (ti ( g ¡ g0 )) fref (u, g0 , t )dg0 .
(B.5)
Since JU (tref , g, t) is the ux of neurons becoming nonrefractory, it is the source of probability at vreset in equation 3.3. The refractory PDF, fref , plays no role in the reduced model because we no longer need to track the inhibitory conductance of refractory neurons. Nonetheless, since fV (v, t) does not integrate to 1, we still have the modied conservation condition, fV (v, t)dv C
t t¡tref
r(t0 )dt0 D 1,
(B.6)
which replaces equation B.2. Appendix C: Individual Neuron Simulation Procedure
Numerical methods for the individual neuron simulations were similar to those of appendix C of Nykamp and Tranchina (2000). Although the evolution of the voltage and inhibitory conductance was no longer computed analytically between input events, the simulation was still based on an event-driven simulation. The event-driven procedure was used because the excitatory synaptic events were still delta functions. A time-stepping algorithm where an excitatory input did not coincide with a time step would have produced signicant error. Furthermore, since a neuron could cross threshold only at the time of an excitatory synaptic input, the event-driven simulation generalizes easily to the slow inhibition case. In this approach, a neuron’s voltage is calculated only when it receives synaptic input. When a neuron receives synaptic input, its voltage and inhibitory synaptic conductance are evolved up to the time of the input. Then either the voltage or the inhibitory conductance is jumped up, depending on whether the input was excitatory or inhibitory, respectively. We used a second-order Runge-Kutta method to evolve the voltage and conductance up to the synaptic input time. For each event, the step size of the RungeKutta method is chosen so that the synaptic input time coincides exactly with a time step. After each excitatory input voltage jump, the voltage is compared with threshold to determine if the neuron spiked. If it spiked, the voltage is reset to vreset .
544
Duane Q. Nykamp and Daniel Tranchina
This hybrid method between a time-stepping algorithm and an eventdriven simulation is an efcient way to simulate the hybrid neurons with both delta function and slower synaptic conductances. Appendix D: Parameters
We used a gamma distribution for the distribution of unitary synaptic conductance sizes Ae/ i , f Ae / i ( x ) D
exp(¡x / ae/ i ) ae / i (ne/ i ¡ 1)!
ne / i ¡1
x ae / i
,
(D.1)
where fAe / i (x) is the probability density function of Ae/ i . For excitatory synapses, since C e¤ D 1 ¡ exp(¡Ae / c), the complementary cumulative distribution function, equation 2.3, for C e¤ is FQC e¤ (x ) D e¡u(x)
ne ¡1 lD 0
u ( x) l , l!
(D.2)
where u (x) D ¡(c / ae ) log(1 ¡ x). We used a coefcient of variation of 0.5 for all A’s. Thus, once m Ae / i , the average value of Ae / i , was chosen, the values of ae / i and ne/ i were determined. For the single- and two-population simulations, m Ae / i was chosen randomly. For the model of a hypercolumn in visual cortex, we used the following: m Ae D 0.008 / c, m Ai D 0.027 / c for excitatory neurons, and m Ae D 0.020 / c, m Ai D 0.066 / c for inhibitory neurons. These values corresponded to postsynaptic potentials that were half those used in the model by Somers et al. (1995). For both excitatory and inhibitory neurons, we set Ei D ¡70 mV, Er D vreset D ¡65 mV, vth D ¡55 mV, and Ee D 0 mV. For excitatory neurons, c / gr D tm D 20 ms and tref D 3 ms, and for inhibitory neurons, c / gr D tm D 10 ms and tref D 1 ms. We used the parameters for excitatory neurons for single population simulations. For the distribution of synaptic latencies a(t), we used:
a(t) D
a N 0
exp(¡t / ta ) ta (na ¡ 1)!
t ta
n a ¡1
for 0 · t · 7.5 ms
(D.3)
otherwise,
where na D 9, ta D 1 / 3 ms, and a N is a constant so that a(t)dt D 1. We used the same a for all interactions. For the model of hypercolumn in visual cortex, the connectivity W between two populations depended on their difference in preferred orientation. The connectivity between a presynaptic population of type j and a
Population Density Approach
545
postsynaptic population of type k, for j, k 2 fE, Ig, whose preferred orientation differed by lDh, was Wjkl D
N jkCj exp(¡|lDh | 2 / (2s 2 )) W j 0
for |lDh | < 60± otherwise,
(D.4)
N jk. Thus, W N jk, for j, k 2 fE, Ig, where Cj was a constant such that l Wjkl D W is the average total number of synapses from all populations of type j onto N EE D 72, W N EI D 112, W N IE D 48, and each neuron of type k. We used W N II D 32, which were double the numbers from Somers et al. (1995). We let W sI D 60± and sE D 7.5± . Acknowledgments
We gratefully acknowledge numerous helpful discussions with David McLaughlin, Robert Shapley, Charles Peskin, and John Rinzel. This material is based on work supported under a National Science Foundation Graduate Fellowship. References Abbott, L. F., & van Vreeswijk, C. (1993). Asynchronous states in networks of pulse-coupled oscillators. Phys. Rev. E, 48, 1483–1490. Baer, S. M., Erneux, T., & Rinzel, J. (1989). The slow passage through a Hopf bifurcation: Delay, memory effects, and resonance. SIAM J. Appl. Math., 49, 55–71. Chawanya, T., Aoyagi, A., Nishikawa, I., Okuda, K., & Kuramoto, Y. (1993). A model for feature linking via collective oscillations in the primary visual cortex. Biol. Cybern., 68, 483–490. Galarreta, M., & Hestrin, S. (1997). Properties of GABAA receptors underlying inhibitory synaptic currents in neocortical pyramidal neurons. J. Neurosci., 17, 7220–7227. Gerstner, W. (2000). Population dynamics of spiking neurons: Fast transients, asynchronous states, and locking. Neural Computation, 12, 43–89. Knight, B. W. (1972a). Dynamics of encoding in a population of neurons. J. Gen. Physiol., 59, 734–766. Knight, B. W. (1972b).The relationship between the ring rate of a single neuron and the level of activity in a population of neurons: Experimental evidence for resonant enhancement in the population response. J. Gen. Physiol., 59, 767–778. Knight, B. W. (2000).Dynamics of encoding in neuron populations: Some general mathematical features. Neural Computation, 12, 473–518. Knight, B. W., Manin, D., & Sirovich, L. (1996). Dynamical models of interacting neuron populations. In E. C. Gerf (Ed.), Symposium on Robotics and Cybernetics: Computational Engineering in Systems Applications. Lille, France: Cite Scientique.
546
Duane Q. Nykamp and Daniel Tranchina
Kuramoto, Y. (1991).Collective synchronization of pulse-coupled oscillators and excitable units. Physica D, 50, 15–30. Nykamp, D., & Tranchina, D. (2000). A population density approach that facilitates large-scale modeling of neural networks: Analysis and an application to orientation tuning. J. Comp. Neurosci., 8, 19–50. Omurtag, A., Knight, B. W., & Sirovich, L. (2000). On the simulation of large populations of neurons. J. Comp. Neurosci., 8, 51–63. Sirovich, L., Knight, B. W., & Omurtag, A. (in press). Dynamics of neuronal populations: The equilibrium solution. SIAM. Somers, D. C., Nelson, S. B., & Sur, M. (1995). An emergent model of orientation selectivity in cat visual cortical simple cells. J. Neuroscience, 15, 5448–5465. Stern, P., Edwards, F. A., & Sakmann, B. (1992). Fast and slow components of unitary EPSCs on stellate cells elicited by focal stimulation in slices of rat visual cortex. J. Physiol. (Lond), 449, 247–278. Strogatz, S. H., & Mirollo, R. E. (1991). Stability of incoherence in a population of coupled oscillators. J. Stat. Phys., 63, 613–635. TamÂas, G., Buhl, E. H., & Somogyi, P. (1997). Fast IPSPs elicited via multiple synaptic release sites by different types of GABAergic neurone in the cat visual cortex. J. Phys., 500, 715–738. Treves, A. (1993). Mean-eld analysis of neuronal spike dynamics. Network, 4, 259–284. Wilson, H. R., & Cowan, J. D. (1972). Excitatory and inhibitory interactions in localized populations of model neurons. Biophysical J., 12, 1–24. Received August 17, 1999; accepted June 1, 2000.
LETTER
Communicated by Ning Qian
A Complex Cell–Like Receptive Field Obtained by Information Maximization Kenji Okajima Hitoshi Imaoka Fundamental Research Laboratories, NEC Corporation, Tsukuba, 305 Japan The energy model (Pollen & Ronner, 1983; Adelson & Bergen, 1985) for a complex cell in the visual cortex is investigated theoretically. The energy model describes the output of a complex cell as the squared sum of outputs of two linear operators. An information-maximization problem to determine the two linear operators is investigated assuming the low signalto-noise ratio limit and a localization term in the objective function. As a result, two linear operators characterized by a quadrature pair of Gabor functions are obtained as solutions. The result agrees with the energy model, which well describes the shift-invariant and orientation-selective responses of actual complex cells, and thus suggests that complex cells are optimally designed from an information-theoretic viewpoint. 1 Introduction
Suppose we extract a nonlinear feature a from an input image f (x ) using two receptive eld (RF) functions w 1 and w 2 as 2
a D (w 1 , f C n)2 C (w 2 , f C n )2 D
x
w 1 (x )( f (x ) C n (x ))
2
C
x
w 2 (x )( f (x ) C n (x))
.
(1.1)
Here, n represents the noise, which is assumed to be zero-mean uncorrelated gaussian and is independent of the signal f . By introducing a complex receptive eld w D w 1 C iw 2 , equation 1.1 can be rewritten as a D |(w , f C n )| 2 .
(1.2)
Now, we obtain a certain amount of information MI by observing such a feature. The problem that will be considered here is to determine the two functions w 1 and w 2 (or the complex RF w) to maximize this amount of information. Neural Computation 13, 547–562 (2001)
° c 2001 Massachusetts Institute of Technology
548
Kenji Okajima and Hitoshi Imaoka
Figure 1: An example of the localization potential.
Since we conne ourselves to localized RFs, we add a localization term to our objective function. Thus, we actually will consider a problem to determine w , which maximizes l D MI ¡ m
x
u (x)|w (x )| 2 / kw k2 ,
(1.3)
where m is a constant and u (x) is a localization potential. An example of u (x) is depicted schematically in Figure 1, but the result that will be obtained does not depend on the details of its specic form if it is well approximated by a second-order expansion around its minimum as u (x ) ¼ umin C bx2 . Kohonen (1994) considered projections of signals onto a subspace spanned by two vectors w 1 and w 2 , and investigated the self-organization of this subspace using his learning algorithm, ASSOM. By computer simulations, he showed that complex cell–like shift-invariant RFs are self-organized through ASSOM. We will show that a similar result is obtained through infomax (Linsker, 1988). It will be shown that a complex cell–like RF is obtained as a solution for the information-maximization problem. The result might be related to Kohonen’s result, although the relationship is not yet clear. As for simple cells, Olshausen and Field (1996) demonstrated that spatially localized and orientation-tuned simple-cell-like RFs emerge by learning a sparse code for natural images. Bell and Sejnowski (1997) obtained similar results by independent component analyses (ICA) of natural images. Since their ICA algorithm is based on the infomax of a nonlinear network in the zero-noise limit, their result indicates that infomax can derive the simple cells’ RFs in the zero-noise limit. Okajima (1995, 1996, 1998b), on the other hand, considered an information-maximization problem in the low signalto-noise ratio (SNR) limit and showed that a Gabor function–like RF similar to that of a simple cell is obtained as a solution. This article uses techniques similar to those in Okajima (1998b), but will deal with a nonlinear feature as shown in equation 1.2, and the result will be compared with the energy model for complex cells in the visual cortex (Pollen & Ronner, 1983; Adelson & Bergen, 1985; Heeger, 1992; Ohzawa, DeAngelis, & Freeman, 1997).
A Complex Cell–Like Receptive Field
549
The next section briey summarizes (a simplied version of) the energy model, and section 3 denes our objective function in more detail. We solve the information-maximization problem in section 4, and in section 5 we summarize the results. A preliminary result has been presented in Okajima (1998c). 2 Energy Model for Com plex Cells
Neurons in the visual cortex in cats or monkeys are classied into two main categories: simple and complex cells. The spatially localized and orientationselective RFs of these cells are well described by the energy model (Pollen & Ronner, 1983; Adelson & Bergen, 1985; Heeger, 1992; Ohzawa et al., 1997). 2.1 Linear Summation and Rectication. First, a linear summation of an input image f (x ) is calculated using an RF function w Q (x), and the result is rectied:
LQ D
f (x )wQ (x )
AQ D LQ D LQ D 0
(2.1)
x
when LQ ¸ 0 otherwise.
(2.2)
This AQ describes the simple cell’s response. It is known that the RF of a simple cell is well approximated by a Gabor function (Daugman, 1980; Marcelja, 1980), which is dened by a sinusoidal function localized by a gaussian envelope, wQ (x ) D Gs (x) cos(kx C Q ),
(2.3)
where Gs (x ) denotes a gaussian function, and Q denotes the phase. (In this article, we use a convention to call the function dened by equation 2.3 a Gabor function and that dened by equation 2.6 a complex Gabor function.) From Figure 2, we see that a two-dimensional Gabor function has an anisotropic prole that results in an orientation selectivity of a simple cell. 2.2 Energy Mechanism. Next, the complex cell’s response is described as a squared sum of the output of four simple cells. All have the same frequency k, but with phases in steps of 90 degrees. Mathematically, this can be described as a squared sum of two linear operators, each characterized by a Gabor function of the same frequency but with phases 90 degrees apart from each other:
a D (LQ2 C LQ2 C 90) / 4.
(2.4)
Thus, according to the energy model, a complex cell adopts a quadrature pair of Gabor functions for its RF functions w 1 and w 2 (compare equation 1.1
550
Kenji Okajima and Hitoshi Imaoka
Figure 2: Examples of the Gabor function: (a) Q D 0, (b) Q D ¡p / 2.
with equations 2.1 and 2.4). In other words, a complex cell adopts a complex Gabor function for its complex RF w . Here, a complex Gabor function is dened as a complex exponential function localized by a gaussian envelope Gs (x ): w (x ) D Gs (x ) exp(ikx C Q ).
(2.5)
The complex cell’s response approximated by equation 2.4 also shows an orientation selectivity, but the response is rather insensitive to the precise position of the visual stimulus (shift-invariant feature detection) in accord with experimental observations (Hubel & Wiesel, 1960). 3 Objective Function
Now let us go back to the information-maximization problem described in section 1 and dene the objective function in more detail. First, we assume shift invariance for the input image statistics throughout this article. That is, it is assumed that if a certain image fc (x ) is presented, its displaced images fc (x ¡x 0 ) will also be presented with the same probability. Next, we classify all these displaced images fc (x ¡ x 0 ) into the same class or “category” as fc (x ); when an image fc 0 (x ) cannot be generated by displacing fc (x ), it (and its displaced images) will be classied into a different category. Then suppose that by observing output a, we are to guess the category c to which the input image belongs. As a measure for the amount of obtained information, we adopt the following mutual information, which is relevant to this situation; that is, the mutual information between the output a and the category c (Okajima, 1998b): MI[w] D H[a] ¡ hH[a|c ]ic .
(3.1)
A Complex Cell–Like Receptive Field
551
Here, h¢ic denotes an averaging operation over c , and H[a] and H[a|c ] respectively denote the entropy and the conditional entropy, which are dened as H[a] D ¡
a
H[a|c ] D ¡
a
P (a) log[P(a)]
(3.2)
P (a|c ) log[P(a|y)]
(3.3)
using the probability P. This MI represents the average amount of obtained information with regard to which category of pattern c is presented. Another possible choice for the measure might be the mutual information between a and the image itself, f . But this will require the RF to discriminate a feature (such as a bar or an edge of certain orientation) from a displaced one even when the feature itself is the same. In order to derive the shift-invariant RF of a complex cell, it is crucial to use this MI as the measure. By adding the localization term for obtaining a localized RF, our objective function is written as l D MI[w ] ¡ m
x
u (x )|w (x)| 2 / kw k2 .
(3.4)
Here, u (x ) is a localization potential that has a minimum at x D 0 (see Figure 1), and m ( > 0) is a constant that determines the relative weight of the localization term. We will see that the term MI does not depend on kw k. Therefore, the localization term is normalized by 1 / kw k2 to make the relative weight unchanged irrespective of kw k. Now our problem is to nd w , which maximizes this objective function. 4 Information Maximization in the Low SNR Limit
Unfortunately, computing the entropy or the mutual information is in general not easy. Therefore, here we evaluate the mutual information MI assuming the low signal-to-noise ratio (SNR) limit (Bialek, 1992; Okajima, 1998b). The strategy is as follows. First, when the signal f is zero, the output depends only on the noise and is written as an D |(w , n)| 2 D n21 C n22 I
n1 ´ (w 1 , n ), n2 ´ (w 2 , n ).
(4.1)
When we have a nonzero signal, the output is slightly varied as an ! a D (n1 C e s1 )2 C (n2 C e s2 )2 I e s1 ´ (w 1 , e f ), e s2 ´ (w 2 , e f ) (4.2) where the signal is written as e f instead of f to indicate it is small.
552
Kenji Okajima and Hitoshi Imaoka
Then we expand the entropy and the conditional entropy in e : H[a] H[a|c ]
» D H[an ] C dH(e ) 0 » D H[an ] C dH (e ),
(4.3)
c
and we obtain the mutual information by 0 MI » D dH(e ) ¡ hdHc (e )ic .
(4.4)
The point here is that we need not take into account all the higher-order statistics of the signal in this expansion. For example, when we are to evaluate MI up to the fourth order in e , we need to take into account only the statistics up to the fourth order. If we carry out this expansion, after some calculations (see the appendix) we obtain 4 4 MI » D (e / sn )(A[w] C B[w ])I
kw k2 D 1,
(4.5)
where sn2 denotes the variance of the noise, and the term A[w ] is written as A[w] D 1 / 2
k1, k2
| wQ ( k1 )| 2 | wQ (k2 )| 2
¢ h|F(k1 )| 2 |F(k2 )| 2 i ¡ h|F(k1 )| 2 ih|F(k2 )| 2 i .
(4.6)
Here, F and wQ denote the Fourier transform of the input image f and RF w , respectively, and h¢i denotes an averaging operation over images (since the variables inside the bracket do not depend on the position, this averaging operation is equivalent to that over “categories,” c ). Term B[w ] is mainly due to the higher-order statistics, and its explicit form is described in the appendix. A point that should be made here is that B[w ] takes its maximum of around zero when the real and imaginary parts of w are orthonormal. Since we can see that MI does not depend on kw k2 , it is normalized as kw k2 D 1 in equation 4.5. Finally, our objective function is written as 4 l» D 1 / 2(e / sn )
¢ h|F(k1 ¡m
)| 2
k1,k2
| wQ ( k1 )| 2 | wQ ( k2 )| 2
|F(k2 )| 2 i ¡ h|F(k1 )| 2 ih|F(k2 )| 2 i
u (x )|w (x)| 2 C (e / sn )4 B[w ]I
kw k2 D 1.
(4.7)
4.1 Optimal Receptive Field. First, we show numerical calculation results. Figure 3 shows the solutions w 1 and w 2 , which are obtained by maximizing the objective function. These are well approximated by Gabor functions. The tting results show that w 1 and w 2 have almost the same frequency, but their phases are about 90 degrees apart from each other (see the caption
A Complex Cell–Like Receptive Field
553
Figure 3: A numerical solution that maximizes the objective function 4.8: (a) w 1 D Re[w ], (b) w 2 D Im[w ]. Each w 1 and w 2 is well approximated by the Gabor function: w i D ci exp(¡(x2 / 2sx2i C y2 / 2sy2i )) cos(2p fi x C Qi ) with f1 D 0.123, f2 D 0.127, Q1 D ¡0.0042, Q2 D ¡1.57, c1 D 0.284, c2 D 0.293, sx1 D 2.02, sx2 D 1.99, sy1 D 1.87, sy2 D 1.92 (in arbitrary units except for Q1 and Q2 , which are in radians). In this calculation, we adopted an uncorrelated gaussian signal ltered through an isotropic DoG lter (DoG(x, y) D 2p1s 2 expf¡(x2 C y2 ) / (2sex2 g ¡ 1 2p s 2
inh
ex
2 g, with sex D 0.95 sinh D 2.86 in the same unit as sx1 ) for expf¡(x2 C y2 ) / (2sinh
input images, and an exponential function expf(x2 C y2 ) / (2sw2 )g (sw D 2.7) for the localization potential. We assume that when an input image fc (r ) is presented, its displaced images fc (r ¡ r 0 ) will also be presented with the same probability, and all these images are supposed to be classied into the same category c . Maximization of the objective function equation, 3.4, is performed here by maximizing the objective function, equation 4.8, where the power spectrum is 2 2 2 2 2 2 Q calculated as p(k ) D | DoG(k x , ky )| D fexp(¡| k | / 2sk, ex ) ¡ exp(¡| k | / 2sk, inh )g 4 (sk,ex D 1.05, sk,inh D 0.35), and the relative weight is set to m (sn / e ) D 0.01. Although the power spectrum is isotropic, we obtain an orientation-selective RF as a solution. We expect that the frequency of the Gabor function is determined by the frequency where the power spectrum p(k) takes its maximum (see text). With the above power spectrum, this is calculated to be |k0 | D 2p £ 0.124, which agrees with the numerical solution. Even when the DoG lter is replaced by a different lter, we also obtain a complex Gabor function–like solution, but its frequency may vary according to the peak frequency of the power spectrum.
to Figure 3). This means that the optimal complex RF can be approximated by a complex Gabor function, which agrees with the energy model for a complex cell. In this numerical calculation, we adopted an uncorrelated gaussian signal ltered through an isotropic DoG (difference of gaussian) lter for input images and an exponential function exp(c|x| 2 ) for the localization potential. But the result is not sensitive to these specic conditions. We obtain similar results when we change these conditions.
554
Kenji Okajima and Hitoshi Imaoka
(b)
(a)
response, a
y
x -100
-50
0
50
100
orientation (degree)
Figure 4: Response properties of a theoretically calculated complex RF. (a) Responses to a bright (or dark) spot. In this receptive eld, the bright region is responsive to a spot of light. (b) Orientation selectivity as represented by the response to an oriented grating pattern.
Figure 4 shows the response properties of the theoretically obtained complex RF. As expected from the energy model, its response to a spot of light shows no distinct excitatory-inhibitory subregions, but it still shows an orientation selectivity. It should be emphasized that the input signal statistics adopted in the numerical calculation are isotropic, but still we obtain an orientation-selective RF as a solution. The selective orientation of the solution is determined depending on the initial condition for the numerical calculation. Next, let us consider intuitively why such an energy mechanism of a complex cell can increase the obtained information. For simplicity, suppose here that the RF is approximated by a simple plane wave w (x ) / exp(¡ikx) instead of a localized complex Gabor function. Then, if the noise is small, the output a is written as a power of the Fourier transform of the input 2 2 image as a » D | (w , fc ,x0 )| / |Fc ( k)| . As is known, this power remains constant even when the pattern is displaced, provided the pattern c is unchanged. Accordingly, in this case, the conditional entropy H[a|c ] (¸ 0) becomes almost zero, which results in the increase in the obtained information MI D H[a] ¡ hH[a|c ]ic . It is speculated that a similar mechanism makes a complex Gabor function become the solution for the infomax problem. Its shift-invariant property decreases the term hH[a|c ]ic , which in turn increases the obtained information MI. In this article, we evaluate MI in the low SNR limit because of a technical reason. But the intuitive consideration suggests that a similar result is obtained under the wider range of the SNR.
A Complex Cell–Like Receptive Field
555
Figure 5: A power spectrum p(k) with a maximum at k0 (schematic representation).
Next, let us consider frequency. Suppose, for simplicity, that the input image statistics are gaussian. Then the fourth-order correlations h|F(k1 )| 2 |F(k2 )| 2 i in equation 4.7 can be rewritten by using only the second-order correlations h|F(k)| 2 i, and the objective function is written as 4 l» D (e / sn )
k
p (k)2 | wQ ( k)| 4 ¡ m
C ( e / sn )4 B[w ]I
kw k2 D 1.
u (x )|w (x )| 2 x
(4.8)
Here p (k) D h|F(k)| 2 i denotes the power spectrum of input images. From this equation, we see that the rst term becomes large when the frequency component of the RF wQ is concentrated around the frequency k0, where the power spectrum takes its maximum (see Figure 5). Thus, the rst term tends to localize the RF in the frequency domain. Since the second term localizes the function in the space domain, a Gabor function (of frequency k0) that is localized in both the space and frequency domains becomes the optimal RF (see the appendix). Term B has the effect of making the real and imaginary part of w mutually orthogonal, and we obtain a complex Gabor function–like RF as a solution. 5 Discussion and Summary
In order to obtain a localized RF, we assumed a localization term in the objective function. To derive a localized RF only from an information-theoretical consideration is a future problem. Here we would like to describe the intuitive idea. If two regions in the visual eld are located far from each other, local images fallen into these regions must be mutually independent. We think the theory must be rened to take this into account. Let us consider here a simplied problem, as follows. Suppose we have two hypothetical screens, a and b, and observe a feature b extracted from mutually independent images fa (x ) and fb (x) on each screen using two complex RFs
556
Kenji Okajima and Hitoshi Imaoka
w a D w a1 C iw a2 and w b D w b1 C iw b2 as b D | (w a , fa C na ) C (w b , fb C nb )| 2 .
(5.1)
Here again, we classify displaced images fc a, or c b (x ¡ x0 ) into the same category as fc a, orc b (x), and by observing b, we are supposed to guess the categories c a and c b to which the images fa and fb , respectively, belong. In this case, the optimal RFs w a and w b are determined by maximizing the mutual information between b and fc a, c b g, that is, MI[bI c a , c b ] D H[b] ¡ hH[b|c a , c b ]ic a,c b . Now, if we assume the images fa and fb are shown (and displaced) independently, we can show that the solution maximizing this mutual information is a localized one (localized within one of the screen, that is, kw a k2 D 1 and kw b k2 D 0, or kw a k2 D 0 and kw b k2 D 1). This is because the mutual information MI[bI c a, c b ] is calculated (see the appendix) as 4 4 MI[bI c a , c b ] » D (e / sn )( A[w a ] C B[w a ] C A[w b ] C B[w b ])I
kw a k2 C kw b k2 D 1,
(5.2)
where the terms A and B are dened respectively by equations 4.6 and A.4 and are fourth order in w a or w b , while the constraint is kw a k2 C kw b k2 D 1 ] (note that MI[bI pc a, c b for w a D w 0 and w b D 0 is twice as much as that for w a D w b D w 0 / 2 for any w 0 satisfying kw 0k2 D 1). Here, the statistics of fa and fb are assumed to be the same. Thus, the above consideration suggests that in order to derive a localized RF, we must modify the image model and the measure for the obtained information. That is, we should assume that if a certain image fc (x ) is shown, its locally displaced images (i.e., images obtained by slightly displacing each part—each object—of the image independently) are also shown, and we classify all these images into the same “category” as fc (x ). Then, the mutual information between the output and the above-dened “category” must be maximized. Further discussion based on this idea to derive the Gabor function–like localized RF is described elsewhere (Okajima & Imaoka, 1999). Of course, it is possible that in the visual cortex, the RF localization is caused by other reasons, for example, because widespread arborizations are costly for neurons. We compared the result with a simple cell–like RF. Okajima (1998b) considered a linear feature as D (w s , f ), which is obtained by a RF w s . In this case, the mutual information MI s D H[as ] ¡ hH[as |c ]ic is calculated as MI s D 1 / 4(e / sn )4 ¢ h|F(k1
)| 2
k1,k2
| wQ ( k1 )| 2 | wQ ( k2 )| 2
|F(k2 )| 2 i ¡ h|F(k1 )| 2 ih|F(k2 )| 2 i
(5.3)
in the low SNR limit. By adding the localization term to this MI s , a simple cell–like RF (a Gabor function) is obtained as a solution for the maximization problem.
A Complex Cell–Like Receptive Field
557
When we compare this with equations 4.5 and 4.6, we see that apart from term B, MI for a complex cell is twice as much as that for a simple cell. Since the absolute value of term B becomes small for the solution of a complex Gabor function, we see that the optimal complex RF can extract almost twice as much information as that for a simple cell RF. This factor of two arises from the fact that the term A in equation 4.6 is derived by expanding log s 2 ¡ hlog sc2 ic in SNR (s 2 ´ h(w , f C n )2 i D sn2 kw k2 C e 2 k | wQ (k)| 2 h|F(k)| 2 i, sc2 ´ sn2 kw k2 C e 2 k |w ( k)| 2 |Fc ( k)| 2 ; see the appendix), while MI s in equation 5.3 is obtained by expanding 1 / 2(log s 2 ¡ hlog sc2 ic ) in SNR (here, the factor of 1/2 comes from that for 1 / 2 log(2p es 2 ), which is the entropy of the gaussian distribution; see Okajima, 1998b). Complex cell RFs are generally larger than those of simple cells at a given eccentricity; accordingly, it has been suggested that a complex cell RF should be modeled by several quadrature pairs of simple cells with spatially shifted but overlapping RFs. To extend our approach taking this into account, rst we must generalize the complex cell response, equation 1.1, replacing it by the squared sum of the outputs of 2n linear operators. Also, it seems necessary to rene the theory such that it can deal with the RF localization without assuming the localization term. Let us nally summarize the results. Optimal complex RF is determined by infomax. Here, as a measure for the obtained information, we adopted the mutual information between the output and the category. We assumed the low SNR limit and also the localization term in the objective function. The optimal complex RF obtained by the information-maximization problem was similar to a complex Gabor function in accord with the energy model for a complex cell. In addition, it was found that the optimal complex RF can extract almost twice as much information as that for a simple cell RF. Appendix A.1 Derivation of Equation 4.5. We assume h f i D 0 here. Since we are assuming the noise to be (uncorrelated) zero-mean gaussian, the entropy of a D (n1 C e s1 )2 C (n2 C e s2 )2 coincides with that of (¡n1 C e s1 )2 C (¡n2 C e s2 )2 D (n1 ¡ e s1 )2 C (n2 ¡ e s2 )2 , and thus is an even function of e . Therefore, we will expand MI in (e 2 ). Let us dene m1 and m2 as m1 D n1 C e s1 and m2 D n2 C e s2 . By an orthogonal transformation (m¤1 , m¤2 )t D T (m1 , m2 )t (the superscript t denotes transposition), we rst diagonalize the covariance matrix as h(m¤1 , m¤2 )t (m¤1 , m¤2 )i D ¤2 ( Q 1, m Q 2 ) D (m¤1 , m¤2 ) / s, where s 2 ´ hm21 i C diagfhm¤2 1 i, hm2 ig. Introducing m 2 2 C 2 hm2 i (i.e., hm Q 1 i hm Q 2 i D 1), the output a is written as
Q 21 C mQ 22 ) ´ s 2 a0 . a D m21 C m22 D s 2 (m
(A.1)
H[a] D log s 2 C H0 [a0 ].
(A.2)
Therefore, we obtain
558
Kenji Okajima and Hitoshi Imaoka
Here, s 2 is calculated as s 2 D sn2 kw k2 C e 2 k | wQ ( k)| 2 h|F(k)| 2 i, where sn2 is the variance of the noise (we used an equality h|(w , e f )| 2 i D e 2 k | wQ (k)| 2 h|F(k)| 2 i which is valid when h f f t i is shift invariant). The entropy H 0 [a0 ] (a0 ´ m Q 21 C m Q 22 ) is determined by the distribution P (m Q 1, m Q 2 ). Similarly, the conditional entropy is written as H[a|c ] D log sc2 C H0 [a0 |c ],
(A.3)
where sc2 D sn2 kw k2 C e 2 k |w ( k)| 2 |Fc ( k)| 2 . Then term A[w ] in equation 4.5 is obtained by expanding log s 2 ¡hlog sc2 ic in (e 2 ) (in this expansion, terms of order o (e 2 ) cancel each other out, and since the result does not depend on kw k, it is normalized to unity in equation 4.6). Meanwhile, term B[w ] is obtained by expanding H0 [a] ¡ hH0 [a|c ]ic . In this expansion, we note that when the signal is zero, P (m Q 1, m Q 2 ) becomes a gaussian, which is completely specied by specifying, for example, hmQ 21 i ´ & (then hm Q 22 i is written as 1 ¡ &). Therefore, in this case, H0 [a0 ] is a function of &. When we have a nonzero signal, mQ 1 , mQ 2 will have nonzero higherorder cumulants, and H 0 [a0 ] will also depend on them. Thus the expansion is written as H0 [a] ¡ hH0 [a|c ]ic ¼ (e 4 / sn4 )B[w ] D B1 ¡ hB2 (c )ic ,
(A.4)
where B 1 D (e 4 / sn4 ) C
@H 0 @kQ 41
4
hs¤1 ic C
@2 H 0
@&
@& 2
@ (e 2 )
2
@H0 @kQ 42 C
2
2
hs¤1 s¤2 ic C
@H0 @kQ 43
@H 0
@2 &
@&
@ (e 2 ) 2
4
hs¤2 ic
e4,
(A.5)
and B2 (c ) is also dened by the right-hand side of equation A.5, but by replacing all the averaging operations with those under the conditional probability P ( f |c ). In equation A.5, (s¤1 , s¤2 )t ´ T (s1 , s2 )t , and kQ 41 ´ hmQ 41 ic , 4 2 2 4 kQ42 ´ hmQ 21 mQ 22 ic, kQ 43 ´ hmQ 42 ic , hs¤1 ic , hs¤1 s¤2 ic , and hs¤2 ic are the fourth-order cu4 4 2 2 2 mulants, which are written as, for example, hs¤1 ic D hs¤1 i¡3hs¤1 i2 , hs¤1 s¤2 ic D 2 2 2 2 hs¤1 s¤2 i ¡ hs¤1 ihs¤2 i ¡ 2hs¤1 s¤2 i2 by using moments (see, e.g., Kubo, 1962). In the expansion A.4, terms of order o (e 2 ) cancel each other out, and since here again the result does not depend on kw k, it is normalized to unity in equation A.5. The derivative of H0 with respect to hmQ 31 m Q 2 ic or hmQ 1 m Q 32 ic becomes zero, so we need not take them into account. In solving the maximization problem, derivatives of H 0 were numerically evaluated. For example, to evaluate @H 0 / @kQ 41, we used the following distribution, which is almost gaussian but has a small fourth-order cumulant
A Complex Cell–Like Receptive Field
559
Figure 6: Numerically calculated @H0 / @kQ 41 as a function of &0 ´ (2& ¡ 1)2 (& ¸ 1 / 2).
kQ41 (|kQ 41 | ¿ 1, kQ 41 < 0): P (mQ 1 , mQ 2 ) D
j1 ,j2
dj1 dj2 exp(¡&j12 / 2 ¡ (1 ¡ & )j22 / 2 C kQ 41j14 / 4!)
£ exp(ij1 m Q 1 C ij2 m Q 2 ).
(A.6)
From kQ 41 dependence of numerically calculated entropy H 0 [m Q 21 C m Q 22 ], @H 0 / @kQ 41 was determined. Figure 6 shows @H0 / @kQ41 as a function of &0 ´ (2& ¡ 1)2 . We see that @H0 / @kQ 41 takes its maximum at & 0 D 0. This means that @H 0 / @kQ 41 takes its maximum when w 1 and w 2 are orthonormal, because & 0 is written as &0 D (kw 1 k2 ¡ kw 2 k2 )2 C 4(w 1 , w 2 )2 when e D 0. Term B[w ] thus causes w 1 and w 2 to be orthonormal. A.2 Maximization of l0 D (e / sn )4 k p ( k)2 | wQ ( k)| 4 ¡m x u (x)|w (x )| 2 under kwk2 D 1. To solve the maximization problem, we consider the maxi-
mization of l00 D
k
p0 ( k)| wQ ( k)| 2 ¡ m
u (x )|w (x)| 2 ,
(A.7)
x
where p0 ( k) is determined self-consistently by p0 (k) D 2(e / sn )4 p ( k)2 | wQ (k)| 2 (note (e / sn )4d
kp
( k)2 | wQ (k)| 4 ) D (e / sn )4
(A.8) k 2p(k)
2 | wQ (
k)| 2d(| wQ (k)| 2 holds).
560
Kenji Okajima and Hitoshi Imaoka
We expand p0 ( k) and u (x) as p0 ( k) ¼ p00 ¡ a(k ¡ k0 )2 , u (x) ¼ umin C bx2 . Then l00 is written as l00 ¼ ¡a
k
(k ¡ k0 )2 | wQ (k)| 2 ¡ mb
x
x2 |w (x)| 2 C const.
´ ¡(aD k2 C mb D x2 ) C const.
(A.9)
p p From an obvious inequality ( aD k ¡ mb D x )2 ¸ 0 and the uncertainty relation D xD k ¸ 1 / 2, aD k2 C mb D x2 ¸ 2 abm D xD k ¸
abm
(A.10)
holds, and the right-hand side of equation A.9 p takes its maximum when p w (x ) is a complex Gabor function that satises a D mb D that is, k D x, p when w (x ) / exp(¡ mb / ax2 / 2 C ik0x ) (Okajima, 1998a). This is because a complex Gabor function achieves the theoretical lower limit of one-half for the product D xD k (Gabor, 1946; Daugman, 1985). Since a depends on wQ (k), it must be determined self-consistently, but still the solution becomes a complex Gabor function of frequency k0 . A.3 Additional Explanation of Equation 5.2. By performing a similar
calculation as in section A.1, we obtain 4 4 2 2 2 MI[bI c a , c b ] » D (e / sn )(A[w a] C B[w a ] C A[w b ] C B[w b ]) / (kw a k C kw b k ) ,(A.11)
where A[w ] and B[w ] are, respectively, dened by equations 4.6 and A.4, and are fourth order in w . Since this MI[bI c a , c b ] remains unchanged when we 6 0), kw a k2 C kw b k2 is normalized to unity change w a and w b to cw a and cw b (c D in equation 5.2. Terms A[w a] C A[w b ], for example, are obtained by expanding log s 02 ¡hlog sc02a,c b ic a,c b in (e 2 ), where s 02 ´ h|(w a , fa C na ) C (w b , fb C nb )| 2 i, which is calculated as s 02 D sn2 (kw a k2 C kw b k2 ) C e 2
k
(| wQa ( k)| 2 C | wQb ( k)| 2 )h| F ( k)| 2 i
by assuming that fa and fb are mutually independent, and sc02a,c b is dened as sc02a,c b D sn2 (kw a k2 C kw b k2 ) C e 2
k
(| wQ a ( k)| 2 |Fc a (k)| 2 C | wQ b (k)| 2 |Fc b ( k)| 2 ).
Acknowledgments
This work was supported by Special Coordination Funds for Promoting Science and Technology.
A Complex Cell–Like Receptive Field
561
References Adelson, E. H., & Bergen, J. R. (1985). Spatiotemporal energy models for the perception of motion. Journal of the Optical Society of America, A2(2), 294–299. Bell, A. J., & Sejnowski, T. J. (1997). Edges are the “independent components” of natural scenes. In M. Mozer, M. Jordan, & T. Petsche (Eds.), Advances in neural information processing systems, 9 (pp. 831–844). Cambridge, MA: MIT Press. Bialek, W. (1992). Optimal signal processing in the nervous system. In W. Bialek (Ed.), Princeton lectures on biophysics (pp. 321–401). Singapore: World Scientic. Daugman, J. G. (1980). Two-dimensional spectral analysis of cortical receptive eld proles. Vision Research, 10, 847–856. Daugman, J. D. (1985). Uncertainty relation for the resolution in space, spatial frequency, and orientation optimized by two-dimensional visual cortical lters. Journal of the Optical Society of America, A2, 1160–1169. Gabor, D. (1946). Theory of communication. J. Inst. Elec. Eng., 93, 429–457. Heeger, D. J. (1992). Half-squaring in response of cat striate cells. Visual Neuroscience, 9, 427–443. Hubel, D. H., & Wiesel, T. N. (1960). Receptive elds, binocular interaction and functional architecture in the cat’s visual cortex. Journal of Physiology, 160, 106–154. Kohonen, T. (1994). Self-organizing feature map. New York: Springer-Verlag. Kubo, R. (1962). Generalized cumulant expansion method. Journal of the Physical Society of Japan, 17, 206–226. Linsker, R. (1988). Self-organization in a perceptual network. Computer, 21(3), 105–117. Marcelja, S. (1980). Mathematical description of the responses of simple cortical cells. Journal of the Optical Society of America, 70, 1297–1300. Ohzawa, I., DeAngelis, G. C., & Freeman, R. D. (1997). Encoding of binocular disparity by complex cells in the cat’s visual cortex. J. Neurophysiol., 22, 2879– 2908. Okajima, K. (1995). The Gabor-type receptive eld of a visual cortical neuron extracts the maximum information from input local images. In Extended abstracts of the International Conference on BMED, Okinawa, Japan (pp. 290–293). Okajima, K. (1996). The Gabor-type RF as derived by the mutual-information maximization. In Extended abstract of the International Workshop on Brainware, Tokyo, Japan (pp. 119–121). Okajima, K. (1998a). The Gabor function extracts the maximum information from input local signals. Neural Networks, 11, 435–439. Okajima, K. (1998b). Two-dimensional Gabor-type RF as derived by mutual information maximization. Neural Networks, 11, 441–447. Okajima, K. (1998c). “Complex receptive eld” obtained by Infomax coincides with the energy model for complex cells in the visual cortex. In Proceedings of the Fifth International Conference on Neural Information Processing, Kitakyushu, Japan (pp. 657–660).
562
Kenji Okajima and Hitoshi Imaoka
Okajima, K., & Imaoka, H. (1999). Infomax and receptive eld localization. In Proceedings of the 9th Annual Meeting of the Japanese Neural Networks Society, Sapporo, Japan (pp. 151–152). Olshausen, B. A., & Field, D. J. (1996). Emergence of simple-cell receptive eld properties by learning a sparse code from natural images. Nature, 381, 607– 609. Pollen, A. D., & Ronner, S. F. (1983). Visual cortical neurons as localized spatial frequency lters. IEEE Trans. System Man, and Cybernetics, SMC13, 907–916. Received February 22, 1999; accepted May 15, 2000.
LETTER
Communicated by Peter Dayan
Self-Organization of Topographic Mixture Networks Using Attentional Feedback James R. Williamson Department of Cognitive and Neural Systems and Center for Adaptive Systems, Boston University, Boston, MA 02115, U.S.A. This article proposes a neural network model of supervised learning that employs biologically motivated constraints of using local, on-line, constructive learning. The model possesses two novel learning mechanisms. The rst is a network for learning topographic mixtures. The network’s internal category nodes are the mixture components, which learn to encode smooth distributions in the input space by taking advantage of topography in the input feature maps. The second mechanism is an attentional biasing feedback circuit. When the network makes an incorrect output prediction, this feedback circuit modulates the learning rates of the category nodes, by amounts based on the sharpness of their tuning, in order to improve the network’s prediction accuracy. The network is evaluated on several standard classication benchmarks and shown to perform well in comparison to other classiers. 1 Introduction
There have been signicant advancement in the understanding of statistical learning properties of articial neural networks. However, neural networks are usually unconstrained from an implementational point of view. Applying constraints that are motivated by biological implementability make statistical analysis more difcult but may offer the compensatory advantage of producing models that have more straightforward links with biological systems. In addition, the constraints may provide certain practical advantages. It is with these motivations that the following three constraints were used to guide neural network development: Local. Activation and learning equations employ only simple computations, using only locally available information. This constraint allows for the possibility of efcient hardware implementation. On-line. Weights are updated after the presentation of each input pattern. This allows the network to learn in real time. Constructive. The network automatically constructs, or self-organizes, a representation of sufcient complexityto learn an input-output mapNeural Computation 13, 563–593 (2001)
° c 2001 Massachusetts Institute of Technology
564
James R. Williamson
ping. This allows for an efcient usage of resources since only the memory required for solving the problem at hand is allocated. In this article a neural network is proposed that obeys these three constraints. The network learns smooth receptive elds, or basis functions, in the hidden layer that are sensitive to the input density distribution. The receptive elds are learned using simple, local computations that take advantage of a data format suggested by models of cortical development: representing inputs as topographic maps. The network’s learning dynamics are also self-corrective in the presence of prediction errors. This is done not by using complicated feedback computations based on an error gradient but rather by using a simple biasing of network activities (and, hence, of learning rates), based on the sharpness of receptive eld tuning. These two novel learning mechanisms—learning a mixture density model in the hidden layer based on topographic input maps and biasing the activations of the mixture components using “attentional” feedback—are described in sections 1.1 and 1.2. 1.1 Learning Topographic Mixtures. Models of cortical learning and development typically share two essential computations: center-surround processing within network layers and correlational learning of connections between layers. These computations yield characteristic activity patterns within layers in which nodes positioned near each other are more positively correlated than nodes positioned farther away from each other. Therefore, although all nodes in the same layer may respond to the same feature dimension, such as the orientation of a visual stimulus, nearby nodes tend to respond to nearby regions of the feature space (i.e., similar orientations), whereas nodes spaced farther apart do not. For example, many models have been proposed of how neurons in primary visual cortex develop their orientationally tuned receptive elds within smoothly varying maps of orientation preference (Durbin & Mitchison, 1990; Obermayer, Blasdel, & Schulten, 1992; Obermayer, Ritter, & Schulten, 1990; Swindale, 1992; Sirosh & Miikkulainen, 1994; Olson & Grossberg, 1998). Figure 1 depicts a simple example of a topographic map in a feature layer, which encodes orientation preference for visual input. Activation levels of the orientationally selective nodes are shown, given a vertical edge as input. The feature layer is analogous to a set of orientation columns in primary visual cortex. Now imagine a set of nodes in a higher category layer, which have normalized activities due to mutual, divisive inhibition (i.e., the net activation across the layer is constant) and receive inputs from the feature layer via adaptive connections that are updated with a type of correlational learning called instar learning (Grossberg, 1976, 1980; Kohonen, 1989). Instar learning causes the connection weights to track the presynaptic signals from the feature nodes when the category nodes are active. The result, depicted in Figure 1, is that the weights to a category node consist of a weighted
Self-Organization of Topographic Mixture Networks
565
Figure 1: Topographic mixture model. The learned paths are shown from a topographic feature layer and an output layer to an internal category layer, as well as reciprocal paths from the category layer to the output layer. In this example, the feature nodes encode orientation preference. The category nodes learn smooth receptive elds, which partition this feature space based on its associations with the class outputs. The dashed line represents divisive, normalizing inhibition within the category layer. The shaded bars represent activity levels.
average of several similar feature patterns, encoding the expected feature pattern conditioned on the category node’s activity. The shape of the resulting category receptive eld is determined by both the intrinsic spread of a single feature pattern and the spread of different feature patterns learned by the category node. The category nodes also receive signals from an output layer, which represents class labels. In Figure 1 the labels are vertical and horizontal. These connections are also updated via instar learning. During training mode, feature and output activations are determined, respectively, by feature inputs and supervised class labels. Category activations, which regulate learning rates, are determined by input from these layers, which is combined multiplicatively. Therefore, during training, categories learn conjunctions of the feature and class representations. During test mode, category activations are determined only by inputs from the feature layer. Class output predictions
566
James R. Williamson
are then produced via learned connections from the category layer to the output layer, which are symmetrical to the output category connections. Thus, we see that a combination of center-surround processing within the feature layer and correlational learning between layers results in the learning of a mixture model of smooth distributions governing the input-output mapping. In this model, each category node represents a single component of the mixture distribution. Because this circuit relies on topography in the feature layer, the representation that it learns is called a topographic mixture model. A probabilistic interpretation of this learning model is given in section A.3. Mixture models allow supervised learning using density estimation (Ghahramani & Jordan, 1994; Williamson, 1997). The essential idea is to optimize the model’s likelihood for representing the joint input-output density of the training data and then to use this mixture model to estimate the conditional distribution in the output space given test inputs. This distribution is then used to obtain output predictions. An advantage of the density estimation approach is exibility. The density model can be used to obtain predictions in any direction. In other words, the “output” variables merely correspond to the variables that are missing and need to be estimated given the variables that are not missing. Another advantage is locality. The posterior probabilities of the mixture components (i.e., category activations) can be computed using local information. These probabilities are then used to update local model parameters. In a neural network architecture, this yields simpler, more localizable computations than alternative methods such as backpropagation, which use gradient-descent computations based on the output error. 1.2 Attentionally Biased Learning. One possible problem with a straightforward density estimation approach is that due to inherent limitations in the ability to estimate an accurate model of the input-output density (due to imperfect assumptions underlying the parametric model), maximizing the likelihood of the mixture model may not correspond to maximizing the accuracy of predictions on a test set. In this article a new method of attentional biasing is proposed that alters the learning process in order to improve prediction accuracy. With this method, learning is allowed only when the output prediction satises an accuracy criterion. If this criterion is not met, then attentional feedback systematically alters, or biases, the activation levels of the category nodes by different amounts, which depend on their tuning properties. The magnitude of this feedback, which is called the vigilance level, is increased until the accuracy criterion is met. If the criterion is not met for any vigilance level, then a new category node is introduced, expanding the network’s capacity. Two aspects of attentional biasing need to be considered: its general goal (that is, how its effect on category nodes relates to their receptive eld tuning properties) and its specic effect (that is, how it momentarily alters the
Self-Organization of Topographic Mixture Networks
567
shape of a node’s receptive eld). The general goal of attentional biasing is to increase the activations (and hence learning rates) of broadly tuned nodes with respect to those of sharply tuned nodes. This goal is motivated by the fact that sharply tuned nodes, because their activity distribution over the training set exhibits higher kurtosis than that of broadly tuned nodes, are more likely to have a strong inuence on the prediction. Therefore, when the network makes an inaccurate prediction, the sharply tuned nodes are more likely to have caused the error, and the mistake should be most easily corrected by reducing their activations with respect to those of broadly tuned nodes. Since learning rates are proportional to activations, this means that regions of the input space that yield prediction errors will tend to be learned with high vigilance levels, causing higher-than-normal learning rates for broadly tuned nodes. This causes the receptive elds of these nodes to zero in on problematic regions of the input space until the errors are eliminated. This general goal of attentional biasing is achieved using learned inhibitory bias weights that modulate the divisive effect of vigilance on a node’sactivation. A bias weight learns a node’sexpected input match, which is its receptive eld height at the point in the input space represented by the current input. Sharply tuned nodes will thus tend to have large bias weights because their input matches tend to be high. Roughly speaking, the size of a node’s bias weight is proportional to the sharpness of its receptive eld (quantitative results on this issue are provided in section 2.3). Figure 2a illustrates the relative effect of raising vigilance on two receptive elds given the assumption that the inhibitory weights are inversely proportional to the receptive eld widths. Raising vigilance has a purely modulatory effect, making the broader receptive eld higher with respect to the narrow one without changing the widths of either. In a related neural network called gaussian ARTMAP, which is a variant of predictive adaptive resonance theory networks, a vigilance threshold has been used rather than vigilance modulation (Williamson, 1996, 1997; Grossberg & Williamson, 1999). A vigilance threshold has a subtractive rather than a modulatory effect on receptive elds. Figure 2b shows that, like vigilance modulation, a vigilance threshold favors a broadly tuned receptive eld over a narrowly tuned one (in both Figures 2a and 2b, the cross-over point moves to the left as vigilance is raised). Unlike vigilance modulation, however, a vigilance threshold causes the receptive elds to become narrower. This narrowing of receptive elds makes the activity pattern across categories less distributed by shutting some nodes off. Historically, the use of a vigilance threshold in gaussian ARTMAP was motivated by the adaptive resonance theory concept of resetting, or shutting off, nodes when in some sense they do not adequately match the input or output pattern (Carpenter & Grossberg, 1987). The vigilance modulation introduced in this article is a modication of that concept. Simulations (not shown here) have shown that vigilance modulation is generally more
568
James R. Williamson
Figure 2: Effect of raising vigilance on two receptive elds. In this example, it is assumed that the bias weights for the two category nodes are proportional to their receptive eld heights. This relationship typically holds, as illustrated in section 2.3. Two alternative methods for attentional biasing of receptive elds are illustrated. In the vigilance modulation case (top), receptive elds are dened L
by xji D
f w h D 1 ih jih
1 C r 2 bji
L hD 1
. In the vigilance threshold case (bottom), they are dened by
xji D [ fih wjih ¡r 2 bji ] C . Vigilance modulation favors the wider receptive eld but does not make either receptive eld narrower. A vigilance threshold, on the other hand, favors the wider receptive eld while making both receptive elds narrower. See the appendix for an explanation of these variables. For clarity, receptive elds are rescaled in this plot to have the same height at different vigilance levels.
Self-Organization of Topographic Mixture Networks
569
effective than a vigilance threshold. The effectiveness of vigilance modulation suggests that while it is useful to bias nodes differentially based on the sharpness of their tuning, it is generally not useful to carry this process to such an extreme that learning is completely turned off in some nodes. 1.3 Topographic Attentive Mapping Network. The network proposed in this article is called the topographic attentive mapping (TAM) network. This is because its novel contributions are the integral use of topography in the input feature maps, along with the use of attentional feedback to bias learning rates in the category layer, in order to learn an effective inputoutput mapping. Figure 3 illustrates a TAM network containing two input feature maps, two basis nodes, one category node, and one output node. Note that the category layer in the simple network architecture depicted in Figure 1 has been expanded to include both unidimensional basis nodes encoding a receptive eld in each input dimension and multidimensional category nodes encoding the conjunction of all the input dimensions and the output dimension. A vigilance node has also been added, which implements attentional biasing by exerting divisive inhibition on the basis nodes. The variables represent the activity of a node or the size of a synaptic weight. The symbols inside a node denote the mathematical operation computed by that node. A detailed description of the model equations is given in the appendix. 2 Simulations 2.1 Effect on Learning of Raising Vigilance. Figure 2a illustrates the momentary effect on receptive elds of raising vigilance. To investigate the long-term effect of raising vigilance on the learning of receptive elds, the following simulation experiments were performed. For simplicity, only one input dimension and one output class were used. Learning with just one output class is functionally equivalent to unsupervised learning. The network was initialized with 10 category nodes. Each node’s receptive eld was centered on a different part of the one-dimensional input space by allowing it to learn, independently of the other nodes, for one learning trial. The value of the input, Ii , for the rst learning trial of each category j was Ii D ( j ¡ 1) / 10. Therefore, the 10 receptive elds were centered at 0.0, 0.1, . . . , 0.9 in the input space, which has a range of [0 : 1]. In addition, equation A.18 was altered to allow wraparound so that receptive elds at the boundaries of the input range would look the same as those in the middle. After this category initialization, learning took place in the normal way, using 1000 randomly selected training inputs, Ii 2 U [0 : 1]. Figure 4a shows the receptive elds resulting from condition 1, the default condition with the vigilance level, r, set to r ´ 0. Figure 4b shows the result of condition 2, in which r was elevated in one region of the input space: r D 50 whenever 1 / 3 · Ii · 2 / 3 and r D 0 otherwise. Where r was elevated, the
570
James R. Williamson
Figure 3: TAM network, shown with two feature dimensions, one category node, and one output node. Dashed lines indicate inhibitory connections. Symbols indicate the mathematical operation computed at each node.
representation became less distributed, with the receptive elds in that area becoming taller and narrower (receptive elds are illustrated by plotting the normalized activations using equation A.7 with r D 0). When a category node learns with r elevated, each of its bias weights bji tracks a smaller match value due to equation A.1, giving the node an even greater advantage the next time r is elevated. This leads to a snowballing effect in which the advantaged nodes crowd out the disadvantaged nodes where r is elevated. A possible alternative approach for “paying greater attention” to a particular region of the input space is to increase the learning rates in that region monolithically. Figure 4c shows the result of condition 3, in which ( ) all learning rates, wj rate , were doubled when 1 / 3 · Ii · 2 / 3. As this plot shows, the opposite effect was obtained as in condition 2. In condition 3, the representation became more distributed in the region of the input space with the elevated learning rate. One measure of the discriminability of a point in the input space is a discriminability index, DI, which is the average of the magnitudes of the receptive eld slopes at that point. Figure 4d plots the DI for the three repre-
Self-Organization of Topographic Mixture Networks
571
Figure 4: Effects on learning of raising vigilance. See section 2.1 for details. Plotted are normalized activations after (a) training with r ´ 0; (b) training with r D 50 if 1 / 3 · Ii · 2 / 3 and r D 0 otherwise; (c) training with r ´ 0 (rate) but with learning rate wj doubled when 1 / 3 · Ii · 2 / 3; (d) A discrimination index (DI) is plotted for each of the representations in a–c. The DI is the sum of the magnitudes of the receptive eld slopes.
572
James R. Williamson
sentations shown in Figures 4a through 4c. Condition 1 produces a roughly at DI across the input space, reecting the uniform distribution of the inputs. Condition 2 produces a at DI everywhere except for the transitions from r D 0 to r D 50, where the DI is elevated. Finally, condition 3 produces a slightly lowered DI within the region where the learning rate was doubled. Therefore, attentional biasing increases discriminability in regions where there are changes in the average vigilance level. This is done by, instead of increasing the learning rate of all nodes, as in condition 3, increasing it just for those nodes with small bias weights while simultaneously decreasing their bias weights. 2.2 DELVE Classication Benchmarks. The TAM network was evaluated on several classication benchmarks in order to determine how well it performs with respect to other classiers. All results were obtained using the same set of parameters (see section A.7) following 200 training epochs, or iterations through the training set. The DELVE benchmark collection (Rasmussen et al., 1996) provides an environment for assessing learning methods in a way that is both relevant to real-world problems and allows for statistically valid comparisons with other learning methods. TAM was evaluated on the seven classication benchmarks in the DELVE collection for which there are results from one or more alternative learning methods. From each classication data set, one or more classication tasks is dened involving training sets with different amounts of data. Therefore, it is possible to determine not only how well a learning method handles a particular classication problem, but also how its performance scales with the size of the training set. The data sets are of four different types: natural, cultivated, simulated, and articial. Natural data sets were originally gathered for real-world applications; cultivated data sets came from a real-world source but were never used to solve a real problem; simulated data sets were generated by a simulator but are believed to resemble real data; and articial data sets were generated according to some mathematical formula and are not meant to resemble any real data. TAM was compared to the following alternative learning methods:
CART: a basic decision tree that creates decision boundaries parallel to the input axes (Breiman, Friedman, Olshen, & Stone, 1984) 1NN: one nearest neighbor based on Euclidean distance KNN-Class: k-nearest neighbor algorithm, in which k is chosen on the basis of leave-one-out cross-validation on the training set Several mixtures-of-experts (MEs) and hierarchical mixtures-ofexperts (HMEs) variants, as follows (Waterhouse, Mackay, & Robinson, 1996). ME-EL: a committee of MEs trained by ensemble learning
Self-Organization of Topographic Mixture Networks
573
Table 1: Splice. 100 Classier
Error
TAM TAM-C CART 1NN KNN-CLASS ME-EL ME-ESE HME-EL HME-ESE HME-GROW
.125 .161 .215 .459 .339 .145 .152 .144 .154 .153
200
Signicance C C C C C C C C
Error .081 .090 .140 .429 .293 .100 .097 .099 .099 .106
400
Signicance C C C C C C C C C
Error .062 .056 .100 .393 .263 .076 .068 .077 .070 .072
Signicance
C C C
ME-ESE: a committee of MEs trained by early stopping HME-EL: a committee of HMEs trained by ensemble learning HME-ESE: a committee of HMEs trained by early stopping HME-GROW: a committee of HMEs grown via early stopping Tables 1 through 6 summarize TAM’s performance and provide statistical comparisons with the results of the alternative learning methods. Each column shows results obtained in a single classication task, with the number at the top of the column indicating the number of data items in the training set. In each row is listed the error rate for a given learning method. Next to the error rate is an indication of whether the difference between that method’s performance and that of TAM is statistically signicant (p < 0.05), with a C indicating that the alternative method obtained a signicantly higher error rate and a ¡ indicating that it obtained a signicantly lower error rate. 2.2.1 Splice. This natural problem involves, given a position in the middle of a window of 60 DNA sequence elements, predicting if the position is an “intron-exon” boundary, an “exon-intron” boundary, or neither type of boundary. Therefore, a classier must learn to predict one of three class outputs based on a 60-dimensional input vector, in which each dimension takes on a categorical value, A, G, T, or C. Preprocessing involved assigning numerical values to these categorical data—A D 0, G D 1, T D 2, C D 3—and then mapping these values into feature map distributions as described in section A.6. Table 1 shows the results. On the small and medium training sets, TAM performed signicantly better than most of the alternative methods. On the largest training set, TAM performed signicantly better than CART and the NN methods and slightly better than the ME and HME variants, although these latter differences are not statistically signicant.
574
James R. Williamson
TAM’s superior performance may result partly from the preprocessing, which, due to overlapping feature distributions for different categorical attributes, converts categorical data into a numerical scale. As discussed in section A.3, categorical data can be represented discretely by making the distributions nonoverlapping. Accordingly, TAM was also evaluated by replacing equation A.18 with the following equation, which results in narrow, nonoverlapping feature map distributions whose widths with respect to the input range of [0 : 1] are s D 1 / (10L ) rather than s D 1 / L: fih D
exp[¡5(LIi ¡ h C 0.5)2 ] L 0 2 h0 D1 exp[¡5(LIi ¡ h C 0.5) ]
.
(2.1)
When TAM uses “categorical” feature maps due to preprocessing with equation 2.1 rather than equation A.18, it is referred to as TAM-C. Table 1 compares TAM-C’s performance to that of TAM. On the small and medium training sets, TAM-C did not perform as well. Apparently TAM’s overlapping input distributions provided an additional benet, although the input data have a categorical interpretation. On the largest data set, however, TAM-C performed slightly better than TAM, perhaps because its nonoverlapping input distributions precluded crosstalk. 2.2.2 Titanic. This natural problem involves learning to predict whether a person on board the Titanic survived based on social class (rst class, second class, third class, crew member), age (adult or child), and sex. Preprocessing involved assigning numerical values to these categorical data—rst class D 0, second class D 1, third class D 2, crew member D 3; adult D 0, child D 1; and male D 0, female D 1—and then converting these data into feature distributions as described in section A.6. Because the input data are categorical, TAM-C was also evaluated. As Table 2 shows, TAM and TAM-C obtained similar results except that TAM did better on the smallest training set. TAM obtained signicantly better results than 1NN on all training set sizes. It obtained slightly better results than the other methods as well, although these differences are not signicant. 2.2.3 Image-seg. This cultivated problem involves, given 16 continuous attributes derived from 3 £ 3 pixel regions from outdoor images, predicting which region class (brickface, sky, foliage, cement, window, path, grass) it came from. Table 3 shows the results. Only a couple of the differences are statistically signicant; however, TAM performed slightly worse than the ME and HME variants on the medium and large training sets. 2.2.4 Letter. This simulated problem involves predicting which of the 26 uppercase letters an image came from, given 16 simple statistical features derived from the image, with each feature taking on one of 16 integer values. Table 4 shows the results. TAM performed signicantly better than CART
Self-Organization of Topographic Mixture Networks
575
Table 2: Titanic. 20 Classier
Error
TAM TAM-C CART 1NN KNN-CLASS
.271 .285 .316 .313 .284
40
Significance
Error
Significance
.252 .244 .262 .322 .269
C C
80 Error .225 .231 .234 .312 .258
C
160
Significance
C
Error
Significance
.222 .218 .247 .305 .243
C
Table 3: Image-seg. 70 Classier
Error
TAM CART 1NN KNN-CLASS ME-EL ME-ESE HME-EL HME-ESE HME-GROW
.319 .361 .338 .353 .309 .320 .311 .319 .317
140
Signicance
Error
280
Signicance
.341 .348 .331 .333 .317 .337 .306 .313 .321
¡
Error
Signicance
.311 .288 .322 .322 .286 .282 .283 .283 .285
¡
and signicantly worse than virtually all the other methods on all three training set sizes. 2.2.5 Mushrooms. This articial problem involves, given 21 nominally valued attributes describing the characteristics of mushrooms, predicting Table 4: Letter. 390 Classier
Error
TAM CART 1NN KNN-CLASS ME-EL ME-ESE HME-EL HME-ESE HME-GROW
.396 .501 .351 .351 .322 .302 .314 .312 .345
780
Signicance C
¡ ¡ ¡ ¡ ¡ ¡ ¡
Error .308 .410 .262 .262 .262 .252 .246 .240 .276
1560
Signicance C
¡ ¡ ¡ ¡ ¡ ¡ ¡
Error .226 .327 .186 .186 .195 .185 .195 .190 .218
Signicance C
¡ ¡ ¡ ¡ ¡ ¡
576
James R. Williamson
Table 5: Mushrooms. 64
128
256
512
1024
Classier Error Signif- Error Signif- Error Signif- Error Signif- Error Signif-
icance
TAM TAM-C KNNCLASS
.052 .055 .041
icance
icance
.025 .031 .016
.015 .014 .007
¡
.006 .005 .003
.002 .001 .002
Table 6: Ringnorm and Twonorm. Ringnorm
Twonorm
300
300
Classier
Error
Signicance
Error
Signicance
TAM CART 1NN KNN-CLASS
.043 .214 .361 .361
C
.033 .226 .066 .027
C
C C
C
¡
whether the mushroom is edible or poisonous. Preprocessing involved assigning integer values to the attributes (which range in number from 2 to 12 per dimension) by replacing each attribute with the number of its index in the documentation lists. Then these data were converted into feature distributions as described in section A.6. Because the input data are categorical, TAM-C was also evaluated. Table 5 shows the results. As in the Splice data set, TAM performed slightly better than TAM-C on the smaller training sets and slightly worse on the larger training sets, although the differences are not signicant. TAM performed slightly worse than KNN-CLASS on all the training sets except the largest one. 2.2.6 Ringnorm. This articial problem involves, given two output classes drawn from different 20-dimensional multivariate normal distributions, predicting which class each datum belongs to. Class 1 has mean zero and covariance four times thep identity. Class 2 has mean D (a, a, . . . , a ) and unit covariance, where a D 2 / 20. Breiman et al. (1984) report a theoretical expected error rate of 0.013. Table 6 shows that TAM performed signicantly better than CART and the NN methods. 2.2.7 Twonorm. This articial problem is similar to the ringnorm problem except that both classes have unit variance, Class 1 has mean D (a, a, . . . ,
Self-Organization of Topographic Mixture Networks
577
p a ), and Class 2 has mean D (¡a, ¡a, . . . , ¡a) where a D 2 / 20. Breiman et al. (1984) report a theoretical expected error rate of 0.023. Table 6 shows that TAM performed signicantly better than CART and 1NN but worse than KNN-CLASS. 2.3 Statistics of Learning. In this section, several statistics relating to the benchmark simulations are described in order to provide a better understanding of the representations that TAM learned. For clarity, results are shown from only the largest training set of the rst ve benchmarks.
2.3.1 Speed of Convergence. The results shown in Tables 1 through 6 were obtained following 200 training epochs. However, the error rate converged much earlier than 200 epochs on most data sets. Figure 5 (top) illustrates this by plotting the average error rates obtained after each training epoch. The error rates nearly converged following only 32 training epochs. 2.3.2 Number of Category Nodes. Another important question is the amount of memory that TAM requires. The memory requirements are the number of category nodes that are created times the number of weights per category. Figure 5 (bottom) illustrates the average number of categories per output class that were created. For all except the Titanic data set, TAM equilibrated with a relatively small number of categories per output class. On these data sets, TAM also had low training error rates, ranging from 0.0 on the mushrooms data set up to 0.03 on the image-seg data set. However, on the Titanic data set the number of categories kept growing despite the fact that the training and test error rates barely changed following the rst few epochs. The problem appears to be that the training error rate remained xed at a relatively high level of 0.2. Therefore, with the current heuristics governing category instantiation (see section A.5), there may be a tendency for category proliferation when training set error remains high. This is a topic for future research. 2.3.3 Weight Pruning. The number of feature weights per category node is LM, where L is the number of weights in each dimension and M is the number of dimensions. Weight pruning can signicantly reduce the number of weights per dimension, often with little or no effect on the error rate. When a category is rst instantiated, it has uniformly distributed feature weights, wjih D 1 / L. With learning, these weights typically converge into a unimodal, gaussian-like distribution. When this happens, the smaller weights become insignicant and can be removed. A similar dynamic occurs for the class weights, pji , although pruning of these weights is not explored here. Figure 6 illustrates the effect on error rates of pruning the smallest feature
578
James R. Williamson
Figure 5: (Top) Average error rates plotted as a function of the number of training epochs for ve DELVE data sets. (Bottom) Average number of categories per output class.
Self-Organization of Topographic Mixture Networks
579
Figure 6: Effect of pruning the smallest feature weights. The error rate is plotted as a function of the average number of remaining feature weights per input dimension.
weights following 200 training epochs. After the smallest weights were set to zero, the remaining weights in each vector wji were renormalized to sum to one. The error rates are plotted as a function of the average number of weights remaining for each one-dimensional basis node. For all except the splice data set, error rates were virtually unaffected after 7 of the 12 weights were pruned. If this pruning rule were used throughout training, then the weights would be removed continuously as they fell below threshold. This would produce a pattern of synaptic proliferation followed by renement and clustering of projections, which is analogous to that found during cortical development (Callaway & Katz, 1990; Kandel & O’Dell, 1992; Antonini & Stryker, 1993). 2.3.4 Receptive Fields and Bias Weights. Earlier in the article, it was claimed that match tracking favors categories with wide receptive elds over those with narrow receptive elds because the latter tend to have larger bias weights. A way to quantify this relationship is to compute the correlations between the narrowness of receptive elds and the size of their cor-
580
James R. Williamson
Figure 7: Scatterplots, obtained following 200 training epochs, relating two variables. These are maxh (wjih ) on the abscissa and bji on the ordinate.
responding bias weights. Narrowness of receptive elds is approximated by a simple statistic: the size of the maximum feature weight, maxh (wjih ). Correlations were computed across all nodes j and dimensions i, between the maximum feature weights, maxh (wjih ), and the bias weights, bji . The average correlations for each data set were as follows. splice: 0.92; Titanic: 0.90; image-seg: 0.97; letter: 0.93; and mushrooms: 0.99. Figure 7 shows the scatterplots relating these two variables from a single simulation of the rst four data sets. The majority of cases, which lie on the main diagonal, correspond to bias weights that were updated with, for the most part, r D 0. The cases that lie below the diagonal correspond to bias weights that were often updated with r > 0, as was illustrated in section 2.1. Note that, in particular, a large proportion of bias weights falls below the main diagonal on the Titanic data set, on which the training error rate was especially high.
Self-Organization of Topographic Mixture Networks
581
2.4 Classifying Natural Textures. One of the motivations for developing the TAM model was to obtain a classier that was functionally similar to the gaussian ARTMAP (GAM) model, while obeying the locality constraint, unlike GAM. TAM’s primary differences from GAM are that it (1) synthesizes gaussian-like receptive elds with a set of feature weights, instead of explicitly dening them with means and variances; (2) uses vigilance modulation instead of a vigilance threshold (see Figure 2); and (3) employs distributed connections, pji , with graded strengths between the categories and the output nodes rather than binary-valued connections. In order to evaluate the effect of these changes, TAM was compared to GAM on natural texture classication benchmarks, on which GAM obtained good results (Grossberg & Williamson, 1999). The data set was produced by a biologically motivated image processing system that extracted from each 8£8 pixel region of an input image 16 oriented contrast features (four orientations and four spatial scales), as well as a single brightness feature. GAM was trained to use these 17-dimensional feature vectors to classify natural textures from the Brodatz album (Brodatz, 1966). For each texture, a single partition of the data set was made into three images for training (768 items) and one image for testing (256 items). GAM was evaluated on different numbers of textures, from 6 up to 42 textures. As the number of textures increased and the classication problem became more difcult, GAM’s classication rate and category allocation (per texture) remained relatively stable. Figure 8 summarizes GAM’s results, which were obtained after two training epochs. The error bars depict the standard deviations of the error rates and number of categories per texture obtained from ve simulations in which the order of presentations of the training data was randomized. On the whole, GAM’s performance did not improve with further training. As Grossberg and Williamson (1999) reported, these results are superior to those obtained on a similar texture classication task by an alternative image classication architecture that used rule-based, multilayer perceptron, or k-nearest neighbor classiers. GAM also outperformed this alternative architecture when they were both evaluated on an identical task involving the classication of 10 natural textures. Figure 8 also shows TAM’s results on these data sets. Due to the large size of the training sets, TAM was trained for only 30 epochs for computational tractability. TAM obtained lower error rates than GAM when the number of textures was small and essentially equivalent error rates than GAM when the number of textures was small and essentially equivalent error rates when the number of textures was larger (see Figure 8, top). TAM required more training epochs to reach the same performance level as GAM and also required more categories per class (see Figure 8, bottom).
582
James R. Williamson
Figure 8: (Top) Error rates for TAM and GAM as a function of the number of textures in training-testing data set. (Bottom) Numbers of categories created by TAM and GAM.
Self-Organization of Topographic Mixture Networks
583
3 Conclusion
We have described a neural network that uses a topographic representation of input data and a biasing feedback signal. The topographic input representation allows the network to learn smooth receptive elds using only local activation and synaptic updating rules. The biasing feedback signal allows the network to improve its prediction accuracy by altering its learning dynamics when it makes an inaccurate output prediction. By combining these two novel learning mechanisms, the network can use local equations to learn on-line and construct a representation of appropriate size and complexity for whatever problem it is trained on. Using a single set of parameters, the network performed effectively compared to alternative approaches on eight classication benchmarks. Appendix A.1 Feedforward Activations. The TAM model variables are indexed as follows. The L feature nodes encoding a single input dimension are indexed by h, the M input dimensions are indexed by i, the N category nodes are indexed by j, and the O output nodes are indexed by k. Each category node receives input from M basis nodes, each encoding the match in a single input dimension. For the jth category node, the match in the ith basis node is computed by the inner product between the activity distribution in the ith feature map, fi , and the distribution of the node’s weight eld, wji :
xji D
L h D1 fih wjih . 1 C r 2 bji
(A.1)
The numerator in equation A.1 yields a smooth receptive eld, as illustrated in Figure 2. Changing the input value corresponds to shifting the activity pattern in the feature map, fi , which results in a change in the match between fi and wji . The denominator in equation A.1 describes the effect of attentional biasing on the node’s activity. The vigilance term r is normally set to zero. Raising r produces divisive inhibition, with the amount of inhibition modulated by the inhibitory bias weight, bji . The size of bji is positively correlated with the height (and negatively correlated with the width) of the receptive elds. This correlation is quantied on several benchmark simulations in section 2.3. Due to this correlation, tall, narrow receptive elds are usually attenuated more than short, wide receptive elds when r > 0. Figure 2a illustrates the differential effects that raising vigilance has on two receptive elds with different widths. Category nodes represent feature conjunctions across M perceptual dimensions. They are activated by a conjunction of bottom-up input from
584
James R. Williamson
their M basis nodes: M
yj D
xji .
(A.2)
i D1
The network’s output nodes (indexed by k) are then activated by the category nodes via weighted connections pjk, which represent the probability of output k given category j: N
zk D
yj pjk.
(A.3)
j D1
The class prediction, K, is the index of the maximally activated output node: K D arg max(zk ). k
(A.4)
A.2 Supervision and Attention. Let K¤ denote the index of the “correct” supervised output class. An accuracy criterion (AC) determines whether the network’s output prediction is similar enough to the supervised output to allow learning. If the AC is not met, attention is invoked: the vigilance level, r, is incrementally raised from an initial value of r D 0. This causes the predictions to change due to differential modulations, via equation A.1, of activities in the basis nodes. Vigilance is raised until either the AC is satised or the maximal vigilance level is reached:
If zK¤ / zK < AC then repeat (i) r :D r C r (step ) I
(ii) equations A.1–A.4I until either zK¤ / zK ¸ AC or r ¸ r (max) .
(A.5)
Once equation A.5 is satised, top-down feedback incorporates information as to the correct output. First, supervised feedback selects the correct output node: z¤k D d[k ¡ K¤ ],
(A.6)
where d[x] D 1 if x D 0Id[x] D 0 otherwise. Then top-down output category feedback factors in the conditional probabilities of the correct output ¤ (via the term O kD1 zk pjk ). In addition, normalization of the category activations causes each input to have the same net impact during learning, since the activations determine learning rates. The activations represent the
Self-Organization of Topographic Mixture Networks
585
category posterior probabilities given both the input and the correct output: yj¤ D
M i D 1 xji ¢ N M 0 j0 D 1 i D 1 xj i
O ¤ kD 1 z k pjk O ¤ ¢ kD 1 zk pj0 k
.
(A.7)
The output category feedback is a feature-specic form of attention that serves a different role from the nonspecic vigilance form of attention in equation A.5. Feature-specic attention favors categories that have strong associations with the correct external expectations. Vigilance-based attention, on the other hand, performs a memory search by favoring categories that have low bias weights, and hence wide receptive elds, regardless of their output associations. A.3 Probabilistic Interpretation of Network Activations. A probabilistic interpretation of the output activities, z¤k , and their inuence on category activations in equation A.7, is as follows. Supervised class labels are discrete, categorical data that can be modeled with a multinomial distribution. First, assume that the weights, pjk, encode this distribution:
pjk D P ( k D K¤ | j).
(A.8)
Then the posterior probability of the jth mixture component (i.e., activation of the jth category) based solely on knowledge of the correct class, K¤ , is: P( j | K¤ ) D D
P(k D K¤ | j)P( j)
N j0 D1
P ( k D K ¤ | j 0 )P ( j 0 )
O ¤ kD1 zk pjk N O ¤ j0 D1 kD1 zk pj0 k
.
D
O ¤ kD 1 d[k ¡ K ]pjk N O ¤ j0 D 1 kD 1 d[k ¡ K ]pj0 k
(A.9)
Equation A.9 shows that the addition of top-down feedback in equation A.7 incorporates the conditional probabilities involving the output dimension. Note, however, that equation A.9 assumes that the category prior probabilities, P ( j), are uniformly distributed. It has been found empirically that using learned estimates of category prior probabilities, PQ ( j), and incorporating these estimates into equation A.7 degrades the network’s accuracy as a classier. A possible reason for this nding is as follows. To assume PQ ( j) D PQ ( j0 ) for all j, j0 , as in equation A.7, becomes a self-fullling prophecy. It causes the categories to learn a more balanced, efcient partitioning of the input-output space. If, on the other hand, learned estimates PQ ( j) are factored into equation A.7, then the learn-
586
James R. Williamson
ing process tends to yield PQ ( j) À PQ ( j0 ) for some j, j0 . Simulation results (not shown here) suggest that this yields a less effective utilization of resources. A probabilistic interpretation of the input activities, ffi giMD 1 , and their inuence on category activations, is as follows. For simplicity, assume r D 0 (r > 0 results in a biasing of the probabilities analyzed here). The discrete output data are encoded by discrete, nonoverlapping distributions among the output nodes, as in equation A.6. The real-valued input data, on the other hand, are encoded by overlapping distributions among an ordered set of nodes, fi . Due to their overlapping distributions, these feature nodes represent values along a scale. Similarly, just as pj represents the conditional distribution of discrete values among O output nodes, wji represents the conditional density of smooth, overlapping distributions among L feature nodes encoding the ith input dimension. Therefore, with r D 0, the posterior probability of the jth mixture component, based solely on knowledge of the feature input, is
P ( j | ffi giMD 1 ) D D
p(ffi giMD 1 | j)P ( j)
N j0 D1
p(ffi giMD 1 | j0 )P ( j0 )
M i D1 xji N M 0 j0 D1 i D1 xj i
.
D
M i D1
N j0 D1
L hD 1 fih wjih M L i D1 hD 1 fih wj0 ih
(A.10)
Equation A.10 describes the posterior probability of component j based on a joint density model of the M input dimensions. The density distribution in each input dimension i is encoded by the weight vector wji , just as the distribution in the output dimension is encoded by pj . As in equation A.9, the prior probabilities of the categories are assumed to be equal. It is easy to see from equations A.9 and A.10 that the activations yj¤ in equation A.7 ¤ ). represent P ( j | ffi gM iD 1 , k D K
A.4 Learning. Since category activations represent posterior probabilities conditioned on the current input-output, correlational learning rules allow the network to learn a mixture model of the input-output density. The learning procedure is, in effect, an on-line approximation of a statistical batch learning approach for optimizing mixture models, the expectationmaximization (EM) algorithm. (See Williamson, 1997, for a detailed explanation of this relationship.) On-line learning obtains better statistical sampling if the learning rate begins high and then reduces with experience. Experience is represented by nj , which begins at zero when category j is rst instantiated and converges toward 1 via a discrete difference equation in which the change in nj is
Self-Organization of Topographic Mixture Networks
587
dened as
D nj
D ayj¤ (1 ¡ nj ).
(A.11)
The learning rate for feature weights wjih begins with a relatively large value but converges toward a small xed value as nj ! 1: (rate )
wj
D
a . ab ( M ) C nj
(A.12)
The number of input dimensions, M, has an effect on the net change in receptive elds during learning. Matches computed in equation 2.1 are multiplied together in equation A.1 to produce an M-dimensional receptive ( ) eld. Therefore, if the value of wj rate were independent of M, increasing M would result in greater total changes in the receptive elds during learning. In order to avoid this effect, and instead obtain the same effective learning rate regardless of the dimensionality of a data set, the function b ( M ) was added to equation A.12 and dened to make the effective multidimensional learning rate invariant of M, b (M) D
l1 / M , 1 ¡ l1 / M
(A.13)
with the rate of learning inversely proportional to the free parameter l 2 (0 : 1). The derivation of equation A.13 is given in section A.8. With l D 1 / 3, which was used in all the simulations in this article, b ( M ) ¼ 0.91M ¡ 0.45. The feature weight wjih tracks its input, fih , at a rate proportional to yj¤ . Over time, weight vector wji learns the average of the spatially varying activity distributions that input to it, and thereby ends up encoding a wider distribution than exists in any single fi activity distribution:
D wjih
(rate ) ¤ yj ( fih
D wj
¡ wjih ).
(A.14)
The learning rate for class weights pjk also begins with a relatively large value and converges toward a small xed value as nj ! 1: (rate )
pj
D
a . a C nj
(A.15)
The output weights pjk track the output activities and thereby learn the conditional probability that output k is correct given category j:
Dpjk D pj(rate ) yj¤ (z¤k ¡ pjk ).
(A.16)
588
James R. Williamson
The inhibitory bias weights bji track the activations of their basis nodes at a constant rate:
D bji
( ) D b rate yj¤ (xji ¡ bji ).
(A.17)
Equation A.17 causes bji to learn the expected value of xji , category j’s match in the ith input dimension. Categories with narrow receptive elds will tend to have larger matches, and thus larger bias weights. This correlation is quantied on benchmark simulations in section 2.3. Figure 2a illustrates the differential effect that these weights have on narrow versus wide receptive elds. Categories that learn when r is large will also tend to have small bias weights, due to the effect of r on xji in equation A.1. This dynamic is illustrated in section 2.1. A.5 Category Instantiation. Various heuristics are possible for determining when to instantiate new categories. In our simulations, training always begins with zero categories (N D 0), and a new category is instantiated (N :D N C 1) every time vigilance reaches its maximal level, r D r (max) . The reasoning behind this rule is that by the time r reaches r (max) , the AC has not been satised for 0 · r · r (max) . This suggests that the existing receptive elds are poorly positioned to learn the current input-output mapping, and therefore a new receptive eld is needed, centered on the current input. As a result of this rule, the number of categories created depends on the difculty of the classication task. New categories are initialized in a tabula rasa state, with nj D bji D 0 and with uniformly distributed weights, wjih D 1 / L and pjk D 1 / O. Following its instantiation, a new category’s activity is computed via equations A.1 through A.4. Learning then takes place for all categories via equations A.6, A.7, and A.11 through A.17. A.6 Input Preprocessing. Self-organizing feature maps (SOFMs) produce a data format with two properties: nearby cells are correlated (they have overlapping receptive elds), and the distribution of receptive elds in the input space is sensitive to the input density. Figure 9 illustrates the ideal effect of a one-dimensional (1D) SOFM on 1D data. The uneven data density (top) is made uniform across the space of nodes in the map (bottom). Simulating SOFMs is beyond the scope of this article. Rather, an approximation of their properties is obtained as follows. First, histogram equalization is performed on each dimension of the training data set, with the number of bins equal to the number of data items. If several data items have the same value, the median bin index corresponding to these items is used for all of them. The bin indices are then mapped into the range [1 / (2T ) : 1 ¡ 1 / (2T )], where T is the number of items in the training set. This yields input values Ii in a new, warped feature space containing a uni-
Self-Organization of Topographic Mixture Networks
589
Figure 9: Warping of input space, as described in section A.6, which approximates the effect of SOFMs. (Top) Uneven density distribution of training data in the original input space (dashed line) and training items selected from this distribution (x’s). (Bottom) Uniform density distribution of data in the warped feature space and the evenly spaced training items. Fixed-width activity distributions in this warped space, as dened by equation A.18, correspond to variable-width distributions in the original input space.
form distribution over the training data, as shown in Figure 9. Fixed-width activity distributions are then obtained via the equation
fih D
exp[¡0.5(LIi ¡ h C 0.5)2 ]
L h0 D1
exp[¡0.5(LI i ¡ h0 C 0.5)2 ]
,
(A.18)
590
James R. Williamson
where L is the number of nodes in the 1D topographic map. L determines the level of resolution of the map, and the width of the gaussian activity distributions with respect to the input range of [0 : 1] is s D 1 / L. These xed-width activity distributions correspond to variable-width distributions in the original input space. Therefore, they effect a variable bandwidth smoothing of the original input space. The width of an activity distribution corresponds to the narrowest possible width that a weight distribution can have, due to equation A.12. Therefore, the size of L determines a minimum level of regularization for the network, with smaller L resulting in greater regularization. This can have the positive effect of preventing overlearning but the negative effect of limiting capacity and hence the ability to discriminate. The Ii values for test data are obtained by linear interpolation between the two nearest training values. If a test value is smaller or larger than any training values, then its Ii value is obtained by linear interpolation between either the smallest training item and zero, or between the largest training item and one. A.7 Parameters. The following set of parameters was used on all the simulations: AC D 0.8, r (step ) D 0.1, r (max) D 100, a D 10¡7 , l D 1 / 3, b(rate) D 0.01. The simulation experiments in section 2.1 involve a highresolution 1D representation, so a large value of L (L D 24) was used. The multidimensional classication benchmarks in sections 2.2 and 2.4 presumably do not benet from such high resolution in each dimension, so a smaller value of L (L D 12) was used. A.8 Derivation of b ( M ). Consider the effect of M on D wjih in equa(
)
(
)
tion A.14 if wj rate is constant, that is, if wj rate D c . For simplicity, assume yj¤ D 1 and r D 0. Then the change in the weight vector wji is
D wji
D c (fi ¡ wji ).
(A.19)
What is the effect of M on the net change in the multidimensional receptive eld? First, we decompose wji into its previous value plus the most recent (
)
(
)
change: wji D wjiold C D wjiold . Then, via equations A.1 and A.2, we can see that the changes in each dimension are multiplied across all M dimensions: M
yj D
i D1 M
D
i D1
M
xji D
i D1
(
)
(
fi> ¢ wjiold C D wjiold
)
( ) ( ) fi> ¢ wjiold C c (fi(old) ¡ wjiold )
Self-Organization of Topographic Mixture Networks M
D
i D1
( ) fi> ¢ c fi(old) C (1 ¡ c )wjiold .
591
(A.20)
From the right-hand term in equation A.20, we can see that the fraction of the previous receptive eld that is retained depends on M. In order to preserve the same amount of the previous receptive eld regardless of the value of M, we would therefore like to have (1 ¡ c )M D l,
(A.21)
where l 2 (0 : 1) is a constant learning rate independent of M. Obviously, equation A.21 can be true only for arbitrary M if we make c a function of M. This is why equation A.12 contains b (M ), which is a function of M. Our goal here is to determine what form b ( M ) should take. Different experience levels, nj , result in different functions b (M ). Therefore, we need to assume a xed nj . Based on the idea that xing learning rates with respect to M is most important on a node’s rst learning trials, when the learning rate is highest, we x nj to the value it takes after the rst learning trial of the rst instantiated node. This is nj D a, yielding (
)
c D wj rate D
1 . b (M) C 1
(A.22)
Plugging this into equation A.21 yields 1¡
1 ( b M) C 1
M
D
b (M) ( b M) C 1
M
D l,
(A.23)
or b (M) D
l1 / M . 1 ¡ l1 / M
(A.24)
Acknowledgments
I would like to thank Michael Cohen for many useful suggestions, and three reviewers for their valuable comments. This work was supported in part by the Defense Advanced Research Projects Agency and the Ofce of Naval Research (ONR N00014-95-1-0409). References Antonini, A., & Stryker, M. P. (1993). Functional mapping of horizontal connections in developing ferret visual cortex: Experiments and modeling. Journal of Neuroscience, 14, 7291–7305.
592
James R. Williamson
Breiman, L., Friedman, J. H., Olshen, R. A., & Stone, C. J. (1984). Classication and regression trees. Belmont, CA: Wadsworth. Brodatz, P. (1966). Textures. New York: Dover. Callaway, E. M., & Katz, L. C. (1990). Emergence and renement of clustered horizontal connections in cat striate cortex. Journal of Neuroscience, 10, 1134– 1153. Carpenter, G. A., & Grossberg, S. (1987). A massively parallel architecture for a self-organizing neural pattern recognition machine. Computer Vision, Graphics, and Image Processing, 37, 54–115. Durbin, R., & Mitchison, G. (1990). A dimension reduction framework for understanding cortical maps. Nature, 343, 644–647. Ghahramani, Z., & Jordan, M. I. (1994). Supervised learning from incomplete data via an EM approach. In J. D. Cowan, G. Tesauro, & J. Alspector (Eds.), Advances in neural information processing systems, 7. Cambridge, MA: MIT Press. Grossberg, S. (1976). Adaptive pattern classication and universal recoding. I: Parallel development and coding of neural feature detectors. Biological Cybernetics, 23, 121–134. Grossberg, S. (1980). How does a brain build a cognitive code? Psychological Review, 87, 1–51. Grossberg, S., & Williamson, J. R. (1999). A self-organizing neural system for learning to recognize textured scenes. Vision Research, 39, 1385–1406. Kandel, E. R., & O’Dell, T. J. (1992). Are adult learning mechanisms also used for development? Science, 258, 243–246. Kohonen, T. (1989). Self-organization and associative memory (3rd ed.). Berlin: Springer-Verlag. Obermayer, K., Blasdel, G. G., & Schulten, K. (1992). Statistical-mechanical analysis of self-organization and pattern formation during the development of visual maps. Physical Review A, 45, 7568–7589. Obermayer, K., Ritter, H., & Schulten, K. (1990). A principle for the formation of the spatial structure of retinotopic maps, orientation and ocular dominance columns. Proceedings of the National Academy of Sciences USA, 87, 8345– 8349. Olson, S. J., & Grossberg, S. (1998). A neural network model for the development of simple and complex cell receptive elds within cortical maps of orientation and ocular dominance. Neural Networks, 11, 189–208. Rasmussen, C. E., Neal, R. M., Hinton, G. E., van Camp, D., Revow, M., Ghahramani, Z., Kustra, R., & Tibshirani, R. (1996). The Delve manual, version 1.1. Toronto: University of Toronto. Sirosh, J., & Miikkulainen, R. (1994). Cooperative self-organization of afferent and lateral connections in cortical maps. Biological Cybernetics, 71, 66–78. Swindale, N. (1992). A model for the coordinated development of columnar systems in primate striate cortex. Biological Cybernetics, 66, 217–230. Waterhouse, S. R., Mackay, D. J. C., & Robinson, A. J. (1996). Bayesian methods for mixtures of experts. In D. S. Touretzky, M. C. Mozer, & M. E. Hasselmo (Eds.), Advances in neural information processing systems, 8. Cambridge, MA: MIT Press.
Self-Organization of Topographic Mixture Networks
593
Williamson, J. R. (1996).Gaussian ARTMAP: A neural network for fast incremental learning of noisy multidimensional maps. Neural Networks, 9, 881–897. Williamson, J. R. (1997). A constructive, incremental-learning network for mixture modeling and classication. Neural Computation, 9, 1517–1543. Received December 2, 1998; accepted May 29, 2000.
LETTER
Communicated by Teuvo Kohonen
Auto-SOM: Recursive Parameter Estimation for Guidance of Self-Organizing Feature Maps Karin Haese Data Warehouse/Data Mining, Mummert & Partners ManagementConsulting, Braunschweig, Germany, D-38104 Geoffrey J. Goodhill Department of Neuroscience, and Georgetown Institute for Cognitive and Computational Sciences, Georgetown University Medical Center, Washington, D.C. 20007, U.S.A. An important technique for exploratory data analysis is to form a mapping from the high-dimensional data space to a low-dimensional representation space such that neighborhoods are preserved. A popular method for achieving this is Kohonen’s self-organizing map (SOM) algorithm. However, in its original form, this requires the user to choose the values of several parameters heuristically to achieve good performance. Here we present the Auto-SOM , an algorithm that estimates the learning parameters during the training of SOMs automatically. The application of Auto-SOM provides the facility to avoid neighborhood violations up to a user-dened degree in either mapping direction. Auto-SOM consists of a Kalman lter implementation of the SOM coupled with a recursive parameter estimation method. The Kalman lter trains the neurons’ weights with estimated learning coefcients so as to minimize the variance of the estimation error. The recursive parameter estimation method estimates the width of the neighborhood function by minimizing the prediction error variance of the Kalman lter. In addition, the “topographic function” is incorporated to measure neighborhood violations and prevent the map’s converging to congurations with neighborhood violations. It is demonstrated that neighborhoods can be preserved in both mapping directions as desired for dimension-reducing applications. The development of neighborhood-preserving maps and their convergence behavior is demonstrated by three examples accounting for the basic applications of self-organizing feature maps. 1 Introduction
Kohonen’s self-organizing map (SOM) (Kohonen, 1982, 1984, 1995) has found wide application in the elds of data exploration, data mining, data classication, data compression, and biological modeling, due to its properNeural Computation 13, 595–619 (2001)
° c 2001 Massachusetts Institute of Technology
596
Karin Haese and Geoffrey J. Goodhill
ties of “neighborhood preservation” and local resolution of the input space proportional to the data distribution. Since it was introduced, it has been improved in several ways. For instance, the assumption of a xed architecture has been relaxed (Martinetz & Schulten, 1994; Haese & vom Stein, 1996; Bauer & Villmann, 1997), and it has been shown how the magnication factor can be controlled (Bauer, Der, & Herrmann, 1996). However, an important limitation still remains: the values of the learning parameters that are most appropriate for fast convergence to the minima of the neurons’ individual energy functions depend on the unique properties of the data in each application. Thus, the successful application of the algorithm depends on the user’s knowledge of the input data and experience with the algorithm. It would be greatly preferable for these parameters to be determined automatically during training to achieve the best possible outcome, thus relieving the user of extensive parameter studies and reducing the number of learning steps required. Kohonen (1995) suggested a partial solution to this problem by using a batch learning method that dispenses with the learning rate. Unfortunately, however, the learning rate is used to control the magnication factor of the nal map, which is rapidly gaining interest in all eld of application of SOMs, including speech recognition (Bauer et al., 1996), medicine (Villmann, in press a), and robotics (Villmann, in press b). Herrmann (1995) proposed a way to determine automatically the width of the neighborhood function, but at the cost of further parameters to be specied (though the values of these parameters may be less critical). A rather different approach to avoiding the necessity for choosing these parameters is generative topographic mapping (GTM) (Bishop, Svensen, & Williams, 1996). In this article we present the Auto-SOM: a method for automatic parameter estimation in the SOM based on estimation by a linear Kalman lter extended by a recursive parameter estimation method. We demonstrate its effectiveness on examples including a real application problem, and compare its performance with alternative versions of the SOM and with the GTM. Feature maps consist of nodes r arranged on an nA -dimensional lattice. All nodes are elements of the set A. To each node a weight vector wr is assigned, which is a point in the nM -dimensional input space. The feature map develops by adapting the weight vectors wr to the presented input data v of a manifold M 0 F (int (s)) D 0,
(3.16)
with F denoting the “topographic function” (Villmann, Der, Herrmann, & Martinetz, 1997), a well-known measure of neighborhood preservation (e.g., Bruske & Sommer, 1995; Bauer, Herrmann, & Villmann, 1999; Herrmann, Bauer, & Villmann, 1997). It is dened on the basis of the graph D M (r, r0 ) of connections according to the induced Delaunay triangulation (Martinetz & Schulten, 1994) of the input space: def
kr ¡ r0 kmax > k ; dD
M
(r, r0 ) D 1 ,
(3.17)
def
kr ¡ r0 k D 1 ;
M
(r, r0 ) > k ,
(3.18)
fr ( k) D # r0 | fr (¡k) D # r0 |
dD
Auto-SOM
def
F ( k) D
609 1 N
r0 2A fr0 ( k) F (1) C F (¡1) 1 r0 2A fr0 ( k) N
k> 0 kD0 k < 0.
(3.19)
This measure calculates the number and the order k of neighborhood violations and is the only measure known to account for the violation order. This information is exploited in the formulation of the transition of the width (see equation 3.16). The transition model will increase s if violations are measured that are of higher order than the actual width s( j). In case these orders of violation occur, s will not be decreased further until the algorithm has restored a neighborhood-preserving mapping YM !A . The results are demonstrated in the third example (see Figures 8 and 9). Finally, we guarantee the convergence of the learning algorithm to a neighborhood-preserving mapping YA!M from the output to the input space by applying the constrained transition of s in equation 3.16 with the topographic function F (¡int (s)). In the following we demonstrate this by presenting results for particular applications. 4 Applications
The results obtained with Auto-SOM are demonstrated on three examples covering the main application problems: (1) approximation of input data distribution (two-dimensional uniformly distributed trained by a twodimensional feature map), (2) dimension reduction with neighborhood preservation (three-dimensional nonlinear manifold trained by a two-dimensional feature map), and (3) dimension reduction. Finally, we demonstrate the performance of Auto-SOM using a real high-dimensional application problem by comparing it to the results with generative topographic mapping (Bishop et al., 1996). For the rst of these, we consider the proposed training method to reect the uniform distribution of the input data V D fv( j)|0 · (v)1 · 1 ^ 0 · (v)2 · 1g by the distribution of the neurons. The training result using a 10 £ 10 feature map is shown at the top of Figure 6. We nd the nodes of the feature map uniformly distributed over the input data set, just moving within their Voronoi cell. The quantization error qerr (see equation 3.9) reaches 0.0021 after 70,000 learning steps. s(70,000) is equal to 0.96. These results can be reached by experts on SOM too, as can be deduced from our results with the original SOM. We took the estimated learning parameters as a guideline for our choice of the initial values and rates of decrease (see Figure 6). The nal SOM trained with the original learning algorithm shows nearly the same quantization error after 82,000 learning steps and s(82,000) D 0.96. In contrast to the exponentially decreasing learning parameters usually used with the original SOM, Figure 6 shows that the AutoSOM estimates of the learning parameters decrease faster. The estimate of s( j) is at rst decreasing much faster; later, the decreasing rate reduces
610
Karin Haese and Geoffrey J. Goodhill
until it becomes very slow for s(j) < 1. This is due to the individual energy function Er , equation 3.12, which depends on the learning coefcients. The individual learning coefcients, depicted only for the winner neuron (see Figure 6), decrease from values 0.3 to 0.06 during the training process. While s( j) is smoothly estimated, the estimates of the learning coefcients are noisier during the whole learning process, as demonstrated by the learning coefcient Kr0 ( j) in Figure 6. This is due to the highly inaccurate model of the learning process at the beginning of the training. The weights are far from being static, which is modeled by equation 2.5. Additionally, the system and measurement noise covariances, Qwr and Qor , respectively, are in the same order of magnitude, which results in considerable uctuations between the condence in either the measurements or the process model. These uctuations are reected in the Kalman gain elements. High values of the Kalman gain express high condence in the measurements oO r ; in contrast, low values express high condence in the process model. During the organization process, the system noise covariance Qwr and the measurement noise covariances Qor separate from each other by orders of magnitude. Qwr tends to zero, expressing high condence in the static process model, while Qor tends to a limit, which depends on the input data set and the number of neurons NA (see equation 2.13). A comparison of Auto-SOM with the batch learning of SOMs is of high interest, because batch learning is known to converge within an order of magnitude fewer learning steps than the original SOM and additionally does not involve the learning rate. Therefore, we also trained the SOM with the batch-learning method. We started with the training result gained by Auto-SOM after 50,000 learning steps, which is already ordered. Within 5000 steps, batch learning was able to achieve a well-ordered map with equal quality (qerr D 0.0022) as those of Auto-SOM and the original SOM. Therefore, batch SOM here needs fewer learning steps than Auto-SOM. However, batch-learning techniques get stuck more often in local minima, because these techniques smooth the noisy optimization process. Therefore, they are faster, but the noise is known to be a remedy to escape from local minima. For instance, our nal example of high-dimensional learning patterns trained with batch SOM did not lead to any map equal in quality to the maps from the original or the Auto-SOM. For the second example, the Auto-SOM is trained with data from nonlinear manifolds. The data are chosen from the surface of a hemisphere (see Figure 7). Appropriate learning parameters are hard to guess, because the self-organizing algorithm has to nd the projection of three-dimensional inputs onto the two-dimensional lattice of neurons. However, the AutoSOM estimates the learning parameters s and Kr0 by itself, as shown in Figure 7. Here they decrease at rst linearly, but become sublinear when s is between 2 and 1. With s D 0.9 the map plotted in Figure 7 is achieved. Here the quantization error is 0.0003, a substantial improvement over the method proposed by Haese (1999), which was trained until s was equal
Auto-SOM
611
Figure 6: Feature map trained on uniformly distributed two-dimensional data using Auto-SOM with the estimated learning parameters s, Kr 0 as well as the quantization error qerr , and the covariance matrices of the neuron’s outputs and the weights, Qor and Qwr , respectively.
to 0.7 but remained with a higher quantization error qerr D 0.0128. Further training of the map in Figure 7 will continue in reducing the quantization error, but because the energy minimum is very at, this occurs only very
612
Karin Haese and Geoffrey J. Goodhill
Figure 7: Auto-SOM learning results on training data taken from a nonlinear manifold. Auto-SOM estimates of s and Kr 0 , as well as the quantization error qerr , are shown during the course of training.
slowly. For the third and last example, we demonstrate that neighborhoods can be preserved for both mappings, YA!M and YM !A , if we impose the constraints in equation 3.16 on the width of the neighborhood function. The training data are selected randomly from a stripe-shaped area—here a rect-
Auto-SOM
613
angle with V D fv|0 · (v)1 · 3 ^ 0 · (v)2 · 6g. If the smaller extension of the input data set, in direction (v)1 , is considered excessively as a dimension due to noise, a one-dimensional feature map is chosen to quantize the data. Without any constraints on the width s, the feature map is guided to the conguration in Figure 8. The point of phase transition can be observed in the courses of Kr0 and Qwr . After about 24,000 learning steps, the elements of the covariance matrix Qwr increase, indicating that the stable map conguration just reached is lost. The distance between the weights and the centers of their Voronoi cells rst tends to increase, before converging to a new map conguration with a smaller quantization error. This conguration preserves neighborhoods only from the output into the input space. In order to preserve neighborhoods for the mapping in the other direction, we need to remain in the rst stable map conguration. This is achieved by rening the transition model of the width s as proposed in equation 3.16. Thereby, the width of the neighborhood function remains above the theoretical value scrit ¼ 2.02s (Ritter & Schulten, 1989) with s D 3 being the width of the stripe. The width s as well as the estimated learning coefcient Kr0 and elements of the covariance matrices Qwr and Qor are shown in Figure 9. They converge in the form of an attenuated oscillation. Finally, a mapping is reached that does preserve neighborhood relations for the mappings YA!M and YM !A , but it has a higher quantization error than the map in Figure 8. Finally, we show that Auto-SOM can nd good mappings for real highdimensional input data, which have to be inspected and therefore visualized. We compare the results of Auto-SOM with those of GTM proposed by Bishop et al. (1996) and further developed in Bishop, Svensen, and Williams (1997). Both algorithms can be regarded as kernel smoothers (Bishop, Svensen, & Williams, 1998). As the comparison of GTM and the batch SOM version in Bishop et al. (1998) has revealed once more the lack of guidelines for how to shrink the width of the neighborhood function with batch SOMs, now with Auto-SOM we provide a competing algorithm to GTM, whose learning parameters are determined automatically. Therefore, we compare both algorithms, Auto-SOM and GTM, using the problem of determining the fraction of oil in a multiphase pipeline (Bishop & James, 1993) . The pipeline carries a mixture of oil, water, and gas. Twelve measurements of the attenuation of gamma beams are taken to decide whether one of the phases (oil, water, or gas) belongs to laminar, homogeneous, or annular ow. The training and testing data set consist of 1000 data points, each drawn with equal probability from one of the three classes. Figure 10 shows the different visualizations of the input data space using Auto-SOM without the renements in section 3.1 and GTM, respectively. The number of patterns that neurons at location r D [r1 , r2 ]T respond to is plotted, assigning the class of ow by different gray levels. Auto-SOM shows high resolution of the nonempty parts of the input space with a small quantization error, 0.1294. Additionally, only seven neurons on the map are
614
Karin Haese and Geoffrey J. Goodhill
Figure 8: One-dimensional feature map trained on uniformly distributed twodimensional data using Auto-SOM with the estimated learning parameters s, Kr0 as well as the quantization error qerr and the covariance matrices of the neuron’s outputs and the weights, Qor and Q wr , respectively.
stimulated by patterns of different classes. The topology is preserved best in mapping direction YM !A with F ( k > 18) D 0 and in mapping direction YA!M F (¡k > 9) D 0. In comparison, the GTM with a 12 £ 12 grid of nonlinear basis functions, with common width twice the distance between
Auto-SOM
615
Figure 9: One-dimensional feature map trained to stabilize in a bidirectional neighborhood-preserving map conguration. The estimated learning parameters s, Kr0 , the quantization error qerr, and the elements of the covariance matrices, Qor and Q wr , respectively, are depicted during the course of training.
two neighboring basis function centers and constant degree of weight regularization, equal 1000.0 for 1000 iterations, resolves the input space as accurately as Auto-SOM with a quantization error of 0.1272, whereas the number
616
Karin Haese and Geoffrey J. Goodhill
Figure 10: (Top) Auto-SOM and GTM learning results on training data taken from ow measurements in a three-phase pipeline. (Bottom) Auto-SOM estimates of s and Kr0 shown during the course of training.
of neurons always responding to different ow congurations nearly doubles to 13. The topology preservation achieved by the GTM is best in mapping direction YA!M with F (¡k > 7) D 0 and in mapping direction YM !A F (k > 16) D 0. Thus, the Auto-SOM performs as well as the GTM does. Further results on this training set were studied applying the original SOM as well as the batch SOM. The original SOM converged after 60,000 training cycles to the best mapping with quantization error 0.098 and a minimum of 9 neurons always responding to different ow congurations. The topology preservation achieved is worse with F (¡k > 16) D 0 in mapping direction YA!M and F ( k > 8) D 0 in mapping direction YM !A . Batch SOM did not show better results. From this we conclude that the algorithms converge to maps with very similar qualities in topology preservation and quantization of the test patterns. However, the Auto-SOM and the GTM algorithm render parameter studies unnecessary at the cost of greater computational expense. The greatest additional part to be performed with Auto-SOM in comparison to the batch SOM version is described by the linear KF equations, 2.16, 2.17, and 2.20, which increases with the number of neurons NA . However, these additional costs are only a fraction of those resulting from
Auto-SOM
617
even one additional parameter study, because the greatest part of CPU time is needed for searching the winner neurons. Therefore, these results show that Auto-SOM is a feasible as well as a tractable method for estimating learning coefcients and the width of the neighborhood function, which leads to good results in all application domains of the self-organizing feature map. Consequently, the Auto-SOM is capable of relieving the user from laborious parameter studies. Furthermore, it has also been demonstrated how the underlying model of the RPEM can be modied to yield bidirectional neighborhood preserving mappings in dimension-reducing applications. 5 Conclusions
We have presented the Auto-SOM, an optimal learning parameter estimation method for the training of feature maps. A KF implementation of the original SOM calculates the learning coefcient for each neuron to estimate the weights’ adaptation process with minimal variance. From its estimation error, the proper width of the neighborhood function is calculated by the RPEM, which acts on the basis of the neurons’ individual energy functions. Both algorithms establish the Auto-SOM, which converges automatically to neighborhood-preserving feature mappings from the output into the input space. This is guaranteed by the topographic function measuring violations of neighborhood preservation. The performance and the features of AutoSOM were discussed on example applications. Auto-SOM is demonstrated to yield best results, thus relieving the user from laborious parameter studies. In addition, Auto-SOM can be modied to yield bidirectional neighborhood-preserving mappings. This feature is highly appreciated for vector quantization tasks of noisy data or any applications where it is desirable for the SOM to stabilize in neighborhood-preserving map congurations. Furthermore, Auto-SOM can provide insight into the data under examination by analyzing the estimated learning parameters as well as the noise covariances. This has the potential to reveal further aspects of SOM features and its convergence properties. Acknowledgments
K. H. was supported by the German Aerospace Center and German Research Foundation. G. J. G. was supported by grants from the NIH and DoD. References Bauer, H.-U., Der, R., & Herrmann, M. (1996).Controling the magnication factor of self-organizing feature maps. Neural Computation, 8, 757–765.
618
Karin Haese and Geoffrey J. Goodhill
Bauer, H.-U., Herrmann, M., & Villmann, T. (1999). Neural maps and topographic vector quantization. Neural Networks, 12, 659–676. Bauer, H.-U., & Villmann, T. (1997). Growing a hypercubical output space in a self-organizing feature map. IEEE Transaction on Neural Networks, 8, 218–226. Bishop, C. M., & James, G. (1993). Analysis of multiphase ows using dualenergy gamma densitometry and neural networks. Nuclear Instruments and Methods in Physics Research, A327, 580–593. Bishop, C. M., Svensen, M., & Williams, C. K. I. (1996). GTM: The generative topographic mapping (Tech. Rep.). Birmingham, UK: Aston University. Bishop, C. M., Svensen, M., & Williams, C. K. I. (1997). Magnication factors of the SOM and GTM Algorithm. In Proceedings of the Workshop on SelfOrganizing Maps (WSOM ‘97),Espoo, Finland (pp. 333–338). Helsinki: Helsinki University of Technology. Bishop, C. M., Svensen, M., & Williams, C. K. I. (1998). GTM: The generative topographic mapping. Neural Computation, 10(1), 215–234. Bruske, J., & Sommer, G. (1995).Dynamic cell structure learns perfectly topology. Neural Computation, 7, 845–865. Der, R., Balzuweit, G., & Herrmann, M. (1996).Constructing principal manifolds in sparse data sets by self-organizing maps using self-regulating neighbourhood width. In Proc. ICNN‘95, IEEE Int. Conf. on Neural Networks. Erwin, E., Obermayer, K., & Schulten, K. (1992).Self-organizing maps: Ordering, convergence properties and energy functions. BiologicalCybernetics,67, 47–55. Haese, K. (1990). Optimierung des Sch¨atzverhaltens eines Kalman-Filters. Unpublished master’s thesis, RWTH Aachen, Germany. Haese, K. (1998). Self-organizing feature map with self-adjusting learning parameters. IEEE Transactions on Neural Network, 9(6), 1270–1278. Haese, K. (1999). Kalman lter implementation of self-organizing feature maps. Neural Computation, 11(5), 1211–1233. Haese, K., & vom Stein, H.-D. (1996).Fast self-organizing of n-dimensional topology maps. In VIII European Signal Processing Conference, Trieste, Italy (pp. 835– 838). EURASIP Edizioni LINT Trieste. Herrmann, M. (1995). Self-organizing feature maps with self-organizing neighborhood widths. In Proc. ICNN‘95, IEEE Int. Conf. on Neural Networks, 6 (pp. 2998–3003). Piscataway, NJ: IEEE Service Center. Herrmann, M., Bauer, H.-U., & Villmann, T. (1997). A comparison of topography measures for neural maps. In Proceedings of the Workshop on Self-Organizing Maps (WSOM‘97),Espoo, Finland (pp. 274–279). Helsinki: Helsinki University of Technology. Kohonen, T. (1982). Self-organized formation of topologically correct feature maps. Biological Cybernetics, 43, 59–69. Kohonen, T. (1984). Self-organization and associative memory. Berlin: SpringerVerlag. Kohonen, T. (1985). Self-organizing maps. Berlin: Springer-Verlag. Ljung, L. (1999). System identication: Theory for the user. Englewood Cliffs, NJ: Prentice Hall. Ljung, L., & Soderstr ¨ om, ¨ T. (1983). Theory and practice of recursive identication. Cambridge, MA: MIT Press.
Auto-SOM
619
Martinetz, T., & Schulten, K. (1994). Topology representing network. Neural Networks, 7(3), 507–522. Ritter, H., & Schulten, K. (1989). Convergence properties of Kohonen’s topology conserving maps: Fluctuation, stability and dimension selection. Biological Cybernetics, 60, 59–71. Tolat, V. (1990). An analysis of Kohonen’s self-organizing maps using a system of energy functions. Biological Cybernetics, 64, 155–164. Villmann, T. (in press a). Analysis of psychotherapy process data using neural maps. IEEE Transactions on Neural Networks. Villmann, T. (in press b). Application of magnication control for the neural gas network in a sensorimotor architectur for robot navigation. In Fortschrittsberichte des VDI, Workshop SOAVE 2000, Ilmenau. VDI-Verlag. Villmann, T., Der, R., Herrmann, M., & Martinetz, T. M. (1997). Topology preservation in self-organizing feature maps: Exact denition and measurement. IEEE Transaction on Neural Networks, 8(2), 256–266. Yin, H., & Allinson, N. M. (1995). On the distribution and convergence of feature space in self-organizing maps. Neural Computation, 7(7), 1178–1187. Received May 17, 1999; accepted June 14, 2000.
LETTER
Communicated by Jorge Jos eÂ
Exponential Convergence of Delayed Dynamical Systems Tianping Chen Department of Mathematics, Fudan University, Shanghai, China Shun-ichi Amari RIKEN Brain Science Institute, Wako-shi, Saitama 351-01, Japan We discuss some delayed dynamical systems, investigating their stability and convergence. We prove that under mild conditions, these delayed systems are global exponential convergent. 1 Introduction
Recurrently connected neural networks, sometimes called the Hopeld neural networks, have been extensively studied and have found many applications in different areas. Such applications heavily depend on the dynamic behavior of the networks. Therefore, the analysis of these dynamic behaviors is a necessary step for the practical design of neural networks. The recurrently connected neural network is described by the following differential equations: Ci
dui D dt
j
Tij vj ¡
ui C Ii Ri
i D 1, . . . , N
u (t0 ) D u 0 2 RN ,
(1.1)
where vj D gj (uj ), j D 1, . . . , N and all gj are sigmoidal functions. The dynamic behaviors of recurrently connected neural networks have been studied since the early research on these networks. For example, multistable and oscillatory behaviors were studied in Amari (1971, 1972) and Wilson (1972). Chaotic behaviors were studied in Sompolinsky and Crisanti (1988). Hopeld and Tank (1984, 1986) studied the dynamic stability of symmetrically connected networks and showed their practical applicability to optimization problems. Cohen and Grossberg (1983) gave more rigorous results on the global stability of networks. The global stability of symmetrically connected networks has been well established. There are also a few articles that have studied local and global stability of asymmetrically connected networks. For example, Li, Michel, and Porod (1988), Michel, Farrel, Porod (1989), and Michel and Gray (1990) applied large-scale techniques to get many sufcient conditions for local stability. Yang and Dillon (1994) also obtained some sufcient conditions for local exponential stability and for the existence and uniqueness of an equilibrium Neural Computation 13, 621–635 (2001)
° c 2001 Massachusetts Institute of Technology
622
Tianping Chen and Shun-ichi Amari
point. However, they did not address the issue of the global stability of these networks. In practice, global stability is more important than local stability, as Hirsch (1989), pointed out and some sufcient conditions for global stability have been given by using Gershgorin’s circle theorem. Kelly (1990) applied the contraction mapping theory to obtain some sufcient conditions for global stability. Matsuoka (1992) generalized some of Hirsch’s and Kelly’s results by using a new Lyapunov function. Kaszkurewicz and Bhaya (1994) proved that the diagonal stability of the interconnection matrix implies the existence and uniqueness of an equilibrium point and global stability at the equilibrium. Forti, Maneti, and Marini (1994) showed that the negative semideniteness of the interconnection matrix guarantees the global stability of a Hopeld network. Fang and Kincaid (1996) applied the matrix measure theory to get some sufcient conditions for global and local stability. Chen and Amari (2000) give an effective approach for discussing global stability of Hopeld networks. Recently, several articles have discussed delayed Hopeld neural networks and cellular neural networks. A single time delay t > 0 was introduced by Marcus and Westervelt (1989). The following system of differential equation with delay was considered, Ci
dui (t) D dt
Tij vj (t ¡ t ) ¡
j
i D 1, . . . , N
ui ( t ) C Ii Ri u (t0 ) D u 0 2 RN ,
(1.2)
where vj D gj (uj ), j D 1, . . . , N and all gj are activation functions (e.g., sigmoidal functions). System 1.2 has more complicated dynamics than system 1.1 due to the delay. (For results on this system, see Belair, Campbell, & van den Driessche, 1996.) Gopalsamy and He (1991) considered a modication of system 1.2 by incorporating different delays tij ¸ 0, that is, dui (t) D ¡bi ui (t) C dt
j
aij gj (uj (t ¡ tij ) C Ii
i D 1, . . . , N,
(1.3)
where gj are sigmoidal functions. Cao and Zhou (1998) considered cellular neural networks with delay dui (t) D ¡ci ui (t) C dt C
j
aij gj (uj (t)) j
bij gj (uj (t ¡ tij )) C Ii
i D 1, . . . , N,
(1.4)
Exponential Convergence of Delayed Dynamical Systems
623
where gi satisfy the following conditions: H1 : gj ( j D 1, . . . , n ) is bounded in R.
H2 : There are numbers m i such that |gi (u ) ¡ gj (v )| · m i |u ¡ v| for any u, v 2 R.
It is clear that the Hopeld neural networks with delay are special cases of equation 1.4. The initial conditions associated with equation 1.4 are of the form u i (s ) D w i ( s )
s 2 [¡t, 0],
f or
where
t D max ti, j , 1·i, j ·n
(1.5)
where w i, j 2 C([¡t , 0]), i D 1, . . . , n. It happens that the same constraints are introduced in van den Driessche and Zou (1998). However, in their paper, the following delayed dynamical system, xP i (t) D ¡ci xi (t) C
n jD 1
bij gj (xj (t ¡ tj )),
(1.6)
was discussed. Most articles deal with the delayed neural networks by various Lyapunov-like functions, and various sufcient conditions for the stability of the delayed neural networks are given. Among them (e.g., Cao & Zhou, 1998), it is proved that under the assumptions H1 , H2 , if mi ci
n jD 1
(| aji | C |bji |) < 1,
(1.7)
then there exists an equilibrium point, which is global asymptotic stable independent of the delay tij . Previously (Chen, 1999), we introduced a new and efcient approach for dealing with delayed dynamics. In this article, we investigate the effectiveness of this approach further and give more new results. There are two questions we address: 1. Is the stability for delayed dynamical systems exponentially asymptotic? 2. What will happen if the inequality, equation 1.7, is replaced by an equality? In this article, we answer the rst question. The second question, which is more complicated, will be answered later. Throughout the article, we assume that ti, j D t and investigate the following dynamic system, dui (t) D ¡c i ui (t) C dt
aij fj (uj (t)) j
624
Tianping Chen and Shun-ichi Amari C
j
bij gj (uj (t ¡ t )) C Ii
i D 1, . . . , N,
(1.8)
which was also discussed in Guan and Chen (1999). We assume that the functions fi and gi satisfy either H ¤ conditions or H ¤¤ conditions: H¤ Conditions H¤ 1: fi , gi ( j D 1, . . . , n ) are bounded in R. H¤ 2: 0
0 and ¡c (t) C b (t) · ¡g < 0
(2.3)
holds for all t > 0: then gt
|x(t)| · Cd gt C | lnd|
(2.4)
where 0 < d D maxt fb (t) /c (t)g < 1. Proof . Denote T k D [(k ¡ 1)t, kt ] and
M k D max |x(t)|. t2Tk
We claim that M kC1 · M k. In fact, by the denition of M k, we have |u(kt )| · M k. Then by the differential inequality 2.1 and the inequality 2.3, we conclude that for all t 2 T kC 1 , we have x(t) · M k. In fact, if for some t 2 T kC 1 , x (t) ¸ M k, then xP (t) < 0. Therefore, x (t ) · M k whenever t 2 T kC 1 , which means M kC1 · M k.
We also claim that there at least exists a t 2 [kt, kt C |lnd| /g] such that |x(t)| · dM k.
In fact, if x(t) > dM k for all t 2 [kt, kt C |lnd| /g], then xP (t) < ¡gx(t)
for
t 2 [kt, kt C |lnd| /g].
Therefore, x(t) < x( kt )e ¡g(t¡kt ) and x( kt C |lnd| /g) < dx(kt ) · dM k ,
which contradicts the fact that x (t) > dM k for all t 2 [kt, kt C |lnd| /g]. Let t0 D inf ft: kx(t)| · dM kg. t > kt
Then by similar argument, we conclude that |x(t)| · dM k
when
t > t0 .
626
Tianping Chen and Shun-ichi Amari
Let [a] denote the integer part of a positive number a and k¤ D
|lnd| gt
C 1.
Then by previous argument, we have M kC k¤ · dM k. Therefore, Mnk¤ · d n M k¤ , and for nk¤ · k < (n C 1)k¤ , we have M k · Mnk¤ . As the nal conclusion, we have gt
|x(t)| · Cd gt C | lnd| . Lemma 2.
(2.5)
Let a scalar function x(t) satisfy the following delayed differential
inequality, dx (t) · ¡c (t)x (t) C b (t)x1 / 2 (t)x1 / 2 (t ¡ t ), dt
(2.6)
with initial condition x (s) D w (s)
s 2 [¡t, 0].
(2.7)
If the functions c (t) and b (t) satisfy the following conditions: (1) c (t) and b (t) are bounded and (2) c (t) > 0, b (t) > 0 and ¡c (t) C b (t) · ¡g < 0
(2.8)
holds for all t > 0: then gt
|x(t)| · Cd gt C | lnd| ,
(2.9)
where 0 < d D maxt fb (t) /c (t)g < 1. Proof . Using the same notations as in the proof of lemma 1, denote T k D
[(k ¡ 1)t, kt ] and
M k D max |x(t)| t2Tk
Exponential Convergence of Delayed Dynamical Systems
627
We claim that M kC1 · M k. In fact, by the denition of M k, we have |u(kt )| · M k. Then by the differential inequality, 2.6, and the inequality, 2.8, we conclude that if t 2 T kC 1 x (t) · M k, because if x (t) D M k, then xP (t) < 0. Therefore, x(t) · M k whenever t 2 T kC 1 , which means M kC1 · M k. We also claim that there at least exists a t0 2 [kt, kt C 2|lnd| /g] such that for all t > t0 , x(t) · d 2 M k. In fact, if, on the contrary, we assume that x (t ) > d 2 M k for all t 2 [kt, kt C 2|lnd| /g], then xP (t) < ¡gx(t)
for
t 2 [kt, kt C 2|lnd| /g].
Therefore, x(t) < x( kt )e ¡g(t¡kt ) and x( kt C 2|lnd| /g) < d 2 x (kt ) · d 2 M k, which contradicts the assumption that x (t) > d 2 M k for all t 2 [kt, kt C 2|lnd| /g]. The claim is proved. Let k¤ D
2|lnd| gt
C 1.
Then by previous argument, we have M kC k¤ · d 2 M k. Therefore, M nk¤ · d 2n M k¤ , and for nk¤ · k < (n C 1)k¤ , we have M k · Mnk¤ . As the nal conclusion, we have gt
|x(t)| · Cd gt C | lnd| .
(2.10)
628
Tianping Chen and Shun-ichi Amari
3 Main Theorems
In this section, we establish several theorems on the asymptotic stability of delayed neural networks. By the well-known Borouwer xed-point theory, it is easy to prove that under either one of the conditions H¤ or H ¤¤ and the conditions of following theorems, there exists at least one equilibrium point for the dynamical system, 1.8. Let u¤ D [u¤1 , . . . , u¤n ]T be such an equilibrium point and v (t) D u (t) ¡ u¤ . Dene fj¤ (v ) D fj (v C uj¤ ) ¡ fj (uj¤ )
(3.1)
gj¤ (v) D gj (v C uj¤ ) ¡ gj (uj¤ ).
(3.2)
and
Then v(t) satises dvi (t) D ¡c i vi (t ) C dt C
j
j
aij fj¤ (vj (t))
bij gj¤ (vj (t
¡ t ))
i D 1, . . . , N,
(3.3)
and the uniqueness and asymptotic stability of equation 1.8 reduce to the global convergence of equation 3.3. If the activation functions fi , gi i D 1, ¢, N satisfy H¤ 1, H¤ 2 in H ¤ condition, and the coefcients in the dynamical system, 1.8, satisfy the following inequalities, Theorem 1.
¡c i C aiiC m i C
6 i jD
|aij |m j C
j
|bij |ºj · ¡g < 0 i D 1, . . . , n,
(3.4)
where aiiC D minf0, aii g, then the dynamical system (1.8) has a unique equilibrium u¤ , which is globally and asymptotically stable exponentially. Assume that the activation functions fi , gi i D 1, ¢, N satisfy conditions in H ¤¤ and the coefcients in equation (1.8) satisfy the following inequality, Corollary 1.
¡c i C
j
|aij |m j C
j
|bij |ºj · ¡g < 0 i D 1, . . . , n,
(3.5)
Then there exists a unique equilibrium u¤ , which is globally and asymptotically stable exponentially.
Exponential Convergence of Delayed Dynamical Systems
629
Remark. It is important to point out that theorems 1 and corollary 1 are
valid for the case that the dynamical system has different delays—that is, it is valid for the following dynamical system, dui (t) D ¡c i ui (t) C dt C
j
aij fj (uj (t)) j
bij gj (uj (t ¡ tj )) C Ii
i D 1, . . . , N.
(3.6)
In fact, what we need to do is take t D maxftj g. The arguments need not change. Proof of Theorem 1. Let i 0 D i 0 (t) be the index such that vi0 (t) D kv(t)k1 .
Under the assumption that the activation functions fi , gi i D 1, ¢, N satisfy H ¤ 1, H¤ 2 in H ¤ condition, we have dkv (t)kf1g d|vi0 (t)| dvi0 (t) D D sign (vi0 (t)) dt dt dt D sign (vi0 (t))
¡c i0 vi0 (t) C
j
ai0 j fj¤ (vj (t)) C
j
bi0 j gj¤ (vj (t ¡ t ))
D f¡c i0 |vi0 (t ) | C ai0 i0 (t)| fi¤ (vi0 (t)|g C sign (vi0 (t ))
D
j
f ¤ (v i (t ) C sign ( vi0 ( t )) ¡c i0 C ai0 i0 i 0 vi0 (t)
C sign (vi0 (t ))
·
6 i0 jD
ai0 j fj¤ (vj (t)) C
bi0 j j
¡c i0 C aiC0 i0 m i0 C C
|bi0 j |ºj j
gj¤ (vj (t ¡ t )) kv(t ¡ t )k1 |ai0 j |m j
6 i0 jD
kv(t ¡ t )k1 .
Let x (t) D kv(t)k1 c (t) D ¡c i0 C aiC0 i0 m i0 C
|ai0 j |m j 6 i0 jD
bi0 j gj¤ (vj (t ¡ t )) ai0 j 6 i0 jD
fj¤ (vj (t)) kv(t)k1
kv(t)k1
kv(t ¡ t )k1
kv(t)k1 (3.7)
630
Tianping Chen and Shun-ichi Amari
|bi0 j |ºj .
b (t ) D
j
Then we have dx (t) D ¡c (t)x (t) C b (t)x(t ¡ t ), dt
(3.8)
where functions c (t), b (t) are dened in the lemma 1. By lemma 1, we conclude that if the activation functions fi , gi i D 1, ¢, N satisfy H ¤ 1, H¤ 2 in H¤ condition and f¡c i C aiiC m i g C
6 i jD
|aij |m j C
j
|bij |ºj · ¡g < 0
(3.9)
holds for i D 1, . . . , N, where a C D min(a, 0), then functions c (t), b (t) satisfy the conditions described in lemma 1. By lemma 1, we conclude that g
|x(t)| · Cd gt C | lnd| t ,
(3.10)
where d D min i
|bij |ºj
j C
c i ¡ aii ¡
6 i jD
|aij |m j
< 1.
This means that the equilibrium u¤ is globally and asymptotically stable exponentially. If the activation functions fi , gi i D 1, ¢, N satisfy H¤ 1 and H ¤ 2 in H condition, and the coefcients in the dynamical system, 1.8, satisfy Theorem 2. ¤
C
¡c j C m j
ajj C
|aij |
C max ºj j
6 j iD
|bij | i
· ¡g < 0
j D 1, . . . , N,
then the dynamical system has a unique equilibrium u¤ , which is globally and asymptotically stable exponentially. If the activation functions fi , gi i D 1, ¢, N satisfy H¤¤ 1, H ¤¤ 2 in condition, and the coefcients in equation 1.8 satisfy
Corollary 2.
H
¤¤
¡c i C
i
|aij |m j C
i
|bij |ºj · 0
(3.11)
for i D 1, . . . , n, then there exists a unique equilibrium u¤ , which is globally and asymptotically stable exponentially.
Exponential Convergence of Delayed Dynamical Systems
631
Proof of Theorem 2.
dkv (t )kf1g D dt D
sign (vi ) i
dvi dt ¡c i |vi (t)| C
sign (vi (t)) i
C
D ¡
i
C
D ¡
i
j
j
c i |vi (t)| C i
j
i
j
C
sign (vi (t))aij fj¤ (vj (t))
i
j
¡ t ))
ajj C
j
fj¤ (vj (t))
sign (vi (t))aij 6 j iD
sign (vi (t))bij gj¤ (vj (t ¡ t )) sign (vi (t))aij
¡c j C
j
bij gj¤ (vj (t
aij fj¤ (vj (t))
sign (vi (t))bij gj¤ (vj (t ¡ t ))
c i |vi (t)| C
i
C
D
sign (vi (t))
i
i
j
bij gj¤ (vj (t ¡ t ))
j
sign (vi (t))
C
D ¡
c i |vi (t)| C
aij fj¤ (vj (t))
fj¤ (vj (t))
sign (vi (t ))bij
gj¤ (vj (t ¡ t )) vj (t ¡ t )
i
j
|vj (t)|
|vj (t)|
i
|vj (t ¡ t )|
C
· min
¡c j C m j
j
C max j
|bij |
ºj
|aij |
ajj C
i
|vj (t)|
6 j iD
j
gj¤ (vj (t ¡ t )) vj (t ¡ t )
j
|vj (t ¡ t )|
C
· min
¡c j C m j
j
C max ºj j
|aij |
ajj C
6 j iD
|bij | i
j
|vj (t ¡ t )|
|vj (t)| j
(3.12)
632
Tianping Chen and Shun-ichi Amari
Let |vj (t)|
x (t ) D
j
c (t) D c D min ¡c j C m j fajj C j
b (t) D b D max ºj j
|aij |g C 6 j iD
|bij | . i
Then we have dx (t) D ¡c (t)x (t) C b (t)x(t ¡ t ) dt
(3.13)
and by lemma 1, we conclude that if the activation functions fi , gi i D 1, ¢, N satisfy H ¤ 1 in H ¤ condition, C
min ¡c j C m j j
ajj C
|aij | 6 j iD
C max ºj j
|bij | i
· ¡g < 0,
then functions c (t), b (t) satisfy the conditions described in lemma 1. By lemma 1, we conclude that g
|x(t)| · Cd gt C | lnd| t ,
(3.14)
where dD
maxj fºj
minj fc j C m j fajj ¡
i
|bij |g 6 j iD
|aij |g C g
< 1.
(3.15)
This means that the equilibrium u¤ is globally and asymptotically stable exponentially. If the activation functions, fi , gi i D 1, ¢, N satisfy H ¤ 1 and H ¤ 2 in H¤ condition, and the coefcients in the dynamical system, 1.8, satisfy Theorem 3.
1 C (A M C fA C MgT ) 2 · ¡g < 0,
min l ¡C C
C max l
(BN )T (BN )
where l(Q) represents the set of the eigenvalues of the matrix Q. C D diag[c 1 , . . . , c n ] A D (aij )ni, jD 1 , A C is the matrix obtained by replacing aii with aiiC D maxf0, aii g. B D (bij )ni, jD1 , M D diag[m 1 , . . . , m n ], N D diag[º1 , . . . , ºn ]
Exponential Convergence of Delayed Dynamical Systems
633
Then the dynamical system, 1.8, has a unique equilibrium u¤ , which is globally and asymptotically stable exponentially. Proof of Theorem 3. 2
1 dkv(t)kf2g D 2 dt
vi i
dvi dt
vi (t)
D
¡c i vi (t) C
i
C
D
¡c i C aii
i
C C
·
i
C
j
aij i
6 i jD
bij i
j
j
bij gj¤ (vj (t ¡ t ))
fi¤ (vi (t)) 2 vi ( t ) vi ( t )
fi¤ (vj (t)) vi (t)vj (t) vj (t) gj¤ (vj (t ¡ t )) vj (t ¡ t )
f¡c i C aiiC m i gv2i (t) C i
j
aij fj¤ (vj (t))
vi (t)vj (t ¡ t ) aij m j |vi (t )vj (t)| i
6 i jD
bijºj |vi (t )vj (t ¡ t )|.
(3.16)
Let x( t ) D
j
vj2 (t).
Then it is easy to see that x(t) · min l ¡C C C max l(
1 C ( A M C fA C MgT ) 2
x (t )
(BN )T (BN ))x1 / 2 (t) x1 / 2 (t ¡ t ).
(3.17)
Under the assumptions of theorem 3, x(t) satises the conditions of lemma 2. Theorem 3 is a consequence of lemma 2. 4 Comparisons
In this section, we compare various stability theorems on delayed neural networks.
634
Tianping Chen and Shun-ichi Amari
The assumptions on the activation functions and the results in Cao and Zhou (1998) and van den Driessche and Zou (1998) are similar. Both discuss the case of different delays and use similar methods. This method is also similar to the method proposed in Gopalsamy and He (1991), by introducing similar Lyapunov functions. However, this method is not applicable to the proof of our theorems. The approach used in this article is totally different and was rst proposed in Chen (1999). Theorem 1 in this article is with l1 norm, which cannot be given using the methods in Cao and Zhou (1998) and van den Driessche and Zou (1998). As we pointed out, theorem 1 is also applicable to the case of different delays. The conditions in theorem 2 seem stronger than that in Cao and Zhou (1998) and van den Driessche and Zou (1998). It is also applicable to the case of the same delay. However, our theorem 2 gives the exponential asymptotic stability; their method cannot. Guan and Chen (1999) also discussed exponential stability. However, the conditions are too difcult to test. The conditions in theorem 3 are not comparable with the conditions in the previous articles. 5 Conclusions
We have discussed the stability of some delayed dynamical systems and provided some conditions for exponential stability of the delayed dynamical systems. We also compared the theorems given here with other results. Acknowledgments
T. Chen is supported by the National Foundation of China 69982003. References Amari, S. (1971). Characteristics of randomly connected threshold element networks and neural systems. Proc. IEEE, 59, 35–47. Amari, S. (1972). Characteristics of random nets of analog neuron-like elements. IEEE Trans., SMC-2, 643–657. Belair, J., Campbell, S. A., & van den Driessche, P. (1996). Stability and delayinduced oscillations in a neural network model. SIAM J. Appl. Math., 56, 245–255. Cao, J. D., & Zhou, D. (1998). Stability analysis of delayed cellular neural networks. Neural Networks, 11, 1601–1605. Chen, T. P. (1999). Convergence of delayed dynamical systems. Neural Processing Letters, 10, 267–271. Chen, T. P., & Amari, S. (2000). Stability of asymmetric Hopeld networks. IEEE Transactions on Neural Networks. Cohen, M. A., & Grossberg, S. (1983). Absolute stability of global pattern formation and parallel memory storage by competitive neural networks. IEEE
Exponential Convergence of Delayed Dynamical Systems
635
Trans. Systems, Man and Cybern., 13, 815–826. Fang, Y., & Kincaid, T. G. (1996).Stability analysis of dynamical neural networks. IEEE Trans. Neural Networks, 7, 996–1006. Forti, M., Maneti, S., & Marini, M. (1994).Necessary and sufcient conditions for absolute stability of neural networks. IEEE Trans. Circuits Syst.-I:Fundamental Theory Appl., 41, 491–494. Gopalsamy, K., & He, X. (1991). Stability in asymmetric Hopeld nets with transmission selays. Phys. D., 76, 344–358. Guan, Z. H., & Chen, G. (1999).On delayed impulsive Hopeld neural networks. Neural Networks, 12, 273–280. Hirsch, M. W. (1989). Convergent activation dynamics in continuous time networks. Neural Networks, 2, 331–349. Hopeld, J. J., & Tank, D. W. (1984). Neurons with graded response have collective computational properties like those of two-state neurons. Proc. Nat. Acad. Sci., 79, 3088–3092. Hopeld, J. J., & Tank, D. W. (1986). Computing with neural circuits: A model. Sci., 233, 625–633. Kaszkurewicz, E., & Bhaya, A. (1994).On a class of globally stable neural circuits. IEEE Trans. Circuits Syst.-I: Fundamental Theory Appl., 41, 171–174. Kelly, D. G. (1990).Stability in contractive nonlinear neural networks. IEEE Trans. Biomed. Eng, 3, 231–242. Li, J. H., Michel, A. N., & Porod, W. (1988). Qualitative analysis and synthesis of a class of neural networks. IEEE Trans. Circuits Syst., 35, 976–985. Marcus, C. M., & Westervelt, R. M. (1989). Stability of analog neural networks with delay. Phys. Rev. A, 39, 347–359. Matsuoka, K. (1992). Stability conditions for nonlinear continuous neural networks with asymmetric connection weights. Neural Networks, 5, 495–500. Michel, A. N., Farrel, J. A., & Porod, W. (1989). Qualitative analysis of neural networks. IEEE Trans. Circuits Syst., 36, 229–243. Michel, A. N., & Gray, D. L. (1990). Analysis and synthesis of neural networks with lower block triangular interconnecting structure. IEEE Trans. Circuits Syst., 37, 1267–1283. Sompolinsky, H., & Crisanti, A. (1988). Chaos in random neural networks. Physical Review Letter, 61, 259–262 van den Driessche, P., & Zou, X. (1998). Global attractivity in delayed Hopeld neural network models. SIAM, J. Appl. Math., 58, 1878–1890. Wilson, H. R. (1972). Excitatory and inhibitory interactions in localized populations of model neurons. Biophysical J., 12, 1–24. Yang, H., & Dillon, T. S. (1994). Exponential stability and oscillation of Hopeld graded response neural network. IEEE Trans. Neural Networks, 5, 719–729. Received November 12, 1999; accepted March 30, 2000.
LETTER
Communicated by John Platt
Improvements to Platt’s SMO Algorithm for SVM Classier Design S. S. Keerthi Department of Mechanical and Production Engineering, National University of Singapore, Singapore-119260 S. K. Shevade C. Bhattacharyya K. R. K. Murthy Department of Computer Science and Automation, Indian Institute of Science, Bangalore-560012, India This article points out an important source of inefciency in Platt’s sequential minimal optimization (SMO) algorithm that is caused by the use of a single threshold value. Using clues from the KKT conditions for the dual problem, two threshold parameters are employed to derive modications of SMO. These modied algorithms perform signicantly faster than the original SMO on all benchmark data sets tried. 1 Introduction
In the past few years, there has been a lot of excitement and interest in support vector machines (Vapnik, 1995; Burges, 1998) because they have yielded excellent generalization performance on a wide range of problems. Recently, fast iterative algorithms that are also easy to implement have been suggested (Platt, 1998; Joachims, 1998; Mangasarian & Musicant, 1998; Friess, 1998; Keerthi, Shevade, Bhattacharyya, & Murthy, 1999a). Platt’s sequential minimization algorithm (SMO) (Platt, 1998, 1999) is an important example. A remarkable feature of SMO is that it is also extremely easy to implement. Platt’s comparative testing against other algorithms has shown that SMO is often much faster and has better scaling properties. In this article we enhance the value of SMO even further. In particular, we point out an important source of inefciency caused by the way SMO maintains and updates a single threshold value. Getting clues from optimality criteria associated with the KKT conditions for the dual, we suggest the use of two threshold parameters and devise two modied versions of SMO that are more efcient than the original SMO. Computational comparison on benchmark data sets shows that the modications perform signicantly faster than the original SMO in most situations. The ideas mentioned in this article can also be applied to the SMO regression algorithm (Smola, 1998). Neural Computation 13, 637–649 (2001)
° c 2001 Massachusetts Institute of Technology
638
S. S. Keerthi, S. K. Shevade, C. Bhattacharyya, and K. R. K. Murthy
We report the results of that extension in Shevade, Keerthi, Bhattacharyya, and Murthy (1999). In section 2 we briey discuss the SVM problem formulation, the dual problem, and the associated KKT optimality conditions. We also point out how these conditions lead to proper criteria for terminating algorithms for designing SVM classiers. Section 3 gives a short summary of Platt’s SMO algorithm. In section 4 we point out the inefciency associated with the way SMO uses a single threshold value, and we describe the modied algorithms in section 5. Computational comparison is done in section 6. 2 SVM Problem and Optimality Conditions
The basic problem addressed in this article is the two-category classication problem. Burges (1998) gives a good overview of the solution of this problem using SVMs. Throughout this article we will use x to denote the input vector of the SVM and z to denote the feature space vector, which is related to x by a transformation, z D w (x ). As in all other SVM designs, we do not assume w to be known; all computations will be done using only the kernel function, k(x, xO ) D w (x ) ¢ w (xO ), where “¢” denotes inner product in the z space. Let f(xi , yi )g denote the training set, where xi is the ith input pattern and yi is the corresponding target value; yi D 1 means xi is in class 1, and yi D ¡1 means xi is in class 2. Let zi D w (xi ). The optimization problem solved by the SVM is: 1 min kwk2 C C 2
i
ji s.t. yi (w ¢ zi ¡ b) ¸ 1 ¡ ji 8 iI ji ¸ 0 8 i.
(2.1)
This problem is referred to as the primal problem. Let us dene w(a) D i ai yi zi . We will refer to the ai ’s as Lagrange multipliers. The ai ’s are obtained by solving the following dual problem: max W (a) D
i
1 ai ¡ w(a) ¢ w(a) s.t. 0 · ai · C 8iI 2
i
ai yi D 0.
(2.2)
Once the ai ’s are obtained, the other primal variables, w, b, and j , can be easily determined by using the KKT conditions for the primal problem. It is possible that the solution is nonunique; for instance, when all ai s take the boundary values of 0 and C, it is possible that b is not unique. The numerical approach in SVM design is to solve the dual (instead of the primal) because it is a nite-dimensional optimization problem. (Note that w (a) ¢ w(a) D i j yi yj ai aj k(xi , xj ).) To derive proper stopping conditions for algorithms that solve the dual, it is important to write down the optimality conditions for the dual. The Lagrangian for the dual is: LN D
1 w(a) ¢ w (a) ¡ 2
i
ai ¡
i
di ai C
i
m i (ai ¡ C ) ¡ b
ai yi . i
Improvements to Platt’s SMO Algorithm
639
Dene Fi D w (a) ¢ zi ¡ yi D
j
aj yj k(xi , xj ) ¡ yi .
The KKT conditions for the dual problem are: @LN
@ai
D (Fi ¡b )yi ¡di C m i D 0, di ¸ 0, di ai D 0, m i ¸ 0, m i (ai ¡ C) D 0 8 i.
These conditions1 can be simplied by considering three cases for each i: Case 1. ai D 0: di ¸ 0, m i D 0 ) (Fi ¡ b )yi ¸ 0.
(2.3a)
Case 2. 0 < ai < C: di D 0, m i D 0 ) (Fi ¡ b )yi D 0.
(2.3b)
Case 3. ai D C: di D 0, m i ¸ 0 ) (Fi ¡ b )yi · 0.
(2.3c)
Dene the following index sets at a given a: I 0 D fi: 0 < ai < Cg; I1 D fi: yi D 1, ai D 0g; I2 D fi: yi D ¡1, ai D Cg; I3 D fi: yi D 1, ai D Cg; and, I4 D fi: yi D ¡1, ai D 0g. Note that these index sets depend on a. The conditions in equations 2.3a through 2.3c can be rewritten as b · F i 8 i 2 I 0 [ I1 [ I2 I b ¸ F i 8 i 2 I 0 [ I3 [ I4 .
(2.4)
It is easily seen that optimality conditions will hold iff there exists a b satisfying equation 2.4. The following result is an immediate consequence. Lemma 1.
Dene:
bup D minfFi : i 2 I0 [ I1 [ I2 g and blow D maxfFi : i 2 I0 [ I3 [ I4 g.
(2.5)
Then optimality conditions will hold at some a iff blow · bup .
(2.6)
1 The KKT conditions are both necessary and sufcient for optimality. Hereafter we will simply refer to them as optimality conditions.
640
S. S. Keerthi, S. K. Shevade, C. Bhattacharyya, and K. R. K. Murthy
It is easy to see the close relationship between the threshold parameter b in the primal problem and the multiplier, b. In particular, at optimality, b and b are identical. Therefore, in the rest of the article, b and b will denote the same quantity. We will say that an index pair (i, j) denes a violation at a if one of the following sets of conditions holds: i 2 I0 [ I3 [ I4 , j 2 I0 [ I1 [ I2 and Fi > Fj
i 2 I0 [ I1 [ I2 , j 2 I0 [ I3 [ I4 and Fi < Fj .
(2.7a) (2.7b)
Note that optimality conditions will hold at a iff there does not exist any index pair (i, j) that denes a violation. Since, in numerical solution, it is usually not possible to achieve optimality exactly, there is a need to dene approximate optimality conditions. Condition 2.6 can be replaced by blow · bup C 2t,
(2.8)
where t is a positive tolerance parameter. (In the pseudocode given in Platt, 1998, this parameter is referred to as tol.) Correspondingly, the denition of violation can be altered by replacing equations 2.7a and 2.7b by: i 2 I0 [ I3 [ I4 , j 2 I0 [ I1 [ I2 and Fi > Fj C 2t
i 2 I0 [ I1 [ I2 , j 2 I0 [ I3 [ I4 and Fi < Fj ¡ 2t.
(2.9a) (2.9b)
Hereafter, when optimality is mentioned, it will mean approximate optimality. Since b can be placed halfway between blow and bup, approximate optimality conditions will hold iff there exists a b such that equations 2.3a through 2.3c are satised with a t -margin, that is: (Fi ¡ b ) yi ¸ ¡t if ai D 0 |(Fi ¡ b )| · t if 0 < ai < C (Fi ¡ b ) yi · t if ai D C.
(2.10a) (2.10b) (2.10c)
These three equations, 2.10a through 2.10c, are the approximate optimality conditions employed by Platt (1998), Joachims (1998), and others. Note that if equations 2.10a through 2.10c are violated for one value of b, that does not mean that optimality conditions are violated. It is also possible to give an alternate approximate optimality condition based on the closeness of W (a) to the optimal value, estimated using duality gap ideas. This is ne, but care is needed in choosing the tolerance used (see Keerthi et al., 1999a, for a related discussion). This criterion has the disadvantage that it is a single global condition involving all i. On the
Improvements to Platt’s SMO Algorithm
641
other hand, equation 2.9 consists of an individual condition for each pair of indices, and hence it is much better suited to the methods discussed here. In particular, when (i, j) satises equation 2.9, then a strict improvement in the dual objective function can be achieved by optimizing only ai and aj . (This is true even if t D 0.) 3 Platt’s SMO Algorithm
We now give a brief description of the SMO algorithm. The basic step consists of choosing a pair of indices, (i1 , i2 ) and optimizing the dual objective function by adjusting ai1 and ai2 only. Because the working set is only of size 2 and the equality constraint in equation 2.2 can be used to eliminate one of the two Lagrange multipliers, the optimization problem at each step is a quadratic minimization in just one variable. It is straightforward to write down an analytic solution for it. (Complete details are given in Platt, 1998.) The procedure takeStep (which is a part of the pseudocode given there) gives a clear description of the implementation. There is no need to recall all details here. We make only one important comment on the role of the threshold parameter, b. As in Platt (1998) dene the output error on the ith pattern as Ei D Fi ¡ b.
Consistent with the pseudocode of Platt (1998), let us call the indices of the two multipliers chosen for optimization in one step as i2 and i1 . A look at the details in Platt (1998) shows that to take a step by varying ai1 and ai2 , we only need to know Ei1 ¡ Ei2 D Fi1 ¡ Fi2 . Therefore, knowledge of the value of b is not needed to take a step. The method followed to choose i1 and i2 at each step is crucial for efcient solution of the problem. Based on a number of experiments, Platt came up with a good set of heuristics. He employs a two-loop approach: the outer loop chooses i2 , and for a chosen i2 , the inner loop chooses i1 . The outer loop iterates over all patterns, violating the optimality conditions—rst only over those with Lagrange multipliers on neither the upper nor lower boundary (in Platt’s pseudocode, this looping is indicated by examineAll=0 ) and, once all of them are satised, over all patterns violating the optimality conditions (examineAll=1 ) to ensure that the problem has indeed been solved. For efcient implementation Platt maintains and updates a cache for Ei values for indices i corresponding to nonboundary multipliers. The remaining Ei are computed as and when needed. Let us now see how the SMO algorithm chooses i1 . The aim is to make a large increase in the objective function. Since it is expensive to try out all possible choices of i1 and choose the one that gives the best increase in objective function, the index i1 is chosen to maximize |Ei2 ¡Ei1 |. (If we dene C yi1 t, ai2 ( t ) D aiold ¡ yi2 t, and ai D aiold r (t ) D W (a(t)) where ai1 (t) D aiold 1 2 0 6 8i 2 fi1 , i2 g, then |r (0)| D |Ei1 ¡ Ei2 |.) Since Ei is available in cache for
642
S. S. Keerthi, S. K. Shevade, C. Bhattacharyya, and K. R. K. Murthy
nonboundary multiplier indices, only such indices are initially used in the above choice of i1 . If such a choice of i1 does not yield sufcient progress, then the following steps are taken. Starting from a randomly chosen index, all indices corresponding to nonbound multipliers are tried, one by one, as choices for i1 . If sufcient progress still is not possible, all indices are tried, one by one, as choices for i1 , again starting from a randomly chosen index. Thus, the choice of random seed affects the running time of SMO. Although a value of b is not needed to take a step, it is needed if equations 2.10a through 2.10c are employed for checking optimality. In the SMO algorithm, b is updated after each step. A value of b is chosen so as to satisfy equation 2.3 for i 2 fi1 , i2 g. If, after a step involving (i1 , i2 ), one of ai1 , ai2 (or both) takes a nonboundary value, then equation 2.3b is exploited to update the value of b. In the rare case that this does not happen, there exists a whole interval, say, [b low , bup], of admissible thresholds. In this situation SMO simply chooses: b D (blow C bup ) / 2. 4 Inefciency of the SMO Algorithm
SMO is a carefully organized algorithm with excellent computational efciency. However, because of its way of computing and using a single threshold value, it can carry some inefciency unnecessarily. At any instant, the SMO algorithm xes b based on the current two indices that are being optimized. However, while checking whether the remaining examples violate optimality, it is quite possible that a different, shifted choice of b may do a better job. So in the SMO algorithm, it is quite possible that alhough a has reached a value where optimality is satised (that is, equation 2.8), SMO has not detected this because it has not identied the correct choice of b. It is also quite possible that a particular index may appear to violate the optimality conditions because equation 2.10 is employed using an “incorrect” value of b, although this index may not be able to pair with another to dene a violation. In such a situation, the SMO algorithm does an expensive and wasteful search looking for a second index so as to take a step. We believe that this is an important source of inefciency in the SMO algorithm. There is one simple alternate way of choosing b that involves all indices. By duality theory, the objective function value in equation 2.1 of a primal feasible solution is greater than or equal to the objective function value in equation 2.2 of a dual feasible solution. The difference between these two values is referred to as the duality gap. The duality gap is zero only at optimality. Suppose a is given and w D w(a). The ji can be chosen optimally (as a function of b). The result is that the duality gap is expressed as a function of b only. One possible way of improving the SMO algorithm is always to choose b so as to minimize the duality gap.2 This corresponds to 2
One of the reviewers suggested this idea.
Improvements to Platt’s SMO Algorithm
643
the subproblem, min i
maxf0, yi (b ¡ Fi )g.
This is an easy problem to solve. Let p denote the number of indices in class 2, that is, the number of i having yi D ¡1. In an increasing order arrangement of fFi g, let fp and fp C 1 be the pth and (p C 1)th values. Then any b in the interval, [ fp , fp C 1 ] is a minimizer. The determination of fp and fp C 1 can be done efciently using a median-nding technique. Since all Fi are typically not available at a given stage of the algorithm, it is appropriate to apply the above idea to that subset of indices for which Fi are available. (This set contains I0 .) We implemented this idea and tested it on some benchmark problems. It gave a mixed performance, performing well in some situations, poorly in some, and failing in a few cases. More detailed study is needed to understand the cause of this. (See section 6 for details of the performance on three examples.) 5 Modications of the SMO Algorithm
In this section we suggest two modied versions of the SMO algorithm, each of which overcomes the problems mentioned in the last section. As we will see in the computational evaluation of section 6, these modications are almost always better than the original SMO algorithm, and in most situations, they give quite good improvement in efciency when tested on several benchmark problems. In short, the modications avoid the use of a single threshold value b and the use of equations 2.10a through 2.10c for checking optimality. Instead, two threshold parameters, bup and blow , are maintained, and equation 2.8 or 2.9 is employed for checking optimality. Assuming that the reader is familiar with Platt (1998) and the pseudocodes given there, we give only a set of pointers that describe the changes that are made to Platt’s SMO algorithm. Pseudocodes that fully describe these changes can be found in Keerthi et al. (1999b). 1. Suppose, at any instant, Fi is available for all i. Let i low and i up be indices such that Fi Fi
low up
D blow D maxfFi : i 2 I0 [ I3 [ I4 g D bup
D minfFi : i 2 I 0 [ I1 [ I2 g.
(5.1a) (5.1b)
Then checking a particular i for optimality is easy. For example, suppose i 2 I1 [ I2 . We have to check only if Fi < Fi low ¡ 2t. If this condition holds, then there is a violation, and in that case SMO’s takeStep procedure can be applied to the index pair, (i, i low). Similar steps can be given for indices in the other sets. Thus, in our approach, the checking of optimality of i2 and the choice of the second index i1 go hand in hand, unlike the original
644
S. S. Keerthi, S. K. Shevade, C. Bhattacharyya, and K. R. K. Murthy
SMO algorithm. As we will see below, we compute and use (i low, blow ) and (i up, bup ) via an efcient updating process. 2. To be efcient, we would, as in the SMO algorithm, spend much of the effort altering ai , i 2 I0; cache for Fi , i 2 I0 are maintained and updated to do this efciently. And when optimality holds for all i 2 I0 , only then examine all indices for optimality. 3. Some extra steps are added to the takeStep procedure. After a successful step using a pair of indices, (i2 , i1 ), let IQ D I 0 [ fi1 , i2 g. We compute, partially, (i low, blow ) and (i up, bup ) using IQ only (that is, using only i 2 IQ in equations 5.1a and 5.1b). Note that these extra steps are inexpensive because cache for fFi , i 2 I0 g is available and updates of Fi1 , Fi2 are easily done. A careful look shows that since i2 and i1 have been just involved in a successful step, each of the two sets, IQ \ (I0 [ I1 [ I2 ) and IQ \ (I0 [ I3 [ I4 ), is nonempty; hence the partially computed (i low, blow ) and (i up, bup ) will not be null elements. Since i low and i up could take values from fi2 , i1 g and they are used as choices for i1 in the subsequent step (see item 1 above), we keep the values of Fi1 and Fi2 also in cache. 4. When working only with ai , i 2 I0 , that is, a loop with examineAll=0 , one should note that if equation 2.8 holds at some point, then it implies that optimality holds as far as I0 is concerned. (This is because as mentioned in item 3 above, the choice of blow and bup is inuenced by all indices in I0 .) This gives an easy way of exiting this loop. 5. There are two ways of implementing the loop involving indices in I0 only (examineAll=0 ): Method 1. This is in line with what is done in SMO. Loop through
all i2 2 I0 . For each i2 , check optimality and, if violated, choose i1 appropriately. For example, if Fi2 < Fi low ¡2t , then there is a violation, and in that case choose i1 D i low. Method 2. Always work with the worst violating pair; choose i2 D
i low and i1 D i up.
Depending on which one of these methods is used, we refer to the resulting overall modication of SMO as SMO–Modication 1 and SMO– Modication 2. SMO and SMO–Modication 1 are identical except in the way optimality is tested. SMO–Modication 2 can be thought of as a further improvement of SMO–Modication 1, where the cache is effectively used to choose the violating pair when examineAll=0 . 6. When optimality on I0 holds, we come back to check optimality on all indices (examineAll=1 ). Here we loop through all indices, one by one. Since (blow , i low) and (bup, i up ) have been partially computed using I0 only, we update these quantities as each i is examined. For a given i, Fi is computed rst, and optimality is checked using the current (blow , i low ) and (bup, i up); if there is no violation, Fi is used to update these quantities. For example, if i 2 I1 [ I2 and Fi < blow ¡ 2t , then there is a violation, in which case we
Improvements to Platt’s SMO Algorithm
645
Table 1: Data Set Properties. Data Set
s2
n
m
Wisconsin Breast Cancer Adult-7 Web-7
4.0 10.0 10.0
9 123 300
683 16,100 24,692
take a step using (i, i low ). On the other hand, if there is no violation, then (i up, bup ) is modied using Fi , that is, if Fi < bup then we do: i up :D i and bup :D Fi . 7. Suppose we do as described in the previous item. What happens if there is no violation for any i in a loop having examineAll=1 ? Can we conclude that optimality holds for all i? The answer is yes. This is easy to see from the following argument. Suppose, by contradiction, there does exist one (i, j) pair such that they dene a violation, that is, they satisfy equation 2.9. Let us say i < j. Then j would not have satised the optimality check in the described implementation because Fi would have, earlier than j is seen, affected either the calculation of blow or bup settings, or both. In other words, even if i is mistakenly taken as having satised optimality earlier in the loop, j will be detected as violating optimality when it is analyzed. Only when equation 2.8 holds is it possible for all indices to satisfy the optimality checks. Furthermore, when equation 2.8 holds and the loop over all indices has been completed, the true values of bup and blow , as dened in equation 2.5, would have been computed since all indices have been encountered. 6 Computational Comparison
In this section we compare the performance of our modications against the original SMO algorithm and the SMO algorithm that uses duality gap ideas for optimizing b. We implemented all these methods in Fortran and ran them using f77 on a 200 MHz Pentium machine. The value t D 0.001 was used for all experiments. Although we have tested on a number of problems, here we report the performance on only three problems: Wisconsin Breast Cancer data (Bennett & Mangasarian, 1992), UCI Adult data (Platt, 1998), and Web page classication data (Joachims, 1998; Platt, 1998). In the case of the Adult and Web data sets, the inputs are represented in a special binary format, as used by Platt in his testing of SMO. To study scaling properties as training data grow, Platt did staged experiments on the Adult and Web data. We have used only the data from the seventh stage. The gaussian kernel, k(xi , xj ) D exp(¡0.5kxi ¡ xj k2 / s 2 ), was used in all experiments. The s 2 values employed, together with n, the dimension of the input, and m, the number of training points, are given in Table 1.
646
S. S. Keerthi, S. K. Shevade, C. Bhattacharyya, and K. R. K. Murthy 20
CPU Time(s)
0 "smo" x "smo duality gap" * "smo mod 1" + "smo mod 2"
10
1
4 0.04
10
1
C
10
0
Figure 1: Wisconsin Breast Cancer data. CPU time (in seconds) is shown as a function of C.
When a particular method is used for SVM design, the value of C is usually unknown. It has to be chosen by trying a number of values and using a validation set. Therefore, good performance of a method over a range of C values is important. So for each problem, we tested the algorithms over an appropriate range of C values. Figures 1 through 3 give the computational costs for the three problems. For the Adult data set, the SMO algorithm based on duality gap ideas had to be terminated for extreme C values since excessive computing times were needed. In the case of Web data, this method failed for C D 0.2. It is clear that the modications outperform the original SMO algorithm. In most situations, the improvement in efciency is very good. Between the two modications, the second one fares better overall. The SMO algorithm based on duality gap did not fare well overall. Although in some cases it did well, in other cases it became slower than the original SMO, and it failed in some cases. 7 Conclusion
In this article we have pointed out an important source of inefciency in Platt’s SMO algorithm that is caused by the operation with a single threshold value. We have suggested two modications of the SMO algorithm that overcome the problem by efciently maintaining and updating two thresh-
Improvements to Platt’s SMO Algorithm
10
647
5
"smo" "sm o duality gap" "smo mod 1" "smo mod 2"
CPU Time(s)
0 x * +
10
4
10
2
10
1
10
C
0
10
1
Figure 2: Adult-7 data. CPU time (in seconds) shown as a function of C.
10
5
"smo" "smo duality gap" "smo mod 1" "smo mod 2"
CPU Time(s)
0 x * +
10
4
10
1
10
0
C
10
1
10
2
Figure 3: Web-7 data. CPU time (in seconds) shown as a function of C.
648
S. S. Keerthi, S. K. Shevade, C. Bhattacharyya, and K. R. K. Murthy
old parameters. Our computational experiments show that these modications speed up the SMO algorithm considerably in many situations. Platt has already established that the SMO algorithm is one of the fastest algorithms for SVM design. The modied versions of SMO presented here enhance the value of the SMO algorithm even further. It should also be mentioned that heuristics such as shrinking and kernel caching that are effectively used in Joachims (1998) can be employed to speed up our SMO modications even more. The ideas mentioned in this article for SVM classication can also be extended to the SMO regression algorithm (Smola, 1998). We have reported those results in Shevade et al. (1999). References Bennett, R., & Mangasarian, O. L. (1992). Robust linear programming discrimination of two linearly inseparable sets. Optimization Methods and Software, 1, 23–34. Burges, C. J. C. (1998). A tutorial on support vector machines for pattern recognition. Data Mining and Knowledge Discovery, 3(2). Friess, T. T. (1998). Support vector networks: The kernel Adatron with bias and softmargin (Tech. Rep.). Shefeld, England: University of Shefeld, Department of Automatic Control and Systems Engineering. Joachims, T. (1998). Making large-scale support vector machine learning practical. In B. Sch¨olkopf, C. Burges, & A. Smola (Eds.), Advances in kernel methods: Support vector machines. Cambridge, MA: MIT Press. Keerthi, S. S., Shevade, S. K., Bhattacharyya, C., & Murthy, K. R. K. (1999a). A fast iterative nearest point algorithm for support vector machine classier design (Tech. Rep. No. TR-ISL-99-03). Bangalore, India: Intelligent Systems Lab, Department of Computer Science and Automation, Indian Institute of Science. Available online at: http://guppy.mpe.nus.edu.sg/»mpessk. Keerthi, S. S., Shevade, S. K., Bhattacharyya, C., & Murthy, K. R. K. (1999b). Improvements to Platt’s SMO algorithm for SVM classier design (Tech. Rep. No. CD-99-14). Singapore: Control Division, Department of Mechanical and Production Engineering, National University of Singapore. Available online at: http://guppy.mpe.nus.edu.sg/»mpessk. Mangasarian, O. L., & Musicant, D. R. (1998). Successive overrelaxationfor support vector machines (Tech. Rep.). Madison, WI: Computer Sciences Department, University of Wisconsin. Platt, J. C. (1998). Fast training of support vector machines using sequential minimal optimization. In B. Sch¨olkopf, C. Burges, & A. Smola (Eds.), Advances in kernel methods: Support vector machines. Cambridge, MA: MIT Press. Platt, J. C. (1999). Using sparseness and analytic QP to speed training of support vector machines. In M. S. Kearns, S. A. Solla and D. A. Cohn (Eds.), Advances in neural information processing systems, 11. Cambridge, MA: MIT Press. Shevade, S. K., Keerthi, S. S., Bhattacharyya, C., & Murthy, K. R. K. (1999). Improvements to the SMO algorithm for SVM regression (Tech. Rep. No. CD99-16). Singapore: Control Division, Department of Mechanical and Pro-
Improvements to Platt’s SMO Algorithm
649
duction Engineering, National University of Singapore. Available online at: http://guppy.mpe.nus.edu.sg/»mpessk. Smola, A. J. (1998). Learning with kernels. Unpublished doctoral dissertation, Technische Universit a¨ t Berlin. Vapnik, V. (1995). The nature of statistical learning theory. Berlin: Springer-Verlag. Received August 19, 1999; accepted May 17, 2000.
LETTER
Communicated by Dana Ballard
Learning Hough Transform: A Neural Network Model Jayanta Basak Machine Intelligence Unit, Indian Statistical Institute, Calcutta 700 035, India A single-layered Hough transform network is proposed that accepts image coordinates of each object pixel as input and produces a set of outputs that indicate the belongingness of the pixel to a particular structure (e.g., a straight line). The network is able to learn adaptively the parametric forms of the linear segments present in the image. It is designed for learning and identication not only of linear segments in two-dimensional images but also the planes and hyperplanes in the higher-dimensional spaces. It provides an efcient representation of visual information embedded in the connection weights. The network not only reduces the large space requirement, as in the case of classical Hough transform, but also represents the parameters with high precision. 1 Introduction
The Hough transform is used to generate global descriptions of some particular shapes in parametric representation from the local measurements of the features. The number of solution classes (i.e., the number of shapes) may not be known a priori. The classical Hough transform is most commonly employed for the detection of lines, circles, ellipses, and other parametric shapes. Problems of computer vision (Duda & Hart, 1972; Ballard & Brown, 1982) like edge or line linking and boundary detection exploit the concept of Hough transform. Hough transform has also been applied in independent component analysis (ICA), where a highly peaked source signal distribution is considered (Lin, Grier, & Cowan, 1997). One of the basic advantages of Hough transform is that it works even in the presence of small gaps (loss of connectivity) in line or edge segments that may be created due to the presence of noise. Hough transform (and generalized Hough transform) converts lines or edges (curves or boundaries) in an image space into clusters of points in the parameter space. Then the task is to separate out (detect) these clusters (peaks) in the parameter space. The classical Hough transform for the detection of straight line segments works as follows. Each point on a straight line in the image space (corresponding to object pixels) can be parametrically expressed as h D x cos w C y sin w ,
(1.1)
where (x, y) are the coordinates of a point on the line segment in the image, h Neural Computation 13, 651–676 (2001)
° c 2001 Massachusetts Institute of Technology
652
Jayanta Basak
and w being the parameters specifying the line segment. As a dual representation, a point (x, y ) in the image space corresponds to a curve (sinusoidal in nature) in the (h , w ) space (i.e., parameter space). In the noiseless condition, all the sinusoidal curves in the parameter space corresponding to all data points lying on the straight line segment, equation 1.1, intersect at exactly one point specifying the parameter values of the line segment. In other words, if the point of intersection in the parameter space can be identied, then the line segment in the image space can be specied in its parametric form. In the noisy condition, there may not exist any such point of intersection, but a point in the parameter space can be identied such that all the parametric curves are within the limit of a certain distance from the point. If there exists only one straight line segment—that is, exactly one point of intersection in the parameter space, then the problem is identical to linear regression, where a linear least-square t also can provide the solution. However, the Hough transform is useful if more than one such line segment is present. In the presence of more than one line segment, more than one such point of intersection has to be identied in the parameter space. In order to identify computationally the representative points of intersection, the parameter space is quantized, and each slot represents a range (Dh, D w ). An accumulator array (A) is dened to store the parameter values. If the (h , w ) value lies in the range (h 0 § Dh / 2, w 0 § D w / 2), then accumulator A(h0 , h 0 ) is incremented by unity. A variation of Hough transform (Kimme, Ballard, & Sklansky, 1975; Wechsler & Sklansky, 1977) uses the local line directions to increment the accumulator values selectively. Accumulator will contain local peaks corresponding to the linear segments in the image. The peaks are then usually found by a thresholding scheme. This computational technique has been generalized (generalized Hough transform) to detect arbitrary shapes (Ballard, 1981). A problem of Hough transform is the proper identication of peaks in the quantized parameter (accumulator) space. It is often difcult to obtain the peaks by thresholding since the selection of threshold is very much subjective, and an improper selection may lead to several spurious peaks. Second, the heights of the desired peaks are not uniform, and a true peak is often surrounded by noisy peaks of high amplitude that may be generated due to the presence of noise in the image. Therefore, a single threshold value in the accumulator space often is not sufcient for obtaining the true peaks. Computation of Hough transform requires a large amount of memory storage space. The entire parameter space is quantized into a number of slots. If the size of the slots is small, then it requires enormous amounts of storage space. Moreover, very small slots may not provide sharp peaks in the parameter space for real-life images because the peak response will be distributed over the neighboring slots. On the other hand, for coarse quantization, the precision in estimating the parameters gets reduced. The Hough transform can be extended from two-dimensional images to a higher-dimensional space in order to identify planes and hyper-
Learning Hough Transform
653
planes. However, at the higher dimensions, the computational cost and the storage requirements are very high. The Hough transform technique can be viewed as an unsupervised multiclass regression problem where the class information is not known a priori. Here we present a single-layer neural network model that accepts the image coordinates ((x1 , x2 ) as in equation 1.1) as input and adaptively learns the parameter values. The parameter values are represented by the connection weights between the input and output layers. The network accepts the image coordinates sequentially and learns the parameter values in an unsupervised mode. One of the basic advantages here is the reduction of the space requirement as in the case of conventional Hough transform. The entire quantized parameter space is not necessary in such a representation. Moreover, the parameter values can be specied with better precision since no quantization is performed over the parameter space. The connectionist framework of Hough transform along with the learning of the parameters provides an efcient scheme for visual representation. Although the model is used here to detect the linear structures, it can be extended to nd out curves and arbitrary shapes by using higher-order connections between the layers, which in turn will lead to the generation of a hierarchical description of visual information. 2 Learning Hough Transform for Linear Segments
In general, a linear segment (e.g., a straight line segment in two-dimensional space, a plane in a three-dimensional space, or a hyperplane in higher dimension) in an n-dimensional space can be represented as (2.1)
wi xi D h,
where x D (x1 , x2 , . . . , xn ) represents the variable on the hyperplane and h is the distance of the hyperplane from the origin (in Euclidian sense), with a normalization constraint
i
w2i D 1.
(2.2)
A set of m such linear segment (hyperplanes) can be represented as (2.3)
Wx D H.
W D [w 1 , w 2 , . . . , w m ] is an m £ n matrix, and H D [h1 , h2 , . . . , hm ] is an m £ 1 vector, where each w i together with hi represent the parameters of the
ith linear segment. The normalizing constraint is kw i k D 1
for all i.
(2.4)
654
Jayanta Basak
2.1 Problem. The problem is to identify W and h from a given set of observed data points. If there is only one linear segment, then the problem is identical to the linear regression problem where a linear least-square t or a maximum likelihood estimation can be performed in order to obtain the desired parameter vector. In the presence of more than one linear segment, the points can be considered as being generated from a mixture distribution given as
p( x ) D
i
k i w ( x I m i , S i ),
(2.5)
where w can be considered as a gaussian distribution if we consider gaussian noise on the data points. k i , m i , S i are the parameters of the distribution. However, it is difcult to design a maximum likelihood estimator, as in the case of a single linear segment (or hyperplane), for estimating the parameters ki , m i , Si for all segments simultaneously. Attempts are made to estimate the parameters (m i , Si ) and the mixing coefcients (ki ) using the expectation-maximization (EM) algorithm (see McLachlan & Krishnan, 1997). The objective here is to detect more than one linear segment or more than one set of parameter vectors such that a given vector x lies on one of these linear segments. The problem is, in a sense, analogous to the problem of clustering, where the clusters are in the form of line segments such that each data point lies within a certain distance from at least one of the linear segments. It is to be noted here that a data point can lie on more than one line segment (at the point of intersection). In the clustering algorithms (see Duda & Hart, 1978), a point can lie within one cluster only (in the case of fuzzy clustering—Bezdek, 1981; Dave & Bhaswan, 1992—this restriction is usually relaxed). 2.2 Learning Rule. Let us now consider a learning system (e.g., a neural network) where observing a sequence of vectors x, the set of parameters (W , H) can be adapted such that one of the hyperplanes satises equation 2.1. Here a single-layered neural network is proposed that accepts each data vector x as input and produces as a set of outputs y D [y1 , y2 , . . . , ym ]. Each output yi represents the degree of belongingness of the data point to the corresponding ith hyperplane. The input vector u D [u1 , u2 , . . . , um ] to the output layer is u D Wx ¡ h,
(2.6)
where W is the weight matrix and H D [h1 , h2 , . . . , hm ] are the thresholds of the output nodes. If x lies on any one of the m hyperplanes corresponding to the m output nodes (say, the ith hyperplane), then the corresponding entry of u , that is,
Learning Hough Transform
655
ui must be zero. The other entry can be nonzero. u may have more than one zero entry corresponding to the junction of more than one hyperplanes. u is dependent on the distance of the point x from the hyperplanes. Let us denote the output of the network by an m £ 1 vector y, such that the ith entry of y is maximum when the point x falls on the ith hyperplane. Ideally, yi should be 1 if x is on the ith hyperplane and 0 otherwise. Therefore, the output can be written as y D f (u ) D [ f (u1 ), f (u2 ), . . . , f (um )]0 ,
(2.7)
where f (.) is a bell-shaped function with its peak at zero and rapidly decreases with the increase in u. The objective function E should be such that E is zero when yi is unity for any i, and the function should be continuous and differentiable. One form of objective function chosen here is E (x ) D P niD 1 (1 ¡ yi (x ))
a
.
(2.8)
The parameter a > 0 determines the steepness of the function near minima (that is, a xed point). If a increases, the steepness of E near the local minima decreases, and vice versa. The objective function ensures that at least one output node must be fully active at any instant for a given input x. Note that it may be possible to dene an entropy like an objective function, which reects the information-maximization principle (Bell & Sejnowski, 1995). In the information-maximization principle for blind source separation, the activation function is a nonlinear sigmoidal one leading to an estimate of the mixing matrix by having a maximum spread of the signals in a bounding box. On the other hand, the nonlinear function in this formulation localizes or clusters the points belonging to the same linear structure. As Lin et al. (1997) pointed out, linear structures can be observed in mixed signals if the source signal distribution is highly peaked (unimodal or bimodal, etc.). However, the local shifts of the linear structures (parameter set H) also need to be detected, which is difcult to formulate using the existing information-maximization principle. Moreover, in general, if the number of linear structures exceeds the dimensionality, then the existing framework of information maximization will fail to identify them. The weight matrix W and H are updated in order to decrease E (x ) for a given input vector x. Following the gradient descent rule, the weight updating is given as
D wij / ¡b D ¡b D
@E @wij
¡a
E f 0 (ui )xj 1 ¡ yi
cE 0 f (ui )xj , 1 ¡ yi
(2.9)
656
Jayanta Basak
where b is a constant of proportionality and c D ab. Similarly, the change in H can be represented as
Dhi
D¡
kc E 0 f (ui ). 1 ¡ yi
(2.10)
The learning rate constant for H is considered to be kc , where k is a constant. The change in wij should be such that at any iteration, kw i k2 D 1,
(2.11)
where w i is the ith row of W . In other words, for an n-dimensional space, there are n¡1 free variables specifying w i . Note that this is the same principle of computing PCA (Oja, 1982; see also Amari, 1977); however, the activation functions here are different. We therefore update wij (t C 1) as wij (t C 1) D
[
wij (t) C D wij (t) . 2 1/ 2 j (wij ( t ) C D wij (t )) ]
(2.12)
Considering j w2ij (t) D 1 and D wij (t) to be sufciently small with respect to wij (t), we can write 2
wij (t C 1) D 1 C 2
j
wij (t )D wij (t) C O (c )
¡1 / 2
(wij (n ) C D wij (n )). (2.13)
Neglecting the higher-order terms of D wij (t ), we get wij (t C 1) D wij (t) C
c E (bf x, t) 0 f (ui (t))(xj (t) ¡wij (t )(ui (t) C hi (t))). (2.14) 1¡yi (t)
For the parameter H, the updating rule is hi (t C 1) D hi (t) ¡
kc E (x , t ) 0 f (ui (t)). 1 ¡ yi ( t )
(2.15)
2.3 Choice of Activation Function. Different types of bell-shaped activation functions can be chosen for the output nodes. The activation function of an output node should have high localization capability, and at the same time a data point lying far apart should be able to attract a hyperplane if the point does not lie on any of the hyperplanes. A natural choice of activation function is of gaussian form, given as
yi D exp ¡
u2i l2
.
(2.16)
Learning Hough Transform
657
1
0.8
f(u) 0.6
0.4
0.2
0 6
4
2
0
2
4
6
u
Figure 1: The nature of the activation functions in equations 2.16 and 2.17. The dashed line shows the gaussian type activation function in equation 2.16. The solid line shows the nature of activation function in equation 2.17. u represents the input to a neuron and f (u) is the corresponding output.
The gaussian form of activation function provides a highly localized response with a small value of l. However, with the gaussian form of activation function, a data point lying at a distance greater than 3l from a hyperplane has a very small attraction on the hyperplane. Therefore, for a small value of l, the performance of the network becomes highly dependent on the initial selection of the parameters. An alternate form of activation function is yi D
1 , 1 C ( uli )2
(2.17)
which has a very sharp peak with a heavy tail. Therefore, the activation function, equation 2.17, has a high localization property, and at the same time the distant data points are able to attract the hyperplanes. Note that a similar form of activation function is attempted in the formation of sparse codes for natural scenes, leading to a complete family of localized, oriented, bandpass-receptive elds (Olshausen & Field, 1996). Figure 1 shows the nature of the activation functions in equations 2.16 and 2.17, respectively. For the activation functions, chosen according to equation 2.17, the learning rule is wij (t C 1) D wij (t) ¡
2c E (x , t) l2
y2i (t ) 1 ¡ yi (t)
£ ui (t) xj (t) ¡ (ui (t) C hi (t))wij (t )
(2.18)
658
Jayanta Basak
and hi (t C 1) D hi (t) C
2kc E (x , t) l2
y2i (t) 1 ¡ yi ( t )
ui (t).
(2.19)
Considering the gaussian form activation function, equation 2.16, the learning rule is wij (t C 1) D wij (t) ¡
2c E (x , t) l2
hi (t C 1) D hi (t) C
2kc E (x , t) l2
yi ( t ) ui (t)[xj (t) ¡wij (t)(ui (t) C hi (t))] 1¡yi (t ) yi ( t ) ui (t). 1 ¡ yi ( t )
(2.20)
Note that the learning rules in equations 2.18 and 2.20 are analogous to the competitive learning framework where a weight decay is necessary. The effective learning rate is essentially a function of the output yi , which is very large when the parameters are close to the true solution, and it is very small if the corresponding data point is not a proper representative of the hyperplane. Here for a given data point, all the hyperplanes are provided with a rigid transformation; however, the hyperplane closest to the data point is affected most, and the shift of a hyperplane decreases with the increase of the distance of the hyperplane from the data point. It is, in effect, similar to the principle of SOM (Kohonen, 1988) where the neighborhood function is dened rather explicitly. The parameter k in equations 2.18 through 2.20 is the ratio of the learning rates of the parameters wij and hi . It depends on the distribution of the data points in the n-dimensional space with respect to the origin (in the pattern space). The choice of k is highly subjective. If the data points are far from the origin—that is, the distance of the hyperplanes from the origin is quite high—then intuitively k should have a higher value. The performance of the learning algorithm is sensitive to the selection of k. In order to overcome this problem, the learning of the parameters can be performed in two separate stages. First, the parameter set H, quantifying the distances of the hyperplanes from the origin, is learned; that is, the parameter set H is updated optimally so that the error is minimum. After that, the parameter set W is changed optimally to minimize the error. The change of the parameters depends on the learning rate parameters. In order to change the network parameters optimally, the learning rate parameter should be selected optimally. Note that the loss function for obtaining the maximum likelihood estimate for a single line, leading to linear regression, is given as
E D
t
(w . x (t ) ¡ h )2 ,
(2.21)
Learning Hough Transform
659
where x (t) denotes the sequence of input vectors. From equations 2.6, 2.8, 2.16, and 2.17, it can be observed that for a single line, E can be expressed as ED
1 l2
t
(w .x (t) ¡ h )2 C higher-order terms in
( w .x ( t ) ¡ h ) 2 l2
(2.22)
for l À | w .x (t) ¡ h |. In other words, in the vicinity of the true solution, E behaves in the same way as in the case of linear regression. On the other hand, if l ¿ | w . x (t ) ¡ h |, E ! 1 for that particular sample. Therefore, the contribution of an outlier to the objective function does not increase quadratically, as in the case of regression. This protects the linear structure from being affected by the presence of an outlier and also in localizing the structures in the presence of multiple linear segments. 3 Selection of the Learning Rate
The rate of updating of W and H depends on the constant c and l. The parameter l determines the width of the bell-shaped activation function. For small values of l, the activation function is highly localized, and therefore the network will behave in such a way that the hyperplanes are little perturbed by the distant points, which is desired when the network settles in the true xed point. On the other hand, if the parameter l is very small, the attraction of the hyperplanes toward the distant points will be very small. Therefore, the network may not come closer to the desired xed point during the updating of the weights. The learning rate c should be dependent on l in such a way that it provides a sufcient movement in the weight space so that the network comes closer to the xed point. On the other hand, it should be small enough that it does not bypass the xed point. Let us say that the network is close to a xed point. In that case, in the vicinity of the true solution, the weight matrix W and H are perturbed in such a way that, it makes the objective function E to be zero for the given sample point—that is, E (x , t C 1) D 0.
(3.1)
The change in E due to change in y is E (x , t) C D E D [P i (1 ¡ (yi C D yi ))]a D [P i (1 ¡ yi )]a . P i 1 ¡ D E (x , t). P i 1 ¡
D yi 1 ¡ yi
D yi 1 ¡ yi a
a
(3.2)
660
Jayanta Basak
Restoring only the rst-order terms, from equations 3.1 and 3.2, E (x , t ) 1 ¡ a
i
D yi 1 ¡ yi
D 0,
(3.3)
that is,
i
D yi 1 D . 1 ¡ yi a
(3.4)
The change in u is given as
D ui ( t )
D
D ¡
2c E l2
y2i 1 ¡ yi
ui k C
D ¡
2c E l2
y2i 1 ¡ yi
ui [k C X ¡ (ui C hi )2 ],
2 j xj
where X D u is
D yi
D wij (t)xj ¡ Dhi (t)
j
@yi
D
@ui
j
xj2 ¡ (ui C hi )
wij xj
(3.5)
j
D kx k2 . The change in the output y for a small change in
. D ui
2ui D ¡ 2 y2i D ui . l
(3.6)
From equation 3.4, the expression can be written as 4c E l2
i
y3i 1 ¡ yi
[k C X ¡ (ui C hi )2 ] D
1 . a
(3.7)
Therefore, the learning rules for W and H (from equations 2.18 and 2.19) are given as y2 (t)
D wij (t) D ¡
i )[xj (t) ¡ wij (t)(ui (t) C hi (t))] ui (t)( 1¡y i (t )
2a
y3 (t) ( k )[ k 1¡yk (t) k C
X ¡ (u k (t) C hk (t))2 ]
(3.8)
and y2 (t )
Dhi (t) D
i ) kui (t)( 1¡y i (t)
2a
y3 (t ) )[k C k (t )
k k ( 1¡y
X ¡ (u k (t ) C hk (t))2 ]
.
(3.9)
Learning Hough Transform
661
Similarly, for the gaussian type of activation function (see equation 2.16), the learning rules (see equation 2.20) are y (t )
D wij (t) D ¡
ui (t)( 1¡yi i (t ) )[xj (t) ¡ wij (t)(ui (t) C hi (t))] 2a
() ( yk t )2 ( 1 )[ k 1¡yk (t) log yk (t) k C
X ¡ (u k (t) C hk (t))2 ]
(3.10)
and y (t )
Dhi (t) D
kui (t)( 1¡yi i (t ) ) 2a
() ( yk t )2 ( 1 )[ k 1¡yk (t) log yk (t) k C
X ¡ (u k (t) C hk (t))2 ]
.
(3.11)
The updating rule reveals the fact that the changes in the parameter values are independent of the selection of l. The parameter l, however, is implicitly embedded in the output y of the network. Initially, when the network is far from the xed point, the denominator is very small. In order to take care of this situation, the learning rate is clamped to a constant when the denominator is very small. Near the optimal solution, the learning rate changes according to equation 3.7, depending on how far the network is from the xed point. The parameter l decides the width of the activation function, which can be selected depending on the distance between the line segments and the amount of noise present in the image. The learning rules embed the parameter k, which is essentially the ratio of the learning rates of the parameter set W and H. The selection of the parameters k and l is dependent on the type of images or the incoming sample points. The parameter k is dependent on the nature of the linear segments. In other words, the parameter k is implicitly dependent on the distance of the linear segments from the center dened by the user. In order to remove the free variable k, W and H can be alternatively updated in the batch mode as follows: i. Update H without changing W until the error goes to a local minimum as y2 (t)
Dhi
D hi C
i ) ui (t)( 1¡y i (t )
2a
.
y 3 (t ) ( l ) l 1¡yl (t)
(3.12)
Compute y . ii. Update W without changing H until the error goes to a local minimum as 2(
D wij
D wij ¡
Compute y .
)
yi t )[xj (t) ¡ wij (ui (t) C hi )] ui (t)( 1¡y i (t )
2a
y 3 (t )
l 2 l ( 1¡y (t ) )[X ¡ ( ul (t ) C hl ) ]
.
l
iii. Steps i and ii are repeated until the error is minimized.
(3.13)
662
Jayanta Basak
The learning rule discussed here is for a given data point. In order to minimize the error with respect to H rst and then with respect to W , it is necessary to run the algorithm in the batch mode. In the case of batch mode, the necessary changes D wij and Dhi for each sample data points are computed for each hyperplane. The overall change is the average of these changes for individual data points. If the batch mode updating is performed with the given set of data points with sequential updating of H and W , then the algorithm converges in the vicinity of the true solution. 4 Theoretical Analysis of Convergence
The objective function E is not convex in the parameter space, and globally it is difcult to ensure the convergence of the network. However, in the vicinity of the true solution, the network dynamics can be shown to converge. Here we considered batch mode updating, and the network converges in the batch mode. In the vicinity of the true solution, we can consider ui ¿ uj such that (ui / l)2 ¿ 1 and (uj / l)2 À 1 for a judicious choice of l. Now 1 ¡ yi D
(ui / l)2 . 1 C (ui / l)2
(4.1)
Therefore, in the vicinity of the true solution, (ui / l)2 1
1 ¡ yi D
8x 2 Vi otherwise,
(4.2)
where Vi is the set of data points dening the ith hyperplane. The updating of the weight wij is given as 2
D wij
D ¡
yi ( 1¡y )ui (xj ¡ wij (ui C hi )) i
2a
y3k k 1¡yk ( X
¡ ( u k C hk ) 2 )
.
(4.3)
In the vicinity of the true solution, we can validly assume that y3i 1 ¡ yi
( X ¡ ( ui C h i ) 2 ) À
6 i kD
y3k 1 ¡ yk
(X ¡ (u k C h k )2 ).
(4.4)
As a result, in the vicinity of the true solution, the updating rule for wij for a given data point in Vi is given as
D wij
D
¡g1 (x )ui (xj ¡ wij (ui C hi )) 0
for all x 2 Vi otherwise,
(4.5)
Learning Hough Transform
663
where g1 (x ) is the learning rate for the data point x given as g1 (x ) D
2ayi
(kx k2
1 . ¡ (ui C hi )2 )
(4.6)
This is rather a simplied approximation of the learning rule (see equation 4.3) in the vicinity of the true solution, analogous to the updating rule in the competitive learning. However, in competitive learning, it is necessary that Vi Vj D W; only one active output can be present. In the present framework of learning, it is not necessary that Vi Vj D W. Considering ui to be sufciently small and restoring only the rst-order terms in ui , we can simplify equation 4.5 in the batch mode,
D wij
D
¡g1 ui (xj ¡ hi wij ) 0
for all x 2 Vi otherwise,
(4.7)
where g1 is the average effective learning parameter in the batch mode updating. Note that it is possible to restore the second-order terms, and still the convergence in the vicinity of the true solution can be shown. The average change in weight in the batch mode therefore can be written as
D wij
D ¡g1 (hui xj iVi ¡ hi wij hui iVi ),
(4.8)
where h.iVi is the sample average over the data points in Vi . Similarly, from equation 3.12, the average change in hi in the batch mode is given as hi (n C 1) D hi (n ) C g2 hui iVi .
(4.9)
According to the learning algorithm in equations 3.12 and 3.13, rst the parameter hi is updated (see equation 3.12) until there is no further change in hi for all i. Then the parameter w i is updated according to equation 3.13. Under this two-stage updating scheme in the batch mode, the algorithm converges to the desired solution in the vicinity of the true solution. In order to prove the convergence in the vicinity of the true solution, we have to show that ei (n ) D hw 0i (n )x ¡ hi iVi ! 0
(4.10)
Vi (n ) D h(w 0i (n )x ¡ hi )2 iVi ! 0
(4.11)
and
as n ! 1.
664
Jayanta Basak
Since hi is changed rst without changing w i , ei (n C 1) D (1 ¡ g2 )ei (n ),
(4.12)
where ei D hw 0i (n )x ¡ hi iVi . Therefore, ei (n C 1) < ei (n ) for some positive g2 · 1. The value of hi is updated without changing w i so long as the change in hi is zero (computationally the change is very small). After convergence in hi we get ei D 0, that is, hi D w 0i m i ,
(4.13)
where m i D hx iVi , that is, hui iVi D 0. The change in w i can be written as w i (n C 1) D w i (n ) ¡ g1 hui x iVi .
(4.14)
It can be simplied as w i (n C 1)
D D D
w i (n ) ¡ g1 (hxx0 iVi w i ¡ him i ) w i (n ) ¡ g1 (hxx0 iVi w i ¡ m im 0i w i ) (I ¡ g1 A i )w i ,
(4.15)
where A i D hxx0 iVi ¡ m im 0i D h(x ¡ m i )(x ¡ m i )0 iVi . Here A i is a nonnegative denite matrix. Considering hi D w 0im i , Vi D h(w 0i x ¡ w 0im i )2 iVi .
(4.16)
Simplifying the expression, we get Vi D w 0i A i w i .
(4.17)
Therefore, Vi (n C 1) D w 0i (n C 1)A i w i (n C 1) D w 0i (n )( I ¡ g1 A i )A (I ¡ g1 A i )w i (n )
(4.18)
0 2 3 D Vi (n ) ¡ g1 w i (n )(2A i ¡ g1 A i )w i (n ).
Since A i is a nonnegative denite matrix, A 2i and A 3i are also nonnegative denite. If the value of g1 is chosen as g1
20) these condence intervals do not differ substantially from those based on a gaussian ( § 2 standard deviations) but at small º0 , the difference can be substantial as for these values the º20 distribution has long tails. 4.7 High-Frequency Limit. The population spectrum goes to a constant value equal to the rate l in the high-frequency limit. In practice, spectra calculated from a nite sample will go to a value close to l, but uctuations in the number of spikes in the interval will lead to an error in this estimate. For a given sample, the spectrum will go to the value given by I ( f ! 1) D
NT NX n (T ) XX 1 K¡1 h k (tjn )2 , NT K kD 0 nD 1 jD 1
(4.24)
where tjn is the jth spike in the nth trial and Nn (T ) is the total number of spikes in the nth trial. In the case of direct and lag window estimators, the averaging over tapers need not be performed. This expression yields a value that is typically very close to the sample estimate of the mean rate.7 Signicant departures from this high-frequency limit are of interest when interpreting the spectrum, as these indicate enhancement or suppression relative to a homogeneous Poisson process. 4.8 Choice of Estimator, Taper, and Lag Window. The preceding section discussed the large sample statistical properties of direct, lag window, and multitaper estimates of the spectrum. The choice of which estimator to use remains a contentious one (Percival & Walden, 1993). The multitaper method is the most systematic of the estimators, but the lag window estimators should perform almost as well for spike train spectra that have reasonably small dynamic ranges. 8 However, it is possible to construct spike trains with widely different timescales, which can possess a large, dynamic range. In addition, the multitaper technique leads to a simple jackknife procedure by leaving out one data taper in turn. A further important property of the multitaper estimator is that it gives more weight to events at the edges of the time interval and thus ameliorates the arbitrary downweighting of the edges of the data introduced by single tapers. When using the lag window estimator, there are many choices available for both the taper and the lag window. The choice of taper is generally not critical provided that the taper goes smoothly to zero at the start and end of the interval. A rectangular taper has particularly large side lobes 7 8
It is exactly the sample estimate of the mean rate for a rectangular taper. Dynamic range is a measure of the ± ² variation in the spectrum as a function of frequency
and is dened as 10 log10
max f S( f ) min f S( f )
.
Spectrum and Coherency of Sequences of Action Potentials
731
in the frequency domain, which can lead to signicant bias. The choice of lag window is also usually not critical; typically a gaussian kernel will be satisfactory. 5 Estimating the Coherency Sample coherency, which may be estimated using equation 5.1, may be evaluated using any of the previously mentioned spectral estimators. The principal difference is that the direct estimator, in terms of which the other estimators are expressed, is given by equations 5.2 and 5.3 rather than 4.1 and 4.2: q X X IX C( f ) D I12 / I11 22
D( ) I12 f
D
JaD ( f ) D
J D ( f ) J D ( f )¤ Z1 1 2 ( ) ¡2p if t ¡1
hte
(5.1) (5.2) dN a (t),
(5.3)
where the N 1 (t ) and N 2 (t) are simultaneously recorded spike trains from two different cells and X denotes the type of spectral estimator. Possible choices of estimator X include D direct, DT trial-averaged direct, LW lag window, or MT multitaper. Lag window and multitaper coherency estimates may be constructed by D (¢) in place of D (¢) in equations 4.11 and 4.18. The estimates substituting I12 I are biased over a frequency range equal to the width of the smoothing, although the exact form for the bias is difcult to evaluate. 5.1 Condence Limits for the Coherence. The treatment of error bars is somewhat different between the spectrum and the coherency, since the coherency is a complex quantity. Usually one is interested in establishing whether there is signicant coherence in a given frequency band. In order to do this, the sample coherence should be tested against the null hypothesis of zero population coherence. The distribution of the sample coherence under this null hypothesis is ( P (| C| ) D (º0 ¡ 2)|C|(1 ¡ |C| 2 ) º0 / 2¡2)
0 · |C| · 1.
(5.4)
A derivation of this result is given in Hannan (1970). In outline, the method is to rewrite the coherence in such a way that it is equivalent to a multiple correlation coefcient (Anderson, 1984). The distribution of a multiple correlation coefent is then a known result from multivariate statistics. In the case of coherence estimates based on lag window estimators, the appropriateº0 may be used, although this is only approximately valid because this method of derivation assumes integer º0 / 2.
732
M. R. Jarvis and P. P. Mitra
It is straightforward to calculate p a condence level based on this distribution. The coherence will exceed 1 ¡ p1 / (º0 / 2¡1) only in p £100% of experiments. In addition, it is notable that the quantity (º0 / 2 ¡ 1)|C| 2 / (1 ¡ |C| 2 ) is distributed as F2(º0 ¡2) under this null hypothesis. It is useful to apply a transformation to the coherence before p plotting it, which aids in the assessment of signicance. The variable q D ¡(º0 ¡ 2) log(1 ¡ |C| 2 ) has a Raleigh dis2 tribution with density p (q ) D qe ¡q / 2 . This density function does not depend onº0 and furthermore has a tail that closely resembles a gaussian. For certain values of a tting parameter b,9 a further linear transformation r D b (q ¡ b ) leads to a distribution that closely resembles a standard normal gaussian for r > 2. This means that for r > 2, one can interpret r as the number of standard deviations by which the coherence exceeds that expected under the null hypothesis. 5.2 Condence Limits for the Phase of the Coherency. If there is no population coherency, then the phase of the sample coherency is distributed uniformly. If there is population coherency, then the distribution of the sample phase is approximately gaussian provided that the tails of the gaussian do not extend beyond a width 2p . An approximate 95% condence interval for the phase (Rosenberg, Amjad, Breeze, Brillinger, & Halliday, 1989; Brillinger, 1974) is s ³ ´ 2 1 wO ( f ) § 2 ¡ 1 (5.5) , º0 |C( f )| 2 where wO ( f ), the sample estimate of the coherency phase, is evaluated using tan ¡1 fIm(C) / Re (C )g. 6 Finite Size Effects In the preceding sections, error bars were given for estimators of the spectrum and the coherence. However, these error bars were based on large sample sizes (they apply asymptotically as T ! 1). Neurophysiological data are not collected in this regime, and particularly in awake behaving studies where data are often sparse, corrections arising at small T are potentially important. In order to estimate the size of these corrections, a particular model for the point process is required. The model studied was chosen primarily for its analytical tractability while still maintaining a nontrivial spectrum. The model and the nal results are presented here, but the details of the analysis are reserved for the appendix. The model is a doubly stochastic inhomogeneous Poisson process with a gaussian rate function. A specic realization of a spike train is generated from the model in the following 9
A reasonable choice for b is 23/20.
Spectrum and Coherency of Sequences of Action Potentials
733
Figure 5: Schematic illustrating the model for which nite size corrections to the asymptotic error bars will be evaluated. (a) A spectrum SG ( f ) is dened. (b) A realization lG (t) is drawn from a gaussian process with this spectrum. (c) The mean rate l is added to lG (t) to obtain l(t). (d) This rate function is used to generate a realization of an inhomogeneous Poisson process, yielding a set of spike times.
manner. First, a population spectrum SG ( f ) is specied. From this, a realization of a zero mean gaussian process lG ( f ) is generated. A constant l, the mean rate, is then added to this realization. This function is then considered to be the rate function for an inhomogeneous Poisson process. A realization of this inhomogeneous Poisson process is then generated. A schematic of the model is shown in Figure 5. Technically this is not a valid process because the rate function l(t) may be negative. However, if the area underneath the spectrum is small enough, then the uctuations about the mean rate seldom cross zero, and corrections due to this effect are negligible. In addition, large violations of this area constraint have been tested by Monte Carlo simulation, and the results still apply to a good approximation. An important feature of this model is that the population spectrum of the spike trains is simply the spectrum of the inhomogeneous Poisson process rate function plus an offset equal to the mean rate.10 The spectrum of the rate function is a positive real quantity, and therefore within this model the population spectrum cannot be less than the mean rate at any frequency. Intuitively, the reason is that the process must be more variable than a homogeneous Poisson process at all frequencies. To make the nature of the result clear, a simplied version is given in equation 6.1. This version is for the particular case of a homogeneous Poisson process (which has a at population spectrum) and a rectangular taper:11 µ varfIX ( f )g D l2
¶ 2 1 C , º0 NT Tl
(6.1)
where l is the mean rate. 10
This result does not depend on the gaussian assumption. The expression also holds approximately for the multitaper estimate provided all tapers up to the Shannon limit are used. 11
734
M. R. Jarvis and P. P. Mitra
A sample-based estimate of NT Tl is the total number of spikes over all trials. It is to be noted that nite-size effects reduce the degrees of freedom. This result implies that there is a point beyond which additional smoothing does not decrease the variance further, and this point is approximately when º0 is equal to twice the total number of spikes. The full result is given in equations 6.2 to 6.7: varfIX ( f )g D
2EfIX ( f )g2 º( f )
(6.2)
W( f ) CX 1 1 h C D , 2TNT EfIX ( f )g2 º( f ) º0
(6.3)
where (R CX h
D
1 f (t)4 dt 0 P R1 1 2 2 k, k0 0 f k ( t ) f k0 (t ) dt K2
f (t / T ) D
p
if X D LW, D or DT if X D MT
Th (t )
(6.4) (6.5)
and W ( f ) D lhf C 4[EfIX ( f )g ¡ lhf ] C 2[EfIX (0)g ¡ lhf ] C [EfI X (2 f )g ¡ lhf ]
lhf D EfIX ( f ! 1)g.
(6.6) (6.7)
is a constant of order unity, which depends on the taper. When a taper CX h is used to control bias, some of the spikes are effectively disregarded, and this has an effect on the size of the correction. The function f (t) has the same form as the taper h(t) but is dened for the interval [0, 1]. CX is the h integral of the fourth power of f and obtains its minimum value of one for a rectangular taper. It is typically between 1 and 2 for other tapers. In the multitaper case, cross-terms between tapers are included. Equation 6.6 describes how the nite-size correction depends on the structure of the spectrum. W ( f ) is the sum of four terms. The rst term is the only one present for a at spectrum. The second term is a correction that depends on the spectrum at the frequency being considered. The next two terms depend on the spectrum at zero frequency and the spectrum at twice the frequency being considered. The nal three terms all depend on the difference between the spectrum at some frequency and the high-frequency limit. Equation 6.6 applies provided that the spike train is well described by the model. However, this is not necessarily the case, and a suppression of the spectrum, which cannot be described by the model, often occurs at low
Spectrum and Coherency of Sequences of Action Potentials
735
frequencies. 12 In the event that there is a signicant suppression of the spectrum, W ( f ) may become small or even negative. To avoid this, a modied form for W ( f ) that prevents this may be used: W( f ) D lhf C 4 max([EfIX ( f )g ¡ lhf ], 0) C 2 max([EfI X (0)g ¡ lhf ], 0) . . . C max([EfI X (2 f )g ¡ lhf ], 0).
(6.8)
The modication to the result is somewhat ad hoc, so Monte Carlo simulations of spike trains with enforced refractory periods have been performed to test its validity. These simulations demonstrated that although the correction derived using equation 6.8 was signicantly different from that obtained from the Monte Carlo simulations in the region of the suppression, equation 6.8 provided a pessimistic estimate in all cases studied. This increases condence that applying nite-size corrections using equation 6.8 will provide reasonable error bars for small samples. Equation 6.3 gives the nite-size correction in terms of a reduction in º0. The new º( f ) may be used to put condence intervals on the results, as described in section 4.6, although the accuracy of the º2 assumption will be reduced. In the case of coherence, an indication of the correction to the condence level can be obtained by using the smaller of the two º( f ) from the spike train spectra to calculate the condence level using equation 5.4. In all cases, if the effect being observed achieves signicance only by an amount that is of the same order as the nite-size correction, then it is recommended that more data be collected. 7 Experimental Design Often it is useful to know in advance how many trials or how long a time interval one needs in order to resolve features of a certain size in the spectrum or the coherence. To do this, one needs to estimate the asymptotic degrees of freedom º0. This depends on the size of feature to be resolved a, the signicance level for which condence intervals will be calculated p, and the fraction of experiments that will achieve signicance P . In addition, the reduction in the degrees of freedom due to nite size effect depends on the total number of spikes Ns and also Ch (see section 6). An estimate of v 0 may be obtained in two stages. First, a,p and P are specied and used to calculate degrees of freedom º. Second, the asymptotic degrees of freedom º0 is estimated using º, Ns , and Ch . The feature size a D (S¡l) / l is the minimum size of feature that the experimenter is content to resolve. For example, a value of 0.5 indicates that where the population 12 Note that any spike train spectra displaying signicant suppression below the mean ring rate can immediately rule out the inhomogeneous Poisson process model.
736
M. R. Jarvis and P. P. Mitra
spectrum exceeds 1.5l, the feature will be resolved. The signicance level should be set to the same value that will be used for calculating the condence interval for the spectrum, typically 0.05. For a given p, there is some probability P that an experiment will achieve signicance. º, h To calculate i
one begins with a guess ºg . Then q1 is chosen such that P º2g ¸ q1 D p / 2. £ ¤ On the basis of this, one then evaluates P0 D 1 ¡ W q1 / (1 C a) where W is the cumulative º2g distribution.13 If P0 is equal to the specied fraction P , then º D ºg ; otherwise a different ºg is chosen. This procedure is readily implemented as a minimization of (P ¡ P0 (ºg ))2 on a computer. Having obtained º, one can estimate º0 using 1 1 C [1 C 4a] ¡ h D , º0 º 2Ns [1 C a]2
(7.1)
where the 4a is omitted from the numerator if a < 0. Figure 6 illustrates example design curves generated using this method. These curves show the asymptotic degrees of freedom as a function of feature size for different total numbers of spikes. The existence of a region bounded by vertical asymptotes implies that as long as the total number of measured spikes is nite, modulations in the spectrum below a certain level cannot be detected no matter how much the spectrum is smoothed. These curves may be used to design experiments capable of resolving spectral features of a certain size. In the case of the coherence, one calculates how many degrees of freedom are required for the condence line to lie at a certain level as described in section 5.1. 8 Line Spectra One of the assumptions underlying the estimation of spectra is that the population spectrum varies slowly over the smoothing width (W for multitaper estimators). While this is often the case, there are situations in which the spectrum contains very sharp features, which are better approximated by lines than by a continuous spectrum. This corresponds to periodic modulations of the underlying rate, such as when a periodic stimulus train is presented. In such situations, it is useful to be able to test for the presence of a line in a background of colored noise (i.e., in a locally smooth but otherwise arbitrary continuous population spectrum). Such a test has been previously developed, in the context of multitaper estimation, for continuous processes h 13
£
These formulas apply for a > 0. If a < 0, then P º2g · q1
¤
W q1 / (1 C a) should be used.
i
D p / 2 and P0 D
Spectrum and Coherency of Sequences of Action Potentials
737
100 80
n 0
60 40 20 0 - 0.5
0
0.5
a
1
1.5
2
Figure 6: Example design curves for the case when p D 0.05, P D 0.5, and Ch D 1.5. The three curves correspond respectively to Ns D 1 (solid line), Ns D 100 (dashed line), and Ns D 20 (dotted line).
(Thomson, 1982). In the following section the analogous development for point processes is presented. 8.1 F-Test for Point Processes. A line in the spectrum has an exactly dened frequency, and consequently the process N (t) has a nonzero rst moment. The natural model in the case of a single line is given by EfdN (t)g / dt D l 0 C l1 cos(2p f1 t C w ).
(8.1)
A zero mean process (N) may be constructed by subtraction of an estimate of l0 t. Provided that the product of the line frequency( f1 ) and the sample duration(T) is much greater than one, the sample quantity N (T ) / T is an approximately unbiased estimate of l0 . The resultant zero mean process N has a Fourier transform that has a nonzero expectation: Z Jk ( f ) D
1 ¡1
h k (t)e¡2p if t dN (t)
EfJk ( f )g D c1 H k ( f ¡ f1 ) C c¤1 H k ( f C f1 ),
(8.2) (8.3)
where, c1 D l1 eiw / 2.
(8.4)
738
M. R. Jarvis and P. P. Mitra
In the case where f > 0 and f1 > W, EfJ k ( f )g ’ c1 H k ( f ¡ f1 ).
(8.5)
The estimates of Jk ( f1 ) from different tapers provide a set of uncorrelated estimates of c1 Hk (0). It is hence possible to estimate the value of c1 by complex regression: P k J k ( f1 ) Hk (0) . (8.6) cO1 D P 2 k |H k (0)| Under the null hypothesis that there is no line in the spectrum (c1 D 0), it P may readily be shown that EfOc1 g D 0 and varfOc1 g D S ( f1 ) / k |Hk (0)| 2 . The residual spectrum,14 which has the line removed, may be estimated using 1X | Jk ( f ) ¡ cO1 H k ( f ¡ f1 )| 2 . SO ( f ) D K k
(8.7)
In the large sample limit, the distributions of both cO1 and SO ( f1 ) are known (Percival & Walden, 1993) and may be used to derive P | cO1 | 2 k |H k (0)| 2 (K ¡ 1) . P (8.8) D F2,2(K¡1) , 2 k |J k ( f1 ) ¡ cO1 H k (0)| . Where D denotes “is distributed as.” The null hypothesis may be tested using this relation; if it is rejected, the line can be removed using equation 8.7 to estimate the residual spectrum. It is worth noting that although relation 8.8 was derived for large samples, the test is remarkably robust as the sample size is decreased. Numerical tests indicate that the tail of the F distribution is well reproduced even in situations where there are as few as ve spikes in total. 8.2 Periodic Stimulation. A common paradigm in neurobiology where line spectra are particularly important is that of periodic stimulation. When a neuron is driven by a periodic stimulation of frequency f1 , the spectrum may contain lines at any of the harmonics nf1 . Provided that f1 > 2W, the analysis of section 8.1 applies with each harmonic being separately tested for signicance. The rst moment of the process, which has period 1 / f1 , is given by equation 8.9 and may be estimated using cOn : X l(t) D l 0 C ln cos(2p nf1 t C w n ). (8.9) n
14
It is also possible to estimate a residual coherency. In order to do this, one uses P x y y (J ( f ) ¡ cOx1 H k ( f ¡ f1 )) ¤ (J k ( f ) ¡ cO1 H k ( f ¡ f1 )), a residual cross-spectrum SO xy ( f ) D K1 k k together with the residual spectra to evaluate the usual expression for coherency.
Spectrum and Coherency of Sequences of Action Potentials
739
where ln D 2|cn |, w n D tan ¡1 fIm(cn ) / Re (cn )g, the sum is taken over all the signicant coefcients. This rate function l(t) is the average response to a single stimulus or impulse response. The coefcients cn are the Fourier series representation of l(t). 8.3 Error Bars. It is possible to put condence intervals on both the modulus and the phase of the coefcients cOn . For large samples (more than 10 spikes), the real and imaginary parts of cOn are q distributed as independent P gaussians, each with standard deviation sn D S (nf1 ) / (2 k |H k (0)| 2 ). For cn / sn > 3, the distribution of | cOn | is well approximated by a gaussian centered on |cn | and with standard deviation sn . In addition, the estimated phase angle (wO n ) is also almost gaussian with mean w n and standard deviation sn / |cn |. Approximate error bars or condence intervals may be obtained q P |Hk (0)| 2 ). using a sample-based estimate of sn , sO n D SO (nf1 ) / (2 k
Estimating error bars for the impulse response function is more involved due to their nonlocal nature (if one of the Fourier coefcients is varied the impulse response function changes everywhere). It is therefore of interest to estimate a global condence interval, dened as any interval such that the probability of the function crossing the interval anywhere is some predened probability. A method for estimating a global condence band is detailed in Sun and Loader (1994) and outlined here. First, a basis vector W(t) is constructed: 2
3 sO 1 cos(2p f1 t) .. 6 7 6 7 . 6 7 6sO N cos(2p fN t)7 6 7 W(t) D 6 7 6 sO 1 sin(2p f1 t) 7 6 7 .. 4 5 .
(8.10)
sO N sin(2p fN t)
where N is the total number of harmonics. The elements of this vector have unit variance, and a standard approximation, which relies on sn being gaussian, may be applied. 2 P (sup|l (t) ¡ Efl(t)g| > c| |W(t)||) · 2(1 ¡ N (c)) C ( k /p )e¡c / 2
(8.11)
Where sup is the maximum value of its operand, ||W(t)|| denotes the length of vector W(t), N (c) is the cumulative standard normal distribution, and k is a constant. k may be evaluated by constructing the 2 £ N matrix X (t) D [W (t) dW (t) / dt], forming its QR decomposition (Press, Teukolsky, Vetterling, R & Flannery, 1992) and then evaluating k D 0T |R22 (t) / R11 (t)| dt.
740
M. R. Jarvis and P. P. Mitra
Figure 7: (a) Gaussian kernel (100 ms width) smoothed ring rate with 2s error bars based on a stationarity assumption. The vertical lines indicate the period over which the spectrum was calculated. A light is ashed at time zero, and the spectrum is evaluated over the interval when the monkey is required to remember the target location. (b) The spectrum evaluated over this interval using a lag window estimator with a 40% cosine taper and a gaussian lag window of width 3.5 Hz. 95% condence limits are shown with the nite size correction included (this typically resulted in a decrease in º( f ) from about 50 to 36). The horizontal line indicates the high-frequency limit. (c) The same spectrum evaluated using a multitaper estimator. A bandwidth (W) of 5 Hz was used allowing ve tapers. Both estimators have the same degrees of freedom.
Condence intervals for the residual spectrum are calculated in the usual manner (using º2 ), although at the line frequencies, the interval is slightly broadened due to the loss of two degrees of freedom incurred by estimation of cn . Section 11 contains an example application of the methods described in this section. 9 Example Spectra Figure 7 is a spectrum calculated from data collected from a single cell recorded from area PRR in the parietal cortex of an awake behaving monkey during a delayed memory reach task (Snyder et al., 1997). The spectrum is calculated over an interval of 0.5 second during which the ring rate is reasonably stationary (the average ring rate lies within the error bars) and is averaged over ve trials. The spectrum shows two features that achieve signicance. There is enhancement of the spectrum in the frequency band 20–40 Hz, indicating the presence of an underlying broadband oscillatory mode in the neuronal ring rate. In addition, there is suppression of the spectrum at low frequencies. As discussed previously, a suppression of this sort is consistent with an effective refactory period during which the neuron is less likely to re. Care must be taken at low frequencies since at frequencies comparable to the smoothing width, the spectrum is particularly sensitive to any nonstationarity in the data. It may be useful for the practitioner to review parameter choice and the relative merits of multitaper and lag window estimators. First, there is the
Spectrum and Coherency of Sequences of Action Potentials
741
choice of time interval. This was chosen to be as short as possible, consistent with resolving the structure of interest. Choice of bandwidth is a subject on which a considerable literature exists and which is beyond the scope of this article (Loader, 1999). However, in practice, it is usually straightforward to choose a reasonable bandwidth by trial and error. A balance should be sought between undersmoothing, where the structure is not clear due to the high variance of the estimator, and oversmoothing, where the structure is blurred out because of excessive frequency averaging. One of the advantages of the multitaper method over the lag window method is that once the time interval and bandwidth have been set, the calculation of the spectrum is automatic and, in the sense dened in section 4.4, optimal. If a lag window estimate is used, then the trade-off between bias and variance is not formalized and one has to choose the taper and kernel smoother separately (see section 4.8). Although both the lag window and the multitaper estimates reveal the same structure in the example, this may not have been the case if the spectrum had a larger dynamic range. Whichever method is used, exploratory data analysis is always recommended when choosing the time interval and bandwidth. 10 Example Coherency To illustrate the estimation of coherency, simulated spike trains were generated from a coupled doubly stochastic Poisson process. For a given trial, a pair of rate functions was drawn from a gaussian process. The realizations share a coherent mode that is linearly mixed into the rates of both cells. These coupled rate functions are then used to draw a realization of an inhomogeneous Poisson process for each cell independently. Using this method, 15 trials of duration 0.5 second were generated. The coherent mode was set such that the population coherence was a gaussian of height 0.35 and standard deviation 5 Hz centered on 20 Hz. The phase of this mode was set to 180 degrees. Figure 8 indicates that this coherent mode is reasonably estimated. 11 Example Periodic Stimulation An example of an analysis of a periodic stimulus paradigm is shown in Figure 9. The datum is a single cell recording collected from the barrel cortex of an awake behaving rat during periodic whisker stimulation at 5.5 Hz (Sachdev, Jenkinson, Melzer, & Ebner, 1999). There is a single trial of duration 50 seconds. During the trial, the average ring rate, estimated using a 2 second kernel, is constant to within the error bars expected for a Poisson process. O is seen to have two distinct The estimated impulse response function l(t) sharp peaks, outside of which the response does not differ signicantly from zero. The moduli of the Fourier coefcients are signicant out to n D 25. This
742
M. R. Jarvis and P. P. Mitra
Figure 8: (a) Coherence (left axis) and phase of the coherency (right axis). Fifteen trials of 0.5 second duration were simulated using a doubly stochastic Poisson process as described in the text. A multitaper estimator with a smoothing width of 7 Hz was used. Finite size corrections were used and resulted in 25% reduction in the degrees of freedom. A horizontal line has been drawn at the 95% condence level under the null hypothesis of no coherency. Where the null hypothesis is rejected, the phase of the coherency is estimated and shown with an approximate 95% condence interval. (b) The standardized coherence is a transformation that maps the null distribution onto an approximately standard normal variate (as described in section 5.1). The estimated coherence at 20 Hz would therefore lie at three standard deviations if there were no population coherence.
O automatically sets the smoothing of l(t) as structure on a timescale of less than 1 / (25 £ 5.5) D 7 ms does not achieve signicance. Note that the coefcients are enhanced at multiples of 6 (approximately 33 Hz), which comes from having two peaks in the time domain l(t) separated by approximately 30 ms. The phase of the coefcients closely follows a straight line, but there is a small, periodic deviation from this line, which is again at index multiples of 6. The gradient of the straight line depends on the time delay of the response. The residual spectrum was calculated by rst evaluating a multitaper estimate from which the signicant harmonics were removed. This spectrum had a bandwidth of 1.5 Hz, chosen to avoid overlap of the harmonics leading to the multitaper estimate being undersmoothed. A further smoothing was performed using a lag window.15 The resultant spectrum displays a slight but signicant suppression relative to a Poisson process out to almost 200 Hz. Such a spectrum is characteristic of a short timescale refractive period. The residual spectrum is particularly useful because rate nonstationarity has 15
The previous theory developed for lag window estimators applies to this hybrid P K¡1 |H k (¢)| 2 in equation 4.13 and |H (¢)| 2 replaced estimator with |H(¢)| 2 replaced by K1 kD 0 by
1 K
P K¡1
k, k0 D 0
|H
kk0 (¢)|
2
in equation 4.15.
Spectrum and Coherency of Sequences of Action Potentials
743
Figure 9: Response to a periodic stimulation of frequency 5.5 Hz. (a) Impulse response function with global 95% condence interval. (b) |Ocn | versus index n with 95% condence interval. Dots indicate points that achieved signicance in the F-test. (c) Residual spectrum with nite-size-corrected condence interval. A multitaper spectrum with 100 tapers and a bandwidth of 1.5 Hz was used initially to avoid overlap of harmonics. This spectrum was then further smoothed using a gaussian lag window with standard deviation 9 Hz. (d) The coefcient phases wO n (in radians) versus index n after subtraction of a tted straight line of gradient 2p / 3 § 0.01. The black dashed lines are a 95% condence interval about zero.
been removed. As was the case for the spectrum, the choice of bandwidth for the residual spectrum depends on the observed structure in the data. The large bandwidth of 9 Hz was chosen because the residual spectrum is very at and can therefore tolerate a high degree of smoothing. 12 Summary It is our belief that spectral analysis is a fruitful and underexploited analysis technique for spike trains. In this article, an attempt has been made to collect
744
M. R. Jarvis and P. P. Mitra Table 1: Basic Direct Spectral Estimator in Terms of Which the Other Estimators Can Be Written. Jakn ( f ) D
RT 0
n
h k (t)e ¡2p if t dN a (t)
kn ( f ) D J kn ( f )J kn¤ ( f ) Iab a b
Notes: For clarity, the superscript D on the direct spectral estimate has been omitted. The index n labels trials, index k labels tapers, and indices a and b label cells.
Table 2: Estimators and Large Sample Degrees of Freedom º0 of Estimates of the Spectrum (ab D 11). X D 1 NT
DT LW MT
1 NT
Equation Number
I 01 ( f )
4.1
2
4.9
2NT
4.11
2NT /j
4.18
2NT K
PabNT
PN T R 1 nD 1 1 NT K
X (f) Iab
I 0n ( f ) nD 1 ab
0n ( 0 )df 0 K( f ¡ f 0 )Iab f
P¡1 NT P K¡1 nD 1
kD 0
kn ( f ) Iab
º0
kn Note: The indices on the Iab are as follows: ab label the cells from which the estimates are constructed, k labels the taper, and n labels the trial.
the machinery necessary for performing spectral analysis on spike train data into a single document. Starting from the population denitions, the statistical properties of estimators of the spectrum and coherency have been reviewed. Estimation methods for both continuous spectra and spectra that contain lines have been included. In addition, new corrections to asymptotic error bars have been presented that increase condence in applying spectral techniques in practical situations where data are often sparse. Tables 1 to 5 summarize the important formulas. Matlab software implementing the methods discussed in this article is available online from http://www.vis. caltech.edu/»WAND/. Appendix: Derivation of Finite Size Correction Following is an outline derivation of the nite-size corrections described in section 6. First, the characteristic functionals (Bartlett, 1966) for the processes N and the inhomogeneous Poisson process rate l(t) are related: (
Á Z CN (h (t)) D E exp i
T 0
!) h (t)dN
Spectrum and Coherency of Sequences of Action Potentials
745
Table 3: Main Formulas Required for Estimating Spectral Error Bars. Equation Number
Equation X ( f )g varfIaa
Variance
1 º( f )
Degrees of freedom
D º1 C 0
D
Use º0 for asymptotic
X ( f )g 2 2EfIaa º( f )
CX W ( f ) h
2TNT EfIX ( f )g2
£
ºI X ( f ) / q2 , ºIX ( f ) / q1
Condence (1 ¡ 2p) £ 100%
Comment
4.20
or º( f ) if using nite size correction
6.3
See text for denitions of CX and W ( f ) h
¤
q1 s.t P[º2 · q1 ] D p q2 s.t P[º2 ¸ q2 ] D p
4.23
Note: Refer to section 4 for additional information.
Table 4: Main Formulas Required for Coherency Estimation. Equation Number
Equation IX
CX ( f ) D p ab
Coherency
Distribution for coherence
5.1
X IX Iaa
bb
P(|C|) D (º ¡ 2)|C| (1 ¡ |C| 2 ) (º/ 2¡2)
r ±
Condence
wO ( f ) § 2
2 º
Comment
1 | C( f )| 2
5.4
Under null hypothesis c D 0
5.5
Approximately
² ¡1
for phase
95%
Note: Refer to section 5 for additional information.
Table 5: Main Formulas Required for the Detection and Removal of a Line from the Spectrum. Equation cO1 D
Complex amplitude of line F-test to access the signicance of a line Residual spectrum
| cO1 | 2
P
P
k
k
O f) D S(
P ( )H (0) Pk Jk f1 k k
|Hk (0)| 2
|Hk (0)| 2 (K¡1)
| Jk ( f1 )¡Oc1 H k (0)| 2 1 K
P k
.
D F2,2(K¡1)
| J k ( f ) ¡ cO1 Hk ( f ¡ f1 )| 2
Note: Refer to section 8 for additional information.
Equation Number
Comment
8.6
8.8
8.7
Null c1 D 0
746
M. R. Jarvis and P. P. Mitra
( D El
"
ÁZ exp
T 0
i b (h (t )) D exp ih (t) ¡ T
!) l(t)b(h (t))dt
Z
T 0
(A.1)
# 0
h (t )dt
0
.
(A.2)
Under the gaussian process assumption for l(t), this integral may be done: " Z Z # Z T 1 T T 0 0 0 ( )L ( ) ( ) ( ) C l (A.3) CN D exp bt t, t b t dtdt b t dt 2 0 0 0 L (t, t0 ) D El f(l(t) ¡ l)(l(t0 ) ¡ l)g.
(A.4)
Note that l denotes the mean rate. Taking the log of the characteristic functionals now yields the following relation between the resultant cumulant functionals: ( ) Z T Z Z 1 T T h (t)dN ) D KN D lnE exp(i b(t)L (t, t0 )b(t0 )dtdt0 2 0 0 0 Z T Cl (A.5) b(t)dt. 0
Next, h (t) is chosen appropriately and substituted into KN . The form for h (t), which is required to obtain the covariance of multitaper estimators, is ih (t ) D h1 h k (t )e ¡2p if1 t C h2 h k (t )e2p if1 t C h3 h k0 (t)e ¡2p if2 t C h4 h k0 (t ) e2p if2 t .
(A.6)
Substituting into the cumulant functional for N yields D¤ ( ) D¤ ( ))g ( ) ( ) KN D lnEfexp(h1 J D f1 C h3 JD f2 , k f1 C h2 J k k0 f2 C h4 J k0
(A.7)
where JD is the Fourier transform of the data tapered by a function indexed k by k. Application of the cumulant expansion theorem (Ma, 1985) then leads to D¤ ( ) C D¤ ( )) ( ) ( ) h3 JD KN D Efexp(h1 J D f1 f2 ¡ 1gC . (A.8) k f1 C h2 J k k0 f2 C h4 J k0
This may then be differentiated and set to zero: @KN Klmno D l m n o @h1 @h2 @h3 @h4 h Dh Dh Dh D 0 1
2
3
4
( ) Dm¤ ( f1 ) JDn ( f2 ) JDo¤ ( f2 )gC . D EfJDl k f1 J k k0 k0
(A.9)
Spectrum and Coherency of Sequences of Action Potentials
747
Moments of the estimators may be expressed in terms of these cumulant derivatives. The expressions are simplied by the fact that all cumulant derivatives that have indices summing to an odd number are zero because N is a zero mean process: ( )g D K1100 EfID k f
(A.10)
K¡1 K¡1 X 1 X ( ) D ( )g covfID k f , I k0 f K2 kD 0 k0 D 0
varfIMT ( f )g D
( ) D ( )g D K1010K 0101 C K1111 C K1001K0110. covfID k f , I k0 f
(A.11)
(A.12)
The problem has now been reduced to that of calculating these derivatives within the model. This is done by substituting the expression for h (t) into the right-hand side of equation A.5. Considerable algebra then leads to the following exact result: A B C Klmno Klmno D Klmno ,
(A.13)
where A Klmno D
µ ¶ C µ ¶ C 1 X ¡H1 ( f1 ) l2 l4 ¡H1 ( f1 )¤ m2 m4 l!m!n!o! ¢¢¢ 2 l ,m ,n ,o Pli !Pmi !Pni !Poi ! T T i i i i µ ¶ µ ¶ ¡H1 ( f2 ) n2 C n4 ¡H1 ( f2 )¤ o2 C o4 l1 ,m1 ,n1 ,o1 Il3 ,m3 ,n3 ,o3 (A.14) T T
P where i li D l and cases where l1 C l2 D l or l3 C l4 D l are excluded (and also for n, m, o): Z ,m1 ,n 1 ,o1 Ill31,m D 3 ,n3 ,o3
1 1
Sl ( f )Hl1 C m1 C n1 C o1 [ f1 (l1 ¡ m1 ) C f2 (n1 ¡ o1 ) ¡ f ] . . . Hl¤3 C m3 C n3 C o3 [ f1 (l3 ¡ m3 ) C f2 (n3 ¡ o3 ) ¡ f ]df, (A.15)
where Sl ( f ) is the spectrum of the gaussian process and Hl is Z
1
hl (t) exp(¡2p if t)dt
(A.16)
H0 ( f ) D T exp(¡ip f T )sinc (p f T )
(A.17)
Hl ( f ) D
¡1
748
M. R. Jarvis and P. P. Mitra
KBlmno D l
l X m X n X o X l m n o [ ][ ][ ][ ]Hp C q C r C s p q r s pD 0 q D 0 r D 0 s D 0
µ
¶( ) ¡H1 ( f1 ) l¡p T µ ¶(m¡q ) µ ¶ (n¡r) µ ¶( ) ¤ ¡H1 ( f1 ) ¡H1 ( f2 ) ¡H1 ( f2 )¤ o¡s £ . (A.18) T T T £[ f1 (p ¡ q) C f2 (r ¡ s)] ¢ ¢ ¢
The preceding result is somewhat cumbersome but readily evaluated computationally for a given spectrum. The expression simplies greatly when only frequencies above the smoothing width are considered and many of the terms may be neglected. Restricting attention to the second-order properties, there are only a few remaining dominant terms. Terms from K1001 lead to the previously discussed asymptotic results, but there are corrections that arise from the term K1111. Assuming that the population spectrum varies slowly over the width of the tapers leads to the result given by equations 6.2 through 6.7. The validity of this assumption has been tested computationally and was found to be very accurate even for spectra with sharp peaks. Acknowledgments We thank C. Buneo, R. Sachdev, and F. F. Ebner for providing example data sets, C. Loader for help with the calculation of global error bars, and D. R. Brillinger and D. J. Thomson for comments that substantially improved the manuscript. M. J. is grateful to R. A. Andersen for both his continued support of theoretical work in his lab and his careful reading of the manuscript. M. J. acknowledges the generous support of the Sloan Foundation for Theoretical Neuroscience. References Abeles, M., Deribaupierre, F., & Deribaupierre, Y. (1983). Detection of single unit responces which are loosely time-locked to a stimulus. IEEE Transactions on Systems Man and Cybernetics, 13, 683–691. Anderson, T. W. (1984). An introduction to multivariate statistical analysis. New York: Wiley. Bartlett, M. S. (1966). An introduction to stochastic processes. Cambrigde: Cambridge University Press. Brillinger, D. R. (1972). The spectral analysis of stationary interval functions. In Proceedings of the Sixth Berkley Symposium on Mathematical Statistics and Probability (1:483–513). Brillinger, D. R. (1974). Time series. New York: Holt, Rinehart and Winston. Brillinger, D. R. (1978). Developments in statistics vol1, chapter Comparative aspects of the study of ordinary time series and of point processes, pages 33–129. Orlando, FL: Academic Press.
Spectrum and Coherency of Sequences of Action Potentials
749
Brody, C. D. (1998). Slow covariations in neuronal resting potentials can lead to artefactually fast cross-correlations in their spike trains. Journal of Neurophysiology, 80, 3345. Cox, D. R., & Lewis, P. (1966). The statistical analysis of series of events. London: Chapman and Hall. Efron, B., & Tibshirani, R. J. (1993). An introduction to the bootstrap. London: Chapman and Hall. Gerstein, G. L., Perkel, D. H., & Dayhoff, J. E. (1985). Cooperative ring activity in simultaneously recorded population of neurons—detection. Journal of Neuroscience, 4, 881–889. Gray, C. M., Konig, ¨ P., Engel, A. K., & Singer, W. (1989). Oscillatory responses in cat visual-cortex exhibit intercolumnar synchronization which reects global stimulus properties. Nature, 338, 334–337. Hannan, E. J. (1970). Multiple time series. New York: Wiley. Loader, C. R. (1999). Local regression and likelihood. New York: Springer-Verlag. Ma, S. K. (1985). Statistical mechanics. Singapore: World Scientic Publishing. Mitra, P. P., & Pesaran, B. (1999). Analysis of dynamic brain imaging data. Biophysical Journal, 76, 691–708. Percival, D. B., & Walden, A. T. (1993). Spectral analysis for physical applications. Cambridge: Cambridge University Press. Pesaran, B., Pezaris, J. S., Sahani, M., Mitra, P. P., & Andersen, R. A. (2000). Temporal structure in neuronal activity during working memory in macaque parietal cortex. Unpublished manuscript. Press, W. H., Teukolsky, S. A., Vetterling, W. T., & Flannery, B. P. (1992). Numerical recipes. Cambridge: Cambridge University Press. Rosenberg, J. R., Amjad, A. M., Breeze, P., Brillinger, D. R., & Halliday, D. M. (1989). The Fourier approach to the identication of functional coupling between neuronal spike trains. Prog. Biophys. Molec. Biol., 53, 1–31. Sachdev, R., Jenkinson, E., Melzer, P., & Ebner, F. F. (1999). Dual electrode recordings from the awake rat barrel cortex. Somatosensory and Motor Research, 16, 163–192. Snyder, L. H., Batista, A. P., & Andersen, R. A. (1997). Coding of intention in the posterior parietal cortex. Nature, 386, 167–169. Sun, J., & Loader, C. R. (1994). Simultaneous condence bands for linear regression and smoothing. Annals of Statistics, 22, 1328–1345. Thomson, D. J. (1982). Spectrum estimation and harmonic analysis. Proceedings of IEEE, 70, 1055–1996. Thomson, D. J. (2000). Nonlinear and nonstationary signal processing. Cambridge: Cambridge University Press. Thomson, D. J., & Chave, A. D. (1991). Advances in spectrum analysis and array processing. Englewood Cliffs, NJ: Prentice Hall. Received March 6, 2000; accepted August 26, 2000.
NOTE
Communicated by Jonathan Victor
A Novel Spike Distance M. C. W. van Rossum Department of Biology, Brandeis University, Waltham, MA 02454, U.S.A. The discrimination between two spike trains is a fundamental problem for both experimentalists and the nervous system itself. We introduce a measure for the distance between two spike trains. The distance has a time constant as a parameter. Depending on this parameter, the distance interpolates between a coincidence detector and a rate difference counter. The dependence of the distance on noise is studied with an integrate-andre model. For an intermediate range of the time constants, the distance depends linearly on the noise. This property can be used to determine the intrinsic noise of a neuron. 1 Introduction
In the analysis of experimental neural data (e.g., when studying the effect of a certain manipulation or how reproducible the neuron’s response is), one often encounters the question, “How similar are these two spike trains ?” A problem is that it is not really known what information to look for in the spike train. In particular cases the precise timing is known to be important, whereas in other cases, only the number of spikes in a certain interval seems of importance (Rieke, Warland, de Ruyter van Steveninck, & Bialek, 1996). The nervous system itself is possibly confronted with the same question of similarity, as when it has to respond to a certain visual cue and has to decide what the appropriate response is (“Was that really a tiger ?!”). One approach in addressing this question is to introduce a distance that measures the (dis)similarity of two spike trains. If the distance between two spike trains is small enough, one can assume that the inputs were identical. Alternatively, distance measures can be used in a forced-choice experiment in which the spike train is compared to different templates. In that case, the template with the smallest distance to the trial should be chosen. Measures to compare spike trains have been introduced—for example: The total spike count. The total number of spikes in the spike trains is compared. This method is quite effective, but it misses all temporal structure in the spike trains. To resolve temporal structures, the spikes can be binned and the number of spikes per bin counted. Next, one measures the number of coincident spikes (see, e.g., Kistler, Gerstner, & van Hemmen, 1997). AlterNeural Computation 13, 751–763 (2001)
° c 2001 Massachusetts Institute of Technology
752
M. C. W. van Rossum
natively, the count is interpreted as a vector in N-dimensional space, where N is the number of bins. The distance to other spike trains can be calculated using an N-dimensional Euclidean distance (Geisler, Albrecht, Salvi, & Saunders, 1991; MacLeod, Backer, & Laurent, 1998). A disadvantage of the binning procedure is that it does not distinguish between a spike shifted so that it just drops out of the bin and a spike shifted many bin durations. And a spike that is shifted but stays within the bin is treated like a spike that was not shifted at all. These might not always be desired effects if the spike timing matters. A more exible distance measure is needed that takes the temporal structure into account but avoids the problems associated with binning. In a set of comprehensive papers Victor and Purpura (1996, 1997) introduced a measure based on a cost function. Two processes differentiate one spike train from another. (1) spikes can be deleted or inserted, and (2) they can be shifted in time. Victor and Purpura attributed to both processes a certain (arbitrary) cost and calculated how much it would cost to transform one spike train into the other one. The cost of insertion or deletion was xed to one. If no cost was associated with the shifting process, the total cost was just the difference in the total number of spikes. Otherwise the cost function for the shift was taken to be a monotonic increasing function of the spike time difference. Although this method has been applied successfully (MacLeod et al., 1998), the calculation of the full cost function is quite involved. The reason is that it is not always clear where a displaced spike came from, and if the number of spikes in the trains is unequal, it can be difcult to determine which spike was inserted or deleted. Here we introduce a spike distance closely related to the distance introduced by Victor and Purpura, yet it is easier to calculate and has a physiological interpretation. The distance is used to measure the intrinsic noise of a model neuron. 2 A Novel Measure
With the goal of a simple distance measure, we propose the following: Given a spike train with spike times ti , f orig (t) D
M X i
d(t ¡ ti ),
(2.1)
where we will assume that all ti > 0. Replace the delta function associated with each spike with an exponential function, that is, add an exponential tail to all spikes, f (t) D
M X i
H (t ¡ ti )e ¡(t¡ti ) / tc .
(2.2)
A Novel Spike Distance
753
t/tc
e
f
1 0
g
1
(f g)
2
0
0
2
250 Time
500
Figure 1: Denition of the distance. (Top) Two spike trains (one ipped) are convolved with an exponential with time constant tc . (Bottom) The difference squared of the spike trains. The distance is given by the integral of this curve.
Here tc is the time constant of the exponential function and H is the Heaviside step function (H(x) D 0 if x < 0 and H (x ) D 1 if x ¸ 0). In principle, one could convolve with functions other than the exponential. Our choice for the exponential function was motivated by its causality, simplicity, and possible biological interpretation (see section 4). The distance between two trains f and g we dene as (see Figure 1) Z ¤2 1 1£ 2 ( ) (2.3) D f, g tc D f (t) ¡ g (t) dt. tc 0 The distance is the Euclidean distance of the two ltered spike trains with tc as a free parameter. To get a sense of the distance, consider the two limits of tc . For tc much smaller than the interspike interval, the smeared functions f and g contribute to the integral only if the spikes are not more than tc apart. This is very similar to a coincidence detection. For most spike trains, coincident spikes can be neglected in the limit of zero tc ; thus, if f contains M and g contains N spikes, one has Z i 1 1h 2 MCN lim D2 ( f, g )tc D . (2.4) f (t) C g2 (t) dt D tc !0 2 tc 0 The distance counts the noncoincident spikes.
754
M. C. W. van Rossum
In the opposite limit, for large tc , the main contribution to the integral comes from times when the last spike has passed but the exponent has still not decayed. Assuming that f (g) contains M (N ) spikes, one can approximate 1 lim D ( f, g )tc D tc !1 tc
Z
1
2
0
(Me ¡t / tc ¡ Ne¡t/ tc )2 dt D
(M ¡ N )2 . 2
(2.5)
In this limit, D measures the difference in total spike count. Note that it is important that the upper limit of the integral is taken to be innity (or, in practice until the tail of the last spike died out) instead of the time of the last spike. The distance thus interpolates between the two extremes of coincidence detection and measuring difference in total spike count. The distance measure is very easy to implement numerically. For the best results, the integral between spikes and after the last spike can be done analytically. After ltering, the spike trains this leaves a sum of N C M terms. Changing integration variables leads to an alternative expression for the distance, D2 ( f, g ) D
1 2
Z
1 ¡1
C f ¡g, f ¡g (t) e ¡|t| / tc dt,
(2.6)
where C f ¡g, f ¡g (t) is the autocorrelation of the difference of the raw spike trains, f orig (t)¡ gorig (t). This shows that the distance can be interpreted as the weighted integral over the autocorrelation, with the weighting depending on tc . It also shows that the distance is invariant under time reversal, that is, the distance is the same if the exponential tails were attached to the other sides of the spikes. 2.1 Analytical Results. For some simple cases, we can derive analytical results. Let us consider the distance between two almost identical spike trains. First consider the insertion of a single spike at time ti into train f , other than that, the spike trains are identical, that is,
g (t) D f (t) C H (t ¡ ti )e¡(t¡ti ) / tc . The distance is 1 tc 1 D . 2
D2insertion ( f, g) D
Z
1 ti
e ¡2(t¡ti )/ tc (2.7)
The removal of a spike yields the same answer. Note that the result is independent of tc .
A Novel Spike Distance
755
Now suppose a spike is shifted from ti in spike train f to time ti C dt in g, g (t) D f (t ) ¡ H (t ¡ ti )e ¡(t¡ti ) / tc C H (t ¡ ti ¡ dt)e ¡(t¡ti ¡dt) / tc , which yields a distance, D2displace ( f,
1 g) D tc
Z
ti Cdt ti
e
¡2(t¡ti ) / tc
¡|dt| / tc . D 1 ¡e
1 C tc
Z
1 ti Cdt
[e¡(t¡ti )/ tc ) ¡ e¡(t¡ti ¡dt) / tc ]2 , (2.8)
which is unity for small tc and vanishes for large tc . The ratio of distances due to spike insertion and displacement depends on tc . For a shift dt D tc ln(2), insertion and displacing cost the same. Note, however, that displacing a spike never costs more than insertion or removal of two spikes. The reason is that shifting a spike can always be done by removing the spike at time t and reinserting it at time t C dt. 2.2 Correlation Effects. When one considers two trains in which more spikes are changed, the distance is more complicated and certainly not always equal to the sum of the individual displacement and insertion distances. As an example, consider the case where two spikes, time T apart, are both displaced at time dt. No other spike occurs between the two. For this case one nds
£ ¤ D2 ( f, g) D 2 1 ¡ e ¡|dt| / tc ¡ 2e ¡|T| / tc [cosh(dt / tc ) ¡ 1] .
(2.9)
In Figure 2 the distance is plotted for various amounts of shift. The rst term is twice the single spike displacement distance, equation 2.8. The second term is due to the correlation and is always negative. Thus, the total distance of displacing two spikes is less if the spikes are close. This seems a natural phenomenon, as the spike trains will look more similar when two neighboring spikes are shifted than when two spikes far apart are shifted. However, proper description of such effects calls for a more complex distance measure with the introduction of extra timescales (Victor & Purpura, 1997). One can also study the insertion of two spikes, time T apart. For that case, one nds D2 ( f, g) D
2 C e¡T/ tc , 2
(2.10)
that is, the cost of insertion two spikes close together is larger than the cost of inserting two distant spikes. Equivalently, when more than one spike is inserted, the distance will increase with increasing tc (see Figure 2).
756
M. C. W. van Rossum
Spikes shifted 1..100 ms
D
2
200
0 Deletion of 3..8 spikes
D
2
40
20
0
1
100
10000
tc (ms) Figure 2: Effect of the deletion and shifting of spikes on the distance versus tc . (Top) All the spikes in the second spike train are shifted 1, . . . , 100 ms with respect to the rst train. At large tc , this shift does not contribute to the distance. The spike trains were Poisson trains with a duration of 10 s and a rate of 20 Hz. (Bottom) At random positions, 3, . . . , 8 spikes are deleted from the second spike train (lower to upper curve). The distance is largest at large tc .
2.3 Distance Between Uncorrelated Poisson Trains. Next we calculate the distance between two uncorrelated Poisson spike trains, both with a rate r. For small tc , the distance measures the number of spikes (see equation 2.4). Assuming a total duration of T, on the average rT spikes will be produced, so for small tc , D2 D rT. On the other hand, for large tc , according to equation 2.5, the distance approaches (M ¡ N )2 / 2, where M and N again denote the number of spikes in the trains. The expectation value for (M¡N )2 for two Poisson processes is 2rT; hence, for large tc , one has D2 D rT. The average distance is thus identical at small and large tc . Alternatively, one
A Novel Spike Distance
757
D
2
4000
2000
0
1
100
10000
1000000
tc (ms) Figure 3: Distance between two independent Poisson trains. The distance is on average constant across tc , but shows large uctuations at large tc .
can use equation 2.5 and C f ¡g, f ¡g D C f, f C C g, g ¡ 2Cf, g D 2rTd(t), which yields D2 D rT independent of tc . Numerical simulations are shown in Figure 3. The distance shows large uctuations for large tc ; the difference in the number of spikes uctuates more strongly than the number of nonoverlapping spikes. The reason is that the difference can be interpreted as a variance, while the number of nonoverlapping spikes behaves more like a mean value. It is well known that the variance is a more variable quantity than the mean. 3 Distance Between Two Noise-Driven Spike Trains
Next we study how the distance varies in response to changes in the input for a spiking neuron. We use a leaky integrate-and-re neuron with a time constant of 50 ms. The stimulus is a gaussian distributed variable with zero mean and 10 ms temporal correlation. Its variance is adjusted to give an average spike frequency of 20 Hz. To compare spike trains, this stimulus is repeated across trials, and nonrepeating gaussian noise is added. The standard deviation of the noise, sadd , was up to 10% of the standard deviation of the stimulus. In Figure 4 we plot the distance as tc is varied. As expected, the distance initially drops as spike time differences due to slightly displaced spikes are smoothed out. However, for large tc , the distance increases again. This contrasts with the distance between two Poisson spike
758
M. C. W. van Rossum
D
2
10000
0
10
0
10
2
10
4
10
6
tc (ms) Figure 4: Distance between a spike train and the spike train with noise added to the input. The different lines correspond to different amounts of noise: standard deviation 2%, 4%, 6%, 8%, and 10% of stimulus standard deviation (lower to upper curve). The crosses indicate the limits tc ! 0 and tc ! 1. Note the minimum of the distance at intermediate tc .
trains where the average distance was constant across tc . For the integrateand-re neuron, noise will both shift as well as insert and delete spikes. The membrane time constant introduces temporal correlations, leading to a minimal distance when tc is roughly of the order of the integration time constant. Why the precise value is larger than the membrane time constant is not clear. The dependence of the distance on the amount of added noise varies for different values of tc (see Figure 5). Unfortunately, there is no obvious analytical approach to determine the distance as a function of the noise for the integrate-and-re model; instead we rely on simulations. We found approximately a power law relation between distance and noise, with the p exponent depending on tc . For small tc , it holds that D2 / sadd , whereas 2,...,4 2 for large tc , one has D / sadd . However, for a large region of intermediate tc , roughly corresponding to the valley region in Figure 4, one has D2 / sadd . 3.1 Estimating Intrinsic Noise. The smooth dependence of the distance on the noise can be used to measure the intrinsic noise of a neuron. To this end, we assume that there is an intrinsic noise source in the neuron that is additive to the input (Tuckwell, 1988; Gerstein & Mandelbrot, 1964). By
A Novel Spike Distance
759
10000
2000
D
2
0
0 50000
0
0
s
0.1
add
/s
0.2
stim
Figure 5: Same data as Figure 4, but here the distance is plotted versus the noise added to the stimulus. For different tc there are different power laws between the distance and the noise. The standard deviation of the noise is denoted as a fraction of the stimulus standard deviation. Because of the different y-scaling, the plots were split for different tc values: (top) tc D 0, 1, 10; (middle) tc D 100, 1000, 10,000 (linear regime); (bottom) tc D 105 , 106 , 1.
mixing additional nonrepeating noise with the input and extrapolating the distance, the intrinsic noise can be estimated (see Figure 6). q The standard deviation at the input of the spike generator is stotal D
2 C 2 sadd sintrinsic ;
2 therefore, the distance is best considered as a function of sadd . In practice, 2 4 one plots D versus sadd and ts a straight line. The intrinsic noise is reconstructed from the intercept of the line with the x-axis (see Figure 6B). This method is borrowed from electrical engineering and psychophysical studies. There one measures, for instance, the detection threshold of a visual stimulus as noise is added to the stimulus (Pelli, 1990; Lu & Dosher, 1999). This detection threshold is commonly proportional to the signal-to-noise ratio of the stimulus, which then allows determination of the intrinsic noise.
760
A
M. C. W. van Rossum
Input + added noise
+
Spike Generator t
Intrinsic Noise
B
D
4
5e+06
0
s
2 intrinsic
s
2 add
0.0004
Figure 6: Using the distance to estimate the intrinsic noise of a cell. (A) Model of the cell. The intrinsic noise in the cell is assumed to be additive to the input. (B) The distance is plotted versus the variance of noise added to the stimulus. By adding noise, the distance to the original response increases (circles). Note that at zero added noise, the distance is nonzero because the cell will not reproduce the exact same spike train due to the presence of the intrinsic noise. Extrapolation yields an estimate for the intrinsic noise.
To test the approach, we simulated an integrate-and-re neuron with an intrinsic noise source and tried to estimate the amount of intrinsic noise by extrapolation. Because the linear regression is not a good t for all tc (see Figure 5), the quality of the t was measured with  2 , and the result with minimal  2 was selected. The method works well for small amounts of intrinsic noise. An intrinsic noise with a standard deviation of 1% (as
A Novel Spike Distance
761
compared to the stimulus standard deviation) was estimated as 1.25% (tc D 600 ms, shown in Figure 6B); an actual value of 5% was estimated as 6%, 10% was estimated as 12.6%, and 20% was estimated as 33%. Thus, the method slightly overestimates the intrinsic noise. At higher levels of added or intrinsic noise, the tting deteriorates as the linear relation between s 2 and D4 gets higher-order corrections. Nevertheless, we obtain a reasonable estimate for the intrinsic noise, which is otherwise hard to access. 4 Conclusion
We have introduced a distance measure that computes the dissimilarity between two spike trains. To calculate the distance, lter both spikes trains, and calculate the integrated squared difference of the two trains. The simplicity of the distance allows for an analytical treatment of simple cases. Numerical implementation is straightforward and fast. The distance interpolates between counting noncoincident spikes and counting the squared difference in total spike count. In order to compare spike trains with different rates, total spike count can be used (large tc ). However, for spike trains with similar rates, the difference in total spike number is not useful, and coincidence detection is sensitive to noise. Instead, intermediate values of tc , somewhat longer than the membrane time constant, are optimal. The distance uses a convolution with the exponential function. This has an interpretation in physiological terms. Indeed, among other distances (see section 1), our distance does not seem the most unlikely one to be implemented physiologically. For short and medium tc , the convolution can be interpreted as postsynaptic potentials in a higher-order neuron. For longer tc , slower second messenger or calcium-induced currents seem more appropriate. It would be interesting to see if such detection schemes are implemented biologically. As an alternative measure, one could convolve the spikes with a square window. In that case, the situation becomes somewhat similar to binning followed by calculating the Euclidean distance between the number of spikes in the bins. But in standard binning, the bins are xed on the time axis; therefore, two different spike trains yield identical binning patterns as long as spikes fall in the same bin. However, with the proposed distance (convolving with either square or exponential), this does not happen; the distance is zero only if the two spike trains are fully identical (assuming tc is nite). The distance squared is related to the distance measure introduced by Victor and Purpura (1996, 1997). One difference with this work is that their displacement distance was linear in the time difference, although they did suggest the use of an exponential displacement distance (see equation 2.8). Another difference is that the distance introduced here is explicitly embedded in Euclidean space, which makes it less general but easier to analyze than the Victor and Purpura distance.
762
M. C. W. van Rossum
Interestingly, the distance is related to stimulus reconstruction techniques, where convolving the spike train with the spike-triggered average yields a rst-order reconstruction of the stimulus (Rieke et al., 1996). Here the exponential corresponds roughly to the spike-triggered average, and the ltered spike trains correspond to the stimulus (the exponentials are now attached to the other side of the spikes, but that does not change the distance). The distance thus approximately measures the difference in the reconstructed stimuli (see equation 2.3). This might well explain the linearity of the measure for intermediate tc . It also suggests that the distance measure might be rened by using the actual spike-triggered average instead of an exponential. A possible application for the distance measure is the insect olfactory system, which uses a timing-based code to distinguish odors (MacLeod et al., 1998). There it was shown, using the Victor and Purpura distance, that information was encoded in the temporal structure of the spike train but not in the mean rate. Another application is the measurement of the intrinsic noise of a neuron or network, which is possible because the distance varies smoothly as noise is added to the input. For sensory systems, stimulation with “natural” noisy stimuli has become quite common. Although natural stimuli are relevant in determining the system’s response, the characterization of the noise in the system is less straightforward (Reich, Victor, Knight, Ozaki, & Kaplan, 1997). Measuring the intrinsic noise could be helpful. We stress that the intrinsic noise is a quantity that is otherwise difcult to measure experimentally. For instance, measurements of subthreshold uctuations in the membrane potential fail to detect noise in the spike generator itself. By adding noise to the stimulus, one can determine this intrinsic noise, which gives an effective description of the neuron’s variability. Acknowledgments
It is a pleasure to thank Alex Backer, ¨ Bob Cudmore, Kate Macleod, Rob Smith, and Ken Sugino for discussions and Larry Abbott and Paul Tiesinga for helpful comments on the manuscript. I thank one of the referees for guiding me toward equation 2.6 and indicating its application to Poisson trains and pointing out the relation of the distance to stimulus reconstruction. Supported by the Sloan Foundation. References Geisler, W. S., Albrecht, D. G., Salvi, R. J., & Saunders, S. S. (1991). Discrimination performance of single neurons: Rate and temporal information. Journal of Neurophysiology, 66, 334–362. Gerstein, G. L., & Mandelbrot, B. (1964). Random walk models for the spike activity of a single neuron. Biophysical Journal, 4, 41–68.
A Novel Spike Distance
763
Kistler, W. M., Gerstner, W., & van Hemmen, J. L. (1997). Reduction of the Hodgkin-Huxley equations to a single variable threshold model. Neural Computation, 9, 1015–1046. Lu, Z.-L., & Dosher, B. A. (1999). Characterizing human perceptual inefciencies with equivalent internal noise. Journal of the Optical Society of America A, 16, 764–778. MacLeod, K., Backer, A., & Laurent, G. (1998). Who reads temporal information contained across synchronized and oscillatory spike trains? Nature, 395, 693– 698. Pelli, D. G. (1990).The quantum efciency of vision. In C. Blakemore (Ed.), Vision: Coding and efciency (pp. 9–24). Cambridge: Cambridge University Press. Reich, D. S., Victor, J. D., Knight, B. W., Ozaki, T., & Kaplan, E. (1997). Response variability and timing precision of neuronal spike trains in vivo. Journal of Neurophysiology, 77, 2836–2841. Rieke, F., Warland, D., de Ruyter van Steveninck, R., & Bialek, W. (1996). Spikes: Exploring the neural code. Cambridge, MA: MIT Press. Tuckwell, H. C. (1988). Introduction to theoretical neurobiology. Cambridge: Cambridge University Press. Victor, J. D., & Purpura, K. P. (1996). Nature and precision of temporal coding in visual cortex: A metric-space analysis. Journal of Neurophysiology,76, 1310– 1326. Victor, J. D., & Purpura, K. P. (1997). Metric-space analysis of spike trains: Theory, algorithms and application. Network: Comput. Neural Syst., 8, 127–164. Received September 24, 1999; accepted July 3, 2000.
LETTER
Communicated by Bard Ermentrout
On Synchrony of Weakly Coupled Neurons at Low Firing Rate L. Neltner D. Hansel Laboratoire de Neurophysique et Physiologie du Syst`eme Moteur, UniversitÂe RenÂe Descartes, 75270 Paris Cedex 06, France The dynamics of a pair of weakly interacting conductance-based neurons, ring at low frequency, º, is investigated in the framework of the phasereduction method. The stability of the antiphase and the in-phase locked state is studied. It is found that for a large class of conductance-based models, the antiphase state is stable (resp., unstable) for excitatory (resp., inhibitory) interactions if the synaptic time constant is above a critical value tsc , which scales as | logº| when º goes to zero. 1 Introduction
The ubiquity of coherent oscillations in the central nervous system (for a review, see Gray, 1994) raises fundamental biophysical issues concerning the role of intrinsic neuronal and synaptic characteristics in the emergence of coherent neuronal activity. These issues have been addressed in various theoretical studies. One way has been by investigating the stability of the asynchronous state of large networks of neurons (Abbott & van Vreeswijk, 1993; Neltner, Hansel, Mato, & Meunier, 2000; Golomb & Hansel, 2000; Hansel & Mato, 2000). This approach sheds light on the conditions under which weakly synchronized states can emerge in large networks but cannot tell us much on the conditions under which strongly synchronized states can appear. In another approach, followed for instance by Hansel, Mato, and Meunier (1993a, 1993b, 1995), van Vreeswijk, Abbott, and Ermentrout (1994), Gerstner, van Hemmen, and Cowan (1996), White, Chow, Ritt, Soto-Trevino, ˜ and Kopell (1998), and Bressloff and Coombes (1999), one studies the properties of specic patterns of synchrony, such as fully synchronized states or other phase-locked states. This approach gives insight into the conditions under which strongly synchronized states exist and are stable. One question that has been considered is: under which condition are two reciprocally coupled identical oscillating neurons synchronized in-phase? For weakly interacting oscillating neurons, Hansel et al. (1995) have shown that to answer this question, it was useful to classify neural oscillators according to the way they respond to innitesimally small perturbations of the membrane potential. They have shown that the shape of the response function (Kuramoto, 1984), Neural Computation 13, 765–774 (2001)
° c 2001 Massachusetts Institute of Technology
766
L. Neltner and D. Hansel
which is nothing more than the phase-resetting curve for innitesimally small perturbations, has a profound effect on the synchronization properties of the neuronal oscillators. For instance, neurons with a response curve that is type I PRC (i.e., strictly positive) cannot lock in-phase when coupled with excitation, whereas they can when coupled with inhibition. By contrast, type II PRC neurons, characterized by a negative regime following the ring of a spike, can lock in-phase, provided that the excitation is sufciently fast. Integrate-and-re neurons are type I PRC. The Hodgkin-Huxley model is type II PRC. Experimental results of Reyes and Fetz (1993) provide an example of cortical neurons with nearly type I PRC properties. Another useful classication of neuronal dynamics has been proposed by Rinzel and Ermentrout (1989). It describes the way neurons go from rest to periodic ring as external current, I, is injected. There are two main types of behavior. In the rst type, oscillations appear with arbitrary low frequency at threshold current, Ic , whereas in the second type, the onset of repetitive ring is at nonzero frequency. Neurons of the rst (resp. second) type are said to display type I (resp. II) excitability (E). The Connor-Stevens model (Connor, Walters, & McKown, 1977) or the model developed by Wang and Buzs a ki (1996), which display a saddle-node bifurcation at the current threshold are examples of type I E behavior. Ermentrout (1996) has studied the possible relationship between the type I PRC and the type I E classes of neurons. He has proved that for sufciently small frequency of ring, to the lowest order in 2 D I ¡ Ic , the response function of type I E neurons is Z (w ) D A (1 ¡ cos(w )),
(1.1)
where A is a constant. Here the phase origin is set so that the emission of the spike corresponds to a phase value w D 0. Therefore, for these neurons, Z(w ) is always positive, that is, type I PRC, at the bifurcation point (see also Izhikevich, 1999). In this article, we investigate the locking of a pair of weakly synaptically coupled type I E neurons near the bifurcation, taking into account the rst correction term in Z (w ). This allows us to clarify how type I E neurons synchronize at low ring rates. 2 Results
We consider a pair of reciprocally coupled identical type I E neuronal oscillators. In the limit of weak synaptic interactions, the state of each of the neurons at time t can be characterized by phase variables, 0 · w i < 2p , (i D 1, 2), which follow the dynamics (Kuramoto, 1984; Ermentrout & Kopell, 1984; Hoppensteadt & Izhikevich, 1997): dw 1 D v C gC (w 1 ¡ w 2 ) dt
(2.1)
Synchrony of Weakly Coupled Neurons
dw 2 D v C gC (w 2 ¡ w 1 ). dt
767
(2.2)
Here, g measures the strength of the interaction v D 2pº, where º is the intrinsic ring frequency of the neurons when decoupled (g D 0) and C (w ) is an effective interaction that can be computed if one knows the response function of the neurons, Z (w ) (Kuramoto, 1984; Hansel et al., 1993a), and the synaptic current, Is (w ): C (w ) D
1 2p
Z
2p 0
Z(w 0 )Is (w 0 ¡ w )dw 0 .
(2.3)
The odd part of C (w ) will be denoted C ¡ (w ) D (C (w ) ¡ C (¡w )) / 2. At large time, the neurons get phase-locked with a phase shift d D w 2 ¡w 1 which is a solution of C ¡ (d) D 0
(2.4)
and which satises the stability condition 0 (d) < 0. gC ¡
(2.5)
We assume that near the bifurcation (2 > 0 and small), Z(w ) can be written in the form: Z(w ) D A(1 ¡ cos(w )) C 2
a
| log 2 | b f (w ) C higher-order terms,
(2.6)
where A > 0, a and b are positive exponents and f (w ) is a function that is model dependent. Below we give estimates for the exponents a and b, using numerical calculations, and we conjecture that they are universal. For simplicity, we assume that the synapses can be described by an afunction (Rall, 1967) with synaptic time constant ts . This implies that Is ´ Is (w / (vts )). Using equations 1.1 and 2.6, a straightforward calculation shows that at leading orders, C ¡ (w ) / [(v)2 ts A sin(w ) C v2
a
| log 2 | b f¡ (w )].
(2.7)
Here f¡ stands for the odd part of the function f . One easily sees that equation 2.4 always has at least two solutions: the in-phase state, d D 0, and the antiphase state, d D p . States with intermediate phase shift can also exist depending on the specic form of f¡ (w ). 2.1 Stability of the Antiphase State. The stability condition for the an-
tiphase state is £ g ¡vts A C 2
a
¤ | log 2 | b f¡0 (p ) < 0.
(2.8)
768
L. Neltner and D. Hansel
Close to a saddle-node bifurcation, 0 < 2 neuron, v, behaves like vDv N2
1 2
¿ 1, the ring frequency of a
(2.9)
,
where the prefactor vN can be computed from the reduction of the neuronal dynamics to its normal form (Ermentrout & Kopell, 1986). Therefore, the condition of stability of the antiphase solution is h g ¡vt N sA C 2
a¡ 12
i | log 2 | b f¡0 (p ) < 0.
(2.10)
If a > 1 / 2, the second term vanishes when 2 ! 0. In that case, sufciently close to the bifurcation, independent of the values of f¡0 (p ) and of ts , the antiphase state is stable (resp. unstable) for excitatory (resp. inhibitory) coupling. If a · 1 / 2, the second term diverges when 2 ! 0. For f¡0 (p ) · 0, the stability properties of the antiphase state are identical to the case when a > 1 / 2. By contrast, for f¡0 (p ) > 0, the stability of the antiphase depends on ts . At large ts , the system behaves as for a > 1 / 2. However, for sufciently small ts , the antiphase state is unstable (resp. stable) for excitatory (resp. inhibitory) coupling. The change of stability of the antiphase state occurs for a critical value of the synaptic time constant given by tsc D 2
(a¡ 12 )
| log 2 | b
f¡0 (p ) . AvN
(2.11)
2.2 Stability of the In-Phase State. For a > 1 / 2, the in-phase state is always unstable (resp. stable) for excitatory (resp. inhibitory) coupling. This is also true for a · 1 / 2, if f 0 (0) ¸ 0. By contrast, if f¡0 (0) < 0, the in-phase 1 state becomes stable for t < 2 (a¡ 2 ) | log 2 | b | f¡0 (0)| / ( Av). N Similar results are found for more general synaptic kinetics, with a rise time different from the decay time. 2.3 Numerical Results. We have estimated numerically the values of the exponents a and b, for several models of type I E excitability. Here we show the results for the Wang-Buzs a ki (W-B) model (Wang & Busz a ki, 1996) and the Morris-Lecar (M-L) model (Morris and Lecar, 1981). The detailed equations corresponding to these models are given in the appendix. We have computed the response function of these models using the software XPPAUT (Ermentrout, 1999) for different values of the biased current, close to the bifurcation. The result was expressed as a Fourier expansion:
Z (w ) D A (1 ¡ cos(w )) C
C1 X
nD1
[bn cos(nw ) C an sin(nw )]).
(2.12)
Synchrony of Weakly Coupled Neurons
10
Fourier Coefficients
10
10
10
769
0
1
2
3
4
10 6 10
10
4
e
10
2
10
0
Figure 1: Fourier coefcients as a function of 2 D I ¡ Ic . Plus signs, solid line: a1 for W-B; empty circles, solid line: a2 for W-B; empty squares, solid line: a3 for W-B; plus signs, dotted line: a1 for M-L; empty circles, dotted line: a2 for M-L; empty squares, dotted line: a3 for M-L.
In Figure 1, the Fourier coefcients a1 , a2 , a3 are plotted versus 2 for the two models. The determination of the limit cycle and how it is perturbed by small depolarizations is difcult and requires time-consuming numerical calculations. For instance, the lowest-frequency point for the W-B model, correponding to 2 D 4.10¡6 , required more than 6 days of CPU on an alphaDec station 500 MHz. The values of the exponent a derived from these curves are summarized in table 1. On average, we have found a D 0.43 for the W-B model and a D 0.50 for the M-L model. Several possible sources of numerical errors could explain the large dispersion around the average value of a and the substantial difference of the average a in the two models. However, these numerical results allowed us to guess that for the two models, a D 1 / 2. They do not allow us to say anything about the exponent b. We looked for another estimate of the exponents a and b by studying how the stability of the antiphase locked states of a pair of inhibitory neurons depends on the synaptic time constant. For each value of the biased current, the synaptic decay time, t2 , was varied, keeping the synaptic rise time xed
770
L. Neltner and D. Hansel
Table 1: Value of the Exponent a Deduced from the Scaling of the Fourier Coefcients a1 , a2 , a3 , W-B and M-L Models. Fourier Coefcient
W-B
M-L
a1 a2 a3
0.37 0.47 0.46
0.41 0.54 0.54
60 Critical Synaptic Time (ms)
50 40 30 20 10 0
10
4
e
10
2
10
0
Figure 2: Critical synaptic time t2 as a function of 2 D I ¡ Ic . Plus signs, solid line: W-B model; plus signs, dotted line: M-L model.
(t1 D 1 ms). The value, t2c , of t2 , at which the antiphase locked state loses stability was estimated, according to condition equation 2.5. The results are presented in Figure 2 on a semilog scale. It shows a clear logarithmic dependance of t2c over three decades of 2 for both models. Together with equation 2.11, this suggests that a D 12 and b D 1. Therefore, we conjecture that near a saddle-node bifurcation, the response function is Z (w ) D A (1 ¡ cos(w )) C 2
1 2
| log(2 )| f (w ) C higher-order terms,
(2.13)
Synchrony of Weakly Coupled Neurons
771
and the critical synaptic time above which the antiphase state loses stability for inhibitory interactions scales as: tsc / | logº|.
(2.14)
3 Conclusion
This article claries further the correspondence between type I E and type I PRC properties of neurons as suggested by Ermentrout (1996). The central result is the conjecture equation 2.13 and its consequences for the synchrony properties of type I E neurons in the vicinity of the onset to periodic ring. We did not prove this conjecture. However, we have provided numerical evidence that supports it for two models of type I E excitable membrane. We have also checked this conjecture for other models, such as the model recently introduced by Shriki, Hansel, and Sompolinsky, (1999), which incorporates a sodium current, a delayed rectier potassium current, and an A-current (results not shown). Therefore, we believe that the scaling in 2 of the second term in equation 2.13 is universal. It should be noted that in the limit 2 ! 0, the numerical estimates of the n D 2 and n D 3 coefcients are smaller than the coefcient n D 1 by one order of magnitude. Since we have no way to evaluate the numerical error bars for these estimates, one cannot exclude the fact that at leading order in 2 , only the mode n D 1 contributes. If f¡0 (p ) > 0, as it is the case for the W-B model, for instance, equation 2.13 implies that when ts varies, the antiphase locked state changes stability for tsc / | log 2 |. Above this critical synaptic time, the antiphase state is stable (resp. unstable) for excitation (resp. inhibition ); below this critical synaptic time, the antiphase state is unstable (resp. stable) for excitation (resp. inhibition). A state with an intermediate phase shift appears. It is stable for excitation and unstable for inhibition. The condition f¡0 (p ) > 0 is equivalent to f 0 (p ) > 0, and thus means that the maximun of the response function is in the second half of the period. We are convinced that this condition is general for any type I PRC neurons. Indeed, at higher frequency, the peak of the response function is always in the second half, and we believe that it approaches the half of the period by upper values. Consequently, we believe that the change of stability of the antiphase state as the synaptic time grows is general for any type I PRC neuron. This behavior is similar to what is found for the leaky integrate-andre (IF) model (van Vreeswijk et al., 1994; Hansel et al., 1995). However, for the IF model, it is easy to prove that the critical synaptic time scales like tsc / 1 /º2 . Therefore, when the frequency º ! 0, tsc diverges more rapidly for the IF model than for type I E models. We thus expect that for xed synaptic times, the stable antiphase state for two inhibitory coupled neurons is more stable for larger ring rates in the IF case than it is for type I E models. This difference is a consequence of the atypical behavior of the leaky IF model near threshold. This is in agreement with recent work that
772
L. Neltner and D. Hansel
has demonstrated the role of active neuronal properties in neural synchrony (Golomb & Hansel, 2000; Hansel & Mato, 2000). Appendix: The Models A.1 Morris-Lecar Model. In the Morris-Lecar model we have used, the potential evolves according to
C
dV D I` C ICa C IK C Is (t) C I0 . dt
(A.1)
The capacitance C D 20 m F/cm2 . The leak current is given by I` D g`(V` ¡V) where g` D 2 mS/cm2 and V` D ¡60 mV. The calcium current is ICa D gCa m1 (VCa ¡ V ) where gCa D 4 mS/cm2 , VCa D 120 mV, m1 D 0.5 ¤ (1 C tanh((V C 1.2) / 18)). The potassium current is IK D gK w(VK ¡ V ) where gK D 8 mS/cm2 , VK D ¡84 mV. The activation variable w evolves according to dw / dt D lw (V )(w1 (V )¡w ) with w1 (V ) D 0.5(1 C tanh((V¡12) / 17.4)) and l(V) D 0.067 cosh((V ¡ 12) / 34.8). The biased current is I0 . When the membrane potential reaches its upper value, a spike is occurring. The synaptic current induced by a given presynaptic neuron into a postsynaptic neuron is modeled as X Isyn (t) D g f (t ¡ tspike ), spikes
where summation is performed over all the spikes emitted prior to time t by the presynaptic neuron. Each synaptic activation then induces a single event, described by the function f (t) D
± t ² 1 ¡ ¡t e t1 ¡ e t2 . t1 ¡ t2
A.2 Wang-Busz a ki Model. In the W-B model the membrane potential
evolves according to C
dV D INa (t) C IK (t) C I`(t) C Is (t) C I. dt
The capacitance C is equal to 1 m F/cm2 . The leak current is given by I` D g`(V` ¡V) where g` D 0.1 mS/cm2 and V` D ¡65 mV. The transient sodium current is given by INa D gNa m1 3 h (VNa ¡ V ) where gNa D 35 mS / cm2 and VNa D 55 mV. The activation variable m instantaneously adjusts to its steady-state value m1 D am / (am C bm ), where am (V ) D ¡0.1(V C 35) / (exp(¡0.1(V C 35)) ¡ 1) and bm (V ) D 4 exp(¡(V C 60) / 18). The inactivation variable h evolves according to dh D ah (1 ¡ h) ¡ bh h, where dt
Synchrony of Weakly Coupled Neurons
773
ah (V ) D 0.35 exp(¡(V C 58) / 20), bh (V ) D 5 / (exp(¡0.1(V C 28)) C 1). The delayed rectier current is given by IK D gK n4 (VK ¡ V ), where gK D 9 mS/cm2 , VK D ¡90 mV. The activation variable n evolves according to dn a (1 ¡ n ) ¡ bn n with an (V ) D ¡0.05(V C 34) / (exp(¡0.1(V C 34)) ¡ 1) dt D n and bn (V ) D 0.625 exp(¡(V C 44) / 80). The synaptic current is modeled as for the Morris-Lecar model. Acknowledgments
We thank C. van Vreeswijk for a careful reading of the manuscript. We acknowledge the hospitality of the Marine Biological Laboratory, Woods Hole, Massachusetts, where this work was initiated as a research project by L. N. for the course “Method in Computational Neuroscience” in August 1997. D. H. was supported in part by the PICS-CNRS 236. References Abbott, L. F., & van Vreeswijk, C. (1993). Asynchronous states in networks of pulse-coupled oscillators. Phys. Rev., E48, 1483–1490. Bressloff, P. C., & Coombes, S. (1999). Dynamics of strongly-coupled spiking neurons. Neural Comput., 12, 43–89. Connor, J. A., Walter, D., & McKown, R. (1977). Neural repetitive ring: Modication of the Hodgkin-Huxley axon suggested by experimental results from crustacean axons. Biophys. J., 18, 81–102. Ermentrout, G. B. (1996). Type I membranes, phase resetting curves, and synchrony. Neural Comp., 8, 979–1001. Ermentrout, G. B. (1999).XPPAUT 3.99—The Dynamical Systems Tool. Available online at: ftp://ftp.math.pitt.edu/pub/bardware. Ermentrout, G. B., & Kopell, N. (1984).Multiple pulse interactions and averaging in systems of coupled neural oscillators. J. Math. Bio., 29, 195–217. Ermentrout, G. B., & Kopell, N. (1986). Parabolic bursting in an excitable system coupled with slow excitation. SIAM J. Appl. Math., 46, 233–253. Gerstner, W., van Hemmen, J. L., & Cowan, J. (1996). What matters in neuronal locking. Neural Comput., 8, 1653–1676. Golomb, D., & Hansel, D. (2000). The number of synaptic inputs and the synchrony of large sparse neuronal networks. Neural Comp., 12, 1093–1139. Gray, C. M. (1994). Synchronous oscillations in neuronal systems, mechanisms and functions. J. Comp. Neuro., 1, 11–38. Hansel, D., & Mato, G. (2000). Existence and stability of persistent states in large neuronal networks. Submitted. Hansel, D., Mato, G., & Meunier, C. (1993a).Phase dynamics for weakly coupled Hodgkin-Huxley neurons. Europhys. Lett., 23, 367–372. Hansel, D., Mato, G., & Meunier, C. (1993b). Phase reduction and neural modeling. Concepts in Neuroscience, 4, 192–210. Hansel, D., Mato, G., & Meunier, C. (1995). Synchrony in excitatory neural networks. Neural Comp., 7, 307–337.
774
L. Neltner and D. Hansel
Hoppensteadt, F. C., & Izhikevich, E. M. (1997). Weakly connected neural networks. New York: Springer-Verlag. Izhikevich, E. M. (1999). Class 1 neural excitability, conventional synapses, weakly connected networks, and mathematical foundations of pulse-coupled models. IEEE Transactions on Neural Networks, 10, 499–507. Kuramoto, Y. (1984). Chemical oscillations, waves and turbulence. New York: Springer-Verlag. Morris, C., & Lecar, H. (1981). Voltage oscillations in the barnacle giant muscle ber. Biophys. J., 35, 193–213. Neltner, L., Hansel, D., Mato, G., & Meunier, C. (2000). Synchrony in heterogeneous networks of spiking neurons, Neural Comp., 12, 1607–1641. Rall, W. (1967). Distinguishing theoretical synaptic potentials computed for different soma-dendritic distribution of synaptic input. J. Neurophysiol.,30, 1138– 1168. Reyes, A. D., & Fetz, E. E. (1993). Two modes of interspike interval shortening by brief transient depolarizations in cat neocortical neurons. J. Neurophysiol., 69, 1661–1672. Rinzel, J., & Ermentrout, G. B. (1989). Analysis of neural excitability and oscillations. In C. Koch & I. Segev (Eds.), Methods in neuronal modeling (pp. 135–169). Cambridge, MA: MIT Press. Shriki, O., Hansel, D., & Sompolinsky, H. (1999). Rate models for conductance based cortical neuronal networks. Submitted. van Vreeswijk, C., Abbott, L. F., & Ermentrout, G. B. (1994). When inhibition not excitation synchronizes neural ring. J. Comput. Neurosci., 1, 313–321. Wang, X.-J., & BuzsÂaki, G. (1996). Gamma oscillation by synaptic inhibition in a hippocampal interneuronal network model. J. Neurosci., 16, 6402–6413. White, J. A., Chow, C. C., Ritt, J., Soto-Trevino, ˜ C., & Kopell, N. (1998). Synchronous oscillations in heterogeneous, mutually inhibited neurons. J. Comput. Neurosci., 5, 5–16. Received July 10, 2000; accepted September 12, 2000.
LETTER
Communicated by Peter Latham
Population Coding with Correlation and an Unfaithful Model Si Wu ¤ Hiroyuki Nakahara Shun-ichi Amari RIKEN Brain Science Institute, Hirosawa 2-1, Wako-shi, Saitama, Japan This study investigates a population decoding paradigm in which the maximum likelihood inference is based on an unfaithful decoding model (UMLI). This is usually the case for neural population decoding because the encoding process of the brain is not exactly known or because a simplied decoding model is preferred for saving computational cost. We consider an unfaithful decoding model that neglects the pair-wise correlation between neuronal activities and prove that UMLI is asymptotically efcient when the neuronal correlation is uniform or of limited range. The performance of UMLI is compared with that of the maximum likelihood inference based on the faithful model and that of the center-of-mass decoding method. It turns out that UMLI has advantages of decreasing the computational complexity remarkably and maintaining high-level decoding accuracy. Moreover, it can be implemented by a biologically feasible recurrent network (Pouget, Zhang, Deneve, & Latham, 1998). The effect of correlation on the decoding accuracy is also discussed. 1 Introduction
Population coding, a method to encode and decode stimuli in a distributed way by using the joint activities of a number of neurons, has the advantage of suppressing the uctuation in a single neuron’s signal (Georgopoulos, Schwartz, & Kettner, 1986; Paradiso, 1988; Seung & Sompolinsky, 1993). Recently there has been an expanded interest in understanding the population decoding methods, which particularly include the maximum likelihood inference (MLI), the center of mass (COM), the complex estimator (CE), and the optimal linear estimator (OLE) (see Pouget, Zhang, Deneve, & Latham, 1998; Salinas & Abbott, 1994). Among them, MLI has an advantage of having small decoding error (asymptotic efciency) but may suffer from the expense of computational complexity, especially when the decoding model is complex. Let us consider a population of N neurons coding a variable x. The encoding process of the population code is described by a conditional prob-
¤
Present address: Dept. of Computer Science, Shefeld University, U. K.
Neural Computation 13, 775–797 (2001)
° c 2001 Massachusetts Institute of Technology
776
S. Wu, H. Nakahara, and S. Amari
ability q(r |x) (Anderson, 1994; Zemel, Dayan, & Pouget, 1998), where the components of the vector r D fri g for i D 1, . . . , N are the ring rates of neurons. We study the following MLI estimator given by the value of x that maximizes the log-likelihood ln p (r |x), where p (r |x) is the decoding model, which might be different from the encoding model, q(r |x). Thus far, when people have studied MLI in a population code, it normally (or implicitly) assumes that p (r |x) is equal to the encoding model q(r |x). This requires that the estimator has full knowledge of the encoding process. Taking account of the complexity of the information process in the brain, it is more natu6 ral to assume p (r |x) D q (r |x). Another reason for choosing this is to save computational cost. Therefore, a decoding paradigm in which the assumed decoding model is different from the encoding one needs to be studied. In the context of statistical theory, this is called estimation based on an unfaithful or a misspecied model. This study is an extension of Wu, Nakahara, Murata, and Amari (2000). Hereafter, we refer to the decoding paradigm of using MLI based on an unfaithful model as UMLI, to distinguish it from that of MLI based on the faithful model, which is called FMLI. Cross-correlation in neuronal activities is observed in both primary sensory and motor areas (Fetz, Yoyama, & Smith, 1991; Gawne & Richmond, 1993; Zohary, Shadlen, & Newsome, 1994; Lee, Port, Kruse, & Georgopoulos, 1998), where population coding is believed to be used. The neural correlation is often complex and may have various sources. For example, it can be inherited from the stimulus noise, generated by the multilayer structure of the information transformation process, or due to the feedback control procedure. It is very hard, even if it is possible, for an estimator to store and utilize all this information. This leads us to investigate an unfaithful decoding model without using the information of correlation. This is different from that of assuming an uncorrelated encoding model. In fact, it has been shown that assuming an uncorrelated encoding model for the cortex will lead to pathological conclusions (Pouget, Deneve, Ducom, & Latham, 1999). By treating neural signals as if uncorrelated in decoding, UMLI decreases the computational cost of FMLI remarkably. More attractive, it maintains high-level decoding accuracy, which is better than that of COM and comparable to that of FMLI in many cases. Thus, UMLI can be a good compromise between decoding accuracy and computational complexity. Recently, Pouget et al. (1998) and Deneve, Latham, and Pouget (1999) proposed a biologically feasible recurrent network model (RNM), which performs decoding in a coarse code format. We note that for the correlation structures in this study, the decoding of UMLI can be achieved by RNM. An important issue discussed in this article concerns the asymptotic efciency of MLI. It is known that when neuronal activities are uncorrelated, MLI is asymptotically efcient in the sense that it can asymptotically achieve the optimal decoding accuracy, that is, the Crame r-Rao bound (Paradiso, 1988; Seung & Sompolinsky, 1993). For correlated neuronal signals, MLI
Population Coding
777
may not be asymptotically efcient, depending on the correlation structure. We study this problem and prove the asymptotic efciency of FMLI and the quasi-asymptotic efciency of UMLI for the uniform and limited-range correlations. As a by-product of the calculation, we also illustrate the effect of correlation on the decoding accuracy of a population code. It shows that the correlation, depending on its form, can either improve or degrade the decoding accuracy, which agrees with the general belief (see Snippe & Koenderink, 1992; Shadlen & Newsome, 1994; Yoon & Sompolinsky, 1999; Abbott & Dayan, 1999). In section 2.1, an unfaithful decoding model is introduced, which neglects the pair-wise correlation between neuronal activities. Two different kinds of correlation structures are considered. In section 2.2, we calculate the decoding error of UMLI and FMLI and investigate their asymptotic efciency for the models we consider. In section 2.3, the implementation of UMLI by RNM is discussed. In section 3, the performances of UMLI, FMLI, and COM are compared for two tuning functions. The effect of correlation on the decoding accuracy is also discussed. In section 4, some overall conclusions and discussions are given. 2 The Population Decoding Paradigm of UMLI 2.1 An Unfaithful Decoding Model of Neglecting the Neuronal Correlation. Let us consider a pair-wise correlated neural response model in
which the neuron activities are assumed to be multivariate gaussian, q(r |x) D p
1
(2p s 2 )N det(A ) 2 3 1 X ¡1 £ exp 4 ¡ 2 A (ri ¡ fi (x ))(rj ¡ fj (x ))5 , 2s i, j ij
(2.1)
where fi (x ) is the mean value of the response of the ith neuron representing its tuning function, that is, hri i D fi (x ),
(2.2)
h¢i denotes averaging over many trials, and N is the number of neurons. The tuning function fi (x ) usually has a radial symmetric shape, fi (x ) D w [(x ¡ ci )2 ],
(2.3)
where ci is the preferred stimulus of the ith neuron. For example, the function w (x ) can be of the gaussian function form.
778
S. Wu, H. Nakahara, and S. Amari
The matrix A D fAij g is the covariance matrix, which is dened by h(ri ¡ fi (x))(rj ¡ fj (x ))i D s 2 Aij ,
(2.4)
and A ¡1 is its inverse. Two correlation structures are considered here. One is the uniform correlation model (Johnson, 1980; Abbott & Dayan, 1999) with the covariance matrix Aij D dij C c(1 ¡ dij ),
(2.5)
where the parameter c (with ¡1 < c < 1) determines the strength of correlation. The inverse of A is Aij¡1 D
dij (Nc C 1 ¡ c) ¡ c . (1 ¡ c)(Nc C 1 ¡ c)
(2.6)
The other correlation structure is of limited range (Johnson, 1980; Snippe & Koenderink, 1992; Abbott & Dayan, 1999) with the covariance matrix Aij D b|i¡j| ,
(2.7)
where the parameter b (with 0 < b < 1) determines the range of correlation. The model captures the fact that the correlation strength between neurons decreases with dissimilarity in their preferred stimuli, a property often observed in cortical areas. The limited-range correlation is translational invariant in the sense that Aij D Akl , if |i ¡ j| D | k ¡ l|. The inverse of A is Aij¡1 D
µ ¶ 1 C b2 b (d C di¡1, j ) . d ¡ C 1, ij i j 1 ¡ b2 1 C b2
(2.8)
There are many possible ways to choose an unfaithful decoding model, depending on the trade-off between computational complexity and decoding accuracy. For the above correlation structures, a natural choice is # 1 X 2 (ri ¡ fi (x )) , exp ¡ 2 p(r |x) D p 2s i (2p s 2 )N 1
"
(2.9)
which neglects the neural correlation in the encoding process but keeps the tuning functions unchanged. 2.2 The Decoding Error of UMLI and FMLI. The decoding error of UMLI has been studied in the statistical theory (Akahira & Takeuchi, 1981;
Population Coding
779
Murata, Yoshizawa, & Amari, 1994). Here we generalize it to the population coding. The decoding error of FMLI can be solved similarly. For convenience, some notations are introduced. r f (r , x ) denotes df (r, x ) / dx. Eq [ f (r , x )] and Vq [ f (r , x )] denote, respectively, the mean value and the variance of f (r, x ) with respect to the distribution q(r |x). Given an observation of the population activity r ?, the UMLI estimate xO is the value of x that maximizes the log-likelihood Lp (r?, x) D ln p (r?|x). Denote by xopt the value of x satisfying Eq [r Lp (r, xopt )] D 0. For the faithful model where p D q, xopt D x. Hence, (xopt ¡ x ) is the error due to the unfaithful setting, whereas (xO ¡ xopt ) is the error due to sampling uctuations. For the unfaithful model, equation 2.9, since Eq [r Lp (r, xopt )] D 0, X [ fi (x ) ¡ fi (xopt )] fi0 (xopt ) D 0. (2.10) i
Hence, xopt D x, and UMLI gives an unbiased estimator in this cases. Let us consider the expansion of r Lp (r?, xO ) at x,
r Lp (r?, xO ) ’ r Lp (r?, x) C rr Lp (r?, x ) (xO ¡ x ).
(2.11)
Since r Lp (r?, xO ) D 0, 1 1 rr Lp (r?, x) (xO ¡ x) ’ ¡ r Lp (r?, x ), N N
(2.12)
where N is the number of neurons. Only the large N limit is considered in this study. Let us analyze the properties of the two random variables N1 rr Lp (r?, x ) and N1 r Lp (r ?, x ). By using equation 2.9, we get 1 1 X ? [r ¡ fi (x )] fi0 (x ), r Lp (r?, x ) D N Ns 2 i i 1 1 X ? 1 X 0 2 00 ( )] ( ) [r rr Lp (r?, x) D ¡ ¡ f x f x f ( x) , i i i N Ns 2 i Ns 2 i i
(2.13) (2.14)
where fi0 (x) D dfi (x ) / dx and fi00 (x ) D d2 fi (x) / dx2 . 2.2.1 The Uniform Correlation Case. We consider rst the uniform correlation model. In this case, we can write r?i D fi (x) C s (2 i C g),
(2.15)
where g and f2 i g, for i D 1, . . . , N, are independent random variables having zero mean and variance c and 1 ¡ c, respectively. g is the common noise for
780
S. Wu, H. Nakahara, and S. Amari
all neurons, representing the uniform character of the correlation. It is easy to check that the above formulated r? has the required correlation structure of equation 2.5. Substituting equation 2.15 into 2.13 and 2.14, we get 1 1 X 0 g X 0 r Lp (r?, x ) D 2 i fi ( x ) C f ( x ), N Ns i Ns i i 1 1 X 00 1 X 0 2 rr Lp (r?, x ) D 2 i fi ( x ) ¡ f (x ) N Ns i Ns 2 i i g X 00 C f (x ). Ns i i
(2.16)
(2.17)
Without loss of generality, we assume that the distribution of the preferred uniform. For the tuning functions of the form 2.3, hence, P 0 stimuli isP 00 ( ) ( ) and f x N / i i i fi x / N approach zero when N is large. Therefore, the correlation contributions (the terms of g) in equations 2.16 and 2.17 can be neglected. UMLI performs in this case as if the neuronal signals are uncorrelated. Thus, by the weak law of large numbers, 1 1 X 0 2 rr Lp (r?, x ) ’ ¡ f (x ) N Ns 2 i i Qp , N
(2.18)
Qp ´ Eq [rr Lp (r , x )].
(2.19)
D
where
According to the central limit theorem, r Lp (r ?, x ) / N converges to a gaussian distribution, 0 1 r Lp (r ?, x ) » N N
@0,
1 1¡c X 0 2 A f (x) N 2 s2 i i 1 G p A, 2
0 D N
@0,
N
(2.20)
where N (0, t2 ) denoting the gaussian distribution having zero mean and variance t2 , and Gp ´ Vq [r Lp (r, x )].
(2.21)
Population Coding
781
Combining the results of equations 2.12, 2.18, and 2.20, we obtain the decoding error of UMLI, (xO ¡ x )UMLI » N D N
(0, Qp¡2 G ( 0,
p
),
(1 ¡ c)s 2 ), N F1 ( x )
(2.22)
where F1 ( x ) D
1 X 0 2 f (x ) , N i i
(2.23)
is scaled to be of order one. The decoding error of UMLI decreases as a function of c and N. In a similar way, the decoding error of FMLI is obtained, (xO ¡ x )FMLI » N D N
(0, Qq¡2 G q ), ³ ´ (1 ¡ c)s 2 0, , N F1 ( x )
(2.24)
which has the same form as that of UMLI except that Qq and Gq are now dened with respect to the faithfulP decoding model, that is, p (r |x) D q (r |x). To get equation 2.24, the condition i fi0 (x) D 0 is used. Interestingly, UMLI and FMLI have the same decoding error because the uniform correlation effect is actually neglected in both UMLI and FMLI. Note that in FMLI, Gq D ¡Qq D Vq [r Lq (r |x)] is the Fisher information. Qq¡2 Gq is therefore the Crame r-Rao bound, which is the optimal accuracy for an unbiased estimator to achieve. The above derivation of equation 2.24 indicates that FMLI is asymptotically efcient in the uniform correlation model. For an unfaithful decoding model, Qp and Gp are usually different from the Fisher information. We call Qp¡2 Gp the generalized Crame r-Rao bound and UMLI quasi-asymptotically efcient if its decoding error approaches Qp¡2 Gp asymptotically. Equation 2.22 shows that UMLI is quasi-asymptotic efcient. We calculate the Crame r-Rao bound and the generalized Crame r-Rao bound for the gaussian response models, equations 2.1 and 2.9 (valid for general correlation structures). From the denitions, 1 X 0 2 f (x ) , s2 i i 1 X Gp D 2 Aij fi0 (x) fj0 (x ), s i, j Qp D ¡
(2.25) (2.26)
782
S. Wu, H. Nakahara, and S. Amari
Qq D ¡ Gq D
1 X ¡1 0 A f (x) fj0 (x ), s 2 ij ij i
1 X ¡1 0 A f (x) fj0 (x ). s 2 ij ij i
(2.27) (2.28)
Thus, Qq¡2 Gq D P Qp¡2 Gp
D
s2
s2 ¡1 0 0 ij Aij fi ( x ) fj ( x)
,
(2.29)
P
0 0 ij Aij fi (x ) fj ( x ) P 0 22 . [ i fi ( x ) ]
(2.30)
2.2.2 The Limited-Range Correlation Case. In section 2.2.1, the asymptotic efciency of FMLI and UMLI is proved when the neuronal correlation is uniform. The result relies on the radial symmetry of the tuning functions and the uniform character of the correlation, which make it possible to cancel the correlation contributions from different neurons. For general tuning functions and correlation structures, the asymptotic efciency of UMLI and FMLI may not hold. This is because the law of large numbers (see equation 2.18) and the central limit theorem (see equation 2.20) are not in general applicable. We note that for the limited-range correlation model, since the correlation is translationally invariant and its strength decreases quickly with the dissimilarity in the neurons’ preferred stimuli, the correlation effect in the decoding of FMLI and UMLI becomes negligible when N is large. This ensures that the law of large numbers and the central limit theorem hold in the large N limit. Therefore, FMLI and UMLI are asymptotically and quasiasymptotically efcient, respectively, in this case. This is conrmed in the simulation in section 3.2. When FMLI and UMLI are asymptotically and quasi-asymptotically efcient, their decoding errors in the large N limit can be described by the Crame r-Rao bound (see equation 2.29) and the generalized CramÂer-Rao bound (see equation 2.30), respectively. From equations 2.7 and 2.30, we get h(xO ¡ x)2 iUMLI »
s 2 F 2 (x ) , NF1 (x )2
(2.31)
where F2 ( x ) D
1 X |i¡j| 0 b fi (x ) fj0 (x ). N i, j
(2.32)
For a xed value of N, F2 (x) is a nonmonotonic function of the parameter b; so is the decoding error of UMLI. Note that the contribution of
Population Coding
783
fi0 (x ) fj0 (x ) to the summation is weighted by b|i¡j| , which decreases exponentially with |i ¡ j|. In the case of small b, those fi0 (x ) fj0 (x ) having small values of |i ¡ j| contribute to the major part of F2 (x). They are mainly positive, excluding the region around the stimulus x. Therefore, F2 (x ) increases with b when b is small. When b is large enough, more negative fi0 (x ) fj0 (x) (having large values of |i ¡ j| with ci < x < cj , or ci > x > cj ) start to contribute, so that F2 (x) decreases with b. This property is conrmed in section 3. For a xed value of b, in the large N limit, the difference between fi0 and fj0 (x ) is negligible when |i ¡ j| is small.1 Thus, F2 ( x ) »
1 X 0 2 f (x ) (1 C 2b C 2b2 C ¢ ¢ ¢) N i i
» F 1 (x ) and
1Cb , 1¡b
h(xO ¡ x )2 iUMLI »
s 2 (1 C b) , N (1 ¡ b)F1 (x )
(2.33)
(2.34)
which is a decreasing function of N. The decoding error of FMLI has the same asymptotic behavior, which has been analyzed by Abbott and Dayan (1999). Comparing equation 2.34 with their result (equation 4.12 in Abbott & Dayan, 1999), we see that UMLI and FMLI have the same decoding error in the large N limit. 2.3 Implementation of UMLI by a Recurrent Network Model. Recently Pouget et al. (1998) and Deneve et al. (1999) proposed a biologically feasible RNM, which performs population decoding in a coarse code format. For the correlation models we consider in this article, RNM can implement UMLI. The essential idea of RNM may be understood as constructing a dynamic system that is triggered by a stimulus x and evolves into a stationary state satisfying X [ri ¡ fi (xO )]vi (xO ) D 0, (2.35) i
where the functions vi (x ) for i D 1, . . . , N are determined by the network parameters, and xO is the network estimation. It is easy to check that for the encoding model (see equation 2.1) and the unfaithful decoding model (see equation 2.9), the estimates of FMLI and 1
For the gaussian tuning function, the difference between fi0 and fj0 (x) is the order of 1 / N when |i ¡ j| is small. For the triangular tuning function, the difference is zero in most cases, but can be § 2 in the region where the preferred stimuli are close to the stimulus. In the large N limit, the contribution of this region is negligible.
784
S. Wu, H. Nakahara, and S. Amari
UMLI are determined, respectively, by X [ri ¡ fi (xO )]Aij¡1 fj0 (xO ) D 0,
(2.36)
i
X i
[ri ¡ fi (xO )] fi0 (xO ) D 0.
(2.37)
Hence, if RNM is designed such that, vi (x ) / Aij¡1 fj0 (x ), or vi (x ) / fi0 (x), it implements FMLI, or UMLI, respectively. An example of using RNM to implement UMLI is given in Pouget et al. (1998, 1999). 3 Performance Comparison
The performance of UMLI is compared with that of FMLI and the COM decoding method. The neural population model we consider is a regular array of N neurons (Baldi & Heiligenberg, 1988; Snippe, 1996) with the preferred stimuli uniformly distributed in the range [¡D, D], that is, ci D ¡D C 2iD / (N C 1), for i D 1, . . . , N. The comparison is done at the stimulus x D 0. Two different tuning functions are studied: the triangular function and the gaussian one. Compared with FMLI, UMLI has an equivalent or larger decoding error, since it uses less information of the encoding process. By simplifying the covariance matrix, UMLI largely reduces the computational cost. COM is a simple decoding method without using any information of the encoding process, whose estimate is the averaged value of the neurons’ preferred stimuli weighted by the responses (Georgopoulos, Kalaska, Caminiti, & Massey, 1982; Snippe, 1996), that is, P ri ci . (3.1) xO D Pi i ri The shortcoming of COM is a large decoding error. The decoding errors of COM for the two correlation structures we study are calculated to be P s 2 ij Aij ci cj 2 h(xO ¡ x) iCOM » P (3.2) , [ i fi (x )]2 P where the condition i fi (x )ci D 0 is used, due to the regularity of the distribution of the preferred stimuli. 3.1 The Case for the Triangular Tuning Function. The triangular tuning
function has the form (Brunel & Nadal, 1998) » 1 ¡ |x ¡ ci | / a |x ¡ ci | · a fi ( x ) D |x ¡ ci | > a 0 where the parameter a is the tuning width.
(3.3)
Population Coding
785
We note that the gaussian response model does not give zero probability for negative ring rates. To make it more reliable, we set ri D 0 when fi (x ) D 0 (i.e., |ci ¡x| > a ), which means that only active enough neurons contribute to the decoding. This seems to be biologically plausible by considering the time limitation of the decoding process. The mathematical effect of the cutoff will be discussed later. For the tuning width a, there are effectively N D Int[2a / d¡ 1] neurons involved in the decoding process, where d is the difference in the preferred stimulus between two consecutive neurons and the function Int[¢] denotes the integer part of the argument. Since the triangular tuning function is piece-wise linear, the decoding errors of UMLI and FMLI can be calculated exactly (see the appendix). Figure 1 compares the decoding errors of the three methods for the uniform correlation model. It shows that UMLI has the same decoding error as that of FMLI and a lower error than that of COM. Figure 2 compares the decoding errors of the three methods for the limited-range correlation model. We see that UMLI has a lower decoding error than that of COM and a larger error than that of FMLI (nite N effect). For a xed value of b, the discrepancy between FMLI and UMLI decreases with the number of neurons (see Figure 2a). For a xed value of N, the discrepancy increases with the correlation strength and decreases subsequently when the correlation is very strong (see Figure 2b). Figures 1b and 2b also illustrate the effect of correlation on the decoding accuracies. We see that the uniform correlation improves the decoding accuracies in the three methods; the limited-range correlation degrades the decoding accuracies when the correlation strength is small and improves the accuracies when the strength is large enough. 3.2 The Case for the Gaussian Tuning Function. The gaussian tuning
function has the form ³ ´ ( x ¡ ci ) 2 ( ) fi x D exp ¡ , 2a2
(3.4)
where the parameter a is the tuning width. Again, to make the gaussian response model more reliable, we set ri D 0 when fi (x ) < 0.11 (|x ¡ ci | > 3a). For the tuning width a, there are N D Int[6a / d ¡ 1] neurons involved in the decoding process, where d is the difference in the preferred stimuli between two consecutive neurons. Figure 3 compares the decoding errors of the three methods for the uniform correlation model. It shows that UMLI has the same decoding error as that of FMLI2 and a lower error than that of COM. 2 We prove that UMLI and FMLI have the same decoding error in the large N limit. Here, they have the same error in the nite N due to the regular distribution of preferred stimuli.
786
S. Wu, H. Nakahara, and S. Amari
Figure 1: Comparison of the decoding errors of UMLI, FMLI, and COM for the uniform correlation model and the triangular tuning function. The parameters are chosen as a D 1 and s D 0.1. (a) The decoding errors change with the number of neurons, N. The correlation strength c is 0.5. (b) The decoding errors change with the correlation strength c. The number of neurons, N, is 50.
Population Coding
787
Figure 2: Comparison of the decoding errors of UMLI, FMLI, and COM for the limited-range correlation model and the triangular tuning function. The parameters are chosen as a D 1 and s D 0.1. (a) The decoding errors change with the number of neurons, N. The correlation strength b is 0.5. (b) The decoding errors change with the correlation strength b. The number of neurons, N, is 50.
788
S. Wu, H. Nakahara, and S. Amari
Figure 3: Comparison of the decoding errors of UMLI, FMLI, and COM for the uniform correlation model and the gaussian tuning function. The parameters are chosen as a D 1 and s D 0.1. (a) The decoding errors change with the number of neurons, N. The correlation strength c is 0.5. (b) The decoding errors change with the correlation strength c. The number of neurons, N, is 50.
Population Coding
789
In Figure 4, the simulation results for the decoding errors of FMLI and UMLI in the limited-range correlation model are compared with those obtained by using the Crame r-Rao bound and the generalized Crame r-Rao bound, respectively. It shows that the two results agree very well when the number of neurons, N, is large, which means that FMLI and UMLI are asymptotically and quasi-asymptotically efcient as we analyzed. We also see that a stronger correlation leads to a slower convergence for the decoding error of FMLI to the Crame r-Rao bound. For a nite value of N, their discrepancy can be quite large in the strong correlation case. In Figure 4a, we observe an interesting behavior for the Crame r-Rao bound (the case of b D 0.8). It increases rst when N is small and decreases subsequently when N is large. Asymptotically, it is inversely proportional to N, or, equivalently, the Fisher information is proportional to N. The reason can be understood as follows. When the number of neurons, N, increases, on one hand, the number of signals becomes larger, which tends to increase the Fisher information. On the other hand, the total correlation in the population is also enlarged, which tends to decrease the Fisher information. These two factors compete and generate the above behavior. Moreover, since the correlation strength between two consecutive neurons, b, is constant, the former factor nally dominates. Thus, asymptotically, the Fisher information increases proportionally with N. In the simulation, the standard gradient-descent method is used to maximize the log-likelihood, and the initial guess for the stimulus is chosen as the preferred stimulus of the most active neuron. The CPU time of UMLI is around one-fth that of FMLI when the number of neurons N D 50. UMLI reduces the computational cost of FMLI signicantly. Figure 5 compares the decoding errors of the three methods for the limited-range correlation model. The simulation results for FMLI and UMLI are used. The gure shows that UMLI has a lower decoding error than that of COM and a comparable performance with that of FMLI. Again, we observe that the uniform correlation improves the decoding accuracies of the three methods (see Figure 3b). The limited-range one degrades the decoding accuracies when the correlation strength is small and improves the accuracies when the strength is large enough (see Figure 5b). To make the gaussian response model more plausible, we have forced the neuronal activity to be zero when the value of the tuning function is under a threshold (0 for the triangular tuning function and 0.11 for the gaussian one). This partly compensates the defect of the gaussian model of having nonzero probability for negative ring rates. It is easy to see that this cutoff does not affect much the results of UMLI and FMLI, due to their nature of decoding by using the derivative of the tuning functions, whereas the decoding error of COM will be greatly enlarged without cutoff. Without cutoff, the range of the preferred stimuli should be nite to avoid divergence of the decoding error of COM. In this case, the model will become more similar to the one used by Snippe (1996), whose calculation shows that COM is extremely
790
S. Wu, H. Nakahara, and S. Amari
Figure 4: Comparison of the simulation results of the decoding errors of UMLI and FMLI in the limited-range correlation model with those obtained by using the CramÂer-Rao bound (CRB) and the generalized CramÂer-Rao bound (GCRB), respectively. The parameters are chosen as a D 1 and s D 0.1. SMR denote the simulation results. In the simulation, 10 sets of data are generated, each averaged over 1000 trials. (a) FMLI. (b) UMLI.
Population Coding
791
Figure 5: Comparison of the decoding errors of UMLI, FMLI, and COM for the limited-range correlation model and the gaussian tuning function. The parameters are chosen as a D 1 and s D 0.1. The simulation results for FMLI and UMLI are very close. (a) The decoding errors change with the number of neurons, N. The correlation strength b is 0.5. (b) The decoding errors change with the correlation strength b. The number of neurons, N, is 50.
792
S. Wu, H. Nakahara, and S. Amari
inefcient compared with the optimal decoding accuracy when the tuning width is small. Thus, cutoff benets only COM. Our conclusion does not change except for this point. In the above calculation, we have not considered the neuronal spontaneous activity. If this factor is included, for example, hri i D fi (x ) is replaced by hri i D fi (x ) C c , where c is a small constant representing the background noise, the decoding error of COM will be enlarged, whereas the performance of UML and FMLI will not be affected. This can be understood from the same reason as that in cutoff. Thus, adding a spontaneous term will only make the superiority of UMLI to COM more signicant. The optimal linear estimator (OLE) is another decoding method, with computational complexity and accuracy between those of COM and FMLI (Salinas & Abbott, 1994). Since it uses information from the neural correlation, we do not compare it with UMLI. 4 Conclusions and Discussions
We have studied a population decoding paradigm in which MLI is based on an unfaithful model. This is motivated by the fact that the encoding process of the brain is not exactly known by the estimator and that a properly simplied decoding model can save computational cost without sacricing much decoding accuracy. As an example, we consider an unfaithful decoding model that neglects the pair-wise correlation between neuronal activities. Two different correlation structures are studied: the uniform and the limited-range correlations. We proved the asymptotic efciency of FMLI and the quasi-asymptotic efciency of UMLI for the two correlations. The performance of UMLI is compared with that of FMLI and of COM. It turns out that UMLI has a lower decoding error than that of COM. Compared with FMLI, UMLI has an equivalent or larger decoding error, though with much less computational cost. In the large N limit, UMLI and FMLI have the same decoding error for both uniform and limited-range correlations. Furthermore, UMLI can be implemented by a biologically feasible recurrent network (Pouget et al., 1998; Deneve, Latham, & Pouget, 1999). Our future work seeks to understand the biological implication of UMLI. Yoon and Sompolinsky (1999) studied a different correlation structure, referred to as the YS model. Unlike the limited-range correlation model, the correlation strength between two consecutive neurons in the YS model increases with N, and hence, the correlation range is nite even when N is innity. Interestingly, Yoon and Sompolinsky have shown that the Fisher information in this case does not increase asymptotically in proportion to N but stays at a constant level. Since the correlation effect in the YS model is not negligible for large N, the law of large numbers (see equation 2.18) and the central limit theorem (see equation 2.20) do not hold. FMLI and UMLI are not consistent in this case. For simplicity, we study a simplied
Population Coding
793
YS model through setting b D h1 / N , for 0 < h < 1, in the limited-range correlation (Abbott & Dayan, 1999) (It corresponds to the case of a D 0 in the general YS model. In this article, we use a D 0. However, we believe 6 that a D 0 has an important effect when b D h1/ N .). The simulation results are shown in Figure 6. It shows that UMLI and FMLI are not consistent, and their decoding errors tend to constant levels in the large N limit (see Figure 6a). UMLI has a comparable performance as that of FMLI and a lower error than that of COM (see Figure 6b). The effect of correlation on the decoding accuracies of FMLI, UMLI, and COM is investigated for the two correlation structures. It turns out that the correlation, depending on its form, can either improve or degrade the decoding accuracy. This observation agrees with the analysis of Abbott and Dayan (1999), which is done with respect to the optimal decoding accuracy, that is, the Crame r-Rao bound, not on specic decoding methods. Our result shows further that these properties hold in the three decoding methods. Through exchanging the roles of A and A ¡1 , the limited-range correlation model can be easily generalized to be a new one, in which neurons are negatively correlated with nearest neighbors. It is easy to check that all results are in the same form except that A and A ¡1 are exchanged. Therefore, the effect of correlation becomes reverse, which improves the decoding accuracy when the strength is small and degrades when the strength is large. This article has studied only the pair-wise correlation with a structure being uniform or of limited range. The real neuron correlation may be more complex and have orders of more than two. However, if the radial symmetry of the tuning functions and the translational invariance of the correlation are not violated so much, which seems to be true, we still expect that UMLI holds the benecial properties discussed above. Appendix: The Decoding Errors of UMLI and FMLI in the Case of Triangular Tuning Functions
For the triangular tuning function, ( 6 ci |x ¡ ci | < a, x D sign(x ¡ ci ) / a 0 fi ( x ) D |x ¡ ci | > a 0
(A.1)
where sign(x ¡ ci ) denoting the sign of (x ¡ ci ). The function fi0 (x ) is singular at ci D x and ci D x § a. Without loss of generality, we assume no such preferred stimuli exist. Denote ri D fi (x ) C ji , for i D 1, . . . , N, where fji g are random numbers satisfying hji i D 0,
(A.2)
hjijj i D s 2 Aij .
(A.3)
794
S. Wu, H. Nakahara, and S. Amari
Figure 6: (a) Comparison of the simulation results of the decoding errors of UMLI and FMLI in the simplied YS model with those obtained by using the CramÂer-Rao bound (CRB) and the generalized CramÂer-Rao bound (GCRB), respectively. (b) Comparison of the decoding errors of UMLI, FMLI, and COM. The parameters are chosen as a D 1, s D 0.1, h D 0.2. In the simulation, 10 sets of data is generated, each of which is averaged over 10, 000 trials.
Population Coding
795
The UMLI estimate is the solution of
r ln p (r | xO ) D 0.
(A.4)
From equation 2.9 we get X i
(ri ¡ fi (xO )) fi0 (xO ) D 0.
(A.5)
Suppose xO is close enough to x, due to the piece-wise linearity of the triangular function; we then have ri ¡ fi (xO ) D ji ¡ (xO ¡ x) fi0 (x )
(A.6)
fi0 (xO ) D fi0 (x ).
(A.7)
Substituting equations (A.6) and (A.7) into (A.5), we get P j i f 0 ( x) xO D x C P i 0 i 2 . i ( fi (x ))
(A.8)
Therefore, )2
h(xO ¡ x iUMLI D
s2
P ij
[
P
Aij fi0 (x ) fj0 (x )
i ( fi ( x)) 0
2 ]2
.
(A.9)
The FMLI estimate is the solution of
r ln q(r | xO ) D 0.
(A.10)
From equation 2.1 we get X i, j
Aij¡1 (rj ¡ fj (xO )) fi0 (xO ) D 0.
(A.11)
By using equations (A.6) and (A.7), we get P xO D x C P
i, j i, j
Aij¡1jj fi0 (x )
Aij¡1 fi0 (x )) fj0 (x )
.
(A.12)
Therefore, h(xO ¡ x )2 iFMLI D P
s2 ij
Aij¡1 fi0 (x ) fj0 (x )
.
(A.13)
796
S. Wu, H. Nakahara, and S. Amari
Acknowledgments
We are grateful to Noboru Murata for helping to solve the mathematic problems in this article, and we thank the two anonymous reviewers for their valuable comments. S. W. acknowledges helpful discussions with Peter Dayan and Danmei Chen. References Abbott, L. F., & Dayan, P. (1999). The effect of correlated variability on the accuracy of a population code. Neural Computation, 11, 91–101. Akahira, M., & Takeuchi, K. (1981). Asymptotic efciency of statistical estimators: Concepts and high order asymptotic efciency. In Lecture Notes in Statistics 7. Berlin: Springer-Verlag. Anderson, C. H. (1994). Basic elements of biological computational systems. International Journal of Modern Physics C, 5, 135–137. Baldi, P., & Heiligenberg, W. (1988).How sensory maps could enhance resolution through ordered arrangements of broadly tuned receivers. Biol. Cybern., 59, 313–318. Brunel, N., & Nadal, J.-P. (1998). Mutual information, Fisher information, and population coding. Neural Computation, 10, 1731–1757. Deneve, S., Latham, P. E., & Pouget, A. (1999). Reading population codes: A neural implementation of ideal observers. Nature Neuroscience, 2, 740–745. Fetz, E., Yoyama, K., & Smith, W. (1991). Synaptic interactions between cortical neurons. In A. Peters & E. G. Jones (Eds.), Cerebral cortex,9. New York: Plenum Press. Gawne, T. J., & Richmond, B. J. (1993). How independent are the messages carried by adjacent inferior temporal cortical neurons? Journal of Neuroscience, 13, 2758–2771. Georgopoulos, A. P., Kalaska, J. F., Caminiti, R., & Massey, J. T. (1982). On the relations between the direction of two-dimensional arm movements and cell discharge in primate motor cortex. Journal of Neuroscience, 2, 1527– 1537. Georgopoulos, A. P., Schwartz, A. B., & Kettner, R. E. (1986). Neuronal population coding of movement direction. Science, 243, 1416–1419. Johnson, K. O. (1980). Sensory discrimination: Neural processes preceding discrimination decision. J. Neurophy., 43, 1793–1815. Lee, D., Port, N. L., Kruse, W., & Georgopoulos, A. P. (1998). Variability and correlated noise in the discharge of neurons in motor and parietal areas of the primate cortex. Journal of Neuroscience, 18, 1161–1170. Murata, M., Yoshizawa, S., & Amari, S. (1994). Network information criterion— determining the number of hidden units for an articial neural network model. IEEE. Trans. Neural Networks, 5, 865–872. Paradiso, M. A. (1988). A theory for use of visual orientation information which exploits the columnar structure of striate cortex. Biological Cybernetics, 58, 35–49.
Population Coding
797
Pouget, A., Deneve, S., Ducom, J.-C., & Latham, P. (1999). Narrow versus wide tuning curves: What’s best for a population code? Neural Computation, 11, 85–90. Pouget, A., Zhang, K., Deneve, S., & Latham, P. E. (1998). Statistically efceint estimation using population coding. Neural Computation, 10, 373–401. Salinas, E., & Abbott, L. F. (1994). Vector reconstruction from ring rates. Journal of Computational Neuroscience, 1, 89–107. Seung, H. S., & Sompolinsky, H. (1993).Simple models for reading neuronal population codes. Proceeding of the National Academy of Sciences USA, 90, 10749– 10753. Shadlen, M. N., & Newsome, W. T. (1994). Noise, neural codes and cortical organization. Current Opinion in Neurobiology, 4, 569–579. Snippe, H. P. (1996). Parameter extraction from population codes: A critical assessment. Neural Computation, 8, 511–529. Snippe, H. P., & Koenderink, J. J. (1992). Information in channel-coded systems: Correlated receivers. Biological Cybernetics, 67, 183–190. Wu, S., Nakahara, H., Murata, N., & Amari, S. (2000). Population decoding based on an unfaithful model. In S. A. Solla, T. K. Leen, & K.-R. Muller ¨ (Eds.), Advances in neural information processing systems 12 (pp. 192–198). Cambridge, MA: MIT Press. Yoon, H., & Sompolinsky, H. (1999). The effect of correlations on the Fisher information of population codes. In M. Kearns, S. Solla, & D. Cohn (Eds.), Advances in neural information processing systems 11 (pp. 167–173). Cambridge, MA: MIT Press. Zemel, R. S., Dayan, P., & Pouget, A. (1998). Population interpolation of population codes. Neural Computation, 10, 403–430. Zohary, E., Shadlen, M. N., & Newsome, W. T. (1994). Correlated neuronal discharge rate and its implications for psychophysical performance. Nature, 370, 140–143. Received August 29, 1999; accepted June 15, 2000.
LETTER
Communicated by Michael DeWeese
Metabolically Efcient Information Processing Vijay Balasubramanian Jefferson Laboratory of Physics, Harvard University, Cambridge, MA 02138, U.S.A. Don Kimber FX Palo Alto Laboratory, Palo Alto, CA 94304, U.S.A. Michael J. Berry II Department of Molecular Biology, Princeton University, Princeton, NJ 08544, U.S.A. Energy-efcient information transmission may be relevant to biological sensory signal processing as well as to low-power electronic devices. We explore its consequences in two different regimes. In an “immediate” regime, we argue that the information rate should be maximized subject to a power constraint, and in an “exploratory” regime, the transmission rate per power cost should be maximized. In the absence of noise, discrete inputs are optimally encoded into Boltzmann distributed output symbols. In the exploratory regime, the partition function of this distribution is numerically equal to 1. The structure of the optimal code is strongly affected by noise in the transmission channel. The Arimoto-Blahut algorithm, generalized for cost constraints, can be used to derive and interpret the distribution of symbols for optimal energy-efcient coding in the presence of noise. We outline the possibilities and problems in extending our results to information coding and transmission in neurobiological systems. 1 Introduction
There is increasing evidence that far from being noisy and unreliable, spiking neurons can encode information about the outside world precisely in individual spike timings (see de Ruyter van Steveninck, Lewen, Strong, Koberle, & Bialek, 1997, Berry, Warland, & Meister, 1997; Buracas, Zador, de Weese, & Albright, 1998). Estimates of the information transmitted by sensory neurons have often found them to be highly informative, sending 2 to 5 bits per spike, and quite reliable, using roughly half of the total entropy available in their spike trains (Buracas et al., 1998 [monkey visual cortex]; Warland, 1991 [cricket cercus]; Rieke, Warland, & Bialek, 1993, and Rieke, Bodnar, & Bialek, 1995 [frog auditory system and frog sacculus]; Berry & Meister, 1998 [salamander retina]; Strong, Koberle, de Ruyter van Steveninck, & Bialek, 1997 [blowy H1 interneuron]; Warland, Reinagel, & Meister, 1997 [salamander retina]; Reinagel & Reid, 1998 [cat LGN]; see Neural Computation 13, 799–815 (2001)
° c 2001 Massachusetts Institute of Technology
800
V. Balasubramanian, D. Kimber, and M. J. Berry II
Figure 1: Schematic view of an information system.
Rieke, Warland, & de Ruyter van Steveninck, 1997, for a discussion and review). So it is possible that there are behavioral regimes where information theory will be a powerful tool for predicting the structure of neural codes, provided the costs and constraints of biological computation are properly incorporated. Therefore, as a step toward a biologically relevant information theory, we examine the effect of energetic costs on the coding and transmission of information by discrete symbols, following important prior work by Levy and Baxter (1996) and Sarpeshkar (1998). We have in mind a model of a sensory system where signals from the natural world are detected and encoded, and pass through a noisy channel before arriving at a decisionmaking receiver. Our results are equally relevant to low-power electronic devices, such as mobile telephones, that are constrained by nite battery life. In general terms, the role of a sensory system in the process of information use by an organism is summarized in Figure 1. Information about the environment is detected by sensors and encoded for transmission through an information channel to a control system. For example, the retina detects patterns of light, which are encoded by ganglion cells for transmission through the optic nerve to the brain. We might expect evolution (or engineering) to produce systems that make an “optimal” choice for both the amount of information to transmit and, given the amount, the kind of information to transmit. The amount of information is quantied in classical information theory by the mutual information I (SI Z), and by the rate R D I (SI Z) / N during a period in which N symbols are transmitted (Cover & Thomas, 1991). As we will describe, information theory can be used to determine the minimum power necessary to transmit at a given rate R or the minimum energy needed to transmit a given amount I of information. The rate at which the organism should operate is determined by a trade-off between the value and cost of the transmitted information. We outline two different behavioral regimes in which these trade-offs leads to different coding strategies. 1.1 Immediate Regime. In some activities an organism is engaged in a time-critical task involving rapidly changing environmental states, and
Metabolically Efcient Information Processing
801
its performance depends strongly on its rate of sensory information acquisition R. For example, a cheetah’s effectiveness in catching a gazelle, and hence in procuring metabolic gains from food, might be expected to improve with increasing R. However, acquiring sensory information also incurs a metabolic cost at some rate E, and unless the resulting rate of metabolic gain for the organism V is great enough, the expenditure may not be worthwhile. Although numerous factors affect the value of information, we will focus on how it varies with the rate R and consider a value function V (R ) with all other variables held constant. In general, we expect that the value V (R) of sensory information will increase monotonically, but not linearly, with R. At a low enough rate of acquisition, the sensory modality will be of no use to the organism. For instance, if the cheetah sees half as well, it will not capture half as many gazelles—it will starve. Conversely, at a high enough information rate, the value should saturate, as there is only so much meat in a gazelle. Balancing the marginal increase in value to the organism, @V (R) / @R, against the marginal increase in energy expense, @E (R) / @R, yields some optimal rate R¤ for such an immediate regime. Alternatively, there may be some structural constraints, such as signal-to-noise ratio of the sensory modality or processing speed of the biological circuitry, that limit the attainable rate Rc . We cannot compute R¤ without knowing the value function V (R), and we cannot compute Rc without knowing the structural constraints. However, the smaller of these two values will set the organism’s rate of sensory information acquisition, and whatever it is, an optimal code will minimize the energy cost for this rate. To study the structure of such codes, we can simply ask how to minimize the power required to transmit at a given information rate. As we shall see, E (R) is an increasing convex function, so this is equivalent to determining that maximum rate R(E ) of information transmission given a constraint of average energy E per symbol (see Figure 2). 1.2 Exploratory Regime. In many other situations, the relevant environmental state is changing slowly, and an organism is not faced with any urgent tasks. Here, it is free to choose the rate at which it surveys its surroundings, as well as the time it spends before making a behavioral decision that changes its environment. The quality of exploration will depend on the total amount of sensory information acquired. Better exploration will allow the organism to achieve more appropriate behavior, but continued exploration will involve a cost in metabolic energy as well as in opportunities for other behavior. Therefore, there will be an optimal amount of information, I¤ , that the organism should acquire, where the marginal value of exploration matches its marginal cost. We cannot compute I¤ without knowing the value of exploration achievable using an amount I of information. However, whatever the value of I¤ , and independent of the details of the activity, an “optimal” sensory system will transmit that information at the rate that minimizes the cumulative
802
V. Balasubramanian, D. Kimber, and M. J. Berry II
Figure 2: Schematic of energy optimization. The information rate (thick line) is a convex function of the energy rate until Emax . The exploratory regime optimum (R¤ , E¤ ) is given by the intersection of the tangent from the origin (thin line) with R(E).
energy cost Ec (I¤ ). The convexity of E (R) implies that this is achieved by a sensory system that transmits at a xed rate, as any variations in the rate will result in a higher cumulative energy cost. This optimal rate of sen¤ sory information acquisition will minimize Ec (I¤ ) D E (R) IR or, equivalently, maximize R(E ) / E. 1.3 Low-Power Devices. Both the immediate and the exploratory regimes apply to low-power electronic devices, such as mobile telephones and laptop computers. The nite battery lifetime of these devices puts a premium on energy efciency. The immediate regime is equivalent to an “on-line” mode, where the information rate of the device is determined by the application, but the total amount of information is variable. The exploratory regime is equivalent to an “off-line” or “batch” mode, where the total amount of information to transmit is set, but the rate is variable. 1.4 Summary. A system operating at any given information rate R should transmit using the minimimum energy E (R) required for that rate, all other constraints being held equal. In immediate activities, the optimal rate is determined by the trade-off between gain realizable at rate R and the cost E (R ). However, in an exploratory regime, the optimal rate maximizes R(E ) / E, independent of the details of the activity. The next sections describe the general structure of energy efcient codes.
Metabolically Efcient Information Processing
803
2 Metabolically Constrained Capacity and Coding
In this section we consider the consequences of metabolic efciency in information transmission. We will not address the problem of determining what information to transmit, but abstract the mapping S ! X in Figure 1 as performing this task. From this point of view, we can treat X as a sequence of symbols to be encoded into a sequence Y of channel inputs, which get transmitted to produce an output sequence Z. Denote the elements of these sequences at a specic time as x, y, and z. Channel transmission is both noisy and energetically costly. Assume a discrete memoryless channel, modeled by cross-over probabilities Q k| j ´ Prfz D zk |y D yj g, giving the probability that a channel input symbol yj results in a channel output zk. The organism as a whole incurs a variety of energetic expenditures at all times, but we will focus on the costs of operating the sensory system, these being relevant to the optimization considered here. The energetic cost of transmitting information can be referred to the input Y or the output Z, or may even be a function of both X and Y. However, we choose to associate energy costs fE1 , ¢ ¢ ¢ , En g with input symbols fy1 , ¢ ¢ ¢ , yn g. This entails no loss of generality, since for arbitrary costs Ejk depending on both P input yj and output z k, we may simply take Ej as the expected cost Ej ´ k Qk| j Ejk for use of symbol yj . Our goal is to nd, for any given energy E, the maximum achievable mutual information I (XI Z) between the signal X and the channel output Z, with expected energy cost EN · E. However, it can be shown that I (XI Z ) · I (YI Z), with equality when X can be completely determined from Y (Cover & Thomas, 1991). Intuitively, the encoding from X to Y should exploit the channel characteristics, but without loss of information about X. Assuming the mapping from X to Y is indeed lossless, maximizing I (XI Z ) reduces to maximizing I (YI Z). Correlations within the sequence Y will always decrease the total amount of transmitted information, since this is bounded above by the entropy of Y. So to maximize I (YI Z) we can assume that the symbols of Y are independently drawn from a distribution q(y ) over the channel inputs. But both I (YI Z) and EN depend on q(y ), so formally the problem is to determine the function C(E ) D (1 / N )
max
q (y)I
EN ·E
I (YI Z)I EN D
X
q(yj )Ej ,
(2.1)
j
where C (E ) is called the channel capacity-cost function (Blahut, 1987). It is evident from equation 2.1 and the statistical independence of symbols in Y that C(E ) D R(E ) where R (E ) is the constrained transmission rate discussed earlier. The channel coding theorems of classical information theory assert that reliable transmission of information is possible at any rate less than R and at no rate greater than R. Our focus is not on reliable transmission
804
V. Balasubramanian, D. Kimber, and M. J. Berry II
per se but simply on the maximum per symbol rate R(E ) at which mutual information I (YI Z) can be established given the constraint EN · E. We now address—rst in the noiseless case and then for a noisy channel— the related problems of (1) characterizing C(E ), (2) determining the distribution qE (y ) that achieves C(E ), and (3) nding the maximum of C (E ) / E. The rst two problems are of interest because an energy-optimal device or organism should achieve C (E ) for whatever energy E it is operating at, requiring a very particular distribution over y. The third problem is interesting because it allows us to determine both the rate C¤ and energy E¤ at which an energyoptimal organism would operate in the exploratory regime, regardless of the details of its activity. 2.1 Efcient Noiseless Transmission. In the absence of noise, the channel input and output are equal (Y D Z), and the mutual information I (YI Z) equals the channel input entropy H (Y ). So nding the capacity at xed energy reduces to maximizing the entropy of Y at xed energy. Correlations within the sequence Y will always decrease the entropy, so we can assume that the symbols in Y are drawn independently from some distribution q. The purpose of the encoding process X ! Y is to implement a deterministic map between the signal X and the channel input Y, in such a way that the symbols of Y are statistically independent and have a distribution q. We will not dicuss how this encoding is performed in practice and will focus instead on the structure of the optimal distribution q.1 Then the per symbol information P rate (or entropy)Pand energy involved in the transmission are H D ¡ jnD 1 qj ln qj and EN D jnD 1 qj Ej , where qj D q (yj ). In the immediN while in the exploratory regime we ate regime, we maximize H at xed E, N maximize H / E.
2.1.1 Immediate Regime. Entropy maximization at xed average cost is a classic problem, solvable using the method of Lagrange multipliers by dening the function
GD¡
n X j D1
0 qj ln qj C b @
n X jD 1
1
0 1 n X qj Ej ¡ EA C l @ qj ¡ 1A
(2.2)
jD 1
and setting its derivatives with respect to b, l, and all the qj equal to zero. Setting @@Gl D 0 ensures that the q remains a probability distribution. The conditions @G / @qj D @G / @b D @G / @l D 0 can be solved simultaneously to 1
There are standard algorithms in coding theory that perform such mappings between X and Y (Cover & Thomas, 1991). Most such algorithms are not biologically plausible, and it would be very interesting to determine whether suitable encoding algorithms can be implemented by biological hardware.
Metabolically Efcient Information Processing
805
yield qj D
e¡bEj
Z
I
ZD
n X
Pn e
¡b Ej
j D1
I
¡b Ej j D 1 Ej e ¡bEj jD 1 e
E D Pn
D¡
@ ln Z @b
,
(2.3)
where the normalization factor Z is known as the partition function and b is implicitly determined by demanding that the average energy be E. We are simply recovering the commonplace fact of statistical physics that entropy is maximized at xed average energy by a Boltzmann distribution with an “inverse temperature” b dened by equation 2.3. Standard results about Boltzmann distributions then tell us that the maximum information rate at xed energy H (E ) is a convex function of E, increasing from 0 at Emin D P minj (Ej ) to a maximum Hmax D ln n at Emax D jnD 1 Ej / n. (In the language of statistical physics, the “heat capacity” is positive.) Larger energies (E > Emax ) lower the entropy (see Figure 2). 2.1.2 Exploratory Regime. In the exploratory regime, we maximize the information transmitted per energy cost, so we should extremize 0 1 0 1 P n n X X ¡ jnD1 qj ln qj H C l@ C l@ Pn GQ D qj ¡ 1A D qj ¡ 1A E j D1 qj Ej jD 1 jD 1
(2.4)
with respect to l and all the qj . If GQ is maximized by some distribution q, Q Q We there is a corresponding information rate HQ and power consumed E. Q the information rate is maximized by have already shown that for xed E, the Boltzmann distribution (see equation 2.3). So qQ must be Boltzmann for Q This reduces the multivariable optimization some inverse temperature b. problem of maximizing GQ to a single equation: choose q to be Boltzmann as in equation 2.3 and demand that @GQ / @b D 0. It is easy to solve this condition in terms of the partition function (see equation 2.3) and H D bE C ln Z . 2 Z Maximizing with respect to b gives the condition ln Z @ @ln 0. Solutions b2 D Q that maximize G satisfy ln Z D 0
H)
Z D 1.
(2.5)
Thus, information transmission is optimized in the exploratory regime by a Boltzmann distribution with unit partition function. This selects a particular energy E¤ and associated entropy H¤ . Despite the ubiquity of the partition function in statistical physics, this is the only instance, insofar as we are aware, of a clear physical meaning assigned to a particular numerical value of Z .
806
V. Balasubramanian, D. Kimber, and M. J. Berry II
2.2 Efcient Noisy Transmission. Now consider the noisy channel. Once again the capacity will be maximized when the symbols of the sequence Y are chosen independently from some q (y) because correlations reduce transmitted information. With this assumption, and the channel crossover probabilities dened in section 2, the channel capacity (see equation 2.1) at a xed transmission energy becomes
2
3
C (E ) D max 4 ¡
X
q(y )IEN ·E
j
qj ln qj C
X
qj Q k| j log Pj|k5 ,
(2.6)
jk
where Pj|k ´ Prf(Y D yj |Z D z kg is given by Pj| k D
qj Q k| j p(y D yj , z D zk ) . D P ( ) p z D zk j qj Qk| j
(2.7)
The maximization is complicated by the dependency of Pj|k on qj . An insight due to Arimoto (1972) and Blahut (1972), which still applies despite the energy constraint, is that equation 2.6 can also be written as the double maximization, 2 3 X (2.8) C (E ) D max 4¡ qj ln qj ¡ HO (Y|Z)5 , O EN ·E q(y ), PI
j
P P where we dene HO (Y|Z) ´ j qj HO j ´ ¡ jk qj Q k| j log POj|k. The advantage of this form is that the capacity can be computed numerically by an iterative algorithm that alternately maximizes with respect to qj and PO k| j while holding the other variable xed. Each of these maximizations can be carried out using Lagrange multipliers, as in the previous derivations. The resulting algorithm can be summarized: (0)
1. Choose arbitrary nonzero qj . 2. For t D 0, 1, 2, . . . repeat: (t)
q Qk| j
() (a) POj|tk à Pj
(t) q Qk| j j j
( C 1)
(b) qj t
( C 1)
(c) If qj t
(t) ¡b Ej ¡HO j
à Pe
(t) ¡b Ej ¡HO j e
with b chosen so
P
(t C 1) Ej i qj
DE
()
close to qj t stop
The correctness of this generalization of the classic Arimoto-Blahut algorithm is discussed in Blahut (1972). In maximizing with respect to q in step 2b, HO (Y|Z) and the energy costs play identical roles. Indeed, HO (Y|Z ) is
Metabolically Efcient Information Processing
807
essentially the average cost due to information loss in noise, leading to the Boltzmann distribution in step 2b. This algorithm yields the capacity at xed energy C(E ) and the associated distribution qE (y). In the exploratory regime, numerical optimization of C (E ) / E gives an optimal energy E¤ , associated capacity C¤ , and distribution qE¤ (y). 2.3 Summary. Given the channel noise and the symbol energies, the capacity function C (E ) can be computed. In the noiseless case, it is achieved by a Boltzmann distribution. For a noisy channel, C(E ) is computed numerically, and in all cases the distributions produced by the algorithm above achieve metabolically optimal transmission. In the exploratory regime, the rate should be chosen to maximize C(E ) / E, which is achieved in the noiseless case when Z equals 1. We have not discussed the implementation of the encoding from X into Y, which may be realized by either arithmetic or block coding methods (Cover & Thomas, 1991). How well this mapping can be approximated by biological organisms is a question for investigation. 3 Characteristics of the Efcient Code
In this section, we consider some of the properties of energy-efcient codes. First, we show that the optimal code is invariant under certain changes in the symbol energies. Then we illustrate some of the effects of adding noise. 3.1 Energy Invariances. The metabolically efcient distribution on code symbols is invariant under some transformations of the energy model in both the immediate and exploratory regimes. Regardless of whether the energy costs are assigned to the channel inputs yi or the channel outputs zj , the optimal immediate symbol distributions are independent of a constant shift in the energies (E k ! E k C D ). In the exploratory regime, the optimal distribution is independent of rescalings of the energies (E k ! lE k). This is shown as follows.
3.1.1 Immediate Regime. In the immediate regime, we x the average transmission energy (E) and carry out the Arimoto-Blahut optimization algorithm in section 2.2. First, suppose that symbol energies Ej have been assigned to the channel inputs. We choose an arbitrary starting distribu(0) tion qj for the channel inputs and iteratively perform steps 2a and 2b of ( C 1)
the algorithm to nd improved distributions qj t ( C 1)
()
. Step 2a leaves qj t un-
changed. Step 2b, which computes qj t , is manifestly invariant under a constant shift of the input energies Ej ! Ej C D , accompanied by a shift of the average transmission energy E ! E C D . So the energy-optimal immediate distribution is invariant under a simultaneous constant shift of all the symbol energies and the average energy. Next, suppose that symbol costs
808
V. Balasubramanian, D. Kimber, and M. J. Berry II
Uk have been assigned to the channel P outputs zk. The average energy expended by a channel input yj is Ej D k Uk Q k| j . Since this relation is linear, a constant shift by D of the output energies U k translates to a constant shift by D of the input energies Ej , leaving the optimal immediate distribution invariant. 3.1.2 Exploratory Regime. Suppose the channel inputs have energy Ej and that C (E ) is the channel capacity at xed transmission energy E. We compute the exploratory regime optimum by setting @ (C (E ) / E ) @E
D
1 @C (E ) C (E ) ¡ 2 D 0. E @E E
(3.1)
It follows from the Arimoto-Blahut algorithm that the optimal input distribution at xed transmission energy is invariant under a combined rescaling of both the input symbol energies and the average transmission energy (E k ! l E k and E ! l E). To see this, observe that step 2a of the algorithm does not change the distribution, while the condition in step 2b is solved for the new energies by rescaling b ! b / l. Since the capacity is a function of only the distribution of code symbols and not directly of the symbol energies, we conclude that the capacity for the system with rescaled energies, Cl , satises the relation Cl (lE ) D C (E ).
(3.2)
To nd the optimal exploratory distribution with the rescaled energies, we must solve @(Cl (EQ ) / EQ ) / @EQ D 0. Changing variables to E D EQ / l and using equation 3.2, we nd that @(Cl (EQ ) / EQ ) @EQ
D
1 @(Cl (lE ) / E ) 1 @(C(E ) / E ) D 2 D 0. l2 l @E @E
(3.3)
Since this equation is proportional to equation 3.1, the optimal exploratory distribution is invariant under a rescaling of the input energies.PIf we assign costs U k to the output symbols, linearity of the relation Ej D k U k Qk| j between input and output costs implies that rescaling the output energies rescales the effective input energies and again leaves the exploratory optimum invariant. 3.2 The Effects of Noise. In general, an energy-efcient code should suppress the use of expensive symbols. However, noise can have a dramatic effect, since conveying information requires the use of reliable symbols. In fact, the noisiness of a cheaper symbol can easily lead to its suppression relative to a more expensive, but reliable, symbol. This sort of effect is particularly important in applications to biological systems and is illustrated in the toy examples that follow.
Metabolically Efcient Information Processing
809
Consider a noisy channel in which six symbols fy1 , . . . , y6 g are transmitted as symbols fz1 , . . . , z6 g with channel cross-over (noise) probabilities Q k| j D Prfz D z k |y D yj g, as in section 2. Furthermore, let the output symbol zn have a transmission energy of Un D n. Then the average energy of the P channel input symbol is Ei D 6nD 1 Un Qn|i . In the absence of any noise at all, Q k| j D d kj and so Ei D Ui , and the channel input and channel output distributions for the exploratory regime are both given by: Pr(yn ) D Pr(zn ) D
e¡bn
Z
I
ZD
6 X nD 1
e¡bn D 1.
(3.4)
In other words, the channel input and output distributions are both exponential, and the weight in the exponential is determined by the condition Z D 1. In this case we nd b D 0.685. Next suppose that we have nearest-neighbor noise: 0
1 ¡ 2p B p B B 0 QDB B 0 B @ 0 0
2p 1 ¡ 2p p 0 0 0
0 p 1 ¡ 2p p 0 0
0 0 p 1 ¡ 2p p 0
0 0 0 p 1 ¡ 2p 2p
1 0 0 C C 0 C C. 0 C C p A 1 ¡ 2p
(3.5)
Here Qk| j is the entry in the jth row and kth columns of the matrix Q. Figure 3 shows the optimal exploratory regime distribution on channel output symbols, for several values of noise parameter p. Notice the marked deviation of the optimal output distribution from a pure exponential as the noise increases. For p D 0.25, the least energetic symbol y1 , with E1 D 1, is suppressed so strongly that it is less likely than symbol y2 , with E2 D 2. Among the various intricate effects we have observed in the optimal distribution as a function of noise is a phase-transition-like behavior where the probability of a symbol evolves smoothly until the noise reaches some critical value and then drops suddenly to essentially zero. Figure 4 shows such effects for the input distribution to the channel (see equation 3.5). In statistical physics, phase transitions occur due to trade-offs between energy and entropy. Physical systems at nite temperature try to minimize their energy but maximize their entropy, leading to sharp transitions, such as the melting of ice, at a critical temperature. In our case, information lost to noise decreases the mutual information between the channel input and output, and this reduction in mutual information competes against energy minimization in the optimization. The sharp transitions as a function of noise (see Figure 4) are a result of this trade-off. Since biological signal processing systems are noisy, it is important for applications of our formalism that the noise be carefully measured and included in the model.
810
V. Balasubramanian, D. Kimber, and M. J. Berry II
Figure 3: The effects of noise. Probability distribution of channel output symbols as a function of increasing nearest-neighbor noise. The values of p and the associated optimal b displayed here are fp D 0, b D 0.685g, fp D 0.1, b D 0.420g, fp D 0.2, b D 0.340g, and fp D 0.25, b D 0.317g.
Figure 4: Sharp transitions in symbol probabilities due to noise. Shown here is the probability of channel input symbols as a function of noise. (Top, left to right) y1 , y2 , y3 . (Bottom, left to right) y4 , y5 , y6 . Notice the different vertical scales in each panel.
Metabolically Efcient Information Processing
811
4 Application to Neural Systems
Our primary motivation in analyzing energy-efcient information transmission is to provide a formalism that can make quantitative predictions about the detailed structure of neural codes. To this end, we must identify circumstances in which the neural code can be thought of as a sequence of discrete symbols with distinct energies. Given such a set of symbols as well as a characterization of their transmission noise and energy cost, we can predict the unique symbol distribution that maximizes information transmitted per unit metabolic energy and compare this against the measured symbol distribution. The vertebrate retina provides a particularly good example. Its input is a visual image projected by the optics of the eye; its output consists of easily measured action potentials. The optic nerve, which connects the eye with the brain, represents the visual world with many fewer neurons than at any other point in the visual pathway, suggesting that principles of efcient coding may be relevant. In addition, patterns of light with particular behavioral importance—for instance, the image of a tiger—are distributed over many photoreceptor cells, the primary light sensors of the retina. This makes it difcult for any single retinal neuron to evaluate the behavioral signicance of an overall image. Therefore, we expect that the value of the signal transmitted by a given optic nerve ber is closely related to its information content in bits. Previous studies (Berry et al., 1997; Berry, Warland, & Meister, 2000) have shown that ganglion cells, the output neurons of the retina whose axons form the optic nerve, often transmit visual information to the brain using a discrete set of coding symbols. In these experiments, the retina was stimulated with a wide variety of temporal and spatial patterns of light drawn from a white noise ensemble (Berry et al., 1997). Under these stimulus conditions, ganglion cells responded with discrete bursts of several spikes separated by long intervals of silence. The reproducibility of these ring events was very high: the timing of the rst spike jittered by » 3 ms from one stimulus trial to the next and the total number of spikes varied by » 0.5 spike. This precision implies that each event is highly informative and that events with different numbers of spikes can reliably represent different stimulus patterns. In addition, correlations between successive ring events were very weak, implying that each ring event is an independent coding symbol that carries a discrete visual message. This suggests that the size of each ring event (i.e., the number of spikes it contains) may be treated as a discrete symbol N in the retinal code. A short duration of silence may also be discretized to a symbol 0. The experimentally measured sequence of retinal ganglion cell events, discretized in this manner, is represented in our model as the output sequence Z. In addition, S is the visual stimulus to the retina, X is the output of the photoreceptors, and Y is an internal retinal variable representing the ideal retinal output
812
V. Balasubramanian, D. Kimber, and M. J. Berry II
prior to the addition of noise. Repeated presentations of the same stimulus produce a distribution of ganglion cell events with a sharp peak at a certain symbol and a width that we attribute to noise. Interpreting the peak of the distribution as the intended noiseless output Y, the distribution of actual ganglion cell outputs yields the channel noise matrix required by our model. Given a measurement or an estimate for the energy consumption by events of different sizes (see below), our framework then predicts a specic optimal distribution of event sizes. Comparison of this distribution against the experimentally measured event distribution is a quantitative check of the relevance of metabolically efcient coding to the retina. More generally, our methods may be applied in any system where a suitable discretization of the neural code is available, along with a description of noise and costs. The all-or-nothing character of action potentials makes such discretization possible. By choosing an appropriate time bin, a spiking neuron’s activity becomes a sequence of integer spike counts. The choice of time bin and independent “code words” will depend on the neuron being studied. The noise can be measured experimentally by repetition of an identical stimulus and observation of the resulting distribution of output symbols. The symbol energy is more difcult to access experimentally. However, Siesjo (1978) and Laughlin, de Ruyter van Steveninck, and Anderson (1998) have argued that the dominant energy cost for a neuron arises in the pumps that actively transport ions across the cell membrane. If this is true, then the symbol energy can be found by simulating the known ionic currents in a neuron to nd the total charge transported during different time periods, as this charge ow must be reversed by active transport in order to maintain equilibrium. Because ionic currents are large during an action potential, the symbol energy is likely to be given by a baseline metabolic cost plus an additional increment per spike, EN D 1 C b N, where b is the ratio of spiking cost to baseline cost during the time bin. The baseline cost has components due to leak currents, synaptic currents, and other cellular metabolism. Estimates of b vary and depend on the neuron in question. While a variety of measurements indicate that electrical activity accounts for roughly half of the brain’s total metabolism (Siesjo, 1978), the parameter b may still be small. In any case, since cellular metabolism is difcult to estimate and because it is unclear in the present context whether presynaptic and postsynaptic costs should be bundled into the expense of producing a spike, b can be treated as a free parameter for each neuron and varied to nd the energy-efcient code that best agrees with the neuron’s distribution of coding symbols. Direct determination of metabolic activity is possible for an entire tissue by measurements of oxygen consumption or heat production. Furthermore, the metabolic activity of a single neuron could be obtained by measuring the uptake of a radioactively labeled metabolic precursor, such as glucose, during stimulation of the neuron at different ring rates. Such measurements could x or place bounds on the possible values of b.
Metabolically Efcient Information Processing
813
4.1 Summary. We have outlined how the formalism developed in this article can be applied to real neurons, with particular emphasis on retinal ganglion cells. Discrete output symbols may be dened by counting the number of spikes produced within a xed time window. The noise in each symbol can be experimentally measured, and the energy cost can be estimated. Finally, the optimal distribution of spike counts in a symbol can be computed using our methods and compared to the actual distribution used by the neuron. Such a test would determine whether the metabolic cost of information transmission is an important constraint in the structure of a neural code. 5 Discussion
We have described energy-efcient codes in two different regimes: an immediate regime, where a system’s rate of information transmission is set by external constraints, and an exploratory regime, where the total amount of information transmission is set by external constraints. The optimal codes in these cases are closely related, both following a Boltzmann distribution in the symbol energies, pj » e ¡b Ej , when there is no noise. In the immediate regime, the inverse temperature, b, is set to yield the imposed information rate, while in the exploratory regime, b is set to make the partition function, Z , equal to one. With the addition of noise, the optimal code must be obtained numerically, but can always be found using a straightforward iterative scheme. In delineating the immediate and exploratory regimes, we do not expect that all of an organism’s behavior can be neatly assigned to one or the other category. Instead, we propose here that they apply to some behaviors. We have argued for an immediate regime in which the transmission rate is set by the need to respond rapidly to environmental pressures. However, there will certainly also be situations where the rate is determined instead by complex interactions involving the internal needs and constraints of the organism. There are also subtleties in identifying regimes of behavior that are exploratory. We have described an idealized situtation where an organism acquires a certain amount of sensory information before executing a single behavior. More realistically, the organism simultaneously acquires sensory information relevant to many possible behaviors, and the interplay between sensation and behavior is ongoing. This can be analyzed within our framework by determining the different amounts of optimal information I¤ associated with each behavior and then requiring that the total amount of data be gathered simultaneously. The exploratory regime optimization continues to determine the total rate at which the information should be gathered. The essential point is that in this regime, the organism’s behavior is open-ended: it has sufcient time to choose a rate of sensory information acquisition that achieves energy efciency, while still being able to acquire enough information to make a “good” behavioral decision among the available choices.
814
V. Balasubramanian, D. Kimber, and M. J. Berry II
We have described how our formalism can be applied to a biological system, like the retina. Our methods should also be useful in the analysis of low-power engineered systems, such as mobile telephones and laptop computers, which use discrete, independent coding symbols. In this case, the engineer controls the particular choice of coding symbols, as well as the design of the encoding algorithm and the transmission channel. The energy and noise characteristics of the channel can therefore be precisely determined as inputs to our theoretical analysis. Perhaps such an exercise will help in designing low-power devices that can perform for longer times before running down their batteries. Acknowledgments
M. B. was supported by a National Research Service Award from the National Eye Institute. V. B. was supported by the Harvard Society of Fellows and the Milton Fund of Harvard University. V. B. and D. K. are grateful to the Xerox Palo Alto Research Center, the Institute for Theoretical Physics at Santa Barbara, and the Aspen Center for Physics for hospitality at various stages of this work. After this work was completed, we became aware that some of the results presented here have been obtained independently by Gonzalo Garcia de Polavieja. References Arimoto, S. (1972). An algorithm for computing the capacity of an arbitrary discrete memoryless channel. IEEE Trans. on Info. Theory, IT-18, 14–20. Berry II, M. J., & Meister, M. (1998). Refractoriness and neural precisions. Journal of Neuroscience, 18, 2200–2211. Berry II, M. J., Warland, D. W., & Meister, M. (1997). The structure and precision of retinal spike trains. Proc. Natl. Acad. Sci. USA, 94, 5411–5416. Berry II, M. J., Warland, D. W., & Meister, M. (2000). Firing events: fundamental symbols in the retinal code. Manuscript in preparation. Blahut, R. E. (1972). Computation of channel capacity and rate distortion functions. IEEE Trans. on Info. Theory, IT-18, 460–473. Blahut, R. E. (1987). Principles and practice of information theory. Reading, MA: Addison-Wesley. Buracas, G. T., Zador, A. M., deWeese, M. R., & Albright, T. D. (1998). Efcient discrimination of temporal patterns by motion-sensitive neurons in the primate visual cortex. Neuron, 20(5), 959-969. Cover, T. M., & Thomas, J. A. (1991). Elements of information theory. New York: Wiley. de Ruyter van Steveninck, R. R., Lewen, G. D., Strong, S. P., Koberle, R., & Bialek, W. (1997). Reproducibility and variability in neural spike trains. Science, 275, 1805–1808.
Metabolically Efcient Information Processing
815
Laughlin, S. B., de Ruyter van Steveninck, R., & Anderson, J. C. (1998). The metabolic cost of neural information. Nature Neuroscience, 1(1), 36–41. Levy, W. B., & Baxter, R. A. (1996). Energy-efcient neural codes. Neural Computation, 8, 531–543. Reinagel, P., & Reid, R. C. (1998). Visual stimulus statistics and the reliability of spike timing in the LGN. Soc. Neurosci. Abstr., 24, 139. Rieke, F., Bodnar, D., & Bialek, W. (1995). Naturalistic stimuli increase the rate and efciency of information transmission by primary auditory neurons. Proceedings of the Royal Society of London Series B, 262, 259–265. Rieke, F., Warland, D., & Bialek, W. (1993). Coding efciency and information rates in sensory neurons. Europhys. Lett., 22, 151–156. Rieke, F., Warland, D., & de Ruyter van Steveninck, R. R. (1997). Spikes: Exploring the neural code. Cambridge, MA: MIT Press. Sarpeshkar, R. (1998). Analog versus digital: Extrapolating from electronics to neurobiology. Neural Computation, 10, 1601–1638. Siesjo, B. K. (1978). Brain energy metabolism. New York: Wiley. Strong, S. P., Koberle, R., de Ruyter van Steveninck, R. R., & Bialek, W. (1997). Entropy and information in neural spike trains. Phys. Rev. Lett., 80, 197–200. Warland, D. (1991). Reading between the spikes: Real-time processing in neural systems. Unpublished doctoral dissertation, University of California at Berkeley. Warland, D. K., Reinagel, P., & Meister, M. (1997). Decoding visual information from a population of retinal ganglion cells. J. Neurophysiol., 78(5), 2336–2350. Received October 26, 1999; accepted July 7, 2000.
LETTER
Communicated by Gunther Palm
Effective Neuronal Learning with Ineffective Hebbian Learning Rules Gal Chechik Center for Neural Computation, Hebrew University, Jerusalem, Israel, and School of Mathematical Sciences, Tel-Aviv University, Tel Aviv, 69978, Israel Isaac Meilijson School of Mathematical Sciences, Tel-Aviv University, Tel Aviv, 69978, Israel Eytan Ruppin Schools of Medicine and Mathematical Sciences, Tel-Aviv University, Tel Aviv, 69978, Israel In this article we revisit the classical neuroscience paradigm of Hebbian learning. We nd that it is difcult to achieve effective associative memory storage by Hebbian synaptic learning, since it requires network-level information at the synaptic level or sparse coding level. Effective learning can yet be achieved even with nonsparse patterns by a neuronal process that maintains a zero sum of the incoming synaptic efcacies. This weight correction improves the memory capacity of associative networks from an essentially bounded one to a memory capacity that scales linearly with network size. It also enables the effective storage of patterns with multiple levels of activity within a single network. Such neuronal weight correction can be successfully carried out by activity-dependent homeostasis of the neuron’s synaptic efcacies, which was recently observed in cortical tissue. Thus, our ndings suggest that associative learning by Hebbian synaptic learning should be accompanied by continuous remodeling of neuronally driven regulatory processes in the brain.
1 Introduction
Synapse-specic changes in synaptic efcacies, carried out by long-term potentiation (LTP) and depression (LTD) (Bliss & Collingridge, 1993), are thought to underlie cortical self-organization and learning in the brain. In accordance with the Hebbian paradigm, LTP and LTD modify synaptic efcacies as a function of the ring of pre- and postsynaptic neurons. In this article we revisit the Hebbian paradigm, studying the role of Hebbian synaptic changes in associative memory storage and their interplay with neuronally driven processes that modify the synaptic efcacies. Neural Computation 13, 817–840 (2001)
° c 2001 Massachusetts Institute of Technology
818
G. Chechik, I. Meilijson, and E. Ruppin
Hebbian synaptic plasticity has been the major paradigm for studying memory and self-organization in computational neuroscience. Within the associative memory framework, numerous Hebbian learning rules were suggested, and their memory performance was analyzed (Amit, 1989). Restricting their attention to networks of binary neurons Dayan and Willshaw (1991) and Palm and Sommer (1996) derived the learning rule that maximizes the network memory capacity and showed its relation to the covariance learning rule rst described by Sejnowski (1977). In a series of papers (Palm & Sommer, 1988, 1996; Palm, 1992) Palm and Sommer have studied the space of possible learning rules. They identied constraints that must be fullled to achieve nonvanishing asymptotic memory capacity and showed that effective learning can be achieved if either the mean output value of the neuron or the correlation between synapses is zero. This article extends their work in two ways: by providing a computational procedure that enforces the constraints on effective learning while obeying constraints on locality of information, and by showing how this procedure may be carried out in a biologically plausible manner. Synaptic changes subserving learning have traditionally been complemented by neuronally driven normalization processes in the context of selforganization of receptive elds and cortical maps (von der Malsburg, 1973; Miller & MacKay, 1994; Goodhill & Barrow, 1994; Sirosh & Miikkulainen, 1994) and continuous unsupervised learning as in principal component analysis networks (Oja, 1982). In these scenarios, normalization is necessary to prevent the excessive growth of synaptic efcacies that occurs when learning and neuronal activity are strongly coupled. This article focuses on associative memory learning where this excessive synaptic runaway growth is mild (Grinstein-Massica & Ruppin, 1998) and shows that normalization processes are essential even in this simpler learning paradigm. Moreover, while other normalization procedures can prevent synaptic runaway, our analysis shows that a specic neuronally driven correction procedure that preserves the total sum of synaptic efcacies is essential for effective associative memory storage. The following section describes the associative memory model and establishes constraints that lead to effective synaptic learning rules. Section 3 describes the main result of this article: a neuronal weight correction procedure that can modify synaptic efcacies towards maximization of memory capacity. Section 4 studies the robustness of this procedure when storing memory patterns with heterogeneous coding levels. Section 5 presents a biologically plausible realization of the neuronal normalization mechanism in terms of neuronal regulation. Finally, these results are discussed in section 6. 2 Effective Synaptic Learning Rules
We study the computational aspects of associative learning in low-activity associative memory networks with binary ring f0, 1g neurons. M uncorre-
Effective Learning with Ineffective Learning Rules
819
lated memory patterns fj m gmMD1 with coding level p (fraction of ring neurons) are stored in an N-neuron network. The ith neuron updates its ring state Xit at time t by Xit C 1 D h ( fit ),
fit D
N 1 X Wij Xjt ¡ T, N j D1
1 C sign ( f ) , 2
h( f ) D
(2.1)
where fi is its input eld (postsynaptic potential) and T is its ring threshold. The synaptic weight Wij between the jth (presynaptic) and ith (postsynaptic) neurons is determined by a general additive synaptic learning rule that depends on the neurons’ activity in each of the M stored memory patternsj g , Wij D g
M X gD1
g g A (ji , jj ),
(2.2)
g
where A (ji , jj ) is a 2£2 synaptic learning matrix that governs the incremental modications to a synapse as a function of the ring of the presynaptic (column) and postsynaptic (row) neurons: presynaptic (ji ) 1 0 (j (j ) A i , jj D postsynaptic j ) 1 a b 0 c d
.
In conventional biological terms, a denotes an increment following an LTP event, b denotes a heterosynaptic LTD event, and c denotes a homosynaptic LTD event. The parameters a, b, c , d dene a four-dimensional space in which all linear additive Hebbian learning rules reside. In order to study this fourdimensional space, we conduct a signal-to-noise analysis, as in Amit (1989). This analysis assumes that the synaptic learning rule obeys a zero-mean constraint (E(A) D 0); otherwise, the synaptic values diverge, the noise overshadows the signal, and no retrieval is possible (Dayan & Willshaw, 1991). The signal-to-noise ratio of the neuronal input eld fi during retrieval is (see section A.1) Signal E ( fi |ji D 1) ¡ E ( fi |ji D 0) ´ p Noise Var( fi ) [A(1, 1) ¡ A (0, 1)](1 ¡ 2 ) C [A(1, 0) ¡ A (0, 0)]2 q q D £ ¤ £ ¤ p Var Wij C NpCov Wij , W ik N r N [A(1, 1) ¡ A (0, 1)](1 ¡ 2 ) C [A(1, 0) ¡ A (0, 0)]2 q D , M p Var £ A(j , j )¤ C Np2 Cov £A(j , j ), A (j , j )¤ i
j
i
j
i
k
(2.3)
820
G. Chechik, I. Meilijson, and E. Ruppin
where 2 is a measure of the overlap between the initial activity pattern X 0 and the memory pattern retrieved. As evident from equation 2.3 and already pointed out by Palm and Sommer (1996), when the postsynaptic covariance Cov[A(ji , jj ), A (ji , j k )] (determining the covariance between the incoming synapses of the postsynaptic neuron) is positive and p does not vanish for large N, the network’s memory capacity is bounded; it does not scale with the network size. Because the postsynaptic covariance is non negative (see section A.3 and appendix B), effective learning rules that obtain linear scaling of memory capacity as a function of the network’s size require a vanishing postsynaptic covariance. Intuitively, when the synaptic weights are correlated, adding any new synapse contributes little new information, thus limiting the number of benecial synapses that help the neuron estimate whether it should re. Palm and Sommer (1996) have shown that the catastrophic correlation can be eliminated in the general case where the output values of neurons are fa, 1g instead of f0, 1g, by setting a D p / (1 ¡ p ) or when p asymptotically vanishes for N ! 1. In the biological case, however, it is difcult to think of the output value a as anything but a hard-wired characteristic of neurons, while the coding level p may vary from network to network or even between different patterns. Figure 1A illustrates the critical role of the postsynaptic covariance by plotting the memory capacity of the network as a function of the learning rule. This is done by focusing on a two-dimensional subspace of possibly effective learning rules parameterized by the two parameters a and b, while the two other parameters are set by using a scaling-invariance constraint and the requirement that the synaptic matrix should have a zero mean (see the table in section A.3). As evident, effective learning rules lie on a narrow ridge (characterized by zero postsynaptic covariance), while other rules provide negligible capacity. Figure 1B depicts the memory capacity of the effective synaptic learning rules that lie on the essentially one-dimensional ridge observed in Figure 1A. It shows that all these effective rules are only slightly inferior to the optimal synaptic learning rule A (ji , jj ) D (ji ¡p)(jj ¡p) (Dayan & Willshaw, 1991), which maximizes memory capacity. The results show that effective learning requires a vanishing postsynaptic covariance constraint, yielding a requirement for a balance between synap¡p tic depression and facilitation, b D 1¡p a (see equations ?? and ??). Thus, effective memory storage requires a delicate balance between LTP (a) and heterosynaptic depression (b). This constraint makes effective memory storage explicitly dependent on the coding level p, which is a global property of the network. It is thus difcult to see how effective rules can be implemented at the synaptic level. Moreover, as shown in Figure 1A, Hebbian learning rules lack robustness, as small perturbations from the effective rules may result in a large decrease in memory capacity. Furthermore, these problems cannot be circumvented by introducing a nonlinear Hebbian learning
Effective Learning with Ineffective Learning Rules
821
Figure 1: (A) Memory capacity of a 1000-neuron network for different values of the free parameters a and b as obtained in computer simulations. The two remaining learning-rule parameters were determined by using a scalinginvariance constraint and the zero-synaptic mean constraint, leaving only two free parameters (see section A.3). Capacity is dened as the maximal number of memories that can be retrieved with average overlap bigger than m D 0.95 after a single step of the network’s dynamics when presented with a degraded input cue with overlap m 0 D 0.8. The overlap mg (or similarity) between the current network’s activity pattern X and the memory pattern j g serves to measure rePN g 1 (j trieval acuity and is dened as mg D p(1¡p)N jD 1 j ¡ p)Xj . The coding level is p D 0.05. (B) Memory capacity of the effective learning rules. The peak values on the ridge of A are displayed by tracing their projection on the b coordinate. The optimal learning rule (marked with an arrow) performs only slightly better than other effective learning rules.
±P ² g g rule of the form W ij D g g A (ji , jj ) , as even for a nonlinear function h P i P g g g g g the covariance Cov g ( g A(ji , jj )), g ( g A(ji , jk )) remains positive if Cov(A(ji , jj ), A(ji , jk )) is positive (see Appendix B). In section 4 we further show that the problem cannot be avoided by using some predened average coding level for the learning rule. These observations put forward the idea that within a biological standpoint requiring locality of information and nonvanishing coding level, effective associative learning are difcult to realize with Hebbian rules alone. 3 Effective Learning via Neuronal Weight Correction
The above results show that in order to obtain effective memory storage, the postsynaptic covariance must be kept negligible. How then may effective storage take place in the brain with Hebbian learning? We now show
822
G. Chechik, I. Meilijson, and E. Ruppin
that a neuronally driven procedure (essentially similar to that assumed by von der Malsburg, 1973, and Miller& MacKay, 1994, to take place during self-organization) can maintain a vanishing covariance and enable effective memory storage by acting on ineffective Hebbian synapses and turning them into effective ones. 3.1 The Neuronal Weight Correction Procedure. The solution emerges
when rewriting the signal-to-noise (equation 2.3) as [A(1, 1) ¡ A (0, 1)](1 ¡ 2 ) C [A(1, 0) ¡ A (0, 0)]2 Signal q D , £ ¤ P 2 Noise p (1¡p) Var Wij C Np 2 Var( jND1 Wij ) N
(3.1)
showing that the postsynaptic covariance is small only if the sum of incoming synapses is nearly constant (see equation A.4). We thus propose that during learning, as a synapse is modied, its postsynaptic neuron additively modies all its synapses to maintain the sum of their efcacies at a baseline zero level. Because this neuronal weight correction is additive, it can be performed either after each memory pattern is stored or later, after several memories have been stored. Interestingly, the joint operation of weight correction over a linear Hebbian learning rule is equivalent to the storage of the same set of memory patterns with another Hebbian learning rule. This new rule has both a zero synaptic mean and a zero postsynaptic covariance, as follows:
1 0
1 a c
0 b d
H)
1 0
1 (a ¡ b )(1 ¡ p ) (c ¡ d)(1 ¡ p )
0 (a ¡ b )(0 ¡ p ) (c ¡ d)(0 ¡ p )
.
To prove this transformation, focus on a ring neuron in the current memory pattern. When an LTP event occurs, the pertaining synaptic efcacy is strengthened by a; thus, all other synaptic efcacies must be reduced by a to keep their sum xed. Because there are on average Np LTP events for N each memory, all incoming synaptic efcacies will be reduced by ap. This and a similar calculation for quiescent neurons yield the synaptic learning matrix displayed on the right. It should be emphasized that the matrix on the right is not applied at the synaptic level but is the emergent result of the operation of the neuronal mechanism on the matrix on the left, and is used here as a mathematical tool to analyze network performance. Thus, using a neuronal mechanism that maintains the sum of incoming synapses xed enables the same level of effective performance as would have been achieved by using a zero-covariance Hebbian learning rule, but without the need to know the memories’ coding level. Note also that neuronal weight correction applied to the matrix on the right will result in the same matrix. Thus, no further changes will occur with its reapplication.
Effective Learning with Ineffective Learning Rules
823
3.2 An Example. To demonstrate the benecial effects of neuronal weight correction, we apply it to a noneffective rule (having nonzero covariance). We investigate a common realization of the Hebb rule A (ji , jj ) D jijj with inhibition added to obtain a zero-mean input eld (otherwise the capacity vanishes) yielding A (ji , jj ) D jijj ¡ p2 (Tsodyks, 1989), or in matrix form:
Zero-mean Hebb rule
1 0
1 1 ¡ p2 ¡p2
0 ¡p2 ¡p2
.
As evident, this learning rule employs both homosynaptic and heterosynaptic LTD to maintain a zero-mean synaptic matrix, but its postsynaptic covariance is nonzero and is thus still an ineffective rule. Applying neuronal weight correction to the synaptic matrix formed by this rule results in a synaptic matrix identical to the one generated without neuronal correction by the following rule: Neuronally corrected Hebb rule
1 0
1 1¡p 0
0 ¡p 0
,
which has both zero-mean and zero-postsynaptic covariance. Figure 2 plots the memory capacity obtained with the zero-mean Hebb rule, before and after neuronal weight correction, as a function of network size. The memory capacity of the original zero-mean Hebb rule is essentially bounded, while after applying neuronal weight correction, it scales linearly with network size. Figure 3 shows that the benecial effect of the neuronal correction remains marked for a wide range of coding level values p. 4 Effective Learning with Variable Coding Levels
The identication of the postsynaptic covariance as the main factor limiting memory storage, and its dependence on the coding level of the stored memory patterns, naturally raises the common problem of storing memory patterns with heterogeneous coding levels within a single network. The following subsection explores this question by studying networks with a learning rule that was optimally adjusted for some average coding level, but actually store memory patterns that are distributed around it. Due to the dependence of effective learning rules on the coding level of the stored pattern, there exists no single learning rule that can provide effective memory storage of such patterns. However, the application of neuronal weight correction provides effective memory storage also under heterogeneous coding levels, testifying to the robustness of the neuronal weight correction procedure.
824
G. Chechik, I. Meilijson, and E. Ruppin
Figure 2: Network memory capacity as a function of network size. While the original zero-mean learning rule has bounded memory capacity, the capacity becomes linear in network size when the same learning rule is coupled with neuronal weight correction. The lines plot analytical results, and the squares designate simulation results (p D 0.05). 4.1 Analysis. Bearing in mind the goal of storing patterns with heterogeneous coding levels, we slightly modify the model presented in section 2: the memory pattern j m (1 · m · M) now has a coding level pm (that is, it has 1 Tj m D pm N ring neurons out of the N neurons). For simplicity, we focus on a single learning rule
W ij D
M X m D1
(jim ¡ a)(jjm ¡ a ),
(4.1)
where a is a parameter of the learning rule to be determined. As noted earlier, this rule is the optimal rule for the case of a homogeneous coding level (when all memories share exactly the same coding level) if a is set to their coding level. The overlap mm between the current network’s activity pattern X and the memory j m is now dened in terms of the coding level P m pm as mm D pm (1¡1pm )N jND 1 (jj ¡ pm )Xj . A signal-to-noise analysis for the case of heterogeneous coding levels (see P appendix C) reveals that both the mean synaptic weight (E(W) D mMD 1 (pm ¡ P M a)2 ) and the postsynaptic covariance (Cov[Wij , Wik ] D m D 1 pm (1 ¡ pm )
Effective Learning with Ineffective Learning Rules
825
Figure 3: Comparison of network memory capacity for memory patterns with different values of the coding level p, in a network of N D 5000 neurons. The effect of neuronal correction is marked for a wide range of the p values, especially in the low coding levels observed in the brain. The lines plot analytical results, and the squares designate simulation results.
(pm ¡ a )2 ) are strictly positive, yielding Signal ¼ Noise
r
N M
p (1 ¡ a ¡ 2 ) p1 . (4.2) 1 PM 2 2 2 m D 1 pm (1 ¡ pm ) C (2 C Np1 ) M m D 1 pm (1 ¡ pm )( pm ¡a)
q P M 1 M
Moreover, this analysis assumes that the neuronal threshold is optimally set to maximize memory retrieval. Such optimal setting requires that the threshold is set to E ( fi |ji1 D 1) C E ( fi |ji1 D 0) 2 ³ ´ M X 1 (pm ¡ a )2 ¡ a (1 ¡ a ¡ 2 )p1 C p1 D 2 m D1
T optimal (j 1 ) D
(4.3)
during the retrieval of the memory pattern j 1 (Chechik, Meilijson, & Ruppin, 1998). The optimal threshold thus depends on both the coding level of
826
G. Chechik, I. Meilijson, and E. Ruppin
Figure 4: Memory capacity (dened in Figure 1) as a function of the network’s size for various coding level distributions. Coding levels are normally distributed pm » N(a, s 2 ) with a mean of a D 0.1 and standard deviations of s D 0, 0.01, 0.02, 0.03, but clipped to (0, 1). Lines designate analytical results, and icons display simulation results.
the retrieved pattern p1 and the variability of the coding levels pm . These parameters are global properties of the network that may be unavailable at the neuronal level, making optimal setting of the threshold biologically implausible. To summarize, the signal-to-noise analysis reveals three problems that prevent effective memory storage of patterns with varying coding level using a single synaptic learning rule. First, the mean synaptic efcacy is no longer zero and depends on the coding level variability (see equation C.4). Second, the postsynaptic covariance is nonzero (see equation C.6). Third, the optimal neuronal threshold explicitly depends on the coding level of the stored memory patterns (see equation 4.3). As shown in the previous sections, these problems are inherent in all Hebbian additive synaptic learning rules and are not limited to the learning rule of equation 4.1. To demonstrate the effects of these problems on the network’s memory performance, we have stored memory patterns with coding levels that are normally distributed around a, in a network that uses the optimal learning rule and the optimal neuronal threshold for the coding level a (see equations 4.1 and 4.3). The memory capacity of such networks as a function of the network size is depicted in Figure 4, for various values of coding level variability. Clearly, even small perturbations from the mean coding level a result in considerable deterioration of memory capacity. Moreover, this
Effective Learning with Ineffective Learning Rules
827
deterioration becomes more pronounced for larger networks, revealing a bounded network memory capacity. 4.2 Effective Memory Storage. The above results show that when the coding levels are heterogeneous, both synaptic mean and synaptic covariance are nonzero. However, when applying neuronal weight correction over the learning rule of equation 4.1, the combined effect is equivalent to memory storage using the following synaptic learning rule:
Wij D
M X m D1
(jim ¡ a )(jjm ¡ pm ).
(4.4)
This learning procedure is not equivalent to any predened learning matrix, as a different learning rule emerges for each stored memory pattern, in a way that is adjusted to its coding level pm . Like the case of homogeneous memory patterns with nonoptimal learning rule, the application of neuronal weight correction results in a vanishing postsynaptic covariance and mean (see appendix C) and yields Signal D Noise
r
p (1 ¡ a ¡ 2 ) p1 N q P . (4.5) PM M 1 M 2 (1 ¡ p ) 2 C 1 2 (1 )( ) ¡ ¡ p p p a p m m m m m D1 m m D1 M M
A comparison of equation 4.5 with equation 4.2 readily shows that the dependence of the noise term on network size (evident in equation 4.2) is now eliminated. Thus, a neuronal mechanism that maintains a xed sum of incoming synapses effectively calibrates to zero the synaptic mean and postsynaptic covariance, providing a memory storage capacity that grows linearly with the size of the network. This is achieved without the need to explicitly monitor the actual coding level of the stored memory patterns. The neuronal weight correction mechanism solves two of the three problems of storing memory patterns with variable coding levels, setting to zero the synaptic mean and postsynaptic covariance. But even after neuronal weight correction is applied, the optimal threshold is ³ ´ 1 ¡ p1 (1 ¡ a ¡ 2 )p1 , (4.6) T Optimal (j 1 ) D 2 retaining dependence on the coding level of the retrieved pattern. This difculty may be partially circumvented by replacing the neuronal threshold with a global inhibitory term (as in Tsodyks, 1989). To this end, equation 2.1 is substituted with Xti C 1 D h ( fi ) ,
fi D
N N N 1 X 1 X I X (Wij ¡ I )Xjt D W ij Xjt ¡ Xt , N jD 1 N j D1 N jD 1 j
(4.7)
828
G. Chechik, I. Meilijson, and E. Ruppin
Figure 5: Memory capacity as a function of network size when storing patterns with normally distributed coding levels pm » N(0.1, 0.022 ) using the learning rule of equation 4.1. The four curves correspond to no correction at all (solid line), neuronal weight correction (dashed line), neuronal weight correction with activity-dependent inhibition (long dashed line), and homogeneous coding (pm ´ a) (dot-dashed line). (A) Analytical results. (B) Simulation results of a network performing one step of the dynamics.
where I is the global inhibition term set to I D ( 12 ¡ a)(1 ¡ a ¡ 2 ). When E[Xjt ] D pm ,1 the mean neuronal eld corresponds to a network without
global inhibition that uses a neuronal threshold T D ( 12 ¡ a )(1 ¡ a ¡ 2 )p1 . For small p1 this yields a fair approximation to the optimal threshold of equation 4.6. To demonstrate the benecial effect of neuronal weight correction and activity-dependent inhibition, we turn again to store memory patterns whose coding levels are normally distributed as in Figure 4 using the learning rule of equation 4.1. Figure 5 compares the memory capacity of networks with and without neuronal weight correction and activity-dependent inhibition. The memory capacity is also compared to the case where all memories have the same coding level (the dot-dashed line in Figure 5), showing that the application of neuronal weight correction and activity-dependent inhibition (long dashed line) successfully compensates for the coding level variability, obtaining almost the same capacity as the capacity achieved with a homogeneous coding level. Figure 6 plots the network’s memory capacity as a function of coding level variability.
1 This assumption is precise when considering a single step of the dynamics and the network is initialized in a state with activity pm , as considered here.
Effective Learning with Ineffective Learning Rules
829
Figure 6: Memory capacity as a function of coding level variability of the stored memory patterns. Patterns were stored in a 1000-neuron network using the learning rule of equation 4.1 with a D 0.2, but actually storing patterns with clipped normal distribution of coding levels pm » N(0.2, s 2 ), for several values of the standard deviation s.
While the original learning rule provides effective memory storage only when coding levels are close to the mean, the application of a neuronal correction mechanism provides effective memory storage even when storing an ensemble of patterns with high variability of coding levels. The addition of activity-dependent inhibition is mainly needed when the coding level variability is very high. 5 Neuronal Regulation Implements Weight Correction
The proposed neuronal weight correction algorithm relies on explicit information about the total sum of synaptic efcacies at the neuronal level. Several mechanisms for conservation of the total synaptic strength have been proposed (Miller, 1996). However, because explicit information on the synaptic sum may not be available, we turn to studying the possibility of indirectly regulating the total synaptic sum by estimating the neuronal average postsynaptic potential with a neuronal regulation (NR) mechanism (Horn, Levy, & Ruppin, 1998). NR maintains the homeostasis of neuronal activity by regulating the postsynaptic activity (input eld fi ) of the neuron around a xed baseline. This homeostasis is achieved by multiplying
830
G. Chechik, I. Meilijson, and E. Ruppin
the neuron’s incoming synaptic efcacies by a common factor such that changes in the postsynaptic potential are counteracted by inverse changes in the synaptic efcacies. Such activity-dependent scaling of quantal amplitude of excitatory synapses, which acts to maintain the homeostasis of neuronal ring in a multiplicative manner, has already been observed in cortical tissues by Turrigiano, Leslie, Desai, and Nelson (1998) and Rutherford, Nelson, and Turrigiano (1998) . These studies complement their earlier work showing that neuronal postsynaptic activity can be kept at xed levels via activity-dependent regulation of synaptic conductances (LeMasson, Marder, & Abbott, 1993; Turrigiano, Abbott, & Marder, 1994). We have studied the performance of NR-driven correction in an excitatoryinhibitory memory model where excitatory neurons are segregated from inhibitory ones in the spirit of Dale’s law (Horn et al., 1998; Chechik, Meilijson, & Ruppin, 1999). This model is similar to our basic model, except that Hebbian learning takes place on the excitatory synapses, W ijexcit D
M X gD 1
g g A (ji , jj ),
(5.1)
with a learning matrix A that has a positive mean E ( A) D a. The input eld is now fit D
N N X 1 X WijexcitXjt ¡ Wiinhib Xjt , N jD 1 j D1
(5.2)
instead of the original term in equation 2.1. When W inhib D Ma, this model is mathematically equivalent to the model described in equations 2.1 and 2.2. NR is performed by repeatedly activating the network with random input patterns and letting each neuron estimate its input eld. During this process, each neuron continuously gauges its average input eld fit around a zero mean by slowly modifying its incoming excitatory synaptic efcacies in accordance with
k
dWijexcit (t0 ) dt 0
D ¡Wijexcit (t0 ) fit .
(5.3)
When all W excit are close to a large mean value, multiplying all weights by a common factor approximates an additive change. 2 Figure 7 plots the 2 The above learning rule results in synapses that are normally distributed, p N(Mp2 , ( Mp(1 ¡ p)) 2 ). Therefore, all synapses reside relatively close to their mean when M is large. We may thus substitute Wij (t0 ) D Mp2 C 2 in equation 5.3, yielding Wij (t0 C 1) D Wij (t0 ) C dtd0 Wij (t0 ) /k D (Mp2 C 2 )(1 ¡ fi /k ). As fi /k and 2 are small, this is well approximated by Wij (t‘) ¡ Mp2 fi / k.
Effective Learning with Ineffective Learning Rules
831
Figure 7: Memory capacity of networks storing patterns via the Hebb rule. Applying NR achieves a linear scaling of memory capacity with a slightly inferior capacity compared with that obtained with neuronal weight correction. Memory capacity is measured as in Figure 2, after the network has reached a stable state. Wiinhib is normally distributed with mean E(Wexcit ) D p2 M and standard deviation 0.1p2 M0.5 , where p D 0.1.
memory capacity of networks storing memories according to the Hebb rule PM PM g g g g Wijexcit D g D1 A (ji , jj ) D gD 1 ji jj , showing how NR, which approximates the additive neuronal weight correction, succeeds in obtaining a linear growth of memory capacity as long as the inhibitory synaptic weights are close to the mean excitatory synaptic values (i.e., the zero synaptic mean constraint is obeyed). Figure 8 plots the temporal evolution of the retrieval acuity (overlap) and the average postsynaptic covariance, showing that NR slowly removes the interfering covariance, improving memory retrieval. 6 Discussion
This article highlights the role of neuronally driven synaptic plasticity in remodeling synaptic efcacies during learning. We rst showed that effective associative memory learning requires a delicate balance between synaptic potentiation and depression. This balance depends on network-level information, thus making it difcult to be implemented at the synaptic level in a distributed memory system. However, our ndings show that the combined action of synaptic-specic and neuronally guided synaptic modications
832
G. Chechik, I. Meilijson, and E. Ruppin
Figure 8: Temporal evolution of retrieval acuity and average postsynaptic covariance in a 1000-neuron network. Two hundred fty memories are rst stored in the network using the Hebb rule, resulting in a poor retrieval acuity (m ¼ 0.7 at t D 0 in the upper gure). However, as NR is iteratively applied to the network, the retrieval acuity gradually improves as the postsynaptic covariance vanishes. p D 0.1, k D 0.1. Other parameters as in Figure 7.
yields an effective learning system. This allows for the usage of biologically feasible but ineffective synaptic learning rules, as long as they are further modied and corrected by neurally driven weight correction. The learning system obtained is highly robust and enables effective storage of memory patterns with heterogeneous coding level. The space of Hebbian learning rules had already been studied by Palm and his colleagues (Palm & Sommer, 1988, 1996; Palm, 1992). They derived the vanishing covariance constraint determining effective learning and characterized cases where this covariance is vanishing (Palm & Sommer, 1996). This article complements their work by describing the procedure of neuronal weight correction that is capable of achieving vanishing postsynaptic covariance, and a possible biological realization of this procedure by the mechanism of neuronal regulation. The characterization of effective synaptic learning rules relates to the discussion of the computational role of heterosynaptic and homosynaptic depression. Previous studies have shown that long-term synaptic depression is necessary to prevent saturation of synaptic values (Sejnowski, 1977) and to maintain zero-mean synaptic efcacies (Willshaw & Dayan, 1990). Our study shows that proper heterosynaptic depression is needed to en-
Effective Learning with Ineffective Learning Rules
833
force zero postsynaptic covariance—an essential prerequisite of effective learning. The zero covariance constraint implies that the magnitude of heterosynaptic depression should be smaller than that of homosynaptic potentiation by a factor of (1 ¡ p ) / p. However, effective learning can be obtained regardless of the magnitude of the homosynaptic depression changes, as long as the zero-mean constraint stated above is satised. The terms potentiation and depression used in the above context should be cautiously interpreted. Because neuronal weight correction may modify synaptic efcacies in the brain, the apparent changes in synaptic efcacies measured in LTD/LTP experiments may involve two kinds of processes: synaptic-driven processes, changing synapses according to the covariance between pre- and postsynaptic neurons, and neuronally driven processes, operating to zero the covariance between incoming synapses of the neuron. Although our analysis pertains to the combined effect of these processes, they may be experimentally segregated as they operate on different timescales and modify different ion channels (Bear & Abraham, 1996; Turrigiano et al., 1998). Thus, the relative weights of neuronal versus synaptic processes can be experimentally tested by studying the temporal changes in synaptic efcacy following LTP/LTD events and comparing them with the theoretically predicted potentiation and depression end values. Several forms of synaptic constraints were previously suggested in the literature to improve the stability of Hebbian learning, such as preserving the sum of synaptic strengths or the sum of their squares (von der Malsburg, 1973; Oja, 1982). Our analysis shows that in order to obtain effective memory storage, the sum of synaptic strengths must be preserved, thus predicting that it is this specic form of normalization that occurs in the brain. Interestingly, spike-triggered synaptic plasticity changes in both hippocampal and tectal neurons (Bi & Poo, 1998; Zhang, Tao, Holt, Harris, & Poo, 1998) were recently shown to result in a normalization mechanism that preserves the total synaptic sum (Kempter, Gerstner, & van Hemmen, 1999). The effects of various normalization procedures were thoroughly studied in the context of self-organization of receptive elds and cortical maps (Miller & MacKay, 1994; Goodhill & Barrow, 1994). In these domains, it was shown that subtractive and additive enforcement of normalization constraints may lead to different representations and network architectures. It is thus interesting to note that in the case of associative memory networks studied here, the procedure of enforcing the xed sum constraint has only a minor effect on memory performance. As shown in Figure 1B, memory capacity depends only slightly on the underlying synaptic learning rule, as long as the learning rule constraints are met. Moreover, studying additive and nonadditive normalization procedures (e.g., multiplicative normalization after adding some positive constant to the synaptic weights), we found that both procedures yielded almost the same memory performance.
834
G. Chechik, I. Meilijson, and E. Ruppin
Our results, obtained within the paradigm of autoassociative memory networks, apply also to heteroassociative memory networks. More generally, neuronal weight correction qualitatively improves the ability of a neuron to discriminate correctly among a large number of input patterns. It thus enhances the computational power of the single neuron and may be applied in other learning paradigms. This interplay between cooperative and competitive synaptic changes is likely to play a fundamental computational role in a variety of brain functions, such as sensory processing and associative learning. Appendix A: Signal-to-Noise Analysis for a General Learning Matrix A.1 Signal-to-Noise Ratio of the Neuronal Input Field. We calculate the signal-to-noise ratio of a network storing memory patterns according to a learning matrix A with zero mean E (A (ji , jj )) D p2 A (1, 1) C p(1¡p ) A(1, 0) C (1 ¡ p )pA (0, 1) C (1 ¡ p )2 A(0, 0) D 0. Restricting attention, without loss of generality, to the retrieval of memory pattern j 1 , the network is initialized in a state X generated independent of the other memory patterns. This state is assumed to have activity level p (thus equal to the coding level of j 1 ) and (1¡p¡2 ) 1¡p overlap m10 D (1¡p) with j 1 , where 2 D P (Xi D 0|ji1 D 1) D ( p )P (Xi D P g g M 1 1 1|ji1 D 0). Denoting W ij¤ D g D 2 A(ji , jj ) D Wij ¡ A(ji , jj ), we use the ¤ ¤ fact that W and X are independent and that E[W ] D 0, and write the conditional mean of the neuron input eld:
h E
fi |ji1
2
i D
D
D
D D
D
3 N 1 X 1 E4 W ij Xj |ji 5 ¡ T N j D1 2 3 2 3 N N X X 1 1 E4 A (ji1 , jj1 )Xj |ji1 5 C E 4 W ¤ Xj |ji1 5 ¡ T D N j D1 N jD 1 ij 2 3 N N 1 X 1 X 1 1 1 E4 A (ji , jj )Xj |ji 5 C E[W ij¤ ]E[Xj ] ¡ T D N j D1 N j D1 2 3 N X 1 E4 A (ji1 , jj1 )Xj |ji 5 ¡ T D N j D1 ± ² ± ² ± ² A ji1 , 1 P Xj D 1|jj1 D 1 P jj1 D 1 C ± ² ± ² ± ² C A ji1 , 0 P Xj D 1|jj1 D 0 P jj1 D 0 ¡ T D ± ² ± ² p (1 ¡ p) ¡ T. (A.1) A ji1 , 1 (1 ¡ 2 )p C A ji1 , 0 2 1¡p
Effective Learning with Ineffective Learning Rules
835
Since W ¤ and X are independent and the noise contribution of single memories is negligible, the variance is 2
3 2 3 N N X X 1 1 Var[ fi |ji1 ] D Var 4 W ij Xj 5 ¼ Var 4 W ¤ Xj 5 N j D1 N jD 1 ij
h i h i 1 ¤ Var Wij¤ Xj C Cov W ij¤ Xj , W ik Xk N h i h i p ¤ Var Wij¤ C p2 Cov Wij¤ , Wik D N £ ¤ £ ¤ M ¼ p Var A(ji , jj ) C Mp2 Cov A (ji , jj ), A(ji , jk ) . N ¼
(A.2)
The neuronal eld’s noise is thus dominated by the covariance between its incoming synaptic weights. Combining equations A.1 and A.2, the signalto-noise ratio of the neurons input eld is Signal E ( fi |ji D 1) ¡ E ( fi |ji D 0) p D Noise Var( fi |ji ) r N [A(1, 1) ¡ A (0, 1)](1 ¡ 2 ) C [A(1, 0) ¡ A (0, 0)]2 q . (A.3) D M p Var £A (j , j )¤ C Np2 Cov £A (j , j ), A (j , j )¤ i
j
i
j
i
k
When the postsynaptic covariance is zero, the signal-to-noise ratio remains constant as M grows linearly with N, thus implying a linear memory capacity. However, when the covariance is a positive constant, the term on the right is almost independent of N, and the memory capacity is bounded. A.2 Obtaining Vanishing Covariance. The variance of the input eld
can also be expressed as
2 3 N X £ 1 1 Var fi ji ] ¼ Var 4 W ¤ Xj 5 N jD1 ij 2 3 N N X 1 X ¤ 5 2 ) E (Xj ) Cov(Wij¤ , Wik D 24 N 6 j j D1 kD 1,kD 2 3 N 1 4X C Var(W ij¤ )5 E (Xj ) N2 j D 1 0 1 N X p2 p2 p Var(W ij¤ ) C Var(W ij¤ )p D 2 Var @ Wij¤ A ¡ N N N jD 1
836
G. Chechik, I. Meilijson, and E. Ruppin
0 1 N h i 2 X 1 p D p (1 ¡ p) Var Wij¤ C 2 Var @ Wij¤ A . N N jD 1
(A.4)
Thus, keeping the sum of the incoming synapses xed results in a benecial effect similar to that of removing postsynaptic covariance and further improves the signal-to-noise ratio by a factor p 1 . The postsynaptic sum 1¡p P ( Wij¤ ) remains xed if each memory pattern has exactly pN ring neurons out of the N neurons of the network. A.3 Effective Learning Rules in Two-Dimensional Space. The parameters a, b, c , d of the table in section 2 dene a four-dimensional space in which all linear additive Hebbian learning rules reside. To visualize this space, Figure 1 focuses on a reduced, two-dimensional space utilizing a scaling-invariance constraint and the requirement that the synaptic matrix should have a mean zero. These yield the following rule, having two free parameters (a, b) only:
presynaptic (jj ) 1 0 , A (ji , jj ) D postsynaptic (ji ) 1 a b 0 c f (a, b, c) £ 2 ¤ ¡1 where c is a scaling constant and f (a, b, c) D (1¡ p a C p(1 ¡ p )(c C b ) p )2 is set to enforce the zero-mean constraint. The covariance of this learning matrix when setting c D 1 is £ ¤ Cov A(ji , jj ), A (ji , j k ) D
£ ¤2 p p a C (1 ¡ p)b , (1 ¡ p )
(A.5)
which is always nonnegative and equals zero only when bD
¡p a. 1 ¡p
(A.6)
Appendix B: The Postsynaptic Covariance of Nonadditive Learning Rules
In this section we show that the postsynaptic covariance cannot be zeroed by introducing a nonadditive learning rule of the form Á W ij D g
M X
gD1
! g g A(ji , jj )
for some nonlinear function g.
(B.1)
Effective Learning with Ineffective Learning Rules
837
To show that, note that when X, Y are positively correlated random variables with marginal standard normal distribution and E ( g(X )) D 0, we can write (using independent normally distributed random variables U, V, W) £ ¤ £ ¤ £ ¤ E g(X ) g (Y) D E g(U C V ) g (W C V ) D E E ( g (U C V ) g (W C V )| V ) h i £ ¤ (B.2) D E E ( g (U C V )| V )2 D Var E ( g (U C V )| V ) ¸ 0. Equality holds only when w (v ) D E ( g (U C V )| V D v) D E ( g (U C v )) is a constant function, and as such it must be zero because E ( g (X )) D 0. To show further that the equality holds only when g is constant, we look at Z ¡u2 1 0 D E ( g (U C v)) D p g (v C u)e 2s2 du 2p s Z 2 2 vt 1 ¡ v2 ¡ t D e 2s p e s 2 g(t)e 2s 2 dt, 2p s or Y(v) D ¡
R
vt
e s 2 g (t )e
t2 2s 2
¡
t2 2s 2
(B.3)
dt ´ 0. As Y(v) D 0 is the Laplace transform of
, g is almost everywhere uniquely determined and the solution g( t ) e g D 0 is essentially the only solution.
Appendix C: Signal-to-Noise Analysis for Heterogeneous Coding Levels
2
Analysis follows similar lines to those of section A.1, but this time the ini(1¡p ¡2 ) tialization state X has activity p1 and overlap m10 D (1¡1p1 ) with j 1 , where 1¡p
D P (Xi D 0|ji1 D 1) D ( p 1 )P (Xi D 1|ji1 D 0). The conditional mean of 1
the neuronal input eld in this case is
M h i X (pm ¡ a )2 ¡ T, E fi |ji1 D (ji1 ¡ a )(1 ¡ a ¡ 2 )p1 C p1
(C.1)
m D2
and its variance is £ ¤ £ ¤ 1 V fi ¼ p1 V W ij C p21 Cov[Wij , Wik], N
(C.2)
yielding (1 ¡ a ¡ 2 )p1 Signal ¼ q £ ¤ £ ¤. Noise 1 2 Cov C p V W p W , W 1 ij ij ik 1 N
(C.3)
838
G. Chechik, I. Meilijson, and E. Ruppin
For large enough networks, the noise term in the neuronal input eld is dominated by the postsynaptic covariance Cov[Wij , Wik ] between the efcacies of the incoming synapses. To obtain the signal-to-noise ratio as a function of the distribution of the coding levels fpm gmMD 1 , we calculate the relevant moments of the synaptic weights’ distribution: E (Wij ) D V (Wij ) D D D
M M h i X X m (pm ¡ a)2 , E Wij D
m D1 M X m D1
(C.4)
m D1
M h i X h i ± ² m m m V W ij D E (Wij )2 ¡ E 2 Wij m D1
M h M i2 X X (pm ¡ a )4 pm ¡ 2pm a C a2 ¡
m D1 M X
m D1
m D1
h i pm (1 ¡ pm ) pm (1 ¡ pm ) C 2(pm ¡ a )2 ,
(C.5)
and £ ¤ Cov W ij , Wik D ¼
M h i X m m m E (ji ¡ a )2 (jj ¡ a )(j k ¡ a )
m D1
¡ D
M X m D1
M X m D1
h i m m E2 (ji ¡ a )(jj ¡ a )
pm (1 ¡ pm )(pm ¡ a )2 .
(C.6)
Substituting equations C.4 through C.6 in the signal-to-noise ratio (see equation C.3), one obtains r Signal N ¼ Noise M p (1 ¡ a ¡ 2 ) p1 q P . (C.7) PM M 1 2 (1 ¡ p )2 C (2 C Np ) 1 2 (1 )( ¡ ¡a) p p p p m 1 m m m m D 1 m m D 1 M M When applying neuronal weight correction, the resulting learning procedure is equivalent to using the rule W ij D
M X m D1
(jim ¡ a)(jjm ¡ pm ),
(C.8)
Effective Learning with Ineffective Learning Rules
839
which has the following moments: E (W ij ) D 0, V (Wij ) D
M X
m D1
(C.9) pm2 (1 ¡ pm )2 C
£ ¤ Cov W ij , Wik D 0,
M X m D1
(pm ¡ a )2 pm (1 ¡ pm ),
(C.10) (C.11)
and results in a signal-to-noise ratio r p (1 ¡ a ¡ 2 ) p1 Signal N q P . (C.12) D Noise M 1 M p2 (1 ¡ p )2 C 1 P M p (1 ¡ p )(a ¡ p )2 m m m m m m D1 m D1 M M References Amit, D. J. (1989). Modeling brain function: The world of attractor neural networks. Cambridge: Cambridge University Press. Bear, M. F., & Abraham, W. C. (1996). Long term depression in hippocampus. Annu. Rev. Neurosci., 19, 437–462. Bi, Q., & Poo, M. M. (1998). Precise spike timing determines the direction and extent of synaptic modications in cultured hippocampal neurons. Journal of Neuroscience, 18, 10464–10472. Bliss, T. V. P., & Collingridge, G. L. (1993). Synaptic model of memory: Long-term potentiation in the hippocampus. Nature, 361, 31–39. Chechik, G., Meilijson, I., & Ruppin, E. (1998). Synaptic pruning during development: A computational account. Neural Computation, 10(7), 1759–1777. Chechik, G., Meilijson, I., & Ruppin, E. (1999). Neuronal regulation: A mechanism for synaptic pruning during brain maturation. Neural Computation, 11(8), 2061–2080. Dayan, P., & Willshaw, D. J. (1991). Optimizing synaptic learning rules in linear associative memories. Biol. Cyber., 65, 253. Goodhill, G. J., & Barrow, H. G. (1994). The role of weight normalization in competitive learning. Neural Computation, 6, 255–269. Grinstein-Massica, A., & Ruppin, E. (1998). Synaptic runway in associative networks and the pathogenesis of schizophrenia. Neural Computation, 10, 451– 465. Horn, D., Levy, N., & Ruppin, E. (1998). Synaptic maintenance via neuronal regulation. Neural Computation, 10(1), 1–18. Kempter, R., Gerstner, W., & Hemmen, J. L. van. (1999). Hebbian learning and spiking neurons. Physical Review E, 59, 4498–4514. LeMasson, G., Marder, E., & Abbott, L. F. (1993). Activity-dependent regulation of conductances in model neurons. Science, 259, 1915–1917. Miller, K. D. (1996). Synaptic economics: Competition and cooperation in synaptic plasticity. Neuron, 17, 371–374. Miller, K. D., & MacKay, D. J. C. (1994). The role of constraints in Hebbian learning. Neural Computation, 6(1), 100–126.
840
G. Chechik, I. Meilijson, and E. Ruppin
Oja, E. (1982). A simplied neuron model as a principal component analyzer. Journal of Mathematical Biology, 15, 267–273. Palm, G. (1992). On the asymptotic information storage capacity of neural networks. Neural Computation, 4, 703–711. Palm, G., & Sommer, F. (1988).On the information capacityof local learning rules. In R. Eckmiller & C. von der Malsburg (Eds.), Neural computers (pp. 271–280). Berlin: Springer-Verlag. Palm, G., & Sommer, F. (1996). Associative data storage and retrieval in neural networks. In E. Domani, J. vanHemmen, & K. Schulten (Eds.), Models of neural networks III: Association, generalization and representation (pp. 79–118). Berlin: Springer. Rutherford, L. C., Nelson, S. B., & Turrigiano, G. G. (1998). BDNF has opposite effects on the quantal amplitude of pyramidal neuron and interneuron excitatory synapses. Neuron, 21, 521–530. Sejnowski, T. J. (1977). Statistical constraints on synaptic plasticity. J. Theo. Biol., 69, 385–389. Sirosh, J., & Miikkulainen, R. (1994). Cooperative self-organization of afferent and lateral connections in cortical maps. Biological Cybernetics, 71, 66–78. Tsodyks, M. V. (1989). Associative memory in neural networks with Hebbian learning rule. Modern Physics Letters, 3(7), 555–560. Turrigiano, G. G., Abbott, L. F., & Marder, E. (1994). Activity-dependent changes in the intrinsic properties of neurons. Science, 264, 974–977. Turrigiano, G. G., Leslie, K., Desai, N., & Nelson, S. B. (1998). Activity dependent scaling of quantal amplitude in neocortical pyramidal neurons. Nature, 391(6670), 892–896. von der Malsburg, C. (1973). Self organization of orientation sensitive cells in the striate cortex. Kybernetik, 14, 85–100. Willshaw, D. J., & Dayan, P. (1990). Optimal plasticity from matrix memories: What goes up must come down. Neural Computation, 2(1), 85–93. Zhang, L., Tao, H. Z., Holt, C., Harris, W., & Poo, M. M. (1998). A critical window in the cooperation and competition among developing retinotectal synapses. Nature, 395, 37–44. Received April 16, 1999; accepted July 11, 2000.
LETTER
Communicated by Andrew Barto
Temporal Difference Model Reproduces Anticipatory Neural Activity Roland E. Suri Wolfram Schultz Institut de Physiologie, 1700 Fribourg, Switzerland Anticipatory neural activity preceding behaviorally important events has been reported in cortex, striatum, and midbrain dopamine neurons. Whereas dopam ine neurons are phasically activated by reward-predictive stimuli, anticipatory activity of cortical and striatal neurons is increased during delay periods before important events. Characteristics of dopamine neuron activity resemble those of the prediction error signal of the temporal difference (TD) model of Pavlovian learning (Sutton & Barto, 1990). This study demonstrates that the prediction signal of the TD model reproduces characteristics of cortical and striatal anticipatory neural activity. This nding suggests that tonic anticipatory activities may reect prediction signals that are involved in the processing of dopamine neuron activity. 1 Introduction
In a famous experiment by Pavlov (1927), a dog was trained with the ringing of a bell (stimulus) followed by food delivery (reinforcer). In the rst trial, the animal salivated when food was presented. After several trials, salivation started when the bell was rung. This nding suggests that the salivation response following the bell ring reects anticipation of food delivery. A large body of experimental evidence led to the hypothesis that Pavlovian learning is dependent on the degree of unpredictability of the reinforcer (Rescorla & Wagner, 1972; Dickinson, 1980). According to this hypothesis, reinforcers become progressively less efcient for behavioral adaptation as their predictability grows during the course of learning. The difference between the actual occurrence and the prediction of the reinforcer is usually referred to as the error in the reinforcer prediction. This concept has been employed in the temporal difference model (TD model) of Pavlovian learning (Sutton & Barto, 1990). The TD model uses reinforcement prediction errors for learning a reinforcement prediction signal. This signal was compared to anticipatory responses. As animals seem to optimize the sum of reinforcement over time (Mackintosh, 1974; Dickinson, 1980), it is the goal of the TD model to compute a desired prediction signal that reects the sum of future reinforcement. If the reinforcement is food intake, Neural Computation 13, 841–862 (2001)
° c 2001 Massachusetts Institute of Technology
842
Roland E. Suri and Wolfram Schultz
this desired prediction signal reects the sum of available food in the future. After training of the TD model with a stimulus followed by a reinforcer, the prediction error signal increases phasically when the stimulus is presented, and the prediction signal is tonically increased during the intratrial interval. Recent studies relate the TD model to neural information processing because the reward prediction error of the TD model resembles dopamine neuron activity in situations with unpredicted rewards, fully predicted rewards, reward-predicting stimuli, and unexpectedly omitted rewards. The comparison between basal ganglia anatomy and the architecture of the TD model suggests that cortico-striatonigral pathways are involved in adaptation of dopamine neuron activities (Barto, 1995; Houk, Adams, & Barto, 1995; Montague, Dayan, & Sejnowski, 1996; Schultz, Dayan, & Montague, 1997; Suri & Schultz, 1999). Anticipatory activity is related to an upcoming event that is prerepresented as a result of a retrieval action of antedating events, in contrast to activity reecting memorized features of a previously experienced event (Wagner, 1978). Therefore, as in Pavlov’s experiment, anticipatory activity precedes a future event irrespective of the physical features of the antedating events, which make this future event predictable (see Figure 1A). Phasic activity anticipating rewards was reported in midbrain dopamine neurons (Ljungberg, Apicella, & Schultz, 1992; Schultz, Apicella, & Ljungberg, 1993; Schultz et al., 1997; Mirenowicz & Schultz, 1994). Tonic delay period activity that anticipates stimuli, rewards, or the animal’s own actions was termed anticipatory, preparatory, or predictive and has been reported in the striatum (Hikosaka, Sakamoto, & Usui, 1989; Alexander & Crutcher, 1990a, 1990b; Apicella, Scarnati, Ljungberg, & Schultz, 1992; Schultz & Romo, 1992; Kermadi & Joseph, 1995; Tremblay, Hollerman, & Schultz, 1998; Hollerman, Tremblay, & Schultz,1998), supplementary motor area (Alexander & Crutcher, 1990a, 1990b; Romo & Schultz, 1992), prefrontal cortex (Watanabe, 1996), orbitofrontal cortex (Tremblay & Schultz, 1999, 2000; Schultz, Tremblay, & Hollerman, 2000), premotor cortex (Mauritz & Wise, 1986), and primary motor cortex (Alexander & Crutcher, 1990a, 1990b). The TD model was usually applied to learn to predict one reinforcer. In situations with two different anticipated rewards, we use two TD models, each processing one reward. In order to investigate the relations between anticipatory neural activity and predictive signals of the TD model (Sutton & Barto, 1990), we compare simulated predictive signals with anticipatory neural activities. 2 Description of Anticipatory Neural Activity 2.1 Phasic Anticipatory Activity. Phasic anticipatory neural responses of about 100 msec duration were reported for midbrain dopamine neurons. These neurons are activated by unpredicted rewards and by a reward following a stimulus for the rst time. After repeated presentations of the
TD Model Reproduces Anticipatory Activity
843
stimulus followed by the reward, the activation elicited by the reward decreases and entirely disappears after learning is completed. These neurons become activated instead by the stimulus (Ljungberg et al., 1992; Schultz, Apicella, & Ljungberg, 1993; Schultz et al., 1997; Mirenowicz & Schultz, 1994; Schultz, 1998). 2.2 Tonic Anticipatory Activity. Tonic anticipatory activity was found in subsets of cortical and striatal neurons. Before learning, such neurons often respond to specic events. When this event becomes predicted in the course of learning, these responses seem to become progressively preceded by anticipatory neural activity (Hikosaka et al., 1989; Tremblay et al., 1998). After learning, anticipatory neural activity often progressively increases between the rst (predicting) event and the second (predicted) event. This activity starts increasing at the onset of the predicting event or during the interevent interval. Alternatively to this progressive increase, the responses can be limited to the predictive and to the predicted event as shown in Figure 4C (Hikosaka et al., 1989; Alexander & Crutcher, 1990a, 1990b; Apicella et al., 1992; Kermadi & Joseph, 1995; Watanabe, 1996). Tonic anticipatory activity was reported to precede stimuli, reinforcers, and movements in delayed-response tasks (Apicella et al., 1992; Schultz & Romo, 1992; Schultz, Apicella, Scarnati, & Ljungberg, 1992; Tremblay et al., 1998; Hollerman et al., 1998). Each correct trial of this task consists of an instruction stimulus, a delay period, a trigger stimulus, a behavior, and a reward. Animals have to remember the instruction stimulus to react correctly to the trigger stimulus. Apicella and collaborators (1992) reported anticipatory neural activity in the monkey striatum after training a go-nogo version of this task. From the 1173 studied striatal neurons, 615 showed some change in activity during task performance. The activity of 193 taskrelated neurons increased in advance of at least one task component: the instruction stimulus (16 neurons), the trigger stimulus (15 neurons), the animal’s movement (56 neurons), or the reward delivery (87 neurons) (see Figure 1B). These neurons with anticipatory activity were found in dorsal and anterior parts of caudate and putamen and were slightly more frequent in the proximity of the internal capsule. Tremblay and collaborators (Tremblay & Schultz, 1999, 2000; Schultz et al., 2000) trained monkeys in delayed-response tasks in which each instruction stimulus preceded presentation of a specic reward (two liquids with different taste). Instruction stimulus A was followed by reward X, instruction stimulus B was followed by the same reward X, and instruction stimulus C was followed by reward Y. Neural activity was recorded in six-layered parts of orbitofrontal areas 11 and 14 and rostral area 13. These neurons showed three principal types of activation: responses to instructions (15% of 1095 tested neurons), responses following reward (8%), and sustained activations preceding reward (9%). Unrewarded control trials demonstrated that all three types of activation were inuenced by the expected reward.
844
Roland E. Suri and Wolfram Schultz
TD Model Reproduces Anticipatory Activity
845
The prereward activations began several seconds before the reward and subsided less than 1 sec after reward delivery. Tonic anticipatory activities can be specic for anticipated stimuli (Hikosaka et al., 1989; Alexander & Crutcher, 1990b; Apicella et al., 1992; Kermadi & Joseph, 1995; Tremblay et al., 1998; Hollerman et al., 1998). Kermadi & Joseph (1995) analyzed the neural activity of 2100 neurons in the caudate nucleus. During the instruction phase, a sequence of three visual targets was presented, and the monkey was required to xate on a central xation point. In the subsequent behavioral phase, the monkey had to press
Figure 1: Facing page. (A) Illustration providing a criterion for the expression “activity anticipating event X.” The three well-trained trial types “event A followed by event X,” “event B followed by event X,” and “event C followed by event Y” are assumed to be separated by sufciently long intertrial intervals and are presented randomly intermixed. Presentations of event A (left) and event B (middle), which both precede presentation of event X, increase the activity. Event C, which precedes event Y, does not inuence the activity (right side). This last control trial shows that the anticipatory activity is specic for event X. Furthermore, this control trial indicates that the activity is not related to a common physical feature of the events A, B, and C and therefore does not reect memorization of these preceding events. The responses following events X and event Y are not shown because they are not relevant for this criterion. Events A and B were termed “predictive” events and event X the “predicted” or “anticipated” event. (B) Population activity of expectation- and preparationrelated striatal neurons (gure from Apicella et al., 1992). (Top) Activation of 16 neurons preceding instruction onset. The intertrial interval was 4–7 seconds. (Middle) Activation of 44 neurons preceding the trigger stimulus in go trials. The histogram is split because the intervals between instruction and trigger varied from 2.5 to 3.5 sec. Neurons responding to the trigger stimulus, activated during movement, or activated before instruction or reward, are excluded. (Bottom) Activation of 68 neurons preceding reward in no-go trials. The activation began to a modest extent before trigger onset, gained increasingly in amplitude after trigger onset, and reached its peak when the reward was delivered. In each display, histograms for each neuron normalized for trial number are added, and the resulting sum is divided by the number of neurons. (C) Activity of this neuron in caudate anticipated presentation of stimulus L only if stimulus L occurred in the sequence ULR (gure from Kermadi & Joseph, 1995). During the shown instruction phase of the task, the monkey withheld movements and xated on a central xation point. The three visual stimuli—L (left target), U (upper target), and R (right target)—were presented in the six sequences: LUR, RLU, URL, LRU, ULR, and RUL (top of each subgure). Each cell discharge is indicated by a dot, and the 6 to 9 successive trials per neuron are shown on successive lines (middle of each subgure). The histogram shows the sum of the individual discharges (bottom of each subgure).
846
Roland E. Suri and Wolfram Schultz
three levers in the order indicated by the instruction phase. Six different sequences of the three targets were presented in the instruction phase. Because these six sequences were always presented in the same order, the three target stimuli in each trial were completely predictable. From 125 neurons responding in the instruction phase, the activity of 81 neurons preceded the presented stimuli. The activity of 46 neurons anticipated the offset of the central xation point, the activity of 7 neurons anticipated the illumination of any target, the activity of 17 neurons anticipated the illumination of the rst target, and the activity of 11 neurons anticipated the onset of specic targets. In a majority (35 neurons), the responses to specic targets were modulated by the rank of the target in the sequence or by complex relationships with other targets. Anticipatory activity started increasing about 1 second before stimulus onset. Then this activity progressively increased until it reached the maximum at the onset of the anticipated stimulus. A neuron with activity anticipating the specic target L in the specic sequence ULR is shown in Figure 1C. 2.3 Anticipatory Activity in Paradigms Without Delay Period. Although we do not intend to reproduce anticipatory neural activity in paradigms without delay period, we briey mention such ndings here. Activity of the head-direction cells in the anterior thalamus anticipates the future head direction by a neuron-specic duration between 0 and 50 msec (Blair, Lipscomb, & Sharp, 1997). Event-specic anticipatory activity can also depend on the future behavior of the animal. Activity that anticipates the retinal consequences of intended eye movements by about 100 msec was reported in frontal eye elds (Goldberg & Bruce, 1990; Umeno & Goldberg, 1997), superior colliculus (Walker, Fitzgibbon, & Goldberg, 1995), parietal cortex (Duhamel, Colby, & Goldberg, 1992), and striate cortex (Nakamura & Colby, 1999). 3 Description of the TD Model
In Pavlovian learning paradigms, animals often learn to estimate the time of reward occurrence (Gallistel, 1990). Therefore, the TD model of Pavlovian learning (Sutton & Barto, 1990) proposes a time-estimation mechanism. The same time-estimation mechanism was also used to reproduce the nding that dopamine neuron activity is decreased below baseline levels when an expected reward is omitted (Montague et al., 1996; Schultz, 1998). This timeestimation mechanism is implemented by assuming that each stimulus is represented with a series of short components following stimulus onset. This is achieved by mapping each stimulus to a xed temporal pattern of phasic signals x1 (t ), x2 (t), . . . that follow stimulus onset with varying delays. This temporal pattern is referred to as a complete serial compound stimulus or temporal stimulus representation (see Figure 2A). Note that the choice of these signals is rather arbitrary. The single components have
TD Model Reproduces Anticipatory Activity
847
been proposed to be phasic (Sutton & Barto, 1990), sustained (Desmond & Moore, 1988, 1991), or phasic immediately after the stimulus onset and then progressively more sustained (Grossberg & Schmajuk, 1989; Brown, Bullock, & Grossberg, 1999; Suri & Schultz, 1999). In some models the representation of a stimulus depends on successive events (Dominey, Arbib, & Joseph, 1995; Suri & Schultz, 1999). Although the shapes of the components differ among these models, the learned prediction and prediction error signals are usually not affected by this choice. The temporal stimulus representation is used to compute the reward prediction signal with the adaptive weights Vm (t) (see equation A.2; Sutton & Barto, 1990; Montague et al., 1996; Schultz et al., 1997; Suri & Schultz, 1998). A representation of the TD model using a neuron-like element is shown in Figure 2B. According to the TD model, the reward prediction develops during learning in a similar way as the animal’s anticipatory behavior. The reward prediction increases gradually before an anticipated reward if this reward is completely predicted. The rate of this gradual increase is determined by the constant c , which is referred to as the temporal discount factor. The value of the discount factor c was estimated from the time course of measured anticipatory neural activity. We usually used the standard value c D 0.99 per time step (1 time step D 100 msec), which led to an increase in the prediction signal of 1% each 100 msec (see section 6). Previously, c D 0.98 per 100 msec had been estimated from dopamine neuron activity (Suri & Schultz, 1999). The TD model learns the reward prediction signal from stimuli antedating reward occurrence using a signal that reects “errors” in the reward prediction. The TD model uses the difference between the actual occurrence and the prediction of the reward as this reward prediction error. Thus, the reward prediction error e(t) is computed from discounted temporal differences in the prediction signal p (t) and from the reward signal (see equation A.3). In order to minimize these prediction errors, the elements of the weight matrix Vm (t ) are incrementally adapted according to the product of the prediction error with eligibility traces of the temporal stimulus representation (see equation A.4). These traces are dened as slowly decaying versions of the representation components (see equation A.5). Such stimulus traces were originally introduced to explain learning for situations with a delay between the stimulus and the reinforcer, as they bridge the time interval between the predictive stimulus and the reinforcer (Hull, 1943). Although TD models with complete temporal stimulus representation learn without representation traces (Montague et al., 1996), the proposed model uses traces to accelerate learning (Sutton & Barto, 1998). 3.1 One TD Model for Each Event. Since rewards are usually accompanied by certain sensory stimuli, it was proposed to represent a reward for the TD model as a composite of a reward and a stimulus (Suri & Schultz, 1999). With this approach, the reward prediction error signal is phasically
848
Roland E. Suri and Wolfram Schultz
activated at reward onset after repeated reward presentations. We use the same approach in this study and provide a copy of the reward signal as an additional stimulus to the TD model. The TD model processes two types of input: input that the model learns to predict (usually the reward) and input that serves as information for these predictions (usually the stimuli). The rst event in a trial (usually a stimulus) serves as the information to learn prediction signals for the second event (usually the reward). Since behaviorally important stimuli are preceded by anticipatory neural activity (see Figure 1C), we want to compute prediction signals not only for rewards but also for stimuli. Therefore, we propose a model that does not distinguish between stimuli and rewards. We use
TD Model Reproduces Anticipatory Activity
849
several TD models, each computing predictions for each event (stimulus or reward) that occurs in the simulated paradigm. The TD models receive all events as information input in order to compute optimal prediction signals (see Figure 2C). Since these TD models are independent, each of them is mathematically equal to the standard TD model, and all the simulation results (except Figure 5C) could be computed with the standard TD model. 4 Model Simulations 4.1 Pretraining with Rewards Alone. For the experimental situations, animals were typically familiar with the single rewards occurring in the experiments. Therefore, the model was always pretrained with 20 presentations of the single rewards alone. These single rewards were presented during 1 second. Reward presentations were separated by intervals that were long enough to prevent learning of associations between rewards. 4.2 Delayed-Response Task. We did not intend to reproduce anticipatory neural activity for all events of the delayed-response task but simulated a trial with a presentation of a stimulus followed after a delay by presentation of a reward. The duration of the instruction stimulus (1 sec) and the duration of the intratrial interval (5 sec) corresponded to similar durations in the monkey experiments. Trigger stimulus and the animal’s movements were not modeled. Simulated intertrial intervals were long enough to avoid
Figure 2: Facing page. (A) Temporal stimulus representation. Each stimulus ul (t) is followed by a series of phasic signals x1 (t), x2 (t), x3 (t), . . . that cover trial duration. The rst component of this temporal representation peaks with amplitude one (line 2), the second with amplitude d (line 3), the third with amplitude d 2 (line 4), and so on. Representation computed with the standard value d D 1 is shown (without decay of the temporal representation). (B) TD model for one stimulus and one reward (Sutton & Barto, 1990). For the stimulus u(t) the temporal stimulus representation x1 (t), x2 (t), x3 (t), . . . is computed. Each component xm (t) is multiplied with an adaptive weight Vm (t) (lled dots). The reward prediction p(t) is the sum of the weighted representation components of all stimuli. The difference operator D takes temporal differences from this prediction signal (discounted with factorc ). The reward prediction error e(t) reports deviations to the desired prediction signals. This error is minimized by incrementally adapting the elements of the weights Vm (t) proportionally to the prediction error signal e(t) and to the learning rate b. (C) Two TD models for two events u1 (t) and u2 (t). Each event signal ul (t) reports about a stimulus or a reward that occurs in the experimental paradigm. All events are modeled as a composite of a stimulus and a reward. Each temporal representation component xm (t) is multiplied with an adaptive weight Vlm (lled dots). The event prediction pl (t) is computed from the sum of the weighted components.
850
Roland E. Suri and Wolfram Schultz
associations between trials. If only one reward type was delivered in the animal experiment, the model was trained with 20 trials in which stimulus A (instruction) was followed by reward B. If three different instruction stimuli preceded delivery of two different rewards, the model was trained with the corresponding three pairs of events (stimulus A ! reward X, stimulus B ! reward X, and stimulus C ! reward Y). Each pair was presented 20 times. Tonic and phasic anticipatory activity was compared with prediction signals and prediction error signals, respectively. The temporal discount factor c was chosen to approximate the time course of the anticipatory neural activity. The chosen stimulus representation covered equally the whole interval between stimuli and rewards without “forgetting” the stimulus presentation. However, not all neurons may have access to such a complete temporal stimulus representation. We therefore examined the inuence of an incomplete stimulus representation that decayed rapidly after presentation of stimuli. This was achieved by setting the value of the decay rate d to 0.8 per 100 msec (see the legend to Figure 2A). Using this parameter value, the peaks of the stimulus representation components decreased 20% for each additional 100 msec stimulus-peak interval. This model with incomplete stimulus representation was trained according to the schedule with two rewards. 5 Results
During the pretraining with repeated presentations of the reward (see section 4), the time of reward onset was unpredictable, but the reward duration remained constant. After pretraining, the prediction signals correctly decreased during reward presentation, reecting the remaining reward duration, and the prediction error signals increased phasically at the reward onset (see Figure 3A). The model was trained with a stimulus A followed by a reward B (see Figure 3B). When the discount factor c was set to the standard value of 0.99 per 100 msec (left), the prediction signals increased at onset of stimulus A and then progressively increased with a rate similar to the desired rate of 1% per 100 msec (see left, line 3). When the model was trained with the discount factor c D 0.85, the prediction signals increased with a rate similar to the desired rate of 15% per 100 msec see (see right, line 3). For c D 0.99, the interstimulus interval was too short to learn the desired prediction signals for times before presentation of stimulus A. Therefore, the prediction error was phasically increased at the onset of stimulus A (see bottom, left side). For c D 0.85, the interstimulus interval was long enough to learn the desired prediction signals, which led to a very small prediction error signal (see bottom, right side). Simulated reward prediction signals of the proposed model were comparable with anticipatory neural activity measured in the putamen (part of
TD Model Reproduces Anticipatory Activity
851
Figure 3: (A) Pretraining with a single reward. For the rst presentation of a novel reward u1 (t) of 1 sec duration (left side, line 1), the prediction signal p1 (t) (left, line 2) was zero, as the weight matrix Vl m was initialized with zeros (see equation A.2). The prediction error signal e1 (t) (left, line 3) was equal to the reward signal u1 (t) (see equation A.3). After 20 presentations of this reward (right side), learning was completed and the duration of the reward u1 (t) was correctly predicted (right side, line 2). The prediction signal decreased during the reward presentation, as it correctly reected the remaining future reward duration. The prediction error signal (right side, line 3) increased phasically at the reward onset, as the reward onset was unpredictable. The discount factor was set to the value of c D 0.99 per time step (1 time step D 100 msec). (B) After 20 presentations of stimulus A (line 1) followed by reward B (line 2). The model was trained with the standard value of 0.99 for the discount factor c (left side) and the value of 0.85 (right side). Both prediction signals were learned correctly. The increases in the prediction signals were equal to the desired rates of 1% per 100 msec (left side) and 15% per 100 msec (right side). For c D 0.99 (left side), the prediction error signals increased phasically at onset of stimulus A, as onset of stimulus A was unpredictable. For c D 0.85 (right side), the prediction error signal was almost zero for all time steps, because the simulated prediction signals were close to the desired prediction signals. (This gure was computed without representation decay (d D 1).)
852
Roland E. Suri and Wolfram Schultz
striatum), and reward prediction error signals were comparable with activity of midbrain dopamine neurons (see Figure 4). When the stimulus and the reward were presented in temporal succession for the rst time (see Figure 4A), the reward prediction signal was not affected by presentation of the stimulus, as the stimulus was not associated with reward. At the time of the unpredicted reward, the reward prediction signal was increased (see Figure 4A, line 3). Similar to this simulated signal, activity of a subset of putamen neurons was not affected by presentation of the stimulus and was increased by the reward (see line 4). The simulated reward prediction error was phasically increased at the reward onset as the reward was unpredicted (see line 5). This signal was comparable to the activity of midbrain dopamine neurons (bottom). Simulated signals were then compared with neural activities after learning (see Figure 4B). The simulated reward prediction signal was already increased at stimulus onset and progressively increased until reward onset (see line 3) as the stimulus predicted the reward. This signal correctly increased between the stimulus and the reward about 1% for each 100 msec, because the discount factor c was set to 0.99 per 100 msec (standard value). The simulated reward prediction signal was comparable to reward anticipatory activity of a subset of striatal neurons (see line 4). The signal representing the reward prediction error was phasically increased by the unpredicted stimulus but not affected by the predicted reward (see line 5). This response to the stimulus was smaller than the response to the unpredicted reward before learning (compare Figure 4A, line 5) as the prediction signal (see Figure 4B, line 3) progressively increased according to the discount factor. The reward prediction error was comparable to dopamine neuron activity (see Figure 4B, bottom). Simulated prediction signals were also comparable with reward-specic anticipatory activity recorded in orbitofrontal cortex (see Figure 5). Monkeys had been trained in a delayed-response task with three instruction stimuli, A, B, and C, followed by two different rewards, X and Y (see section 2). The model had been trained with the corresponding pairs of events (see section 4). In trials without occurrence of reward Y, prediction of reward Y was not affected (see Figure 5A, top, left, and middle). In trials with occurrence of reward Y, this prediction signal was activated when stimulus C was presented and then progressively increased until reward Y (see Figure 5A, top, right), because reward Y was completely predicted by stimulus C. Prediction of reward Y was comparable to reward-specic activity of a subset of orbitofrontal neurons anticipating reward Y but not reward X (see Figure 5A, bottom). The model was trained with the same pairs of events, but the value of 0.95 per 100 msec was used for the temporal discount factorc . Therefore, prediction signals increased more rapidly according to a correct rate of about 5% per 100 msec (see Figure 5B, top). Prediction of reward X was only slightly increased at the onset of stimuli A and B and then increased rapidly un-
TD Model Reproduces Anticipatory Activity
853
Figure 4: Comparable time courses of predictive signals and activity histograms for a stimulus (line 1) preceding a reward (line 2). The same timescale applies to simulated trajectories and histograms of neural activity. Activity histograms were selected from delayed-response tasks with the simulated stimulus corresponding to the instruction stimulus. Simulated signals were selected from the above simulations (see Figure 3A, right, and Figure 3B, left). (A) Before learning. A typical reward prediction signal was increased when the reward was presented (line 3). This signal was comparable to the activity histogram of a set of putamen neurons aligned to an unpredictable reward (line 4) (from Schultz, Apicella, Ljungberg, Romo, & Scarnati, 1993). A typical signal reecting reward prediction errors was phasically increased at onset of the reward (line 5). This signal was comparable to dopamine neuron activity (bottom). These neurons do not respond to a small instruction light (from Ljungberg et al., 1992) but rather to an unpredictable drop of liquid reward delivered to the mouth of the animal (from Mirenowicz & Schultz, 1994). (B) After learning (20 stimulus-reward pairings). A typical reward prediction signal had already increased when the stimulus was presented and then progressively increased until occurrence of the reward (line 3). This reward prediction signal was comparable to anticipatory activity of a putamen neuron (line 4; from Apicella et al., 1992). This neural activity seems to anticipate the future reward, because it was also increased before the reward regardless of the instruction stimulus that indicated reward delivery. Of 1173 studied neurons, 6 striatal neurons showed similar sustained reward anticipatory activity lasting over the task duration (compare Figure 1B). A typical signal reecting reward prediction errors was already phasically increased at the stimulus onset and on baseline level when the reward was presented (line 5). This signal was comparable to the activity of dopamine neurons, which respond after learning to a small instruction light but not to a predictable drop of liquid reward (bottom) (from Ljungberg et al., 1992).
854
Roland E. Suri and Wolfram Schultz
til reward X (see top, left, and middle), because reward X was completely predicted by the stimuli A and B. Prediction of reward X was not affected in trials without reward X (see top, right side). Prediction of reward X was comparable to the activity of a subset of orbitofrontal neurons with activity anticipating reward X (see Figure 5B, bottom). Figures 5A and 5B demonstrate that simulated prediction signals and anticipatory neural activities discriminate between specic predicted rewards. We studied the inuence of a rapidly decaying stimulus representation on the prediction signal. After 20 presentations of the stimulus-reward pairs, prediction of reward X was activated by the stimuli A and B and by the reward X (see Figure 5C, top). Since learning was not completed, prediction of reward X was associated with the correct predictive stimuli A and B but did not bridge the time between the stimuli and the reward. The responses to the predictive stimuli A and B were learned with the representation traces, because the traces still bridged the time gap between the stimuli and the rewards. Prediction of reward X was comparable to the activity of a subset of orbitofrontal neurons anticipating reward X but not reward Y (see Figure 5C, bottom). 6 Discussion
This study demonstrates for the rst time that an adaptive model can reproduce characteristics of anticipatory delay period activity. Each of the TD models learned a tonic reward-specic prediction signal using a phasic reward-specic prediction error signal. Since the reward prediction error signal was computed as in previous TD models (Montague et al., 1996; Schultz et al., 1997; Suri & Schultz, 1999), characteristics of the reward prediction error signal were comparable to phasic activities of midbrain
TD Model Reproduces Anticipatory Activity
855
dopamine neurons (see Figure 4). In addition, simulated reward-specic prediction signals were comparable to reward-specic anticipatory activity of subsets of cortical and striatal neurons before and after learning the contingency between a stimulus and a reward (see Figures 4 and 5). Variation of two model parameters, the discount factor and the decay rate of temporal stimulus representation reproduced variations in time courses of anticipatory neural activity (see Figure 5). Although the shown model simulations do not include learning between more than two events, the model can be
Figure 5: Facing page. Comparable time courses of prediction signals and orbitofrontal activities after learning pairings between different stimuli and rewards. Simulated stimuli are compared with the instruction stimuli in a delayedresponse task. (Histograms reconstructed from Tremblay & Schultz, 1999, 2000; Schultz et al., 2000; see section 2. Durations of simulated stimuli, rewards, and intratrial interval as in Figure 4.) (A) Prediction of reward Y. In trials without reward Y, all signals reecting prediction of reward Y were zero (top, left, and middle). When stimulus C preceded reward Y, the signal-reecting prediction of reward Y was activated when stimulus C was presented and then progressively increased until reward Y (top, right side). Prediction of reward Y was comparable to the activity of an orbitofrontal neuron anticipating reward Y but not reward X (bottom). (In the histogram at bottom, right, neural activity before the task was larger than after the task, because the previous task already predicted reward Y.) (B) Prediction of reward X was learned with a discount factor c D 0.95 per 100 msec. This signal slightly increased when stimuli A or B were presented and then increased rapidly until reward X (top, left, and middle). This signal was zero in trials without reward X (top, right). The prediction of reward X was comparable to the activity of an orbitofrontal neuron anticipating reward X but not reward Y (bottom). Nine percent of orbitofrontal neurons are active during delay periods before specic rewards, as shown in A or B (Tremblay & Schultz, 1999, 2000; Schultz et al. 2000). (C) Prediction of reward X with representation decay (d D 0.8 per 100 msec). In trials with presentations of reward X, a typical signal reecting the prediction of reward X increased when the stimuli A and B or the reward X were presented (top, left and middle). Prediction of reward X did not bridge the intratrial interval, as the temporal event representation decayed rapidly. In the trial without presentation of reward X, prediction of reward X was zero (right). Prediction of reward X was comparable to the activity of an orbitofrontal neuron. This neuron responded to the stimuli A and B and to the reward X (bottom, left and middle). The activity did not bridge the intratrial interval. Activity was at baseline levels in trials without reward X (bottom, right). This neuron belongs to the subset with reward-specic responses to the instruction (15% of tested neurons) and to the subset with reward-specic responses following the reward (8% of tested neurons) (unpublished data from Tremblay & Schultz, 1999, 2000, and Schultz et al. 2000). (Activities were aligned to rewards. Horizontal broken lines below histograms indicate stimulus onsets.)
856
Roland E. Suri and Wolfram Schultz
applied to experiments with more events. Anticipatory neural activity is usually inuenced by only two events: a stimulus, which elicits the activity, and a reward, which terminates the activity. Therefore, striatal anticipatory neural activity in the delayed-response task could be reproduced with separate models that separately learn the pairs reward-instruction (see Figure 1B, top), instruction-trigger (see Figure 1B, middle), instruction-reward (see Figure 1B, bottom), and trigger-reward (see Figure 1B, bottom). This assumption of separate networks is plausible, as networks of biological neurons are usually partially connected. For the sequence reproduction task (Kermadi & Joseph, 1995), an elaborated and biologically plausible stimulus representation has been proposed that reproduces order- and stimulusdependent neural activities (Dominey et al., 1995). As the temporal characteristics of anticipatory activities resemble those of the simulated prediction signals (compare Figure 1C with Figure 3B, right), these anticipatory neural activities could probably be reproduced by training a TD model with an elaborated internal representation. Associative weights involved in the computation of tonic prediction signals were adapted according to phasic signals reporting prediction errors. Therefore, the model suggests that phasic anticipatory activities induce long-term adaptations of neurons with tonic anticipatory activities. Consistent with evidence for dopamine-dependent long-term adaptation of corticostriatal transmission (Calabresi, Pisani, Mercuri, & Bernardi, 1992; Calabresi et al., 1997; Wickens, Begg, & Arbuthnott, 1996), the model suggests that activity of midbrain dopamine neurons leads to long-term adaptations of cortical or striatal neurons with tonic reward-anticipating activity. Furthermore, the model postulates the existence of a category of neurons that are phasically active, report errors in predictions of specic rewards, and induce long-term adaptations of neurons with tonic rewardspecic anticipatory activity. Although phasic context-dependent neural activities in striatum and prefrontal cortex have been reported (see Schultz & Romo, 1992), it has not been investigated if these activities anticipate rewards. When the model was trained using an incomplete temporal stimulus representation, the prediction signal did not progressively increase before the predicted reward but instead decreased to zero in the intratrial interval. This time course was similar to some time courses of anticipatory activity (see Figure 5C). This nding suggests that neurons with anticipatory neural activity that is on baseline levels during the intratrial interval do not have access to the complete temporal stimulus representation. If a series of distinguishable stimuli were presented during the trial, these stimuli would serve as a complete temporal stimulus representation, and the model would learn the progressively increasing prediction signals. This suggests that anticipatory neural activity would reveal their optimal time course when measured in an experiment with a series of stimuli that precede the anticipated reward. Such an experiment would test our basic model assumptions.
TD Model Reproduces Anticipatory Activity
857
The reward prediction error signal of the TD model reproduces dopamine neuron activity in many situations (see section 1). Unfortunately, the TD model fails to reproduce dopamine neuron activity when a reward is delivered earlier than expected (Hollerman & Schultz, 1998). This inconsistency has been corrected with a temporal event representation that is inuenced by subsequent events (Suri & Schultz, 1999). For situations with delayed or omitted rewards, prediction signals of the proposed TD model decrease to zero when the reward is expected. These prediction signals resemble some, but not all, tonic anticipatory activity for delayed reward presentation (Hikosaka et al., 1989). These subtle inconsistencies between simulated signals and measured anticipatory activity suggest that some components of the temporal event representation are inuenced by subsequent events (Suri & Schultz, 1999). The proposed model can be partially related to networks of biological neurons. The model learns to predict stimuli and rewards. It has been suggested that reward predictions are learned in limbic parts of pathways from cortex via striatum to midbrain dopamine neurons (Houk et al., 1995; Montague et al., 1996; Schultz et al., 1997; Brown et al., 1999; Suri & Schultz, 1999). In contrast, stimulus predictions may be learned predominantly in the cortex. For visual stimuli, it has been proposed that pyramidal neurons in higher cortical areas learn to predict neural activity of pyramidal neurons in lower cortical areas (Rao & Ballard, 1997, 1999). These studies suggest that the feedforward connections to higher cortical areas carry the prediction errors, whereas the feedback connections carry the prediction signals. Anticipatory neural activity may inuence the behavior of the animal by several mechanisms. Neural activity preceding specic reinforcers may lead to anticipatory responses in Pavlovian paradigms. Reward anticipatory activity was suggested to be used for learning (Houk et al., 1995; Montague et al., 1996; Schultz et al., 1997; Suri & Schultz, 1999) or planning (Suri, Bargas, & Arbib, submitted), whereas stimulus anticipation activity may be used as a predictive representation to learn prediction-reward associations (Dayan, 1993) or to shorten the reaction time to anticipated situations (Goldberg & Bruce, 1990). Appendix A: TD Model
Since the model does not distinguish between stimuli and rewards, both are here referred to as events. We give the model equations for a paradigm with L different events and therefore L independent TD models (L D 2 for Figure 2C). The event signal ul (t) reports the presence (ul (t) D 1) and absence (ul (t) D 0) of the event with number l (l D 1, . . . , L). The desired prediction signal for this event pdesired,l (t) is dened as the discounted sum of this future event ul (t) pdesired,l (t) D ul (t) C c ul (t C 1) C c 2 ul (t C 2) C ¢ ¢ ¢
(A.1)
858
Roland E. Suri and Wolfram Schultz
(standard value of discount factor c D 0.99 per time step, 1 time step D 100 msec). The proposed model consists of the equations below (equations A.2–A.5), which allow estimating the desired prediction signals pdesired,l (t). Events ul (t) are represented with the temporal representation xm (t) as shown in Figure 2A. The rst event u1 (t) is represented in the representation components x1 (t ), x2 (t), . . . , xN (t), the second event in the representation components xN C 1 (t), xN C 2 (t), . . . , x2N (t), and so on. In order to cover trial durations of 7 seconds with the event representation, each event is represented with 70 phasic representation components (N D 70; N D 3 in Figure 2 is only for illustration). We estimate a weight matrix Vlm (for L events l D 1, . . . , L, and m D 1, . . . , N £ L) that computes the prediction signal pl (t C 1) from the temporal representation with pl (t C 1) D
N£L X m D1
Vlm (t)xm (t C 1).
(A.2)
As it follows from equation A.1 that pdesired,l (t) D ul (t) C c pdesired,l (t C 1), the error el (t) between the estimated prediction pl (t) and the desired prediction pdesired,l (t) is computed from discounted temporal differences between successive predictions (difference operator D in Figures 2B and 2C) and from the event ul (t) with el (t) D ul (t) C c pl (t C 1) ¡ pl (t ).
(A.3)
The weight matrix Vlm is initiated with zeros and then adapted with the two-factor learning rule Vlm (t C 1) D Vlm (t) C bel (t)xTm (t),
(A.4)
where the learning rate b was set to the value of 50. This value is larger than that of usual learning rates since the desired prediction signals are much larger than one. The trace xTm (t) is a slowly decaying version of the temporal representation component xm (t), xTm (t) D lxTm (t ¡ 1) C (1 ¡ l)xm (t), with xTm (0) D 0.
(A.5)
The traces decrease 0.3% each time step (1 time step D 100 msec, l D 0.997), as this produces fast learning. The intertrial interval was long enough for all eligibility traces to decrease to zero, which was simulated by setting the traces to zero. In contrast to the proposed algorithm, the TD model computes only one prediction signal (L D 1) but still uses all stimuli in the trial to compute this signal (the sum in equation A.2 is over m D 1, 2, . . . , N £ number of stimuli ).
TD Model Reproduces Anticipatory Activity
859
Acknowledgments
We thank Leon Tremblay for permission to use neural activities he recorded in orbitofrontal cortex (Tremblay & Schultz, 1999, 2000; Schultz et al., 2000). This study was supported by the James S. McDonnell Foundation grant 9439.
References Alexander, G. E., & Crutcher, M. D. (1990a). Preparation for movement: Neural representations of intended direction in three motor areas of the monkey. J. Neurophysiol., 64, 133–163. Alexander, G. E., & Crutcher, M. D. (1990b). Neural representation of the target (goal) of visually guided arm movements in three motor areas of the monkey. J. Neurophysiol., 64, 164–178. Apicella, P., Scarnati, E., Ljungberg, T., & Schultz, W. (1992). Neuronal activity in monkey striatum related to the expectation of predictable environmental events. J. Neurophysiol., 68, 945–960. Barto, A. G. (1995). Adaptive critics and the basal ganglia. In J. C. Houk, J. L. Davis, & D. G. Beiser (Eds.), Models of information processing in the basal ganglia (pp. 215–232). Cambridge, MA: MIT Press. Blair, H. T., Lipscomb, B. W., & Sharp, P. E. (1997). Anticipatory time intervals of head-direction cells in the anterior thalamus of the rat: Implications for path integration in the head-direction circuit. J. Neurophysiol., 78(1), 145–159. Brown, J., Bullock, D., & Grossberg, S. (1999). How the basal ganglia use parallel excitatory and inhibitory learning pathways to selectively respond to unexpected rewarding cues. J. Neurosci., 19(23), 10502–10511. Calabresi, P., Pisani, A., Mercuri, N. B., & Bernardi, G. (1992). Long-term potentiation in the striatum is unmasked by removing the voltage-dependent magnesium block of NMDA receptor channels. Europ. J. Neurosci., 4, 929– 935. Calabresi, P., Saiardi, A., Pisani, A., Baik, J. H., Centonze, D., Mercuri, N. B., Bernardi, G., & Borrelli, E. (1997).Abnormal synaptic plasticity in the striatum of mice lacking dopamine D2 receptors. J. Neurosci., 17(12), 4536–4544. Dayan, P. (1993).Improving generalization for temporal difference learning: The successor representation. Neural Computation, 5, 613–624. Desmond, J. E., & Moore, J. W. (1988). Adaptive timing in neural networks: The conditioned response. Biol Cybern, 58(6), 405–415. Desmond, J. E., & Moore J. W. (1991). Altering the synchrony of stimulus trace processes: Tests of a neural-network model. Biol Cybern., 65(3), 161–169. Dickinson, A. (1980). Contemporary animal learning theory. Cambridge: Cambridge University Press. Dominey, P., Arbib, M., & Joseph, J.-P. (1995).A model of corticostriatal plasticity for learning oculomotor associations and sequences. J. Cognitive Neurosci., 7(3), 311–336.
860
Roland E. Suri and Wolfram Schultz
Duhamel, J. R., Colby, C. L., & Goldberg, M. E. (1992). The updating of the representation of visual space in parietal cortex by intended eye movements. Science, 255(5040), 90–92. Gallistel, C. R. (1990). The organization of learning. Cambridge, MA: MIT Press. Goldberg, M. E., & Bruce, C. J. (1990). Primate frontal eye elds. III. Maintenance of a spatially accurate saccade signal. J. Neurophysiol., 64(2), 489–508. Grossberg, S., & Schmajuk, N. A. (1989). Neural dynamics of adaptive timing and temporal discrimination during associative learning. Neural Networks, 2, 79–102. Hikosaka, O., Sakamoto, M., & Usui, S. (1989). Functional properties of monkey caudate neurons. III. Activities related to expectation of target and reward. J. Neurophysiol., 4(4), 814–832. Hollerman, J.R., & Schultz, W. (1998). Dopamine neurons report an error in the temporal prediction of reward during learning. Nature Neuroscience, 1(4), 304–309. Hollerman, J. R., Tremblay, L., & Schultz, W. (1998). Inuence of reward expectation on behavior-related neuronal activity in primate striatum. J. Neurophysiol., 80(2), 947–963. Houk, J. C., Adams, J. L., & Barto, A. G. (1995) A model of how the basal ganglia generate and use neural signals that predict reinforcement. In J. C. Houk, J. L. Davis, & D. G. Beiser (Eds.), Models of information processing in the basal ganglia (pp. 249–270). Cambridge, MA: MIT Press. Hull, C. L. (1943). Principles of behavior. New York: Appleton-Century-Crofts. Kermadi, I., & Joseph, J. P. (1995). Activity in the caudate nucleus of monkey during spatial sequencing. J. Neurophysiol., 74, 911–933. Ljungberg, T., Apicella, P., & Schultz, W. (1992). Responses of monkey dopamine neurons during learning of behavioral reactions. J. Neurophysiol.,67, 145–163. Mackintosh, N. M. (1974). The psychology of animal learning. London: Academic Press. Mauritz, K. H., & Wise, S. P. (1986). Premotor cortex of the rhesus monkey: Neuronal activity in anticipation of predictable environmental events. Exp. Brain Res., 61(2), 229–244. Mirenowicz, J., & Schultz, W. (1994). Importance of unpredictability for reward responses in primate dopamine neurons. J. Neurophysiol., 72, 1024–1027. Montague, P. R., Dayan, P., & Sejnowski, T. J. (1996). A framework for mesencephalic dopamine systems based on predictive Hebbian learning. J. Neurosci., 16(5), 1936–1947. Nakamura, K., & Colby, C. L. (1999). Updating of the visual representation in monkey striate and extrastriate cortex during saccades. Soc. Neurosci. Abstr., 25(1), 1163. Pavlov, I. P. (1927). Conditioned reexes. Oxford: Oxford University Press. Rao, R. P., & Ballard, D. H. (1997). Dynamic model of visual recognition predicts neural response properties in the visual cortex. Neural Comput., 9(4), 721–763. Rao, R.P., & Ballard, D.H. (1999). Predictive coding in the visual cortex: A functional interpretation of some extra-classical receptive-eld effects. Nat. Neurosci., 2(1), 79–87.
TD Model Reproduces Anticipatory Activity
861
Rescorla, R. A., & Wagner, A. R. (1972). A theory of Pavlovian conditioning: Variations in the effectiveness of reinforcement and non-reinforcement. In A. H. Black & W. F. Prokasy (Eds.), Classical conditioning II: Current research and theory. New York: Appleton-Century-Crofts. Romo, R., & Schultz, W. (1992). Role of primate basal ganglia and frontal cortex in the internal generation of movements. III. Neuronal activity in the supplementary motor area. Exp. Brain Res., 91, 396–407. Schultz, W. (1998). Predictive reward signal of dopamine neurons. J. Neurophysiol., 80, 1–27. Schultz, W., Apicella, P., & Ljungberg, T. (1993). Responses of monkey dopamine neurons to reward and conditioned stimuli during successive steps of learning a delayed response task. J. Neurosci., 13, 900–913. Schultz, W., Apicella, P., Ljungberg, T., Romo, R., & Scarnati, E. (1993). Rewardrelated activity in the monkey striatum and substantia nigra. Progress in Brain Research, 99, 227. Schultz, W., Apicella, P., Scarnati, E., & Ljungberg, T. (1992). Neuronal activity in monkey ventral striatum related to the expectation of reward. Journal of Neuroscience, 12(12), 4595–4610. Schultz, W., Dayan, P., & Montague, P. R. (1997). A neural substrate of prediction and reward. Science, 275(5306), 1593–1599. Schultz, W., & Romo, R. (1992).Role of primate basal ganglia and frontal cortex in the internal generation of movements: I. Preparatory activity in the anterior striatum. Exp. Brain Res., 91, 363–384. Schultz, W., Tremblay, L., & Hollerman, J. R. (2000). Reward processing in primate orbitofrontal cortex and basal ganglia. Cereb. Cortex, 10(3), 272– 284. Suri, R. E., Bargas, J., & Arbib, M. A. Modeling functions of striatal dopamine modulation in learning and planning. Submitted. Suri, R. E., & Schultz, W. (1998). Dopamine-like reinforcement signal improves learning of sequential movements by neural network. Exp. Brain Res., 121, 350–354. Suri, R. E., & Schultz, W. (1999). A neural network learns a spatial delayed response task with a dopamine-like reinforcement signal. Neuroscience, 91(3), 871–890. Sutton, R. S., & Barto, A. G. (1990). Time derivative models of Pavlovian reinforcement. In M. Gabriel & J. Moore (Eds.), Learning and computational neuroscience: Foundations of adaptive networks (pp. 539–602). Cambridge, MA: MIT Press. Sutton, R. S., & Barto, A. G. (1998). Reinforcement learning: An introduction. Cambridge, MA: MIT Press. Available online at: http://www-anw.cs.umass.edu/ ˜rich/book/the-book.html. Tremblay, L., Hollerman, J. R., & Schultz, W. (1998). Modications of reward expectation-related neuronal activity during learning in primate striatum. J. Neurophysiol., 80(2), 964–977. Tremblay, L., & Schultz, W. (1999). Relative reward preference in primate orbitofrontal cortex. Nature, 398(6729), 704–708.
862
Roland E. Suri and Wolfram Schultz
Tremblay, L., & Schultz, W. (2000). Reward-related neuronal activity during gonogo task performance in primate orbitofrontal cortex. J. Neurophysiol., 83(4), 1864–1876. Umeno, M. M., & Goldberg, M. E. (1997). Spatial processing in the monkey frontal eye eld. I. Predictive visual responses. J. Neurophysiol., 78(3), 1373– 1383. Wagner, A. R. (1978). Expectancies and the priming of STM. In S. H. Hulse, W. Fowler, & W. K. Honig (Eds.), Cognitive processes in animal behavior (pp. 177– 210). Hillsdale, NJ: Erlbaum. Walker, M. F., Fitzgibbon, E. J., & Goldberg, M. E. (1995). Neurons in the monkey superior colliculus predict the visual result of impending saccadic eye movements. J. Neurophysiol., 73(5), 1988–2003. Watanabe, M. (1996). Reward expectancy in primate prefrontal neurons. Nature, 382(6592), 629–632. Wickens, J. R., Begg, A. J., & Arbuthnott, G. W. (1996). Dopamine reverses the depression of rat corticostriatal synapses which normally follows highfrequency stimulation of cortex in vitro. Neuroscience, 70(3), 1–5. Received September 9, 1998; accepted July 15, 2000.
LETTER
Communicated by Michael Lewicki
Blind Source Separation by Sparse Decomposition in a Signal Dictionary Michael Zibulevsky Department of Computer Science, Universityof New Mexico, Albuquerque, NM 87131, U.S.A. Barak A. Pearlmutter Department of Computer Science and Department of Neurosciences, University of New Mexico, Albuquerque, NM 87131, U.S.A. The blind source separation problem is to extract the underlying source signals from a set of linear mixtures, where the mixing matrix is unknown. This situation is common in acoustics, radio, medical signal and image processing, hyperspectral imaging, and other areas. We suggest a twostage separation process: a priori selection of a possibly overcomplete signal dictionary (for instance, a wavelet frame or a learned dictionary) in which the sources are assumed to be sparsely representable, followed by unmixing the sources by exploiting the their sparse representability. We consider the general case of more sources than mixtures, but also derive a more efcient algorithm in the case of a nonovercomplete dictionary and an equal numbers of sources and mixtures. Experiments with articial signals and musical sounds demonstrate signicantly better separation than other known techniques. 1 Introduction
In blind source separation an N-channel sensor signal x (t) arises from M unknown scalar source signals si (t), linearly mixed together by an unknown N £ M matrix A, and possibly corrupted by additive noise j (t), x(t) D As(t) C j (t).
(1.1)
We wish to estimate the mixing matrix A and the M-dimensional source signal s(t). Many natural signals can be sparsely represented in a proper signal dictionary: si (t) D
K X
Cik Qk (t).
(1.2)
kD 1
The scalar functions Qk (t ) are called atoms or elements of the dictionary. These elements do not have to be linearly independent and instead may Neural Computation 13, 863–882 (2001)
° c 2001 Massachusetts Institute of Technology
864
Michael Zibulevsky and Barak A. Pearlmutter
form an overcomplete dictionary. Important examples are wavelet-related dictionaries (e.g., wavelet packets, stationary wavelets; see Chen, Donoho, & Saunders, 1996; Mallat, 1998) and learned dictionaries (Lewicki & Sejnowski, in press; Lewicki & Olshausen, 1998; Olshausen & Field, 1996, 1997). Sparsity means that only a small number of the coefcients Cik differ signicantly from zero. We suggest a two-stage separation process: a priori selection of a possibly overcomplete signal dictionary in which the sources are assumed to be sparsely representable and then unmixing the sources by exploiting their sparse representability. In the discrete-time case t D 1, 2, . . . , T we use matrix notation. X is an N £ T matrix, with the ith component xi (t) of the sensor signal in row i, S is an M £ T matrix with the signal sj (t ) in row j, and W is a K £ T matrix with basis function Qk (t ) in row k. Equations 1.1 and 1.2 then take the following simple form: X D AS C j S D CW.
(1.3) (1.4)
Combining them, we get the following when the noise is small: X ¼ ACW. Our goal therefore can be formulated as follows: Given the sensor signal matrix X and the dictionary W, nd a mixing matrix A and matrix of coefcients C such that X ¼ ACW and C is as sparse as possible. We should mention other problems of sparse representation studied in the literature. The basic problem is to represent sparsely scalar signal in given dictionary (see Chen et al., 1996). Another problem is to adapt the dictionary to the given class of signals 1 (Lewicki & Sejnowski, 1998; Lewicki & Olshausen, 1998; Olshausen & Field, 1997). This problem is shown to be equivalent to the problem of blind source separation when the sources are sparse in time (Lee, Lewicki, Girolami, & Sejnowski, 1999; Lewicki & Sejnowski, in press). Our problem is different, but we will use and generalize some techniques presented in these works. Independent factor analysis (Attias, 1999) and Bayesian blind source separation (Rowe, 1999) also consider the case of more sources than mixtures. In our approach, we take an advantage when the sources are sparsely representable. In the extreme case, when the decomposition coefcients are very sparse, the separation becomes practically ideal (see section 3.2 and the six utes example in Zibulevsky, Pearlmutter, Boll, & Kisilev, in press). Nevertheless, detailed comparison of the methods on real-world signals remains open for future research. 1
Our dictionary W may be obtained in this way.
Blind Source Separation by Sparse Decomposition
865
In section 2 we give some motivating examples, which demonstrate how sparsity helps to separate sources. Section 3 gives the problem formulation in probabilistic framework and presents the maximum a posteriori approach, which is applicable to the case of more sources than mixtures. In section 4 we derive another objective function, which provides more robust computations when there is an equal number of sources and mixtures. Section 5 presents sequential source extraction using quadratic programming with nonconvex quadratic constraints. Finally, in section 6 we derive a faster method for nonovercomplete dictionaries and demonstrate highquality separation of synthetically mixed musical sounds. 2 Separation of Sparse Signals
In this section we present two examples that demonstrate how sparsity of source signals in the time domain helps to separate them. Many real-world signals have sparse representations in a proper signal dictionary but not in the time domain. The intuition here carries over to that situation, as shown in section 3.1. 2.1 Example: Two Sources and Two Mixtures. Two synthetic sources are shown in Figures 1a and 1b. The rst source has two nonzero samples, and the second has three. The mixtures, shown in Figures 1c and 1d, are less sparse: they have ve nonzero samples each. One can use this observation to recover the sources. For example, we can express one of the sources as
e si (t) D x1 (t ) C m x2 (t) and choose m so as to minimize the number of nonzero samples ke si k 0, that is, the l0 norm of si . This objective function yields perfect separation. As shown in Figure 2a, when m is not optimal, the second source interferes, and the total number of nonzero samples remains ve. Only when the rst source is recovered perfectly, as in Figure 2b, does the number of nonzero samples drop to two and the objective function achieve its minimum. Note that the function ke si k0 is discontinuous and may be difcult to optimize. It is also very sensitive to noise: even a tiny bit of noise would make all the samples nonzero. Fortunately in many cases, the l1 norm ke s i k1 is a good substitute for this objective function. In this example, it too yields perfect separation. 2.2 Example: Three Sources and Two Mixtures. The signals are presented in Figure 3. These sources have about 10% nonzero samples. The nonzero samples have random positions and are zero-mean unit-variance gaussian distributed in amplitude. Figure 4 shows a scatter plot of the mixtures. The directions of the columns of mixing matrix are clearly visible.
866
Michael Zibulevsky and Barak A. Pearlmutter
Figure 1: Coefcients of signals, with coefcient identity on the x-axis (10 coefcients, arbitrarily ordered) and magnitude on the y-axis (arbitrarily scaled). Sources (a and b) are sparse. Mixtures (c and d) are less sparse.
Figure 2: Coefcients of signals, with coefcient identity on the x-axis (10 coefcients, arbitrarily ordered) and magnitude on the y-axis (arbitrarily scaled). (a) Imperfect separation. Since the second source is not completely removed, the total number of nonzero samples remains ve. (b) Perfect separation. When the source is recovered perfectly, the number of nonzero samples drops to two, and the objective function achieves its minimum.
This phenomenon can be used in clustering approaches to source separation (Pajunen, Hyvrinen, & Karhunen, 1996; Zibulevsky et al., in press). In this work we will explore a maximum a posteriori approach.
Blind Source Separation by Sparse Decomposition
867
Figure 3: Coefcients of signals, with coefcient identity on the x-axis (300 coefcients, arbitrarily ordered) and magnitude on the y-axis (arbitrarily scaled). (Top three panels) Sparse sources (sparsity is 10%). (Bottom two panels) Mixtures. 3 Probabilistic Framework
In order to derive a maximum a posteriori solution, we consider the blind source separation problem in a probabilistic framework (Belouchrani & Cardoso, 1995; Perlmutter & Parra, 1996). Suppose that the coefcients Cik in a source decomposition (see equation 1.4) are independent random variables with a probability density function (pdf) of an exponential type, pi (Cik ) / exp ¡bi h (Cik ).
(3.1)
This kind of distribution is widely used for modeling sparsity (Lewicki & Sejnowski, in press; Olshausen & Field, 1997). A reasonable choice of h (c ) may be h(c) D |c| 1 /c
c ¸1
(3.2)
868
Michael Zibulevsky and Barak A. Pearlmutter
Figure 4: Scatter plot of two sensors. Three distinguished directions, which correspond to the columns of the mixing matrix A, are visible.
or a smooth approximation thereof. Here we will use a family of convex smooth approximations to the absolute value, h1 (c) D |c| ¡ log(1 C |c|) hl (c) D lh1 (c / l),
(3.3) (3.4)
with l a proximity parameter: hl (c) ! |c| as l ! 0 C . We also suppose a priorithat the mixing matrix A is uniformly distributed over the range of interest and that the noise j (t) in equation 1.3 is a spatially and temporally uncorrelated gaussian process2 with zero mean and variance s 2 . 3.1 Maximum A Posteriori Approach. We wish to maximize the poste-
rior probability, max P (A, C|X ) / max P (X|A, C) P ( A) P (C ), A,C
A,C
(3.5)
where P (X|A, C) is the conditional probability of observing X given A and C. Taking into account equations 1.3 and 1.4, and the white gaussian noise, 2 The assumption that the noise is white is for simplicity of exposition and can be easily removed.
Blind Source Separation by Sparse Decomposition
869
we have P (X|A, C) /
Y i,t
exp ¡
(Xit ¡ (ACW )it )2 . 2s 2
(3.6)
By the independence of the coefcients Cjk and equation 3.1, the prior pdf of C is P (C ) /
Y
exp(¡bj h(Cjk )).
(3.7)
j,k
If the prior pdf P ( A) is uniform, it can be dropped3 from equation 3.5. In this way we are left with the problem max P (X| A, C) P (C). A,C
(3.8)
By substituting 3.6 and 3.7 into 3.8, taking the logarithm, and inverting the sign, we obtain the following optimization problem, min A,C
X 1 2 C kACW ¡ bj h (Cjk ), Xk F 2s 2 j,k
where kAkF D
qP i, j
(3.9)
A2ij is the Frobenius matrix norm.
One can consider this objective as a generalization of Olshausen and Field (1996, 1997) by incorporating the matrix W, or as a generalization of Chen et al. (1996) by including the matrix A. One problem with such a formulation is that it can lead to the degenerate solution C D 0 and A D 1. We can overcome this difculty in various ways. The rst approach is to force each row Ai of the mixing matrix A to be bounded in norm, kAi k · 1
i D 1, . . . , N.
(3.10)
The second way is to restrict the norm of the rows Cj from below: kCj k ¸ 1
j D 1, . . . , M.
(3.11)
A third way is to reestimate the parameters bj based on the current values of Cj . For example, this can be done using sample variance as follows: for a given function h (¢) in the distribution 3.1, express the variance of Cjk as 3
Otherwise, if P(A) is some other known function, we should use equation 3.5 directly.
870
Michael Zibulevsky and Barak A. Pearlmutter
a function fh (b ). An estimate of b can be obtained by applying the corresponding inverse function to the sample variance, Á ! X ¡1 ¡1 2 bOj D fh (3.12) K Cjk . k
In particular, when h(c) D |c|, var(c) D 2b ¡2 and bOj D q
K ¡1
2 P
2 k Cjk
.
(3.13)
Substituting h (¢) and bO into equation 3.9, we obtain P X 2 k |Cjk | 1 min 2 kACW ¡ Xk2F C q P 2. A,C 2s K ¡1 C j k
(3.14)
jk
This objective function is invariant to a rescaling of the rows of C combined with a corresponding inverse rescaling of the columns of A. 3.2 Experiment: More Sources Than Mixtures. This experiment demonstrates that sources that have very sparse representations can be separated almost perfectly, even when they are correlated and the number of samples is small. We used the standard wavelet packet dictionary with the basic wavelet symmlet-8. When the signal length is 64 samples, this dictionary consists of 448 atoms; it is overcomplete by a factor of seven. Examples of atoms and their images in the time-frequency phase plane (Coifman & Wickerhauser, 1992; Mallat, 1998) are shown in Figure 5. We used the ATOMIZER (Chen, Donoho, Saunders, Johnstone, & Scargle, 1995) and WAVELAB (Buckheit, Chen, Donoho, Johnstone, & Scargle, 1995) MATLAB packages for fast multiplication by W and W T . We created three very sparse sources (see Figure 6a), each composed of only two or three atoms. The rst two sources have signicant crosscorrelation, equal to 0.34, which makes separation difcult for conventional methods. Two synthetic sensor signals (see Figure 6b) were obtained as linear mixtures of the sources. In order to measure the accuracy of separation, we normalized the original sources with kSj k2 D 1 and the estimated sources with ke Sj k2 D 1. The error was computed as
Error D
ke Sj ¡ Sj k2 ¢ 100%. kSj k2
(3.15)
We tested two methods with these data. The rst method used the objective function (see equation 3.9) and the constraints (see equation 3.11),
Blind Source Separation by Sparse Decomposition
871
Figure 5: Examples of atoms. (Left) Time-frequency phase plane. (Right) Time plot.
and the second method used the objective function (see equation 3.14). We used PBM (Ben-Tal & Zibulevsky, 1997) for the constrained optimization. The unconstrained optimization was done using the method of conjugate gradients, with the TOMLAB package (Holmstrom & Bjorkman, 1999). The same tool was used by PBM for its internal unconstrained optimization. We used hl (¢) dened by equations 3.3 and 3.4, with l D 0.01 and s 2 D 0.0001 in the objective function. The resulting errors of the recovered sources were 0.09% and 0.02% by the rst and the second methods, respectively. The estimated sources are shown in Figure 6c. They are visually indistinguishable from the original sources in Figure 6a. It is important to recognize the computational difculties of this approach. First, the objective functions seem to have multiple local minima. For this reason, reliable convergence was achieved only when the search started randomly within 10% to 20% distance to the actual solution (in order to get such an initial guess, one can use a clustering algorithm, as in Pajunen et al., 1996, or Zibulevsky et al., in press). Second, the method of conjugate gradients requires a few thousand iterations to converge, which takes about 5 minutes on a 300 MHz AMD K6-II even for this very small problem. (On the other hand, preliminary experiments with a truncated Newton method have been encouraging, and we anticipate that this will reduce the computational burden by an order of magnitude or more. Also Paul Tseng’s (2000) block coordinate descent method may be appropriate.) Below we present a few other approaches that help to stabilize and accelerate the optimization. 4 Equal Number of Sources and Sensors: More Robust Formulations
The main difculty in a maximization problem like equation 3.9 is the bilinear term ACW, which destroys the convexity of the objective function and
872
Michael Zibulevsky and Barak A. Pearlmutter
Figure 6: Sources, mixtures, and reconstructed sources in both time-frequency phase plane (left) and time domain (right).
makes convergence unstable when optimization starts far from the solution. In this section, we consider more robust formulations for the case when the number of sensors is equal to the number of sources, N D M, and the mixing matrix is invertible, W D A¡1 . When the noise is small and the matrix A is far from singular, WX gives a reasonable estimate of the source signals S. Taking into account equation 1.4, we obtain a least-squares term kCW ¡WXk2F , so the separation objective may be written as X 1 min kCW ¡ WXk2F C m bj h (Cjk ). W,C 2 j,k
(4.1)
Blind Source Separation by Sparse Decomposition
873
We also need to add a constraint that enforces the nonsingularity of W. For example, we can restrict its minimal singular value rmin (W ) from below, rmin (W ) ¸ 1.
(4.2)
It can be shown that in the noiseless case, s ¼ 0, the problem 4.1–4.2 is equivalent to the maximum a posteriori formulation, equation 3.9, with the constraint kAk2 · 1. Another possibility for ensuring the nonsingularity of W is to subtract K log | det W| from the objective min ¡K log | det W| C W,C
X 1 kCW ¡ WXk2F C m bj h(Cjk ) 2 j, k
(4.3)
which (Bell & Sejnowski, 1995; Pearlmutter & Para, 1996) can be viewed as a maximum likelihood term. When the noise is zero and W is the identity matrix, we can substitute C D WX and obtain the BS Infomax objective (Bell & Sejnowski, 1995): X min ¡K log | det W| C bj h ((WX )jk ). (4.4) W
j,k
4.1 Experiment: Equal Numbers of Sources and Sensors. We created two sparse sources (see Figure 7, top) with strong cross-correlation of 0.52. Separation by minimization of the objective function, equation 4.3, gave an error of 0.23%. Robust convergence was achieved when we started from random uniformly distributed points in C and W. For comparison we tested the JADE (Cardoso, 1999a), FastICA (Hyv¨arinen, 1999), and BS Infomax (Bell & Sejnowski, 1995; Amari, Cichocki, & Yang, 1996) algorithms on the same signals. All three codes were obtained from public web sites (Cardoso, 1999b; Hyv¨arinen, 1998; Makeig, 1999) and were used with default setting of all parameters. The resulting relative errors (see Figure 8) conrm the signicant superiority of the sparse decomposition approach. This still takes a few thousand conjugate gradient steps to converge (about 5 minutes on a 300 MHz AMD K6). For comparison, the tuned public implementations of JADE, FastICA, and BS Infomax take only a few seconds. Below we consider some options for acceleration. 5 Sequential Extraction of Sources via Quadratic Programming
Let us consider nding the sparsest signal that can be obtained by a linear combination of the sensor signals s D wT X. By sparsity, we mean the ability of the signal to be approximated by a linear combination of a small number of dictionary elements Qk, as s ¼ cT W. This leads to the objective X 1 min kcT W ¡ wT Xk22 C m h (ck ), w,c 2 k
(5.1)
874
Michael Zibulevsky and Barak A. Pearlmutter
Figure 7: Sources, mixtures, and reconstructed sources in both time-frequency phase plane (left) and time domain (right). (57%)
(29%) (27%)
(0.2%)
Cardoso’s JADE
Fast ICA
BS Infomax
Equation 4.3
Figure 8: Percentage relative error of separation of the articial sparse sources recovered by JADE, fast ICA, Bell-Sejnowski Infomax, and equation 4.3.
Blind Source Separation by Sparse Decomposition
875
P where the term k h (ck ) may be considered a penalty for nonsparsity. In order to avoid the trivial solution of w D 0 and c D 0, we need to add a constraint that separates w from zero. It could be, for example, kwk22 ¸ 1.
(5.2)
A similar constraint can be used as a tool to extract all the sources sequentially. The new separation vector w j should have a component of unit norm in the subspace orthogonal to the previously extracted vectors w1 , . . . , w j¡1 , k(I ¡ P j¡1 )w j k22 ¸ 1 ,
(5.3)
where P j¡1 is an orthogonal projector onto Spanfw1 , . . . , w j¡1 g. When h (ck ) D |ck |, we can use the standard substitution c D c C ¡ c ¡, c C ¸ 0, c¡ ¸ 0 ³ C´ ³ ´ c W O D and W cO D , ¡W c¡ which transforms equations 5.1 and 5.3 into the quadratic program, min w, cO
subject to:
1 T O ¡ wT Xk22 C m eT cO kOc W 2 kwk22 ¸ 1,
cO ¸ 0
where e is a vector of ones. 6 Fast Solution in Nonovercomplete Dictionaries
In important applications (Tang, Pearlmutter & Zibulevsky, 2000; Tang, Pearlmutter, Zibulevsky, Hely, & Weisend, 2000; Tang, Phung, Pearlmutter, & Christner, 2000) the sensor signals may have hundreds of channels and hundreds of thousands of samples. This may make separation computationally difcult. Here we present an approach that compromises between statistical and computational efciency. In our experience, this approach provides a high quality of separation in a reasonable amount of time. Suppose that the dictionary is “complete”; it forms a basis in the space of discrete signals. This means that the matrix W is square and nonsingular. As examples of such a dictionary, one can think of the Fourier basis, Gabor basis, and various wavelet-related bases, among others. We can also obtain an “optimal” dictionary by learning from given family of signals (Lewicki & Sejnowski, in press; Lewicki & Olshausen, 1998; Olshausen & Field, 1997, 1996).
876
Michael Zibulevsky and Barak A. Pearlmutter
Let us denote the dual basis, Y D W ¡1 ,
(6.1)
and suppose that coefcients of decomposition of the sources, (6.2)
C D SY,
are sparse and independent. This assumption is reasonable for properly chosen dictionaries, although we would lose the advantages of overcompleteness. Let Y be the decomposition of the sensor signals, (6.3)
Y D XY.
Multiplying both sides of equation 1.3 by Y from the right and taking into account equations 6.2 and 6.3, we obtain (6.4)
Y D AC C f ,
where f D j Y is the decomposition of the noise. Here we consider an “easy” situation, where f is white, which assumes that Y is orthogonal. We can see that all the objective functions from sections 3.1 to 5 remain valid if we substitute the identity matrix for W and replace the sensor signal X by its decomposition Y. For example, the maximum a posteriori objectives 3.9 and 3.14 are transformed into min
X 1 2 C kAC ¡ bj h(Cjk ) Yk F 2s 2 j, k
(6.5)
min
P X 2 k |Cjk | 1 2 C kAC ¡ q Yk F P 2. 2s 2 K ¡1 k Cjk j
(6.6)
A,C
and
A,C
The objective, equation 4.3, becomes min ¡K log | det W| C W,C
X 1 kC ¡ WYk2F C m bj h (Cjk ). 2 j,k
(6.7)
In this case we can further assume that the noise is zero, substitute C D WY, and obtain the BS Infomax objective (Bell & Sejnowski, 1995): min ¡K log | det W| C W
X j,k
bj h ((WY )jk ).
(6.8)
Blind Source Separation by Sparse Decomposition
877
Figure 9: Separation of musical recordings taken from commercial digital audio CDs (5 second fragments).
Also other known methods (e.g., Lee et al., 1999; Lewicki & Sejnowski, in press), which normally assume sparsity of source signals, may be directly applied to the decomposition Y of the sensor signals. This may be more efcient than the traditional approach, and the reason is obvious: typically a properly chosen decomposition gives signicantly higher sparsity for the transformed coefcients than for the raw signals. Furthermore, independence of the coefcients is a more realistic assumption than independence of the raw signal samples. 6.1 Experiment: Musical Sounds. In our experiments we articially mixed seven 5-second fragments of musical sound recordings taken from commercial digital audio CDs. Each of them included 40,000 samples after downsampling by a factor of 5 (see Figure 9). The easiest way to perform sparse decomposition of such sources is to compute a spectrogram, the coefcients of a time-windowed discrete Fourier transform. (We used the function SPECGRAM from the MATLAB signal processing toolbox with a time window of 1024 samples.) The sparsity of the spectrogram coefcients (the histogram in Figure 10, right) is much higher then the sparsity of the original signal (see Figure 10, left). In this case Y (see equation 6.3) is a real matrix, with separate entries for the real and imaginary components of each spectrogram coefcient of the sensor signals X. We used the objective function (see equation 6.8) with bj D 1 and hl (¢) dened by equations 3.3 and 3.4 with the parameter l D 10¡4 . Unconstrained minimization was performed by a BFGS quasi-Newton algorithm (MATLAB function FMINU.)
878
Michael Zibulevsky and Barak A. Pearlmutter
Figure 10: Histogram of sound source values (left) and spectrogram coefcients (right), shown with linear y-scale (top), square root y-scale (center), and logarithmic y-scale (bottom).
This algorithm separated the sources with a relative error of 0.67% for the least well-separated source (error computed according to equation 3.15). We also applied the BS Infomax algorithm (Bell & Sejnowski, 1995) implemented in Makeig (1999) to the spectrogram coefcients Y of the sensor signals. Separation errors were slightly larger, at 0.9%, but the computing time was improved (from 30 minutes for BFGS to 5 minutes for BS Infomax). For comparison we tested the JADE (Cardoso, 1999a, 1999b), FastICA (Hyv¨arinen, 1998, 1999), and BS Infomax algorithms on the raw sensor signals. Resulting relative errors (see Figure 11) conrm the signicant (by a factor of more than 10) superiority of the sparse decomposition approach. The method described in this section, which combines a spectrogram transform with the BS Infomax algorithm, is included in the ICA/EEG toolbox (Makeig, 1999). 7 Future Research
We should mention an alternative to the maximum a posteriori approach (see equation 3.8). Considering the mixing matrix A as a parameter, we can estimate it by maximizing the probability of the observed signal X: µ ¶ Z max P (X|A ) D P (X|A, C ) P (C) dC . A
The integral over all possible coefcients C may be approximated, for example, by Monte Carlo sampling or by a matching gaussian, in the spirit of Lewicki and Sejnowski (in press) and Lewicki and Olshausen (1998) or by variational methods (Jordan, Ghahramani, Jaakkola, & Saul, 1999). It would be interesting to compare these possibilities to the other methods presented in this article.
Blind Source Separation by Sparse Decomposition
(8.8%)
879
(8.6%) (7.1%)
Cardoso’s JADE
Fast ICA
BS Infomax
(0.9%)
(0.67%)
Spect Infomax
Spect BFGS
Figure 11: Percentage relative error of separation of seven musical sources recovered by JADE; fast ICA; Bell-Sejnowski Infomax; Infomax, applied to the spectrogram coefcients; and BFGS minimization of the objective (see equation 6.8) with the spectrogram coefcients.
Another important direction is toward the problem of simultaneous blind deconvolution and separation, as in Lambert (1996). In this case, the matrices A and W will have linear lters as an elements, and multiplication by an element corresponds to convolution. Even in this matrix-of-lters context, most of the formulas in this paper remain valid.
8 Conclusions
We showed that the use of sparse decomposition in a proper signal dictionary provides high-quality blind source separation. The maximum a posteriori framework gives the most general approach, which includes the situation of more sources than sensors. Computationally more robust solutions can be found in the case of an equal number of sources and sensors. We can also extract the sources sequentially using quadratic programming with nonconvex quadratic constraints. Finally, much faster solutions may be obtained by using nonovercomplete dictionaries. Our experiments with articial signals and digitally mixed musical sounds demonstrate a high quality of source separation compared to other known techniques.
880
Michael Zibulevsky and Barak A. Pearlmutter
Acknowledgments
This research was partially supported by NSF CAREER award 97-02-311, the National Foundation for Functional Brain Imaging, an equipment grant from Intel corporation, the Albuquerque High Performance Computing Center, a gift from George Cowan, and a gift from the NEC Research Institute. References Amari, S., Cichocki, A., & Yang, H. H. (1996). A new learning algorithm for blind signal separation. In D. Touretzky, M. Mozer, & M. Hasselmo (Eds.), Advances in neural information processing systems, 8. Cambridge, MA: MIT Press. Attias, H. (1999). Independent factor analysis. Neural Computation, 11(4), 803– 851. Bell, A. J., & Sejnowski, T. J. (1995). An information-maximization approach to blind separation and blind deconvolution. Neural Computation, 7(6), 1129– 1159. Belouchrani, A., & Cardoso, J.-F. (1995). Maximum likelihood source separation by the expectation-maximization technique: Deterministic and stochastic implementation. In Proceedings of 1995 International Symposium on Non-Linear Theory and Applications (pp. 49–53). Las Vegas, NV. Ben-Tal, A., & Zibulevsky, M. (1997). Penalty/barrier multiplier methods for convex programming problems. SIAM Journal on Optimization, 7(2), 347– 366. Buckheit, J., Chen, S. S., Donoho, D. L., Johnstone, I., & Scargle, J. (1995). About wavelab (Tech. Report). Stanford, CA: Department of Statistics, Stanford University. Available online at: http://www-stat.stanford.edu/»donoho/ Reports/. Cardoso, J.-F. (1999a). High-order contrasts for independent component analysis. Neural Computation, 11(1), 157–192. Cardoso, J.-F. (1999b). JADE for real-valued data. Available online at: http://sig. enst.fr/»cardoso/guidesepsou.html. Chen, S. S., Donoho, D. L., & Saunders, M. A. (1996). Atomic decomposition by basis pursuit. Available online at: http://www-stat.stanford. edu/»donoho/Reports/. Chen, S. S., Donoho, D. L., Saunders, M. A., Johnstone, I., & Scargle, J. (1995). About atomizer (Tech. Rep.). Stanford, CA: Department of Statistics, Stanford University. Available online at: http://wwwstat.stanford.edu/»donoho/Reports/. Coifman, R. R., & Wickerhauser, M. V. (1992). Entropy-based algorithms for best-basis selection. IEEE Transactions on Information Theory, 38, 713– 718. Holmstrom, K., & Bjorkman, M. (1999). The TOMLAB NLPLIB. Advanced Modeling and Optimization, 1, 70–86. Available online at: http://www.ima.mdh. se/tom/.
Blind Source Separation by Sparse Decomposition
881
Hyva¨ rinen, A. (1998). The Fast-ICA MATLAB package. Available online at: http://www.cis.hut./»aapo/. Hyva¨ rinen, A. (1999). Fast and robust xed-point algorithms for independent component analysis. IEEE Transactions on Neural Networks, 10(3), 626–634. Jordan, M. I., Ghahramani, Z., Jaakkola, T. S., & Saul, L. K. (1999).An introduction to variational methods for graphical models. In M. I. Jordan (Ed.), Learning in graphical models (pp. 105–161). Norwell, MA: Kluwer. Lambert, R. H. (1996). Multichannel blind deconvolution: FIR matrix algebra and separationof multipath mixtures. Unpublished doctoral dissertation, University of Southern California. Lee, T. W., Lewicki, M. S., Girolami, M., & Sejnowski, T. J. (1999). Blind source separation of more sources than mixtures using overcomplete representations. IEEE Signal Processing Letters, 6, 87–90. Lewicki, M. S., & Olshausen, B. A. (1998). A probabilistic framework for the adaptation and comparison of image codes. Journal of the Optical Society of America A: Optics, Image Science, and Vision, 16, 1587–1601. Lewicki, M. S., and Sejnowski, T. J. (in press). Learning overcomplete representations. Neural Computation. Makeig, S. (1999). ICA / EEG toolbox. San Diego, CA: Computational Neurobiology Laboratory, Salk Institute. Available online at: http://www.cnl.salk. edu/»tewon/ica cnl.html. Mallat, S. (1998). A wavelet tour of signal processing. Orlando, FL: Academic Press. Olshausen, B. A., & Field, D. J. (1996). Emergence of simple-cell receptive eld properties by learning a sparse code for natural images. Nature, 381, 607–609. Olshausen, B. A., & Field, D. J. (1997). Sparse coding with an overcomplete basis set: A strategy employed by V1? Vision Research, 37, 3311–3325. Pajunen, P., Hyvdrinen, A., & Karhunen, J. (1996). Non-linear blind source separation by self-organizing maps. In Proceedings of the International Conference on Neural Information Processing. Berlin: Springer-Verlag. Pearlmutter, B. A., and Parra, L. C. (1996). A context-sensitive generalization of ICA. In Proceedings of the International Conference on Neural Information Processing (pp. 151–157). Berlin: Springer-Verlag. Rowe, D. B. (1999). Bayesian blind source separation. Submitted. Tang, A. C., Pearlmutter, B. A., & Zibulevsky, M. (2000). Blind separation of neuromagnetic responses. Neurocomputing, 32–33 (Special Issue), 1115– 1120. Tang, A. C., Pearlmutter, B. A., Zibulevsky, M., Hely, T. A., & Weisend, M. P. (2000). An MEG study of response latency and variability in the human visual system during a visual-motor integration task. In S. A. Solla, T. K. Leen, & K. R. Muller ¨ (Eds.), Advances in neural information processing systems, 12 (pp. 185–191). Cambridge, MA: MIT Press. Tang, A. C., Phung, D., Pearlmutter, B. A., & Christner, R. (2000). Localization of independent components from magnetoencephalography. In Proceedings of the International Workshop on International Component Analysis and Blind Source Separation (Helsinki, Finland). Tseng, P. (2000). Convergence of block coordinate descent method for nondifferentiable minimization. Submitted.
882
Michael Zibulevsky and Barak A. Pearlmutter
Zibulevsky, M., Pearlmutter, B. A., Boll, P., & Kisilev, P. (in press). Blind source separation by sparse decomposition in a signal dictionary. In S. J. Roberts & R. M. Everson (Eds.), Independent components analysis: Princeiples and practice. Cambridge: Cambridge University Press. Received July 28, 1999; accepted July 11, 2000.
LETTER
Communicated by Jean-Marc Cardoso
Complexity Pursuit: Separating Interesting Components from Time Series Aapo Hyv a¨ rinen Neural Networks Research Centre, Helsinki University of Technology, P.O. Box 5400, FIN-02015 HUT, Finland A generalization of projection pursuit for time series, that is, signals with time structure, is introduced. The goal is to nd projections of time series that have interesting structure, dened using criteria related to Kolmogoroff complexity or coding length. Interesting signals are those that can be coded with a short code length. We derive a simple approximation of coding length that takes into account both the nongaussianity and the autocorrelations of the time series. Also, we derive a simple algorithm for its approximative optimization. The resulting method is closely related to blind separation of nongaussian, time-dependent source signals. 1 Introduction
In multivariate statistics, a central problem is to nd one-dimensional (1D) projections of the data vector that reveal interesting aspects of the data. Denoting by x D (x1 , x2 , . . . , xn )T the n-dimensional random vector corresponding to the observed data, such projections are dened by P a constant vector w D (w1 , w2 , . . . , wn )T as the linear combination w T x D i wi xi . A classical method for nding such projections is to compute the principal components (Oja, 1982; Jolliffe, 1986) of the data vectors. The rst principal component gives a projection that optimally approximates the data vector in the sense of mean square error. The second principal component gives the optimal approximation of the residual of the approximation given by the rst principal component, and so forth. Recently, however, it has been argued that the projections given by the principal components do not describe the data in a meaningful way in many cases. This is because principal component analysis neglects important aspects of the data such as clustering and higher-order independence. This has led to the development of methods that are based not on mean square error but rather on higher-order statistics. This means using other information from that contained in the covariance matrix. An important method using higher-order information is projection pursuit (Friedman & Tukey, 1974). In basic (1D) projection pursuit, we try to nd directions w such that the projection of the data vector in that direction, w T x, has an “interesting” distribution in the sense of displaying some Neural Computation 13, 883–898 (2001)
° c 2001 Massachusetts Institute of Technology
884
Aapo Hyva¨ rinen
structure. Huber (1985) and Jones and Sibson (1987) have argued that the gaussian distribution is the least interesting one and that the most interesting directions are those that show the least gaussian distribution. An information-theoretic justication for this is that the gaussian distribution has maximum entropy among all distributions of unit variance. Entropy can be considered a measure of disorder, that is, a lack of (interesting) structure. Projection pursuit is closely related to independent component analysis (ICA) (Jutten & HÂerault, 1991; Comon, 1994). ICA is a statistical model where the observed data are expressed as a linear transformation of latent variables that are nongaussian and mutually independent. We may express the model as x D As ,
(1.1)
where x D (x1 , x2 , . . . , xn )T is the vector of observed random variables, s D (s1 , s2 , . . . , sn )T is the vector of the latent variables called the independent components or source signals, and A is an unknown constant matrix, called the mixing matrix. Exact conditions for the identiability of the model were given by Comon (1994). Note that we assumed, for simplicity, that the dimension of x equals the dimension of s, but this need not necessarily be the case. If nongaussianity is measured suitably, nding maximally nongaussian directions gives one method of estimating ICA (Delfosse, & Loubaton, 1995; Hyv¨arinen & Oja, 1997; Hyv¨arinen, 1999a). Thus, we obtain independent components as the projection pursuit directions si D w Ti x , and the w i correspond to estimates of the rows of A ¡1 . In many cases, projection pursuit and ICA are applied on data sets that are not simply random vectors but multivariate time series, that is, signals with time dependencies. However, these two methods in their basic forms completely ignore any time structure and use only the marginal distributions of the projections. Interestingly, it has been shown that under some restrictions, the time-dependency information alone is sufcient to estimate independent components (Tong, Liu, Soon, & Huang, 1991; Belouchrani, Meraim, Cardoso, & Moulines, 1997; Matsuoka, Ohya, & Kawamoto, 1995; Molgedey & Schuster, 1994). Results obtained by these methods are likely to improve if one exploits all the available information on the interestingness of the projections. It would be most useful, therefore, to dene a more general method that nds interesting projections of time series using both the nongaussianity and the time structure of the projections. In this article, we propose a generalization of projection pursuit that takes into account the time structure of the projections as well. This is based on using the coding complexity of the projection. In our “complexity pursuit,” we search for projections that can be easily coded. This is a general-purpose measure of structure, closely related to using Kolmogoroff complexity as a criterion for nding a representation (Pajunen, 1998a), and is probably connected to information-processing principles used in the brain (Atick,
Complexity Pursuit
885
1992; Hochreiter & Schmidhuber, 1999; Pajunen, 1998a). We develop simple approximations of coding complexity and derive an algorithm for (approximative) minimization of the complexity approximations. This algorithm can be seen as a principled combination of algorithms using the criteria of nongaussianity and autocorrelation as in projection pursuit, ICA, and source separation. Moreover, it can be interpreted as maximum likelihood estimation of the ICA model, assuming time-correlated models for the independent components. This article is organized as follows. The basic principle of complexity pursuit is introduced in section 2. Suitable approximations of complexity are developed in section 3. An algorithm for minimizing the complexity approximation is given in section 4. Relation to other methods is discussed in section 5, simulation results are given in section 6, and conclusions are drawn in section 7. 2 Complexity and Time Series
In their classic forms, projection pursuit and ICA consider the observed data x to be a multivariate random vector. In many cases, however, the observed data are multivariate time series x (t), that is, a vector of time signals. A time signal has much more structure than a random variable. This structure is completely neglected in projection pursuit. In some ICA methods, it is taken into account, but such ICA methods typically neglect the marginal distribution of the data. A unifying theoretical framework for using both kinds of information is given by Kolmogoroff complexity (Pajunen, 1998a). Kolmogoroff complexity is based on the interpretation of coding length as structure. Suppose that we want to code a signal s(t), t D 1, . . . , T. For simplicity, let us assume for the moment that the signal is binary, so that every value s(t) is 0 or 1. In general, it is not possible to code this signal with fewer than T bits, so that every bit in the code gives the value of s(t) for one t. However, most natural signals have redundancy; parts of the signal can be efciently predicted from other parts. Such a signal can be coded, or compressed, so that the code length is shorter than the original code length. It is well known that audio or image signals, for example, can be coded so that the code length is decreased considerably. This is because such natural signals are highly structured. For example, image signals consist not of random pixels but of such higher-order regularities as edges and contours. We could thus measure the amount of structure of the signal s(t) by the amount of compression that is possible in coding the signal. For signals of xed length T, the structure could be measured by the length of the shortest possible code for the signal. Note that the signal could be compressed by many different kinds of coding schemes (the coding theory literature is full of them), but we are here considering the shortest code possible, thus maximizing the compression over all possible coding schemes. This is a nonrigorous denition of Kolmogoroff complexity (for a more rigorous denition
886
Aapo Hyva¨ rinen
of the concept, see Pajunen, 1998a, 1998b). Kolmogoroff complexity is usually dened for binary signals, but it could be applied on continuous-valued signals as well after a suitable quantization. Thus, we arrive at a theoretical denition of complexity pursuit. The goal is to nd projections w T x (t ) such that the Kolmogoroff complexity of the projection is minimized. We must also x the scale of the projection, since otherwise taking w D 0 would give a projection that is trivially structured. Thus, we constrain the variance of the projection to be unity, as usual in projection pursuit. Kolmogoroff complexity is a theoretical measure, since its computation involves nding the best coding scheme for the signal. The number of possible coding schemes is innite, so this optimization is intractable. Therefore, in practice, approximations must be used. If the signals have no time structure, their Kolmogoroff complexities are given (approximately) by their entropies (Pajunen, 1998a). In this case, we rediscover ordinary projection pursuit. Furthermore, Pajunen (1998a, 1999) showed how to approximate Kolmogoroff complexity by criteria using the autocorrelations of the signals, in which case we rediscover a method closely related to the source separation methods in Tong et al. (1991), Belouchrani et al. (1997), and Molgedey and Schuster (1994). In the next section, we introduce a more general framework for approximating Kolmogoroff complexity. 3 Approximation of Complexity
In this section, we derive an approximation of the Kolmogoroff complexity of a scalar signal y (t), t D 1, . . . , T. For simplicity, the signal is assumed to have zero mean and unit variance. 3.1 General Formulation by Predictive Coding. Consider predictive coding of the signal. The value y(t) is predicted from the preceding values by some function f to be specied:
yO (t) D f (y (t ¡ 1), y(t ¡ 2), . . . , y(1)).
(3.1)
To code the actual value y (t), the residual, dy(t) D y (t) ¡ yO (t),
(3.2)
is coded by a scalar quantization method. The point is that in many cases, it is easier to code the residual than the original value y(t), if the predictor f is suitably chosen so that it gives a reasonable prediction of y(t). This coding stategy is used for y (t) at all time points t. Thus, the whole signal is coded by coding the residuals (and the initial value(s) of y (t) for t near 1). The residuals are coded independently from each other, neglecting any dependencies.
Complexity Pursuit
887
According to the basic principles of information theory (Cover & Thomas, 1991), the length of this code is asymptotically approximated by the sum of the entropies H of the residuals. We use this as an approximation of the coding complexity: KO (y) D
X
H (d y(t)).
(3.3)
t
Assuming that the residual is stationary and ergodic and that the predictor uses a history of bounded length, and ignoring border effects, we have the simpler version, KO (y) D TH (dy ),
(3.4)
where dy denotes a random variable with the marginal distribution of the residual. Note that we made the assumption here that the signal is stationary. In the case of a nonstationary signal, more sophisticated models for the signal need to be developed, in which the function f is changing in time (Matsuoka et al., 1995). 3.2 Using Linear Models for Prediction. To use the approximation in equation 3.4 in practice, we need to x the structure of the predictor f and nd an approximation of the entropy ofdy. We use a computationally simple predictor structure, given by a linear autoregressive model:
yO (t ) D
X t >0
at y (t ¡ t ).
(3.5)
To optimize the performance of the predictor, the parameters at should ideally be estimated so that the entropy of the residual dy is minimized. This is equivalent to estimating the autoregressive model by the method maximum likelihood, taking into account the true distribution of the residual. Alternatively, to simplify the computations, the at could be estimated by a least-squares method, minimizing the variance of the residual. Note that in practice, the sum in equation 3.5 is taken over a nite, possibly very small, set of lag indices t . 3.3 Approximation of the Entropy of Residual. To approximate the entropy of dy, many different methods could be devised. In particular, it is a good idea to standardize the variable to unit variance to decouple the effects of scale and nongaussianity. Denoting by sd2 the variance of the residual, we have by the well-known scaling property of entropy,
³ H (dy ) D H
dy sd
´ C log sd .
(3.6)
888
Aapo Hyva¨ rinen
The estimation of the entropy of the standardized version H (dy / sd ) could be done, for example, using the general entropy approximation introduced in Hyv¨arinen (1998b). We adopt here a simpler method, however, which is possible by assuming that we have prior knowledge on the distribution of the residual. We assume that we know a good approximation of the (negative) logarithm of the probability density of the residual, denoted by G. Then we can plug this into the denition of entropy, and obtain the approximation » ³ ´¼ dy C log sd . (3.7) H (dy ) ¼ E G sd In ICA, it is well known that the exact form of the nonquadratic function used to probe higher-order statistics is not very important (Cardoso & Laheld, 1996; Hyv¨arinen & Oja, 1998; Cardoso, 2000). Likewise, Hyv¨arinen (1998b) showed that entropy can be well approximated using a xed nonquadratic function. We may therefore optimistically assume that the exact form of the function G is not very important here either, as long as it is qualitatively similar enough. The simulations in section 6 give some support for this assumption. In particular, in many cases we can assume that the residuals are supergaussian, which seems to be the preponderant case in natural data (Hyv¨arinen, 1999b; VigÂario, S¨arela¨ , Jousm a¨ ki, H¨ama¨ la¨ inen, M., & Oja, 2000). Then we can use the negative log density of a generic supergaussian random variable, say, G (dy ) D
p 1 log 2 C 2|y|. 2
(3.8)
The additive constant in equation 3.8 is immaterial and can be omitted. 4 Finding Minimum Complexity Directions
Using the approximation given in the preceding section, we can formulate a practical method for complexity pursuit. To nd the “most interesting” directions w T x (t), use the approximation of complexity introduced in the preceding section and nd its minima. 4.1 Formulating the Criterion. Let us consider what kind of practical criterion we obtain using the above approximation of complexity for y (t ) D w T x (t). First, note that the values of at and sd are functions of w only. To emphasize this, we write at (w ) and sd (w ). Thus, we can express the approximation of complexity as a function of w only: ( Á Á !!) X 1 T T w x (t) ¡ at (w )x (t ¡ t ) KO (w x (t)) D E G sd (w ) t >0
Complexity Pursuit
889 C log sd ( w ).
(4.1)
We must x the scale of w T x (t) to gain direct access to the higher-order structure. Thus, we constrain Ef(w T x (t))2 g D 1.
(4.2)
The terms in the approximation in equation 4.1 have intuitive interpretations. The rst one measures the contribution of the nongaussianity to the entropy of the residual of the linear predictor. The argument of G is normalized to unit variance, so this nonquadratic function measures the nongaussianity in the same way as the one-unit contrast function in Hyv¨arinen and Oja (1998) and Hyv¨arinen (1999a). Minimizing this term alone amounts to nding direction in which the residual is as nongaussian as possible. The second term measures the contribution of the variance of the residual to its entropy. Minimization of this term alone amounts to nding a projection that has maximum autocorrelations, that is, maximum time dependencies. This principle is closely related to those used in blind source separation methods in Tong et al. (1991), Molgedey and Schuster (1994), and Belouchrani (1997), as will be seen in section 5. Thus, our criterion uses simultaneously the two most widely used criteria for ICA and related methods: nongaussianity and autocorrelations. Moreover, Kolmogoroff complexity leads to another modication of ordinary projection pursuit: it is the nongaussianity of the residuals of predictive coding that is measured instead of the nongaussianity of the original signals. 4.2 Deriving the Algorithm.
4.2.1 The Gradient. To nd the minima of the approximation of complexity, we can use a simple gradient descent. The gradient of KO in equation 4.1 with respect to w can be obtained straightforwardly as
rw KO (w T x (t)) ( X 1 at (w )x (t ¡ t )) D E ( x (t ) ¡ sd (w ) t >0 Á !) X 1 T( ( ) £g w x t ¡ at (w )x (t ¡ t )) sd (w ) t >0 C b rw sd ( w )
1 ¡ E sd (w )
("
X t >0
£g
# T
(rw at (w ))w x (t ¡ t ) Á
X 1 wT (x (t) ¡ at (w )x (t ¡ t )) sd (w ) t >0
!) ,
(4.3)
890
Aapo Hyva¨ rinen
where g is the derivative of G, and b is dened as " ( X 1 1 b D 1¡ at (w )x (t ¡ t )) E w T (x (t) ¡ sd (w ) sd (w ) t >0 Á !)# X 1 T( ( ) £g w x t ¡ at (w )x (t ¡ t ) . sd (w ) t >0
(4.4)
To simplify the resulting algorithm, we use the following approximation of the gradient, which is quite accurate. Assume that w T x (t) is really generated by the autoregressive model that is used to predict it, and that G is the negative log density of the residual. Consider the following lemma: For any random variable x with a smooth density px and satisfying Efxg D 0, we have » 0 ¼ p ( x) (4.5) E x D ¡1. p (x ) Lemma 1.
By partial integration, we obtain » 0 ¼ Z Z p ( x) p0 (x ) E x D x p(x )dx D x p0 (x )dx p (x ) p( x ) Z 1 £ p (x)dx D ¡1. D 0¡
Proof.
(4.6)
Applying this lemma for the residual (normalized to unit variance), we have ( Á ! X 1 T w x (t) ¡ at (w )x (t ¡ t ) E sd (w ) t >0 Á !) X 1 T( ( ) £g w x t ¡ at (w )x (t ¡ t )) (4.7) D 1, sd (w ) t >0 P which implies that b D 0.Moreover, the quantity t > 0 (rw at (w ))w T x (t ¡t ) depends on only the past values of w T x (t). Therefore, it is independent P T from the residual w (x (t) ¡ t > 0 at (w )x (t ¡ t )), which has the role of the innovation process here. Thus, the third term in the gradient vanishes as well. Thus, we have the following approximation of the gradient: ( X 1 T rw KO (w x (t)) ¼ at (w )x (t ¡ t )) E (x (t) ¡ sd (w ) t >0 Á !) X 1 T( ( ) £g w x t ¡ at (w )x (t ¡ t )) . (4.8) sd (w ) t >0
Complexity Pursuit
891
To simplify the resulting algorithm further, we can use the fact that the factor 1 / sd (w ) multiplying the gradient is less important, since it does not change the direction of the gradient. Thus, it can be omitted. Furthermore, the same factor multiplying the argument of g could be omitted in many cases without changing the point where the gradient vanished. This is because for homogeneous g, for example, g (u ) / sign(u) or g(u ) / u3 , this constant does not change the direction of the gradient either and could be omitted as well. In practice, we often use g, which is a good approximation of a homogeneous function (e.g., the tanh function as given below). 4.2.2 The Algorithm. Thus, we can use a simple (approximative) gradient descent for updating w . To begin, we can simplify the algorithm by rst whitening the zero mean data x (t), for example, by z (t) D Vx (t) D (Efx (t)x (t)T g) ¡1/ 2 x (t).
(4.9)
Now, the constraint of unit variance of w T x (t) can be replaced by the constraint of unit norm of w . This is a standard procedure in ICA and projection pursuit (Comon, 1994; Friedman, 1987). Denote by z (t) the whitened data and by m a learning rate. The algorithm is then as follows. At every step, rst estimate the autoregressive constants at (w ) in equation 3.5 for the time series given by w T z (t), t D 1, . . . , T. Then do the gradient descent and normalization: (Á w à w ¡ mE
z (t) ¡
Á
X t >0
T(
at (w )z (t ¡ t ))
£g w z (t) ¡ w à w / kw k.
X t >0
at (w )z (t ¡ t )
!!) (4.10) (4.11)
The function g should be chosen as in ordinary ICA, but according to the probability distribution of the residual instead of the actual component w T z (t). If the residual is supergaussian, g (u ) D sign(u) is suitable. This could also be approximated by a smoother function g (u ) D tanh(au), where a ¸ 1 is a suitable constant (Bell & Sejnowski, 1995; Hyv¨arinen, 1999a). For subgaussian residuals, one could use g (u ) D u ¡ tanh(u) (Girolami, 1998) or g(u ) D u3 , for example. For almost gaussian residuals, a linear g could be used. Obviously, any scaling constants multiplying g can be dropped, although they are needed, in principle, to ensure that ¡G is actually a log density of unit variance. To estimate several projections, one can simply use a deation scheme (Delfosse & Loubaton, 1995; Hyv¨arinen & Oja, 1997). This means that after estimating m components, the new one is found by projecting w , after every
892
Aapo Hyva¨ rinen
step of the algorithm, on the subspace orthogonal to the one spanned by the already estimated components. A simple Gram-Schmidt orthogonalization scheme accomplishes this (Hyv¨arinen & Oja, 1997; Hyv¨arinen, 1999a). Symmetric orthogonalization may be used as well, in which case the algorithm is more akin to ICA than projection pursuit (see the next section). 4.2.3 Simple Special Case. A simple special case of the method is obtained when the autoregressive model has just one predicting term: yO (t) D a1 y (t ¡ 1).
(4.12)
The lag need not be equal to 1, but this is the basic case. This method may be very useful in practice since it takes into account the most basic form of autocovariance in the same way as the algorithms in Tong et al. (1991) and Molgedey and Schuster (1994); additional terms may not provide much extra information in many applications. The parameter a1 in the algorithm can then be estimated very simply by a least-squares method as aO 1 D w T Efz (t)z (t ¡ 1)T gw .
(4.13)
5 Discussion 5.1 Connection to Independent Component Analysis. We have proposed an algorithm for nding interesting 1D projections of multivariate time series, as measured by an approximation of Kolmogoroff complexity. Finding interesting projections of random vectors, as measured by nongaussianity, is closely connected to ICA estimation, and therefore one might expect that our method is closely connected to ICA as well. In fact, well-known ICA methods can be found as special cases of our method. First, assume that the signal has no (linear) time dependencies, which implies that the residual equals the signal itself. Then our algorithm reduces to w à w ¡ m Efz (t) g (w T z (t))g w à w / kw k.
(5.1) (5.2)
This is in fact the algorithm proposed in Hyv¨arinen and Oja (1998) for nding one independent component. In particular, if G is chosen as the negative log density of the residual (here: component), the results in Hyv¨arinen and Oja (1998) show that the algorithm converges to one of the independent components. Most well-known algorithms for estimating ICA for nongaussian components are closely related to this deationary method (Amari, Cichocki, & Yang, 1996; Bell & Sejnowski, 1995; Cichocki & Unbehauen, 1996; Cardoso & Laheld, 1996; Hyv¨arinen & Oja, 1997; Hyv¨arinen, 1999a; Karhunen, Oja, Wang, VigÂario, & Joutsensalo, 1997; Oja, 1997).
Complexity Pursuit
893
As another special case, assume that the data are gaussian. Then the function g can be taken linear, and the method reduces to something that is closely related to minimization of lagged cross-correlations, as in Belouchrani et al. (1997). To see this more clearly, assume that we use the AR(1) predictor. Denoting by C ¡1 D
1 [Efz (t)z (t ¡ 1)T g C Efz (t ¡ 1)z (t)T g] 2
(5.3)
a symmetric version of the lagged covariance matrix, we have Ef(z (t) ¡ a1 z (t ¡ 1))g(w T (z (t) ¡ a1 z (t ¡ 1)))g D [(1 ¡ a12 )I ¡ 2aC ¡1 ]w ,
(5.4)
and the algorithm takes the form w à w ¡ m [(1 ¡ a12 )I ¡ 2aC ¡1 ]w w à w / kw k.
(5.5) (5.6)
The algorithm can be considered as a power method for computing the dominant eigenvector of C ¡1 . Tong et al. (1991) proposed that the independent components could be estimated by nding the eigenvalue decomposition of C ¡1 D EDE T . The eigenvectors e i thus found give the independent components as eTi z (t ). This was motivated by the fact that such a decomposition gives signals that are uncorrelated in the usual way, as well as uncorrelated 6 with the lag, that is, Ef (e Ti z (t))(ejT z (t ¡ 1))g D 0 for i D j. In a deationary scheme, where we estimate one component using our algorithm, then remove it from the data, and iterate such extraction steps, our algorithm estimates the same eigenvectors of C ¡1 and gives the same decomposition. It is interesting to note that minimizing delayed cross-correlations is thus seen to correspond to nding signals with maximum autocorrelations, not unlike in PCA, where directions of maximum variance are found by a decorrelating eigenvalue decomposition. The connection to ICA estimation can be made even more explicit by considering an ICA model as in equation 1.1, where the independent components are modeled using an autoregressive model, X ati si (t ¡ t ) C d, (5.7) si (t) D t >0
where d is a nongaussian random variable. Denote by W D (w 1 , . . . , w n )T the inverse of A . The likelihood of such a model can be formulated as log L (w i , at I i D 1, . . . , N, t > 0) Á ! T X n X X log pi w Ti x (t) ¡ ati w Ti x (t ¡ t ) D tD 1 i D 1
C T log | det W |.
t >0
(5.8)
894
Aapo Hyva¨ rinen
Now assume that the estimation of the autoregressive coefcients is decoupled from the estimation of the w i . In other words, the ati are estimated for xed w i , and then the w i are estimated for xed ati , and so on. Assume further that the data are whitened and that W is constrained to be orthogonal, as is usual in ICA (Cardoso & Laheld, 1996; Hyv¨arinen & Oja, 1997; Hyv¨arinen, 1999a). Then the term log | det W | is constant, and what is left is essentially an approximation of the sum of the negative entropies of the residuals (the pi should be adapted here so that they correspond to the log densities of the residuals). Maximization of the likelihood is thus essentially equivalent to minimizing the sum of the approximations of complexity of the n projection w T x (t). Thus, our algorithm can be seen as a one-unit version of this ICA estimation problem. 5.2 Related Algorithms. Another algorithm that combines nongaussianity and time correlations was proposed for ICA estimation by Muller, ¨ Philips, and Ziehe (1999). This was constructed heuristically by combining two estimation criteria—one measuring nongaussianity and another one measuring time correlations. Using Kolmogoroff complexity, we nd here a principled way of combining these criteria. In particular, we nd the optimal weight that we should give to the part measuring nongaussianity with respect to the part measuring time correlations. Moreover, our method uses the nongaussianity of the residuals instead of the nongaussianity of the components. This is in line with the arguments advanced in Hyv¨arinen (1998a), where it was proposed that ICA should be applied on the innovation process instead of the original signals. The innovation process coincides in the present framework with the residual of an optimal (possibly nonlinear) predictor. Hyv¨arinen (1998a) argued that the innovation process is likely to be more nongaussian and to have components that are more independent than the original data, and thus ICA algorithms should give better estimation results when applied on the innovation process. Thus, measuring the nongaussianity of the residuals should improve the separation results. A nal point of difference between our method and that proposed in Muller ¨ et al. (1999) is that our algorithm allows for deationary estimation of components in the spirit of projection pursuit. Finally, it is worth mentioning another ICA algorithm that combines time structure with nongaussianity (Attias, 2000). This algorithm models the time structure in a very different way, using a hidden Markov model with discrete states. A potential drawback of this approach is that it leads to quite complicated computations. In contrast, using autocorrelations allows computationally much simpler methods. 6 Simulation Results
We created four signals using an AR(1) model, with 5000 time points. Signals 1 and 2 were created with supergaussian innovations and signals 3 and
Complexity Pursuit
895
2
1.5
1
0.5
0 0
10
20
30
40
50
Figure 1: Convergence of complexity pursuit for articially generated data. The error index shown is the squared distance of the separating matrix times the whitened mixing matrix (WVA ) from the nearest (signed) permutation matrix. The median was taken over 10 runs with different random matrices and initial conditions.
4 with gaussian innovations; all innovations had unit variance. Signals 1 and 3 had identical autoregressive coefcients (0.25) and therefore identical autocovariances; signals 2 and 4 had identical coefcients (0.5) as well. To be able to validate the results, we performed source separation experiments. The signals were mixed as in ICA, using random mixing matrices. The step size m was taken equal to 1. The nonlinearity was chosen as g(u ) D tanh(u). Ordinary ICA or blind source separation methods based on nongaussianity would not be able to separate the two gaussian signals from each other. On the other hand, methods based on autocovariances would not be able to separate signals with identical autocovariances. Thus, practically all ICA or source separation algorithms would fail with these data. The only exception, to our knowledge, is the algorithm in Muller ¨ et al. (1999), as discussed in section 5.2. Figure 1 shows the convergence of our algorithm, using the same nonlinearity (tanh) for all components. The graph shows results for symmetric orthogonalization. For deationary orthogonalization, we obtained similar results. In both cases, the algorithm correctly estimated the independent components, in around 10 to 15 iterations. Note that a single generic nonlinearity that corresponds to supergaussian residuals was able to separate both gaussian and supergaussian signals, which indicates that the method is robust with respect to the choice of nonlinearity in the much same way as ICA.
896
Aapo Hyva¨ rinen
7 Conclusion
We introduced an extension of projection pursuit, in which the time structure of a time series (time-dependent signal) is taken into account instead of using the marginal distribution only. This is based on nding directions in which the coding complexity of the signal is minimized. This also provides a principled method for (possibly deationary) estimation of independent components that are time dependent. The coding complexity can be approximated by a combination of the nongaussianity and the variance of the residual of a linear autoregressive model, leading to a computationally simple algorithm. References Amari, S.-I., Cichocki, A., & Yang, H. (1996). A new learning algorithm for blind source separation. In D. Touretzky, M. Mozer, & M. Hasselmo (Eds.), Advances in neural information processing systems, 8 (pp. 757–763) Cambridge, MA: MIT Press. Atick, J. (1992). Entropy minimization: A design principle for sensory perception? International Journal of Neural Systems, 3, 81–90. Attias, H. (2000). Independent factor analysis with temporally structured sources. In S. A. Solla, T. K. Leen, & K.-R. Muller ¨ (Eds.), Advances in neural information processing systems, 12. Cambridge, MA: MIT Press. Bell, A., & Sejnowski, T. (1995). An information-maximization approach to blind separation and blind deconvolution. Neural Computation, 7, 1129–1159. Belouchrani, A., Meraim, K. A., Cardoso, J.-F., & Moulines, E. (1997). A blind source separation technique based on second order statistics. IEEE Transactions on Signal Processing, 45(2), 434–444. Cardoso, J. F. (2000). Entropic contrasts for source separation: Geometry and stability. In S. Haykin (Ed.), Adaptive unsupervised learning. New York: Wiley. Cardoso, J.-F., & Laheld, B. H. (1996). Equivariant adaptive source separation. IEEE Transactions on Signal Processing, 44(12), 3017–3030. Cichocki, A., & Unbehauen, R. (1996). Robust neural networks with on-line learning for blind identication and blind separation of sources. IEEE Transactions on Circuits and Systems, 43(11), 894–906. Comon, P. (1994). Independent component analysis—a new concept? Signal Processing, 36, 287–314. Cover, T. M., & Thomas, J. A. (1991). Elements of information theory. New York: Wiley. Delfosse, N., & Loubaton, P. (1995). Adaptive blind separation of independent sources: A deation approach. Signal Processing, 45, 59–83. Friedman, J. (1987). Exploratory projection pursuit. Journal of the American Statistical Association, 82(397), 249–266. Friedman, J. H., & Tukey, J. W. (1974). A projection pursuit algorithm for exploratory data analysis. IEEE Transactions of Computers, C-23(9), 881–890.
Complexity Pursuit
897
Girolami, M. (1998). An alternative perspective on adaptive independent component analysis algorithms. Neural Computation, 10(8), 2103–2114. Hochreiter, S., & Schmidhuber, J. (1999). Feature extraction through LOCOCODE. Neural Computation, 11(3), 679–714. Huber, P. (1985). Projection pursuit. Annals of Statistics, 13(2), 435–475. Hyva¨ rinen, A. (1998a). Independent component analysis for time-dependent stochastic processes. In Proceedings of the International Conference on Articial Neural Networks (ICANN’98) (pp. 135–140). Sk¨ovde, Sweden. Hyva¨ rinen, A. (1998b).New approximations of differential entropy for independent component analysis and projection pursuit. In M. Kecino, M. Jordan, & S. Solla (Eds.), Advances in neural information processing systems, 10 (pp. 273– 279). Cambridge, MA: MIT Press. Hyva¨ rinen, A. (1999a). Fast and robust xed-point algorithms for independent component analysis. IEEE Transactions on Neural Networks, 10(3), 626–634. Hyva¨ rinen, A. (1999b). Sparse code shrinkage: Denoising of nongaussian data by maximum likelihood estimation. Neural Computation, 11(7), 1739–1768. Hyva¨ rinen, A., & Oja, E. (1997). A fast xed-point algorithm for independent component analysis. Neural Computation, 9(7), 1483–1492. Hyva¨ rinen, A., & Oja, E. (1998). Independent component analysis by general nonlinear Hebbian-like learning rules. Signal Processing, 64(3), 301–313. Jolliffe, I. (1986). Principal component analysis. Berlin: Springer-Verlag. Jones, M., & Sibson, R. (1987). What is projection pursuit? Journal of the Royal Statistical Society, ser. A, 150, 1–36. Jutten, C., & Herault, J. (1991). Blind separation of sources, part I: An adaptive algorithm based on neuromimetic architecture. Signal Processing, 24, 1–10. Karhunen, J., Oja, E., Wang, L., VigÂario, R., & Joutsensalo, J. (1997). A class of neural networks for independent component analysis. IEEE Transactions on Neural Networks, 8(3), 486–504. Matsuoka, K., Ohya, M., & Kawamoto, M. (1995). A neural net for blind separation of nonstationary signals. Neural Networks, 8(3), 411–419. Molgedey, L., & Schuster, H. G. (1994). Separation of a mixture of independent signals using time delayed correlations. Phys. Rev. Lett., 72, 3634–3636. Muller, ¨ K.-R., Philips, P., & Ziehe, A. (1999). JADETD : Combining higher-order statistics and temporal information for blind source separation (with noise). In Proceedings of the International Workshop on Independent Component Analysis and Signal Separation (ICA’99) (pp. 87–92). Aussois, France. Oja, E. (1982). A simplied neuron model as a principal component analyzer. Journal of Mathematical Biology, 15, 267–273. Oja, E. (1997). The nonlinear PCA learning rule in independent component analysis. Neurocomputing, 17(1), 25–46. Pajunen, P. (1998a). Blind source separation using algorithmic information theory. Neurocomputing, 22, 35–48. Pajunen, P. (1998b). Extensions of linear independent component analysis: Neural and information-theoreticmethods. Unpublished doctoral dissertation, Helsinki University of Technology. Pajunen, P. (1999). Blind source separation of natural signals based on approximate complexity minimization. In Proceedings of the International Workshop on
898
Aapo Hyva¨ rinen
Independent Component Analysis and Signal Separation (ICA’99) (pp. 267–270). Aussois, France. Tong, L., Liu, R.-W., Soon, V., & Huang, Y.-F. (1991). Indeterminacy and identiability of blind identication. IEEE Transactions on Circuits and Systems, 38, 499–509. VigÂario, R., S¨arel¨a, J., Jousm¨aki, V., H¨amal¨ ¨ ainen, M., & Oja, E. (2000). Independent component approach to the analysis of EEG and MEG recordings. IEEE Transactions on Biomedical Engineering, 47(5), 589–593. Received February 23, 2000; accepted September 8, 2000.
LETTER
Communicated by David Mumford
Algebraic Analysis for Nonidentiable Learning Machines Sumio Watanabe P&I Laboratory, Tokyo Institute of Technology, Yokohama, 226-8503 Japan This article claries the relation between the learning curve and the algebraic geometrical structure of a nonidentiable learning machine such as a multilayer neural network whose true parameter set is an analytic set with singular points. By using a concept in algebraic analysis, we rigorously prove that the Bayesian stochastic complexity or the free energy is asymptotically equal to l1 log n ¡ (m1 ¡ 1) log log n C constant, where n is the number of training samples and l1 and m1 are the rational number and the natural number, which are determined as the birational invariant values of the singularities in the parameter space. Also we show an algorithm to calculate l1 and m1 based on the resolution of singularities in algebraic geometry. In regular statistical models, 2l1 is equal to the number of parameters and m1 D 1, whereas in nonregular models, such as multilayer networks, 2l1 is not larger than the number of parameters and m1 ¸ 1. Since the increase of the stochastic complexity is equal to the learning curve or the generalization error, the nonidentiable learning machines are better models than the regular ones if Bayesian ensemble learning is applied. 1 Introduction
Learning machines made of superpositions of homogeneous functions are often useful for constructing practical information systems. For example, layered neural networks, radial basis functions, and mixtures of normal distributions have been applied to many recognition and prediction systems (Watanabe & Fukumizu, 1995). They are written in the form
y (x, fah , bh g) D
H X
ah g (bh , x ),
hD1
where g is a function, x is an input vector, and fah , bh g is a set of parameters to be optimized. One of their important properties is that they do not satisfy the regularity condition for the asymptotic normality of the maximum likelihood estimator in general (Hagiwara, Toda, & Usui, 1993; Fukumizu, 1996). For regular statistical models (Cramer, 1949; Murata, Yoshizawa, & Amari, 1994), the set of true parameters consists of only one point and the Fisher information matrix is positive denite, even if the learning model is larger Neural Computation 13, 899–933 (2001)
° c 2001 Massachusetts Institute of Technology
900
Sumio Watanabe
than necessary to attain the true distribution, which is called overrealizable. On the other hand, if a layered learning machine is in the overrealizable case, the set of true parameters is not one point but an analytic set with singularities, so estimated parameters do not converge to one point. In other words, the correspondence between the set of parameters and the set of the functions is not a one-to-one mapping. In this article, such learning models are called nonidentiable. Research on nonidentiable learning machines is important for layered neural networks for three reasons. First, it is necessary for selecting the optimal model, which balances the function approximation error with the statistical estimation error. Although the true distribution is not contained in the parametric learning machine in a practical application, no information criterion can be obtained without studies for the overrealizable case (Hagiwara et al., 1993; Fukumizu, 1999; Watanabe, 1998a, 1999, 2000a). The distributions of estimated parameters in nonidentiable cases should be analyzed also for testing hypotheses (Dacunha-Castelle & Gassiat, 1997; Knowles & Siegmund, 1989). In these studies, the asymptotic case is studied mainly based on the assumption that the number of training samples is sufciently large. However, there is a similarity between the error behavior in a nonidentiable machine with extensive data and that in a complex machine with fewer data, since the Fisher information matrices are ill dened in both cases. Therefore, we can expect that the asymptotic research for the nonidentiable case is useful for complex learning machines in practical applications. Second, studies for nonidentiable machines will clarify the essential difference between regular statistical models and articial neural networks. The difference in the function approximation eld has already been studied (Barron, 1994; Mhaskar, 1996), whereas that in the statistical estimation eld has not. This article shows that the generalization errors by nonidentiable learning machines with the Bayesian method are not larger than those of identiable and regular models, which can be proven under the condition that the function approximation error is negligible compared to the statistical estimation error. Finally, the theory for the learning machines whose loss functions cannot be approximated by any quadratic form will be the foundation for improving neural network training algorithms. A lot of training algorithms have been devised on the assumption that the set of the optimal parameter consists of only one point and that the Fisher information matrix is positive denite. However, this assumption is not satised in general. For example, we often nd that a layered learning machine works very well when functions of hidden units are almost linearly dependent. The training algorithms should be improved so that they make learning machines attain maximum performance even when they are in near-redundant states. The maximum likelihood method is not an appropriate training algorithm for complex and layered learning machines in such states.
Algebraic Analysis for Nonidentiable Learning Machines
901
In this article, in order to clarify the statistical properties of the nonidentiable learning machines, we prove that the Bayesian stochastic complexity or the free energy F(n ) has the asymptotic form F (n ) D l1 log n ¡ (m1 ¡ 1) log log n C O(1), where n is the number of training samples and l1 and m1 are the rational number and the natural number, respectively, which are determined by the singularities of the set of true parameters. The learning curves are determined by the algebraic geometrical structure of the parameter set. We also show an algorithm to calculate l1 and m1 by using the technique of blowingup singularities, and we show that 2l1 is not larger than the number of parameters. Since the increase of the stochastic complexity F (n C 1) ¡ F (n ) is equal to the generalization error dened by the average Kullback distance of the estimated probability density from the true one (Levin, Tishby, & Solla, 1990; Amari, Fujita, & Shinomoto, 1992; Amari & Murata, 1993), our result claims that layered neural networks are the better learning machines than regular statistical models if the Bayesian estimation (Akaike, 1980; Mackay, 1992) or ensemble learning is applied in training. The free energy F (n ), an important observable in Bayesian statistics, information theory, and mathematical physics, has a lot of other names and applications. For example, it is called Bayesian criterion in Bayesian model selection (Schwarz, 1974), stochastic complexity in universal coding (Rissanen, 1986; Yamanishi, 1998), Akaike’s Bayesian criterion in optimization of hyperparameters (Akaike, 1980), and evidence in neural network learning (Mackay, 1992). In almost all cases in these studies, F (n) was calculated by using the gaussian approximation or the saddle-point approximation based on the assumption that the loss function can be approximated by a quadratic form among the one true parameter. In neural network learning, we cannot use such an approximation. This article constructs the general formula, which enables us to analyze both identiable and nonidentiable models in the same way. To study a loss function whose zero points contain singularities, we employ the Sato-Bernstein polynomial or the so-called b-function in algebraic analysis, which can extract algebraic information from the set of true parameters. Also, we construct an algorithm to calculate constants l1 and m1 by using the resolution of singularities in algebraic geometry. Resolution of singularities transforms the integral of several variables into the direct product of integrals of one variable and enables us to calculate algorithmically the learning efciency of an arbitrary learning machine in Bayesian estimation. 2 Main Results
Let p(y|x, w) be a conditional probability density of an output y 2 RN for a given input x 2 RM and a given parameter w 2 Rd , which represents
902
Sumio Watanabe
a probabilistic inference of a learning machine. Let Q (w) be a probability density function on the parameter space Rd , whose support is denoted by W D supp Q ½ Rd . We assume that training or empirical sample pairs f(xi , yi )I i D 1, 2, . . . , ng are independently taken from q (y|x)q(x ), where q(x) and q (y|x) represent the true input probability and the true inference probability, respectively. In the Bayesian estimation, the estimated inference pn (y|x ) is the average of the a posteriori ensemble, Z pn (y|x) D p (y|x, w)rn (w ) dw, r n (w) D
n Y 1 Q (w ) p(yi |xi , w ), Zn iD 1
R where Zn is the constant that ensures rn (w ) dw D 1. The generalization error is dened by the average Kullback distance of the estimated probability density pn (y|x ) from the true one q(y|x ), »Z ¼ q(y|x ) log K(n ) D En q (x, y ) dx dy , pn (y|x ) where En f¢g shows the expectation value over all sets of training sample pairs. In this article we mainly consider the statistical estimation error and assume that the model can attain the true inference; in other words, there exists a parameter w0 2 W such that p(y|x, w0 ) D q(y|x ). Let us dene the average and empirical loss functions: Z q (y|x) log f (w) D q(y|x )q(x) dx dy, p( y|x, w) n 1X q(yi |xi ) log . fn ( w ) D n D1 p (yi |xi , w) i
Note that f (w) ¸ 0 is Kullback information. By the assumption, the set of the true parameters W 0 D fw 2 WI f (w ) D 0g is not an empty set. If f (w) is an analytic function, then W 0 is called an analytic set of f (w). If f (w) is a polynomial, then W 0 is called an algebraic variety. W 0 is not a manifold in general, since no coordinate can be introduced in the neighborhood of singular points. For example, layered neural networks such as three-layered perceptrons and radial basis functions have many singular points. Even if W 0 consists of one point or if the true distribution is not contained in the parametric models (W 0 is an empty set), the niteness of the number of training samples often makes W 0 seem to be nonidentiable in neural network models. For the purpose of studying such a case rigorously, we assume that W 0 is not empty. Remark.
Algebraic Analysis for Nonidentiable Learning Machines
903
From these denitions, it is proven in Levin et al. (1990) and Amari (1993) that the average Kullback distance K (n) is equal to the increase of the Bayesian stochastic complexity or the free energy F(n ), K (n ) D F (n C 1) ¡ F (n ), where F (n ) is dened by » ¼ Z F (n ) D ¡En log exp(¡nfn (w))Q (w) dw . In this article, we show the rigorous asymptotic form of F (n ) and clarify the algebraic geometrical structure of the stochastic complexity. Theorems 1 and 2 are the main results of this article. Let C1 0 be a set of all compact support and C1 -class functions on Rd . Assume that f (w) is an analytic function and Q (w ) is a probability density function on Rd . Then there exists a real constant C1 such that for any natural number n, Theorem 1.
F (n ) · l1 log n ¡ (m1 ¡ 1) log log n C C1 ,
(2.1)
where the rational number ¡l1 (l1 > 0) and the natural number m1 are the largest pole and its multiplicity of the meromorphic function that is analytically continued from Z J (l) D f (w)lQ (w) dw (Re(l) > 0), f (w ) < 2
where 2 > 0 is a sufciently small constant, and Q (w ) is an arbitrary nonzero ( ) ( ) C1 0 -class function that satises 0 · Q w · Q w . The proof of theorem 1 is shown in section 3. Also it is proven in lemma 2 in section 3 that all poles of the function J (l) are on the negative part of the real axis. In theorem 1, “largest pole” means the largest one as a real value. For theorem 2, we dene a condition. Let y (x, w) be a vector-valued function of (x, w) 2 RM £ Rd . We dene two conditions on y (x, w): Condition A.
1. y (x, w) is an analytic function of w 2 W D supp Q ½ Rd , which can be extended as a holomorphic function on some complex open set W ¤ , where W ½ W ¤ ½ Cd and W ¤ is independent of x 2 supp q ½ RM . 2. y (x, w) is a measurable function of x 2 RM that satises the condition Z sup ky (x, w )k2 q(x ) dx < 1, (2.2) w2W¤
where k ¢ k is the norm of the vector y (x, w ).
904
Sumio Watanabe
Let s > 0 be a constant value. Assume that Q (w) is a C1 0 -class probability density function. Let us consider a model Theorem 2.
³ ´ 1 ky ¡ y (x, w)k2 exp ¡ p (y|x, w) D , (2p s 2 )N / 2 2s 2 where both y (x, w ) and ky (x, w)k2 satisfy condition A. Then there exists a constant C2 > 0 such that for any natural number n, |F(n) ¡ l1 log n C (m1 ¡ 1) log log n| · C2 , where the rational number ¡l1 (l1 > 0) and a natural number m1 are the largest pole and its multiplicity of the meromorphic function that is analytically continued from Z J (l) D f (w )lQ (w ) dw (Re(l) > 0), f (w ) < 2
where 2 > 0 is a sufciently small constant. This theorem determines the behavior of the asymptotic stochastic complexity. The proof of theorem 2 is shown in section 4. A real-valued analytic function y (x, w) of x 2 RM and w 2 Rd is said to have the associated convergence radii (r1 , r2 , . . . , rd ) at (x, wO ), if the Taylor expansion, Remark.
y (x, w ) D
X j1 , j 2 ,..., jd
aj1 , j2 ,..., jd (x )(w1 ¡ wO1 ) j1 (w2 ¡ wO2 ) j2 ¢ ¢ ¢(wd ¡ wOd ) jd ,
absolutely converges in fw 2 Rd I |wj ¡ wO j | < rj ( j D 1, 2, . . . , d )g and diverges in fw 2 Rd ; |wj ¡wO j | > rj ( j D 1, 2, . . . , d )g. For an N-dimensional vectorvalued analytic function, the convergence radii are dened as (min r1 , min r2 , . . . , min rd ), where min shows the minimum value among corresponding N values. The associated convergence radii at (x, wO ) are denoted by (r1 (x, wO ), . . . , rd (x, wO )). A real analytic function y (x, w ) can be extended as a holomorphic function on some open set W ¤ ½ Cd independent of x if and only if min inf inf rj (x, w) > 0
1 · j ·d x2K w2W
holds. In theorems 1 and 2, we introduced the meromorphic function J (l), whose largest pole and its multiplicity determine the learning efciency of
Algebraic Analysis for Nonidentiable Learning Machines
905
the model. It is an important fact that J (l) is invariant under the transform f (w ) 7! f ( g(u )) Q (w) dw 7! Q ( g (u ))| g0 (u )| du, where g: U ! W is an arbitrary analytic function from some parameter space U to the given parameter space W and | g0 (u )| is its Jacobian. Therefore, the constants l1 and m1 are also invariant under the transform. The analytic function g, which need not have its inverse function g ¡1 , is sometimes called a birational mapping, and algebraic geometry claries the geometrical structure of the parameter space, which is invariant under birational mappings. To consider the learning curve or the generalization error, let us introduce the denition of the asymptotic expansion. Let fsi (n ), i D 1, 2, 3, . . .g be a sequence of real-valued functions of a natural number n, which satises lim
n!1
si C 1 (n ) D0 s i ( n)
(i D 1, 2, . . .).
This condition is referred to as si C 1 (n ) D o(si (n )) (i D 1, 2, 3, . . .). If a real valued function s (n ) satises the condition 1 lim n!1 sk ( n )
( s (n ) ¡
k X i D1
) ai si (n)
D 0
( k D 1, 2, 3, . . . , K ),
where fai g are real values, then we dene that s (n ) has an asymptotic expansion s(n ) » D
K X
ai si (n ).
(2.3)
iD 1
Note that coefcients fai g are determined uniquely. This denition contains the case K ! 1. For a real-valued function s(x ) of the real variable x, the asymptotic expansion for x ! a (a is some value) is dened by the same way. Based on theorem 2, the function c(n ), dened by c(n ) D F (n ) ¡ l1 log n C (m1 ¡ 1) log log n, is a bounded function, |c(n)| < C2 . If c(n C 1) ¡ c(n ) D o (1 / (n log n )), then it follows that ³ ´ l1 1 m1 ¡ 1 C Co K (n ) D , n n log n n log n which gives the learning curve of a non-identiable learning machine.
906
Sumio Watanabe
Assume the same condition as theorem 2. If c (n C 1) ¡ c(n ) D o(1 / n log n ), then the learning curve is given by Corollary 1.
l1 m1 ¡ 1 ¡ . K (n ) » D n n log n Regular statistical models have l1 D d / 2 and m1 D 1, which can be shown as the special case of theorem 2 (see example 1). Nonidentiable models such as neural networks have different values l1 · d / 2 and m1 ¸ 1 in general. Corollary 2. Assume the same condition as theorem 1. If Q (w ) can be taken as Q (w) > 0 for arbitrary w 2 W0, then l1 · d / 2, where d is the dimension of the parameter space.
Corollary 2 is proved in section 5. Even for a nonidentiable learning machine, we can calculate l1 and m1 based on the resolution technique of singularities in algebraic geometry. An algorithm to calculate l1 and m1 for a given learning machine is also shown in section 5. 3 Proof of Theorem 1
In this and the following section, we show the proofs of theorem 1 and 2. These proofs not only ensure theorems but also clarify the mathematical structure of Bayesian learning in articial neural network models. Assume that f (w) and fn (w ) are continuous functions and that Q (w) is a probability density function. Then an inequality Z F (n) · ¡ log exp(¡nf (w ))Q (w) dw Lemma 1.
holds for arbitrary natural number n. (Opper & Haussler, 1995). From Jensen’s inequality, Z Z ¡ log exp(a(w))b(w) dw · ¡ a(w )b (w ) dw
Proof of Lemma 1.
holds for an arbitrary continuous function a(w ) and an arbitrary compact support probability density b (w ). First, assume that Q (w) is a compact support function. By applying the above inequality to the special case, a (w ) D ¡nf fn (w) ¡ f (w)g, 1 b (w ) D exp(¡nf (w ))Q (w ) Y
Algebraic Analysis for Nonidentiable Learning Machines
where Y D ¡ log
1 Y
R
907
exp(¡nf (w))Q (w) dw, we obtain
Z
exp(¡nfn (w ))Q (w ) dw Z 1 · nf fn (w) ¡ f (w)g exp(¡nf (w))Q (w) dw. Y
By using En f fn (w) ¡ f (w)g D 0 and Fubini’s theorem, it follows that
» ¼ Z Z ¡En log exp(¡nfn (w ))Q (w )dw · ¡ log exp(¡nf (w))Q (w )dw
for an arbitrary compact support function Q (w ). Since the set of all compact support functions is dense in the set of all probability density functions by the L1 norm, lemma 1 is obtained. For a given 2 > 0, a set of parameters W2 is dened by W2 D fw 2 W ´ supp QI f (w ) < 2 g. We introduce Sato-Bernstein’s b-function. Assume that there exists 2 0 > 0 such that f (w ) is an analytic function in W2 0 . Then there exists a triple (2 , P, b), where Theorem 3.
1. 2 < 2
0
is a positive constant,
2. P D P (l , w, @w) is a differential operator which is a polynomial for l. 3. b(l) is a polynomial, such that
P (l , w, @w ) f (w )l C 1 D b(l) f (w)l
(8 w 2 W2 , 8 l 2 C).
The zeros of the algebraic equation b(l) D 0 are real, rational, and negative numbers (Sato & Shintani, 1974; Bernstein, 1972; Bj¨ork, 1979; Kashiwara, 1976). In theorem 3, P D P (l , w, @w) is a nite-order differential operator for whose coefcients are analytic functions for w. P can be understood as a w mapping from C1 to C1 . The formally dened adjoint operator P¤ by the same way as the usual partial differential operator satises the relation Z
Z w (w)P (l , w, @w )Q (w )dw D
P (l , w, @w )¤ w (w )Q (w )dw
for any w 2 C1 , Q 2 C1 0 . Theorem 3 was proven based on the algebraic property of the ring of partial differential operators (see Bernstein, 1972;
908
Sumio Watanabe
Sato & Shintani, 1974; Bjork, ¨ 1979). The rationality of the zeros of b (l) D 0 is shown based on the resolution of singularities (Atiyah, 1970; Kashiwara, 1976). The smallest-order polynomial b(l) that satises the above relation is called a Sato-Bernstein polynomial or a b-function. Recently an algorithm to calculate the b-function has been developed (Oaku, 1997). Remark.
By setting 2 > 0 small enough, we can assume
kr f (w)k > 0
(8 w 2 W2 n W 0 ),
(3.1)
because f (w ) is analytic. Hereafter 2 > 0 is taken small enough so that both theorem 3 and equation 3.1 hold. Also we can assume that 2 < 1 without loss of generality. For a given analytic function f (w), let us dene a complex function J (l) of l 2 C by Z J (l) D f (w )lQ (w )dw. W2
Assume that f (w) is an analytic function in W2 and that Q (w) is a C1 -class function. Then J (l) can be analytically continued to a meromorphic 0 function on the entire complex plane;in other words, J (l) has poles only in |l| < 1. Moreover, J (l) satises the following conditions: Lemma 2.
1. The poles of J (l) are rational, real, and negative numbers. p p 2. For an arbitrary a 2 R, J (1 C a ¡1) D 0 and J (a § 1 ¢ ¡1) D 0. Proof. J (l) is an analytic function in the region Re(l) > 0. First, J (1 C p
a ¡1) D 0 is shown p by the Lebesgue’s convergence theorem. Second, let us show J (a § 1 ¡1) D 0 for xed a > 0. For t > 0 we dene a function Z
JOa (t) D exp(at)
W2
d(t ¡ log f (w))Q (w)dw,
where JOa (t) is well dened because d(g(w)) is well dened if kr g (w)k > 0 for any w that satises g(w ) D 0 (Gel’fand & Shilov, 1964). From the denition, JOa (t) is an L1 function, and p J (a § b ¡1) D
Z
log 2 ¡1
p JOa (t) exp( § bt ¡1) dt,
which shows that we can p apply the Riemann-Lebesgue convergence theorem, resulting in J (a § 1 ¡1) D 0 for xed a > 0. Finally, by using the
Algebraic Analysis for Nonidentiable Learning Machines
909
formal adjoint operator P¤ , Z 1 Pf (w)l C 1 Q (w)dw b(l) W2 Z 1 D f (w )l C 1 P¤Q (w)dw, b(l) W2
J (l) D
(l) can be where we used the property of the b-function. Because P¤Q 2 C1 0 ,J 6 0. By using analytic continuation, (l (l) analytically continued to ¡1) if J b D p J (a § 1 ¡1) D 0 even for a < 0. If b(l) D 0, then such l is at most a pole on the negative part of real axis. Poles of the function J (l) are on the negative part of the real axis and contained in the set fm C ºI m D 0, ¡1, ¡2, . . . , b(º) D 0g. They are ordered from larger to smaller and referred to as ¡l1 , ¡l2 , ¡l3 , . . . (l k > 0 is a rational number), and the multiplicity of ¡lk is denoted by m k. Denition.
We dene a real-valued function I (t) by Z I (t ) D
W2
d(t ¡ f (w))Q (w )dw
(0 < t < 2 ).
(3.2)
Here, since kr f (w )k > 0 for w 2 W2 n W 0, d(t ¡ f (w)) is well dened (Gel’fand & Shilov, 1964). For t ¸ 2 or t < 0 then we dene I (t) D 0. Assume that f (w ) is an analytic function in W2 and that Q (w) is a C1 -class function. Then I (t) has an asymptotic expansion for t ! 0, 0 Lemma 3.
I (t ) » D
1 m k ¡1 X X kD 1 m D 0
ck,m C 1 tlk ¡1 (¡ log t)m ,
(3.3)
where (m! ¢ ck,m C 1 ) is the coefcient of the (m C 1)th order in the Laurent expansion of J (l) at the pole l D ¡lk. Proof. The special case of this lemma is shown in the theory of distributions
(Gel’fand & Shilov, 1964). Let us dene IK ( t ) ´ I ( t ) ¡
K m k ¡1 X X kD 1 m D 0
ck,m C 1 tlk ¡1 (¡ log t)m .
For equation 3.3, it is sufcient to show that for an arbitrary xed K, lim IK (t)tl D 0 t!0
(8 l > ¡lK C 1 C 1).
(3.4)
910
Sumio Watanabe
Z From the denition of J (l), J (l) D Z
1 0
1 0
I (t)tl dt. The simple calculation shows
tl C lk ¡1 (¡ log t )m dt D
Therefore,
Z
1 0
m! . (l C lk )m C 1
IK (t)tl dt D JK (l),
where JK (l) is dened by JK (l) D J (l) ¡
K m k ¡1 X X m! ¢ ck,m C 1 , (l C lk ) m C 1 D0 kD 1 m
which is holomorphic in the region Re(l) > ¡lK C 1 . By putting t D e¡x and by using Laplace inverse transform, Z t C i1 1 (3.5) IK (e ¡x )e ¡x D JK (u )eux du 2p i t ¡i1 holds for any real t > 0. By lemma 2, J (a § i1 ) D 0; thus, JK (a § i1 ) D 0 for an arbitrary real a. Since JK (l) is holomorphic in the region Re(l) > ¡lK C 1 , the complex integral path in equation 2.8 can be moved from t > 0 to t > ¡lK C 1 . Therefore, Z C1 1 ¡x ) ¡x¡t x ( (3.6) IK e e D JK (t C iu )eiux du 2p ¡1 holds for any real x. Hence, by putting x D 0, JK (t C iu ) 2 L1 . The term on the right-hand side of equation 3.6 goes to zero when x ! 1 because of the Riemann-Lebesgue convergence theorem. Here, by putting t D ¡ log x, we obtain equation 3.4. By combining the above results, we have Z exp(¡nf (w))Q (w )dw · ¡ log exp(¡nf (w ))Q (w )dw
Proof of Theorem 1.
Z F (n ) · ¡ log
W 1
W2
³ ´ t dt D ¡ log e¡nt I (t) dt D ¡ log e¡t I n n 0 0 8 9 <X 1 m = m j( C ) Z n k ¡1 X X (log ) c n m j k,m C 1 » e¡t tlk ¡1 (¡ log t)m¡j dt D ¡ log l : ; n k 0 kD 1 m D 0 j D 0 Z
Z
n
D l1 log n ¡ (m1 ¡ 1) log log n C O (1),
where I (t) is dened by equation 3.2 with Q (w ) instead of Q (w ).
Algebraic Analysis for Nonidentiable Learning Machines
911
4 Proof of Theorem 2
Hereafter, we assume that the model is given by ³ ´ 1 1 2 exp ¡ (y ¡ y (x, w )) ) p (y|x, w ) D p 2 2p for the simple description. It is easy to extend the proof for a general standard deviation (s > 0) case and a general output dimension (N > 1) case. For this model, Z 1 ( ) (y (x, w) ¡ y (x, w 0 ))2 q (x ) dx f w D 2 n n 1 X 1X (y (xi , w ) ¡ y (xi , w 0 ))2 ¡ gi (y (xi , w) ¡ y (xi , w 0 )) fn ( w ) D 2n iD 1 n iD 1 where fgi ´ yi ¡ y (xi , w 0 )g are independent samples from the standard normal distribution, and w0 is an arbitrary element in W 0. Let fxi , gi gniD 1 be a set of independent samples taken from q (x )q0 (g), ( ) where q x is a probability density and q0 (g) is the standard normal distribution. Assume that a function j (x, w) satises condition A and that the Taylor expansion of j (x, w ) at wO absolutely converges in a region T D fwI |wj ¡ wO j | < rj g. For a given constant 0 < a < 1, we dene a region Ta ´ fwI |wj ¡ wO j | < arj g. Then the following hold: R 1. If j (x, w)q(x ) dx D 0, there exists a constant c0 such that for an arbitrary n, 8 9 2= < 1 X n p j (xi , w) < c0 < 1. An ´ En sup :w2Ta n ; D 1 i Lemma 4.
2. There exists a constant c00 such that for an arbitrary n, 8 9 2= < 1 X n p g j (x , w) < c00 < 1. Bn ´ En sup :w2Ta n D1 i i ; i Proof. We show equation 2.1 The statement 2.2 can be proven in the same
way. We denote k D ( k1 , k2 , . . . , kd ) and j (x, w) D D
1 X
a k (x )(w ¡ wO )k
kD 0 1 X
k1 ,...,kd D 0
a k1 ¢¢¢kd (x)(w1 ¡ wO1 ) k1 ¢ ¢ ¢ (wd ¡ wOd )kd .
912
Sumio Watanabe
This power series absolutely converges in T; therefore, j (x, w) can be extended as a holomorphic function in T. By using Cauchy’s integral formula for several complex functions, a k ( x) D
Z
1 (2p i)d
Z C1
¢¢¢
Cd
j (x, w ) dw 1 ¢ ¢ ¢ dw d , O j )kj C 1 j ( wj ¡ w
(4.1)
Q
where Cj is a circle with a radius rj ¡ d. We dene a function M (x) by M (x ) D sup |j (x, w)|. w2Ta
Then by equation 2.2 in condition A, »Z
¼ M (x )2 q(x ) dx
M´
1/ 2
< 1,
and there exists d > 0 such that |ak (x )| · Q d
M (x )
jD 1
By the assumption,
|rj ¡ d| kj
R
.
(4.2)
ak (x )q (x ) dx D 0. Thus,
8 91 2= 2 »Z ¼ J < = n X X 1 gj (w ) p1 ( ) sup g a (1) (n ) D En h x , w i j i > > n i D1 :w2UnW0 f (w) ; jD 1 8 9 2= J X n X 2 < 1 · En sup p gi hj (xi , w ) , n ; a :w2UnW0 jD 1
i D1
which is bounded by some constant by lemma 4. Second, we show niteness of a (2) (n ). By the assumption that both y (x, w) and y (x, w)2 satisfy condition A, an inequality Z f1 C M (x )2 g (y (x, w ) ¡ y (x, w 0 ))2 q (x ) dx < 1 holds, where M (x) D sup w2W | y (x, w) ¡ y (x, w 0 )|. By using lemma 6, there J exists a nite set of functions fgj , hj gjD 1 , where gj (w ) is a real-valued analytic function and hj (x, w) satises condition A with f1 C M (x )2 g q(x ) instead of q(x) such that
y (x, w) ¡ y (x, w0 ) D
J X j D1
gj (w )hj (x, w ),
Algebraic Analysis for Nonidentiable Learning Machines
915
where two matrices, Z Ljk (w ) ´ Njk (w ) ´
hj (x, w )h k (x, w )q (x ) dx Z hj (x, w )h k (x, w )f1 C M (x)2 g q(x ) dx,
are positive denite. Hence, P R ( ) ( ) ( ) (y (x, w) ¡ y (x, w 0 ))4 q(x ) dx j,k Njk w gj w g k w R · P · C, 2 (y (x, w) ¡ y (x, w 0 )) q(x ) dx j, k Ljk ( w) gj ( w) gk ( w ) where C > 0 is a constant. Thus, 1 f (w) ¸ C
Z fy (x, w ) ¡ y (x, w0 )g4 q(x) dx,
which ensures that
a(2) (n ) · C En
8 > >
> =
sup Z . > > > :w2UnW0 fy (x, w ) ¡ y (x, w 0 )g4 q(x) dx > ;
Here niteness of the last term can be shown by the same way as a (1) (n ), where fy (x, w ) ¡ y (x, w 0 )g2 is decomposed by lemma 6 instead of y (x, w) ¡ y (x, w 0 ). Proof of Theorem 2.
We dene an D
sup | fn (w )|.
w2WnW0
Then, by theorem 4, E n fan2 g < 1. The free energy or the Bayesian stochastic complexity satises » ¼ Z exp(¡nfn (w))Q (w)dw F (n ) D ¡En log W » ¼ Z q exp(¡nf (w ) ¡ nf (w)fn (w))Q (w)dw D ¡En log W » ¼ Z q ( ) ( )) ( ) C ¸ ¡En log exp(¡nf w an nf w Q w dw W Z ± n ² 1 ¸ ¡ log exp ¡ f (w ) Q (w)dw ¡ En fan2 g, 2 2 W
916
Sumio Watanabe
p where we used an inequality an nf (w ) · (an2 C nf (w)) / 2. Let us dene Zi (n ) (i D 1, 2) by Z Zi (n ) D
± n ² exp ¡ f (w) Q (w)dw, 2 W (i )
where W (1) D W2 and W (2) D W n W2 . Then by the same method as the proof of theorem 1, Z1 (n ) » D c1,m1
(log n )m1 ¡1 n l1
.
On the other hand, Z Z2 (n ) ·
WnW2
± n2 ² ± n2 ² exp ¡ Q (w)dw · exp ¡ . 2 2
Therefore, » ± n2 ²¼ (log n )m1 ¡1 C exp ¡ C const., F (n ) ¸ ¡ log c1,m1 2 n l1 which completes Theorem 2. 5 Algorithm to Calculate the Learning Efciency
Theorem 2 shows that the important values l1 and m1 are determined by the meromorphic function J (l). However, J (l) is dened by the integral of several variables. It is not so easy to determine the largest pole and its multiplicity of J (l). If J (l) is given by an integral of a single variable, for example, if Z 2 J (l) D x2l xr dx, 0
then the pole of J (l) is l1 D ¡(r C 1) / 2, and its multiplicity is m1 D 1. The resolution of singularities in algebraic geometry transforms an arbitrary integral of several variables into an integral whose essential term is a direct product of integrals of single variables. In fact, Atiyah (1970) showed that theorem 5 is directly proven from Hironaka’s theorem (Hironaka, 1964). Let f (w) be a real analytic function dened in a neighborhood of 0 2 Rd . Then there exist an open set W ¾ f0g, a real analytic manifold U, and a proper analytic map g: U ! W such that: Theorem 5.
1. g: U n U 0 ! W n W 0 is a biregular map, where W 0 D f ¡1 (0) and U 0 D g ¡1 (W 0 ).
Algebraic Analysis for Nonidentiable Learning Machines
917
2. For each P 2 U there are local analytic coordinates (u1 , . . . , ud ) centered at P so that, locally near P, we have f ( g(u1 , . . . , ud )) D a(u1 , . . . , ud )u1k1 u2k2 ¢ ¢ ¢ udkd ,
(5.1)
where a(u ) is an invertible analytic function and ki ¸ 0. This theorem is a special version of the well-known Hironaka’sresolution of singularities in algebraic geometry. Since f (w ) is not a polynomial but an analytic function, the singularities of f can be locally resolved. Hironaka’s proof is completely constructive, and the above map g can be constructed by the nite recursive procedures, blowing-up nonsingular manifolds contained in singular sets. In the theorem, both U and W are locally compact Hausdorff spaces, and g is continuous. In this case, g is a “proper map” if and only if for an arbitrary compact set K, g¡1 (K ) is compact. The following algorithm is used to calculate l1 and m1 : 1. Cover the analytic set (the set of the true parameter) W 0 D fw 2 supp QI f (w ) D 0g by the nite union of open neighborhoods Wa .
2. For each neighborhood W a, nd a resolution map ga and a manifold Ua by blowing up singularities. Since ga is a proper map and the closure of Wa is compact, Ua is covered by a nite union of open sets fUab g whose closures are homeomorphic to compact sets in some Euclidean spaces. 3. For each neighborhood of Wab D g (Uab ), the function Jab (l) is calculated by equation 5.1, Z Jab (l) ´ D D
Z Z
Wab
Uab
Uab
f (w)lQ (w)dw f ( ga (w))lQ ( ga (u ))| g0a (u )| du lk lk lk a(u )l u1 1 u2 2 ¢ ¢ ¢ ud d Q ( ga (u ))| g0a (u )| du,
where | g0a (u )| is Jacobian. Note that fuI | g0a (u )| D 0g ½ ga¡1 (W 0 ). The last integration can be done for each variable ui , and the largest pole (ab ) (ab ) ¡l1 and their multiplicity m1 of Jab (l) are obtained, where Taylor expansion of | g0a (u )| is used. (ab)
4. The largest pole ¡l1 of J (l) is ¡l1 D maxab (¡l1 ), and its multi(a¤ b ¤ ) plicity m1 is m1 , where a¤ b ¤ is the argument that maximizes the (ab ) ab ) ¡l1 . If (a, b ) that maximizes ¡l1 is not unique, then (a, b ) that (ab ) maximizes m1 among such (a, b ) sets is chosen.
918
Sumio Watanabe
In order to calculate l1 and m1 , only the neighborhood Wa that gives the largest pole is important. The singularity that is in such neighborhood Wa is called the deepest singularity in this article. Note that by theorem 5, 1 · m1 · d, where d is the number of parameters. For the regular statistical models, by using the appropriate coordinate (w1 , . . . , wd ) the average loss function f (w ) can be locally written by Example 1 (Regular Models).
f (w) D
d X iD 1
w2i .
Then, among the origin, W 1 D fwI kwk < 1g, Z
J (l) D
W1
f (w )l dw,
where we replaced Q (w ) by Q (w ) D const., based on the natural assumption that Q (w ) > 0 on W. We dene W11 D fw 2 WI |wi | < |w1 |g and U11 D f(u1 , . . . , ud )I |ui | < 1g. By using the blowing-up technique, we nd a map g1 : (u1 , . . . , ud ) 7! (w1 , . . . , wd ), w 1 D u1 ,
w i D u1 ui
(2 · i · d ).
Then the function J11 (l) is Z (l) J11 D f (w )l dw W11
Z D
1 ¡1
Z , ...,
1 ¡1
Á u2l 1
1C
d X iD 2
!l u2i
|u1 | d¡1 du1 du2 , ¢ ¢ ¢ , dud .
This function has the pole at l D ¡d / 2 with the multiplicity m1 D 1. We dene W1i and J1i (l) in the same way as W11 and J11 (l). Then W1 is the union of W1i and a measure zero set. Therefore, the free energy is F(n) » D
d log n C O (1), 2
and if K (n) satises the same condition as corollary 1, then K (n ) » D d / 2n. Let w0 be an arbitrary xed point in W 0 . We can assume w0 D 0 without loss of generality. Since f (w) is analytic, there exists a constant a > 0, such that for any w in the neighborhood of 0, Proof of Corollary 2.
f (w ) · akwk2 .
Algebraic Analysis for Nonidentiable Learning Machines
919
From lemma 2, we analyze the case when l is a real and negative value. For such l, J (l) has a real value if it is nite. Let Q be a positive and C1 0 a class function smaller than Q (w). Then for a sufciently small constant d > 0, Z J (l) ¸ f (w)lQ (w)dw kwk |a|, |ad3 | < |ab C cd| < |cd3 |g, U11 D f(x, y, z, w )I |x| < 1, |y| < 1, |w| < 1, w ¡ 1 < z < w C 1g. By using the blowing-up technique, we nd a map g1 : U11 ! W 11, which is dened by a D x, b D y3 w ¡ yzw, c D xzw, d D y. From the algorithmic point of view, W11 and U11 are determined systematically in the blowing-up process. By using this transform, we obtain f ( g1 (x, y, z, w )) D x2 y6 w2 [1 C fw2 (y2 ¡ z)3 C zg2 ], | g01 (x, y, z, w)| D |xy3 w|. We assume naturally Q (w) > 0 in fw D (a, b, c, d )I |a| < 1, |b| < 1, |c| < 1, |d| < 1g. In the calculation of poles of J (l), we can simply put Q (w) D 1 in this set: Z J1 (l) D f (w)l dw Z
D
W1 1
¡1
Z
dx
Z
1 ¡1
dy
1 ¡1
Z
dw
(1)
w C1 w¡1
dz f ( g1 (x, y, z, w ))l |xy3 w|.
(1)
The result is that l1 D 2 / 3, and m1 D 1. From the other regions W12 , W13,. . . we do not obtain larger l1 than 2 / 3. Hence, F(n) » D
2 log n C O (1), 3
whereas the dimension divided by two is d / 2 D 4 / 2. Finally, we introduce an elemental inequality for the asymptotic expansion of the free energy. Let us consider K learning machines (1 · k · K), pk (y|x, wk ) D
³ ´ 1 1 2 ( )k exp ¡ ky ¡ y x, w , k k (2p s 2 )N / 2 2s 2
where N is the number of output units. Assume that the asymptotic expansions of free energies are, respectively, given by ( k) ( k) F k ( n) » D l log n C (m ¡ 1) log log n,
Algebraic Analysis for Nonidentiable Learning Machines
921
which are determined under the condition that the true distributions are fp(y|x, wk0 )g and the a priori distribution is Qk (wk ). Let us study the case that a learning machine made of a sum of K machines, 0 ® ®2 1 K ® X 1 1 ® ® ® A @¡ ( ) exp ¡ y p (y|x, fwk g) D y x, w ® ® , k k ® (2p s 2 )N / 2 2s 2 ® kD 1
is trained by using samples from the distribution p (y|x, fwk0 g). The Q asymptotic expansion of the free energy with the a priori distribution k Qk (wk ) is written as F( n ) » D l log n ¡ (m ¡ 1) log log n. Then we have the following inequality: Corollary 3.
Constants l and fl( k) g satisfy the inequality, l ·
PK
kD 1 l
( k)
.
This corollary has an important application. Let l(H 0 , H ) log n be the essential term of the asymptotic form of a three-layered perceptron with H hidden units, which is trained using samples from the true distribution represented with H 0 hidden units. We assume that H ¡ H 0 D KL, where K and L are some integers. By using corollary 3, l(H 0, H ) · l(H 0 , H 0 ) C l(0, H ¡ H 0 ) · l(H 0 , H 0 ) C (H ¡ H0 ) / K l(0, L ).
Note that l(H 0 , H 0 ) D d / 2, where d is the number of parameters in the model with H 0 hidden units (Watanabe, 1998b). Even if the resolution of singularities is too complex and too difcult to calculate l(H0 , H ), we can obtain some inequalities based on the free energy of the simpler case. The Kullback distance of the machine p(y|x, fw kg) is given by ®2 Z ® ®X ® K 1 ® ® ( ) ( )g f y ¡ y f (w ) ´ x, w x, w ® ® q(x ) dx k k k k0 ® 2s 2 ®
Proof.
kD 1
·
K 2s 2
D K
K Z X kD 1
K X
ky k (x, w k ) ¡ y k (x, wk0 )k2 q(x ) dx
f k (wk ),
kD 1
where fk (wk ) is the Kullback distance of p k (y|x, wk ) from p k (y|x, w k0 ): Á ! Z K K k X Y Y Qk (w k ) F (n ) · ¡ log exp ¡Kn f k (w k ) dw k kD 1
kD 1
kD 1
922
Sumio Watanabe
2 1.5 1 0.5 0 -0.5 -1 -1.5 -2
-2 -1.5 -1 -0.5
0
0.5
1
1.5
2
Figure 1: Experimental distribution of MLE. True = (0,0).
·
K X kD 1
fl( k) log n ¡ (m ( k) ¡ 1) log log n C const.g.
Example 4 (Experiments).
p (y|x, a, b ) D p
Let us consider a case when 1
³
1 (y ¡ as (bx ))2 exp ¡ 2 2 2(0.1) 2p (0.1)
´
is trained using samples from p(y|x, a0 , b0 ), where s(x) D x C x2 . In this case, we can calculate the maximum likelihood estimator, because the likelihood function is a quadratic form of ab and ab2 . Figures 1, 3, and 5 show the maximum likelihood estimators (MLE) for (a, b), which are estimated by using samples from the distribution with true parameters (a 0, b0 ) D (0, 0), (0.2, 0.2), and (0.4, 0.4), respectively. Here q(x ) is the standard normal distribution, n D 20, and the number of trials is 200. Note that MLEs outside of [¡2, 2] £ [¡2, 2] are not drawn in the gures. Figure 2, 4, and 6 show the parameters, which are taken from the a posteriori distribution estimated by using samples from the distribution with true parameters (a0 , b 0 ) D (0, 0), (0.2, 0.2), and (0.4, 0.4). Here we dened the a priori distribution of (a, b) by the uniform distribution of [¡2, 2] £ [¡2, 2]. It should be emphasized that the distribution of MLEs is very different from
Algebraic Analysis for Nonidentiable Learning Machines
923
2 1.5 1 0.5 0 -0.5 -1 -1.5 -2
-2 -1.5 -1 -0.5
0
0.5
1
1.5
2
1
1.5
2
Figure 2: Experimental distribution of Bayes. True = (0,0).
2 1.5 1 0.5 0 -0.5 -1 -1.5 -2
-2 -1.5 -1 -0.5
0
0.5
Figure 3: Experimental distribution of MLE. True = (0.2,0,2).
924
Sumio Watanabe
2 1.5 1 0.5 0 -0.5 -1 -1.5 -2
-2 -1.5 -1 -0.5
0
0.5
1
1.5
2
1.5
2
Figure 4: Experimental distribution of Bayes. True = (0.2,0,2).
2 1.5 1 0.5 0 -0.5 -1 -1.5 -2
-2 -1.5 -1 -0.5
0
0.5
1
Figure 5: Experimental distribution of MLE. True = (0.4,0.4).
Algebraic Analysis for Nonidentiable Learning Machines
925
2 1.5 1 0.5 0 -0.5 -1 -1.5 -2
-2 -1.5 -1 -0.5
0
0.5
1
1.5
2
Figure 6: Experimental distribution of Bayes. True = (0.4,0.4).
the Bayesian a posteriori distribution and that neither distribution is subject to any gaussian distribution even if the true parameter is only one point. In this example, by putting p D ab, q D ab2 , the model can be understood as the regular statistical model. The result is that the generalization error by the maximum likelihood method is asymptotically equal to 2 / 2n in three cases. However, for a general activation function s(x), there exists no transform that makes the model regular. The generalization error in such a case is far larger than that in regular model cases. The generalization errors by Bayesian estimation in these cases are asymptotically 1 / (2n ) ¡ 1 / (n log n ), 2 / (2n ), and 2 / (2n ), respectively. For the case n D 20 (n is the number of training samples), the experimental average generalization errors by the maximum likelihood method were 0.09509, 0.09058, and 0.09426 where true parameters were (0, 0), (0.2, 0.2), and (0.4, 0.4), respectively, whereas those by Bayesian method were 0.01920, 0.02697, and 0.6236, respectively. For the case n D 50, the former errors were 0.02519, 0.02528, and 0.02491, respectively, and the latter errors were 0.00856, 0.01228, and 0.01931. These results show that if the model is almost redundant, Bayesian estimation is more appropriate than the maximum likelihood method.
926
Sumio Watanabe
6 Discussion 6.1 Fundamental Aspects. In this section, we explain the fundamental structure of the theorems in this article. Let us dene a volume function of the set of parameters fw 2 WI f (w) · tg,
Z
V (t) D
f (w ) · t
Q (w)dw.
Then V 0 (t) D I (t ), where I (t) is dened by equation 3.2. This article has shown that the asymptotic stochastic complexity F (n ) is given by the Laplace transform of the volume function: Z » ( ) F n D ¡ log exp(¡nt) dV (t). If the model is identiable and the Fisher information matrix is positive denite, then the asymptotic shape of the set fw 2 WI f (w) · tg is the inside of an ellipsoid. However, if the model is nonidentiable and has singularities, then it is the union of the inner points of several complex “hyperboloids.” If V (t) satises the relation l m ¡1 V (t) » D c1,m1 t 1 (¡ log t) 1
(t ! 0),
then F(n) » D l1 log n ¡ (m1 ¡ 1) log n, (n ! 1) Z Const. . J (l) D tl V 0 (t) dt D (l C l1 )m1
(6.1) (6.2)
These relations (equations 6.1 and 6.2) show that the learning efciency of Bayesian estimation is determined by the volume of almost true parameters. Mathematically, it is difcult to prove V (t) / tl1 (¡ log t )m1 ¡1 directly, because of singularities in the true parameter set. The relation that is proven rst is equation 6.2, where we need algebraic properties of the loss function, Sato-Bernstein’s b-function, and Hironaka’s resolution of singularities. 6.2 Theoretical Aspects. The case when the parameter is not identiable might be understood as a special one that seldom occurs. However, the superpositions of homogeneous functions essentially have the problem of redundancy. Let us illustrate the mathematical structure of redundancy. A function f (x ) in L2 (R) can be decomposed as
³
Z Z f (x ) D
g(a, b )Q
´ x¡b da db, a
Algebraic Analysis for Nonidentiable Learning Machines
927
where Q (x ) is an analyzing wavelet and g (a, b) is a coefcient function (Chui, )g is an overcomplete basis in L2 (R ), 1992). Here the set of functions fQ ( x¡b a with the result that g(a, b) is not determined uniquely. This decomposition shows that the set of coefcients fgh g in the neural approximation, f ( x) ¼
H X hD 1
³
gh Q
x ¡ bh ah
´
,
becomes more nonidentiable when H is close to innity. In order to analyze the case n ! 1 and H ! 1, we have to devise a theory for redundant function approximation and redundant statistical estimation. Watanabe (2000b) shows that singularities in the parameter space make the generalization error smaller even when the true distribution is not contained in the parametric model. On the other hand, the set of coefcients in the orthonormal basis decomposition is given by the orthonormal projection of the true function to the corresponding function space; hence, it is identiable even if H goes to innity. This is one of the essential differences between articial neural networks and orthonormal basis functions. We expect that articial neural networks are better learning machines than the orthnormal ones if Bayesin estimation is applied. It is well known that if a statistical model contains unused or nuisance parameters, then those parameters should not be estimated. If the parameters are estimated and determined, then the statistical estimation error becomes large. They are estimated in the maximum likelihood method but not in Bayesian estimation. This is one of the reason that Bayesian estimation is useful for neural networks in almost redundant states. 6.3 Practical Aspects. In regular learning machines, the generalization error by the maximum likelihood method is asymptotically equal to d / 2n (Akaike, 1974), which coincides with that by Bayesian estimation (Amari, 1993), where d is the number of parameters and n is the number of training samples. In this article we have shown that in nonidentiable learning machines, the generalization error by the Bayesian estimation is not larger than d / 2n. When the singularity becomes deeper, the generalization error is reduced. On the other hand, it is conjectured that the generalization error of a layered learning machine by the maximum likelihood method is far larger than d / 2n. For example, it is proven that if the distribution of the inputs consists of a nite sum of delta functions, then the generalization error is in proportion to log n / n (Hagiwara, Kuno, & Usui, 2000). Also it is proven that if the activation function of hidden units is a linear function, then the generalization error is C / 2n, where C is larger than the number of parameters (Fukumizu, 1999). In a practical application, the true distribution is not strictly contained in the nite parametric model in general (Shibata, 1981). In such a case, the
928
Sumio Watanabe
generalization error is the sum of the function approximation error and the statistical estimation error, where the former is called bias and the latter variance. In this article, we have shown that the increase of the variance of an articial neural network is not larger than the increase of parameters if Bayesian estimation is applied. Therefore, we can use the larger-size neural network with the not-so-large variance and the smaller bias. However, if we apply the maximum likelihood method to an articial neural network, then we should use the smaller model to ensure that the variance is small, resulting in the larger bias. We conjecture that this difference is the main reason that almost redundant learning machines with ensemble learning are more useful than selected small models with one point estimation, in practical applications. 7 Conclusion
The mathematical foundation for nonidentiable learning machines such as neural networks is established based on algebraic analysis and algebraic geometry. The free energy or the stochastic complexity is asymptotically equal to l1 log n ¡ (m1 ¡ 1) log log n C const., where l1 is a rational number and m1 is a natural number. Moreover, we obtained the following properties: 1. The learning curve is determined by the singularities in the parameter space. 2. The learning efciency can be algorithmically calculated by the resolution of singularities. 3. Bayesian estimation is more appropriate than the maximum likelihood method for neural networks in redundant states. Information geometry gives the foundation of statistical learning theory for a regular learning machine (Amari, 1985), which has a positive denite Fisher metric. We expect that algebraic geometry and algebraic analysis will play an important role in the study of a complex and hierarchical learning machine that has a degenerate metric. Analysis for both the maximum likelihood method and the maximum a posteriori method is an important problem for the future. Appendix: Elemental Lemmas and Proofs
In this appendix, we show two elemental lemmas and their proofs. These are technical properties, but sometimes they are useful in neural network theory.
Algebraic Analysis for Nonidentiable Learning Machines
929
Let fXk (a )I k D 1, 2, . . . , g be a set of real valued random variables with parameter a and A be a set of all parameters. Then Lemma 5.
8
0 such that | gj (w)| · 2
| j|
.
(A.1)
On the other hand, we can understand gj (w) as an element of the ring of the convergent power series, R. Let us consider a sequence of increasing ideals in R, 8 9 <X n = (n D 1, 2, 3, . . .). In D uj (w) gj (w)I uj (w ) 2 R : ; j D1
It is well known that R is a Noetherian ring; in other words, R satises the maximal condition. Therefore, there exists a nite integer J such that In ½ I J for any integer n. Consequently, for i > J, there exist functions fcij (w) 2 Rg such that J X gi (w ) D cij (w) gj (w). j D1
Algebraic Analysis for Nonidentiable Learning Machines
931
In this decomposition, it is proven in Oka (1962) that we can choose functions fcij (w)g that satisfy |ckj (w )| · C sup | gk (w )| , w
where C does not depend on j. By combining this with the inequality A.1, an inequality |cij (w)| · C2 | j| is obtained. Then we can decompose y (x, w) as
y (x, w ) D D
J X j D1 J X j D1
ej (x ) gj (w) C (
1 X
ei (x ) gi (w)
iD J C 1
gj (w ) ej (x ) C
1 X
) cij (w)ei (x ) .
iD J C 1
( )
p Here the sequence fhj (¢, w)g dened by ( )
p hj (x, w) D ej (x ) C
p X
cij (w )ei (x),
iD J C 1
converges to a function hj (¢, w) uniformly for w by the topology of L2 (q) when p ! 1, with the result that hj (¢, w ) is a holomorphic function as an element in L2 (q). Since the set fej (x )g is an orthonormal system, fhj (x, w)g are also linearly independent. Acknowledgments
This research was partially supported by the Ministry of Education, Science, Sports and Culture in Japan, Grant-in-Aid for Scientic Research 09680362 and 12680370. References Akaike, H. (1974). A new look at the statistical model identication. IEEE Trans. on Automatic Control, 19, 716–723. Akaike, H. (1980). Likelihood and Bayes procedure. In J. M. Bernald (Ed.), Bayesian statistics (pp. 143–166), Valencia, Spain: University Press. Amari, S. (1985). Differential-geometricalmethods in statistics. New York: Springer. Amari, S. (1993). A universal theorem on learning curves. Neural Networks, 6, 161–166. Amari, S., Fujita, N., & Shinomoto, S. (1992). Four types of learning curves. Neural Computation, 4(4), 608–618. Amari, S., & Murata, N. (1993). Statistical theory of learning curves under entropic loss. Neural Computation, 5, 140–153.
932
Sumio Watanabe
Atiyah, M. F. (1970). Resolution of singularities and division of distributions. Comm. Pure and Appl. Math., 13, 145–150. Barron, A. R. (1994). Approximation and estimation bounds for articial neural networks. Machine Learning, 14(1), 115–133. Bernstein, I. N. (1972). The analytic continuation of generalized functions with respect to a parameter. Functional Anal. Appl., 6, 26–40. Bjork, ¨ J. E. (1979). Rings of differential operators. Amsterdam: North-Holland. Chui, C. K. (1992). An introduction to wavelets. Orlando, FL: Academic Press. Cramer, H. (1949). Mathematical methods of statistics. Princeton, N.J.: Princeton University Press. Dacunha-Castelle, D., & Gassiat, E. (1997). Testing in locally conic models, and application to mixture models. Probability and Statistics, 1, 285–317. Fukumizu, K. (1996). A regularity condition of information matrix of a multilayer perceptron network. Neural Networks, 9, 871–879. Fukumizu, K. (1999). Generalization error of linear neural networks in unidentiable cases. In Lecture Notes on Computer Science, 1720, 51–62. Gel’fand, I. M., & Shilov, G. E. (1964). Generalized functions. Orlando, FL: Academic Press. Hagiwara, K., Kuno, K., & Usui, S. (2000). On the problem in model selection of neural network regression in overrealizable scenario. In Proc. of Int. Joint Conf. on Neural Netork, Italy, Como. Hagiwara, K., Toda, N., & Usui, S. (1993). On the problem of applying AIC to determine the structure of a layered feed-forward neural network. Proc. of IJCNN Nagoya Japan, 3, 2263–2266. Hironaka, H. (1964). Resolution of singularities of an algebraic variety over a eld of characteristic zero. Annals of Mathematics, 79, 109–326. Kashiwara, M. (1976). B-functions and holonomic systems. Inventions Math., 38, 33–53. Knowles, M., & Siegmund, D. (1989). On Hotelling’s approach to testing for a nonlinear parameter in regression. International Statistical Review, 57, 205– 220. Levin, E., Tishby, N., & Solla, S. A. (1990).A statistical approaches to learning and generalization in layered neural networks. Proc. of IEEE, 78(10), 1568–1674. Mackay, D. J. (1992). Bayesian interpolation. Neural Computation, 4(2), 415–447. Mhaskar, H. N. (1996). Neural networks for optimal approximation of smooth and analytic functions. Neural Computation, 8, 164–177. Murata, N., Yoshizawa, S., & Amari, S. (1994). Network information criteriondetermining the number of hidden units for an articial neural network model. IEEE Transactions on Neural Networks, 5(6), 797–807. Oaku, T. (1997). Algorithms for the b-function and D-modules associated with a polynomial. Journal of Pure and Applied Algebra, 117, 495–518. Oka, K. (1962). Sur les fonctions analytiques de plusieurs variables. Tokyo: Iwanamishoten. Opper, M., & Haussler, D. (1995). Bounds for predictive errors in the statistical mechanics of supervised learning. Physical Review letters, 75(20), 3772–3775. Rissanen, J. (1986). Stochastic complexity and modeling. Annals of Statistics, 14, 1080–1100.
Algebraic Analysis for Nonidentiable Learning Machines
933
Sato, M., & Shintani, T. (1974). On zeta functions associated with prehomogeneous vector space. Annals of Mathematics, 100, 131–170. Schwarz, G. (1974). Estimating the dimension of a model. Annals of Statistics, 6(2), 461–464. Shibata, R. (1981).An optimal model selection of regression variables. Biometrika, 68, 45–54. Watanabe, S. (1998a). Inequalities of generalization errors for layered neural networks in Bayesian learning. Proc. of Int. Conf. on Neural. Inform. Process, 1, 59–62. Watanabe, S. (1998b). On the generalization error by a layered statistical model with Bayesian estimation. IEICE Trans., J81-A, 1442–1452. (English version is in Elect. and Comm. in Japan, (2000), 95–104.) Watanabe, S. (1999). Algebraic analysis for singular statistical estimation. Lecture Notes in Computer Science, 1720, 39–50. Watanabe, S. (2000a). Algebraic analysis for non-regular learning machines. Advances in Neural Information Processing, 12, 356–362. Watanabe, S. (2000b). Mathematical foundation for redundant statistical estimation. Proc. of 31st ISCIE Int. Symp. on Stochastic Systems Theory and Its Applications (pp. 119–124). Watanabe, S., & Fukumizu, K. (1995). Probabilistic design of layered neural networks based on their unied framework. IEEE Trans. on Neural Networks, 6(3), 691–702. Yamanishi, K. (1998). A decision-theoretic extension of stochastic complexity and its applications to learning. IEEE Trans. on Information Theory, 44(4), 1424– 1439. Received August 9, 1999; accepted May 8, 2000.
LETTER
Communicated by Maxwell Stinchcombe
A New On-Line Learning Model Shahar Mendelson Department of Mathematics, Technion, Haifa 32000, Israel We introduce a new supervised learning model that is a nonhomoge neous Markov process and investigate its properties. We are interested in conditions that ensure that the process converges to a “correct state,” which means that the system agrees with the teacher on every “question.” We prove a sufcient condition for almost sure convergence to a correct state and give several applications to the convergence theorem. In particular, we prove several convergence results for well-known learning rules in neural networks. 1 Introduction
In this article, we examine a model of abstract supervised learning. In a supervised learning model, the student is exposed to inputs and responds to them. If the response is incorrect (it does not agree with that of the teacher), the student “learns” and moves closer to the teacher in some sense. One of the rst supervised learning models suggested was the perceptron (see Minsky & Papert, 1972; Ritter, Martinetz, & Schulten, 1992; Haykin, 1994). In this model, both the student S and the teacher T are halfspaces in Rn . Given an input v 2 Rn , the student’s answer is incorrect if v is in S but not in T, or vice versa. In this case, the student changes its position and moves “closer ” to the teacher by some learning rule. This learning model is on-line; at every “learning step,” the system’s response depends on only its current state and the input, and not, for example, other parameters of the state-space, which is a space containing all the possible students. In a general supervised on-line learning model, the learning process is determined by an on-line error function denoted by E (x, v ), where x is a student and v is an input. In the general case the teacher is not known, but makes its mark on the learning process via the on-line error function. The student x is asked a question (represented by the input v), and E (x, v ) measures the magnitude of the mistake x made with regard to the input (question) v. In other words, E (x, v ) is a nonnegative bounded function such that for every state x and every input v, E (x, v ) D 0 if and only if x gives a correct response to the input v. Thus, E (x, v) D 0 if and only if x agrees with the teacher on the input v. The larger E (x, v) is, the bigger the disagreement between x and the teacher on the input v. Neural Computation 13, 935–957 (2001)
° c 2001 Massachusetts Institute of Technology
936
Shahar Mendelson
An example of such a model is the on-line Gibbs learning, rst suggested in Kim and Sompolinsky (1996, 1998). It is a Markov process in which the transition density from state to state, given the input at the nth stage, is determined by an “on-line” energy function. The nth stage transition operator is dened so that on average, the transition from the nth stage to the n C 1 stage reduces the energy. The energy function depends on the on-line error function E (x, v ), so that a reduction in the energy is equivalent to a smaller global error, which for every state x is the average of the error R function on all possible inputs. Formally, the global error at x is E g (x ) D V E (x, v )dº(v), where V is the input set and º is the probability measure on V by which the inputs are selected. In most supervised learning models suggested so far, the transition density from x to x0 depended on both E g (x ) and Eg (x0 ), and not on the responses of x and x0 to each input separately. Hence, those models do not enter into the category of on-line learning. The strength of the non-on-line learning models is that at each learning step, the global error decreases; therefore, it is easy to guarantee that the student converges to a minimum of E g (x ). Kim and Sompolinsky (1998) claimed that the on-line Gibbs algorithm (OLGA) had some of the features of non-on-line models: it converges in some sense to a minimum of E g . The on-line model we suggest is a variation of the on-line Gibbs learning. We examine it in two cases: when the learning step (the maximal distance between the nth state and the n C 1 state) is a constant and when the learning step decreases to 0. The main focus of this article is on the rst case, but we give an example of a convergence theorem for the second case under some additional assumptions on the on-line error function. In section 2, we dene our model and state most of the convergence results concerning the process with a constant learning step. We also present several examples in which the process may be used. Two of those examples are the well-known perceptron and the multilayer perceptron models (Ritter et al., 1992; Haykin, 1994). In section 3, we give an example of a convergence theorem in the case where the learning steps decrease to 0. In this example, we take into account the possibility that at every stage, the teacher has a probability p < 1 / 2 of making a mistake. In section 4 we prove the results stated in section 2 and compare our model and the OLGA. In addition, we present a counterexample to one of the claims in Kim and Sompolinsky (1998). We end with some concluding remarks. Let us turn to some denitions and notations. Throughout the article, (X, d, m ) is a compact metric probability space with a metric d and a probability measure m . We assume that the measure m is compatible with the metric in the sense that every open set in X has a positive m measure. If A ½ X, set d (x, A) D infa2A d(x, a). We say that a sequence (xn ) converges to a set A if limn!1 d (xn , A) D 0. Let Bl (x ) be the closed ball of radius l centered at x. Xn is the product space of n copies of X with the induced topology and measure. (V, º) is the compact metric space of all possible inputs, where
A New On-Line Learning Model
937
º is a probability measure on V. Denote by t the product measure (m £ º) on X £ V. For a random variable Y, EY is the expectation of Y and kYk1 D E|Y|. Given a set A, AN denotes its closure and ÂA is its characteristic function—the function that is 1 on A and vanishes elsewhere. Finally, we say that (an ) 2 `p , P if |an | p < 1. 2 The Model and Some Examples
In this section we dene our model and list some results concerning it. Then we give examples for ways in which this learning process may be used. We present two cases: when the error function is a 0-1 function and when E (x, v ) is a nonnegative smooth function. For every x 2 X, put Cx D fv | E (x, v ) D 0g. Thus, Cx is the set of inputs on which the student x agrees with the teacher. Our process is a Markov process (Xn , Vn ), where (Vn ) are independent and identically distributed (i.i.d.) according to º and are independent of (Xn ), where Xn represents the state of the student at the nth stage. We assume that X1 is distributed according to m and that Xn is adapted as follows: let l be the size of the maximal learning step, and set Tn to be a sequence decreasing to 0. If d(x, x0 ) > l, then x cannot move to x0 . Moreover, if x gives the correct response to v, then x remains stationary. If x does not give the correct response to v and if d(x, x0 ) · l, then the transition density at the nth stage from x to x0 given v is Pn (Xn C 1 D x0 | Xn D x, Vn D v ) D
E(x 0 , v) 1 E (x, v )e¡ Tn , ( ) cn x
(2.1)
where cn (x ) is the normalizing constant, which is determined by the fact that the probabilities that x remains stationary and that x moves to some x0 should add up to 1. Thus, Z cn (x) D º(Cx ) C
Z Bl (x )
V
E (x, v )e¡E
(x0 ,v) / Tn
dº(v)dm (x0 ).
The reason for this selection of a transition operator is simple. First, if x gives the correct response, there is no reason to change its location. In cases where x is wrong, the linear term E (x, v ) guarantees that if E (x, v ) is small, that is, if the response of x is almost correct, x does not move by much. The 0 exponent e¡E (x ,v) / Tn ensures that for small values of Tn , x moves only to x0 , for which E (x0 , v ) is small too. From here on we denote the conditional transition density Pn (Xn C 1 D x0 | Xn D x, Vn D v ) by Pn (x0 | x, v ) and set P to be the measure induced on the orbits (Xn ) of the process 2.1.
938
Shahar Mendelson
The 0-temperature process is the process whose transition operators are given by P (x0 | x, v ) D lim Pn (x0 | x, v ), n!1
where the limit is taken pointwise. Note that from the computational point of view, it is easier to run the process while decreasing T to 0 than to run the 0-temperature process. On the other hand, the analysis of the model is much easier at T D 0. We show that the “nice” properties of the process 2.1 at T D 0 are obtained in the case where Tn is decreased to 0, and the schedule by which Tn is taken to 0 plays no essential role. A key part in the analysis ofR process 2.1 is the fact that for every A ½ X, the convergence R of Pn ( A|x) D A V Pn (x0 | x, v )dº(v)dm (x0 ) to P (A|x ) is uniform. Therefore, we can approximate the behavior of this process by the 0-temperature process. Assume that º(Cx D Cx0 ) is continuous with respect to each variable separately, where AD B D (A \ Bc ) [ (B \ Ac ). Assume further that there is a d > 0 such that º(Cx ) ¸ d for every x. Assumption 1.
Although this last assumption seems innocent, it is rather strong and limiting. It implies that in some sense, all the states in X are not too far away from the teacher. In all the examples we present, the only way to ensure that this assumption holds is if we assume that the state-space is a small enough neighborhood of the target concept. On the other hand, it is easy to construct many on-line error functions for which this assumption holds, in which the state-space is not restricted to a neighborhood of the target concept. Let O ½ X denote the set of local minima of the error function E g (x ). We say that A ½ X is a neighborhood of O if it is an open subset of X that contains O . We denote by Q the set of states that agree with the teacher almost surely, that is, Q D fx | E g (x ) D 0g. Denition 1.
Let us state our main results: Theorem 1.
Assume that the error function is a 0-1 function. Then:
a. If A is any neighborhood of O , then for every l > 0 and every positive sequence (Tn ) ! 0, the orbits (Xn ) dened by the process 2.1 enter A innitely often P -almost surely. b. Assume that m (Q) > 0. Then for l D diam (X ) and for every sequence (T n ) ! 0, (Xn ) converges P -almost surely to Q.
A New On-Line Learning Model
939
Note that Q is an absorbing set, which means that the probability of leaving Q is 0. When the error function is not a 0-1 function, we have to make an additional assumption: Assume that Eg is monotone in the sense that E g (y) · Eg (x ) 6 O and for every e > 0, there is when Cy ¾ Cx . Assume further that for every x 2 6 x, y 2 Be ( x) such that C y ¾ Cx . some y D Assumption 2.
If the on-line error function satises assumption 2, then the assertion of theorem 1a holds. The assertion of theorem 1b is true in the general case even without assumption 2. Theorem 2.
Note that the 0-1 case is much easier to handle, for two reasons. First, there is an easy connection between Eg (x) and º(Cx ), since E g (x) D 1¡º(Cx ). Therefore, as º(Cx ) is increased, one moves closer to a minimal point of E g (x ). Second, it is easy to see that if d(x, y) · l, then there is a positive transition density from x to y if and only if º(Cy nCx ) > 0. Because the error function is 0-1 and assuming that d(x, y) · l, it sufces that E g (x ) < E g (y ) to ensure positive transition density from x to y. On the other hand, for a general error function, the situation is rather complicated. One does not know if there is any connection between Cy nCx and E g (y) ¡ Eg (x). This fact is the reason for assumption 2. In the nal section we present an example (see example 2) in which the error function is a smooth, nonnegative, bounded function, but the orbits cannot leave the global maximum of E g (x ). In this example, even though for every x 2 X, E g (x ) · E g (x 0 ), it so happens that Cx nCx0 is empty. Next, we present three examples in which model 2.1 may be used. In all three cases, the error function is a 0-1 function. 2.1 Example. The simplest case in which one can apply process 2.1 is the perceptron learning rule (see Haykin, 1994). In this process, we adapt a halfspace in Rd using a teacher that is also a halfspace. Denote the teacher by T D fy 2 Rd | x¤0 (y) > 0g where x¤0 is a linear functional on Rd , and set the student S D fy 2 Rd | x¤1 (y ) > 0g. Assume that kx¤0 k D kx¤1 k D 1. In this case, the error function is 0 for inputs which are in T \ S or in T c \ Sc and 1 otherwise. Assume further that the input set V is a nite union of balls, disjoint from the boundary of T, and that the probability measure º is given by a continuous density function supported on V. Recall that the Haar measure on the d-dimensional unit sphere Sd¡1 is simply the uniform distribution on that set. The space X will be a closed neighborhood of x¤0 in Sd¡1 such that for every x¤ 2 X, º(Cx¤ ) > 0, endowed with the normalized Haar measure. Since the function x ! º(Cx ) is continuous, it attains a positive minimum on X. Thus, there is some d > 0 such that for every x¤ 2 X, º(Cx¤ ) ¸ d.
940
Shahar Mendelson
Clearly, º(Cx D Cy ) is continuous, and the conditions of theorem 1b hold. Therefore, the orbits (Xn ) dened by process 2.1 converge almost surely to the set fx | E g (x ) D 0g. Again, this example reveals the limitations of process 2.1: it may be applied only in the “local” setup; states for which º(Cx ) D 0 are excluded from the state-space. 2.2 Example. Here we present two examples that deal with the uniform approximation of a continuous function f : V ! R by a function selected from a family of continuous functions fgx |V ! R |x 2 Xg. Let X and V be compact subsets of Rm and Rn , while m and º are the normalized Lebesgue measures on X and V. Fix g > 0, and let » 0 if | f (v ) ¡ gx (v )| , E (x, v ) D 1 otherwise.
Note that E g (x ) D 0 if and only if k f ¡ gx k1 < g. In the examples that follow, there is a trade-off between the accuracy parameter g and the selection of the state-space X. In the rst example, X is a set of parameters that represent multilayer perceptrons (MLPs) with a single hidden unit, while in the second example, each x 2 X represents a polynomial. Both of these classes are dense in the space of continuous functions C (V ) with respect to the supremum norm. The fact that the set of polynomials is dense in C(V ) is due to Weierstrass (Cheney, 1966). As for MLPs with a single hidden layer, Leshno, Lin, Pinkus, and Schocken (1993) proved the following: Let s be a continuous function that is not an algebraic polynomial. Then spanfs (hv, wi C h ) | w 2 Rd , h 2 Rg is dense in C(V ) for every compact set V ½ Rd . Theorem 3.
Thus, given a function f 2 C (V ) that one wishes to approximate, the accuracy parameter determines the set of parameters in which one should seek the approximating function. As g decreases, the set of parameters needed to ensure that there is some state x for which E g (x ) D 0 increases in size. Conversely, if the parameter space X is given, there is some g that is the best possible accuracy in which one can approximate f using the functions fgx | x 2 Xg. Hence, for smaller values ofg, there will be no x for which E g (x ) D 0. In the two examples presented here, we assume that there is some gx that approximates f up to an accuracy of g. Thus, there is some x 2 X for which E g (x ) D 0. Clearly, if one wishes to use theorem 1, we have to nd some kind of continuity condition on the family fgx g, to ensure that assumption 2 holds. This is the object of the following lemma: Lemma 1. Assume that for every xed s 2 X, limr!s kgr ¡ gs k1 D 0. Then º(Cx D Cy ) is continuous with respect to each variable separately.
A New On-Line Learning Model
941
First we show that for every s 2 X, º(Cs nCr ) is continuous with respect to r. Since º is regular, there exists a compact set K ½ Cs such that º(Cs nK) < e / 2. Therefore, there is some d > 0 such that supv2K |gs (v ) ¡ f (v )| · g ¡ d. By Chebyshev’s inequality, Proof.
»
d ¸ º gs ( v) ¡ gr ( v) 2
¼
·
2kgs ¡ gr k1 . d
Hence, if s and r are close enough, then º(K \ Ccr ) · e / 2. Therefore, ¡ ¢ º(Cs nCr ) D º(K \ Ccr ) C º (Cs nK) \ Ccr · e. A similar argument shows that limr!s º(Cr nCs ) D 0. We are ready to present the two examples. 2.2.1 Multilayer Perceptron (MLP). The MLP is composed of several layers of perceptrons and an output function s. It is formed by using the output of each layer of perceptrons as an input to the next one. Usually the output function s of each perceptron in the MLP is assumed to be smooth, but we assume that s is only a Lipschitz function and that the network has a single hidden layer. Let X0 , Y0 , V be compact sets in R k£d , R k , Rd , respectively, and set Z0 ½ R k. Assume that both X D X0 £ Y0 £ Z0 and V are equipped with the normalized Lebesgue measures m and º. The MLP may be viewed as a function M: X0 £Y0 £Z0 £V ! R: for every v 2 Rd and (x, y, z±) 2 X the response ² P Pd of the MLP (x, y, z) to the input v is Mx,y,z (v) D jkD1 yj s i D 1 xij vi C zj . If
we view the MLP as a neural network, d is the number of cells in the input layer, k is the number of units in the second (hidden) layer, (xij )ikD1 represents the synaptic weights between the rst layer and the± jth cell in the ²second Pd layer, and zj is the threshold of that cell. Hence, s i D 1 xij vi C zj is the response of the jth cell in the second layer. yj is the synaptic weight between the jth cell in the second layer and the output cell. Therefore, the response of the output cell is Mx,y,z (v ). We assume that for the given accuracy parameter g, there is some x 0, y 0, z 0 in the interior of X such that kMx0 ,y0 ,z0 ¡ f k1 < g. Thus, the set of correct states Q has m -positive measure. Indeed, note that the set of states fx, y, zg for which Mx,y,z g–approximates f is open: if Mx,y,z 2 Q, then for some e > 0, the function Mx,y,z g ¡ e approximates f . Since s is Lipschitz, then it is easy to see that ¡ ¢ sup · C kx ¡ x1 k C ky ¡ y1 k C kz ¡ z1 k , Mx1 ,y1 ,z1 (v ) ¡ M x,y,z (v ) v2V
where C is some constant that depends on only the Lipschitz constant of s. Thus, a small perturbation in fx, y, zg also gives an g–approximation of
942
Shahar Mendelson
f . By the same argument, if fxn , yn , zn g converges to fx, y, zg, then M xn ,yn ,zn converges to M x,y,z uniformly, which implies convergence in the L1 norm. Hence, the conditions of lemma 1 hold, and º(Cx D Cy ) is continuous. In this example too, the process has a local feature since we require that X does not contain elements for which º(Cx ) D 0. Here, one way to ensure that this is the case is to impose that the set fgx g is contained in the small L1 neighborhood of the target concept Mx0 ,y0 ,z0 . This does not make the approximation procedure trivial since a small neighborhood with respect to the L1 norm may be a large set with respect to the supremum norm. Thus, for most of the elements in such a neighborhood, E g is large. To summarize, the assumptions of theorem 1b hold, and the orbits (Xn ) of the process 2.1 converge P -almost surely to a function thatg-approximates f . 2.2.2 Polynomial Approximation of a Lipschitz Function in [¡1, 1]. This example is similar to the one presented above, so most of the details are omitted. There is a one-to-one correspondence between the set of polynomials of degree · d and Rd C 1 . Note that it is possible to g-approximate every continuous function by a polynomial. However, the process requires predetermining the degree of polynomials we use, as well as the compact set from which the coefcients are selected. Assume that we have some additional information on the function f we wish to approximate— for example, its Lipschitz constant l. In this case, the trade-off between the accuracy parameter g and the set of parameters X may be estimated. For a bound on the degree of the approximating polynomial, we estimate Pd¡1 i . By Jackson’s theorem (ChEd ( f ) D infa0 ,...,ad¡1 sup v2[¡1,1] iD 0 ai v ¡ f (v ) eney, 1966), if Ed ( f ) D g, then d · C gl , where C is some absolute constant. Next, to estimate the size of the set from which the coefcient (ai ) is selected, we use Bernstein’s inequality, which states that kp0d k1 · dkpd k1 where pd is 0 0 a polynomial of degree d. Note that |a0 | · kpd k1 , |a1 | D |pd (0)| · kpd k1 , and so on. Therefore, for every g, there are an integer d and a vector (a1 , . . . , ad ) P such that sup v2[¡1,1] d1 ai vi ¡ f (v) < g, where both d and k(a1 , . . . , ad )k1 depend only on l and g. Adding the “locality” assumption, a similar argument to the one used in example 1a shows that the conditions of theorem 1b hold. Thus, the orbits of the process 2.1 converge to a correct state almost surely, which is a polynomial that g-approximates the target concept. 3 An Example of a 0 Temperature Process
We investigate one example of a 0-temperature process in which the maximal learning step ln decreases to 0. We estimate the schedule by which (ln ) should be taken to 0 to ensure that the orbits of this process converge to a correct state even if at every step there is a probability p < 1 / 2 for the teacher to make a mistake.
A New On-Line Learning Model
943
Much like the results in Kim and Sompolinsky (1998), we deal with a situation in which the state-space is a neighborhood of a local minimum of E g . However, we are able to prove convergence results that are stronger than those demonstrated for the OLGA. Indeed, Kim and Sompolinsky (1998) showed that the average error E (E g (Xn )) tends to E g (xmin ), while in this section, we show that Xn converges in probability to xmin . We begin by formulating the general assumptions regarding our model. The basic assumption is that if d (x, y ) > ln , then the probability that x moves to y during the nth step is 0. Since we assume that the teacher can make a mistake with probability p, x remains stationary in two cases: if E (x, v ) D 0 and the teacher is correct, or if E (x, v) D 1 and the teacher is wrong. Thus, up to a normalizing constant, the probability that x does not move during the nth step is (1 ¡ p)º(Cx ) C p(1 ¡ º(Cx )). We assume that if the teacher makes an error, then it provides the wrong answer for both E (x, v ) and E (y, v ). 6 Therefore, for every y D x, y 2 Bln (x) the conditional transition density from x to y is (up to a normalizing constant) p (1 ¡ E (x, v)) e ¡
1¡E (y,v) Tm
ÂfE
(x,v ) D 0g C
(1 ¡ p )E (x, v )e ¡
E (y,v) Tm
ÂfE
(x,v) D 1g
D (¤).
Note that the rst term arises from the case in which the teacher makes simultaneous mistakes, while the second term is the case in which the teacher is right. The normalizing constant is given by Z dm,n D (1 ¡ p)º(Cx ) C p (1 ¡ º(Cy )) C
Z Bl n (x )
V
(¤)dº(v )dm (y ).
The 0-temperature process is dened by taking Tm ! 0. Therefore, the probability that x remains stationary is (up to the normalizing constant) (1 ¡ p)º(Cx ) C p(1 ¡ º(Cx )), while the conditional transition density during 6 x, y 2 Bl ( x) is (up to the normalizing constant) the nth step from x to y D n pÂfE
(x,v) D 0g\fE (y,v )D 1g C
(1 ¡ p)ÂfE
(x,v ) D1g\fE (y,v ) D 0g .
Let X ½ Rd be a compact set and put V D [0, 1], both equipped with the normalized Lebesgue measure m and º. Assume that for every x 2 X, Cx D [0, f (x )], where f 2 C2 (X ) and 0 < f (x ) · 1. Thus, the error function E (x, v) is: »
E (x, v ) D
1 0
if v > f (x ) if v · f (x ).
(3.1)
This assumption on the structure of the error function will allow us to estimate expressions of the form º(Cx nCy ) easily.
944
Shahar Mendelson
6 x, y 2 B l ( x ) is Here, the conditional transition density from x to y D n
pÂf f (y )·v · f (x )g C (1 ¡ p )Âf f (x) ·v · f (y )g . Therefore, by integrating over V, the transition operators of the 0 temperature process are: 1 Qn (y|x) cn ( x ) 8 (1 ¡ p )( f (y ) ¡ f (x )) 1 < p( f (x) ¡ f (y )) D cn ( x ) : 0
Gn (y|x ) D
f f (y ) > f (x)g \ Bln (x ) f f (y ) < f (x)g \ Bln (x ) otherwise.
(3.2)
The normalizing constant cn (x) is determined by the fact that the probabilities that x remains stationary and that x moves to y should add up to 1. Hence, Z cn (x ) D (1 ¡ p ) f (x ) C p (1 ¡ f (x )) C
B ln ( x )
Qn (y|x )d (y ).
Denote by (Xn ) the orbits of the process 3.2 at the nth stage. Theorem 4.
Let ln be the sequence of the learning steps, and assume that (ln ) 2
`d C 3 n`d C 2 . If f has a unique critical point in X and if that point is a global maximum, then the orbits (Xn ) converge in probability to that point. If p D 0, the convergence
is P -almost surely.
Put rn D ldnC 2 , Yn (x) D (Xn C 1 ¡ Xn ) / rn given that Xn D x, denote by H (x) the Hessian of f at x and set Un (x ) D E (hr f (x ), Yn (x )i | Xn D x ). Let o be the unique point for which r f (o ) D 0, and assume that o is a global maximum of f . The following Lemma describes the properties of the random variables dened above. Lemma 2.
1. Assume that K is a compact set such that d(K, o) > 0. Then there are an integer N and some L > 0 such that for every n > N, infx2K Un (x) ¸ L. 2. There is an integer N such that for every n > N, EUn ¸ 0. 3. If an D supx rn2 E kYn (x )k2 , then (an ) 2 `1 .
Let S be the unit sphere in Rd and denote by | | the Haar measure on C ¡ S. Put Sx D fs 2 S | hr f (x ), si > 0g and Sx,r D fs 2 S | f (x C rs) > f (x)g. Sx,r is dened in a similar way with the reversed inequality. Since cn (x ) tends to Proof.
A New On-Line Learning Model
945
(1 ¡ p) f (x) C p (1 ¡ f (x)) uniformly and since Un (x) D r1 hr f (x ), E (Xn C 1 ¡ n Xn |Xn D x )i, it is enough to prove the rst claim for the function ½ Z 1 (1 ¡ p ) (y ¡ x )( f (y ) ¡ f (x ))dy rn An ( x ) ¾ Z (y ¡ x)( f (x ) ¡ f (y))dy, r f (x ) D (¤), C p
Un (x )cn (x) D
Bn (x )
where An (x) D f f (y) > f (x )g \ Bln (x) and Bn (x) D f f (x ) > f (y )g \ Bln (x ). ® ®2 By Taylor’s formula, f (y) ¡ f (x) D hr f (x ), y ¡ xi C O (® y ¡ x® ); hence, (¤) D
1 rn
³
Z
(1 ¡ p ) Z C
Bln (x)
Z
An ( x )
hr f (x), y ¡ xi2 dy ¡ p
! ® ®3 ® ® O ( y ¡ x )dy .
Bn (x)
hr f (x), y ¡ xi2 dy
A simple calculation shows that the third term converges uniformly on K to 0. Therefore, it is enough to estimate the rst and second terms. Note that Z
Z An ( x )
hr f (x), y ¡ xi2 dy ¡ p Z D
ln 0
Á
r
dC1
Z
(1 ¡ p )
Bn (x)
hr f (x), y ¡ xi2 dy
!
Z 2
hr f (x ), si h (s)ds ¡ p
Sx,C r
2
hr f (x ), si h (s)ds dr,
S¡ x, r
where r d¡1 h (s) is theRJacobian of the transformation to spherical coordinates. R Set g (x, r) D (1 ¡ p ) S C hr f (x ), si2 h(s )ds ¡ p S¡ hr f (x ), si2 h(s)ds. Then, x, r
x, r
Z g (x, r) D
Z C
S x, r
hr f (x ), si2 h (s )ds ¡ p
S
hr f (x), si2 h (s )ds
D g1 (x, r ) ¡ pg2 (x ).
To nish the proof of the rst claim, it is enough to nd R and l such that for every r < R and every x 2 K, g(x, r) ¸ l. Indeed, once we establish the existence of such R and l and if ln < R, then 1 rn
Z
ln 0
r
dC1
g(x, r)dr ¸
1 rn
Z
ln 0
r d C1 l ¸
l . dC2
Let Dx,t D fs 2 S | hr f (x ), si > tg. It is easy to see that for every x, the sets Dx,t increase to Sx as t tends to 0. Since Dx,t and | Sx | are both continuous functions of x, then by Dini’s theorem, converges uniformly on Dx,t to | Sx | K. For every e > 0, there is some t > 0 such that for every x, Sx nDx,t D
946
Shahar Mendelson
e | Sx | ¡ < 2M , where M D supx kr f (x )k. Since r f is continuous, there is Dx,t e somedx > 0 such that for every y 2 Bdx (x ), Dx,t ½ Dy,t / 2 and º(Cx D Cy ) < 2M . Note that if ky ¡ xk < dx , s 2 Dx,t and z D y C rs, there is some absolute constant C such that f (z) ¡ f (y ) D hr f (y), sir C o (r2 ) ¸
tr ¡ Cr2 . 2
Hence, there is some R > 0 such that if r < R and y 2 Bdx (x ), then Dx,t ½ C \ Dy,t / 2 ½ Sy . Sy,r For such r and y we see that Z g1 (y, r) D D
Z Z
Z C Sy,r
Sy
> ZSy
> Sy
hr f (y ), si2 h (s )ds ¸
Dx, t
Z
hr f (y), si2 h (s )ds ¡
hr f (y ), si2 h (s )ds
Sy nDx, t
hr f (y ), si2 h (s)ds >
hr f (y), si2 h (s )ds ¡ M (| Sy nSx | C |Sx nDx,t | ) > hr f (y), si2 h (s )ds ¡ e.
On the other hand, for every y, Z pg2 (y ) D p
Z S
hr f (y ), si2 h (s )ds D 2p
Sy
hr f (y), si2 h (s )ds.
R Therefore, g (y, r ) ¸ (1 ¡ 2p) Sy hr f (y), si2 h (s)ds ¡ e . Since f 2 C2 (X ) and R 6 0 on K, then r f ( x) D hr f (x ), si2 h (s)ds ¸ C on K, implying that g (y, r ) > Sx (1 ¡ 2p)C ¡ e . From here, the claim follows using a standard compactness argument. Let us turn to the second assertion. By the same argument used above, Z lim inf g (x, r) ¸ (1 ¡ 2p) r!0
Sx
hr f (x ), si2 h(s )ds D g(x).
Therefore, by Fatou’s lemma, lim inf Un (x)cn (x ) D lim inf n!1
n!1
D
g(x) . dC2
1 rn
Z
ln 0
r d C 1 g(x, r )dr ¸
1 rn
Z
ln 0
r d C 1g(x)dr
A New On-Line Learning Model
947
Hence, lim inf EUn ¸ E n!1
g(x) > 0. (d C 2)c(x)
Turning to the nal claim, note that for every integer n and every x, Z 1 rn2 E kYn (x )k2 · (1 ¡ p) f (x ) C p (1 ¡ f (x )) An (x)[Bn (x) ¢ky ¡ xk2 f ( y) ¡ f ( x) dy Z ·C ky ¡ xk2 |hr f (x), y ¡ xi C O(ky ¡ xk2 )| dy An (x )[Bn (x ) Z · C0 ky ¡ xk3 dy · C00 ldnC 3 , B ln ( x )
0
00
where C, C , and C are absolute constants. Thus, 1 X nD 1
an · C00
Corollary 1.
1 X nD 1
ldnC 3 < 1.
If EUn converges to 0, then E |Un | converges to 0 too.
For every e > 0, x a compact set K ½ X such that m (XnK ) < e and d (K, o) > 0. By the lemma, for an integer n large enough, the functions Un are nonnegative on K. Also, Un are uniformly bounded, and we may assume that they are bounded by 1. Therefore, Z Z |Un | dm C E |Un | D Un dm XnK K Z | Un | dm < EUn C 2e. D EUn C 2 Proof.
XnK
The idea behind the proof is to use Taylor’s formula to approximate the differences f (Xn C 1 ) ¡ f (Xn ). Indeed, given Xn , we see that Proof of Theorem 4.
f (Xn C 1 ) D f (Xn C rn Yn ) D f (Xn ) C r n hr f (Xn ), Yn i C
1 2 r hYn , H (Xn C hr n Yn )Yn i. 2 n
If Vn (x) D E (hYn , H (Xn C hrn Yn )Yn i|Xn ), then the conditional expectation of the expression for f (Xn C 1 ) is 1 E( f (Xn C 1 )| Xn ) D f (Xn ) C rn Un (x ) C r n2 Vn (x ). 2
948
Shahar Mendelson
Taking expectations on both sides and iterating, it follows that
E ( f (Xn C 1 )) D E f (X1 ) C
n X 1
ri EUi C
n X 1 2 r EVi . 2 i 1
(3.3)
Since f is bounded by 1, then · 1. By the Cauchy-Schwarz E ( f (Xn C 1 )) inequality |Vi | · C sup x E kYi (x )k2 , which implies that r i2 E |Vi | · Cr i2 sup E kYi (x )k2 D Cai . x
P1 2 ( ) Recall that by Pnlemma 2, ai 2 `1 ; hence, 1 r i EVi converges. Again, by the lemma, iD 1 ri EUi is a nonnegative series sufciently large i. By equation 3.3 and the above, this series is bounded; thus, it converges, which implies that E ( f (Xn )) converges too. 6 Since (rn ) 2 , there is a subsequence EUnj that tends to 0, and by corol `1 lary 1, E Unj tends to 0 too. Using Chebyshev’s inequality, Unj converges in probability to 0. Therefore, there is a subsequence of Unj , also denoted by Unj , which converges almost surely to 0. According to lemma 2, Un are uniformly bounded away from 0 on every compact set not containing o. Therefore (Xnj ) must converge almost surely to o. On the other hand, E ( f (Xn )) converges, and it is a continuous function of Xn . Hence, it must converge to E f (o) D f (o ). Since o is a unique maximum of f , then for every e > 0, there is some d > 0 such that fkXn ¡ ok ¸ e g ½ f f (o) ¡ f (Xn ) ¸ dg, and by Chebyshev’s inequality, the measure of the later set tends to 0. To prove the second claim, note that if p D 0, then f (Xn ) is increasing almost surely. Therefore, ( f (Xn )) converges almost surely. Since E ( f (Xn )) converges to f (o ), f (Xn ) must converge to f (o ) almost surely, and since o is a unique maximum, Xn converges to o almost surely. 4 Proofs of the Results from Section 2
In this section our aim is to prove the results stated in section 2. Recall that for every set A of positive measure, Pn ( A|x) is the transition probability from x to A in the nth stage. Clearly, for every x and any such set A, Z
Z
fnA (x) cn ( x )
(4.1)
E (x, v )ÂCx0 dº(v )dm (x0 ) f A ( x) R . D 0 º(Cx ) C Bl (x ) V E (x, v)ÂCx0 dº(v )dm (x ) c (x )
(4.2)
1 Pn ( A|x ) D c n (x )
A\Bl (x )
V
E (x, v )e¡E
(x0 ,v) / Tn
dm (x0 )dº(v) D
converges pointwise to R
P (A|x) D
R
A\Bl (x ) V
R
A New On-Line Learning Model
949
Let P 0 be the probability measure induced by the orbits (Xn ) of the 0temperature process 4.2, and set P to be the induced measure by the orbits of process 4.1. We will show that the convergence of Pn to P is uniform in both x and A. First, we prove that fnA (x) converges uniformly to f A (x ). With a similar argument, it is possible to show that cn (x ) converges uniformly to c(x). The desired convergence follows since cn (x ) and c(x) are bounded away from 0. For every A ½ X of positive measure, let f A and fnA be as in equations 4.1 and 4.2. Then ( fnA ) converges to f A uniformly in both x and A. Lemma 3.
We may assume that E (x, v) is bounded by 1. Set E D f(x0 , s )| E (x0 , s ) ( 0 ) 6 E, then e ¡E x ,v / Tn D ÂC 0 ( v). Also, for every > 0g, and note that if (x0 , v ) 2 x e > 0, there is a set K ½ E and some b > 0 such that t (EnK ) < e and E (x0 , v) > b on K. Therefore, Proof.
Z A sup · sup fn ¡ f A x2X
x2X
D sup x2X
Z
X£V
E
E (x, v) e ¡E
E (x, v ) e¡E
(x0 ,v ) / Tn
(x0 ,v )/ Tn
0 ¡ ÂCx0 (v) dt (x , v )
0 ¡ ÂCx0 (v ) dt (x , v) D (¤).
Since the integrands are uniformly bounded by 2, ÂCx (v) vanishes on E and t (EnK ) < e , it follows that Z (¤) · 2e C
K
E (x, v )e¡b / Tn dt · 2e C e¡b / Tn .
Clearly the estimates above are uniform in the set A, which proves our claim. The observation that follows uses a similar computation: For every set A ½ X, the function P (A|x ) is continuous. Moreover, the function P (Xn 2 An , . . . , X2 2 A2 | X1 D x ) is a continuous function of x for every measurable sets A2 , A3 , . . . , An . Lemma 4.
To prove this fact, one has to use the assumption that for every x, º(Cx D Cx0 ) is a continuous function of x0 . The proof of the lemma is straightforward and is omitted. Our aim is to use information concerning the 0-temperature process 4.2 to derive similar results about process 4.1. If (Xn ) denotes an orbit, then for every set O ½ X, put Oi D f(Xn ) | Xi 2 O for n D ig, Ln0 (x, O ) D P 0fXi 2 O for some i ¸ n | Xn D xg, Ln (x, O ) D P fXi 2 O for some i ¸ n | Xn D xg and L 0 (x, O ) is the P 0 -probability to enter O innitely often given that X1 D x.
950
Shahar Mendelson
Let O ½ X for which there are a > 0 and an integer N such that for ) every x 2 X, P 0 f[N 1 Oi | X1 D x > a. Then there is some integer M 0 such that for every m > M 0 and every x 2 X, Lm (x, O) ¸ a2 . Lemma 5.
Recall that the 0–temperature is a homogeneous Markov process. CN Also, since for every m and every x, P 0 f[m Oi |Xm D x) > a, and since Pn m converges uniformly to P, then for m large enough and every x, Proof.
CN P f[m Oi | Xm D xg > m
a . 2
Hence, for m large enough and for every x, Lm (x, O ) >
a 2.
Assume that there are a > 0 and an integer N such that for every n > N and for every x, Ln (x, O) > a. Then the orbits of the process 4.1 enter O innitely often P –almost surely. Lemma 6.
This result appears in Orey (1971) in a slightly weaker form. The proof uses the same idea as the one presented in Orey and is set out here for the sake of completeness. The rst part of the proof is a version of a 0-1 law that is due to P. LÂevy (see Lo`eve, 1963). Let Y1 , Y2 , . . . , be a sequence of random variables, and put Y to be a random variable dened on Y1 , Y2 , . . . such that E |Y| < 1. Clearly, Zn D E(Y | Y1 , . . . , Yn ) forms a martingale; hence, by the martingale convergence theorem (Lo`eve, 1963, p. 393), Zn converges almost surely to Y. By setting Bi D fXi 2 Og, B D fXn 2 O i.o.g, Yn D Xn and Y D ÂB , it follows that P (B | Y1 , . . . , Yn ) D E (Y | Y1 , . . . , Yn ) converges almost surely | to ÂB and P ([1 . . . , Yn ) tends to Â[1 Bi for every xed k. k Bi Y1 , k On the other hand, for every k · n, Proof.
) ([1 | P ([1 n Bi | Y1 , . . . , Yn ) ¸ P ( B | Y 1 , . . . Yn )I k B i Y1 , . . . , Y n ¸ P thus, by taking n ! 1, 1 Â[1 Bi ¸ lim sup P ([n Bi | Y 1 , . . . , Yn ) k n!1
¸ lim inf P ([1 n Bi | Y1 , . . . , Yn ) ¸ ÂB . n!1
Taking k ! 1, the left side converges almost surely to ÂB . Hence,
P ([1 n B i | Y1 , . . . , Y n ) ! ÂB .
A New On-Line Learning Model
951
Denote by X1 the set of all the orbits of the process. Since Ln (Xn , O ) D P ([1 n B i | X 1 , . . . , X n ), then by the 0-1 law presented above, Ln (Xn , O) tends to the characteristic function of the set fXn 2 O i.o.g. By our assumption, for n large enough and for every x, Ln (x, O ) > a. For such n, Ln (Xn , O ) > a almost surely. Therefore, excluding a set of 0-probability, » ¼ n o X1 ½ lim sup Ln (Xn , O) > 0 D lim Ln (Xn , O) D 1 D fXn 2 O i.o.g. n!1
n!1
Hence, Xn 2 O innitely often P -almost surely. By a similar method, one shows the following: Assume that there are a > 0 and an integer N, such that for every n > N and every x 2 A, Ln (x, B ) > a. Then, P –almost surely orbits, which visit A innitely often, also visit B innitely often. Lemma 7.
Combining lemma 5 with lemma 6, it follows that in order to prove theorem 1a, it is enough to show that for every neighborhood A of O , there are a > 0 and an integer N such that for every x, P 0 ([N 1 Ai |x) > a. Corollary 2.
We begin with the proof of theorem 1b. Let A be an open set containing Q D fx | E g (x ) D 0g. Since l D diamX and since Q has a m –positive measure, every x has a positive P 0 –probability to enter Q. Since Ac is compact, a simple continuity argument shows that there is some a > 0 6 A, P ( Q|x) ¸ a. But since Pn ! P uniformly, then for such that for every x 2 n large enough infx2A Pn (Q, x ) ¸ a2 . Hence, by lemma 7, orbits that visit Ac 6 i.o. must enter Q P –almost surely. This is impossible since Q is an absorbing set. Thus, P -almost every orbit enters Ac a nite number of times, implying that the orbits of the process 2.1 converge P –almost surely to Q. To prove theorem 1a we use corollary 2. Assume that we limit the size of the learning step to l and let A be a neighborhood of O . Set Ai D f(Xn ) | Xi 2 A for n D ig and recall that Proof of Theorem 1.
0( c c ) ) P 0 (Ai n [i¡1 1 Aj | X1 D x D P Xi 2 A, Xi¡1 2 A , . . . , X2 2 A | X1 D x
is a continuous function of x. Clearly, for every integer n,
P 0 ([n1 Ai | X1 D x ) D
n X 1
P 0 (Ai n [i1 Aj | X1 D x ).
952
Shahar Mendelson
Therefore, for every n, hn (x ) D P 0 ([n1 Ai | X1 D x ) is a continuous function too. Note that if for every x there is some integer n D n (x) such that hn (x ) D ex > 0, then by the continuity of hn , there is a neighborhood Ux of x on which hn (x) > e2 . Since X is compact and (hn (x )) is a monotone increasing sequence, there exists a nite subcover (Uxi ) of X, such that for every x 2 Uxi e and every n > n (xi ), hn (x ) ¸ 2xi . Therefore, there are a > 0 and an integer N such that for every x 2 X, hN (x ) > a. By corollary 2, this sufces to prove our claim. Moreover, note that it sufces to show that L 0 (x, A) > 0 for every x, since this implies that for every x, there is some integer N (x ) such that for every n > N (x ), hn (x) > 0. Finally, to show that indeed for every x 2 X L 0 (x, A ) > 0, recall that since the error function is 0-1, then if E g (x) < E g (y) and d (x, y ) · l, there is a positive transition density from x to y. For every x, let yx be a point in Bl/ 2 (x ) in which the relative maximum of º(Cx ) is attained in Bl/ 2 (x ). Dene a sequence x1 D x, x2 D yx , x3 D yyx , and so on. A simple compactness argument shows that xi D x 0 for i larger than some n. Thus, x 0 must be a local maximum of º(Cx ) D 1 ¡ E g (x ). Note that by the above, there is a positive transition density from x to yx . Moreover, since º(Cyx D Cz ) is a continuous function of z, there is some 0 < e < l2 such that for every z 2 Be (yx ) ½ Bl (x ), x has a positive transition density to z. Since Be (yx ) has a nonempty interior, it has a m -positive measure, implying that the transition from x into that set is positive. Using a similar argument, one shows that there is a positive probability that Be (yx ) is mapped into an arbitrarily small ball centered in yyx . It takes a nite number of steps to reach a neighborhood of x0 , which is a subset of A; thus, L 0 (x, A) > 0. The proof of theorem 2 goes along the same lines as the proof of theorem 1. The only difference is in the denition of yx . The idea is to equip Bl (x) with the partial order · dened by x · y if and only if Cx ½ Cy . Then use Zorn’s lemma to nd a maximal element in Bl (x), and let yx be that maximal element. Again, dene the sequence (xn ), and use a compactness argument to show that for i > n, xi D x 0 and that x 0 is the maximal element in Bl (x 0 ). Note that by assumption 2, the only elements that are maximal in their neighborhood are elements of O . The proof of theorem 2b is identical to that of theorem 1b and does not require any additional assumptions. Sketch of Proof of Theorem 2.
5 Discussion
This investigation was motivated by the work of Kim and Sompolinsky (1996, 1998), who introduced the OLGA.
A New On-Line Learning Model
953
The on-line Gibbs learning is slightly different from the model we dened. In the OLGA, the conditional transition density is given by Pn (Xn C 1 D x0 | Xn D x, Vn D v ) D
1 ¡En (x0 ,x, v) e 2Tm , c
(5.1)
where E n (x0 , x, v ) D E (x0 , v) C 12 ln kx ¡ x0 k2 . In the 0-temperature process, x moves during the nth stage to the nearest point x0 that gives a correct response to the input v—assuming that kx¡x0 k · p 2 / ln , and remains stationary otherwise. Note that the part that l plays in process 2.1 is similar to the role of l ¡1 in the OLGA. There are several differences between process 2.1 and the OLGA. For 6 0, the processes differ on two main aspects. First is the way the bounded TD change enters the game. In process 2.1, the bounded change is simply a cutoff function that prevents the orbits from moving “too far.” In the OLGA rule, this is done in a smoother fashion, which is the addition of the term 1 0 2 2 ln kx ¡ x k to the energy function. The second and more important difference is the additional linear term E (x, v) appearing in the transition densities of process 2.1. This feature prevents large changes if the student gives an “almost correct” answer and prevents change if the answer is correct. Note that in the OLGA rule for 6 0, changes may occur even if the student’s answer is correct. TD In the 0-temperature process, the OLGA changes the current state to the nearest state that yields a correct answer to the given input, provided that the change is not “too large.” For example, if the answer is correct, no change is made. As in the OLGA, in our version of the 0-temperature process, the change is made only if the student gave the wrong answer to the given input. However, unlike the OLGA, the transition is made to any state that gives the correct answer and is close enough to the current state. Hence, given the current state xn and such an input v, xn C 1 will be equally distributed on the set fx|d(x, xn ) · l, E (x, v ) D 0g. This denition has two advantages compared with the 0-temperature OLGA rule. One is that this is the “natural” limit of the process where T ! 0, and the other is that one does not have to solve a difcult optimization problem needed to run the OLGA at T D 0. Our selection of the 0-temperature process has its analytical benets too. The fact that the transition operators as T ! 0 converge uniformly to the 0temperature operators allowed us to use the simpler 0-temperature process to estimate the orbits of the process as T ! 0. Much like the OLGA, process 2.1 has a local nature. In process 2.1 the local nature is expressed in the assumption that each state has “many” inputs on which it agrees with the teacher. In the case of the OLGA, the local nature is more explicit. In most of the examples investigated in Kim and Sompolinsky (1998) for T D 0, the results are obtained by expansions near a local minimum while assuming that the learning steps decrease to 0.
954
Shahar Mendelson
The philosophy behind their idea is that the orbits are eventually trapped in a neighborhood of a local minimum, and the event of transition from the neighborhood of one local minimum to the other is rare. The authors focused on estimating the uctuations of the average error E(E g (Xn )) from 6 0 and l ! 1, it was E g (xmin ), where xmin is the local minimum. For T D claimed that the OLGA converges to the Gibbs distribution where E g plays the role of the energy function. Thus, like other annealing processes, to ensure convergence in distribution to a global minimum of E g , one has to decrease the temperature slowly while decreasing the learning steps. We were able to prove convergence results for process 2.1. We showed that the orbits come arbitrarily close to a local minimum of E g almost surely without assuming that the state-space is a small neighborhood of a local minimum. Moreover, if there are “many” correct states, then by theorem 2b, the orbits of process 2.1 converge almost surely to a correct state. Note that both process 2.1 and the OLGA may not converge to a global minimum of Eg when T D 0. To show that, we present two examples with a common feature: both of them are 0-temperature processes in which the process does not converge to a global minimum of E g . In the rst example, the orbits are trapped in a local minimum, and in the second example, the situation is even worse: the global maximum of E g is absorbing. Set X D [¡1 / 2, 2], V D [¡1, 1], and assume that the error function is 0-1. We dene E (x, v ), which induces the 0-temperature process, using Figure 1. Here, E (x, v) D 0 on the shaded area and 1 otherwise. Note that the global minimum of E g in X is attained at x D 2, and in this p state, the student is in complete agreement with the teacher. However, if x < ¡ 2 / ln , it cannot move toward 2. Indeed, if x < v < 0, x gives a correct response to v, so it remains stationary. For v > 0, x does not move because the nearest point that yields a correct response to v is too far away. Hence, if, for example, the initial distribution m is supported in [¡1 / 2, ¡1 / 4] and ln > 32, the orbits of the 0-temperature process converge almost surely to x D ¡1 / 2. Example 1.
p In a similar fashion, for every learning step sequence 2 / ln tending to 0, there are initial distributions such that the orbits of the 0-temperature do not converge in distribution to the global minimum of E g (x ). Moreover, for every m > 0 there are some initial distributions such that for every sequence ln , where li ¸ m for every i, the process 5.1 converges almost surely to a local minimum of E g , which is not the global minimum. The following example serves as a counterexample to one of the claims appearing in Kim and Sompolinsky (1998). It is a smooth on-line error function for which the OLGA at T D 0 and l ! 1 does not converge to a local minimum of E g (even in the weak sense of convergence of the average error), since the global maximum of E g is an absorbing state. This emphasizes the local nature needed for the OLGA: if one does not limit the state-space to
A New On-Line Learning Model
955
Figure 1: In this 0-1 on-line error function, the shaded area represents pairs (x, v) on which the student x gives the correct response to the input v. Here, the OLGA with bounded change may converge to a local minimum of global error function, which is not the global minimum (see example 5.1).
a neighborhood of the local minimum, the orbits may be trapped in another domain of attraction, which in this case is a global maximum of E g . Here we construct a continuous function E (x, v) on the set D D X £ V D [¡1 / 2, 1 / 2] £ [0, 2] with respect to the normalized Lebesgue measure, such that the global maximum of E g is an absorbing state. Dene the error function on D by Example 2.
(
E (x, v ) D
2(1¡x2 )(v C x2 ¡1) x2 C 1
0
if v > 1 ¡ x2 if v · 1 ¡ x2 .
Clearly, for every x, Cx D fv|0 · v · 1 ¡ x2 g. Thus, Co ¾ Cx , E (x, v ) ¸ 0 on D and x D 0 is an absorbing state. Indeed, for every input v > 1, there is no state for which E (x, v ) D 0 and for 0 · v · 1, E (0, v) D 0. Moreover, it is clear that the 0-temperature OLGA converges to x D 0. Indeed, note that if |x| > , then the transition probability from y to x is 0, because Cy ¾ Cx . y Hence, if y gives the wrong answer on the input v, so will x, and no transition will be made. Therefore, the dynamics of the processes “pushes” any initial state toward x D 0.
956
Shahar Mendelson
On the other hand, E g (x ) D (1 ¡ x2 )
R2
2
2(v C x2 ¡1) dv. Changing the intex2 C 1 4 R 2 4 that E g (x ) D 1¡x 0 zdz D 1 ¡ x . 2
1¡x2
gration variable to z D 2(vxC2 xC 1¡1) , we see Thus, E g (x ) attains a global maximum in x D 0. In a similar fashion, it is possible to construct such a function with any degree of smoothness. This counterexample seems to indicate that deriving global results based on the behavior near local minima has its problems. Indeed, in the analysis of the OLGA (see Kim & Sompolinsky, 1998, p. 2339) the process was analyzed using the Fokker-Planck equation. Neglecting terms that are O ( l12 ), the Fokker-Planck equation for the transition operators is @ @t
P (x, t ) D r ¢ [r E g (x )P (x, t )] C
1X @i @j [Tij P (x, t )], l i, j
where t is the time variable and Tij is the diffusion matrix. Next, the diffusion term (which is O ( 1l )) is neglected, under the assumption that r E g is not small. However, this does not guarantee a descent toward a local minimum. This would be the case if one assumes that there is a unique critical point in the state-space, which is a local minimum, that is, a strong locality assumption. As the example shows, if there are other critical points, the orbits of the OLGA may converge to those points and not to the local minimum. Roughly speaking, the reason for the behavior demonstrated in example 2 is that the OLGA does not take into account the “size” of the error made by x on the input v. The transition is made based on the fact that the new state is correct on that input. Hence, the orbits of the process are driven toward a state x0 , which was the largest set of correct answers. If the error function is 0-1, this implies that E g attains a minimum in x 0 and that the OLGA will eventually converge to x 0. On the other hand, if the on-line error function is not 0-1, it is possible that x 0 makes “very large” mistakes on inputs in which it gives the wrong answer. Thus, although x0 makes few errors, the average error E g (x 0 ) may be made arbitrarily large. It is important to note that the on-line error function in example 2 does not satisfy assumption 2; thus, theorem 2 may not be applied to this case. 6 Conclusion
The main stumbling block regarding process 2.1 is its local nature, which is due to the assumption that º(Cx ) is bounded away from 0. We were not able to remove this restrictive assumption, but it may be possible to do so using a more sophisticated probabilistic argument. The reason for the assumption ln ! 1 in Kim and Sompolinsky (1998) was to overcome the possibility that the teacher makes a mistake. We did not treat this problem outside the one example presented in section 3, and it deserves additional consideration. Note that an easy way to generalize part a of theorem 1 is to formulate a stopping procedure, which freezes the process once a state is close enough
A New On-Line Learning Model
957
to a correct answer. One possibility is to count the number of consecutive correct responses at each state and stop the process once the number passes a given threshold. This gives an estimate on the measure º(Cx ). However, if the error function is not 0-1, the fact that º(Cx ) is close to its global maximum does not imply that E g (x ) is close to the global minimum (this is the idea behind example 2). For that, one needs additional assumptions on the structure of the error function. Our nal remark concerns process 3.2. There are several other probabilistic schemes that may solve the relevant optimization problem, at least when the function f has a unique maximum. Particularly well known is the so-called stochastic approximation scheme (see Kushner & Yin, 1997). We do not claim that process 3.2 yields the best solution to the optimization problem. Our goal was to present one complete example in which the 0-temperature process with a bounded change converges to the minimum of Eg , even in the presence of noise. We were not able to formulate a more general theorem than the one presented here. Even when the error function is 0-1, the function º(Cx ) does not determine the transition density from state to state. All we know is that when º(Cx ) > º(Cy ), there is a positive transition density from y to x. Unfortunately, it is possible to construct natural examples for which º(Cy ) > º(Cx ), but still there is a positive transition density from y to x and it is possible that the analogous convergence theorem is not true. References Cheney, E. W. (1966). Introduction to approximation theory. New York: McGrawHill. Haykin, S. (1994). Neural networks: A comprehensive foundation. New York: Macmillan. Kim, J. W., & Sompolinsky, H. (1996). On-line Gibbs learning. Physical Review Letters, 76, 3021–3024. Kim, J. W., & Sompolinsky, H. (1998). On-line Gibbs learning I. Physical Review E, 58, 2335–2347. Kushner, H. J., & Yin, G. G. (1997). Stochastic approximation algorithms and applications. Berlin: Springer-Verlag. Leshno, M., Lin, V. Y., Pinkus, A., & Schocken, S. (1993). Multilayer feedforward networks. Neural Networks, 6, 861–867. Lo`eve, M. (1963). Probability theory (3rd ed.). New York: D. Van Nostrand. Minsky, M., & Papert, S. (1972). Perceptrons. Cambridge, MA: MIT Press. Orey, S. (1971). Limit theorems for Markov chain transition probabilities. New York: Van Nostrand Reinhold. Ritter, H., Martinetz, T., & Schulten, K. (1992). Neural computation and self organizing maps. Reading, MA: Addison-Wesley. Received January 11, 1999; accepted July 10, 2000.
Neural Computation, Volume 13, Number 5
CONTENTS Article
Patterns of Synchrony in Neural Networks with Spike Adaptation C. van Vreeswijk and D. Hansel
959
Note
Bayesian Analysis of Mixtures of Factor Analyzers Akio Utsugi and Toru Kumagai
993
Letters
Synchronization in Relaxation Oscillator Networks with Conduction Delays Jeffrey J. Fox, Ciriyam Jayaprakash, DeLiang Wang, and Shannon R. Campbell
1003
Predictions of the Spontaneous Symmetry-Breaking Theory for Visual Code Completeness and Spatial Scaling in Single-Cell Learning Rules Chris J. S. Webber
1023
Localist Attractor Networks Richard S. Zemel and Michael C. Mozer
1045
Stochastic Organization of Output Codes in Multiclass Learning Problems Wolfgang Utschick and Werner Weichselberger
1065
Predictive Approaches for Choosing Hyperparameters in Gaussian Processes S. Sundararajan and S. S. Keerthi
1103
Architecture-Independent Approximation of Functions Vicente Ruiz de Angulo and Carme Torras
1119
Analyzing Holistic Parsers: Implications for Robust Parsing and Systematicity Edward Kei Shiu Ho and Lai Wan Chan
1137
The Computational Exploration of Visual Word Recognition in a Split Model Richard Shillcock and Padraic Monaghan
1171
ARTICLE
Communicated by Bard Ermentrout
Patterns of Synchrony in Neural Networks with Spike Adaptation C. van Vreeswijk¤ Racah Institute of Physics and Center for Neural Computation, Hebrew University, Jerusalem, 91904 Israel D. Hansel¤ Centre de Physique ThÂeorique, Ecole Polytechnique, 91128 Palaiseau Cedex, France We study the emergence of synchronized burst activity in networks of neurons with spike adaptation. We show that networks of tonically ring adapting excitatory neurons can evolve to a state where the neurons burst in a synchronized manner. The mechanism leading to this burst activity is analyzed in a network of integrate-and-re neurons with spike adaptation. The dependence of this state on the different network parameters is investigated, and it is shown that this mechanism is robust against inhomogeneities, sparseness of the connectivity, and noise. In networks of two populations, one excitatory and one inhibitory, we show that decreasing the inhibitory feedback can cause the network to switch from a tonically active, asynchronous state to the synchronized bursting state. Finally, we show that the same mechanism also causes synchronized burst activity in networks of more realistic conductance-based model neurons. 1 Introduction
The central nervous system (CNS) displays a wide spectrum of macroscopic, spatially synchronized, rhythmic activity patterns, with frequencies ranging from 0.5 Hz (d rhythm), to 40–80 Hz (c rhythm) and even up to 200 Hz (for a review, see Gray, 1994). Rhythmic activities in the brain can also differ by the strength, that is, by the amplitude of the cross-correlation (CC) peaks (properly normalized) and the precision of the synchrony that can be characterized by the width of the CC peaks. In many cases, for instance, in the visual cortex during visual stimulation, the CC peaks are narrow; action potentials of different neurons are correlated across the timescale of the spikes. In other cases, in epileptic seizures in the hippocampus (Silva, Amitai, & Connors, 1991; Flint & Connors, 1996) or in slow cortical rhythms (Steriade,
¤ Present address: Neurophysique et Physiologie du Syst` eme Moleur, EP 1848 CNRS, 45 Rue des Saint P`eres, 75270 Paris, France
Neural Computation 13, 959–992 (2001)
° c 2001 Massachusetts Institute of Technology
960
C. van Vreeswijk and D. Hansel
McCormick, & Sejnowski, 1993), for example, the synchrony that is observed does not occur on a spike-to-spike basis, but rather on a longer timescale. In these cases, spikes are not synchronized, but the ring rates of the neurons display synchronous modulation, and the ring rate of the neurons can be substantially higher than the frequency of the synchronized rhythmic activity pattern. The mechanisms involved in the emergence of these rhythms remain a matter of debate. An important issue is the respective contribution, in their emergence, of the cellular, synaptic, and architectonic properties. Recent theoretical work has shown that synaptic excitation alone can hardly explain the occurrence of synchrony of neural activity (Hansel, Mato, & Meunier, 1993; Abbott & van Vreeswijk, 1993; van Vreeswijk, Abbott, & Ermentrout, 1994; Hansel, Mato, & Meunier, 1995; Gerstner, van Hemmen, & Cowan, 1996). This has led to the suggestion that inhibition plays an important role in neural synchrony (Hansel et al., 1993; Abbott & van Vreeswijk, 1993; Hansel et al., 1995; van Vreeswijk, 1996). This scenario and its robustness to heterogeneities in intrinsic properties of neurons, noise, and sparseness of the connectivity pattern have been investigated in detail (White, Chow, Ritt, Soto-Trevino, ˜ & Kopell, 1998; Chow, 1998; Neltner, Hansel, Mato, & Meunier, 2000a; Golomb & Hansel, 2000). Neltner et al. (2000a) have shown that neural activity in heterogeneous networks of inhibitory neurons can be synchronized at ring rates as large as 150 to 200 Hz, provided that the connectivity of the network is large and the synaptic interactions and the external inputs are sufciently strong. This scenario, which relies on a balancing between excitation and inhibition, can be generalized to networks consisting of two populations of neurons: one excitatory and the other inhibitory (Neltner, Hansel, Mato, & Meunier, 2000b). Recent experimental studies in hippocampal slices (Whittington, Traub, & Jefferys, 1995), in which excitation has been blocked pharmacologically, also support the role of inhibition in neural synchrony. However, other experiments in hippocampal slices, with carbachol, show that when inhibition is blocked, synchronized burst activity emerges. Modeling studies (Traub, Miles, & Buzsaki, 1992) show that the afterhyperpolarization (AHP) currents may play an important role in the emergence of this burst activity. Bursting neuronal activity is also observed in the CNS under physiological conditions. For a long time, it has been known that the neurons in primary visual cortex often re bursts of action potentials (Hubel, 1959). Evidence has been found (Cattaneo, Maffei, & Morrone, 1981) that the tuning curves to orientation of complex cells in V1 are more selective when computed only from the spikes red in bursts than when all of the spikes are included. More recent studies (DeBusk, DeBruyn, Snider, Kabara, & Bonds, 1997; Snider, Kabara, Roig, & Bonds, 1998) have also shown that a substantial fraction of the spikes red by cortical neurons during information processing occurs during bursts, and they have suggested that these spikes play a crucial role in information processing in cortex. Since slow potassium currents are widespread in neurons, it is important to understand how they contribute to the shaping of the spatiotemporal
Patterns of Synchrony in Neural Networks
961
patterns of activity. Here we examine how rhythmic activity emerges in a network where the excitatory neurons possess an AHP current responsible for spike adaptation. We study, analytically and numerically, networks of neurons whose excitatory populations exhibit spike adaptation. We show that the positive feedback, due to the excitatory synaptic interactions, and the negative and local feedback, due to the spike adaptation, cooperate in the emergence of a collective state in which neurons re in bursts. These bursts are synchronized across the network, but the individual spikes are not. They result from a network effect, since in our model, neurons can re only tonically when isolated from the network. This state is robust against noise, heterogeneity in the intrinsic properties of the neurons, and sparseness in the connectivity pattern of the network. In the next section we introduce the integrate-and-re (IF) model we study. The properties of the single excitatory neuron dynamics of this model are described in section 3. In section 4, we study networks that consist of only excitatory neurons. We study the phase diagram in section 4.1. The mechanism and properties of the synchronized bursting state are analyzed in the limit of slow adaptation in sections 4.2 and 4.3, respectively. Section 4.4 shows that the mechanism is robust to the introduction of inhomogeneities, sparseness of the coupling, and noise, and in section 4.5, networks with realistic adaptation time constants are considered. In section 5, we discuss the effect of adding an inhibitory population. It has been shown recently that IF networks can differ substantially in their collective behavior, from networks of neurons modeled in a more realistic way in terms of ionic conductances. Given this, in section 6, we show by numerical simulations how the results we have obtained for IF networks extend to networks with conductance-based dynamics. Finally, we briey discuss the main results of this work. 2 Integrate-and-Fire Model Network The network model consists of two populations of (IF) neurons; one is excitatory (E), and the other is inhibitory (I). The excitatory population shows spike adaptation; the inhibitory one does not. The pattern of connectivity ab ab is characterized by a matrix, Jij , i, j D 1, . . . , N, a D E, I: Jij D 1, if neuron j in population a makes a synapse on the postsynaptic neuron i, from popab ulation b, and Jij D 0 otherwise. For simplicity, we assume that all of the synapses have the same strength and the same synaptic time constants. We neglect propagation delays. Each excitatory neuron i is characterized at time t by its membrane potential, ViE (t), its adaptation current, AEi (t), and the total synaptic current into it, EEi (t ). These quantities are following the dynamical equations: tE
dV iE E E E D IE i ¡ Vi C E i ¡ Ai dt
(2.1)
962
C. van Vreeswijk and D. Hansel
and tA
dAEi D ¡AE i . dt
(2.2)
If at time t, ViE reaches 1, a spike occurs, ViE is instantaneously reset to 0, and the adaptation current, AEi , is increased by gA / tA , that is, ViE (t C ) D 0 and Ai (t C ) D Ai (t¡ ) C gA / tA . Similarly, the activity of inhibitory neuron i is characterized by its membrane potential, ViI , and by the total synaptic current into it, EIi (t). The membrane potential satises tI
dViI D IiI ¡ ViI C EIi , dt
(2.3)
supplemented with the reset condition, ViI (t C ) D 0, if the membrane potential of inhibitory neuron i reached threshold at time t. The external inputs into the excitatory and inhibitory neurons are IiE and I , respectively. The membrane time constant for the inhibitory cells is t . (For Ii I simplicity, we assume that all neurons in the same population have the same membrane time constant.) Note that we are not including the adaptation current in the dynamics of the inhibitory neurons in order to account for the fact that in cortex, most of the inhibitory neurons do not display spike adaptation (Connors, Gutnick, & Prince, 1982; Gutnick, Connors, & Prince, 1982; Connors & Gutnick, 1990; Ahmed, Anderson, Douglas, Martin, & Whitteridge, 1998). For the excitatory neurons, we will assume a typical value for the membrane time constant of tE D 10 msec. In the following, time will be measured normalized to tE , so tE will be omitted from the equations. The synaptic currents, EEi and E Ii , into excitatory and inhibitory cell, i, are given respectively by: E ai D
X X b D E,I j,k
t1b
³ ´ b b 1 ¡(t¡tj, k ) / t1b ¡(t¡tj,k ) / t2b ab ¡e Jij gab e , ¡ t2b
(2.4)
ab
where Jij is the connectivity matrix of the network, a D E, I; t1b and t2b are b
the rise and decay times of the synapses projecting from population b; tjk denotes the kth time neuron j of population b red an action potential; and gab is the strength of the synapses projecting from population b to population a. Note that gEE and gIE are positive, while gEI and gII are negative. In most of the work, we will use synaptic decay and rise times of t1,E D 3 msec and t2,E D 1 msec, and t1,I D 6 msec and t2,I D 1 msec, respectively. Most of the article deals with network with all-to-all connectivity. In that case, the ab connectivity matrix is Jij D 1 ¡ di, jda,b and gab D Gab / N, where di, j D 1 for
Patterns of Synchrony in Neural Networks (A)
2
1.2
963 (B)
V 1 0.8 0 0.14
0
20
40
t
60
80
100
A
R 0.4
0.07
0
0
20
40
t
60
80
100
0 0.5
1
1.5 I0
2
2.5
Figure 1: (A) Response of a single cell to current injection. At, t D 0, a constant current, I0 D 1.1, is applied. Upper panel: The neuron voltage. Spikes were added at the times neuron when the voltage crosses the threshold, h D 1. Lower panel: The adaptation current. (B) The ring frequency of the single-model neuron with gA D 0.675 versus the constant applied current. The rate is expressed in units of 1 / tm . With tm D 10 msec, R D 1 corresponds to a rate of 100 Hz. Solid line, tA D 1; long dashed line, tA D 2; short dashed line, tA D 5; dotted line, tA D 1.
i D j and di, j D 0 otherwise, and similarly for da,b . We assume that Gab is independent of N. For simplicity, we have assumed that the synapses and the adaptation are described by currents rather than conductance changes. Describing these variables by conductance changes does not qualitatively affect the result of our analysis. In numerical simulations of the network, equations 2.1–2.4 were integrated using a second-order Runge-Kutta method supplemented by a rstorder interpolation of the ring time at threshold, as explained in Hansel, Mato, Meunier, & Neltner, 1998; Shelley, 1999). This algorithm is optimal for IF neuronal networks and allows us to get reliable results for time steps, D t, which are substantially large. The integration time step is D t D 0.25 ms. The stability of the results was checked against varying the number of time steps in the simulations and their size. When burst duration and interburst interval were measured, we discarded the rst few bursts to eliminate the effects of the initial values and averaged these quantities over 10 to 20 bursts. 3 Single-Neuron Properties of the Excitatory Population Before we study the activity of the model network, we rst describe the single-cell characteristics of the excitatory neurons in the presence of an external stimulus, I0 . In the single-neuron study, Figure 1A shows the response of a single cell to constant input. For t < 0, I0 D 0 and V (t) D A(t) D 0. At t D 0, a constant current, I0 > 1, is turned on, and the neuron starts ring.
964
C. van Vreeswijk and D. Hansel
As the adaptation currents build up, the ring rate diminishes. Eventually the cell stabilizes in a periodic ring state. It can be shown that the decrease of the instantaneous ring rate toward its limit at large time is well approximated by an exponential relaxation. Thus, the single-model neurons are tonically ring adaptive neurons. After a transient period, the cell res periodically with period T. Assume that the cell res at time t0. Immediately after the spike, the adaptation current will be at its maximum value A 0 and then will decrease until the next spike is red. Thus, A (t0C ) ´ A 0. This implies that A (t 0 C T C ) D A0 . Therefore, just before the spike at t0 C T, A is given by A (t 0 C T ¡ ) D A 0 ¡ g A / t A .
(3.1)
On the other hand, since the cell does not spike between times t D t0 and t D t0 C T, A (t0 C T ¡ ) also satises A (t0 C T ¡ ) D A 0e ¡T / tA .
(3.2)
Therefore, A 0 is given by A0 D
gA . tA (1 ¡ e ¡T / tA )
(3.3)
Just after the spike at time t D t0 , the membrane potential of the cell is reset to 0, V (t0C ) D 0. Between spikes, V satises d V D I 0 ¡ V ¡ A ( t ), dt
(3.4)
with A(t) D A0 e¡(t¡t0 ) / tA . Consequently, for 0 < t < T, V (t0 C t) is given by V (t0 C t) D I 0 (1 ¡ e¡t ) ¡
A 0tA ¡t / tA (e ¡ e¡t ). tA ¡ 1
(3.5)
Since the cell res again at time t0 C T, V (t0 C T ¡ ) D 1, the period T satises I 0 (1 ¡ e¡T ) ¡
gA e¡T/ tA ¡ e ¡T D 1. tA ¡ 1 1 ¡ e ¡T / tA
(3.6)
Figure 1B shows the ring rate at large time, R D 1 / T, for a single neuron as a function of the input current, I0 , for different values of the adaptation time constant, tA . If the input is below threshold, I 0 < 1, the cell does not re. For large input currents, such that T ¿ 1, we can use equation 3.6 to approximate the rate as RD
1 1 C gA
(I0 ¡ 1 / 2),
(3.7)
Patterns of Synchrony in Neural Networks
965
independent of tA . Therefore, the rate varies linearly with the external input, as is the case for the IF model without adaptation. The effect of the adaptation is to divide the gain of the frequency-current curve by a factor that does not depend on the adaptation time constant. Just above the threshold, the rate increases nonlinearly with I0 . Using 0 < dI ´ I0 ¡ 1 ¿ 1, one nds from equation 3.6 that, assuming tA > 1, the rate, R, satises asymptotically in the limit T À tA : RD
µ ³ ´¶¡1 1 gA ¡ log (dI) C log . tA tA C 1
(3.8)
So for any tA > 1, the rate has a ¡1 / log(dI) dependence on the external input, as is the case for IF neurons without adaptation. However, the prefactor here is t1A , whereas when there is no adaptation, the prefactor is 1 (when gA ! 0, a crossover occurs between the two behaviors). If tA is very large, a third regime exists for 1 ¿ T ¿ tA . In this limit, equation 3.6 can be written as ± ² (I 0 ¡ gA R ) 1 ¡ e ¡1 / R D 1.
(3.9)
Thus, for low rates, R ¿ 1, the rate is approximately given by RD
dI . gA
(3.10)
In the limit tA ! 1, the logarithmic dependence of R on dI becomes negligible due to the small prefactor 1 / tA , and equation 3.10 extends up to the bifurcation point. Therefore, in this limit, the transfer function starts linearly, with a slope 1 / gA . The slope stays nearly constant as long as e¡1/ R is negligible compared to 1. For larger I 0, the slope gradually decreases, and for I0 À 1 C gA , the rate depends linearly on I0 , but with a smaller slope 1 / (1 C g A ). This is shown in Figure 1B. In our simple model, the transfer function of the neurons can be approximated by one semilinear function over a large range of frequencies only for gA À 1. While we nd dR / dI 0 D 1 / gA near threshold, for an input that gives a rate R D 1, corresponding to 100 Hz, the slope changes substantially to dR / dI 0 ¼ 1 / ( g A C 0.9207). 4 One Excitatory Population with Spike Adaptation 4.1 The Phase Diagram of the Model. We now consider the case in which all of the inhibition has been blocked in the model network, focusing on the behavior of the excitatory population. We rst consider the case where all of the neurons are identical, all-to-all coupled, with GEE D gs , and all receive the same external input, I 0. (The case of heterogeneous and sparsely connected networks is considered in section 4.4.)
966
C. van Vreeswijk and D. Hansel 2
1
Vi
0 2
1
0
0
50
t
100
150
Figure 2: Voltage traces of two neurons in a network that is in the bursting state. Simulation was started at t D 0 with random initial conditions. Traces are shown after the transient activity has died out. Parameters used: I0 D 1.1, gA D 0.4, tA D 150 msec, and gs D 0.675. Time is in units of tm .
As we saw in the previous section, the single cells re periodically if they receive constant input. In an asynchronous state (AS), the ring times of the neurons are evenly distributed, and the input to the cells is constant (up to uctuations of the order O (1 / N )). This implies that in a large, asynchronous network, the neurons cannot re in bursts. The stability of the asynchronous state can be determined analytically (van Vreeswijk, 2000). Here we mention only the results of this analysis. There are two ways in which the AS can become unstable: For weak external input: When the input, I0 , is just above the threshold, the AS is stable if the synaptic coupling, gs , is sufciently strong. If gs is decreased, the AS becomes unstable through a normal Hopf bifurcation to a synchronized state in which the cells continue to re tonically. In this state, neural activity is synchronized on the spike-tospike level. This instability is similar to that in networks of IF neurons without adaptation (van Vreeswijk, 1996). For strong external input: When I0 is sufciently large and the adaptation, g A , sufciently strong, the asynchronous spiking state is stable for small gs . When gs is increased, the AS state is destabilized through an inverted Hopf bifurcation. Past this bifurcation, the neurons re in bursts, as shown in Figure 2. In the following, we focus on this scenario. As we will see, this synchronized state is characterized by a coherent neural activity on the burst-to-burst level, where spikes are either not synchronized or weakly synchronized. A phase diagram of the network is shown in Figure 3 as a function of tA and g A . Here we have xed the parameters I0 and gs at values for which the
Patterns of Synchrony in Neural Networks
967
100 80 60
t
A 40
AS
BS
20
AS+BS 0 0.32
0.36
0.4 gA
0.44
0.48
Figure 3: Phase diagram of the IF network with adaptation. Dependence of the nal state of the network on tA (in units of tm ) and gA is shown. Other parameters: I 0 D 1.1, gs D 0.675. AS: The neurons are spiking, and the network activity is not synchronized. BS: The neurons are bursting and the network activity is synchronized. AS+BS: In this region the network displays bistability. Depending on the initial condition, the network settles into the asynchronous spiking state or a synchronous bursting state.
network bursts for sufciently large gA (second scenario above). There is a region of bistability in which the network state can evolve to a synchronized bursting state or an asynchronous spiking state depending on the initial conditions. For very slow adaptation, the region of bistability is small, but it is considerable for faster adaptation. 4.2 Mechanism of Synchronized Bursting in the Limit tA ! 1. The determination of the stability of the asynchronous state is quite involved. For very slow adaptation, however, it is relatively simple to understand the bifurcation to that synchronized bursting state. To see this, let us rst recall some of the properties of connected networks of IF neurons without adaptation. Assuming that the neurons re asynchronously, it is straightforward to show (see Abbott & van Vreeswijk, 1993) that if the external input, I0 , is sufciently large, neurons are ring at constant rate, R(I0 ), given by (I 0 C gs R)(1 ¡ e¡1/ R ) D 1.
(4.1)
According to this equation, the minimal current, Imin , required for the neurons to be active in the AS, depends on the synaptic strength gs . As shown in Figure 4, Imin ranges from Imin D 0.5, for gs D 1, to Imin D 1, as gs approaches 0. Even if the AS of a neuronal network exists, it is possible that the network settles not in this state but in another state, in which the neuronal
968
C. van Vreeswijk and D. Hansel 1.8
1.4
I0
1
0.6 0
0.2
0.4
gs
0.6
0.8
1
Figure 4: The network without adaptation (gA D 0). Solid line: The critical value, Ic , of the external current, I0 , above which the asynchronous state is stable, is plotted versus the synaptic strength, gs . Dashed line: The minimum value, Imin , of the external current, I 0, below which the asynchronous state does not exist is plotted versus the synaptic strength, gs .
activity is partially synchronized. For IF networks, following Abbott and van Vreeswijk (1993), one nds that there is a critical current Ic > Imin , above which the AS is stable. This critical current depends on gs and the synaptic time constants, t1 and t2 . Figure 4 shows the dependence of Ic on gs . For small gs , the critical input exceeds the ring threshold current, Ic > 1. In that case, the bifurcation at Ic is a normal Hopf bifurcation leading to a partially synchronized state, in which the average rate of neurons is lower than in the (unstable) AS. This rate goes to 0 as I0 approaches 1. This is shown graphically in Figure 5A. When gs is sufciently large, the critical input is below the threshold current, Ic < 1 (see Figure 5B). The network goes through an inverted Hopf bifurcation as I0 is reduced past Ic . For I0 < Ic , the only stable state is the quiescent state (QS), in which the neurons are not ring. The network settles in this state. For Ic < I0 < 1, there are two stable states, the AS and the QS. Presumably there exists also a third state, in which neuronal activity is partially synchronized, with a ring rate that approaches the rate of the AS when I0 ! Ic and 0 as I0 ! 1. However, one can conjecture that this state is unstable. In the following, we consider that gs is sufciently large so that Ic < 1. If I0 is time dependent and decreases slowly (compared to the characteristic timescales of the network dynamics) starting from above threshold, the network settles in the AS. It remains there until I0 reaches Ic , where it abruptly switches to the QS. On the other hand, if I0 increases slowly from below Ic , the network remains in the QS until I0 reaches 1, at which point the neurons start to re again.
Patterns of Synchrony in Neural Networks
(A)
R
(B)
0.4
0.4
0.3
0.3
R
0.2
0.1
0 0.7
969
0.2
0.1
0.8
0.9
1
I0
1.1
1.2
1.3
Ic
1.4
0 0.7
Ic
0.8
0.9
1
I0
1.1
1.2
1.3
1.4
Figure 5: The ring rate of the neurons in the network without adaptation (gA D 0) in the asynchronous state is plotted versus the external current. Solid line: Stable solutions. Dashed lines: Unstable solutions. The critical current Ic is indicated on the x-axes. (A) gs D 0.3. The asynchronous state is unstable below threshold current (I0 D 1). (B) gs D 0.7. The asynchronous state is stable in a range of the external current below threshold.
Let us now assume that the neurons display a very slow spike adaptation (tA ! 1) and that at time t D 0, the adaptation and the membrane potential are randomly distributed. For t > 0, a constant suprathreshold stimulation is applied, that is, I0 > 1. Since tA is large, individual spikes increase the adaptation current Ai only by a tiny amount, g A / tA . Therefore, after a transient period following the stimulation, Ai (t) reects the cumulative effect for all spikes of neuron i that occurred roughly between times t ¡ tA and t. The timing of these spikes on a timescale much smaller than tA can be neglected in the calculation of Ai . Thus, we can write tA
dAi D ¡Ai C gA Ri (t ), dt
(4.2)
where Ri (t) the rate of neuron i smoothed over some window with width much less than tA . In this case, Ri (t) is the rate of an IF neuron without adaptation that receives an input I0 C gs E (t) ¡ Ai . Thus, the neurons that start out with the largest Ai (0) will be the ones with the lowest rates. As a result, their adaptation current will decrease relative to neurons for which Ai (0) is smaller, so that after a time that is large compared to tA , the level of the adaptation current, and therefore the rates, of all cells converge. Thus, we can set Ai D A and Ri D R. If A < I0 ¡ 1 the network is in the AS, with the cells ring at a rate R(I0 ¡ A), A obeys tA
d A D ¡A C gA R(I0 ¡ A). dt
(4.3)
970
C. van Vreeswijk and D. Hansel
Figure 6: Limit cycle behavior of A in the large tA limit. If the network res asynchronously, A is driven to gA R(I 0 ¡ A) (solid line). If this quantity is larger than A (dashed line), A increases. When A exceeds I0 ¡ Ic , the network becomes quiescent, and A is driven to 0. When A is less than I0 ¡ 1, the network jumps back to the asynchronous state. A and the rate R periodically traverse a closed trajectory (thick solid line).
For sufciently large gA , A increases until it reaches Ac D I0 ¡ Ic , where the activity of the network drops suddenly to zero, since the only stable state is the QS. In the QS, A decreases, satisfying tA
d A D ¡A. dt
(4.4)
When A has decreased sufciently (to A D I0 ¡ 1), the neurons will again start to re, rapidly evolving to the AS, and A increases again. Thus, the network switches periodically from the AS to the QS. This is illustrated in Figure 6. This bursting mechanism requires that the adaptation current, A, has to continue to grow as long as the network is in the AS. Thus, g A has to satisfy gA > g A,c ´
I 0 ¡ Ic . R ( Ic )
(4.5)
If gA does not satisfy this constraint, the network reaches an equilibrium with A D Ae D g A R(I0 ¡ Ae ), and the network settles in the AS state. There is no simple way to determine analytically the behavior of the bursting state for nite tA . However, numerical simulations conrm that this scenario remains valid for adaptation currents with parameters in a physiological range (see section 4.5). This analysis shows how synchronized bursts can emerge cooperatively from the combination of a strong excitatory feedback with slow and sufciently strong spike adaptation. In the next section, we study the properties of these bursts.
Patterns of Synchrony in Neural Networks
971
4.3 Properties of the Bursting Network in the Large tA Limit. Equations 4.3 and 4.4 suggest that we can calculate the burst duration and the interburst interval in the large tA limit. Indeed, according to equations 4.3 and 4.4, the burst duration, TB , and the interburst interval, TI , are given by Z TB D t A
I0 ¡Ic
I0 ¡1
dA ( gA R I 0 ¡ A ) ¡ A
(4.6)
³ ´ dA I 0 ¡ Ic . D tA log A I0 ¡ 1
(4.7)
and Z TI D tA
I0 ¡Ic I0 ¡1
Thus, both have a duration that is of order tA . However, there is a subtlety that has to be taken into account. Between bursts, the synaptic feedback is negligible, so that for all cells, the membrane potential Vi satises d Vi D I0 ¡ Vi ¡ A. dt
(4.8)
This means that between bursts, the difference between the membrane potential decreases. In fact, the standard deviation in potential, sV , given by 2
(sV ) D N
¡1
X i
Á Vi2
¡ N
¡1
X
!2 Vi
,
(4.9)
i
satises between bursts d sV D ¡sV , dt
(4.10)
as can be seen from considering dsV2 / dt and using equation 4.8. After the burst is terminated, sV is of order 1. Therefore, at the end of the interburst period, sV is of order e¡TI . For large tA , this is extremely small. This means that just before the neurons start to re again, they are very nearly synchronized. However, when the burst has started, the cells are driven away from synchrony (since we have assumed that the synaptic time constants and the synaptic strength are such that excitatory interactions destabilize the synchrony of the network at the spike-to-spike level). For a nearly synchronized network, it can be shown that when the network is again active, sV satises sV (T C t) D sV (T )elt ,
(4.11)
where T is the time at which the network becomes active again. Here, l depends nontrivially on the network parameters, but is of order 1. Equation 4.11 is valid as long as sV (T C t) is small; for larger sV , the standard
972
C. van Vreeswijk and D. Hansel
deviation in the membrane potential grows more slowly, and it asymptotically goes to sVA , the value of sV in the AS, where sVA is of order 1. Since sV (T ) is of order e ¡TI , it will take at least a time of order TI / l (which is of order tA ) before sV is of order 1 and the network activity is appreciably desynchronized. Since the burst duration is also of order tA , this implies that for a nite fraction of the burst, the network is appreciably away from the AS, even in the large tA limit. Because, for a given level of external input, a network in a partially synchronized state res at a lower mean rate than in the asynchronous state (see Figure 5B), one expects that equation 4.6 underestimates the duration of the burst, even in the large tA limit. Unfortunately, there is no theoretical expression for the average rate in a network that is partially synchronized. The increase in desynchronization is also not well understood in this state, and it is not possible to take into account the extremely high level of synchrony at the start of the burst or to obtain an exact theoretical expression for the burst duration. The interburst time interval should be correctly predicted from equation 4.7. This is because, while at the beginning of the burst the cells are nearly synchronized, they re completely asynchronously at the end of the burst, so the theory in the previous section correctly predicts the state of the neurons when the burst terminates. Since the interburst interval depends on only the value of A at the termination of the burst, equation 4.7 gives the correct value of the interburst interval, even though burst duration is not predicted correctly by equation 4.6. Figure 7 shows the dependence of TB / tA and TI / tA on the external input, I0 , for tA D 100. The quantities TB and T I were determined using numerical simulations of a network of N D 100 neurons. The values predicted by equations 4.6 and 4.7 are also shown. The burst duration is underestimated by equation 4.6. On the other hand, the values of TI predicted by equation 4.7 are in good quantitative agreement with the simulations. It is interesting to note that the switching between desynchronizing episodes when the network is active and increasing synchrony during inactive episodes that our model exhibits also shows up in other model networks, for example, in Li and Dayan’s (1999) EI network. In this model, the network is driven to inactivity by slowly activating inhibitory units, which are subsequently slowly turned off. Although the mechanism by which their network “bursts” is very different from ours, their model has, in some parameter regimes, a tendency toward convergence of state variables during the quiescent period and while they diverge when the network is active. This mimics the alternating convergent and divergent episodes for the voltages in our network. 4.4 Robustness of the Bursting Mechanism. So far we have considered a very idealized network, with all-to-all coupling, identical neurons, and each neuron’s receiving the same noiseless external input. Here we show
Patterns of Synchrony in Neural Networks
973
2.5
T/t
A
2 1.5 1 0.5 0
1
1.1
I0
1.2
1.3
Figure 7: The interburst interval, TI (solid line), and the intraburst duration, TB (dashed line), are plotted versus the external input, I 0, in the large tA limit. Parameters are gs D 0.675 and gA D 0.6. The results from numerical simulations for tA D 100 are also plotted: pluses for TI , diamonds for TE . Numerical integration of the dynamics was performed with a time step, D t D 0.025. The network consists of N D 100 excitatory neurons. The synaptic time constants of the synapses were 0.1 for the rise time and 0.3 for the decay time. For each value of the current, a simulation was performed and stopped after 100 bursts occurred and the burst duration and interburst interval determined by averaging of all but the rst ve bursts.
that the mechanism for network bursting described in the previous section is robust to heterogeneities, sparseness of the connectivity, and noise. 4.4.1 Heterogeneities. We consider a network in which the neurons P receive heterogeneous external inputs: Ii D I0 C dIi , i D 1, . . . , N (with i dIi D 0). Other types of heterogeneities (e.g., in the membrane time constant or in the threshold) can be treated similarly. P The limit of weak level of heterogeneities, sI ´ N ¡1 i (d Ii )2 ¿ 1, can be studied analytically (see appendix A). We nd that in the presence of heterogeneous input, it takes a time of the order of ¡ log((1 ¡ e¡TI / tA )2 sI2 ) / l before the asynchronous state is reached. So as long as ¡ log(sI2 ) is large compared to tA , it takes a negligible fraction of the burst duration before the cells re asynchronously. For large tA , the inhomogeneity, sI , can be extremely small, so that its effect on R(I0 ) and Ic can be neglected. Consequently, equations 4.6 and 4.7 will predict TB and TI correctly in the limit where tA is large and sI is small, while ¡ log(sI2 ) À tA . However, an extremely large tA is required to check numerically the crossover between the two limits. For a nite level of heterogeneities, the approximation of appendix A does not hold. However, one can also generalize the large tA approximation of section 5 to this case. Indeed, in the large tA limit, the adaptation, Ai , of
974
C. van Vreeswijk and D. Hansel 0.8
0.6
R 0.4 0.2
0
0
0.2
0.4
t/t
A
0.6
0.8
1
Figure 8: The ring rate in a heterogeneous network in the large tA limit, after the transient has died out in a network with 100 neurons. Solid line: Mean rate R. Dashed line: Rate for neuron with input Ii D I 0 ¡d. Dotted line: Rate for neuron with input Ii D I0 C d. Parameters: I 0 D 1.1, gs D 0.6, gA D 0.675, d D 0.3.
neuron i satises the self-consistent equations: tA
dAi D ¡Ai C gA r(I0 C dIi C gs R ¡ Ai ), dt
(4.12)
where r(I ) D 1 / log[I / (I ¡ 1)] for I > 1, and r(I ) D 0 otherwise, while R has to be determined self-consistently from RD
DN 1 iX r(I0 C dIi C gs R ¡ Ai ). N iD 1
(4.13)
This set of equations can be studied numerically. We have evaluated this system of equations with N D 100, for gs D 0.6, gA D 0.675, I0 D 1.1. We have found that the network continues to show synchronous bursting activity even for values of d as large as d D 0.3, that is, with input currents Ii ranging from Ii D 0.8 to Ii D 1.4. Therefore, the bursting state is extremely robust to heterogeneities. This is shown in Figure 8. This gure shows the time course of ring rate for the neuron with the largest external input as well as that for the neuron with the smallest input. While the onset of the bursts and the burst duration of these cells are very similar the ring rates during the bursts vary substantially. Cells with intermediate inputs show ring rates that are between these two extremes. If d is increased beyond 0.4, the network settles in the AS. In the intermediate region, 0.3 < d < 0.4, solving the equations is difcult and extremely
Patterns of Synchrony in Neural Networks
975
time-consuming due to numerical problems. We were not able to characterize fully where and how the transition to the AS occurs. 4.4.2 Sparse Connectivity. We consider here the case of a network of neurons that are partially and randomly connected. We assume that the synaptic weights, Jij , are chosen as independent random variables, namely, 6 Jij D gs / M (i D j), with probability M / N and Jij D 0 otherwise (Jii D 0 for all i). On average, a neuron receives input from M other cells, so that the mean input into a cell is gs R. Due to the randomness of the connectivity, the cells do not all exactly receive M p inputs. When N < < M < < 1, cells receive input from approximately M § M other cells. The main effect of this is that in the asynchronous state, different cells receive a constant feedback input that uctuates spatially. Therefore, the neurons will have different ring rates and levels of adaptation. This can be captured in the large tA limit by describing the system in a coarse-grain time approximation where adaptation Ai of neuron i satises tA
dAi D ¡Ai C gA r (I0 C gs Rxi ¡ Ai ). dt
(4.14)
The function r(I ) is dened as above and xi D Mi / M, where M i is the number of inputs that cell i receives. These differential equations have to be solved self-consistently together with Z RD
dxr (x )r (I0 C gs Rx ¡ Ai ),
(4.15)
where r is the distribution of xi D Mi / M. In the large M limit, this distribution will approach a narrowly peaked gaussian with mean 1 and standard deviation 1 / M, and thus the ring rates and adaptation will closely resemble that of a network where xi D 1 for all i. Therefore, as long as M is sufciently large, the network will show bursting behavior when a corresponding all-to-all coupled network does, irrespective of how small M / N is. However, for sufciently small M, M < M c , bursting will be destroyed by the spatial uctuations in connectivity. For instance, solving numerically equations 4.14 and 4.15 with gA D 0.8 and gs D 0.675, using the gaussian approximation for r (x ), we nd that Mc < 10. 4.4.3 Noise.
We consider a network in which the input Ii (t) is given by
Ii ( t ) D I 0 C g i ( t ),
(4.16)
where the gaussian white noise satises hgi (t )i D 0 and hgi (t)gj (t0 )i D s 2dijd(t ¡ t0 ).
976
C. van Vreeswijk and D. Hansel
When gA D 0, in the AS, the neurons receive a total input with mean I0 C gs R(I0 ) and uncorrelated uctuations with standard deviation, s. For such a network, it can be shown (Tuckwell, 1988) that the rate R satises s RD p 2p
µZ
xC x¡
2 dxex / 2 H (x )
¶ ¡1
.
(4.17)
Here x § D (I0 § gs R ) / s, and H (x ) is the complementary error function p R1 2 H (x ) D x dx e¡x / 2 / 2p . Equation 4.17 can be solved numerically for given s. If s is not very large, the rate R as a function of I 0 will have two stable branches. On one branch, which extends from I0 D ¡1 to I0 D I1 ¼ 1, the rate R is very low. We will denote this solution with R¡ (I0 ). It corresponds to the quiescent state in the absence of noise. The second branch, on which the rate is substantial, extends from I0 D I2 < I1 to I0 D 1. We will denote this solution by R C (I0 ). Below threshold, I0 < 1, it corresponds to the selfsustained asynchronous state of the noiseless case. Stability analysis of the asynchronous state is nontrivial, but it can be shown that the rst branch is always stable, while the second is also stable, provided that the noise is not too weak and the synaptic feedback is sufciently strong. Now consider the effect of adaptation. We assume that tA À 1 and that g A > 0 satises I 0 ¡ I2 I 0 ¡ I1 < gA < . R C ( I2 ) R ¡ (I 1 )
(4.18)
As the noisy external input with a mean I0 > 1 is turned on, the network will start to re with rate R C (I0 ), and the adaptation will grow from A D 0, satisfying tA
dA D ¡A C g A R C (I 0 ¡ A ). dt
(4.19)
The adaptation grows until it reaches A D I 0 ¡I2 . At this point, the rate drops to a much lower rate, R¡ (I2 ), and the adaptation will decrease, obeying tA
dA D ¡A C g A R¡ (I 0 ¡ A ), dt
(4.20)
until the adaptation has reached I 0 ¡I1 , at which point the rate rapidly jumps to R C (I1 ). At this point, A starts to grow again, satisfying equation 4.19, and the process repeats. It should be clear that the mechanism for bursting in both the noisy and the noiseless cases is similar. The only difference between the two cases is that in the noisy case, gA cannot be too large (see equation 4.18). However, in practice, this upper limit for gA is unrealistically large if I0 is not very close to the threshold current.
Patterns of Synchrony in Neural Networks
977
4.5 Realistic Adaptation Time Constants. So far we have studied the dynamics of the network in the large tA limit. In fact, only for very large tA , on the order of 1000, the ratios TB / tA and TI / tA approach the large tA limit (result not shown). If we assume a membrane time constant, tm , of 10 msec, this means that only for adaptation time constants on the order of 10 seconds or larger does the theory predict TB and TI correctly. Clearly, realistic time constants for the adaptation are much smaller. Thus, the theory does not give the right value for TB and TI when a realistic tA is used. Nevertheless, we can ask whether equations 4.6 and 4.7 at least give a qualitatively correct dependence of these two quantities on the different network parameters. In other words, we will check whether the theory allows us to understand how changing the network parameters changes the burst duration and the interburst interval. 4.5.1 Adaptation Properties. In Figure 9A, the time average burst duration and interburst interval are displayed as a function of tA for gA D 0.6, gs D 0.85, and I0 D 1.1. Both quantities increase roughly linearly with tA . However, as should be expected, the simulation results differ from the analytical estimate from the large tA limit theory (see equations 4.6 and 4.7). Equation 4.7 implies that the interburst interval is independent of gA , while equation 4.6 shows that the burst duration diverges as gA approaches some critical value, gA,c . For g A much larger than gA,c , TB decreases as 1 / gA . In Figure 9B, we show the dependence of TB and TI on gA , in the large tA limit and from numerical simulations for tA D 10. Notice that these results have been obtained for moderately large values of gA and gs ; they are in excellent qualitative agreement with the analytical predictions. This is to be expected, since in our theoretical approach, we do not assume weak coupling or adaptation. It should be noted, however, that there seems to be some extra structure, particularly in the interburst interval, which the large tA limit does not capture. 4.5.2 Synaptic Coupling Strength. According to our analysis of the bursting mechanism, there is a minimal value of gs , gs,c , for which Ic D 1. This means that if gs < gs,c neurons are not ring bursts. For gs just above gs,c , Ic is slightly larger than 1, so, according to equations 4.6 andf 4.7, TB and TI should be small. Increasing gs decreases Ic , and thus, according to equation 4.7, TI will increase. This also tends to increase TB . However, increasing gs also increases R (I ), which tends to decrease TB . If gs is close to gs,c the increase in Ic has the strongest effect, and one expects that TB increases with gs . By contrast, when gs approaches its maximum value, gs D 1 (in the large tA limit), Ic approaches slowly its minimum value, Ic D 0.5. But here the rate, R(I ), increases very rapidly with gs (R(I) ! 1 for gs ! 1). Thus, for sufciently large gs , the increase in the ring rate, R, is the dominant factor. Therefore, for large gs , TB decreases with gs . As gs approaches 1, TB approaches 0.
978
C. van Vreeswijk and D. Hansel
(A)
140
(B) 1.6
120 100
T
1.2
80
T/t
A
0.8
60 40
0.4
20 0
0
20
40
t
A
60
2.5
80
100
0.6
0.7
0 0.4
0.6
0.8
0.8
0.9
1
(C)
gA
1
1.2
1.4
2 1.5
T/t
A 1
0.5 0 0.5
gs
Figure 9: Results from numerical simulations of a network with N D 200 excitatory neurons. Synaptic rise time: t1 D 0.1. Synaptic decay time: t2 D 0.3. (A) TB and TI (in units of tm ) versus tA . Solid line: theoretical value of TB from large tA limit. Dashed line: theoretical value of TI from large tA limit. Diamonds: TB in simulation. Pluses: TI in simulation. Parameters: I0 D 1.1, gs D 0.85, gA D 0.6. (B) TB / tA and TI / tA versus g A . Solid line: theoretical value of TB / tA . Dashed line: theoretical value of TI / tA . Diamonds: TB / tA in simulation with tA D 10. Pluses: TI / tA in simulation with tA D 10. Parameters: I0 D 1.1, gs D 0.85. (C) TB / tA and TI / tA versus gs . Solid line: theoretical value of TB / tA . Dashed line: theoretical value of TI / tA . Diamonds: TB / tA in simulation with tA D 10. Pluses: TI / tA in simulation with tA D 10. Squares: TB / tA in simulation with tA D 100. Crosses: TI / tA in simulation with tA D 100. Parameters: I0 D 1.1, g A D 0.6.
As is also shown in Figure 9C, for tA D 100, the prediction of the theory is conrmed. However, if tA is too small (e.g., tA D 10), our numerical simulations show a very different behavior for TB as a function of gs . Indeed, for this value of tA , the burst duration, TB , does not decrease as gs approaches 1, unlike in the large tA limit. The ring rate during the burst also stays nite in this limit, contrary to what one expects from the large tA limit. This difference is due to the fact that in the theoretical treatment of the large tA limit, we assumed that the time needed to settle in the AS was negligible— compared not only to tA but also to TB and TI . When the excitatory coupling
Patterns of Synchrony in Neural Networks
979
approaches gs D 1, this is no longer valid, since in this limit TB becomes small. Thus, the discrepancy between theory and simulation for tA , which is too small, is to be expected. 4.5.3 Synaptic Coupling Strength. We have also checked, using numerical simulations, that for realistic adaptation time constant, the bursting state is robust to heterogeneities and sparseness using numerical simulations. Our results were in agreement with the conclusions of the large tA theory. For instance, for the parameters of section 4.4, and tA D 10, the network settles in a bursting state for d as large as 0.3. 5 Two Populations Up to this point, we have analyzed the collective bursting state, which emerges in a network of excitatory neurons with spike adaptation when the synaptic feedback is sufciently strong. Based on numerical simulations, we have also shown that these states are robust to noise, heterogeneities of intrinsic neuronal properties, and sparseness of the connectivity. In this section, we study what happens to this bursting state when a second population, consisting of inhibitory neurons without spike adaptation, interacts with this excitatory population. Several experimental and numerical studies have shown (Silva et al., 1991; Flint & Connors, 1996) that if, in cortical slices, the feedback from inhibitory neurons is blocked, the activity of the circuit can change from tonically active to synchronized burst activity. An interesting question is whether our model can account for this change in activity. To investigate this, we consider a network that consists of two populations of neurons— one excitatory and one inhibitory. 5.1 The Large tA Limit. As in the one-population network, the system becomes much simpler to analyze in the limit tA ! 1. In this limit, the system will show activity with spike-to-spike synchrony. However, if the properties of the neurons are sufciently inhomogeneous, the connectivity is sparse, and gII is sufciently small, then the network will evolve to an asynchronous state. In this case, one can, similarly to the one-population model, reduce the system to a system described by tA
dA D ¡A C gA RE ( A). dt
(5.1)
Where the rate, RE ( A), has to be determined self-consistently from RE D f (IE0 C GEE RE C GII RI ¡ A ) RI D
f (II0 C
GIE RE C GII RI ) / tI ,
(5.2) (5.3)
980
C. van Vreeswijk and D. Hansel
where f (I ) D [log(I / (I ¡ 1))]¡1 . Since f (I ) is a nontrivial function of I, there is no explicit solution to this set of equations. Nevertheless, if II0 À 1, the ring rate of the inhibitory population, RI , can be approximated by RI D (II0 C GIE RE ¡ 1 / 2) / (tI ¡ GII ). In this case, RE satises Á RE D f
IE0 C
µ ¶ ! II0 ¡ 1 / 2 GEI GIE ¡ A C GEE C GEI RE . (tI ¡ GII ) (tI ¡ GII )
(5.4)
This shows that the net effect of adding the inhibitory population is to reduce ef f
the external input into the excitatory neurons from I 0 D IE0 to I0 D IE0 C GEI (II0 ¡1 / 2) / (tI ¡GII ), and decreasing the synaptic strength of the excitatory ef f
to excitatory connections from gs D GEE to gs D GEE C GEI GIE / (tI ¡ GII ). (Recall that GEI is negative.) Now consider the situation where, in the single population, gs and I0 are sufciently large so that this single population would evolve to the bursting state. Then the inhibitory network feedback can reduce the effective external ef f
ef f
input, I0 , and coupling, gs , sufciently to stabilize the tonically ring state, provided that GEI and GIE are sufciently large. Thus, in such a network, the neurons re tonically. However, chemically blocking the GABA-ergic synapses (i.e., setting GEI to zero) will result in synchronized burst activity of the excitatory population. 6 Conductance-Based Networks with Adaptation In this section, we show that the results we have established for IF network models can be generalized in the framework of conductance-based neuronal networks. These models take into account in a more plausible way the biophysics underlying action potential ring, but they do not yield to analytical study, even in the limit of slow adaptation. Therefore, our study of these models relies on only numerical simulations. The neurons obey the following dynamical equations: C
dV i (t) D gL (EL ¡ Vi (t)) ¡ Iiactive (t) dt app syn (i D 1, . . . , N ), C Ii (t ) C Ii ( t ) ,
(6.1)
where Vi (t ) is the membrane potential of the ith cell at time t, C is its capacitance, gL is the leak conductance, and EL is the reversal potential of the leak current. The leak conductance, gL , represents the purely passive contribution to the cell’s input conductance and is independent of Vi and t. In addition to the leak current, the cell has active ionic currents with HodgkinHuxley type kinetics (Tuckwell, 1988), the total sum of which is denoted as Iiactive (t) in equation 6.1. They lead to a repetitive ring of action potentials
Patterns of Synchrony in Neural Networks (A)
(B)
50
0
0
-50
-50
V (in mV)
V (in mV)
50
981
50 0
50 0
-50
-50 0
200
400
600
t (in msec)
800
1000
0
200
400
600
t (in msec)
800
1000
Figure 10: (A) Activity of two cells in a population of weakly coupled excitatory conductance-based neurons. The neurons re tonically in an asynchronous state. Coupling strength: GEE D 0.2 mS/cm2 . The network consists of N D 200 all-toall coupled neurons. The external input has an average ring rate, f 0 D 1300 Hz, and a conductance, G 0 D 0.03 mS/cm2 . The parameters of the single-neuron dynamics and the synapses are given in appendix B. (B) Activity of two neurons in a strongly coupled population of excitatory conductance-based neurons. Coupling strength: GEE D 0.6 mS/cm2 . Same parameters as in A. The neurons re in synchronized bursts without spike-to-spike synchrony.
if the cell is sufciently depolarized. An externally injected current is denoted as Iapp . The synaptic coupling between the cells gives rise to a synaptic current, Isyn, which is modeled as syn
Ii (t ) D
N X j D1
gij (t)(Ej ¡ Vi (t)),
(6.2)
where gij (t) is the synaptic conductance triggered by the action potentials of the presynaptic jth cell, and Ej is the reversal potential of the synapses made by neuron j. The synaptic conductance is assumed to consist of a linear sum of contributions from each of the presynaptic action potentials. The detailed dynamical equations are given in appendix B. The neurons in the network receive an external input from a set of neurons that re spikes at random with a Poisson distribution. The effect of this external input is represented in equation 6.1 by the current Iistim (t ). This current depends on the average rate of the input neurons, the peak conductance, and the time constants of their synapses, as explained in appendix B. 6.1 One Excitatory Population. We rst consider the limit where all of the inhibitory interactions are blocked. If the excitation is weak, the network settles into an asynchronous state. This is shown in Figure 10A. Increasing the synaptic strength leads to synchronized burst states in a very similar
982
C. van Vreeswijk and D. Hansel (A)
250
interval (in msec)
Interval (in msec)
200 150 100 50 0
0
(B)
80
100
t
200
A
(in msec)
300
60
40
20
0
5
6
7
8
9
10
gA (in nA)
11
12
13
14
Figure 11: (A) TB and TI versus tA . Pluses: TB . Crosses: TI . GEE D 0.6 mS/cm2 ; f0 D 2600 Hz; G 0 D 0.015 mS/cm2 ; g A D 5 mS/cm2 . Other parameters as in appendix B. (B) TB and TI versus gA . Pluses: TB . Crosses: TI . tA D 60 msec. Other parameters as in A.
way to what we have found for the IF network. The spikes within the bursts are not synchronized (see Figure 10B). We have also studied the dependence of the interburst duration and the intraburst duration as a function of tA and gA . The results, shown in Figure 11, are qualitatively similar to what we have found for the IF network (compare with Figures 9A and 9B). Finally, we have studied the average synaptic contact per neuron required for the collective bursting state to emerge. This number is on the order of 5 to 10, as for the IF network. 6.2 Two-Population Network. We now consider the two-population conductance-based neuronal network. For simplicity, we assume that the dynamics of the inhibitory population involve the same currents as the excitatory one, except that the adaptation current is suppressed. The maximal conductances of the synapses that neurons in population b are making on neurons in population a (a, b D E, I) are denoted by Gab . The connectivity of the neurons inside each population and between them is sparse; the number of synaptic inputs uctuates from neuron to neuron with an average of M D 100 synaptic inputs. In the following, we x all the parameters of the network and discuss the dependence of the network state on the inhibitory conductances GII and GEI . We have checked the behavior of the network for other parameter sets and have found that these results are qualitatively generic for a broad range of parameters, provided that the adaptation strength is sufcient. Figure 12 displays the coherence of the network, as dened in appendix C, as a function of GEI and GII . When the inhibition is sufciently weak, with both GEI and GII small, the system settles into a highly synchronous bursting state. By increasing GEI , keeping GII xed and not too large, the network
Patterns of Synchrony in Neural Networks
983
0.5 0.4
c
0.3 0.2 0.1 0 4
3
2
GII
1
0
0.2 0.4
1 0.6 0.8
1.6 1.2 1.4
GEI
Figure 12: Coherence  against GII and GEI . The network consists of NE D 800 excitatory and NI D 800 inhibitory sparsely connected. Each neuron receives on average 89 excitatory inputs and 89 inhibitory inputs. GEE D 0.8 mS/cm2 , GIE D 0.8 mS/cm2 , gA D 8 mS/cm2 , tA D 60 msec. The rate of the external input on the excitatory (resp. inhibitory) population is 336 Hz (resp. 140 Hz). The conductance of the external input is 0.105 mS/cm2 on both populations. All other parameters are as indicated in appendix B.
remains in a bursting state, but the level of coherence of the network decreases, while the length of the bursts decreases as well as the average ring rate of the excitatory population. At the same time, the ring rate of the inhibitory population increases. Eventually, if GEI is above a critical value, the network becomes asynchronous, and the neurons are not ring bursts of spikes anymore. This critical value increases with GII . Note also that for xed GEI , the degree of synchrony decreases with GII . This is due to the fact that decreasing the inhibition between the inhibitory neurons decreases the effective gain of the network (see equation 5.3). 7 Conclusion Our study deals with stimulated and suprathreshold strongly interacting neurons. The case of subthreshold excitatory neurons with adaptation has recently been addressed by Golomb and Amitai (1997). Synchrony of weakly interacting supra-threshold excitatory neurons with spike adaptation has been investigated by Crook, Ermentrout, and Bower (1998). The mechanisms for synchrony they have studied differ essentially from the one we
984
C. van Vreeswijk and D. Hansel
have investigated here, which involves strong excitatory feedback. A mechanism involving the cooperation of strong excitation and with slow ionic current has been investigated recently by Butera, Rinzel, and Smith (1999) for the generation of respiratory rhythm. Latham, Richmond, Nelson, and Nirenberg (2000) consider a network of intrinsically active neurons with spike adaptation that exhibits a bursting mechanism similar to ours. We have examined a simple scenario that can lead to rhythmic and stable collective patterns of neural activity. We have shown that states characterized by synchronization and slow bursting emerge naturally from cooperation between excitatory feedback and ring adaptation. In these states, neurons re bursts that are synchronized, but the spikes within the bursts are not. The bursting this system exhibits is a network effect, since the neurons cannot re periodic bursts on their own. It relies on the fact that an excitatory network can display a region of bistability and slowly drag the neurons back and forth in and out of this bistable region. This mechanism is very similar to those that have been proposed for single-neuron bursting (Rinzel & Ermentrout, 1998). This synchronized bursting is a general phenomenon. As we have shown, it occurs for the IF network as well as for more biophysically plausible conductance-based neuronal models. In both classes of models, the synchronized bursting states have similar properties. It is also extremely robust to heterogeneity, sparseness of the connectivity, and noise. This is in contrast to mechanisms of spike synchrony in excitatory or inhibitory networks (White et al., 1998; Neltner et al., 2000a; Golomb & Hansel, 2000). We have dissected analytically the mechanism of this phenomenon in a network of one population of excitatory IF neurons, assuming slow adaptation (tA ! 1). In this limit, a simple equation can be written to relate the time evolution of the adaptation current of the neurons and their ring rate. Another assumption of our analysis is that without adaptation, the network settles into either an asynchronous state, when this state is stable, or a quiescent state, when the former is unstable. Under this assumption, the dynamics of the excitatory networks can be reduced to a “rate model” in which the dynamics of the neurons are characterized by a rate variable corresponding to the adaptation current. This is in contrast to other types of rate models in which the dynamical quantity corresponds to a population activity (Wilson & Cowan, 1972; Ginzburg & Sompolinsky, 1994) or to synaptic conductances (Ermentrout, 1994; Shriki, Hansel, & Sompolinsky, 1999). Another difference with rate models introduced previously is that in our case, rate equations have to be supplemented with the stability condition of the asynchronous state of the full spiking model. This approach should be contrasted to the one taken in Latham et al. (2000), where the population activities are described by Wilson and Cowan equations. This is equivalent to our approach under the assumption that the AS is always stable if it exists. Also note that in order to derive equations 4.3 and 4.4, we have also assumed that the synaptic time constant is fast. If one assumes
Patterns of Synchrony in Neural Networks
985
that the synaptic time constants are large, of the same order of magnitude as the adaptation time constant, one nds more complicated rate dynamics equations. They involve two coupled “rate variables.” One corresponds to the adaptation current and the other to the synaptic current. Relying on numerical simulations, we have investigated the network dynamics for nite values of tA . Our results show that quantitative agreement between simulations and our large tA theory requires very large and unrealistic values of tA . However, for reasonable values of tA , we have found good qualitative agreement. One of the discrepancies between the theory and the simulations is due to the fact that even in the large tA limit, there is always a small but nite fraction of the burst in which the spikes are synchronized across the network. Adding a small number of heterogeneities changes the picture. Indeed, in this limit, the fraction of the burst during which the spikes remain synchronized during a burst goes to 0 when tA diverges. Therefore, there is a crossover between the limit of no heterogeneities and the limit tA ! 1. However, an extremely large tA is required to see this crossover in numerical simulations. We have also shown how the effect of adding an inhibitory population can be understood in the large tA limit. We have seen that recurrent inhibition reduces the effective excitatory feedback and the external current, and that this reduction depends on the strength of the inhibition between the inhibitory neurons. This can be sufcient to settle the network in an asynchronously ring state, even if the network without inhibition would evolve to the synchronized bursting state. This explains how chemically blocking the inhibitory feedback can change the activity pattern of the network from asynchronous tonically ring to synchronized bursts (Silva et al., 1991; Flint & Connors, 1996). The same effects can be observed in networks with realistic adaptation time constants. The fact that inhibition destroys the bursting state has to be opposed to the synchronizing effect of inhibition in spike-tospike synchrony (Hansel et al., 1993; van Vreeswijk et al., 1994; Hansel et al., 1995). The pattern of ring of the neurons in the synchronous bursting state is reminiscent of epileptic population bursting. A mechanism proposed to explain the emergence of population bursting in the hippocampus involves excitatory coupling between neurons that are able to re bursts on their own (Traub et al., 1992; Pinsky & Rinzel, 1994). Our study shows that very similar collective states can be achieved with nonbursting neurons. Since adaptation currents display a wide spectrum of time constants ranging from tens to hundreds of milleseconds, this leads very naturally to rhythmic activity over a wide range of frequencies. We have focused on networks with random connectivity, as observed, for example, in the hippocampus. Although it is still a matter of debate, this type of architecture is probably also relevant for modeling the dynamics of motor cortical areas. The somatosensory and the visual cortices are organized differently. The connectivity pattern in these areas is highly correlated
986
C. van Vreeswijk and D. Hansel
with the functional properties of the neurons. In the primary visual cortex, for instance, the probability of connection between two neurons decreases with the difference of their preferred orientation. In networks with this type of architecture, collective bursting can also occur. However, if the inhibition in the network is sufciently strong, the collective and synchronous bursting states can be destabilized and replaced by a traveling pulse state, as has recently been shown (Hansel & Sompolinsky, 1998). It is worth noting that the mechanism described here is not the only one that can induce burst activity in a network that consists of neurons that re tonically if they are isolated and injected with a constant current. O’Donovan, Wenner, Chub, Tabak, and Rinzel (1998) showed that excitatory networks of neurons with synaptic adaptation can also bifurcate to a bursting state. These bursts can have a similar timescale as the bursts in our network. It would be interesting to see how an interplay of these two mechanisms affects the network activity. Appendix A: Burst Properties in a Weakly Heterogeneous Network In this appendix, we study how a weak level of heterogeneities in the external input affects the synchronized bursting state of the excitatory IF network. The equation for the membrane potential of neuron i is d (A.1) Vi D ¡Vi ¡ Ai C Ii C gs E, dt P ¡1 where Ii D I0 C dIP i , with N i dIi D 0. If the inhomogeneities dIi are ¡1 2 (d ) small, s(I) ´ N ¿ 1, we can expand around the solution with I i i homogeneous input. It can be shown that Ic will shift by a small amount, dIc , that is of the order of s(I). We will assume that s(I) is small enough that this shift in Ic can be ignored. We will now show that under these conditions, the burst duration and inter-burst interval satisfy equations 4.6 and 4.7 to leading order. We assume that the network has reached a state where the average rate, R, varies periodically, and that R > 0 for 0 < t < TB , while R D 0 for TB < t < TB C TI . Thus, we have to show in this case that the level of synchronization is small enough at the beginning of the burst, so that the time it takes to reach the asynchronous state is negligible compared to TB in the large tA limit. Writing Ri (t) D R(t) C dRi (t) and Ai (t) D A (t) C dAi (t), we nd that when the network res asynchronously, 1 D R C dRi
Z
1 0
dV , I0 C gs R ¡ A ¡ V C dIi ¡ dAi
(A.2)
or, to leading order, dRi (t) D (d Ii ¡ dAi (t))G(t),
(A.3)
Patterns of Synchrony in Neural Networks
987
R1 with G (t) D R2 (t) 0 dV / (I0 C gs R (t) ¡ A(t) ¡ V )2 . Here, R(t) and A(t) are the solutions for the rate and adaptation currents derived in section 5. Thus, dAi satises » ddAi ¡dAi C g A (d Ii ¡ dAi )G (t) tA D ¡dAi dt
for 0 < t < TB for TB < t < TB C TI ,
(A.4)
and dAi (0) D dAi (TB C TI ).
(A.5)
These equations imply that dAi has a linear dependence on dIi , dAi (t) D ¡a(t)dIi , where a is a periodic function with period TB C TI . The precise shape of a(t) depends on the network parameters. However (because of the nonhomogeneous term present on the right-hand side of equation (A.4), for all parameter choices, 0 < a(t) < 1 for all t. Since a (TB ) < 1, a(TB C TI ), thus, a (0) can have at most the value amax D e¡TI / tA . At t D TBC , just after the termination of the burst, the standard deviation of the membrane potentials, sV , is of the order 1. In the interburst interval, sV satises d 2 s D ¡2sV2 C (1 ¡ a(t))hVdIi, dt V P where hVdIi D N ¡1 i VidIi obeys d hVdIi D ¡hVdIi C (1 ¡ a (t ))sI2 . dt
(A.6)
(A.7)
Thus, if we neglect the term with da / dt, which is of order 1 / tA , we nd for sV2 , ³
´ d2 d C3 C 2 sV2 D (1 ¡ a (t )) 2 sI2 . dt2 dt
(A.8)
Therefore, just before the next burst begins, the standard deviation in the membrane potential will at least be equal to ± ²2 sV2 (T B C TI ) ¸ 1 ¡ e¡TI / tA sI2 .
(A.9)
It will take a time of order ¡ log((1 ¡e¡TI / tA )2 sI2 ) / l before the asynchronous state is reached. As long as ¡ log(sI2 ) is large compared to tA , it will take a negligible fraction of the burst duration before the cells re asynchronously. Consequently, equations 4.6 and 4.7 will predict TB and TI correctly in the large tA limit.
988
C. van Vreeswijk and D. Hansel
Thus, independent of the length of the burst, the standard deviation in the membrane potential is at least sI at the start of the burst. This means that it will take at most a time of order ¡ log(sI ) / l after the commencement of the burst until sV is of order 1. In the large tA limit, it takes an arbitrarily small fraction of the burst duration until the network is again in the asynchronous state. It should be noted that in the presence of inhomogeneity in the input, the cells do not all re at the same rate, and therefore the adaptation currents, Ai , will not go to the same value. This inhomogeneity will also affect Ic . However, if sI is small, this will have only a small effect on TB and TI . If a small inhomogeneity is added to the input, equations 4.6 and 4.7 will give a good approximation of TB and TI in the large tA limit. Appendix B: The Conductance-Based Model In this appendix, we give the details of the equations of the conductancebased model used in section 6. The membrane potentials of the neurons follow the equation C
dV D ¡IL ¡ INa ¡ IK ¡ IKA ¡ Iadapt C Isyn (t) C Iext (t), dt
(B.1)
where Isyn denotes the synaptic current generated within the network and Iext stands for synaptic currents from sources outside the network. The leak current is IL D gL (V ¡ EL ). The sodium and the delayed rectier currents are described in a standard way: INa D gNa m31 h(V ¡ ENa ) for the sodium current and IK D gK n4 (V ¡ EK ) for the delayed rectier current. The gating variables, x D h, n, satisfy the relaxation equations, dx / dt D (x1 ¡ x) / tx . The functions x1 , ( x D h, n, m), and tx are: x1 D ax / (ax C bx ), and tx D w / (ax C bx ) where am D ¡0.1(V C 30) / (exp(¡0.1(V C 30)) ¡ 1), bm D 4 exp(¡(V C 55) / 18), ah D 0.07 exp(¡(V C 44) / 20), bh D 1 / (exp(¡0.1(V C 14)) C 1), an D ¡0.01(V C 34) / (exp(¡0.1(V C 34)) ¡ 1) and bn D 0.125 exp(¡(V C 44) / 80). We have taken: w D 10. The model incorporates a potassium A-current: IKA D gKA a31 b (V ¡ EK ) with a1 D 1 / (exp(¡(V C 50) / 20) C 1). The function b(t) is given by db / dt D (b1 ¡b) / tKA , with b1 D 1 / (exp((V C 80) / 6) C 1). For simplicity, the time constant, tKA , is voltage independent. The adaptation current, denoted by Iadapt, is modeled as Iadapt D gA A (t)(V ¡ EK ) where the function A(t) satises: dA A1 (V ) ¡ A D , tA dt
(B.2)
where A1 (V ) D 1 / (exp(¡0.7(V C 30.)) C 1) For simplicity, the adaptation time constant tA is independent of the voltage. The other parameters of the model are C D 1m F/cm2 , gNa D 100 mS/cm2 , and gK D 40 mS/cm2 . Unless otherwise specied, gL D 0.05 mS/cm2 ,
Patterns of Synchrony in Neural Networks
989
gKA D 20 mS/cm2 and tKA D 20 msec. The reversal potentials of the ionic and synaptic currents are ENa D 55 mV, EK D ¡80 mV, EL D ¡65 mV, EE D 0 mV, and EI D ¡80 mV. The synaptic inputs from inside the network are modeled as conductance changes. The synaptic current ow into a postsynaptic cell at time t, induced by a single presynaptic spike at time t0, is Isyn (t) D Gsyn f (t ¡ t0 )(Esyn ¡ V (t)),
(B.3)
where V (t) is the membrane potential of the postsynaptic neuron at time t, Esyn , the reversal potential of the synapse, and Gsyn , its strength. The function f (t) is given by f (t) D
± t ² 1 ¡ ¡t e t1 ¡ e t2 H(t) t1 ¡ t2
(B.4)
(H (t) is the Heaviside function). The larger of the time constants, t1 and t2 , characterizes the rate of exponential decay of the synaptic potentials, while their time to peak is equal to tpeak D (t1 t2 / (t1 ¡ t2 )) ln(t1 / t2 ). The normalization adopted here ensures that the time integral of f (t) is always unity. For multiple events, Isyn (t) becomes the sum of the total contributions at time t of all spikes generated by the presynaptic cells in the past. The external synaptic inputs, which are excitatory, are described in a similar way as for the internal synaptic inputs. The spike times, t0, are random and taken from a Poisson process with a xed uniform rate, f0. The external synaptic inputs to different cells are uncorrelated. The synaptic time constants are t1 D 1 msec, t2 D 3 msec for the excitation in the network and t1 D 1 msec, t2 D 6 msec for the inhibition. The synapses from outside the network have an instantaneous rise. They decay with a time constant t2 D 3 msec. Appendix C: Measure of Synchrony in Large Neuronal Networks To measure the degree of synchrony in the network, we follow the method proposed and developed in Hansel and Sompolinsky (1992), Golomb and Rinzel (1993), and Ginzburg and Sompolinsky (1994), which is grounded on the analysis of the temporal uctuations of the activity. One evaluates at a given time, t, the average membrane potential of the neurons: AN (t) D
N 1 X Vi (t). N iD 1
(C.1)
Its time uctuations can be characterized by the variance D
DN
D AN (t)
2
E t
¡ hAN (t )i2t .
(C.2)
990
C. van Vreeswijk and D. Hansel
This variance is normalized to the population-averaged variance of singlecell activity,
D
D
N ±D E ² 1 X Vi (t)2 ¡ hVi (t)i2t . t N iD 1
(C.3)
The resulting coherence parameter,
SN
D
DN , D
(C.4)
behaves generally for large N as
SN D Â C
³ ´ 1 a CO N N2
(C.5)
where a is some constant number and Â, between 0 and 1, measures the degree of coherence in the system. In particular,  D 1 if the system is totally synchronized (i.e., Vi (t) D V (t) for all i) and  D 0 if the state of the system is asynchronous. Acknowledgments We thank D. Golomb, G. Mato, and P. Dayan for a careful reading of the manuscript. D. H. acknowledges the hospitality of the Center for Neural Computation and the Racah Institute of Physics of the Hebrew University. This research was partially supported by PICS 236 (PIR Cognisciences and M.A.E): Dynamique neuronale et codage de l’information: Aspects physiques et biologiques to D. H. References Abbott, L. F., & van Vreeswijk, C. (1993). Asynchronous states in networks of pulse-coupled oscillators. Phys. Rev., E48, 1483–1490. Ahmed, B., Anderson, J. C., Douglas, R. J., Martin, K. A., & Whitteridge, D. (1998). Cereb. Cortex, 8, 462–476. Butera, R. J. Jr., Rinzel, J., & Smith, J. C. (1999). Models of respiratory rhythm generation in the pre-Botzinger complex. II. Populations of coupled pacemaker neurons. J. Neurophysiol., 82, 398–415. Cattaneo, B., Maffei, L., & Morrone, C. (1981). Patterns in the discharge of simple and complex visual cortical cells. Proc. R. Soc. Lond. B Biol. Sci., 212, 279–297. Chow, C. (1998). Phaselocking in weakly heterogeneous neural networks. Physica D, 118, 343–370. Connors, B. W., & Gutnick, M. J. (1990). Intrinsic ring patterns of diverse neocortical neurons. Trends. Neurosci., 13, 99–104.
Patterns of Synchrony in Neural Networks
991
Connors, B. W., Gutnick, M. J., & Prince, D. A. (1982). Electrophysiological properties of neocortical neurons in vitro. J. Neurophysiol., 48, 1302–1320. Crook, S. M., Ermentrout, G. B., & Bower, J. M. (1998). Spike frequency adaptation affects the synchronization properties of networks of cortical oscillations. Neural Computation, 10, 837–854. DeBusk, B., DeBruyn, E., Snider, R., Kabara, J., & Bonds, A. (1997). Stimulusdependent modulation of spike burst length in cat striate cortical cells, J. Neurophysiol., 78, 199–213. Ermentrout, E. (1994). Reduction of conductance-based models with slow synapses to neural nets. Neural Computation, 6, 679–695. Flint, A. C., & Connors, B. W. (1996). Two types of network oscillations in neocortex mediated by distinct glutamate receptor subtypes and neuronal populations. J. Neurophysiol., 75, 951–957. Gerstner, W., van Hemmen, J. L., & Cowan, J. (1996). What matters in neuronal locking. Neural Computation, 8, 1653–1676. Ginzburg, I., & Sompolinsky, H. (1994). Theory of correlations in stochastic neuronal networks. Phys. Rev. E, 50, 3171–3191. Golomb, D., & Amitai, Y. (1997). Propagating neuronal discharges in neocortical slices: Computational and experimental study. J. Neurophysiol.,78, 1199–1211. Golomb, D., & Hansel D. (2000). The number of synaptic inputs and the synchrony of large, sparse neuronal networks. Neural Computation, 12, 1095–1141. Golomb, D., & Rinzel, J. (1993).Dynamics of globally coupled inhibitory neurons with heterogeneities. Phys. Rev. E, E48, 4810–4814. Gray, C. M. (1994). Synchronous oscillations in neuronal systems: Mechanisms and functions. J. Comp. Neurosci., 1, 11–38. Gutnick, M. J., Connors, B. W., & Prince, D. A. (1982). Mechanisms of neocortical epileptogenesis in vitro. J. Neurophysiol., 48, 1321–1335. Hansel, D., Mato, G., & Meunier, C. (1993).Phase reduction and neural modeling. Concepts in Neuroscience, 4, 192–210. Hansel, D., Mato, G., & Meunier, C. (1995). Synchrony in excitatory neural networks. Neural Computation, 7, 307–337. Hansel, D., Mato, G., Meunier, C., & Neltner, L. (1998).On numerical simulations of integrate-and-re neural networks. Neural Computation, 10, 467–485. Hansel, D., & Sompolinsky, H. (1992). Chaos and synchrony in a model of a hypercolumn in visual cortex. Phy. Rev. Lett., 68, 718–721. Hansel, D., & Sompolinsky, H. (1998). Modeling feature selectivity in local cortical circuits. In C. Koch & I. Segev (Eds.), Methods in neuronal modeling: From synapses to networks (2nd ed.). Cambridge, MA: MIT Press. Hubel, D. (1959). Single unit activity in striate cortex of unrestrained cats. J. Physiol. (London), 147, 226–238. Latham, P. E., Richmond, B. J., Nelson, P. G., & Nirenberg, S. (2000). Intrinsic dynamics in neural networks. I. Theory. J. Neurophysiol., 83, 808–827. Li, Z., & Dayan, P. (1999). Computational differences between asymmetrical and symmetrical networks. Network, 10, 59–78. Neltner, L., Hansel, D., Mato, G., & Meunier, C. (2000a). Synchrony in heterogeneous networks of spiking neurons, Neural Computation, 12, 1607–1642. Neltner, L., Hansel, D., Mato, G., & Meunier, C. (2000b). Unpublished results.
992
C. van Vreeswijk and D. Hansel
O’Donovan, M. J., Wenner, P., Chub, N., Tabak, J., & Rinzel, J. (1998).Mechanisms of spontaneous activity in the developing spinal cord and their relevance to locomotion. Ann. N.Y. Acad. Sci., 860, 130–141. Pinsky, P. F., & Rinzel, J. (1994). Intrinsic and network rhythmogenesis in a reduced Traub model for CA3 neurons. J. Comput. Neurosci., 1, 39–60. Rinzel, J., & Ermentrout, G. B. (1998). Analysis of neural excitability and oscillations. In C. Koch & I. Segev (Eds.), Methods in neuronal modeling:From synapses to networks (2nd ed.). Cambridge, MA: MIT Press. Shelley, M. (1999). Efcient and accurate integration schemes for systems of integrate-and-re neurons. Unpublished manuscript, Courant Institute of Mathematical Sciences and Center for Neural Science, New York University. Shriki, O., Hansel, D., & Sompolinsky, H. (1999). Rate models for conductance based cortical neuronal networks. Submitted. Silva, L. R., Amitai, Y., & Connors, B. W. (1991). Intrinsic oscillations of neocortex generated by layer pyramidal neurons. Science, 251, 432–435. Snider, R., Kabara, J., Roig, B., & Bonds, A. (1998). Burst ring and modulation of functional connectivity in cat striate cortex. J. Neurophysiol., 80, 730–744. Steriade, M., McCormick, D. A., & Sejnowski, T. J. (1993). Thalamocortical oscillations in the sleeping and aroused brain. Science, 262, 679–685. Traub, R. D., Miles R., & Buzsaki, G. (1992). Computer simulation of carbacholdriven rhythmic population oscillations in the CA3 region of the in vitro rat hippocampus. J. Physiol. (Lond.), 51, 653–672. Tuckwell, H. C. (1988). Introduction to theoretical neurobiology. New York: Cambridge University Press. van Vreeswijk, C. (1996). Partial synchronization in populations of pulsecoupled oscillators. Phys. Rev. E, 54, 5522–5537. van Vreeswijk, C. (2000). Stability of the asynchronous state in networks of nonlinear oscillators. Phys. Rev. Let., 84, 5110–5113. van Vreeswijk, C., Abbott, L. F., & Ermentrout, G. B. (1994). When inhibition not excitation synchronizes neural ring. J. Comput. Neurosci., 1, 313–321. White, J. A., Chow, C. C., Ritt, J., Soto-Trevino, ˜ C., & Kopell, N. (1998). Synchronous oscillations in heterogeneous, mutually inhibited neurons. J. Comput. Neurosci., 5, 5–16. Whittington, M. A., Traub, R. D, & Jefferys, J. G. R. (1995). Synchronized oscillations in interneuron networks driven by metabotropic glutamate receptor activation. Nature, 373, 612–615. Wilson, H. R., & Cowan, J. D. (1972). Excitatory and inhibitory interactions in localized populations of model neurons. Biophys. J., 12, 1–24. Received August 9, 1999; accepted February 22, 2000.
NOTE
Communicated by Zou bin Ghahramani
Bayesian Analysis of Mixtures of Factor Analyzers Akio Utsugi Toru Kumagai National Institute of Bioscience and Human-Technology, Tsukuba 305-8566, Japan For Bayesian inference on the mixture of factor analyzers, natural conjugate priors on the parameters are introduced, and then a Gibbs sampler that generates parameter samples following the posterior is constructed. In addition, a deterministic estimation algorithm is derived by taking modes instead of samples from the conditional posteriors used in the Gibbs sampler. This is regarded as a maximum a posteriori estimation algorithm with hyperparameter search. The behaviors of the Gibbs sampler and the deterministic algorithm are compared on a simulation experiment. 1 Introduction
The mixture of factor analyzers (MFA) (Hinton, Dayan, & Revow, 1997; Ghahramani & Hinton, 1997) is a straightforward fusion of a gaussian mixture model and a factor analysis (FA) model, and can extract a set of local linear subspaces hidden in data. It also provides an efcient approximation to smooth nonlinear manifolds embedded in the data space. Since such a structure is often found in natural images, the model is applicable to natural image processing, such as pattern recognition and data compression (Frey, Colmenarez, & Huang, 1998; Tipping & Bishop, 1999). Most estimation algorithms for the MFA model are based on maximum likelihood (ML) estimation. However, this is problematic because the likelihood is not bounded. Bishop (1999) recently proposed a partial Bayesian estimation algorithm for an isotropic version of MFA: the mixture of principal component analyzers (PCA). He introduced a simple gaussian prior on the factor loadings, and then derived an expectation-maximization (EM) algorithm for the maximum a posteriori (MAP) estimates of these parameters and the ML estimates of the other parameters. He also derived an estimation algorithm for hyperparameters in the prior using approximate Bayesian inference. We introduce here a full prior on all parameters of the general MFA model using natural conjugate priors. We then derive more accurate Bayesian algorithm using a Gibbs sampler, which generates parameter samples following the posterior. Using these samples, we can estimate the posterior distribution, which can provide not only the estimates of the parameters Neural Computation 13, 993–1002 (2001)
° c 2001 Massachusetts Institute of Technology
994
Akio Utsugi and Toru Kumagai
but their condence. We also obtain a fast deterministic algorithm emulating the trend of the Gibbs sampler. The behaviors of these algorithms are compared on a simulation experiment. 2 Generative Model of MFA 2.1 Likelihood of MFA. We represent a data set of n points xi D (xi1 , . . . , xip )0 2 Rp , i D 1, . . . , n, as a matrix X (p £ n ) D [x1 , . . . , xn ]. The MFA assumes that each data point is generated by one of m FA units with the inner dimensions qk, k D 1, . . . , m. Thus, the probability density on X is given by
f (X |Y, Z , L , M, Y) D
n Y m Y
fN(xi | ¤
i D 1 kD 1
k y ki C
¹ k, ª
k )g
zki
,
(2.1)
where N ( ¢ | ¢ ) denotes a gaussian density function. The factor scores Y D fY k (qk £ n ) D [y k1 , . . . , y kn ]I k D 1, . . . , mg and the binary memberships Z (m £ n) D [z ki ] are regarded as missing data and have the priors f (Y) D
n Y m Y i D 1 kD 1
f (Z | ¿ ) D
N (y ki | 0, I qk )
n Y m Y i D 1 kD 1
(2.2)
tkzki ,
(2.3)
where I qk denotes an identity matrix with a size qk . The prior selection probabilities of the FA units ¿ D (t1 , . . . , tm ), their centroids M D f¹k (p £ 1)I k D 1, . . . , mg, the factor loadings L D f¤ k (p £ qk ) D [¸ k1 , . . . , ¸ kqk ]I k D 1, . . . , mg, and the uniquenesses Y D fª k (p £ p ) D diag(y k1 , . . . , y kp )I k D 1, . . . , mg constitute the parameter set H of the model. By integrating out the missing data Y and Z from the complete-data likelihood, f (X , Y, Z |H) D f (X |Y, Z , M, L , Y) f (Y ) f (Z | ¿ ),
(2.4)
we obtain the genuine likelihood f (X | H) D
n X m Y i D 1 kD 1
t k N (x i | ¹ k , ª
k C
¤
k¤
0
k ).
(2.5)
Note that this likelihood diverges as ¹ k ! xi , y kj ! 0 if qk < p, because ¤ k¤ 0k are singular.
Bayesian Analysis of Mixtures of Factor Analyzers
995
The general model of MFA explained above is often constrained in order to reduce the number of parameters. For example, Ghahramani and Hinton (1997) used a common uniqueness matrix ª for all FA units. Also, Tipping and Bishop (1999) proposed an MFA model with isotropic uniqueness matrices ª k D y kI p , which is called the mixture of probabilistic PCA. Under these constraints, they have derived the ML estimation algorithms. 2.2 Natural Conjugate Priors on Parameters. Many studies on mixture models (Diebolt & Robert, 1994) and FA models (Press and Shigemasu, 1989) employ natural conjugate priors on their parameters, because such priors usually lead to simple Bayesian estimation algorithms. The natural conjugate prior on ¿ is a Dirichlet distribution,
f (¿ |c ) D D (¿ |c ) /
m Y kD 1
c ¡1
tk
.
(2.6)
The natural conjugate priors on M and L are gaussian distributions: f ( M|Y, ¹ N , a1 ) D
f (L |Y, a2 ) D
m Y kD 1
qk m Y Y kD 1 j D 1
N (¹k | ¹ N , a1¡1 ª
k)
(2.7)
N (¸kj | 0, a2¡1 ª
k ).
(2.8)
P For simplicity, the hyperparameter ¹ N is estimated by the data mean ni xi / n. We can set ¹ N D 0 by centralizing the data. The natural conjugate prior on Y is an inverse gamma distribution, f (Y ¡1 |d, b ) D
p m Y Y kD 1 j D 1
G(y kj¡1 |d, b ),
(2.9)
where Y ¡1 D fª ¡1 I k D 1, . . . , mg and G(¢|d , b ) is a gamma density with a k mean d / b and a variance d / b 2 . H D fa1 , a2 , b, d, c g is called hyperparameters. However, we also call all of Y, Z , H, and H parameters. 3 Gibbs Sampler
The Bayesian estimate of a parameter is given by its posterior expectation. However, this is often difcult to calculate directly. In such a case, random samples generated from the posterior are used to approximate it. Markov chain Monte Carlo (MCMC) is an effective method to generate such random
996
Akio Utsugi and Toru Kumagai
samples when the direct generation of the samples from the posterior is difcult (Tanner, 1996). For the MFA model, an MCMC algorithm can be constructed using a Gibbs sampler. First, the set of all parameters is divided into some blocks, and the conditional posterior on each block given the other parameters is obtained. The Gibbs sampler is started by setting appropriate initial values of the parameters into the conditions of the conditional posteriors. Then a block is chosen, and a random sample on the block is generated from the conditional posterior. This sample is plugged into the conditions of the conditional posteriors on the other blocks, and then the sampling is moved to the next block. By iterating such sequential sampling, we obtain random series for the parameters. It is known that the distribution on the samples converges onto the posterior under some regularity condition. Thus, we can approximate the posterior expectation by the average over the random series. In the MFA model, the parameter blocks are Y, Z , ¿ , M, L , Y, and H. In the following subsections, we obtain the conditional posteriors on them. 3.1 Conditional Posteriors on Y and Z . The conditional posteriors on Y and Z given the other parameters have been obtained by Ghahramani and Hinton (1997). The conditional posterior on Y is obtained from equations 2.1 and 2.2:
f (Y| X , Z , M, L , Y) D
n Y m Y i D 1 kD 1
fN(y ki | D k (xi ¡ ¹ k ), §
k )g
zki
£fN(y ki | 0, I qk )g1¡zki ,
(3.1)
where
Dk D ¤ §
k
0
k
(ª
k C
¤
k¤
0 ¡1
k
)
(3.2)
D I qk ¡ D k ¤ k .
(3.3)
The conditional posterior on Z is f (Z | X , H) D
n Y m Y i D 1 kD 1
pzkiki ,
(3.4)
where p ki are the posterior selection probabilities of the FA units: t k N (x i | ¹ k , ª k C ¤ k ¤ pki D p(z ki D 1| xi , H) D Pm l D1 tl N (x i | ¹l , ª l C ¤
0
k)
l¤
0
l)
.
(3.5)
Bayesian Analysis of Mixtures of Factor Analyzers
997
3.2 Conditional Posteriors on H . The joint density on X , Y, Z , and H is given by the product of equations 2.4 and 2.6 through 2.9. The logarithm of this density is expressed as
log f (X , Y, Z , H|H) m » X 1 (n k C c ¡ 1) log tk C (n k C qk C 2d ¡ 1) log | ª D 2 kD 1
¡1 | k
1 0 ] ¡ tr[(C XX k ¡ 2C XY k¤ Q k C 2b I )ª ¡1 k 2 ¼ 1 0 Q )] ¡ tr[(C YY k C A k )(¤ Q kª ¡1 k ¤ k 2 mp pq C log a1 C log a2 C mpd log b C const, 2 2
(3.6)
where ¤ Q k D [¹k, ¤ k ], A k D diag(a1 , a2 1 0qk ) and 1 qk is a qk-dimensional P column vector with ones, q D m k D1 q k , nk D
n X
(3.7)
zki
i D1
C XX k D C XY k D C YY k D
n X iD 1 n X i D1 n X iD 1
z ki xi x0i
(3.8)
zki xi yQ 0ki
(3.9)
z ki yQ ki yQ 0ki
(3.10)
and yQ ki D [1, y 0ki ]0 . From equation 3.6, we can obtain the conditional posteriors on ¿ , ¤ Q k, and ª k: f (¿ | X , Z , c ) D D(¿ |n1 C c , . . . , n k C c ) f (¤ Q k | X , Y, Z , ª
k,
a1 , a2 ) D N (¤ Q k | C XY k (C YY k C A k ) ¡1 , (C YY k C A k ) ¡1 ª k )
Y, Z , ¤ Q k, a1 , a2 , b, d) ¡1 1 (n k C qk C 2d C 1), DG ª k 2 ´ 1 0 0 diag[C XX k ¡ 2C XY k¤ Q k C ¤ Q k (C YY k C A k )¤ Q k C 2b I p ] , 2 f (ª
(3.11)
(3.12)
¡1 | k X,
³
(3.13)
998
Akio Utsugi and Toru Kumagai
where N and G denote gaussian and gamma density functions on matrix variables, respectively (see the appendix). 3.3 Conditional Posterior on H. To obtain the conditional posterior on H, we need a hyperprior on H. In this article, we use hyperpriors only on a1 , a2 , and b,
f (h ) D G(h|dh , sh ), h D a1 , a2 , b,
(3.14)
and x the other hyperparameters. From these hyperpriorsand equation 3.6, the conditional posteriors on a1 , a2 and b are obtained as m 1X mp f (a1 | X , H , da1 , sa1 ) D G a1 C da1 , ¹0 ª 2 2 D1 k Á
k
m 1X pq tr(¤ f (a2 | X , H, da2 , sa2 ) D G a2 C da2 , 2 2 kD 1
! ¡1 C sa1 k ¹k
Á
! 0
kª
Á m X tr(ª f (b | X , H , d, db , sb ) D G b mpd C db , kD 1
(3.15)
¡1 k ¤
k ) C sa2
(3.16)
! ¡1 ) C sb k
.
(3.17)
We now have all conditional posteriors for the Gibbs sampler. In the simulations, we use c D 2, d D 2, db D 0.2, da1 D da2 D 2 for N statistical inference. The scale hyperparameters sh are set P to dP h / h, where p 2 n 2 2 2 2 N N N b /d D 0.01sx , b / (aN 1d) D sx , b / (aN 2d) D 0.1sx and sx D . This i j xij setting is similar to the one of Richardson and Green (1997). 4 Deterministic Estimation Algorithm
To obtain the estimate using the Gibbs sampler, we need to average the samples after the arrival of the random process at a stationary state. However, assessing this arrival is a difcult task. Furthermore, if the posterior has a complicated conguration, that is, there are many peaks with similar heights, the sample point wanders among the peaks for a long time. In such a case, the average of the samples in a short time range is doubtful as the estimate. Thus, a deterministic algorithm arriving at a local optimal point is desired for fast estimation. Such an algorithm is obtained by taking modes instead of samples from the conditional posteriors.
Bayesian Analysis of Mixtures of Factor Analyzers
999
The modes of the conditional posterior, equations 3.11 through 3.13, are given by tOk D
nk C c ¡ 1 n C m(c ¡ 1)
(4.1)
O ¤ Q k D C XY k (C YY k C A k ) ¡1
ªOkD
1 n k C qk C 2d ¡ 1
(4.2)
O0 diag(C XX k ¡ C XY k¤ Q k C 2b I p ).
(4.3)
By replacing n k, C XXk , C XYk and C YYk with their conditional posterior expectations (Ghahramani & Hinton, 1997), the above expressions become the update rules of H in the EM algorithm for the MAP estimate of H. The modes of the conditional posterior, equations 3.15 through 3.17, are given by a O 1¡1
" m X 1 D ¹0 ª mp C 2da1 ¡ 2 kD1 k
a O 2¡1
" m X 1 tr(¤ D pq C 2da2 ¡ 2 kD 1
0
O ¡1
" m X 1 tr(ª D mpd C db ¡ 1 kD1
¡1 ) C sb k
b
# ¡1 C k ¹k
2sa1
(4.4) #
kª
¡1 k ¤
k) C
2sa2
(4.5)
# .
(4.6)
These are used as the update rules of the hyperparameters. 4.1 Comparison of Algorithms on Simulations. An articial data set with the size n D 400 is generated by the MFA models with m D 5, q k D 2 ( k D 1, . . . , m ). The value of H is generated by the priors, equations 2.6 through 2.9, with a1 D 0.01, a2 D 0.1, b D 1, d D 5, and c D 5. An MFA model with the same structure as the data generation is applied to the data. The initial values of ¸ kj and ¹k are generated by independent gaussian random generators with the same scale as the data distribution. The initial values of tk and y k are set to 1 / m and the mean of data variances, respectively. The Gibbs sampler is always continued until 300 iterations. On the other hand, the deterministic algorithm is continued until the iteration reaches 300, or the relative variations of the log hyperparameters and the log posterior on H are under 0.0001. The posterior on H is calculated by the product of equations 2.5 through 2.9. The Gibbs sampler and the deterministic algorithm take 23.2 and 9.7 seconds per session, respectively, on a 333 MHz processor.
1000
Akio Utsugi and Toru Kumagai a
2
Gibbs
Log Posterior Gibbs 2200
0.8
2400
0.7 0.6
2600
0.5
2800
0.4
3000
0.3 3200
0.2
3400
0.1 0
50
100 a
2
150 Iteration
200
250
300
3600
50
Deterministic
100
150
Iteration
200
250
300
250
300
Log Posterior Deterministic 2200
0.8
2400
0.7 0.6
2600
0.5
2800
0.4
3000
0.3 3200
0.2
3400
0.1 0
50
100
150 Iteration
200
250
300
3600
50
100
150
Iteration
200
Figure 1: The variations of a2 (left) and the log posterior on H (right) in 20 sessions of the Gibbs sampler (top) and the deterministic algorithm (bottom) on a data set (n D 400, m D 5, qk D 2). The curves of the Gibbs sampler are smoothed using a Hamming windows with the size 21.
Figure 1 shows the variations of a2 and the log posterior on H in 20 sessions of the deterministic algorithm from different initial conditions, together with the smoothed curves for the Gibbs sampler. We can see that those values arrive at their stationary states rapidly. The values of the Gibbs sampler then uctuate with their posterior variances. Both algorithms have similar drifts in most sessions, though they have several stationary states depending on the initial values. The other hyperparameters, a2 and b, also acquire similar estimates by the both algorithms. 5 Conclusion
We derived Bayesian estimation algorithms for the MFA using a Gibbs sampler and its deterministic approximation. A simulation experiment showed that both algorithms yield similar estimates. We also tried comparisons between our Bayesian method and the ML estimation method on some pattern recognition tasks with real data. Although
Bayesian Analysis of Mixtures of Factor Analyzers
1001
the ML estimation algorithm sometimes fails to obtain a nite estimate, we have not obtained a result that shows the clear superiority of the Bayesian method on the prediction performance. We think that the advantage of the Bayesian method lies in providing more information on the parameters and the model tness than the ML estimation. Moreover, we think the Bayesian formalization is necessary to the further development of the MFA model. Appendix
A gaussian density on a matrix variable X (p £ q) D [x1 , . . . , xq ] is dened as N (X | M , § ) D N (vec(X )| vec(M ), § ),
(A.1)
where vec(X ) D (x01 , . . . , x0q )0 . A gamma density on a positive diagonal matrix X D diag(x1 , . . . , xp ) is dened as G(X |d, B ) D D
p Y
G(xj |d, bj )
jD 1
| B | d | X | d¡1 expf¡ tr(BX )g, [C(d)]p
(A.2)
where B D diag(b1 , . . . , bp ). References Bishop, C. M. (1999). Bayesian PCA. In M. S. Kearns, S. A. Solla, & D. A. Cohn (Eds.), Advances in neural informationprocessing systems,11 (pp. 382–388). Cambridge, MA: MIT Press. Diebolt, J., & Robert, C. P. (1994). Estimation of nite mixture distributions through Bayesian sampling. Journal of the Royal Statistical Society, Series B, 56, 363–375. Frey, B. J., Colmenarez, A., & Huang, T. S. (1998). Mixtures of local linear subspaces for face recognition. In Computer Vison and Pattern Recognition 98. Santa Barbara, CA. Ghahramani, Z., & Hinton, G. E. (1997). The EM algorithm for mixtures of factor analyzers (Tech. Rep. No. CRG-TR-96-1). Toronto: University of Toronto, Department of Computer Science. Hinton, G. E., Dayan, P., & Revow, M. (1997). Modeling the manifolds of images of handwritten digits. IEEE Transactions on Neural Networks, 8, 65–74. Press, S. J., & Shigemasu, K. (1989). Bayesian inference in factor analysis. In L. Gleser, M. Perleman, S. J. Press, S. J., & A. Sampson (Eds.), Contributions to probability and statistics (pp. 271–287). New York: Springer-Verlag.
1002
Akio Utsugi and Toru Kumagai
Richardson, S., & Green, P. J. (1997). On Bayesian analysis of mixtures with an unknown number of components. Journal of the Royal Statistical Society, Series B, 59, 731–792. Tanner, M. A. (1996). Tools for statistical inference (3rd ed.). New York: SpringerVerlag. Tipping, M. E., & Bishop, C. M. (1999). Mixtures of probabilistic principal component analyzers. Neural Computation, 11, 443–482. Received August 27, 1999; accepted July 25, 2000.
LETTER
Communicated by Nancy Kopell
Synchronization in Relaxation Oscillator Networks with Conduction Delays Jeffrey J. Fox Department of Physics, Cornell University, Ithaca, NY 14853, U.S.A. Ciriyam Jayaprakash Department of Physics, The Ohio State University, Columbus, Ohio 43210, U.S.A. DeLiang Wang Department of Computer and Information Science and Center for Cognitive Science, The Ohio State University, Columbus, Ohio 43210, U.S.A. Shannon R. Campbell Physical Optics Corporation R & D Division, Torrance, California 90505, U.S.A. We study locally coupled networks of relaxation oscillators with excitatory connections and conduction delays and propose a mechanism for achieving zero phase-lag synchrony. Our mechanism is based on the observation that different rates of motion along different nullclines of the system can lead to synchrony in the presence of conduction delays. We analyze the system of two coupled oscillators and derive phase compression rates. This analysis indicates how to choose nullclines for individual relaxation oscillators in order to induce rapid synchrony. The numerical simulations demonstrate that our analytical results extend to locally coupled networks with conduction delays and that these networks can attain rapid synchrony with appropriately chosen nullclines and initial conditions. The robustness of the proposed mechanism is veried with respect to different nullclines, variations in parameter values, and initial conditions. 1 Introduction
Synchronous neural activity has been observed in many parts of the brain and across the two hemispheres (Singer & Gray, 1995; Phillips & Singer, 1997; Keil, Mueller, Ray, Gruber, & Elbert, 1999). Neural synchrony arises quickly, within one to two periods of stimulus onset. It is often precise, with zero or at most a few milliseconds difference in ring times, and is observed over a considerable distance of the cortex (e.g., 14 mm in the primate motor cortex). This is quite remarkable given that signicant delays occur during nerve conduction; for instance, the conduction speed in local cortical circuits c 2001 Massachusetts Institute of Technology Neural Computation 13, 1003– 1021 (2001) °
1004
J. J. Fox, C. Jayaprakash, D. Wang, and S. R. Campbell
is less than 1 m/sec (Kandel, Schwartz, & Jessell, 1991; Murakoshi, Guo, & Ichinose, 1993). With the commonly reported 40 Hz oscillation frequency, such a conduction speed implies that connected neurons that are 1 mm apart have a time delay of more than 4% of the period of oscillation. Extensive experimental ndings of neural synchrony have stimulated much research in understanding how populations of oscillators can achieve synchrony. Such studies reveal that locally coupled oscillator networks with excitatory coupling and no conduction delay can reach global synchrony across the network. In particular, relaxation oscillators, which link directly to neuronal models of spike generation (FitzHugh, 1961; Nagumo, Arimoto, & Yoshizawa, 1962; Morris & Lecar, 1981), have been shown to synchronize rapidly with local coupling (Somers & Kopell, 1993; Terman & Wang, 1995). A number of studies have examined the effects of time delays in oscillator networks and reveal a diverse and interesting range of behaviors. For example, in a network of identical phase oscillators with local excitatory coupling, the inclusion of a time delay in the interactions decreases the frequency (Niebur, Schuster, & Kammen, 1991). Analysis of pulse-coupled oscillators indicates that time delays together with excitatory coupling do not lead to synchronization in a pair of oscillators; however, if the coupling is inhibitory, a pair of oscillators can be synchronous (van Vreeswijk, Abbott, & Ermentrout, 1994; Ernst, Pawelzik, & Geisel, 1995). Ernst et al. (1995) also examined, through computer simulations, networks of pulse-coupled oscillators with all-to-all connections and time delays. They found that if the coupling is excitatory, then groups of synchronous oscillators spontaneously arise and decay; if the coupling is inhibitory, several clusters of stable synchronous oscillators can form. Campbell and Wang (1998) found that a network of locally coupled relaxation oscillators with delayed excitatory coupling exhibits loose synchrony, whereby the oscillators that are directly coupled converge to a phase separation less than or equal to the time delay. More biologically realistic models of neural oscillators also exhibit synchrony when coupled with time delays. K¨onig and Schillen (1991) reported that a system of Wilson-Cowan oscillators becomes synchronous, and time delays are important for the synchronous behavior. Traub, Whittington, Stanford, and Jefferys (1996) described a model of cortical circuitry in which synchrony in a network of neurons is observed with time-delay coupling and the synchronous behavior is linked to a specic ring pattern of interneurons—the so-called spike doublets. Their numerical ndings have inspired Ermentrout and Kopell (1998) to analyze a similar but simplied system with a pair of local circuits, each consisting of one excitatory unit (E) and one inhibitory unit (I). They showed, by using E ! I and I ! E connections, that the model, due to the presence of doublets in inhibitory units, can synchronize with signicant time delays. Despite the studies on time delays in oscillator networks, it remains an interesting problem how synchrony with zero phase lag can be achieved efciently in locally coupled networks with time delays. In this article we
Synchronization with Conduction Delays
1005
propose a different mechanism for achieving rapid synchrony in locally coupled relaxation oscillators. The key idea is that this can be accomplished by choosing the nullclines of individual oscillators appropriately. We study the case of two-coupled oscillators approximately analytically in the next section; this enables us to derive a formula that shows how an appropriate choice of the nullclines can be made and allows us to estimate the speed of convergence to synchrony. We have veried the analytical approximations by numerical simulations of the two-oscillator system. To establish the viability of our mechanism, we have studied one-dimensional (1D) and two-dimensional (2D) arrays of oscillators numerically. For time delays, on the order of 10% of the period of the single uncoupled oscillator and a spread of initial phases synchrony is achieved in a few periods in most cases. To demonstrate the wider applicability of this mechanism, different choices for the nullclines (linear, cubic, exponential, etc.) have been tested; provided the parameters are chosen according to the condition derived in the two-oscillator case, our mechanism is operative. In the case of the step function, there is no synchrony, as was shown earlier (Campbell & Wang, 1998), and the sigmoid function is not effective. We also discuss the effect of different initial conditions and offer some remarks in the case of inhomogeneous time delays with a spread. Thus, we provide a different mechanism for achieving synchrony in locally coupled oscillators with time delays; the models inspired by Traub et al. (1996) exploit an adjustment of timing structure via doublets using additional circuitry, while we use the choice of the intrinsic nullclines to accomplish the same result. We describe our oscillator system in the next section and then analyze a pair of oscillators. Subsequently, we compare our analytical results with numerical simulations and examine synchrony in 1D and 2D oscillator networks. Further discussions are given in the last section. 2 Model Denition
For our study, we use relaxation oscillators similar to those dened by Terman and Wang (1995). Each oscillator i is described by the following equation, xP i D 3xi ¡ x3i ¡ yi C Si yP i D e ( f (xi ) ¡ yi ),
(2.1a) (2.1b)
where xi is the excitatory variable, yi is the inhibitory variable, and S is an excitatory coupling term. When e is chosen small and there is no coupling, the equation corresponds to a standard relaxation oscillator (Wang, 1999). The x-nullcline of the oscillator is a cubic (see Figure 1). In the singular limit (e ! 0), the orbit of the oscillator is determined by this nullcline. The oscillator rapidly alternates between the active phase of high x activity and the silent phase of low x activity (see Figure 1).
1006
J. J. Fox, C. Jayaprakash, D. Wang, and S. R. Campbell
Figure 1: Loose synchrony with a step function y-nullcline. The two oscillators are indicated by a circle dot and a square dot, respectively, at four time instants. The oscillators move on the lower cubic when coupling is not activated. When an excitation takes effect, the oscillators move along the upper cubic. LLB, lower left branch; LRB, lower right branch; ULB, upper left branch; URB, upper right branch. Other notations are described in the text.
When the coupling is activated, the cubic is moved up vertically by an amount equal to the coupling strength (Somers & Kopell,1993). An oscillator moves along the lower cubic unless it receives an excitation from a neighboring oscillator; in this case, it moves along a higher cubic (see Figure 1). The speed with which the oscillator moves along its cubic is determined by its distance to the y-nullcline, given by f (x ). By choosing the form of f (x ), one can control not only the dynamics of the oscillator but also the synchronizing behavior of an oscillator system. This observation leads to our mechanism for achieving synchrony in a locally coupled network with conduction delays. We study 1D or 2D locally coupled networks, where each oscillator is coupled only with its nearest neighbors. The coupling term S in equation 2.1a is given by X (2.2) Si D WikS1 (x k (t ¡ t )) k
S1 D 1 / (1 C exp[¡K(x ¡ h )]),
(2.3)
Synchronization with Conduction Delays
1007
where t is the conduction delay for a signal to travel from one oscillator to its neighbors, K is a parameter for the sigmoid function S1 , and h is a threshold an oscillator needs to reach in order to inuence the activity of its neighbors. Wik is the strength of the synaptic coupling from oscillator k to oscillator i. The coupling is local; thus, Wik is zero unless i and k are adjacent. To eliminate boundary effects, we normalize the connection weights to each oscillator so that Wik D a/ Ni , where Ni is the number of neighbors coupled to i and a is a constant (see Wang, 1995). In this article, Ni can be one, two, three, or four depending on the location of i and the dimensionality (1D or 2D) of the network. 3 Analysis
To illustrate better how the y-nullcline controls the dynamics of the network, we analyze the two-oscillator case in the singular limit. Let us rst consider a special case with the y-nullcline being a step function, which helps to explain several terms and the consequences of introducing time delays. This case has been extensively analyzed by Campbell and Wang (1998), who conclude that such a system will reach only loose synchrony (see section 1). Figure 1 illustrates such a scenario with only two cubics to consider, one above the other by a constant a. LLB and LRB stand for the lower left branch and the lower right branch, respectively, and ULB and URB the upper left branch and the upper right branch, respectively. In Figure 1, the two oscillators are denoted by a circular dot (leading one) and square dot (trailing one), and the trajectory of the leading one is denoted by a thin, solid curve and that of the trailing one by a thick, dashed curve. Let the two oscillators start on LLB at t0 , when the leading oscillator is just ready to jump to LRB. Let C be the initial phase separation, or the time the trailing oscillator takes to catch up to the leading one if the latter is held stationary. Note that C does not change when both oscillators are on the same branch (Somers & Kopell, 1993), because when the oscillators remain on the same cubic, both traverse the same path. Assume that C < t . In this case, the trailing oscillator jumps to LRB at t1 , before the delayed excitation from the leading one takes effect. As a result, at t1 , the phase separation between the two oscillators is still C. Both oscillators travel on LRB until t2 , when the delayed excitation from the leading oscillator takes effect, which makes the trailing one hop to URB (see Figure 1). Note that because of the step y-nullcline, two oscillators of the same y value, one on URB and another on LRB, move with the same speed in the y direction. At t3 , the delayed excitation from the trailing oscillator to the leading one takes effect and makes the leading oscillator hop to URB. Now both oscillators are on URB, and their phase separation is still C because of the step y-nullcline. Indeed, their phase separation remains C throughout the entire limitcycle (Campbell & Wang, 1998). Thus, phase compression—the reduction of initial phase separation—does not occur when C < t .
1008
J. J. Fox, C. Jayaprakash, D. Wang, and S. R. Campbell
This example clearly shows that time delays introduce a major change in the behavior of the system, since it is well known that two-coupled relaxation oscillators with a step y-nullcline rapidly synchronize to zero-phase separation (Somers & Kopell, 1993; Terman & Wang, 1995). Without a time delay, the leading oscillator, when it jumps to the right branch, induces the trailing one to jump immediately, provided that C is not too large (see Figure 1). Phase compression occurs during the simultaneous jumping process, because the phase separation after both jump to the right branch is smaller compared to before the jumping, when both are on the left branch. Since a step function has zero slope, the analysis also illustrates that the y-nullcline must have some slope in order to achieve exact synchrony. Then the oscillators with the same y value have different speeds on different cubics. The following analysis provides an idea of precisely what slope is necessary for fast synchronization. For this analysis, we make two assumptions. First, the y-nullcline is lower than both cubics along their left branches and higher along their right branches, so that any intersections between the two nullclines occur only along the middle branches of the cubics. Second, the two oscillators are initially on LLB. We follow the technique used by Somers and Kopell (1993) and Mirollo and Strogatz (1990) to nd a return map for C, that is, the phase separation of the two oscillators after one complete period as a function of the initial phase separation. To nd this return map, we rst determine C 0 , the phase separation after both have just jumped from LLB to URB. Denote time t D 0 when the leading one has just reached the left knee of LLB (see Figure 2). At time t D C, the trailing one jumps from LLB to LRB. Next, at t D t , the delayed excitation from the leading oscillator reaches the trailing one, causing it to hop from b on LRB to b0 on URB (see Figure 2). Note that the trailing oscillator has traveled for a time t ¡ C on LRB. Finally, at time t D t C C, the delayed excitation from the trailing oscillator reaches the leading one, causing it to hop from a on LRB to a0 on URB. At time t C C, the phase separation between the two oscillators is denoted by C 0 . To nd C 0 , we rst note that the trailing oscillator has spent a time C on URB before the cubic of the leading oscillator is raised at point a. During this time, the trailing oscillator has moved from b0 to a point denoted by a square dot in Figure 2. Clearly, C 0 is the time it takes for the oscillator to move from this point to a0 . Thus, we can write C 0 C C D tb0 ¡a0 ,
(3.1)
where tb0 ¡a0 is the time it takes an oscillator to move from b0 to a0 along URB. If we go through similar arguments for the jumps from the active phase to the silent phase that bring the oscillators back to LLB, we nd C 00 C C 0 D tB¡A .
(3.2)
Synchronization with Conduction Delays
1009
Figure 2: The two-oscillator case. Each pair of circle and square dots represents the locations of the two oscillators at a specic time. The lower-left pair represents the initial phase separation (C), the right pair represents the phase separation (C 0 ) after one jump, and the upper-left pair represents the phase separation (C 00 ) after two jumps (one up and one down). Other labels are described in the text.
Here tB¡A is the time it takes an oscillator to move from point B to A (see Figure 2), where the trailing and the leading oscillators hop down from ULB to LLB, respectively. Finally, the return map can be rewritten as C 00 D C ¡ (tb0 ¡a0 ¡ tB¡A ),
(3.3)
and writing the times tb0 ¡a0 and tB¡A as integrals, we have ÁZ 00
C DC¡
a0
b0
dy ¡ yP
Z
A B
dy yP
! ,
where the lower bar in y indicates the lower cubic.
(3.4)
1010
J. J. Fox, C. Jayaprakash, D. Wang, and S. R. Campbell
Now, let hR (y) and hL (y ) denote the x coordinates of LRB and LLB. Then, hR (y ¡ a) and hL (y ¡a) correspond to those of URB and ULB. Thus, we have ÁZ 0 ! Z A a dy dy 00 ¡ ¢¡ ¡ ¢ . C DC¡ (3.5) b0 e f ( hR ( y ¡ a)) ¡ y B e f (hL ( y )) ¡ y Note that in the singular limit, the y coordinates of points a0 , b0 , A0 , and B0 are the same as those of points a, b, A, and B, respectively. In the limit cycle, let t D 0 denote the origin in time when the leading oscillator is at the lowest point on LRB and its y coordinate is thus y (t D 0). As discussed earlier, at b0 , the trailing oscillator has traveled for a time t ¡ C on LRB, and hence, its y coordinate is y(t ¡C ). At a, when the leading oscillator receives an excitatory signal from the trailing one, it has traveled for time t C C starting from the lowest point, and thus, the y coordinate of a0 is simply y(t C C ). This allows us to rewrite the limits of the rst integral in equation 3.5. We can similarly start the clock at the highest point of ULB and let its y coordinate be denoted by y(t D 0) and rewrite the limits of the second integral. Hence, we have 0 Z C 00 D C ¡ @
y (t C C )
Z
dy
± ²¡ e f (hP R (y¡a)) ¡y
y (t ¡C )
1 y(t C C 0 ) y (t ¡C 0 )
dy
± ²A. e f (hL (y )) ¡y
(3.6)
Finally, we expand both integrals for the case C < t and C 0 < t using the formula for differentiating an integral with respect to a parameter (Courant, 1988). If Z f2 (x ) G(x ) ´ F (y, x )dy, f1 (x)
then dG (x ) D dx
Z
f2 (x )
@F (y, x) @x
f1 (x)
dy C
df2 (x ) df1 (x ) F ( f 2 ( x ), x ) ¡ F ( f1 (x ), x ). dx dx
Using this for C 0 given by Z 0
C D ¡C C
y (t C C) y (t ¡C)
dy , e ( f (hR (y ¡ a)) ¡ y )
and expanding we obtain to leading order in C, C 0 ¼ C (¡1 C 2R / Ra ), where we dene R D [ f (hR (y )) ¡ y]| t Dt Ra D [ f (hR (y ¡ a)) ¡ y]| t Dt .
Synchronization with Conduction Delays
1011
Similarly, we obtain with
C 00 ¼ C 0 (¡1 C 2La / L ) L D [ f (hL (y)) ¡ y]| tD t
La D [ f (hL ( y ¡ a)) ¡ y]| tD t . Together they lead to the following key equation: C 00 ¼ C (1 ¡ 2R / Ra )(1 ¡ 2La / L ).
(3.7)
Thus, to eliminate the rst-order term in the expansion of C 00 in powers of C, we need R, the vertical distance from the y-nullcline to the point y (t ) on LRB, to be one-half of Ra , the distance from the y-nullcline to the point on URB corresponding to the same y value; or a similar condition must hold on the left branches. If f (x ) is such that this 1/2 ratio holds for a larger and larger region around y (t ), then more and more terms in the expansion of C 00 drop out. If the ratio holds for the entire right (or left) branch, C 00 goes immediately to zero after only one jump to the active phase. A symmetric ynullcline produces equal compression for both left-to-right and right-to-left jumps. These observations provide a guide to the choice of the nullclines for efcient synchronization. We note that this rst-order result agrees with the ndings of Campbell and Wang (1998) in the case that f (x ) is a step function; the ratio of vertical distances to the y-nullcline is one across the entire branch, yielding C 00 D C, or no compression once C is less than t . Equation 3.7 is clearly useful for establishing the stability of a synchronous solution, as well as for estimating the speed of convergence to synchrony once a particular y-nullcline is chosen. We can also employ this analysis to determine the effects of asymmetric time delays. That is, the conduction delay from one oscillator to the other differs by an amount d from that in the opposite direction. In this case, the system will not reach zero-phase synchrony. One can show that for a symmetric y-nullcline chosen such that the rst-order term in equation 3.7 is negligible, the system will reach a stable phase difference equal to d / 2, a result that we have conrmed numerically. In order to draw a comparison, we have extended the analysis of Ermentrout and Kopell (1998). If we let the delay from circuit 1 to circuit 2 be d, and let the delay from circuit 2 to circuit 1 be d C d 0 , then one can choose parameters so that the system reaches a stable xed point with a phase difference d 0 / 2. Thus, our model with fewer parameters produces similar results. One can nd an expression similar to equation 3.6 for the return map if C and C 0 are larger than t . However, this map appears more difcult to analyze, and we content ourselves with a numerical study in the next section.
1012
J. J. Fox, C. Jayaprakash, D. Wang, and S. R. Campbell
Finally, we note that in some cases, we can provide an explicit expression of a y-nullcline that produces synchronization after only one jump from the silent phase to the active phase. Let us denote by d the mean difference in x values (hR (y ¡ a) ¡ hR (y) in our earlier notation) between LRB and URB for the relevant range of y coordinates. We note that for the cubic xnullcline considered in this article, this difference does not vary much. In such cases the y-nullcline given by an exponential function f (x ) D C 2x / d for a constant C produces a 1/2 ratio between the y-coordinate of the ynullcline for a point on LRB and that of the corresponding point on URB with the same y value. Because for a given x along the right branches, the y-coordinate of the y-nullcline is much greater than that of the x-nullcline, the latter can be ignored in the calculations of R and Ra . Thus, this choice of f (x ) approximately satises the 1/2 ratio between R and Ra , and produces essentially immediate synchronization after one jump. 4 Numerical Simulation
The second part of our study consists of numerical simulations. We relate our numerical results for the two-oscillator case to our analytical results. The bulk of the section describes our results for 1D and 2D networks. We discuss various types of y-nullclines that can produce fast synchrony and the robustness of this mechanism. We use the fourth-order Runge-Kutta method (Press, Teukolsky, Vetterling, & Flannery, 1992) adapted to time delays to solve our equations, typically with a time step of 0.01; smaller time steps have been used to verify our numerical results. A fth-order Lagrange interpolation formula (Abramowitz & Stegun, 1964) is used to determine the required intermediate values of x. Unless otherwise noted, the results discussed in this section are obtained using the following set of parameters: e D 0.02, t D 2.0, a D 6.0, K D 50.0, and h D ¡0.5. This value of t is about 10% of the oscillation period for an uncoupled oscillator. For the y-nullcline, we used f (x ) D 8x3 C 5 (see Figure 6). 4.1 Two-Coupled Oscillators. We rst describe the two-oscillator case. With the above parameter values, using numerically obtained values for x (t ) and y (t ), we nd that the coefcient in the rst-order expansion for C 00 in equation 3.7 is approximately 0.0116. In Figure 3A, we show a graph of C 00 (C ) obtained both numerically and analytically. Note that for C < 1, the analytical approximation is reasonably accurate. For C larger than 1, the higher-order terms begin to dominate, and the rst-order approximation is no longer good. The asymptotic slope for the numerically obtained data is about 0.013. This agrees with the predicted result to about 12%. We have obtained comparable or better agreement for other values of e . Figure 3B shows the numerically obtained return map for C, ranging from zero to a value equal to the time needed to traverse the entire LLB. In
Synchronization with Conduction Delays
1013
A
B
Figure 3: Return maps for the two-oscillator case, with t D 2.0. (A) Numerical and analytical return maps for C < t . The dots are numerically obtained data points, and the line is the analytical approximation. Note that for C small, the numerical return map is approximately linear and agrees well with the analytical map. (B) Numerically obtained return map for a large range of C.
this and all the following cases, when we refer to positions on a particular branch, they are limited to the portion of the branch within the limit cycle of the oscillation (see Figure 2). Note the dramatic change in the return map for C > t . Although we do not have an analytical calculation for the general case, a qualitative understanding follows from the analysis of Campbell and Wang (1998) done in a similar situation (see also Somers & Kopell, 1993). If C is larger than t , then after the leading oscillator jumps to the active phase, the
1014
J. J. Fox, C. Jayaprakash, D. Wang, and S. R. Campbell
Figure 4: Synchronization of two-coupled oscillators. The oscillators start at the two ends of LLB with x1 D ¡2 and x2 D ¡1.
trailing oscillator still spends some time in the silent phase after it receives excitation from the leading one and hops to ULB. Thus, the two oscillators travel in opposite directions on the different nullclines during this time, resulting in dramatic compression of their phase separation. The return map indicates that the two-oscillator case always converges to synchrony. Figure 4 illustrates the synchronization process in the two-oscillator case. The two oscillators start at the two extreme ends on LLB. Also, for purposes of illustration, the coupling strength a is much reduced to 1.0 so that synchrony does not occur as fast. 4.2 One-Dimensional Arrays. We now describe our numerical results for 1D arrays. Table 1 shows the average time to synchrony in terms of the number of periods for several array sizes. The results are obtained by averaging over 25 trials. To determine initial conditions, we randomly select y values between the highest point on LLB (y D 2) and a point on LLB chosen so that the spread of initial conditions is 43% of the total time to traverse LLB (we explain this choice below). We dene a network as synchronized if the maximum phase separation between any two oscillators is less than 3%
Synchronization with Conduction Delays
1015
Table 1: Average Rates of Synchrony in 1D Arrays. Array Size
Average periods to synchrony
2
4
8
16
32
1.0
1.08
3.24
6.16
8.36
of the total period of an oscillator. For smaller arrays, synchrony is always quite fast. For larger arrays (larger than 10), we nd that certain initial conditions lead to very slow synchrony. After studying these cases more carefully, we nd that slow synchrony is due to the presence of two or more domains, or blocks of connected oscillators within each of which all have nearly the same phase. These domains result from the topology of 1D nearest-neighbor coupling. If the relative phase between two such domains is large, slow synchrony results because of the limited connectivity of the network. As we discuss below, the domain effects in 2D arrays are much less severe due to increased network connectivity; longer-range couplings are also expected to suppress domain effects and accelerate synchrony. We have explored in greater detail cases with limited spread of randomly chosen initial conditions. We start all the oscillators in a simulation in a region of LLB. This region represents a spread of initial phases equal to 43% of the total time to traverse LLB. The specic percentage (43%) is empirically obtained so that the narrowing of the initial phase spread eliminates most of the slow synchrony cases. However, even for these initial conditions, about 1 in 25 trials has time to synchrony greater than 20 periods for arrays of size 8 or larger. As one further limits the range of initial conditions, the initial degree of synchrony is higher and higher. Thus, the time to synchrony can become arbitrarily small. Figure 5 shows a typical simulation with 20 oscillators, whose initial phases spread over the time to traverse LLB subtracted by t . In this gure a dot in the ith row represents the time at which the ith oscillator jumps to the active phase. To test the robustness of our model, we conducted simulations with different choices for y-nullclines. As shown in Figure 6, we have tested sigmoid, linear, cubic, exponential, and hyperbolic sine functions (some curves in Figure 6 are shifted vertically when used in simulations). We nd that all of these functions except the sigmoid can yield rapid synchrony for appropriately chosen parameters. In fact, this is predicted by equation 3.7. According to the equation, to produce fast synchrony, the distance to a ynullcline from the right-most point on LRB should be roughly half that from the corresponding point on URB with the same y-coordinate (see Figure 2). Thus, one expects that functions with smaller slopes, such as sigmoids, synchronize with a slower rate than functions with greater slopes, such as cubics. Also, equation 3.7 shows that nullclines that are symmetric about the
1016
J. J. Fox, C. Jayaprakash, D. Wang, and S. R. Campbell
Figure 5: Synchronization in a 1D array of 20 oscillators. The horizontal axis represents time, and the vertical axis oscillator positions in the array.
origin produce convergence to synchrony for both jumping up and jumping down. Thus, we predict that symmetric functions, such as hyperbolic sines, synchronize with a faster rate than asymmetric ones, such as exponential functions. This is indeed what we have observed in simulations. With the y-nullcline xed to a cubic function, we have also checked for robustness with respect to parameter changes. We nd that varying the time delay t by 20% has little effect on the behavior of the system. Similarly, changing the parameters of a cubic y-nullcline does not affect the network behavior. Finally, we test how the network responds to inhomogeneous time delays. To do this, we vary the delays between oscillators by a random variable up to 5% width of the time delay; large inhomogeneities in conduction delays prevent synchronization, as expected. Note that in this case, the time delay from oscillator i to j is generally different from that from j to i. We nd that the system does not converge to zero phase-lag synchrony with such an inhomogeneity. However, the inhomogeneous system does converge to satisfy the 3% synchrony condition discussed earlier, with a rate similar to that of the corresponding homogeneous system. Other networks, such as those that make use of spike doublets in inhibitory units (Ermentrout & Kopell, 1998), are also unable to reach zero phase-lag synchrony in the presence of such inhomogeneity.
Synchronization with Conduction Delays
1017
Figure 6: Plot of several y-nullclines that have been studied, where a is an exponential curve ( f (x) D 4 22x ¡ 4), b is a sigmoid curve ( f (x) D 13 tanh(0.7x)), c is a cubic curve ( f (x) D 8x3 ), d is a linear curve ( f (x) D 20x), and e is a sinh curve ( f (x) D 8 sinh(x ln 4)).
4.3 Two-Dimensional Networks. In the 2D case, each oscillator is connected with its four nearest neighbors with open boundaries. As mentioned earlier, synchronization in 2D networks with random initial conditions is less susceptible to domain effects. We have studied 10 £ 10 networks because the maximum distance between any two oscillators is the same as in a 1D 20-oscillator array. With the same initial conditions used to produce Table 1, the average time to synchrony over 25 trials in the 20 oscillator array is 6.52 periods. For a 10 £ 10 network, using the same spread of initial conditions, the average time to synchrony over the same number of trials is 3.36 periods. In these trials, the longest time to synchrony is 9 periods. We have also studied other initial conditions. We rst split the network into two 5 £ 10 blocks of oscillators: one block with all the oscillators beginning at the top of LLB and the other with all the oscillators near the bottom of LLB. This condition causes very slow synchrony (greater than 40 periods). However, the condition represents a worst case due to its similarity to 1D arrays. In another condition, we start all the oscillators in a 4 £ 4 square at the center of the network at one extreme of the left branch, and all other oscillators at the other extreme. This initial condition yields synchrony in 12
1018
J. J. Fox, C. Jayaprakash, D. Wang, and S. R. Campbell
Figure 7: Synchronization in a 20 £ 20 2D network. The horizontal axis represents time, and the vertical axis x activity. The activity of all 400 oscillators is superimposed.
periods. We expect that networks with greater connectivity, such as those in which connectivity is gaussian rather than nearest neighbor, show even faster synchrony. Finally, we have tested a 20 £20 network where each oscillator randomly starts in a rectangle that just bounds the limit cycles of all the oscillators (see Figure 2); in other words, each oscillator has an equal chance of falling into the silent phase or the active phase. In this case, the network still converges to synchrony, although the time to synchrony is generally greater than 20 periods. Because of increased connectivity in 2D networks, we can use a spread of initial conditions over 71% of the time to traverse LLB. Note that the 71% initial spread is the maximum spread on LLB after the time delay is considered, which is about 29% of the time to traverse LLB. For a 20£20 array with these initial conditions, we nd that an average time to synchrony is 12.36 periods, with the maximum time (out of 25 trials) of 23 periods. Figure 7 shows the synchronization process of a 20 £ 20 network. The network in this case synchronizes after about 5 periods; the maximum phase separation between any two oscillators is less than the 3% period criterion. 5 Discussion
Our analysis shows that the synchronous solution can be stable for a pair of relaxation oscillators in the presence of conduction delays. We have derived an approximation for the rate of synchrony. This analysis indicates how to choose y-nullclines for relaxation oscillators to induce rapid synchrony. We have also examined asymmetric conduction delays between two oscillators. We nd that small differences lead to a solution in which the phase separation between the oscillators is approximately half the difference of the two time delays; in other words, the system is nearly synchronous. Further, we have conducted numerical simulations in 1D and 2D oscillator networks.
Synchronization with Conduction Delays
1019
Our simulations indicate that synchrony also occurs in these systems for a range of y-nullclines, initial conditions, parameters, and time delays. We have studied large phase separation in the initial conditions of the oscillators (40% of the time to traverse the left branch) and time delays up to 12% of the period of an uncoupled oscillator and nd that our mechanism works for the given set of parameters. These systems also become nearly synchronous with inhomogeneous delays, analogous to the two-oscillator case. We have not studied much larger values of t , particularly in cases where the sum of the time delay and the largest phase separation exceeds the time taken to traverse one of the branches. Our study provides insights into the dynamics of relaxation oscillator systems with time delays. Fast synchronization does not depend on the exact form of a y-nullcline; it results as long as the 1/2 ratio approximately holds between the rates of motion on LRB and URB. A reasonably symmetric y-nullcline speeds up the process of synchronization further. We have described what initial conditions are appropriate to achieve rapid synchrony, as some special initial conditions can result in slower synchronization than others. The set of y-nullclines displayed in Figure 6 includes the linear function used in the FitzHugh-Nagumo oscillator (FitzHugh, 1961; Nagumo et al., 1962) and the sigmoid function used in the Morris-Lecar oscillator (Morris & Lecar, 1981); both are relaxation oscillators and have been widely used as models of neuronal activity. How well networks of these oscillators synchronize with time-delay coupling depends on specic linear and sigmoid functions used. Qualitatively speaking, our analysis suggests that in the presence of conduction delays, linear y-nullclines should result in faster synchronization than sigmoid y-nullclines. Previous studies examined coupled integrate-and-re oscillators and found that synchrony can arise in the presence of time delays (van Vreeswijk et al., 1994; Ernst et al., 1995). These studies found that inhibitory connections are needed for synchronous solutions in the presence of time delays. Our mechanism for achieving synchronization in relaxation oscillators is quite different. Synchrony in our model relies on the fact that there are different rates of motion along the different nullclines of the system. Integrate-andre oscillators do not possess separate nullclines and thus cannot accommodate a mechanism like ours. In our system, the interaction is still excitatory, and the synchronous solution remains stable as time delays are introduced into the system, which does not occur in integrate-and-re oscillator networks. As analyzed by Ermentrout and Kopell (1998), oscillator models with delicate timing structure (i.e. spike doublets; see Traub et al., 1996) can also exhibit synchrony with time delays. Our model achieves a similar objective in a very different way. We do not use another circuitry to deal with time delays; the y-nullcline is an intrinsic part of a relaxation oscillator. As a result, our mechanism is simpler. On the other hand, our
1020
J. J. Fox, C. Jayaprakash, D. Wang, and S. R. Campbell
mechanism is not as directly motivated biologically as that of Ermentrout and Kopell. We conclude that synchrony exists in systems of locally coupled relaxation oscillators in the presence of conduction delays and with excitatory coupling. Our analysis reveals that y-nullclines can play an important role in synchronizing a locally coupled network of relaxation oscillators with conduction delays. A variety of y-nullclines have been tested and shown to yield rapid synchrony. Moreover, our result allows one to predict whether and how quickly a given y-nullcline produces synchrony. Given that relaxation oscillators have direct linkage to neuronal models, our analysis may help in understanding how synchrony arises in neural systems, where conduction delays are inevitable. Our insights into how to create appropriate conditions to achieve zero phase-lag synchrony may also be very useful in constructing and analyzing physical systems (Cosp, Madrenas, & Cabestany, 1999). Some opto-electronic systems and devices that operate at extremely high frequencies (Wirkus & Rand, 1997; Young et al., 1988) are modeled as coupled relaxation oscillators, and in these systems conduction delays can become a signicant percentage of the oscillation period. Acknowledgments
This work was supported in part by an NSF grant (IRI-9423312) and an ONR Young Investigator Award (N00014-96-1-0676) to D. L. W. C. J. acknowledges computer support from the Ohio Supercomputer Center. References Abramowitz, M., & Stegun, I. (Eds.). (1964). Handbook of mathematical functions. Washington, DC: U.S. Government Printing Ofce. Campbell, S. R., & Wang, D. L. (1998). Relaxation oscillators with time delay coupling. Physica D, 111, 151–178. Cosp, J., Madrenas, J., & Cabestany, J. (1999). A VLSI implementation of a neuromorphic network for scene segmentation. In Proceedings of MicroNeuro (pp. 403–408). Courant, R. (1988). Differential and integral calculus. New York: Wiley. Ermentrout, G. B., & Kopell, N. (1998). Fine structure of neural spiking and synchronization in the presence of conduction delays. Proc. Natl. Acad. Sci. USA, 95, 1259–1264. Ernst, U., Pawelzik, K., & Geisel, T. (1995).Synchronization induced by temporal delays in pulse-coupled oscillators. Phys. Rev. Lett., 74, 1570–1573. FitzHugh, R. (1961). Impulses and physiological states in models of nerve membrane. Biophys. J., 1, 445–466. Kandel, E. R., Schwartz, J. H., & Jessell, T. M. (1991). Principles of neural science (3rd ed.), New York: Elsevier.
Synchronization with Conduction Delays
1021
Keil, A., Mueller, M. M., Ray, W. J., Gruber, T., & Elbert, T. (1999). Human gamma band activity and perception of a Gestalt. J. Neurosci., 19, 7152– 7161. Konig, ¨ P., & Schillen, T. B. (1991). Stimulus-dependent assembly formation of oscillatory responses: I. Synchronization. Neural Comp., 3, 155–166. Mirollo, R. E., & Strogatz, S. H. (1990). Synchronization of pulse coupled biological oscillators. SIAM J. Appl. Math., 50, 1645–1662. Morris, C., & Lecar, H. (1981). Voltage oscillations in the barnacle giant muscle ber. Biophys. J., 35, 193–213. Murakoshi, T., Guo, J., & Ichinose, T. (1993). Electrophysiological identication of horizontal synaptic connections in rat visual cortex in vitro. Neurosci. Lett., 163, 211–214. Nagumo, J., Arimoto, S., & Yoshizawa, S. (1962). An active pulse transmission line simulating nerve axon. Proc. IRE, 50, 2061–2070. Niebur, E., Schuster, H. G., & Kammen, D. M. (1991). Collective frequencies and metastability in networks of limit-cycle oscillators with time delay. Phys. Rev. Lett., 67, 2753–2756. Phillips, W. A., & Singer, W. (1997). In search of common foundations for cortical computation. Behav. Brain Sci., 20, 657–722. Press, W. H., Teukolsky, S. A., Vetterling, W. T., & Flannery, B. P. (1992). Numerical recipes in C (2nd ed.). New York: Cambridge University Press. Singer, W., & Gray, C. M. (1995). Visual feature integration and the temporal correlation hypothesis. Ann. Rev. Neurosci., 18, 555–586. Somers, D., & Kopell, N. (1993). Rapid synchrony through fast threshold modulation. Biol. Cybern., 68, 393–407. Terman, D., & Wang, D. L. (1995). Global competition and local cooperation in a network of neural oscillators. Physica D, 81, 148–176. Traub, R., Whittington, M., Stanford, M., & Jefferys, J. (1996). A mechanism for generation of long-range synchronous fast oscillations in the cortex. Nature, 383, 621–624. van Vreeswijk, C., Abbott, L. F., & Ermentrout, B. (1994). When inhibition not excitation synchronizes neural ring. J. Comp. Neurosci., 1, 313–321. Wang, D. L. (1995). Emergent synchrony in locally coupled neural oscillators. IEEE Trans. Neural Net., 6(4), 941–948. Wang, D. L. (1999). Relaxation oscillators and networks. In J. Webster (Ed.), Encyclopedia of electrical and electronic engineers (vol. 18, pp. 396–405). New York: Wiley. Wirkus, S., & Rand, R. (1997). Dynamics of two coupled van der Pol oscillators with delay coupling. In Proceedings of ASME Design Engineering Technical Conference. Young, J. F., Wood, B. M., Liu, H. C., Buchanan, M., Landheer, D., Spring, A. J., Thorpe, A., & Mandeville, P. (1988). Effect of circuit oscillations on the DC current-voltage characteristics of double barrier resonant tunneling devices. Appl. Phys. Lett., 52, 1398–1401. Received November 8, 1999; accepted July 20, 2000.
LETTER
Communicated by Bruno Olshausen
Predictions of the Spontaneous Symmetry-Breaking Theory for Visual Code Completeness and Spatial Scaling in Single-Cell Learning Rules Chris J. S. Webber DERA, Malvern WR14 3PS, U.K. This article shows analytically that single-cell learning rules that give rise to oriented and localized receptive elds, when their synaptic weights are randomly and independently initialized according to a plausible assumption of zero prior information, will generate visual codes that are invariant under two-dimensional translations, rotations, and scale magnications, provided that the statistics of their training images are sufciently invariant under these transformations. Such codes span different image locations, orientations, and size scales with equal economy. Thus, singlecell rules could account for the spatial scaling property of the cortical simple-cell code. This prediction is tested computationally by training with natural scenes; it is demonstrated that a single-cell learning rule can give rise to simple-cell receptive elds spanning the full range of orientations, image locations, and spatial frequencies (except at the extreme high and low frequencies at which the scale invariance of the statistics of digitally sampled images must ultimately break down, because of the image boundary and the nite pixel resolution). Thus, no constraint on completeness, or any other coupling between cells, is necessary to induce the visual code to span wide ranges of locations, orientations, and size scales. This prediction is made using the theory of spontaneous symmetry breaking, which we have previously shown can also explain the data-driven self-organization of a wide variety of transformation invariances in neurons’ responses, such as the translation invariance of complex cell response. 1 Introduction
In “single-cell” learning rules, cells adapt independently of one another; the adaptation of a whole network of such neurons may be described by a set of uncoupled dynamical equations, each of which describes the independent adaptation of an individual cell. “Multiple-cell” learning rules cannot be factorized into uncoupled dynamical equations in this way. Local as well as nonlocal learning rules can fall into this second category. In local multiplecell learning rules, the coupling between adapting neurons is mediated by interneural connections. Olshausen and Field (1996), Bell and Sejnowski Neural Computation 13, 1023– 1043 (2001)
° c British Crown copyright 2001/DERA
1024
Chris J. S. Webber
(1997), van Hateren and van der Schaff (1998), and others have shown that multiple-cell learning rules can adapt to encode small (16-pixel square), natural scenes on which they are trained in terms of localized, oriented, bandpass-receptive elds like those of cortical simple cells. Such multiplecell rules employ coupling between cells to discourage redundancy of their coding functions, so that the code will not overrepresent some image locations, orientations, and size scales at the expense of others. The ability of the visual simple-cell code to detect features over the full range of locations, orientations, and size scales is called completeness, a related but different interpretation of the word from that used in linear algebra, where completeness is the ability to reconstruct any arbitrary vector by linear superposition, and where a set of n-dimensional basis vectors is complete if any n of them are mutually linearly independent. Blais, Intrator, Shouval, and Cooper (1998) studied several single-cell rules that also generate oriented receptive elds, but did not demonstrate localization of the receptive elds to an extent smaller than the 13-pixel square input window, and consequently did not demonstrate the bandpass property. Webber (2000) demonstrated the localization and bandpass properties of simple-cell-like receptive elds using a single-cell Hebb (1949) type rule with threshold neural output. Using a simple analytical model, Webber (1994) had shown this rule to have the property of nding sparse codes. Webber (1997) explains that this learning rule nds maximally kurtotic directions in the probability distribution Pr(x ) of data vectors x— those unit directions w ˆ in the data’s vector space for which the tail of the marginal distribution Pr(x ¢ w ˆ ) of projections x ¢ w ˆ is longest, as measured by the particular nonlinear moment that the learning rule maximizes by gradient ascent. This explains why the rule nds localized, oriented, bandpass-receptive elds; Field (1994) had proposed that such receptive elds correspond to kurtotic directions of the distribution of natural images. The main result of Webber (2000) was to show analytically that invariances of the statistics Pr(x ) of training data x under orthogonal transformations can drive the development of corresponding transformation invariances in neural recognition responses. (Orthogonal transformations are linear transformations on the data vector x that preserve its vector length | x | and are the n-dimensional generalizations of rotations and reections.) This result provided a novel and potentially generic mechanism of unsupervised invariance discovery, associated with the mathematical phenomenon of spontaneous symmetry breaking. In this economical theory, response invariances develop without the need for cells to pick up temporal correlations induced by the transformation, so avoiding several fundamental problems associated with memory-trace models of invariance learning. That article proposed that a function of sparse coding is to facilitate this mechanism of invariance discovery for much more general transformations that are not orthogonal in the raw input space and as an example,
Spontaneous Symmetry Breaking and Code Completeness
1025
using natural image training data, showed that single-cell learning can thus account for the translation invariance of Hubel and Wiesel (1962) complex cells. In this article, the same theory of spontaneous symmetry breaking is applied to the orthogonal transformations that interrelate the various possible convergent states of single-cell learning rules, when the training data possess the basic statistical invariances of natural scenes. This theory allows one to make predictions about the relative numbers of neurons that will converge to each of the various possible convergent states, given the simplest of assumptions about the random distribution of initial weight-vector states before learning: the assumption of zero prior information. These quantitative predictions show that single-cell learning rules can generate visual codes that span different image locations, orientations, and size scales with equal economy. The property of spanning different size scales with equal economy is called spatial scaling and is observed in the distribution of measured peak spatial frequency of cortical simple cells (DeValois, Albrecht, & Thorell, 1982). However, van Hateren & van der Schaff (1998) observe that the multiple-cell independent component analysis rule of Hyv¨arinen and Oja (1997) and the multiple-cell rule of Olshausen and Field (1996) are unable to generate distributions of peak spatial frequency that are consistent with spatial scaling. It has even been suggested that multiple-cell rules should employ an explicit constraint to enforce spatial scaling (Koenderink, 1984; Li & Atick, 1994). 2 The Spontaneous Symmetry-Breaking Mechanism
Consider the large class of unsupervised single-cell learning rules for which the weight vector update is given by dw (t) D l dt
Z [v x ¡ w w (t)] Pr(x ) dx ,
(2.1)
where the scalars v and w are both continuous but not necessarily linear functions of the scalar product x ¢ w , the distance | x ¡w |, and/or the length | w |. The learning rate R l may change over time t. At each time step t, the ensemble average ¢ ¢ ¢ Pr(x ) dx is considered to be taken over a sufciently large, representative, random sample of training patterns that the learning rule prescribes effectively deterministic (i.e., nonstochastic) behavior. It is well known that this is the limiting case of stochastic rules,
Dw
D l0 [v x ¡ w w ],
(2.2)
when the learning rate l0 is very small. For the special case of learning rules that incorporate a normalization constraint of constant length | w |, | w | is
1026
Chris J. S. Webber
kept constant by the particular (Lagrange multiplier)1 choice: w (x , w ) D v (x , w )
x ¢w
|w | 2
.
(2.3)
The class of deterministic rules described by equation 2.1 includes the Hebb (1949) rule, for which v (x, w ) is the neuron’s recognition output response. The class includes projection pursuit rules, for which v(x , w ) is dened to be a function of x ¢ w / | w | (e.g., Friedman & Tukey, 1974; Friedman, 1987). The class also includes single-cell rules based on a radial basis function, for which w (x, w ) D v (x , w ) are functions of | x ¡ w | (and for which a | w |constancy condition, equation 2.3, is not appropriate). In general, given a particular data set Pr(x ), only certain special states of w , called possible convergent states w 1, can be reached by the action of a learning rule (equation 2.1) on convergence: w 1 ´ lim w (t).
(2.4)
t!1
In general there are many possible convergent states for a given learning rule and data set; different possible convergent states can be reached by starting the learning from different initial states w 0 ´ w (t D 0). Each possible convergent state w 1 has a basin of attraction, which is dened as that volume of w -space containing all the initial states w 0 that will eventually adapt into that particular w 1 as opposed to any of the others. Webber (2000) proves, for this class of rules, that if Pr(x ) is symmetric Pr(S(x )) D Pr(x )
(2.5)
under an orthogonal transformation S to a degree of approximation, then S(w1) will be a convergent state if w 1 is. This statement covers two possibilities for any given symmetry S. Any possible convergent state will either (1) itself be invariant under S, S(w1) D w1
(symmetry preservation),
(2.6)
or (2) will be mapped by S onto another, distinct, state w 0 1, w1
S
¡!
w 0 1 D S(w 1 )
(spontaneous symmetry breaking),
(2.7)
which is also one of the possible convergent states that can be reached by the dynamics 2.1. For example, if the statistics Pr(x ) of a set of images x 1
Webber (1994) proves that this Lagrange constraint is equivalent to the more conventional formulation of normalization constraints, in which equation 2.1 has no explicit w term but in which w is rescaled to a xed length after implementing each weight vector update.
Spontaneous Symmetry Breaking and Code Completeness
1027
are translationally invariant, any convergent state for the weight vector will also be a convergent state if translated across the image. Convergent states related by a symmetry group that is continuous rather than discrete will not be isolated stable states, but will merge into a continuum of metastable states with adjacent basins of attraction. (As an example, image translation is only a discrete transformation because of the nite pixel sampling frequency, but is effectively a continuous transformation for features of lower spatial frequencies. The same can be said of two-dimensional rotation and scale magnication.) In case 1, the weight vector will self-organize from a random and asymmetric initial state into a convergent state that itself possesses a symmetry of the data’s probability distribution. This phenomenon of symmetry preservation provides the novel mechanism of unsupervised invariance discovery cited in section 1. In that context, the orthogonal transformations of interest were permutations among the n code units of an n-dimensional sparsecoded data vector x, and symmetries of Pr(x ) under these transformations are the result of statistical indistinguishabilities among the sparse-coded feature set. Webber (1991a, 2000) proposed that transformations as generic as 2D and 3D object translation, rotation and deformation, or even discontinuous transformations such as left-right ocularity reversal or change of a speaker ’s identity could in principle be mapped onto permutations by the action of appropriate sparse codes in a hierarchical architecture, giving rise to statistical indistinguishabilities in the sparse code, and hence driving the self-organization of transformation-invariant neurons via the symmetry-preservation mechanism. Case 2 is called spontaneous symmetry breaking because the symmetry S of Pr(x ) is not manifest in the convergent state w 1 . The symmetry S is said to be hidden, or spontaneously broken. This article focuses on symmetry breaking as opposed to symmetry preservation, and the orthogonal transformations of interest here correspond to the basic approximate invariances of (spectrally whitened) natural scenes: 2D translation, rotation, and scale magnication. In general, a distribution Pr(x ) may possess a whole group of different symmetries, and some of these will be broken in the convergent state, while others are preserved. Whether a given symmetry is broken or preserved is determined by the specic form of the functions v and w in the learning rule (equation 2.1) and on any symmetry-breaking bifurcation parameters implicit in those functions. 3 Convergent States Related by Symmetry Will Be Populated Equi-Probably
If one applies the same orthogonal transformation to the vectors that span an angle or solid angle in w -space, the angle or solid angle will be preserved unchanged by the transformation. It follows that if a spontaneously broken
1028
Chris J. S. Webber
symmetry of Pr(x ) relates possible convergent states w 1 and w 0 1, then the basins of attraction of those distinct convergent states will have equal volumes (and the same shapes) in w -space. In projection pursuit rules, the unit direction w ˆ ´ w / | w | is the most important part of w ; its length | w | is unimportant and constrained constant (to unit length, without loss of generality). Since all permissible w therefore lie on the unit n-sphere | w | D 1, the basins of attraction of symmetry-related convergent states will occupy equal areas on the unit n-sphere, and subtend equal solid angles at the origin. For projection pursuit rules, therefore, the prior distribution of initial states w 0 that contains no prior information (maximal Shannon entropy and zero Fisher information) is the uniform distribution over the unit nsphere, because no unit direction drawn from this distribution will be any more likely a priori than any other. The simplest and least contrived way of randomly initializing a weight vector is to initialize all its weights as independent random variables. If these are all drawn independently from gaussian distributions of equal variance and zero mean, this will lead to a distribution of weight vectors that is a spherically symmetric n-variate gaussian ball in w -space, centered on the origin. If the neuron has a normalization mechanism that scales its weight vector to unit length, this normalization will project the gaussian ball radially onto the unit n-sphere, so that the prior distribution of initial states w 0 will be uniform over the surface of that sphere. For projection pursuit rules, therefore, initialization according to the assumption of zero prior information should be biologically plausible, because the central limit theorem tells us that the gaussian distribution of zero mean is inevitable if each synaptic weight is the cumulative effect of a large number of small, (initially) random quantities such as receptors or vesicles (inhibitory and excitatory with equal initial probabilities), and because no neural mechanism is needed to contrive any statistical dependences between synapses before adaptation, other than that the cell’s total squared synaptic weight is constrained constant. The homeostasis of each cell’s total quantity of some synaptic material could in principle be sufcient to achieve such a simple constraint. Randomly initialized with this assumption of zero prior information, a large population of neurons, adapting independently according to the same single-cell learning rule and data set Pr(x ), will converge to populate each of the possible convergent states in proportion to the area of its basin of attraction. Thus, on average, equal numbers of neurons will converge to convergent states related by a spontaneously broken orthogonal symmetry of Pr(x ). 4 The Basic Invariances of Natural Scenes
Randomly chosen ensembles of natural images have statistics that are at least approximately invariant under the basic transformations of 2D trans-
Spontaneous Symmetry Breaking and Code Completeness
1029
lation, rotation, and scale magnication. The probability distribution Pr(x ) of natural scenes x will be invariant under all 2D translations, provided one assumes that the effects of foveal attention and visual periphery can be neglected. Certainly the statistics of images from a camera pointed randomly in a natural environment will be translation invariant (neglecting edge effects), although one might expect animals to pay more attention to areas of interest than to the blank sky, so from the point of view of a real retina, the statistics may be less accurately translation invariant. The statistics of images composed by a landscape photographer will have a particularly marked excess of horizontal and vertical edges over other orientations, although other animals probably spend less time with their heads vertical, and so their retinas probably see more rotation-invariant statistics. Random ensembles of natural scenes will have the property of statistical invariance under 2D rotations to some degree of approximation, however. Measurements of the mean power spectrum of ensembles of unwhitened natural scenes show it not to depart signicantly from the power law2 (1 / 2p | f |) dE / d| f | 1 / | f | 2, where f D ( f1 , f2 ) is the 2D spatial frequency 2 2 2 and | f | D f1 C f2 (Deriugin, 1956; Cohen, Gorog, & Carlson, 1975; Burton & Moorhead, 1987; Field, 1987). It follows from the convolution theorem that if natural scenes are spectrally whitened, for example, by convolution with an on-center off-surround kernel lter whose phaseless (real) Fourier transform is proportional to | f |, then the mean power spectrum of whitened natural scenes will be independent of | f |, that is, scale invariant. The hypothesis that the probability distribution Pr(x ) of whitened natural scenes is scale invariant in all its moments arises from the observation that the objects within the natural 3D environment can be viewed from a range of distances, and consequently with a range of angles subtended at the retina. Ruderman (1994) conrms the scale invariance of some other moments of natural scene statistics. Of course, scale invariance must break down at the extremes of high and low frequency in the same way as translation invariance must break down at the image boundary, because no real imaging device can have innite spatial coverage and innite resolution. Similarly, the spatial scaling of the visual code can be expected to hold only across a nite range of scales. Field (1987) proposed that cortical simple cells form a complete, quasiorthogonal code for images (i.e., the basis vectors are nearly orthogonal but not exactly so), in which various Gabor-like lters are similar to one another under 2D translations, rotations, and scale magnications. In Field’s code, lters of each size scale would be spaced regularly and uniformly in position and orientation angle to ensure quasi-orthogonality and spaced regularly in
2 (1 / 2p | f |) dE / d| f | is the 1D average over orientations of the 2D power spectrum d2E / df1 df2 . Measurements give (1 / 2p | f | ) dE / d| f | 1 / | f | 2¡g, where |g| is not signicantly different from zero.
1030
Chris J. S. Webber
log| fm|, where fm is a lter’s mean3 2D spatial frequency. (Similarly shaped lters that encode different scale sizes have spectra that span equal intervals of log| f |, because magnication corresponds to translation of the frequency spectrum in log| f |.) Field (1987) argued that these properties of the code follow from the requirements of code completeness and economy, assuming the basic invariances of the statistics of natural scenes. Given a number of basis vectors equal to the number of image pixels, quasi-orthogonality would ensure both of these requirements. The number of Gabor-like lters required to span the interval d log| f | completely would be proportional to | f | 2, because each individual lter spans only a region of the image whose area is inversely proportional to | fm| 2 and so, to span the whole image area at the | fm| frequency scale, one would need a number of Gabor-like lters proportional to | fm| 2. Another way to see this is to note that the complete subset of lters of mean frequency a | fm| spans a2 times as many of the 2D Fourier coefcients as the complete subset of lters of mean frequency | fm|. The same is true of truly orthogonal and complete wavelet transforms, such as the Daubechies (1988) transform. As is explained rigorously in section 5, this | fm| 2 dependence of the number of receptive elds is consistent with spatial scaling. If unwhitened natural scenes have 1 / | f | 2 contrast spectra on average, the integrated contrast dE / d log| f | within each equal d log| f | interval will be independent of | f |. However, the number of Gabor-like lters required«to span¬ the interval d log| f | completely is proportional to | f | 2. The variance4 (x ¢ w ˆ )2 of projections of unwhitened images x , along unit-| w ˆ | Gabor-like lters w ˆ that are similar to one another under scale magnications, must therefore be proportional to 1 / | fm| 2, in order that the sum of such variances over all | fm| 2 lters having the same fm will be constant (see Webber, 1991b, for further discussion). It follows from the convolution theorem that natural images must be prewhitened by convolution with a phaseless lter having a real Fourier transform proportional to | f | (up to the cutoff at the Nyquist frequency), if the variances of projections x¢w ˆ along all Gabor-like lters are to be similar regardless of their f m . A simpler way to see this is to note that whitening the contrast spectrum makes the covariance matrix isotropic, so that the variance will be the « same ¬ along any direction in x-space. Of course, equalizing the variances (x ¢ w ˆ )2 does not guarantee that two distributions Pr(x ¢ w ˆ ) will be the same for two Gabor-like lters w ˆ that can be mapped onto one another by magnication, although this does follow if the statis3
R
The mean 2D frequency fm is df w( Q f) ¤ w( Q f ) f, where w( Q f ) is the 2D Fourier transform of the normalized receptive eld w(i ) and the integral is taken over just one of the (symmetric) pair of lobes of the 2D transform. Dening the cell’s orientation axis such that the coordinates of the 2D Rvector fRm parallel and perpendicular to it satisfy ( fm,k , fm, ? ) D (0, | fm | ), the 1 1 mean is fm, ? D ¡1 dfk 0 df? w( Q fk , f? )¤ w( Q fk , f? ) f? . 4 Assuming the mean h x i has been subtracted away.
Spontaneous Symmetry Breaking and Code Completeness
1031
tics Pr(x ) of the whitened images are scale invariant. Spectral whitening is often introduced ad hoc by authors to demonstrate the extraction of oriented bandpass-receptive elds, but this article explains why this is necessary for single-cell rules. This may be the purpose of on-center off-surround preprocessing in the retina and lateral geniculate nucleus (in which the positive and negative components of whitened images are respectively represented on separate on-center and off-center cells with nonnegative outputs). 5 5 Implications of the Basic Approximate Invariances of Natural Scenes
Two-dimensional translations, rotations, and scale magnications are orthogonal 6 transformations in the n-dimensional space of n-pixel images x , and so they belong to the class of transformations to which the spontaneous symmetry-breaking theory can be applied. Consequently, these basic invariances of the statistics Pr(x ) of spectrally whitened natural scenes have implications for the convergent receptive elds of single-cell learning rules of the class dened in section 2. Using different reasoning and assumptions, the spontaneous symmetry-breaking theory predicts completeness and spatial scaling properties similar to those of Field’s code; the predicted distributions of the lters’ orientations, image locations, and log| fm| have the same functional forms as in Field’s code. However, the theory does not predict Field’s quasi-orthogonality property; the lters will not be spaced at regular intervals in orientation, location, and log| fm| because single-cell rules have no neural coupling to enforce regular spacings. 5 In this article’s experimental demonstrations, prewhitened images are represented with pixel values x that can be negative as well as positive (case 1), not with respective nonnegative values x C and x¡ from separate on-center and off-center lateral geniculate nucleus (LGN) cells (case 2). Other authors who assume prewhitening in deriving simplecell receptive elds (Olshausen & Field, 1996; Bell & Sejnowski, 1997, for example) have not addressed the implications for the subsequent neural network of case 2 prewhitening. The distinction between the two cases will actually be irrelevant if respective synaptic weights w C and w ¡ from co-located on-center and off-center inputs are constrained such that w¡ D ¡w C . This is because x C w C C x ¡ w¡ in case 2 yields the same contribution to x ¢ w or | x ¡ w | as the corresponding x w in case 1, if ¡w ¡ D w C D w (because x ¡ D 0 when x C D x and x C D 0 when x ¡ D ¡x). This constraint of opposite-valued synapses from co-located on-center and off-center LGN cells might arise through simultaneous adaptation of LGN and simple-cell receptive elds, if LGN adaptation is inuenced by a “top-down” selfsupervision mechanism mediated via these synapses, such as that described by Webber (1997) for the Hebbian threshold projection-pursuit neuron. 6 A transformation i ! i0 of the 2D image coordinate i D (i , i ) of the form i 0 ( i ) D a O i C t, 1 2 where O is a 2 £2 rotation matrix, t is a 2D translation vector, and a is the magnication R 1 scale factor, corresponds to a linear redistribution x0 (i0 ) D ¡1di M(i , i0 ) x( i ) of image am-
plitude x(i ) with matrix element M( i, i0 ) D
1 a
±
±
0 t d 2 i ¡ O ¡1 i ¡ a
R1
²²
. It can be shown that
this linear transformation preserves total contrast ¡1di0 [x0( i0 )]2 D orthogonal in the n-dimensional space of image vectors x.
R1
di [x( i )]2 and so is
¡1
1032
Chris J. S. Webber
The rst kind of prediction is that if such a learning rule gives rise to an oriented, localized, bandpass-receptive eld, then, starting from different random initial states, the same learning rule will give rise to other, similarly shaped receptive elds, which can be mapped onto the rst one and to each other by those 2D translations, rotations, and scale magnications under which Pr(x ) is invariant. This prediction has the consequence that a large enough number of cells, adapting independently from initial states distributed randomly with zero prior information, will converge to span every combination of image location, orientation, and size scale, because the uniform prior distribution spans the basin of attraction of every possible convergent state. This is the property of code completeness. The second kind of prediction follows from the fact that the areas and shapes of basins of attraction, and the w -space geometrical relationships between them, are unchanged if the same orthogonal transformation is applied to any set of convergent states w 1. If one applies a magnication of scale factor a to a set of convergent states having mean spatial frequencies in the 2D interval dfmD dfm,1 dfm,2 around fm D ( fm,1 , fm,2 ), one generates an equal number of convergent states having mean frequencies in the 2D interval d (fm / a ) D (dfm) / a2 around fm / a. The code is scale invariant if the image statistics are—that is, it has the spatial scaling property. The w ˆ -space density of these convergent states is unchanged by this orthogonal transformation because their w -space geometrical relationships are unchanged, but in the 2D image-coordinate space i D (i1 , i2 ), these states are dispersed over an image area a2 times larger. Therefore, the i-space density of convergent states (i.e., the number per unit image area) in the interval (dfm) / a2 around fm / a is smaller than the i-space density in the interval dfm around fm , by a factor 1 / a2. The i-space density in the interval (df m) around fm / a is equal to the i-space density in the same-sized interval dfm around f m . In other words, if N is the total number of convergent states of all size scales per unit image area, d2N/ dfm,1 dfm,2 will be independent of | f m|. If N is rotation invariant, d2N/ dfm,1 dfm,2 must therefore be independent of f m , that is, uniform over the fm -plane. Integrating, the number dN/ d log| f m| per unit image area within each equal magnication-ratio interval d log| fm|—the number per unit image area at each size scale—will rise in proportion to | fm| 2, as with Field’s quasi-orthogonal code. Both the w ˆ -space density and the i-space density of convergent states remain unchanged under 2D translations and rotations, if the statistics Pr(x ) are invariant under these orthogonal transformations. Consequently, the total number N of convergent states per unit image area will be uniformly distributed with image position i and with orientation angle, to an extent determined by the accuracy of the translation and rotation invariance of the image statistics. The quantitative predictions of the second kind show that no image location, orientation, or size scale will be encoded any more or less completely or economically than any other. The code itself will be translation, rotation and scale invariant and will have the spatial scaling property. Field (1987) argued
Spontaneous Symmetry Breaking and Code Completeness
1033
Figure 1: An example of the natural scenes used for training in the demonstrations: raw (left) and spectrally whitened (right).
for quasi-orthogonality on grounds of code completeness and economy, implicitly assuming that cortical cells are in short supply and that coupling between them would be necessary to enforce completeness. Here we see that (over-)completeness and invariance of the visual code under translation, rotation, and spatial scaling follow from the spontaneous symmetry-breaking theory, assuming zero prior information, enough neurons, and only a simple single-cell learning rule without coupling, if the statistics of whitened natural scenes are sufciently accurately invariant under 2D translations, rotations, and scale magnications. (Quasi-orthogonality, however, is not a consequence of the spontaneous symmetry-breaking theory for single-cell rules, because coupling between cells would be necessary to enforce the orthogonality.) 6 Computational Demonstrations
To test the prediction of scale invariance over a signicant range of scales, one needs a much larger input window than has been used by other authors in deriving oriented receptive elds. In these experiments, a training set of 10,000 images was obtained by excising randomly located 128 £128 pixel squares from larger digitized photographs of natural scenes (see the example in Figure 1). The sharp discontinuity that exists at the boundary of aperiodic images mars the assumed scale invariance of the image statistics, because the spectral whitening proportionally amplies the higher-frequency components of the discontinuity, producing an artifact that extends far in from the boundary. For the small images for which these experiments are currently feasible, care must therefore be taken to minimize the biases introduced by this arti-
1034
Chris J. S. Webber
fact, by softening the edge discontinuities before whitening. Even so, these biases cannot be rendered entirely negligible for such small images. To this end, the image amplitude of each 128£128 pixel square is rolled off softly to zero over a range of 10 pixels before whitening, by modulation with a circular window boundary of diameter roughly 118 pixels. As a function of the 2 (118¡2| )]. 2D pixel coordinate i , the modulation function is erf[ 10 i | A circular modulation window is used to avoid introducing articial orientationdependent bias into the image statistics on whitening. Even with a softened boundary, the scale invariance of the image statistics must still break down at the lowest spatial frequencies because the window is not innite in extent. The scale and rotation invariance must also break down at high frequencies, because the images are sampled on an anisotropic (i.e., square) grid with only nite resolution. Each windowed image is then independently normalized to constant total squared amplitude. The images are then all spectrally whitened, by convolution with the same on-center off-surround kernel function, whose Fourier transform is phaseless (real); the modulus of the kernel’s Fourier transform is dened to be the ensemble-mean 2D power spectrum of the unwhitened images raised to the power ¡ 12 , because the convolution theorem ensures that convolution with such a kernel will render mean 2D power spectrum of the whitened images at (f -independent). The whitening introduces negative coordinates into the resulting 128£128-dimensional training vectors x , because it removes the DC mean of each image almost entirely. Five thousand sigmoid, Hebbian, projection-pursuit neurons, initialized randomly and independently according to the zero prior information assumption described in section 3, were trained independently to convergence on the same distribution of whitened training images, according to the same single-cell learning rule, with the same values of the (nonadaptive) neural parameters (see the appendix for details). This learning rule is identical to one proven analytically by Webber (2000) to be capable of invariance discovery via the spontaneous symmetry-breaking mechanism, except that the more conventional and biologically plausible sigmoid response function is used here instead of the more analytically tractable piece-wise linear function. The 128£128–pixel training vectors are presented to the 128£128– synapse weight vectors with relative position offsets that are uniformly distributed in 2D, subject to 2D “wrap-around” periodic boundary conditions (see the appendix). This ensures that the weight vectors experience translationally invariant image statistics. Figures 3 through 6 show scatter plots and histograms of measurements made on this sample7 of 5000 independently trained receptive elds. Fig7
Of course, 5000 neurons are fewer than the 16,384 that would be required if they were actually used together to form a complete vector basis for reconstructing 128£128pixel images. However, that is not the purpose of this experiment, which is to generate histograms that are sufciently representative to infer statistical properties of the over-
Spontaneous Symmetry Breaking and Code Completeness
1035
Figure 2: Examples of convergent weight vector states, all obtained by numerical simulation of the same single-cell learning rule with the same parameter values and the same training set of whitened natural scenes, but starting from different initial states with synapses initialized randomly and independently according to the zero prior information assumption. White corresponds to the most positive synaptic weights of each receptive eld and black to the most negative. Each receptive eld is displayed with a different linear mapping from weight value to gray level to bring out its contrast, so the fact that the four backgrounds are represented by different shades of gray does not indicate that the background weight values are different; they are in fact close to zero in all four cases.
ure 2 illustrates the range of orientations, image locations, and scale sizes found in the sample. Figure 3 tests the prediction (for images large enough for boundary artifacts to be negligible) that a single-cell learning rule, inicomplete code that would result if an indenitely large number of receptive elds were generated by the same statistically independent sampling process.
1036
Chris J. S. Webber
tialized randomly and independently with zero prior information, will generate receptive elds that populate the mean spatial frequency plane with uniform density d2N/ dfm,1 dfm,2 . Although biased by explainable artifacts at the extreme spatial frequencies (see the caption to Figure 3), the underlying uniformity predicted by the spontaneous symmetry-breaking theory is still visible. The imperfect statistical symmetry under 2D rotations, which gives rise to greater numbers of cells tuned to horizontal and vertical orientations in Figure 3, is a real property of natural scenes, although the image statistics still have sufcient symmetry for the full range of orientations to be encoded by the single-cell rule. Figure 4 tests the theory’s spatial scaling prediction that dN/ d log| fm|, the number of receptive elds per unit image area at each size scale, will rise in proportion to | fm | 2. Despite explainable artifacts at the extreme spatial frequencies (see the caption to Figure 4), the trend predicted by the theory is still visible. The spontaneous symmetrybreaking theory does not actually predict that all convergent states have similarly shaped receptive elds, but it does predict that for each shape of convergent state, there should be others of all the different size scales similar to it in shape; thus, a scatter plot of receptive eld bandwidth against mean spatial frequency should be distributed about a line of proportionality, though not necessarily concentrated on a line of proportionality. Figure 5 shows that this is as observed, except that the spatial scaling breaks down at the extreme spatial frequencies due to explainable artifacts (see the caption to Figure 5). The theory predicts that the various convergent states of a single-cell learning rule will encode the complete range of image locations uniformly, provided that the statistics of the training images are sufciently invariant under 2D translations. In these simulations, the image statistics are accurately translation invariant, and Figure 6 shows that the observed distribution of receptive eld locations is accurately uniform. Figure 6 also shows that this random distribution of receptive eld locations is isotropic with respect to receptive eld orientation axis. (This differs from the quasiorthogonal code of Field, 1987, in which cells of similar orientation had to be spaced regularly on an anisotropic lattice in order to ensure quasiorthogonality.) 7 Conclusion
Although the dynamical equation, 2.1, that governs the learning is in general invariant under all the orthogonal symmetries of the data’sdistribution, the receptive elds generated by the learning are not in general invariant under these transformations. The spontaneous symmetry-breaking theory explains how this happens and that the symmetries of the data instead map alternative receptive elds onto one another: receptive elds that will develop with equal probability, provided that their synapses are randomly initialized with zero prior information. The article has shown that this theory implies that single-cell learning rules can generate (over-)complete vi-
Spontaneous Symmetry Breaking and Code Completeness
1037
Figure 3: Scatter plot of the mean 2D spatial frequency of receptive elds generated from a single-cell learning rule initialized randomly and independently according to the zero prior information assumption. In this experiment, boundary artifacts at higher and low frequencies mar the underlying uniformity of the distribution, as does a slight preference for horizontal and vertical orientations, although the plot is still quite uniform over a | fm | factor of 5 in | f m | (0.1 0.5). The nite pixel sampling resolution will inevitably cause the density of states to roll off to zero over the range | f m | < 1, because there can be no support from the data for frequen0.5 cies higher than the Nyquist frequency. The uniformity also breaks down at the low frequencies | f m | 0.1 fNyquist, which correspond to the 10-pixel rolloff width of the soft circular boundary of the image window: 0.1 fNyquist D 1 / (10 pixels). There is a hole in the scatter plot for | f m | less than about 0.06, because receptive elds any larger than this would have to extend across the whole input window and beyond, and there can be no support from the data for correlation lengths larger than the input window (Figure 2, lower right, shows a convergent state near this lower limit). Imperfect statistical symmetry under 2D rotations, which here has given rise to greater numbers of cells tuned to horizontal and vertical orientations, is a real property of natural scenes.
1038
Chris J. S. Webber
Figure 4: Histogram of the number dN/ d log| f m| of randomly initialized, independent, single cells per unit image area that converge to encode each equal magnication-ratio interval d log| fm|, that is, the number of receptive elds per unit image area at each size scale. The same range of | fm | is depicted as in Figure 3: | f m | 2 < 0.25. The vertical bars indicate random Poisson sampling errors, and the horizontal bars the histogram bin width corresponding to each sample 1 (each bin covers a range of loge | f m | equal to 30 ). The spatial scaling proportionality with | fm| 2 is marred by the low-frequency (| fm | 2 0.01) window-boundary artifact explained in the caption to Figure 3. Proportionality is also marred at higher frequencies by the lack of support from the data beyond the Nyquist frequency; Nyquist aliasing causes measurements of | f m | to be underestimated for the higher-| fm | cells, so that the graph rises faster than proportionally for higher| f m | cells, before inevitably falling off to zero over the interval 0.25 | f m | 2 < 1, as is explained in the caption to Figure 3. (On grounds of spatial scaling, some of the highest-frequency components of higher- | fm | receptive elds really belong at frequencies f beyond fNyquist; the artifact of Nyquist aliasing maps these frequencies f onto lower frequencies fNyquist ¡ ( f ¡ fNyquist). Thus some cells that belong in the interval 0.25 | f m | 2 < 1 will be falsely counted with the| fm | 2 0.25 cells as a result.)
sual codes that are collectively invariant under scale magnications, 2D translations, and rotations, without the need for coupling between cells to discourage redundancy or statistical dependence or to enforce orthogonality on the code. Experimental verication of the predictions of this theory also provides a novel test of the scale invariance of the statistics of natural scenes to high orders, because the theory assumes that the entire probability distribution Pr(x ), not just a few of its low-order moments, is invariant. The prediction of spatial scaling for single-cell rules, given the plausible
Spontaneous Symmetry Breaking and Code Completeness
1039
Figure 5: Irrespective of receptive eld size, the bandwidth of each receptive q eld in the direction perpendicular to its orientation axis, dened as
R1
R1
Q fk , f? )¤ w( Q fk , f? )( f? ¡ | fm |)2 , is roughly equal to the recepdf? w( Q fk , f? ) is the 2D Fourier transtive eld’s mean spatial frequency | f m |, where w( form of the (normalized) receptive eld w(ik , i? ), and where the symbols k and ? indicate coordinate axes parallel and perpendicular to the receptive eld’s orientation axis, respectively. (The orientation axis is dened such that the coordinates of f m parallel and perpendicular to it satisfy ( fm,k , fm,? ) D (0, | fm |).) This scale invariance breaks down below | f m | 0.1, as was seen in Figure 3. It also breaks down at the highest frequencies as the rise of bandwidth begins to shallow out due to the artifact of Nyquist aliasing (see the caption to Figure 4); in fact, Nyquist aliasing will bring the bandwidth back down to zero at very high | f m | approaching fNyquist, because such receptive elds’ highfrequency tails cannot extend beyond the Nyquist cutoff, and because their low-frequency tails cannot extend very far below | f m | (otherwise the mean | fm | would be lower). 2
¡1
dfk
0
assumption of synaptic initialization with zero prior information, provides a possible resolution of the problems other authors have encountered in trying to explain the spatial scaling properties of the visual cortex using multiple-cell learning rules. The theory of spontaneous symmetry breaking is an economical one, which we have previously shown also provides a novel mechanism for unsupervised invariance discovery, potentially relevant to quite general transformations.
1040
Chris J. S. Webber
Figure 6: Scatter plot of the locations in the 2D image plane of the receptive elds’ centroids. Note that the position vectors i m plotted here are given not in the coordinate system (im, horiz , im,vert ) dened by the horizontal and vertical image directions but in a rotated, co-centric, system of coordinates (im,k , im,? ) dened parallel and perpendicular to each receptive eld’s orientation axis. In other words, each point plotted is the 2D position vector of a receptive eld’s centroid, but each position vector i m has been rotated about the xed origin i D 0, through an angle equal to that receptive eld’s orientation angle. Plotting each receptive eld’s location in a coordinate system rotated into alignment with its individual orientation axis reveals that the random distribution of centroids is isotropic with respect to orientation. (Note that the scatter plot will be uniform throughout the circle | im | < 64 wherever the arbitrary xed origin i D 0 has been dened, because of the translation invariance of the image statistics and the wrap-around boundary conditions at ihoriz D § 64 and ivert D § 64.)
Spontaneous Symmetry Breaking and Code Completeness
1041
Appendix: Implementation of the Learning Rule Used in the Demonstrations
The learning rule used for the demonstrations has the conventional sigmoid response function, µ v( x ¢ w ) D
³ ´¶ x ¢ w ¡ # ¡1 1 C exp ¡ . s
(A.1)
The parameter # determines the height of the neuron’s response threshold, and the parameter s determines the softness of its sigmoid response function. The neuron’s weight vector rotates under the effect of Hebbian adaptation to a training ensemble fx g drawn from distribution Pr(x ) of training patterns, Z dw (t) D l(t) dx Pr(x ) x v (w ¢ x ) dt fx g ¡(length constraint term parallel to w ),
(A.2)
subject to a Lagrange length constraint | w | D 1. Clearly this rule is of the general form 2.1. It can be shown (Webber, 2000) that this projection pursuit rule converges on those unit directions w that are maxima (subject to the constraint | w | D 1) of an objective function, Z
U (w ) D
Z
dx Pr(x ) u (w ¢ x )
fx g
where
u (a ) D
a
da 0 v(a0 ),
(A.3)
and that U (w ) is invariant under any orthogonal symmetry of Pr(x ). In practice, the length constraint is implemented simply by continuously re-scaling the weight vector after each unit time step: R w (t) C l(t) dx Pr(x ) x v (w ¢ x ) R w (t C 1) D . w (t) C l(t) dx Pr(x ) x v (w ¢ x )
(A.4)
The variable rate parameter l(t) is determined from a xed rate constant l 0: ¿ Z ) ( ) l(t) D l 0 x Pr( x x w ¢ x . d v
(A.5)
Webber (1994) shows that continuous rescaling (see equation A.4) is equivalent to the form equation A.2 provided l 0 1. In these demonstrations, l 0 D 0.5. In x -space length units chosen such that each pixel’s amplitude variance over Pr(x ) is 1 squared unit, the values of all 5000 neurons’ (nonadaptive) parameters were set to # D 2.5 units and s D 0.05 unit. (The vari« ¬ ance (x ¢ w ˆ )2 of projections along any unit directionw ˆ will be 1 squared unit, because the covariance matrix is isotropic.)
1042
Chris J. S. Webber
At each unit time step in equation A.4, 100 windowed and whitened 128 £128-dimensional training vectors x D x (ihoriz , ivert ) (where the image amplitude x is a function of 2D position coordinates ihoriz , ivert that both run from ¡64 to C 64), drawn randomly from a total training set of 10,000 such images excised from larger natural scenes, were presented to each neuron, at every one of the 128 £128 possible position offsets (D ihoriz, D ivert ) with respect to its 128 £ 128-synapse weight vector w D w(ihoriz, ivert ) (subject to “wrap-around” periodic boundary conditions in both image coordinates at ihoriz D § 64 and ivert D §64.). Thus, the image statistics experienced by the learning rule are accurately translation invariant by construction. In these demonstrations, all randomly initialized neurons reached convergence within 125 unit time steps of equation A.4.
References Bell, A. J., & Sejnowski, T. J. (1997). The “independent components” of natural scenes are edge lters. Vision Research, 37, 3327–3338. Blais, B. S., Intrator, N., Shouval, H., & Cooper, L. N. (1998). Receptive eld formation in natural scene environments: Comparison of single-cell learning rules. Neural Computation, 10, 1797–1813. Burton, G. J., & Moorhead, I. R. (1987). Color and spatial structure in natural scenes. Applied Optics, 26, 157–170. Cohen, R. W., Gorog, I., & Carlson, C. R. (1975). Image descriptors for displays (Tech. Rep. No. N00014-74-C0184).Arlington, VA: Ofce of Naval Research. Daubechies, I. (1988). Orthonormal bases of compactly supported wavelets. Communications on Pure and Applied Mathematics, 41, 909–996. Deriugin, N. G. (1956). The power spectrum and the correlation function of the television signal. Telecommunications, 1, 1–12. DeValois, R. L., Albrecht, D. G., & Thorell, L. G. (1982). Spatial frequency selectivity of cells in macaque visual cortex. Vision Research, 22, 545–559. Field, D. J. (1987). Relations between the statistics of natural images and the response properties of cortical cells. Journal of the Optical Society of America, 4, 2379–2394. Field, D. J. (1994). What is the goal of sensory coding? Neural Computation, 6, 559–601. Friedman, J. H. (1987). Exploratory projection pursuit. Journal of the American Statistical Association, 82, 249–266. Friedman, J. H., & Tukey, J. W. (1974). A projection pursuit algorithm for exploratory data analysis. IEEE Transactions on Computers, C(23), 881–889. Hebb, D. O. The organisation of behaviour. New York: Wiley. Hubel, D. H., & Wiesel, T. N. (1962). Receptive elds, binocular interaction and functional architecture in the cat’s visual cortex. Journal of Physiology, 160, 106–154. Hyva¨ rinen, A., & Oja, E. (1997). A fast xed-point algorithm for independent component analysis. Neural Computation, 9, 1483–1492.
Spontaneous Symmetry Breaking and Code Completeness
1043
Koenderink, J. J. (1984). The structure of images. Biological Cybernetics, 50, 363– 370. Li, Z., & Atick, J. J. (1994). Towards a theory of striate cortex. Neural Computation, 6, 127–146. Olshausen, B. A., & Field, D. J. (1996). Emergence of simple-cell receptive eld properties by learning a sparse code for natural images. Nature, 381, 607–609. Ruderman, D. L. (1994). The statistics of natural images. Network: Computation in Neural Systems, 5, 517–548. van Hateren, J. H., & van der Schaff, A. (1998). Independent component lters of natural images compared with simple cells in primary visual cortex. Proceedings of the Royal Society of London, B 265, 359–366. Webber, C. J. S. (1991a). Self-organisation of position- and deformation-tolerant neural representations. Network: Computation in Neural Systems, 2, 43–61. Webber, C. J. S. (1991b). Competitive learning, natural images and cortical cells. Network: Computation in Neural Systems, 2, 169–187. Webber, C. J. S. (1994). Self-organisation of transformation-invariant detectors for constituents of perceptual patterns. Network: Computation in Neural Systems, 5, 471–496. Webber, C. J. S. (1997). Generalisation and discrimination emerge from a selforganising componential network: A speech example. Network: Computation in Neural Systems, 8, 425–440. Webber, C. J. S. (2000). Self-organisation of symmetry networks: Transformation invariance from the spontaneous symmetry-breaking mechanism. Neural Computation, 12, 565–596. Received December 1, 1999; accepted August 1, 2000.
LETTER
Communicated by David Touretzky
Localist Attractor Networks Richard S. Zemel Department of Computer Science, University of Toronto, Toronto, ON M5S 1A4, Canada Michael C. Mozer Department of Computer Science, University of Colorado, Boulder, CO 80309-0430, U.S.A. Attractor networks, which map an input space to a discrete output space, are useful for pattern completion—cleaning up noisy or missing input features. However, designing a net to have a given set of attractors is notoriously tricky; training procedures are CPU intensive and often produce spurious attractors and ill-conditioned attractor basins. These difculties occur because each connection in the network participates in the encoding of multiple attractors. We describe an alternative formulation of attractor networks in which the encoding of knowledge is local, not distributed. Although localist attractor networks have similar dynamics to their distributed counterparts, they are much easier to work with and interpret. We propose a statistical formulation of localist attractor net dynamics, which yields a convergence proof and a mathematical interpretation of model parameters. We present simulation experiments that explore the behavior of localist attractor networks, showing that they yield few spurious attractors, and they readily exhibit two desirable properties of psychological and neurobiological models: priming (faster convergence to an attractor if the attractor has been recently visited) and gang effects (in which the presence of an attractor enhances the attractor basins of neighboring attractors).
1 Introduction
Attractor networks map an input space, usually continuous, to a sparse output space. For an interesting and important class of attractor nets, the output space is composed of a discrete set of alternatives. Attractor networks have a long history in neural network research, from relaxation models in vision (Marr & Poggio, 1976; Hummel & Zucker, 1983), to the early work of Hopeld (1982, 1984), to the stochastic Boltzmann machine (Ackley, Hinton, & Sejnowski, 1985; Movellan & McClelland, 1993), to symmetric recurrent backpropagation networks (Almeida, 1987; Pineda, 1987). c 2001 Massachusetts Institute of Technology Neural Computation 13, 1045– 1064 (2001) °
1046
Richard S. Zemel and Michael C. Mozer
Figure 1: (a) A two-dimensional space can be carved into three regions (dashed lines) by an attractor net. The dynamics of the net cause an input pattern (the X) to be mapped to one of the attractors (the O’s). The solid line shows the temporal trajectory of the network state, yO . (b) The actual energy landscape during updates (equations dened below) of a localist attractor net as a function of yO , when the input is xed at the origin and there are three attractors, with a uniform prior. The shapes of attractor basins are inuenced by the proximity of attractors to one another (the gang effect). The origin of the space (depicted by a point) is equidistant from the attractor on the left and the attractor on the upper right, yet the origin clearly lies in the basin of the right attractors.
Attractor networks are often used for pattern completion, which involves lling in missing, noisy, or incorrect features in an input pattern. The initial state of the attractor net is determined by the input pattern. Over time, the state is drawn to one of a predened set of states—the attractors. An attractor net is generally implemented by a set of visible units whose activity represents the instantaneous state and, optionally, a set of hidden units that assist in the computation. Attractor dynamics arise from interactions among the units and can be described by a state trajectory (see Figure 1a). Pattern completion can be accomplished using algorithms other than attractor nets. For example, a nearest-neighbor classier will compare the input to each of a set of predened prototypes and will select as the completion the prototype with the smallest Mahalanobis distance to the input (see, e.g., Duda & Hart, 1973). Attractor networks have several benets over this scheme. First, if the attractors can be characterized by compositional structure, it is inefcient to enumerate them as required by a nearestneighbor classier; the compositional structure can be encoded implicitly in the attractor network weights. Second, attractor networks have some degree of biological plausibility (Amit & Brunel, 1997; Kay, Lancaster, & Freeman, 1996). Third, in most formulations of attractor nets (e.g., Golden, 1988; Hopeld, 1982), the dynamics can be characterized by gradient descent in an energy landscape, allowing one to partition the output space into attractor basins (see Figure 1a). In many domains, the energy land-
Localist Attractor Networks
1047
scape and the corresponding structure of the attractor basins are key to obtaining desirable behavior from an attractor net. Consider a domain where attractor nets have proven to be extremely valuable: psychological models of human cognition. Instead of homogeneous attractor basins, modelers often sculpt basins that depend on the recent history of the network (priming) and the arrangement of attractors in the space (the gang effect). Priming is the situation where a network is faster to land at an attractor if it has recently visited the same attractor. Priming is achieved by broadening and deepening attractor basins as they are visited (Becker, Moscovitch, Behrmann, & Joordens, 1997; McClelland & Rumelhart, 1985; Mozer, Sitton, & Farah, 1998; Plaut & Shallice, 1994). This mechanism allows modelers to account for a ubiquitous property of human behavior: people are faster to perceive a stimulus if they have recently experienced the same or a closely related stimulus or made the same or a closely related response. In the gang effect, the strength or pull of an attractor is inuenced by other attractors in its neighborhood. Figure 1b illustrates the gang effect: the proximity of the two right-most attractors creates a deeper attractor basin, so that if the input starts at the origin, it will get pulled to the right. This effect is an emergent property of the distribution of attractors. The gang effect results in the mutually reinforcing or inhibitory inuence of similar items and allows for the explanation of behavioral data in a range of domains including reading (McClelland & Rumelhart, 1981; Plaut, McClelland, Seidenberg, & Patterson, 1996), syntax (Tabor, Juliano, & Tanenhaus, 1997), semantics (McRae, de Sa, & Seidenberg, 1997), memory (Redish & Touretzky, 1998; Samsonovich & McNaughton, 1997), olfaction (Kay et al., 1996), and schizophrenia (Horn & Ruppin, 1995). Training an attractor net is notoriously tricky. Training procedures are CPU intensive and often produce spurious attractors and ill-conditioned attractor basins (e.g., Mathis, 1997; Rodrigues & Fontanari, 1997). Indeed, we are aware of no existing procedure that can robustly translate an arbitrary specication of an attractor landscape into a set of weights. These difculties are due to the fact that each connection participates in the specication of multiple attractors; thus, knowledge in the net is distributed over connections. We describe an alternative attractor network model in which knowledge is localized—hence, the name localist attractor network. The model has many virtues, including the following: It provides a trivial procedure for wiring up the architecture given an attractor landscape. Spurious attractors are eliminated. An attractor can be primed by adjusting a single parameter of the model.
1048
Richard S. Zemel and Michael C. Mozer
The model achieves gang effects. The model parameters have a clear mathematical interpretation, which claries how the parameters control the qualitative behavior of the model (e.g., the magnitude of gang effects). We offer proofs of convergence and stability. A localist attractor net consists of a set of n state units and m attractor units. Parameters associated with an attractor unit i encode the center in state-space of its attractor basin, denoted w i , and its pull or strength, denoted p i . For simplicity, we assume that the attractor basins have a common width s. The activity of an attractor at time t, qi (t), reects the normalized distance from its center to the current state, y (t), weighted by its strength: p i g (y (t ), w i , s (t )) q i (t ) D P ( () j p j g y t , w j , s(t))
(1.1)
g (y , w , s ) D exp(¡| y ¡ w | 2 / 2s 2 ).
(1.2)
Thus, the attractors form a layer of normalized radial-basis-function units. In most attractor networks, the input to the net, E , serves as the initial value of the state, and thereafter the state is pulled toward attractors in proportion to their activity. A straightforward expression of this behavior is y (t C 1) D a(t)E C (1 ¡ a(t))
X
qi (t)w i ,
(1.3)
i
where a trades off the external input and attractor inuences. We will refer to the external input, E , as the observation. A simple method of moving the state from the observation toward the attractors involves setting a D 1 on the rst update and a D 0 thereafter. More generally, however, one might want to reduce a gradually over time, allowing for a persistent effect of the observation on the asymptotic state. In our formulation of a localist attractor network, the attractor width s is not a xed, free parameter, but instead is dynamic and model determined: sy2 (t) D
1X qi (t )| y (t) ¡ w i | 2 . n i
(1.4)
In addition, we include a xed parameter sz that describes the unreliability of the observation; if sz D 0, then the observation is wholly reliable, so y D E . Finally, a(t) is then dened as a function of these parameters: a(t) D sy2 (t) / (sy2 (t) C sz2 ).
(1.5)
Below we present a formalism from which these particular update equations can be derived.
Localist Attractor Networks
1049
The localist attractor net is motivated by a generative model of the observed input based on the underlying attractor distribution, and the network dynamics corresponds to a search for a maximum likelihood interpretation of the observation. In the following section, we derive this result and then present simulation studies of the architecture. 2 A Maximum Likelihood Formulation
The starting point for the statistical formulation of a localist attractor network is a mixture-of-gaussians model. A standard mixture-of-gaussians consists of m gaussian density functions in n dimensions. Each gaussian is parameterized by a mean, a covariance matrix, and a mixture coefcient. The mixture model is generative; it is considered to have produced a set of observations. Each observation is generated by selecting a gaussian with probability proportional to the mixture coefcients and then stochastically selecting a point from the corresponding density function. The model parameters (means, covariances, and mixture coefcients) are adjusted to maximize the likelihood of a set of observations. The expectation-maximization (EM) algorithm provides an efcient procedure for estimating the parameters (Dempster, Laird, & Rubin, 1977). The expectation step calculates the posterior probability qi of each gaussian for each observation, and the maximization step calculates the new parameters based on the previous values and the set of qi . The operation of this standard mixture-of-gaussians model is cartooned in Figure 2a. The mixture-of-gaussians model can provide an interpretation for a localist attractor network in an unorthodox way. Each gaussian corresponds to an attractor, and an observation corresponds to the state. Now, however, instead of xing the observation and adjusting the gaussians, we x the gaussians and adjust the observation. If there is a single observation, a D 0, and all gaussians have uniform spread s, then equation 1.1 corresponds to the expectation step and equation 1.3 to the maximization step in this unusual mixture model. The operation of this unorthodox mixture-ofgaussians model is cartooned in Figure 2b. Unfortunately, this simple characterization of the localist attractor network does not produce the desired behavior. Many situations produce partial solutions, or blend states, in which the observation does not end up at an attractor. For example, if two gaussians overlap signicantly, the most likely value for the observation is midway between them rather than at the center of either gaussian. We therefore extend this mixture-of-gaussians formulation to characterize the attractor network better. We introduce an additional level at the bottom containing the observation. This observation is considered xed, as in a standard generative model. The state, contained in the middle level, is an internal model of the observation that is generated from the attractors. The attractor dynamics involve an iterative search for the internal model
1050
Richard S. Zemel and Michael C. Mozer
Figure 2: A standard mixture-of-gaussians model adjusts the parameters of the generative gaussians (elliptical contours in upper rectangle, which represents an n-dimensional space) to t a set of observations (dots in lower rectangle, which also represents an n-dimensional space). (b). An unorthodox mixture-ofgaussians model initializes the state to correspond to a single observation and then adjusts the state to t the xed gaussians. (c) A localist attractor network adds an extra level to this hierarchy—a bottom level containing the observation, which is xed. The middle level contains the state. The attractor dynamics can then be viewed as an iterative search to nd a single state that could have been generated from one of the attractors while still matching the observation.
that ts the observation and attractor structure. These features make the formulation more like a standard generative model, in that the internal aspects of the model are manipulated to account for a stable observation. The operation of the localist attractor model is cartooned in Figure 2c. Formulating the localist attractor network in a generative framework permits the derivation of update equations from a statistical basis. Note that the generative framework is not actually used to generate observations or update the state. Instead, it provides the setting in which to optimize and analyze computation. In this new formulation, each observation is considered to have been generated by moving down this hierarchy: 1. Select an attractor x D i from the set of attractors. 2. Select a state (i.e., a pattern of activity across the state units) based on the preferred location of that attractor: y D w i C N y . 3. Select an observation z D yG C N
z.
Localist Attractor Networks
1051
The observation z produced by a particular state y depends on the generative weight matrix G . In the networks we consider here, the observation and state-spaces are identical, so G is the identity matrix, but the formulation allows for the observation to lie in some other space. N y and N z describe the zero-mean, spherical gaussian noise introduced at the two levels, with deviations sy and sz , respectively. Under this model, one can t an observation E by nding the posterior distribution over the hidden states (X and Y ) given the observation:
p(X D i, Y D y | Z D E ) D
p( E | y , i ) p ( y , i ) p (E | y )p i p(y |i) P D R , ( p (E ) E | y ) i p i p (y |i)dy p y
(2.1)
where the conditional distributions are gaussian: p (Y D y |X D i) D G (y | w i , sy ) and p (E | Y D y ) D G (E | y , sz ). Evaluating the distribution in equation 2.1 is tractable, because the partition function (the denominator) is a sum of a set of gaussian integrals. This posterior distribution for the attractors corresponds to a distribution in state-space that is a weighted sum of gaussians. During inference using this model, this intermediate level can be considered as encoding the distribution over states implied by the attractor posterior probabilities. However, we impose a key restriction: at any one time, the intermediate level can represent only a single position in state-space rather than the entire distribution over states. This restriction is motivated by the fact that the goal of the computation in the network is to settle on the single best state, and furthermore, at every step of the computation, the network should represent the single most likely state that ts the attractors and observation. Hence, the attractor dynamics corresponds to an iterative search through state-space to nd the most likely single state that was generated by the mixture-of-gaussian attractors and in turn generated the observation. To accomplish these objectives, we adopt a variational approach, approximating the posterior by another distribution Q (X, Y | E ). Based on this approximation, the network dynamics can be seen as minimizing an objective function that describes an upper bound on the negative log probability of the observation given the model and its parameters. In a variational approach, one is free to choose any form of Q to estimate the posterior distribution, but a better estimate will allow the network to approach a maximum likelihood solution (Saul, Jaakkola, & Jordan, 1996). Here we select a very simple posterior: Q (X, Y ) D qid(Y D yO ), where qi D Q (X D i) is the responsibility assigned to attractor i, and yO is the estimate of the state that accounts for the observation. The delta function over Y implements the restriction that the explanation of an input consists of a single state.
1052
Richard S. Zemel and Michael C. Mozer
Given this posterior distribution, the objective for the network is to minimize the free energy F, described here for a particular observation E : XZ
Q (X D i, Y D y ) dy P (E , X D i, Y D y ) i X X qi ¡ ln p(E | yO ) ¡ D qi ln qi ln p(yO |i), pi i i
F (q , yO | E ) D
Q (X D i, Y D y ) ln
where p i is the prior probability (mixture coefcient) associated with attractor i. These priors are parameters of the generative model, as are sy , sz , and w . Thus, the free energy is: F (q , yO | E ) D
X
1 qi C | E ¡ yO | 2 2 p 2s i z i 1 X C qi | yO ¡ w i | 2 C n ln(sy sz ) 2sy2 i qi ln
(2.2)
Given an observation, a good set of parameters for the estimated posterior Q can be determined by alternating between updating the generative parameters and these posterior parameters. The update procedure is guaranteed to converge to a minimum of F, as long as the updates are done asynchronously and each update step minimizes F with respect to that parameter (Neal & Hinton, 1998). The update equations for the various parameters can be derived by minimizing F with respect to each of them. In our simulations, we hold most of the parameters of the generative model constant, such as the priors p , the weights w , and the generative noise in the observation, sz . Minimizing F with respect to the other parameters—qi , y , sy , a—produces the update equations (equations 1.1, 1.3, 1.4, and 1.5, respectively) given above. 3 Running the Model
The updates of the parameters using equations 1.1 through 1.5 can be performed sequentially in any order. In our simulations, we initialize the state yO to E at time 0, and then cycle through updates of fqi g, sy , and then yO . Note that for a given set of attractor locations and strengths and a xed sz , the operation of the model is solely a function of the observation E . Although the model of the generative process has stochasticity, the process of inferring its parameters from an observation is deterministic. That is, there is no randomness in the operation of the model. The energy landscape changes as a function of the attractor dynamics. The initial shapes of the attractor basins are determined by the set of w i , p i , and sy . Figure 3 illustrates how the free energy landscape changes over time.
Localist Attractor Networks
1053
Figure 3: Each gure in this sequence shows the free energy landscape for the attractor net, where the three attractors (each marked by O) are in the same locations as in Figure 1, and the observation is in the location marked by an E . The state (marked by yO ) begins at E . Each contour plot shows the free energy for points throughout state-space, given the current values of sy and fqi g, and the constant values of E , sz , fw i g, and fp i g. In this simulation, sz D 1.0 and the priors on the attractors are uniform. Note how the gang effect pulls the state yO away from the nearest attractor (the left-most one).
This generative model avoids the problem of spurious attractors for the standard gaussian mixture model. Intuition into how the model avoids spurious attractors can be gained by inspecting the update equations. These equations have the effect of tying together two processes: moving yO closer to some w i than the others, and increasing the corresponding responsibility qi . As these two processes evolve together, they act to decrease the noise sy , which accentuates the pull of the attractor. Thus, stable points that do not correspond to the attractors are rare. This characterization of the attractor dynamics indicates an interesting relationship to standard techniques in other forms of attractor networks. Many attractor networks (e.g., Geva & Sitte, 1991) avoid settling into spurious attractors by using simulated annealing (Kirkpatrick, Gellat, & Vecchi, 1983), in which a temperature parameter that controls the gain of the unit ac-
1054
Richard S. Zemel and Michael C. Mozer
tivation functions is slowly decreased. Another common attractor-network technique is soft clamping, where some units are initially inuenced predominantly by external input, but the degree of inuence from other units increases during the settling procedure. In the localist attractor network, sy can be seen as a temperature parameter, controlling the gain of the state activation function. Also, as this parameter decreases, the inuence of the observation on the state decreases relative to that of the other units. The key point is that as opposed to the standard techniques, these effects naturally fall out of the localist attractor network formulation. This obviates the need for additional parameters and update schedules in a simulation. Note that we introduced the localist attractor net by extending the generative framework of a gaussian mixture model. The gaussian mixture model can also serve as the basis for an alternative formulation of a localist attractor net. The idea is straightforward. Consider the gaussian mixture density model as a negative energy landscape, and consider the search for an attractor to take place via gradient ascent in this landscape. As long as the gaussians are not too close together, the local maxima of the landscape will be the centers of the gaussians, which we designated as the attractors. Gang effects will be obtained because the energy landscape at a point will depend on all the neighboring gaussians. However, this formulation has a serious drawback: if attractors are near one another, spurious attractors will be introduced (because the point of highest probability density in a gaussian mixture will lie between two gaussian centers if they are close to one another), and if attractors are moved far away, gang effects will dissipate. Thus, the likelihood of spurious attractors increases with the strength of gang effects. The second major drawback of this formulation is that it is slower to converge than the model we have proposed, as is typical of using gradient ascent versus EM for search. A third aspect of this formulation, which could also be seen as a drawback, is that it involves additional parameters in specifying the landscape. The depth and width of an attractor can be controlled independently of one another (via the mixture and variance coefcients, respectively), whereas in the model we presented, the variance coefcient is not a free parameter. Given the drawbacks of the gradient-ascent approach, we see the maximum-likelihood formulation as being more promising, and hence have focused on it in our presentation.
4 Simulation Studies
To create an attractor net, learning procedures could be used to train the parameters associated with the attractors (p i , w i ) based on a specication of the desired behavior of the net, or the parameters could be hand-wired based on the desired structure of the energy landscape. The only remaining free parameter, sz , plays an important role in determining how responsive the system is to the observation.
Localist Attractor Networks
1055
Figure 4: (a) The number of spurious responses increases as sz shrinks. (b) The nal state ends up at a neighbor of the generating attractor more frequently when the system is less responsive to the external state (i.e., high sz ). Note that at sz D 0.5, these responses fall when spurious responses increase.
We conducted three sets of simulation studies to explore properties of localist attractor networks. 4.1 Random State-Space. We have simulated the behavior of a network with a 200-dimensional state-space and 200 attractors, which have been randomly placed at corners of the 200-D hypercube. For each condition we studied, we ran 100 trials, each time selecting one source attractor at random, corrupting it to produce an input pattern, and running the attractor net to determine its asymptotic state. The conditions involve systematically varying the value of sz and the percentage of missing features in the input, º. Because the features have (1,–1) binary values, a missing feature is indicated with the value zero. In most trials, the nal value of yO corresponds to the source attractor. Two other types of responses are observed: spurious responses are those in which the nal state corresponds to no attractor, and adulterous responses are those in which the state ends up not at the source attractor but instead at a neighboring one. Adulterous responses are due to gang effects: pressure from gangs has shrunk the source attractor basin, causing the network to settle to an attractor other than the one with the smallest Euclidean distance from the input. Figure 4a shows that the input must be severely corrupted (more than 85% of an input’s features are distorted) before the net makes spurious responses. Further, the likelihood of spurious responses increases as sz increases, because the pull exerted by the input, E , is too strong to allow the state to move into an attractor basin. Figure 4b also shows that the input must be severely corrupted before adulterous responses occur. However, as sz increases, the rate of adulterous responses decreases. Gang effects are mild in these simulations because the attractors populated the space uniformly. If instead there was structure among the attrac-
1056
Richard S. Zemel and Michael C. Mozer
Figure 5: Experiment with the face localist attractor network. Right-tilted images of the 16 different faces are the attractors. Here the initial state is a left-tilted image of one of these faces, and the nal state corresponds to the appropriate attractor. The state is shown after every three iterations.
tors, gang effects are more apparent. We simulated a structured space by creating a gang of attractors in the 200-D space: 20 attractors that are near to one another and another 20 attractors that are far from the gang and from each other. Across a range of values of sz and º, the gang dominates, in that typically over 80% of the adulterous responses result in selection of a gang member. 4.2 Faces. We conducted studies of localist attractor networks in the domain of faces. In these simulations, different views of the faces of 16 different subjects comprise the attractors, and the associated gray-level images are points in the 120 £128 state-space. The database contains 27 versions of each face, obtained by varying the head orientation (left tilt, upright, or right tilt), the lighting, and the scale. For our studies, we chose to vary tilt and picked a single value for the other variables. One aim of the studies was to validate basic properties of the attractor net. One simulation used as attractors the 16 right-tilted heads and tested the ability of the network to associate face images at other orientations with the appropriate attractor. The asymptotic state was correct on all 32 input cases (left tilted and upright) (see Figure 5). Note that this behavior can largely be accounted for by the greater consistency of the backgrounds in the images of the same face. Nonetheless, this simple simulation validates the basic ability of the attractor network to associate similar inputs. We also investigated gang effects in face space. The 16 upright faces were used as attractors. In addition, the other 2 faces for one of the 16 subjects were also part of the attractor set, forming the gang. The initial observation was a morphed face, obtained by a linear combination of the upright face of the gang member and one of the other 15 subjects. In each case, the asymptotic state corresponded to the gang member even when the initial weighting assigned to this face was less than 40% (see Figure 6). This study shows that the dynamics of the localist attractor networks makes its behavior different from a nearest-neighbor lookup scheme.
Localist Attractor Networks
1057
Figure 6: Another simulation study with the face localist attractor network. Upright images of each of the 16 faces and both tilted versions of one of the faces (the “gang member” face shown above) are the attractors. The initial state is a morphed image constructed as a linear combination of the upright image of the gang member face and the upright image of a second face, the “loner ” shown above, with the loner face weighted by .65 and the gang member by .35. Although the initial state is much closer to the loner, the asymptotic state corresponds to the gang member.
4.3 Words. To test the architecture on a larger, structured problem, we modeled the domain of three-letter English words. The idea is to use the attractor network as a content-addressable memory, which might, for example, be queried to retrieve a word with P in the third position and any letter but A in the second position—a word such as HIP. The attractors consist of the 423 three-letter English words, from ACE to ZOO. The state-space of the attractor network has one dimension for each of the 26 letters of the English alphabet in each of the three positions, for a total of 78 dimensions. We can refer to a given dimension by the letter and position it encodes; for example, P3 denotes the dimension corresponding to the letter P in the third position of the word. The attractors are at the corners of a [–1, +1]78 hypercube. The attractor for a word such as HIP is located at the state having value –1 on all dimensions except for H1 , I2 , and P3 , which have value +1. The observation E species a state that constrains the solution. For example, one might specify “P in the third position” by setting E to +1 on dimension P3 and to –1 on dimensions a3 , for all letters a other than P. One might specify “any letter but A in the second position” by setting E input to –1 on dimension A2 and to 0 on all other dimensions a2 , for all letters a other than A. Finally, one might specify the absence of a constraint in
1058
Richard S. Zemel and Michael C. Mozer
Figure 7: Simulation of the three-letter word attractor network, queried with E2 . Each frame shows the relative activity of attractor units at a given processing iteration. Activity in each frame is normalized such that the most active unit is printed in black ink; the lighter the ink color, the less active the unit is. All 54 attractor units for words containing E2 are shown here; the other 369 attractor units do not become signicantly active because they are inconsistent with the input constraint. Initially the state of the attractor net is equidistant from all E2 words, but by iteration 3, the subset containing N3 or T3 begins to dominate. The gang effect is at work here: 7 words contain N3 and 10 contain T3 , the two most common endings. The selected word—PET—contains the most common rst and last letters in the set of three-letter E2 words.
a particular letter position, r, by setting E to 0 on dimensions ar , for all letters a. The task of the attractor net is to settle on a state corresponding to one of the three-letter words, given soft constraints on the letters of the word. McClelland and Rumelhart’s (1981) interactive-activation model of word perception performs a similar computation, and our implementation exhibits the key qualitative properties of their model. If the observation species a word, of course the attractor net will select that word. The interesting queries are those in which the observation underconstrains or overconstrains the solution. We illustrate with two examples of the network’s behavior. Suppose we provide an observation that species E2 but no constraint on the rst or third letter of the word. Fifty-four three-letter words contain E2 , but the attractor net lands in the state corresponding to one specic word. Figure 7 shows the state of the attractor net as PET is selected. Each frame of the gure shows the relative activity at a given processing iteration of all 54 attractor units for words containing E2 . The other 369 attractor units do not become signicantly active, because they are inconsistent with the input constraint. Activity in each frame is normalized such that the most active unit is printed in black ink; the lighter the ink color, the less active the unit is. As the gure shows, initially the state of the attractor net is equally
Localist Attractor Networks
1059
Figure 8: Simulation of the three-letter word attractor network, queried with DEG. Only attractor units sharing at least one letter with DEG are shown. The selection, BEG, is a product of a gang effect. The gangs in this example are formed by words sharing two letters. The most common word beginnings are PE– (7 instances) and DI– (6); the most common word endings are –AG (10) and –ET (10); the most common rst-last pairings are B–G (5) and D–G (3). One of these gangs supports B1 , two support E2 , and three support G3 ; hence BEG is selected.
distant from all E2 words, but by iteration 3, the subset containing N3 or T3 begins to dominate. The gang effect is at work here: 7 words contain N3 and 10 contain T3 , the two most common endings. The selected word—PET— contains the most common rst and last letters in the set of three-letter E2 words. The gang effect is sensitive to the set of candidate alternatives, not the entire set of attractors: A, B, and S are the most common rst letters in the corpus, yet words beginning with these letters do not dominate over others. Consider a second example in which we provide an observation that species D1 , E 2 , and G 3 . Because DEG is a nonword, no attractor exists for that state. The closest attractor shares two letters with DEG. Many such attractors exist (e.g., PEG, BEG, DEN, and DOG). Figure 8 shows the relative activity of attractor units for the DEG query. Only attractor units sharing at least one letter with DEG are shown. The selection—BEG—is again a product of a gang effect. The gangs in this example are formed by words sharing two letters. The most commonword beginnings are PE– (7 instances) and DI– (6); the most common word endings are –AG (10) and –ET (10); the most common rst-last pairings are B–G (5) and D–G (3). One of these gangs supports B1 , two support E 2 , and three support G3 —the letters of the selected word. For an ambiguous input like DEG, the competition is fairly balanced. BEG happens to win out due to its neighborhood support. However, the balance
1060
Richard S. Zemel and Michael C. Mozer
Figure 9: Simulation of the three-letter word attractor network, queried with DEG, when the attractor LEG is primed by setting its prior to 1.2 times that of the other attractors. Priming causes the network to select LEG instead of BEG.
of the competition can be altered by priming another word that shares two letters with the input DEG. In the simulations reported above, the priors, p i , were uniform. We can prime a particular attractor by slightly raising its prior relative to the priors of the other attractors. For example, in Figure 9, DEG is presented to the network after setting the relative prior of LEG to 1.2. LEG has enough of an advantage to overcome gang effects and suppress DEG. However, with a relative prior of 1.1, LEG is not strong enough to win out. The amount of priming required to ensure a word will win the competition depends on the size of its gang. For example, DOG, which belongs to fewer and smaller gangs, requires a relative prior of 2.3 to beat out DEG. By this simple demonstration, we have shown that incorporating mechanisms of priming into the localist attractor network model is simple. In contrast, implementing priming in distributed attractor networks is tricky, because strengthening one attractor may have broad consequences throughout the attractor space. As a nal test in the three-letter word domain, we presented random values of E to determine how often the net reached an attractor. The random input vectors had 80% of their elements set to zero, 10% to +1, and 10% to ¡1. In presentation of 1000 such inputs, only 1 did not reach an attractor— the criterion for reaching an attractor being that all state vector elements were within 0.1 unit from the chosen attractor—clearly a demonstration of robust convergence.
Localist Attractor Networks
1061
5 Conclusion
Localist attractor networks offer an attractive alternative to standard attractor networks, in that their dynamics are easy to specify and adapt. Localist attractor network use local representations for the m attractors, each of which is a distributed, n-dimensional state representation. We described a statistical formulation of a type of localist attractor net and showed that it provides a Lyapunov function for the system as well as a mathematical interpretation for the network parameters. A related formulation of localist attractor networks was introduced by Mathis and Mozer (1996) and Mozer et al. (1998). These papers described the application of this type of network to model psychological and neural data. The network and parameters in these earlier systems were motivated primarily on intuitive grounds. In this article, we propose a version of the localist attractor net whose dynamics are derived not from intuitive arguments but from a formal mathematical model. Our principled derivation provides a number of benets, including more robust dynamics and reliable convergence than the model based on intuition and a clear interpretation to the model parameters. In simulation studies, we found that the architecture achieves gang effects and accounts for priming effects, and also seldom ends up at a spurious attractor. This primary drawback of the localist attractor approach is inefciency if the attractors have componential structure. This can be traced to the lack of spurious attractors, which is generally a desirable property. However, in domains that have componential structure, spurious attractors may be a feature, not a drawback. That is, one may want to train an attractor net on a subset of the potential patterns in a componential domain and expect it will generalize to the remainder, for example, train on attractors AX, BX, CX, AY, and BY, and generalize to the attractor CY. In this situation, “generalization” comes about by the creation of spurious attractors. Yet it turns out that traditional training paradigms for distributed attractor networks fare poorly on discovering componential structure (Noelle & Zimdars, 1999). In conclusion, a localist attractor network shares the qualitative dynamics of a distributed attractor network, yet the model is easy to build, the parameters are explicit, and the dynamics are easy to analyze. The model is one step further from neural plausibility, but is useful for cognitive modeling or engineering. It is applicable when the number of items being stored is relatively small, which is characteristic of pattern recognition or associative memory applications. Finally, the approach is especially useful in cases where attractor locations are known, and the key focus of the network is the mutual inuence of the attractors, as in many cognitive modeling studies.
1062
Richard S. Zemel and Michael C. Mozer
Acknowledgments
We thank Joe Holmgren for help with running simulation studies. We also thank the reviewers for helpful comments and the Vision and Modeling Group at the MIT Media Lab for making the face image database publicly available. This work was supported by ONR award N00014-98-1-0509 and McDonnell-Pew award 95-1 to R.Z., and NSF award IBN-9873492 and McDonnell-Pew award 97-18 to M.M.
References Ackley, D. H., Hinton, G. E., & Sejnowski, T. J. (1985). A learning algorithm for Boltzmann Machines. Cognitive Science, 9. Almeida, L. B. (1987). A learning rule for asynchronous perceptrons with feedback in a combinatorial environment. In Proceedings of the First International Conference on Neural Networks (pp. 609–618). Amit, D. J., & Brunel, N. (1997). Model of global spontaneous activity and local structured activity during delay periods in the cerebral cortex. Cerebral Cortex, 7(3), 237–252. Becker, S., Moscovitch, M., Behrmann, M., & Joordens, S. (1997). Long-term semantic priming: A computational account and empirical evidence. Journal of Experimental Psychology: Learning, Memory, and Cognition, 23(5), 1059–1082. Dempster, A. P., Laird, N. M., & Rubin, D. B. (1977). Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society, B, 39(1), 1–38. Duda, R. O., & Hart, P. E. (1973). Pattern classication and scene analysis. New York: Wiley. Geva, S., & Sitte, J. (1991). An exponential response neural net. Neural Computation, 3(4), 623–632. Golden, R. (1988). Probabilistic characterization of neural model computations. In D. Z. Anderson (Ed.), Neural information processing systems (pp. 310–316). New York: American Institute of Physics. Hopeld, J. J. (1982). Neural networks and physical systems with emergent collective computational abilities. Proceedings of the National Academy of Sciences, 79, 2554–2558. Hopeld, J. J. (1984). Neurons with graded response have collective computational properties like those of two-state neurons Proceedings of the National Academy of Sciences, 81, 3088–3092. Horn, D., & Ruppin E. (1995). Compensatory mechanisms in an attractor neural network model of schizophrenia. Neural Computation, 7(1), 182–205. Hummel, R. A., & Zucker, S. W. (1983). On the foundations of relaxation labeling processes. IEEE PAMI, 5, 267–287. Kay, L. M., Lancaster, L. R., & Freeman, W. J. (1996). Reafference and attractors in the olfactory system during odor recognition. Int. J. Neural Systems, 7(4), 489–495.
Localist Attractor Networks
1063
Kirkpatrick, S., Gelatt, C. D., & Vecchi, M. P. (1983). Optimization by simulated annealing. Science, 220, 671–680. Marr, D., & Poggio, T. (1976). Cooperative computation of stereo disparity. Science, 194, 283–287. Mathis, D. (1997). A computational theory of consciousness in cognition. Unpublished Ph.D. dissertation, University of Colorado. Mathis, D., & Mozer, M. C. (1996). Conscious and unconscious perception: A computational theory. In G. Cottrell (Ed.), Proceedings of the Eighteenth Annual Conference of the Cognitive Science Society (pp. 324–328). Hillsdale, NJ: Erlbaum. McClelland, J. L., & Rumelhart, D. E. (1981). An interactive activation model of context effects in letter perception: Part I. An account of basic ndings. Psychological Review, 88, 375–407. McClelland, J. L., & Rumelhart, D. E. (1985). Distributed memory and the representation of general and specic information. Journal of Experimental Psychology: General, 114(2), 159–188. McRae, K., de Sa, V. R., & Seidenberg, M. S. (1997). On the nature and scope of featural representations of word meaning. Journal of Experimental Psychology: General, 126(2), 99–130. Movellan, J., & McClelland (1993). Learning continuous probability distributions with symmetric diffusion networks. Cognitive Science, 17, 403–496. Mozer, M. C., Sitton, M., & Farah, M. (1998).A superadditive-impairment theory of optic aphasia. In M. I. Jordan, M. Kearns, & S. Solla (Eds.), Advances in neural information processing systems, 10. Cambridge, MA: MIT Press. Neal, R. M., & Hinton, G. E. (1998). A view of the EM algorithm that justies incremental, sparse, and other variants. In M. I. Jordan (Ed.), Learning in graphical models. Norwell, MA: Kluwer. Noelle, D. C., & Zimdars, A. L. (1999).Methods for learning articulated attractors over internal representations. In M. Hahn & S. C. Stoness (Eds.), Proceedings of the 21st Annual Conference of the Cognitive Science Society. Hillsdale, NJ: Erlbaum. Pineda, F. J. (1987). Generalization of back propagation to recurrent and higher order neural networks. In Proceedings of IEEE Conference on Neural Information Processing Systems. Plaut, D. C., McClelland, J. L., Seidenberg, M. S., & Patterson, K. (1996). Understanding normal and impaired word reading: Computational principles in quasi-regular domains. Psychological Review, 103(1), 56–115. Plaut, D. C., & Shallice, T. (1994). Connectionist modelling in cognitive neuropsychology: A case study. Hillsdale, NJ: Erlbaum. Redish, A. D., & Touretzky, D. S. (1998). The role of the hippocampus in solving the Morris water maze. Neural Computation, 10(1), 73–111. Rodrigues, N. C., & Fontanari, J. F. (1997). Multivalley structure of attractor neural networks. Journal of Physics A (Mathematical and General), 30, 7945– 7951. Samsonovich, A., & McNaughton, B. L. (1997). Path integration and cognitive mapping in a continuous attractor neural network model. Journal of Neuroscience, 17(15), 5900–5920.
1064
Richard S. Zemel and Michael C. Mozer
Saul, L. K., Jaakkola, T., & Jordan, M. I. (1996). Mean eld theory for sigmoid belief networks. Journal of AI Research, 4, 61–76. Tabor, W., Juliano, C., & Tanenhaus, M. K. (1997). Parsing in a dynamical system: An attractor-based account of the interaction of lexical and structural constraints in sentence processing. Language and Cognitive Processes, 12, 211–271. Received July 2, 1999; accepted July 12, 2000.
LETTER
Communicated by Thomas Dietterich
Stochastic Organization of Output Codes in Multiclass Learning Problems Wolfgang Utschick Werner Weichselberger Institute for Network Theory and Signal Processing, Munich University of Technology, D–80290 Munich, Germany The best-known decomposition schemes of multiclass learning problems are one per class coding (OPC) and error-correcting output coding (ECOC). Both methods perform a prior decomposition, that is, before training of the classier takes place. The impact of output codes on the inferred decision rules can be experienced only after learning. Therefore, we present a novel algorithm for the code design of multiclass learning problems. This algorithm applies a maximum-likelihood objective function in conjunction with the expectation-maximization (EM) algorithm. Minimizing the augmented objective function yields the optimal decomposition of the multiclass learning problem in two-class problems. Experimental results show the potential gain of the optimized output codes over OPC or ECOC methods. 1 Introduction
Designing a classication system is considered to be the construction of a decision rule, f : RN ! , x 7 ! V k ,
(1.1)
which uniquely assigns each pattern x 2 RN to one element of a nite set of K classes D fV1 , V2 , . . . , VK g. The case of K D 2 denes a twoclass problem (dichotomous classication); K > 2 addresses a multiclass problem (cf. polychotomous classication, Friedman, 1996). Due to the uncertainty of statistical parameters in practical applications, the decision rule is constructed solely with the knowledge of the true class indices, km 2 f1, 2, . . . , Kg, of M given training examples S D f(x1 , k1 ), (x2 , k2 ), . . . , (xM , kM )g. The © training set S can be expressed ª as S D fS1 , S2 , . . . , SK g, where Sk D (x k,1 , k), (x k,2 , k), . . . , (xk, Mk , k) denotes the set of training examples of class V k, and M D M1 C M 2 C ¢ ¢ ¢ C MK for | Sk | D M k. 1.1 Decomposition into Dichotomies. Numerous works have shown that breaking down a multiclass problem into a series of two-class problems c 2001 Massachusetts Institute of Technology Neural Computation 13, 1065– 1102 (2001) °
1066
Wolfgang Utschick and Werner Weichselberger
can be advantageous (Anand, Mehrotra, Mohan, & Ranka, 1995; Dietterich & Bakiri, 1995; Mayoraz & Moreira, 1996; Friedman, 1996; Gareth & Hastie, 1997; Schapire, 1997). Decomposing a polychotomy into D dichotomies means dening D partitions ( dC , d¡ ) on . Each of these partitions denes a two-class problem to be learned by the corresponding component classier or plug-in classier (PiC) (cf. Gareth & Hastie, 1997). The set of training examples for the©dth two-class problem S d can be easily obtained ª d d d d from the sets Sk by S D S1 , S2 , . . . , SK , where n o S kd D (x k,1 , tdk ), (x k,2 , tdk ), . . . , (xk,Mk , tdk ) ( C 1 I V k 2 dC d tk D ¡1 I V k 2 d¡ .
(1.2)
The value tdk is the desired output value (target value) of the dth component ) is called target vector for patterns of class V k. The vector tk D (t1k , t2k , . . . , tD k or code vector of class Vk. The matrix 0 1
t1
ftkgKkD 1
B t2 C B C :D B . C D (#1 #2 ¢ ¢ ¢ #D ) @ .. A
(1.3)
tK denes the target matrix (code matrix). The vector #d D (#d1 , #d2 , . . . , #dK )T denotes the dth dichotomy vector, which directly corresponds to the dth partition ( dC , d¡ ). Figure 1 shows how the rows of ftkgKkD 1 (target vectors tk) are associated with classes. The columns (dichotomy vectors #d ) represent the two-class problems into which the polychotomous classication problem is decomposed. The problems of overtting and poor exibility of the decision boundary are introduced. 1.2 Error-Correcting Output Coding. The maximum number of possible binary vectors of dimension D is given by 2D . Considering the necessary uniqueness of target vectors,
¡ ¡
¢ 2D !
¢ , 2D ¡ K !
(1.4)
code matrices are still feasible. Therefore, the question of how to nd the optimal code matrix is raised. For instance, Dietterich and Bakiri (1995) demonstrated the potential of error-correcting output coding (ECOC), which is based on large Hamming distances between code vectors. Furthermore, a constructive approach was presented by Mayoraz and Moreira (1996). They
Stochastic Organization of Output Codes
1067
Figure 1: (Top) An example for the target matrix ftk gKkD 1 . (Middle, bottom) The gures show exemplary decision bounderies of the two-class problems associated with the dichotomy vectors #3 and #5 , respectively. They illustrate the problems of overtting and poor exibility.
proposed decomposing the classication problem into “adequate” two-class tasks that are “easy to learn.” The degree of being “adequate” or “learnable” depends on the distribution of the data, the architecture of the PiCs, and the learning rule. Although the benet of the alternative approaches has been demonstrated, the use of the conventional one-per-class (OPC) output coding is still considered to be one of the state-of-the-art methods (Schurmann, ¨ 1996; Kressel and Schurmann, ¨ 1997). The reasons for this are the dramatically increased computational costs due to the very large number of PiC components required in ECOC approaches (Dietterich & Bakiri, 1995; Gareth &
1068
Wolfgang Utschick and Werner Weichselberger
Hastie, 1997) and the correlation of errors between the PiCs. In practical applications, the correlation properties of PiCs make large distances between target vectors worthless (Utschick, 1998). Moreover, the OPC method is theoretically well established and is known (under certain assumptions) to approximate the Bayes optimal decision rule (Ruck, Rogers, Kabrinsky, Oxley, & Suter, 1990; Richard & Lippmann, 1991; Rojas, 1996). In the following, we present a novel algorithm that generates solutions where the parameters of the PiCs and the code matrix are simultaneously optimized with respect to (w.r.t.) a global objective function. In section 2, we introduce the principle of the plug-in classication system to which we refer in this work. In section 3, the maximum-likelihood approach is presented, which avoids the combinatorial complexity that has been adressed by the above gure of possible code matrices (see equation 1.4). The standard maximum-likelihood function is augmented by a regularization term to provide a trade-off between overtting and poor exibility during the optimization process, which is based on the EM algorithm. Section 4 presents the learning algorithm of the PiC components, which can be transformed into a series of independent regression problems, and in section 5 the asynchronous algorithm for optimizing the code matrix is proposed. The resulting LAOCOON algorithm is summarized in section 6. In section 7, the learning curves of the algorithm are presented, and we show the results of the optimized output codes in some practical applications of polychotomous classication. Our conclusions are presented in section 8. 2 Classication System
The polychotomous classication system, to which we refer in the sequel, denes a decision rule of the form f : RN ! K , x 7! k f D arg min fdist (y (x, £ k
(2.1) ), tk )g ,
(2.2)
© ª where £ D µ1 µ 2 , . . . , µD . The assignment function f is composed of D PiC components, which are characterized by their sets of parameters µd . The components map the input vector x from the input space onto y in the decision space. The decision space is equal to the hypercube
D D fz 2 R: ¡1 · z · 1gD .
(2.3)
The target vectors tk represent the different classes in the decision space. The intermediate vector y is assigned to the class index k by means of the minimal distance decision rule, or the winner-take-all rule, and a distance measure dist: D £ D ! R C , (y , tk ) 7! ky ¡ tkk2 . Note that this model is
Stochastic Organization of Output Codes
1069
identical to the standard representation of classiers via maximization over a set of discriminant functions (cf. Duda & Hart, 1973; Schurmann, ¨ 1996; Kressel & Schurmann, ¨ 1997). 3 The Maximum-Likelihood Approach 3.1 Association Probabilities. A proposed method to nd optimal output codes for multiclass learning problems is to use combinatorial optimization techniques that have been developed for maximum cut and satisability problems (Goemans & Williamson, 1995; cf. Schapire, 1997). A different approach to nonconvex optimization is the application of statistical mechanics. A known technique for nonconvex optimization is simulated annealing (Kirkpatrick, Gelatt, & Vecchi, 1983). The deterministic counterpart of the stochastic relaxation technique is deterministic annealing, which is based on embedding a combinatorial optimization problem into a family of stochastic optimization problems (Rose, 1992; Puzicha, Hofmann, & Buhmann, 1997). An elegant heuristic technique, it has been successfully applied to vector quantization and related optimization problems (Rose, 1992, 1998; Hofmann, Puzicha, & Buhmann, 1998). Inspired by this stochastic relaxation technique, we avoid the discrete combinatorial optimization problem for the optimal output coding by introducing association probabilities (Rose, 1992),
PTk : T k ! [0, 1], t 7! PTk [t],
(3.1)
which represent the probabilities of realizations of random target vectors T k. In other words, the PTk [t] represent the probabilities of target vectors being the optimal choice for class k. In the sequel, the indication of random variables from the notation of probability functions is omitted. For instance, in the following, we denote the PTk [tk,i ] as P[t k,i ]. T k µ f1, ¡1gD , where I D | T k |, denotes the set of target vectors of class Vk , © ª Tk D tk,1 , tk,2 , . . . , tk,I , (3.2) and P k 2 [0, 1]I denotes the set of the respective association probabilities, © ª P k D P[t k,1 ], P[tk,2 ], . . . , P[tk,I ] . (3.3) The most probable target vector of the set T k is called center vector tk,c of class Vk . In the following, the notations T k and T (t k,c ) will be used equally. 3.2 Maximum Likelihood. Given the sets of associated vectors fTk gKkD 1 ,
we introduce a maximum-likelihood approach1 for searching the optimal 1 For more details, especially the derivation of the objective function and the Lagrangian term, refer to appendix B.
1070
Wolfgang Utschick and Werner Weichselberger
probabilities P D fPk gKkD 1 which minimizes the constrained objective function,2 * L (P , £ ) D ¡
M X
+ log(p(y
m(
m ))
£ ), t
m D1
(3.4)
, P¤t|y
subject to the structural constraint (regularization), H . The constrained minimization of equation 3.4 is equivalent to the unconstrained minimization of the Lagrangian term, that is, n o min L C ¸T H ,
(3.5)
P, £
where L D L H C LP , ¸ D (lt , l# )T , H D (Ht , H# )T , and *
LH D ¡ * LP D ¡
M X
+
m
log(p(y | t
m D1 M X m D1
m ))
(3.6)
, P¤t|y
+ m ])
log(P[t
(3.7)
, P¤t| y
1 + K Y K Y B C dist (tk, t`)A , Ht D ¡ log @ *
0
`D 1 `D 6 k
dD 1
eD 1 6 d eD
(3.8)
Pt
1 + D Y D Y ¡ ¢ ¡ ¢C B dist #d , #e dist #d , ¡#e A . H# D ¡ log @ *
0
kD 1
(3.9)
Pt
(Notice that we have dropped explicit representation of the dependence on P and £ for convenience.) We identify L H as the objective function for the learning of the parameter vectors µ d of the parallel PiC components. On the other hand, the term ¸T H is supposed to prevent overtting and favors the generation of large Hamming distances between target vectors tk and between dichotomy vectors #d , respectively. The parameters lt and l# control the trade-off between the learnability of the chosen subsets (via the code matrix) of training examples S d and the generation of large distances in row and column space of the code matrix. 2
Here, we have used hiP¤ to denote the expectation of a random variable over the t|y
distribution P¤ (t | y). The asterisk denotes the parameters of the previous iteration step. For more details, refer to appendix A.
Stochastic Organization of Output Codes
1071
3.3 Structural Constraints. The core of the regularization term Ht for the target vectors is the logarithm of the product of the distances of all possible pairs of target vectors. Due to the multiplication of the distances, the regularization term approaches innity if one single distance becomes zero. Furthermore, the logarithm attens the term for high values. The inuence of the regularization is reduced if the value of the regularizer is already high. The regularization term H# for dichotomies is constructed analogously to Ht . Both structural constraints are based on row separation and column separation (separation of rows and columns of the code matrix), two heuristics proposed by Dietterich and Bakiri (1995). Guruswami and Sahai (1999) present a more theoretical introduction to these principles of error-correcting codes. 3.4 Expectation Maximization. After reorganization of equations 3.4 and 3.5, we obtain an expression that depends only on Pk and p(y k,m | t k,i ) (cf. equations B.4, B.5, B.7, and B.10). In the sequel, the parametric model of the conditional probability will be assumed to be gaussian. In this case, assume the minimization of the objective function L H is performed as follows
min ¤ Pt| y
Mk X K X I X
± ² dist y k,m , tk,i P¤ [t k,i | y k,m ],
(3.10)
kD 1 m D 1 i D 1
where dist: D £ D ! R C again denotes the Euclidean distance measure. Under the assumption that the association probabilities are hard and each input vector x k,m is assigned to a unique target vector tk,c , the minimization of the objective function is equal to min
tk, c
Mk K X X kD 1 m D 1
± ² dist y k,m , t k,c ,
t k,c 2 T k.
(3.11)
From this point of view, equation 3.10 represents the probabilistic generalization of the objective function in equation 3.11 (Miller, Rao, Rose, & Gersho, 1996). Obviously, minimization of equation 3.10 solely w.r.t. the conditional probabilities P[tk,i | y k,m ] would produce a hard output coding solution, since a full assignment of input vectors to target vectors is always less expensive in costs of equation 3.10 than a fuzzy assignment. However, the minimization of the objective function w.r.t. the internal parameters of the PiCs, £ , and the structural constraint of the regularization term, H , requires a more complicated optimization technique. Toward this end, we apply the expectation maximization (EM) algorithm (see appendix A). The optimization process is split into a sequence of training epochs. Each epoch consists of an expectation step (E-step) and a maximization step (M-step), which are applied to two concurrent parts: (1) the
1072
Wolfgang Utschick and Werner Weichselberger
training of the plug-in components based on L H and (2) the optimization of the association probabilities of target vectors P k based on LP , Ht , and H# (see sections 4, 5, and 6, respectively). The E-step of the EM algorithm is calculated via
E : P¤ [tk,i | y k,m (£
¤
p (y k,m (£ ¤ ) | tk,i )P ¤ [tk,i ] )] D P , I k,m ( £ ¤ ) | t )P ¤ [t ] k, j k, j j D 1 p (y
(3.12)
using the association probabilities P k¤ . (The asterisk denotes the parameters of the previous iteration step.) Readers familiar with the application of the EM algorithm for density estimation (Redner and Walker, 1984; Bishop, 1995) will see the analogy to the solutions obtained thereby. Analogously to density estimation, the distributions of the intermediate vectors y in the decision space are interpreted as mixed distributions with the elements of the sets T k as centers. The vector probabilities P[tk,i ] are the weights of these mixed distributions. 3.5 Moving Target Centers. As a further important step toward the simplication of the entire optimization problem, we limit the size of the sets T k. To this end, the Tk D T (tk,c ) are reduced to the set of target vectors within the direct neighborhood of the center vectors t k,c (cf. equation 3.14). Because the optimal choice of the sets Tk is unknown (analogous to the original problem of optimal output coding), we propose a sequence of deterministic moves of the Tk in f¡1, C 1gD . We further dene the transition probabilities to be equal to the association probabilities P k, which are a result of the previous training epoch. Then, after each training epoch, the new T k includes the most probable target vector of the previous set together with its local neighborhood vectors, that is, the target vector center tk,c is updated as follows (tk,c à t¤k,c ): © ª U T : tk,c D arg max P[tk,i ] . (3.13)
tk,i 2T
¤ k iD 1,...,I
Consequently, the new T k à T k¤ , resp. T (tk,c ) à T (t¤k,c ), reads as n o T k D t 2 f¡1, C 1gD : hamm(t, t k,c ) · 1 ,
(3.14)
where hamm: RD £ RD ! f1, 2, . . . , Dg denotes the Hamming distance measure. The following update rule, equation 3.17, describes the deterministic “moving” of target vectors and its association probabilities. Here, we have used the notation of conditional probabilities for the association probabilities: » P[t k,i ]I t 2 T k ^ t D tk,i (3.15) P[t | tk,c ] D 0I otherwise.
Stochastic Organization of Output Codes
1073
After dening
±t :D U
T
(T k, Pk ) ¡ t¤k,c ,
(3.16)
we introduce the double-step update rule, »
U
P
:
P[t | t¤k,c ]I P[t ¡ ±t | t¤k,c ]I
t 2 Tk \ T k¤ t 2 Tk n T k¤ ,
(i)
P[t | tk,c ] D
(ii )
P t | tk,c ] D ¡P[t | tk,c ] C const., P[
(3.17)
where3 »
Tk \ T k¤ D T k n T k¤ D
»
Tk¤ I ft k,c , t¤k,c gI
±t D 0 otherwise,
±t D 0 Tk n ftk,c, t¤k,c gI otherwise, ;I
and where P[t | t¤k,c ] is the solution of the corresponding training epoch of the Lagrangian term, equation 3.5. Applying the implicit Euler method for the numerical solution of the second-step differential equation of the update rule, equation 3.18, results in P[t | t k,c ] Ã ºP[t | tk,c ] C (1 ¡ º)const.,
(3.18)
where º 2 [0, 1]. Considering that the probabilities of P k sum up to unity, const. is equivalent to I ¡1 . Here, I again denotes the size of the sets T k. The update rule in equation 3.18 obviously introduces a attening of the association probabilities after each optimization step. Experiments have shown that the attening helps to prevent the solution of the optimization process from getting stuck in a local minimum. Figure 2 shows the idea of the “moving” set T k. The set Tk “moves” along a path of growing probabilities until a local optimum of the maximumlikelihood objective function is reached. 4 Training of the Plug-In Components
In order to escape from local minima, we introduce a multiple initial point strategy within the optimization process via the training of the PiCs. In 3 The appearance of both subsets T in equation 3.14.
k
\T
¤, k
T
k
nT
¤ k
depends on the denition of the T
k
1074
Wolfgang Utschick and Werner Weichselberger
Figure 2: “Moving” set T k during the optimization process. The size of a black ball indicates the value of the probability associated with the vector.
order to describe the training algorithm, we denote the minimization of the partial objective function LH during a training epoch as4 Z
M H : µ ¤d (v) 7! µ d (v) D m B d (v) C
t 0
@L H (µ d (v , t )) @µ d
dt,
(4.1)
where d D 1, . . . , D, µ d (v , 0) D 0, v is an event within a given probability space, µd are the parameters of the dth plug-in component, t 2 R C is the integration time of a training epoch, and B d is a Brownian term. The µ ¤d (v) and µ d (v) denote the PiC parameters before and after the training epoch. For the training of component parameters (M-step), it turns out that min fL H (£ )g
£
can be achieved one PiC after the other, via M H (see equation 4.1) by solving a nonlinear regression problem (see section C.1). If we assume that p (y k,m (£ ) | tk,i ) is gaussian, the regression function equals the nonlinear mapping of rst-layer components of the classier fd : R N ! [¡1, C 1], x 7! yd (µ d ). Furthermore the estimation of the µ d is based on training examples Á ! I X ¤ k,m d ¤[ k,m 2 RN £ D , (4.2) x , tk,i P t k,i | y (£ )] iD 1
where the asterisk denotes again the parameters or variables of the previous iteration step. It is noticeable that after the algorithm has converged, the regression problem equals the conventional least-squares learning of backprop-like algorithms based on training examples f(xk,m , tdk,c )g (see section C.1). 4
Here, we have used the notation of the integral equation (cf. Sch¨afer, 1995).
Stochastic Organization of Output Codes
1075
5 Optimization of Association Probabilities
There are two methods for updating the P k. During each training epoch, we either update each set of association probabilities Pk individually in a xed predetermined or random order, or we update all P k simultaneously. The rst method involves calculating the local properties of the cost function w.r.t. variations in only one set of probabilities and updating that set accordingly. In the second method, we need to compute local properties w.r.t. the variation of the whole P . In the following section, we describe the rst method (asynchronous update) only, due to its lower computational requirements and its favorable properties. 5.1 Asynchronous Optimization. The optimization of target vectors
(code words) n o min LP (P ) C ¸T H (P ) ,
(5.1)
P
is iteratively and asynchronously solved by dealing with one subset T k of target vectors after the other. When computing the probabilities of the Tk, 6 the sets T` for ` D k are assumed to be empty except for the current center vectors t`,c . Due to the proposed asynchronous style of the optimization algorithm, the objective function of each iteration step becomes a convex function5 (see appendix D), n o MP : P k¤ 7! arg min L0 P , k C ¸T H 0 k , (5.2) Pk
subject to the constraint that all probabilities assigned to the vectors of T k P sum up to one, that is, IiD1 P[tk,i ] D 1. Thus, the optimum values P k are equal to the solution of @ @P[tk,i ]
L0 P ,k C ¸T
@ @P[tk,i ]
H 0 k D 0,
dP[tk,1 ] C ¢ ¢ ¢ C dP[tk,i ] C ¢ ¢ ¢ C dP[tk,I ] D 0.
8i D 1, . . . , I
(5.3) (5.4)
Figure 3 shows a two-dimensional example (tk,i 2 R 2 ) of the objective function for the set Tk . Due to the limited dimensionality of the graph, the number of admissible target vectors is set to 3. The probabilities of two vectors form the x-axis and y-axis, and the probability of the third vector is implicitly provided by 1 D P[tk,1 ] C P[tk,2 ] C P[tk,3 ]. Whereas in the unregularized case the vector t k,1 would reach the highest probability, the regularization causes a signicant shift in favor of vector t k,2 . 5 Here, we have used the prime symbol to denote the simplied version of the optimization, due to the asynchronous approach.
1076
Wolfgang Utschick and Werner Weichselberger
Figure 3: The two-dimensional example of the objective function during asynchronous optimization.
Thus, we can employ a standard numerical optimization method to nd the optimal association probabilities of this iteration step (Fletcher, 1980; Gill, Murray, & Wright, 1981).
Stochastic Organization of Output Codes
1077
6 LAOCOON Algorithm
Finally, the complete learning adaptive o utput codes o f n eural Networks algorithm (LAOCOON) reads as: © ªK 1. Initialization. Code matrix t¤k kD1 and association probabilities P k¤ , for k D 1, . . . , K, component parameters µ ¤d , for d D 1, . . . , D, regularization parameters lt , l# , and threshold parameter 2 . 2. E-step. Compute expectation terms6
P¤t|y D E (T (t¤k,c ), P k¤ , £ for k D 1, . . . , K, w.r.t. parameters £
¤
¤
),
and variables P k¤ .
3. M-step. Solve the nonlinear regression problem for all d D 1, . . . , D
convex optimization problem for all k D 1, . . . , K (asynchronously) ± n o n o² ± n o² ¤ ¤ hd D M H Bd (v), T (t k,c ) , Pt|y P k D MP T (t¤k,c ), P¤t|y + Step 4
4. Update. Center target vector and association probabilities:
tk,c à U Pk à U
T P
(T (t¤k,c ), P k ), (P k ).
5. If P[tk,c ] > 1 ¡ 2 for all k D 1, . . . , K. Then stop. Otherwise t¤k,c à tk,c , P k¤ à P k ) Step 2. Whereas in each pass of the M-step the PiC parameters µ d are initialized due to the Brownian term in equation 4.1, the initial conditions for optimizing the probabilities of the target vectors are provided by the analytical solution of the unregularized case (lt D l# D 0), P[tk,i ] D
Mk 1 X P¤ [tk,i | y k,m (£ Mk mD 1
¤
)].
(6.1)
The algorithm terminates when all association probabilities have converged. Experiments showed that the center vector probabilities always converge to a value near 1, and, as a consequence, all noncenter vectors 6
Here, we again have used P¤t| y to denote the conditional probability of equation 3.12.
1078
Wolfgang Utschick and Werner Weichselberger
end up with a probability near 0. For the numerical results in section 7, the parameter 2 is set to 0.1.7 After the probabilities of target vectors have converged, the parallel PiC components of the classier are nally trained by means of least-squares learning via conventional backprop-like algorithms (see section 4). 7 Experimental Results
The numerical experiments were carried out on the vowel database, 8 the segmentation database, 9 the glass database, 10 and two digit databases 11 from the UCI repository of machine learning databases (Blake, Keogh, & Merz, 1998). The PiC components for all numerical experiments, regarding the vowel database and the segmentation database, are rst-order polynomial classiers (Eigenmann and Nossek, 1998) of three hidden neurons, which form a soft boolean function (AND) of linear classiers (see section C.2). All other experiments presented in this work have been performed using a multilayer perceptron (MLP) (Bishop, 1995) architecture with one hidden layer. The number of hidden neurons is varied (one and three, resp.). The numbers #1 through #5 on the horizontal axis in Figures 8 through 11, 12, 13, and 14 indicate independent realizations of the named output codes (ECOC, OPC, LAOCOON). For each realization of an output code, the results of 10 independent nal training runs of the PiC components are presented (including the average error rates indicated by short horizontal lines). The horizontal line in Figures 8 and 9, which is labeled “nearest neighbor,” represents the error rate of the nearest-neighbor classier applied to the test examples. The nearest-neighbor classier is a nonparametric candidate of classiers and serves as a kind of reference approach for the discussed classication system. All numerical results are based on the nal training of the PiC components, according to the plain optimization method of equation 4.1, where the integration of the training epoch is extended, until a stopping rule (monitoring of the training error) terminates the process. During the training of the PiC components, we apply no method to avoid overtting. This way, the regularization effect of the optimized code can be seen undistorted.
7 Experiments showed that this threshold sufces to guarantee that the center vectors would not change during additional iterations. 8 ftp://ftp.ics.uci.edu/pub/machine-learning-databases/undocumented/ connectionist-bench/vowel/. 9 ftp://ftp.ics.uci.edu/pub/machine-learning-databases/image/. 10 ftp://ftp.ics.uci.edu/pub/machine-learning-databases/glass/. 11 ftp://ftp.ics.uci.edu/pub/machine-learning-databases/optdigits; ftp://ftp.ics.uci. edu/pub/machine-learning-databases/pendigits.
Stochastic Organization of Output Codes
1079
Figure 4: Learning curves of the PiC components and the objective function of the target vectors. The asterisks in the bottom plot mark every training epoch where at least one center vector changes.
7.1 Learning Curves. The PiC components and the code matrix were simultaneously learned by the LAOCOON algorithm in section 6.12 Figure 4 shows the typical learning curves of the PiC classiers and of the objective function of the code matrix, according to the number of applied training
12 All numerical experiments in section 7.1 have been performed on the vowel database.
1080
Wolfgang Utschick and Werner Weichselberger
Figure 5: Learning curves of the averaged Hamming distances. The asterisks in the plots mark every training epoch where at least one center vector changes.
epochs. Figure 5 shows the curves of the averaged Hamming distances between target vectors and vectors of dichotomies. The averaged Hamming distances decrease because the initial code matrix is generally chosen randomly, and random matrices have large Hamming distances. Hence, during optimization of the code matrix, the distances between the rows and the columns are more likely to become smaller rather than larger.
Stochastic Organization of Output Codes
1081
Figure 6: Learning curves of the probabilities of the random variables of class V9 .
Finally, Figure 6 presents the expected binary target values13 htd9 i of all PiC components for class V9 (top) and the probabilities P[t9,i ] associated with the target vectors t9,i 2 T9 of class V9 (bottom). The asterisks in Figures 4 and 5 mark every training epoch where at least one center vector t k,c changes by means of the update operations UP (see equation 3.17) and UT (see 13
htdk i D
PI
td P[t k,i ]. iD 1 k, i
1082
Wolfgang Utschick and Werner Weichselberger
equation 3.13). The bottom curves in Figure 6 have to be interpreted with care. If the set Tk is altered after a training epoch, the ith element tk,i changes as well (cf. Figure 2). Thus, a single curve in the right-hand side of Figure 6 describes the random variable of different target vectors at different training epochs. On the other hand, it represents the random variable of the same relative target vector in the neighborhood of the changing center vector. After approximately 80 training epochs, a unique solution of the target vector for class V9 was obtained. 7.2 Vowel Database. The vowel database consists of instances of 11 British English vowels (K D 11). Fifteen individual speakers uttered each vowel six times; eight of the speakers are used for the training data (M k D 8 ¢ 6 D 48, and M D 11 ¢ 48 D 528), the others for testing the generalization performance. By means of digital speech processing, 10 real-valued features (N D 10) were extracted from the recorded speech signals. In two different scenarios, the number of parallel PiC components was D D 11 and D D 22, respectively. First, the regularization parameters were set to lt, 0 D 0.005 and l#, 0 D 0.05. After approximately 50 training epochs the 11-PiCs-LAOCOON codes have lower generalization error rates than comparable OPC and ECOC methods (see Figure 7). It is remarkable that the 11-PiCs–ECOC codes yield no better performance than the OPC codes. To produce better ECOC codes, far more component classiers than classes have been proposed in Dietterich and Bakiri (1995). In contrast to ECOC codes, the LAOCOON algorithm is capable of improving the generalization performance with the same number of PiCs as for OPC codes. The averaged Hamming distances of the LAOCOON codes are between 4.25 and 5.24 for target vectors and between 3.33 and 3.71 for dichotomies. However, the minimal distances are 1 and 0, respectively.14 On the other hand, the average Hamming distances of the ECOC codes are about 5.7 (target vectors) and 4.4 (dichotomies). Although the LAOCOON codes are less error correcting due to the smaller Hamming distances, the dichotomies from LAOCOON codes are more optimal than those from ECOC codes. On the other hand, the two-class problems are more complex than the OPC ones (cf. training results) but do increase the generalization performance. Thus, the LAOCOON algorithm reduces overtting in this application (vowel database). In further experiments using 22-bit code vectors (D D 22), the resulting generalization error rates are even closer to the nearest-neighbor border 14 Although a minimal Hamming distance of 0 is forbidden due to the logarithm in H, the nite numerical realization of log(0) nevertheless allows the generation of zero distances. Then the Hamming distance of 0 between two dichotomies may be interpreted as the combination of two PiC components based on the same two-class problem (cf. Breiman, 1996).
Stochastic Organization of Output Codes
1083
Figure 7: Comparison of ECOC, OPC, and LAOCOON on the Vowel Database. All codes have 11-bit target vectors. Each code—1 OPC Code, 2 ECOC (#1,#2) codes, and 5 LAOCOON (#1–#5) codes—was trained 10 times. For each training result, the percentage of errors is depicted. For each code, a horizontal line represents the arithmetic average of the 10 training runs.
(see Figure 8). The regularization parameters were set to lt, 0 D 0.01 and l#, 0 D 0.1. Because there are only K D 11 classes, the standard OPC is an 11-PiCs code. To compare equal code sizes, we also used a 22-PiCsOPC code, which uses each OPC dichotomy twice. The 22-PiCs-LAOCOON codes signicantly outperform the other approaches. The improvement of the 22-PiCs-OPC code shows the benet of averaging. The results of the 22-PiCs-ECOC codes prove the need for a certain code length for ECOC codes to be effective. 7.3 Segmentation Database. The segmentation database includes instances of 3 £3 pixel regions from images of seven different textures (K D 7): brickface, sky, foliage, cement, window, path, and grass. The database is di-
1084
Wolfgang Utschick and Werner Weichselberger
Figure 8: Comparison of ECOC, OPC, and LAOCOON on the Vowel Database. All codes have 22-bit target vectors (except standard OPC). Each code—2 OPC Codes, 2 ECOC (#1,#2) codes, and 5 LAOCOON (#1–#5) codes—was trained 10 times. For each training result, the percentage of errors is depicted. For each code, a horizontal line represents the arithmetic average of the 10 training runs.
vided into M D 210 training examples and 2100 test examples. Based on the position and color information of the pixel regions, 19 features (N D 19) were extracted from the samples. The number of parallel PiC components was D D 7. The regularization parameters were set to lt, 0 D 0.005 and l#, 0 D 0.01. The LAOCOON codes clearly outperform the other coding approaches (see Figure 9). One ECOC solution is nearly equivalent to the OPC code. The averaged Hamming distances of the LAOCOON codes are between 2.57 and 3.43 for target vectors and between 2.00 and 2.43 for dichotomies. The minimal distances are again 1 and 0, respectively. The average Hamming distances of the ECOC codes are between 3.71 and 3.91 (target vectors) and 2.46 and 2.64 (dichotomies). In contrast to the experiments in section 7.2, the LAOCOON codes perform best for both the test and the training data. In this application the LAOCOON algorithm nds two-class problems that are more adequate than the OPC ones. Figure 10 proves the better performance of the component classiers on the LAOCOON two-class problems.
Stochastic Organization of Output Codes
1085
Figure 9: Comparison of ECOC, OPC, and LAOCOON on the Segmentation Database. All codes have 7-bit target vectors. Each code—1 OPC Code, 2 ECOC (#1,#2) codes, and 3 LAOCOON (#1–#3) codes—was trained 10 times. For each training result, the percentage of errors is depicted. For each code, a horizontal line represents the arithmetic average of the 10 training runs.
7.4 Glass Database. The glass database consists of instances of six different types of glass (K D 6). Each instance is described by its refractive index and the concentrations of certain chemical elements, yielding a total of nine features (N D 9). The number of instances is not equally distributed over the glass types (see Table 1). For each class, 70% of the instances are used for training. Thus, the database is divided into M D 148 training and 66 test patterns. All experiments have been performed on square code matrices (D D 6). The regularization parameters were set to lt, 0 D l#, 0 D 0.001. The averaged Hamming distances of the ECOC codes are about 3.1 for target vectors and about 2.1 for dichotomies. The minimal distances are between 1 and 2. The distances of the LAOCOON codes are only slightly
1086
Wolfgang Utschick and Werner Weichselberger
Figure 10: Performance of the PiC components on the ECOC, OPC, and LAOCOON two-class problems of the Segmentation Database. Each code—1 OPC Code, 2 ECOC (#1,#2) codes, and 3 LAOCOON (#1–#3) codes—was trained 10 times. For each training result, the percentage of the averaged error of the components is depicted. For each code, a horizontal line represents the arithmetic average of the 10 training runs.
smaller: approximately 2.9 for target vectors and 2.0 for dichotomies. The minimal distances are again 1 and 2. Again, the LAOCOON codes perform better than the two other approaches (see Figure 11). In contrast to the preceding databases, in this case the LAOCOON algorithm does not reduce the Hamming distances significantly. Furthermore, even the equally sized ECOC codes outperform the OPC approach. Thus, optimal solutions seem to have large distances, but Table 1: Distribution of Samples for the Glass Database. Class Denotation Float-processed building window Float-processed vehicle window Non-oat-processed building window Container Tableware Headlamp
Number of Instances 70 17 76 13 9 29
Stochastic Organization of Output Codes
1087
Figure 11: Comparison of ECOC, OPC, and LAOCOON on the Glass Database. All codes have 6-bit target vectors. Each code—1 OPC Code, 2 ECOC (#1,#2) codes, and 4 LAOCOON (#1–#4) codes—was trained 10 times. For each training result, the percentage of errors is depicted. For each code, a horizontal line represents the arithmetic average of the 10 training runs. The upper half shows the test error for PiC components with one hidden neuron, the lower half for components with three hidden neurons.
an arbitrary assignment of target vectors to classes (ECOC codes) degrades the performance. It is worth noticing that the performance of the LAOCOON codes does not change with the number of hidden neurons, whereas ECOC and OPC coding works better with more neurons. This means that for ECOC and OPC, the performance suffers from linear approximation of the decision boundaries. On the other hand, the LAOCOON algorithm yields codes whose decision boundaries do not suffer from linear approximation. Therefore, an increased exibility of the PiC components is not needed.
1088
Wolfgang Utschick and Werner Weichselberger
7.5 Pen-based Recognition Digit Database. The pendigit database comprises instances of hand-written digits (K D 10). Forty-four people wrote 250 instances each; 30 writers are used for training and 14 for testing. Due to some rejected instances, the total number of training patterns is M D 7494, and the number of test patterns is 3498. These are nearly evenly distributed over the 10 classes. The writing was performed on a pressure-sensitive tablet, which recorded the information used for generating 16 attributes (N D 16). The number of dichotomies is D D 10, and the regularization parameters were set to lt, 0 D 0.002 and l#, 0 D 0.001. The averaged Hamming distances of the LAOCOON codes for the case of three hidden neurons are between 4.89 and 5.38 for target vectors and between 3.62 and 4.05 for dichotomies. The minimal distances are 2 to 4 and 1 to 3, respectively. For the case of only one hidden neuron, the distances decrease to 4.07 to 5.16, 3.62 to 3.93 (averaged), and 2 to 3, 1 to 2 (minimum). The Hamming distances of the ECOC codes are 5.244 to 5.311, 4.04 to 4.11 (averaged), and 3 (minimum). Although the difference of the distances is not very large, the LAOCOON codes for the one neuron case have smaller distances than for the three neuron case in all categories. This suggests that the decreased exibility of the PiC components forces the algorithm to produce codes that are easier to learn and less error correcting. For this database, the OPC approach seems to yield the optimal code, as it clearly outperforms the two other coding techniques (see Figure 12). But still the LAOCOON results are clearly better than arbitrary ECOC codes. Our interpretation is that the LAOCOON algorithm starts from a random code and optimizes it, but gets stuck in a local optimum. Various attempts to start the algorithm from a code near the OPC code showed no effect. The code matrix did not change at all during the optimization process. In contrast to the glass database, the performance of the LAOCOON codes increased signicantly with the number of hidden neurons. Furthermore, the “error distance” between OPC and LAOCOON clearly decreases for the case of three neurons. This suggests that the problem of getting stuck in a local optimum is not that severe for more exible PiC components. 7.6 Optical Recognition Digit Database. The optdigit database consists of instances of hand-written digits (K D 10) as well. Thirty people produced M D 3823 training samples, and another 13 persons wrote 1797 test instances. The feature extraction was performed according to National Institute of Standards and Technology preprocessing methods and yielded 64 attributes (N D 64). As for the former database, the instances are nearly evenly distributed over the 10 classes. The number of dichotomies is D D 10, and the regularization parameters were set to lt,0 D 0.0001 and l#, 0 D 0.001. For both cases (three and one hidden neurons), the averaged Hamming distances of the LAOCOON codes are about 5.2 for target vectors and 4 for dichotomies. The minimal distances are 3 and 2, respectively. The Hamming
Stochastic Organization of Output Codes
1089
Figure 12: Comparison of ECOC, OPC, and LAOCOON on the Pendigit Database. All codes have 10-bit target vectors. Each code—1 OPC Code, 2 ECOC (#1,#2) codes, and 4 LAOCOON (#1–#4) codes—was trained 10 times. For each training result, the percentage of errors is depicted. For each code, a horizontal line represents the arithmetic average of the 10 training runs. The upper half shows the test error for PiC components with one hidden neuron, the lower half for components with three hidden neurons.
distances of the ECOC codes are the same as for the pendigit database, since the size of the code matrix is the same. Again, the OPC approach is the best in most cases (see Figure 13). But in contrast to the pendigit data set, training on the OPC code results in very high error rates in a few cases. However, these training runs yield very high training error rates as well. Retraining could be triggered without using a validation or test set. Hence, these far-off values should not be considered. Nevertheless, it is worth noticing that these failures in training do not occur with the ECOC and LAOCOON approaches.
1090
Wolfgang Utschick and Werner Weichselberger
Figure 13: Comparison of ECOC, OPC, and LAOCOON on the Optdigit Database. All codes have 10-bit target vectors. Each code—1 OPC Code, 2 ECOC (#1,#2) codes, and 4 LAOCOON (#1–#4) codes–was trained 10 times. For each training result, the percentage of errors is depicted. For each code, a horizontal line represents the arithmetic average of the 10 training runs. The upper half shows the test error for PiC components with one hidden neuron, the lower half for components with three hidden neurons.
As for the pendigit dataset, the LAOCOON algorithm seems to get stuck in local minima. This is an effect that decreases with the increasing exibility of the PiC components (see Figure 14). 8 Conclusion
We have developed a theoretical framework for transforming the combinatorial problem of nding an optimal output code into a continuous optimization task. The main feature of this work is the introduction of association
Stochastic Organization of Output Codes
1091
Figure 14: Averaged performance of ECOC and LAOCOON codes relative to OPC coding. As performance measure, the negative test error percentage is used. The horizontal line indicates the OPC error level. This gure summarizes the results depicted and described in Sections 7.2 through 7.6.
probabilities that describe the fuzziness of target vectors during the optimization. By means of the EM algorithm, the fuzzy representation of target vectors relaxes into a hard-output coding solution. This algorithm resembles the ML approach for density estimation problems in a reverse manner. Instead of searching for the mixture components of unknown distributions, we search for PiCs that map the input vectors into the decision space in such a way that the decision boundaries of the designed distributions of classes are optimal w.r.t. the generalization error. The LAOCOON algorithm uses this framework to establish a balance between large Hamming distances, overtting, and adequate two-class problems of the optimized output code. To this end, the complete optimization is based on a sequence of nonlinear regression problems and convex optimization tasks. By modifying the overall objective function, future research could incorporate further aspects of the code design. The experimental results show signicant gains in the generalization performance by the application of the LAOCOON algorithm for output code design. However, the results of some experiments make us believe that for many polychotomous classication problems, the OPC method is still the optimal choice for the output coding of multiple classes (cf. OCR applications Utschick, 1998).
1092
Wolfgang Utschick and Werner Weichselberger
Regularization on the manifold of the possible decision functions of the classier is a standard method to improve the generalization properties of classication systems. Following the heuristic approach of Dietterich and Bakiri (1995), the optimization of output codes in the presented framework is an implementation of this objective. The algorithm presented in this article can be effective only if either the PiC components are not exible enough to deal with the dichotomous classication tasks or the applied regularization method does not prevent overtting. The results of LAOCOON and ECOC codes show a fairly wide range of performance. In order to predict which code to use, a validation data set might help. However, we performed all numerical experiments without using a validation set. Although the experimental results in this work were carried out solely using neural network–like components, the only fundamental restriction to the PiCs is their capability to solve regression problems. Thus, arbitrary plug-in classication systems (e.g., support vector machines; Cortes & Vapnik, 1995; Sch¨olkopf, 1997) could be used, and the combination of heterogeneous classication systems should also be considered in future research. Since the LAOCOON algorithm produces a solution that depends on the properties of the PiCs, it is generally not reasonable to transfer optimized LAOCOON codes from one classication system to another. Appendix A: Expectation Maximization
The EM algorithm is a general method for solving ML estimation problems given incomplete data (Dempster, Laird, & Rubin, 1977; Redner & Walker, 1984; Moon, 1996; McLachlan & Krishnan, 1997). Let L ML (h ) D
M X m D1
log p (xm I h )
(A.1)
denote the log-likelihood function of the training set S , where p (xIh ), which depends on a set of parameters h , denotes the probability density function of input patterns x. Taking the conditional expectation at parameters h ¤ , w.r.t. suitably introduced data (ym | xm ), one obtains L ML (h ) D L (h , h ¤ ) ¡ Q (h , h ¤ ), where L (h , h
* ¤)
D
M X m D1
* Q (h , h
¤)
D
(A.2) +
m
m
log p (x , y Ih )
M X m D1
(A.3)
, p(y|xIh ¤ )
+ m
m
log p (y | x Ih )
. p( y|xIh ¤ )
(A.4)
Stochastic Organization of Output Codes
1093
It can be shown that to maximize the log-likelihood function LML (h ), it sufces to maximize L (h , h ¤ ). The term Q (h , h ¤ ) can be ignored because 6 h ¤ (Jensen’s inequality). Q (h , h ¤ ) · Q(h ¤ , h ¤ ), for all h D The idea of the EM method is to produce a simpler ML estimation problem by an appropriate choice of “complete data.” The most important prerequisite for the practicability of the EM algorithm is that the original likelihood function of the “incomplete data” xm can be obtained by the likelihood function of the unknown “complete data” (ym | xm ) via “marginalization,” that is, by integrating out the so-called complete data. For mixture density estimation tasks, it can be shown that M Y mD 1
Z p (xm I h ) D
¢¢¢ Y
1
Y
Z Y M M
mD 1
p (xm , ym Ih ) ¢ dy1 ¢ ¢ ¢ dyM
(A.5)
must hold. The Y m describes the domain of the ym data. The outline of the EM algorithm is as follows. The algorithm starts with an arbitrary initial guess and denotes by h ¤ the current estimate of the parameters h after a number of training epochs. The next iteration consists of two steps: the E-step (expectation), Z L (h , h ¤ ) D
¢¢¢ Y
1
£
Z X M
Y M Y m D1
M
m D1
log p (xm , ym I h )
p(ym | xm I h ¤ ) dy1 ¢ ¢ ¢ dyM ,
(A.6)
and the M-step (maximization), max L (h , h ¤ ). h
(A.7)
If L (h , h ¤ ) is continuous in both h and h ¤ , it has been proven that the algorithm converges to a local maximum of the log-likelihood function L ML (h ). Appendix B: Objective Function B.1 Maximum-Likelihood Part. Each y m is formally assigned to a single tm . In the EM context, y m is termed the incomplete (known) data, and tm is called the complete (unknown) data. As a consequence, the expected value of the likelihood function is estimated as follows: " # M M XX X X Y ¢¢¢ log(p(y m | tm ) P[tm ]) LD¡ P ¤ [tm | y m ] . (B.1) 1 2 M D1 D1 m m t t t
1094
Wolfgang Utschick and Werner Weichselberger
P (The tm expresses the sum over all target vectors tm associated with the mth pattern.) It is worth noticing that for the calculation of the expectation P¤ [tm | y m (µ ¤ )] (´ P¤ [tm | y m ]) is used rather than P[tm ]. Transforming equation B.1 yields 2 3 M X X M X XX X Y 4 ¢¢¢ ¢¢¢ log(p(y m | tm ) P[tm ]) LD ¡ P¤ [tn | y n ]5
tm M X X
mD 1
D ¡
mD 1
2
m
t
t1
tm¡1 tm
C1
n D1
tM
log(p(y m | tm ) P[tm ])P¤ [tm | y m ] £
3 M X X X X Y 6 7 £4 ¢¢¢ ¢¢¢ P¤ [tn | y n ]5 .
t1
(B.2)
tm¡1 tm C 1 tM nnDDm1 P Because of the identity tn P[tn | y n ] D 1, the equation is reduced to LD¡
6
M X X
log(p(y m | tm ) P[tm ])P¤ [tm | y m ].
(B.3)
m D 1 tm
For further transformations of L, we direct the readers’ attention to the target vectors. The patterns from the same class are obviously represented by the same target vector: t k,m D tk for all m D 1, . . . , M k. Hence, the sum over all training patterns is separated into K partial sums, each belonging to a single class. For the nal formula of the resulting objective function, the L D L H C LP is split up into two parts (cf. section 3.2): LH D ¡ LP D ¡
Mk X K X I X
log(p(y k,m (£ ) | tk,i ))P¤ [tk,i | y k,m (£
¤
)],
(B.4)
kD 1 m D 1 i D 1 Mk X K X I X
log(P[tk,i ])P¤ [tk,i | y k,m (£
¤
)].
(B.5)
kD 1 m D 1 i D 1
According to Bayes’ rule, the P¤ [tk,i | y k,m (£ P ¤ [tk,i | y k,m (£
¤
)] D
¤
)] are calculated by
p(y k,m (£ ¤ ) | tk,i )P¤ [t k,i ] . I P ¤ ¤ k,m p (y (£ ) | t k, j )P [tk, j ]
(B.6)
j D1
B.2 Regularization Part. Since the nal target vectors tk are not known, the expectation of the product of distances is taken as
Ht D ¡
K X K X kD 1
`D 1 `D 6 k
P[V k]P[V`]
I X I X
¡ ¡ ¢¢ P[t k,i ]P[t`, j ] log dist tk,i , t`, j .
i D1 j D 1
The prior probabilities of classes are considered.
(B.7)
Stochastic Organization of Output Codes
1095
Figure 15: The joint probability P[#dk ^ #ek ] of the kth bits of two dichotomies equals the sum of the probabilities of all possible target vectors that support these two bits.
The regularization term H# for dichotomies is constructed analogously to Ht . The basic element is the logarithm of the product of the distances of all possible pairs of dichotomies, where the complements of dichotomies have to be taken into account. The nal target vectors and dichotomies are not known. However, whereas the target vectors of different classes may vary independently according to the sets T k, vectors of dichotomies cannot vary independently. The expectation is denoted by D X D XX X
P [#d ^ #e ] #d #e ¡ ¡ ¢ ¡ ¢¢ £ log dist #d , #e ¢ dist #d , ¡#e .
H# D ¡
dD1
eD 1 6 d eD
(B.8)
P The # expresses the sum over all dichotomy vectors # 2 f C 1, ¡1gK . The joint probabilities can be derived from the association probabilities by (see Figure 15) P [# d ^ # e ] D
K Y kD 1
P[#dk
^
#ek]
D
Á K I Y X kD 1
iD 1
! k,i dd^e P[tk,i ]
,
(B.9)
where k,i dd^e
» 1I D 0I
tdk,i D #dk ^ tek,i D #ek otherwise.
(B.10)
(The prior probabilities of the classes are considered in appendix D.) Furthermore, the distance to the dichotomy vector (1 1, . . . , 1)T is also incorporated
1096
Wolfgang Utschick and Werner Weichselberger
into the nal regularization term, in order to avoid the trivial dichotomy ¡
D X X dD 1
± ± ² P [#d ] log dist #d , (1 1, . . . , 1)T
#d
± ²² ¢ dist #d , ¡(1 1, . . . , 1)T ,
(B.11)
with P [# d ] D » ddk,i D
K Y kD 1
1I 0I
P[#dk]
Á K I Y X
D
kD 1
i D1
! ddk,i P[tk,i ]
(B.12)
,
tdk,i D #dk otherwise.
(B.13)
B.3 Regularization Parameters. The ¸ D (lt , l# )T are called regularization parameters. In order to make the balance of L, Ht , and H# independent from M, the number of training examples, and D, the number of PiC components, the parameters are modied as follows:
(B.14)
lt à M ¢ lt, 0 , l# Ã
M ¢ l#, 0. D2
(B.15)
The parameter lt needs no modication w.r.t. K, the number of classes V k, because the prior probabilities P[V k] and P[V`] are already considered within the denition of Ht . Appendix C: PiC Learning C.1 Nonlinear Regression. Assuming the p (y k,m (£
leads to p (y
k,m (
£ ) | t k,i ) D ±p
) | t k,i ) to be gaussian
³
1 2p s
²D
± ²2 ´ 1 k,m ( ) exp ¡ 2 dist y . £ , tk,i 2s
(C.1)
Thus, L H becomes LH D ¡
Mk X K X I X kD 1 m D 1 i D 1
"
£P¤ [tk,i | y k,m (£
# ¡ ¢2 ³± ²D ´ p dist y k,m (£ ), tk,i ¡ ¡ log 2p s 2s 2 ¤
)].
(C.2)
Stochastic Organization of Output Codes
1097
The asterisk denotes the parameters p of the previous iteration step. Hence, omitting the constant term log(( 2p s )D )P¤ [tk,i | y k,m (£ ¤ )], L H becomes LH D
Mk X K X I X D ± ²2 1 X k,m,d ( ) d ¡ y £ t P¤ [tk,i | y k,m (£ k,i 2s 2 kD1 m D1 iD1 dD 1
¤
)].
(C.3)
Obviously the optimization of L H can be divided into separate optimization problems for each dichotomy: L H,d D
Mk X K X I ± ²2 1 X y k,m,d (µ d ) ¡ tdk,i P¤ [tk,i | y k,m (£ 2 2s kD 1 mD 1 i D1
¤
)].
(C.4)
This expression can be interpreted as learning each vector xk,m w.r.t. multiple (I) target values tdk,i . As the target values are binary, tdk,i 2 f1, ¡1gd , this means that the vector xk,m is trained on ¡1 and 1 simultaneously, weighted by the terms P¤ [tk,i | y k,m (£ ¤ )]. After a few transformations, where terms constant w.r.t. £ are subsumed as “const.,” we obtain L H,d D
Mk X K X I ± ²2 1 X y k,m,d (µ d ) P¤ [t k,i | y k,m (£ 2 2s kD 1 mD 1 i D1
¡2 C
¤
)]
Mk X K X I 1 X y k,m,d (µd )tdk,i P¤ [tk,i | y k,m (£ 2 2s kD 1 mD 1 i D1
Mk X K X I ± ²2 1 X tdk,i P¤ [t k,i | y k,m (£ 2 2s kD1 m D1 iD1 | {z
¤
¤
)]
)] }
constant
D
Mk ± K X I ²2 X 1 X y k,m,d (µ d ) P¤ [t k,i | y k,m (£ 2 2s kD 1 mD 1 iD 1 | {z
¤
)] }
D1
¡2
Mk K X I X 1 X k,m,d ( ) y µ tdk,i P¤ [tk,i | y k,m (£ d 2s 2 kD 1 mD 1 iD 1
C 12 2s
¡
Mk K P P
³
I P
tdk,i P¤ [t k,i | y k,m (£
kD 1 m D 1 i D 1 Á Mk K X I X X
1 2s 2 |
kD 1 m D 1
i D1
¤
tdk,i P ¤ [tk,i | y k,m (£
{z constant
¤
9 > > > > > > = !2 >
)] C const.
´2
)] ¤
)]
> > > > > } > > ;
D 0
1098
Wolfgang Utschick and Werner Weichselberger
Á Mk K X I X 1 X D yk,m,d (µ d ) ¡ tdk,i P¤ [tk,i | y k,m (£ 2 2s kD 1 mD 1 iD 1
!2 ¤
)]
C const.
(C.5) Now the interpretation of L H,d is obvious. The terms I X i D1
tdk,i P¤ [tk,i | y k,m (£
¤
)] 2 [¡1, C 1]
(C.6)
are real valued; hence, L H,d is the sum of the squared errors over all training patterns regarding real valued target values. The discrete classication task has been transformed into a regression problem. The learning task of the dth component equals fd : RN 7! R , which is dened by the M-tuples Á
x
k,m
,
I X iD 1
!
tdk,i P¤ [tk,i | y k,m (£
¤
)]
2 RN £ D .
(C.7)
This raises the question how the approximation of the real valued function is compatible with the original classication task. The answer is straightforward: the algorithm stops when the probabilities P[tk]) of all center vectors converge near 1: P[tk,c ] D 1 ^ P[t k,i ] D 0
8(k, i) 2 f1, . . . , Kg £ f1, . . . , Ig n fcg .
(C.8)
Thus, at the end of the algorithm, L H equals LH D
Mk ± K X ²2 1 X k,m,d ( ) d ¡ . y µ t d k,c 2s 2 kD 1 mD 1
(C.9)
The sum of squared errors in equation C.9 represents the known objective function for the standard backpropagation algorithm (Werbos, 1974; Rumelhart & McClelland, 1986; Bishop, 1995; Ripley, 1996). When the probabilities P[tk,i ] converge to their nal values, other well-established training algorithms can be used for the nal training as well. C.2 First-Order Polynomial Classier. As PiC component for the vowel database and the segmentation database we used a rst-order polynomial classier (Eigenmann & Nossek, 1998), which forms a soft boolean function (AND) of linear classiers. The mapping of each component is equivalent to Á ! H N Y X n Y hd,h,i xi , (C.10) fd : R ! [0, 1], x 7! hD 1
iD 0
Stochastic Organization of Output Codes
1099
© ªN where the hd,h,i iD 1 represent the weight parameters of the hth hidden neuron of the dth component. By denition of x0 ´ 1 and parameters hd,h, 0, a threshold for each neuron is introduced. As a nonlinear activation function, we deploy the standard sigmoid function Y: R ! [0, 1], j 7! (1 C exp(¡j )) ¡1 . In order to comply with the domain of the nonlinear regression problem, equation 4.2, a shift of coordinates of the component output from [0, 1] to [¡1, C 1] is required. Appendix D: Asynchronous Optimization
In each asynchronous optimization step, the probabilities of the T k are op6 timized separately. The sets T` for ` D k are assumed to be empty except for the current center vectors t`. Hence, the modied parts of the objective function read as L0 P ,k D ¡
Mk X I X mD 1 i D 1
H0 P ,k D ¡P[V k]
H0 #,k
log(P[tk,i ])P¤ [t k,i | y k,m (£
K X
P[V`]
I X
¤
)],
¡ ¡ ¢¢ P[t k,i ] log dist tk,i , t` ,
(D.1) (D.2)
iD 1
`D 1 `D 6 k
2 D D ¡ ¡ ¢ 6X X X X D ¡P[V k ] 4 P[#dk ^ #ek] log dist #d , #e dD1
eD 1 6 d eD
¡
#dk
#ek
¢dist #d , ¡#e ¡
D X X d D 1 #k d
±
¢¢
± ± ² P[#dk] log dist #d , (1 1, ¢ ¢ ¢ , 1)T
¢dist #d , ¡(1 1, ¢ ¢ ¢ , 1)T
3 ²²
7 5,
(D.3)
P where the #k expresses the sum over the two cases #dk D § 1. d Taking the derivatives of the objective function parts results in @L0 P , k @P[tk,i ] @H0 P ,k @P[tk,i ] @H0 #,k @P[tk,i ]
D ¡
ai , P[tk,i ]
(D.4)
D ¡bi ,
(D.5)
D ¡c i ,
(D.6)
1100
Wolfgang Utschick and Werner Weichselberger
where ai D
Mk X
P¤ [tk,i | y k,m (£
mD 1
bi D P[V k]
K X
¤
)],
(D.7)
¡ ¡ ¢¢ P[V`] log dist tk,i , t` ,
(D.8)
`D 1 `D 6 k
2
D
D
d D1
eD 1 6 d eD
¡ ¡ ¢ ¡ ¢¢ 6X X X X k,i c i D P[V k] 4 dd^e log dist #d , #e ¢ dist #d , ¡#e C
#dk
D X X d D 1 #k d
#ek
± ± ² ddk,i log dist #d , (1 1, ¢ ¢ ¢ , 1)T 3
±
¢dist #d , ¡(1 1, ¢ ¢ ¢ , 1)T
²²
7 5,
(D.9)
The Hessian matrix is a diagonal matrix with entries @2 L0 P ,k @P[tk,i ]@P[t k, j ]
( D
ai I P2 [t k, i ]
0I
iD j
otherwise,
(D.10)
where ai > 0. Hence, the Hessian matrix is positive denite, and the optimization problem of each training epoch has a unique global minimum. References Anand, R., Mehrotra, K., Mohan, C. K., & Ranka, S. (1995). Efcient classication for multiclass problems using modular neural networks. IEEE Transactions on Neural Networks, 6,117–124. Bishop, C. M. (1995). Neural networks for pattern recognition. Oxford: Clarendon Press. Blake, C., Keogh, E., & Merz, C. J. (1998). UCI repository of machine learning databases. University of California, Irvine, Department of Information and Computer Sciences. Available online at: http://www.ics.uci.edu/»mlearn/ mlrepository.html. Breiman, L. (1996). Bagging predictors. Machine Learning, 24, 123–140. Cortes, C., & Vapnik, V. N. (1995). Support-vector networks. Machine Learning, 20, 273–297. Dempster, A. P., Laird, N. M., & Rubin, D. B. (1977). Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society, B, 39, 1–38.
Stochastic Organization of Output Codes
1101
Dietterich, T. G., & Bakiri, G. (1995). Solving multiclass learning problems via error-correcting output codes. Journal of Articial Intelligence Research, 2, 263– 286. Duda, R. O., & Hart, P. E. (1973). Pattern classication and scene analysis. New York: Wiley. Eigenmann, R., & Nossek, J. A. (1998). Classication based on first-order multivariable polynomials. In Proceedings of the International Workshop on Advanced Black-Box Techniques for Nonlinear Modeling: Theory and Applications, Leuven (pp. 7–12). Fletcher, R. (1980). Practical methods of optimization. New York: Wiley. Friedman, J. H. (1996).Another approach to polychotomous classication. Available online at: http://www-stat.stanford.edu/»jhf. Gareth, J., & Hastie, T. (1997). Error coding and PaCT’s. Winning paper in the ASA student paper competition for the Statistical Computing Section, Available online at: http: //playfair.Stanford.EDU/»trevor/ASA/winners.html. Gill, P. E., Murray, W., & Wright, M. H. (1981). Practical optimization. Orlando, FL: Academic Press. Goemans, M. X., & Williamson, D. P. (1995). Improved approximation algorithms for maximum cut and satisability problems using semidenite programming. Journal of the Association for Computing Machinery, 42, 1115– 1145. Guruswami, V., & Sahai, A. (1999). Multiclass learning, boosting, and errorcorrecting codes. In Proceedings of the International Conference on Computational Learning Theory. Hofmann, T., Puzicha, J., & Buhmann, M. (1998). Unsupervised texture segmentation in a deterministic annealing framework. IEEE Transactions on Pattern Analysis and Machine Intelligence, 20, 803–818. Kirkpatrick, S., Gelatt, C. D., & Vecchi, M. V. (1983). Optimization by simulated annealing. Science, 220, 671–680. Kressel, U., & Schurmann, ¨ J. (1997). Pattern classication techniques based on function approximation. In H. Bunke & P. S. P. Wang (Eds.), Handbookon optical characterrecognition and document analysis (Vol. 2, pp. 49–78). Singapore: World Scientic. Mayoraz, E., & Moreira, M. (1996). On the decomposition of polychotomies into dichotomies (Tech. Rep. No. IDIAP-RR 96-08). Dalle Molle Institute for Perceptive Articial Intelligence (IDIAP). McLachlan, G., & Krishnan, T. (1997).The EM algorithm and extensions. New York: Wiley. Miller, D., Rao, A. V., Rose, K., & Gersho, A. (1996). A global optimization technique for statistical classier design. IEEE Transactions on Signal Processing, 44, 3108–3122. Moon, T. K. (1996). The expectation–maximization algorithm. IEEE Signal Processing Magazine, 47–60. Puzicha, J., Hofmann, T., & Buhmann, J. M. (1997). Deterministic annealing: Fast physical heuristics for real-time optimization of large systems. In Proceedings of the Fifteenth IMACS World Conference on Scientic Computation, Modelling and Applied Mathematics, Berlin.
1102
Wolfgang Utschick and Werner Weichselberger
Redner, R. A., & Walker, H. F. (1984). Mixture densities, maximum likelihood and the EM algorithm. SIAM Review, 26, 195–239. Richard, M. D., & Lippmann, R. P. (1991). Neural network classiers estimate Bayesian a posteriori probabilities. Neural Computation, 3, 461–483. Ripley, B. D. (1996). Pattern recognition and neural networks. Cambridge: Cambridge University Press. Rojas, R. (1996). A short proof of the posterior probability property of classier neural networks. Neural Computation, 8, 41–43. Rose, K. (1992). Vector quantization by deterministic annealing. IEEE Transactions on Information Theory, 38, 1249–1257. Rose, K. (1998). Deterministic annealing for clustering, compression, classication, regression, and related optimization problems. Proceedings of the IEEE, 86, 2210–2239. Ruck, D. W., Rogers, S. K., Kabrinsky, M., Oxley, M. E., & Suter, B. W. (1990) . The multilayer perceptron as an approximation to a Bayes optimal discriminant function. IEEE Transactions on Neural Networks, 1, 296–298. Rumelhart, D. E., & McClelland, J. L. (1986). Learning internal representations by error propagation. In D. E. Rumelhart, J. L. McClelland, & the PDP Group (Eds.), Parallel distributed processing (Vol. 1, pp. 318–362). Cambridge, MA: MIT Press. Sch¨afer, S. (1995). Unconstrained global optimization using stochastic integration. Optimization, 35. Schapire, R. E. (1997). Using output codes to boost multiclass learning problems. In Proceedings of the Fourteenth International Conference on Machine Learning (pp. 313–321). Sch¨olkopf, B. (1997). Support vector learning. Munich: Oldenbourg-Verlag. Schurmann, ¨ J. (1996). Pattern classication: A unied view of statistical and neural approaches. New York: Wiley. Utschick, W. (1998).Error-correctingclassication based on neural networks. Aachen: Shaker. Werbos, P. (1974). Beyond regression: New tools for prediction and analysis in the behavioral sciences. Unpublished doctoral dissertation, Harvard University. Received March 22, 1999; accepted July 13, 2000.
LETTER
Communicated by David MacKay
Predictive Approaches for Choosing Hyperparameters in Gaussian Processes S. Sundararajan Department of Computer Science and Automation, India Institute of Science, Bangalore560012, India S. S. Keerthi Department of Mechanical and Production Engineering, National University of Singapore, Singapore-119260 Gaussian processes are powerful regression models specied by parameterized mean and covariance functions. Standard approaches to choose these parameters (known by the name hyperparameters) are maximum likelihood and maximum a posteriori. In this article, we propose and investigate predictive approaches based on Geisser’s predictive sample reuse (PSR) methodology and the related Stone’s cross-validation (CV) methodology. More specically, we derive results for Geisser’s surrogate predictive probability (GPP), Geisser’s predictive mean square error (GPE), and the standard CV error and make a comparative study. Within an approximation we arrive at the generalized cross-validation (GCV) and establish its relationship with the GPP and GPE approaches. These approaches are tested on a number of problems. Experimental results show that these approaches are strongly competitive with the existing approaches. 1 Introduction
Gaussian processes (GPs) are powerful regression models that have gained popularity recently, though they have appeared in different forms in the literature for years. They also can be used for classication (see MacKay, 1997; Neal, 1997; Rasmussen, 1996; Williams & Rasmussen, 1996). Here, we restrict ourselves to regression problems. GPs are specied using parametric forms for the mean and covariance functions (Williams & Rasmussen, 1996). We assume that the process is zero mean. Let Z N D fX N , t N g, where X N D fx (i): i D 1, . . . , Ng and tN D ft(i): i D 1, . . . , Ng. Here, t (i) represents the observed output corresponding to the input vector x (i). Then, the likelihood (MacKay, 1997) is given by p (t N | X N , µ ) D
¡1 ) exp(¡ 12 tTN C N tN N
1
(2p ) 2 | C N | 2
,
(1.1)
c 2001 Massachusetts Institute of Technology Neural Computation 13, 1103– 1118 (2001) °
1104
S. Sundararajan and S. S. Keerthi
N N is the covariance matrix with the (i, j)th where C N D CN N C s 2 I N , C element [CN N ]i, j D CN (x (i), x ( j)I µN ), and CN (¢I µN ) denotes the covariance function parameterized by µN . Note that µ = (µN ,s 2 ) is the set of hyperparameters. Then the predictive distribution of the output y (N C 1) for a test case x (N C 1) is also gaussian with mean and variance ¡1 tN yO (N C 1) D kTN C 1 C N
(1.2)
¡1 sy2(N C 1) D bN C 1 ¡ k TN C 1 C N kN C 1 ,
(1.3)
and
where bN C 1 D C(x (N C 1), x (N C 1)I µ ) and kN C 1 is an N £ 1 vector with the ith element given by C(x (N C 1), x (i)I µ ). We need to specify the covariance function C (¢I µ ). Williams and Rasmussen (1996) found the following covariance function to work well in practice: CN (x (i), x ( j)I µN ) D a0 C a1
M X
xp (i)xp ( j)
pD1
0
1 M X 1 C v 0 exp @¡ wp (xp (i) ¡ xp ( j))2 A , 2 pD1
(1.4)
where xp (i) is the pth component of ith input vector x (i). The parameters w1 , w2 ,. . . , wM are referred to as automatic relevance determination (ARD) hyperparameters (Rasmussen, 1996) since their values inuence the relevance of the inputs in determining the output and are determined automatically using the training data. Note that all the parameters are positive, and it is convenient to use the logarithmic scale. Hence, µ is given by log(a0 , a1 , v 0, w1 , . . . , wM , s 2 ). Then the question is how we handle µ . Sophisticated techniques like hybrid Monte Carlo (HMC) methods (Rasmussen, 1996; Neal, 1997) are available that can numerically integrate over the hyperparameters to make predictions. Alternatively, we can estimate µ from the training data. We restrict ourselves to the latter approach here. In the classical approach, µ is assumed to be deterministic but unknown, and the maximum likelihood (ML) estimate µML is found by maximizing the likelihood 1.1. In the Bayesian approach, µ is assumed to be random, and a prior p(µ ) is specied. Then the maximum a posteriori (MAP) estimate µ MP is found by maximizing p(tN | X N , µ )p (µ ) with the motivation that the predictive distribution p(y (N C 1)| x (N C 1), Z N ) can be approximated as p (y (N C 1)| x (N C 1), Z N , µ MP ). We proposed and investigated different predictive approaches based on Geisser ’s predictive sample reuse (PSR) methodology (Geisser, 1975) and
Choosing Hyperparameters in Gaussian Processes
1105
the related Stone’s cross-validation (CV) methodology (Stone, 1974) to estimate the hyperparameters (from the training data) in linear models (Sundararajan, 1999) and articial neural networks (Sundararajan & Keerthi, 1999a; Sundararajan, 1999). In this article we investigate these approaches to choose the hyperparameters in GPs. (For an earlier version of this article, see Sundararajan & Keerthi, 1999b). Simulation results indicate that the proposed approaches are strongly competitive with the existing approaches. 2 Predictive Approaches for Choosing Hyperparameters
Geisser (1975) proposed PSR methodology, which can be applied to both model selection and parameter estimation problems. This methodology is closely related to the CV methodology independently presented by Stone (1974). Based on the PSR methodology, Geisser and Eddy (1979) proposed Q ( ( )| (i) ( ) ) maximizing N i D 1 p t i Z N , x i , Mj (known as Geisser ’s surrogate predictive probability, GPP) by synthesizing Bayesian and PSR methodology in the context of (parameterized) model selection. Here, we propose to maxQ (i ) (i ) imize GPP D N i D 1 p ( t ( i )| Z N , x ( i ) , µ ) to estimate µ , where Z N is obtained () from Z N by removing the ith sample. Note that p (t (i)| Z Ni , x (i), µ ) is nothing (i ) but the predictive distribution p (y (i)| Z N , x (i), µ ) evaluated at y (i) D t (i). Since maximizing GPP is equivalent to minimizing its negative logarithm, we dene the GPP function as GPP (µ ) D ¡
N 1 X () log(p(t(i)| Z Ni , x (i), µ )). N iD 1
(2.1)
Also, we dene Geisser ’s predictive mean square error (GPE) as GPE (µ ) D
N 1 X E ((y (i) ¡ t (i))2 ), N iD 1
(2.2) ()
where the expectation operation is dened with respect to p (y (i)| Z Ni , x (i), µ ) and propose to estimate µ by minimizing GPE. One can show that (see Burman, 1989; Efron & Gong, 1983) our proposal to choose the hyperparameters by maximizing GPP is equivalent to choosing the hyperparameters such that the implied conditional density function (model) has maximum similarity with the true conditional density function. On the other hand, nding the hyperparameters that minimize GPE is equivalent to choosing the hyperparameters that minimize the predictive mean square error. Next, let us dene the CV error as CV (µ ) D
N 1 X (t (i) ¡ yO (i))2 , N iD 1
(2.3)
1106
S. Sundararajan and S. S. Keerthi ()
where yO (i) is the mean of the predictive distribution p (y(i)| Z Ni , x (i), µ ). Then equation 2.2 can be written as GPE (µ ) D CV (µ ) C
N 1 X s 2( ) , N i D1 y i
(2.4)
() where sy2(i) is the variance of p (y (i)| Z Ni , x (i), µ ). Thus, while the CV error minimizes the deviation from the predictive mean, GPE takes the predictive variance into account.
2.1 Expressions for the Predictive Measures and Their Gradients. From () equations 1.2 and 1.3, we see that p(y (i)| Z Ni , x (i), µ ) is gaussian with mean
yO (i) and variance sy2(i) . Further, the summand in equation 2.1 is evaluated at y (i) D t (i). Therefore, to evaluate the various predictive measures, we need only to nd expressions for the error t (i) ¡ yO (i) and sy2(i) ; these can be evaluated efciently. These expressions are derived in the appendix. Then GPP (µ ) and its gradient can be computed efciently using the following result. The objective function GPP (µ ) under the GP model is given by
Theorem 1.
GPP (µ ) D
N N q2N (i) 1 X 1 X 1 ¡ logNcii C log2p 2N iD1 cNii 2N iD1 2
(2.5)
¡1 where cNii denotes the ith diagonal entry of C N and qN (i) denotes the ith element of ¡1 q N D C N tN . Its gradient is given by
@GPP (µ ) @hj
N 1 X D 2N iD 1
Á 1 C
q2N (i) cNii
!³
sj,i cNii
´ C
³ ´ N rj (i) 1 X ( ) qN i , (2.6) N i D1 cNii
¡1 @C N ¡1 ¡1 C N tN , and q N D C N t N . Here, cN i denotes where sj,i D cN Ti @@ChNj cN i , rj D ¡C N @hj ¡1 the ith column of the matrix C N .
We will give meaningful interpretation to the different terms shortly. Similarly, using the following result, we can compute CV (µ ) and its gradient efciently. Theorem 2.
CV (µ ) D
The CV function CV (µ ) under the GP model is given by ´ N ³ 1 X qN (i) 2 N iD 1 cNii
(2.7)
Choosing Hyperparameters in Gaussian Processes
1107
and its gradient is given by @CV (µ ) @hj
Á ! ³ ´! ´ Á N ³ qN (i)rj (i) q2N (i) sj,i 1 X 2 C D N i D1 cNii cNii cNii cNii
(2.8)
where sj,i , rj , qN (i) and cNii are as dened in theorem 1. Next, from lemma 1 (see the appendix) we can obtain the expression for GPE (µ ) from equations 2.4 and 2.7, and the gradient can be written as @GPE (µ ) @hj
D
@CV (µ ) @hj
´ N ³ 1 X 1 2 T @C N C cN i cN i . N iD 1 cNii @hj
(2.9)
The expressions derived above can be used together with powerful gradientbased optimization methods to determine the hyperparameters. 2.2 Interpretations. First, from equation 1.2, we see that for arbitrary scaling of the covariance function C (and therefore k), the predictive mean yO (i) remains the same; that is, yO (i) is scale invariant. Therefore, it is evident from equation 2.3 that the CV error can determine the covariance function within a scaling factor only. On the other hand, from equation 1.3, we see that unlike the predictive mean, the predictive variance is dependent on the scale of the covariance function. Then, from equation 2.4, it follows that the GPE criterion prefers the scale of C to go to zero. However, if one of the scale parameters (for example, the noise level) is set, then the other parameters can be determined using the CV or GPE criteria. Note that such problems do not arise with the GPP criterion. These problems can be addressed by considering the following reparameterization of the covariance function, Á Á M M X 1X C(x (i), x (j)Ih ) D s 2 aN 0 C aN 1
pD 1
xp (i)xp (j) C vN 0 exp ¡
!
¡ xp (j))
2
! C di, j
2
wp (xp (i)
pD 1
,
(2.10)
where a 0 D s 2 aN 0 , a1 D s 2 aN 1 , and v 0 D s 2 vN 0. Let us dene P (x (i), x ( j)I µ ) D pNi, j ¡1 ¡1 1 ( ( ) ( )I ). x i , x j µ Then P N D s 2 C N . Therefore, cNi, j D s 2 , where cNi, j , pNi, j s2 C ¡1 ¡1 denote the (i, j)th element of the matrices C N and P N , respectively. First let us look at the implications of this reparameterization in the GPP criterion because the noise level derived from this criterion and reparameterization is ¡1 useful in determining the scale of the CV and GPE criteria. With qN N D P N tN we can rewrite equation 2.5 as GPP (µ ) D
N N qN 2N (i) 1 X 1 X 1 ¡ log pNii C log 2p s 2 . 2 2Ns iD1 pNii 2N iD 1 2
(2.11)
1108
S. Sundararajan and S. S. Keerthi
Now, by setting the derivative of equation 2.11 with respect to s 2 to zero, we can infer the noise level as sO 2 D
N qN 2N (i) 1 X . N iD1 pNii
(2.12)
Similarly, the CV error, equation 2.3, can be rewritten as CV (µ ) D
N qN2N (i) 1 X . N iD 1 pN2ii
(2.13)
Note that CV (µ ) is dependent only on the ratio of the hyperparameters (i.e., on aN 0, aN 1 , vN 0 due to scale invariance of yO (i)) apart from the ARD parameters. This implies that we cannot infer the noise level uniquely. However, we can estimate the ARD parameters and the ratios aN 0, aN 1 , vN 0 . Once we have estimated these parameters, we can use equation 2.12 to estimate the noise level. This is possible because like CV (µ ), equation 2.12 is dependent only on the ratios and ARD parameters. Next, since the GPE criterion prefers the scale to go to zero, it follows that under reparameterization, the noise level preferred by the GPE criterion is zero. To see this, rst let us rewrite equation 2.4 under reparameterization as GPE (µ ) D
N N qN 2N (i) 1 X 1 X 1 C s2 . 2 N iD1 pNii N iD 1 pNii
(2.14)
Since qN N (i) and pNii are independent of s 2 , it follows that the GPE prefers zero as the noise level. Therefore, this approach can be applied when either the noise level is known or a good estimate of it is available. Alternately, we can substitute the estimator equation 2.12, in place of s 2 in equation 2.14. Then, note that like equation 2.12 and 2.13, equation 2.14 will be dependent only on the ratios and ARD hyperparameters, which can be estimated by minimizing this function. Subsequently using these estimates, the noise level can be estimated from equation 2.12. Thus, with the help of equation 2.12, both the CV and GPE criteria can be used to x the scale of the covariance function as well. Also, like GPP they do not need knowledge of the noise level. 2.3 GCV, Approximate GPP, GPE, and Their Relationships. We can use an approximation similar to the one used with the standard CV error function (Breiman, 1995) of a linear model that results in the generalized cross-validation (GCV) (Craven P & Wahba, 1979), for the GP model also. More specically, let pQ D N1 N Q 8 i. i D 1 pNii , and use the approximation pNii ¼ p, With this approximation, let us denote GPP (µ ), GPE (µ ), CV (µ ), and sO 2 as
Choosing Hyperparameters in Gaussian Processes
1109
AGPP(µ ), AGPE(µ ), GCV (µ ), and sQ 2 , respectively, and refer AGPP (µ ) and AGPE(µ ) as the approximate GPP and GPE functions, respectively. Then GCV is given by PN
qN 2N (i) N pQ2
i D1
GCV (µ ) D
(2.15)
and sQ 2 D GCV (µ )p. Q Unlike in the case of linear model, the computational complexity of the GCV measure is not reduced and remains the same as the CV error in its present form here. But this approximation sheds light on the relationships between the measures. First, the derivatives of AGPP(µ ) and GCV (µ ) in the reparameterized form are related through @AGPP (µ ) @hj
³ D
@GCV (µ ) @hj
´ ³
´ 1 . 2GCV(µ )
(2.16)
(µ ) (µ ) Noting that GCV (µ ) > 0, we see that if @AGPP D 0 , 8j, then @GCV D @hj @hj 0 , 8j. This implies that the rst-order necessary condition at the extremum (i.e., minimum or maximum) point is satised at the same point for both the approximate GPP and GCV functions. To prove that both functions attain the minimum or maximum at the same points, we need to study the properties of the Hessian matrices, which is difcult. However, based on our results on linear models (Sundararajan, 1999), we expect the minimizer/maximizer of these two objective functions to be same. Next, consider the approximate GPE and GCV. As discussed earlier, we can substitute sQ 2 in place of s 2 in AGPE(µ ) and then estimate the hyperparameters in the reparameterized form. This yields AGPE(µ ) D 2 GCV (µ ). As a consequence, the solutions obtained by minimizing the GCV and approximate GPE are exactly the same.
3 Simulation Results
We have carried out simulations on four different data sets. We considered MacKay’s robot arm problem and its modied version introduced by Neal (1996). We used the same data set as MacKay (two inputs and two outputs), with 200 examples in the training set and 200 in the test set. (This data set is referred to as data set 1 in Table 1.) Next, to evaluate the ability of the predictive approaches in estimating the ARD parameters, we carried out simulation on the robot arm data with six inputs (Neal’s version), denoted as data set 2 in Table 1. This data set was generated by adding four more inputs, two of them copies of the two inputs corrupted by additive zeromean gaussian noise of standard deviation 0.02 and two further irrelevant gaussian noise inputs with zero-mean and unit variance (Williams & Rasmussen, 1996). The performance measures chosen were the average of test
1110
S. Sundararajan and S. S. Keerthi
set error (TSE) (normalized by true noise level of 0.0025) and the average of negative logarithm of predictive probability (NLPP) (computed from gaussian density function with equations 1.2 and 1.3). Friedman’s (1991) data sets 3 and 4 were based on the problem of predicting impedance and phase, respectively, from four parameters of an electrical circuit. Training sets of four different sizes (50, 100, 150, 200) and with a signal-to-noise ratio of about 3:1 were replicated 100 times, and ³ for each Rtraining set (at ´ each sample size N), scaled integral squared error ISE D
D
(y(x )¡Oy(x ))2 dx varD y (x )
and NLPP
were computed using 5000 data points randomly generated from a uniform distribution over D (Friedman, 1991). The approaches considered here are the existing ML and MAP approaches and the GPP, CV, GCV, and GPE approaches developed here. In the cases of CV and GPE, we estimated the hyperparameters in the reparameterized form. These approaches are indicated as CVUK and GPEUK , respectively. We also evaluated these approaches with the noise level estimate generated randomly with 10% error (that is, gaussian distributed with true noise level NLT as the mean and standard deviation of 0.03 NLT ). These approaches are denoted as CVK and GPEK . Note that a comparison of the CV and GPE approaches helps in understanding the important role played by the predictive variance. Also, as noted earlier in the CV approach, the ratio of the hyperparameters (other than ARD) to the noise level is important; therefore, CVK and CVUK are mathematically equivalent. However, in practical implementation, the solutions obtained from CVK and CVUK may be slightly different due to the failure of the optimization algorithm to reach the optimal values. In the case of GCV, we estimated the hyperparameters again in the reparameterized form. In the case of MAP, we used the same prior given in Rasmussen (1996). For all methods, the conjugate gradient (CG) algorithm (Rasmussen, 1996) was used to optimize the hyperparameters in the logarithmic scale. The termination criterion (relative function error) with a tolerance of 10¡7 was used, but with a constraint on the maximum number of CG iterations set to 100. In the case of robot arm data sets, the algorithm was run with 10 different initial conditions, and the best solution (chosen from respective best objective function value) is reported. The optimization was carried out separately for the two outputs, and the results reported are the average TSE and NLPP. In the case of Friedman’s data sets, the optimization algorithm was run with three different initial conditions, and the best solution was picked up. When N D 200, the optimization algorithm was run with only one initial condition. For all the data sets, both the inputs and outputs were normalized to zero mean and unit variance. Table 1 gives the results on the robot arm data sets. We see that the performances (both TSE and NLPP) of the predictive approaches are better than the ML and MAP approaches for both data sets. In the case of data set 2, we observe that like the ML and MAP methods, all the predictive approaches
Choosing Hyperparameters in Gaussian Processes
1111
Table 1: Results on Robot Arm Data Sets. Data Set 1 ML MP GPP GPEUK CVUK CVK GPEK GCV
TSE 1.126 1.131 1.115 1.111 1.112 1.115 1.111 1.113
NLPP ¡1.512 ¡1.511 ¡1.524 ¡1.518 ¡1.518 ¡1.520 ¡1.524 ¡1.516
Data Set 2 TSE 1.131 1.181 1.116 1.116 1.146 1.147 1.112 1.119
NLPP ¡1.512 ¡1.489 ¡1.516 ¡1.516 ¡1.514 ¡1.514 ¡1.524 ¡1.509
Note: Average of normalized test set error and negative logarithm of predictive probability for various methods.
correctly identied the irrelvant inputs. Notice the degradation in the GPE performance in the absence of a good noise-level estimate. However, its performance is closer to GPP and better than both ML and MAP. In the case of Friedman’s data set 3 (see Figures 1 and 2), the important observation is that the performances (both TSE and NLPP) of the predictive approaches (GPP, GPEUK , CVK , and CVUK ) are poor at low sample size (N D 50), and they improve very well as the sample size increases. Note that the performance of the predictive approaches is better compared to the ML and MAP methods starting from N D 100 onward. Also, note that although the TSE performance of the ML approach is better than or the same as that of GPP, the NLPP performance of the GPP approach is better from N D 100 onward. This is because in NLPP, the logarithm of the predictive variance is more important. This phenomenon can be observed for several other cases as well. Next, the performances (TSE) of the CVUK and CVK methods are almost the same, which indicates that there is no need to know the noise level to use the CV approach. This is due to the mathematical equivalence of the approaches, and the slight deviation in the results is due to differences in numerical computation. Note that as one would expect, there is a degradation in the GPE performance in the absence of the noise-level information and more so at low sample size. Also, the performance of GPEUK is quite close to that of GPP as the sample size increases. Further, note that the GPE performance is much better compared to that of the CV methods. This behavior, we expect, is because of the important role played by the predictive variance. Next, it is clear that the MAP method gives the best performance at low sample size because, we believe, the prior plays an important role and hence is very useful. In the case of GCV, we have not reported the results for N D 50 because for several training sets, the performances were poor. We believe that this is due to the invalidity of the approximation at low sample size. This belief is
1112
S. Sundararajan and S. S. Keerthi
0.8
Normalized Integral Squared Error
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
ML
MAP
GPP
GPEUK
CVUK
CVK
GPEK
GCV
Figure 1: Friedman’s data set 3. Integral squared error (ISE) for different training sample sizes (50, 100, 150, 200) in that order for each method. Median (indicated by circles) and quartiles (indicated by error bars) of ISE obtained from 100 different training data sets.
based on the following observations. After estimating the hyperparameters in the reparameterized form, we computed the noise level, and it was close to zero for several training sets (indicating the problem of overtting); the performances were poor for these data sets. Further, the number of training sets for which this behavior was observed became fewer as the training set size increased. More specically, when N D 100, 10 training data sets (in the case of data set 4) and 3 training data sets (in the case of data set 3) had similar behavior. The results are reported after excluding these cases. When N D 150, we observed this behavior for 3 training data sets in the case of data set 4 and none in the case of data set 3. When N D 200, we did not observe any such behavior. So as one would expect, the approximation becomes better and better as the number of samples increases. This can be seen by comparing the performance of GCV with that of CVUK . In the case of Friedman’s data set 4 (see Figures 3 and 4) the ML and MAP approaches perform better compared to the predictive approaches except GPEK . The performances of predictive methods improve as the number of samples increases and are very close to those of the ML and MAP methods
Choosing Hyperparameters in Gaussian Processes
1113
7.8
Negative Logarithm of Predictive Probability
7.6
7.4
7.2
7
6.8
6.6
6.4
6.2
ML
MAP
GPP
GPEUK
CVUK
CVK
GPEK
GCV
Figure 2: Friedman’s data set 3. Negative logarithm of predictive probability (NLPP) for different training sample sizes (50, 100, 150, 200) in that order for each method. Median (indicated by circles) and quartiles (indicated by error bars) of NLPP obtained from 100 different training data sets.
when N D 150, N D 200. On comparing the performances of CV and GPE, we once again see the role and importance of the predictive variance. Also, note that the performance of GPEK is inferior to the ML and MAP approaches at low sample sizes and improves over these approaches (see NLPP) as N increases. As earlier, there is again degradation in the GPE performance in the absence of noise-level information, and its performance is quite close to that of the GPP approach as the sample size increases. In the future, we would like to compare our methods against Monte Carlo methods and possibly evaluate all of the methods thoroughly within the DELVE experimental setup. 4 Conclusion
The predictive measures estimate the predictive performance of a given model from the training samples alone. Since the quality of the estimate
1114
S. Sundararajan and S. S. Keerthi
0.5
Normalized Integral Squared Error
0.45
0.4
0.35
0.3
0.25
0.2
0.15
0.1
ML
MAP
GPP
GPEUK
CVUK
CVK
GPEK
GCV
Figure 3: Friedman’s data set 4. Integral squared error (ISE) for different training sample sizes (50, 100, 150, 200) in that order for each method. Median (indicated by circles) and quartiles (indicated by error bars) of ISE obtained from 100 different training data sets.
will be better as the number of examples increases, the performance improves very well with an increase in the number of samples. Further, the knowledge of the noise level helps in improving the quality of this estimate. Typically when the number of samples is sufciently large, we nd that the predictive approaches perform better than the ML and MAP approaches. In particular, we nd that the MAP approach is best when the number of samples is very low. Simulation results indicate that the number of samples required for good estimates (for the predictive approaches to perform better than the existing approaches) will be dependent on the problem. The sufcient number of samples can be as low as 100, as evident from our results on Friedman’s data set 3. It is not clear how one can nd the number of samples required to take advantage of these methods before actually trying them out on a given data set. The comparison with the existing approaches indicates that the predictive approaches developed here are strongly competitive. Among the predictive approaches, we recommend GPP and/or GPE (reparameterized) approaches to choose the hyperparameters. Finally, the overall cost for computing the function and the gradient (for all the
Choosing Hyperparameters in Gaussian Processes
1115
1.8
Negative Logarithm of Predictive Probability
1.6
1.4
1.2
1
0.8
0.6
0.4
ML
MAP
GPP
GPEUK
CVUK
CVK
GPEK
GCV
Figure 4: Friedman’s data set 4. Negative logarithm of predictive probability (NLPP) for different training sample sizes (50, 100, 150, 200) in that order for each method. Median (indicated by circles) and quartiles (indicated by error bars) of NLPP obtained from 100 different training data sets.
predictive approaches) is O (MN 3 ); on the other hand, the cost is O (N 3 ) for the ML and MAP approaches. However, the cost for making predictions is same for all the approaches. Appendix
Let E i,N D (e 1 e 2 , . . . , e i¡1 e N e i C 1 , . . . , e N¡1 e i ), where ej is the ith column vector of the identity matrix I N . Then E i,N is symmetric and orthogonal (see Golub & Van Loan, 1985). Now, let C N D (c1 c 2 , . . . , cN ) and C i,N D E i,N C N E i,N . Then, Á ! (i )
C i,N D ()
CN () [c i i ]T
(i )
ci
cii
(A.1)
, ()
where c i i D c i ¡ fcii g, that is, c i i is an N ¡ 1 £ 1 vector obtained from () removing the ith element cii from ci and the matrix C Ni is obtained from C N by removing the ith row and column.
1116
S. Sundararajan and S. S. Keerthi
Lemma 1.
is given by
qN (i ) cNii .
The variance sy2(i) is given by
1 cNii ,
and the error term t (i) ¡ yO (i)
¡1 tN , and Here, qN (i) is the ith element of the vector q N D C N
¡1 cNii is the ith diagonal element of C N . Further, ¡1 @C N ¡1 ¡C N CN t N . @hj
@cNii @hj
@C N D ¡NcT i @hj cN i and
@q N @hj
D
Proof. From equations 1.2 and 1.3, we have
yO (i) D
h
i () T
h
ci i
i (i ) ¡1 (i ) tN
(A.2)
CN
h iT h i ( ) ¡1 (i ) sy2(i) D cii ¡ c i(i) C Ni ci .
(A.3)
The results follow from simplifying these expressions. First, note that ¡1 ¡1 C i,N D E i,N C N E i,N
(A.4)
¡1 since E i,N is both orthogonal and symmetric. Therefore, C i,N is nothing but ¡1 the matrix obtained by interchanging the ith row and column of C N with ¡1 the Nth row and column, respectively. Now, if we write C N D (cN 1 cN 2 , . . . , cN N ), then
Á ¡1 C i,N
D
()
N i C N () [Nc i i ]T
()
cN i i
! (A.5)
,
cNii
() ¡1 where CN Ni is obtained from C N by removing the ith column and row and ³h i ´T T (i ) ¡1 cN i cNii D E i,N cN i . Note that cNii is the ith diagonal entry of C N . Next,
using a result on the inverse of a partitioned matrix (see Rao, 1991, p. 33) on C i,N (see equation A.1), we nd 0h ¡1 C i,N D @
h where d i D
i (i ) ¡1
CN
¡
C
d Ti
d i d Ti
(i) cii ¡d Ti c i (i)
cii ¡d Ti c i
i (i ) ¡1 (i ) ci .
CN
it follows that
D
1 . cNii
di (i) cii ¡dTi c i
1
1 A,
(A.6)
(i)
cii ¡d Ti c i
On comparing the (N, N )th element of equa-
tions A.5 and A.6, we see that sy2(i )
¡
1 (i) cii ¡d Ti c i
D cNii . Then, from equation A.3,
Choosing Hyperparameters in Gaussian Processes
1117
³ (i ) ´ ¡1 tN From equation A.6, we see that the Nth element of C i,N is nothing t (i) ³ ´ h iT h i () ( ) ¡1 (i ) but t (i) ¡ c i i C Ni t N cNii . But from equation A.4, ³ ¡1 C i,N
( )´
t Ni
t (i)
¡1 ¡1 E i,N E i,N tN D E i,N C N tN D E i,N q N , D E i,N C N
with the Nth element as qN (i). Therefore, from equation A.2, it follows that t (i) ¡ yO (i) D ¡1 @C N ¡1 ¡C N CN . @hj
q N (i ) . cNii
To nd the derivatives, we use the result
This completes the proof.
@C ¡1 N @hj
D
References Breiman, L. (1995). Better subset regression using the non-negative garrote. Technometrics, 37(4), 373–384. Burman, P. (1989). A comparative study of ordinary cross-validation and the repeated learning testing methods. Biometrika, 76(3), 503–514. Craven, P., & Wahba, G. (1979). Smoothing noisy data with spline functions. Numer. Math., 31, 377–403. Efron, B., & Gong, G. (1983). A leisurely look at the bootstrap, the jackknife and cross-validation. American Statistician, 37(1), 36–48. Friedman, J. H. (1991). Multivariate adaptive regression splines. Ann. Stat., 19, 1–141. Geisser, S. (1975). The predictive sample reuse method with applications. Journal of the American Statistical Association, 70, 320–328. Geisser, S., & Eddy, W. F. (1979).A predictive approach to model selection, Journal of the American Statistical Association, 74, 153–160. Golub, G.H., & Van Loan, C. F. (1985). Matrix computations. Baltimore: John Hopkins University Press. MacKay, D. J. C. (1997). Gaussian processes—A replacement for neural networks Available online at: http://www.wol.ra.phy.cam.ac.uk/mackay/. Neal, R. M. (1996). Bayesian learning for neural networks. New York: SpringerVerlag. Neal, R. M. (1997). Monte Carlo implementation of gaussian process models for Bayesian regression and classication (Tech. Rep. No. 9702). Toronto: Department of Statistics, University of Toronto. Rao, C. R. (1991). Linear Statistical inference and its applications. New Delhi: Wiley Eastern Limited. Rasmussen, C. (1996). Evaluation of gaussian processes and other methods for non-linear regression. Unpublished doctoral dissertation, University of Toronto. Stone, M. (1974). Cross-validatory choice and assessment of statistical predictions (with discussion). J. R. Statist. Soc. B, 36, 111–147.
1118
S. Sundararajan and S. S. Keerthi
Sundararajan, S. (1999). Predictive approaches for choosing hyperparametersin linear and nonlinear regression problems. Unpublished doctoral dissertation, Indian Institute of Science, Bangalore, India. Sundararajan, S., & Keerthi, S. S. (1999a). Predictive approaches for choosing hyperparameters in articial neural networks. Submitted. Sundararajan, S., & Keerthi, S. S. (1999b). Predictive approaches for choosing hyperparameters in gaussian processes. In S. A. Solla, T. K. Leen, & K-R. Muller ¨ (Eds.), Advances in neural informationprocessingsystems,12. Cambridge, MA: MIT Press. Williams, C. K. I., & Rasmussen, C. E. (1996). Gaussian processes for regression. In D. S. Touretzky, M. C. Mozer, & M. E. Hasselmo (Eds.), Advances in neural information processing systems, 8. Cambridge, MA: MIT Press. Received November 5, 1999; accepted July 20, 2000.
LETTER
Communicated by Sara Solla
Architecture-Independent Approximation of Functions Vicente Ruiz de Angulo Carme Torras Institut de Rob`otica i Inform`atica Industrial, (CSIC-UPC), 08034-Barcelona, Spain We show that minimizing the expected error of a feedforward network over a distribution of weights results in an approximation that tends to be independent of network size as the number of hidden units grows. This minimization can be easily performed, and the complexity of the resulting function implemented by the network is regulated by the variance of the weight distribution. For a xed variance, there is a number of hidden units above which either the implemented function does not change or the change is slight and tends to zero as the size of the network grows. In sum, the control of the complexity depends on only the variance, not the architecture, provided it is large enough. 1 Introduction
Neural networks are function approximators with a unique feature, their large number of adaptable parameters, that makes them general and powerful estimators. But when the number of samples is not high compared to the number of weights, this power should be restricted in some way to obtain useful approximations. The good results obtained almost from the very beginning of the revival of connectionism in the past decade, particularly with backpropagation networks, can be explained by the adoption of methods against overtting, some of them rst used in the context of neural networks, like early stopping. These methods can be divided in two groups. The rst group refers to architecture selection and manipulation, and consists basically of choosing networks with few parameters. For certain applications with known invariances, this can be accomplished by forcing some groups of weights to take the same value, that is unifying parameters (Lang, Waibel, & Hinton, 1990). But in general this is done by limiting the number of weights or hidden units. A popular method is to begin with a large network and then eliminate the least signicant parameters for the tting, during or after training (Le Cun, Denker, & Solla, 1990). In the latter case, it is usual to repeat the learning-pruning cycle several times. The second group of methods adds a regularization term to the cost function in order to constrain the weights to minimize the regularizer also. The relative weight given to the regularizer is controlled by means of a regularization coefcient. The typical regularizer for neural networks is c 2001 Massachusetts Institute of Technology Neural Computation 13, 1119– 1135 (2001) °
1120
Vicente Ruiz de Angulo and Carme Torras
a quadratic function of the weights, limiting their magnitude. The early stopping technique, which avoids the growth of the weights in the later stages of learning, has a similar working principle. There is a third option: the use of random weights. The idea is that the information contained in a random weight depends on its variance and, thus, the tting power of a network can be controlled through the amplitude of its weight distribution. The use of random weights was rst suggested by Hanson (1990) to avoid poor minima. Later, An (1996) and Murray and Edwards (1993, 1994) showed the utility of the approach to improve generalization, especially for classication problems. Hinton and van Camp (1993a, 1993b) used individual variances for each weight that were adapted during learning to minimize a minimum description length cost. Here we show that simple noise addition, in the sense indicated below, has very interesting properties. Being a single-variance version of the Hinton and van Camp framework, it is tantamount to a regularization approach, but at the same time, in practice, it also appertains to the architecture selection methods. Indeed we show that the variance determines the surviving architecture and the optimal weight vector, provided the network has enough hidden units. Thus, there exists a one-to-one correspondence between the regularization coefcient and the function implemented by the trained network, regardless of the architecture. 2 Learning with Noisy Weights
This article studies the weight congurations obtained by minimizing a cost function EP (W ) that is the expectation of E (W ), Z EP (W ) D
E (W C R ) P (R) dR,
(2.1)
where W is the weight vector and R is a vector of random variables, each obeying the same symmetric distribution P (¢). In regression networks, the error function is taken as follows: E (W ) D
X p
E p) ( W ) D
1 X | |F(WI Xp ) ) ¡ Dp ) || 2 , 2 p
(2.2)
F (W, Xp) ) being the output of the network when the pth input pattern Xp ) is presented, and Dp) is the pth output pattern. A deterministic algorithm was developed by Ruiz de Angulo and Torras (1994) to minimize equation 2.1 without sampling the weight distribution for the purpose of mitigating weight perturbation effects. It is rather simple and has been shown to provide very accurate minima for all variances of P. For this reason, we will use it in all the experiments presented in this article.
Architecture-Independent Approximation of Functions
1121
This algorithm is based on the following equivalence, E P ( W ) D E (W ) C
³ ´ s 2 X @2 E s 2 X @F 2 C O (m 4 ) ¼ E (W ) C , 2 j,i @wji2 2 j,i @wji
(2.3)
s 2 being the variance of R, m 4 its fourth-order moment, and wji the weight 2
departing form unit i and impinging on unit j. s2 can be considered a regularization coefcient, which we call a. The equality in equation 2.3 holds everywhere as a consequence of the symmetry of P, which cancels out the terms associated with the rst and third derivatives, as well as the nondiagonal terms of the Hessian. When the variance is small, fourth- and higher-order moments are very small, and the terms associated with them can be neglected. Instead, when the variance is large, the minimization of E P (W ) produces networks whose fourth and higher derivatives are small, so that those same terms also can be neglected. These phenomena complement one another in such a way that the O (m 4 ) terms are not important for any variance in the minimum of EP (W ). Thus, any minimization algorithm can work asymptotically without these terms. In the approximation step of equation 2.3, a term dependent on both the pattern mists F(WI X p) ) ¡ Dp ) and the second derivatives of F has been dropped. This term can also be neglected in the minimum of EP (W ) due to similar reasons: when the variance is small, the mists are small, and when the variance is large, the minimum tends to have small second derivatives of F. Observe that this argument requires that for s D 0, the network is able to approximate all the training points fairly well. When the training set contains points with the same input and different outputs, we assume that the network is able to interpolate the average of these outputs (which is the result of the minimization of the plain sum-of-squares error).1 If this condition is satised, the sum of the neglected terms dependent on the mists is null. The accuracy of an algorithm based on this approximation was demonstrated for all variances in Ruiz de Angulo (1996) and Ruiz de Angulo and Torras (1998) by means of comparisons with the exact minimization for a very simple form of P, namely, one giving equal probability to the extreme points of a cross centered at the origin, and zero probability elsewhere. With this approximation, the derivatives of the expected error for pattern p p, EP (W ), for a two-layer regression network with the above-dened E (W )2 1 Thus, we are assuming that the network is large enough to do this, and that the control of the complexity of the approximation is carried out by regulating s and not the number of hidden units. 2 The derivatives for a classication network using the relative entropy error function and whose output units’ activation function is tanh are the same as in equation 2.4.
1122
Vicente Ruiz de Angulo and Carme Torras
and linear output units, are 8 ¡ ¢2 p) 0 is that above a certain number of iterations, the generalization error does not change with more training. Thus, overfting is limited. Although the selection of the best variance (or regularizer coefcient) for generalization is not the purpose of this article, we show in Figure 4 the generalization error of the intensively trained networks used in the preceding gure. It can be seen that the networks with 6 HUs and 12 HUs generalize equally well for a > 0, except in the case a D .053, where the former falls in a local minimum. In subsequent experiments with a xed a, it has been observed that the number of surviving hidden units increases with the number of data items to cope with the increase in the amount of information available (Ruiz de Angulo, 1996). 3.2 Local Minima. A question that arises naturally after seeing the results of the preceding experiments is whether two networks with different
1126
Vicente Ruiz de Angulo and Carme Torras
Figure 3: Representation of the absolute values of weights after training. Those of the 6 HU network are marked with a cross and those of the 12-HU network with a circle. Note that the weights lying on the axis of abcissas are null, and thus correspond to nonsurviving units.
number of neurons, trained with the same training set and a, will always produce equivalent approximations. First, let us state that all the claims we make in this article refer to wellminimized networks. No strong claims can be made about weights resulting from incomplete optimizations. Thus, no ways to regulate complexity extrinsic to the cost function, such as early stopping, are applied. Often two networks must be trained for a long time before their similarity is clearly noticed. It is not unusual (as in the case of minimizing the plain
Architecture-Independent Approximation of Functions
1127
Figure 4: Generalization error of the networks displayed in Figure 3.
E (W )) that learning slows down in nearly at surfaces of EP (W ), sometimes leading to replicated units, which disappear after enough training. From time to time, even after long training periods, one gets different congurations, including cases with different number of remaining hidden units. Two situations need to be distinguished here, depending on whether the number of surviving units in the larger network is greater than the number of available hidden units in the smaller network or not. In the rst situation, there is no opportunity for conuence. In the second, if the conguration reached by the larger network is a global minimum, it must also be a global minimum for the smaller network. Consequently, in the second situation, we have always veried that at least one of the networks had been trapped in a local minimum. Thus, our claim for architecture-independent approximation still holds, but must be stated in the context of global minimization of the regularized function. The frequency of local minima is not too high. Results such as those in Figure 1 are common in a rst trial. However, we must acknowledge that results such as those displayed in Figure 3, in which with xed initial weights equal congurations are obtained for the entire range of a’s, are uncommon. Experimental results indicate that it is more likely to fall in local minima when the number of surviving hidden units required by the global minimum is close to that of the network and also when this number is much larger.
1128
Vicente Ruiz de Angulo and Carme Torras
A strategy to avoid local minima is to add a random perturbation several times in the middle and late stages of learning. 3.3 Quasi-Similar Approximations. So far we have seen only one of the two types of hidden units produced by the algorithm, which hereafter will be called nonreplicated units: those whose weight pattern appears only once in the network. But the algorithm often produces weight congurations in which several units are replicated. In these cases the number of surviving units is not indicative of the complexity of the network, and, indeed, it is not unusual that an increase of a produces a minimum with more surviving units, some of them being replicated. The replication of a hidden unit depends on the degree of linearity that its output exhibits in the range of the input training data. The nonreplicated units are always nonlinear, whereas the replicated ones may be of either type. As will be shown next, when the replicated units are nonlinear, the global minimum contains a xed number of them, irrespective of the size of the network. On the contrary, the number of linear replicated units grows with network size, tending to ll the whole network. Suppose we have a network with a number n > 0 of nonsurviving hidden units, and assume that one of the surviving units is almost linear. Since the second derivatives associated with a hidden unit depend on only the squares of the hidden-to-output weights and on the square of the hidden unit activation, splitting a linear hidden unit keeps the same E (W ) while reducing the second derivatives part of EP (W ). Thus, returning to our imaginary network, by splitting the linear hidden unit into n C 1 equal units and replicating its weight pattern for all the nonsurviving units, the minimum value of EP (W ) is obtained with the same E (W ). For this reason, we call these units replicated units of the ll-everything type. As an example, see Figure 5, where the weights resulting from training a 6 HU and a 12 HU network with 25 samples of the function sin(x1 C x2 ) up to AVG (W ) D .00005, using a D .02, are shown. One can see that in Figure 5a, there are ve functionally identical units, the rst and second of which have inverted sign patterns in relation to the others. In Figure 5b, that same replicated unit appears 11 times, but the magnitudes of its weights are smaller. Thus, when the resulting weight conguration contains linear replicated units, enlarging the number of hidden units causes a rescaling of these replicated hidden units. Because the hidden units cannot be completely linear, this rescaling imposes a slight adjustment on the remaining, nonlinear units. Figure 6 shows the weights of the same 6 HU and 12 HU networks used in Figure 5 for a complete range of a’s. Here, for 0 < a < .04, the conguration in both networks consists of a nonreplicated unit and ll-everything linear units. When the number of hidden units tends to innity, the rescaling tends to be completely linear, and the adjustment of the nonreplicated units tends to zero. Instead, when there are too few units, the rescaling makes the hidden units too nonlinear, and thus, a completely different solution is found. Fig-
Architecture-Independent Approximation of Functions
1129
Figure 5: Results of training a 6 HU and a 12 HU network with 25 samples of the function sin(x1 C x2 ), using a D .02. The resulting congurations contain replicated units of the ll-everything type.
ure 7 shows an example with the same settings as in Figure 5, but supplying more information for the network (30 patterns are used) and requiring more delity to the data (a D .0066). Here, the absolute values of the weights of the replicated units are larger, and, thus, they do not admit a redistribution into just ve units. Therefore, the 6 HU network converges to an entirely different conguration. On the other hand, when the replicated units are signicantly nonlinear, the scenario is much more restricted, because a xed number of replicated units is involved in the global minimum. We call these units replicated units of the xed number type. For instance, in Figure 6, a D .04 produces a solution with four replicated units, in both the 12 HU and the 6 HU networks. Different initializations lead always to the same type of solution with the 12 HU network, but this solution is more difcult to nd for the 6 HU network, because it requires almost all its hidden units, and approximately half of the times produces a solution with three replicated units. The behavior of the 6 HU network is also favored by another fact: if the replicated units are not highly nonlinear, one more and one less number of replications can
1130
Vicente Ruiz de Angulo and Carme Torras
Figure 6: Representation of the weights resulting from training 6 HU and 12 HU networks under the same conditions as in Figure 5, but for a complete range of a’s. In the top left plot, the number of weights with the same value is indicated besides the corresponding symbol.
be a solution with an EP (W ) value almost indistinguishable from the global minimum. A consequence of all this is that quasi-similar functions can be implemented by different networks, even if a is not in their conuence range. To measure the degree of similarity between two functions, we dene the functional distance as the mean square distance between the outputs of the two networks in a grid of 10,000 points regularly distributed in the input domain. Figure 8 shows the functional distances between architectures with
Architecture-Independent Approximation of Functions
1131
Figure 7: Results of training a 6 HU and a 12 HU network with 30 samples of the function sin(x1 C x2 ), using a D .0066. Here, the 6 HU is too small to accommodate the rescaling of the ll-everything units, and different solutions are found by the two networks.
different number of hidden units, minimized for several a’s. Since for the two highest a’s, some of the networks fell in local minima, in these cases we used the best minimum selected from three different random trials. It is evident that as a grows, all the architectures tend to produce the same results. But the most interesting observation from this graph is that for any a > 0, the distances between architectures decrease very quickly as the number of hidden units grows, and they are indistinguishable from zero above 50 units. Notice that the comparisons involve networks that differ the more in the number of hidden units, the larger are the architectures. The above observation agrees with the expectation of a tendency to closer similarity for larger nets. Of course, for each given a, all the architectures exhibit almost the same generalization error, especially those above 50 HUs. An interesting fact derived from Figure 8 is that as the sizes of the networks grow, their conuence occurs at lower a values. Thus, the chances that the value of a leading to the best generalization falls within the conuence region increase with the size of the networks. In other words, it seems that good generalization and conuence can be simultaneously achieved for large networks.
1132
Vicente Ruiz de Angulo and Carme Torras
Figure 8: Evaluation of the similarity between the functions implemented by architectures with different numbers of hidden units. A complete range of a’s is tested.
In conclusion, no matter what the value of a is, large enough networks make similar approximations. For that reason, we can disregard the network size selection problem and concentrate on the ideal regularizer coefcient to reduce the complexity of the network. 4 Discussion
We have presented a very simple neural network algorithm capable of producing quasi-similar approximations with different network architectures. This algorithm can be viewed as a way of minimizing the expectation of the standard error over a distribution of the weights with a common and xed variance. Alternatively, it can be seen as the addition of a regular-
Architecture-Independent Approximation of Functions
1133
ization or penalty term to E (W ) to control the degree of adaptivity of the network. For a given a, if the minimum in the limit of innite HUs does not contain replicated units of the ll-everything type, there exists a threshold number of units above which the function implemented by any architecture is exactly the same. Otherwise, if the minimum admits replicated units, then there exists also a threshold number of HUs above which linear replicated units begin to appear. Above this point, the networks show a close functional similarity, which approaches quickly the identity as the number of HUs grows. If enough hidden units are provided, in either case the function implemented by a neural network depends only on a, it being independent of the architecture. To make sure that this architecture-independence property is not shared by all weight-elimination procedures, we have run Optimal Brain Damage (Le Cun et al., 1990) on the same data and the same architectures employed in Figure 8. We have pruned the weights ve at a time and have retrained all the networks until only the number of weights indicated in the abscisses is left.4 The results displayed in Figure 9 are clear, in the sense that the function implemented by the network varies markedly all along the weight removal process; in other words, it does not stabilize beyond a given number of weights. The results in this article can be put under a Bayesian light. Neal (1996) has advocated convincingly for the use of neural networks with as many hidden units as can be afforded computationally, regardless of the size of the training set. In a proper Bayesian approach, it makes more sense to allow the maximum information to be extracted from the data and let the prior reduce conveniently the complexity (if the prior is correct). From this point of view, the rst problem is to look for priors over weights such that the corresponding prior over the function reaches a reasonable limit as the number of hidden units goes to innity. In fact, not a prior but a family of priors suitably scaled according to the number of HUs is dened by Neal (1996) to obtain such convergence. The usual type of priors, the gaussian one, converges to gaussian processes—a family of distributions over functions that can be handled more efciently by a direct implementation. In addition, they produce networks where the contribution of individual HUs is negligible and consequently are unable to represent “hidden features” of the data. Another problem is that implementing Bayesian inference with methods based on the posterior may break down as the number of HUs becomes large. Instead, Monte Carlo methods are able to do the job, but being slow, they may fail to reach the true posterior in a reasonable time for a number of HUs close enough to the limit.
4 Note that an initial pruning has been necessary to make the number of weights of all networks a multiple of ve before beginning to prune the weights ve at a time.
1134
Vicente Ruiz de Angulo and Carme Torras
Figure 9: Evaluation of the similarity between the functions implemented by pruned networks (following Optimal Brain Damage) derived from initial architectures containing different numbers of hidden units. The different symbols correspond to the same pairs of networks as in Figure 8, and the scale is also the same as in that gure.
The algorithm presented in this article is based on a regularization term, P ± @F ²2 a j,i @wji , that can be considered a prior on the weights for which, as we have seen, the posterior maximum reaches a limit as the size of the networks goes to innity. This limit is reached abruptly if there are only nonreplicated and x-number replicated units and more smoothly if there are ll-everything replicated units. In the former case, the limit is easily reached and identied; in the latter case, a good approximation can also be identied when adding a new neuron produces a rescaling of the replicated units. In addition, in the innite limit, the contribution of some individual
Architecture-Independent Approximation of Functions
1135
units (nonreplicated and xed-number replicated) is relevant, and that of the remaining units can be grouped in a linear unit. This means that the set of surviving units can represent hidden features, a fact that can be useful to link several outputs of the function or for knowledge transference between tasks. It would be interesting to analyze the features that confer these properties on our prior, so that other priors with these or similar properties can be devised. One of such features is surely the fact that our prior treats the weights in groups associated with the hidden units, not independently. References An, G. (1996). The effects of adding noise during back-propagation training on generalization performance. Neural Computation, 8, 643–674. Hanson, S. J. (1990). A stochastic version of the delta rule. Physica D, 42, 265–272. Hinton, G. E., & van Camp, D. (1993a). Keeping neural networks simple. In Proceedings of the International Conference on Articial Neural Networks (pp. 11– 18). Amsterdam. Hinton, G. E., & van Camp, D. (1993b). Keeping neural networks simple by minimizing the description length of the weights. In Proceedings of the Sixth ACM Conf. Comp. Learning Theory (pp. 5–13). Santa Cruz. Lang, K. J., Waibel, A. H., & Hinton, G. E. (1990). A time-delay neural network architecture for isolated word recognition. Neural Networks, 3, 23–43. Le Cun, Y., Denker, J. S., & Solla, S. A. (1990).Optimal brain damage. In D. Touretzky (Ed.), Advances in neural information processing systems, 2. San Mateo, CA: Morgan Kaufmann. Murray, A. F., & Edwards, P. J. (1993). Synaptic weight noise during multilayer perceptron training: Fault tolerance and training improvements. IEEE Trans. on Neural Networks, 4, 722–725. Murray, A. F., & Edwards, P. J. (1994). Enhanced multilayer perceptron performance and fault tolerance resulting from synaptic weight noise during training. IEEE Trans. on Neural Networks, 5, 792–802. Neal, R. M. (1996). Bayesian learning for neural networks. New York: SpringerVerlag. Ruiz de Angulo, V. (1996). Interferencia catastrÂoca en redes neuronales: Soluciones y relaciÂon con otros problemas del conexionismo. Unpublished doctoral dissertation, Basque Country University. Ruiz de Angulo, V., & Torras, C. (1994). Random weights and regularization. In Proceedings of the International Conference on Articial Neural Networks (pp. 1456–1459). Sorrento. Ruiz de Angulo, V., & Torras, C. (1998). Averaging over networks: Properties, evaluation and minimization. (Tech. Rep. No. IRI-DT 9811). Barcelona: Universitat Polit`ecnica de Catalunya. Received May 11, 1999; accepted July 27, 2000.
LETTER
Communicated by Risto Miikkulainen
Analyzing Holistic Parsers: Implications for Robust Parsing and Systematicity Edward Kei Shiu Ho Department of Computing, Hong Kong Polytechnic University, Hung Hom, Kowloon, Hong Kong Lai Wan Chan Department of Computer Science and Engineering, Chinese University of Hong Kong, Shatin, New Territories, Hong Kong Holistic parsers offer a viable alternative to traditional algorithmic parsers. They have good generalization performance and are robust inherently. In a holistic parser, parsing is achieved by mapping the connectionist representation of the input sentence to the connectionist representation of the target parse tree directly. Little prior knowledge of the underlying parsing mechanism thus needs to be assumed. However, it also makes holistic parsing difcult to understand. In this article, an analysis is presented for studying the operations of the conuent preorder parser (CPP). In the analysis, the CPP is viewed as a dynamical system, and holistic parsing is perceived as a sequence of state transitions through its state-space. The seemingly one-shot parsing mechanism can thus be elucidated as a step-by-step inference process, with the intermediate parsing decisions being reected by the states visited during parsing. The study serves two purposes. First, it improves our understanding of how grammatical errors are corrected by the CPP. The occurrence of an error in a sentence will cause the CPP to deviate from the normal track that is followed when the original sentence is parsed. But as the remaining terminals are read, the two trajectories will gradually converge until nally the correct parse tree is produced. Second, it reveals that having systematic parse tree representations alone cannot guarantee good generalization performance in holistic parsing. More important, they need to be distributed in certain useful locations of the representational space. Sentences with similar trailing terminals should have their corresponding parse tree representations mapped to nearby locations in the representational space. The study provides concrete evidence that encoding the linearized parse trees as obtained via preorder traversal can satisfy such a requirement.
c 2001 Massachusetts Institute of Technology Neural Computation 13, 1137– 1170 (2001) °
1138
Edward Kei Shiu Ho and Lai Wan Chan
1 Introduction
Traditionally, symbolic approaches have dominated the study of language parsing. Various models have been devised, such as the chart parser and the augmented transition network (Allen, 1995; Gazdar & Mellish, 1989). Generally these models were characterized by their algorithmic and rule-based nature. The parsing process was governed by a well-dened algorithm in which the input sentence was analyzed step by step, with the target parse tree being built incrementally at the same time. Besides, symbolic representations were adopted for sentences and parse trees. Also, the construction of the parser required a detailed knowledge of the underlying grammar. Although remarkable success has been achieved in computer language processing, their practicality in natural language processing has been relatively limited. The major reason is that they have poor error recovery capability when parsing ungrammatical sentences. Their rule-based nature is too rigid to provide enough robustness as demanded by natural languages (Kwasny & Faisal, 1992). Besides, the extreme exibility of natural languages effectively prohibits the complete enumeration of the grammar rules, which are essential for constructing an algorithmic parser. In view of this, researchers have turned their attention to more dataoriented paradigms. Generally a model would be employed that acquired the target language by exposure to samples generated by it (commonly called positive data). 1 Among various paradigms, the statistical approach (Charniak, 1993) has already aroused the interest of many researchers. On the other hand, as inspired by the study of cognitive science, neural networks have been applied in language parsing also. One of the earliest models was due to Fanty (1985), who proposed a procedure that could construct a connectionist parser from any context-free grammar given (so no learning was involved). The parser employed localist representation techniques and was purely syntactic in nature. A similar “nonlearning” model has also been devised by Selman (1985), who made use of a Boltzmann machine to implement the parser. Later, Hanson and Kegl (1987) put forward a connectionist model called the PARSNIP which was actually an autoassociative feedforward network (Rumelhart, Hinton, & Williams, 1986). To process a sentence, the syntactic tags corresponding to the words in the sentence were presented to the input layer in parallel. In contrast to Fanty’s and Selman’s models, distributed representations rather than localist ones were adopted. Processing succeeded if the network could “reproduce” at its output layer the syntactic tag corresponding to each position of the input sentence (so it was more a language recognizer than a parser). 1 In general, negative data (counterexamples not generated by the language) would also be needed in order to avoid overgeneralization, although Angluin (1980) has proved that, theoretically, certain classes of languages were learnable by using positive data alone.
Analyzing Holistic Parsers
1139
Connectionist representation of the input sentence Symbolic representation of the input sentence
Recursive Encoding
Connectionist representation of the target parse tree
Holistic Transformation
Recursive Decoding
Symbolic representation of the target parse tree
Figure 1: Holistic parsing paradigm (adapted from Ho & Chan, 1999).
A common drawback of these three models was that they all imposed a limit on the sentence’s length. To circumvent this, Santos (1989) proposed the parsing and learning system (PALS), in which a matrix of units was used to represent the structure of the target parse tree. In each step, only a partial parse tree, called a snapshot, was stored in the matrix, which was combined with the next word of the sentence to grow into a larger snapshot. The complete parse tree was thus built up incrementally. This allowed the parser to handle sentences of unbounded length. The construction of the parse tree was guided by a set of language rules generated from the grammar rules. The applicability of each rule was determined by a rule weight that was adapted through learning. Besides these early approaches, effort has been paid to integrate connectionist techniques into the symbolic parsing framework, leading to various hybrid parsers. For example, the connectionist deterministic parser (Kwasny & Faisal, 1992) consisted of a symbolic component, which is a deterministic parser, and a subsymbolic component, which is a feedforward network. During parsing, based on the contents of the parser ’s symbolic data structures, the feedforward network output the action to perform, and the parser manipulated its data structures accordingly. Other examples include the sentence gestalt model (St. John & McClelland, 1990), the PARSEC model (Jain, 1991), the neural network push-down automaton (NNPDA) (Sun, Giles, Chen, & Lee, 1993), and the subsymbolic parser for embedded clauses (SPEC) (Miikkulainen, 1996). However, these attempts still required a detailed specication of the parsing algorithm and the grammar. They were still algorithmic and rule based, although the rules might have more exible decision boundaries. The inductive learning power of neural networks had not been fully utilized, and error recovery performance was unsatisfactory. 2 Holistic Parsing
Recently an alternative approach, holistic parsing, has been studied (see Figure 1). Several models have been proposed, including Reilly’s parser (Reilly, 1992), the XERIC parser (Berg, 1992), the modular connectionist parser (Sharkey & Sharkey, 1992), the backpropagation parsing network
1140
Edward Kei Shiu Ho and Lai Wan Chan
(BPN) (Ho & Chan, 1994), and the conuent preorder parser (CPP) (Ho & Chan, 1997). In a holistic parser, connectionist representations are developed for sentences and parse trees. In contrast to the algorithmic approaches, parsing is achieved by mapping holistically the representation of the input sentence to the representation of the target parse tree. Ho and Chan (1999) proposed a general framework for holistic parser design. Several design dimensions have been identied: Encoding sentences. Conventionally, sentences are encoded as sequences by training a recurrent network. Two common choices are the simple recurrent network (SRN) (Elman, 1990) and the sequential recursive autoassociative memory (SRAAM) (Pollack, 1990). Encoding parse trees. Traditionally, parse trees have been represented as hierarchical data structures, and the recursive autoassociative memory (RAAM) (Pollack, 1990) has commonly been applied for this task. Recently, Kwasny and Kalman (1995) proposed linearizing parse trees by preorder traversal. The sequences obtained can then be encoded by a recurrent network (such as the SRN or the SRAAM). Implementing the holistic transformation. A feedforward network can be trained to effect an explicit transformation from the representation of the input sentence to the representation of the target parse tree. Alternatively, one may choose to encode the sentences or the parse trees rst. The representations obtained can then be used as the training targets for encoding the others. Taking one step further, the representation of a sentence and that of its parse tree can co-evolve simultaneously instead of being developed independently. This technique is known as conuent inference (Chrisman, 1991). Learning to parse phrases. A holistic parser can be trained to parse phrases in addition to complete sentences. For example, parsing the verb phrase h V D N i gives the subtree (V (D N)). As depicted in Figure 1, a holistic parser effectively encapsulates its underlying parsing mechanism in a “black box.” Instead of dening how to achieve, we simply specify what to achieve. It has the advantage that the necessary grammatical knowledge can be inferred by inductive learning. But it is also difcult to understand how parsing is actually accomplished. In this article, a model is presented for analyzing holistic parsers. The CPP is used as an example to illustrate the method.2 In the model, the CPP is viewed as a dynamical system, whose state-space is dened as the set of all distributed representations that can possibly be developed over 2 Experimental results (Ho & Chan, 1999) reveal that among all the holistic parsers proposed, the CPP has the best generalization performance and is also the most robust one.
Analyzing Holistic Parsers
1141
the hidden layer of the network. Through training, a number of states are dened in the state-space where each state corresponds to a set of hiddenlayer representations that produce the same result upon decoding. Some states are designated as nal states, which, when being decoded, give a syntactically well-formed parse tree.3 Effectively, a nite-state automaton is extracted. When parsing a sentence, the CPP moves from one state to another as each terminal of the sentence is being read. A trajectory is thus formed in the state-space, and parsing succeeds if the state that the trajectory terminates at is a nal state. The motivation of this study is threefold. First, the analysis allows the seemingly one-shot parsing mechanism of holistic parsers to be analyzed as a step-by-step inference process, with the intermediate parsing decisions being revealed by the states visited during parsing. This can enhance our understanding of how parsing is actually accomplished by a holistic parser. Second, we nd that in parsing an erroneous sentence, the trajectory of the CPP in the state-space will deviate from the normal track that is followed when the original intact sentence is parsed. However, each state of the CPP actually behaves like an attractor and encloses a set of points in the statespace that give the same result upon decoding. These points constitute the basin of attraction of the state. So if the parser’s trajectory is deected only slightly by the error in the sentence, it may still reach the basin of attraction of the target nal state. In that case, the erroneous sentence can be recovered. This characteristic equips the CPP with a certain tolerance to noise. Finally, the analysis reveals that the specic encoding method adopted by a holistic parser for representing parse trees has a profound effect on the distribution of the states in the state-space, which is a determining factor of the generalization capability of the parser. With respect to this issue, the simulation result provides concrete evidence that linearizing parse trees (as in the CPP) can improve the generalization performance of a holistic parser. 3 Conuent Preorder Parser (CPP)
Because we will use the CPP to illustrate the method for analyzing holistic parsers, this section rst gives a brief review of it. 3.1 Training Methodology. The CPP combines two SRAAMs to form a dual-ported SRAAM (Chrisman, 1991), with one single hidden layer shared between them (see Figure 2). Two techniques are applied. First, each parse tree is linearized by preorder traversal to give a sequence (see also Kwasny & Kalman, 1995). For example, linearizing the parse tree in Figure 3 leads to the preorder traversal sequence h s np D N vp V np np D N pp P np D N i. 3 A parse tree or subtree is syntactically well formed if it can be derived step by step from a nonterminal by using the production rules of the context-free grammar given.
1142
Output layer
Edward Kei Shiu Ho and Lai Wan Chan 1
2
n
1
2
...
m
prefix
1
2
2
n
1
2
n
m
1
2
...
n
1
2
...
ith terminal
m
.. .
prefix
SRAAM Sentence Encoder
Hidden layer to Output layer = Decoder sub-net
Backpropagation of error to effect confluent inference
prefix
2
m
ith terminal
...
1
2
.. .
prefix
1
... prefix
n
...
ith terminal
Backpropagation of error to effect confluent inference
Input layer
1
...
Input layer to Hidden layer = Encoder sub-net
ith terminal
SRAAM Preorder Traversal Encoder
Figure 2: Conuent preorder parser (CPP).
The training sentences and their corresponding preorder traversals are then encoded by the sentence encoder and the preorder traversal encoder, respectively. To encode a sentence, we input its terminals to the sentence encoder one at a time as the ith terminal eld. In each time step i, the hidden-layer activation at time i ¡1 will serve as an additional input of the network as the prex h1 . . . i ¡1i eld. The SRAAM then learns to reproduce the inputs at its output layer, and the resulting hidden-layer activation will be used as one of the inputs in the next time step. This process is repeated recursively until the whole sentence is read, and the nal hidden-layer activation is used as the sentence coding. Preorder traversals are encoded by the preorder traversal encoder similarly. Instead of training the two SRAAMs independently, conuent inference is applied. For each sentence, two extra training patterns are introduced, which involve adapting the weights of both SRAAMs. For example, given the sentence h D N V D N i, which corresponds to the preorder traversal s np D
vp N
np
V np D
pp N
np
P D
N
Figure 3: Parse tree of the sentence h D N V D N P D N i (adapted from Ho & Chan, 1999).
Analyzing Holistic Parsers
1143
Table 1: Training Patterns for Achieving Conuent Inference When Teaching the CPP to Parse h D N V D N i. Input Layer Sentence Encoder
Preorder Traversal Encoder
hD NV Di+N Don’t care
D on’t care h s np D N vp V np D i + N
Output Layer (target)
Hidden Layer
hDNVDNi h s np D N vp V np D N i
Sentence Encoder
Preorder Traversal Encoder
Don’t care hDNVDi+N
h s np D N vp V np D i + N Don’t care
Note: Fields marked “don’t care” signify that the weights connecting the units in these elds are xed when learning the pattern concerned.
h s np D N vp V np D N i, the two patterns in Table 1 have to be included. In the rst pattern, error is backpropagated from the output layer of the preorder traversal encoder to the input layer of the sentence encoder; in the second one, error is backpropagated from the output layer of the sentence encoder to the input layer of the preorder traversal encoder. These patterns place extra constraints on the training of the two SRAAMs such that when training succeeds, the sentence and its corresponding preorder traversal are represented by the same coding. Besides complete sentences, the CPP learns to parse phrases to produce the corresponding subtrees. Experimental results (Ho & Chan, 1999) suggest that this can lead to better performance for a holistic parser in general. 3.2 Experiments. We use the context-free grammar in Table 2 (adopted from Pollack, 1990) to evaluate the performance of the CPP. In all, 112 sentences and their corresponding parse trees are generated. Among them, 80 sentences are randomly selected for training, and the remaining 32 sentences are reserved for testing. Three runs are performed (each uses different initial weights), and the average performance is reported. Generalization performance is measured by the percentage of testing sentences that can be correctly parsed by the trained network. In addition to generalization performance, we evaluate its robustness. Four types of errors are considered: erroneous sentences with one terminal substituted by a wrong terminal (SUB), with an extra terminal inserted (INS), with one terminal omitted (OMI), and with two neighboring terminals exchanged (EX). Multiple erroneous sentences are generated systematically by injecting error into every possible position of each training sentence. There are 853 SUB sentences, 853 OMI sentences, 933 INS sentences, and 773 EX sentences. Robustness is dened as the percentage of erroneous sentences of Table 2: Context-Free Grammar Used in the Experiment (Pollack, 1990). Sentence
Noun Phrase
Verb Phrase
Prepositional Phrase
Adjectival Phrase
s ! np vp s ! np V
np ! D ap np ! D N np ! np pp
vp ! V np vp ! V pp
pp ! P np
ap ! A ap ap ! A N
1144
Edward Kei Shiu Ho and Lai Wan Chan
Table 3: Results of the Experiment. Error Recovery Generalization 91.67%
SUB
OMI
INS
EX
Average
94.72%
70.93%
50.80%
66.88%
70.46%
each type that can be successfully recovered. The results are summarized in Table 3. 4 Analyzing Holistic Parsers
In a holistic parser, sentences are encoded as sequences by using a recurrent network. During parsing, the representation of the input sentence is mapped to another piece of representation directly, which, upon decoding, gives the target parse tree. Apparently no intermediate decisions are involved. However, during the recursive encoding of the input sentence, the hidden-layer activation of the recurrent network is evolving as each terminal of the sentence is being read. In some sense, it manifests the parser’s temporary interpretation of the partial input sentence that it has seen so far. In each time step, the previous hidden-layer activation (much like a preliminary decision) is fed back to the input layer. This feedback is then composed with the current input terminal of the sentence (which provides some new information) to form the next hidden-layer activation (which represents an updated interpretation of the input sentence). Originally, only after the whole sentence is read will the nal hidden-layer activation be decoded. But it is also worth examining the intermediate activation patterns so as to investigate the rational inference or knowledge resolution mechanism underlying the holistic parsing process. 4.1 A Dynamical Systems Model of Holistic Parsing. It is well known that neural networks with feedback links can be perceived as dynamical systems (Kolen, 1994; Rodriguez, Wiles, & Elman, 1999). From the network dynamics’ point of view, the CPP is no different from an ordinary recurrent network. Hence, it can also be characterized as a dynamical system, and we dene its state-space as the set of all distributed representations that can possibly be developed over the hidden layer of the network. From this, we apply the trained CPP to parse the 80 training sentences again. Besides the representations of the complete sentences, we collect the intermediate hidden-layer activation of the CPP in response to each terminal read. Each possible hidden-layer activation pattern corresponds to a point in the state-space of the dynamical system. Intuitively, it represents the parser’s conguration during the parsing process. In parsing a sentence, the hiddenlayer activation is changing as each terminal of the sentence is being read.
Analyzing Holistic Parsers
1145
8 N
A
6
4 D 2 D
PC2
0 N
A
2
4
N D
6
8
V
P
starting state 10
8
6
4
2
0
2
4
6
8
10
PC1
Figure 4: Trajectory of the CPP in the PC1-PC2 subspace when it parses the sentence h D A N V D A N P D N i.
Correspondingly, the CPP can be thought of as moving from one point to another in the state-space. A trajectory is thus formed that traces the parsing process (see also Elman, 1990). Unfortunately, the high dimension of the hidden layer makes it impossible to visualize the state-space directly. Principal component analysis is thus applied, and the state-space is projected onto the two-dimensional subspace spanned by the rst two principal components: PC1 and PC2. For example, Figure 4 shows the trajectory of the CPP in the PC1-PC2 subspace when it parses the training sentence h D A N V D A N P D N i. 4.2 Extracting a Finite-State Automaton from the CPP. The trajectories can in fact provide useful hints about the underlying parsing mechanism of the CPP. To reveal this information, we decode the points that the CPP visits when it parses the 80 sentences (these points correspond to hiddenlayer activations). For each point, a sequence of terminals results that may represent the preorder traversal of a syntactically well-formed parse tree
1146
Edward Kei Shiu Ho and Lai Wan Chan
12
< < < < < < <
... V D N > ... V D N P D N > ... V D A N P D N > ... V D A N > ... V P D N > ... V D A A N > ... V P D A N >
PC2
4 2 0 2 4 6 8
14
12
10
8
6
4 PC1
2
0
2
4
6
Figure 5: Distribution of the 80 nal states identied for the CPP (that correspond to the 80 training sentences) in the PC1-PC2 subspace.
or subtree. In some sense, it is an interpretation of the conguration of the parser when it is situated at that point. For every unique sequence found, a new state is given rise to, whose identity is dened by the sequence. If the identity of a state is a legitimate preorder traversal, the state is said to be a nal state. In all, 186 distinct states are identied: 102 nal states, 83 intermediate states, and 1 starting state. Between them, there are 234 possible transitions. 4 Figure 5 shows the distribution of the 80 nal states that correspond to the 80 training sentences in the PC1-PC2 subspace (note that the states are not identied by clustering the PCA or the representational space). Together, the states and the transitions dene a nite-state automaton (Hopcroft & Ullman, 1979), which encodes the grammatical knowledge for the parsing function. 4 The total number of states identied does not vary a lot in different runs of the experiments (within a range of § 10). The same applies to the number of nal states also.
Analyzing Holistic Parsers D 0
A 1
N 2
1147 V
3
D 4
A 5
N 6
P 7
D 9
N 10
11
Figure 6: State transition sequence resulted when the CPP parses the sentence h D A N V D A N P D N i (see Figure 4 and Table 4).
Holistic parsing can thus be redened. Before parsing begins, the CPP is in the starting state. It then changes from one state to another as each terminal of the input sentence is being read (we call this a state transition). If the last state that the CPP resides in is a nal state, the sentence is successfully parsed, and the identity of the nal state gives the parse tree required. Intuitively, holistic parsing of a sentence can also be analyzed as a step-bystep rational inference procedure, with the intermediate parsing decisions reected by the identities of the states visited by the parser during the processing of the sentence. For example, Figure 6 depicts the state transition sequence involved when the CPP parses the sentence h D A N V D A N P D N i. Here, nodes with a double circle denote nal states. For convenience, each state is labeled by a unique number. The identities of the states concerned are listed in Table 4. Two observations merit further discussion. First, some of the state transition sequences resulted in parsing similar sentences (that share subsequences) have overlapped to a certain extent. Thus, some states are shared or reused. For example, when h D A N V D A N i and h D A N V D N P D N i are parsed, their respective prexes h D A N V D A i and h D A N V D N i result in the same state transition sequence 0 ! 1 ! 2 ! 3 ! 4 ! 5 ! 6. As a result of sharing states, the number of states of the nite-state automaton is effectively reduced. Recall that when the 80 training sentences are parsed, 186 states have been identied. But if they were represented as indepenTable 4: Identities of the States Appearing in Figure 6. State
Identity
0 1 2 3 4 5 6 7 9 10 11
Starting state ¢¢¢ (D N) (D (A N)) ((D (A N)) V) ¢¢¢ ((D (A N)) (V (D N))) ((D (A N)) (V (D (A N)))) ((D (A N)) (V ¢ ¢ ¢)) ((D (A N)) (V ((D ¢ ¢ ¢) ¢ ¢ ¢))) ((D (A N)) (V ((D (A N)) (P (D N)))))
Note: ¢ ¢ ¢ represents a sequence that cannot be interpreted to give a well-formed tree or subtree.
1148
Edward Kei Shiu Ho and Lai Wan Chan
dent sequences, the number of states should have been 235. 5 Hence, it is a sign that the CPP is not simply memorizing the training examples. Intuitively, some kind of knowledge abstraction has occurred, in which the CPP spots out the similarities between the sentences and forms shared states accordingly to capture the commonalities.6 For example, state 6 abstracts the common characteristics between the two similar prexes h D A N V D A i and h D A N V D N i. In parsing the 80 training sentences, 42 shared states are found, and the maximum number of sequences encoded by a shared state is three (for example, state 49 represents the three subsequences h D N V D A i, h D N V D N i, and h D N V P D i). Note that in addition to the similarities, the differences between the sentences are captured by the shared states as well. Consider h D A N V D A N i again. Upon decoding, state 6 gives the parse tree ((D (A N)) (V (D N))). After reading the last terminal, state 7 is reached whose identity is the parse tree ((D (A N)) (V (D (A N)))). Thus, the fact that h D A N V D A i has just been read is “carried” implicitly by state 6, although it is not manifested explicitly by its identity. Note that the sharing of states may facilitate generalization also. When an unseen sentence similar to certain training sentences is parsed, some of the states that are discovered while parsing the training sentences would probably be revisited. Intuitively, these states identify the familiar fragments in the unseen sentence, based on which the CPP can work out the total parse tree. For example, state 50 originally represents the subsequence h D N V D A N i only. But after reading the prex h D N V D A A i of the testing sentence h D N V D A A N i, state 50 is visited again. In other words, state 50 becomes a shared state and encodes h D N V D A A i as well. When the last terminal N of the testing sentence is read, the CPP travels from state 50 to state 189 (a new state), which outputs the target parse tree. However, if the sentences were merely “hard-coded” as independent sequences, generalization would need to be done on the y. The second observation is that when we interpret the identities of the intermediate states, we nd that some of them reveal traces of partial subtrees (such as state 9 and state 10 in Table 4). To a certain extent, they help to explain the underlying parsing mechanism of the CPP by illustrating how the target parse tree evolves during the parsing process. Besides, the observation serves as an evidence that the CPP has induced the internal structure of the sentences. Thus, grammatical knowledge has been acquired. 5 In parsing the 80 training sentences, the average number of times each state is visited is 4.6 (excluding the starting state). 6 In general, sentences sharing subsequences give rise to similar representations when being encoded by the CPP. If the overlaps between two sentences are substantial, their representations will be sufciently close to each other in the representational space. Consequently, they are decoded to give the same state.
Analyzing Holistic Parsers
1149
4.3 Explaining Generalization. Besides illustrating the holistic parsing mechanism, the dynamical systems model can be used to explain the generalization capability of the CPP. We claim that in addition to the nal states corresponding to the training sentences, the training process will organize the state-space in such a way that extra nal states are formed. Each extra nal state, when being decoded, gives the parse tree of an unseen sentence. This enables the CPP to generalize to novel sentences. Recall that in our experiment, 80 training sentences have been used. Each corresponds to a unique nal state, giving a total of 80 nal states. But it turns out that 22 extra nal states have been identied. Among them, 9 correspond to some of the constituent phrases that have been included in training (e.g., the noun phrase h D N i),7 whereas the remaining 13 extra nal states represent complete sentences that are not among the training sentences. In other words, they are novel sentences generalizable by the CPP (see Table 5). Each of these 13 sentences is similar to the training sentences in two ways. First, it is either an exact prex of a training sentence (e.g. state 28) or is sufciently similar to the prex of some training sentence (e.g., state 95). Second, for each of these generalizable sentences, there exists some sentence in the training set that has the same ending as it (e.g., the sentence h D N P D N V D N i as encoded by state 127 shares the same trailing terminals as the training sentence h D N P D A N V D N i). Intuitively, each involves a novel composition of familiar constituents. That means that the constituent phrases used to construct them have also appeared somewhere in the training set (or, as for state 95, phrases that are sufciently similar have been used before). But they have never been composed together in the same way as they are in the novel sentences. We believe that both types of similarities facilitate the generalization of novel sentences. To illustrate this, consider state 99 in Table 5. Its corresponding sentence S0 = h D A A N P D N V i is a prex of the training sentence h D A A N P D N V D A N i. Hence, the preorder traversals of their respective parse trees—h s np np D ap A ap A N pp P np D N V i and h s np np D ap A ap A N pp P np D N vp V np D ap A N i—share the same prex. However, their trailing parts are still quite different. This suggests that learning to parse a prex of a complete sentence S (where the prex itself is a sentence), given only the mapping between S and its corresponding parse tree, is by no means trivial. This inadequacy is supplemented by the existence of certain training sentences whose endings are the same as or sufciently similar to that of S0 (one example is S00 D h D N P D N V i). The preorder traversals of these training sentences will have the same (or similar) trailing terminals as that
7 These constituent phrases have been used to compose the training sentences (note that the CPP has learned to parse both complete sentences and phrases).
1150
Edward Kei Shiu Ho and Lai Wan Chan
Table 5: The 13 Extra Final States Identied When Parsing the 80 Sentences in the Training Set. State 4 99 110 28 72 89 113 142 148 55 95 122 127
Identity ((D (A N)) V) (a prex of the training sentence h D A N V D A N i)
(((D (A (A N))) (P (D N))) V) (a prex of the training sentence h D A A N P D N V D A N i) (((D N) (P (D (A N)))) V) (a prex of the training sentence h D N P D A N V D A A N i) ((D (A (A N))) (V (D N))) (a prex of the training sentence h D A A N V D N P D N i)
(((D (A N)) (P (D N))) (V (D N))) (a prex of the training sentence h D A N P D N V D N P D N i)
(((D (A (A N))) (P (D (A N)))) (V (D N))) (a prex of the training sentence h D A A N P D A N V D N P D N i) (((D N) (P (D (A N)))) (V (D (A N)))) (a prex of the training sentence h D N P D A N V D A N P D N i)
((((D (A N)) (P (D N))) (P (D (A N)))) (V (D (A N)))) (a prex of the training sentence h D A N P D N P D A N V D A N P D N i) ((((D (A N)) (P (D N))) (P (D N))) (V (D N))) (a prex of the training sentence h D A N P D N P D N V D N P D N i) ((D N) (V (P (D N)))) (similar to a prex of h D N V P D A N i)
(((D (A (A N))) (P (D (A (A N))))) V) (similar to a prex of h D A A N P D A N V P D A N i) (((D N) (P (D (A N)))) (V (P (D N)))) (similar to a prex of h D N P D A N V P D A N i) (((D N) (P (D N))) (V (D N))) (similar to a prex of h D N P D N V D A N i)
Notes: For state 4, the corresponding sentence h D A N V i is a prex of several training sentences besides the one shown (a similar observation can be made for state 99 and state 110 also), whereas for each of the states 28, 72, 89, 113, 142, and 148, the respective sentence is a prex of one training sentence only. For the remaining states, the sentences encoded are not the exact prex of any training sentence
of S0 (e.g., the preorder traversal of S00 is h s np np D N pp P np D N V i). Intuitively, they provide hints on how to parse a novel sentence that ends with similar terminals (such as S0 ). More nal states have been formed in fact. To discover them, we parse the 32 testing sentences again. Forty new states—17 nal states and 23 intermediate states—are identied as a result. These 17 new nal states correspond to 17 of the 29 testing sentences that can be successfully generalized by the CPP. The 12 nal states that correspond to the remaining 12 testing sentences
Analyzing Holistic Parsers D 0
N 1
P 2
1151
D 47
N 107
V 124
D 125
N 126
P 127
D 207
N 208
209
Figure 7: State transition sequence involved in parsing the testing sentence h D N P D N V D N P D N i.
are among the 22 extra nal states that have already been identied when parsing the 80 training sentences. For example, Figure 7 shows the state transition sequence in parsing the testing sentence h D N P D N V D N P D N i, where darkened nodes represent states that are newly discovered. State 209 is one of those 17 new nal states identied while parsing the 29 testing sentences. Its identity is the parse tree (((D N) (P (D N))) (V ((D N) (P (D N))))). In Figure 8, the nal states corresponding to the 29 testing sentences that can be successfully generalized by the CPP are mapped onto the PC1-PC2 subspace (the red symbols). Also shown are the nal states that correspond to the 80 training sentences (the blue symbols). As depicted, the nal states corresponding to the testing sentences are clustered around the nal states corresponding to the training sentences. As revealed by the discussion, the generalization performance of the CPP is intimately related to the number of nal states that can be realized by training. Since the hidden units are real valued, an innite number of states can be dened in principle. However, due to discretization in the encoding and decoding processes, only some of them can become nal states in practice. In general, the number of nal states that can be realized is proportional to the number of training examples used. However, a recursive context-free grammar can potentially generate an innite number of sentences. In that case, the CPP will have an innite number of nal states, with each representing a unique parse tree. Obviously it is impossible to achieve practically given only a limited number of training examples. Apparently, this places a theoretical limit on the CPP’s computation power. What training should strive for is to organize the state-space in such a way that the maximum number of extra nal states can be realized in addition to those dened by the training sentences. Doubtlessly, this at least requires careful selection of a training set that is representative of the possible types of structures and the possible modes of syntactic composition. 5 Explaining Error Recovery
The CPP is capable of parsing mildly ungrammatical sentences. In this section, we apply the dynamical systems model to investigate how the CPP corrects errors. Depending on the outcome of the recovery, there are three possible cases to consider.
1152
Edward Kei Shiu Ho and Lai Wan Chan 12 10 8 6
< < < < < < <
VDN> VDNPDN> VDANPDN> VDAN> VPDN> VDAAN> VPDAN>
PC2
4 2 0 2 4 6 8 14
12
10
8
6
4
2 PC1
0
2
4
6
8
Figure 8: Distribution of the nal states of the CPP in the PC1-PC2 subspace. The red symbols denote the 29 testing sentences that can be successfully generalized, and the blue symbols correspond to the 80 training sentences.
5.1 The Original Sentence is Recovered. Consider the erroneous sentence E1 D h D A N V P A N P D N i, which is derived from the training sentence S1 D h D A N V D A N P D N i by substituting its fth terminal D by a P. By examining the trajectory of the CPP in the PC1-PC2 subspace (see Figure 9), we see that when E1 is parsed, the parser initially follows the same path as when S1 is parsed. But as soon as the wrong terminal P is read (when the CPP is at point t), it deviates from the normal track and travels to u instead of v. Yet the discrepancy between the two paths decreases gradually thereafter as the remaining terminals in E1 are processed. Finally, the CPP resides at x, whereas when S1 is parsed, the trajectory terminates at y. But when x is decoded, the parse tree T1 D ((D (A N)) (V ((D (A N)) (P (D N))))) is obtained, which is exactly the parse tree of the original sentence S1 . E1 is thus successfully recovered. On the other hand, when two other erroneous sentences E2 D h D A N V D D N P D N i and E3 = h D A N V D A N P A N i are parsed (which are also
Analyzing Holistic Parsers
1153
Parsing < D A N V D A N P D N > Parsing < D A N V P A N P D N >
10 8
(y)N (z) (w)
6
N(x) A
A
4
D D
PC2
2 0
D
N (u)
N
P
A
2 4
(v)
N D
6 8
P
P
(t) V starting state
10
8
6
4
2
0
2 PC1
4
6
8
10
12
Figure 9: Trajectory of the CPP during the recovery of the erroneous sentence h D A N V P A N P D N i (solid line). The dashed line traces its trajectory in parsing the original correct sentence h D A N V D A N P D N i. Points w and z mark, respectively, the nal position of the parser when the erroneous sentences h D A N V D D N P D N i and h D A N V D A N P A N i are parsed. Upon decoding, all of w, x, y, and z give the same parse tree ((D (A N)) (V ((D (A N)) (P (D N))))).
derived from S1 by substitution), the CPP terminates at w and z respectively (see Figure 9). Upon decoding, they give T1 also. These results suggest that the nal state corresponding to T1 is not a single point in the state-space. Instead, it occupies a region that at least encloses the points w, x, y, and z. Intuitively, the states are not discrete. Each of them behaves like an “attractor ” and encloses a set of points in the state-space. The SRAAM representations (or hidden-layer activations) corresponding to the points give the same parse tree upon decoding. In some sense, these points constitute the basin of attraction of the state (note that the denitions of attractors and basins of attraction as adopted here are different from the conventional ones assumed in dynamical systems study). As w, x, y, and z are all lying within the basin of attraction of the same nal
1154
Edward Kei Shiu Ho and Lai Wan Chan Parsing the erroneous sentence (SUB) < D A N V P A N P D N > D 15 P D
0
A 1
N 2
226 A
V 3
227
P
N
N 4
6 D
A
7
11 P
5
N
D 9
10
Parsing the training sentence < D A N V D A N P D N >
Figure 10: State transition sequence involved in parsing the erroneous sentence h D A N V P A N P D N i. The identity of state 11 is the parse tree ((D (A N)) (V ((D (A N)) (P (D N))))).
state (state 11 in Figure 10), they are thus decoded to give the same parse tree T1 . This helps to explain how E1 , E2 , and E3 are recovered. It is this attractorlike characteristic of states that equips the CPP with a certain tolerance to noise. Taking one step further, we think that the basin of attraction of a nal state, due to training, is larger than that of a nonnal one. This can be revealed by the state transition sequence for parsing E1 (see Figure 10). Upon reading the wrong terminal P (when the CPP is in state 4), the state transition sequence starts to deviate from its normal track (which is followed when S1 is parsed) and the CPP moves to state 15. Surprisingly, the two sequences then converge again and overlap through the next two terminals. After that, they diverge for the second time. Finally, they terminate at the same state (state 11). An explanation can be offered here. As shown in Table 6, the two states where the two sequences overlap (states 6 and 7) are both nal states (they correspond to the training sentences h D A N V D N i and h D A N V D A N i, respectively). It is thus reasonable to believe that their basins of attraction are comparatively larger, such that although the state transition sequence deviates from the original track, it can still pass through the basins of attraction of states 6 and 7, whereas states 9 and 10, being nonnal, have smaller basins of attraction and the CPP just overshoots them, although it is in the vicinity of them (see Figure 9). Intuitively, states 6 and 7 are exerting attractive force on the CPP that is strong enough to pull it toward them. To a certain extent, this also helps to drift the trajectory of the CPP toward the basin of attraction of the nal state 11 (thus, it has contributed to the recovery of E1 ). Another interesting observation is, as revealed by Figure 10, that the nite-state automaton extracted is nondeterministic. For example, when
Analyzing Holistic Parsers
1155
Table 6: Identities of the States Appearing in Figure 10. State
Identity
0 1 2 3 4 5 15 6 7 9 226 10 227 11
Starting state hsi h np D N i h np D ap A N i h s np D ap A N V i h P np D N vp ap A D N i h vp np D N vp ap A D N i h s np D ap A N vp V np D N i h s np D ap A N vp V np D ap A N i h s np D ap A N vp V np D D ap A N V i h s np D ap A N vp V pp P D ap A N V i h s np D ap A N vp V np np D ap A V np D N i h s np D ap A N vp V np np D ap A ap A D N i h s np D ap A N vp V np np D ap A N pp P np D N i
Final State
p p p p p
p
the parser is in state 7, there are two different transitions for the same input symbol P. This reects that the nite-state automaton is only a nonperfect approximation of the CPP, as the operation of the latter is deterministic.8 But at the same time, it further supports our claim that the states are not discrete, and the parser is capable of making “context-sensitive” decisions during parsing: the exact edge taken during a state transition will depend on the preceding input symbols read. With reference to Figure 10, if h D A N V D A N i has just been read, the parser travels to state 9, whereas if h D A N V P A N i is read, it moves to state 226 instead. The parser thus behaves like a graded-state machine (Servan-Schreiber, Cleeremans, & McClelland, 1991), and apparently its states have some type of “memory.” 5.2 A Grammatical Sentence Other Than the Original One is Recovered. Consider the erroneous sentence E 4 D h A A N P D N V i, which is
obtained by deleting the rst terminal from the training sentence S2 = h D A A N P D N V i. As shown in Figure 11, the parser resides at x after reading the last terminal of E4 , whereas when S2 is parsed, the trajectory terminates at y. Upon decoding, x gives the parse tree T2 = (((D (A N)) (P (D N))) V), while y gives the parse tree T3 = (((D (A (A N))) (P (D N))) V), which corresponds to the original sentence S2 . In Figure 11, another point z is also shown, which is the nal position of the parser when the sentence S3 = h D A N P D N V i is parsed. When being decoded, z gives the same parse tree T2 . By denition, x and z are in the same nal state. 8 Yet a nondeterministic nite-state automaton is equivalent to a deterministic one as far as expressive power is concerned. In fact, the transformation of a nondeterministic nite-state automaton to a deterministic one is well dened (Hopcroft & Ullman, 1979).
1156
Edward Kei Shiu Ho and Lai Wan Chan
N
N
6
Parsing < D A A N P D N V > Parsing < A A N P D N V >
4
2
D D
PC2
0 A
2 A
N
4 A 6
N
P
V(y) V(x)
(z)
(u)
A (v) D P
8 starting state 10
4
2
0
2
4 PC1
6
8
10
12
Figure 11: Trajectory of the CPP when parsing the erroneous sentence h A A N P D N V i (the solid line). The dashed line traces the trajectory when the original correct sentence h D A A N P D N V i is parsed. Point z is the nal position of the parser when the sentence h D A N P D N V i is parsed. Upon decoding, both x and z give the parse tree (((D (A N)) (P (D N))) V), whereas y gives the parse tree (((D (A (A N))) (P (D N))) V).
Intuitively, the error in E4 has deected the trajectory of the CPP to such an extent that the point x where the trajectory terminates at is deviated sufciently far from y. Effectively, x has “escaped” from the basin of attraction of the nal state enclosing y (i.e., state 99 in Figure 12). Instead, it is “attracted” by another nearby nal state enclosing z (state 70 in Figure 12). This explains why T2 is produced instead of T3 . 5.3 A New Final State is Discovered. Consider the erroneous sentence E5 = h D A A A N V D P A A N i, which is produced from the training sentence S4 = h D A A A N V D A A N i by inserting a superuous terminal P after its seventh terminal. When E5 is parsed, the trajectory terminates at x, whereas for S4 , it terminates at y (see Figure 13). As shown, x is sufciently
Analyzing Holistic Parsers
1157
Parsing the erroneous sentence (OMI) < A A N P D N V > N A
234
A A
0
1
P 235
N 2
D 4
P 3
N 236
237
D 57
N 58
V V
59
Parsing the sentence
70
D N
A
19
20
84
P
D
85
99
N
V
Parsing the training sentence < D A A N P D N V >
Figure 12: State transition sequence involved in parsing the erroneous sentence h A A N P D N V i. The identities of states 70 and 99 are (((D (A N)) (P (D N))) V) and (((D (A (A N))) (P (D N))) V), respectively. 8 A (v) 6
P (u)
4
2
D A
0
A
N
PC2
(x)
N
A
2 A 4
(t)
V
N (y)
A
A D
6
8
10
Parsing < D A A A N V D A A N > Parsing < D A A A N V D P A A N > 8
6
4
2
0
2
4
starting state 6
8
10
PC1
Figure 13: Trajectory of the CPP when parsing the erroneous sentence h D A A A N V D P A A N i (solid line). The dashed line shows the trajectory when the original correct sentence h D A A A N V D A A N i is parsed. At the beginning, the two trajectories overlap. But from point t, they diverge. Point x is situated in a new nal state that corresponds to the parse tree ((D (A (A (A N)))) (V (D (A (A (A N)))))).
1158
Edward Kei Shiu Ho and Lai Wan Chan Parsing the erroneous sentence (INS) < D A A A N V D P A A N > P D 0
A 1
A 2
A 3
N 19
V 35
D 36
A
A 37
A 38
N 39
N 40
238
Parsing the training sentence < D A A A N V D A A N >
Figure 14: State transition sequence involved in parsing the erroneous sentence h D A A A N V D P A A N i.
far away from the nal state enclosing y. Upon decoding, x produces the parse tree T4 D ((D (A (A (A N)))) (V (D (A (A (A N)))))), which is not among the parse trees used for training (and it is not one of the parse trees used for testing, either). Thus, x has entered the basin of attraction of a new nal state: state 238 in Figure 14 (whose identity is T4 ). Moreover, this nal state is “reachable” in the sense that when the unseen grammatical sentence h D A A A N V D A A A N i is parsed, T4 is produced. 6 Implications for Systematicity in Representations
The performance of a holistic parser is affected by the number of nal states that can be realized by training. Yet whether these nal states can actually be reached is an equally important issue. Recall that the CPP will reside at a point in the state-space after reading the last terminal of the input sentence. If this point is lying within the nal state corresponding to the target parse tree, parsing succeeds. Otherwise it fails, even if the nal state does exist somewhere in the state-space. Hence, the performance of a holistic parser is also dependent on the distribution of its nal states in the state-space. Since the nal states are actually sentence or parse tree representations, the parser ’s performance is dependent on the distribution of these representations in the representational space (i.e., the state-space). Regarding this issue, one would expect that sentences or parse trees having a similar structure should be represented by similar connectionist coding. Consequently, they should lie close together in the representational space. Fodor and Pylyshyn (1988) referred to this characteristic as the systematicity in representations and claimed that it is the key to good performance in connectionist articial intelligence systems. However, we will demonstrate that having systematic representations alone is not a sufcient condition for good generalization performance in holistic parsing. Instead, the exact way that the parse tree representations are organized in the representational space is a determining factor of its generalization capability, which we will show is dependent on the encoding method used. 6.1 Systematicity Revisited. Fodor and Pylyshyn (1988) have posed a difcult challenge to connectionists. The systematicity property, they have
Analyzing Holistic Parsers
1159
claimed, is a major element underlying human cognitive capacities. In their work, they have criticized that connectionist representations, lacking a constituent syntactic structure, cannot explain or even exhibit systematicity without creating connectionist implementations of classical symbolic systems. Their work has stimulated much discussion as well as controversy, and a number of approaches and systems have been proposed that claim to be able to meet the systematicity requirement (e.g., Chalmers, 1992; Niklasson & Van Gelder, 1994; Smolensky, 1990). To a certain extent, they have successfully demonstrated that connectionist systems are capable of performing structure-sensitive operations using distributed representations. Along another line of research, much work has also been done on formalizing the systematicity concept. Among them, Hadley (1994) has dened four degrees of systematicity—weak, quasi-, strong, and semantic systematicity—for characterizing the generalization performance of a language learning system. Each degree reected how novel the testing items were relative to the training data used. His model has been elaborated by Christiansen and Chater (1994), who proposed that the capability of a connectionist system to process novel inputs could be classied into three levels: weak, quasi-, and strong generalization. The authors have also provided some hints of how strong generalization could possibly be achieved by a connectionist system. On the other hand, a more comprehensive framework has been put forward by Niklasson and Van Gelder (1994), which classied different degrees of systematicity into six levels. Strictly speaking, the CPP does not t in any of these classication models, since lexical categories are used instead of words. Besides, each sentence contains a single clause only, and there are no embedded sentences. But the testing sentences used in our experiments are novel in the sense that each sentence involves a novel syntactical composition of familiar constituents. And regarding the systematicity requirement, the representations developed by the CPP do exhibit certain syntactical structure. With reference to Figure 5, training sentences that end with similar terminals have their corresponding representations lying close to one another in the PC1-PC2 subspace (this is because sentences are encoded from left to right by using an SRAAM, which emphasizes the most recent inputs to the network, that is, the trailing terminals). Roughly eight clusters can be identied, each consisting of sentences having the same verb phrase. Moreover, as depicted in Figure 8, each testing sentence is mapped to the cluster of sentences that have the same trailing terminals. These observations illustrate that the representational space of the CPP is syntactically organized, and systematicity in representations is thus exhibited (equivalently, we also have systematicity in nal states). 6.2 The P RAAM Parser. Having systematic representations alone is not a sufcient condition for good generalization performance of a holistic parser. To support our argument, we construct another holistic parser, the
1160
Edward Kei Shiu Ho and Lai Wan Chan
7
< ... V > < ... V D N > < ... V D N P D N > < ... V D A N P D N > < ... V D A N > < ... V P D N > < ... V D A A N > < ... V P D A N >
6 5 4
PC2
3 2 1 0 1 2 3
8
7
6
5
4
3 2 PC1
1
0
1
2
3
Figure 15: Distribution of the 80 nal states (corresponding to the 80 training sentences) as identied by the PRAAM .
PRAAM , which is a derivative of the CPP (the symbol “P” in PRAAM stands for “parser ”). In the PRAAM , parse trees are encoded as hierarchical data structures by training a RAAM. This is in contrast to the CPP, where parse trees are linearized into sequences, which are then encoded by using an SRAAM. But in other respects, they are the same (sentences are encoded by an SRAAM, and conuent inference is applied). The same sets of 80 training cases and 32 testing cases as adopted by the CPP are used to train and evaluate the PRAAM , respectively. The generalization performance is found to be 53.13%, which is considerably worse than that of the CPP. Principal component analysis is then carried out on the representations produced. If we label each representation by the trailing part of the sentence encoded (the verb phrase), the representations do not form very clear-cut clusters in the PC1-PC2 subspace (see Figure 15). Apparently the representations produced by the PRAAM are not systematic. But actually this is not the case. Recall that in the PRAAM , parse trees are encoded as hierarchical data structures by training a RAAM. As a result,
Analyzing Holistic Parsers
1161
< < < < < < < < < < <
A A N ... > A A A N ... > N ... > A N P D A N ... > A N P D N ... > A A N P D A N ... > A A N P D N ... > N P D N ... > A N P D N P D N ... > N P D N P D A N ... > N P D N P D N ... >
1
1
1 2
2.5 6.5
1
1
6
5.5
5
4.5
4
3.5
3
PC1
Figure 16: Distribution of the representations produced by the PRAAM for sentences that end with the verb phrases h V D A N i or h V P D N i (symbols labeled by 1 correspond to sentences that end with h V P D N i).
a sentence representation, being identical to the representation of its corresponding parse tree due to conuent inference, will be much affected by the order and manner in which the subtrees are composed together to form the total parse tree. Effectively, each representation takes the entire parse tree structure into account, with more weight on the higher levels, rather than being dictated solely by the last few terminals of the sentence encoded (as when parse trees are linearized). This explains why clear-cut clusters are not observed in Figure 15. As an illustration, let us examine the representations of the sentences that end with the verb phrases h V D A N i or h V P D N i. As shown in Figure 15, the two clusters of representations are tangled up with each other in the PC1-PC2 subspace. But if we label each representation by the subject noun phrase of the sentence instead, more clear-cut clusters can be observed. As depicted in Figure 16, sentences with the same subject noun phrase have their coding mapped to nearby locations in the PC1-PC2 subspace. Moreover, a certain degree of neighboring relationship is observed among the clusters. Precisely, two clusters that correspond to sentences having similar subject noun phrases are located close to each other in the representational
1162
Edward Kei Shiu Ho and Lai Wan Chan
space (e.g., the cluster h D A A N P D N . . . i is situated in between the two clusters h D A N P D N . . . i and h D A A N P D A N . . . i). The reason is that each parse tree representation is produced by composing the coding of the subject noun phrase and the verb phrase. Hence, both will affect the distribution of the representations. To summarize, both parsers produce systematic representations, but their representations are organized along different dimensions of syntactical characteristics (the CPP emphasizes the trailing terminals of the sentence, while the PRAAM takes the entire parse tree into account), which is caused by the use of different encoding methods for parse trees (parse trees are linearized in the CPP but not in the PRAAM ). This results in the discrepancy in their performances. 6.3 Linearizing Parse Trees Improves Generalization Performance.
Consider a novel sentence S0 . To parse S0 , we rst encode it by an SRAAM (recall that in both the CPP and the PRAAM , input sentences are encoded by an SRAAM). Based on our prior discussion, the coding produced will be similar to the representations RS1 , RS2 , . . . , RSn of those training sentences S1 , S2 , . . . , Sn that end with the same terminals as S0 . Consequently, the coding and, correspondingly, their nal states will lie close together in the representational space.9 Note that due to conuent inference, RS1 , RS2 , . . . , RSn are equal to the corresponding parse tree representations RT1 , RT2 , . . . , RTn , respectively, where Ti is the parse tree of Si (i D 1, . . . , n). RS0 is then decoded to give a parse tree T. If T is to be equal to the correct parse tree T0 of S0 , the representation RT should be sufciently similar to the correct parse tree representation RT0 . In other words, RT ( D RS0 ) has to be situated near to RT 0 in the representational space; that is, they are in the same nal state. As RT lies close to RT1 , RT2 , . . ., RTn , so does RT0 . This implies that the parse tree representations should be organized according to the trailing terminals of the corresponding sentences encoded. By encoding the linearized forms of parse trees as in the CPP, the representations obtained will be organized according to the trailing terminals of the corresponding sentences. But a similar effect is not achievable if parse trees are encoded hierarchically as in the PRAAM . This explains why the generalization capability of the CPP is stronger than that of the PRAAM . It is thus evident that linearizing parse trees can improve the generalization performance of a holistic parser. As an illustration, let us examine the subset of testing sentences that end with the verb phrase h V D N i (see Figure 17). There are ve such cases. Only one can be parsed by the PRAAM (the failed cases are labeled 1, 2, 3,
9 As shown in Figures 5 and 8, in the CPP, the training sentences form different clusters in the PC1-PC2 subspace according to their trailing terminals, and each testing sentence is mapped to the cluster of sentences that end with the same terminals.
Analyzing Holistic Parsers
1163
7
2
6 1
3 4
5
PC2
2 1 4 4
3
Training cases Successful testing cases 3
Wrong coding Correct coding 1: < D A A N V D N > 2: < D A N P D N V D N > 3: < D N P D N V D N > 4: < D A N P D N P D N V D N >
2
3
2
1
0
1
2
PC1
Figure 17: Distribution of the nal states for those sentences that end with the verb phrase h V D N i. There are ve testing cases. The ± symbol corresponds to the one that can be successfully parsed by the PRAAM ; the labels 1, 2, 3, and 4 denote the four failed cases. In each failed case, the sentence representation produced by the SRAAM is marked by £, and the corresponding correct parse tree coding as produced by the RAAM is marked by .
and 4, respectively). However, the parse trees in all ve cases can actually be represented by the RAAM of the PRAAM (in fact, the RAAM has perfect generalization with respect to all the 32 testing cases). So the problem is just that the SRAAM representation produced by encoding the input sentence is seriously deviated from the range of correct parse tree representations (i.e., the desired nal state) that can be decoded by the RAAM to give the target parse tree. As shown, the sentence representations produced by the SRAAM of the PRAAM in the failed cases (denoted by £) are located near the cluster of representations of training sentences that end with the verb phrase h V D N i. However, the corresponding parse tree coding produced by the RAAM of the PRAAM (denoted by ) is remote from that cluster. Parsing thus fails.
1164
Edward Kei Shiu Ho and Lai Wan Chan
As a nal remark, although encoding parse trees hierarchically will lead to poor generalization in holistic parsing, the coding formed, being systematic, may still be useful for subsequent structure-sensitive operations. In fact, Chalmers (1992) has achieved a structure transformation task by using a feedforward network to map an active sentence to its passivized counterpart, where both sentences were encoded by a RAAM. What we want to reiterate is that having representations systematically placed in the representational space does not guarantee good generalization. Whether they are organized in a way that facilitates the task at hand is even more important. To conclude, there exists no single scheme that works for all tasks. The best choice is simply problem dependent. 6.4 Effect of Long-Term Dependencies on Generalization Performance.
Readers may be concerned that as a result of linearizing parse trees, two different sentences that share the same ending will have similar representations. Consequently, their difference can easily be lost in a linearized structure (as in the CPP). This issue is commonly known as the long-term dependency problem (see, e.g., Lin, Horne, Tino, & Giles, 1996). In general, the problem is unavoidable for the CPP because it is a recurrent network operationally. But empirically, our experimental results show that its effect on the performance of the CPP is not too apparent. Consider the following testing sentences: 1. Test1 = h D N P D N P D N V D N P D N i.
2. Test2 = h D N P D N P D N V D A N P D N i.
3. Test3 = h D A N P D N P D A N V D N P D N i. 4. Test4 = h D A N P D N P D A N V D A N i.
5. Test5 = h D A N P D N P D A N V P D A N i.
All of these long sentences can be successfully generalized by the CPP. If we look at the training set, each has a similar counterpart (where Testi corresponds to Traini , i being from 1 to 5): 1. Train1 = h D A N P D N P D N V D N P D N i.
2. Train2 = h D A N P D N P D N V D A N P D N i. 3. Train3 = h D N P D N P D A N V D N P D N i. 4. Train4 = h D N P D N P D A N V D A N i.
5. Train5 = h D N P D N P D A N V P D A N i.
Each Testi differs from the respective Traini by one terminal only, and they share the same ending. In the experiments, the CPP is capable of distinguishing the two sentences in each pair and produces the correct parse tree for each (in fact, in none of the testing cases has the CPP misinterpreted a testing sentence as another training or testing sentence). That means that
Analyzing Holistic Parsers
1165
the distinguishing feature in each case, albeit small and appearing early in the sentence, can still be preserved as the remaining terminals are read. This reects that given the maximum length of the sentences used in the experiments (which is 17), the inuence of long-term dependencies on the CPP’s performance is not noticeably signicant. Its effect may become more prominent when longer sentences are involved. Representing parse trees hierarchically will suffer a similar problem as the long-term dependency effect. If the parse trees of two sentences differ only slightly at the leaf level, the difference may be lost during the recursive bottom-up encoding process, so the parser may have difculty distinguishing them, and the same parse tree is produced. For example, when the testing sentence h D A N P D N P D A N V D A N i is parsed, the PRAAM gives ((((D N) (P (D N))) (P (D (A N)))) (V (D (A N)))), which actually corresponds to the training sentence h D N P D N P D A N V D A N i. Such errors have in fact occurred several times for the PRAAM and have caused its poor generalization performance. This suggests that the long-term dependency effect is more apparent when parse trees are encoded hierarchically than when they are linearized. 7 Discussion and Conclusion
We have proposed a method for analyzing holistic parsers. Based on the formalism, holistic parsing is achieved as if a rational inference process is involved during which the parse tree is built in a step-by-step manner. This illustrates that abstraction of grammatical knowledge has actually occurred during training. The analysis also provides insight into the issue of how the performance of a holistic parser is affected by the encoding methods for sentences and parse trees. Although we have demonstrated the approach by using the CPP only, we think that it is equally applicable to other holistic parsers, provided that a recurrent network of some type is employed for encoding sentences. To a certain extent, our approach can also be used as a general method for inducing nite-state approximation of context-free grammars. Although nite-state automata are less powerful than their context-free counterparts, they are more efcient language models computationally. In fact, Roche (1997) has shown that nite-state formalisms are capable of modeling a wide range of syntactic structures of the English language efciently. But conventional approaches (e.g., Pereira & Wright, 1997) usually require the full specication of the grammar in order to build the nite-state automaton for parsing, which may not be available in practice (especially when natural languages are concerned). Other researchers have also attempted to extract a nite-state automaton from a recurrent network. Our model differs from theirs in two major aspects. First, these approaches dealt with regular languages primarily, and the network was trained on either a sequence prediction task (e.g., Servan-
1166
Edward Kei Shiu Ho and Lai Wan Chan
Schreiber et al., 1991) or a sequence classication task (e.g., Giles et al., 1992; Lawrence, Giles, & Fong, 1998; Watrous & Kuhn, 1992). But for holistic parsers, we aim at parsing context-free languages exclusively. Thus, we can at best obtain a nite-state approximation of the target context-free grammar only. Second, states are identied differently. In previous approaches, clustering or quantization techniques were often applied to partition the hiddenlayer activation space of the network into states. For example, in Giles et al. (1992), each hidden neuron’s range was divided into q partitions of equal width. Thus for N hidden neurons, there are qN possible states. Unavoidably, certain parameters will be involved in these algorithms (such as q) that may affect the number of states and the nite-state automaton extracted. More important, the performance of the nite-state automaton may also be inuenced. But often their optimal values can only be determined empirically (Giles et al., 1992). As different from other approaches, our extraction method is data oriented and assumes no prior knowledge of the grammar. The states are dened deterministically by the symbolic decoding’s results, so no parameters are involved, and they are directly interpretable. Moreover, nal states have larger basins of attraction than nonnal states. This implies that it is not always possible to tell whether two points in the state-space belong to the same state by measuring the distance between them. In some sense, this rules out methods where states are dened by clustering the distributed representations using some kind of numerical metric (such as the Euclidean metric). Despite its advantages, there is a concern that the number of states as identied by our model may be unbounded. New states may potentially be created when novel or ungrammatical sentences are parsed (although states are reused sometimes). On the one hand, this result is natural; nitestate automata can recognize only regular grammars, which are less powerful language models than context-free grammars. So to parse more sentences, more states are needed. Paradoxically, as a context-free grammar can generate an innite number of sentences in general, the nite-state automaton extracted will need an innite number of states in the extreme case. On the other hand, the number of states can actually be reduced. We may allow two states whose identities are only slightly different to be treated as the same state. Consider Figure 10 again. When h D A N V P A N P D N i is parsed, two new states are given rise: states 226 and 227. Upon decoding, state 226 gives h s np D ap A N vp V pp P D ap A N V i, which differs from the identity of state 9 (h s np D ap A N vp V np D D ap A N V i) by two terminals only. Hence, state 226 can be treated as the same state as state 9. Similarly, states 227 and 10 can be merged. Note that a parameter has to be dened here, which is the maximum number of mismatched terminals allowed. It serves as a similarity measure that decides when two states can be merged.
Analyzing Holistic Parsers
1167
Alternatively, we may restrict the maximum length of the states’ identities. Recall that when a distributed representation is decoded, a sequence is produced from right to left recursively, which may possibly represent the preorder traversal of a syntactically well-formed parse tree. Suppose the maximum length is set to n. Then all but the right-most n symbols of the sequence are discarded. Intuitively, only the right-most n symbols are considered signicant. The remaining subsequence is then used as the identity of the state. As the number of symbols is nite, the number of states should also be bounded. 10 Besides being an explanatory device, our model helps to identify those factors that are pertinent to the performance of a holistic parser. The generalization capability is related to the number of nal states that can be realized through training as well as their distribution in the state-space, and the sizes of the basins of attraction of the nal states will affect the robustness. These characteristics are mainly dependent on the training network’s process. It is thus useful to explore their relationship with factors such as the number of hidden units (which determines the maximum representational capacity of the network), the training environment (e.g., the learning rate), the coding for the terminals, and the training patterns used (their variety and combination). In fact, we believe that there exists a trade-off between the number of nal states formed and the average size of their basins of attraction. Although increasing the number of nal states (say, by using more training examples) can improve generalization capability, it may also decrease the sizes of their basins of attraction and increase the “density” of the representational space. Robustness is thus degraded. So, it is useful to have a formulation of this trade-off in terms of the type of performance expected (e.g., whether generalization or robust parsing is of more concern, and the level of robustness desired). More important, we hope that through regulating the training process, we can exercise control over the attractors and states formed (their number, their distribution in the state-space, and their sizes) (see e.g., Tsung & Cottrell, 1993). In this way, we can position the holistic parser at a “suitable point” in the trade-off where the specic performance demand is met.11 This may shed light on the further improvement of the performance of a holistic parser.
10 These two techniques can be applied only to nonnal states, as each nal state corresponds to a unique parse tree. If their number is reduced, the number of sentences that can be parsed will decrease also. So in the extreme case, the number of nal states is still unlimited. 11 For example, by choosing the initial weights of the network carefully, we might be able to affect the representations formed, which will inuence the distribution of the states.
1168
Edward Kei Shiu Ho and Lai Wan Chan
Acknowledgments
The work described in this article was fully supported by a grant from the Research Grants Council of the Hong Kong Special Administrative Region, China (PolyU 4133/97E). References Allen, J. (1995). Natural language understanding. Redwood City, CA: Benjamin / Cummings. Angluin, D. (1980). Inductive inference of formal languages from positive data. Information and Control, 45, 117–135. Berg, G. (1992). A connectionist parser with recursive sentence structure and lexical disambiguation. In Proceedings of the Tenth Conference on Articial Intelligence (AAAI-92), San Jose (pp. 32–37). Chalmers, D. J. (1992). Syntactic transformation of distributed representations. In N. Sharkey (Ed.), Connectionist natural language processing (pp. 46–55). Norwell, MA: Kluwer. Charniak, E. (1993). Statistical language learning. Cambridge, MA: MIT Press. Chrisman, L. (1991). Learning recursive distributed representations for holistic computation. Connection Science, 3, 345–366. Christiansen, M. H., & Chater, N. (1994). Generalization and connectionist language learning. Mind and Language, 9, 273–287. Elman, J. L. (1990). Finding structure in time. Cognitive Science, 14, 179–211. Fanty, M. (1985). Context-free parsing in connectionist networks (Tech. Rep. No. TR174). Rochester, NY: University of Rochester. Fodor, J. A., & Pylyshyn, Z. W. (1988). Connectionism and cognitive architecture: A critical analysis. Cognition, 28, 3–71. Gazdar, G., & Mellish, C. (1989). Natural language processing in Prolog: An introduction to computational linguistics. Reading, MA: Addison-Wesley. Giles, C., Miller, C., Chen, D., Chen, H., Sun, G., & Lee, Y. (1992). Learning and extracting nite state automata with second-order recurrent neural networks. Neural Computation, 4, 393–405. Hadley, R. F. (1994). Systematicity in connectionist language learning. Mind and Language, 9, 247–272. Hanson, S. J., & Kegl, J. (1987). PARSNIP: A connectionist network that learns natural language grammar from exposure to natural language sentences. In Proceedings of the Ninth Annual Cognitive Science Society Conference (pp. 106– 119). Ho, K. S. E., & Chan, L. W. (1994). Representing sentence structures in neural networks. In Proceedings of the International Conference in Neural Information Processing Systems (ICONIP’94), Seoul (pp. 1462–1467). Ho, K. S. E., & Chan, L. W. (1997). Conuent preorder parsing of deterministic grammars. Connection Science, 9, 269–293. Ho, K. S. E., & Chan, L. W. (1999). How to design a connectionist holistic parser. Neural Computation, 11, 2085–2106.
Analyzing Holistic Parsers
1169
Hopcroft, J. E., & Ullman, J. D. (1979). Introduction to automata theory, languages, and computation. Reading, MA: Addison-Wesley. Jain, A. N. (1991). Parsing complex sentences with structured connectionist networks. Neural Computation, 3, 110–120. Kolen, J. F. (1994). Exploring the computational capabilities of recurrent neural Networks. Unpublished doctoral dissertation, Ohio State University. Kwasny, S. C., & Faisal, K. A. (1992). Symbolic parsing via subsymbolic rules. In J. Dinsmore (Ed.), The symbolic and connectionist paradigm: Closing the gap (pp. 209–236). Hillsdale, NJ: Erlbaum. Kwasny, S. C., & Kalman, B. L. (1995). Tail-recursive distributed representations and simple recurrent networks. Connection Science, 7, 61–80. Lawrence, S., Giles, C., & Fong, S. (1998). Natural language grammatical inference with recurrent neural networks. IEEE Transactions in Knowledge and Data Engineering, 12, 126–140. Lin, T., Horne, B. G., Tino, P., & Giles, C. L. (1996). Learning long-term dependencies in NARX recurrent neural networks. IEEE Transactions on Neural Networks, 7, 1329–1338. Miikkulainen, R. (1996). Subsymbolic case-role analysis of sentences with embedded clauses. Cognitive Science, 20, 47–73. Niklasson, L. F., & Van Gelder, T. (1994). On being systematically connectionist. Mind and Language, 9, 288–302. Pereira, F. C. N., & Wright, R. N. (1997). Finite-state approximation of phrasestructure grammars. In E. Roche & Y. Schabes (Eds.), Finite-state language processing (pp. 149–173). Cambridge, MA: MIT Press. Pollack, J. B. (1990). Recursive distributed representations. Articial Intelligence, 46, 77–105. Reilly, R. (1992). Connectionist techniques for on-line parsing. Network, 3, 37–45. Roche, E. (1997). Parsing with nite-state transducers. In E. Roche & Y. Schabes (Eds.), Finite-state language processing (pp. 241–281). Cambridge, MA: MIT Press. Rodriguez, P., Wiles, J., & Elman, J. L. (1999). A recurrent neural network that learns to count. Connection Science, 11, 5–40. Rumelhart, D. E., Hinton, G. E., & Williams, R. J. (1986). Learning internal representations through error propagation. In D. E. Rumelhart, J. L. McClelland, & the PDP Research Group (Eds.), Parallel distributed processing: Experiments in the microstructure of cognition (pp. 318–362). Cambridge, MA: MIT Press. Santos, E. (1989). A massively parallel self-tuning context free parser. In D. S. Touretzky (Ed.), Advances in neural information processing systems, 1 (pp. 537–544). San Mateo, CA: Morgan Kaufmann. Selman, B. A. (1985). Rule-based processing in connectionist system for natural language understanding (Tech. Rep. No. CSRT-168). Toronto: Computer Science Research Institute, University of Toronto. Servan-Schreiber, D., Cleeremans, A., & McClelland, J. L. (1991). Graded state machine: The representation of temporal contingencies in simple recurrent networks. Machine Learning, 7, 161–193. Sharkey, N. E., & Sharkey, A. J. C. (1992). A modular design for connectionist parsing. In A. N. Marc & F. J. Drossaers (Eds.), Proceedings of the Twente Work-
1170
Edward Kei Shiu Ho and Lai Wan Chan
shop on Language Technology 3: Connectionism and Natural Language Processing (pp. 87–96). Smolensky, P. (1990). Tensor product variable binding and the representation of symbolic structures in connectionist systems. Articial Intelligence, 46, 159– 216. St. John, M. F., & McClelland, J. L. (1990). Learning and applying contextual constraints in sentence comprehension. Articial Intelligence, 46, 217–257. Sun, G. Z., Giles, C. L., Chen, H. H., & Lee, Y. C. (1993). The neural network pushdown automata:Model, stack and learning simulations (Tech. Rep. Nos. UMIACSTR-93-77). College Park: University of Maryland. Tsung, F.-S., & Cottrell, G. W. (1993). Phase-space learning for recurrent networks (Tech. Rep. No. CS93-285). San Diego: Department of Computer Science and Engineering, University of California, San Diego. Watrous, R., & Kuhn, G. (1992). Induction of nite state languages using secondorder recurrent networks. In J. Moody, S. Hanson, & R. Lippman (Eds.), Advances in neural information processing systems, 4 (pp. 309–316). San Mateo, CA: Morgan Kaufmann. Received March 25, 1999; accepted June 15, 2000.
LETTER
Communicated by Gary Cottrell
The Computational Exploration of Visual Word Recognition in a Split Model Richard Shillcock Department of Psychology and Institute for Adaptive and Neural Computation, Division of Informatics, University of Edinburgh, Scotland, U.K. Padraic Monaghan Institute for Adaptive and Neural Computation, Division of Informatics, University of Edinburgh, Scotland, U.K. We argue that the projection of the visual eld to the cortex constrains and informs the modeling of visual word recognition. On the basis of anatomical and psychological evidence, we claim that the higher-level cognition involved in word recognition does not completely transcend initial foveal splitting. We present a schematic connectionist model of word recognition that instantiates the precise splitting of the visual eld and the contralateral projection of the two hemields. We explore the special nature of the exterior (i.e., rst and last) letters of words in reading. The model produces the correct behavior spontaneously and robustly. We analyze this behavior of the model with respect to words and random patterns and conclude that the systematic division of the visual input has predictable, general informational consequences and is chiey responsible for the exterior letters effect. 1 Introduction
The human fovea is precisely split about a vertical midline: the left and right hemields are projected contralaterally to the right and left hemispheres of the brain, respectively (Fendrich & Gazzaniga, 1989; Fendrich, Wessinger, & Gazzaniga, 1996; Sperry, 1968). This foveal splitting is sufciently exact to mean that when a printed word is xated, under typical reading distances, the two parts of the word are initially projected to different hemispheres (Sugishita, Hamilton, Sakuma, & Hemmi, 1994). Foveal splitting has been recognized in cognitive neuropsychological research, particularly research involving split-brain subjects, but it has received little attention in psycholinguistic approaches to visual word recognition (Brysbaert, 1994; Shillcock & Monaghan, 1998; Shillcock, Ellison, & Monaghan, in press; Shillcock, Monaghan, & Ellison, 1999). We describe a series of studies using neural network architectures to explore an abstract characterization of one aspect of visual word recognition: that of coordinating the information in the two c 2001 Massachusetts Institute of Technology Neural Computation 13, 1171– 1198 (2001) °
1172
Richard Shillcock and Padraic Monaghan
Figure 1: Split model architecture. Each column of eight units in the input and output layers represents one letter. Gray units denote nonactivated nodes; white units denote activated nodes. Words were presented in every position across the input units.
hemields, containing, respectively, the two parts of a xated word. We will show that one particular aspect of visual word recognition, the special status accorded to the exterior (i.e., rst and last) letters of a word, may be explained by the split nature of the processor. More generally, we will show that such modeling can yield theoretical insights into the nature of hemispheric interaction. Figure 1 illustrates a neural network model of word recognition that instantiates foveal splitting in the simplest, most extreme form. The model carries out an identity mapping from an orthographic representation presented across the input nodes to the same representation across the output nodes. The division between the two halves of the input is precise, with no sharing of information between the two hemields. In reality, in the human visual system, it is likely that some degree of ipsilaterally projected information is available. For instance, low spatial-frequency information about
Word Recognition in a Split Model
1173
Figure 2: Example of the ve different versions of the stimulus for any one word.
a xated word may be accessible from subcortical routes, which are not dependent on callosal transfer; thus, information about the length of the xated word may be available to both hemispheres. Elsewhere we argue in more detail that this initial splitting of the input conditions the higher cognitive processing involved in visual word recognition (Shillcock et al., in press; Shillcock & Monaghan, 1998; Shillcock et al., 1999). Our strategy has been rst to investigate the robust behavior of the simplest model of foveal splitting. The critical feature of the model is that it is required to coordinate the information in the two hemields regardless of where the xation point falls. Thus, as shown in Figure 2, for a four-letter word there are ve possible xation points, ranging from just before the rst letter to just after the last letter and ignoring the possibility of a xation directly on a letter. 2 The First and Last Letters of Words
In psycholinguistic experiments, the rst and last letters of a visually presented word have been shown to be more salient than the rest of the letters and to receive priority in processing. Forster (1976) argues that these exterior letters constitute an access code that activates a subset of the lexicon.1 Rumelhart and McClelland (1981) suggest that “subjects use some sort of ‘outside in’ processing strategy that leads to variations in performance across serial position” (p. 76). Jordan (1990) concludes that “psychological representations for exterior letter combinations from words do appear to exist, and . . . can be activated even though no other letters are perceived” (p. 903). The priority of exterior letters is evident from a number of experimental paradigms: Identity priming. Recognition of the whole target word (e.g., trap) is facilitated by the prior presentation of its exterior letters (t p) but not its interior letters ( ra ) (Forster & Gartlan, 1975; McCusker, Gough, & Bias, 1981). 1 Throughout this article we use Jordan’s (1990) terminology of exterior letters to refer to the rst and last letter of a visually presented word and interior letters to refer to all the letters that lie between the rst and last position.
1174
Richard Shillcock and Padraic Monaghan
Probed report. Exterior letters of pattern postmasked strings are reported more accurately than interior letters (Butler & Merikle, 1973; Merikle, 1974; Merikle & Coltheart, 1972; Merikle, Coltheart, & Lowe, 1971). Importantly, the two exterior letters produce closely comparable levels of report. Rumelhart and McClelland (1981) present data from the Reicher-Wheeler task, showing serial position effects in the form of a bow-shaped curve (experiment 7, gure 18, p. 77); it is these data that prompt Rumelhart and McClelland to make the “outside in” processing claim quoted above. Identication. Legal pairs of exterior letters such as (d k) (from dark, disk) are reported more accurately than single letters (d) or illegal pairs of exterior letters (d x) or (z k) (Jordan, 1990, 1995). This “pair-letter effect” was observed only when the pattern postmask matched the horizontal boundaries of the legal exterior letter pair. Reports of wholeword displays were always superior to reports of letter pairs. Off-line identication. Exterior letters tend to be recognized rst in noisy conditions, as when successive presentations of blurred content words become clearer and clearer (see, e.g., Shillcock, Kelly, Buntin, & Patterson, 1997). In summary, the exterior letters of visually presented words are afforded priority in processing in a variety of recognition tasks. One potential explanation is that although the exterior letters are in fact farthest away from the central foveal xation point, the relevant visual information is actually clearer than that in the middle because exterior letters are bounded on one side by white space and are thus not susceptible to the same level of visual ambiguity and lateral interference as interior letters (Eriksen & Rohrbaugh, 1970; Estes, Allmeyer, & Reder, 1976). This explanation is perhaps most compelling for the off-line recognition of degraded lexical stimuli. Jordan (1990) argues against this lateral interference account as the sole explanation of all the data, on the grounds that not all types of string elicit the effect: pattern-postmasked strings of letter-like nonsense characters elicit the reverse pattern of results, with exterior characters being perceived less well than interior characters (Hammond & Green, 1982; Mason, 1982; Mason & Katz, 1976). These particular data are seen as resulting from a word recognition device specically attuned to exterior letters, which is partially set in motion by the nonsense characters and cannot be completely inhibited. These data involving letter-like nonsense characters are simultaneously evidence for the special status of exterior letters and the claim that this status does not derive solely from more peripheral, psychophysical considerations. The credit for the original demonstration that the effect can emerge from an implemented computational model goes to Rumelhart and McClelland, who suggest a number of mechanisms that might account for the serial
Word Recognition in a Split Model
1175
position effects they observe in the human data; they implement two of them to simulate the data. First, they differentially weight the inputs to each of the four letter positions in their interactive-activation model (IAM). Second, they assume that the readout occurs at different times for the different positions: the two exterior letters are read out rst, followed by the second and then the third. Rumelhart and McClelland also mention possible perceptibility differences of particular letters in particular positions, statistical properties of the words, and variations in locus of xation and attention, thus summarizing the most relevant parameters in a monolithic (nonsplit) model of word recognition. The statistical properties of the English lexicon will be relevant to our own explorations below. The beginnings of English words are typically the most informative parts; redundancy rises across serial position (see, for instance, Yannakoudakis & Hutton, 1992, for an analysis of the phonological statistics of English words). In monosyllabic English words, the dominant consonant-vowel-consonant (CVC) structure allows greater variety in onsets and codas than in the middles of words. Thus, giving processing priority to the exterior letters of English words is adaptive. The exterior letters effect (ELE) is outside the range of data that has been captured by models of visual word recognition in the connectionist tradition other than the IAM. In Seidenberg and McClelland’s (1989) developmental model, the Wickelgraph input representation leads to the reverse prediction by underrepresenting the exterior letters compared with the interior letters, in its triples of consecutive letters: ¤ da, dar, ark, rk¤ (this criticism also applies to Mozer ’s, 1987, BLIRNET model).2 In some of the subsequent models of visual word processing (chiey concerned with pronunciation) in this tradition (e.g., Plaut & McClelland, 1993), the orthographic input is explicitly structured in terms of onset, nucleus, and coda, and there is no indication that any differential processing can emerge for exterior letters; indeed, Plaut and McClelland remark on the typically autonomous and componential processing occurring in onset, nucleus, and coda, the three positions interacting in the processing of only words with irregular pronunciations, like pint. In summary, existing models of visual word recognition do not spontaneously allow the ELE to emerge as a principled result of their basic architecture. Below, we develop an account of the effect, based on the claim that foveal splitting fundamentally conditions word recognition. We present simulations using a lexicon of the 60 most frequent English words to demonstrate that the ELE occurs with psychologically realistic stimuli, and we present simulations using more controlled lexica to establish that the effect is genuinely due to the split nature of the processor. 2
The computational measure of coding ends of words in two ways, as ¤ and ¤¤ , allows interior and exterior letters to participate in an equal number of representations: ¤¤ d, ¤ da, dar, ark, rk¤ , k¤¤ . However, this amendment still does not make the correct prediction of better representation of exterior letters.
1176
Richard Shillcock and Padraic Monaghan
Figure 3: Nonsplit model. The dashed line indicates incomplete connectivity between the input and the hidden layers. See the text for details.
We compare the behavior of the split model with an otherwise comparable nonsplit model, shown in Figure 3. In the simulations with a nonsplit architecture, we used the same training and testing regime as with the split architecture. In the nonsplit model, the input layer consisted of eight letter positions, effectively pooling the separate input layers of the split model. The hidden layer was also a pooled-resource version of the split model, and contained 40 units. The hidden layer and the output layer were fully connected. However, in order to make the processing power of the two networks comparable, each unit in the input layer in the nonsplit model was connected randomly to only half the hidden units. This measure allows there to be the same number of weighted connections between input and hidden layers in the split and nonsplit models. The processing power of neural networks is a function of the number of units and the number of weighted connections. If the two models are alike in these respects, then their performances may be compared.
Word Recognition in a Split Model
1177
3 Simulating the Exterior Letters Effect
Figure 1 shows the basic model that we explored. The input layer is composed of two sets of four input units (four letter slots) on either side of a midline, or xation point. There is complete connectivity between these sets of input units and their respective groups of 20 hidden units. The number of hidden units was determined by pilot studies and represents the minimum number capable of reliably solving this mapping problem. There is complete connectivity between the two sets of hidden units and the set of four output units. Our network encodes letters in terms of the eight-bit visual feature system employed by Plaut and Shallice (1994). The features correspond to attributes such as “contains a vertical stroke” or “contains a closed part.” Each input layer and the output layer contain four letter positions. The model was required to recreate and integrate its orthographic inputs at the output units, for all ve possible inputs for each word (see Figure 2). The network was trained using the backpropagation learning algorithm (Rumelhart, Hinton, & Williams, 1986) and was implemented in PDP++ (O’Reilly, Dawson, & McClelland, 1995). The principal psychological reality that this model is intended to capture is that of the splitting of the visual eld. The set of output units is not necessarily intended to be located in any one hemisphere. We use the model to address the question of what happens when information divided between two processing domains is required to be coordinated. What processing strategies spontaneously emerge from a simple instantiation of the problem in a split architecture? 3.1 Simulating the ELE with a Real-Word Lexicon. The exterior letters of four-letter words, such as d¤¤ k as in disk, are recognized better than the interior letters of words in studies with human subjects (Forster & Gartlan, 1975; McCusker et al., 1981). In this simulation we aim to show that this effect naturally emerges from a split model trained on a small but realistic set of English words. The 60 words used in the training set, and listed in the appendix, were all the four-letter words with a frequency greater than 1 in 10,000 from the CELEX lexical database (Baayen, Pipenbrock, & Gulikers, 1995). Each word was presented to the model an equal number of times in each of the ve possible input locations. For each presentation, the model was required to recreate the word at the output. The words were presented in random permuted order, with particular presentation positions also randomized. After approximately 100 epochs of training, the network had learned the task well, with mean squared error (MSE) for each word being less than 0.5. The nonsplit network was similarly trained on the same lexicon, with the words presented in all ve possible positions in the input and was trained to the same level of MSE as the split model. Ten simulations were carried out, and all produced very similar results.
1178
Richard Shillcock and Padraic Monaghan
We tested the hypothesis that the ELE would emerge within these simple constraints. The trained models were presented with stimuli consisting of either the exterior or the interior letters of each of the words in the training set. These letters were positioned so that the notional words of which they were a part occupied all possible positions across the input units (below, we refer to the individual input nodes by number, counting from left to right from 1 to 8). In the exterior letters condition, the exterior letters of each word in the training set were presented intact, and each of the interior letters was replaced by an ambiguous pattern in which all the units were activated to half their maximum value.3 We used these ambiguous patterns to represent stimulus conditions in the original experiments by Jordan and others in which information was provided about word length but with no indication about what letters were present in those particular positions.4 In the interior letters condition, the interior letters were presented intact, and the exterior letters were replaced by ambiguous patterns. Figure 4 shows the MSE at output letter positions 3 and 6 (the exterior letters, given a central xation of the word) for the split and nonsplit networks respectively, for the exterior letters condition. Figure 5 shows the MSE for the output letter positions 4 and 5 (the interior letters, given a central xation of the word) for the split and nonsplit networks, respectively, for the interior letters condition. Taken together the two graphs show a striking interaction between model type (split versus nonsplit) and letters presentation condition (exterior versus interior): the difference between the MSE for exterior letters and interior letters is greater in the split model than in the nonsplit model. The graphs also show a somewhat smaller MSE for the exterior letters compared with the interior letters in the nonsplit model alone. An analysis of variance was carried out treating the individual trained models as subjects in an experiment; the MSE was summed for the two interior letters and the two exterior letters in each model. The analysis of variance conrmed a signicant interaction between model type and letters presentation condition (F(1,18) = 31.23, p < .001; F(1,59) = 149.47, p < .001). Individual t-tests were performed on the MSE summed across the two exterior letters and across the two interior letters for each model. For the split model, there was a signicant difference between MSE for the exterior letters and MSE for the interior letters (t(9) D ¡9.89, p < .001, two-tailed). For the exterior letters alone, there was a signicant difference between the split model and the nonsplit model (t(9) D ¡6.35, p < .001, two-tailed). For the interior letters alone, there was no signicant difference between the split model and the nonsplit model (t(9) D .39, n.s.). These results all represent an ELE in which the split nature of one of the models plays a crucial part. However, there is 3 This pattern of activation is exactly equidistant from each letter in the eightdimensional input space. 4 In the experimental paradigm used by Jordan (1990, 1995), information about word length was conveyed by the stimulus mask.
Word Recognition in a Split Model
1179
Figure 4: Mean squared error (MSE) at individual letter positions 3 and 6 for the 60 words of English when the exterior letters of the words were presented as input to the split and nonsplit model, respectively. Error bars in all graphs represent standard error of the mean.
also a (smaller) signicant difference between the exterior and interior letters in the nonsplit model alone (t(9) D ¡2.80, p D .021 two-tailed). This last result suggests an ELE with its origin in the basic task—the shift-invariant identity mapping—that the models are both required to perform. These results refer only to the central xation, across input nodes 3, 4, 5, and 6. Separate analyses were carried out for each of the ve possible input positions for a word. The results are presented in Tables 1, 2, and 3; the middle rows of the tables contain the results discussed above, and illustrated in Figures 4 and 5, for the central xation. Table 1 shows the MSE data for the exterior letters condition and Table 2 for the interior letters condition. Table 3 summarizes the results of analyses of variance summing MSE for the two interior and the two exterior letters in each model, with subjects (F(1,18)) and items (F(1,59)), respectively, as the random variable; only the outcome of the (model £ presentation condition) interaction is shown, in the second column. The four right-most columns of Table 3 show the results of the t-tests (df = 9) between summed MSE for exterior and for interior letters and split and nonsplit models. The pattern of results described above for the central xation is broadly followed in
1180
Richard Shillcock and Padraic Monaghan
Figure 5: MSE at individual letter positions 4 and 5 for the 60 words of English when the interior letters of the words were presented as input to the split and nonsplit models, respectively. Table 1: MSE for Presentation of the Exterior Letters to the Split and Nonsplit Models Trained on the 60 Most Frequent English Four-Letter Words. Model
Split Nonsplit Split Nonsplit Split Nonsplit Split Nonsplit Split Nonspit
Mean Squared Error in Letter Position 1
2
3
4
0.68 0.85
0.50 1.04
0.45 1.03
0.93 0.89 0.20 1.07
5
6
7
8
0.19 0.95 0.97 0.85
0.37 1.04 -
0.49 1.01 -
0.65 0.78
the two other more central presentations, in which the word is “xated” between two of its constituent letters and input falls into each half of the model; the interaction was signicant in each. In single hemield presen-
Word Recognition in a Split Model
1181
Table 2: MSE for Presentation of the Interior Letters to the Split and Nonsplit Models Trained on the 60 Most Frequent English Four-Letter Words. Model
Split Nonsplit Split Nonsplit Split Nonsplit Split Nonsplit Split Nonspit
Mean Squared Error in Letter Position 1
2
3
4
-
0.81 0.80 -
1.21 1.13 1.23 0.92 -
1.64 1.27 1.19 1.15 -
5
6
7
8
1.40 1.36 1.19 1.13 -
1.42 1.37 0.93 0.92
1.36 1.22
-
Table 3: Results of the Analyses of Variance and t-Tests (Two-Tailed) for the Split and Nonsplit Networks Presented with the 60 Most Frequent Words of English. Input Interaction Nodes Between Model and Presentation Condition, by Subjects and by Items 1234 2345 3456 4567 5678
n.s.,¤¤¤ ¤¤¤ , ¤¤¤ ¤¤¤ , ¤¤¤ ¤¤¤ , ¤¤¤ n.s.,¤¤¤
t-Test for Exterior Letters, Comparing Split and Nonsplit Models
t-Test for Split Model, Comparing Exterior and Interior Letters
t-Test for Interior Letters, Comparing Split and Nonsplit Models
t-Test for Nonsplit Model, Comparing Exterior and Interior Letters
n.s.
¤
n.s.
¤¤¤
¤¤¤
n.s. n.s.
n.s. n.s. n.s.
¤¤¤ ¤¤¤
n.s.
¤¤¤ ¤¤¤ ¤¤¤
¤
¤ ¤
¤¤
Note: ¤ p < .05. ¤¤ p < .01. ¤¤¤ p < .001.
tations, in which all of the input falls in half of the model, the interaction becomes nonsignicant in the analyses by subjects. In some of the individual models trained from different random starting weights, the interactions for the single hemield presentations achieved signicance in the analyses by subjects, but the interaction was always weaker than those for the split (more central ) presentations. Table 3 shows the precise role of the exterior letters in the split model in causing the interaction. The ELE seems to receive contributions from at least two sources: the split architecture of one of the models and the nature of the task itself. Each model is required to perform the same task, so the signicant interaction between model type and presentation condition must reect the architectural difference between the models—the fact that one of them is split. The data in the right-most
1182
Richard Shillcock and Padraic Monaghan
Figure 6: PDP++’s representation of the split network’s performance on the centrally presented word mean when only the exterior letters are presented. Each node’s activation level is indicated by the size of the light area; when the node is highly activated, it is wholly white, and when it is inactive, it is wholly gray.
column of Table 3 are evidence for a smaller potential contribution to the ELE from the task itself. Figure 6 shows the PDP++ representation of the network when it is presented with only the exterior letters: m¤¤ n of the centrally presented word mean. The exterior letters are reproduced with very little error in the output. Figure 7 shows the network when it is presented with only the corresponding interior letters: ¤ ea¤ . In this condition, the features of the interior letters are not reproduced accurately. For the letter e, one of the three features is not activated very strongly, and two features are erroneously activated. For the letter a, three of the four features are not activated strongly, and two features are activated erroneously. It is also informative to compare the activation of letters that are not presented in the input in Figures 6 and 7. When only the exterior letters are presented, the network correctly produces ve of the seven features present in the interior letters and incorrectly activates four features. When only the interior letters are presented the network correctly produces four of the seven features of the exterior letters but inappropriately activates seven other features. This detailed study of one stimulus
Word Recognition in a Split Model
1183
Figure 7: The network’s performance on the word mean when only the interior letters are presented. Performance is worse than when only exterior letters are presented. See the text for details.
word, mean, presented in the two conditions provides a qualitative picture of the ELE as it appears in the whole-word priming task, which we describe in more detail below. McCusker et al. (1981) show, in a series of four experiments, that the exterior letters of four-letter words are better than the corresponding interior letters at priming the subsequently presented whole word. We simulated this effect using the 10 different versions of each of the split and nonsplit model trained on the 60-word lexicon, as described above. The exterior letters or the interior letters of each word were presented, and the total MSE associated with the relevant whole word at the output was recorded. This measure indicates how close the network comes to reproducing the whole word from just the presented letters. We tested the hypothesis that the exterior letters would lead to a signicantly smaller MSE for the whole word, compared with the MSE generated by the interior letters. Two-way ANOVAs with model type (Split versus Nonsplit) and letters presentation condition (Exterior versus Interior) and MSE over the whole word as dependent variable were performed for each of the 10 split and nonsplit simulations. The effects were qualitatively identical to when error only for presented letters was analyzed, as above: presenting exterior
1184
Richard Shillcock and Padraic Monaghan
letters to the split model primes the whole word better than presenting interior letters, whereas for the nonsplit model, this difference in priming is signicantly smaller. For the centrally presented words (input nodes 3, 4, 5, and 6), for example, MSE with the split model was 5.70 when the exterior letters were presented and 8.22 when the interior letters were presented. For the nonsplit model, MSE was 7.27 following presentation of the exterior letters and 7.60 following presentation of the interior letters. For the singlehemield presentations (input nodes 1, 2, 3, 4 and 5, 6, 7, 8) the interactions between model type (Split versus Nonsplit) and presentation condition (Exterior versus Interior) were not signicant (F(1, 18) = 0.16, p D 0.69; and F(1, 18) = 1.81, p D 0.20, respectively). However, for the other presentation positions (input nodes 2, 3, 4, 5; 3, 4, 5, 6; and 4, 5, 6, 7), there were signicant interactions (F(1, 18) = 47.91, p < 0.001; F(1, 18) = 45.08, p < 0.001; F(1, 18) = 78.43, p < 0.001, respectively). For the presentations that cross the midline of the split network, the ELE obtains in the simulation of the whole-word priming task. We report this simulation as a direct demonstration of the modeling of psychological data related to the ELE. We hypothesize that the superior whole-word priming observed for the exterior letters is due to the split architecture of the processor, but support for this hypothesis must await the control studies with completely articial lexica, which we report below. However, the simulations reported above show that the ELE is a robust effect with a real-word lexicon. In summary, construing part of visual word recognition as a shift-invariant identity mapping and training with the 60 most frequent four letter words of English produce the ELE, and produce it signicantly more robustly in the split model. Presenting just the exterior letters leads to good representations of those letters and hence to effective primes for words, whereas presenting just the interior letters produces less secure representations and hence less effective priming of the appropriate word. One factor potentially contributing toward such behavior is the particular statistics of the training set. As an inspection of the appendix reveals, interior letters are more ambiguous in the words to which they refer: there are 9 ambiguous exterior letter-pairs (e.g., h¤¤ e matches both have and here), including two three-way ambiguities (e.g., life, like, late), whereas there are 15 ambiguous interior letter pairs, including 4 three-way ambiguities (e.g., keep, seem, feed) and one four-way ambiguity (them, then, when, they). Exterior letters cue a smaller number of words than do interior letters, and it is therefore useful to prioritize their representation compared with the interior letters. The statistics of the training set broadly reect the fact that in English CVC words, more variety is possible in the consonants of the onset and coda than in the vowels of the nucleus. A second potentially contributing explanation is that the observed ELE emerges from the split nature of one of the processors. A third potential explanation is that it is due to the nature of the shift-invariant identity mapping that the task requires. A fourth potential account of the ELE in
Word Recognition in a Split Model
1185
the split model involves the denition of letters in terms of their immediate contexts: the orthotactic constraints of English monosyllabic words dictate, for instance, that f may be followed by t, but that the reverse is not allowed, and that only certain vowels in certain orders may appear adjacently. We may expect the nonsplit model as well as the split model to encode letter identity at least in part in terms of immediate context, so that an unspecied letter might be partially restored. In the split model, this means of encoding letter identity might be expected to militate against interior letters compared with exterior letters in that the contexts of the former are more adversely affected by the split and the shift-invariant identity mapping. In the simulations described below, we address the four potential explanations given above by using lexica composed of items in which the degree of structure-mediating features and words in the original real-words lexicon is decreased. First, below, we randomize the order of letters within the words, thereby removing orthotactic structure from the lexicon. Further, we remove the letter level completely in a lexicon in which each word is represented by a unique random array of features. We hypothesize that the split nature of the processor will produce the most robust contribution to the ELE. 3.2 Simulating the ELE in a Random-Letter Lexicon. The random-letter lexicon contained 60 items. Items were generated by randomly assigning a letter to each position within the item. Randomization was permuted so that each letter occurred at least twice in each letter position across the lexicon. This random-letter lexicon allows us to assess the extent to which the ELE observed with real words was due to the orthotactic constraints present in real words of English. The split and nonsplit models were trained and tested, as with the realword lexicon above, by presenting rst the exterior letters and then the interior letters and recording the respective MSE. Tables 4 and 5 present the data for all presentation positions, as in the analysis of the real-words lexicon above. Table 6 summarizes the model £ presentation condition interaction from analyses of variance conducted as in the real-words simulation above. The interaction between model type and presentation condition was again the most striking aspect of the data. The ELE emerged robustly, as with the real-words lexicon. However, in the nonsplit model, the MSE for the exterior letters was also noticeably smaller than that for the interior letters. As Tables 4, 5, and 6 show, with the random-letter words, the interaction between model type and presentation condition reaches signicance in both of the single hemield presentations. The ELE obtains even when the word is being presented to a single hemield. Table 6 shows that in the simulations with the random-letter words, the overall pattern of results found for the real English words obtains more robustly. In the current simulation the model £ presentation condition interaction is signicant in the analyses by subjects even in the single-hemield
1186
Richard Shillcock and Padraic Monaghan
Table 4: MSE for Presentation of the Exterior Letters to the Split and Nonsplit Models Trained on the Random-Letter Words. Model
Split Nonsplit Split Nonsplit Split Nonsplit Split Nonsplit Split Nonspit
Mean Squared Error in Letter Position 1
2
3
4
.84 1.05
.69 1.34
.61 1.51
1.14 1.19 .28 1.54
5
6
7
8
.27 1.33 1.18 1.37
0.49 1.39 -
.62 1.46 -
.92 1.23
Table 5: MSE for Presentation of the Interior Letters to the Split and Nonsplit Models Trained on the Random-Letter Words. Model
Split Nonsplit Split Nonsplit Split Nonsplit Split Nonsplit Split Nonspit
Mean Squared Error in Letter Position 1
2
3
4
-
1.64 1.36 -
1.55 1.46 1.69 1.65 -
1.71 1.57 1.93 2.04 -
5
6
7
8
1.73 1.72 2.05 1.73 -
1.80 1.64 1.56 1.48
1.38 1.34
-
presentations, and four of ve of the MSE differences between the exterior letters in the split and nonsplit model are highly signicant, the t-values being greater than the corresponding values in Table 3 for the real-words simulation. Similarly, in the right-most column of Table 6, more of the t-tests of the difference between MSE for the interior and exterior letters in the nonsplit model are signicant, compared with the same comparisons in Table 3. In summary, the simulations with the random-letter lexicon rule out two of the four potential accounts we suggested for the ELE observed in our simulations with the real-words lexicon. The orthotactic constraints present in the 60-word English lexicon could not have been wholly responsible for the effect. Similarly, the split architecture’s greater disruption of the letter context of the interior letters compared with the exterior letters could not have
Word Recognition in a Split Model
1187
Table 6: Results of the Analyses of Variance and t-Tests (Two-Tailed) for the Split and Nonsplit Networks Presented with the 60 Random-Letter Words. Input Interaction Nodes Between Model and Presentation Condition, by Subjects and by Items 1234 2345 3456 4567 5678
t-Test for Exterior Letters, Comparing Split and Nonsplit Models
t-Test for Split Model, Comparing Exterior and Interior Letters
t-Test for Interior Letters, Comparing Split and Nonsplit Models
t-Test for Nonsplit Model, Comparing Exterior and Interior Letters
¤¤ , ¤¤¤
n.s.
¤¤¤
¤¤
¤¤
¤¤¤ , ¤¤¤
¤¤¤
¤¤¤
n.s. n.s.
¤¤¤
n.s.
¤¤¤ , ¤¤¤ ¤¤¤ , ¤¤¤ ¤¤ , ¤¤¤
¤¤¤ ¤¤¤ ¤¤
¤¤¤ ¤¤¤
¤¤
¤¤¤ ¤¤¤ ¤
n.s.
Note: ¤ p < .05. ¤¤ p < .01. ¤¤¤ p < .001.
played a major role. The ELE was more pronounced in the random-letter lexicon, despite the absence of reliable orthotactic contexts. The ELE in both the real-words simulations and the random-letters simulations is caused by the nature of the processing task; it emerges spontaneously and robustly from the shift-invariant identity mapping and is signicantly magnied by the presence of a split in the processor. We repeated the simulations corresponding to whole-word priming with the random-letters lexicon. We tested the prediction that compared with the interior letters, the exterior letters would produce a signicantly smaller MSE for the whole word of which they were part. Two-way ANOVAs with model type (Split versus Nonsplit) and letterspresentation condition (Exterior versus Interior) and MSE over the whole word as dependent variable were calculated as above, and the effects were qualitatively similar. For the centrally presented words (input nodes 3, 4, 5, 6), for example, MSE for the split model was 5.90 when the exterior letters were presented and 9.87 when the interior letters were presented. For the nonsplit model, MSE was 8.31 following presentation of the exterior letters and 9.10 following presentation of the interior letters. For the singlehemield presentations (input nodes 1, 2, 3, 4 and 5, 6, 7, 8), the interaction between model type (Split versus Nonsplit) and presentation condition (Exterior versus Interior) was signicant in the right visual eld presentation (F(1, 18) = 6.34, p D 0.021) and was marginally signicant in the left visual eld presentation (F(1, 18) = 3.36, p D 0.083). For the other presentation positions (input nodes 2, 3, 4, 5; 3, 4, 5, 6; and 4, 5, 6, 7) there were signicant interactions (F(1, 18) = 443.66, p < 0.001; F(1, 18) = 94.53, p < 0.001; F(1, 18) = 195.97, p < 0.001, respectively). With lexical entries containing no orthotactic structure, the ELE obtains more strongly than in the same simulations with the real-word lexicon.
1188
Richard Shillcock and Padraic Monaghan
The results show that exterior letters are better primes than interior letters for the words from which they are drawn, even when the words are random strings of letters. These results support our interpretation of the same effect for the real-word lexicon that the effect was due to the nature of the processor and the task, as opposed to being solely due to the orthotactic structure contained in the real words or the unequal disruption of the immediate contexts of exterior and interior letters. The simulations show that the ELE in whole-word priming is a sufciently strong effect not to be disrupted by the presence of realistic orthotactic structure in the training data. 3.3 Simulations with Random Feature Patterns as “Words”. In the two sets of simulations, we have used real English words and articial words composed of randomly scrambled letters, respectively. In both cases, there is a letter level of representation between the lexical and featural levels in the input lexicon, and the hidden units may be encoding generalizations at this letter level. The real-word input allowed intermediate generalizations about letters; for instance, 23 letters occurred in the training set, the letter y may occur only in particular locations, the sequence rd may occur wordnally but not word-initially, and the sequence aa is not permitted. The random-letters lexicon contained a letter level of representation although no orthotactic structure, and generalizations were possible concerning the within-letter contexts in which particular features could occur. We carried out a nal set of simulations with an input lexicon with no such letter level. We created a lexicon of 60 random distributions of features, which may be thought of as single-character pictograms: each word is unique, and it does not share any predictable structure with any other words above the level of individual features. Thus, in Figure 1, the 4 £ 8 feature map representing word was effectively scrambled, although each “letter ” so produced was composed on average of the same number of features used in the real-word lexicon, and each “letter ” consisted of at least one activated feature. This randomized input allowed generalizations only at the feature level (i.e., whether a particular node is activated) and at the “word,” or pictogram, level (the whole 4 £ 8 pattern). This nal input lexicon therefore represents a more difcult mapping problem than the two previous lexica. We tested the hypothesis that the ELE would emerge robustly for this input with random patterns as words. Simulations with the two models were carried out as in the two previous sets of simulations. The models learned to the same criterion as for the lexical input after approximately 150 epochs of training. The results for the central presentation are presented in Tables 7, 8, and 9, showing a yet clearer manifestation of the behavior seen in the two sets of simulations above. There is a strong interaction of model type and presentation condition and an ELE in the nonsplit model. Compared with the corresponding results for the real-word lexicon and the random-letters lexicon, the pattern of results in Table 9 shows increased uniformity in the model£presentation condition
Word Recognition in a Split Model
1189
Table 7: MSE for Presentation of the Exterior Letters to the Split and Nonsplit Models Trained on the 60 “Pictogram” Words. Model
Split Nonsplit Split Nonsplit Split Nonsplit Split Nonsplit Split Nonspit
Mean Squared Error in Letter Position 1
2
3
4
1.13 1.13
.90 1.29
.85 1.36
1.44 1.52 .48 1.44
5
6
7
8
.56 1.52 1.40 1.23
.98 1.43 -
1.01 1.42 -
1.20 1.11
Table 8: MSE for Presentation of the Interior Letters to the Split and Nonsplit Models Trained on the 60 “Pictogram” Words. Model
Split Nonsplit Split Nonsplit Split Nonsplit Split Nonsplit Split Nonspit
Mean Squared Error in Letter Position 1
2
3
4
-
1.67 1.69 -
1.95 1.81 1.96 1.93 -
2.17 1.94 2.24 2.19 -
5
6
7
8
2.11 2.18 2.02 2.04 -
2.02 2.01 1.83 1.84
1.66 1.71
-
interaction and in the t-tests in the rest of the table. Values of F and of t are typically larger than in the previous analyses. In summary, the ELE emerged in both models presented with the pictogram input lexicon, but with a signicant interaction between model type and presentation condition, conrming our claim that two sources are responsible for the effect. 4 Discussion and Conclusions
We began with the phenomenon, demonstrated from a range of experimental paradigms, that the representation of the exterior letters of words seems to be prioritized in reading—the ELE. We described the precise vertical split-
1190
Richard Shillcock and Padraic Monaghan
Table 9: Results of the Analyses of Variance and t-Tests for the Split and Nonsplit Networks Presented with the 60 “Pictogram” Input Lexicon. Input Nodes
Interaction Between Model and Presentation Condition
t-Test for Exterior Letters, Comparing Split and Nonsplit Models
t-Test for Split Model, Comparing Exterior and Interior Letters
t-Test for Interior Letters, Comparing Split and Nonsplit Models
t-Test for Nonsplit Model, Comparing Exterior and Interior Letters
1234 2345 3456 4567 5678
¤¤ , ¤¤¤
n.s.
¤¤¤
¤¤¤
¤¤¤ , ¤¤¤
¤¤¤
¤¤¤
n.s. n.s. n.s. n.s. n.s.
¤¤¤ , ¤¤¤ ¤¤¤ , ¤¤¤ ¤¤¤ , ¤¤¤
¤¤¤ ¤¤¤
p D .053†
¤¤¤ ¤¤¤ ¤¤¤
¤¤¤ ¤¤¤ ¤¤¤ ¤¤¤
Notes: ¤ p < .05. ¤¤ p < .01. ¤¤¤ p < .001. The p value marked with † refers to an effect that is in the unpredicted direction: the exterior letters accrue higher MSE in the split than in the nonsplit model.
ting of the human fovea, an underrecognized fact within the eld of visual word recognition, and we hypothesized that part of the explanation for the ELE might lie in the divided nature of the processor. Our simulations used a shift-invariant identity mapping in a split and a nonsplit model. In this approach, letter location is not physically encoded within the initial architecture (there is complete connectivity between input and hidden layers) but is expressed in terms of co-occurring information. In the split model, the most important coding of location concerned the half of the model in which particular pieces of information occur. Our simulations have shown two distinct contributions toward the ELE: one arising within the nonsplit model and one reecting the vertical division in the split model. Below, we explore related explanations of these two contributions and assess some of the psychological implications. First, we consider the ELE as it appears in the split model. The data in Tables 1, 4, and 7 suggest a resource-based explanation for the ELE in the split model. Each table shows the MSE for both of the exterior letters for each presentation position of the word. Error is smallest for the left-most letter position when it appears at input node 4 and for the right-most letter position when it appears at input node 5 (in Table 1, the MSE is 0.20 and 0.19, respectively). In these presentations, a single, exterior letter is stranded alone in one half of the model, allowing that letter access to all of the processing resources of that half of the model. In contrast, the remaining three letters must share the resources of the other half of the model. Consider the left-most letter in the word as it appears in input positions 1, 2, 3, and 4. The MSE progressively falls (from 0.68 to 0.20 in Table 1, for instance).
Word Recognition in a Split Model
1191
This fall reects the fact that the letter is being processed in the same half of the model and, for each presentation, has a progressively larger share of the processing resources. The MSE increases sharply when the midpoint is crossed (it is 0.97 for input position 5 in Table 1), reecting the fact that the letter is now being processed in a different half of the model, using resources that must also represent the rest of the word. The three sets of simulations all demonstrate this precise pattern, for both ends of the presented word in the split model. In contrast to the exterior letters, the interior letters are divided more evenly between the two halves of the split model, and the resulting MSE data reect the fact that the responsibility of each half of the model for each interior letter is not so skewed as it is for the exterior letters. We will refer to this explanation as the hemispheric division of labor account. Further evidence for this resource-based account is found in the way that the split model learns words in the ve different presentation positions shown in Figure 2. For 10 different versions of the split model, we plotted the fall in total MSE for all of the words for each of the presentation positions every 10 epochs of training. We carried out this procedure for the three different lexica. For each lexicon, the central presentation position (input nodes 3, 4, 5, and 6) typically began to show a smaller MSE than the other positions very early in training and maintained this advantage throughout. The MSE for the two single-hemield presentations typically fell the slowest of all, with the remaining two positions usually accruing intermediate levels of MSE. This collaboration of the two halves of the model resembles a simple example of superadditivity— a feature of human performance on a range of tasks using the two visual hemields, in which the combined activity of the two hemispheres is apparently greater than the “sum of their parts” (see, e.g., Banich & Belger, 1990). The second contribution toward the ELE appears to come from the shiftinvariant identity mapping itself, given that the nonsplit model produces the ELE in its own right. There is complete connectivity between the input and hidden layers, meaning that exterior letters are dened only by appearances of the stimuli in other locations in the input eld. As Table 5 illustrates, the MSE associated with interior letter positions is typically greater for more central input locations; the shift-invariant mapping requires these nodes to support a greater variety of representations, thereby increasing the problems of superpositional storage for these positions. We will refer to this explanation as the superpositional storage account. Support for this account comes from recording the fall in MSE in the nonsplit model for the ve presentation positions and for the three lexica, as described above. From very early in training, the central presentation position typically produced the highest MSE of all ve positions. The U-shaped curve for MSE across the ve positions that was observed for the split model, as described above, was inverted for the nonsplit model; in the nonsplit model, the single hemield presentation positions typically generated the smallest levels of
1192
Richard Shillcock and Padraic Monaghan
MSE throughout training. The mechanisms behind the hemispheric division of labor account and the superpositional storage account may operate additively in the split model. One of the potential explanations of the ELE we considered involved the split model’s disproportionately disrupting the immediate contexts of interior letters more than those of exterior letters. This potential account of the ELE in the simulation with real English words received no support from the subsequent simulations with the random-letters lexicon and the pictogram lexicon. In these two sets of simulations, there is less and less potential to dene the constituents of words in terms of their contexts, compared with the real English words, yet the ELE becomes progressively stronger across the three sets of simulations. We therefore reject this potential account based on disrupted contexts; the hemispheric division of labor account and the superpositional storage account remain the principal explanations of the observed behaviors in the three sets of simulations we have reported. The increase in the size of the ELE from the English words, to the randomletter words, to the pictograms is a marked feature of the data we have presented. This change reects the increasing difculty of the task as less and less intermediate structure (i.e. orthotactic constraints and a letter-level of representation) is available to the model. This increased task difculty was reected in the number of epochs required by the models to train to criterion for each input lexicon. The longer and more difcult the learning task, the greater is the opportunity for the two sources of the ELE to take effect. The stronger ELE may be seen in increasing signicance levels across F and t in Tables 3, 6, and 9 and in the appearance of signicant effects in some of the single-hemield presentations to the split model. The learning phase is essential to the network’s performance, and so we would predict that presenting novel stimuli to the network would not show the ELE so clearly. The reverse effect to the ELE is observed in human data with the presentation of letter-like nonsense characters (Hammond & Green, 1982; Mason, 1982; Mason & Katz, 1976), and whether this would emerge from our model is a topic for future research. We turn now to the psychological implications of the modeling results. The models and the task that we have presented are simple and small; only the vertical splitting of one of the models is directly psychologically realistic. What conclusions can be drawn for psychological models of visual lexical processing? The hemispheric division of labor account of the ELE has a direct psychological interpretation, as follows. Any word can be xated at any point along its length, causing it to be cleanly split in its projection to the two hemispheres. Processing that is directly relevant to word recognition occurs intrahemispherically before the two domains of information are completely merged. Over a history of a variety of xation points for any one word, the right hemisphere develops an increasingly exclusive relationship with the information that is closer to the left end of the word, and the left hemisphere behaves similarly with respect to the information closer to the right
Word Recognition in a Split Model
1193
end of the word. This pattern of processing produces the ELE. This account of the ELE has been derived here solely from computational modeling. In retrospect, it might equally well have been produced from a consideration of the facts of foveal splitting and xation data during reading, in which case the modeling above would be a computational implementation of the argument. It is less easy to claim that the superpositional storage account has a comparable psychological reality. In this account, we claimed that the weighted connections between the middle region of the input and the hidden layer (in the nonsplit model) came under pressure to represent a greater variety of letters, and so better, less errorful representations emerged toward the edges of the input eld, where the exterior letters tend to fall. The two accounts— superpositional storage and hemispheric division of labor—are both cashed out in the same terms in the simple connectionist models we have employed: a better mapping is possible if a greater number of weighted connections is exclusively available to mediate it. In the superpositional storage account, the exclusivity arises from the relatively smaller density of mappings required from the outer regions of the input eld. In the hemispheric division of labor account, the exclusivity arises from different parts of the input falling in different halves of the processor. The hemispheric division of labor account may be plausibly applied to real single-word reading; the earlier stages of single-word reading are demonstrably divided between the two hemispheres. The superpositional storage account requires more specic, and less plausible, assumptions about word recognition in the brain; it requires us to believe that the representational substrate for word recognition is coded in terms of absolute location, with specic letter positions possessing dedicated resources. In contrast, the hemispheric division of labor account need only specify location in terms of being in one or other hemisphere. We therefore claim that of the two sources we have identied for the ELE in our simulations, the hemispheric division of labor account has more psychological reality. Our goal has been to model the psychological data, and we therefore conclude that these data are in part explained by foveal splitting and the initial projection of the two parts of a word to different hemispheres. We have directly modeled the ELE in the whole-word priming experiments reported by McCusker et al. (1981); our simulations with different lexica suggest that the effect, stemming from the split nature of the architecture and the nature of the task, is sufciently strong not to be eclipsed by the orthotactic structure present in the real-world materials. Our model is a very abstract characterization of word recognition, intended only to represent the psychological reality of foveal splitting. Its input consisted of only four-letter words, reecting the fact that the relevant human experimentation predominantly uses words of this length. It is likely that each hemisphere has available (from subcortical routes) relatively crude, low-spatial-frequency visual information from the ipsilateral hemi-
1194
Richard Shillcock and Padraic Monaghan
eld, thereby providing information about word length (see Sergent, 1987; Corballis & Sergent, 1988). Restricting the model to words of one length therefore has some psychological justication. Jordan (1990) lists some of the letter groups that have been proposed as processing units in the reading of isolated words: digrams, trigrams, prexes, sufxes, morphemes, phonological units, and basic orthographic syllabic structures. He proposes exterior letters as just such a perceptual unit. We have developed Jordan’s proposal in the context of a psychologically more realistic architecture: the splitting of the visual projection is fundamental to the psychological architecture underlying word recognition. The putative perceptual units listed by Jordan are generated on the strength of emergent formal and statistical structure. Prexes, for example, are morphological units and also occur sufciently frequently to have perhaps assumed the status of single entities. The perceptual units we have described are qualitatively different from the other units listed, in that they have their origin in the anatomical structure of the processor. In the modications that Smith, Jordan, and Sharma (1991) make to the IAM, the length of the word is independently represented, thereby encoding the status of the exterior letters as a qualitatively different type of information. In our own model, the length of the word is not separately represented. The strongest claim we can make is that the ELE found in reading is solely due to foveal splitting, as we have modeled the phenomenon. A more realistic conclusion may be that a variety of factors conspire to make the prioritizing of the exterior letters of words an adequate and natural strategy for accomplishing word recognition. We consider them in turn. First, there may be a (small) contribution to the ELE from psychophysical factors. It may also be that lateral inhibition between adjacent letters exists at the abstract, cognitive levels of lexical processing, favoring outside letters. Second, exterior letters possess a unique informational status, which we have explored elsewhere (Shillcock et al., in press): given a xation in the middle of a word or just left of the middle, letter location information need only specify which hemield a letter appears in to dene uniquely most words in the English lexicon. If the left hemield contains a, c, and r and the right hemield contains e, p, and t, then the word must be carpet. This hemield information about letter location is less adequate for shorter words,5 and it is the identity of the exterior letters that resolves all of the ambiguities in the three- and four-letter words and virtually all of the remaining ambiguity in the ve-letter and longer words. In summary, if letter identities are known, then letter location information can be surprisingly crude, but with exterior letters playing a unique informational role. (Both of these
5 Although note that less than 5% of four-letter words are ambiguous, like time and item, which share the same letters in the right and left hemield following a central split.
Word Recognition in a Split Model
1195
factors are the direct result of foveal splitting, which automatically assigns the two parts of the xated word to different hemispheres and instantiates a “two-slot” processor.) The third factor, which chiey concerns monosyllabic words, is that the exterior letters usually represent the onset and coda, and therefore tend to be more informative than the interior letters, better restricting the identity of the word involved. This third factor is relatively fortuitous. Thus, although our hemispheric division of labor account may be central to the explanation of the ELE, different factors converge to make exterior letters important and to encourage the processor to prioritize them. Indeed, within these constraints, the range of possible solutions to the problem of visual word recognition is much more limited than would rst appear. At a more general level we have shown, along with Reggia, Goodall, and Shkuro (1998), for instance, that small-scale neural network models of hemispheric coordination can produce behavior that illuminates processing problems that are solved with the full resources of the brain. The analysis of such models can provide a richer conceptual vocabulary for exploring hemispheric interaction than has emerged from classical models of cognition. Appendix: Training Set Words
also call even from have keep life many must same take them time well with
away come fact give here know like mean only seem tell then turn what work
back down feel good into last look more over some than they very when year
both each nd hand just late make much part such that this want will your
Acknowledgments
This research was carried out partly under MRC (U.K.) grant G9421014N, and ESRC (U.K.) research studentship R00429634206. We acknowledge the very helpful comments of Bruce Graham, Gary Cottrell, and an anonymous reviewer. All remaining errors are our own.
1196
Richard Shillcock and Padraic Monaghan
Correspondence concerning this article should be addressed to Richard Shillcock, Department of Psychology, 7 George Square, Edinburgh, EH8 9JZ, or to the authors at
[email protected] or
[email protected], respectively.
References Baayen, R. H., Pipenbrock, R., & Gulikers, L. (1995). The CELEX lexical database [CD-ROM]. Philadelphia: Linguistic Data Consortium, University of Pennsylvania. Banich, M. T., & Belger, A. (1990). Interhemispheric interaction: How do the hemispheres divide and conquer a task? Cortex, 26, 77–94. Brysbaert, M. (1994). Interhemispheric transfer and the processing of foveally presented stimuli. Behavioural Brain Research, 64, 151–161. Butler, B. E., & Merikle, P. M. (1973). Selective masking and processing strategy. Quarterly Journal of Experimental Psychology, 25, 542–548. Corballis, M. C., & Sergent, J. (1988). Imagery in a commissurotomized patient. Neuropsychologia, 33, 937–959. Eriksen, C. W., & Rohrbaugh, J. (1970). Visual masking in multielement displays. Journal of Experimental Psychology, 83, 147–154. Estes, W. K., Allmeyer, D. H., & Reder, S. M. (1976). Serial position functions of letter identication at brief and extended exposure durations. Perception and Psychophysics, 19, 1–15. Fendrich, R., & Gazzaniga, M. S. (1989). Evidence of foveal splitting in a commissurotomy patient. Neuropsychologia, 27, 273–281. Fendrich, R., Wessinger, C. M., & Gazzaniga, M. S. (1996). Nasotemporal overlap at the retinal vertical meridian—investigations with a callosotomy patient. Neuropsychologia, 34, 637–646. Forster, K. I. (1976). Accessing the mental lexicon. In E. C. J. Walker & R. J. Wales (Eds.), New approaches to language mechanisms. Amsterdam: North Holland. Forster, K. I., & Gartlan, G. (1975). Hash coding and search processes in lexical access. Paper presented at the Second Experimental Psychology Conference, University of Sydney. Hammond, E. J., & Green, D. W. (1982). Detecting targets in letter and nonletter arrays. Canadian Journal of Psychology, 36, 67–82. Jordan, T. R. (1990). Presenting words without interior letters: Superiority over single letters and inuence of postmark boundaries. Journal of Experimental Psychology: Human Perception and Performance, 16, 893–909. Jordan, T. R. (1995). Perceiving exterior letters of words: Differential inuences of letter-fragment and non-letter-fragment masks. Journal of ExperimentalPsychology: Human Perception and Performance, 21, 512–530. Mason, M. (1982). Recognition time for letters and non-letters: Effects of serial position, array size, and processing order. Journal of Experimental Psychology: Human Perception and Performance, 8, 724–738.
Word Recognition in a Split Model
1197
Mason, M., & Katz, L. (1976). Visual processing of nonlinguistic strings: Redundancy effects and reading ability. Journal of Experimental Psychology: General, 105, 338–348. McCusker, L. X., Gough, P. B., & Bias, R. G. (1981). Word recognition inside out and outside in. Journal of Experimental Psychology: Human Perception and Performance, 7, 538–551. Merikle, P. M. (1974). Selective backward masking with an unpredictable mask. Journal of Experimental Psychology, 103, 589–591. Merikle, P. M., & Coltheart, M. (1972). Selective forward masking. Canadian Journal of Psychology, 26, 296–302. Merikle, P. M., Coltheart, M., & Lowe, D. G. (1971). On the selective effects of a pattern masking stimulus. Canadian Journal of Psychology, 25, 264–279. Mozer, M. C. (1987). Early parallel processing in reading: A connectionist approach. In M. Coltheart (Ed.), Attention and performance XII: The psychology of reading (pp. 83–104). Hillsdle, NJ: Erlbaum. O’Reilly, R. C., Dawson, C. K., & McClelland, J. L. (1995). PDP++ neural network simulation software. Pittsburgh, PA: Carnegie Mellon University. Plaut, D. C., & McClelland, J. L. (1993). Generalization with componential attractors: Word and nonword reading in an attractor network. In Proceedings of the 15th Annual Conference of the Cognitive Science Society. Hillsdale, NJ: Erlbaum. Plaut, D. C., & Shallice, T. (1994). Connectionist modelling in cognitive neuropsychology: A case study. Hillsdale, NJ: Erlbaum. Reggia, J., Goodall, S., & Shkuro, Y. (1998). Computational studies of lateralization of phoneme sequence generation. Neural Computation, 10, 1277–1297. Rumelhart, D. E., Hinton, G. E., & Williams, R. J. (1986). Learning internal representations by error propagation. In D. E. Rumelhart & J. L. McClelland (Eds.), Parallel distributed processing: Explorations in the microstructure of cognition. Cambridge, MA: MIT Press. Rumelhart, D. E., & McClelland, J. L. (1981). An interactive activation model of context effects in letter perception: Part 2. The contextual enhancement effect and some tests and extensions of the model. Psychological Review, 89, 60–94. Seidenberg, M. S., & McClelland, J. L. (1989). A distributed, developmental model of word recognition and naming. Psychological Review, 96, 523–568. Sergent, J. (1987). A new look at the human split brain. Brain, 110, 1375–1392. Shillcock, R., Ellison, T. M., & Monaghan, P. (in press). Eye-xation behaviour, lexical storage and visual word recognition in a split processing model. Psychological Review. Shillcock, R. C., Kelly, M. L., Buntin, L., & Patterson, D. (1997). Similarities and differences between content and function words: The role of lexical neighbourhoods in the clarity gating task (Rep. No. EOPL-97-4). Edinburgh: Department of Linguistics, University of Edinburgh. Shillcock, R. C., & Monaghan, P. (1998). Using anatomical information to enrich the connectionist modelling of normal and impaired visual word recognition. In S. J. Derry & M. A. Gernsbacher (Eds.), Proceedings of the Twentieth Annual Meeting of the Cognitive Science Society (pp. 945–950). Hillsdale, NJ: Erlbaum.
1198
Richard Shillcock and Padraic Monaghan
Shillcock, R., Monaghan, P., & Ellison, T. M. (1999). The SPLIT model of visual word recognition: Complementary connectionist and statistical cognitive modelling. In D. Heinke, G. W. Humphreys & A. Olson (Eds.) Connectionist models in cognitive neuroscience: The 5th Neural Computation and Psychology Workshop (pp. 3–12). London: Springer-Verlag. Smith, P. T., Jordan, T. R., & Sharma, D. (1991). A connectionist model of visualword recognition that accounts for interactions between mask size and word length. Psychological Research, 53, 80–87. Sperry, R. W. (1968) Apposition of visual half-elds after section of neocortical commissures. Anatomical Record, 160, 498–499. Sugishita, M., Hamilton, C. R., Sakuma, I., & Hemmi, I. (1994). Hemispheric representation of the central retina of commissurotomized subjects. Neuropsychologia, 32, 399–415. Yannakoudakis, E. J., & Hutton, P. J. (1992). An assessment of N-phoneme statistics in phoneme guessing algorithms which aim to incorporate phonotactic constraints. Speech Communication, 11, 581–602. Received June 14, 1999; accepted July 24, 2000.
Neural Computation, Volume 13, Number 6
CONTENTS View
Generalization in Interactive Networks: The Benets of Inhibitory Competition and Hebbian Learning Randall C. O’Reilly
1199
Note
Optimal Smoothing in Visual Motion Perception Rajesh P. N. Rao, David M. Eagleman, and Terrence J. Sejnowski
1243
Letters
Rate Coding Versus Temporal Order Coding: What the Retinal Ganglion Cells Tell the Visual Cortex Rufn Van Rullen and Simon J. Thorpe 1255 The Effects of Spike Frequency Adaptation and Negative Feedback on the Synchronization of Neural Oscillators Bard Ermentrout, Matthew Pascal, and Boris Gutkin
1285
A Unied Approach to the Study of Temporal, Correlational, and Rate Coding Stefano Panzeri and Simon R. Schultz
1311
Determination of Response Latency and Its Application to Normalization of Cross-Correlation Measures Stuart N. Baker and George L. Gerstein 1351 Attractive Periodic Sets in Discrete-Time Recurrent Networks (with Emphasis on Fixed-Point Stability and Bifurcations in Two-Neuron Networks) Peter Ti Ïno, Bill G. Horne, and C. Lee Giles
1379
Attractor Networks for Shape Recognition Yali Amit and Massimo Mascaro
1415
VIEW
Communicated by Gary Cottrell
Generalization in Interactive Networks: The Benets of Inhibitory Competition and Hebbian Learning Randall C. O’Reilly Department of Psychology, University of Colorado at Boulder, Boulder, CO 80309, U.S.A. Com putational models in cognitive neuroscience should ideally use biological properties and powerful computational principles to produce behavior consistent with psychological ndings. Error-driven backpropagation is computationally powerful and has proven useful for modeling a range of psychological data but is not biologically plausible. Several approaches to implementing backpropagation in a biologically plausible fashion converge on the idea of using bidirectional activation propagation in interactive networks to convey error signals. This article demonstrates two main points about these error-driven interactive networks: (1) they generalize poorly due to attractor dynamics that interfere with the network’s ability to produce novel combinatorial representations systematically in response to novel inputs, and (2) this generalization problem can be remedied by adding two widely used mechanistic principles, inhibitory competition and Hebbian learning, that can be independently motivated for a variety of biological, psychological, and computational reasons. Simulations using the Leabra algorithm, which combines the generalized recirculation (GeneRec), biologically plausible, error-driven learning algorithm with inhibitory competition and Hebbian learning, show that these mechanisms can result in good generalization in interactive networks. These results support the general conclusion that cognitive neuroscience models that incorporate the core mechanistic principles of interactivity, inhibitory competition, and error-driven and Hebbian learning satisfy a wider range of biological, psychological, and computational constraints than models employing a subset of these principles. 1 Introduction
A long-standing goal of neural network modeling is to develop a formalism that is consistent with the known biological properties of the neural networks of the brain, uses powerful computational principles, and produces behavior consistent with ndings from psychology. This goal has often been thwarted by incompatibilities among these different levels of constraints. This article examines the ability of networks to process novel information, termed generalization, and how this ability is affected by network c 2001 Massachusetts Institute of Technology Neural Computation 13, 1199– 1241 (2001) °
1200
Randall C. O’Reilly
properties motivated by attempts to produce a formalism that achieves this long-standing goal. We will see that interactive (bidirectional, recurrent) networks, which are appealing for a variety of biological, psychological, and computational reasons, nevertheless tend to generalize much worse than their feedforward counterparts. However, the inclusion of two other principled mechanisms, inhibitory competition and Hebbian learning (which are similarly appealing for a variety of biological, psychological, and computational reasons), results in good generalization in an interactive network. These results support the use of formalisms that include all of these principles. 1.1 The Importance of Interactivity. Interactive networks provide a number of advantages from the biological, computational, and psychological perspectives. Biologically, bidirectional connectivity is ubiquitous in the cortex (Felleman & Van Essen, 1991; Levitt, Lewis, Yoshioka, & Lund, 1993; White, 1989; Douglas & Martin, 1990). Computationally, interactivity enables iterative, constraint-satisfaction-style processing (Smolensky, 1986; Ackley, Hinton, & Sejnowski, 1985; Hopeld, 1982, 1984), and generally allows information processing to ow in many different directions within one network, providing a richer and more exible system. Psychologically, interactivity is important for understanding phenomena like the word superiority effect, where higher-level word representations both depend on and yet exert top-down inuence over lower-level letter representations (McClelland & Rumelhart, 1981) and other kinds of top-down processing effects (Vecera & O’Reilly, 1998). Another major motivation for interactive networks is that they enable a more biologically plausible form of error-driven backpropagation learning. Backpropagation learning (Rumelhart, Hinton, & Williams, 1986) provides the starting point for the algorithms discussed in this article and has been used in a wide range of psychological models, but is unfortunately inconsistent with known biological properties (Crick, 1989; Zipser & Andersen, 1988). Specically, a literal biological interpretation of backpropagation would require that an error derivative term be (1) propagated backward across the synapse and down the axon of the sending neuron, (2) summed and multiplied by a derivative term, and (3) propagated down the dendrites of this neuron and across its synapses, and so on. No evidence for these mechanisms exists. It was recently shown that backpropagation can be implemented in a more biologically plausible fashion using bidirectional activation propagation in an interactive network using the GeneRec algorithm (O’Reilly, 1996a), which is a generalization of the recirculation algorithm (Hinton & McClelland, 1988). In GeneRec, error information is propagated as two separate terms by standard activation propagation mechanisms in interactive networks, and the difference between these terms (which is the error signal) can be plausibly computed using the synaptic modication mechanisms un-
Generalization in Interactive Networks
1201
derlying long-term potentiation and depression (LTP/LTD). Versions of the GeneRec algorithm are equivalent to the other known ways of implementing powerful error-driven learning using interactive activation propagation instead of direct error propagation (e.g., the deterministic Boltzmann machine, Hinton, 1989b; and contrastive Hebbian learning, Movellan, 1990). Thus, several different approaches converge on the idea that the way to perform error-driven learning in a more biologically plausible manner is to use interactive networks, where error signals are communicated by topdown activation propagation. 1.2 Interactivity Impairs Generalization. Despite the numerous advantages of interactivity, it can also impart some signicant limitations: it tends to impair the generalization performance of networks. Generalization is important from both computational and psychological perspectives. Computationally, generalization has been a major focus in the study of machine learning, where it enables limited training data to be applied more broadly (Weigend, Rumelhart, & Huberman, 1991; Wolpert, 1992; Vapnick & Chervonenkis, 1971). Psychologically, generalization is important for modeling human nonword reading performance (e.g., the fact that people generally pronounce nonwords like nust according to the regularities of the English language; Seidenberg & McClelland, 1989; Plaut, McClelland, Seidenberg, & Patterson, 1996), and more generally for understanding how human and animal cognition can exhibit exibility in ever-changing environments. Thus, the fact that interactivity impairs generalization, often to a signicant degree, is problematic for any attempt to develop a biologically, computationally, and psychologically satisfying neural network algorithm. To understand how an interactive network can interfere with generalization, one must understand how networks typically generalize. Networks can form distributed internal representations that encode the compositional features of the environment in a combinatorial fashion, which enables novel stimuli to be processed successfully by activating the appropriate novel combination of representational (hidden) units. Although the combination is novel, the constituent features are familiar and have been trained to produce or inuence appropriate outputs, such that the novel combination of features should also produce a reasonable result. In the domain of reading, a network can pronounce nonwords correctly when it represents the pronunciation consequences of each letter using different units that can easily be recombined in novel ways for nonwords. In this combinatorial view of generalization, interactivity impairs generalization because an interactive network is a dynamic system, whereas a feedforward network is not. Under certain conditions, the interactive activation dynamics (e.g., attractors) can interfere with the ability to form novel combinatorial representations; the units interact with each other too much to retain the kind of independence necessary for novel recombination. Consider a very simple example (see Figure 1): train that a red and
1202
Randall C. O’Reilly
output
1
2
3
1
3
?
internal reps input
2
R
G
B
a) Elemental reps (good combinatorial generalization)
R
G
B
b) Conjunctive reps (bad combinatorial generalization)
Figure 1: Illustration of why independent (elemental) internal representations are important for good combinatorial generalization. Inputs are different colored lights (red, green, and blue), and outputs are different buttons that could be pressed. Training consists of R,G ! 1,2 and R,B ! 1,3, and testing is on G,B, which, combinatorially, should produce 2,3. (a) the internal representations encode the input-output mapping in an independent or elemental fashion, enabling full combinatorial generalization. (b) the representations are conjunctive, meaning that inputs interact or conjoin in determining the internal representations (via attractor dynamics), producing different attractors for each pair of inputs. This can produce bad combinatorial generalization because the weights for the novel G,B attractor have not been associated with anything meaningful during training.
green light means push buttons 1 and 2, and a red and blue light means push buttons 1 and 3. Then test with green and blue lights—buttons 2 and 3 should be pushed. If the training is encoded using independent internal representations for each elemental input-output mapping, then the system will naturally generalize to novel combinations (see Figure 1a). However, if there are interactive attractor dynamics that cause different pairs of inputs to be encoded by very different sets of internal units, then combinatorial generalization will not work, because the green-blue test input will activate a novel internal representation that has not been systematically associated with the correct outputs (see Figure 1b). In other words, combinatorial generalization suffers to the extent that inputs interact conjunctively in determining the internal representation. The trade-off between independent, elemental representations versus conjunctive or congural representations has been analyzed in the context of hippocampal representations, which are thought to be conjunctive, while cortical representations are thought to be more elemental or independent (Marr,1971; O’Reilly & McClelland, 1994; McClelland, McNaughton, & O’Reilly, 1995; O’Reilly & Rudy, 2000, in press). According to this analysis,
Generalization in Interactive Networks
1203
conjunctive representations are important for capturing specics, while elemental and independent representations are important for capturing generalities of the environment and using these representations for generalization to novel situations. Thus, as will be demonstrated, generalization suffers when there are conjunctive attractor dynamics, where multiple input features interact and give rise to different internal representations, instead of each feature contributing independently (combinatorially) to the internal representations. Specically, these conjunctive attractor dynamics can arise when learning and activation dynamics are underconstrained, such that individual units tend to be activated by a large and unsystematic collection of input features, and the network is highly sensitive to small differences between different inputs. In addition to demonstrating a substantial impairment of combinatorial generalization in underconstrained interactive networks, the simulations in this article provide more direct evidence that conjunctive attractor dynamics are the source of the problem. However, some other reports in the literature have demonstrated successful generalization in interactive networks (e.g., Plaut, & McClelland, 1993; Plaut et al., 1996). Indeed, Plaut and colleagues attributed the generalization success of their networks to their ability to form componential attractors, where attractor dynamics operated within, but not between, the representations of the compositional features of letters and phonemes of words (just like the independent representations of Figure 1a). Thus, the components could be effectively recombined to subserve good generalization, instead of suffering from the problems of conjunctive representations. However, the algorithm that Plaut and colleagues used did not signicantly develop the recurrent feedback weights from the output layer to the hidden layer over the course of learning (Plaut, personal communication, 1996), and the networks were specically constrained to settle very rapidly. These facts minimize the extent to which those networks can be considered interactive; instead, they are effectively feedforward networks that have only token feedback connections, and their generalization performance can thus be attributed to the lack of interactivity rather than to the existence of componential attractors. Another case showed that recurrent connections actually improved generalization (Harm & Seidenberg, 1999), but here the attractor dynamics were only in the output layer and served to clean up representations to the nearest valid output. The arguments in this article apply instead to fully interactive networks where the input-output mapping itself is mediated by an interactive network via one or more hidden layers. The importance of the distinction between output-only attractors and those developed over hidden layers was clearly demonstrated by Noelle and Cottrell (1996), who found that networks with only local recurrent connectivity at the directly trained output layer generalized well, but networks with recurrent connectivity at the hidden layer that received a distal error signal from an output
1204
Randall C. O’Reilly
layer generalized poorly. More recent work by Noelle and Zimdars (1999) showed improved generalization with distal error signals by incorporating additional biasing mechanisms. Thus, the idea that generalization performance is impaired in fully interactive networks is consistent with the ndings in the literature, as well as the simulations reported below. Therefore, the improvement in biological plausibility imparted by the GeneRec algorithm and variations of it is offset by a decrement in psychological plausibility and computational efcacy, leaving us perhaps not that much closer to the eventual goal of a biologically, psychologically, and computationally satisfying algorithm. 1.3 Biases for Improving Generalization in Interactive Networks. A very different conclusion can be reached if one retains the advantages of interactive networks and augments them with additional biases that favor the development of systematic, truly componential attractor structures. The use of such biases in neural networks has been discussed in the context of the fundamental bias-variance trade-off (Geman, Bienenstock, & Doursat, 1992). This trade-off emphasizes the fact that biases that are appropriate for the task can greatly facilitate learning and generalization by reducing the level of variance, where variance reects the extent to which parameters are underconstrained by learning, and thus free to vary, causing random errors in generalization. These biases are also known as regularizers (e.g., Poggio & Girosi, 1990). However, inappropriate biases can obviously hurt performance by introducing systematic errors, such that there is no such thing as a single universally benecial set of biases (Wolpert, 1996a, 1996b). It is nevertheless possible that mammalian cortical learning mechanisms have a set of biases that facilitate learning and generalization in the small subset of all possible environments that characterize the natural world (and what an animal needs to learn about the natural world to survive). An important goal of the work described here is to make progress in identifying such biases. It is well known that purely error-driven learning mechanisms are weakly biased and are therefore typically underconstrained by the learning task, causing them to suffer from too much variance (Vapnick & Chervonenkis, 1971; Weigend et al., 1991; Geman et al., 1992; Wolpert, 1992). For example, this means that the weights in a trained network tend to reect a large contribution from their random initial values, which prevents the units from systematically carving up the input-output mapping into separable subsets that can be independently combined for the novel testing items. Instead, each unit participates haphazardly in many different aspects of the mapping. This haphazardness is often not that detrimental to generalization in a purely feedforward network (e.g., as demonstrated in a componential generalization task; Brousse, 1993), but it proves far more damaging in the context of the attractor dynamics in an interactive network. Therefore, it seems likely that appropriate additional biases could signi-
Generalization in Interactive Networks
1205
cantly increase the network’s tendency to produce componential attractors. The specic choices of biases explored in this article are biologically and psychologically motivated, with the hope of producing an algorithm that is at once biologically plausible and capable of effective generalization. Specifically, two important principled mechanisms, inhibitory competition and Hebbian learning, are incorporated as biases in the network. Both of these mechanisms have been widely used for biological and psychological models, but they have not generally been combined with error-driven learning in an interactive network. Biologically, there is good evidence for inhibitory competition in the cortex in the form of the effects of pervasive inhibitory interneurons (Gabbot & Somogyi, 1986), and Hebbian learning characterizes the properties of LTP/LTD in the cortex (Bear, 1996). Psychologically, inhibition is important for modeling attentional and other phenomena, and Hebbian learning has been used in modeling a number of different learning phenomena (Miller, Keller, & Stryker, 1989). In the context of generalization in interactive networks, the specic biases of inhibitory competition and Hebbian learning constrain both the interactive activation dynamics (in the case of inhibitory competition) and the learned weight patterns such that the network actually produces componential attractors. The simulations presented in this article, which are implemented using the Leabra algorithm that combines GeneRec with inhibitory competition and Hebbian learning mechanisms (O’Reilly, 1996b, 1998; O’Reilly, & Munakata, 2000; O’Reilly & Rudy, in press), demonstrate that these biases outperform other standard biases such as weight decay on a combinatorial generalization task. Furthermore, these biases have proven useful for a wide range of naturalistic learning environments, as demonstrated by modeling a range of cognitive phenomena in perception, memory, language, and higher-level cognition using the same algorithm, often with the same parameters (O’Reilly& Munakata, 2000). Thus, by incorporating the additional mechanistic principles of inhibitory competition and Hebbian learning together with interactivity in a biologically plausible form of backpropagation via the GeneRec algorithm, it is possible to have a neural network model that satises a wider range of biological, psychological, and computational constraints than models using only subsets of these principles. The article is organized as follows. First, a combinatorial generalization task is introduced, and feedforward backpropagation and GeneRec are compared, demonstrating the generalization problem with interactive networks. Corroborating evidence from recurrent backpropagation is then presented. Then inhibitory competition and Hebbian learning are discussed and the Leabra implementation summarized. Then generalization results for Leabra are presented, with a variety of analyses of its performance, followed by an exploration of a version of the combinatorial generalization task with exceptions and an entirely different generalization task involving handwritten digit recognition.
1206
Randall C. O’Reilly
2 Combinatorial Generalization and Interactivity The task used to explore combinatorial generalization was constructed with several different desiderata in mind. First, it has a simple combinatorial structure, such that novel inputs can be composed from combinations of a basic vocabulary of features. This combinatorial structure is implemented by having four different input-output slots, where the output mapping for a given slot depends only on the corresponding input pattern for that slot (similar to the tasks studied by Brousse, 1993, and Noelle & Cottrell, 1996, and the example shown in Figure 1). One could think of this as the letters in a word mapping to corresponding phonemes, except that unlike English, this mapping is completely regular (a combination of regular and irregular cases is explored later). Each slot has a vocabulary of input-output mappings. The second desiderata is that there is some interesting substructure to the vocabulary mapping within each slot, which establishes that a slot is a coherent, interdependent collection of units. Specically, the input vocabulary consists of all 45 combinations of 5 horizontal and 5 vertical bars in a 5 £ 5 grid, and the output mapping is a localist identication of the two input bars (this is similar to the bars tasks used by FoldiÂak, ¨ 1990; Saund, 1995; Zemel, 1993; Dayan & Zemel, 1995). The third desiderata is that the structure of the task should be visibly evident in the weight patterns, which is accomplished by the use of the bars, so that it should be easy to examine the trained weight patterns for evidence of having represented the basic elements of the task. The resulting network with an example input-output pattern is shown in Figure 2. The total possible number of distinct input patterns is approximately 4.1 million (454 ), but the networks are trained on only 100 randomly constructed examples. Thus, to the extent that any signicant level of generalization is observed, this task serves as a demonstration of how neural networks can leverage a relatively small amount of experience to produce a huge range of generalized behavior. Five hundred randomly constructed testing patterns (all different from the 100 training patterns) were used to assess generalization performance, with generalization reported as a proportion of testing items with errors. Error was scored using the criterion that each output unit had to be on the right side of .5 according to the correct target pattern (i.e., using a unit-wise tolerance of .5). Ten different randomly initialized networks were run for 500 epochs each, and the results are averages over the best generalization performance (assessed every 25 epochs during training) of these 10 networks. 2.1 Basic Results. To assess the effect of interactivity, two different networks were compared on the combinatorial generalization task: a standard feedforward backpropagation network and an interactive GeneRec network using the symmetric, midpoint variation of the learning rule, which is equivalent to contrastive Hebbian learning (CHL) or a deterministic Boltzmann
Generalization in Interactive Networks H V H V
Output
1207
H V H V
Hidden
Input
Figure 2: Architecture of the combinatorial generalization network. The input and output are composed of four slots, with the input pattern for each slot being one of 45 different possible combinations of two horizontal and/or vertical bars in the 5 £5 slot grid, and the output pattern being a localist identication of each of the two lines (the rst row of ve output units for each slot representing the vertical lines and the second row representing the horizontal lines). Darkened units show an example input-output pattern.
machine (DBM):
D wij
C C ¡ ¡ D 2 [(xi yj ) ¡ (xi yj )]
(2.1)
for sending and receiving unit activations xi and yj , respectively, in the minus (target unclamped) and plus (target clamped) phases (see O’Reilly, 1996a, for details). A learning rate (2 ) of .01 was used for both, without any momentum or other factors. The basic generalization results for networks with 100 hidden units are shown in Figure 3, which clearly shows that the interactivity of GeneRec impairs generalization considerably. The interactive GeneRec network makes roughly 90% errors, while the backpropagation network makes only 40% errors. 2.2 Hidden Layer Size. It is generally thought that fewer degrees of freedom should produce more constrained learning and thus better generalization. The results for different numbers of hidden units, shown in Figure 4, indicate that in this task, more hidden units produce better generalization, with the 100 hidden-unit case (shown in Figure 3) producing the best performance (and larger numbers of hidden units produce somewhat better performance but with decreasing gains). This result goes against the standard idea that more hidden units results in overtting, but overtting
1208
Randall C. O’Reilly
Generalization Error
1.0
Bp vs GeneRec, 100 Hidden
0.8 0.6 0.4 0.2 0.0
Bp
GeneRec
Figure 3: Generalization results (proportion error on the test set) comparing standard feedforward backpropagation (Bp) with interactive GeneRec. GeneRec generalizes very poorly.
b)
Bp, Hidden Units
1.0
Generalization Error
Generalization Error
a)
0.8 0.6 0.4 0.2 0.0
36
49 64 81 # of Hidden Units
100
GeneRec, Hidden Units
1.0 0.8 0.6 0.4 0.2 0.0
36
49 64 81 # of Hidden Units
100
Figure 4: Effect of number of hidden units on generalization performance in (a) standard feedforward backpropagation (Bp) and (b) interactive GeneRec. Larger hidden layers perform better.
may not be as much of a problem here because the task is not noisy; the input-output mapping is completely deterministic. A likely explanation is that more hidden units allow for greater averaging over multiple partially redundant but idiosyncratic hidden units, producing more systematic overall behavior. Similar results were reported by Weigend (1994) and in the version of this task with exceptions discussed later. However, the handwritten digit recognition task has noisy input patterns, and evidence of overtting with larger hidden layers is observed. 2.3 Weight Decay. A commonly used bias or regularizing function is weight decay (Hinton, 1989a; Weigend et al., 1991). We implemented two
Generalization in Interactive Networks Bp, Weight Decay
1.0
b) Generalization Error
0.8 0.6 0.4 0.2
0.8 0.6 0.4 0.2
E.001
E.002
E.005
S.001
S.002
S.005
E.001
E.002
E.005
S.001
S.002
S.005
0.0 0
0.0
GeneRec, Weight Decay
1.0
0
Generalization Error
a)
1209
Figure 5: Effect of weight decay on generalization performance in (a) standard feedforward backpropagation (Bp) and (b) interactive GeneRec. First bar is 0 weight decay. Bars labeled S are for the given level of simple weight decay, while E bars are for the given level of weight-elimination weight decay. Simple weight decay at the .002 level produces the best results.
commonly used forms of weight decay in the Bp and GeneRec networks: simple weight decay and weight-elimination weight decay (Weigend et al., 1991). In simple weight decay, a constant fraction of the weight value is subtracted at each weight update, and weight elimination is similar except that the rate of decay is a more complex function of the weight such that larger weights suffer relatively less decay than smaller ones (supporting a prior assumption of a bimodal distribution of weight values—one population of larger weights that are actually useful and another of near-zero weights that are not useful; see Weigend et al., 1991, for details). The results with these forms of weight decay for the 100-hidden-unit network are shown in Figure 5. Although a small amount (.002; smaller amounts had progressively smaller effects) of simple weight decay appears to improve generalization performance in both Bp and GeneRec reliably, the difference is not substantial. The weight-elimination version of weight decay always appears to impair, rather than improve, performance. Although the specic forms of weight decay explored here were not overly successful in this task, it is possible that other forms might perform better. Nevertheless, most forms of weight decay are problematic from a psychological perspective because they predict a level of forgetting that is likely much stronger than what is observed in humans and animals, given the weight decay levels that produce computational benets. 2.4 Weight Patterns. An examination of the weights in the trained networks clearly shows why generalization is impaired in the interactive network (see Figure 6). The units have not carved the input-output mapping into separable subsets that can be independently combined for the novel
1210
Randall C. O’Reilly H V H V
Output
H V H V
Hidden
Input
Figure 6: Weights for a typical hidden unit in the GeneRec network after training, with the size of the square indicating magnitude and color indicating sign (white D positive, black D negative). Both feedforward weights from the input layer and feedback weights from the output layer are shown (the GeneRec weights are symmetric so these feedback weights also indicate the nature of the hidden-to-output weights; also note that there are no within-layer connections, as indicated by the zero-value gray squares). It is clear that the units are not dividing up the task into separable mappings for each slot; instead, each unit participates in multiple slot mappings.
testing items; instead, each unit participates in the input-output mapping for multiple slots. Although this is true for both the backpropagation and GeneRec networks, it is particularly damaging for the interactive GeneRec network because of its attractor dynamics. In contrast, the feedforward backpropagation network is still capable of producing a roughly linear combination of hidden unit activations that yields reasonable (though far from perfect) levels of generalization. 2.5 Combinatorial Response Analysis: Direct Evidence for Conjunctive Attractor Dynamics. One way of more directly assessing the extent of conjunctive attractor dynamics in the interactive network is to measure its internal hidden representations in response to input patterns that differ systematically from each other by changes in one or more of the input slots. Because there is no relationship between the slots, the hidden layer should exhibit a combinatorial response as a function of the number of slots that are different. This is what would happen if the representations for each slot were independent of each other (as in Figure 1a). If, however, we see that the hidden layer representations are more different than the input patterns (i.e.,
Generalization in Interactive Networks
1211
exhibiting pattern separation; O’Reilly & McClelland, 1994), this suggests that conjunctive activation dynamics are causing the network to overreact to differences in the input. In other words, the hidden units are encoding specic conjunctions of input features across different slots, and therefore very different hidden representations are activated when the input changes (as in Figure 1b). In short, if we nd that the hidden representations in the interactive network are more different than the corresponding differences in input patterns, then this supports the idea that conjunctive attractor dynamics are behind the poor generalization observed in these networks. To address this important question, two sets of 250 input pattern pairs were generated—one set that differed in only one slot (pairs overlapped by 75%) or by two slots (pairs overlapped by 50%). Thus, to the extent the average pairwise hidden unit overlap falls below 75% and 50% for the rst and second sets, respectively, the hidden patterns are exhibiting conjunctivity. We presented these patterns to the Bp and GeneRec networks and computed the normalized dot product (cosine) of the corresponding hidden representations (after settling in GeneRec) for each set of pairs. To control for the baseline activation levels of the units, which would otherwise distort the similarity measure, the average activation values over all the test input patterns were subtracted from the pattern-wise activation values before computing the cosine. Figure 7 shows the results of this analysis, where it is clear that the interactive GeneRec network is representing the input patterns using consistently more different hidden representations than the feedforward network. It is particularly instructive that the random initial networks exhibit a much more proportional response, as would be expected when the weights are small and the units are in the linear range of the sigmoidal activation function. However, when the GeneRec network’s weights get larger, it exhibits a much more nonlinear response, indicative of the complex, conjunctive attractors that have developed over training. To substantiate further that the conjunctive attractors formed in the GeneRec network are really spurious conjunctions across different slots and not the result of simply activating a trained attractor that was close to the novel input pattern, the output patterns need to be analyzed. If the GeneRec network is indeed producing spurious conjunctive attractors, one would expect that the outputs would be largely uninterpretable noise (e.g., based on the random-looking weight patterns in Figure 6), and not well formed outputs that correspond to trained patterns. To test this, the output activation patterns for each of the four output slots were compared to the entire vocabulary of possible output patterns within each slot. The output was counted as well formed if each of the slots matched (to within .5 unit-wise tolerance) one of these valid output patterns, and was considered malformed otherwise. Note that this analysis ignores whether the actual combination of patterns across slots was trained; it just treats each slot individually (i.e.,
1212
Average Hidden Ovlp
0.80 0.75
Hidden Response to 75% Ovlp Inputs
Proportional to Input
0.70 0.65 0.60
Initial Trained
Bp
GeneRec
b) 0.55 Average Hidden Ovlp
a)
Randall C. O’Reilly Hidden Response to 50% Ovlp Inputs
0.50
Proportional to Input
0.45 0.40 0.35
Initial Trained
Bp
GeneRec
Figure 7: Average pairwise overlap (normalized dot product or cosine) between hidden patterns corresponding to inputs that differ by (a) 75% (one out of four slots different) and (b) 50% (2 out of 4 slots different). Feedforward backpropagation (Bp) remains much closer to a linear response level (75% and 50% hidden similarity, respectively) after training compared to interactive GeneRec, which shows evidence of attractor dynamics pulling the network away from a linear, combinatorial response to the inputs.
allowing any of the approximately 4.1 million possible patterns to match). The number (out of 500 total patterns) of malformed outputs for the 75% and 50% overlap testing patterns is shown in Figure 8. These results, showing that 80% of the GeneRec outputs are malformed compared to only 40% for backprop, clearly support the idea that the interactive attractor dynamics and underconstrained weights in GeneRec lead to spurious, malformed conjunctive attractors. 2.6 Summary. An interactive network that is otherwise identical to a feedforward one generalizes much worse on this combinatorial generalization task. Manipulations of hidden units and weight decay had qualitatively similar effects on both types of networks, such that the interactive network appears to behave just like the feedforward one, but with a greatly elevated offset in generalization errors. This result, together with the analysis of the hidden layer responses and well-formedness of the outputs, clearly supports the idea that the spurious, conjunctive attractor dynamics of the interactive network interfere with its ability to produce systematic combinatorial representations of novel stimuli. 3 Comparison with Recurrent Backpropagation Although GeneRec has been shown to compute essentially the same error gradients as backpropagation, even in networks with several hidden layers (O’Reilly, 1996a), one might argue that the bad generalization results from GeneRec are somehow an artifact of this particular algorithm. To ad-
Generalization in Interactive Networks
1213
Spurious (Malformed) Attractors
Total Malformed Outputs
500 400
75% Ovlp 50% Ovlp
300 200 100 0
Bp
GeneRec
Figure 8: Number of malformed outputs out of 500 total for the 75% and 50% overlap inputs. An output is well-formed if each of the slot outputs corresponds to one of the vocabulary of possible slot outputs, and malformed otherwise. The results support the idea that conjunctive attractor dynamics and underconstrained weights cause GeneRec to produce spurious (malformed) outputs.
dress this concern and provide additional evidence for the idea that interactivity impairs generalization, networks using the Almeida-Pineda (AP) xed-point recurrent backpropagation algorithm (Almeida, 1987; Pineda, 1987a, 1987b, 1988) were also run. This algorithm is a mathematically direct extension of feedforward error backpropagation to a recurrent attractor network having the same basic architecture as the GeneRec network. To the extent that this AP network also suffers from poor generalization, this substantiates the GeneRec results and shows that they are not artifactual. One important difference between AP and the CHL version of GeneRec is that GeneRec maintains the weight symmetry between feedforward and feedback weights, whereas AP does not. The net result is that GeneRec develops the feedback weights over the course of learning, whereas AP leaves them virtually untouched. Because it is the magnitude of these feedback weights that largely determines the extent of the interactive activation dynamics in the network, this is an important difference. By adjusting the random weight initialization in the AP network, we can control the average magnitude of the feedback weights to achieve three objectives: (1) show that these weights inuence the level of generalization, such that the more interactive networks with stronger feedback weights generalize worse; (2) show that GeneRec’s generalization is roughly what would be expected given the magnitude of its feedback weights; and (3) show that the Plaut et al. (1996) generalization results with a recurrent backpropagation network are as expected from a recurrent network with small, undeveloped feedback weights.
1214
Randall C. O’Reilly AP Generalization, 100 Hidden Generalization Error
1.0 0.8 0.6 0.4 0.2 0.0
Bp
AP.25
AP.5
AP1 GeneRec
Figure 9: Generalization results for Almeida-Pineda (AP) with various levels of feedbackweight variance (.25, .5, and 1.0), as compared with the Bp and GeneRec results shown in Figure 3. As the strength of these feedback weights increases, causing increased attractor dynamics, generalization performance decreases. Figure 10 shows that GeneRec’s generalization is commensurate with AP’s given the strength of its feedback weights.
Figure 9 shows the generalization performance of three different AP networks having .25, .5, and 1.0 uniform random initial weight variance (centered on 0), resulting in an average feedback weight magnitude (mean of absolute values of all output-to-hidden weights) of roughly .125, .25, and .5, respectively. Clearly, generalization performance is a function of the strength of these weights, with the 1.0 variance case resulting in generalization performance roughly equivalent to that of GeneRec, while the .25 variance case produces generalization performance roughly comparable to that of the feedforward network, as was the case with the Plaut et al. (1996) model. Figure 10a compares the average feedback weight magnitudes from before and after training with those values from the GeneRec network. Commensurate with their generalization performance, the GeneRec network has slightly smaller feedback weights compared to the 1.0 variance AP network. Thus, GeneRec’s generalization performance is as expected based on its level of interactivity as determined by these feedback weights. The correlation between the generalization scores and the trained feedback weight magnitude measure was very strong (r D .9705). Also evident from Figure 10 is the fact that GeneRec develops the feedback weights over training much more than AP does. Another way of measuring the level of interactivity in a network is in terms of settling time—the number of activation update cycles required to achieve a stable activation state (stopping criterion was when maximum
Generalization in Interactive Networks
Avg Abs Feedback Wt
0.5
Feedback Weight Magnitudes Initial Trained
0.4 0.3 0.2 0.1 0.0
AP.25
AP.5
b)
AP1
GeneRec
Settling Times
200 Cycles To Settle
a)
1215
150 100 50 0
AP.25
AP.5
AP1
GeneRec
Figure 10: Measures of the level of interactivity in the AP and GeneRec networks: (a) average feedback weight magnitudes before and after training and (b) settling times (average number of cycles to achieve a stable activation state). These measures are strongly correlated (r D .9705 and r D .9998, respectively) with the generalization performance shown in Figure 9, as well as with each other (r D .9733). Also, GeneRec develops the feedback weights much more than AP does.
activation change of any unit in the network in one cycle went below .01). A longer settling time is indicative of a more complex activation trajectory and thus of more extensive attractor dynamics. As shown in Figure 10b, settling time is well correlated with both the strength of the feedback weights (r D .9733) and the generalization performance (r D .9998). To summarize, the AP results strongly corroborate the conclusion reached previously that the conjunctive attractor dynamics produced by interactive networks directly impair generalization performance. 4 Inhibitory Competition and Hebbian Learning Although it is possible that any of a number of different biases could be added to the interactive GeneRec network to improve generalization performance, this article focuses on two such biases that have been widely used and can be independently motivated for a variety of biological, psychological, and computational reasons. These biases and their implementation in the Leabra algorithm are discussed in the subsequent sections. 4.1 Inhibitory Competition. Roughly 15 to 20% of the neurons in the cortex are GABAergic inhibitory interneurons (White, 1989; Gabbot & Somogyi, 1986). The importance of these neurons for controlling the positive feedback loops between excitatory cortical pyramidal neurons is revealed for example in the epileptic-like effects of GABA antagonists (Grinvald, Frostig, & Lieke, 1988). Thus, any realistic model of the cortex should include a role for inhibitory competition. Many neural network formalisms
1216
Randall C. O’Reilly b) Inhib (kWTA)
Unit 2
Unit 2
a) No Inhib
Unit 1
Unit 1
Figure 11: Illustration in two dimensions (units) of how kWTA inhibition can restrict the activation dynamics of an interactive network. (a) A network without inhibition, which can explore all possible combinations of activation states. (b) The effect of inhibition in restricting the activation states that can be explored to those having a roughly constant level of activity (e.g., under the kWTA inhibition function described in the text), leading to faster and more constrained attractor dynamics.
show the computational advantages of inhibitory competition (Grossberg, 1976; Kohonen, 1984; McClelland & Rumelhart, 1981; Rumelhart & Zipser, 1986). One such advantage is that inhibitory competition allows only the most strongly excited representations to prevail, with this selection process identifying the most appropriate representations for subsequent processing. Furthermore, learning mechanisms are affected by this selection process such that only the selected representations are rened over time through learning, resulting in an effective differentiation and distribution of representations. In the context of the generalization problem, the selection and distribution effects of inhibitory competition can result in individual units specializing on the input-output mapping for separable components (e.g., the slots in the combinatorial problem explored in this article). When the units are specialized in this way, combinatorial representations are facilitated. Furthermore, inhibition has the effect of greatly constraining the activation dynamics of an interactive network, as illustrated in Figure 11. Thus, a network with inhibitory competition should settle more rapidly and with less inuence from the kinds of attractor dynamics that can impair generalization. Another way of thinking about the benets of inhibitory competition comes from the idea that given the general structure of the natural environment, sparse distributed representations (i.e., with relatively few units active at a time) are particularly useful (Barlow, 1989; Field, 1994). For example, in visual processing, a given object can be dened along a set of feature dimensions (e.g., shape, size, color, texture), with a large number of different values along each dimension (e.g., many different possible shapes, sizes, colors, textures). Assuming that these feature values are encoded individual units (or small groups thereof) in a distributed representation, the overall representation of a given object will activate only a relatively small subset
Generalization in Interactive Networks
1217
of the entire set of feature units (i.e., the representations will be sparse). More generally, it seems as though the world can be usefully represented in terms of a large number of categories with a large number of exemplars per category (animals, furniture, trees, etc.). If we again assume that only a relatively few such exemplars are processed at a given time, a bias favoring sparse representations is appropriate. The combinatorial generalization problem has this structure, with only a few of the possible different bar stimuli present at any given time. To substantiate the argument further in favor of using sparse distributed representations, Olshausen and Field (1996) developed a simulation that showed that imposing a bias for developing this type of representation can result in the development of realistic early visual representations (oriented edge detectors) of natural visual scenes. The close t between the simulated representations and those in the early visual system suggests that the visual system also uses a sparse distributed representational system. Although seemingly straightforward, achieving a sparse distributed representation is technically challenging, primarily because this case is difcult to analyze mathematically within a probabilistic learning framework. The problem is one of combinatorial explosion; one needs to take into account all the different possible combinations of active and inactive units to analyze a sparse distributed representation based on true inhibitory competition. Thus, sparse distributed representations fall in a complex intermediate zone between two easily analyzed frameworks (Hinton & Ghahramani, 1997): (1) the winner-take-all (WTA) framework (Rumelhart & Zipser, 1986; Grossberg, 1976; Nowlan, 1990), where only one unit is allowed to be active at a time—having a single active unit eliminates the combinatorial problems, but also does not allow for distributed representations; and (2) the independent units framework, where the units are considered to be (conditionally) independent of each other (e.g., a standard backpropagation network)—this allows the combined probability of an activation pattern to be represented as a simple product of the individual unit probabilities (and for distributed representations), but there is no inhibitory competition. There have been a number of attempts to remedy the limitations of these two analytical frameworks, by introducing distributed representations within a basically WTA framework, or by introducing sparseness constraints within the independent units framework. However, the limitations of these frameworks are difcult to overcome. Basically, any use of WTA prevents the cooperativity and combinatoriality of true distributed representations, and the need to preserve independence among the units in the independent units framework prevents the introduction of any true activation-based competition (see the discussion in O’Reilly, 1998). Leabra achieves sparse distributed representations with true competition by using a k-winners-take-all (kWTA) mechanism, which generalizes the WTA approach to k winners (Majani, Erlarson, & Abu-Mostafa, 1989). A kWTA mechanism can enforce true competition among the units, while al-
1218
Randall C. O’Reilly
lowing for a (sparse) distributed representation across the subset of k units. kWTA mechanisms have been analyzed for factors such as stability and convergence onto k units and can be implemented with biologically plausible lateral inhibition mechanisms (Majani et al., 1989; Fukai & Tanaka, 1997). However, they have not been analytically treated within a probabilistic learning framework, due to the combinatorial explosion problems. Nevertheless, the simple form of kWTA used in Leabra (described in detail in the appendix) works well in bidirectionally connected networks and has been shown to be useful for modeling a wide range of cognitive phenomena (O’Reilly & Munakata, 2000). 4.2 Hebbian Learning. Although Hebbian learning has typically been used as the primary form of learning in a range of models, it is used here as a supplemental form of learning to complement a network learning primarily from error-driven GeneRec. Hebbian learning by itself is simply not powerful enough to enable the learning of even moderately complex input-output mappings and so cannot be considered a self-sufcient learning mechanism (as we will see below, Hebbian learning alone cannot learn the combinatorial generalization task explored in this article). However, there are a number of ways of understanding why the use of Hebbian learning, computed on the plus phase activations to reinforce the correct patterns, is benecial in the context of error-driven learning. First, error-driven and Hebbian learning have two complementary learning objectives, which can be referred to as task learning and model learning, respectively. Error-driven learning accomplishes task learning because it is specically (and exclusively) driven by the demands of the task (i.e., the discrepancies between actual and desired outputs). In contrast, model learning has the objective of learning an internal model of the environment irrespective of specic tasks. From a machine learning perspective, task learning amounts to learning the conditional probability distribution of the output on the input, while model learning amounts to learning the entire (joint) probability distribution (e.g., Rubinstein & Hastie, 1997). 1 . Given that Hebbian and error-driven learning have complementary objectives, combining them makes sense and can generally lead to a more constrained, strongly biased learning algorithm. At a more detailed level, the benets and limitations of Hebbian and error-driven learning can be understood in terms of how locally they operate. Hebbian learning is limited in what it can learn because each unit is driven only by the local correlational structure present in its inputs; there is no overall objective function guiding these local weight changes to achieve
1 As implemented in Hebbian learning, model learning really amounts to representing only the correlational aspects of the environmental structure, not the entire joint probability distribution.
Generalization in Interactive Networks
1219
a more distal output pattern somewhere else in the network. In contrast, error-driven learning is ultimately driven by just such a distal error function, which is why it can learn complex problems using intermediate hidden layers. One of the benets of combining Hebbian with error-driven learning can thus be understood as enabling learning to take place even when the distal error signals are weak or inconsistent across trials. Furthermore, units in a purely error-driven network are typically underconstrained by the task at hand and are also highly dependent on the responses of other units for determining what they end up representing. This extreme interdependence and lack of local constraint can lead to slow and ineffective learning in large, many-layered networks. To the extent that Hebbian learning provides a useful bias that is applied locally to each unit, it can shape learning locally and apply individual constraints on unit representations, limiting the codependence problems. From a biological perspective, the combination of error-driven GeneRec learning and Hebbian learning actually provides a better t to the known properties of synaptic modication (LTP/LTD) than does GeneRec alone (O’Reilly, 1996a). The qualitative signs of GeneRec and Hebbian weight changes are consistent except in two cases: (1) when the sending and receiving units are persistently active in both phases, Hebbian predicts a weight increase (LTP) while GeneRec predicts no weight change; and (2) GeneRec predicts a weight decrease (LTD) when the units are erroneously active (i.e., when the plus-phase activation coproduct is greater than that of the minus phase in the GeneRec learning rule, equation 2.1), while Hebbian predicts no weight change (assuming the the plus phase coproduct is zero). Therefore, the combination of Hebbian and GeneRec is essentially just like Hebbian, except that LTD should occur when the units are erroneously active, meaning that the known Hebbian mechanisms are largely sufcient. Hebbian synaptic modication appears to operate on the concentration of postsynaptic calcium ions: lower levels of calcium produce LTD, and higher levels produce LTP (Artola, Brocher, & Singer, 1990; Lisman, 1989, 1994; Bear & Malenka, 1994). Under this mechanism, Hebbian LTP results from the calcium inux produced by persistent activation of sending and receiving neurons. This mechanism can also produce the LTD required by GeneRec for erroneously active units. Because erroneous activation means that there is initial activation (in the minus or expectation phase, where the actual network output is produced) followed by inactivation (in the plus or outcome phase, where the target network output is experienced), the initial activation will allow calcium to enter the cell, but the subsequent inactivation prevents the concentration from reaching LTP levels, resulting in LTD. In the context of the generalization problem, Hebbian learning can facilitate good generalization by encouraging units to represent the strongest aspects of the correlational structure in the input, which are the correla-
1220
Randall C. O’Reilly
tions present in the activations of individual lines within a given slot and in the input-output mapping between lines and line identity units. Thus, Hebbian learning should encourage units to specialize on representing intraslot information, producing a more componential representation that can be recombined for novel inputs to produce good generalization. The implementation of Hebbian learning in Leabra is essentially the same as in competitive learning (Rumelhart & Zipser, 1986; Nowlan, 1990:
D wij
C C C C C D 2 (xi yj ¡ yj wij ) D 2 yj (xi ¡ wij )
(4.1)
for sending unit activation xiC and receiving unit activation yjC (both in the plus phase) with learning rate 2 . Rumelhart and Zipser (1986) showed that this equation can be interpreted as producing weight values that converge on the conditional probability that the sending unit is active given that the receiving unit is active, P (xi D 1|yi D 1), assuming that the activation states reect the probability that the unit is active (a detailed derivation is provided in the appendix). In the context of a competitive activation function, a given receiving unit is active only when a relatively large number of inputs align with its weights, meaning that this conditional probability learning comes to reect the strong correlations in the inputs. As explained in detail in the appendix, this Hebbian learning term is combined additively with the GeneRec error-driven learning term at every connection in the network. One limitation of the Hebbian learning algorithm is that the weights linearly reect the strength of the conditional probability. This linearity can limit the network’s ability to focus on only the strongest correlations, while ignoring weaker ones. In the combinatorial generalization task, for example, any “spurious” correlations between input patterns across different slots will be faithfully encoded by linear-Hebbian units and prevent good generalization. Thus, it is useful to include a further bias on the learning system to encode only the strongest correlations. One way of achieving this goal is to use subtractive normalization instead of the multiplicative normalization used in the competitive learning rule, because subtractive normalization moves the weights toward extrema, while multiplicative normalization retains a linear proportionality to the correlation strength (Miller & MacKay, 1994; Goodhill & Barrow, 1994). However, the completely binary weights produced by subtractive normalization are not likely to be generally useful, especially in more complex tasks and in the context of error-driven learning, where graded weight values are required to balance the contribution of individual units to multiple different aspects of problems. The approach taken here is to introduce a contrast enhancement function that magnies the stronger weights and shrinks the smaller ones in a parametric, continuous fashion. This contrast enhancement is achieved by passing the linear weight values computed by the learning rule through a
Generalization in Interactive Networks
1221
Contrast Enhancement Effective Weight Value
1.0 0.8 0.6 0.4 0.2 0.0 0.0
0.2 0.4 0.6 0.8 Linear Weight Value
1.0
Figure 12: Effective weight value as a function of underlying linear weight value, showing contrast enhancement of correlations around the middle values of the conditional probability as represented by the linear weight value. The gain parameter, c D 6, controls the sharpness of the sigmoid, and the midpoint of the function is shifted upward by the offset parameter h D 1.25.
sigmoidal nonlinearity of the following form (see Figure 12): wO ij D
1 ± ² ¡c , w 1 C h 1¡wij ij
(4.2)
where wO ij is the contrast-enhanced weight value, and the sigmoidal function is parameterized by an offset h and a gain c . The offset (which is multiplicative but moves the middle of the sigmoid up or down in an offset-like fashion) determines where the threshold for the nonlinearity is located, and the gain determines how sharp the nonlinearity is. A gain value of 6 and an offset of 1.25 works well for a wide range of problems (O’Reilly& Munakata, 2000). In the combinatorial generalization problem, a higher threshold (i.e., a higher h ) for correlations to be reected in the weights should produce even better isolation of the slot representations in the hidden layer; indeed, a threshold of 1.5 worked better than 1.25, and was used for the results reported here. 5 Generalization with Inhibitory Competition and Hebbian Learning To test the importance of inhibitory competition and Hebbian learning for improving the generalization performance of an interactive error-driven network (GeneRec), a Leabra network was run on the combinatorial generalization task. The network had 100 hidden units with the kWTA k parameter set to 25 units active (25% activity is the default), the k was set to 8 for
1222
Randall C. O’Reilly Leabra vs Bp, GeneRec Generalization Error
1.0 0.8 0.6 0.4 0.2 0.0
Leabra
Bp
GeneRec
Figure 13: Generalization results comparing Leabra (GeneRec plus inhibitory competition and Hebbian learning) with standard feedforward backpropagation (Bp) and interactive GeneRec. The additional biases in Leabra result in much better generalization than even feedforward backpropagation.
the output layer, and the learning rate was .01 (see the appendix for other standard parameters and equations). The results clearly indicate that these biases greatly facilitate generalization in an interactive network (see Figure 13, specically the comparison between the two interactive networks using exactly the same error-driven learning rule, GeneRec and Leabra). In the best case of the 10 Leabra networks, generalization error was only .008. If we extrapolate this error on the 500 test cases to the entire space of test items, the network has achieved nearly perfect generalization to over 4 million different patterns based on learning only 100. The average generalization performance corresponds in the extrapolation to getting over 3.5 million problems correct based on the 100 training patterns. One indication that Hebbian learning is specically important for the generalization performance of Leabra comes from the time course of generalization over training (see Figure 14). Although the network learns the task after only 10 epochs or fewer of training, generalization performance does not start to improve signicantly until roughly 200 or more epochs. In contrast, the backpropagation network starts to generalize much earlier, and this generalization generally parallels learning performance on the task. Bp takes roughly 50 epochs to learn the task, during which time most of the generalization is achieved. An obvious way to determine the importance of Hebbian learning is to run a Leabra network without it; this is then just a GeneRec network with inhibitory competition. Such a network performs quite badly on this task, having an average generalization error of .992 ( § .00109 standard error of the mean[SEM]). Thus, it is clear that Hebbian learning is essential.
Generalization in Interactive Networks
1223
Timecourse of Generalization
Generalization Error
1.0 Bp Leabra
0.8 0.6 0.4 0.2 0.0
0
250
500 Epochs
750
1000
Figure 14: Time course of generalization in Leabra (best generalizing network) and Bp, showing that generalization performance in Leabra starts to improve well only after the task has been learned by error-driven learning (which occurs within the rst 10 epochs). This identies Hebbian learning as specically important. In contrast, generalization in Bp begins much more rapidly, and more closely parallels the time course of learning (Bp takes roughly 50 epochs to learn the task).
However, Hebbian learning cannot even begin to learn this task on its own, as a run of Leabra with only Hebbian learning (together with the kWTA inhibitory competition) demonstrates (see Figure 15). Note that Hebbian learning without inhibitory competition does noxt typically work very well, because the positive feedback nature of Hebbian learning requires the selection and differentiation pressure of inhibitory competition. Thus, although it is insufcient on its own in this task, inhibitory competition is nevertheless playing an important role. Other tasks have shown that inhibitory learning can make an important contribution on its own (O’Reilly & Munakata, 2000). An examination of the weights of a trained Leabra network (see Figure 16) shows that individual hidden units are parceling the task up into the logical separable components of individual lines within a given slot. Having represented the task in such a separable fashion, the units in the network can easily be recombined in novel ways to process the novel testing items, thereby achieving good generalization. As a useful simplication, one can understand the interaction between error-driven and Hebbian learning in forming these representations as follows: error-driven learning (together with inhibitory competition) ensures that different units take account of different aspects of the problem and that all aspects are covered such that the problem is actually solved. Hebbian learning operates slowly and renes the hidden unit representations so that they encode only the strongest correlations; because these are the individual line elements, the weights
1224
Randall C. O’Reilly
Training Error (SSE)
Timecourse of Learning 800 600 Hebb Only Leabra
400 200 0
0
50
100 Epochs
150
200
Figure 15: Time course of learning for a version of Leabra with only Hebbian learning (Hebb only), compared to the standard case with both error-driven and Hebbian learning (Leabra). The kWTA inhibitory competition is present in both cases.
come to emphasize these and exclude inputs from other slots, which are more weakly correlated. Given the kind of combinatorial representations learned by the Leabra network, it is not too surprising that it exhibits an almost exactly proportional response to the 75% and 50% overlapping test items used to analyze the Bp and GeneRec networks previously (see Figure 17). Furthermore, Leabra produces only a small proportion of spurious (malformed) outputs compared to Bp or GeneRec (see Figure 18). Interestingly, the random initial weights in the untrained Leabra network produce a signicant amount of pattern separation, which then disappears after training. This is consistent with the idea that sparser levels of activation, with random weights, produce more pattern separation (O’Reilly & McClelland, 1994). Thus, the benets of inhibitory competition for damping attractor dynamics come at the cost of increasing pattern separation, with random weight patterns. Over learning, the inhibitory competition provides a benecial pressure for shaping the weights (as discussed previously) and overcomes the initial conjunctivity bias. 6 Generalization with Exceptions One potential objection to the above results showing an advantage for inhibitory competition and Hebbian learning is that these biases might be so specic to the purely regular combinatorial case that the network would break down in a task that also had irregularities or exceptions to the general rule. The ability to handle both regularities and exceptions is important for dealing with many language phenomena, including the well-studied cases
Generalization in Interactive Networks H V H V Output
1225
H V H V
Hidden
Input
Figure 16: Weights for a typical hidden unit in the Leabra network after training (see Figure 6 for details). Both feedforward weights from the input layer and feedback weights from the output layer are shown. (The Leabra weights are symmetric, so these feedback weights also indicate the nature of the hidden-tooutput weights. Also note that there are no within-layer connections, as indicated by the zero-value gray squares; the kWTA inhibition in Leabra is computed directly, not via inhibitory weights.) It is clear that the units are dividing up the task into separable mappings for each slot — each unit participates in mapping a single line within a slot (in this case, the middle horizontal line in the lower left-hand slot) to its corresponding identication output. This division enables combinatorial generalization, explaining the good generalization results. Note that this unit, like many others, also had a weak representation of inputs in other slots, presumably due to relatively strong spurious correlations across slots.
of the English past tense and spelling-to-sound mappings (Rumelhart & McClelland, 1986; Seidenberg & McClelland, 1989; Plaut et al., 1996; Plunkett & Marchman, 1996). We can explore this issue by introducing exceptions into the combinatorial generalization task explored here. Exceptions were generated for any pattern having a vertical bar in the last (right-most) position in the nal (fourth) slot (like the past tense inectional sufx that appears at the end of a word), with the consequence that the output pattern for the third slot was repeated in the fourth slot, instead of this fourth output slot reecting the actual inputs from the fourth slot. Thus, these exceptions required there to be interactions between slots, instead of their being completely independent, as in the regular mapping (which was exactly as in the previous task). To equalize the opportunity for generalization compared to the fully regular task, the networks were trained on 100 randomly generated regular cases (as before) plus 27 randomly generated
1226
Average Hidden Ovlp
0.85
Hidden Response to 75% Ovlp Inputs
0.80
Proportional to Input
0.75 0.70 0.65 0.60 0.55
Initial Trained
0.50 0.45
Leabra
Bp
b) 0.60 Average Hidden Ovlp
a)
Randall C. O’Reilly
GeneRec
Hidden Response to 50% Ovlp Inputs
0.55 Proportional to Input
0.50 0.45 0.40 0.35
Initial Trained
0.30 0.25
Leabra
Bp
GeneRec
Figure 17: Average pairwise overlap (normalized dot product or cosine) between hidden patterns corresponding to inputs that differ by (a) 75% (one out of four slots different) and (b) 50% (two out of four slots different). Although with random weights the inhibitory competition in Leabra results in signicant pattern separation (very different hidden representations for similar inputs), training produces a network with proportional responding to input differences.
Total Malformed Outputs
500 400
Spurious (Malformed) Attractors 75% Ovlp 50% Ovlp
300 200 100 0
Leabra
Bp
GeneRec
Figure 18: Number of malformed outputs out of 500 total for the 75% and 50% overlap inputs. An output is wellformed if each of the slot outputs corresponds to one of the vocabulary of possible slot outputs, and malformed otherwise. The Leabra network produces largely well-formed outputs, consistent with the generalization results.
irregulars (patterns were generated at random until 100 regulars had accumulated; the proportion of irregulars is close to the expected 20% having a vertical line in one out of ve positions). The generalization results for the three main algorithms (Leabra, Bp, GeneRec) are shown in Figure 19, where it is clear that the biases in Leabra still enable better generalization even in a task with exceptions. The overall level of generalization was worse than in the fully regular case, as one would
Generalization in Interactive Networks
Generalization Error
1.0 0.8
1227
Exceptions: Leabra vs Bp, GeneRec Overall Regulars Exceptions
0.6 0.4 0.2 0.0
Leabra
Bp
GeneRec
Figure 19: Generalization results (proportion error on test set of 500 items, and broken down into proportion error for regulars and exceptions separately) for the three algorithms (Leabra, Bp, and GeneRec) on the problem with exceptions. Although generalization performance is naturally worse than in the purely regular case, Leabra still performs better than the other algorithms (substantially better on regulars and marginally better on exceptions), indicating the advantages of the inhibitory competition and Hebbian learning biases.
expect given the more complex task being trained. However, the Leabra network is generalizing on the exceptional mapping better than either of the other networks (though not by much in the case of the Bp network), meaning that it is capable of systematically encoding more complex interactions between slots and exhibiting substantial generalization on that basis. Because the regular performance is better than the exception performance, it is clear that the biases in Leabra are specically advantageous for the regular mapping, but this does not come at a cost to the exceptions. The training error curves (not shown) were basically the same as the fully regular case, with just one epoch more required on average to reach a 0 error criterion (7.7 epochs in the fully regular case, 8.7 in the exception case). In examining the weights of the Leabra network (not shown), one could nd units that encoded the exceptional cases by having input weights from the last position of the nal slot, plus weights from the third slot, and a mapping to the fourth slot output appropriate for the third slot weights. To follow up on the issue of the effect of number of hidden units on generalization, Figure 20 shows that even in the task with exceptions, larger hidden layers produce better generalization. Again, these results can be understood in terms of the benets of a larger sample of idiosyncratic units that appear to outweigh the overtting costs associated with this task, which is still deterministic, even if somewhat more complex. In other work, the Leabra algorithm has also been shown to exhibit systematic encodings of complexly interacting stimuli, resulting in good gen-
1228
Randall C. O’Reilly
Generalization Error
1.0
Exceptions: Bp, Hidden Units Overall Regulars Exceptions
0.8 0.6 0.4 0.2 0.0
36
49 64 81 # of Hidden Units
100
Figure 20: Generalization results for Bp with different hidden layer sizes on the problem with exceptions. Even with exceptions, a larger hidden layer improves generalization.
eralization. In particular, the spelling-to-sound model presented in O’Reilly and Munakata (2000) replicated the generalization performance of the backprop networks reported in Plaut et al. (1996) , but did so using simpler and more surface-valid input-output representations than the carefully crafted representations used by Plaut et al. (1996). Table 1 shows the nonword generalization data from this model compared with Plaut et al. (1996) and human data on a range of different nonword lists. To pronounce the nonwords correctly, the model has to encode complex combinations of both independent and conjunctive (i.e., taking into account multiple letters) letter-phoneme mappings. Thus, it should be clear that the benets of the biases in Leabra are not restricted to simple, purely regular mappings. 7 Generalization in Handwritten Digit Recognition To substantiate further the generality of the results on the tasks studied above, an additional task with very different characteristics was explored. This task is handwritten digit recognition, which has been widely used in the literature. The specic digit set employed, from Nowlan (1990), has 48 different samples of each of the 10 digits (0–9) encoded as thresholded binary intensities in a 16 £ 16 input grid. Because the digits are handwritten, they are highly variable (noisy) within a category and overlap considerably with those in other categories. The challenge is to form category boundaries that are sufciently general to provide reasonable generalization performance to novel instances, based on these noisy and overlapping training exemplars. Because of the noisy inputs, there is a strong possibility for overtting in
Generalization in Interactive Networks
1229
Table 1: Summary of Nonword Reading Performance in the Leabra Model Compared to the PMSP (Plaut et al., 1996) and Human Data. Nonword Set Glushko (1979) regulars Glushko (1979) exceptions McCann & Besner (1987) controls McCann & Besner (1987) homophones Taraban & McClelland (1987)
Leabra
PMSP
People
95.3 97.6 85.9 92.3 97.9
97.7 100.0 85.0 N.A. N.A.
93.8 95.9 88.6 94.3 100.0a
Notes: The Glushko data show the results when alternative outputs that are consistent with the training corpus are allowed. a Human accuracy for the Taraban & McClelland set was not reported but was presumably near 100 percent (the focus of the study was on reaction times). Source: Data from O’Reilly & Munakata (2000).
this domain. Training was performed on 32 instances per digit, with generalization testing on the remaining 16. The generalization results for the three basic algorithms with 64 hidden units each are shown in Table 2a. This replicates the same pattern seen previously, where GeneRec generalizes considerably worse than feedforward backpropagation (Bp), and Almeida-Pineda (AP) performs intermediately (weight initialization variance was .5, producing moderate amounts of feedback weights that are not increased with training). Critically, the extra biases in Leabra produce much better generalization in a fully recurrent network, with generalization error slightly better than the Bp network with an equivalent number of hidden units. The effect of number of hidden units, shown in Table 2b, shows an opposite pattern from previous results, with more hidden units leading to worse generalization in Bp and to a lesser extent in GeneRec, but not in Leabra (which actually shows the opposite pattern). This is to be expected because this task has noisy inputs, which allow for the units to overt the specic, idiosyncratic features of the training inputs and thus not generalize as well to the novel testing inputs. However, the extra biases of the Leabra algorithm prevent it from this overtting (e.g., the Hebbian learning forces the units to extract the central tendency of the input patterns corresponding to a given digit category), and it appears to benet from a richer, more redundant distributed representation that comes from having a larger hidden layer. Finally, the fact that the best Bp network (with 15 hidden units) performs slightly better than Leabra (with 64 hiddens) is not of central relevance to the point of this article. The critical thing is that Leabra performs much better than GeneRec.
1230
Randall C. O’Reilly
Table 2: Generalization in Digit Recognition Networks. a.
Algorithm Leabra Bp AP GeneRec
Generalization Error
Standard Error of the Mean
.131 .151 .202 .279
.00456 .00482 .0126 .0323
b. Leabra
Hiddens 15 30 64
Bp
GeneRec
Generalization Error
Standard Error of the Mean
Generalization Error
Standard Error of the Mean
Generalization Error
Standard Error of the Mean
.221 .164 .131
.0127 .00878 .00456
.12 .131 .151
.00867 .0106 .00482
.236 .231 .279
.00809 .0106 .0323
Notes: (a) Generalization results for digit recognition for the different algorithms (64 hidden units), shown as proportion generalization error on the 160 testing items. (b) Effects of number of hidden units, with an indication of overtting with larger hidden layers in Bp and to a lesser extent in GeneRec, but not in Leabra (which actually shows the opposite pattern).
8 Discussion This article has shown that an attempt to make error-driven backpropagation learning more biologically plausible by using bidirectional activation propagation to communicate error signals has the unfortunate consequence of producing bad generalization. This impaired generalization can be attributed to the conjunctive attractor dynamics of an otherwise unconstrained interactive network, which interfere with the ability to systematically form novel combinatorial representations. However, by adding inhibitory competition and Hebbian learning to the interactive network, good generalization is restored, and the resulting algorithm satises a variety of constraints from the biological, psychological, and computational perspectives. The generalization results presented here demonstrate that neural networks are capable of substantial levels of generalization based on a relatively small sample of the environment (e.g., 100 training patterns out of 4.1 million possible, resulting in an average of 3.5 million correct responses). Such results should help to counter the persistent claims that neural networks are
Generalization in Interactive Networks
1231
incapable of producing systematic behavior based on statistical learning of the environment (Marcus, 1998; Pinker & Prince, 1988). There is reason to believe that the particular combination of mechanistic principles embodied by the Leabra algorithm has a wide range of applicability. One can identify six core such principles: interactivity, inhibitory competition, distributed representations, error-driven task learning, Hebbian model learning, and an overarching concern for biological plausibility. These principles reect important themes that have played a central role in a large number of neural network models and theories over the years (see O’Reilly, 1998, for a review). This article adds to this existing body of research by showing that there can be important computational advantages to combining these mechanisms together into one coherent framework. In addition to the specic issue of generalization that was the focus of this article, this framework can be used to understand a wide range of cognitive phenomena and provides a unied way of implementing a number of existing cognitive models (O’Reilly & Munakata, 2000). Appendix: Implementational Details A.1 Pseudocode. The pseudocode for Leabra is given here, showing exactly how the pieces of the algorithm described in more detail in the subsequent sections t together. Outer loop: Iterate over events (trials) within an epoch. For each event: 1. Iterate over minus and plus phases of settling for each event. (a) At start of settling, for all units: i. Initialize all state variables (activation, vm , etc.). ii. Apply external patterns (clamp input in minus, input and output in plus). (b) During each cycle of settling, for all nonclamped units: i. Compute excitatory netinput (ge (t) or gj —equation A.3). ii. Compute kWTA inhibition for each layer, based on giH (equation A.7): A. Sort units into two groups based on giH : top k and remaining k C 1 to n. B. If basic, nd k and k C 1th highest; if average based, compute average of 1 ! k and k C 1 ! n.
C. Set inhibitory conductance gi from g H and g H k kC 1 (equation A.6).
1232
Randall C. O’Reilly
iii. Compute point-neuron activation combining excitatory input and inhibition (equation A.1). (c) After settling, for all units: i. Record nal settling activations as either minus or plus phase (yj¡ or yjC ). 2. After both phases, update the weights (based on linear current weight values) for all connections: (a) Compute error-driven weight changes (equation A.8) with soft weight bounding (equation A.13). (b) Compute Hebbian weight changes from plus-phase activations (equation 4.1). (c) Compute net weight change as weighted sum of error driven and Hebbian (equation A.14). (d) Increment the weights according to net weight change, and apply contrast enhancement (equation 4.2). A.2 Point Neuron Activation Function. Leabra uses a point neuron activation function that models the electrophysiological properties of real neurons, while simplifying their geometry to a single point. This function is nearly as simple computationally as the standard sigmoidal activation function, but the more biologically-based implementation makes it considerably easier to model inhibitory competition, as described below. Further, using this function enables cognitive models to be related to more physiologically detailed simulations more easily, thereby facilitating bridge building between biology and cognition. The membrane potential Vm is updated as a function of ionic conductances g with reversal (driving) potentials E as follows: X dV m (t) Dt gc (t) gc (Ec ¡ Vm (t)) dt c
(A.1)
with 3 channels (c) corresponding to e excitatory input, l leak current, and i inhibitory input. Following electrophysiological convention, the overall conductance is decomposed into a time-varying component gc (t) computed as a function of the dynamic state of the network, and a constant gc that controls the relative inuence of the different conductances. The equilibrium potential can be written in a simplied form by setting the excitatory driving potential (Ee ) to 1 and the leak and inhibitory driving potentials (E l and Ei ) of 0: 1 Vm D
ge ge ge ge C gl gl C gi gi
,
(A.2)
Generalization in Interactive Networks
1233
which shows that the neuron is computing a balance between excitation and the opposing forces of leak and inhibition. This equilibrium form of the equation can be understood in terms of a Bayesian decision-making framework (O’Reilly & Munakata, 2000). The excitatory net input-conductance ge (t) or gj is computed as the proportion of open excitatory channels as a function of sending activations times the weight values: gj D ge (t) D hxi wij i D
1X xi wij . n i
(A.3)
The inhibitory conductance is computed via the kWTA function described in the next section, and leak is a constant. Activation communicated to other cells (yj ) is a thresholded (H) sigmoidal function of the membrane potential with gain parameter c , yj (t ) D ± 1C
1 1 c [Vm (t )¡H]C
²,
(A.4)
where [x] C is a threshold function that returns 0 if x < 0 and x if x > 0. Note that if it returns 0, we assume yj (t) D 0, to avoid dividing by 0. As it is, this function has a very sharp threshold, which interferes with graded learning mechanisms (e.g., gradient descent). To produce a less discontinuous deterministic function with a softer threshold, the function is convolved with a gaussian noise kernel, which reects the intrinsic processing noise of biological neurons, Z yj¤ (x) D
1 ¡1
p
1 2p s
2 2 e ¡z / (2s ) yj (z ¡ x )dz,
(A.5)
where x represents the [Vm (t) ¡ H] C value, and yj¤ (x) is the noise-convolved activation for that value. In the simulation, this function is implemented using a numerical lookup table, as an analytical solution is not possible. A.3 k-Winners-Take-All Inhibition. Leabra uses a kWTA function to achieve sparse distributed representations, with two different versions having different levels of exibility around the k out of n active units constraint. Both versions compute a uniform level of inhibitory current for all units in the layer as follows, H ) C q( gH gi D g H kC 1 k ¡ gk C 1 ,
(A.6)
where 0 < q < 1 is a parameter for setting the inhibition between the upper bound of g H and the lower bound of g H . These boundary inhibition values k kC1
1234
Randall C. O’Reilly
are computed as a function of the level of inhibition necessary to keep a unit right at threshold,
giH D
g¤e gNe (Ee ¡ H) C gl gNl (El ¡ H) , H ¡ Ei
(A.7)
where g¤e is the excitatory net input without the bias weight contribution. This allows the bias weights to override the kWTA constraint. In the basic version of the kWTA function, which is relatively rigid about the kWTA constraint, g H and gH are set to the threshold inhibition value k kC 1 for the kth and k C 1th most excited units, respectively. Thus, the inhibition is placed exactly to allow k units to be above threshold and the remainder below threshold. For this version, the q parameter is almost always .25, allowing the kth unit to be sufciently above the inhibitory threshold. In the average-based kWTA version, gH is the average giH value for the k H top k most excited units, and g kC 1 is the average of giH for the remaining n ¡k units. This version allows for more exibility in the actual number of units active depending on the nature of the activation distribution in the layer and the value of the q parameter (typically between .5 and .7 depending on the level of sparseness in the layer, with a default value of .6). The simulations used the average-based version for the hidden layer (which can take advantage of the exibility) with q D .6 (default), and the basic version for the output layer (which does not need the exibility), with q D .25 (default). Performance using just the basic version for all layers was signicantly worse (generalization error of 0.239) than the more appropriately biased case with average-based in the hidden layer (0.138) but was still better than backpropagation (0.401). Activation dynamics similar to those produced by the kWTA function have been shown to result from simulated inhibitory interneurons that project both feedforward and feedback inhibition (O’Reilly & Munakata, 2000). Thus, although the kWTA function is somewhat biologically implausible in its implementation (e.g., requiring global information about activation states and using sorting mechanisms), it provides a computationally effective approximation to biologically plausible inhibitory dynamics. A.4 Error-Driven Learning. As indicated in the text, Leabra uses the symmetric midpoint version of the GeneRec algorithm (O’Reilly, 1996a), which is functionally equivalent to the deterministic Boltzmann machine and contrastive Hebbian learning (CHL) (Hinton, 1989b; Movellan, 1990). The network settles in two phases—an expectation (minus) phase, where the network’s actual output is produced and an outcome (plus) phase, where the target output is experienced—and then computes a simple difference of a pre- and postsynaptic activation product across these two phases (as
Generalization in Interactive Networks
1235
shown in equation 2.1 and reproduced here),
D errwij
C C ¡ D (xi yj ) ¡ (xi¡ yj )
(A.8)
for sending unit xi and receiving unit yj in the two phases. A.5 Hebbian Learning. The simplest form of Hebbian learning adjusts the weights in proportion to the product of the sending (xi ) and receiving (yj ) unit activations: D wij D xi yj . The weight vector is dominated by the principal eigenvector of the pairwise correlation matrix of the input, but it also grows without bound. Leabra uses essentially the same learning rule used in competitive learning or mixtures-of-gaussians (Rumelhart & Zipser, 1986; Nowlan, 1990), which can be seen as a variant of the Oja normalization (Oja, 1982). This learning rule was shown in equation 4.1, and is reproduced here: C C C C C D xi yj ¡ yj wij D yj (xi ¡ wij )
D hebbwij
(A.9)
After Rumelhart and Zipser (1986) and O’Reilly and Munakata (2000), we can see that this equation converges on the conditional probability that the sender is active given that the receiver is active. First, we write the activations as encoding the probability that the units are active for the current input pattern, indexed by t, and sum the weight changes over all patterns: X [P(yj |t)P(xi |t) ¡ P (yj |t)wij ]P(t) D wij D 2
³t
X
D2
P (yj |t)P(xi |t)P(t)¡
t
X
´
P (yj |t)P(t)wij .
(A.10)
t
Then we set D wij to zero and solve to nd the equilibrium weight value, resulting in:
³
X
0D2
P (yj |t)P(xi |t)P(t)¡
t
X
´
P (yj |t)P(t)wij
t
P wij D
t
P (yj |t)P(xi |t)P(t) P . ( t P yj |t)P(t)
(A.11)
P The interesting thing to note here is that the numerator t P (yj |t)P(xi |t)P(t) is actually the denition of the joint probability of the sending and receiving units both being active together across all the patterns t, which is just
1236
Randall C. O’Reilly
P P (yj , xi ). Similarly, the denominator t P (yj |t)P(t) gives the probability of the receiving unit being active over all the patterns, or P (yj ). Thus, we can rewrite the above equation as wij D
P (yj , xi ) P (yj )
D P (xi |yj ),
(A.12)
at which point it becomes clear that this fraction of the joint probability over the probability of the receiver is just the denition of the conditional probability of the sender given the receiver. A.6 Combining Error-Driven and Hebbian Learning. Error-driven and Hebbian learning are combined additively at each connection to produce a net weight change. Two equations are needed: a soft weight bounding equation to to keep the error-driven component within the same 0–1 range of the Hebbian term and the combination equation. Soft weight bounding with exponential approach to the 0–1 extremes is implemented using
D sberr wik D [D err] C (1 ¡ wik ) C [D err]¡ wik,
(A.13)
where D err is the error-driven weight change, and the [x] C operator returns x if x > 0 and 0 otherwise, while [x]¡ does the opposite, returning x if x < 0, and 0 otherwise. The net weight change equation combining error-driven and Hebbian learning (which also includes the learning rate parameter 2 ) uses a normalized mixing constant khebb :
D wij
D 2 [khebb (D hebb ) C (1 ¡ khebb )( D sberr )].
(A.14)
Three different values of khebb were used to explore the relative contributions of error-driven and Hebbian learning: khebb D 0 gives purely error-driven learning (plus the kWTA inhibition constraint), khebb D .02 is the value used for combining Hebbian and error driven, producing the main results, and khebb D 1 gives purely Hebbian learning (plus the kWTA inhibition constraint). Note that only a relatively small amount of Hebbian learning is required to achieve the benets; this is in part because the Hebbian learning factor is so persistently present on each trial of learning and shaping the learning so reliably according to the local correlations present. Acknowledgments I thank Yuko Munakata and Gary Cottrell for helpful comments. This work was supported in part by NIH program project grant MH47566 and NSF grant IBN-9873492.
Generalization in Interactive Networks
1237
References Ackley, D. H., Hinton, G. E., & Sejnowski, T. J. (1985). A learning algorithm for Boltzmann machines. Cognitive Science, 9, 147–169. Almeida, L. B. (1987). A learning rule for asynchronous perceptrons with feedback in a combinatorial environment. In M. Caudil & C. Butler (Eds.), Proceedings of the IEEE First International Conference on Neural Networks (pp. 609–618). San Diego, CA. Artola, A., Brocher, S., & Singer, W. (1990). Different voltage-dependent thresholds for inducing long-term depression and long-term potentiation in slices of rat visual cortex. Nature, 347, 69–72. Barlow, H. B. (1989). Unsupervised learning. Neural Computation, 1, 295–311. Bear, M. F. (1996). A synaptic basis for memory storage in the cerebral cortex. Proceedings of the National Academy of Sciences, 93, 3453. Bear, M. F., & Malenka, R. C. (1994). Synaptic plasticity: LTP and LTD. Current Opinion in Neurobiology, 4, 389–399. Brousse, O. (1993). Generativity and systematicity in neural network combinatorial learning (Tech. Rep. No. CU-CS-676-93).Boulder, CO: University of Colorado at Boulder, Department of Computer Science. Crick, F. H. C. (1989). The recent excitement about neural networks. Nature, 337, 129–132. Dayan, P., & Zemel, R. S. (1995). Competition and multiple cause models. Neural Computation, 7, 565–579. Douglas, R. J., & Martin, K. A. C. (1990). Neocortex. In G. M. Shepherd (Ed.), The synaptic organization of the brain (pp. 389–438). Oxford: Oxford University Press. Felleman, D. J., & Van Essen, D. C. (1991). Distributed hierarchical processing in the primate cerebral cortex. Cerebral Cortex, 1, 1–47. Field, D. J. (1994). What is the goal of sensory coding? Neural Computation, 6(4), 559–601. Foldi ¨ ak, P. (1990). Forming sparse representations by local anti-Hebbian learning. Biological Cybernetics, 64(2), 165–170. Fukai, T., & Tanaka, S. (1997). A simple neural network exhibiting selective activation of neuronal ensembles: From winner-take-all to winners-share-all. Neural Computation, 9, 77–97. Gabbot, P. L. A., & Somogyi, P. (1986). Quantitative distribution of GABAimmunoreactive neurons in the visual cortex (area 17) of the cat. Experimental Brain Research, 61, 323–331. Geman, S., Bienenstock, E. L., & Doursat, R. (1992). Neural networks and the bias/variance dilemma. Neural Computation, 4, 1–58. Glushko, R. J. (1979). The organization and activation of orthographic knowledge in reading aloud. Journal fo Experimental Psychology, 5, 674–691. Goodhill, G. J., & Barrow, H. G. (1994). The role of weight normalization in competitive learning. Neural Computation, 6, 255–269. Grinvald, A., Frostig, R. D., & Lieke, E. (1988). Optical imaging of neuronal activity. Physiological Review, 68, 1285–1366.
1238
Randall C. O’Reilly
Grossberg, S. (1976). Adaptive pattern classication and universal recoding I: Parallel development and coding of neural feature detectors. Biological Cybernetics, 23, 121–134. Harm, M. W., & Seidenberg, M. S. (1999). Phonology, reading acquisition, and dyslexia: Insights from connectionist models. Psychological Review, 106, 491– 528. Hinton, G. E. (1989a). Connectionist learning procedures. Articial Intelligence, 40, 185–234. Hinton, G. E. (1989b). Deterministic Boltzmann learning performs steepest descent in weight-space. Neural Computation, 1, 143–150. Hinton, G. E., & Ghahramani, Z. (1997). Generative models for discovering sparse distributed representations. Philosophical Transactions of the Royal Society (London) B, 352, 1177–1190. Hinton, G. E., & McClelland, J. L. (1988). Learning representations by recirculation. In D. Z. Anderson (Ed.), Neural information processing systems, 1987 (pp. 358–366). New York: American Institute of Physics. Hopeld, J. J. (1982). Neural networks and physical systems with emergent collective computational abilities. Proceedings of the National Academy of Sciences, 79, 2554–2558. Hopeld, J. J. (1984). Neurons with graded response have collective computational properties like those of two-state neurons. Proceedings of the National Academy of Sciences, 81, 3088–3092. Kohonen, T. (1984). Self-organizationand associative memory. New York: SpringerVerlag. Levitt, J. B., Lewis, D. A., Yoshioka, T., & Lund, J. S. (1993). Topography of pyramidal neuron intrinsic connections in macaque monkey prefrontal cortex (areas 9 and 46). Journal of Comparative Neurology, 338, 360–376. Lisman, J. E. (1989). A mechanism for the Hebb and the anti-Hebb processes underlying learning and memory. Proceedings of the National Academy of Sciences, 86, 9574–9578. Lisman, J. (1994). The CaM Kinase II hypothesis for the storage of synaptic memory. Trends in Neurosciences, 17, 406. Majani, E., Erlarson, R., & Abu-Mostafa, Y. (1989). The induction of multiscale temporal structure. In D. S. Touretzky (Ed.), Advances in neural information processing systems, 1 (pp. 634–642). San Mateo, CA: Morgan Kaufmann. Marcus, G. F. (1998).Rethinking eliminative connectionism. Cognitive Psychology, 37, 243. Marr, D. (1971). Simple memory: A theory for archicortex. Philosophical Transactions of the Royal Society (London) B, 262, 23–81. McCann, R. S., & Besner, D. (1987). Reading pseudohomophones: Implications for models of pronunciation and the locus of word-frequency effects in word naming. Journal of Experimental Psychology, 13, 14–24. McClelland, J. L., McNaughton, B. L., & O’Reilly, R. C. (1995). Why there are complementary learning systems in the hippocampus and neocortex: Insights from the successes and failures of connectionist models of learning and memory. Psychological Review, 102, 419–457.
Generalization in Interactive Networks
1239
McClelland, J. L., & Rumelhart, D. E. (1981). An interactive activation model of context effects in letter perception: Part 1. An account of basic ndings. Psychological Review, 88(5), 375–407. Miller, K. D., Keller, J. B., & Stryker, M. P. (1989). Ocular dominance column development: Analysis and simulation. Science, 245, 605–615. Miller, K. D., & MacKay, D. J. C. (1994). The role of constraints in Hebbian learning. Neural Computation, 6, 100–126. Movellan, J. R. (1990). Contrastive Hebbian learning in the continuous Hopeld model. In D. S. Touretzky, G. E. Hinton, & T. J. Sejnowski (Eds.), Proceedings of the 1989 Connectionist Models Summer School (pp. 10–17). San Mateo, CA: Morgan Kaufman. Noelle, D. C., & Cottrell, G. W. (1996). In search of articulated attractors. In G. W. Cottrell (Ed.), Proceedings of the 18th Annual Conference of the Cognitive Science Society (pp. 329–334). Hillsdale, NJ: Erlbaum. Noelle, D. C., & Zimdars, A. L. (1999).Methods for learning articulated attractors over internal representations. In M. Hahn, & S. C. Stoness (Eds.), Proceedings of the 21st Annual Conference of the Cognitive Science Society (pp. 480–485). Hillsdale, NJ: Erlbaum. Nowlan, S. J. (1990). Maximum likelihood competitive learning. In D. S. Touretzky (Ed.), Advances in neural information processing systems, 2 (pp. 574–582). San Mateo, CA: Morgan Kaufmann. Oja, E. (1982). A simplied neuron model as a principal component analyzer. Journal of Mathematical Biology, 15, 267–273. Olshausen, B. A., & Field, D. J. (1996). Emergence of simple-cell receptive eld properties by learning a sparse code for natural images. Nature, 381, 607. O’Reilly, R. C. (1996a). Biologically plausible error-driven learning using local activation differences: The generalized recirculation algorithm. Neural Computation, 8(5), 895–938. O’Reilly, R. C. (1996b). The Leabra model of neural interactions and learning in the neocortex. Unpublished doctoral dissertation, Carnegie Mellon University, Pittsburgh, PA. O’Reilly, R. C. (1998).Six principles for biologically-based computational models of cortical cognition. Trends in Cognitive Sciences, 2(11), 455–462. O’Reilly, R. C., & McClelland, J. L. (1994). Hippocampal conjunctive encoding, storage, and recall: Avoiding a tradeoff. Hippocampus, 4(6), 661–682. O’Reilly, R. C., & Munakata, Y. (2000). Computational explorations in cognitive neuroscience: Understanding the mind by simulating the brain. Cambridge, MA: MIT Press. O’Reilly, R. C., & Rudy, J. W. (2000). Computational principles of learning in the neocortex and hippocampus. Hippocampus, 10, 389–397. O’Reilly, R. C., & Rudy, J. W. (in press). Conjunctive representations in learning and memory: Principles of cortical and hippocampal function. Psychological Review. Pineda, F. J. (1987a). Generalization of backpropagation to recurrent and higher order neural networks. In D. Z. Anderson (Ed.), Proceedings of the IEEE Conference on Neural Information Processing Systems (pp. 602–611). New York: IEEE.
1240
Randall C. O’Reilly
Pineda, F. J. (1987b). Generalization of backpropagation to recurrent neural networks. Physical Review Letters, 18, 2229–2232. Pineda, F. J. (1988). Dynamics and architecture for neural computation. Journal of Complexity, 4, 216–245. Pinker, S., & Prince, A. (1988). On language and connectionism: Analysis of a parallel distributed processing model of language acquisition. Cognition, 28, 73–193. Plaut, D. C., & McClelland, J. L. (1993). Generalization with componential attractors: Word and nonword reading in an attractor network. Proceedings of the 15th Annual Conference of the Cognitive Science Society (pp. 824–829). Hillsdale, NJ: Erlbaum. Plaut, D. C., McClelland, J. L., Seidenberg, M. S., & Patterson, K. E. (1996). Understanding normal and impaired word reading: Computational principles in quasi-regular domains. Psychological Review, 103, 56–115. Plunkett, K., & Marchman, V. A. (1996). Learning from a connectionist model of the acquisition of the English past tense. Cognition, 61, 299. Poggio, T., & Girosi, F. (1990). Regularization algorithms for learning that are equivalent to multilayer networks. Science, 247, 978–982. Rubinstein, Y. D., & Hastie, T. (1997). Discriminative vs. informative learning. In D. Heckerman, H. Mannila, & D. Pregibon (Eds.), Proceedings of the Third International Conference on Knowledge Discovery and Data Mining (pp. 49–53). Menlo Park, CA: AAAI Press. Rumelhart, D. E., Hinton, G. E., & Williams, R. J. (1986). Learning internal representations by error propagation. In D. E. Rumelhart, J. L. McClelland, & PDP Research Group (Eds.), Parallel distributed processing. Vol. 1: Foundations (pp. 318–362). Cambridge, MA: MIT Press. Rumelhart, D. E., & McClelland, J. L. (1986). On learning the past tenses of English verbs. In J. L. McClelland, D. E. Rumelhart, & PDP Research Group (Eds.), Parallel distributed processing. Vol. 2: Psychological and biological models (pp. 216–271). Cambridge, MA: MIT Press. Rumelhart, D. E., & Zipser, D. (1986).Feature discovery by competitive learning. In D. E. Rumelhart, J. L. McClelland, & PDP Research Group (Eds.), Parallel distributed processing. Vol. 1: Foundations (pp. 151–193). Cambridge, MA: MIT Press. Saund, E. (1995). A multiple cause mixture model for unsupervised learning. Neural Computation, 7, 51–71. Seidenberg, M. S., & McClelland, J. L. (1989). A distributed, developmental model of word recognition and naming. Psychological Review, 96, 523–568. Smolensky, P. (1986).Information processing in dynamical systems: Foundations of harmony theory. In D. E. Rumelhart, J. L. McClelland, & PDP Research Group (Eds.), Parallel distributed processing. Vol. 1: Foundations (pp. 282–317). Cambridge, MA: MIT Press. Taraban, R., & McClelland, J. L. (1987).Conspiracy effects in word pronunciation. Journal of Memory and language, 26, 608–631. Vapnik, V. N., & Chervonenkis, A. (1971). On the uniform convergence of relative frequencies of events to their probabilities. Theory of Probability and Its Applications, 16, 264–280.
Generalization in Interactive Networks
1241
Vecera, S. P., & O’Reilly, R. C. (1998). Figure-ground organization and object recognition processes: An interactive account. Journal of Experimental Psychology: Human Perception and Performance, 24, 441–462. Weigend, A. (1994). On overtting and the effective number of hidden units. In M. C. Mozer, P. Smolensky, & A. S. Weigend (Eds.), Proceedings of the 1993 Connectionist Models Summer School (pp. 335–342). Hillsdale, NJ: Erlbaum. Weigend, A. S., Rumelhart, D. E., & Huberman, B. A. (1991). Generalization by weight-elimination with application to forecasting. In R. P. Lippmann, J. E. Moody, & D. S. Touretzky (Eds.), Advances in neural information processing systems, 3 (pp. 875–882). San Mateo, CA: Morgan Kaufmann. White, E. L. (1989). Cortical circuits: Synaptic organization of the cerebral cortex, structure, function, and theory. Boston: Birkh a¨ user. Wolpert, D. H. (1992). On the connection between in-sample testing and generalization error. Complex Systems, 6, 47–94. Wolpert, D. H. (1996a). The existence of a priori distinctions between learning algorithms. Neural Computation, 8, 1391–1420. Wolpert, D. H. (1996b). The lack of a priori distinctions between learning algorithms. Neural Computation, 8, 1341–1390. Zemel, R. S. (1993). A minimum description length framework for unsupervised learning. Unpublished doctoral dissertation, University of Toronto, Canada. Zipser, D., & Andersen, R. A. (1988). A backpropagation programmed network that simulates response properties of a subset of posterior parietal neurons. Nature, 331, 679–684. Received March 11, 1999; accepted September 22, 2000.
NOTE
Communicated by Geoffrey Hinton
Optimal Smoothing in Visual Motion Perception Rajesh P. N. Rao Department of Computer Science & Engineering, University of Washington, Seattle, WA 98195, U.S.A. David M. Eagleman Sloan Center for Theoretical Neurobiology,Salk Institute for Biological Studies, La Jolla, CA 92037, U.S.A. Terrence J. Sejnowski Howard Hughes Medical Institute, Salk Institute for Biological Studies, La Jolla, CA 92037, U.S.A., and Department of Biology, University of California at San Diego, La Jolla, CA 92037, U.S.A. When a ash is aligned with a moving object, subjects perceive the ash to lag behind the moving object. Two different models have been proposed to explain this “ash-lag” effect. In the motion extrapolation model, the visual system extrapolates the location of the moving object to counteract neural propagation delays, whereas in the latency difference model, it is hypothesized that moving objects are processed and perceived more quickly than ashed objects. However, recent psychophysical experiments suggest that neither of these interpretations is feasible (Eagleman & Sejnowski, 2000a, 2000b, 2000c), hypothesizing instead that the visual system uses data from the future of an event before committing to an interpretation. We formalize this idea in terms of the statistical framework of optimal smoothing and show that a model based on smoothing accounts for the shape of psychometric curves from a ash-lag experiment involving random reversals of motion direction. The smoothing model demonstrates how the visual system may enhance perceptual accuracy by relying not only on data from the past but also on data collected from the immediate future of an event. 1 Introduction
When subjects are presented with a ash that is aligned with a moving object, they perceive the ash to lag behind the moving object (MacKay, 1958; Nijhawan, 1994) (see Figure 1a). In order for subjects to perceive the ash as aligned with the moving object, the ash must be presented at a spatial location ahead of the moving object. A recent urry of experiments has sparked interest in models that can explain this phenomenon (Nic 2001 Massachusetts Institute of Technology Neural Computation 13, 1243– 1253 (2001) °
1244
R. P. N. Rao, D. M. Eagleman, and T. J. Sejnowski Actual
Perceived
Moving Bar
Flash (a) Latency Difference Model
Motion Reversal
Time
Time
Extrapolation Model
Predicted Percept
Space
Space
(b)
(c)
Figure 1: Flash-lag effect and predictions of two previous models. (a) Flashlag effect. The moving bar (black) is perceived to be ahead of the ashed bar (shaded) even though the retinal images are physically aligned. (b) Actual (solid line) and perceived (dashed line) positions of the moving bar, as predicted by the motion extrapolation model for the motion reversal experiment. (c) Actual (solid line) and perceived (dashed line) positions, as predicted by the latency difference model.
jhawan, 1994; Baldo & Klein, 1995; Khurana & Nijhawan, 1995; Nijhawan, 1997; Lappe & Krekelberg, 1998; Purushothaman, Patel, Bedell, & Ogmen, 1998; Whitney & Murakami, 1998; Whitney, Murakami, & Cavanagh, 2000; Krekelberg & Lappe, 2000; Brenner & Smeets, 2000; Eagleman & Sejnowski, 2000a, 2000b, 2000c). The motion extrapolation model (see Figure 1b) (Nijhawan, 1994) assumes that the visual system extrapolates the location of moving objects to compensate for propagation delays as signals are transmitted from the retina to higher cortical areas. If left uncompensated, such delays would cause the perceived location of the moving object to lag signicantly behind the actual location. Nijhawan suggested that extrapolation allows objects to be perceived at their actual location. In the latency difference model (Baldo & Klein, 1995; Purushothaman et al., 1998; Whitney & Murakami, 1998; Whitney et al., 2000), it is hypothesized that the moving object is perceived to
Optimal Smoothing in Visual Motion Perception
1245
be ahead of the ash due to shorter neural propagation delays for moving objects as compared to ashed objects (see Figure 1c). Recent experiments have revealed the shortcomings of both these models (Eagleman & Sejnowski, 2000a, 2000b, 2000c). In particular, both the extrapolation model and the latency difference model fail to provide a complete explanation for the psychometric curves obtained when the direction of motion of the moving object is abruptly reversed. We show that a model based on the engineering technique of optimal smoothing (Bryson & Ho, 1975) overcomes the limitations of both these models. In the smoothing model, perception of an event is not online but rather is delayed, so that the visual system can take into account information from the immediate future before committing to an interpretation of the event. 2 Motion Reversal Experiments
In an experiment designed to test the motion extrapolation model, Whitney and Murakami (1998) reversed the motion of a horizontally translating bar at a random time and location along its trajectory. A ash could appear at various times before or after motion reversal. The study tested where the ash needed to be placed in order to be perceived as aligned with the moving bar. According to the extrapolation model, at the point of motion reversal, the moving bar should be perceived at its extrapolated location as depicted in Figure 1b (recall that the time of reversal is random and unknown to the subject). Contrary to this prediction, Whitney and Murakami reported that the perceived position of the moving bar never overshot the reversal point. Rather, the perceived location of the moving bar began deviating signicantly from that predicted by the motion extrapolation model at approximately 60 to 75 milliseconds before the time of reversal. If extrapolation were indeed occurring, the bar’s reversal must have been known before the actual reversal took place, an impossibility. Whitney and Murakami therefore concluded that their results supported the latency difference model. However, the latency difference model cannot by itself explain the rounding of the curve observed near the time of reversal, predicting instead a sharp reversal in the perceived location, as shown in Figure 1c. Whitney and Murakami (1998) suggested that the rounding may be due to neural delay variability or a spatiotemporal averaging lter, but other experiments have revealed more serious aws in the latency difference model (Eagleman & Sejnowski, 2000a, 2000b, 2000c). For example, the ash-lag effect is preserved in the case where the bar starts moving at the same time t0 as the ash that is aligned with it. In this “ash-initiated” paradigm (Khurana & Nijhawan, 1995; Eagleman & Sejnowski, 2000a), there is no past history of bar motion at time t0 for a spatiotemporal lter to operate over. The moving bar should suffer the same initial processing delay as the ashed stimulus: how could it still be perceived ahead of the ash? This suggests that the visual system is using motion information occurring after time t0 to make a
1246
R. P. N. Rao, D. M. Eagleman, and T. J. Sejnowski
judgment about the perceived location at time t 0, in effect using information from the immediate future to estimate a quantity in the recent past, a form of “postdiction.” This interpretation has recently been proposed by Eagleman and Sejnowski (2000a), who show that their psychophysical results are best explained by the postdiction hypothesis. We show here that this hypothesis can be framed succinctly within the statistical framework of optimal smoothing (Bryson & Ho, 1975). 3 The Optimal Smoothing Model
The strategy of estimating a value in a time series based on future values (in addition to past values) is known as smoothing in the engineering literature (Bryson & Ho, 1975). On the other hand, estimating a current value based only on past values is called ltering, e.g., Kalman ltering (see Kalman, 1960). We have simulated the experiments of Whitney and Murakami using a simple dynamical model describing the linear motion and reversal of the bar in the presence of gaussian noise: x (t C 1) D x (t) C c(t)y (t) C n (t)
(3.1)
where x(t) denotes the position of the bar at time t, c(t)y (t) is the increment or decrement in position for the next time step (c(t) D C 1 initially, switching to ¡1 at a random time of reversal), and n (t) is zero-mean gaussian noise with variance s 2 . The increment amount y (t) is assumed to be constant except for additive zero-mean gaussian noise: y (t C 1) D y (t) C w (t ). Finally, the position x is assumed to be corrupted by measurement noise m (t) before being observed by the subject: z (t ) D x (t) C m (t), where m is again a gaussian noise process with zero-mean and variance sm2 . An optimal linear lter (the Kalman lter; Kalman, 1960) was derived from the motion model above to estimate the most likely position b x of the bar at time t given information about the current and past positions of the moving bar (see Bryson & Ho, 1975, for a derivation): b x (t) D x (t) C g (t)(z (t) ¡ x(t)) x (t) D b x (t ¡ 1) C c(t ¡ 1)b y (t ¡ 1)
(3.2) (3.3)
where g (t) is a gain term (see Bryson & Ho, 1975) and b y(t ¡ 1) D y (0) D a (a determines the velocity of the bar, assumed to be constant in this case). Equations 3.2 and 3.3 can be explained as follows. At any given time t, the lter maintains an estimate x (t) of bar position x before a new measurement z (t) is obtained. This estimate is our best estimate of position using all previous measurements z(t ¡ 1), . . . , z(0) and the motion model in equation 3.1. Note that x (t) is computed from b x (t ¡ 1), which in turn is computed from x(t ¡ 1). Once the measurement z (t ) is obtained, the lter computes a new estimate b x (t) by correcting the old estimate x(t) using the mismatch
Optimal Smoothing in Visual Motion Perception
1247
error (z(t) ¡ x(t)). Thus, b x (t) represents our best estimate of bar position after measuring z(t). The lter estimate for time t was smoothed recursively using the estimates from the next time steps. This increases accuracy by allowing data from time steps in the future of t to inuence and possibly correct the lter estimate at time t (Bryson & Ho, 1975): xsm (t) D b x(t) C h (t)(xsm (t C 1) ¡ x(t C 1))
(3.4)
where xsm (t) is the smoothed estimate for time t given position information from time steps 1, . . . , N (N > t), and h(t) is a gain term (see Bryson & Ho, 1975). Note that since xsm (t) depends not only on b x (t ) but also on xsm (t C 1), which in turn depends on xsm (t C 2) and so on, the smoothed estimate at time t relies not only on measurements from the past but also on measurements from future time steps relative to t. Smoothing corrects each position estimateb x (t ) by adding the error term h(t)(xsm (t C 1) ¡x(t C 1)), which represents the mismatch between the smoothed and the ltered estimates at time t C 1. The smoothing model described above can be used to account quantitatively for the ash-lag results involving motion reversal. Before doing so, it is useful to make a distinction between the following three times associated with an event: (1) event time, which is the time at which an event occurs in the real world; (2) neural activity time, which is the time at which a representation of the event is formed at a particular neural level; and (3) represented time (or subjective time) of the occurrence of the event. To see how these three times may differ, consider the case of recalling visual memories, say, of an event that occurred during college and an event that occurred in childhood. Clearly the event times are different for these two events, as are the represented times, both of which have a temporal order (childhood events before college events). However, the neural activity time which is the time of recall of these memories, does not need to follow this temporal order. For the ash-lag effect, the latency difference model assumes that the neural activity time is the same as the represented time and that the neural activity time for the ash is later than that for the moving bar. The extrapolation model, on the other hand, assumes that for the moving bar, neural delays can be counteracted such that the represented time of the moving bar is equal to its event time. The smoothing model, in contrast, is illustrated in Figure 2. Suppose that the event time of the ash is t 0. We assume that the neural activity time of the ash is t0 C D , where D is the neural propagation delay. Then, according to the smoothing model, the subject’s perceived location of the bar at the time of the ash is given by the smoothed estimate xsm (t0 C D ). Note that this estimate includes information from time steps up to t 0 C D C f , where f is the amount of time in the future used for smoothing. Thus, the estimate of an event that happened at time t0 is retrospectively assigned after a minimum duration of D C f . The ash-lag effect occurs because the subject reports the
1248
R. P. N. Rao, D. M. Eagleman, and T. J. Sejnowski
Figure 2: Event time, neural activity time, and represented time. F represents the ash; the strip of numbers represents successive positions of the moving object. In this example, the ash occurs at time t 0 (event time) when the moving object occupies position 4. After a neural propagation delay of D , neural activity pertaining to the ash begins at a particular level of the visual system (“neural activity” on the ordinate). After processing of further information, the results of the smoothing lter become available to consciousness at time t0 C D C f . The represented time cannot be displayed on the same axis (real-world time); instead, it must be displayed on its own axis (subjective-world time). All studies of the ash-lag effect measure only the relative timing between ashed and moving objects, informing us in no way about real-world time. In the gure, tQ0 is the perceived moment of the ash which occurred at time t0 (the graphs are offset because the represented time tQ0 cannot exist until smoothing is complete). In subjective time, the ash is aligned with position 6 (the smoothed position estimate for time t0 C D ). This misalignment is the ash-lag illusion. The absence of positions 1 and 2 in the perception represents the Frohlich effect, in which the initial positions of a moving object are not perceived.
location of the moving bar to be the smoothed estimate at time t0 C D (see Figure 2). For the simulations, the following parameter values were used: g (t) D 0.7, h(t) D 0.5, s D 0.01, sm D 0.01, a D 1, x D 0, N D 50, xsm (N ) D b x (N ), and D D 45 milliseconds (two time steps in the simulations). Similar results were obtained when parameter values, such as the noise variances, were varied in the neighborhood of the values given above. The gain terms g(t)
Optimal Smoothing in Visual Motion Perception
1249
and h (t) can be made a time-varying function of the variances s and sm (Bryson & Ho, 1975), but in the simulations, constant values, such as the ones specied above, were found to be sufcient for modeling the psychophysical results. (Matlab code for running these simulations is available online at http://www.cnl.salk.edu/ ~rao/smoothing.m.) 4 Results
Figures 3a and 3b show the perceived location of the moving bar for a human subject (data points) and for the optimal smoothing model (data points labeled xsm ), respectively. The perceived location estimated by the optimal ltering model (x) is given by the dotted line. The data shown were averaged over 100 trials with a single random reversal of motion in each trial. As seen in the gure, the smoothing model reproduces the rounding of the curve observed in human subjects (see Figure 3a), while the ltering model, which uses only past positions of the bar, overshoots at the point of reversal before correcting its estimate at subsequent time steps. This overshoot is avoided by the smoothing model because data from the immediate future are taken into account, producing a more accurate estimate of bar position. How many data points from the future are taken into account in the model? To answer this question, we computed the impulse response functions of the lter and the smoother that were used in the simulations. As seen in Figures 3c and 3d, both the lter and smoother use input data from the current and approximately four previous time steps. However, the smoother also takes into account data from about four to ve time steps into the future. In the model, this corresponds to a time interval of approximately 90 to 112 milliseconds (one time step ¼ 22.5 milliseconds), which is in the range of the time window of approximately 80 milliseconds reported by Eagleman and Sejnowski (2000a). 5 Discussion
Why should the visual system delay its perception of an event to integrate information from the future? The smoothing model suggests that this is done in order to enhance perceptual accuracy in the presence of uncertainty and noise. It has long been known in the engineering community (see, e.g., Bryson & Ho, 1975) that the limitations of ltering (signal estimation based on the past) can be overcome by smoothing techniques that take some or all future data in a time series into account for optimal estimation of signal properties. Our results suggest that the visual system may be employing this strategy for accurate estimation of visual motion. An additional advantage of smoothing is that smoothed estimates make learning more reliable. For example, in the case of hidden Markov models (HMMs),
1250
R. P. N. Rao, D. M. Eagleman, and T. J. Sejnowski
Human Data
Optimal Smoothing Model
Time Steps
2 0
x sm
-x
-2 -4
-6
-4
-2
0
Space (pixels)
(a)
(b)
Impulse Response of Filter
Impulse Response of Smoother 0.4
Response
0.6 0.3 0.4
0.2
0.2 0 10
0.1 5
0 Time steps
(c)
5
10
0 10
5
0
5
10
Time steps
(d)
Figure 3: Flash-lag effect interpreted as optimal smoothing of visual motion estimates. (a) Data from a human subject showing the perceived location of a moving bar, as revealed by aligning the ash (reproduced from Whitney & Murakami, 1998). Lines through data points are 95% condence intervals. Dotted line D prediction of the extrapolation model. (b) Data from the optimal smoothing model, where the perceived location is taken to be the smoothed position estimate xsm . Lines through data points are one standard deviation above and below average values computed over 100 trials. The ltered estimate x is shown as a dotted line for comparison. Note the overshoot at the time of reversal for the ltered but not the smoothed estimate. (c) Impulse response of the optimal linear lter used in the simulations. (d) Impulse response of the optimal smoother used in the simulations. Note that the impulse response function for the smoother includes weights for past, current, and future data, whereas the impulse response for the lter considers only current and past data.
Optimal Smoothing in Visual Motion Perception
1251
both the forward (ltering) and the backward (smoothing) procedures are used for computing the likelihood in the Baum-Welch algorithm for learning model parameters (Rabiner & Juang, 1993; see also Dayan & Hinton, 1996). Filtering and smoothing are similarly used in algorithms for learning the parameters of continuous-state linear dynamical systems (Shumway & Stoffer, 1982; Ghahramani & Hinton, 1996). Given the natural trade-off between the amount of perceptual delay required for smoothing and the need for real-time computation, an interesting open question is whether the delay of 80 to 100 milliseconds inferred from psychophysical experiments (Eagleman & Sejnowski, 2000a) represents an optimal balance between perceptual accuracy and real-time inference. A related question is whether this delay can be adapted according to the task at hand. These questions remain the subject of ongoing investigations. The model presented here also assumes that the subject possesses a model of the moving stimulus as given by equation 3.1. Such a model could have been acquired as a result of prior experience with moving stimuli and netuned during training before collection of data or, alternately, could have been learned directly during training. The latter possibility is supported by several algorithms that have recently been suggested for learning the parameters of linear dynamical systems directly from input data (Shumway & Stoffer, 1982; Ghahramani & Hinton, 1996; Rao & Ballard, 1997). An interesting question is whether the smoothing model can predict the effect of varying the luminance of the ash and the moving bar. We expect such an experimental manipulation to change the signal-to-noise ratio in the input channels and, hence, the gain terms g (t ) and h (t) in the lter and smoother, respectively, thereby changing the shape of their impulse response functions (see Figures 3c and 3d). This could result in a ash-lead effect under some circumstances, as observed experimentally (Purushothaman et al., 1998). It is known that the ash-lag effect is reduced when the ash becomes more predictable (Eagleman & Sejnowski, 2000b). For the simulations reported here, we used a minimal internal model for the ash: the ash is detected by the subject after some amount of processing delay. A more general approach is to use a dynamical model for the ash in addition to the dynamical model for the moving object. Such an extended model would allow smoothed estimates of both the moving and ashed bars to be computed; the smoothed position of the ashed bar would then be compared to the smoothed position of the moving bar. In the case of multiple predictable ashes (e.g., stroboscopically moving ashes), such a model would be expected to produce a reduction in ash lag (due to smoothing of the ashed bars), in accordance with previous experimental ndings (Lappe & Krekelberg, 1998; Eagleman & Sejnowski, 2000b). Testing this hypothesis remains an interesting direction for future research. The idea that the visual system performs statistical or Bayesian inference based on its inputs has recently been proposed by several research groups
1252
R. P. N. Rao, D. M. Eagleman, and T. J. Sejnowski
(e.g., Freeman, 1994; Hinton, Ghahramani, & Teh, 2000; Knill & Richards, 1996; Rao, 1999). Our results support this emerging model of visual perception and show how the visual system may base its inference about a particular event not only on past observations but also on observations from the immediate future. Such a model extends previous models of the visual cortex based on optimal ltering theory (Mumford, 1994; Rao & Ballard, 1997; Rao, 1999). It may additionally allow novel interpretations of other wellknown visual phenomena, such as backward masking (Bachmann, 1994) and the color phi effect (Kolers & von Grunau, 1976), involving the effect of future stimuli on the perception of a preceding stimulus. Acknowledgments
This research was supported by the Sloan Center for Theoretical Neurobiology at the Salk Institute and the Howard Hughes Medical Institute. We thank Geoffrey Hinton for providing the impetus to this work by pointing out the connections between smoothing and the ash-lag effect, and the reviewers for their helpful comments and suggestions that led to Figures 2, 3c, and 3d. References Bachmann, T. (1994). Psychophysiology of visual masking. Commack, NY: Nova Science Publishers. Baldo, M. V., & Klein, S. A. (1995). Extrapolation or attention shift? Nature, 378, 565–566. Brenner, E., & Smeets, J. B. (2000). Motion extrapolation is not responsible for the ash-lag effect. Vision Research, 40(13), 1645–1648. Bryson, A. E., & Ho, Y.-C. (1975). Applied optimal control. New York: Wiley. Dayan, P., & Hinton, G. E. (1996). Varieties of Helmholtz machine. Neural Networks, 9(8), 1385–1403. Eagleman, D. M., & Sejnowski, T. J. (2000a). Motion integration and postdiction in visual awareness. Science, 287, 2036–2038. Eagleman, D. M., & Sejnowski, T. J. (2000b). Response: The position of moving objects. Science 289, 1107a. Available online at: http://www.sciencemag.org/ cgi/content/full/289/5482/1107a. Eagleman, D. M., & Sejnowski, T. J. (2000c). Response: Latency difference, not postdiction. Science, 290, 1051a. Freeman, W. T. (1994). The generic viewpoint assumption in a framework for visual perception. Nature, 368, 542–545. Ghahramani, Z., & Hinton, G. E. (1996). Parameter estimation for linear dynamical systems (Tech. Rep. No. CRG-TR-96-2). Toronto: Department of Computer Science, University of Toronto. Hinton, G. E., Ghahramani, Z., & Teh Y. W. (2000). Learning to parse images. In S. A. Solla, T. K. Leen, & K.-R. Muller ¨ (Eds.), Advances in neural information processing systems, 12 (pp. 463–469). Cambridge, MA: MIT Press.
Optimal Smoothing in Visual Motion Perception
1253
Kalman, R. E. (1960). A new approach to linear ltering and prediction theory. Trans. ASME J. Basic Eng., 82, 35–45. Khurana, B., & Nijhawan, R. (1995). Extrapolation or attention shift? Nature, 378, 565–566. Knill, D. C., & Richards, W. (1996). Perception as Bayesian inference. Cambridge, UK: Cambridge University Press. Kolers, P., & von Grunau, M. (1976). Shape and color in apparent motion. Vision Research, 16, 329–335. Krekelberg, B., & Lappe, M. (2000). A model of the perceived relative positions of moving objects based upon a slow averaging process. Vision Research, 40(2), 201–215. Lappe, M., & Krekelberg, B. (1998). The position of moving objects. Perception 27(12), 1437–1449. MacKay, D. M. (1958). Perceptual stability of a stroboscopically lit visual eld containing self-luminous objects. Nature, 181, 507–508. Mumford, D. (1994). Neuronal architectures for pattern-theoretic problems. In C. Koch & J. L. Davis (Eds.), Large-scale neuronal theories of the brain (pp. 125– 152). Cambridge, MA: MIT Press. Nijhawan, R. (1994). Motion extrapolation in catching. Nature, 370, 256–257. Nijhawan, R. (1997). Visual decomposition of colour through motion extrapolation. Nature, 386, 66–69. Purushothaman, G., Patel, S. S., Bedell, H. E., & Ogmen, H. (1998).Moving ahead through differential visual latency. Nature, 396, 424. Rabiner, L., & Juang, B.-H. (1993). Fundamentals of speech recognition. Englewood Cliffs, NJ: Prentice Hall. Rao, R. P. N. (1999). An optimal estimation approach to visual perception and learning. Vision Research, 39, 1963–1989. Rao, R. P. N., & Ballard, D. H. (1997). Dynamic model of visual recognition predicts neural response properties in the visual cortex. Neural Computation, 9, 721–763. Shumway, R. H., & Stoffer, D. S. (1982). An approach to time series smoothing and forecasting using the EM algorithm. J. Time Series Analysis, 3, 253–264. Whitney, D., & Murakami, I. (1998).Latency difference, not spatial extrapolation. Nature Neuroscience, 1, 656–657. Whitney, D., Murakami, I., & Cavanagh, P. (2000). Illusory spatial offset of a ash relative to a moving stimulus is caused by differential latencies for moving and ashed stimuli. Vision Research, 40, 137–149. Received February 25, 2000; accepted September 25, 2000.
LETTER
Communicated by Michael Berry
Rate Coding Versus Temporal Order Coding: What the Retinal Ganglion Cells Tell the Visual Cortex Run Van Rullen Simon J. Thorpe Centre de Recherche Cerveau et Cognition, FacultÂe de MÂedecine Rangueil, 31062 Toulouse Cedex, France It is often supposed that the messages sent to the visual cortex by the retinal ganglion cells are encoded by the mean ring rates observed on spike trains generated with a Poisson process. Using an information transmission approach, we evaluate the performances of two such codes, one based on the spike count and the other on the mean interspike interval, and compare the results with a rank order code, where the rst ganglion cells to emit a spike are given a maximal weight. Our results show that the rate codes are far from optimal for fast information transmission and that the temporal structure of the spike train can be efciently used to maximize the information transfer rate under conditions where each cell needs to re only one spike. 1 Introduction
How do neurons transmit information? This question is a central problem in the eld of neuroscience (Perkel & Bullock, 1968). Signals can be conveyed by analog and electrical mechanisms locally, but over distances information has to be encoded in the spatiotemporal pattern of trains of action potentials generated by a population of neurons. The exact features of these spike trains that carry information between neurons need to be dened. The most commonly used code is one based on the ring rates of individual cells, but this is by no means the only option. In recent years a strong debate has opposed partisans of codes embedded in the neurons’ mean ring rates and researchers in favor of temporal codes, where the precise temporal structure of the spike train is taken into account (Softky, 1995; Shadlen & Newsome, 1995, 1998; Gautrais & Thorpe, 1998). Here we address this question of neural coding in the context of information transmission between the retina and the visual cortex. The retina is a particularly interesting place to study neural information processing (Meister & Berry, 1999). First, it is relatively easy to stimulate and record retinal cells. Furthermore, the general architecture and functional organization of the retina are remarkably well known (Rodieck, 1998). There is probably no other place in the visual system where one can dene more c 2001 Massachusetts Institute of Technology Neural Computation 13, 1255– 1283 (2001) °
1256
Run Van Rullen and Simon J. Thorpe
rigorously what information needs to be represented, how many neurons are available to do it, and how long the transmission should last. A widely used simplication states that the information transmitted from the retina to the brain codes the intensity of the visual stimulus at every location in the visual eld. Although this strong statement can certainly be discussed, it is clear that the aim of retinal coding is to transmit enough information about the image on the retina to allow objects and events to be identied. One can also consider that the different types of ganglion cells “tile” the entire retina, so that there is little or no redundancy in the number of neurons, which should therefore be kept to an absolute minimum. In monkeys, in particular, the number of ganglion cells is roughly 1 million. However, the limited redundancy in the number of neurons encoding a given stimulus feature at a particular location does not mean that there is no overlap between ganglion cells’ receptive elds. In fact, in the cat retina, between 7 and 20 ganglion cells have receptive eld centers that share a given common position in the visual eld (Peichl & W¨assle, 1979; Fischer, 1973). Correlations among the responses of neighboring ganglion cells also demonstrate that they do not operate as independent channels (Arnett & Spraker, 1981; Mastronarde, 1989; Meister, Lagnado, & Baylor, 1995). Nevertheless, it is not clear in the literature how much of that correlation in the output ring pattern can be explained by shared common inputs (from photoreceptors, bipolar, horizontal, or amacrine cells) to the ganglion cells (Brivanlou, Warland, & Meisler, 1998; DeVries, 1999; Vardi & Smith, 1996). Finally, data on the speed of visual processing provide severe limitations to the time available for information transmission through the visual system. First, recorded neuronal latencies can be extremely short: responses start at around 20 ms in the retina (Sestokas, Lehmkuhle, & Kratz, 1987; Buser & Imbert, 1992) and approximately 10 ms later in the lateral geniculate nucleus (LGN) (Sestokas et al., 1987); the earliest responses in V1 already exhibit selectivity to stimulus orientation around 40 ms poststimulus (Celebrini, Thorpe, Trotter, & Imbert, 1993; Nowak, Munk, Girard, & Bullier, 1995). In inferotemporal cortex (IT), face-selective responses begin between 80 and 100 ms after stimulus presentation (Perrett, Rolls, & Caan, 1982) and show selectivity to face orientation even at the very start of the response (Oram & Perrett, 1992). Taken together, these data indicate that visual processing should rely on very short transmission times, on the order of 10 to 20 ms between two consecutive processing stages and less than 50 ms between the retina and the cortex. The same conclusions can be derived from psychophysical observations on monkeys and humans, in a task where subjects must decide whether a briey ashed photograph of a natural scene contains a target category such as an animal, food, or a means of transport. Monkeys can respond as early as 160 ms after stimulus presentation, and human subjects around 220 ms (Thorpe, Fize, & Marlot, 1996; Fabre-Thorpe, Richard, & Thorpe, 1998; VanRullen & Thorpe, 1999).
Temporal Coding in the Retina
1257
Given the large number of synaptic stages involved, it appears here again that information processing and transfer should not last more than about 10 ms at each processing stage, and probably less than 50 ms between the retina and the brain (the delay of transduction in the photoreceptors has to be taken into account). Therefore, further computation should rely on very few spikes per ganglion cell. As Meister and Berry (1999) argue, computations that use a very restricted number of spikes are difcult to conciliate with the common view stating that retinal encoding uses the ring frequencies of individual ganglion cells. Classically, ganglion cells are thought to encode their inputs in their output ring frequency (Warland, Reinagel, & Meister, 1997), and the process of retinal spike train generation is supposed to be stochastic, that is, subject to a Poisson or pseudo-Poisson noise. As an alternative, Meister and Berry (1999) review a number of arguments for taking into account the temporal information that can be derived from the very rst spikes in the retinal spike trains. For example, this information could be represented by synchronous ring among neurons. Here we introduce another temporal coding scheme, based on the order of ring over a population of ganglion cells. One can consider the ganglion cells as analog-to-delay converters; the most strongly activated ones will tend to re rst, whereas more weakly activated cells will re later or not at all. Under such conditions, the relative timing in which the ganglion cells re the rst spike of their spike train can be used as a code (Thorpe, 1990). A more specic version of this hypothesis uses only the order of ring across a population of cells (Thorpe & Gautrais, 1997). This coding scheme has already been proposed to account for the speed of processing in the visual system (Thorpe & Gautrais, 1997, 1998). We have also shown (VanRullen, Gautrais, Delorme, & Thorpe, 1998) that it can be successfully applied to a computationally difcult task, such as detecting faces in natural images. We will compare the performances of this code with two classical implementations of rate coding: one that relies on the spike count and the other on the mean interspike interval to estimate the ganglion cells’ ring frequencies over a Poisson spike train. The approach that we use tests the efciency of the different coding schemes for reconstructing the input image on the basis of the spike trains generated by the ganglion cells (Rieke, Warland, de Ruyter van Steveninck, & Bialek, 1997). More precisely, we suppose (for simplicity) that a natural input image is briey presented to the retina, preceded and followed by a dark uniform eld. Previous experiments (e.g., Thorpe et al., 1996) suggest that under these conditions, information should be available to the visual cortex as early as 50 ms poststimulus. We take the position of an imaginary observer “listening” to the pattern of spikes coming up the optic nerve and trying to derive information about the input image. Of course, we do not consider the role of the visual system in general as being to reconstruct the
1258
Run Van Rullen and Simon J. Thorpe
Figure 1: Difference of gaussians: ON-center cells’ receptive elds. The scale 128 lter is not represented, although it was used in our simulations.
image in the brain. Rather, this reconstruction should be seen as a form of benchmark—a test of the potential of a particular code. First, we introduce a simple model of the architecture of the retina and its functional organization, independent of the way information will be represented. Then we describe our rank order coding scheme, and show how it can be applied to retinal coding. Finally, we compare both a “noisefree” and a “noisy” version of this code with two rate-based coding models (one in which the information is embedded in the spike count, the other in the mean interspike interval), by estimating the quality of the input image reconstruction that they provide as a function of time. 2 Retinal Model 2.1 Wavelet-like Transform. We designed a model retina. Our model ganglion cells compute the local contrast intensities at different spatial scales and for two different polarities: ON- and OFF-center cells. We can consider this decomposition as a wavelet-like transform, using differences of gaussians (DoG) as the basic lters (Rodieck, 1965). The spatiotemporal properties of our model ganglion cells match those of X-type cells: they use linear spatial summation between the center and surround regions of the receptive eld, and there is no temporal component in the input-output function (Buser & Imbert, 1992). The ganglion cells’ receptive elds are shown in Figure 1. We used the simple DoG described by Field (1994), where the surround has three times the width of the center. An OFF-center lter was simply an inverted version of the ON-center receptive eld. The narrowest lters (at scale 1) were 5£5 pixels in size, and the widest 767 £767 pixels (at scale 128). Furthermore, these lters were normalized so that when the input pattern is identical to the lter itself, the result of the convolution at this given scale
Temporal Coding in the Retina
1259
should be 1. The result of the application of these lters at any position and scale is the output of the wavelet-like transform, which produces a set of analog values, corresponding to the activation levels of our model ganglion cells. According to wavelet theory (Mallat, 1989), the wavelet-like reconstruction will simply be obtained by applying on the reconstructed image, for each scale and position, the corresponding ganglion cell’s receptive eld, multiplied by the corresponding activation value. More precisely, the contrast at a particular position (x, y ) and scale (s) is dened as: ContrastIm (x, y, s) D
XX i
j
(Im(i C x, j C y ) ¢ DoGs (i, j))
where DoGs denotes the DoG lter at scale s and (i, j) spans the width and height of the DoGs lter. Using these contrast values, the reconstruction ImRec of the image Im is obtained by ImRec (i, j) D
XXX x
y
s
ContrastIm (x, y, s ) ¢ DoGs (x ¡ i, y ¡ j),
where s spans the range of spatial scales and (x, y ) spans the image width and height. 2.2 Subsampling. For computational reasons, as well as for biological plausibility, the spatial resolution of the transform varies together with the scale of the lters, so that when the scale is doubled, the resolution is divided by 2. More precisely, the narrowest convolutions (at scale 1) are computed for every pixel in the original image, whereas the lters at scale 2 are applied once every 2 pixels horizontally and vertically. Therefore, the number of neurons per image is no more than 8/3 times the number of pixels in the original image. Let n be the number of pixels in the input image; the number of ganglion cells is then
2 ¢ (n C n / 4 C n / 16 C n / 64 C ¢ ¢ ¢ C n / 16, 384). This organization scheme is detailed in Figure 2. All natural images that will be used in the following simulations are 364 £ 244 pixels in size, and the number of ganglion cells will then be approximately 236,000. Of course, we do not claim that this precise architecture is biologically realistic. The real organization of ganglion cells in the mammalian retina has numerous differences from our model. First, the actual recorded receptive eld sizes do not span as many octaves as our model receptive elds do. The ratio between the biggest and the smallest receptive eld sizes across
1260
Run Van Rullen and Simon J. Thorpe
Figure 2: Retinal organization. The image is encoded through a bank of lter maps with two different polarities (ON- and OFF-center cells) and eight different scales (only four are shown here). The sampling resolution is inversely proportional to the scale.
the entire retina is less than 100, and only around 2 at a given eccentricity (Croner & Kaplan, 1995). However, one could argue that even if there were cells with very large receptive elds, they would be very rare and therefore difcult to record. Another notable difference is that in biological visual systems, the ganglion cells at each spatial scale are not equally distributed over the retina. On the other hand, the model we used allows the information in an image to be fully encoded with a wavelet-like transform, and therefore we are able to address the real issue that concerns us here: What would be the most efcient way of transmitting this information to the brain using spiking neurons? Given the architecture, how do these neurons convert an analog intensity value, representing the local contrast in their receptive eld, into a succession of ring events, the spike train? What is the optimal way of doing so in terms of the maximization of information transmission? In the following sections, we consider a variety of coding schemes that could be used to transmit information about the image to the brain. First, we describe a method that uses an analog-to-delay mechanism coupled with a coding scheme in which the order of ring in the retinal ganglion cells is used to code the information in the image (Thorpe, 1990; Thorpe & Gautrais, 1998). Later we compare the efciency of this code with more conventional
Temporal Coding in the Retina
1261
rate- based codes using either counts of the total number of spikes produced by each cell or measures of the interspike interval. 3 Rank Order Coding
The result of the convolution computed by our model ganglion cells on the original image is an analog value, which can be thought of as the neuron’s membrane potential. It has to be converted into a spike train that will be transmitted to the cortex via the optic nerve. 3.1 Analog-to-Rank Conversion. Since we want to show that the relative order in which the ganglion cells will generate their rst spike can be used as a code, we can simply assign a rank to each neuron as a function of its activation. The most activated will re rst, and so on. This is supported by the characteristics of integrate-and-re neurons: the higher the membrane potential is, the sooner the threshold will be reached, and the sooner a spike will be emitted. We do not need to model the absolute or relative timing of the spike precisely, since the only relevant variable that will be used for decoding is the neuron’s rank. However, it is possible to assign a latency to each neuron, and we will implement such a function in section 4.1 in order to compare our model with rate coding schemes. 3.2 Rank Order Decoding Using Image Statistics: Contrast = f (Rank). Now that the mechanism responsible for order encoding is described, we need to evaluate the quality of that code. A good and simple way of estimating the information about the visual stimulus that is carried by the spikes along the optic nerve is to use these spikes to reconstruct the input stimulus. With the kind of wavelet-like transform that is computed by our model ganglion cells, the reconstruction of the image is simply obtained by once again applying the DoG lters on the result of the previous convolution. Each ganglion cell’s receptive eld needs to be added to the reconstruction image, at the position corresponding to its center, with a multiplicative value equal to the result of the previous convolution—this neuron’s activation level. The problem with our rank order code is that this result has been “forgotten” through the analog-to-order conversion. If the output of the ganglion cells were simple analog values, there would not be any such problem, but the only information available in our case is the relative order in which the ganglion cells red. If neuron i red rst, what does it mean in terms of the contrast intensities in the original image? If neuron j red right after neuron k, how relevant is the information transmitted by neuron j compared to that conveyed by neuron k? A simple way of answering these questions is to associate each possible order with the average contrast value that drove the corresponding neuron above threshold. This is roughly equivalent to a sort of reverse correlation analysis.
1262
Run Van Rullen and Simon J. Thorpe
We presented our model ganglion cells with more than 3000 natural images (364 £ 244 pixels), sorted the obtained contrast values for all scales and polarities in decreasing order for each image, and then computed the average contrast value obtained at the location of the rst neuron to generate a spike. This location, of course, varied with the different images. The same was done for all second spikes in the images, and so on, until the last neuron that red. This procedure is described by the following equation. The average maximal contrast MaxC is dened by: 8 0. By simply looking at the odd part of the interaction function, we can pick out all the stable and unstable phase-locked solutions for two weakly coupled identical oscillators. The computation of the interaction functions Hj can be done by successive changes of variables and then averaging. However, this is inefcient and is
1294
Bard Ermentrout, Matthew Pascal, and Boris Gutkin
not readily automated. We have shown (Ermentrout & Kopell, 1991) that this rigorous procedure is equivalent to the formal method of computing H in the manner of Kuramoto (1982): Hj (w ) D
1 T
Z
T 0
X¤ (t ) ¢ Gj (X0 (t C w ), X 0 (t)) dt.
(3.3)
The vector function X¤ (t) is the unique solution to the linearized adjoint equation: dX ¤ (t) D ¡[DX F (X 0 (t))]T X¤ (t) dt
X ¤ (t ) ¢
dX 0 (t) D 1. dt
(By [DX F]T we mean the transpose of the matrix of partial derivatives of F with respect to the variables, X.) The functions X¤ (t ) are easily computed for any stable limit cycle (see Williams & Bowtell, 1997). For equation 3.3, the effects of adaptation on the interactions between neural oscillations arise from two sources: changes in the coupling function and changes in the adjoint X¤ (t). 3.2 Specic Forms for Conductance-Based Models. In synaptically coupled one-compartment model neurons, coupling is only through the potential. That is, the coupling G(¢, ¢) is zero for all components except for the voltage. In particular,
G(X2 , X1 ) D gsyn s2 (t)(Vsyn ¡ V1 (t )). Here, s2 (t) is the synaptic gating variable (often satisfying a differential equation or specied as a xed function of the time since the presynaptic cell has red). gsyn and Vsyn are the maximal conductance and the reversal potential of the synapse. This means that for a pair of coupled neurons, the interaction function is simply H (w ) D gsyn
1 T
Z
T 0
spre (t C w )V ¤ (t)(Vsyn ¡ V (t)) dt,
(3.4)
where V ¤ (t) is the voltage component of the adjoint. If we adjust input currents to the oscillators so that the period of oscillations is xed, then the role of adaptation can be analyzed separately from the frequency effects. For a xed-period oscillation, we do not expect adaptation to have much of an effect on the shape of the synaptic gating variable (since this just depends on the ring of an action potential); thus, all the changes in the function H that arise will be due to changes in the product: V ¤ (t)(Vsyn ¡ V (t)).
Effects of Adaptation and Feedback
1295
For excitatory synapses, Vsyn D 0 mV, so that the main effect of the term (Vsyn ¡ V (t )) is to multiply the adjoint V ¤ (t) by an essentially positive function. (V(t) > 0 for only a very short fraction of the period.) Qualitatively, we see that if adaptation is to have an effect on the interaction, it will have to be through the adjoint V ¤ (t). The PRC is an experimentally measurable quantity (Reyes & Fetz, 1993a, 1993b) in which a brief pulse is applied to an oscillating neuron and the advance or delay of the next spike is computed. That is, suppose that T is the normal period of the cell. Let TO (t ) be the time of the next spike given that the stimulus has occurred at a time t after the last spike. Then the PRC, P (t), is dened as the fraction change of the timing: P(t) ´
T ¡ TO (t ) . T
Often only the timing advance or delay, D (t) D TP (t) is reported. Hansel et al. (1995) have given a nice intuitive denition of the voltage component of the adjoint, V ¤ (t). It is just the innitesimal PRC. That is, let P (t, a ) be the PRC for a stimulus of amplitude a. Then V¤ (t) D lima!0 D (t, a ) / a. Since the period of the oscillation does not change with adaptation (as we compensate for this with additional injected current) and the synaptic time course does not change, the main effect of adaptation on the interaction between neural oscillators must be mediated through the PRC (approximated by the adjoint). Thus, we study the effects of adaptation on the PRCs and adjoints, V ¤ (t), of conductance-based neural models. 3.3 Adjoints for Neural Models. Hansel et al. (1995) described adjoints for several different model neurons. In particular, they describe what they call Type I and Type II adjoints. Type I adjoints are strictly nonnegative; the effect of a depolarizing stimulus is either to do nothing or advance the phase. Model neurons like the Traub model above (without adaptation), the Connor model (Connor, Walter, & McKown, 1977), the Morris-Lecar model (Morris & Lecar, 1981), and the integrate-and-re model are all examples of models with nonnegative PRCs. Ermentrout (1996) and Izhikevich (1999) showed that all neural models that have Class I membrane properties (this includes all the aforementioned models except the integrate-and-re) are, near the onset of oscillations, equivalent to a simple canonical model that is described below. This model has a strictly positive PRC, D (t) D 2 arctan(tan(t / 2) C w ) ¡ t, where w is the magnitude of the stimulus. Type II (Hansel et al., 1995) adjoints are characterized by both positive and negative regions so that for stimuli shortly after the spike, the phase is delayed. The Hodgkin-Huxley model is of this type. Near a supercritical Hopf bifurcation, the adjoint has a form similar to ¡ sin(2p t / T ); thus, it is both positive and negative. The adjoint for certain classes of relaxation oscillators also has both positive and negative regimes (Izhikevich, 2000).
1296
Bard Ermentrout, Matthew Pascal, and Boris Gutkin
In addition to classifying the differences in adjoint types, Hansel et al. (1995) also numerically demonstrated that neurons with Type II adjoints can synchronize with excitatory coupling in contrast to neurons with Type I adjoints, which have difculty synchronizing. This particular result suggests that one mechanism by which adaptation enables neurons to synchronize is to switch the adjoint from Type I to Type II. As we have already mentioned, the M current adaptation serves to change the saddle-node bifurcation (which leads to Type I adjoints) to a Hopf bifurcation (which results in Type II adjoints). Thus, the adaptation generated by the M current may work by changing the adjoint from a strictly positive one to one with a substantial negative regime after spiking. 3.4 The Variation of the Adjoint and the Interaction Function. In this section, we consider the model depicted in Figures 1 and 2 in the limit of weak coupling with different degrees of adaptation. In Figure 5 we show the effects of the M current adaptation on the adjoint, V¤ (t ), in the left set of panels and the odd part of the interaction function in the right panel. In each panel, the degree of adaptation is shown, as well as the applied current. The current is chosen so that the period of the limit cycle is 25 msec. If we multiply V ¤ (t) by (Vsyn ¡ V (t)), the shapes of the resulting curves are virtually unchanged except for the amplitude; thus, we show only the adjoints V ¤ (t). Figures 5 and 6 illustrate the effect of adaptation on the adjoint and the consequent effect on the interaction function for the Traub model (cf. Figures 1 and 2). Figure 7 shows a similar picture for the network with feedback inhibition. We rst point out that with the M-current adaptation, synchrony becomes stable in the weak coupling limit since the slope of the odd part of the interaction function is positive. This is not the case for the high-threshold adaptation or for the network with recurrent inhibition; the best we can say is that there is a small phase difference between the two oscillators. We will see why this happens in the next section. The key feature responsible for making the phase difference between the pair of oscillators close to 0 is the attening or negative region of the adjoint right after spiking. We can intuitively see why there is a tendency toward synchrony by looking at the different adjoints. Consider a pair of mutually coupled identical cells that are nearly synchronous. Cell 1 res rst, which advances cell 2. Cell 2 res, and this advances cell 1. Thus, in one cycle, both cells are advanced. The phase difference between them will shrink if cell 2 is advanced more than cell 1. In the case of no adaptation, the advances are both positive and close in magnitude, so a more detailed analysis is required, and the width of the synapse plays a critical role. However, in the presence of adaptation, cell 1 is either not advanced at all (with the AHP current) or actually delayed (with the M current); thus, adaptation allows cell 2 to “catch up” to cell 1 in the sense that the phase lag between them shortens. This effect is quite strong and thus will not be as sensitive to the time course of the synapses.
Effects of Adaptation and Feedback
1297
PRC
Hodd 0
0
0 0
0 0
0 0
T/2
T
T/2
T
Figure 5: The adjoint and the odd part of the interaction function for the Traub model with M current (low threshold)–based adaptation. In each frame, sufcient current is injected into the model cell so that the frequency of oscillation is 40 Hz. Synapses are as in Figure 2. From top to bottom, (gm , I) D (0,0.922), (0.2628,1.99), (0.99,4.9), and (2.477,10.3).
More formally, the following argument shows how changes in the adjoint can alter the stability of the synchronous state. Recall that the synchronous 0 (0) solution is stable if Hodd > 0, which implies H 0 (0) > 0. Thus, the condition for stable synchrony for two coupled, identical, synaptically coupled neurons is Z 1 T 0 H0 (0) D s (t)(Vsyn ¡ V (t))V ¤ (t) dt > 0. T 0 Consider the case with no adaptation. The adjoint is zero and then rises to some peak. Since s0 (t) is positive shortly after the spike and is negative thereafter, the integral is negative so that synchrony will be unstable. In the case of the low-threshold adaptation (M current), the adjoint is negative and grows in magnitude. Thus, the integral will be positive so that synchrony
1298
Bard Ermentrout, Matthew Pascal, and Boris Gutkin
PRC
Hodd 0
0
0 0
0 0
0 0
0 0
T/2
T
T/2
T
Figure 6: The adjoint and the odd part of the interaction function for the Traub model with AHP current (high threshold)–based adaptation. In each frame, sufcient current is injected into the model cell so that the frequency of oscillation is 40 Hz. Synapses are as in Figure 2. From top to bottom, (gAHP , I) D (0,0.93),(0.262,3.06), (0.915,8.58), (1.368,12.455), and (1.48,13.43).
will be stable. For the case of high-threshold adaptation and recurrent inhibition, the situation is more subtle since the adjoint is nearly at during the majority of time when the synapse is on. Numerical calculations are necessary. Indeed, as seen in Figures 6 and 7, synchrony remains unstable, but there is a stable, nearly synchronous solution. We have discussed only the adjoints of oscillators with adaptation in this section. It is easy to compute the actual PRCs for these models by choosing an appropriate perturbing stimulus. Numerical calculations (not shown here) reveal that with a current stimulus with width 0.2 msec and magnitude 10, the PRC and the adjoint are indistinguishable up to a scale factor. The measured PRCs of cortical neurons resemble those of the Traub model with an AHP-type (or high-threshold) adaptation. In Figure 8 we show the
Effects of Adaptation and Feedback
1299
(B) 40
1
20
Hodd(t)
*
V (t)
(A) 1.5
0.5 0 0.5
0
10
t (msec)
0 20 40
20
0
0.4
30
0.3
20
0.2
Hodd(t)
*
V (t)
20
(D)
(C)
0.1 0 0.1
10
t (msec)
10 0 10 20
0
10
t (msec)
20
30
0
15
t (msec)
Figure 7: Effects of a local inhibitory interneuron on the adjoint and the interaction function. (A, B) Adjoint and Hodd , respectively, for the isolated excitatory neuron. Synchrony is unstable. (C, D) The same functions are shown, respectively, for an excitatory and an inhibitory circuit. The inhibition attens the the response right after the spike. The resulting Hodd shows that synchrony is nearly stable.
experimentally determined PRC (Reyes & Fetz, 1993a) and an adjoint from the Traub model. It is difcult to tell whether there is a statistically signicant negative region, so that one could possibly t this with the M-current adaptation model. Adaptation appears to encourage synchrony by either causing the neuron to ignore inputs shortly after it has spiked or to delay the onset of the next spike by producing a negative region in the PRC. 4 The Behavior Near the Bifurcation: A “Canonical Model”
In previous work (Ermentrout, 1996) we have shown that many models for spiking are Class I. That is, as the current increases, the model goes from rest to repetitive ring via a saddle node on a circle. Locally, a saddle-node
1300
Bard Ermentrout, Matthew Pascal, and Boris Gutkin
0.125
gK(Ca)=1.00 I=9.25
D (t)
0.075
0.025
0.025
0
10
t (msec)
20
Figure 8: The PRC from a cortical neuron (see Reyes and Fetz, 1993b, and the PRC from the model in Figure 1 with a choice for gAHP and I that closely matches the data. (Data supplied by A. Reyes.)
bifurcation has the form x0 D I C x2 ,
(4.1)
where I is the input into the system. If I is strictly positive, then equation 4.1 “blows up” in a nite amount of time. This suggests a nonlinear integrateand-re model in which x is reset to ¡1 when it grows to 1. Blowing up to innity is equivalent to eliciting a spike. Suppose the system p is reset to ¡1 at t D 0. Then for T > 0, it will blow up to C 1 at t D p / I.pThus, one can think of this as a repetitively ring neuron with a period p / I. This argument can be made formal (see Ermentrout & Kopell, 1986, or Izhikevich, 1999). Equation 4.1 can be changed into a differential equation on the circle by making the transformation x D tan(h / 2), whence it becomes the “theta model”: h 0 D 1 ¡ cosh C (1 C cosh )I.
(4.2)
Firing occurs when h (t) D p . Since we are interested in the oscillatory case, the parameter I can be set to 1 without loss of generality so that the un-
Effects of Adaptation and Feedback
1301
0.25 no adaptation gA=3 "instant" adaptation
0.2
0.15
PRC
0.1
0.05
0
0.05
0
10
20 t
Figure 9: PRC for the theta model with and without adaptation. “Instant” adaptation means the adaptation modeled as a xed delay of about a third of the period with amplitude 3.
coupled theta model satises h 0 D 2. Thus, h (t) D 2t C h (0). The model res when x in equation 4.1 goes to innity or when h D p in equation 4.2. The PRC for this simple model is obtained by instantly incrementing x in equation 4.1 by an amount g. In the appendix, we derive the PRC for this model and obtain P (tI g) D
1 1 C arctan(tan(t ¡ p / 2) C g ) ¡ t / p . 2 p
(4.3)
Thus, for instantaneous perturbations, the PRC of the theta model is explicitly computable and is shown in Figure 9. It is strictly positive and for small g is approximately proportional to 1 ¡ cos 2t. We next ask how adaptation affects the PRC of the theta model. Izhikevich (2000) shows that if you add adaptation with slow decay, then the theta model becomes: h 0 D (1 ¡ cosh ) C (1 C cosh )(I ¡ ga z ) ta z0 D d(h ¡ p ) ¡ z.
(4.4) (4.5)
1302
Bard Ermentrout, Matthew Pascal, and Boris Gutkin
The PRC cannot be analytically found for this model. However, we can do exactly the same as one would do in an experiment and compute the PRC by applying a brief pulse at different times. The result of this computation is shown in Figure 9. As with the full biophysical models, the effect of adaptation is to suppress the excitability after the pulse. To gain some insight into how the adaptation has this suppressive effect at the early part of the PRC, we introduce a simplied version of adaptation that is analytically tractable. We apply a negative delta function perturbation with magnitude ga to a cell at xed time ta after it has red. That is, we approximate the exponential adaptation, ga exp(¡t / ta ), by a single pulse concentrated ta units later. In this case, we can explicitly compute the PRC (see the appendix). Figure 9 shows the result of the calculation. Several effects are clear: the peak response is delayed considerably, the response shortly after the spike is strongly suppressed, and the slope is increased for inputs that occur right before the spike. These effects conspire to push the two neurons toward synchrony. To close this section, we consider a pair of theta neurons that have both instantaneous delta function coupling (resetting using the PRC) and delayed negative feedback. Thus, the model consists of a pair of theta neurons with a xed current bias, I D 1, subject to the following rule: Each time hj D p , it 6 is reset to hj D ¡p ; oscillator k D j is reset to hk D 2 arctan(tan(hk / 2) C g)I and ta time units later, hj receives a negative impulse of strength, ga : hj D 2 arctan(tan(hj / 2) ¡ ga ). (The derivation of these resetting functions is in the appendix.) The usefulness of this simple model is that it can be directly reduced to a map for the timing differences. That is, we let Tj (n ) denote the time of the nth ring of unit j, we assume identical units, and nally, we assume that ta is larger than the initial time difference between the two units. This latter assumption simplies the construction of the map and eliminates multiple cases. Let w n D T2 (n ) ¡ T1 (n ). The map, w ! M (w ), is the composition of several functions: x1 (w ) D tan(w ¡ p / 2) C g x2 (w ) D tan[¡w C ta C arctan x1 (w )] ¡ ga Tp (w ) D p / 2 ¡ arctan(x2 (w )) C ta
y2 (w ) D tan[Tp (w ) ¡ w ¡ ta ¡ arctan(cot ta C ga )] C g w n C 1 D p / 2 ¡ arctan(y2 (w )) ´ M (w ).
The derivation is in the appendix. First, note that by direct substitution, one nds that w D 0 is a xed point. Next suppose that there is no adaptation,
Effects of Adaptation and Feedback
1303
so that ga D 0. Then the map simplies to w nC1 D w n, no matter what the coupling strength. That is, in the absence of adaptation, the map is degenerate and has no effect on the timing differences. We can expand the map with adaptation (ga > 0) around the synchronous xed point to study its local dynamics. Using Maple V, we expand the map in a power series in w and nd that to cubic order: M (w ) D w ¡ ga g2
ga C 2 cot ta w 3 ´ w ¡ Cw 3 . 1 C ( ga C cot ta )2
Because the adaptation and synapses are instantaneous, the linearized map is degenerate. However, the adaptation adds a nonlinear effect to the origin. Since the coefcient of the cubic term is negative, this implies that the origin is an attracting xed point. It is crucial that the adaptation delay, ta , be positive since otherwise the limiting coefcient of w 3 will be zero and the origin will again be neutrally stable. This map provides some additional insight. We expect that if the synapses and the adaptation are not instantaneous, then the linear coefcient will not be identically 1. Suppose that we introduce a small parameter, m , that represents the effects of nite (but nonzero) synaptic and adaptation persistence. Then the map is approximately given by w n C 1 D (1 C m )w n ¡ Cw n3 . If m < 0, then the origin is stable, and we obtain stable synchrony to the perturbed map. However, if m > 0, then the origin is unstable, and synchrony is not a stable solution. However, since C > 0, for m > 0, there is a pair p of stable xed points: wN D § m / C. That is, we nd a nearly synchronous timing difference between the two units identical in local behavior to the last panels in Figures 6B and 7D. Numerical simulations of a pair of theta models using instant adaptation and synapses whose dynamics are not instantaneous show that synchrony is unstable. Instead, the pair tends to a nearly synchronous solution with a phase difference that tends to zero as the synapses speed up. Figure 10 shows the phase difference between the two oscillators for synapses that have the form s(t) D b exp(¡bt) as a function of the parameter b. As b ! 1, s (t) approaches a delta function, and w appears to approach the synchronous solution. The approach seems to follow a power law that is consistent with the fact that for innitely fast synapses, the synchronous solution is stable but not exponentially stable. 5 Conclusions
We have shown that the addition of adaptation to a spiking Class 1 neural model has effects on the ability of the neuron to synchronize with other
1304
Bard Ermentrout, Matthew Pascal, and Boris Gutkin
3
2
f
1
0
0
5
10
b
15
20
Figure 10: Phase difference between two theta models with delayed instant adaptation and time-dependent exponential synapses, s(t) D b exp(¡bt). As b ! 1 the phase difference tends to zero.
cells. This was shown to be a consequence of the effects of the adaptation on the PRC. For M current–based adaptation, the current has a global effect on the bifurcation diagram through a destabilization of the rest state at high enough currents. This leads to a PRC that is negative for a short time after the spike. Intuitively, what is going on is that right after the spike, the adaptationrelated potassium gates are not fully saturated, and the additional depolarizing currents from an excitatory synapse open them further. This delays the onset of the next spike. The calcium-based adaptation acts more like a shunt since it is already close to saturation due to its sharpness of onset. The recurrent inhibition acts in a way similar to the calcium-based adaptation. Thus, the synaptic depolarization occurring after a spike is largely ignored. The effects are more subtle since the phase cannot be delayed due to an excitatory stimulus. Thus, intuition fails (or is, at least, not obvious), and detailed calculations are necessary.
Effects of Adaptation and Feedback
1305
van Vreeswijk and Hansel (in press) have recently explored the effects of adaptation in integrate-and-re models. They have sufciently strong recurrent excitatory coupling such that with adaptation, the result of connecting two neurons together is bursting. This is a different regime from that studied here. The adaptation in their models builds up slowly and is strong enough to shut the neurons off. After the neurons recover from the adaptation, they re again. The recurrent excitation leads to rapid ring. Thus, the adaptation leads to a pattern of bursts that are synchronized. However, the spikes within the burst are not synchronized. In this article, the synapses are not sufciently strong to induce bursting. Adaptation and negative feedback have a dramatic effect on the ring rate versus current curve of cortical neurons (Ermentrout, 1998). The computations and analysis in this article show that they also have a big effect on the synchronization properties of excitatorily coupled neurons. More generally, the presence of a delayed negative feedback makes it possible to synchronize or nearly synchronize two excitatorily coupled cells that would not synchronize without the feedback. Appendix
Here, we present the equations used in the article. The general equations for all neurons have the form: ³ ´ ca (v ¡ ek ) Cv0 D I ¡ gna hm3 (v ¡ ena ) ¡ g kn4 C gm w C gahp ca C 1 ¡ gl (v ¡ el ) ¡ ica 0 m D am (v)(1 ¡ m) ¡ bm (v )m n0 D an (v )(1 ¡ n ) ¡ bn (v )n h0 D ah (v )(1 ¡ h) ¡ bh (v )h
w0 D (w1 (v ) ¡ w) / tw (v)
ca0 D ¡0.002ica ¡ ca / 80 ica D gca ml,1 (v )(v ¡ eca )
ml,1 (v) D 1 / (1 C exp(¡(v C 25)) / 2.5)) am (v) D 0.32(54 C v ) / (1 ¡ exp(¡(v C 54) / 4)) bm (v) D 0.28(v C 27) / (exp((v C 27) / 5) ¡ 1)
ah (v) D 0.128 exp(¡(50 C v) / 18) bh (v) D 4 / (1 C exp(¡(v C 27) / 5)) an (v) D 0.032(v C 52) / (1 ¡ exp(¡(v C 52) / 5)) bn (v) D 0.5 exp(¡(57 C v ) / 40)
tw (v) D 100 / (3.3 exp((v C 35) / 20.0) C exp(¡(v C 35) / 20.0)) w1 (v) D 1.0 / (1.0 C exp(¡(v C 35) / 10.0)).
1306
Bard Ermentrout, Matthew Pascal, and Boris Gutkin
Synaptic gates satisfy s0 D a(1 ¡ s) / (1 C exp(¡(v C 10) / 10)) ¡ bs.
(A.1)
Units are mS/cm2 for the conductances, milliseconds for time, millivolts for voltage, m F/cm2 for capacitance, and m A/cm2 for currents. The parameters are ek D ¡100, ena D 50, el D ¡67, eca D 120, gl D 0.2, g k D 80, gna D 100, c D 1, gca D 1, a D 2, b D 0.1. The parameters gm , gahp , and I varied from simulation to simulation depending on whether the adaptation was on. Typically, gm , gahp were on the order of 1–2. Simulations with inhibitory neurons consisted of a single excitatory (E) and inhibitory (I) pair (cf. Figures 3 and 7). I cells have the same intrinsic dynamics as the E cells but lack any adaptation currents. The E cells receive a bias current I D 3.5, and the I cells receive no bias current. Connection strengths are ge!i D 0.1 and gi!e D 1. E synapses obey equation A.1 with a D 10 and b D 5, while for the inhibitory synapses, a D 2 and b D 0.1. Synaptic reversal potentials were 0 for the excitatory synapses and ¡80 for the inhibitory synapses. These small subnetworks (of one E and one I cell) are coupled together through excitatory-excitatory interactions with ge!e D 0.01 for Figure 3. A.1 The Canonical Maps. Here we derive all of the maps and PRCs for the theta model with instantaneous adaptation and synapses. It is more convenient to work with the untransformed equations,
dx D x2 C 1, dt
(A.2)
where we have assumed without loss in generality that the applied current, I D 1. This is an integrate-and-re model, but ring occurs when x goes to innity and the reset is x D ¡1. The quadratic nonlinearity enables the system to blow up in a nite amount of time. The solution to equation A.2 for x (0) D a is x (t ) D tan(t C arctan(a)).
This calculation shows that the next “spike” occurs (i.e., x goes to innity) at T f (a) D p / 2 ¡ arctan(a).
(A.3)
If x(0) D ¡1, then x (t) D 1 when t D p . Thus, the period of the unperturbed system is p . Now suppose that at t D 0, x is reset to ¡1, and at t D s < p a delta function pulse of strength g is applied. Then right after the pulse, x (s C ) D x (s¡ ) C g D tan(s ¡ p / 2) C g.
Effects of Adaptation and Feedback
1307
Thus, from equation A.3, the time of ring is now TO (s ) D s C p / 2 ¡ arctan(tan(s ¡ p / 2) C g ), from which we deduce the PRC:
D (t) D
p ¡ TO (t) 1 1 t C arctan(tan(t ¡ p / 2) C g ) ¡ . D p 2 p p
(A.4)
As a by-product of this calculation, we can obtain the phase-resetting curve for the theta model. That is, given that h D h0 when the impulse comes in, what is the new value of h? Since h (t) D ¡p C 2t, we see that t D p / 2 C h / 2. Thus, in terms of h, the value of x after an impulse is x D tan(h / 2) C g, and thus hnew D 2 arctan(tan(hold / 2) C g).
(A.5)
Next, suppose that we add the delayed adaptation with strength ga and delay ta . First, we compute the unperturbed period. At t D taC , x (ta ) D tan(ta ¡ p / 2) ¡ ga , so that the period of the oscillation is Ta D ta C p / 2 ¡ arctan(tan(ta ¡ p / 2) ¡ ga ). Note that if ga D 0 or ta D 0, then Ta D p , the unperturbed period. We compute the PRC for this model assuming a pulse arises at a time t after ring. Suppose that t < ta . Then x (t) D tan(t ¡ p / 2) C g ´ x1 . We next compute the value of x when the delayed adaptation kicks in, x2 : x2 D tan(ta ¡ t C arctan(x1 )) ¡ ga . The time of ring is then TO D T f (x2 ) C ta . This gives us the PRC for perturbations that arise before t D ta . Similar calculations are done for the case in which the perturbation occurs after the
1308
Bard Ermentrout, Matthew Pascal, and Boris Gutkin
adaptation occurs. These two cases together allow us to derive the function for the PRC with the adaptation that is shown in Figure 9. Finally, we derive the map for two theta models with instant adaptation and instant synapses. We will suppose for simplicity that the timing difference between them is less than the adaptation time. The following events occur. Cell 1 res, cell 2 res, cell 1 receives adaptation, cell 2 receives adaptation, cell 1 res, and cell 2 res, completing the cycle. Cell 1 res at t D 0 and cell 2 at t D w . Thus, cell 1 receives synaptic input and has the value x1 (w ) D tan(w ¡ p / 2) C g. The next thing to occur is the adaptation of cell 1, which occurs at a time ta , leaving cell 1 with the value x2 (w ) D tan(ta ¡ w C arctan(x1 )) ¡ ga . At t D w C ta , cell 2 receives adaptation so that its value is y1 D tan(ta ¡ p / 2) ¡ ga D ¡(ga C cot ta ). Next cell 1 res at T1 (w ) D p / 2 ¡ arctan(x2 (w )) C ta , which induces a phase shift on cell 2: y2 (w ) D tan[T1 (w ) ¡ w ¡ ta C arctan y1 ] C g. Finally cell 2 res at T2 D T1 C p / 2 ¡ arctan y2 (w ). Thus the new timing difference is w 0 D T 2 ¡ T 1 D p / 2 ¡ arctan y2 (w ) ´ M (w ). Thus, the map M is the composition of several readily computable maps. Acknowledgments
The research reported in this article was supported by the National Institute of Mental Health and the National Science Foundation. We thank Alex Reyes for kindly providing the data for Figure 8.
Effects of Adaptation and Feedback
1309
References Connor, J. A., Walter, D., & McKown, R. (1977). Neural repetitive ring: Modication of the Hodgkin-Huxley axon suggested by experimental results from crustacean axons. Biophys. J., 18, 81–102. Crook, S. M., Ermentrout, G. B., & Bower, J. M. (1998). Spike frequency adaptation affects the synchronization properties of networks of cortical oscillations. Neural Comput., 10, 837–854. Ermentrout, B. (1996).Type I membranes, phase resetting curves, and synchrony. Neural Comput., 8, 979–1001. Ermentrout, B. (1998). Linearization of F-I curves by adaptation. Neural Comput., 10, 1721–1729. Ermentrout, G. B., & Kopell, N. K. (1984). Frequency plateaus in a chain of weakly coupled oscillators I. SIAM J. Math. Analysis, 15, 215–237. Ermentrout, G. B., & Kopell, N. (1986). Parabolic bursting in an excitable system coupled with a slow oscillation. SIAM J. Appl. Math., 46, 233–253. Ermentrout, G. B., & Kopell, N. K. (1991). Multiple pulse interactions and averaging in systems of coupled neural oscillators. J. Math. Biology, 29, 195– 217. Ermentrout, G. B., & Kopell, N. K. (1998). Fine structure of neural spiking and synchronization in the presence of conduction delays. Proc. Natl. Acad. Sci., 95, 1259–1264. Gray, C. M., Engel, A. K., Konig, P., & Singer, W. (1992). Synchronization of oscillatory neuronal responses in cat striate cortex: Temporal properties. Vis. Neurosci., 8, 337–347. Hansel, D., Mato, G., & Meunier, C. (1995). Synchrony in excitatory neural networks. Neural Comput., 7, 307–337. Hoppensteadt, F., & Izhikevich, E. (1997). Weakly connected neural nets. Berlin: Springer-Verlag. Izhikevich, E. M. (1999). Class 1 neural excitability, conventional synapses, weakly connected networks, and mathematical foundations of pulse-coupled models. IEEE Trans. Neur. Networks, 10, 499–507. Izhikevich, E. M. (2000). Neural excitability, spiking, and bursting. International Journal of Bifurcation, 10, 1171–1267. Kopell, N., Ermentrout, G. B., Whittington, M. A., & Traub, R. D. (2000). Gamma rhythms and beta rhythms have different synchronization properties. Proc. Natl. Acad. Sci. USA, 97, 1807–1872. Kuramoto, Y. (1982). Chemical oscillations, waves, and turbulence. New York: Springer-Verlag. McCormick, D. A., Connors, B. W., Lighthall, J. W., & Prince, D. A. (1985). Comparative electrophysiology of pyramidal and sparsely spiny stellate neurons in the neocortex. J. Neurophysiol., 54, 782–806. Morris, C., & Lecar, H. (1981). Voltage oscillations in the barnacle giant muscle ber. Biophy J., 35, 193–213. Reyes, A. D., & Fetz, E. E. (1993a). Effects of transient depolarizing potentials on the ring rate of cat neocortical neurons. J. Neurophysiol., 69,1673–1683.
1310
Bard Ermentrout, Matthew Pascal, and Boris Gutkin
Reyes, A. D., & Fetz, E. E. (1993b). Two modes of interspike interval shortening by brief transient depolarizations in cat neocortical neurons. J. Neurophysiol., 69, 1661–1672. Rinzel, J. M., & Ermentrout, G. B. (1998). Analysis of neuronal excitability. In C. Koch and I. Segev (Eds.), Methods in neuronal modeling (2nd ed.). Cambridge, MA: MIT Press. Traub, R. D., Jefferys, J. G. R., & Whittington, M. A. (1999). Fast oscillations in cortical circuits. Cambridge, MA: MIT Press. Traub, R. D., & Miles, R. (1991). Neuronal networks of the hippocampus. New York: Cambridge University Press. van Vreeswijk, C., Abbott, L. F., & Ermentrout, G. B. (1994). When inhibition not excitation synchronizes neural ring. J. Comput. Neurosci., 1, 313–321. van Vreeswijk, C., & Hansel, D. (in press). Patterns of synchrony in neural networks with spike adaptation. Neural Computation. Wang, X. J. (1998). Calcium coding and adaptive temporal computation in cortical pyramidal neurons. J. Neurophys., 79, 1549–1566. Williams, T. L., & Bowtell, G. (1997). The calculation of frequency-shift functions for chains of coupled oscillators, with application to a network model of the lamprey locomotor pattern generator. J. Comput. Neurosci., 4, 47–55. Received January 25, 2000; accepted August 1, 2000.
LETTER
Communicated by Jonathan Victor
A Unied Approach to the Study of Temporal, Correlational, and Rate Coding Stefano Panzeri Neural Systems Group, University of Newcastle upon Tyne Department of Psychology, Newcastle upon Tyne, U.K. Simon R. Schultz Howard Hughes Medical Institute and Center for Neural Science, New York University, New York, NY 10003, U.S.A. We demonstrate that the information contained in the spike occurrence times of a population of neurons can be broken up into a series of terms, each reecting something about potential coding mechanisms. This is possible in the coding regime in which few spikes are emitted in the relevant time window. This approach allows us to study the additional information contributed by spike timing beyond that present in the spike counts and to examine the contributions to the whole information of different statistical properties of spike trains, such as ring rates and correlation functions. It thus forms the basis for a new quantitative procedure for analyzing simultaneous multiple neuron recordings and provides theoretical constraints on neural coding strategies. We nd a transition between two coding regimes, depending on the size of the relevant observation timescale. For time windows shorter than the timescale of the stimulus-induced response uctuations, there exists a spike count coding phase, in which the purely temporal information is of third order in time. For time windows much longer than the characteristic timescale, there can be additional timing information of rst order, leading to a temporal coding phase in which timing information may affect the instantaneous information rate. In this new framework, we study the relative contributions of the dynamic ring rate and correlation variables to the full temporal information, the interaction of signal and noise correlations in temporal coding, synergy between spikes and between cells, and the effect of refractoriness. We illustrate the utility of the technique by analyzing a few cells from the rat barrel cortex. 1 Introduction
At the most fundamental level, information about sensory stimulation is represented in the central nervous system by the spike emission times of c 2001 Massachusetts Institute of Technology Neural Computation 13, 1311– 1349 (2001) °
1312
Stefano Panzeri and Simon R. Schultz
populations of neurons. In principle, the temporal pattern of spikes across the neuronal population provides a large capacity for fast information transmission (MacKay & McCulloch, 1952). It is still unclear how much of this theoretical capacity is actually exploited by the brain. It has long been known that a substantial amount of sensory information is carried by the discharge rate of individual neurons (Adrian, 1926). In some circumstances, however, if the stimulus is modulated on a very short timescale, precisely replicable sequences of spikes can be obtained (Bair & Koch, 1996; Buracas, Zador, DeWeese, & Albright, 1998). Does this represent temporal coding, or is the relevant time window for counting spikes merely very short? Recent analyses have also suggested that temporally coded information is present in the spike trains of individual neurons in the monkey visual cortex under more general stimulation conditions (Victor & Purpura, 1996, 1998; Mechler, Victor, Purpura, & Shapley, 1998). Evidence has also accrued that some information appears to be encoded by stimulus(or behavior-) related changes in the coordination of timing of ring between small populations of cortical cells (Vaadia et al., 1995; deCharms & Merzenich, 1996; Riehle, Grun, Diesmann, & Aertsen, 1997). Given that our understanding of how spike trains are decoded biophysically is far from complete, the rigorous study of the information properties of nerve cells requires that we can quantify spike train information in full generality. Understanding neural coding then means understanding how the different features of the population of spike trains (such as spike counts, correlations, or patterns) contribute to the full temporal information. This provides a powerful constraint on the biophysical decoding that occurs. For instance, if no signicant information about a stimulus is present in feature X of the spike trains, then none can be decoded by recipient neuronal pools by using “X detection.” To study these questions, it is of interest to quantify information using a rigorous measure such as mutual information (Shannon, 1948; Cover & Thomas, 1991). In this context, Shannon mutual information measures the extent to which observing a spike train (or a number of spike trains from several cells) reduces the uncertainty as to which external stimulus was present; it provides a bound on well the stimuli can be discriminated. Equivalently, it measures the delity of coding on a trial-by-trial basis—how reproducibly the responses on individual trials represent the stimulus. If the spike times are observed with nite temporal precision D t, and if D t is small enough such that all timing uctuations below that timescale are not inuenced by stimulation, then a binary string can be formed (with bin width D t) that contains all of the information present in the spike train. Direct estimation of the full temporal information is in principle possible by measuring the frequency of occurrence of all possible patterns of ones and zeros (“words”) that the string can take from the experimental data (Strong, Koberle, de Ruyter van Steveninck, & Bialek, 1998; de Ruyter van Steveninck, Lewen, Strong, Koberle, & Bialek, 1997). A direct approach is particularly useful
Temporal, Correlational, and Rate Coding
1313
because it does not make any assumptions about what are the important aspects of the spike train and is thus completely general. However, direct estimation by “brute force” estimation of frequencies is problematic because of the large data sizes required (growing exponentially with word length). Despite this limitation, a recent study has succeeded in directly quantifying full temporal information from an alert animal, in part because of the low spiking rates obtained under the stimulation conditions used (Buracas et al., 1998). Another approach to direct calculation of the information is to expand the mutual information as a Taylor series in the experimental time window (Bialek, Rieke, de Ruyter van Steveninck, & Warland, 1991; Skaggs, McNaughton, Gothard, & Markus, 1993; Panzeri, Biella, Rolls, Skaggs, & Treves, 1996; Panzeri, Schultz, Treves, & Rolls, 1999). This is the approach taken here. In this article we extend previous work based on the spike count response for a small population of cells (Panzeri et al., 1999) to present for the rst time an analytical expression for the full temporal information in a population of spike trains to second order in the time window. This enables a natural separation of the contributions of both instantaneous ring rates and temporal correlations between spikes to the complete information. It also allows comparison with the information carried by the number of spikes red by each cell, which can be expressed in an equivalent form. This provides theoretical insight into the individual factors that determine the coding capabilities of neural spike trains. It also provides a procedure for neurophysiological data analysis that is superior with respect to data size requirements compared to brute force frequency estimation of the information. The use of such a Taylor series approach requires that the experimental time window (which should not be confused with the smaller bin width used to discretize the spike train) be short enough that only a few spikes are emitted in its duration. This short time window limit, although a restriction, is relevant to the transmission of sensory information in the mammalian cortex. Single-unit recording studies in primates have demonstrated that the majority of information about static visual stimuli is often transmitted in windows as short as 20 to 50 ms (TovÂee, Rolls, Treves, & Bellis, 1993; Heller, Hertz, Kjaer, & Richmond, 1995; Rolls, TovÂee, & Panzeri, 1999). Information about time-dependent signals is often conveyed by single sensory cells by producing about one spike per characteristic time of stimulus variation (Rieke, Warland, de Ruyter van Steveninck, & Bialek, 1996). Event-related potential studies of the human visual system (Thorpe, Fize, & Marlot, 1996) provide further evidence that the processing of information in a multiplestage neural system can be extremely rapid. The periods during which transcranial magnetic stimulation disrupts processing in early visual cortex have been found to be as short as 40 ms (Corthout, Uttl, Walsh, Hallett, & Cowey, 1999). Finally, the assessment of the information content of interacting assemblies, which may last for only a few tens of milliseconds (Singer et al., 1997), requires the use of such short windows.
1314
Stefano Panzeri and Simon R. Schultz
The article is organized as follows. In section 2, the problem is dened, and the series expansion of the information is explained. In section 3, the necessary rate and correlation parameters are introduced, and the probabilities of each possible response expressed in terms of these parameters. Section 4 gives the main result of the article: the analytical expression obtained by substituting these probabilities back into the equation for mutual information. In section 5, we examine the conditions under which this approach is valid. Section 6 uses these results to study the role of response timescales in neural coding; two distinct coding regimes are found, depending on the relative stimulus and response characteristic timescales. Temporal encoding may contribute signicantly when the mean instantaneous ring rates of the cells uctuate on a timescale shorter than the time window in which most of the information about the stimulus is transmitted. Section 7 examines the precision with which spike times may code information. In section 8, we study the conditions under which one nds synergistic relationships between spikes and between cells in an assembly. This includes an analysis of the effect of a refractory period on the spike train information. In section 9, we illustrate the method by applying it to spike trains recorded in the rat barrel cortex. Finally, in section 10, we discuss the consequences of the results and their relationship to other work.
2 The Information Carried by Neural Spike Trains
Consider a time period of duration T, associated with a dynamic or static sensory stimulus, during which we observe the activity of C cells. This period of physiological observation has associated with it a period (which might be earlier by some “lag” time, but which we will consider to have the same length) in which we also observe the characteristics of an external correlate, such as a sensory stimulus, which might be inuencing the cells’ behavior. Let us denote each different stimulus history (time course of characteristics within T) as s(t ), a function of time from stimulus onset chosen from the set S of experimentally presented stimulus histories. We shall describe the neuronal population response to the stimulus by the collection of spike arrival times ftai g, each tai being the time of occurrence of the ith spike emitted by the ath neuron. Although the spike arrival time is a continuous variable, it can be experimentally measured only with nite precision D t. Let us divide the time window T into small bins of width D t (in which at most one spike per cell is observed). With complete generality, we can represent the spike sequence ftaig as a sequence of binary digits, one for each time bin and cell, with the response 1 in the bins corresponding to the spike times and 0 in all other bins. We denote the probability of observing a spike sequence ftai g when a particu-
Temporal, Correlational, and Rate Coding
1315
lar stimulus history s(t ) was present as P (ftai g|s(t )).1 P (ftai g) D hP(ftai g|s(t ))is is its average across all stimulus histories. We determine P (ftai g|s(t )) by repeating each stimulus history in exactly the same way on many trials and observing the responses. Following Shannon (1948), we can write down the mutual information provided by the spike trains about the whole set of stimuli as £ ¤ £ a ¤ p ftai g|s(t ) D s p[s(t )] fti g|s(t ) log2 D p(ftai g) £ a ¤ X X £ ¤ P fti g|s(t ) a ´ . P[s(t )] P fti g|s(t ) log2 P (ftai g) s(t )2S ta Z
I (ftai gI S )
Z
D tai p
(2.1)
i
This notation can be read as follows. In the rst, more R general form of the equation, we use the functional integral notation D s to indicate that in principle, a continuous set of stimulus histories can be created. In practice, a discrete and often very limited set of stimuli is usually used to test the cell; hence, we replace the functional integral with a summation over a discrete set of stimuli. For brevity, the t will henceforth be dropped unless there is a particular R need to stress the dynamic nature of the stimulus function. The notation D tai indicates integration over all spike times tai , i D 1 . . . na , a D 1 . . . C, and summation over all total spike counts from the population. na is the number of spikes emitted by cell a. This notation is veryRsimilar to that used in Rieke et al. (1996). In the limit of innite precision, D tai may be interpreted as a simple Riemannian integral; the results we present in this article are valid in this continuous limit. More usually we will invoke nite precision D t and interpret the integral as a summation over all time bins. In addition to studying the information contained in the sequence of spike times, we can also quantify the information contained in the response space dened by only the number of spikes emitted by each cell in the time window. We can form a C-dimensional vector n , each component of which is na , the number of spikes red by the respective neuron in the experimental time window. The spike count information can be written I (n I S ) D
X s2S
P (s)
X n
P (n |s) log2
P (n |s) . P(n )
(2.2)
This information quantity is exactly that calculated in Panzeri et al. (1999). Invoking the data processing inequality, it is relatively obvious that I (n I S ) · I (ftai gI S ). 1
We use P(¢) to indicate a probability and p(¢) a probability density.
1316
Stefano Panzeri and Simon R. Schultz
Now, the spike train information can be approximated by a power series, I (ftai gI S ) D It (ftai gI S ) T C Itt (ftai gI S )
T2 C ¢¢¢, 2
(2.3)
where It (¢) and Itt (¢) refer to the rst and second time derivatives of the information, respectively. As we shall see, this becomes effectively an expansion in a dimensionless quantity—the number of spikes red in the time window. For this approximation to be valid, the information function must be analytic in T, and the series must converge within a few terms. Both of these issues will be addressed in section 5. The time derivatives of the information can be calculated by taking advantage of the narrowness of the time windows, as will be explained. 3 Correlation Functions and Response Probability
Before specifying the expressions for the response probabilities, it is necessary to dene some response parameters. Consider rst the discrete time resolution case (nite D t). For each individual trial with stimulus s, the spike train density of each cell can be represented as a sum of pulses, ra (tI s ) D
X dt,ta i
i
Dt
,
(3.1)
where dt1 ,t2 is the Kronecker delta function (1 if t1 and t2 label the same time bin, and zero otherwise). i in the above indexes the spike number. The timedependent ring rate is measured as the average of this quantity over all experimental trials with the same stimulus s. We denote this trial average by a bar, and the ring rate is thus ra (tI s). Note that this mean rate function is sometimes called the poststimulus time histogram. It is also necessary to introduce parameters describing correlations among spikes. (Note that we use the word correlation for both autocorrelation and cross-correlation.) Now, it is apparent that for physically plausible processes, r will remain constant as D t ! 0. If the mean ring rate is a differentiable function of time, then for sufciently short bin widths, reducing D t will just reduce the probability of observing a spike accordingly. We would like to retain this property for our correlation measure as well, so that when we write out the series expansion of the information equation, there will not be any dependence on time hidden inside the response parameters. Such hidden dependence would prevent a simple intuition of the dependence of the relative magnitudes of the information terms on the timescale and would, we believe, be inelegant. The Pearson product moment, for instance, can be shown to approach zero for short time windows (Panzeri et al., 1999). The measure we choose is the scaled correlation density. We can write the scaled noise correlation density, which measures correlations in
Temporal, Correlational, and Rate Coding
1317
the response variability upon repeated trials of the same stimulus, 2 as c ab (tai , tjb I s) D
ra (tai I s )rb (tjb I s) ra (tai I s )rb (tjb I s)
6 b or t a D 6 tb ¡ 1, if a D i j
c aa (tai , tai I s) D ¡1.
(3.2)
The numerator of the rst term indicates the average, over trials in which the same stimulus s is presented, of the product of the spike densities of cell a at time tai and cell b at time tjb . Note that the scaled correlation density is simply a joint poststimulus time histogram from which the number of coincidences purely due to rate modulation has been subtracted (implemented by the ¡1 in equation 3.2). Now we can introduce the above correlation parameters in terms of the conditional ring probabilities of observing one spike from cell a in the time bin centered at tai , given that cell b emitted a spike in the time bin centered at tjb , when stimulus s was presented: P (tai |tjb I s ) ´ ra (tai I s) D t[1 C c ab (tai , tjb I s)] C O (D t2 ).
(3.3)
This assumes that the conditional probabilities (see equation 3.3) scale proportionally to D t. It is a natural assumption because it merely implies that the probability of observing a spike in a time bin is proportional to the resolution D t of the measurement. It is violated only in the implausible and nonphysical case of spikes locked to one another with innite time precision. It can, of course, be checked for any given data set; this will be carried out in section 5. Note that quadratic- (and higher-) order terms in D t were neglected from the above relationship as they affect only third- and higherorder terms of the information. We will nd it convenient to measure another type of correlation also: signal correlation, that is, correlation in the mean responses of the neurons across the set of stimuli. These correlations can also be thought of as correlations in the tuning curves of the neurons. For homogeneity, we will also quantify these by a scaled correlation density: D E ra (tai I s)rb (tjb I s) sE ¡ 1. (3.4) ºab (tai , tjb ) D « ¬ D ra (tai I s ) s rb (tjb I s ) s
In equation h¢is can be taken to indicate the average across R 3.4, the brackets P stimuli, D s p (s)¢ or s P (s)¢. Both c and º may range from ¡1 to 1, with zero indicating lack of correlation. 2 We adopt the convention, used by a number of authors, of using the term noise correlation to mean correlation at xed signal, thus distinguishing it from correlation across different signals, a concept we also require.
1318
Stefano Panzeri and Simon R. Schultz
When passing to the high-resolution limit (D t ! 0), the same denitions outlined above apply, provided that we replace the Kronecker delta function with that of Dirac in the denition of spike density, X d(t ¡ tai ), (3.5) ra (tI s ) D i
and use probability densities instead of probabilities in equation 3.3. Let us now consider the case of the spike count response parameters. The correlational parameters that inuence the spike count information are the scaled correlation coefcients of the spike counts in each trial (Panzeri et al., 1999). These are obtained by summation over time bins (or integration in the high-resolution limit) of the above expressions. The spike count scaled noise correlation coefcient can be written as R aR b dti dtj ra (tai I s)rb (tjb I s)[1 C c ab (tai , tjb , s)] c ab (s ) D R R ¡ 1. (3.6) [ dtai ra (tai I s)][ dtjb rb (tjb I s )] Similarly, signal correlation is quantied as DR E R dtai ra (tai I s) dtjb rb (tjb I s) sE ¡ 1. ºab D «R ¬ DR b b a a ( ) ( ) dt i ra ti I s s dtj rb tj I s
(3.7)
s
Equations 3.6 and 3.7 are a simple renotation of the denitions in Panzeri et al. (1999) to t with the requirements of the full temporal notation in this article. The nite resolution expressions for the above are of course obtained by replacing the integrals with summations: Z X lim (3.8) f (tn )D t D dtf (t). D t!0 n
If the conditional instantaneous ring rates are nondivergent, as assumed in equation 3.3, then the short timescale expansion of response probabilities becomes essentially an expansion in the total number of spikes emitted by the population in response to a stimulus. The only responses that contribute to the transmitted information up to order k are those with up to k spikes from the population. This is proved in appendix A. The only relevant events for the second-order analysis are therefore those with no more than two spikes emitted in total, and they can be truncated at second order without affecting the rst two information derivatives. This is valid for any time resolution. For brevity, we report the results only for the innite time resolution case. The resulting expression (see appendix A) is P ( 0 |s) D 1 ¡
C Z X aD 1
dta1 ra (ta1 I s )
Temporal, Correlational, and Rate Coding
C
C X C 1X 2 aD 1 b D 1
Z
Z dta1
³
p (ta1 |s)dta1
D
ra (ta1 I s )dta1
1¡
h i dtb2ra (ta1 I s)rb (tb2 I s) 1 C c ab (ta1 , tb2 I s)
C Z X b D1
1319
´
dtb2 rb (tb2 I s)
h i 1 C c ab (ta1 , tb2 I s)
a D 1, ¢ ¢ ¢ , C h i 1 p (ta1 tb2 |s)dta1 dtb2 D ra (ta1 I s)rb (tb2 I s) 1 C c ab (ta1 , tb2 I s) dta1 dtb2 2 a, b D 1, ¢ ¢ ¢ , C,
(3.9)
where P (0 |s) is the probability of zero response (no cells re), p(ta1 |s) is the probability density of observing just one spike from cell a at the specied time location, and p (ta1 tb2 |s) is the probability density of observing just a pair of spikes at the given times. 4 Analytical Results
We now insert the second-order response probabilities (see equation 3.9) into the expressions for the information contained in the sequence of spike times (see equation 2.1) and the spike counts (see equation 2.2). Then, for each term in the sum over responses, we use the power expansion of the logarithm as a function of T log2 (1 ¡ T x ) D ¡
1 (Tx ) j 1 X , ln 2 jD1 j
(4.1)
and, after integrating over spike times, group together all terms in the sum that have the same power of T. Equating these with equation 2.3 yields expressions for the information derivatives. These expressions depend only on the time-dependent ring rates r, the noise correlations c , and the signal correlations º. As shown in appendix A, the power expansion in the short time window length T is related to the expansion in the total number of spikes emitted by the population. Therefore, although the formalism is for simplicity developed as an expansion in T, the expansion parameter is in fact the adimensional quantity representing the average number of spikes emitted by the population in the window T. The rst-order contribution (T times the instantaneous information rate) to the full temporal information from a population of spike trains is (Bialek et al., 1991) It (ftai gI S ) T D
C Z X a D1
½ dt
a
ra (ta I s) log2
r a (t a I s ) hra (ta I s0 )is0
¾ .
(4.2)
s
This is simply a sum of single-cell contributions. It is insensitive to both signal and noise correlation.
1320
Stefano Panzeri and Simon R. Schultz
The expression for the second-order contribution (the second temporal derivative multiplied by T 2 / 2) breaks up into three terms: Itt (ftai gI S )
Z C X C Z E « ¬ D 1 X T2 D dta1 dtb2 ra (ta1 I s ) s rb (tb2 I s) 2 2 ln 2 aD 1 bD 1 s ( ) h i 1 £ ºab (ta1 , tb2 ) C 1 C ºab (ta1 , tb2 ) ln 1 C ºab (ta1 , tb2 ) Z Z C X C D E 1X C dta1 dtb2 ra (ta1 I s )rb (tb2 I s)c ab (ta1 , tb2 I s ) 2 aD 1 b D 1 s £ log2
1 1 C ºab (ta1 , tb2 )
* Z C X C Z h i 1X a b C dt1 dt2 ra (ta1 I s)rb (tb2 I s) 1 C c ab (ta1 , tb2 I s) 2 aD 1 b D 1 8 > > >
( a 0) ( b 0) > > £ra t1 I s rab t2bI s 0 ¤¬ : £ 1 C c ab (t1 , t2 I s ) s0
9 > > + > = .
> > > ;
(4.3)
s
We will refer to these terms as components 2a, 2b, and 2c of the information, respectively. When each spike is completely independent, as in a Poisson process, it is apparent that only the rst term of equation 4.3 survives. It is easy to see that this term is always less than or equal to zero, since f (x ) D x ¡ (1 C x ) ln(1 C x) has a global maximum at f (0) D 0: the rst term is equal to zero only if the signal correlation is precisely zero. This means that the information accumulation from a Poisson process with |º| > 0 always slows down after the rst spike. If there is any deviation from independence in the timing of successive spikes, then the other terms can contribute to the information through nonzero autocorrelation density c aa (ta1 , tb2 ). If there is any relationship between the times of spike emission of different cells, then they can contribute through nonzero cross-correlation c ab (ta1 , tb2 ). As with the spike count information terms detailed in Panzeri et al. (1999), the second of these terms (the stimulus-independent correlational component) reects contributions from a level of correlation that is not stimulus dependent. The third term is nonnegative and is nonzero only in the presence of stimulus dependence of the correlation between spikes; it is called the stimulus-dependent correlational component. The natural separation of the second-order information into three components is important because each component reects
Temporal, Correlational, and Rate Coding
1321
the contribution of a different relevant encoding mechanism. When applying this analysis to real neuronal data, a signicant amount of information found in the third component of Itt , relative to the total information, would clearly signal that cells are transmitting information mainly by participating in a stimulus- (or context-) dependent correlational assembly (Singer et al., 1997). A specic example will be given in Figure 5. In the same way, the short timescale expansion can be carried out for the spike count information. This was performed in detail in Panzeri et al. (1999), and here we report briey the main results to cast them into identical notation to that above for comparison. The spike count information derivatives are similar to equation 4.3, but they depend only on the mean rate across time and on the spike count correlation coefcients (see equations 3.6 and 3.7). The rst-order contribution is: *Z + R a C X dt ra (ta I s ) a a ¬ . (4.4) It (n I S ) T D dt ra (t I s) log2 «R a dt ra (ta I s0 ) s0 aD 1 s
As before, the second-order contribution is broken into three components: T2 2 ¾ ½ Z ¾ C X C ½ Z 1 X a ( a ) b ( b ) D dt1 ra t1 I s dt2 rb t2 I s 2 ln 2 aD1 bD 1 s s µ ¶ 1 £ ºab C (1 C ºab ) ln 1 C ºab ´ ³Z ´ ¾ C X C ½ ³Z 1X 1 a ( a ) b ( b ) C dt1 ra t1 I s dt2 rb t2 I s c ab (s ) log2 C 2 a D1 bD 1 1 ºab s ´ ³Z ´ C X C ½ ³Z 1X C dta1 ra (ta1 I s) dtb2 rb (tb2 I s ) [1 C c ab (s)] 2 a D1 bD 1 ( «R )+ R ¬ dt a1 ra (ta1 I s0 ) dtb2 rb (tb2 I s0 ) s0 [1 C c ab (s)] £ log2 «R a . (4.5) R ¬ dt 1 ra (ta1 I s0 ) dtb2 rb (tb2 I s0 ) [1 C c ab (s0 )] s0
Itt (n I S )
s
5 Limitations
In this section we address the potential limitations of the technique presented here as a procedure for the analysis of real data. There are four assumptions that must be satised for the series approximation to be guaranteed to be a good estimate of the true information: (1) that the experimental time window is small, (2) that the conditional ring probability scales with D t, (3) that the information is an analytic function of time, and (4) that experimental trials are statistically indistinguishable.
1322
Stefano Panzeri and Simon R. Schultz
5.1 Assumption 1. The short timescale limit used here formally requires that the mean number of spikes in the time window be small. The actual range of validity of the order T 2 approximation will depend on how well the time dependence of the information from the neuronal population ts a quadratic approximation. The range of validity for the spike count series information was checked by simulation in Panzeri et al. (1999). The same range of validity must hold for the temporal information as well when the ring rates and correlations of the neurons vary slowly with respect to the time window considered, since in this limit the additional temporal information is small (see the next section). The additional numerical test required here is therefore on the range of applicability of the second-order approximation for the full spike timing information when the parameters describing the neuronal ring probabilities uctuate quite rapidly within the time window T. For this purpose we simulated neurons driven by two stimuli. The response of each neuron to each stimulus was modeled as a 1 ms resolution Poisson process with a time-dependent ring rate (the response in each 1 ms time bin is generated with a time-dependent ring rate independent of the response in the other time bins). The rst stimulus gave a constant (at) mean response r0 D 30 spikes per second. across time, whereas the second stimulus was chosen to produce a response of mean r 0 and sinusoidally modulated with amplitude r 0, at a frequency 1 / tc . After a full period, the two stimuli are indistinguishable on the basis of spike count alone, and therefore this is a good model for examining temporal coding. The analytical approximations to the information were tested against the true information computed directly from the model response probabilities. Figure 1 shows the accuracy of the rst- and second-order approximations to the information in the spike times for a single cell with responses to the second stimulus oscillating very fast (at a frequency of 500 Hz). The secondorder approximation is nearly exact up to 200 ms, even in the extreme case of such fast uctuations. The range of validity seems not to be signicantly affected by changing the frequency of oscillations. We have also carried out simulations with a set of two cells with either Poisson or cross-correlated ring. When simulating two cells, we could not systematically test time windows as long as 200 ms because the number of spike trains over which we have to sum to compute the full information increases exponentially with both the length of the window and the population size. However, we found that the scaling expectation that the range of validity shrinks as 1 / C with population size is conrmed up to the time window lengths that we could test. The simulations presented here therefore show that the series expansion can be useful for timescales relevant to neuronal coding, for the small ensembles of cells typically recorded during in vivo experimental sessions. In particular, when correlations between spikes are weak, the series expansion is very precise up to hundreds of ms. However, the presence of very strong
Temporal, Correlational, and Rate Coding
Information in spike times (bits)
0.5
1323
True information 1st order 2nd order
0.4 0.3 0.2 0.1 0 0
50 100 150 Time after stimulus onset (ms)
200
Figure 1: Accuracy of the rst- and second-order approximations to the full information in the spike times of a single cell with Poisson responses to two stimuli. The cell responded to stimulus 1 with a constant (in time) rate of 30 spikes per second and to stimulus 2 with a spike rate oscillating sinusoidally around 30 spikes per second with period 2 ms and amplitude 30 spikes per second.
correlations between the spikes can potentially reduce this range of validity. Therefore, the actual range of validity for the data under analysis should be further assessed, for example, by looking at the agreement between the series expansion result for spike counts and the brute force evaluation of the spike count information from response probabilities (see equation 2.2). (The latter, unlike the brute force full temporal information, can usually be computed with an experimentally feasible number of trials.) An example of this check is given in section 9. 5.2 Assumption 2. The second assumption required is that the probability of observing a spike in the time bin centered on tai from cell a given that one has been observed in the time bin centered on tjb from cell b scales with D t. This is captured by equation 3.3. This assumption would be expected to break down only if there were a signicant number of spikes synchronized with near-innite precision. To illustrate how this scaling assumption can be validated experimentally, we examined the scaling relation between the average value of P (tai |tjbI s ) and D t for two neurons recorded from the rat barrel cortex. The stimuli
1324
Stefano Panzeri and Simon R. Schultz 0.018
r3e3ch2u3 r38e4ch5u2
0.016
Conditional firing rate
0.014 0.012 0.01 0.008 0.006 0.004 0.002 0
5
10
15 D t (ms)
20
25
Figure 2: Scaling relationship between the average conditional ring rate (computed as the average probability of observing a spike at one time, given that a spike has been observed at another time, divided by the bin width D t) and the bin width of observation. The relationship is shown for two cells from the rat barrel cortex.
were upward deections of one of nine whiskers. (These data were provided by M. E. Diamond and colleagues. For further reference to the data, see Lebedev, Mirabella, Erchova, & Diamond, 2000, and section 9, where the information properties of the same two cells are analyzed with our method.) The result can be seen in Figure 2. We found that the averaged conditional ring rate, computed as the average across all bin pairs and stimuli of P (tai |tjb I s) / D t, remained approximately constant as D t was decreased, and no divergence of the conditional ring rate was observed for all the resolutions considered. This provides evidence that assumption 2 holds for this data set. 5.3 Assumption 3. This method expands the mutual information as a Taylor series in T. Therefore, the third required assumption is that the information is an analytic function of the time window size T. This requirement ensures that the Taylor series expansion of the mutual information exists and that the series converges to the mutual information. The validity of this assumption can be tested, at least to some extent, on the data set under analysis. For example, one can check that there are no discontinuities in time of the ring parameters. Also, if there are enough data,
Temporal, Correlational, and Rate Coding
1325
the mutual information can be computed by brute force from the Shannon formula, equation 2.1, although it can usually be computed only for a more limited time range than when using the series expansion (see section 9). Hence, the information computed by brute force from the Shannon formula can be compared, in some time windows, to the series expansion estimation. This provides a strong check of the validity of this particular assumption and of the convergence of the series expansion method, as fully illustrated in section 9. It is apparent that the assumption is valid for the data we examined here (see section 9). This assumption is more likely to cause problems with articially constructed models implementing instantaneous correlations between neurons. One could argue that such models are nonphysical in any case. 5.4 Assumption 4. The nal assumption required is that experimental trials are statistically indistinguishable. This assumption is, of course, inherent in any neurophysiological data analysis and is taken to be the denition of an experimental trial. Satisfaction of this assumption is the domain of experimental design. 6 Response Timescales and Pure Temporal Coding
Having quantied the information I (ftai gI S ) in the spike times and I (n I S ) in the spike counts as a function of the rates and correlations, we are in a position to study the conditions under which information is not dominated by the spike counts. In other words, we ask, How much extra information, not contained in the spike counts, is conveyed by the temporal relations between spikes? This extra information is precisely I (ftai gI S ) ¡ I (n I S ). We nd that the crucial parameter is the typical timescale of variation of the ring rate and correlation functions, relative to the time window T of interest. We will label this typical or characteristic stimulus-induced response timescale tc . 6.1 Case 1, large tc . . If the scale of time variation of any time-dependent function f (t) is large compared to the time window considered, then f (t ) can be approximated by its power expansion, and the following quasi-static approximation holds:
Z
t0 C T / 2 t0 ¡T / 2
f (t)dt D f (t0 ) T C
@2 @t2
f (t)| tD t0
T3 C O ( T 5 ). 24
(6.1)
As a consequence, if both rate and correlation functions vary slowly in the time domain of interest, only the rate and correlation values near the center of the interval (t0 ) are important, and the extra amount of information in the spike times is subleading (only of order T 3 ). The expression for the T 3 term in the extra information in spike timing can be computed by applying
1326
Stefano Panzeri and Simon R. Schultz
equation 6.1) to the integrals over time involved in computing information rates (see equations 4.2 and 4.4). The result depends on only the ring-rate variations near t0 , and not on correlations between the spikes: I (ftai gI S ) ¡ I (n I S ) T3 X @ D P (s) r (tI s)| t D t0 24 ln 2 s @t ³ ´ (@/ @t)r(tI s)| t D t0 h(@/ @t)r(tI s0 )| t Dt0 is0 ¡ . hr(t0 I s0 )is0 r (t 0 I s )
(6.2)
Equation 6.2 can be shown to be nonnegative for any rate function, as required by information-theoretical consistency. Being of order T 3 , in this regime, pure temporal coding affects neither the rate nor acceleration of information transmission. Pure temporal information would accumulate very slowly with time. Therefore, under these circumstances, information is dominated by the spike counts. 6.2 Case 2, small tc . If either ring rates or correlation functions uctuate with a characteristic timescale smaller than T, then the quasi-static approximation, equation 6.1, breaks down, and there is a transition to a different coding regime. A possible case may be the presence of a sinusoidal component of period shorter than T in the rate uctuations. If in particular tc is much shorter than T, the integral of the function over the time window does not depend on the value of the function at the center of the interval as in equation 6.1, or on the frequency of response variations, but only on the average value of the function over the period tc :
Z
t0 C T/ 2 t0 ¡T/ 2
µ f (t)dt D
1 tc
¶
Z f (t)dt tc
T C O (T 2 ).
(6.3)
As a consequence, in this limit, the additional purely temporal information is proportional to T (and not to T 3 , as in the quasi-static phase): I (ftai gI S ) ¡ I (n I S ) D
¾ ra (ta I s) hra (ta I s0 )is0 s aD 1 *Z + R a C X dt ra (ta I s) a ( a ) ¬ ¡ dt ra t I s log2 «R a dt ra (ta I s0 ) s0 a D1
C Z X
½ dta ra (ta I s ) log2
C O( T 2 ).
s
(6.4)
This means that when rates uctuate rapidly the actual rate of information transmission (measured in bits per second) can be considerably faster than that with spike counts only. Thus, in this phase, there can be substantial
Information (bits)
a
Firing rate (Hz)
Temporal, Correlational, and Rate Coding
b
1327
Stimulus 1 Stimulus 2
50
0 0 0.4 0.3
50
100
150
50 100 Time after onset (ms)
150
Spike times Spike counts
0.2 0.1 0 0
Figure 3: Comparison of the full temporal and spike count information. (a) Modulation of the ring rates of a single nonhomogeneous Poisson neuron by two stimuli. (b) Evolution of the full temporal and the spike count information as the time window is increased in width from the rst instance of stimulus-related responses.
pure temporal information, and it may even be dominant with respect to the spike count information. We studied particular cases by simulating mean ring-rate and correlation parameters for each time-step as described in section 5. Two stimuli were again used, one inducing a at-rate response function and the second a sinusoidally modulated function of time. The following results were obtained with 1 ms time resolution, but have also been conrmed by numerical integration in the innite time resolution limit. The effect of timing precision is considered separately. Figure 3 illustrates a situation in which the spike count of a single cell responding according to a Poisson process is unable to discriminate between the stimuli after a full period, but such information is available from spike timing. The time window is increased, beginning at the onset of stimulus-related responses. In this case, after 60 ms (a full cycle of oscillation), the spike count information drops to zero, whereas the full temporal information accumulates cycle after cycle. After one cycle, the information is purely temporal. The effect of stimulus modulation frequency on the full temporal information is shown in Figure 4a. Figure 4b shows the purely temporal information for the same situation (the full information minus the spike count information). The difference between the fast and slow regimes can
1328
Stefano Panzeri and Simon R. Schultz
0.45 Full temporal information (bits)
0.4 0.35
1 Hz 4 Hz 20 Hz 50 Hz
0.3
0.25 0.2
0.15 0.1
0.05 0 0
a
50 100 Time after stimulus onset (ms)
150
0.45
Pure timing information (bits)
0.4 0.35 0.3
1 Hz 4 Hz 20 Hz 50 Hz
0.25 0.2
0.15 0.1
0.05 0 0
b
50 100 Time after stimulus onset (ms)
150
Temporal, Correlational, and Rate Coding
1329
0.7 0.6
Information (bits)
0.5
full temporal spike count 1st order temporal 2nd order a 2nd order b 2nd order c
0.4 0.3 0.2 0.1 0 0.1 0
20
40 60 Time after stimulus onset (ms)
80
100
Figure 5: A situation in which stimulus-dependent correlation in spike timing between cells can make a sizable (but nevertheless second-order) contribution to the total information, via component 2c of the temporal information. (See the text for the correlation function used in this case.) In this example, the component 2c dominates the full temporal information. Note that the corresponding spike count component was more than 10 times smaller over the whole time range.
be seen by comparing these two gures. Slow uctuations lead to negligible (of order T 3 ) pure temporal information. In the fast regime, pure temporal information increases roughly proportionally to the window length. Note also that if uctuations are very fast (tc ¿ T) and the spike incidence is measured with high temporal precision, the amount of information is roughly independent of the frequency of oscillation, as predicted by equation 6.3. If the rates vary slowly with respect to T, but the correlations vary on a faster timescale, then the pure timing information can be only of second order in T. This means that in this situation, the instantaneous rate of information transmission is unaffected by the temporal structure of the spike train. However, the temporal information can still be appreciable. An example is shown in Figure 5. This gure plots the information in the responses Figure 4: Facing page. Effect of the frequency of stimulus modulation of the rates. (a) Full temporal information. (b) Purely temporal information only. The main requirement for temporal contribution to the information can be seen easily in the comparison of these two gures: fast uctuation of the ring rates in comparison to the timescale of observation.
1330
Stefano Panzeri and Simon R. Schultz
of a pair of cells, each with the same characteristics as those in Figures 3 and 4. Both of the cells’ ring rates are slowly (1 Hz) modulated by one stimulus only. However, the correlation between them is modulated more rapidly, with a frequency of 100 Hz, and it is stimulus dependent. The correlation for one stimulus is equal in magnitude but opposite in sign to that of the other stimulus. The correlation between the cells is modulated at f D 100 Hz according to c ab (ta , tb I s ) D § exp(¡|ta ¡tb | / l) sin(2p f ta ) sin(2p f tb ). The decay constant l was chosen to be 25 ms. This particular cross-correlation function was chosen to match the one used in Figure 7, but the result is indicative of the behavior of any fast-oscillating cross-correlation. There is, as can be seen, a substantial contribution to information component 2c, the third, stimulusdependent, component of the second-order information. This contribution has no counterpart in the spike count information, which remains much smaller. 7 Precision of Spike Timing
The total information contained in the spike times, equation 2.1, is also a function of the precision D t with which the spikes are measured (or, equivalently, of the precision in the spike timing itself). Experimental measures of the information with different values of D t can address the question of to what temporal precision information is transmitted in the cerebral cortex. The whole question of whether spike timing is important is really a question of whether the use of time resolution as short as a few milliseconds significantly increases the information extracted (Strong et al., 1998). Measuring information at high resolutions with a brute force direct evaluation from equation 2.1 is made difcult by the exponential increase of the number of trials needed as D t is decreased (see Strong et al., 1998, and appendix B). Our formalism gives a simple expression for the information in spike timing as a function of the timing precision D t (the discrete version of equations 4.2 and 4.3), and requires only a quadratically increasing number of trials as D t ! 0. Thus, it can be used to measure the information with resolutions that would be impractical with brute force methods using data sets of the size of typical cortical recording sessions (Panzeri & Treves, 1996). As an example, Figure 6 illustrates the impact of sampling precision on the full temporal information contained in 32 ms from a single-model cell responding to two stimuli according to a Poisson process. The rst stimulus elicits a constant response ring rate, and the second stimulus elicits a response ring rate oscillating sinusoidally in time, exactly as in Figure 3. In Figure 6, the oscillation frequencies for the responses to the second stimulus were varied in the range 5 to 250 Hz. The result is that the timing precision has no effect on the information only if it is much smaller than the typical timescale of response parameter variations tc . If D t ’ tc , then information is strongly underestimated with respect to the innite resolution limit.
Temporal, Correlational, and Rate Coding
1331
0.12
Information (bits)
0.1 0.08
5 Hz 50 Hz 100 Hz 250 Hz
0.06 0.04 0.02 0
0.25
0.5 1 Time resolution (ms)
2
4
Figure 6: Effect of precision of spike timing on the information available from 32 ms of the response of a single simulated neuron. We used an absolute refractory period of 4 ms in this simulation (in order to have at most one spike per time bin for all the precisions used). The data obtained with the continuous-time limit (obtained by numerical integration) are included as the y-intercept information values.
The issue of the efciency of information transmission for different timing precisions (how much of the total entropy is actually exploited for information transmission) is addressed in Schultz and Panzeri (2000). 8 Synergy and Redundancy in Temporal Information
How is information combined from a group of coding elements? Is the amount of information obtained from the whole pool of elements greater than the sum of that from each individual element (synergistic) or less than the sum (redundant)? Between these two cases, there can exist a situation where the information from each element is independent, and the total information thus increases linearly as the number of elements is increased. Two notions of “synergy” are of immediate relevance. The rst is synergy/redundancy between assemblies of cells; the second is synergy between spikes (whether they are from the same cell or another).
1332
Stefano Panzeri and Simon R. Schultz
8.1 Synergy Between Cells. We dene the amount of synergy between cells as the total information from the ensemble of spike trains minus the sum of that from the individual cells (Rieke et al., 1996). The redundancy is simply the negative of this quantity. This means that to second order, the synergy 6 is simply the sum of the off-diagonal (a D b) elements of equation 4.3. This is because the rst-order terms (see equation 4.2) of the information and the a D b terms in the summation are identical for both the ensemble information and the sum of the individual cell informations, thus canceling each other in the subtraction. The amount of synergy between cells depends critically on correlation. For synergistic coding, nonzero cross-correlation is needed. This can be seen from equation 4.3. In the absence of correlation c , only the rst term of Itt would survive, and we have shown this term to be less than or equal to zero, meaning that synergy cannot occur. It is quite obvious that synergy may occur when the degree of correlation between cells is modulated by the stimulus. However, even when noise correlation is not stimulus dependent, it is possible to achieve synergistic coding. This is done by using the second, stimulus-independent, correlational component of Itt to increase the total information. This happens, for each pair of cells, when the noise correlation c ab (tai , tjb I s) is opposite in sign
to the signal correlation ºab (tai , tjb ) for times tai , tjb . This basic mechanism of synergy for the temporal information is identical in principle to that considered for spikes counts in (Oram, Foldi ¨ ak, Perrett, & Sengpiel, 1998; Abbott & Dayan, 1999; Panzeri et al., 1999); it extends naturally to noise and signal correlations between time pairs tai and tjb . Correlations between cortical neurons often oscillate and vary rapidly in sign with time (Konig, ¨ Engel, & Singer, 1995). Our analysis shows that in this case, to obtain a maximally synergistic effect, the signal correlation should oscillate in counterphase with respect to the cross-correlogram and with similar frequency. This produces opposite signs of signal and noise correlation. If the signal was of lower frequency, the effect of cross-correlation would be washed away by its rapid change of sign with respect to the signal. The effect of the relative frequencies and phase of signal and noise variations with time is illustrated in Figure 7. We simulated two Poisson cells, responding to two stimuli as in Figure 3a, but with the ring rate in response to the second stimulus being modulated at f1 D 10 and f2 D 20 Hz, Figure 7: Facing page. Effect of the relative frequencies and phase of signal and noise temporal variations in achieving synergistic coding. (a) Signal and noise correlation oscillating with identical timescales. The line marked with the star symbol indicates synergistic coding in this case, as it falls above the sum of single cells’ information. (b) Noise correlation oscillating 10 times faster than signal correlation.
Temporal, Correlational, and Rate Coding
signal as fast as noise
0.7 0.6
1333
sum of single cells 2 cells ; g = 0; no corr 2 cells ; g = +1 2 cells ; g = 1
Information (bits)
0.5 0.4 0.3 0.2 0.1 0 0
a
20
80
100
80
100
signal much slower than noise
0.7 0.6
40 60 Time after onset (ms)
sum of single cells 2 cells ; g = 0; no corr 2 cells ; g = +1 2 cells ; g = 1
Information (bits)
0.5 0.4 0.3 0.2 0.1
b
0 0
20
40 60 Time after onset (ms)
1334
Stefano Panzeri and Simon R. Schultz
respectively. The signal correlation between the cells is in this simple case º(ta , tb ) / sin(2p f1 ta ) sin(2p f2 tb ).
(8.1)
To generate a cross-correlation model that could be tuned to match the signal correlation frequency, the cross-correlation was chosen for both stimuli to be of the form c ab (ta , tb I s ) D c exp(¡|ta ¡ tb | / l) sin(2p kf1 ta ) sin(2p kf2 tb ).
(8.2)
The cross-correlation function decayed exponentially, with time constant 25 ms, as a function of the time between spikes (Konig ¨ et al., 1995). The constant c in front of the correlation function was set to 0 (no cross-correlation), ¡1 (signal and noise in counterphase), or 1 (signal in phase with the noise). k determined the relative timescales of signal and noise correlation oscillation and was set to 1 for Figure 7a and 10 for Figure 7b. The information from the pair of cells observed simultaneously is compared with the summed single-cell information in the gure. When the noise correlation between the cells is chosen to oscillate with the same frequency as the signal correlation does, synergy is obtained when signal and noise correlation are in counterphase. Conversely, high redundancy is obtained when they are in phase. When instead signal and noise oscillate on very different timescales, correlations play a much less signicant role. When response proles of different neurons are independent and little signal correlation is present, weakly stimulus-modulated correlations do not affect information transmission at all (Oram et al., 1998; Panzeri et al., 1999). This is important as signal correlation between cortical neurons was reported to be very small for stimulus sets of increasing complexity (Gawne, Kjaer, Hertz, & Richmond, 1996; Rolls, Treves, & TovÂee, 1997; DeAngelis, Ghose, Ohzawa, & Freeman, 1999), and therefore correlations might be less important under natural conditions than when using articial and limited laboratory stimuli. The neuronal model used in this simulation is certainly far from realistic; however, the result discussed here is valid beyond the simple models used for the gures. We note that when cross-correlations are stimulus dependent, then the third term of equation 4.3 also becomes nonzero, and additional interactions between response parameters can arise. They can be studied by specifying the stimulus dependence of correlations and also taking into account this information component. In conclusion, this section shows that the contribution of cross-correlations to the representation of the external world can be more complex than what might be expected from naive visual inspection of cross-correlograms. This reinforces the need for a rigorous information-theoretic analysis of the role of trial-by-trial correlations in solving complex encoding problems like feature binding (Singer et al., 1997).
Temporal, Correlational, and Rate Coding
1335
8.2 Synergy Between Spikes. Another notion of synergy, dened for single cells (as well as populations) and for the full temporal information, is synergy between spikes. Motivated by the notion that particular sequences of spikes may have a special role in encoding stimuli, Brenner, Strong, Koberle, Bialek, and de Ruyter van Steveninck (2000) recently introduced a synergy measure of this type. In that work, “events” (e.g., single spikes or pairs of spikes occurring with a given time delay irrespective of whether there are other spikes in between) are singled out from the whole spike train; the information carried by occurrence of single events was computed.3 Brenner et al. (2000) considered the synergy to be the information from a complex event (e.g., a pair of spikes occurring at specied times) minus the information from each of those individual spikes constituting that pair. Another way to measure the synergy between spikes that is suggested by our analysis here is to take into account their contribution to the information conveyed by the whole spike train. The contribution of single spikes to the whole spike train information is given by equation 4.2, and, correspondingly the contribution of pairs of spikes is given by the sum of equations 4.2 and 4.3. Therefore, the extent of synergy between pairs of spikes is simply the quantity given in equation 4.3 alone. The concept generalizes to higherorder interactions; “spike” synergy is something that falls naturally out of the series expansion approach. It is interesting to note that the introduction of a refractory period has an accompanying effect on the synergy between spikes, as measured by the fractional second-order contribution to the overall temporal information. Figure 8a shows the effect of the refractory period on the synergy between spikes. This is shown for the example of a cell whose ring rate is modulated by a 40 Hz sinusoid for one stimulus as in the earlier gures. The gure compares a Poisson process with processes augmented with an absolute refractory period of between 1 and 16 ms, and also with another process that has a 2 ms absolute refractory period and an exponentially decaying autocorrelation with time constant 20 ms, reecting a relative refractory period. It is apparent that for the Poisson process (in which spikes are not entirely independent because of the rate modulation), the secondorder contribution is small or negative. For longer refractory periods, there is an increasing peak in the early period of the response. This effect appears to involve an interaction between the absolute refractory period and the time constant of stimulus modulation of the rates; the stimulus modulation timescale determines the width of the spike synergy peak (see Figure 8b), whereas the refractory period length determines its height. 3 The denition of information carried by the occurrence of a single event is a generalization of equation 4.2. However, the Brenner et al. (2000) denition of information carried by, for example, single spikes in isolation is different from the contribution of the occurrence of a single spike in a given trial to the information from the full spike train. The latter quantity also has terms of higher order.
1336
Stefano Panzeri and Simon R. Schultz
t
refr
t
refr
t
0.3
refr
t
refr
0.25 t t
2nd order fraction
0.2
refr refr
= = = = = =
0 1 ms 2 ms 4 ms 16 ms 2 ms + decay
0.15 0.1 0.05 0 0.05 0.1 0
a
10
20 30 Time after onset
40
0.3
50
20 Hz 40 Hz 80 Hz
0.25
2nd order fraction
0.2 0.15 0.1 0.05 0
b
0.05 0
10
20 30 Time after onset (ms)
40
50
Temporal, Correlational, and Rate Coding
1337
It has been noted previously that the temporal correlations introduced into the spike train by refractoriness can have a benecial effect on coding. This occurs by regularization of the higher ring-rate parts of the response, producing a more deterministic relation between stimulus and response (de Ruyter van Steveninck et al., 1997; Berry & Meister, 1998). Our analysis conrms this effect and adds several new facts: that the effect is second order in time and that it involves interaction between the refractory and dominant stimulus-induced modulation timescales, such that the width of the period of increased information is determined by the stimulus frequency characteristics, whereas the amount of extra information is determined by the effective duration of refraction. 9 Analysis of Neurophysiological Data
In order to illustrate the specic advantages and the type of neurophysiological results that can be obtained with our series expansion method, we present here an example analysis of two cells recorded from the barrel cortex of adult normal anesthetized Wistar rats. The two neurons presented were located in the D2 and C2 barrel column, respectively. Each stimulation consisted of a 100 ms lasting upward deection of one of the whiskers. For each neuron, the stimulus set was composed of its principal whisker and its eight surrounding whiskers, so that each of the whiskers that was likely to activate the cells was included in the stimulus set. Fifty to 56 trials per whisker were available. The time onset of stimulation, as well as spike times, were recorded with 0.1 ms precision. (See Lebedev et al., 2000, for a complete description of the experimental methods.) The information about which of the nine whiskers was stimulated was computed with the series expansion approach and using brute force estimation from response frequencies. Information from spike counts and spike times was computed. For the latter, a resolution D t = 10 ms was used. Corrections for the upward bias originating from limited sampling were applied (see appendix B). The range over which unbiased information measures could be obtained is discussed below and in appendix B. Note that the probability scaling assumption was shown to be satised by both neurons in Figure 2. Figure 9a shows the spike count information analysis for the rst neuron. Since the overall mean rate of the neuron across the 100 ms stimulation is 10 Hz, the series expansion is expected to be very good. Indeed, a compar-
Figure 8: Facing page. Synergy between spikes, measured by the fraction of the temporal information contained in the second-order terms. (a) Comparison of a Poisson process with processes augmented with absolute refractory periods. In one case, a relative refractory period with exponential autocorrelation is also added (circles). (b) Interaction of a 4 ms refractory period with stimulus-induced modulation frequency, for three different modulation frequencies.
1338
Stefano Panzeri and Simon R. Schultz r38e4ch5u2 ; Spike counts information 0.8 0.7
Information (bits)
0.6
brute force info total series info rate comp. stim. indep. corr. stim. dep. corr.
0.5 0.4 0.3 0.2 0.1 0 0
20
40
a
60 Time (m s)
80
100
r38e4c h5u2 ; Full tem poral inform ation 0.8 0.7
Information (bits)
0.6
brute force info total series info rate comp. stim. indep. corr. stim. dep. corr.
0.5 0.4 0.3 0.2 0.1 0 0
b
20
40
60 Time (m s)
80
100
ison of the brute force and series spike count information shows that the two are essentially identical for the whole stimulation time range. Because the two quantities are (after nite sampling corrections) unbiased in this
Temporal, Correlational, and Rate Coding
1339
time range, this is a very compelling verication of the validity of the series expansion method. In Figure 9b we report the spike times information analysis for the rst neuron. It is evident that the brute force estimation of the full temporal information diverges rapidly after the rst four to ve time bins (i.e., after to 40–50 ms). This is due to failure of corrections for nite sampling, as expected by the rule of thumb for sampling corrections (see appendix B). The spike times series expansion is a close match to the brute force estimator up to 40 to 50 ms, and no appreciable divergence due to nite sampling is visible, again as expected. This illustrates the clear superiority, in terms of sampling requirements, of the series expansion with respect to brute force evaluations. The separation of information into components, obtained using the series expansion, shows that most of the information is carried by the rate component. However, for this cell spike, correlations contribute up to approximately 15% of the information in the spike times case. Note that all of this information in spike correlations is contributed by the stimulus-independent component. This means that this little correlational information arises not from modulation of the autocorrelogram with stimulus, but from the interplay of signal and noise correlations. The peak across time for the spike count information was 0.31 bit; the full temporal information obtained with 10 ms resolution was of 0.41 bit at time T = 50 ms. This shows that a neuron can transmit appreciable temporal information within a fraction of one mean interspike interval. This possibility was predicted by our mathematical analysis of coding regimes. Finally, it is interesting to note that this neuron in the rst 10 ms of response carried on average more than 3 bits per each spike emitted. Figure 10 reports the information time course for the second neuron. Its mean rate across 100 ms was 16 Hz. Also in this second case, the information for the spike counts obtained with the series expansion coincided with that obtained by the brute force method, thus conrming the reliability of the series expansion. When considering spike times, series expansion and brute force evaluation are very close up to 40 to 50 ms. After 50 ms, the brute force evaluation cannot be corrected effectively for nite sampling, and thus it starts to diverge. The peak spike count information was 0.13 bit;
Figure 9: Facing page. (a) Time course of the information in the spike counts for a neuron located in the D2 barrel column of the rat cerebral cortex. The information obtained with the brute force method (equation 2.2, indicated with ±) is compared to that obtained with the series expansion at second order ( C ). The three components of the series expansion information are also presented (rate, ¦; stimulus-independent correlational, ; stimulus-dependent correlational, ?. (b): Time course of information carried by spike times of the same neuron, computed with a 10 ms precision. Notation as in a. The “rate” component in this case refers to the dynamic ring rate across the time window rather than anything to do with a spike count code.
1340
Stefano Panzeri and Simon R. Schultz
r3e3ch2u3 ; Spike counts inform ation 0.8 0.7
Information (bits)
0.6
brute force info total series info rate comp. stim. indep. corr. stim. dep. corr.
0.5 0.4 0.3 0.2 0.1 0 0
20
40
a
60 Time (m s)
80
100
r3e3ch2u3 ; Full temporal information 0.8 0.7
Information (bits)
0.6
brute force info total series info rate comp. stim. indep. corr. stim. dep. corr.
0.5 0.4 0.3 0.2 0.1 0 0
b
20
40
60 Time (m s)
80
100
Figure 10: (a) Spike count and (b) spike times information time course for a second neuron located in the C2 barrel column. Notations as in Figure 9.
Temporal, Correlational, and Rate Coding
1341
the peak spike timing information was 0.20 bit. As before, signicant temporal information is transmitted within one typical interspike interval. The component analysis indicates that unlike the rst cell, this second cell transmits information by mean ring-rate modulations only, and correlations between spikes do not transmit any information. The above two cells were shown only for illustrative purposes rather than to make any general point about the information properties of barrel cortex neurons. 4 This illustration shows that (1) the series expansion method can give reliable and testable results; (2) when used to compute information in spike times, it has considerable advantages in terms of data size requirements; and (3) it can effectively quantify the contributions of different encoding mechanisms to neuronal information transmission. 10 Discussion
We have demonstrated that the MacKay and McCulloch (1952) information contained in the specication of the times of spike emission of a population of cells can be broken up into a series of terms, each quantifying the contribution of a different encoding mechanism, just as the information contained in the spike counts from a population can (Panzeri et al., 1999). Although it can be applied only to a limited range of time windows, the use of this series approach has a number of advantages over brute force computation for both the understanding of the theory of coding with spike trains and the analysis of neurophysiological data. A rst advantage of this approach is that it necessarily compares the contributions of different information-bearing parameters on correct and equal terms and shows how they combine to yield the full information available from the spike train. A second advantage of the series expansion method is that it requires much less data sampling than a brute force approach in order to obtain unbiased and reliable information measures. This is because the information is computed by using only the subset of the possible variables characterizing the spike train that carry the most information in the short time limit. Therefore, the complexity of the response space is reduced in the way that preserves the most information. A third interesting feature of the method present here is that it does not assume that the neuronal response space is a vector space, as, for example, principal component analysis (Optican & Richmond, 1987) does. Since sensory systems are nonlinear, this is a property that has to be satised by any method for studying neuronal information encoding (Victor & Purpura, 1996, 1997). Fourth, the series expansion approach is a method to quantify the full temporal information
4 A complete study of the informational properties of a large set of barrel cortical neurons will be the subject of a separate publication (Panzeri, Petersen, Schultz, Lebedev, & Diamond, 2000).
1342
Stefano Panzeri and Simon R. Schultz
that can be extracted from the spike train by an ideal observer and does not depend on the validity of a stimulus-response model, and is not specic to a particular stimulus decoder. Therefore, it provides information values against which the performance of encoding and decoding models can be assessed (Rieke et al., 1996; Borst & Theunissen, 1999). The observation timescales with which spike trains are studied depend on the questions being asked. Many authors, particularly those with a perspective from the eld of psychophysics, have examined windows of data up to several seconds long. The justication is that in certain tasks (Britten, Shadlen, Newsome, & Movshon, 1992), psychophysical performance increases up to these timescales, as does the ability to discriminate stimuli based on counting the spikes of a single cell (as counting for longer allows noise to be averaged out). The techniques discussed in this article are inapplicable to such long time windows (although other techniques, such as direct calculation of the spike count information for a single cell, do apply). We note that the study of neural coding is often considered as a precursor to, or partner of, the study of computation or information processing in the nervous system. To understand the computational architecture of the brain, it is necessary to understand the nature of the symbols that are processed. Accepting that single neurons are the primary information processing units of the brain, these symbols must exist on timescales that are “seen” by single cells, which may be as low as a membrane time constant, or longer if additional integrative processes are enacted. It is neural coding on single-cell timescales that determines computation and motivates the work reported here. Since perceptual processes are unlikely to be uniform in time, such a strategy might also elucidate some of the computations underlying perception at longer timescales. The work presented here is relevant not only as a tool for neurophysiological data analysis. It also allows analytical studies of the statistical properties of population spike trains, such as refractoriness, autocorrelation, and timing relationships between cells, to be conducted with regard to understanding which parts of the full temporal information they contribute to and which they do not. Here we have used this formalism to relate precisely the timescale of response parameter variations to transitions between spike count and temporal encoding regimes; to show that the impact on information representations of stimulus-independent trial-by-trial correlations depends crucially on their interplay with the signal correlation; and to show that the neuronal refractory period can lead to synergy between spikes. A further understanding of how other parameters of biophysical relevance can be adjusted to maximize encoding capacity can be reached by the use of the formalism presented here. One most interesting nding often reported in experimental studies of neuronal information transmission is that sensory neurons transmit most of the information within one mean interspike interval, and single spikes carry a lot of information (Rieke et al., 1996). Thus, high spike timing precision
Temporal, Correlational, and Rate Coding
1343
and rate coding may coexist in sensory neuron, as the rate code has to be evaluated in a short window. This has led some authors to conclude that under these conditions the distinction between rate and temporal encoding may become blurred (see Rieke et al., 1996). However, differences between temporal and rate encoding and rate coding can be dened and be present even under conditions of such rapid processing (Theunissen & Miller, 1995; Borst & Theunissen, 1999). Our study advances the understanding of the distinction between temporal and rate encoding when the relevant window for information transmission is short. In fact, our work explicitly relates response variations timescales to transitions between coding regimes and shows that under some conditions, a substantial amount of information can be encoded purely temporally within a fraction of one interspike interval (see section 6). Indeed, the example analysis of two neurons in the rat barrel cortex shows that this is not only a theoretical possibility, but that it can be realized by sensory neurons. Shadlen and Newsome (1998) argue that because the interspike interval in the responses of cortical neurons is highly variable, the rapid information transmission achieved by the cerebral cortex (i.e., substantial information being transmitted in one interspike interval or less) must imply redundancy of signal. Their argument is based on the idea that to obtain reliability in short timescales, it is necessary to average away the large observed variability of individual interspike intervals by replicating the signal through many similar neurons. In other words, the need for rapid information transmission should strongly constrain the cortical architecture. Our study demonstrates precisely the opposite. The fact that the rst-order temporal information transmitted by a population is simply the sum of all single-cell contributions (see equation 4.2) demonstrates that it is not necessary to transmit many copies of the same signal to ensure rapid and reliable transmission. If each cell contributes some nonidentical information about the stimuli, this will sum up in less than one interspike interval, and high population information rates can be achieved. We nally note that DeWeese (1995, 1996) previously reported an elegant analytical study of the impact of interaction between spikes of information transmission in single cells. This study made use of a cluster expansion formalism derived from statistical mechanics. The results DeWeese obtained in this way are close to what we obtained in the single-cell case. The only difference is that in DeWeese’s equations, the summation over time in the quadratic term of second derivative of the information is restricted to pairs of time bins different from each other. However, this work reports several advances with respect to the earlier work of DeWeese. In fact, we computed the information carried by an ensemble of cells instead of by single cells only. We separated the information into different components, each reecting different encoding mechanisms. We also studied the transition between spike count and temporal encoding regimes. The power series formalism makes the conditions under which the series converges transparent; this
1344
Stefano Panzeri and Simon R. Schultz
issue is much less clear with a cluster expansion (see, e.g., DeWeese, 1995). Unlike in DeWeese (1996), corrections for nite sampling were introduced here. This last development is essential when using the method for studying information transmission in real neurons. In conclusion, a full understanding of the coding properties of a neural system cannot be achieved simply by computing the total information about the stimuli contained in its spike trains; rather, it is necessary at the very least to discover how this total is arrived at from the individual informationbearing parameters. The results that we have presented here encourage us to think that this is possible. Appendix A
In this appendix we show that under the assumptions summarized in section 5, the only responses contributing to the transmitted information up to order k are the spike patterns with up to k spikes in total. Thus, the power expansion in the short time window length T is related to the expansion in the total number of spikes emitted by the population. We also compute the only nonzero response probabilities up to second order, which are the only ones contributing to the rst two information derivatives. We give the proof for nite temporal precision. Consider the response to stimulus s. The probability of observing one spike in the bin centered at tai , averaged over all possible patterns of spikes occurring in other bins, is ra (tai I s )D t. Now note that the assumption specied by equation 3.3 implies that the probability of a single spike, the observation of any pattern of other spikes, also scales proportionally to D t: f
P (tai |tjb ¢ ¢ ¢ t k I s) / D t.
(A.1)
Third, we use the well-known theorem on compound probabilities to write f
the following chain rule for the probability of observing a pattern tai tjb , . . . , tk of k spikes when stimulus s is presented: f
f
f
f
P (tai tjb , . . . , t k I s) D P (tai |tjb , . . . , tk I s)P (tjb |, . . . , tk I s), . . . , P (tk I s ). (A.2) This means that a probability of k spikes is a product of k conditional probabilities of emission of each of the spikes given the presence of other spikes in the pattern. Since, as proven above, each of the k terms of this product is proportional to D t, probabilities of k-plets of spikes scale as (D t) k. From this scaling property, from the denition of information, equation 2.1, and from the logarithm series expansion, equation 4.1, it follows that for the computation of the rst k information derivatives, only response probabilities with up to k spikes are needed, and they can be truncated at the kth order in D t. Responses with more than k spikes do not contribute to the rst k information derivatives.
Temporal, Correlational, and Rate Coding
1345
Finally, we compute the response probabilities of up to two spikes, truncated at the second order in D t. The probability of observing just two spikes at ta1 , tb2 is, up to O(D t2 ), given by the product of P (ta1 |tb2 I s) (from equation 3.3) and rb (tb2 I s )D t (the latter factor quantifying at rst order the probability of one spike in tb2 ): P (ta1 tb2 I s) D
h i 1 ra (ta1 I s )rb (tb2 I s ) 1 C c ab (ta1 , tb2 I s ) (D t)2 C O(D t3 ). 2
(A.3)
The probability of two spikes is divided by two to prevent overcounting due to equivalent permutations rather than restrict the sum over events to nonequivalent permutations, as was done in Panzeri et al. (1999). The probability P (ta1 I s) of observing just one spike at ta1 is given, up to order D t2 , by the product of rb (tb2 I s)D t and the probability of not observing any other spike than that at ta1 : P (ta1 I s) D ra (ta1 I s)D tP (no spikes but ta1 I s) C O (D t3 ) 0 1 C X X XX D ra (ta1 I s)D t @1 ¡ P (tb2 |ta1 I s) ¡ P (tb2 tc3 |ta1 ) ¡ ¢ ¢ ¢A 0
b D 1 tb 2
D ra (ta1 I s)D t @1 ¡ D t C O (D t
3 ).
C X X bD 1 tb 2
b,c tb tc 2 3
1 h i rb (tb2 I s) 1 C c ab (ta1 , tb2 I s) A (A.4)
The probability of no cells at all ring, up to D t2 , can be obtained by subtracting from 1 all the response probabilities that are nonzero at second order. The probability expansion in the innite temporal precision limit can be obtained along the same lines by replacing probabilities with probability densities and using equation 3.8. Appendix B
When considering neurophysiological data, the information quantities have to be evaluated from response probabilities obtained from a limited number of trials per stimulus, Ns . This induces an upward systematic error, or bias, in the information measures. The systematic error has to be evaluated and subtracted in order to obtain unbiased information measures. In this appendix we briey report how the bias is computed and corrected for in the different information estimation methods. The bias correction is a simple formula, which is slightly different when considering the information calculated directly from the Shannon formula (equations 2.1 and 2.2) by brute force estimation of response frequencies, or
1346
Stefano Panzeri and Simon R. Schultz
when considering the information computed by the series expansion5 (see Panzeri & Treves, 1996; Panzeri et al., 1999): P dIbruteforce D
s2S (R s
¡ 1) ¡ (R ¡ 1) I 2N ln 2
P dIseries D
s2S ( Rs )
¡R , 2N ln 2
(B.1)
where N is the total number of trials (across all stimuli) and R is the number of relevant response classes across all stimuli (i.e., the number of different responses with nonzero probability of being observed). Rs is the number of relevant response classes to stimulus s (Panzeri & Treves, 1996). It is intuitive that the number of trials per stimulus Ns should be big when compared to the number of response classes R in order to get good enough sampling of conditional responses, and therefore reliable bias correction. When a Bayes procedure (which takes into account that some response classes may not be observed because of local undersampling) is used to estimate the number of relevant bins, reliable bias corrections can be obtained when Ns is equal to or higher than the number of response classes R. This gives a useful rule of thumb for evaluating the range in which the bias corrections still work. For the brute force full temporal information (see equation 2.1), the response space is that of all the possible temporal spike patterns, and the number of possible responses is 2Cd , d D T / D t being the number of time bins in which the window is digitized. The situation is very different when the information is computed by the series expansion approach. For the rst-order spike times information, R is the number of nonzero bins of the dynamic rate function, which is at most Cd. For the second-order spike times information, R is the number of relevant bins in the space of pairs of spike ring times, which is at most Cd (d ¡ 1) / 2 C C(C ¡ 1)d(d C 1) / 4. Therefore the number of trials needed to correct effectively for the bias grows only linearly when C or d is increased if the rst-order term is considered and only quadratically when the second-order term is also included. Considering that the growth of the data size required for the brute force evaluation from the Shannon formula is exponential, there is a clear advantage in terms of data size when using the series approach developed here. Some simulation examples of effectiveness of bias removal procedures for the series expansion method were reported in Schultz and Panzeri (2000). As an example, when 50 to 55 trials per stimulus are available for information analysis of single cells (as in the example reported here), the brute force computation is unbiased only up to 4 to 5 time bins, and the series expansion is unbiased up to 9 to 10 time bins. The spike count information 5 Note that the series expansion correction reported below is the leading bias term in the short time limit only. Note also that the series expansion bias correction slightly differs from the one for the brute force computation because in the series expansion formalism, the response class corresponding to zero response always has nonzero probability. This response class contribution cancels out the ¡1 in the bias for the brute force infromation.
Temporal, Correlational, and Rate Coding
1347
quantities, of course, require many fewer data than the corresponding spike times quantities. They are thus well corrected for bias for the whole range in which the full temporal information measures are reliable. Acknowledgments
We thank W. Bair, P. Dayan, J. Gigg, J. A. Movshon, R. Petersen, E. T. Rolls, A. Treves, and M. P. Young for useful discussions. We also thank M. Lebedev and M. Diamond for making the barrel cortex data available to us. This research was supported by the Wellcome Trust Programme Grant 055075/Z/98 (S. P.) and by the Howard Hughes Medical Institute (S. R. S.) References Abbott, L. F., & Dayan, P. (1999). The effect of correlated variability on the accuracy of a population code. Neural Comp., 11, 91–101. Adrian, E. D. (1926). The impulses produced by sensory nerve endings: Part I. J. Physiol. (Lond.), 61, 49–72. Bair, W., & Koch, C. (1996). Temporal precision of spike trains in extrastriate cortex of the behaving macaque monkey. Neural Computation, 8(6), 779– 786. Berry II, M. J., & Meister, M. (1998). Refractoriness and neural precision. J. Neurosci., 18(6), 2200–2221. Bialek, W., Rieke, F., de Ruyter van Steveninck, R. R., & Warland, D. (1991). Reading a neural code. Science, 252, 1854–1857. Borst, A., & Theunissen, F. E. (1999). Information theory and neural coding. Nature Neuroscience, 2, 947–957. Brenner, N., Strong, S. P., Koberle, R., Bialek, W., & de Ruyter van Steveninck, R. (2000). Synergy in a neural code. Neural Computation, 12, 1531–1552. Britten, K., Shadlen, M. N., Newsome, W. T., & Movshon, J. A. (1992). The analysis of visual-motion—a comparison of neuronal and psychophysical performance. J. Neurosci., 12, 4745–4765. Buracas, G. T., Zador, A. M., DeWeese, M. R., & Albright, T. D. (1998). Efcient discrimination of temporal patterns by motion-sensitive neurons in primate visual cortex. Neuron, 20, 959–969. Corthout, E., Uttl, B., Walsh, V., Hallett, M., & Cowey, A. (1999). Timing of activity in early visual cortex as revealed by transcranial magnetic stimulation. Neuroreport, 10, 2631–2634. Cover, T. M., & Thomas, J. A. (1991). Elements of information theory. New York: Wiley. de Ruyter van Steveninck, R. R., Lewen, G. D., Strong, S. P., Koberle, R., & Bialek, W. (1997). Reproducibility and variability in neural spike trains. Science, 275, 1805–1808. DeAngelis, G. C., Ghose, G. M., Ohzawa, I., & Freeman, R. D. (1999). Functional micro-organization of primary visual cortex: Receptive eld analysis of nearby neurons. J. Neurosci., 19, 4046–4064.
1348
Stefano Panzeri and Simon R. Schultz
deCharms, R. C., & Merzenich, M. M. (1996). Primary cortical representation of sounds by the coordination of action potentials. Nature, 381, 610–613. DeWeese, M. (1995). Optimization principles for the neural code. Unpublished doctoral dissertation, Princeton University. DeWeese, M. (1996). Optimization principles for the neural code. Network, 7, 325–331. Gawne, T. J., Kjaer, T. W., Hertz, J. A., & Richmond, B. J. (1996). Adjacent visual cortical complex cells share about 20% of their stimulus-related information. Cerebral Cortex, 6, 482–489. Heller, J., Hertz, J. A., Kjaer, T. W., & Richmond, B. J. (1995). Information ow and temporal coding in primate pattern vision. J. Comp. Neurosci., 2, 175– 193. Konig, ¨ P., Engel, A. K., & Singer, W. (1995). Relation between oscillatory activity and long-range synchronization in cat visual cortex. Proc. Natl. Acad. Sci. USA, 92, 290–294. Lebedev, M. A., Mirabella, G., Erchova, I., & Diamond, M. E. (2000). Experiencedependent plasticity of rat barrel cortex: Redistribution of activity across barrel-columns. Cerebral Cortex, 10, 23–31. MacKay, D., & McCulloch, W. S. (1952). The limiting information capacity of a neuronal link. Bull. Math. Biophys., 14, 127–135. Mechler, F., Victor, J. D., Purpura, K. P., & Shapley, R. (1998). Robust temporal coding of contrast by V1 neurons for transient but not steady-state stimuli. J. Neurosci., 18(16), 6583–6598. Optican, L. M., & Richmond, B. J. ( 1987).Temporal encoding of two-dimensional patterns by single units in primate inferior temporal cortex: III. Information theoretic analysis. J. Neurophysiol., 57, 162–178. Oram, M. W., FoldiÂak, ¨ P., Perrett, D. I., & Sengpiel, F. (1998).The “ideal homunculus”: Decoding neural population signals. Trends in Neurosciences, 21(6), 259– 265. Panzeri, S., Biella, G., Rolls, E. T., Skaggs, W. E., & Treves, A. (1996). Speed, noise, information and the graded nature of neuronal responses. Network, 7, 365–370. Panzeri, S., Petersen, R. S., Schultz, S. R., Lebedev, M., & Diamond, M. E. (2000). Coding by timing of single spikes in the D2 barrel cortex. Submitted. Panzeri, S., Schultz, S. R., Treves, A., & Rolls, E. T. (1999). Correlations and the encoding of information in the nervous system. Proc. R. Soc. Lond. B, 266, 1001–1012. Panzeri, S., & Treves, A. (1996). Analytical estimates of limited sampling biases in different information measures. Network, 7, 87–107. Riehle, A., Grun, S., Diesmann, M., & Aertsen, A. M. H. J. (1997). Spike synchronization and rate modulation differentially involved in motor cortical function. Science, 278, 1950–1953. Rieke, F., Warland, D., de Ruyter van Steveninck, R. R., & Bialek, W. (1996). Spikes: Exploring the neural code. Cambridge, MA: MIT Press. Rolls, E. T., TovÂee, M., & Panzeri, S. (1999). The neurophysiology of backward visual masking: Information analysis. J. Cognitive Neurosci., 11, 300– 311.
Temporal, Correlational, and Rate Coding
1349
Rolls, E. T., Treves, A., & TovÂee, M. J. (1997). The representational capacity of the distributed encoding of information provided by populations of neurons in primate temporal visual cortex. Exp. Brain Res., 114, 149–162. Schultz, S., & Panzeri, S. (2000). Temporal correlations and neuronal spike train entropy. Submitted. Shadlen, M. N., & Newsome, W. T. (1998). The variable discharge of cortical neurons: Implications for connectivity, computation and coding. J. Neurosci., 18(10), 3870–3896. Shannon, C. E. (1948). A mathematical theory of communication. AT&T Bell Labs. Tech. J., 27, 379–423. Singer, W., Engel, A. K., Kreiter, A. K., Munk, M. H. J., Neuenschwander, S., & Roelfsema, P. (1997). Neuronal assemblies: Necessity, signature and detectability. Trends in Cognitive Sciences, 1, 252–261. Skaggs, W. E., McNaughton, B. L., Gothard, K., & Markus, E. (1993). An information theoretic approach to deciphering the hippocampal code. In S. Hanson, J. Cowan, & C. Giles (Eds.), Advances in neural information processing systems, 5 (pp. 1030–1037). San Mateo, CA: Morgan Kaufmann. Strong, S., Koberle, R., de Ruyter van Steveninck, R., & Bialek, W. (1998). Entropy and information in neural spike trains. Physical Review Letters, 80, 197–200. Theunissen, F. E., & Miller, J. P. (1995). Temporal encoding in nervous systems: A rigorous denition. J. Comput. Neurosci., 2, 149–162. Thorpe, S., Fize, D., & Marlot, C. (1996). Speed of processing in the human visual system. Nature, 381, 520–522. TovÂee, M. J., Rolls, E. T., Treves, A., & Bellis, R. P. (1993). Information encoding and the response of single neurons in the primate temporal visual cortex. J. Neurophysiol., 70, 640–654. Vaadia, E., Haalman, I., Abeles, M., Bergman, H., Prut, Y., Slovin, H., & Aertsen, A. (1995). Dynamics of neuronal interactions in monkey cortex in relation to behavioural events. Nature, 373, 515–518. Victor, J. D., & Purpura, K. P. (1996). Nature and precision of temporal coding in visual cortex: A metric space analysis. J. Neurophysiol., 76, 1310–1326. Victor, J. D., & Purpura, K. P. (1997).Metric space analysis of spike trains: Theory, algorithms and applications. Network, 8, 127–164. Victor, J. D., & Purpura, K. P. (1998). Spatial phase and the temporal structure of the response to gratings in V1. J. Neurophys., 80, 554–571. Received August 18, 1999; accepted July 27, 2000.
LETTER
Communicated by Ad Aertsen
Determination of Response Latency and Its Application to Normalization of Cross-Correlation Measures Stuart N. Baker Department of Anatomy, University of Cambridge, Cambridge CB2 3DY, U.K. George L. Gerstein Department of Neuroscience, University of Pennsylvania, Philadelphia, PA 19104, U.S.A. It is often of interest experimentally to assess how synchronization between two neurons changes following a stimulus or other behaviorally relevant marker. The joint peristimulus time histogram (JPSTH) achieves this, but assumes that changes in the cells’ ring rate following the stimulus are stereotyped from one sweep to the next. Erroneous results can be generated if this is not the case. We here present a method to assess whether there are variations in response latency or amplitude from sweep to sweep. We then describe how the effects of response latency variation can be mitigated by realigning sweeps to their individual latencies. Three methods of detecting response latency are presented and their performance compared on simulated data. Finally, the effect on the JPSTH of sweep realignment using detected latencies is illustrated. 1 Introduction
There is much current interest in neuroscience in the idea that short-term synchronization between two neurons can encode information in excess of that carried by the ring rate of the cells (Gerstein, Bedenbaugh, & Aertsen, 1989; Abeles, Bergman, Margalit, Vaadia, 1993; Softky & Koch, 1993; Singer, 1993; von der Malsburg, 1995; Baker, Olivier, & Lemon, 1997; Riehle, Grun, ¨ Diesmann, & Aertsen, 1997; Hatsopoulos, Ojakangas, Paninski, & Donoghue, 1998). In order to estimate the extent of above-chance synchrony in experimental data, it is usual to calculate the cross-correlation histogram (CCH). The raw CCH often contains features, unrelated to neuronal synchrony or interactions, that are caused by modulation in ring rate following the experimental stimulus. These are usually removed by computing a shufe predictor of the CCH and subtracting it (Perkel, Gerstein, & Moore, 1967b). The joint peristimulus time histogram (JPSTH) of Aertsen, Gerstein, Habib, and Palm (1989) is a technique for performing this correction in a time dependent fashion, so that cell synchronization can be studied as a function of peri-stimulus time. c 2001 Massachusetts Institute of Technology Neural Computation 13, 1351– 1377 (2001) °
1352
Stuart N. Baker and George L. Gersetin
However, as Brody (1999a) and Pauluis and Baker (2000) pointed out, peaks remaining in CCHs corrected in this way do not necessarily indicate the presence of short-term synchrony between spikes. If the cells’ responses show correlated variation in either their amplitude or their latency after successive stimuli, this can also lead to CCH peaks. All methods that average across trials are adversely affected by such trial-to-trial variability, including, for example, the peristimulus time histogram (PSTH). This is not a problem limited to measures of correlation, although they form the focus of this article. Responses could vary in amplitude if the animal’s attention is changing. Latency variation could occur if, for example, the animal does not maintain an exactly constant xation point from one trial to the next, or if the available behavioral alignment marker is not precisely timed, as is often the case with studies of movement. Latency variability has been reported experimentally in a number of different systems (Richmond, Optican, & Spitzer, 1990; Munoz & Wurtz, 1993; Haplea, Covey, & Casseday, 1994; Lu, Guido, Vaughan, & Sherman, 1995; Seidemann, Meilijson, Abeles, Bergman, & Vaadia, 1996; Kitzes & Hollrigel, 1996). Pauluis and Baker (2000) described a method for overcoming these problems based on an estimate of the instantaneous ring rate of the cells; although this is in general successful, it provides no information on the nature of the correlation that is removed (amplitude, latency, or both). A procedure to correct the JPSTH for variable-response amplitude has been described by several authors (A. M. H. J., Aertsen, personal communication, 1991, Brody, 1999b), and is successful. However, the issue of response latency variability is more problematic. Brody (1999b) suggested an approach to be used when latency variation is suspected. The spikes recorded for both neurons on each single trial are shifted relative to their original values, and the resultant shufe-corrected CCH is calculated. The set of shifts applied to each trial is adjusted to minimize the peak in the CCH. Because the same shift is applied to both spike trains on a given trial, correlations caused by short-term synchrony in the timing of spikes will not be affected. If the procedure succeeds in nding a set of shifts that can abolish the CCH peak, this indicates that trial-by-trial latency variation, rather than short-term synchrony, could have generated the peak. However, as Brody (1999b) noted, it is not possible to draw denitive conclusions in this case. Since the algorithm attempts to minimize the shufe-corrected CCH peak, it is inevitable that some reduction in its magnitude will be achieved by a fortuitous choice of shifts, even if the response latency was in fact constant. There is no obvious way to estimate how great a CCH peak reduction must occur following the shifting paradigm before we should accept that response latency covariation was indeed a problem. In this article, we present an alternative approach to the problem of assessing whether a CCH peak could have been caused by latency covariation. We rst describe a useful diagnostic measure that can indicate, in an initial exploratory analysis, whether latency or amplitude variation is present.
Normalization for Response Latency Variability in Cross-Correlations
1353
This measure can also indicate whether there is a complex combination of the two types of variation that would preclude further analysis. We then describe three methods for assessing the latency of a response on a single trial and compare their efcacy. Finally, we show that shifting spike trains according to the response latency detected for each trial can adequately correct for apparently signicant CCH peaks due to latency covariation. 2 Use of the Firing-Rate Variance to Detect Latency Jitter
This article is primarily concerned with methods by which the effects of response latency covariation on cross-correlation measures may be reduced. However, when faced with an experimental data set, it is important rst to determine whether latency variability exists. The variability of the stimulusrelated ring-rate prole can be a useful tool to investigate this. The neuron ring-rate time course should rst be determined for each trial. A kernel estimator (Silverman, 1986) is convenient for this, in which j the spike times Ti during the ith trial are convolved with a kernel function K (t): n X
F i (t ) D
j D1
± ² j K t ¡ Ti
(2.1)
Here Fi (t) is the kernel estimate of the ring-rate time course for the ith trial. In order for the rate Fi (t) to be expressed in Hertz, the kernel function should have unit area. A common choice, which we will use throughout this article, is the gaussian kernel, 1 ¡t K (t) D p e 2s , s 2p 2
2
(2.2)
where s determines the kernel width and controls the degree of smoothing. A gaussian kernel has been used to estimate neuronal ring rate by several previous studies (Sanderson, 1980; Richmond, Optican, Podell, & Spitzer, 1987; Richmond et al., 1990; Paulin, 1992; Nawrot, Aertsen, & Rotter, 1999). The mean ring-rate time course across N trials can then be calculated as F (t ) D
N 1 X Fi (t). N iD 1
(2.3)
This approximates a smoothed version of the standard PSTH. Likewise, the variance time course s 2 (t) can be estimated as s 2 (t) D
N h i2 1 X F i ( t ) ¡ F (t ) . N ¡ 1 iD 1
(2.4)
1354
Stuart N. Baker and George L. Gersetin
Cox and Isham (1980) show that the variance of a kernel estimate of the ring rate is given by: Z 1 Z 1Z 1 s2 D F K2 (u )du C 2F K (u )K (v )h(v ¡ u )dvdu ¡1
¡F
2
»Z
1 ¡1
¼
K (u )du
2
¡1
,
u
(2.5)
where h(t) is the autocorrelation function of the underlying point process. or the gaussian kernel of equation 2.2, Z 1Z 1 F 2 2 s D p ¡ F C 2F (2.6) K (u )K (v )h(v ¡ u )dvdu. 2s p ¡1 u If the spike train is assumed to be a gamma process of order a, mean rate F (Baker & Gerstein, 2000), the autocorrelation is given by ³ a h 2nj i´ X 2p j exp (2.7) h (t) D F i C taF e a i ¡ 1 , a j D1
p where i D ¡1 (the proof of this result is given in the appendix). A gamma distribution is dened as the waiting time to the ath event in a Poisson process; its use allows for a realistic relative refractory period and is a more appropriate representation of experimental data than a Poisson process, which is a gamma distribution with a D 1 (Kufer, FitzHugh, & Barlow, 1957; Stein, 1965; Baker & Lemon, 2000). The integral of equation 2.6 is not analytically solvable and must therefore be determined numerically. Figure 1A shows the calculated variance as a function of the mean rate F, for a Poisson process (order a D 1), using different values of the kernel width s. Figure 1B shows a similar plot for a gamma process, order a D 8. In most cases, the variance depends linearly on the mean ring rate, as shown by the good t to the regression lines. Linearity breaks down only in the case of the non-Poisson process (see Figure 1B) when the kernel width is small in comparison to the modal interspike interval (10 ms width up to about 50 Hz mean rate); this presumably reects an undersmoothing of the contribution of individual spikes, accentuated by the presence of a refractory period. We therefore conclude that to a good approximation, the variance of the kernel estimated ring rate should be proportional to the mean rate. The constant of proportionality a clearly depends on the kernel width s and the variability in the discharge of the neuron (i.e., the order of the gamma process). In experimental data, it is simplest to estimate a from a period of background ring (0 < t < tB ) prior to the stimulus by Pt B 2( ) 0s t a D Pt D . (2.8) tB () tD 0 F t
Normalization for Response Latency Variability in Cross-Correlations
1355
We then assume that this value of a will be preserved throughout the response period; this condition will be met if the neuron discharge does not change its variability as a function of time and is equivalent to assuming that the spike train can be modeled by a gamma process of xed order a. This is a key assumption, which should be veried for individual data sets to be analyzed. If it does hold, there are several ways in which the ring-rate variance can nevertheless exceed its expected value. First, the amplitude of the response may vary from trial to trial, which will lead to an increase in the variance throughout the response period. Second, the response may vary in its latency, which will cause the greatest variance increase around the onset and offset of the response. Finally, a response may alter its shape from trial to trial, which will cause a complex pattern of rate variance. Figures 1C to IF illustrate mean and variance calculations for a variety of simulated responses, to show how these can be used to gain insight into the response type. All of the simulated responses were generated with a stepwise change in ring rate, as shown in the schematic proles at the top. Beneath these schematics are plots of the ring-rate variance time course (thick line). Superimposed with a thin line is the time course of the mean rate, scaled by a suitable factor so that it matches the variance over the rst 100 ms. In Figure 1C, the response was stereotyped for each of the 500 simulated trials. The variance is well tted by a scaled version of the mean curve. In Figure 1D, the ring rate during the response varied uniformly by § 10 Hz, but its timing remained stereotyped. The variance exceeded the expected value throughout the response period. In Figure 1E, the response latency varied by § 50 ms, but the response duration was constant. Two peaks in the variance above its expected value occurred at the onset and offset of the response, a clear sign of response variability. However, an identical variance prole was seen when the time of both the onset, and the offset, of the response varied independently with § 50 ms jitter (see Figure 1F), such that the response duration was also variable. The latter two cases can be disambiguated by using the covariance matrix of the ring rate prole across trials. This is displayed in the lower graphs of Figure 1 as contour plots and is computed as C(t, s) D
N 1 X (Fi (t) ¡ F (t))(Fi (s) ¡ F (s)). N ¡ 1 iD 1
(2.9)
In the situation of Figure 1E, there is a negative correlation between the ring rate around the time of the onset and offset latencies from trial to trial, as seen by the troughs in the covariance matrix placed symmetrically about the leading diagonal. This results from the constant duration of the response. By contrast, when response onset and offset vary independently, no off-diagonal features are seen in the covariance plot (see Figure 1F). It should be emphasized that the ring-rate variance and covariance will not necessarily yield an unambiguous result for every data set. In situations
1356
Stuart N. Baker and George L. Gersetin
where the amplitude, latency, and shape of the response all vary from one trial to the next, it will not be possible to provide clear-cut interpretations as in Figures 1C to 1F. However, an investigation of the variance and covariance of the single trial responses is to be recommended as a sensible rst step in determining how variable an experimental data set is and what form this variability takes before applying a cross-correlation or JPSTH analysis.
Normalization for Response Latency Variability in Cross-Correlations
1357
3 Correction for Latency Jitter
In data where the response latency varies from trial to trial and furthermore covaries between the two neurons under study, an apparently signicant cross-correlation peak can be produced in the JPSTH. One straightforward means by which this can be overcome is if the spike trains are realigned, so that the latency of the response in the new data set is constant (Brody, k, j0
1999b). The new spike times Ti k, j 0
Ti
k, j
D Ti
¡ di ,
are determined from (3.1)
where the k index identies from which of the two neurons the spike comes, the j index identies the spike within a trial, and di is the shift applied to the ith trial. Note that since both spike trains are shifted by the same amount, correlation due to short-term synchronization is not affected by the operation. If the shifts di are equal to Li , the latency of the response on the ith trial, this correction will remove the artifactual correlation in the JPSTH completely. However, it is usually necessary to estimate the response latency from the data themselves. Such an estimate will probably be subject to some error ei .
Figure 1: Facing page. Use of variance and covariance to investigate amplitude and latency variation across trials. (A) Calculated variance of the kernel estimate of the ring rate as a function of mean rate, for a Poisson process (gamma order a D 1). The four symbols show the results for different widths of a gaussian kernel, s D 10, 25, 50, or 100 ms. Lines show linear regression ts y D bx. (B) Similar plot to A, for a gamma process order a D 8. Linearity is lost for the narrowest kernel used at low rates; the regression line for this case was tted only to the points with mean rate above 40 Hz. Note the different ordinate scales in A and B; the estimate is considerably more variable for a Poisson process than a higher-order gamma process. All measurements were calculated using one sample point from each of 500 independent trials. (C–F) Top traces are schematic illustrations of the responses simulated. Middle traces show the rate variance (thick line) as a function of time and the mean rate scaled to match the rst 100ms of the variance (thin line). Bottom plots show the covariance matrix in contour form. Contours are spaced every 25 Hz2 , and are all positive except for E, when negative contours are marked by dashed lines. Zero contours are omitted for clarity. (C) Invariant response, with 20 Hz baseline and 60 Hz response. (D) As C, but response amplitude varies from 50 to 70 Hz on each trial. (E) Constant response amplitude (60Hz), but single trial response latency varies over a 100ms window. (F) Response onset and offset latency varies independently over a 100 ms window. All measures were calculated from 500 sweeps of simulated data.
1358
Stuart N. Baker and George L. Gersetin
The shifts will thus effectively be di D Li C ei .
(3.2) k, j0
The response in the shifted spike trains Ti will not have a constant latency; instead, the latency will have a distribution equal to that of the errors in the latency estimate ei . Furthermore, since the error in the size of the shift will be the same for each spike train on a given trial, the transformed data will still contain latency correlations between the spikes. The operation of realigning data trial by trial to the estimated response latency therefore effectively replaces the original latency distribution with one equal to the distribution of errors in the latency estimates. Trial shifting in this way is thus worthwhile only if the width of the error distribution is smaller than that of the variation in latency in the original data. In computing the JPSTH, it is usually of interest to investigate correlation between two neurons as a function of the time after the alignment event. If the single trial spikes are shifted by different amounts, this will have the effect of blurring any periods of synchrony that are truly locked to the stimulus. For this reason, in addition to the usual JPSTH display, it is important to present a histogram of the distribution of the shifts di applied, so that the JPSTH can be interpreted with this borne in mind (see Figure 6B). 4 Methods for Response Latency Estimation
The following sections describe three methods by which the response latency may be estimated from a single experimental spike train; the error distribution is quantied for each technique, and the parameter region within which each performs best determined. Two other methods were also assessed in the course of this study. One used the cumulative sum (CUSUM; Ellaway, 1978) calculated on a single trial, and determined when it exceeded the condence limits given by Davey, Ellway, and Stein (1986). The other used successive comparison of interspike intervals with a gamma distribution, in a similar fashion to the work of Pauluis and Baker (2000). Neither method performed as well as the best of the three techniques described below for any region of the parameter range assessed, and thus they will not be considered further. Some other methods are addressed in Sanderson (1980) and Churchward et al. (1997). 4.1 Firing-Rate Variance Method. For spike trains described by a gamma process, if responses are stereotyped from trial to trial, the time course of the variance of the single trial ring rate should be proportional to the time course of the mean rate. A prediction of the variance time course on this basis is thus
sO 2 (t) D aF(t),
(4.1)
Normalization for Response Latency Variability in Cross-Correlations
1359
where the constant of proportionality a can be found from a background region as described in equation 2.8. The extent to which an observed data set departs from this ideal can be quantied as the summed-squared error: ED
tend h i2 X s 2 (t) ¡ sO 2 (t ) .
(4.2)
tD0
The error E can be computed from the data set adjusted by shifts di as described in section 3, and it can be considered a function of those shifts. This suggests that the single trial response latencies could be calculated by nding the shifts di that minimized E. Such a minimization is performed without taking account of the effect that the shifts will have on the JPSTH and is thus to be preferred over the minimization approach of Brody (1999b). In order to test this approach, a simulated annealing minimization algorithm (Press, Flannery, Teukolsky, & Vetterling, 1989) was used on simulated data. Five hundred trials were simulated, in which spikes were produced with a gamma distribution of interspike intervals (see equation 4.6), according to the method described in Baker and Gerstein (2000). The order of the gamma distribution was 8, allowing for a clear relative refractory period; with such a distribution, only 5.1% of intervals will be less than half the mean interval. Following the onset of a trial, the simulated neuron discharged with a baseline rate of lB D 20 Hz. At a latency that varied uniformly between 400 and 1000 ms after the trial onset, the rate increased to 60 Hz and remained at this level for the remainder of the trial. Figure 2A shows the variance (thick line) and its prediction from scaling the mean (thin line); the constant of proportionality a was determined over the rst 50 ms of the trace. As in Figure 1, the presence of latency variability increased the rate variance above that expected, leading to a considerable mismatch between these two curves. Initially, di was set equal to 700 ms for all i, and the error E calculated. A value of i was then selected at random, and di was perturbed according to di0 D di C D ,
(4.3)
where D was a gaussian distributed random number, with mean zero and standard deviation S. Any values of di0 that fell outside the interval [400, 1000] ms following this perturbation were reected back into this range. The error was then recalculated as E0 using this new shift. The new value of the ith shift di0 was accepted with a probability P D e (E¡E ) / t , 0
(4.4)
with values of P > 1 assumed to indicate certainty; t is the “temperature” of the simulated annealing algorithm. This procedure was then repeated
1360
Stuart N. Baker and George L. Gersetin
for many iterations. We found that good convergence was obtained when the perturbation standard deviation S and the temperature t were adjusted with the iteration number I according to ¡1 50,000 S D 100e /
t D t0 e¡1/ 10,000.
t0 was set equal to 10¡3 times the initial error E.
(4.5)
Normalization for Response Latency Variability in Cross-Correlations
1361
After the simulated annealing algorithm had converged, a small number of trials had latency estimates that were outliers and wildly incorrect. This could be improved by calculating the decrease in the error produced by removing each trial from the analysis. For the 4% of trials that produced the greatest decrease in error when omitted, an exhaustive search was carried out for the latency that minimized the overall error. Figure 2B shows how the error declined with successive iterations of the simulated annealing algorithm, until convergence was obtained. Figure 2C shows the actual and predicted variance of the shifted data set; these agree closely. Figure 2D plots the response latency detected di versus the actual latency used to simulate that trial. There is a high degree of correlation; this method is thus clearly a feasible way of nding response latency in the case that latency variation is the only contributing factor to the variance. The analysis shown in Figure 2A to 2D was repeated with different simulated data to determine how sensitive the algorithm was to changes in parameters. The ring rate of the baseline period lB was altered from 10 to 50 Hz in 10 Hz steps; responses were tested that were two to ve times the rate of the baseline, in steps of 1. For every combination analyzed, the error in each single trial latency estimate compared with the known latency used to simulate the data was determined. The bias of the estimates was found, equal to the mean of these errors; this is an indication of a tendency consistently to over- or underestimate the latency. The standard deviation of the single trial errors measured the random error introduced by the latency estimation procedure. Figures 2E and 2F are plots of these values as a function of different baseline rates; the separate curves show the results for different magnitudes of response increase above baseline. The bias is generally small and shows no consistent modulation with the rate of the baseline or response. The standard deviation showed a weak dependence on the baseline rate, declining for higher rates, and a much stronger dependence on the size of the response. As expected, more accurate latency estimates were produced for the larger response sizes. Figure 2: Facing page. Variance method for latency estimation. (A) Firing-rate variance (thick line) and the mean rate scaled to t the variance for the rst 50 ms (thin line). Computed from 500 sweeps, simulated with a 20 Hz baseline, 60 Hz response, with onset latency uniformly distributed from 400 to 1000 ms. (B) Decline in error E (see text) with successive iterations of the simulated annealing algorithm. (C) Rate variance plot as A for the data shifted according to the optimal shifts found by the annealing algorithm. (D) Scatter plot of the estimated latency versus the actual latency used to simulate the data. (E) Bias and (F) Standard deviation of latency estimate versus ring rate of the baseline. (G) As F, but using only 50 sweeps in the analysis. (H) The dependence of the standard deviation of the latency estimate on the number of sweeps, with a 10 Hz baseline rate. Throughout, different lines show different amplitudes of response, from two to ve times the baseline.
1362
Stuart N. Baker and George L. Gersetin
This technique for estimation of response latency requires an accurate measurement to be made of the ring-rate variance. As such, it might be expected that the method would be sensitive to changes in the number of sweeps available for analysis. The analysis of Figures 2A to 2F used 500 sweeps, a number larger that that recorded in many experimental studies. Accordingly, Figure 2G investigates the performance of the algorithm when only 50 sweeps are available. Performance was not appreciably different from that shown in Figure 2F. Finally, Figure 2H plots the standard deviation of the latency as a function of the number of sweeps available for analysis, using data with a background ring rate of 10 Hz throughout. There are no consistent trends, conrming that the method also is appropriate when lower number of sweeps are available. The method described assumes that any excess in ring-rate variance over that expected from the mean is due to response latency variability. However, as noted in Figure 1D, excess rate variability can also occur due to changes in the amplitude of the response from trial to trial. Some of the simulations run tested the effect of combined amplitude and latency variability on the accuracy of the latency estimates. The method appeared relatively robust to changes in response amplitude from trial to trial. This is probably because shifting by the appropriate latency always reduces the squared error E, leading to accurate latency estimates even if the eventual value of E at convergence is higher than usual due to the presence of amplitude variation. If the excess variance is due only to response amplitude variability, simulations showed that the algorithm nevertheless found a set of shifts that fortuitously reduced the mismatch between actual and expected rate variance. A signicant mismatch nevertheless remained, unlike in the situation where latency variability was present when the variance accorded well with the prediction after trial shifting (see Figure 2C). If only amplitude variation is present, the estimated latencies will of course be entirely spurious. It is thus important to pay attention to the form of the rate variance, as described in section 2, to decide in advance whether latency variability is in fact present in a given data set. 4.2 Bayesian Method. The second method for response latency estimation uses a Bayesian procedure. This is similar in form to that proposed by Commenges and Seal (1985) but has the additional feature that the response is permitted to be of arbitrary shape. We assume that the neuron in question res at the start of the trial with a constant rate lB , and that this can be estimated from a baseline (prestimulus) period. The unit’s discharge following the stimulus is represented by a series of successive interspike intervals Ij . If the underlying ring rate is l, the interspike intervals are assumed to follow a gamma distribution (Shadlen & Newsome, 1998; Pauluis & Baker, 2000), with order a, such that the probability distribution func-
Normalization for Response Latency Variability in Cross-Correlations
1363
tion is g (I, l, a)dI D
± a ²a l
Ia¡1 ¡aI / l e dI. (a ¡ 1)!
(4.6)
Suppose now that the response began at the start of the kth interspike interval. If the time course of the subsequent poststimulus modulation in ring rate is known, the likelihood of the data can be simply calculated as N Y Y ¡ ¢ k¡1 P data | k, li D g (Ii , lB , a) g (Ii , li , a) , iD 1
(4.7)
iD k
where li represents the underlying ring rate pertaining during interval Ii . In a Bayesian framework, li are “nuisance” parameters, whose values are of no interest in determining the response latency. Equation 4.7 may therefore be marginalized, by integrating over all possible values of li , with  Ruanaidh & the integral weighted according to the prior probability for li (O Fitzgerald, 1996; Sivia, 1996; Carlin & Louis, 1996). In the absence of specic information, a at prior is to be preferred, which has a uniform probability density for li . If we assume that the response produces an increase in ring rate throughout the response period compared to the background rate lB , and that the maximum possible rate that we expect is lH , the prior for li is P (li ) D
8
lH .
(4.8)
The likelihood of the data is then: P (data | k) D D
k¡1 Y
g (Ii , l B , a )
iD 1
k¡1 Y
N Z Y iD k
l D lH
l D lB
g(Ii , l, a)
1 dl lH ¡ lB
g (Ii , l B , a )
iD 1
0 2 3 lD lH 1 N a¡2 a a¡1 Y X 1 a Ii B C 4e¡aIi / l 5 ¢@ A . (4.9) j j! ( aI )a¡j¡1 (lH ¡ lB )(a ¡ 1) l i iD k jD 0 lD lB
From Bayes’ theorem, P (k | data) / P (data | k)P ( k).
(4.10)
In the absence of prior information about the expected latency, we assume a at prior for k, and hence P (k) is a constant. Equation 4.9 thus gives
1364
Stuart N. Baker and George L. Gersetin
directly the relative probability of different response latencies, to within a scale factor. The values may be normalized to true probabilities simply by rescaling them to have unit sum. The time of the rst spike that contributed to interval Ii could be used as the estimated response latency. However, as noted in Pauluis and Baker (2000), this will lead to a systematic error. The actual point at which the synaptic input to the cell increased must lie somewhere in the middle of Ii . If we take Ii C 1 as the best estimate of the mean interval during the initial part of the response, the unbiased estimate of the response latency is (Pauluis & Baker, 2000) 1 d D T i C 1 ¡ Ii C 1 , 2
(4.11)
where Ti represents the time of the spike at the start of interval Ii . Figures 3A and 3B show the results of applying this method to a single trial of simulated data. Figure 3A illustrates a schematic prole of the instantaneous ring rate for this trial, which rose abruptly from 20 to 60 Hz at time 700 ms. The times of simulated spikes are shown by the raster. Figure 3B plots the calculated probability that the latency equals each of the discrete times derived from equation 4.9. The most probable latency lies close to the actual value of 700 ms. The arrowhead beneath the abscissa marks the mean of the probability distribution, found by averaging all of the latency estimates weighted by their probabilities (Sivia, 1996); this also is close to 700 ms. In tests on a range of simulated data sets, we found the mean latency to be a slightly more accurate estimate than the peak of the probability distribution; all subsequent values quoted for the Bayesian method will therefore use the distribution mean. Use of the Bayesian method requires an estimate of the order of the gamma process, a, used to model the interspike interval distribution. Figure 3D shows how the standard deviation of the errors in the latency estimates of Figure 3C varied when different orders were assumed other than the value of 8 used to generate the data. Surprisingly, the most reliable estimate is generated by underestimating the order, equivalent to assuming that the spike train is more variable than it actually is. Figure 3D makes it clear that the Bayesian method will not perform well unless a reasonable estimate of the order of the gamma process used to model the experimental spike train is available. However, this should be relatively easily determined by examining the interspike interval distribution of the unit under study during the background period, so long as sufcient sweeps have been recorded to estimate the distribution accurately. A more complicated method for determining the best gamma order to describe an experimental spike train has been described by Baker and Lemon (2000). Since this uses both the baseline and response periods, it may be of use in situations where there are limited data. A second requirement of the Bayesian method is an estimate of the maximum possible response rate lH . Figure 3E shows how the error standard
Normalization for Response Latency Variability in Cross-Correlations
1365
Figure 3: Bayesian method for latency estimation. (A) Raster of a single trial of simulated discharge, with an underlying ring rate as shown by the schematic prole above. The baseline ring rate was 20 Hz and the response rate 60 Hz. (B) Probability of the response latency at different values for the trial in A. The arrow marks the mean of the distribution. (C) Plot of estimated latency versus actual latency used to simulate the data for 500 trials. (D, E) Dependence of the standard deviation of the latency estimate on the parameters used in the Bayesian method, for the data of C. (D) Order a of the gamma process. (E) Maximum response rate lH . Arrowheads show the actual parameters used to simulate the data (a D 8, lH D 60 Hz). (F) Bias and (G) standard deviation of the latency estimate over a range of background and response rates, same plotting conventions as Figures 2E and 2F.
deviation depends on this parameter for the data of Figure 3C. The dependence is negligible, even if lH is estimated to be below the actual response magnitude (the arrowhead on the abscissa in Figure 3E). A reasonable esti-
1366
Stuart N. Baker and George L. Gersetin
mate for lH should be used based on the observed PSTH, but the algorithm is robust to errors in the determination of this parameter. Figures 3F and 3G present results on the bias and standard deviation of the latency estimate calculated over several simulated data sets with the same properties as those used in Figures 2E and 2F. For the Bayesian method, the bias is relatively constant across different baseline ring rates. By contrast, the variability in the rate estimate depends quite strongly on the baseline rate, but is relatively insensitive to the magnitude of the response compared to baseline. 4.3 Rate Change Method. The nal method to be described here is conceptually the simplest; it is illustrated in Figures 4A and 4B applied to a single trial of simulated data. The ring rate is estimated for the trial in question using the kernel method described in equations 2.1 and 2.2 (see Figure 4B). The mean and standard deviation of this rate estimate are determined over a baseline region (indicated in Figure 4A); the values of the mean, and mean plus three standard deviations, are shown by dashed lines in Figure 4B. The response latency is determined as the time when the rate estimate exceeds a threshold of the mean plus three standard deviations (the arrowhead in Figure 4B). When applying this method to noisy data, it is sometimes possible that no value of the estimated rate during the response period will rise above the arbitrary threshold chosen. We found that when this occurred, the actual latency was usually toward the maximum of the possible range, and by chance insufcient response spikes had occurred to produce a detectable response before the end of the time range searched. There are several possible ways of treating such trials; for example, they could be excluded entirely from further analysis, on the basis that there is no evidence that a response did in fact occur on these occasions. However, for simplicity, in the quantitative analysis of this method that follows, the latency on such trials has been estimated as simply the maximum of the rate function during the response period. Figure 4C shows a plot of the estimated latency versus the actual latency used to simulate the data for 500 trials with the same parameters as used in Figures 2A to 2D and 3A to 3C. A high level of correlation is apparent. In only 3.2% of trials in this case did the rate estimate fail to cross the detection threshold as described above. An important parameter to be adjusted when using this method is the width of the gaussian kernel used to estimate the single trial ring rate. Since neuron ring rates can vary widely, no single width is likely to be universally applicable. We have used a choice of width s given by
sD
k , lB
(4.12)
which permits the width to adapt to the baseline ring rate of the data to be analyzed. In this way, an accurate representation of the baseline is
Normalization for Response Latency Variability in Cross-Correlations
1367
Figure 4: Firing-rate change method for latency estimation. (A) Raster of a single trial of simulated data, with the instantaneous ring-rate prole used to generate the data shown above. (B) Estimated ring rate of the trial in A, using a gaussian kernel estimator with width parameter k D 0.25. Dotted lines show the mean and mean plus three standard deviations calculated over the baseline period. (C) Scatter plot of estimated latency versus actual latency used to simulate the data for 500 trials. (D) Bias and (E) standard deviation of the latency estimate as a function of the parameter k, which determines the kernel width. Different lines show the results with responses two, three, or four times the ring rate of the baseline. (F) bias and (G) standard deviation of the latency estimates as a function of baseline rate, using a kernel with width parameter k D 0.25. Plotting conventions as in Figures 2E and 2F and 3D and 3E.
achieved, which provides a steady background against which to assess the onset of the response. In order to determine an optimal choice for the parameter k, the bias and standard deviation of the estimated latency were
1368
Stuart N. Baker and George L. Gersetin
calculated over data sets consisting of 500 trials of simulated data. In all cases, the baseline was 20 Hz; responses of 40 to 80 Hz were tested. The latency bias and standard deviations are shown in Figures 4D and 4E as a function of k. A very small value of k D 0.05 produced a highly variable rate estimate, which exceeded the mean plus three standard deviations detection threshold at the time of every spike. The latency was thus estimated as close to the time of the rst spike in the response period; this normally underestimated the true latency and produced a large negative bias (see Figure 4D). The highly variable rate estimate also led to a large latency standard deviation at low values of k; however, at values of k above approximately 0.2, the standard deviation changed only slowly. We accordingly used k D 0.25 in the remainder of this work; an example of the rate estimate that this produces can be seen in Figure 4B, which also used k D 0.25. A more complex adaptive kernel scheme, following Silverman (1986), was also tested in this context, in which the kernel width s was adjusted continuously throughout the measurement period. This was found to produce latency estimates that were, if anything, more variable than those using the scheme described above; it will accordingly not be detailed further. Figures 4F and 4G show the bias and standard deviation of the latency estimates as a function of baseline ring rate and the response magnitude. The dependence on ring rate is only slight, conrming that the technique used to adapt the kernel width to the baseline performs satisfactorily. For responses greater than three times the baseline, the bias and standard deviation are relatively constant. 4.4 Comparison of Latency Estimation Methods. In other applications of latency estimation, it may be important to avoid a biased estimate; for this reason, curves showing the bias have been included in Figures 2 through 4. However, for the purpose of shifting trials to remove latency related correlation peaks in the JPSTH, it is important that the estimate has a low variability. Figure 5 shows maps of the parameter space that has been explored for all three methods and indicates the method with the smallest standard deviation of the estimate in each region. Figure 5A shows such a map for a response that is a rapid stepwise increase in the cell ring rate, as investigated in Figures 2 through 4. Boxes with split shading indicate that two methods were nearly equivalent (the standard deviations were within 10% of each other). A full description of the dependence of the algorithms’ performance on the different parameters has already been given in Figures 2 through 4. We conclude that the Bayesian method is most suitable for small, sustained responses which are up to three times the baseline, or for larger responses when the baseline rate is above 40 Hz. For large responses on a small baseline, the rate variance method is more suitable. The rate change (kernel) method underperformed the two other algorithms for all of the parameter space investigated.
Normalization for Response Latency Variability in Cross-Correlations
1369
Figure 5B shows a similar map, calculated for the situation where the response consists of a transient increase in ring rate. The transient used had a gaussian prole, with width parameter 50 ms; its peak latency varied over a 400 ms range. In estimating the latency for this response, a worthwhile modication can be made to the rate change method. Rather than estimating the onset latency of the response by the time at which it crosses a threshold, the time of the maximum estimated rate can be used to determine the peak latency. This is subject to less variability than an onset measurement. The rate variance method works with the whole response and so needs little modication to deal with a transient response. The Bayesian method is not well suited to such responses, unless an explicit model of the response shape is included in the calculation. Because it was possible to adjust the rate change method to take account of the response shape, it performed best over most of the parameter range investigated. The analysis of Figures 2 through 5 has concentrated on situations where the baseline ring rate is relatively high: the lowest rate tested was 10 Hz. In many neurons of interest experimentally, baseline rates are considerably lower. Of the methods described here, the Bayesian approach is not well suited to such situations, since it relies on there being several interspike intervals in the baseline and response regions for comparison. At low rates, the baseline region may often not contain any spikes. The rate change method can be applied to low baseline data. However, the near lack of background ring prevents the optimization of the kernel width as described by equation 4.12, and of the detection threshold as background mean plus three standard deviations. So long as sensible values are predetermined for these parameters, the rate change method will work well. The rate variance method can be applied without modication to low-rate data and also yields accurate latency estimates (veried by additional simulations). The approach described here can thus equally be used for data with low background rate. The computation time required to implement the different latency estimation methods varies considerably. The fastest is the rate change method, which requires only the estimation of single trial rates and application of a threshold. For example, the analysis illustrated in Figure 4C took 6.9 seconds on a 600 MHz Pentium II computer. The Bayesian method is somewhat slower (67 seconds for Figure 3C), but not sufciently so to deter its use when it is indicated to be optimal by Figure 5A. The rate variance method is highly computationally intensive (3000 seconds for Figure 4C), since it relies on a lengthy simulated annealing run for each data set; this may prevent its practical application in some circumstances. In applying latency estimates to the correction of the JPSTH, in general two estimates will be found for each trial: one from each neuron’sspike train. The simplest approach to combining these is to take their mean and use the resultant values as the shifts di . However, for the rate variance method, a
1370
Stuart N. Baker and George L. Gersetin
more satisfactory procedure is to minimize the summed error of the predicted rate variances for the two neurons in one annealing run, thereby generating the required shifts directly.
Normalization for Response Latency Variability in Cross-Correlations
1371
5 Application of Latency Estimation to JPSTH
Figure 6 shows an example of the application of the techniques described above to the calculation of the JPSTH between two simulated spike trains. These were simulated with similar parameters as used in the examples of Figures 2 through 4. Each cell had a baseline ring rate of 20 Hz and responded at 60 Hz with a latency uniformly distributed from 500 to 700 ms after the alignment event. The latencies of the two cells were the same on a given trial; otherwise their ring was independent. Figure 6A shows the JPSTH calculation for this data set, displayed as described by Aertsen et al. (1989) and normalized as described in that work to yield a matrix of correlation coefcients. The normalized cross-correlation histogram shows a large central peak, and the plot of the size of the near synchronous bins versus time after the alignment event shows a large peak around the time of the response onset. Such a location of a correlation peak should be treated as a clear warning sign for the presence of latency covariation. Figure 6B shows a histogram of the estimated latency, found using the rate variance method applied to both neurons’ discharge simultaneously. In interpreting the JPSTH that follows, it must be remembered that stimuluslocked features will be blurred by convolution with this distribution, thus impairing the accurate depiction of any stimulus-locked modulated of the correlation. Figure 6C shows the JPSTH calculated from the single-trial shifted data. The peaks in the normalized cross-correlation histogram and in the diagonal prole have been much reduced by this correction. A small residual peak, best visible in the JPSTH matrix itself, is caused by the small latency variation that remains due to the error in estimating the onset latencies (see section 3). If these were experimental data, we would be forced to conclude, as is correct, that there is no short-term synchronization between these neurons but that their response latency covaried.
Figure 5: Facing page. Maps of the method for latency determination with the lowest standard deviation of the estimate for different rates of baseline ring and response amplitude (expressed as a fractional increase over the baseline rate). Shading indicates which method performs best in that region, according to the key at the bottom. Split squares indicate that two methods performed similarly (standard deviation within 10%); the shading to the left in such cases indicates the better method. (A) For a response consisting of a step change in ring rate that is sustained for the remainder of the trial. (B) For a response that is transient and was modeled as a gaussian prole of rate change, width parameter 50 ms. The response amplitude refers to the peak of the prole in this case.
1372
Stuart N. Baker and George L. Gersetin
Normalization for Response Latency Variability in Cross-Correlations
1373
6 Conclusion
We have shown that the estimation of the response latency in single trials is a difcult problem and that errors are inevitably introduced by the estimation procedure. However, by paying attention to the properties of the data to be analyzed (baseline and response amplitude, and response shape), it is possible to select the method that will produce the best estimate. Alignment of trials to the estimated response latency is a successful procedure to remove the effects of latency covariation from the JPSTH calculation. Since this method for determining the single trial shifts di is blind to the effects it will have on the JPSTH, any reduction in the size of the CCH peak can be genuinely attributed to latency covariation. We believe this to be more satisfactory than the technique proposed by Brody (1999b), where the experimenter is never sure whether such a reduction is a statistical artifact of the method. Given that response latency covariation is a frequent observation, we anticipate that trial shifting based on latency estimates will be an important technique in the analysis of experimental data. 7 Appendix
The autocorrelation function h(t) for a gamma process may be determined as follows. For a gamma process, order a, mean rate 1 / a, the rst-order interspike interval distribution is given by I (t ) D
ta¡1 ¡t e . (a ¡ 1)!
(A.1)
The intervals of a gamma process order a can be dened as the waiting time to the ath event of a Poisson process. From this denition, it is obvious that the distribution of the nth intervals for a process of order a will simply be the gamma distribution of rst intervals for a process of order na. The autocorrelation h (t) of a point process is given by the sum of all orders of the interspike interval histogram (Perkel, Gerstein, & Moore, 1967a); hence
Figure 6: Facing page. Application of latency estimation to correction of the JPSTH. (A) JPSTH display as in Aertsen et al. (1989),for a data set simulated to have a step change in ring rate from 20 to 60 Hz in both neurons. The latency of the change varied from 500 to 700 ms (uniform distribution). (B) Histogram of the latencies detected using the rate variance method applied to both neurons’ discharge simultaneously. (C) JPSTH calculated from data in which each trial was shifted according to the estimated latency. The peak seen in the cross-correlation in A is greatly reduced. Bin width 10 ms, 500 trials.
1374
Stuart N. Baker and George L. Gersetin
for the ath order gamma process, h (t) D
1 X nD1
tna¡1 ¡t e (na ¡ 1)!
D S (t)e¡t ,
(A.2)
where S(t) D
1 X nD 1
tna¡1 . (na ¡ 1)!
(A.3)
Integrating, Z Sdt D
1 X tna . (na )! n D1
(A.4)
Observe that if Rj are the ath roots of 1, such that Rj D e
2p j a i
(A.5)
,
then a X jD 1
Rjn
Da D 0
if n D ma otherwise,
(A.6)
where m is an integer. Hence equation A.4 can be rewritten as Z
¡ ¢n 1 X a tRj 1X Sdt D . a nD1 j D1 n!
(A.7)
Recognizing the series expansion for ex , we have, Z Sdt D
a 1X etRj . a jD 1
(A.8)
Differentiating, S(t) D
a 1X Rj etRj . a j D1
(A.9)
Normalization for Response Latency Variability in Cross-Correlations
1375
Hence, a 1X ( Rj et Rj ¡1) a j D1 ± 2p j ² i 2p j a C i t e a ¡1 1X a . D e a j D1
h( t ) D
(A.10)
Rescaling for a mean rate of F instead of 1 / a, we obtain equation 2.7. Acknowledgments
The work reported here was funded by the Wellcome Trust and Christ’s College, Cambridge (S. N. B.), and NIH MH46428, DC01249, and NSF (G. L. G.). References Abeles, M., Bergman, H., Margalit, E., & Vaadia, E. (1993). Spatiotemporal ring patterns in the frontal-cortex of behaving monkeys. Journal of Neurophysiology, 70, 1629–1638. Aertsen, A. M. H. J., Gerstein, G. L., Habib, M. K., & Palm, G. (1989). Dynamics of neuronal ring correlation: Modulation of “effective connectivity.” Journal of Neurophysiology, 61, 900–917. Baker, S. N., Olivier, E., & Lemon, R. N. (1997). Task dependent coherent oscillations recorded in monkey motor cortex and hand muscle EMG. Journal of Physiology, 501, 225–241. Baker, S. N., & Gerstein, G. L. (2000). Improvements to the sensitivity of gravitational clustering for multiple neuron recordings. Neural Computation, 12. Baker, S. N., & Lemon, R. N. (2000). Precise spatiotemporal repeating patterns in monkey primary and supplementary motor areas occur at chance levels. Journal of Neurophysiology, 84, 1770–1780. Brody, C. D. (1999a). Correlations without synchrony. Neural Computation, 11, 1537–1551. Brody, C. D. (1999b). Disambiguating different covariation types. Neural Computation, 11, 1527–1535. Carlin, B. P, & Louis, T. A. (1996).Bayes and empiricalBayes methods for data analysis. London: Chapman & Hall. Churchward, P. R., Butler, E. G., Finkelstein, D. I., Aumann, T. D., Sudbury, A., & Horne, M. K. (1997). A comparison of methods used to detect changes in neuronal discharge patterns. Journal of Neuroscience Methods, 76, 203–210. Commenges, D., & Seal, J. (1985). The analysis of neuronal discharge sequences: Change-point estimation and comparison of variances. Statistics in Medicine, 4, 91–104. Cox, D. R., & Isham, V. (1980). Point processes. London: Chapman and Hall.
1376
Stuart N. Baker and George L. Gersetin
Davey, N. J., Ellaway, P. H., & Stein, R. B. (1986). Statistical limits for detecting change in the cumulative sum derivative of the peristimulus time histogram. Journal of Neuroscience Methods, 17, 153–166. Ellaway, P. H. (1978). Cumulative sum technique and its application to the analysis of peristimulus time histograms. Electroencephalographyand Clinical Neurophysiology, 45, 302–304. Gerstein, G. L., Bedenbaugh, P., & Aertsen, A. M. H. J. (1989). Neuronal assemblies. IEEE Transactions on Biomedical Engineering, 36, 4–14. Haplea, S., Covey, E., & Casseday, J. H. (1994). Frequency tuning and response latencies at three levels in the brainstem of the echolocating bat, Eptesicus fuscus. Journal of Comparative Physiology A, 174, 671–683. Hatsopoulos, N. G., Ojakangas, C. L., Paninski, L., & Donoghue, J. P. (1998). Information about movement direction obtained from synchronous activity of motor cortical neurons. Proceedings of the National Academy of Sciences of the United States of America, 95, 15706–15711. Kitzes, L. M., & Hollrigel, G. S. (1996). Response properties of units in the posterior auditory eld deprived of input from the ipsilateral primary auditory cortex. Hearing Research, 100, 120–130. Kufer, S. W., FitzHugh, R., & Barlow, H. B. (1957). Maintained activity in the cat’s retina in light and darkness. Journal of General Physiology, 40, 683–702. Lu, S. M., Guido, W., Vaughan, J. W., & Sherman, S. M. (1995). Latency variability of responses to visual stimuli in cells of the cat’s lateral geniculate nucleus. Experimental Brain Research, 105, 7–17. Munoz, D. P., & Wurtz, R. H. (1993). Fixation cells in monkey superior colliculus. I. Characteristics of cell discharge. Journal of Neurophysiology, 70, 559–575. Nawrot, M., Aertsen, A., & Rotter, S. (1999). Single-trial estimation of neuronal ring rates: From single-neuron spike trains to population activity. Journal of Neuroscience Methods, 94, 81–92. Â Ruanaidh, J. J. K, & Fitzgerald, W. J. (1996). Numerical Bayesian methods applied O to signal processing. New York: Springer-Verlag. Paulin, M. G. (1992).Digital lters for ring rate estimation. BiologicalCybernetics, 66, 525–531. Pauluis, Q., & Baker, S. N. (2000). An accurate measure of the instantaneous discharge probability, with application to unitary joint-event analysis. Neural Computation, 12, 647–669. Perkel, D. H., Gerstein, G. L., & Moore, G. P. (1967a). Neuronal spike trains and stochastic point processes. I. The single spike train. Biophysical Journal, 7, 391–418. Perkel, D. H., Gerstein, G. L., & Moore, G. P. (1967b). Neuronal spike trains and stochastic point processes. II. Simultaneous spike trains. Biophysical Journal, 7, 419–440. Press, W. H, Flannery, B. P, Teukolsky, S. A, & Vetterling, W. T. (1989). Numerical recipes in Pascal: The art of scientic computing. Cambridge: Cambridge University Press. Richmond, B. J., Optican, L. M., Podell, M., & Spitzer, H. (1987). Temporal encoding of two-dimensional patterms by single units in primate inferior temporal cortex. I. Response characteristics. Journal of Neurophysiology, 57, 132–146.
Normalization for Response Latency Variability in Cross-Correlations
1377
Richmond, B. J., Optican, L. M., & Spitzer, H. (1990). Temporal encoding of two-dimensional patterns by single units in primate primary visual cortex. I. Stimulus-response relations. Journal of Neurophysiology, 64, 351–369. Riehle, A., Grun, ¨ S., Diesmann, M., & Aertsen, A. (1997). Spike synchronization and rate modulation differentially involved in motor cortical function. Science, 278, 1950–1953. Sanderson, A. C. (1980). Adaptive ltering of neuronal spike train data. IEEE Transactions on Biomedical Engineering, 27, 271–274. Seidemann, E., Meilijson, I., Abeles, M., Bergman, H., & Vaadia, E. (1996). Simultaneously recorded single units in the frontal cortex go through sequences of discrete and stable states in monkeys performing a delayed localization task. Journal of Neuroscience, 16, 752–768. Shadlen, M. N., & Newsome, W. T. (1998). The variable discharge of cortical neurons: Implications for connectivity, computation, and information coding. Journal of Neuroscience, 18, 3870–3896. Silverman, B. W. (1986). Density estimation for statistics and data analysis. London: Chapman & Hall. Singer, W. (1993). Synchronization of cortical activity and its putative role in information processing and learning. Annual Review of Physiology, 55, 349– 374. Sivia, D. S. (1996). Data analysis: A Bayesian tutorial. Oxford: Oxford University Press. Softky, W. R., & Koch, C. (1993). The highly irregular ring of cortical cells is inconsistent with temporal integration of random EPSPs. Journal of Neuroscience, 13, 334–350. Stein, R. B. (1965). A theoretical analysis of neuronal variability. Biophysical Journal, 5, 173–194. von der Malsburg, C. (1995).Binding in models of perception and brain function. Current Opinion in Neurobiology, 5, 520–526. Received January 13, 2000; accepted September 18, 2000.
LETTER
Communicated by Morris Hirsch
Attractive Periodic Sets in Discrete-Time Recurrent Networks (with Emphasis on Fixed-Point Stability and Bifurcations in Two-Neuron Networks) Peter Ti Ïno Aston University, Birmingham B4 7ET, U.K., and Department of Computer Science and Engineering, Slovak University of Technology, 812 19 Bratislava, Slovakia Bill G. Horne NEC Research Institute, Princeton, NJ 08540, U.S.A. C. Lee Giles NEC Research Institute, Princeton, NJ 08540, U.S.A., and School of Information Sciences and Technology, Pennsylvania State University, University Park, PA 16801, U.S.A. We perform a detailed xed-point analysis of two-unit recurrent neural networks with sigmoid-shaped transfer functions. Using geometrical arguments in the space of transfer function derivatives, we partition the network state-space into distinct regions corresponding to stability types of the xed points. Unlike in the previous studies, we do not assume any special form of connectivity pattern between the neurons, and all free parameters are allowed to vary. We also prove that when both neurons have excitatory self-connections and the mutual interaction pattern is the same (i.e., the neurons mutually inhibit or excite themselves), new attractive xed points are created through the saddle-node bifurcation. Finally, for an N-neuron recurrent network, we give lower bounds on the rate of convergence of attractive periodic points toward the saturation values of neuron activations, as the absolute values of connection weights grow. 1 Introduction
Discrete-time recurrent neural networks offer a wide range of dynamical behavior. They can generate steady-state, periodic, quasi-periodic, and chaotic orbits. Attractive xed points have been proposed as robust representations of prototype vectors in associative memories (Amit, 1989; Hopeld, 1984; Hui & Zak, 1992). Indeed, a great deal of work has focused on the question of how to constrain the weights in the recurrent network so that it exhibits only attractive steady states (Casey, 1995; Jin, Nikiforuk, & Gupta, 1994; Sevrani c 2001 Massachusetts Institute of Technology Neural Computation 13, 1379– 1414 (2001) °
1380
Peter Ti Ïno, Bill G. Horne, and C. Lee Giles
& Abe, 2000). Jin et al. (1994) give the conditions on the weight matrix under which all xed points of the network are attractive. Saddle xed points were suggested to play an important role in working memories operating in nonstationary environments where both long-term maintenance and quick transitions are desirable (McAuley & Stampi, 1994; Nakahara & Doya, 1998). Moreover, saddle points often mimic stack-like behavior in recurrent networks trained on context-free languages (Rodriguez, Wiles, & Elman, 1999). Also, they were found to be part of a mechanism by which recurrent networks induce nonstable representations of large cycles in nite state machines (Ti Ïno, Horne, Giles, & Collingwood, 1998). There are many applications where oscillatory dynamics of recurrent networks is desirable. For example, when trained to act as a nite state machine (Cleeremans, Servan-Schreiber, & McClelland, 1989; Giles et al., 1992; Ti Ïno & Sajda, 1995; Watrous & Kuhn, 1992), the network has to induce a stable representation of state transitions associated with each input symbol s of the machine. Of special importance are period-n cycles driven by s: given the current state S, when repeatedly presenting the input s, we return after n steps to S. Cycles in the machine often induce in the recurrent network state-space stable periodic orbits corresponding to the input s (Casey, 1996; Manolios & Fanelli, 1994; Ti Ïno et al., 1998). In particular, loops, that is, period 1 cycles, often induce attractive xed points (see, e.g., Casey, 1996; Ti Ïno et al., 1998). Finally, chaotic behavior of recurrent networks has been a focus of a vivid research activity (Botelho, 1999; Klotz & Brauer, 1999; Pasemann, 1995a), especially after Wang (1991) rigorously showed that a simple two-neuron recurrent network is capable of producing chaos. There is a considerable amount of literature on two-unit recurrent networks (Beer, 1995; Borisyuk & Kirillov, 1992; Botelho, 1999; Klotz & Brauer, 1999; Pakdamann, Grotta-Ragazzo, Malta, Arino, & Vibert, 1998; Pasemann, 1993; Wang, 1991; Zhou, 1996). This is partly due to the lack of mathematical tools for a detailed analysis of higher-dimensional dynamical systems and partly due to the high expressive power of such simple networks, in which many generic properties of larger networks are already present (Botelho, 1999). Moreover, two-neuron networks are sometimes thought of as systems of two modules, where each module represents the mean activity of a spatially localized neural population (Borisyuk & Kirillov, 1992; Pasemann, 1995a; Tonnelier, Meignen, Bosh, & Demongeot, 1999). Typically, studies of the asymptotic behavior of two-unit or larger networks assume some form of structure in the weight matrix describing the connectivity pattern among recurrent neurons. We mention a few examples: Symmetric connectivity and absence of self-interactions enabled Hopeld (1984) to interpret the network as a physical system having energy minima in the attractive xed points. These rather strict conditions
Attractive Periodic Sets
1381
were weakened in Casey (1995). Blum and Wang (1992) globally analyzed networks with nonsymmetrical connectivity patterns of special types. In particular, they formulated results for two-neuron networks with the logistic sigmoid transfer function g (`) D 1 / (1 C e¡`), in the absence of neuron self-connections. Deep mathematical studies were presented for severely restricted topologies, such as the “ring network” (Pasemann, 1995b) or the chain oscillators (Wang, 1996). Often rather detailed bifurcation studies are performed with respect to only one or two network parameters; the remaining parameters are kept xed to “appropriate” values (Borisyuk & Kirillov, 1992; Nakahara & Doya, 1998). Also, the assumption of saturated sigmoidal or linear transfer functions (as in the Brain-State-in-a-Box model; Anderson, 1993) makes the study of asymptotically stable equilibrium points more feasible (Botelho, 1999; Hui & Zak, 1992; Sevrani & Abe, 2000). In this article we impose no conditions on connection weights or external inputs, apart from the fact that the weights are assumed to be nonzero. To our knowledge, a thorough xed-point analysis of two-unit neural networks with sigmoid transfer functions is still missing.1 We offer such an analysis in sections 4 and 5. In section 6 we rigorously explain the mechanism by which new attractive xed points are created—the saddle-node bifurcation. This conrms the empirical ndings of Tonnelier et al. (1999). When studying such networks, they empirically observed that a new xed point appears through the saddle-node bifurcation. Hirsch (1994) proved that when all the weights in an N-neuron recurrent network with exclusively self-exciting (or exclusively self-inhibiting) neurons are multiplied by an increasing neural gain, attractive xed points tend toward the saturated activation values. In section 7, we give a lower bound on the rate of convergence of attractive periodic points2 toward the saturation values. The results hold for “sigmoid-shaped” neuron transfer functions. This investigation was motivated by the observation of many in the recurrent network grammar/automata induction community that the activations of recurrent neurons often cluster away from the center of the network state-space and close to the saturated activation values (e.g., Manolios & Fanelli, 1994; Ti Ïno et al., 1998). In fact, this was exploited by Zeng, Goodman, and Smyth (1993) in a heuristic to stabilize the induced automata representations in the recurrent network. 1 The most complete xed-point analysis of continuous-time two-neuron networks can be found in Beer (1995). We determine the xed-point positions using a similar “null-cline” technique. 2 Fixed points can be considered periodic points of period 1.
1382
Peter Ti Ïno, Bill G. Horne, and C. Lee Giles
In the next two sections we recall some basic concepts from the theory of dynamical systems and introduce the recurrent network model. 2 Basic Denitions
A discrete-time dynamical system can be represented as the iteration of a (differentiable) map f : X ! X, X µ 0, we have D (G1 , G2 ) > 0 and a(G1 , G2 ) > 0, for all (G1 , G2 ) 2 (0, A / 4)2 . In our terminology (see equations 4.10–4.12), D C , a C D (0, A / 4)2 µ (0, 1)2 . To identify possible values of G1 and p G2 so that |l1,2 | < 1, it is sufcient to solve the inequality aG1 C dG2 C D (G1 , G2 ) < 2, or equivalently, 2 ¡ aG1 ¡ dG 2 >
p
D (G1 , G2 ).
(4.16)
Consider only (G1 , G2 ) such that r1 (G1 , G2 ) D aG1 C dG 2 ¡ 2 < 0,
(4.17)
that is, (G1 , G2 ) lying under the line r10 : aG1 C dG 2 D 2. All (G1 , G2 ) such that aG1 C dG 2 ¡ 2 > 0, that is, (G1 , G2 ) 2 r 1C (above r 10 ), lead to at least one eigenvalue of J greater in absolute value than 1. Squaring both sides of equation 4.16, we arrive at
k1 (G1 , G2 ) D (ad ¡ bc)G1 G2 ¡ aG1 ¡ dG2
C1
> 0.
(4.18)
3 To simplify the notation, the identication (x, y) of the xed point in which equation 3.3 is linearized is omitted.
1386
Peter Ti Ïno, Bill G. Horne, and C. Lee Giles
G2 k
0 1
k
~ 1/d 1/d
0 1
ad>bc
r
ad=bc ad 0 and bc > 0. All (G1 , G2 ) 2 (0, A / 4]2 below the left branch of k10 (if ad ¸ bc) or between the branches of k 10 (if ad < bc) correspond to the attractive xed points. Different line styles for k10 are associated with the cases ad > bc, ad D bc, and ad < bc—the solid, dashed-dotted, and dashed lines, respectively. 6 bc, the set k 0 is a hyperbola, If ad D 1
G2 D
1 C C , 1 Q d G1 ¡ aQ
(4.19)
with bc , d bc dQ D d ¡ , a bc . CD (ad ¡ bc)2
(4.20)
aQ D a ¡
(4.21) (4.22)
It is easy to check that the points (1 / a, 0) and (0, 1 / d ) lie on the curve k10 . If ad D bc, the set k 10 is a line passing through the points (0, 1 / d) and (1 / a, 0) (see Figure 2). To summarize, when a, d > 0, by equations 4.9, 4.17, and 4.18, the xed point (x, y ) of equation 3.3 is attractive only if (G1 , G2 ) D w (x, y) 2 k1C \ r1¡ , where the map w is dened by equation 4.4.
Attractive Periodic Sets
1387
A necessary (not sufcient) condition for (x, y) to be attractive reads4 ³ ´ ³ ´ 1 1 w (x, y ) 2 0, £ 0, . a d Consider now the case of negative weights on neuron self-connections: a, d < 0. Since in this case a(G1 , G2 ) (see equation 4.15) is negative for all values of the transfer function derivatives G1 , G2 , we have a¡ D (0, A / 4)2 µ (0, 1)2 . In order to identify possible values of (G1 , Gp 2 ) such that |l1,2 | < 1, it is sufcient to solve the inequality aG1 C dG 2 ¡ D (G1 , G2 ) > ¡2, or equivalently, 2 C aG1 C dG 2 >
p
D (G1 , G2 ).
(4.23)
As in the previous case, we shall consider only (G1 , G2 ) such that r2 (G1 , G2 ) D aG1 C dG 2 C 2 > 0,
(4.24)
since all (G1 , G2 ) 2 r2¡ lead to at least one eigenvalue of J greater in absolute value than 1. Squaring both sides of equation 4.23, we arrive at
k2 (G1 , G2 ) D (ad ¡ bc)G1 G2
C aG1 C dG2 C 1
> 0,
(4.25)
which is equivalent to ((¡a)(¡d) ¡ bc)G1 G2 ¡ (¡a )G1 ¡ (¡d )G2 C 1 > 0.
(4.26)
Further analysis is exactly the same as the analysis from the previous case (a, d > 0) with a, aQ , d, and dQ replaced by |a|, |a| ¡ bc / |d|, |d|, and |d| ¡ bc / |a|, respectively. 6 bc, the set k 0 is a hyperbola, If ad D 2 G2 D
¡1 C C Q G1 C d
1 aQ
,
(4.27)
passing through the points (¡1 / a, 0) and (0, ¡1 / d ). 4 If ad > bc, then 0 < a Q < a and 0 < dQ < d (see equations 4.20 and 4.21). The derivatives (G1 , G2 ) 2 k1C lie under the “left branch” and above the “right branch” of k10 (see Figure 2). It is easy to see that since we are conned to the half-plane r1¡ (below the line r10 ), only (G1 , G2 ) under the “left branch” of k10 will be considered. Indeed, r10 is a decreasing line going through the point (1 / a, 1 / d), and so it never intersects the right branch of k 10 . If ad < bc, then aQ , dQ < 0 and (G1 , G2 ) 2 k 1C lie between the two branches of k10 .
1388
Peter Ti Ïno, Bill G. Horne, and C. Lee Giles
If ad D bc, k20 is the line dened by the points (0, ¡1 / d ) and (¡1 / a, 0). By equations 4.9, 4.24, and 4.25, the xed point (x, y) of equation 3.3 is attractive only if (G1 , G2 ) D w (x, y) 2 k2C \ r 2C .
In this case, the derivatives (G1 , G2 ) D w (x, y) must satisfy ³ ´ ³ ´ 1 1 w (x, y ) 2 0, £ 0, . |a| |d|
Finally, consider the case when the weights on neuron self-connections differ in sign. Without loss of generality assume a > 0 and d < 0. Assume further that aG1 C dG2 ¸ 0, that is, (G1 , G2 ) 2 a C [ a0 (see equation 4.15) lie under or on the line a a0 : G2 D G1 . |d| It is sufcient to solve inequality 4.16. Using equations 4.17 and 4.18 and arguments developed earlier in this proof, we conclude that the derivatives (G1 , G2 ) lying in
k1C \ r1¡ \ (a C [ a0 ).
correspond to attractive xed points of equation 3.3 (see Figure 3). Since a > 0 and d < 0, the term ad ¡ bc is negative, and so 0 < a < aQ, dQ < d < 0. The “transformed” parameters aQ , dQ are dened in equations 4.20 and 4.21. For (G1 , G2 ) 2 a¡ , by equations 4.24 and 4.25 and earlier arguments in this proof, the derivatives (G1 , G2 ) lying in
k 2C \ r 2C \ a¡ correspond to attractive xed points of equation 3.3. It can be easily shown that the sets k 10 and k 20 intersect on the line a0 in the open interval (see Figure 3), ³
1 1 , aQ a
³
´ £
1 1 , Q |d| | d|
´ .
We conclude that the xed point (x, y ) of equation 3.3 is attractive only if w (x, y ) 2 [k 1C \ r 1¡ \ (a C [ a0 )] [ [k2C \ r 2C \ a¡]. In particular, if (x, y ) is attractive, then the derivatives (G1 , G2 ) D w (x, y) must lie in ³ ´ ³ ´ 1 1 0, £ 0, . |d| a
Attractive Periodic Sets
r
1389
G2
0 2
a
-1/d k
~ -1/d k
0
0 2
0 2
-1/a
~ -1/a
(0,0)
1/a~
1/a
G1 k
~ 1/d k
0 1
r
1/d
0 1
0 1
Figure 3: Illustration for the proof of lemma 1. The network parameters are constrained as follows: a > 0, d < 0, and bc > 0. All (G1 , G2 ) 2 (0, A / 4]2 below and on the line a0 and between the two branches of k 10 (solid line) correspond to the attractive xed points. So do all (G1 , G2 ) 2 (0, A / 4]2 above a0 and between the two branches of k20 (dashed line).
Examination of the case a < 0, d > 0 in the same way leads to a conclusion that all attractive xed points of equation 3.3 have their corresponding derivatives (G1 , G2 ) in ³ ´ ³ ´ 1 1 0, £ 0, . |a| d We remind the reader that the “transformed” parameters aQ and dQ are dened in equations 4.20 and 4.21. Assume bc < 0. Suppose ad > 0 or ad < 0 with |ad| · |bc| / 2. Then each xed point (x, y ) of equation 3.3 such that Lemma 2.
³ w (x, y ) 2
1 0, |a|
´
³
1 £ 0, Q | d|
´ ³
1 [ 0, | aQ |
´
1 £ 0, |d|
is attractive. In particular, all xed points (x, y) for which ³ w (x, y) 2 are attractive.
1 0, | aQ |
´
³
1 £ 0, Q | d|
´
³
´
1390
Peter Ti Ïno, Bill G. Horne, and C. Lee Giles
Proof. The discriminant D (G1 , G2 ) in equation 4.14 is no longer exclusively
positive. It follows from analytic geometry (see, e.g., Anton, 1980) that D (G1 , G2 ) D 0 denes either a single point or two increasing lines (that can collide into one or disappear). Furthermore, D (0, 0) D 0. Hence, the set D 0 is either a single point—the origin—or a pair of increasing lines (that may be the same) passing through the origin. As in the proof of the rst lemma, we proceed in three steps. First, assume a, d > 0. Since ³ ´ 1 1 4bc D < 0 , D a d ad and
³
D
´ ³ ´ 1 1 , 0 D D 0, D 1 > 0, a d
the point (1 / a, 1 / d ) is always in the set D ¡, while (1 / a, 0), (0, 1 / d ) 2 D C . We shall examine the case when D (G1 , G2 ) is negative. From |l1,2 | 2 D
(aG1 C dG2 )2 C | D | D G1 G2 (ad ¡ bc ) 4
it follows that (G1 , G2 ) 2 D ¡ , for which |l1,2 | < 1, lie in (see Figure 4)
D ¡ \ g¡, where g(G1 , G2 ) D (ad ¡ bc)G1 G2 ¡ 1.
(4.28)
It is easy to show that ³
´ ³ ´ 1 1 1 1 2 g0, , , , a dQ aQ d
1 1 < , aQ a
1 1 < , d dQ
³ and
´ ³ ´ 1 1 1 1 2 D ¡. , , , a dQ aQ d
Turn now to the case D (G1 , G2 ) > 0. Using the technique from the previous proof, we conclude that the derivatives (G1 , G2 ) corresponding to attractive xed points of equation 3.3 lie in
D C \ r1¡ \ k 1C , that is, under the line r10 and between the two branches of k10 (see equations 4.14, 4.17, and 4.18). The sets D 0, r10, k10 , and g 0 intersect in two points, as suggested in Figure 4. To see this, note that for points on g 0 , it holds G1 G2 D
1 , ad ¡ bc
Attractive Periodic Sets
2/d
1391
G2
D
0
D k
D
0 1
1/d ~ 1/d
D
k (0,0)
~ 1/a
0 1
1/a
h
0
0
D 2/a r
G1 0 1
Figure 4: Illustration for the proof of lemma 2. The parameters satisfy a, d > 0 and bc < 0. All (G1 , G2 ) 2 D ¡ and below the right branch of g 0 (dashed line) correspond to the attractive xed points. So do (G1 , G2 ) 2 (0, A / 4]2 in D C between the two branches of k 10.
and for all (G1 , G2 ) 2 D 0 \ g 0 we have (aG1 C dG2 )2 D 4.
(4.29)
For G1 , G2 > 0, equation 4.29 denes the line r 10 . Similarly, for (G1 , G2 ) in the sets k10 and g 0, it holds aG1 C dG2 D 2, which is the denition of the line r 10. The curves k10 and g 0 are monotonically increasing and decreasing, respectively, and there is exactly one intersection point of the right branch of g 0 with each of the two branches of k 10 . For (G1 , G2 ) 2 D 0, |l1,2 | D
aG1 C dG2 , 2
and (G1 , G2 ) corresponding to attractive xed points of equation 3.3 are from
D 0 \ r1¡.
1392
Peter Ti Ïno, Bill G. Horne, and C. Lee Giles
In summary, when a, d > 0, each xed point (x, y) of equation 3.3 such that ³ ´ ³ ´ ³ ´ ³ ´ 1 1 1 1 (G1 , G2 ) D w (x, y ) 2 0, £ 0, [ 0, £ 0, a aQ d dQ is attractive. Second, assume a, d < 0. This case is identical to the case a, d > 0 Q r ¡ , and k C replaced by |a|, | aQ |, |d|, | d|, Q r C, examined above, with a, aQ , d, d, 1 1 2 C 0 and k 2 , respectively. First, note that the set D is the same as before, since (aG1 ¡ dG2 )2 D (| a|G1 ¡ |d|G2 )2 . Furthermore, ad¡bc D |a| |d| ¡bc, and so (G1 , G2 ) 2 D ¡, for which |l1,2 | < 1, lie in
D ¡ \ g¡.
Again, it directly follows that
³
1 1 , Q |a| | d|
´ ³ ,
and
³
1 1 , | aQ | |d|
1 1 , Q |a| | d|
´ 2 g 0,
´ ³ ,
1 1 < , | aQ | |a|
1 1 , | aQ | |d|
1 1 < Q |d| | d|
´ 2 D ¡.
For D C the derivatives (G1 , G2 ) corresponding to attractive xed points of equation 3.3 lie in
D C \ r2C \ k 2C .
All (G1 , G2 ) 2 D 0 \ r2C lead to |l1,2 | < 1. Hence, when a, d < 0, every xed point (x, y) of equation 3.3 such that ³
1 w (x, y) 2 0, |a|
´
³
1 £ 0, Q | d|
´ ³ [ 0,
1 | aQ |
´
³ ´ 1 £ 0, |d|
is attractive. Finally, consider the case a > 0, d < 0. (The case a < 0, d > 0 would be treated in exactly the same way.) Assume that D ¡ is a nonempty region. Then ad > bc must hold and ³ ´ 1 1 2 D ¡. , a |d| This can be easily seen, since for ad < bc we would have for all (G1 , G2 ),
D (G1 G2 ) D (aG1 ¡ dG 2 )2 C 4G1 G2 bc D (aG1 C dG 2 )2 C 4G1 G2 (bc ¡ ad) ¸ 0.
Attractive Periodic Sets
1393
The sign of
³
D
1 1 , a |d|
´
³
bc D 4 1C a|d|
´
is equal to the sign of a|d| C bc D bc ¡ ad < 0. The derivatives (G1 , G2 ) 2 D ¡, for which |l1,2 | < 1, lie in Also,
D ¡ \ g¡. ³
´ ³ ´ 1 1 1 1 2 g 0. , , , | aQ | |d| a dQ
Note that dQ ¸ |d| and | aQ | ¸ a only if 2a|d| · |bc|. Only those (G1 , G2 ) 2 D 0 are taken into account for which |aG1 C dG2 | < 2. This is true for all (G1 , G2 ) in
D 0 \ r1¡ \ r2C . If D (G1 , G2 ) > 0, the inequalities to be solved depend on the sign of aG1 C dG2 . Following the same reasoning as in the proof of lemma 1, we conclude that the derivatives (G1 , G2 ) corresponding to attractive xed points of equation 3.3 lie in ± ² D C [ [k 1C \ r 1¡ \ (a C [ a0 )] [ [k2C \ r 2C \ a¡] . We saw in the proof of lemma 1 that if a > 0, d < 0, and bc > 0, then Q ) potentially correspond to attractive xed all (G1 , G2 ) 2 (0, 1 / aQ ) £ (0, 1 / | d| points of equation 3.3 (see Figure 3). In the proof of lemma 2, it was shown that when a > 0, d < 0, bc < 0, if 2a|d| ¸ |bc|, then (1 / a, 1 / |d| ) is on or under the right branch of g 0 and each (G1 , G2 ) 2 (0, 1 / a ) £ (0, 1 / |d| ) potentially corresponds to an attractive xed point of equation 3.3. Hence, the following lemma can be formulated: Lemma 3.
If ad < 0 and bc > 0, then every xed point (x, y ) of equation 3.3
such that
³ w (x, y) 2
1 0, | aQ |
´
³
1 £ 0, Q | d|
´
is attractive. If ad < 0 and bc < 0 with |ad| ¸ |bc| / 2, then each xed point (x, y ) of equation 3.3 satisfying ³ ´ ³ ´ 1 1 w (x, y ) 2 0, £ 0, |a| |d| is attractive.
1394
Peter Ti Ïno, Bill G. Horne, and C. Lee Giles
5 Transforming the Results to the Network State-Space
Lemmas 1, 2, and 3 introduce a structure reecting stability types of xed points of equation 3.3 into the space of transfer function derivatives (G1 , G2 ). In this section, we transform our results from the (G1 , G2 )-space into the space of neural activations (x, y ). For u > A4 , dene r 41 A D (u) D 1¡ . (5.1) 2 Au The interval (0, A / 4]2 of all transfer function derivatives in the (G1 , G2 )plane corresponds to four intervals partitioning the (x, y)-space, namely, ³ B, B C
³
A 2
¶2 ,
¶ µ ´ A A £ BC ,BC A , 2 2 µ ´ ³ ¶ A A B C , B C A £ B, B C , 2 2 µ ´2 A BC ,BC A . 2 B, B C
(5.2) (5.3) (5.4) (5.5)
Recall that according to equation 4.9 the transfer function derivatives (G1 (x, y ), G2 (x, y )) 2 (0, A / 4]2 in a xed point (x, y ) 2 (B, B C A)2 of equation 3.3 are equal to w (x, y), where the map w is dened in equations 4.3 and 4.4. Now, for each couple (G1 , G2 ), there are four preimages (x, y) under the map w , »³ ³ ´ ³ ´´¼ 1 1 A A w ¡1 (G1 , G2 ) D (5.6) BC §D ,BC §D , 2 2 G1 G2 where D (¢) is dened in equation 5.1. Before stating the main results, we introduce three types of regions in the neuron activation space that are of special importance. The regions are parameterized by a, d > A / 4, and correspond to stability types of xed points5 of equation 3.3 (see Figure 5): ³ ´ ³ ´ A A ¡ D (a) £ B, B C ¡ D (d) , (5.7) D B, B C 2 2 ¡ ¤ ¡ ¢ B C A2 ¡ D (a), B ¢C A2¡ £ B, B C A2 ¡ D (d) ¤[ S (a ¡ (5.8) R 00 , d) D B, B C A2 ¡ D (a) £ B C A2 ¡ D (d), B C A2 , RA 00 (a, d)
5
Superscripts A, S, and R indicate attractive, saddle, and repulsive points, respectively.
Attractive Periodic Sets
1395
y B+A
R01
A
R01 R11
S
S
R11
S
S
R00 R10
A
R00 R10
A
R01 R11
R
R
R11
R
R
R10
S
S
R10
B+A/2+D (d )
B+A/2
B+A/2- D (d )
R01 R00 R00
B
S
S
B+A/2- D (a ) B+A/2 B+A/2+D (a )
A
B+A
x
Figure 5: Partitioning of the network state-space according to stability types of the xed points.
³ RR00(a, d)
D
¶ ³ ¶ A A A A ¡ D (a), B C £ BC ¡ D (d), B C . (5.9) BC 2 2 2 2
Having partitioned the interval (B, B C A / 2]2 (see equation 5.2) of the S (a R (a (x, y )-space into the sets RA (a 00 , d), R 00 , d), and R 00 , d), we partition the remaining intervals (see equations 5.3, 5.4, and 5.5) in the same manner. The resulting partition of the network state-space, (B, B C A)2 , can be seen S (a R (a (a in Figure 5. Regions symmetrical to RA 00 , d), R 00 , d), and R 00 , d) with A respect to the line x D B C A / 2 are denoted by R10 (a, d), RS10 (a, d), and S (a R (a (a RR10 (a, d), respectively. Similarly, RA 01 , d), R 01 , d), and R 01 , d) denote the A S R regions symmetrical to R00 (a, d), R 00 (a, d), and R00(a, d), respectively, with A (a d), S (a d), respect to the line y D B C A / 2. Finally, we denote by R11 , R11 , R A S (a (a (a and R11 , d) the regions that are symmetrical to R 01 , d), R 01 , d), and RR01 (a, d) with respect to the line x D B C A / 2. We are now ready to translate the results formulated in lemmas 1, 2, and 3 into the (x, y )-space of neural activations. If bc > 0, |a| > 4 / A and |d| > 4 / A, then all attractive xed points of equation 5.3 lie in [ RiA (| a|, |d| ), Theorem 1.
i2I
where I is the index set I D f00, 10, 01, 11g.
1396
Peter Ti Ïno, Bill G. Horne, and C. Lee Giles
If bc < 0, ad < 0, |a| > 4 / A, |d| > 4 / A, and |ad| ¸ |bc| / 2, then all xed points of equation 3.3 lying in Theorem 2.
[ i2I
RA i (| a|, |d|), I D f00, 10, 01, 11g
are attractive. Q > 4 / A (see equations 4.20 and 4.21) and one of the If | aQ |, | d| following conditions is satised, Theorem 3.
bc > 0 and ad < 0 bc < 0 and ad > 0 bc < 0, ad < 0 and |ad| · |bc| / 2, then all xed points of equation 3.3 lying in [ i2I
Q RA i (| aQ |, | d| ) , I D f00, 10, 01, 11g
are attractive. For an insight into the bifurcation mechanism (explored in the next section) by which attractive xed points of equation 3.3 are created (or dismissed), it is useful to have an idea where other types of xed points can lie. When the signs of weights on neuron self-connections are equal (ab > 0), as are the signs of weights on the interneuron connections (bc > 0), we have the following theorem: Suppose ad > 0, bc > 0, |a| > 4 / A, and |d| > 4 / A. Then the following can be said about the xed points of equation 3.3: Theorem 4.
Attractive points can lie only in [ i2I
RA i (| a|, |d|), I D f00, 10, 01, 11g.
If ad ¸ bc / 2, then all xed points in [ i2I
RSi (| a|, |d|)
are saddle points; repulsive points can lie only in [ i2I
RRi (| a|, |d|).
Attractive Periodic Sets
1397
The system (see equation 3.3) has repulsive points only when |ad ¡ bc| ¸
4 minf|a|, |d|g. A
Proof. Regions for attractive xed points follow from theorem 1.
Consider rst the case a, d > 0. A xed point (x, y ) of equation 3.3 is a saddle if |l2 | < 1 and |l1 | D l1 > 1 (see equations 4.13 and 4.14). Assume ad > bc. Then p p 0 < (aG1 C dG 2 )2 ¡ 4G1 G2 (ad ¡ bc) D D (G1 , G2 ) < aG1 C dG 2 . It follows that if aG1 C dG2 < 2 (i.e. if (G1 , G2 ) 2 r1¡ ; see equation 4.17), then p 0 < aG1 C dG 2 ¡ D (G1 , G2 ) < 2, and so 0 < l2 < 1. For (G1 , G2 ) 2 r10 [ r 1C , we solve the inequality aG1 C dG2 ¡
p
D (G1 , G2 ) < 2,
which is satised by (G1 , G2 ) from k 1¡ \ (r10 [ r 1C ). For a denition of k1 , see equation 4.18. It can be seen (see Figure 2) that in all xed points (x, y) of equation 3.3 with ³ ¶ ³ » ¼ ¶ ³ » ¼ ¶ ³ ¶ 1 A 1 A A A w (x, y ) 2 0, £ 0, min [ 0, min £ 0, , , , 4 4 aQ 4 dQ 4 the eigenvalue l2 > 0 is less than 1. This is certainly true for all (x, y ) such that w (x, y ) 2 (0, A / 4] £ (0, 1 / d ) [ (0, 1 / a ) £ (0, A / 4].
In particular, the preimages under the map w of
(G1 , G2 ) 2 (1 / a, A / 4] £ (0, 1 / d ) [ (0, 1 / a) £ (1 / d, A / 4] dene the region [ i2I
RSi (a, d ), I D f00, 10, 01, 11g,
where only saddle xed points of equation 3.3 can lie. Fixed points (x, y) whose images under w lie in k1C \ r 1C are repellers. No (G1 , G2 ) can lie in that region if aQ , dQ · 4 / A, that is, if d (a ¡ 4 / A) · bc and a (d ¡ 4 / A ) · bc, which is equivalent to maxfa(d ¡ 4 / A), d (a ¡ 4 / A)g · bc.
1398
Peter Ti Ïno, Bill G. Horne, and C. Lee Giles
When ad D bc, we have (see equation 4.14) p
D (G1 , G2 ) D aG1 C dG 2 ,
and so l2 D 0. Hence, there are no repelling points if ad D bc. Assume ad < bc. Then p
D (G1 , G2 ) > aG1 C dG 2 ,
which implies that l2 is negative. It follows that the inequality to be solved is p aG1 C dG 2 ¡ D (G1 , G2 ) > ¡2. It is satised by (G1 , G2 ) from k2C (see equation 4.25). If 2ad ¸ bc, then | aQ | · a Q · d. and | d| Fixed points (x, y) with ( )# ³ ¶ ³ » ¼ ¶ ³ ¶ 1 A 1 A A A w (x, y) 2 0, £ 0, min [ 0, min £ 0, , , , Q 4 | aQ | 4 4 4 | d| ³
have |l2 | less than 1. If 2ad ¸ bc, this is true for all (x, y ) such that w (x, y ) 2 (0, A / 4] £ (0, 1 / d ) [ (0, 1 / a) £ (0, A / 4] and the preimages under w of (G1 , G2 ) 2 (1 / a, A / 4] £ (0, 1 / d ) [ (0, 1 / a) £ (1 / d, A / 4]
S dene the region i2I RSi (a, d) where only saddle xed points of equation 3.3 can lie. Q · 4 / A, that is, if There are no repelling points if | aQ |, | d| minfa(d C 4 / A), d (a C 4 / A)g ¸ bc. The case a, d < 0 is analogous to the case a, d > 0. We conclude that if ad > bc, in all xed points (x, y ) of equation 3.3 with ( )# ³ ¶ ³ » ¼ ¶ ³ ¶ 1 A 1 A A A w (x, y) 2 0, £ 0, min [ 0, min £ 0, , , , Q 4 | aQ | 4 4 4 | d| ³
|l1 | < 1. Surely this is true for all (x, y) such that w (x, y ) 2 (0, A / 4] £ (0, 1 / |d|) [ (0, 1 / |a|) £ (0, A / 4].
Attractive Periodic Sets
1399
The preimages under w of (G1 , G2 ) 2 (1 / |a|, A / 4] £ (0, 1 / |d|) [ (0, 1 / |a|) £ (1 / |d|, A / 4]
S dene the region i2I RSi (| a|, |d|) where only saddle xed points of equation 3.3 can lie. Q · 4 / A, that is, if |d|(|a| ¡4 / A ) · bc There are no repelling points if | aQ |, | d| and |a| (| d| ¡ 4 / A ) · bc, which is equivalent to maxf|a|(|d| ¡ 4 / A), |d| (| a| ¡ 4 / A)g · bc. In the case ad D bc, we have p D (G1 , G2 ) D |aG1 C dG2 |, and so l1 D 0. Hence, there are no repelling points. If ad < bc, in all xed points (x, y ) with ³ w (x, y ) 2
0,
¶ ³ » ¼ ¶ ³ » ¼ ¶ ³ ¶ 1 A 1 A A A £ 0, min [ 0, min £ 0, , , , 4 4 aQ 4 dQ 4
l1 > 0 is less than 1. If 2ad ¸ bc, this is true for all (x, y ) such that w (x, y ) 2 (0, A / 4] £ (0, 1 / |d|) [ (0, 1 / |a|) £ (0, A / 4], and the preimages under w of (G1 , G2 ) 2 (1 / |a|, A / 4] £ (0, 1 / |d|) [ (0, 1 / |a|) £ (1 / |d|, A / 4]
S dene the region i2I RSi (| a|, |d|) where only saddle xed points of equation 3.3 can lie. There are no repelling points if aQ , dQ · 4 / A, that is, if minf|a| (| d| C 4 / A), |d|(|a| C 4 / A)g ¸ bc. In general, we have shown that if ad < bc and ad C 4 minf|a|, |d|g / A ¸ bc, or ad D bc, or ad > bc and ad ¡ 4 minf|a|, |d|g / A · bc, then there are no repelling points. 6 Creation of a New Attractive Fixed Point Through Saddle-Node Bifurcation
In this section we are concerned with the actual position of xed points of equation 3.3. We study how the coefcients a, b, t1 , c, d, and t2 affect the number and position of the xed points. It is illustrative rst to concentrate
1400
Peter Ti Ïno, Bill G. Horne, and C. Lee Giles
on only a single neuron N from the pair of neurons forming the neural network. Denote the values of the weights associated with the self-loop on the neuron N and with the interconnection link from the other neuron to the neuron N by s and r, respectively. The constant input to the neuron N is denoted by t. If the activations of the neuron N and the other neuron are u and v, respectively, then the activation of the neuron N at the next time step is g (su C rv C t). If the activation of the neuron N is not to change, (u, v ) should lie on the curve fs,r,t : v D fs,r,t (u ) D
³
1 r
¡t ¡ su C ln
´ u ¡B . B C A¡u
(6.1)
The function ln ((u ¡ B ) / (B C A ¡ u )): (B, B C A ) ! < is monotonically increasing with lim ln
u!B C
u¡B D ¡1 B C A¡u
and
lim
u!(B C A )¡
ln
u¡B D 1. B C A¡u
Although the linear function ¡su ¡ t cannot inuence the asymptotic properties of fs,r,t , it can locally inuence its “shape.” In particular, while the effect of the constant term ¡t is just a vertical shift of the whole function, ¡su (if decreasing, that is, if s > 0 and “sufciently large”) has the power to overcome for a while the increasing tendencies of ln((u ¡B) / (B C A ¡ u )). More precisely, if s > 4 / A, then the term ¡su causes the function ¡su ¡ t C ln ((u ¡ B ) / (B C A ¡ u )) to “bend” so that on µ ¶ A A ( ) ( ) C C C ¡D s ,B D s , B 2 2
it is decreasing, while it still increases on ³ B, B C
´ ³ ´ A A C D (s ) , B C A . ¡ D (s) [ B C 2 2
The function ¡su ¡ t C ln ((u ¡ B ) / (B C A ¡ u ))
is always concave and convex on (B, B C A / 2) and (B C A / 2, B C A ), respectively. Finally, the coefcient r scales the whole function and ips it around the u–axis, if r < 0. A graph of fs,r,t (u ) is presented in Figure 6.
Attractive Periodic Sets
1401
v
f
(u) s,r,t
0 B
B+A/2- D (s) B+A/2
B+A/2+D (s)
B+A
u
Figure 6: Graph of fs,r, t (u). The solid line represents the case t, s D 0, r > 0. The dashed line shows the graph when t < 0, s > 4 / A, and r > 0. The negative external input t shifts the bent part into v > 0.
Each xed point of equation 3.3 lies on the intersection of two curves, y D fa,b,t1 (x) and x D fd,c,t2 (y ) (see equation 6.1). There are (42 ) C 4 D 10 possible cases of coexistence of the two neurons in the network. Based on the results from the previous section, in some cases we are able to predict stability types of the xed points of equation 3.3 according to their position in the neuron activation space. More interestingly, we have structured the network state-space (B, B C A)2 into areas corresponding to stability types of the xed points, and in some cases, these areas directly correspond to monotonicity intervals of the functions fa,b,t1 and fd,c,t2 dening the position of the xed points. The results of the last section will be particularly useful when the weights a, b, c, d and external inputs t1 , t2 are such that the functions fa,b,t1 and fd,c,t2 “bend,” thus possibly creating a complex intersection pattern in (B, B C A)2 . For a > 4 / A, the set »
#0 fa,b,t 1
³ ´¼ A ( ( ))| ( ) ¡D a D x, fa,b,t1 x x 2 B, B C 2
(6.2)
contains the points lying on the “rst outer branch” of fa,b,t1 (x ). Analogously, the set » ³ ´¼ A #1 ( ( ))| ( ) C C C 2 D (6.3) fa,b,t D x, f x x B a , B A a,b,t 1 1 2
1402
Peter Ti Ïno, Bill G. Horne, and C. Lee Giles
contains the points in the “second outer branch” of fa,b,t1 (x ). Finally, »
¤ fa,b,t 1
³ ´¼ A A ( ( ))| ( ) ( ) C C C ¡D a ,B D a D x, fa,b,t1 x x 2 B 2 2
(6.4)
is the set of points on the “middle branch” of fa,b,t1 (x ). Similarly, for d > 4 / A, we dene the sets » ³ ´¼ A #0 ( fd,c,t2 (y ), y )| y 2 B, B C (d ) , ¡ D fd,c,t D 2 2 »
#1 fd,c,t 2
³ ´¼ A ( ( ) )| ( ) C C C 2 D D fd,c,t2 y , y y B d ,B A , 2
(6.5)
(6.6)
and » ³ ´¼ A A ¤ ( ( ) )| ( ) ( ) C C C 2 ¡ D D . fd,c,t D f y , y y B d , B d d,c,t 2 2 2 2
(6.7)
containing the points on the “rst outer branch,” the “second outer branch,” and the “middle branch,” respectively, of fd,c,t2 (y ). Using theorem 4 we state the following corollary: Assume a > 4 / A, d > 4 / A, bc > 0, and ad ¸ bc / 2. Then attractive xed points of equation 3.3 can lie only on the intersection of the outer branches of fa,b,t1 and fd,c,t2 . Whenever the middle branch of fa,b,t1 intersects with an outer branch of fd,c,t2 (or vice versa), it corresponds to a saddle point of equation 3.3. In other words, all attractive xed points of equation 3.3 are from [ #j #i \ fd,c,t2 . fa,b,t 1 Corollary 1.
i, j D 0,1
Every point from ¤ \ fa,b,t 1
or ¤ \ fd,c,t 2
is a saddle point of equation 3.3.
[ i D 0,1
[ i D 0,1
#i fd,c,t 2
#i fa,b,t 1
Corollary 1 suggests the saddle-node bifurcation as the usual scenario of creation of a new attractive xed point. In the saddle-node bifurcation, a pair of new xed points, of which one is attractive and the other one is a saddle, is created. Attractive xed points disappear in a reverse manner:
Attractive Periodic Sets
1403
fa,b,t (x)
y
1
B+A
fd,c,t (y) 2
B+A/2
B
B+A/2
B+A
x
Figure 7: Geometrical illustration of the saddle-node bifurcation in a neural network with two recurrent neurons. fd,c, t2 (y), shown as the dashed curve, intersects with fa,b, t1 (x) in three points. By increasing d, fd,c,t2 bends further (solid curve) and intersects with fa,b,t1 in ve points. Saddle and attractive points are marked with squares and stars, respectively. The parameter settings are a, d > 0, b, c < 0.
an attractive point coalesces with a saddle and they are annihilated. This is illustrated in Figure 7. The curve fd,c,t2 (y) (the dashed line in Figure 7) intersects with fa,b,t1 (x ) in three points. By increasing d, fd,c,t2 continues to bend (the solid curve in Figure 7) and intersects with fa,b,t1 in ve points.6 Saddle and attractive points are marked with squares and stars, respectively in Figure 7. Note that as d increases, attractive xed points move closer to vertices fB, B C Ag2 of the network state-space. The next section rigorously analyzes this tendency in the context of recurrent networks with an arbitrary, nite number of neurons. 7 Attractive Periodic Sets in Recurrent Neural Networks with N Neurons
From now on, we study fully connected recurrent neural networks with N neurons. As usual, the weight on a connection from neuron j to neuron i is 6 At the same time, |c| has to be appropriately increased so as to compensate for the increase in d so that the “bent” part of fd, c,t2 does not move radically to higher values of x.
1404
Peter Ti Ïno, Bill G. Horne, and C. Lee Giles ( )
denoted by wij . The output of the ith neuron at time step m is denoted by xi m . The external input to neuron i is Ti . Each neuron has a “sigmoid-shaped” transfer function (see equation 3.1). The network evolves in the N-dimensional state-space O D (B, B C A )N according to
³ xi(m C 1)
D gA,B,m i
N X n D1
´ win xn(m ) C
Ti
i D 1, 2, . . . , N.
(7.1)
Note that we allow different neural gains across the neuron population. In vector notation, we represent the state x (m ) at time step m as7 ± ² ( ) ( ) ( ) T x(m ) D x1m , x2m , . . . , xNm , and equation 7.1 becomes ± ² x (m C 1) D F x (m ) .
(7.2)
It is convenient to represent the weights wij as a weight matrix W D (wij ). 6 0. We assume that W is nonsingular, that is, det(W ) D Assuming an external input T D (T 1 , T2 , . . . , TN )T , the Jacobian matrix J (x ) of the map F in x 2 O can be written as
J (x ) D MG (x)W ,
(7.3)
where M and G (x ) are N-dimensional diagonal matrices,
M D diag(m 1 , m 2 , . . . , m N )
(7.4)
G (x ) D diag(G1 (x ), G2 (x ), . . . , GN (x )),
(7.5)
and
respectively, with8
³ G i ( x) D g0 m i
N X nD 1
´ win xn C m i Ti ,
i D 1, 2, . . . , N.
We denote absolute value of the determinant of the weight matrix W W, that is, W D |det(W )|. 7 8
Superscript T means the transpose operator. Remember that for the sake of simplicity, we denote gA,B,1 by g.
(7.6) by (7.7)
Attractive Periodic Sets
1405
Then |det(J (x ))| D W
N Y
m n Gn (x ).
(7.8)
nD 1
Lemma 4 formulates a useful relation between the value of the transfer function and its derivative. It is formulated in a more general setting than that used in sections 4 and 5. Let u 2 (0, A / 4]. Then the set
Lemma 4.
fg(`)| g0 (`) ¸ ug is a closed interval of length 2D (1 / u ) centered at B C A / 2, where D (¢) is dened in equation 5.1. Proof. For a particular value u of g0 (`) D y ( g(`)), the corresponding values 9
y ¡1 (u ) of g (`) are
A BC §D 2
³ ´ 1 . u
The transfer function g (`) is monotonically increasing with increasing and decreasing derivative on (¡1 , 0) and (0, 1), respectively. The maximal derivative g0 (0) D A / 4 occurs at ` D 0 with g (0) D B C A / 2. Denote the geometrical mean of neural gains in the neuron population by mQ : " mQ D
N Y
# N1 mn
.
(7.9)
nD 1
For W 1 / N ¸ 4 / (AmQ ), dene
± ² 1 A ¡ D mQ W N , 2 ± ² A 1 C D mQ W N . cC D B C 2
c¡ D B C
(7.10) (7.11)
Denote the hypercube [c¡, c C ]N by H . The next theorem states that increasing neural gains m i push the attractive sets away from the center fB C A / 2gN of the network state-space toward the faces of the activation hypercube O D (B, B C A )N . 9
y ¡1 is the set theoretic inverse of y .
1406
Peter Ti Ïno, Bill G. Horne, and C. Lee Giles
Theorem 5.
Assume
³ W¸
4 AmQ
´N .
Then attractive periodic sets of equation 7.2 cannot lie in the hypercube H with sides of length ± ² 1 2D mQ W N , centered at fB C A / 2gN , the center of the state-space. Proof. Suppose there is an attractive periodic orbit X D fx (1) , . . . , x (Q ) g ½
H D [c¡ , c C ]N , with F (x(1) ) D x (2) , F(x (2) ) D x(3) , . . . , F (x(Q ) ) D x (Q C 1) D x (1) . The determinant of the Jacobian matrix of the map FQ in x( j) , j D 1, . . . , Q, is Q Y det(JFQ (x( j) )) D det(J (x ( k) )) and, by equation 7.8,
kD 1
|det(J FQ (x( j) ))| D W Q
Q Y N Y
m n Gn (x ( k) ).
kD 1 n D 1
From Gn (x ( k) ) D y (xn( kC 1) ), n D 1, . . . , N, k D 1, . . . , Q,
where y is dened in equation 4.3, it follows that if all x ( k) are from the hypercube H D [c¡ , c C ]N (see equations 7.10 and 7.11), then by lemma 4, we have WQ But
Q Y N Y kD 1 n D 1
m n G n (x ( k ) ) ¸ W Q
Q Y N Y mn 1
kD 1 n D 1
mQ W N
D 1.
det(JFQ (x (1) )) D det(JFQ (x (2) )) D . . . D det(JFQ (x(Q ) )) is the product of eigenvalues of FQ in x ( j) , j D 1, . . . , Q, and so the absolute value of at least one of them has to be equal to or greater than 1. This contradicts the assumption that X is attractive. The plots in Figure 8 illustrate the growth of the hypercube H with growing v D mQ W 1/ N in the case of B D 0 and A D 1. The lower and upper curves correspond to functions 12 ¡ D (v ) and 12 C D (v ), respectively. For a particular value of v, H is the hypercube with each side centered at 12 and spanned between the two curves.
Attractive Periodic Sets
1407
0.9
0.8
0.7
x
0.6
0.5
0.4
0.3
0.2
0.1
4
5
6
7
v
8
9
10
Figure 8: Illustration of the growth of the hypercube H with increasing v D mQ W1 / N . Parameters of the transfer function are set to B D 0, A D 1. The lower and upper curves correspond to functions 1 / 2¡D (v) and 1 / 2 C D (v), respectively. For a particular value of v, the hypercube H has each side centered at 1 / 2 and spanned between the two curves.
7.1 Effect of Growing Neural Gain. Hirsch (1994) studies recurrent networks with nondecreasing transfer functions f from a broader class than the “sigmoid-shaped” class (see equation 3.1) considered here. Transfer functions can differ from neuron to neuron. He shows that attractive xed-point coordinates corresponding to neurons with a steadily increasing neural gain tend to saturation values10 as the gain grows without a bound. It is assumed that all the neurons with increasing gain are either self-exciting or self-inhibiting. 11 To investigate the effect of growing neural gain in our setting, we assume that the neurons are split into two groups. Initially, all the neurons have the same transfer function gA,B,m ¤ . The neural gain on neurons from the rst group does not change, while the gain on neurons from the second group
10 In fact, it is shown that each of such coordinates tends to either a saturation value f ( §1), which is assumed to be nite, or a critical value f (v), with f 0 (v) D 0. The critical values are assumed to be nite in number. There are no critical values in the class of transfer functions considered in this article. 11 Self-exciting and self-inhibiting neurons have positive and negative weights, respectively, on the feedback self-loops.
1408
Peter Ti Ïno, Bill G. Horne, and C. Lee Giles
is allowed to grow. We investigate how the growth of neural gain in the second subset of neurons affects regions for periodic attractive sets of the network. In particular, we answer the following question: Given the weight matrix W and a “small neighborhood factor” 2 > 0, for how big a neural gain m on neurons from the second group, all the attractive xed points and at least some attractive periodic points must lie within 2 -neighborhood of the faces of the activation hypercube O D (B, B C A)N ? Let 2 > 0 be a “small” positive constant. Assume M (1 · M · N) and N ¡ M neurons have the neural gain equal to m and m ¤ , respectively. Then, for Theorem 6.
m > m (2 ) D
N 1¡ M
m¤
W
1 M
1 £ ¡ ¢¤ N , 2 M 2 1¡ A
all attractive xed points and at least some points from each attractive periodic orbit of the network lie in the 2 -neighborhood of the faces of the hypercube O D (B, B C A)N . Proof. By theorem 5,
³ N¡M ´ A M 1 N N N ¡ D m ¤ m (2 ) W 2 D 2 2 3 v u 4 1 A4 u 5. 1 ¡ t1 ¡ D M 1 2 A m N¡M N m (2 ) N W N ¤
(7.12)
Solving equation 7.12 for m (2 ), we arrive at the expression presented in the theorem. Note that for very close 2 -neighborhoods of the faces of O , the growth of m obeys m /
1 N M
2
,
(7.13)
and if the gain is allowed to grow on each neuron, m /
1 2
.
(7.14)
Equation 7.12 constitutes a lower bound on the rate of convergence of the attractive xed points toward the faces of O as the neural gain m on M neurons grows.
Attractive Periodic Sets
1409
7.2 Using Alternative Bounds on Spectral Radius of Jacobian Matrix.
Results in the previous subsection are based on the bound if |det(J (x ))| ¸ 1, then r (J (x )) ¸ 1, where r (J (x)) is the spectral radius12 of the Jacobian J (x ). Although one can think of other bounds on spectral radius of J (x), usually the expression describing the conditions on terms Gn (x), n D 1, . . . , N, such that r (J (x )) ¸ 1, is very complex (if it exists in a closed form at all). This prevents us from obtaining a relatively simple analytical approximation of the set U D fx 2 O | r (J (x )) ¸ 1g,
(7.15)
where no attractive xed points of equation 7.2 can lie. Furthermore, in general, if two matrices have their spectral radii equal to or greater than one, the same does not necessarily hold for their product. As a consequence, one cannot directly reason about regions with no periodic attractive sets. In this subsection, we shall use a simple bound on the spectral radius of a square matrix stated in the following lemma. Lemma 5.
For an M-dimensional square matrix A , r (A ) ¸
|trace(A )| . M
Proof. Let li , i D 1, 2, . . . , M, be the eigenvalues of A. From
trace(A ) D
X
li ,
i
it follows that X 1 X 1 |trace(A )| |li | ¸ r (A ) ¸ . D li M i M i M Theorems 7 and 8 are analogous to theorems 5 and 6. We assume the same transfer function on all neurons. Generalization to transfer functions differing in neural gains is straightforward. Assume all the neurons have the same transfer function (see equation 3.1), and the weights on neural self-loops are exclusively positive or exclusively Theorem 7.
12
The maximal absolute value of eigenvalues of J (x).
1410
Peter Ti Ïno, Bill G. Horne, and C. Lee Giles
negative, that is, either wnn > 0, n D 1, 2, . . . , N or wnn < 0, n D 1, 2, . . . , N. If attractive xed points of equation 7.2 lie in the hypercube with sides of length ³ ´ m |trace(W )| 2D , N centered at fB C A / 2gN , the center of the state-space, then |trace(W )|
0, if |trace(W )| ¸ 4N / (m A), then [G¤ , A / 4]N ½ J . The hypercube [G¤ , A / 4]N in the space of transfer function derivatives corresponds to the hypercube µ BC
A ¡D 2
in the network state-space O .
³
1 G¤
´ ,BC
A CD 2
³
1 G¤
´¶N
Attractive Periodic Sets
1411
Let 2 > 0 be a “small” positive constant. Assume all the neurons have the same transfer function (see equation 3.1). If wnn > 0, n D 1, 2, . . . , N, or wnn < 0, n D 1, 2, . . . , N, then for Theorem 8.
m > m (2 ) D
1 N £ ¡ )| |trace( m W 2 1¡
2
¢¤ ,
A
all attractive xed points of equation 7.2 lie in the 2 -neighborhood of the faces of the hypercube O . Proof. By theorem 7, 2
D
³ ´ |trace(W )| A ¡ D m (2 ) . 2 N
(7.16)
Solving equation 7.16 for m (2 ), we arrive at the expression in the theorem. 8 Relation to Continuous-Time Networks
There is a large amount of literature devoted to dynamical analysis of continuous-time networks. While discrete-time networks are more appropriate for dealing with discrete and symbolic data, continuous-time networks seem more natural in many “physical” models and applications. Relations between the dynamics of discrete-time and continuous-time networks are considered, for example, in Blum and Wang (1992) and Tonnelier et al. (1999). Hirsch (1994) gives a simple example illustrating that there is no xed step size for discretizing continuous-time dynamics that would yield a discrete-time dynamics accurately reecting its continuoustime origin. Hirsch (1994) has pointed out that while there is a saturation result for stable limit cycles of continuous-time networks—for sufciently high gain, the output along a stable limit cycle is saturated almost all the time—there is no known analog of this for stable periodic orbits of discrete-time networks. Theorems 5 and 6 offer such an analog for discrete-time networks with sigmoid-shaped transfer functions studied in this article. Vidyasagar (1993) studied the number, location, and stability of high-gain equilibria in continuous-time networks with sigmoidal transfer functions. He concludes that all equilibria in the corners of the activation hypercube are stable. Theorems 6 and 7 formulate analogous results for discrete-time networks. 9 Conclusion
By performing a detailed xed-point analysis of two-neuron recurrent networks, we partitioned the network state-space into regions corresponding
1412
Peter Ti Ïno, Bill G. Horne, and C. Lee Giles
to the xed-point stability types. The results are intuitive and hold for a large class of sigmoid-shaped neuron transfer functions. Attractive xed points cluster around the vertices of the activation square [L, H]2 , where L and H are the low- and high-transfer function saturation levels, respectively. Repelling xed points are concentrated close to the center f L C2H g2 of the activation square. Saddle xed points appear in the neighborhood of the four sides of the activation square. Unlike in the previous studies (e.g., Blum & Wang, 1992; Borisyuk & Kirillov, 1992; Pasemann, 1993; Tonnelier et al., 1999), we allowed all free parameters of the network to vary. We have rigorously shown that when the neurons self-excite themselves and have the same mutual-interaction pattern, a new attractive xed point is created through the saddle-node bifurcation. This is in accordance with recent empirical ndings (Tonnelier et al., 1999). Next, we studied recurrent networks of arbitrary nite number of neurons with sigmoid-shaped transfer functions. Inspired by the result of Hirsch (1994) concerning equilibrium points in high-neural-gain networks, we analyzed the tendency of attractive periodic points to approach saturation faces of the activation hypercube as the neural gain increases. Their distance from the saturation faces is approximately reciprocal to the neural gain. Acknowledgments
This work was supported by grants VEGA 2/6018/99 and VEGA 1/7611/20 from the Slovak Grant Agency for Scientic Research (VEGA) and the Austrian Science Fund (FWF) SFB010. References Amit, D. (1989). Modeling brain function. Cambridge: Cambridge University Press. Anderson, J. (1993). The BSB model: A simple nonlinear autoassociative neural network. In M. H. Hassoun (Ed.), Associative neural memories: Theory and implementation (pp. 77–103). Oxford: Oxford University Press. Anton, H. (1980). Calculus with analytic geometry. New York: Wiley. Beer, R. D. (1995). On the dynamics of small continuous–time recurrent networks. Adaptive Behavior, 3(4), 471–511. Blum, K. L., & Wang, X. (1992). Stability of xed points and periodic orbits and bifurcations in analog neural networks. Neural Networks, 5, 577–587. Borisyuk, R. M., & Kirillov, A. (1992). Bifurcation analysis of a neural network model. Biological Cybernetics, 66, 319–325. Botelho, F. (1999). Dynamical features simulated by recurrent neural networks. Neural Networks, 12, 609–615. Casey, M. P. (1995). Relaxing the symmetric weight condition for convergent dynamics in discrete-time recurrent networks (Tech. Rep. No. INC-9504). La Jolla, CA: Institute for Neural Computation, University of California, San Diego.
Attractive Periodic Sets
1413
Casey, M. P. (1996). The dynamics of discrete-time computation, with application to recurrent neural networks and nite state machine extraction. Neural Computation, 8(6), 1135–1178. Cleeremans, A., Servan-Schreiber, D., & McClelland, J. L. (1989). Finite state automata and simple recurrent networks. Neural Computation, 1(3), 372–381. Giles, C. L., Miller, C. B., Chen, D., Chen, H. H., Sun, G. Z., & Lee, Y. C. (1992). Learning and extracting nite state automata with second-order recurrent neural networks. Neural Computation, 4(3), 393–405. Hirsch, M. W. (1994). Saturation at high gain in discrete time recurrent networks. Neural Networks, 7(3), 449–453. Hopeld, J. J. (1984). Neurons with a graded response have collective computational properties like those of two-state neurons. Proceedings of the National Academy of Science USA, 81, 3088–3092. Hui, S., & Zak, S. H. (1992). Dynamical analysis of the brain-state-in-a-box neural models. IEEE Transactions on Neural Networks, 1, 86–94. Jin, L., Nikiforuk, P. N., & Gupta, M. M. (1994). Absolute stability conditions for discrete-time recurrent neural networks. IEEE Transactions on Neural Networks, 6, 954–963. Klotz, A., & Brauer, K. (1999). A small-size neural network for computing with strange attractors. Neural Networks, 12, 601–607. Manolios, P., & Fanelli, R. (1994). First order recurrent neural networks and deterministic nite state automata. Neural Computation, 6(6), 1155–1173. McAuley, J. D., & Stampi, J. (1994). Analysis of the effects of noise on a model for the neural mechanism of short-term active memory. Neural Computation, 6(4), 668–678. Nakahara, H., & Doya, K. (1998). Near-saddle-node bifurcation behavior as dynamics in working memory for goal-directed behavior. Neural Computation, 10(1), 113–132. Pakdamann, K., Grotta-Ragazzo, C., Malta, C. P., Arino, O., & Vibert, J. F. (1998). Effect of delay on the boundary of the basin of attraction in a system of two neurons. Neural Networks, 11, 509–519. Pasemann, F. (1993). Discrete dynamics of two neuron networks. Open Systems and Information Dynamics, 2, 49–66. Pasemann, F. (1995a). Neuromodules: A dynamical systems approach to brain modelling. In H. J. Hermann, D. E. Wolf, & E. Poppel (Eds.), Supercomputing in brain research (pp. 331–348). Singapore: World Scientic. Pasemann, F. (1995b). Characterization of periodic attractors in neural ring network. Neural Networks, 8, 421–429. Rodriguez, P., Wiles, J., & Elman, J. L. (1999). A recurrent neural network that learns to count. Connection Science, 11, 5–40. Sevrani, F., & Abe, K. (2000). On the synthesis of brain-state-in-a-box neural models with application to associative memory. Neural Computation, 12(2), 451–472. Ti Ïno, P., Horne, B. G., Giles, C. L., & Collingwood, P. C. (1998). Finite state machines and recurrent neural networks—automata and dynamical systems approaches. In J. E. Dayhoff & O. Omidvar (Eds.), Neural networks and pattern recognition (pp. 171–220). Orlando, FL: Academic Press.
1414
Peter Ti Ïno, Bill G. Horne, and C. Lee Giles
Ti Ïno, P., & Sajda, J. (1995). Learning and extracting initial mealy machines with a modular neural network model. Neural Computation, 7(4), 822–844. Tonnelier, A., Meignen, S., Bosh, H., & Demongeot, J. (1999). Synchronization and desynchronization of neural oscillators. Neural Networks, 12, 1213–1228. Vidyasagar, M. (1993). Location and stability of the high-gain equilibria of nonlinear neural networks. Neural Networks, 4, 660–672. Wang, D. L. (1996). Synchronous oscillations based on lateral connections. In J. Sirosh, R. Miikkulainen, & Y. Choe (Eds.), Lateral connections in the cortex: Structure and function. Austin, TX: UTCS Neural Networks Research Group. Wang, X. (1991). Period-doublings to chaos in a simple neural network: An analytical proof. Complex Systems, 5, 425–441. Watrous, R. L., & Kuhn, G. M. (1992). Induction of nite-state languages using second-order recurrent networks. Neural Computation, 4(3), 406–414. Zeng, Z., Goodman, R. M., & Smyth, P. (1993). Learning nite state machines with self–clustering recurrent networks. Neural Computation, 5(6), 976–990. Zhou, M. (1996). Fault-tolerance for two neuron networks. Master’s thesis. University of Memphis. Received April 12, 2000; accepted August 2, 2000.
LETTER
Communicated by Nicolas Brunel
Attractor Networks for Shape Recognition Yali Amit Massimo Mascaro Department of Statistics, University of Chicago, Chicago, IL 60637, U.S.A. We describe a system of thousands of binary perceptrons with coarseoriented edges as input that is able to recognize shapes, even in a context with hundreds of classes. The perceptrons have randomized feedforward connections from the input layer and form a recurrent network among themselves. Each class is represented by a prelearned attractor (serving as an associative hook) in the recurrent net corresponding to a randomly selected subpopulation of the perceptrons. In training, rst the attractor of the correct class is activated among the perceptrons; then the visual stimulus is presented at the input layer. The feedforward connections are modied using eld-dependent Hebbian learning with positive synapses, which we show to be stable with respect to large variations in feature statistics and coding levels and allows the use of the same threshold on all perceptrons. Recognition is based on only the visual stimuli. These activate the recurrent network, which is then driven by the dynamics to a sustained attractor state, concentrated in the correct class subset and providing a form of working memory. We believe this architecture is more transparent than standard feedforward two-layer networks and has stronger biological analogies. 1 Introduction
Learning and recognition in the context of neural networks is an intensively studied eld. The leading paradigm from the engineering point of view is that of feedforward networks (to cite only a few examples in the context of character recognition, see LeCun et al., 1990; Knerr, Personnaz, & Dreyfus, 1992; Bottou et al., 1994, Hussain & Kabuka, 1994). Two-layer feedforward networks have proved to be very useful as classiers in numerous other applications. However, they lack a certain transparency of interpretation relative to decision trees, for example, and their biological relevance is quite limited. Learning is not Hebbian and local; rather, it employs gradient descent on a global cost function. Furthermore, these nets fail to address the issue of sustained activity needed to deal with working memory. On the other hand, much work has gone into networks in which learning is Hebbian, and recognition is represented through the sustained activity of attractor states of the network dynamics (see Hopeld, 1982, Amit, 1989, c 2001 Massachusetts Institute of Technology Neural Computation 13, 1415– 1442 (2001) °
1416
Yali Amit and Massimo Mascaro
Amit & Brunel, 1995; Brunel, Carusi, & Fusi, 1998). In this eld however research has primarily focused on the theoretical and biological modeling aspects. The input patterns used in these models are rather simplistic, from the statistical point of view. The prototypes for each class are assumed uncorrelated with approximately the same coding level. The patterns for each class are assumed to be simple noisy versions of the prototypes, with independent ips of low probability at each feature. These models have so far avoided the issue of how the attractors can be used for classication of real data, for example, image data, where the statistics of the input features are more complex. Indeed, if the input features of visual data are assumed to be some form of local edge lters, or even local functions of these edges, there is a signicant variability in conditional distributions of these features given each class. This is described in detail in section 4.1. In this article we seek to bridge the gap between these two rather disjoint elds of research yet trying to minimize the assumptions we make regarding the information encoded in individual neurons or in the connecting synapses, as well as staying faithful to the idea that learning is local. Thus, the neurons are all binary, and all have the same threshold. Synapses are positive with a limited range of values and can be effectively assumed binary. The network can easily be modied to have multivalued neurons; the performance would only improve. However, since it is still unclear to what extent the biological system can exploit the ne details of neuronal ring rates or spike timing, we prefer to work with binary neurons, essentially representing high and low ring rates. Similar considerations hold for the synaptic weights; moreover, the work in Petersen, Malenka, Nocoll, and Hopeld (1998) suggests that potentiation in synapses is indeed discrete. The network consists of a recurrent layer A with random feedforward connections from a visual input layer I. Attractors are trained in the A layer, in an unsupervised way, using K classes of easy stimuli: noisy versions of K uncorrelated prototypes represented by subsets Ac , c D 1, . . . , K, in which the neurons are on. The activities of the attractors are concentrated around these subsets. Each visual class is then arbitrarily associated with one of the easy stimuli classes and is hence represented by the subset or population Ac . Prior to the presentation of visual input from class c to layer I, a noisy version of the easy stimulus associated with class c is used to drive the recurrent layer to an attractor state so that activity concentrates in the population Ac . Following this, the feedforward connections between layers I and A are updated. This stage of learning is therefore supervised. The easy stimuli are not assumed to reach the A layer via the visual input layer I. They can be viewed, for example, as reward-type stimuli arriving from some other pathway (auditory, olfactory, tactile) as described in Rolls (2000), potentiating cues described in Levenex and Schenk (1997), or mnemonic hooks introduced in Atkinson (1975). In testing, only visual input is presented at layer I. This leads to the activation of certain units in A,
Attractor Networks for Shape Recognition
1417
and the recurrent dynamics leads this layer to a stable state, which hopefully concentrates on the population corresponding to the correct class. All learning proceeds along the Hebbian paradigm. A continuous internal synaptic state increases when both pre- and postsynaptic neurons are on (potentiation) and decreases if the presynaptic neuron is on but the postsynaptic neuron is off (depression). The internal state translates through a binary or sigmoidal transfer function to a synaptic efcacy. We add one important modication. If at a given presentation, the local input to a postsynaptic neuron is above a threshold (higher than the ring threshold), the internal state of the afferent synapses does not increase. This is called eld-dependent Hebbian learning. This modication allows us to avoid global normalization of synaptic weights or the modication of neuronal thresholds. Moreover, eld-dependent Hebbian learning turns out to be very robust to parameter settings, avoids problems due to variability in the feature statistics in the different classes, and does not require a priori knowledge of the size of the training set or the number of classes. The details of the learning protocol and its analysis are presented in section 4. Note that we are avoiding real-time dynamic architectures involving noisy integrateand-re neurons and real-time synaptic dynamics, as in Amit and Brunel (1995), Mattia and Del Giudice (2000), and Fusi, Annunziato, Badoni, Salamon, and Amit (1999), and are not introducing inhibition. In future work we intend to study the extension of these ideas to more realistic continuous time dynamics. The visual input has the form of binary-oriented edge lters with eight orientations. The orientation tuning is very coarse, so that these neurons can be viewed as having a at tuning curve spanning a range of angles, as opposed to the classical bell-shaped tuning curves. Partial invariance is incorporated by using complex type neurons, which respond to the presence of an edge within a range of orientations anywhere in a moderate-size neighborhood. This type of partially invariant unit, which responds to a disjunction (union) or a local MAX operation, has been used in Fukushima (1986), Fukushima and Wake (1991), Amit and Geman (1997, 1999), Amit (2000), and Riesenhuber and Poggio (1999). From a purely algorithmic point of view, the attractors are unnecessary. Activity in the A layer can be clamped to the set Ac during training, and testing can simply be based on a vote in the A layer. In fact, the attractor dynamics applied to the visual input leads to a decrease in performance, as explained in section 3.2. However, from the point of view of neural modeling, the attractor dynamics is appealing. In the training stage, it is a robust way to assign a label to a class without pinpointing a particular individual neuron. In testing, the attractor dynamics cleans the activity in layer A; it is concentrated within a particular population, which again corresponds to the label. This could prove useful for other modules that need to read out the activity of this system for subsequent processing stages. Finally, the attractor is a stable state of the dynamics and is hence a mechanism for maintaining
1418
Yali Amit and Massimo Mascaro
a sustained activity representing the recognized class, working memory, as in Miyashita and Chang (1988) and Sakai and Miyashita (1991). We have chosen to experiment with this architecture in the context of shape recognition. We work with the NIST handwritten digit data set and a data set of 293 randomly deformed Latex symbols, which are presented in a xed-size grid, although they exhibit signicant variation in size as well as other forms of linear and nonlinear deformation. (In Figure 10 we display a sample from this data set.) With one network with 2000 neurons in the A layer, we achieve a 94% classication rate on NIST (see section 5), a very encouraging result given the constrained learning and encoding framework we are working with. Moreover, a simple boosting procedure, whereby several nets are trained in sequence, brings the classication rate to 97.6%. Although far from the best reported in the literature, which is over 99% (see Wilkinson et al., 1992; Bottou et al., 1994; Amit, Geman, & Wilder, 1997), this is denitely comparable to many experimental papers on this subject (e.g., Hastie, Buja, & Tibshirani, 1995; Avimelelch & Intrator, 1999) where boosting is also employed. The same holds for the Latex database where, the only comparison is with respect to decision trees studied in Amit and Geman (1997). Boosting consists of training a new network on training examples that are misclassied by the combined vote of the existing networks. The concept of boosting is appealing in that hard examples are put aside for further training and processing. In this article, however, we do not describe explicit ways to model the actual neural implementation of boosting. Some ideas are offered in the discussion section. In section 2 we discuss related work, similar architectures, learning rules, and the relation to certain ideas in the machine learning literature. In section 3 the precise architecture is described, as well as the learning rule and the training procedure. In section 4 we analyze this learning rule and compare it to standard Hebbian learning. In section 5 we describe experimental results on both data sets and study the sensitivity of the learning rule to various parameter settings. 2 Related work 2.1 Architecture. The architecture we propose consists of a visual input layer feeding into an attractor layer. The main activity during learning involves updating the synaptic weights of the connections between the input layer and the attractor layer. Such simple architectures have recently been proposed by Riesenhuber and Poggio (1999) and in Bartlett and Sejnowski (1998). In Riesenhuber and Poggio (1999) the second layer is a classical output layer with individual neurons coding for different classes. No recurrent activity is modeled during training or testing, but training is supervised. The input layer codes large-range disjunctions, or MAX operations, of complex features, which in turn are conjunctions of pairs of complex edge-type fea-
Attractor Networks for Shape Recognition
1419
tures. In an attempt to achieve large-range translation and scale invariance, the range of disjunction is the entire visual eld. This means that objects are recognized based solely on the presence or absence of the features, entirely ignoring their location. With sufciently complex features, this may well be possible, but then the combinatorics of the number of necessary features appears overwhelming. It should be noted that in the context of character recognition studied here, we nd a signicant drop in classication rates when the range of disjunction or maximization is on the order of the image size. This trade-off between feature complexity, combinatorics, and invariance is a crucial issue, which has yet to be systematically investigated. In the architecture proposed in Bartlett and Sejnowski (1998), the second layer is a recurrent layer. The input features are again edge lters. Training is semisupervised, not directly through the association of certain neurons to a certain class but through the sequential presentation of slowly varying stimuli of the same class. Hebbian learning of temporal correlations is used to increase the weight of synapses connecting units in the recurrent layer responding to these consecutive stimuli. At the start of each class sequence, the temporal interaction is suspended. This is a very appealing alternative to supervision, which employs the fact that objects are often observed by slowly rotating them in space. This network employs continuous-valued neurons and synapses, which in principle can achieve negative values. A bias input is introduced for every output neuron, which corresponds to allowing for adaptable thresholds, and depends on the size of the training set. Finally, the feedforward synaptic weights are globally normalized. It would be interesting to study whether eld-dependent Hebbian learning can lead to similar results without the use of such global operations on the synaptic matrix. 2.2 Field-Dependent Hebbian Learning. The learning rule employed here is motivated on one hand by work on binary synapses in Amit and Fusi (1994), Amit and Brunel (1995), and Mattia and Del Giudice (2000), where the synapses are modied stochastically only as a function of the activity of the pre- and postsynaptic neurons. Here, however, we have replaced the stochasticity of the learning process with a continuous internal variable for the synapse. This is more effective for neurons with a small number of synapses on the dendritic tree. On the other hand we draw from Diederich and Opper (1987), where Hopeld-type nets with positive- and negative-valued multi-state synapses are modied using the classical Hebbian update rule, but modication stops when the local eld is above or below certain thresholds. There exists ample evidence for local synaptic modication as a function of the activities of pre and postsynaptic neurons (Markam, Lubke, Frotscher, & Sakmann, 1997; Bliss & Collingridge, 1993). However, at this point we can only speculate whether this process can be controlled by the depolarization or the ring rate of the postsynaptic neuron. One possible model in which
1420
Yali Amit and Massimo Mascaro
such controls may be accommodated can be found in Fusi, Badoni, Salamon, and Amit (2000). 2.3 Multiple Classiers. From the learning theory perspective, the approach we propose is closely related to current work on multiple classiers. The idea of multiple randomized classiers is gaining increasing attention in machine learning theory (Amit & Geman, 1997; Breiman, 1998, 1999; Dietterich, 2000; Amit, Blanchard, & Wilder, 1999). In the present context, each unit in layer A associated with a class c is a randomized perceptron classifying class c against the rest of the world, and the entire network is an aggregation of multiple randomized perceptrons. Population coding among a large number of such linear classiers can produce very complex classication boundaries. It is important to note, however, that the family of possible perceptrons available in this context is very limited due to the constraints imposed on the synaptic values and the use of uniform thresholds. A complementary idea is that of boosting, where multiple classiers are created sequentially by up-weighting the misclassied data points in the training set. Indeed in our context, to stabilize the results produced using the attractors in layer A and to improve performance, we train several nets sequentially using a simplied version of the boosting proposed in Schapire, Freund, Bartlett, and Lee (1998) and Breiman (1998). 3 The Network
The network is composed of two layers: layer I for the visual input and a recurrent layer A where attractors are formed and population coding occurs. Since it is still unclear to what extent biological systems can exploit the ne details of neuronal ring rates or spike timing, we prefer to work with binary 0 / 1 neurons, essentially representing high and low ring rates. The I layer has binary units i D (x, e ) for every location x in the image and every edge type e D 1, . . . , 8 corresponding to eight coarse orientations. Orientation tuning is very coarse so that these neurons can be viewed as having a at tuning curve spanning a range of angles, as opposed to the classical bell-shaped tuning curves. The unit (x, e ) is on if an edge of type e has been detected anywhere in a moderate-size neighborhood (between 3 £ 3 and 10 £ 10) of x. In other words, a detected edge is spread to the entire neighborhood. Thus, these units are the analogs of complex orientation-sensitive cells (Hubel, 1988). In Figure 1 we show an image of a 0 with detected edges and the units that are activated after spreading. We emphasize that recognition does not require careful normalization of the image to a xed size. Observe, for example, the signicant variations in size and other shape parameters of the samples in the bottom panel of Figure 10. Therefore, the spreading of the edge features is crucial for maintaining some invariance to deformations and signicantly improves the generalization properties of any classier based on oriented edge inputs.
Attractor Networks for Shape Recognition
1421
Figure 1: (Top, from left to right): Example of a deformed Latex 0; Detected edges of angle 0, edges of angle 90, edges of angle 180, edges of angle 270 (note that the edges have a polarity). (Bottom, left to right): Same edges spread in a 5 £ 5 neighborhood.
The attractor layer has a large number of neurons N A —on the order of thousands—with randomly assigned recurrent connections allocated to each pair of neurons with probability pAA . The feedforward connections from I to A are also independently and randomly assigned with probability pIA for each pair of neurons in I and A, respectively. This system of connections is described in Figure 2. Each neuron a 2 A receives input of the form ha D hIa C hA a D
NI X iD 0
Eia ui C
NA X
E a0 a u a0 ,
(3.1)
a0 D 0
where ui , ua represent the states of the corresponding neurons and Eia , Ea0 a are the efcacies of the connecting synapses. The efcacy is assumed 0 if no synapse is present. The input ha is also called the local eld of the neuron and decomposes into the local eld due to feedforward input—hIa —and the local eld due to the recurrent input—hA a . The neurons in A respond to their input eld through a step function: ua D H (ha ¡ h ),
(3.2)
where » H(x) D
1 for x > 0 0 for x · 0.
(3.3)
All neurons are assumed to have the same ring threshold h; they are all excitatory and are updated asynchronously. Inhibitory neurons could be
1422
Yali Amit and Massimo Mascaro
Layer A
Layer I Figure 2: Overview of the network. There are two layers, denoted I and A. Units in layer A receive inputs from a random subset of layer I and from other units in layer A itself. Dashed arrows represent input from neurons in I, solid ones from neurons in A.
introduced in the attractor layer to perhaps stabilize or create competition among the attractors. Each synapse is characterized by an internal state variable S. This state has two reecting barriers, forcing it to stay in the range [0, 255]. The efcacy E is a deterministic function of the state. This function is the same for all synapses of the net: 8 <E (S ) D 0 for S · L S¡L (3.4) E (S ) D H¡L Jm for L < S < H : E (S ) D Jm for S ¸ H. The parameter L denes the minimal state at which the synapse is activated, and H denes the state above which the synapse is saturated. Varying L and H produces a broad range of synapse types, from binary synapses when L D H, to completely continuous ones in the allowed range (L D 0, H D 255). The parameter Jm represents the saturation value of the efcacy. A graphical explanation of these parameters is provided in Figure 3. 3.1 Learning. Given a presynaptic unit s and a postsynaptic unit t, the synaptic modication rule is of the form
D Sst
D C C (ht )us ut ¡ C¡ (ht )us (1 ¡ ut ),
(3.5)
Attractor Networks for Shape Recognition
1423
10 J m
E fficacy
8 6 4 2 0 0
L 50
H 100
150
S tatus
200
250
Figure 3: The transfer function for converting the internal state of the synapse to its efcacy. The function is 0 below L, and saturates at Jm above H. In between it is linear. 0 and 255 are reecting barriers for the status. In this example, L D 50, H D 150, Jm D 10.
where C C (ht ) D Cp H (h (1 C kp ) ¡ ht ), C¡ (ht ) D Cd H (ht ¡ h (1 ¡ kd )).
(3.6) (3.7)
Typically kp ranges between .2 and .5, and kd · 1. In our experiments, Cd is xed to 1, while Cp ranges from 4 to 10. In regular Hebbian learning, whenever both the pre- and postsynaptic neurons are on, the internal state increases, and whenever the presynaptic neuron is on and the postsynaptic neuron is off, the internal state decreases. In other words, the functions C C and C¡ do not depend on the local eld and are held constant at Cp and Cd , respectively. The effect of the proposed learning rule is to stop synaptic potentiation when the input to the postsynaptic neuron is too high (higher than h (1 C kp )) or to stop synaptic depression when the input is too low (less than h (1 ¡ kd )). Of course, once a different input is presented, the eld changes, and potentiation or depression may resume. The parameters kp (potentiation) and kd (depression) regulate to what degree the eld is allowed to differ from the ring threshold before synaptic modication is stopped. These parameters may vary for different types of synapses—for example, the recurrent synapses versus the feedforward
1424
Yali Amit and Massimo Mascaro
ones. The parameters Cp and Cd regulate the amount of potentiation and depression applied to the synapse when the constraints on the eld are satised. The transfer from internal synaptic state to synaptic efcacy needs to be computed at every step of learning in order to calculate the local eld ha of the neuron in terms of the actual efcacies. This learning rule bears some similarities to the perceptron rule (Rosenblatt, 1962), the eld-dependent rules proposed by Diederich and Opper (1987), and the penalized weighting in Oja (1989). However, it is implemented in the context of binary 0/ 1 neurons and nonnegative synapses. In some experiments, synapses have been constrained to be binary with a small loss in performance. The main purpose of eld constraints on synaptic potentiation is to keep the average local eld for any unit a 2 Ac to be approximately the same during the presentation of an example from class c, irrespective of the specic distribution of feature probabilities for that class. This enables the use of a xed threshold for all neurons in layer A. In contrast to regular Hebbian learning, this form of learning is not entirely local at the synaptic level due to the dependence on the eld of the neuron. However, if continuous spiking units were employed, the eld constraints could be reinterpreted as a ring-rate constraint. For example, when the postsynaptic neuron res at a rate above some level, which is signicantly higher than the rate determined by the recurrent input alone, potentiation of any incoming synapses ceases. It should also be noted that the role of the internal synaptic state is to slow down synaptic modications similar to the idea of stochastic learning in Brunel et al. (1998) and Amit and Brunel (1995). A more detailed discussion of the properties of this learning rule is provided in section 4. 3.1.1 Learning the Recurrent Efcacies. We assume that the A layer receives noisy versions of K uncorrelated prototypes from some external source. These are well separated, “easy” stimuli, which are associated with the different visual classes. Although they are well separated, we still expect some noise to occur in their activation. Specically, we include each unit in A independently with probability pA in each of K subsets Ac . Each of these random subsets denes a prototype: all neurons in the subset Ac are on, and all in its complement are off. The subsets belonging to each class will be of different size,puctuating around an average of pA NA units with a standard deviation of p A (1 ¡ pA )NA . A typical case for a small NIST classier, as in section 5, would be N A D 200 and pA D .1, so that the number of units in each class will be 20 § 4. Random overlaps—namely units belonging to more than one subset—are signicant. For example, in the case above, about 60% of the units of a subset will also belong to other subsets. Given a noise level f (say, 5–10%), a random stimulus from class c will have each neuron in Ac on with probability 1 ¡ f and each neuron in the complement of Ac on with probability f . Thus, the sets Ac correspond to prototype patterns, and the actual stimuli are noisy versions of these prototypes. The recurrent synapses
Attractor Networks for Shape Recognition
Layer I
Layer A
1425
Class1
Layer A
Class 2
Layer I Figure 4: (Left) Global view of layers I and A, emphasizing that the subsets in A are random and unordered. Each class is represented by a different symbol. (Right) Close-up showing the input features extracted in I, random connections between the layers, recurrent connections in A, and possible overlaps between subsets in A.
in A are modied by presenting random stimuli from the different classes in random order. Modication proceeds as prescribed in equation 3.5. After a sufcient number of presentations, this network will develop attractors corresponding to the subsets Ac with certain basins of attraction. Moreover, whenever the net is in one of these attractor states, the average eld on each neuron in a subset Ac will be approximately h (1 C kp ). Note that the uctuations in the size of the sets Ac typically reduce the maximum memory capacity of the network, (see Brunel, 1994). We use subsets of several tens of neurons so that p A is very small. Figure 4 provides a graphic description of the way the different populations are organized. We reemphasize that direct local learning of attractors on the visual inputs of highly variable shapes is impossible due to the complex structure of the statistics of the input features. Large overlaps exist between classes, and coding levels vary signicantly. 3.1.2 Learning the Feedforward Efcacies. Once the attractors in layer A have been formed, learning of the feedforward connections can begin. Preceding the presentation of the visual input of a data point from class c to layer I, the corresponding external input (again with noise) is presented to A and the associated attractor state emerges, with activity concentrated around the subset Ac . Learning then proceeds by updating the synaptic state variables Sia between layers I and A according to the eld learning rule.
1426
Yali Amit and Massimo Mascaro
Note that the eld at a unit a 2 A is composed of hIa and hA a . However, since we assume that at this stage of learning the A layer is in an attractor state, the eld hA a is approximately h (1 C kp ), and we rewrite the learning rule for the feedforward synapses only in terms of the feedforward local eld hIa ,
D Sia
D C C (hIa )ua ui ¡ C ¡ (hIa )(1 ¡ ua )ui ,
(3.8)
where a 2 A has a connection from i 2 I, and C C (hIa ) D Cp H(h (1 C kp ) ¡ hIa ), C ¡ (hIa )
D
Cd H (hIa
¡ h (1 ¡ kd )),
(3.9) (3.10)
using the same parameters kp , kd as in the recurrent synapses. This same rule can be written in terms of the entire eld ha , if the value of kp on the feedforward connections is modied to kp0 D 2kp C 1. Whereas for recurrent synapses potentiation stops when the local eld is above h (1 C kp ), for feedforward synapses potentiation stops when the local eld is above h (1 C kp0 ) D h (2 C 2kp ). 3.2 Recognition Testing and the Role of the Attractors. After training, the feedforward and recurrent synapses are held xed, and testing can be performed. A pattern is presented only at the input level I. The activity of the neurons in this layer creates direct activation in some of the neurons in layer A. First, it is possible to carry out a simple majority vote among the classes—how many neurons were activated in each of the subsets Ac . This yields a classication rule, which corresponds to a population code for the class. The architecture is then nothing but a large collection of randomized perceptrons, each trained to discriminate between one class or several classes (in the case where a unit in layer A happens to be selective to several classes,) against the “rest of the world”—all other classes merged together as one. Recall that synaptic connections from layer I to A are chosen randomly. For each pair i 2 I, a 2 A the connection is present independently with probability pIA , (which may vary between 5% and 50%). This randomization ensures that units assigned to the same class in layer A produce different classiers based on different subsets of features. The performance of this scheme is surprisingly good given the simplicity of the architecture and the training procedure. For example, on the feedforward nets with 50,000 training and 50,000 test samples on the NIST database, we have observed a classication rate of 94%. When recurrent dynamics are applied to layer A, initialized by the direct activation produced by the feedforward connections, we obtain convergence to an attractor state that serves as a mechanism for working memory. Moreover, the activity is concentrated on the recognized class alone. This is a nontrivial task since the noise around the actual subset Ac produced
Attractor Networks for Shape Recognition
1427
by the bottom-up input can be substantial and is not at all the same as the independent f -level noise used to train the attractors. The classication rates with the attractors on an individual net are therefore often lower than voting based on the direct activation from the feedforward connections. In some cases, the response to a given stimulus is lower than usual, and not enough neurons are activated in any of the classes to reach an attractor. Alternatively, a stimulus may produce a strong enough response in more than one class, producing a stable state composed of a union of attractors. In both cases, the outcome after the net converges to a stable state could be wrong even if the initial vote was correct. One more source of noise on the attractor level lies in the overlap between the populations Ac in the A layer. 3.3 Boosting. A popular method to improve the performance of any particular form of classication is boosting, (see, e.g., Schapire et al., 1998). The idea is to produce a sequence of classiers using the same architecture and the same data. At each stage, the sequence of classiers is aggregated and produces an outcome based on voting among the classiers. Furthermore, at each stage, those training data points that remain misclassied are up-weighted. A new classier is then trained using the new weighting of the training set. In the extreme case, each new classier is constructed only from the misclassied training points of the current aggregate. We employed this extreme version for simplicity. More precisely, assume k nets A1 , . . . , A k have been trained, each one with a separate attractor layer, and synaptic connections to the the same input layer I. The aggregate classier of the k nets is the vote among all neurons in the k attractor layers. Only the misclassied training points of this aggregate net are used to train the k C 1 net. After net AkC 1 is trained, in terms of both the recurrent connections and the feedforward connections, it joins the pool of the k existing networks for voting. Note that voting is not weighted; all neurons in all Ak layers have equal vote. Examples of classication with boosting are given in section 5. 4 Field Learning
In this section we study the properties of eld learning on the feedforward connections in greater detail. Henceforth, the eld refers only to the contribution of the input layer, hIa . Recall that each feature i D (x, e) is on if an edge of type e is present somewhere in a neighborhood of x. For each i and class c, let pi,c denote the probability that ui D 1 for images in class c and qi,c the probability that ui D 1 for images not in class c. Let P (c), c D 1, . . . , K denote the prior on class c, namely, the proportions of the different classes, and let pQi,c D pi,c P (c), qQ i,c D qi,c (1 ¡ P (c)). Finally let ri,c D
pQi,c . qQ i,c
(4.1)
1428
Yali Amit and Massimo Mascaro 1
q
1
Class 0
0.6
q
0.2 0
1
Class 1
0.6
q
0.2
0.5 p
1
0
Class 7
0.6 0.2
0.5 p
1
0
0.5 p
1
Figure 5: pq scatter plots for classes 0, 1, and 7 in the NIST data set. Each point corresponds to one of the NI features. The horizontal axis represents the on-class p-probability of the feature, and the vertical axis is the off-class q-probability.
For the type of features we are employing in this context and in many other real-world examples, the distribution of pi,c , qi,c , i D 1, . . . , NI , is highly variable among different classes. In Figure 5 we show the two-dimensional scatter plots of the p versus q probabilities for classes 0, 1, and 7 from the NIST data set. Each point corresponds to a feature i D 1, . . . , NI . Henceforth, we denote these scatter plots the pq distributions. Note, for example, that there are many features common to 0s and other classes such as 3s, 6s, or 8s, especially due to the spreading we implement on the precise edge locations. Thus, many features have high pi, 0 and qi,0 probabilities. On the other hand, only a small number of features have high probability on 1s, but these typically also have a low probability on other classes. For simplicity, consider the units in layer A that belong to only one subset, Ac ½ A. Each is a perceptron trained to classify class c against the rest of the classes. We assume all neurons in layer A have the same ring threshold h. Imagine now performing Hebbian learning of the form suggested in Amit and Brunel (1995) or Brunel et al. (1998), where Cp , Cd do not depend on the eld. Calculating the mean synaptic change, one easily gets for a 2 Ac , hD Sia i D Cp pQi,c ¡ Cd qQ i,c ,
(4.2)
so that assuming a large number of training examples for each class, Sia will move toward one of the two reecting barriers 0 or 255 according to pQ whether qQ i,c is less than or greater than the xed threshold t D CCpd . The i, c resulting synaptic efcacies will then be close to either 0 or Jm , respectively. In any case, the nal efcacy of each synapse will not depend at all on the pq distribution of the particular class. However, since these pq distributions vary signicantly across different classes, using a xed threshold t will be problematic. If t is high, some classes will not have a sufcient number of synapses potentiated, causing the neuron to receive a small total eld. The corresponding neurons will rarely re, leading to a large number of false negatives for that class. Lowering t will cause other classes to have too many synapses potentiated, and the
Attractor Networks for Shape Recognition
1
0 1
0
1
1
30 90 150 210 270
0
30 90 150 210 270
1
30 90 150 210 270 Zero
0
1429
0
1
20 60 100 140 180
1
30 90 150 210 270 Seven
Hebbian Learning
0
0
20 60 100 140 180
1
20 60 100 140 180 Zero
0
20 60 100 140 180 Seven
Field Learning
Figure 6: (Left) Hebbian learning. (Right) Field learning. In each panel, the top two histograms are of elds at neurons belonging to correct class of that column (0 on the left, 7 on the right) when data points from the correct class are presented. The bottom row in each panel shows histograms of elds at neurons belonging to classes 0 and 7 when data points from other classes are presented. The ring threshold was always 100.
corresponding neurons will re too often because their eld will often be very high, leading to a large number of false positives. One solution could be to maintain Hebbian learning but adapt the threshold of the neurons of each class (and possibly each neuron). This would require a very wide range of thresholds and a sophisticated threshold adaptation scheme (see the left panel of Figure 6). The eld learning rule is an alternative, which uses eld-dependent thresholds on synaptic potentiations as in equation 3.8. The eld is calculated in terms of the activated features in the current example and the current state of the synapses. Potentiation occurs only if the postsynaptic neuron is on, that is, the example is from the correct class, and the eld is below h (1 C kp ). Depression occurs only if the postsynaptic neuron is off, that is, the example is from the wrong class, and the eld is above h (1 ¡kd ). Hence, asymptotically in the number of training examples, the mean eld at a neuron of class c, over the data points of that class, will be close to h (1 C kp ), irrespective of the particular type of pq distribution associated with that class. Furthermore the mean eld at that neuron, over data points outside class c, will be around h (1 ¡ kd ).
1430
Yali Amit and Massimo Mascaro
This is illustrated in the two panels on the right of Figure 6, where we show the distribution of the eld for a collection of neurons of classes 0 and 7 when the correct and incorrect class is presented. The stability in the value of the mean eld across the different classes allows us to use a xed threshold for all neurons. For comparison, on the left of Figure 6, we show the histograms of the elds when Hebbian learning is employed, all else remaining the same. Note the large variation in the location of these histograms, excluding the use of a xed threshold for classication. Of course, the distributions of the histograms around the means are very different and depend on the class, meaning that for some classes, the individual perceptrons will be more powerful classiers than for others. Note also that these histograms in no way represent the classication power of voting among the neurons in layer A. The eld for a data point of class c is not constant over all neurons of that class; it may be below threshold in some and above threshold in others. 4.1 Field Learning and Feature Statistics. The question is, Which synapses does the learning tend to potentiate? For Hebbian learning, the pQ answer is simple: all synapses connecting features for which r i,c D qQ i, c is i, c
greater than CCpd . If there are many such synapses, due to the constraint on the eld, not all such synapses can be potentiated when eld-dependent learning is implemented. There is some form of competition over synaptic resources. On the other hand, if there is only a small number of such synapses, the learning rule will try to employ more features with a lower ratio. In the simplied case where no constraints are imposed on depression, that is, kd D 1, the synapses connected to features with a higher ri,c ratio have a much larger chance of being potentiated asymptotically in the number of presentations of training data. Those with higher pi,c may get potentiated sooner, but asymptotically, synapses with low pi,c but large ri,c gradually get potentiated as well. To provide a better understanding of this phenomenon, we constrain ourselves to binary synaptic efcacies: the transfer function of equation 3.4 has 0 < L D H < 255. We generate a synthetic feature distribution, consisting of three groups of features with the following on-class and off-class conditional probabilities: (1) .7 · p · 0.8, 0.1 · q · 0.2, (2) 0.1 · p · 0.5, q ¼ 0.05, and (3) 0 · p · 0.6, 0.4 · q · 0.8. The features are assumed conditionally independent given class. The idea is that groups 1 and 2 contain informative features with a relatively high pq ratio. However in group 1, both p and q are high, and it is interesting to see how the learning chooses among these two sets. Group 3 is introduced as a control group to verify that uninformative features are not chosen. In Figure 7 we show which synapses the eld learning has chosen to potentiate in the pq plane when kd D 1 at two stages in the training process.
Attractor Networks for Shape Recognition
1431
1 q
1
0.6
q
0.2 0
0.6 0.2
0.5 p
1
0.5 p
0
1
1 q
0.6 0.2 0
0.5 p
1
Figure 7: (Top, left to right) Features corresponding to potentiated synapses at two stages (epochs 1 and 10, left to right) of the training process for the synthetic data set (kd D 1), overlaid on the pq scatter plot. (Bottom) Features selected by the synthetic learning rule. Horizontal axis: on-class p-probabilities. Vertical axis: off-class q-probabilities.
Note that initially features with large pi and smaller ri are potentiated, but gradually the features with higher ri and lower pi take over. The outcome of the potentiation rule can be synthetically mimicked as follows. Sort all features according to their ri ratio. Let p (i) denote the sorted on-class probabilities in decreasing order of r. Pick features from the top of the list so that Q X (i )D 1
Jm p(i) » h (1 C kp ).
(4.3)
See Figure 7. The value of Q will depend on the pq statistics of the neuron one is considering. In fact, using this synthetic potentiation rule, we obtain very similar classication results on a small network applied to the NIST data set, (see Table 1). From the purely statistical point of view, the conditional independence of the features allows us to provide an explicit formula for the Bayes classier, which is linear in the inputs (see, e.g., Duda & Hart, 1973), and in fact employs all features. This linear classier can be implemented with one
1432
Yali Amit and Massimo Mascaro
Table 1: Performance of a System of 500Neurons with Field Learning and Binary Synapses. Rule Type
Rate Observed
Field learning Synthetic rule
88% 86%
perceptron with the appropriate weights and threshold. However, due to the assumption of a constant-threshold, positive-bounded synaptic efcacies, it is not possible to obtain this optimal performance with any one of the individual perceptrons dened here. Another point to be emphasized is that the heuristic explanation provided here regarding the outcome of eld learning is only in terms of the marginal distributions of the features in each class. In reality, the features are not conditionally independent, and certain correlations are very informative. It is not clear at this point whether the eld learning is picking up any of this information. The situation when kd < 1 is more complex. It appears that in this case, there is some advantage for features with large pi and lower ri . Due to the limitation on depression, these synapses tend to persist in the potentiated state despite the relatively larger qi parameters. Recall that depression depends on the magnitude of the elds when the wrong class is presented. If these are relatively low, no depression occurs, even if features corresponding to potentiated synapses have been activated by the current data point. Moreover, with less depression taking place, during the presentation of an example of the correct class, the eld will be high enough even without the low pi –high ri features disabling potentiation and thus not allowing their synapses to strengthen. In Figure 8 we show the potentiated features in the
1 q
1
0.6
q
0.2 0
0.6 0.2
0.5 p
1
0
0.5 p
1
Figure 8: (Left to right) Features corresponding to potentiated synapses at two stages in the training process for the synthetic data set (kd D .2), overlaid on the pq scatter plot. Horizontal axis: on-class p-probabilities. Vertical axis: off-class q-probabilities.
Attractor Networks for Shape Recognition
1433
Table 2: Summary of Parameters Used in the System. Jm
Maximum value for the synaptic strength
L
Lower bound on the linear part of the efcacy
H
Upper bound on the linear part of the efcacy
kp
(1 C kp )h is upper threshold for potentiation
kd
(1 ¡ kd )h is lower threshold for depression
Cp
Synaptic status increase following potentiation
Cd
Synaptic status decrease following depression
pIA
Probability of existence of a synapse between a unit in layer I and a unit in layer A
pAA
Probability of existence of a synapse between two units in layer A
pA
Probability of a unit in A to be selected to a subset Ac
NA
Total number of neurons in each net in layer A
Spread
Spread of each feature in layer I
Iter
Learning iterations per data point
pq plot for kd D .2 at two stages of the training process. Observe that even at the end, features with higher pi and lower ri remain potentiated. 5 Experimental Results
In Table 2 we provide a brief summary of the parameters introduced to help in reading the experimental results in the following sections. In order to make the experiments more time efcient, we have reduced the size of the A layer to more or less accommodate the number of classes involved in the problem. Thus, for the NIST data set with 10 classes, we used NA D 200 neurons with pA D .1 and for the Latex data set with 293 classes, we used NA D 6000 with pA D 0.01. As mentioned in section 3, ultimately layer A should have tens of thousands of neurons to accommodate thousands of recognizable classes. Keeping xed the average number of neurons per class, a larger A for a xed number of classes can only improve classication results because the subsets remain the same size and will therefore be virtually disjoint, further stabilizing the outcome of the attractors. 5.1 Training the Recurrent Layer A. The original description of the training procedure for the feedforward synapses was as follows. Some noisy version of the prototype Ac is presented to layer A. This layer converges to the attractor state concentrated around Ac , and nally a visual input from
1434
Yali Amit and Massimo Mascaro
Table 3: Base parameters for the parameter scan. Only parameters below the line are modied during the scan. L D 100 Iter D 2
H D 100 pA D .1
h D 100 Cd D 1
N A D 100 pAA D 1
Spread 3£3
kp D .2
kd D .2
Cp D 4
pIA D .1
Jm D 10
class c is presented to layer I, after which the feedforward synapses are updated. In the experiments below, we skip the rst step, which involves dynamics in the A layer and is very time-consuming on large networks. If the noise level f is sufciently low, we can assume that only units in Ac will be activated after the attractor dynamics converges. Thus, we directly turn on all units in Ac and keep the others off prior to presenting a visual stimulus from class c. This shortcut has no effect on the nal results. If, however, the attractor dynamics is not implemented in layer A, so that the label activation is noisy during the training of the feedforward connections, performance decreases. It is clear that the attractors ensure a coherent internal representation of the labels during training. For the stable formation of attractors in the A layer, there is a critical noise level fc above which the system destabilizes. This level depends on the size of the network, the size of the sets Ac , and other parameters. For the system used for NIST, fc » 0.2. For the 293-class Latex database with NA D 6000, we found fc to be around .05. 5.2 Parameter Stability. In order to test the stability of eld learning on the feedforward connections with respect to the various parameters, we performed a fast survey with a single net. Here the goal was not to optimize the recognition rate or the performance of our system, but merely to check the robustness of our algorithm. The simulations were based on a simple net composed of only 100 neurons, (on average, 10 per class), trained on 10, 000 examples of handwritten digits taken from the NIST data set. The images in the NIST data set of handwritten digits are normalized to 16 £ 16 graylevel images with a primitive form of slant correction, after which edges are extracted. We start with the system described by the parameters in Table 3. The recognition rate (89%) is quite high considering the size of the system. We then vary one parameter at a time, trying two additional values. This does not provide a full sample of the parameter space, but since we are not interested in optimization, this is not crucial. In Table 4 we report the results. Note that despite the large variability, the nal rates remain well above the random choice 10% rate. Moreover even for parameters yielding low classication rates with a small number of neurons (100), it is possible to achieve great improvement by increasing the number of neurons in layer A. In one experiment, the parameters were
Attractor Networks for Shape Recognition
1435
Table 4: Recognition Rates for Parameter Scan on the NIST Data Set. Parameter Name
First Change
Second Change
Jm kd kp Cp pIA
20.0 ! 86% 0.50 ! 84% 0.50 ! 90% 10.0 ! 87% .05 ! 74%
5.0 ! 74% 1.0 ! 51% 1.0 ! 82% 2.0 ! 89% 0.2 ! 91%
Note: Each cell contains the new parameter value and the resulting rate.
xed for the worst rate in Table 4—kd D 1 and all else the same—and the number of neurons was gradually increased. The results are reported in Table 5, where they are compared to those obtained by the best parameter set, pIA D 0.2, and all else the same. The point here is that in a system of thousands of neurons, the values of these quantities are not critical. The use of boosting further increases the asymptotic value achievable and usually requires many fewer neurons (see section 5.4). 5.3 Attractors. After presentation, the visual stimulus is immediately removed, an initial state is produced in layer A due to the feedforward connections, and the net in layer A evolves to a stable state. A random asynchronous updating scheme is used. The A layer has one attractor for each class, which is concentrated on the original subset selected for the class. However, since no inhibition is present in the system, the self-sustained state can involve a union of such subsets: multiple attractors can emerge simultaneously. Final classication is done again with a majority vote. However, now, after convergence, the state of the net will be “clean,” as can be seen from Figure 9, where the residual activity for class 0 on NIST is shown before and after the net reaches a stable state. The residuals are the difference between the number of active neurons in the correct population and the number of neurons active in the most active of all other populations. The histogram is over all data points of class 0.
Table 5: Recognition Rate as a Function of the Number of Neurons for Two Parameter Sets Neurons
100
200
500
1000
2000
5000
Worst Best
51 91
63 92
69 93
75 93
81 94
85 94
Note: The parameter sets are based on those of Table 3 but with kd D 1 (Worst) and pIA D 0.2 (Best).
1436
Yali Amit and Massimo Mascaro
400
600 400
200
200 0 -20
-10
0
10
20
0 -60 -40 -20
0 20 40 60 80 100
4000 1000 2000 0 -20
500
-10
0
10
20
0 -60 -40 -20 0 20 40 60 80 100
Figure 9: (Left panels) Histogram of the residuals of activity of a network composed of 200 neurons (here for simplicity the subsets are forced to be disjoint) for class 0 on NIST. The upper graph is the preattractor situation, and the lower graph is the postattractor situation. The total number of examples presented for this class was 5000, and the overall performance of the net was 90.1% without attractors and 85.1% with attractors. (Right panels) Same data for ve nets produced with boosting. Classication rate 93.7% without attractors and 90.3% with attractors.
Note that most of the examples that generated a positive residual from the feedforward connections converged to the attractor state. This is evident in the high concentration on the peak residual, which is 20 neurons—precisely the size of the population in this example. Those examples with a low initial residual end up with 0 residuals at the stable state, causing a decrease in the classication rate with respect to the preattractor situation—for example, from 90.1% to 85.1% on one net for the NIST data set. Note that a 0 residual is reached when none of the neurons is active at the stable state or because two attractors are activated simultaneously. The situation with more than one net is somewhat more complicated. In the nal state, one might have only the attractors for the right class in all the nets, or all the attractors of some other class, or anything in between. This is in part captured in the right panel of Figure 9, where ve nets have been trained with boosting. The histogram is of the aggregated residuals. The nal rate of this system is 93.7% without attractors and 90.3% with attractors, showing that the gap between the two is greatly reduced. The reason is that it is improbable that all of the nets have a small residual after
Attractor Networks for Shape Recognition
1437
Table 6: NIST Experiments. a. Parameters LD0 Iter D 2.0 Cd D 1.0
H D 120 Jm D 10.0 pIA D 0.1
h D 100 kp D 0.2 pA D 0.1
N A D 200 kd D 0.2 pAA D 1.0
Spread 3 £ 3 Cp D 4.0
b. Results Number of Nets
1
5
10
20
50
Train rate, no attractors
92.7
95.4
97.2
98.9
99.2
Test rate, no attractors
89.2
92.9
95.5
97.4
97.6
Test rate, with attractors
84.1
89.4
94.2
96.7
97.4
Notes: (a) Parameter set for each net of the experiment on NIST described in section 5.4. The notation used is explained in section 2. (b) Recognition rates as a function of the number of nets on training and on test sets without attractors and on the test set with attractors.
the presentation of a certain stimulus. As the number of nets increases, the gap decreases further; see Table 6. 5.4 NIST. The maximum performance on the NIST data set has been obtained using an ensemble of nets, each with the parameters given in Table 6. Training has been done using the boosting procedure described in section 3.2 to improve the performance of a single net. The net has been trained on 50,000 examples and tested on another 50,000. In Table 6 we show the recognition rate as a function of the number of nets, on both the training set and the test set. Although performance with attractors is lower than without, the gap between the two tends to decrease with the number of nets. 5.5 Latex. Ultimately one would want to be able to learn a new object sequentially by simply selecting a new random subset for that object and updating the synaptic weights in some fashion, using presentations of examples from the new object and perhaps some refreshing of examples from previously learned objects. Here we test only the scalability of the system: Can hundreds of classes be learned within the same framework of an attractor layer with several thousands of neurons and using the same range of parameters? We work with a data set of 293 synthetically deformed Latex symbols, as in Amit and Geman (1997), there are 50 samples per class in the training set and 32 samples per class in the test set. In Figure 10 we show the original 293 prototypes alongside a random sample of deformed ones. Below that we show 32 test samples of the 2 class
1438
Yali Amit and Massimo Mascaro
Figure 10: (Left) Prototypes of Latex gures. (Right) Random sample of deformed symbols. (Bottom) Sample of deformed symbols of class 2.
where signicant variations are evident. Since the data are not normalized or registered, the spreading of the edge inputs is essential. The images are 32 £ 32 and are more or less centered. The parameters employed in this experiment and the classication rates as a function of the number of nets
Table 7: LATEX Experiments. a. Parameters LD0 Iter D 10.0 Cd D 1.0
H D 120.0 Jm D 10.0 pIA D 0.05
h D 100.0 kp D 0.5 pA D 0.01
NA D 6000 kd D 0.2 p AA D 1.0
Spread 7 £ 7 Cp D 10.0
b. Results 293 classes, NA D 6000 Num. Nets
1
5
10
15
Train rate, no attractors
60.2
91.0
94.2
94.6
Test rate, no attractors
42.4
74.9
78.6
79.1
Train rate, attractors
52.3
76.6
84.3
88.6
Test rate, attractors
37.7
57.9
63.2
68.5
Notes: (a) Parameters used in the LATEX experiment. (b) Results with NA D 6000 and pA D 0.01.
Attractor Networks for Shape Recognition
1439
are presented in Table 7. Note that the spread parameter is larger due to the larger images. The main modication with respect to the NIST experiment is in Cp . This has to do with the fact that the ratio pQi / qQ i is now much smaller because P (c) D 0.003 as opposed to 0.1 in the case of NIST. This is a rather minor adjustment. Indeed, the experiments with the original Cp produce comparable classication rates. We show the outcome for the 293 class problem with up to 15 nets. With one net of 6000 neurons, we achieve 52% before attractors and 37% with attractors. This is denitely a respectable classication rate for one classier on a 293-class problem. With 15 nets, we achieve 79.1% without attractors and 68.5% with attractors. Note that using multiple decision trees on the same input, we can achieve classication rates of 91%. Although we have not achieved nearly the same rates, it is clear that the system does not collapse in the presence of hundreds of classes. One clear advantage of decision trees over these networks is the use of “negative” information: features that are not expected to be present in a certain region of a character of a specic class. In the terminology of section 4, these are features with very low p and moderate or high q. There is extensive information in such features, which is entirely ignored in the current setting. 6 Discussion
We have shown that a simple system of thousands of perceptrons with coarse-oriented edge features as input is able to recognize images of characters, even in a context with hundreds of classes. The perceptrons have randomized connections to the input layer and among themselves. They are highly constrained as classiers in that the synaptic efcacies are positive and essentially binary. Moreover, all perceptrons have the same threshold. There is no need for “grandmother cells”; each new class is represented by a small, random subset of the population of perceptrons. Classication manifests itself through a stable state of the dynamics of the network of perceptrons concentrated on the population representing a specic class. The attractors in the A layer serve three distinct purposes. The rst is during learning, where they maintain a stable activity of a specic subset of the A layer corresponding to the class of the presented data point. The second is cleaning up the activity in the A layer after recognition, and the third is maintaining a sustained activity at the correct attractor, thus providing a mechanism for working memory. Learning is performed using a modied Hebbian rule, which is shown to be very robust to parameter changes and has a simple interpretation in terms of the statistics of the individual input features on and off class. The learning rule is not entirely local on the synapse, although it is local on each neuron, since potentiation and depression are suspended as a function of the eld. We believe that in a system with continuous time dynamics and integrate-and-re neurons, this suspension can depend on the ring rate of
1440
Yali Amit and Massimo Mascaro
the neuron. Indeed in future work, we intend to study the implementation of this system in terms of real-time dynamics. In the context of this article, boosting has allowed us to improve classication rates. It is very sensible to have a procedure in which more resources are used to train on more difcult examples; however, here this is implemented in an articial manner. One can imagine boosting evolving naturally in the framework of a predetermined number of several interconnected recurrent A networks, all receiving feedforward input from the I layer. The basic idea would be that if in one network a visual input of class c produces the “correct” activity in the A layer, the eld on the units in the Ac subsets in other nets is strong enough so that no potentiation occurs on synapses connecting to these units. On the other hand, if the visual input produces incorrect activity, the local eld on the other Ac subsets is smaller, and potentiation of feedforward connections can occur. It is of interest to develop methods for sequential learning. A new random subset is selected in the attractor layer and synapses are modied using examples of the new object, while maintaining the existing recognition capabilities of the system through some refreshing mechanism. It is also important to investigate the possibility that this learning mechanism, or a variation thereof, can learn correlations among features. Up to this point, features have been selected essentially according to their marginal probabilities. There is abundant information in feature conjunctions, which does not seem to be exploited in the current set-up. Finally, it will be of interest to investigate various feedback mechanisms from the A layer to the input I layer that would produce template-type visual inputs to replace the original noisy input. Acknowledgments
We thank Daniel Amit, Stefano Fusi, and Paolo Del Giudice for numerous insightful suggestions. Y. A. was supported in part by the Army Research Ofce under grant DAAH04-96-1-0061 and MURI grant DAAH04-96-1-0445. References Amit, D. J. (1989). Modelling brain function: The world of attractor neural networks. Cambridge: Cambridge University Press. Amit, Y. (2000). A neural network architecture for visual selection. Neural Computation, 12, 1059–1082. Amit, Y., Blanchard, G., & Wilder, K. (1999). Multiple randomized classiers:MRCL (Tech. Rep.). Chicago: Deptartment of Statistics, University of Chicago. Amit, D., & Brunel, N. (1995). Learning internal representations in an attractor neural network with analogue neurons. Network, 6, 261. Amit, D. J., & Fusi, S. (1994). Dynamic learning in neural networks with material synapses. Neural Computation, 6, 957.
Attractor Networks for Shape Recognition
1441
Amit, Y., & Geman, D. (1997). Shape quantization and recognition with randomized trees. Neural Computation, 9, 1545–1588. Amit, Y., & Geman, D. (1999).A computational model for visual selection. Neural Computation, 11, 1691–1715. Amit, Y., Geman, D., & Wilder, K. (1997). Joint induction of shape features and tree classiers. IEEE Trans. on Patt. Anal. and Mach. Intel., 19, 1300–1306. Atkinson, R. (1975). Mnemotechnics in second-language learning. American Psycholigsts, 30, 821–828. Avimelelch, R., & Intrator, N. (1999). Boosted mixture of experts: An ensemble learning scheme. Neural Computation, 11, 483–497. Bartlett, M. S., & Sejnowski, T. J. (1998). Learning viewpoint–invariant face representations from visual experience in an attractor network. Network:Comput. Neural. Syst., 9, 399–417. Bliss, T. V. P., & Collingridge, G. L. (1993). A synaptic model of memory: Long term potentiation in the hippocampus. Nature, 361, 31. Bottou, L., Cortes, C., Denker, J. S., Drucker, H., Guyon, I., Jackel, L. D., LeCun, Y., Muller, U. A., Sackinger, E., Simard, P., & Vapnik, V. (1994). Comparison of classier methods: A case study in handwritten digit recognition. In Proc. IEEE Inter. Conf. on Pattern Recognition (pp. 77–82). Breiman, L. (1998). Arcing classiers (with discussion). Annals of Statistics, 26, 801–849. Breiman, L. (1999). Random forests, random features (Tech. Rep.). Berkeley: University of California, Berkeley. Brunel, N. (1994). Storage capacity of neural networks: Effect of the uctuations of the number of active neurons per memory. J. Phys. A.: Math. Gen., 27, 4783–4789. Brunel, N., Carusi, F., & Fusi, S. (1998). Slow stochastic Hebbian learning of classes of stimuli in a recurrent neural network. Network, 9, 123–152. Diederich, S., & Opper, M. (1987). Learning correlated patterns in spin-glass networks by local learning rules. Phys. Rev. Lett., 58, 949–952. Dietterich, T. G. (2000). An experimental comparison of three methods for constructing ensembles of decision trees: bagging, boosting and randomization. Machine Learning, 40, 139–157. Duda, R. O., & Hart, P. E. (1973). Pattern classication and scene analysis. New York: Wiley. Fukushima, K. (1986). A neural network model for selective attention in visual pattern recognition. Biol. Cyber., 55, 5–15. Fukushima, K., & Wake, N. (1991). Handwritten alphanumeric character recognition by the neocognitron. IEEE Trans. Neural Networks, 1. Fusi, S., Annunziato, M., Badoni, D., Salamon, A., & Amit, D. J. (1999). Spikedriven synaptic plasticity: Theory, simulation, VLSI implementation (Tech. Rep.). Rome: University of Rome, La Sapienza. Fusi, S., Badoni, D., Salamon, A., & Amit, D. (2000). Spike-driven synaptic plasticity: Theory, simulation, VLSI implementation. Neural Computation, 12, 2305–2329. Hastie, T., Buja, A., & Tibshirani, R. (1995). Penalized discriminant analysis. Annals of Statistics, 23, 73–103.
1442
Yali Amit and Massimo Mascaro
Hopeld, J. J. (1982). Neural networks and physical systems with emergent computational abilities. PNAS, 79, 2554. Hubel, H. D. (1988). Eye, brain, and vision. New York: Scientic American Library. Hussain, B., & Kabuka, M. R. (1994). A novel feature recognition neural network and its application to character recognition. IEEE Trans. PAMI, 16, 99–106. Knerr, S., Personnaz, L., & Dreyfus, G. (1992). Handwritten digit recognition by neural networks with single-layer training. IEEE Trans. Neural Networks, 3, 962–968. LeCun, Y., Boser, B., Denker, J. S., Henderson, D., Howard, R. E., Hubbard, W., & Jackel, L. D. (1990). Handwritten digit recognition with a back-propagation network. In D. S. Touretsky (Ed.), Advances in neural information, 2. San Mateo, CA: Morgan Kaufmann. Levenex, P., & Schenk, F. (1997). Olfactory cues potentiate learning of distant visuospatial information. Neurobiology of Learning and Memory, 68, 140–153. Markram, H., Lubke, J., Frotscher, M., & Sakmann, B. (1997). Regulation of synaptic efcacy by coincidence of postsynaptic AP’s and EPSP’s. Science, 375, 213. Mattia, M., & Del Giudice, P. (2000). Efcient event-driven simulation of large networks of spiking neurons and dynamical synapses. Neural Comp., 12 2305– 2329. Miyashita, Y., & Chang, H. S. (1988). Neuronal correlate of pictorial short term memory in the primate temporal cortex. Nature, 68, 331. Oja, E. (1989). Neural networks, principle components, and subspaces. International Journal of Neural Systems, 1, 62–68. Petersen, C. C. H., Malenka, R. C., Nocoll, R. A., & Hopeld, J. J. (1998). All-ornone potentiation at CA3-CA1 synapses. Proc. Natl. Acad. Sci., 95, 4732. Riesenhuber, M., & Poggio, T. (1999). Hierarchical models of object recognition in cortex. Nature Neuroscience, 2, 1019–1025. Rolls, E. T. (2000). Memory systems in the brain. Annu. Rev. Psychol., 51, 599–630. Rosenblatt, F. (1962). Principles of neurodynamics:Perceptrons and the theory of brain mechanisms. Washington, DC: Spartan. Sakai, K., & Miyashita, Y. (1991). Neural organization for the long-term memory of paired associates. Nature, 354, 152–155. Schapire, R. E., Freund, Y., Bartlett, P., & Lee, W. S. (1998). Boosting the margin: A new explanation for the effectiveness of voting methods. Annals of Statistics, 26, 1651–1686. Wilkinson, R. A., Geist, J., Janet, S., Grother, P., Gurges, C., Creecy, R., Hammond, B., Hull, J., Larsen, N., Vogl, T., & Wilson, C. (1992). The rst census optical character recognition system conference (Tech. Rep. No. NISTIR 4912). Gaithersburg, MD: National Institute of Standards and Technology. Received November 9, 1999; accepted September 8, 2000.
ARTICLE
Communicated by Vladimir Vapnik
Estimating the Support of a High-Dimensional Distribution Bernhard Scholkopf ¨ Microsoft Research Ltd, Cambridge CB2 3NH, U.K.
John C. Platt Microsoft Research, Redmond, WA 98052, U.S.A
John Shawe-Taylor Royal Holloway, University of London, Egham, Surrey TW20 OEX, U.K.
Alex J. Smola Robert C. Williamson Department of Engineering, Australian National University, Canberra 0200, Australia
Suppose you are given some data set drawn from an underlying probability distribution P and you want to estimate a “simple” subset S of input space such that the probability that a test point drawn from P lies outside of S equals some a priori specified value between 0 and 1. We propose a method to approach this problem by trying to estimate a function f that is positive on S and negative on the complement. The functional form of f is given by a kernel expansion in terms of a potentially small subset of the training data; it is regularized by controlling the length of the weight vector in an associated feature space. The expansion coefficients are found by solving a quadratic programming problem, which we do by carrying out sequential optimization over pairs of input patterns. We also provide a theoretical analysis of the statistical performance of our algorithm. The algorithm is a natural extension of the support vector algorithm to the case of unlabeled data. 1 Introduction During recent years, a new set of kernel techniques for supervised learning has been developed (Vapnik, 1995; Scholkopf, ¨ Burges, & Smola, 1999). Specifically, support vector (SV) algorithms for pattern recognition, regression estimation, and solution of inverse problems have received considerable attention. There have been a few attempts to transfer the idea of using kernels to compute inner products in feature spaces to the domain of unsupervised learning. The problems in that domain are, however, less precisely specic 2001 Massachusetts Institute of Technology Neural Computation 13, 1443–1471 (2001) °
1444
Bernhard Scholkopf ¨ et al.
fied. Generally, they can be characterized as estimating functions of the data, which reveal something interesting about the underlying distributions. For instance, kernel principal component analysis (PCA) can be characterized as computing functions that on the training data produce unit variance outputs while having minimum norm in feature space (Scholkopf, ¨ Smola, & Muller, ¨ 1999). Another kernel-based unsupervised learning technique, regularized principal manifolds (Smola, Mika, Scholkopf, ¨ & Williamson, in press), computes functions that give a mapping onto a lower-dimensional manifold minimizing a regularized quantization error. Clustering algorithms are further examples of unsupervised learning techniques that can be kernelized (Scholkopf, ¨ Smola, & Muller, ¨ 1999). An extreme point of view is that unsupervised learning is about estimating densities. Clearly, knowledge of the density of P would then allow us to solve whatever problem can be solved on the basis of the data. The work presented here addresses an easier problem: it proposes an algorithm that computes a binary function that is supposed to capture regions in input space where the probability density lives (its support), that is, a function such that most of the data will live in the region where the function is nonzero (Scholkopf, ¨ Williamson, Smola, Shawe-Taylor, 1999). In doing so, it is in line with Vapnik’s principle never to solve a problem that is more general than the one we actually need to solve. Moreover, it is also applicable in cases where the density of the data’s distribution is not even well defined, for example, if there are singular components. After a review of some previous work in section 2, we propose SV algorithms for the considered problem. section 4 gives details on the implementation of the optimization procedure, followed by theoretical results characterizing the present approach. In section 6, we apply the algorithm to artificial as well as real-world data. We conclude with a discussion. 2 Previous Work In order to describe some previous work, it is convenient to introduce the following definition of a (multidimensional) quantile function, introduced by Einmal and Mason (1992). Let x1 , . . . , x` be independently and identically distributed (i.i.d.) random variables in a set X with distribution P. Let C be a class of measurable subsets of X , and let λ be a real-valued function defined on C . The quantile function with respect to (P, λ, C ) is U(α) = inf{λ(C): P(C) ≥ α, C ∈ C }
0 < α ≤ 1.
P In the special case where P is the empirical distribution (P` (C) := 1` `i=1 1C (xi )), we obtain the empirical quantile function. We denote by C(α) and C` (α) the (not necessarily unique) C ∈ C that attains the infimum (when it is achievable). The most common choice of λ is Lebesgue measure, in which case C(α) is the minimum volume C ∈ C that contains at least a fraction α
Estimating the Support of a High-Dimensional Distribution
1445
of the probability mass. Estimators of the form C` (α) are called minimum volume estimators. Observe that for C being all Borel measurable sets, C(1) is the support of the density p corresponding to P, assuming it exists. (Note that C(1) is well defined even when p does not exist.) For smaller classes C , C(1) is the minimum volume C ∈ C containing the support of p. Turning to the case where α < 1, it seems the first work was reported by Sager (1979) and then Hartigan (1987) who considered X = R2 with C being the class of closed convex sets in X . (They actually considered density contour clusters; cf. appendix A for a definition.) Nolan (1991) considered higher dimensions with C being the class of ellipsoids. Tsybakov (1997) has studied an estimator based on piecewise polynomial approximation of C(α) and has shown it attains the asymptotically minimax rate for certain classes of densities p. Polonik (1997) has studied the estimation of C(α) by C` (α). He derived asymptotic rates of convergence in terms of various measures of richness of C . He considered both VC classes ¡ ¢and classes with a log ²-covering number with bracketing of order O ² −r for r > 0. More information on minimum volume estimators can be found in that work and in appendix A. A number of applications have been suggested for these techniques. They include problems in medical diagnosis (Tarassenko, Hayton, Cerneaz, & Brady, 1995), marketing (Ben-David & Lindenbaum, 1997), condition monitoring of machines (Devroye & Wise, 1980), estimating manufacturing yields (Stoneking, 1999), econometrics and generalized nonlinear principal curves (Tsybakov, 1997; Korostelev & Tsybakov, 1993), regression and spectral analysis (Polonik, 1997), tests for multimodality and clustering (Polonik, 1995b), and others (Muller, ¨ 1992). Most of this work, in particular that with a theoretical slant, does not go all the way in devising practical algorithms that work on high-dimensional real-world-problems. A notable exception to this is the work of Tarassenko et al. (1995). Polonik, (1995a) has shown how one can use estimators of C(α) to construct density estimators. The point of doing this is that it allows one to encode a range of prior assumptions about the true density p that would be impossible to do within the traditional density estimation framework. He has shown asymptotic consistency and rates of convergence for densities belonging to VC-classes or with a known rate of growth of metric entropy with bracketing. Let us conclude this section with a short discussion of how the work presented here relates to the above. This article describes an algorithm that finds regions close to C(α). Our class C is defined implicitly via a kernel k as the set of half-spaces in an SV feature space. We do not try to minimize the volume of C in input space. Instead, we minimize an SV-style regularizer that, using a kernel, controls the smoothness of the estimated function describing C. In terms of multidimensional quantiles, our approach can be thought of as employing λ(Cw ) = kwk2 , where Cw = {x: fw (x) ≥ ρ}. Here,
1446
Bernhard Scholkopf ¨ et al.
(w, ρ) are a weight vector and an offset parameterizing a hyperplane in the feature space associated with the kernel. The main contribution of this work is that we propose an algorithm that has tractable computational complexity, even in high-dimensional cases. Our theory, which uses tools very similar to those used by Polonik, gives results that we expect will be of more use in a finite sample size setting. 3 Algorithms We first introduce terminology and notation conventions. We consider training data x1 , . . . , x` ∈ X ,
(3.1)
where ` ∈ N is the number of observations and X is some set. For simplicity, we think of it as a compact subset of RN . Let 8 be a feature map X → F, that is, a map into an inner product space F such that the inner product in the image of 8 can be computed by evaluating some simple kernel (Boser, Guyon, & Vapnik, (1992), Vapnik, (1995); Scholkopf, ¨ Burges, et al., (1999)) k(x, y) = (8(x) · 8(y)),
(3.2)
such as the gaussian kernel k(x, y) = e−kx−yk
2
/c
.
(3.3)
Indices i and j are understood to range over 1, . . . , ` (in compact notation: i, j ∈ [`]). Boldface Greek letters denote `-dimensional vectors whose components are labeled using a normal typeface. In the remainder of this section, we develop an algorithm that returns a function f that takes the value +1 in a “small” region capturing most of the data points and −1 elsewhere. Our strategy is to map the data into the feature space corresponding to the kernel and to separate them from the origin with maximum margin. For a new point x, the value f (x) is determined by evaluating which side of the hyperplane it falls on in feature space. Via the freedom to use different types of kernel functions, this simple geometric picture corresponds to a variety of nonlinear estimators in input space. To separate the data set from the origin, we solve the following quadratic program: 1 P 1 2 (3.4) min i ξi − ρ 2 kwk + ν` w∈F,ξ ∈R` ,ρ∈R
subject to (w · 8(xi )) ≥ ρ − ξi , ξi ≥ 0. Here, ν ∈ (0, 1] is a parameter whose meaning will become clear later.
(3.5)
Estimating the Support of a High-Dimensional Distribution
1447
Since nonzero slack variables ξi are penalized in the objective function, we can expect that if w and ρ solve this problem, then the decision function f (x) = sgn((w · 8(x)) − ρ)
(3.6)
will be positive for most examples xi contained in the training set,1 while the SV type regularization term kwk will still be small. The actual trade-off between these two goals is controlled by ν. Using multipliers αi , β i ≥ 0, we introduce a Lagrangian L(w, ξ , ρ, α, β ) =
1 1 X kwk2 + ξi − ρ 2 ν` i X X αi ((w · 8 (xi )) − ρ + ξi ) − β i ξi , − i
(3.7)
i
and set the derivatives with respect to the primal variables w, ξ , ρ equal to zero, yielding X αi 8(xi ), (3.8) w= i
1 1 − βi ≤ , αi = ν` ν`
X
αi = 1.
(3.9)
i
In equation 3.8, all patterns {xi : i ∈ [`], αi > 0} are called support vectors. Together with equation 3.2, the SV expansion transforms the decision function, equation 3.6 into a kernel expansion: ! Ã X αi k(xi , x) − ρ . (3.10) f (x) = sgn i
Substituting equation 3.8 and equation 3.9 into L (see equation 3.7) and using equation 3.2, we obtain the dual problem: 1X 1 , α α k(x , x ) subject to 0 ≤ αi ≤ α 2 ij i j i j ν`
min
X
αi = 1.
(3.11)
i
One can show that at the optimum, the two inequality constraints, equation 3.5, become equalities if αi and β i are nonzero, that is, if 0 < αi < 1/(ν`). Therefore, we can recover ρ by exploiting that for any such αi , the corresponding pattern xi satisfies X αj k(xj , xi ). (3.12) ρ = (w · 8(xi )) = j
1
We use the convention that sgn(z) equals 1 for z ≥ 0 and −1 otherwise.
1448
Bernhard Scholkopf ¨ et al.
Note that if ν approaches 0, the upper boundaries on the Lagrange multipliers tend to infinity, that is, the second inequality constraint in equation 3.11 becomes void. The problem then resembles the corresponding hard margin algorithm, since the penalization of errors becomes infinite, as can be seen from the primal objective function (see equation 3.4). It is still a feasible problem, since we have placed no restriction on the offset ρ, so it can become a large negative number in order to satisfy equation 3.5. If we had required ρ ≥ 0 from the start, we would have ended up with the constraint P i αi ≥ 1 instead of the corresponding equality constraint in equation 3.11, and the multipliers αi could have diverged. It is instructive to compare equation 3.11 to a Parzen windows estimator. To this end, suppose we use a kernel that can be normalized as a density in input space, such as the gaussian (see equation 3.3). If we use ν = 1, then the two constraints only allow the solution α1 = · · · = α` = 1/`. Thus the kernel expansion in equation 3.10 reduces to a Parzen windows estimate of the underlying density. For ν < 1, the equality constraint in equation 3.11 still ensures that the decision function is a thresholded density; however, in that case, the density will be represented only by a subset of training examples (the SVs)—those that are important for the decision (see equation 3.10) to be taken. Section 5 will explain the precise meaning of ν. To conclude this section, we note that one can also use balls to describe the data in feature space, close in spirit to the algorithms of Scholkopf, ¨ Burges, and Vapnik (1995), with hard boundaries, and Tax and Duin (1999), with “soft margins.” Again, we try to put most of the data into a small ball by solving, for ν ∈ (0, 1), 1 X ξi ν` i
min
R2 +
subject to
k8(xi ) − ck2 ≤ R2 + ξi ,
R∈R,ξ ∈R` ,c∈F
ξi ≥ 0 for i ∈ [`].
This leads to the dual X X αi αj k(xi , xj ) − αi k(xi , xi ) min
α
(3.14)
i
ij
subject to
(3.13)
0 ≤ αi ≤
1 X , αi = 1 ν` i
and the solution X αi 8(xi ), c=
(3.15)
(3.16)
i
corresponding to a decision function of the form X X 2 αi αj k(xi , xj ) + 2 αi k(xi , x) − k(x, x) . f (x) = sgn R − ij
i
(3.17)
Estimating the Support of a High-Dimensional Distribution
1449
Similar to the above, R2 is computed such that for any xi with 0 < αi < 1/(ν`), the argument of the sgn is zero. For kernels k(x, y) that depend on only x − y, k(x, x) is constant. In this case, the equality constraint implies that the linear term in the dual target function is constant, and the problem, 3.14 and 3.15, turns out to be equivalent to equation 3.11. It can be shown that the same holds true for the decision function; hence, the two algorithms coincide in that case. This is geometrically plausible. For constant k(x, x), all mapped patterns lie on a sphere in feature space. Therefore, finding the smallest sphere (containing the points) really amounts to finding the smallest segment of the sphere that the data live on. The segment, however, can be found in a straightforward way by simply intersecting the data sphere with a hyperplane; the hyperplane with maximum margin of separation to the origin will cut off the smallest segment.
4 Optimization Section 3 formulated quadratic programs (QPs) for computing regions that capture a certain fraction of the data. These constrained optimization problems can be solved using an off-the-shelf QP package to compute the solution. They do, however, possess features that set them apart from generic QPs, most notably the simplicity of the constraints. In this section, we describe an algorithm that takes advantage of these features and empirically scales better to large ¢ set sizes than a standard QP solver with time com¡ data plexity of order O `3 (cf. Platt, 1999). The algorithm is a modified version of SMO (sequential minimal optimization), an SV training algorithm originally proposed for classification (Platt, 1999), and subsequently adapted to regression estimation (Smola & Scholkopf, ¨ in press). The strategy of SMO is to break up the constrained minimization of equation 3.11 into the smallest optimization steps possible. Due to the constraint on the sum of the dual variables, it is impossible to modify individual variables separately without possibly violating the constraint. We therefore resort to optimizing over pairs of variables.
4.1 Elementary Optimization Step. For instance, consider optimizing over α1 and α2 with all other variables fixed. Using the shorthand Kij := k(xi , xj ), equation 3.11 then reduces to
min α1 ,α2
2 2 X 1X αi αj Kij + αi Ci + C, 2 i,j=1 i=1
(4.1)
1450
Bernhard Scholkopf ¨ et al.
with Ci :=
P`
j=3 αj Kij
0 ≤ α1 , α2 ≤
and C :=
P`
i,j=3 αi αj Kij ,
subject to
2 1 X , αi = 1, ν` i=1
(4.2)
P where 1: = 1 − `i=3 αi . We discard C, which is independent of α1 and α2 , and eliminate α1 to obtain 1 1 min (1 − α2 )2 K11 + (1 − α2 )α2 K12 + α22 K22 α2 2 2 +(1 − α2 )C1 + α2 C2 ,
(4.3)
with the derivative − (1 − α2 ) K11 + (1 − 2α2 )K12 + α2 K22 − C1 + C2 .
(4.4)
Setting this to zero and solving for α2 , we get α2 =
1(K11 − K12 ) + C1 − C2 . K11 + K22 − 2K12
(4.5)
Once α2 is found, α1 can be recovered from α1 = 1 − α2 . If the new point (α1 , α2 ) is outside of [0, 1/(ν`)], the constrained optimum is found by projecting α2 from equation 4.5 into the region allowed by the constraints, and then recomputing α1 . The offset ρ is recomputed after every such step. Additional insight can be obtained by rewriting equation 4.5 in terms of the outputs of the kernel expansion on the examples x1 and x2 before the optimization step. Let α1∗ , α2∗ denote the values of their Lagrange parameter before the step. Then the corresponding outputs (cf. equation 3.10) read Oi := K1i α1∗ + K2i α2∗ + Ci .
(4.6)
Using the latter to eliminate the Ci , we end up with an update equation for α2 , which does not explicitly depend on α1∗ , α2 = α2∗ +
O1 − O2 , K11 + K22 − 2K12
(4.7)
which shows that the update is essentially the fraction of first and second derivative of the objective function along the direction of ν-constraint satisfaction. Clearly, the same elementary optimization step can be applied to any pair of two variables, not just α1 and α2 . We next briefly describe how to do the overall optimization.
Estimating the Support of a High-Dimensional Distribution
1451
4.2 Initialization of the Algorithm. We start by setting a fraction ν of all αi , randomly chosen, to 1/(ν`). If ν` is not an integer, P then one of the examples is set to a value in (0, 1/(ν`)) to ensure that i αi = 1. Moreover, we set the initial ρ to max{Oi : i ∈ [`], αi > 0}. 4.3 Optimization Algorithm. We then select a first variable for the elementary optimization step in one of the two following ways. Here, we use the shorthand SVnb for the indices of variables that are not at bound, that is, SVnb := {i: i ∈ [`], 0 < αi < 1/(ν`)}. At the end, these correspond to points that will sit exactly on the hyperplane and that will therefore have a strong influence on its precise position. We scan over the entire data set2 until we find a variable violating a Karush-kuhn-Tucker (KKT) condition (Bertsekas, 1995), that is, a point such that (Oi − ρ) · αi > 0 or (ρ − Oi ) · (1/(ν`) − αi ) > 0. Once we have found one, say αi , we pick αj according to j = arg max |Oi − On |.
(4.8)
n∈SVnb
We repeat that step, but the scan is performed only over SVnb . In practice, one scan of the first type is followed by multiple scans of the second type, until there are no KKT violators in SVnb , whereupon the optimization goes back to a single scan of the first type. If it finds no KKT violators, the optimization terminates. In unusual circumstances, the choice heuristic, equation 4.8, cannot make positive progress. Therefore, a hierarchy of other choice heuristics is applied to ensure positive progress. These other heuristics are the same as in the case of pattern recognition (cf. Platt, 1999), and have been found to work well in the experiments we report below. In our experiments with SMO applied to distribution support estimation, we have always found it to converge. However, to ensure convergence even in rare pathological conditions, the algorithm can be modified slightly (Keerthi, Shevade, Bhattacharyya, & Murthy, 1999). We end this section by stating a trick that is of importance in practical implementations. In practice, one has to use a nonzero accuracy tolerance when checking whether two quantities are equal. In particular, comparisons of this type are used in determining whether a point lies on the margin. Since we want the final decision function to evaluate to 1 for points that lie on the margin, we need to subtract this constant from the offset ρ at the end. In the next section, it will be argued that subtracting something from ρ is actually advisable also from a statistical point of view. 2 This scan can be accelerated by not checking patterns that are on the correct side of the hyperplane by a large margin, using the method of Joachims (1999)
1452
Bernhard Scholkopf ¨ et al.
5 Theory We now analyze the algorithm theoretically, starting with the uniqueness of the hyperplane (proposition 1). We then describe the connection to pattern recognition (proposition 2), and show that the parameter ν characterizes the fractions of SVs and outliers (proposition 3). Following that, we give a robustness result for the soft margin (proposition 4) and finally we present a theoretical result on the generalization error (theorem 1). Some of the proofs are given in appendix B. In this section, we use italic letters to denote the feature space images of the corresponding patterns in input space, that is, xi := 8(xi ). Definition 1.
(5.1)
A data set
x1 , . . . , x`
(5.2)
is called separable if there exists some w ∈ F such that (w · xi ) > 0 for i ∈ [`]. If we use a gaussian kernel (see equation 3.3), then any data set x1 , . . . , x` is separable after it is mapped into feature space. To see this, note that k(xi , xj ) > 0 for all i, j; thus all inner products between mapped patterns are positive, implying that all patterns lie inside the same orthant. Moreover, since k(xi , xi ) = 1 for all i, they all have unit length. Hence, they are separable from the origin. Proposition 1 (supporting hyperplane). If the data set, equation 5.2, is separable, then there exists a unique supporting hyperplane with the properties that (1) it separates all data from the origin, and (2) its distance to the origin is maximal among all such hyperplanes. For any ρ > 0, it is given by 1 min kwk2 subject to (w · xi ) ≥ ρ, i ∈ [`]. w∈F 2
(5.3)
The following result elucidates the relationship between single-class classification and binary classification. Proposition 2 (connection to pattern recognition). (i) Suppose (w, ρ) parameterizes the supporting hyperplane for the data in equation 5.2. Then (w, 0) parameterizes the optimal separating hyperplane for the labeled data set, {(x1 , 1), . . . , (x` , 1), (−x1 , −1), . . . , (−x` , −1)}.
(5.4)
Estimating the Support of a High-Dimensional Distribution
1453
(ii) Suppose (w, 0) parameterizes the optimal separating hyperplane passing through the origin for a labeled data set, ¢ ¡ ¢ª ©¡ x1 , y1 , . . . , x` , y` ,
¡
¢ yi ∈ {±1} for i ∈ [`] ,
(5.5)
aligned such that (w · xi ) is positive for yi = 1. Suppose, moreover, that ρ/kwk is the margin of the optimal hyperplane. Then (w, ρ) parameterizes the supporting hyperplane for the unlabeled data set, ©
ª y1 x1 , . . . , y` x` .
(5.6)
Note that the relationship is similar for nonseparable problems. In that case, margin errors in binary classification (points that are either on the wrong side of the separating hyperplane or fall inside the margin) translate into outliers in single-class classification, (points that fall on the wrong side of the hyperplane). Proposition 2 then holds, cum grano salis, for the training sets with margin errors and outliers, respectively, removed. The utility of proposition 2 lies in the fact that it allows us to recycle certain results proven for binary classification (Scholkopf, ¨ Smola, Williamson, & Bartlett, 2000) for use in the single-class scenario. The following, explaining the significance of the parameter ν, is such a case. Proposition 3 ν-property. Assume the solution of equation 3.4 and 3.5 satisfies ρ 6= 0. The following statements hold: i. ν is an upper bound on the fraction of outliers, that is, training points outside the estimated region. ii. ν is a lower bound on the fraction of SVs. iii. Suppose the data (see equation 5.2) were generated independently from a distribution P(x), which does not contain discrete components. Suppose, moreover, that the kernel is analytic and nonconstant. With probability 1, asymptotically, ν equals both the fraction of SVs and the fraction of outliers. Note that this result also applies to the soft margin ball algorithm of Tax and Duin (1999), provided that it is stated in the ν-parameterization given in section 3. Proposition 4 (resistance). change the hyperplane.
Local movements of outliers parallel to w do not
Note that although the hyperplane does not change, its parameterization in w and ρ does. We now move on to the subject of generalization. Our goal is to bound the probability that a novel point drawn from the same underlying distribution
1454
Bernhard Scholkopf ¨ et al.
lies outside the estimated region. We present a “marginalized” analysis that in fact provides a bound on the probability that a novel point lies outside the region slightly larger than the estimated one. Definition 2. Let f be a real-valued function on a space X . Fix θ ∈ R. For x ∈ X let d(x, f, θ ) = max{0, θ − f (x)}. Similarly for a training sequence X := (x1 , . . . , x` ), we define
D(X, f, θ ) =
X
d(x, f, θ).
x∈X
In the following, log denotes logarithms to base 2 and ln denotes natural logarithms. Theorem 1 (generalization error bound). Suppose we are given a set of ` examples X ∈ X ` generated i.i.d. from an unknown distribution P, which does not contain discrete components. Suppose, moreover, that we solve the optimization problem, equations 3.4 and 3.5 (or equivalently equation 3.11) and obtain a solution fw given explicitly by equation (3.10). Let Rw,ρ := {x: fw (x) ≥ ρ} denote the induced decision region. With probability 1 − δ over the draw of the random sample X ∈ X ` , for any γ > 0, µ ¶ ª 2 © `2 k + log , P x0 : x0 6∈ Rw,ρ−γ ≤ ` 2δ
(5.7)
¡ ¢ µ µ ¶¶ c1 log c2 γˆ 2 ` (2` − 1)γˆ 2D log e + +1 + 2, k= γˆ 2 γˆ 2D
(5.8)
where
¡ ¢ c1 = 16c2 , c2 = ln(2)/ 4c2 , c = 103, γˆ = γ /kwk, D = D(X, fw,0 , ρ) = D(X, fw,ρ , 0), and ρ is given by equation (3.12). The training sample X defines (via the algorithm) the decision region Rw,ρ . We expect that new points generated according to P will lie in Rw,ρ . The theorem gives a probabilistic guarantee that new points lie in the larger region Rw,ρ−γ . The parameter ν can be adjusted when running the algorithm to trade off incorporating outliers versus minimizing the “size” of Rw,ρ . Adjusting ν will change the value of D. Note that since D is measured with respect to ρ while the bound applies to ρ − γ , any point that is outside the region that the bound applies to will make a contribution to D that is bounded away from 0. Therefore, equation 5.7 does not imply that asymptotically, we will always estimate the complete support.
Estimating the Support of a High-Dimensional Distribution
1455
The parameter γ allows one to trade off the confidence with which one wishes the assertion of the theorem to hold against the size of the predictive region Rw,ρ−γ : one can see from equation 5.8 that k and hence the right-hand side of equation 5.7 scales inversely with γ . In fact, it scales inversely with γˆ , that is, it increases with w. This justifies measuring the complexity of the estimated region by the size of w, and minimizing kwk2 in order to find a region that will generalize well. In addition, the theorem suggests not to use the offset ρ returned by the algorithm, which would correspond to γ = 0, but a smaller value ρ − γ (with γ > 0). We do not claim that using theorem 1 directly is a practical means to determine the parameters ν and γ explicitly. It is loose in several ways. We suspect c is too large by a factor of more than 50. Furthermore, no account is taken of the smoothness of the kernel used. If that were done (by using refined bounds on the covering numbers of the induced class of functions as in Williamson, Smola, and Scholkopf ¨ (1998)), then the first term in equation 5.8 would increase much more slowly when decreasing γ . The fact that the second term would not change indicates a different trade-off point. Nevertheless, the theorem gives one some confidence that ν and γ are suitable parameters to adjust. 6 Experiments We apply the method to artificial and real-world data. Figure 1 displays two-dimensional (2D) toy examples and shows how the parameter settings influence the solution. Figure 2 shows a comparison to a Parzen windows estimator on a 2D problem, along with a family of estimators that lie “in between” the present one and the Parzen one. Figure 3 shows a plot of the outputs (w · 8(x)) on training and test sets of the U.S. Postal Service (USPS) database of handwritten digits. The database contains 9298 digit images of size 16 × 16 = 256; the last 2007 constitute the test set. We used a gaussian kernel (see equation 3.3) that has the advantage that the data are always separable from the origin in feature space (cf. the comment following definition 1). For the kernel parameter c, we used 0.5·256. This value was chosen a priori, it is a common value for SVM classifiers on that data set (cf. Scholkopf ¨ et al., 1995).3 We fed our algorithm with the training instances of digit 0 only. Testing was done on both digit 0 and all other digits. We present results for two values of ν, one large, one small; for values in between, the results are qualitatively similar. In the first experiment, we used ν = 50%, thus aiming for a description of “0-ness,” which 3 Hayton, Scholkopf, ¨ Tarassenko, and Anuzis (in press), use the following procedure to determine a value of c. For small c, all training points will become SVs; the algorithm just memorizes the data and will not generalize well. As c increases, the number of SVs initially drops. As a simple heuristic, one can thus start with a small value of c and increase it until the number of SVs does not decrease any further.
1456
Bernhard Scholkopf ¨ et al.
ν, width c frac. SVs/OLs margin ρ/kwk
0.5, 0.5 0.54, 0.43 0.84
0.5, 0.5 0.59, 0.47 0.70
0.1, 0.5 0.24, 0.03 0.62
0.5, 0.1 0.65, 0.38 0.48
Figure 1: (First two pictures) A single-class SVM applied to two toy problems; ν = c = 0.5, domain: [−1, 1]2 . In both cases, at least a fraction of ν of all examples is in the estimated region (cf. Table 1). The large value of ν causes the additional data points in the upper left corner to have almost no influence on the decision function. For smaller values of ν, such as 0.1 (third picture) the points cannot be ignored anymore. Alternatively, one can force the algorithm to take these outliers (OLs) into account by changing the kernel width (see equation 3.3). In the fourth picture, using c = 0.1, ν = 0.5, the data are effectively analyzed on a different length scale, which leads the algorithm to consider the outliers as meaningful points.
Figure 2: A single-class SVM applied to a toy problem; c = 0.5, domain: [−1, 1]2 , for various settings of the offset ρ. As discussed in section 3, ν = 1 yields a Parzen windows expansion. However, to get a Parzen windows estimator of the distribution’s support, we must in that case not use the offset returned by the algorithm (which would allow all points to lie outside the estimated region). Therefore, in this experiment, we adjusted the offset such that a fraction ν 0 = 0.1 of patterns would lie outside. From left to right, we show the results for ν ∈ {0.1, 0.2, 0.4, 1}. The right-most picture corresponds to the Parzen estimator that uses all kernels; the other estimators use roughly a fraction of ν kernels. Note that as a result of the averaging over all kernels, the Parzen windows estimate does not model the shape of the distribution very well for the chosen parameters.
Estimating the Support of a High-Dimensional Distribution
1457
test train other offset
test train other offset
Figure 3: Experiments on the U.S. Postal Service OCR data set. Recognizer for digit 0; output histogram for the exemplars of 0 in the training/test set, and on test exemplars of other digits. The x-axis gives the output values, that is, the argument of the sgn function in equation 3.10. For ν = 50% (top), we get 50% SVs and 49% outliers (consistent with proposition 3 ), 44% true positive test examples, and zero false positives from the “other” class. For ν = 5% (bottom), we get 6% and 4% for SVs and outliers, respectively. In that case, the true positive rate is improved to 91%, while the false-positive rate increases to 7%. The offset ρ is marked in the graphs. Note, finally, that the plots show a Parzen windows density estimate of the output histograms. In reality, many examples sit exactly at the threshold value (the nonbound SVs). Since this peak is smoothed out by the estimator, the fractions of outliers in the training set appear slightly larger than it should be.
1458
Bernhard Scholkopf ¨ et al.
6
9
2
8
1
8
8
6
5
3
2
3
8
7
0
3
0
8
2
7
Figure 4: Subset of 20 examples randomly drawn from the USPS test set, with class labels.
captures only half of all zeros in the training set. As shown in Figure 3, this leads to zero false positives (although the learning machine has not seen any non-0’s during training, it correctly identifies all non-0’s as such), while still recognizing 44% of the digits 0 in the test set. Higher recognition rates can be achieved using smaller values of ν: for ν = 5%, we get 91% correct recognition of digits 0 in the test set, with a fairly moderate false-positive rate of 7%. Although this experiment leads to encouraging results, it did not really address the actual task the algorithm was designed for. Therefore, we next focused on a problem of novelty detection. Again, we used the USPS set; however, this time we trained the algorithm on the test set and used it to identify outliers. It is folklore in the community that the USPS test set (see Figure 4) contains a number of patterns that are hard or impossible to classify, due to segmentation errors or mislabeling (Vapnik, 1995). In this experiment, we augmented the input patterns by 10 extra dimensions corresponding to the class labels of the digits. The rationale is that if we disregarded the labels, there would be no hope to identify mislabeled patterns as outliers. With the labels, the algorithm has the chance of identifying both unusual patterns and usual patterns with unusual labels. Figure 5 shows the 20 worst outliers for the USPS test set, respectively. Note that the algorithm indeed extracts patterns that are very hard to assign to their respective classes. In the experiment, we used the same kernel width as above and a ν value of 5%. The latter was chosen roughly to reflect our expectations as to how many “bad” patterns there are in the test set. Most good learning algorithms achieve error rates of 3 to 5% on the USPS benchmark (for a list of results, cf. Vapnik, 1995). Table 1 shows that ν lets us control the fraction of outliers. In the last experiment, we tested the run-time scaling behavior of the proposed SMO solver, which is used for training the learning machine (see Figure 6). It was found to depend on the value of ν used. For the small values of ν that are typically used in outlier detection tasks, the algorithm scales very well to larger data sets, with a dependency of training times on the sample size, which is at most quadratic. In addition to the experiments reported above, the algorithm has since
Estimating the Support of a High-Dimensional Distribution
1459
−513 9
−507 1
−458 0
−377 1
−282 7
−216 2
−200 3
−186 9
−179 5
−162 0
−153 3
−143 6
−128 6
−123 0
−117 7
−93 5
−78 0
−58 7
−52 6
−48 3
Figure 5: Outliers identified by the proposed algorithm, ranked by the negative output of the SVM (the argument of equation 3.10). The outputs (for convenience in units of 10−5 ) are written underneath each image in italics; the (alleged) class labels are given in boldface. Note that most of the examples are “difficult” in that they are either atypical or even mislabeled.
Table 1: Experimental Results for Various Values of the Outlier Control Constant ν, USPS Test Set, Size ` = 2007. ν 1% 2% 3% 4% 5% 6% 7% 8% 9% 10% 20% 30% 40% 50% 60% 70% 80% 90%
Fraction of OLs
Fraction of SVs
Training Time (CPU sec)
0.0% 0.0 0.1 0.6 1.4 1.8 2.6 4.1 5.4 6.2 16.9 27.5 37.1 47.4 58.2 68.3 78.5 89.4
10.0% 10.0 10.0 10.1 10.6 11.2 11.5 12.0 12.9 13.7 22.6 31.8 41.7 51.2 61.0 70.7 80.5 90.1
36 39 31 40 36 33 42 53 76 65 193 269 685 1284 1150 1512 2206 2349
Notes: ν bounds the fractions of outliers and support vectors from above and below, respectively (cf. Proposition 3). As we are not in the asymptotic regime, there is some slack in the bounds; nevertheless, ν can be used to control the above fractions. Note, moreover, that training times (CPU time in seconds on a Pentium II running at 450 MHz) increase as ν approaches 1. This is related to the fact that almost all Lagrange multipliers will be at the upper bound in that case (cf. section 4). The system used in the outlier detection experiments is shown in boldface.
1460
Bernhard Scholkopf ¨ et al.
Figure 6: Training times versus data set sizes ` (both axes depict logs at base 2; CPU time in seconds on a Pentium II running at 450 MHz, training on subsets of the USPS test set); c = 0.5 · 256. As in Table 1, it can be seen that larger values of ν generally lead to longer training times (note that the plots use different yaxis ranges). However, they also differ in their scaling with the sample size. The exponents can be directly read off from the slope of the graphs, as they are plotted in log scale with equal axis spacing. For small values of ν (≤ 5%), the training times were approximately linear in the training set size. The scaling gets worse as ν increases. For large values of ν, training times are roughly proportional to the sample size raised to the power of 2.5 (right plot). The results should be taken only as an indication of what is going on; they were obtained using fairly small training sets, the largest being 2007, the size of the USPS test set. As a consequence, they are fairly noisy and refer only to the examined regime. Encouragingly, the scaling is better than the cubic one that one would expect when solving the optimization problem using all patterns at once (cf. section 4). Moreover, for small values of ν, that is, those typically used in outlier detection (in Figure 5, we used ν = 5%), our algorithm was particularly efficient.
been applied in several other domains, such as the modeling of parameter regimes for the control of walking robots (Still & Scholkopf, ¨ in press), and condition monitoring of jet engines (Hayton et al., in press). 7 Discussion One could view this work as an attempt to provide a new algorithm in line with Vapnik’s principle never to solve a problem that is more general
Estimating the Support of a High-Dimensional Distribution
1461
than the one that one is actually interested in. For example, in situations where one is interested only in detecting novelty, it is not always necessary to estimate a full density model of the data. Indeed, density estimation is more difficult than what we are doing, in several respects. Mathematically, a density will exist only if the underlying probability measure possesses an absolutely continuous distribution function. However, the general problem of estimating the measure for a large class of sets, say, the sets measureable in Borel’s sense, is not solvable (for a discussion, see Vapnik, 1998). Therefore we need to restrict ourselves to making a statement about the measure of some sets. Given a small class of sets, the simplest estimator that accomplishes this task is the empirical measure, which simply looks at how many training points fall into the region of interest. Our algorithm does the opposite. It starts with the number of training points that are supposed to fall into the region and then estimates a region with the desired property. Often there will be many such regions. The solution becomes unique only by applying a regularizer, which in our case enforces that the region be small in a feature space associated with the kernel. Therefore, we must keep in mind that the measure of smallness in this sense depends on the kernel used, in a way that is no different from any other method that regularizes in a feature space. A similar problem, however, appears in density estimation already when done in input space. Let p denote a density on X . If we perform a (nonlinear) coordinate transformation in the input domain X , then the density values will change; loosely speaking, what remains constant is p(x) · dx, while dx is transformed too. When directly estimating the probability measure of regions, we are not faced with this problem, as the regions automatically change accordingly. An attractive property of the measure of smallness that we chose to use is that it can also be placed in the context of regularization theory, leading to an interpretation of the solution as maximally smooth in a sense that depends on the specific kernel used. More specifically, let us assume that k is Green’s function of P∗ P for an operator P mapping into some inner product space (Smola, Scholkopf, ¨ & Muller, ¨ 1998; Girosi, 1998), and take a look at the dual objective function that we minimize, X ¢ ¢ X ¡ ¡ αi αj k xi , xj = αi αj k (xi , ·) · δxj (·) i,j
i,j
=
X
¡ ¡ ¢¡ ¢¢ αi αj k (xi , ·) · P∗ Pk xj , ·
i,j
=
X
αi αj
¡ ¢¡ ¢¢ ¢ Pk (xi , ·) · Pk xj , ·
¡¡
i,j
à ! X X ¡ ¢ αi k (xi , ·) · P αj k xj , · = P i 2
= kPf k ,
j
1462
Bernhard Scholkopf ¨ et al.
P using f (x) = i αi k(xi , x). The regularization operators of common kernels can be shown to correspond to derivative operators (Poggio & Girosi, 1990); therefore, minimizing the dual objective function corresponds to maximizing the smoothness of the function f (which is, up to a thresholding operation, the function we estimate). This, in turn, is related to a prior 2 p( f ) ∼ e−kPf k on the function space. Interestingly, as the minimization of the dual objective function also corresponds to a maximization of the margin in feature space, an equivalent interpretation is in terms of a prior on the distribution of the unknown other class (the “novel” class in a novelty detection problem); trying to separate the data from the origin amounts to assuming that the novel examples lie around the origin. The main inspiration for our approach stems from the earliest work of Vapnik and collaborators. In 1962, they proposed an algorithm for characterizing a set of unlabeled data points by separating it from the origin using a hyperplane (Vapnik & Lerner, 1963; Vapnik & Chervonenkis, 1974). However, they quickly moved on to two-class classification problems, in terms of both algorithms and the theoretical development of statistical learning theory, which originated in those days. From an algorithmic point of view, we can identify two shortcomings of the original approach, which may have caused research in this direction to stop for more than three decades. Firstly, the original algorithm in Vapnik and Chervonenkis (1974) was limited to linear decision rules in input space; second, there was no way of dealing with outliers. In conjunction, these restrictions are indeed severe; a generic data set need not be separable from the origin by a hyperplane in input space. The two modifications that we have incorporated dispose of these shortcomings. First, the kernel trick allows for a much larger class of functions by nonlinearly mapping into a high-dimensional feature space and thereby increases the chances of a separation from the origin being possible. In particular, using a gaussian kernel (see equation 3.3), such a separation is always possible, as shown in section 5. The second modification directly allows for the possibility of outliers. We have incorporated this softness of the decision rule using the ν-trick (Scholkopf, ¨ Platt, & Smola, 2000) and thus obtained a direct handle on the fraction of outliers. We believe that our approach, proposing a concrete algorithm with wellbehaved computational complexity (convex quadratic programming) for a problem that so far has mainly been studied from a theoretical point of view, has abundant practical applications. To turn the algorithm into an easy-touse black-box method for practitioners, questions like the selection of kernel parameters (such as the width of a gaussian kernel) have to be tackled. It is our hope that theoretical results as the one we have briefly outlined and the one of Vapnik and Chapelle (2000) will provide a solid foundation for this formidable task. This, alongside with algorithmic extensions such as the
Estimating the Support of a High-Dimensional Distribution
1463
possibility of taking into account information about the “abnormal” class (Scholkopf, ¨ Platt, & Smola, 2000), is subject of current research. Appendix A: Supplementary Material for Section 2 A.1 Estimating the Support of a Density. The problem of estimating C(1) appears to have first been studied by Geffroy (1964) who considered X = R2 with piecewise constant estimators. There have been a number of works studying a natural nonparametric estimator of C(1) (e.g., Chevalier, 1976; Devroye & Wise, 1980; see Gayraud, 1997 for further references). The nonparametric estimator is simply Cˆ` =
` [
B(xi , ²` ),
(A.1)
i=1
where B(x, ²) is the l2 (X ) ball of radius ² centered at x and (²` )` is an appropriately chosen decreasing sequence. Devroye & Wise (1980) showed the asymptotic consistency of (A.1) with respect to the symmetric difference between C(1) and Cˆ ` . Cuevas and Fraiman (1997) studied the asymptotic © ª consistency of a plug-in estimator of C(1): Cˆ plug-in = x: pˆ` (x) > 0 where pˆ` is a kernel density estimator. In order to avoid problems with Cˆ plug-in , they analyzed ª © plug-in := x: pˆ` (x) > β` , where (β` )` is an appropriately chosen seCˆ β quence. Clearly for a given distribution, α is related to β, but this connection cannot be readily exploited by this type of estimator. The most recent work relating to the estimation of C(1) is by Gayraud (1997), who has made an asymptotic minimax study of estimators of functionals of C(1). Two examples are vol C(1) or the center of C(1). (See also Korostelev & Tsybakov, 1993, chap. 8). A.2 Estimating High Probability Regions (α 6= 1). Polonik (1995b) has studied the use of the “excess mass approach” (Muller, ¨ 1992) to construct an estimator of “generalized α-clusters” related to C(α). Define the excess mass over C at level α as EC (α) = sup {Hα (C): C ∈ C } , where Hα (·) = (P − αλ)(·) and again λ denotes Lebesgue measure. Any set 0C (α) ∈ C such that EC (α) = Hα (0C (α)) is called a generalized α-cluster in C . Replace P by P` in these definitions to obtain their empirical counterparts E`,C (α) and 0`,C (α). In other words, his estimator is 0`,C (α) = arg max {(P` − αλ)(C): C ∈ C } ,
1464
Bernhard Scholkopf ¨ et al.
where the max is not necessarily unique. Now while 0`,C (α) is clearly different from C` (α), it is related to it in that it attempts to find small regions with as much excess mass (which is similar to finding small regions with a given amount of probability mass). Actually 0`,C (α) is more closely related to the determination of density contour clusters at level α: cp (α) := {x: p(x) ≥ α}. Simultaneously, and independently, Ben-David & Lindenbaum (1997) studied the problem of estimating cp (α). They too made use of VC classes but stated their results in a stronger form, which is meaningful for finite sample sizes. Finally we point out a curious connection between minimum volume sets of a distribution and its differential entropy in the case that X is onedimensional. Suppose X is a one-dimensional random variable with density p. Let S = C(1) R be the support of p and define the differential entropy of X by h(X) = − S p(x) log p(x)dx. For ² > 0 and ` ∈ N, define the typical set A(`) ² with respect to p by © ª ` 1 A(`) ² = (x1 , . . . , x` ) ∈ S : | − ` log p(x1 , . . . , x` ) − h(X)| ≤ ² , Q where p(x1 , . . . , x` ) = `i=1 p(xi ). If (a` )` and (b` )` are sequences, the notation . a` = b` means lim`→∞ 1` log ba`` = 0. Cover and Thomas, (1991) show that for all ², δ < 12 , then .
.
`h vol A(`) ² = vol C` (1 − δ) = 2 .
They point out that this result “indicates that the volume of the smallest set that contains most of the probability is approximately 2`h . This is a `dimensional volume, so the corresponding side length is (2`h )1/` = 2h . This provides an interpretation of differential entropy” (p. 227). Appendix B: Proofs of Section 5 Proof. (Proposition 1). Due to the separability, the convex hull of the data does not contain the origin. The existence and uniqueness of the hyperplane then follow from the supporting hyperplane theorem (Bertsekas, 1995). Moreover, separability implies that there actually exists some ρ > 0 and w ∈ F such that (w · xi ) ≥ ρ for i ∈ [`] (by rescaling w, this can be seen to work for arbitrarily large ρ). The distance of the hyperplane {z ∈ F : (w · z) = ρ} to the origin is ρ/kwk. Therefore the optimal hyperplane is obtained by minimizing kwk subject to these constraints, that is, by the solution of equation 5.3. Proof. (Proposition 2). Ad (i). By construction, the separation of equation 5.6 is a point-symmetric problem. Hence, the optimal separating hyperplane passes through the origin, for, if it did not, we could obtain another
Estimating the Support of a High-Dimensional Distribution
1465
optimal separating hyperplane by reflecting the first one with respect to the origin this would contradict the uniqueness of the optimal separating hyperplane (Vapnik, 1995). Next, observe that (−w, ρ) parameterizes the supporting hyperplane for the data set reflected through the origin and that it is parallel to the one given by (w, ρ). This provides an optimal separation of the two sets, with distance 2ρ, and a separating hyperplane (w, 0). Ad (ii). By assumption, w is the shortest vector satisfying yi (w · xi ) ≥ ρ (note that the offset is 0). Hence, equivalently, it is also the shortest vector satisfying (w · yi xi ) ≥ ρ for i ∈ [`]. P Proof. Sketch (Proposition 3). When changing ρ, the term i ξi in equation 3.4 will change proportionally to the number of points that have a nonzero ξi (the outliers), plus, when changing ρ in the positive direction, the number of points that are just about to get a nonzero ρ, that is, which sit on the hyperplane (the SVs). At the optimum of equation 3.4, we therefore have parts i and ii. Part iii can be proven by a uniform convergence argument showing that since the covering numbers of kernel expansions regularized by a norm in some feature space are well behaved, the fraction of points that lie exactly on the hyperplane is asymptotically negligible (Scholkopf, ¨ Smola, et al., 2000). Proof. (Proposition 4). Suppose xo is an outlier, that is, ξo > 0, hence by the KKT conditions (Bertsekas, 1995) αo = 1/(ν`). Transforming it into x0o := xo + δ · w, where |δ| < ξo /kwk, leads to a slack that is still nonzero, that is, ξo0 > 0; hence, we still have αo = 1/(ν`). Therefore, α0 = α is still feasible, as is the primal solution (w0 , ξ 0 , ρ 0 ). Here, we use ξi0 = (1 + δ · αo )ξi for i 6= o, w0 = (1 + δ · αo )w, and ρ 0 as computed from equation 3.12. Finally, the KKT conditions are still satisfied, as still αo0 = 1/(ν`). Thus (Bertsekas, 1995), α is still the optimal solution. We now move on to the proof of theorem 1. We start by introducing a common tool for measuring the capacity of a class F of functions that map X to R. Definition 3. Let (X, d) be a pseudometric space, and let A be a subset of X and ² > 0. A set U ⊆ X is an ²-cover for A if, for every a ∈ A, there exists u ∈ U such that d(a, u) ≤ ². The ²-covering number of A, N (², A, d), is the minimal cardinality of an ²-cover for A (if there is no such finite cover, then it is defined to be ∞). Note that we have used “less than or equal to” in the definition of a cover. This is somewhat unconventional, but will not change the bounds we use. It is, however, technically useful in the proofs.
1466
Bernhard Scholkopf ¨ et al.
The idea is that U should be finite but approximate all of A with respect to the pseudometric d. We will use the l∞ norm over a finite sample X = (x1 , . . . , x` ) for the pseudonorm on F , k f klX := max | f (x)|. ∞
x∈X
(B.2)
(The (pseudo)-norm induces a (pseudo)-metric in the usual way.) Suppose X is a compact subset of X. Let N (², F , `) = maxX∈X ` N (², F , lX∞ ) (the maximum exists by definition of the covering number and compactness of X ). We require a technical restriction on the function class, referred to as sturdiness, which is satisfied by standard classes such as kernel-based linear function classes and neural networks with bounded weights (see ShaweTaylor & Wiliamson, 1999, and Scholkopf, ¨ Platt, Shawe-Taylor, Smola, & Williamson, 1999, for further details.) Below, the notation dte denotes the smallest integer ≥ t. Several of the results stated claim that some event occurs with probability 1 − δ. In all cases, 0 < δ < 1 is understood, and δ is presumed in fact to be small. Theorem 2. Consider any distribution P on X and a sturdy real-valued function class F on X . Suppose x1 , . . . , x` are generated i.i.d. from P. Then with probability 1 − δ over such an `-sample, for any f ∈ F and for any γ > 0, ¾ ½ P x : f (x) < min f (xi ) − 2γ ≤ ²(`, k, δ) := 2` (k + log δ` ), i∈[`]
where k = dlog N (γ , F , 2`)e. The proof, which is given in Scholkopf, ¨ Platt, et al. (1999), uses standard techniques of VC theory: symmetrization via a ghost sample, application of the union bound, and a permutation argument. Although theorem 2 provides the basis of our analysis, it suffers from a weakness that a single, perhaps unusual point may significantly reduce the minimum mini∈[`] f (xi ). We therefore now consider a technique that will enable us to ignore some of the smallest values in the set { f (xi ): xi ∈ X} at a corresponding cost to the complexity of the function class. The technique was originally considered for analyzing soft margin classification in ShaweTaylor and Cristianini (1999), but we will adapt it for the unsupervised problem we are considering here. We will remove minimal points by increasing their output values. This corresponds to having a nonzero slack variable ξi in the algorithm, where we use the class of linear functions in feature space in the application of the theorem. There are well-known bounds for the log covering numbers of this class. We measure the increase in terms of D (definition 2).
Estimating the Support of a High-Dimensional Distribution
1467
Let X be an inner product space. The following definition introduces a new inner product space based on X . Definition 4. Let L(X ) be the set of real valued nonnegative functions f on X with support supp( f ) countable, that is, functions in L(X ) are nonzero for at most countably many points. We define the inner product of two functions f, g ∈ L(X ), by X f (x)g(x). f ·g= x∈supp( f )
P The 1-norm on L(X ) is defined by k f k1 = x∈supp( f ) f (x) and let LB (X ) := { f ∈ L(X ): k f k1 ≤ B}. Now we define an embedding of X into the inner product space X × L(X ) via τ : X → X × L(X ), τ : x 7→ (x, δx ), where δx ∈ L(X ) is defined by ½ 1, if y = x; δx (y) = 0, otherwise. For a function f ∈ F , a set of training examples X, and θ ∈ R, we define the function g f ∈ L(X ) by g f (y) = gX,θ f (y) =
X
d(x, f, θ)δx (y).
x∈X
Theorem 3. Fix B > 0. Consider a fixed but unknown probability distribution P which has no atomic components on the input space X and a sturdy class of realvalued functions F . Then with probability 1 − δ over randomly drawn training ∈ sequences X of size `, for all γ > 0 and any f ∈ F and any θ such that g f = gX,θ f P LB (X ), (i.e., x∈X d(x, f, θ ) ≤ B), © ª P x: f (x) < θ − 2γ ≤ 2` (k + log δ` ),
(B.3)
¨ § where k = log N (γ /2, F , 2`) + log N (γ /2, LB (X ), 2`) . The assumption on P can in fact be weakened to require only that there is no point x ∈ X satisfying f (x) < θ − 2γ that has discrete probability.) The proof is given in Scholkopf, ¨ Platt, et al. (1999). The theorem bounds the probability of a new example falling in the region for which f (x) has value less than θ − 2γ , this being the complement of the estimate for the support of the distribution. In the algorithm described in this article, one would use the hyperplane shifted by 2γ toward the origin to define the region. Note that there is no restriction placed on the class of functions, though these functions could be probability density functions. The result shows that we can bound the probability of points falling outside the region of estimated support by a quantity involving the ratio of the
1468
Bernhard Scholkopf ¨ et al.
log covering numbers (which can be bounded by the fat-shattering dimension at scale proportional to γ ) and the number of training examples, plus a factor involving the 1-norm of the slack variables. The result is stronger than related results given by Ben-David and Lindenbaum, (1997): their bound involves the square root of the ratio of the Pollard dimension (the fat-shattering dimension when γ tends to 0) and the number of training examples. In the specific situation considered in this article, where the choice of function class is the class of linear functions in a kernel-defined feature space, one can give a more practical expression for the quantity k in theorem 3. To this end, we use a result of Williamson, Smola, and Scholkopf ¨ (2000) to bound log N (γ /2, F , 2`), and a result of Shawe-Taylor and Cristianini (2000) to bound log N (γ /2, LB (X ), 2`). We apply theorem 3, stratifying over possible values of B. The proof is given in Scholkopf, ¨ Platt, et al. (1999) and leads to theorem 1 stated in the main part of this article. Acknowledgments This work was supported by the ARC, the DFG (#Ja 379/9-1 and Sm 62/11), and by the European Commission under the Working Group Nr. 27150 (NeuroCOLT2). Parts of it were done while B. S. and A. S. were with GMD FIRST, Berlin. Thanks to S. Ben-David, C. Bishop, P. Hayton, N. Oliver, C. Schnorr, ¨ J. Spanjaard, L. Tarassenko, and M. Tipping for helpful discussions. References Ben-David, S., & Lindenbaum, M. (1997). Learning distributions by their density levels: A paradigm for learning without a teacher. Journal of Computer and System Sciences, 55, 171–182. Bertsekas, D. P. (1995). Nonlinear programming. Belmont, MA: Athena Scientific. Boser, B. E., Guyon, I. M., & Vapnik, V. N. (1992). A training algorithm for optimal margin classifiers. In D. Haussler (Ed.), Proceedings of the 5th Annual ACM Workshop on Computational Learning Theory (pp. 144–152). Pittsburgh, PA: ACM Press. Chevalier, J. (1976). Estimation du support et du contour du support d’une loi de probabilit´e. Annales de l’Institut Henri Poincar´e. Section B. Calcul des Probabilit´es et Statistique. Nouvelle S´erie, 12(4),: 339–364. Cover, T. M., & Thomas, J. A. (1991). Elements of information theory. New York: Wiley. Cuevas, A., & Fraiman, R. (1997). A plug-in approach to support estimation. Annals of Statistics, 25(6), 2300–2312. 1997. Devroye, L., & Wise, G. L. (1980). Detection of abnormal behaviour via nonparametric estimation of the support. SIAM Journal on Applied Mathematics, 38(3), 480–488. Einmal, J. H. J., & Mason, D. M. (1992). Generalized quantile processes. Annals of Statistics, 20(2), 1062–1078.
Estimating the Support of a High-Dimensional Distribution
1469
Gayraud, G. (1997). Estimation of functional of density support. Mathematical Methods of Statistics, 6(1), 26–46. Geffroy, J. (1964). Sur un probl`eme d’estimation g´eom´etrique. Publications de l’Institut de Statistique de l’Universit´e de Paris, 13, 191–210. Girosi, F. (1998). An equivalence between sparse approximation and support vector machines. Neural Computation, 10(6), 1455–1480. Hartigan, J. A. (1987). Estimation of a convex density contour in two dimensions. Journal of the American Statistical Association, 82, 267–270. Hayton, P., Scholkopf, ¨ B., Tarassenko, L., & Anuzis, P. (in press). Support vector novelty detection applied to jet engine vibration spectra. In T. Leen, T. Dietterich, & V. Tresp (Eds)., Advances in neural information processing systems, 13. Joachims, T. (1999). Making large-scale SVM learning practical. In B. Scholkopf, ¨ C. J. C. Burges, & A. J. Smola (Eds.), Advances in kernel methods—Support vector learning (pp. 169–184). Cambridge, MA: MIT Press. Keerthi, S. S., Shevade, S. K., Bhattacharyya, C., & Murthy, K. R. K. (1999). Improvements to Platt’s SMO algorithm for SVM classifier design (Tech. Rep. No. CD-99-14). Singapore: Department of Mechanical and Production Engineering, National University of Singapore. Korostelev, A. P., & Tsybakov, A. B. (1993). Minimax theory of image reconstruction. New York: Springer-Verlag. Muller, ¨ D. W. (1992). The excess mass approach in statistics. Heidelberg: Beitr¨age zur Statistik, Universit¨at Heidelberg. Nolan, D. (1991). The excess mass ellipsoid. Journal of Multivariate Analysis, 39, 348–371. Platt, J. (1999). Fast training of support vector machines using sequential minimal optimization. In B. Scholkopf, ¨ C. J. C. Burges, & A. J. Smola (Eds.), Advances in kernel methods—Support vector learning, (pp. 185–208). Cambridge, MA: MIT Press. Poggio, T., & Girosi, F. (1990). Networks for approximation and learning. Proceedings of the IEEE, 78(9). Polonik, W. (1995a). Density estimation under qualitative assumptions in higher dimensions. Journal of Multivariate Analysis, 55(1), 61–81. Polonik, W. (1995b). Measuring mass concentrations and estimating density contour clusters—an excess mass approach. Annals of Statistics, 23(3), 855– 881. Polonik, W. (1997). Minimum volume sets and generalized quantile processes. Stochastic Processes and their Applications, 69, 1–24. Sager, T. W. (1979). An iterative method for estimating a multivariate mode and isopleth. Journal of the American Statistical Association, 74(366), 329–339. Scholkopf, ¨ B., Burges, C. J. C., & Smola, A. J. (1999). Advances in kernel methods— Support vector learning. Cambridge, MA: MIT Press. Scholkopf, ¨ B., Burges, C., & Vapnik, V. (1995). Extracting support data for a given task. In U. M. Fayyad & R. Uthurusamy (Eds.), Proceedings, First International Conference on Knowledge Discovery and Data Mining. Menlo Park, CA: AAAI Press.
1470
Bernhard Scholkopf ¨ et al.
Scholkopf, ¨ B., Platt, J., Shawe-Taylor, J., Smola, A. J., & Williamson, R. C. (1999). Estimating the support of a high-dimensional distribution (Tech. Rep. No. 87). Redmond, WA: Microsoft Research. Available online at: http://www.research.microsoft.com/scripts/pubs/view.asp?TR ID=MSRTR-99-87. Scholkopf, ¨ B., Platt, J., & Smola, A. J. (2000). Kernel method for percentile feature extraction (Tech. Rep. No. 22). Redmond, WA: Microsoft Research. Scholkopf, ¨ B., Smola, A., & Muller, ¨ K. R. (1999). Kernel principal component analysis. In B. Scholkopf, ¨ C. J. C. Burges, & A. J. Smola (Eds.), Advances in kernel methods—Support vector learning (pp. 327–352). Cambridge, MA: MIT Press. Scholkopf, ¨ B., Smola, A., Williamson, R. C., & Bartlett, P. L. (2000). New support vector algorithms. Neural Computation, 12, 1207–1245. Scholkopf, ¨ B., Williamson, R., Smola, A., & Shawe-Taylor, J. (1999). Single-class support vector machines. In J. Buhmann, W. Maass, H. Ritter, & N. Tishby, editors (Eds.) Unsupervised learning (Rep. No. 235) (pp. 19–20). Dagstuhl, Germany. Shawe-Taylor, J., & Cristianini, N. (1999). Margin distribution bounds on generalization. In Computational Learning Theory: 4th European Conference (pp. 263–273). New York: Springer-Verlag. Shawe-Taylor, J., & Cristianini, N. (2000). On the generalisation of soft margin algorithms. IEEE Transactions on Information Theory. Submitted. Shawe-Taylor, J., & Williamson, R. C. (1999). Generalization performance of classifiers in terms of observed covering numbers. In Computational Learning Theory: 4th European Conference (pp. 274–284). New York: Springer-Verlag. Smola, A., & Scholkopf, ¨ B. (in press). A tutorial on support vector regression. Statistics and Computing. Smola, A., Scholkopf, ¨ B., & Muller, ¨ K.-R. (1998). The connection between regularization operators and support vector kernels. Neural Networks, 11, 637–649. Smola, A., Mika, S., Scholkopf, ¨ B., & Williamson, R. C. (in press). Regularized principal manifolds. Machine Learning. Still, S., & Scholkopf, ¨ B. (in press). Four-legged walking gait control using a neuromorphic chip interfaced to a support vector learning algorithm. In T. Leen, T. Diettrich, & P. Anuzis (Eds.), Advances in neural information processing systems, 13. Cambridge, MA: MIT Press. Stoneking, D. (1999). Improving the manufacturability of electronic designs. IEEE Spectrum, 36(6), 70–76. Tarassenko, L., Hayton, P., Cerneaz, N., & Brady, M. (1995). Novelty detection for the identification of masses in mammograms. In Proceedings Fourth IEE International Conference on Artificial Neural Networks (pp. 442–447). Cambridge. Tax, D. M. J., & Duin, R. P. W. (1999). Data domain description by support vectors. In M. Verleysen (Ed.), Proceedings ESANN (pp. 251–256). Brussels: D Facto. Tsybakov, A. B. (1997). On nonparametric estimation of density level sets. Annals of Statistics, 25(3), 948–969. Vapnik, V. (1995). The nature of statistical learning theory. New York: SpringerVerlag. Vapnik, V. (1998). Statistical learning theory. New York: Wiley.
Estimating the Support of a High-Dimensional Distribution
1471
Vapnik, V., & Chapelle, O. (2000). Bounds on error expectation for SVM. In A. J. Smola, P. L. Bartlett, B. Scholkopf, ¨ & D. Schuurmans (Eds.), Advances in large margin classifiers (pp. 261 – 280). Cambridge, MA: MIT Press. Vapnik, V., & Chervonenkis, A. (1974). Theory of pattern recognition. Nauka, Moscow. [In Russian] (German translation: W. Wapnik & A. Tscherwonenkis, Theorie der Zeichenerkennung, Akademie–Verlag, Berlin, 1979). Vapnik, V., & Lerner, A. (1963). Pattern recognition using generalized portrait method. Automation and Remote Control, 24. Williamson, R. C., Smola, A. J., & Scholkopf, ¨ B. (1998). Generalization performance of regularization networks and support vector machines via entropy numbers of compact operators (Tech. Rep. No. 19) NeuroCOLT. Available online at: http://www.neurocolt.com, 1998. Also: IEEE Transactions on Information Theory (in press). Williamson, R. C., Smola, A. J., & Scholkopf, ¨ B. (2000). Entropy numbers of linear function classes. In N. Cesa-Bianchi & S. Goldman (Eds.), Proceedings of the 13th Annual Conference on Computational Learning Theory (pp. 309–319). San Mateo, CA: Morgan Kaufman. Received November 19, 1999; accepted September 22, 2000.
LETTER
Communicated by Misha Tsodyks
Stationary Bumps in Networks of Spiking Neurons Carlo R. Laing Carson C. Chow Department of Mathematics, University of Pittsburgh, Pittsburgh PA 15260, U.S.A.
We examine the existence and stability of spatially localized “bumps” of neuronal activity in a network of spiking neurons. Bumps have been proposed in mechanisms of visual orientation tuning, the rat head direction system, and working memory. We show that a bump solution can exist in a spiking network provided the neurons fire asynchronously within the bump. We consider a parameter regime where the bump solution is bistable with an all-off state and can be initiated with a transient excitatory stimulus. We show that the activity profile matches that of a corresponding population rate model. The bump in a spiking network can lose stability through partial synchronization to either a traveling wave or the all-off state. This can occur if the synaptic timescale is too fast through a dynamical effect or if a transient excitatory pulse is applied to the network. A bump can thus be activated and deactivated with excitatory inputs that may have physiological relevance. 1 Introduction Neuronal activity due to recurrent excitations in the form of a spatially localized pulse or bump has been proposed as a mechanism for feature selectivity in models of the visual system (Somers, Nelson, & Sur, 1995; Hansel & Sompolinsky, 1998), the head direction system (Skaggs, Knieram, Kudrimoti, & McNaughton, 1995; Zhang, 1996; Redish, Elga, & Touretzky, 1996), and working memory (Wilson & Cowan, 1973; Amit & Brunel, 1997; Camperi & Wang, 1998). Many of the previous mathematical formulations of such structures have employed population rate models (Wilson & Cowan, 1972, 1973; Amari, 1977; Kishimoto & Amari, 1979; Hansel & Sompolinsky, 1998). (See Ermentrout, 1998, for a recent review.) Here, we consider a network of spiking neurons that shows such structures and investigate their properties. In our network we find localized time-stationary states (bumps), which may be analogous to the structures measured in the experiments of Colby, Duhamel, and Goldberg (1995) and Funahashi, Bruce, and Goldman-Rakic (1989). The network is bistable, with the bump and the “all–off” state both being stable. Note that the neurons are not intrinsically bistable, as in Camperi and Wang (1998), and the bump solutions do not arise from a Turing–Hopf instability, like that studied by c 2001 Massachusetts Institute of Technology Neural Computation 13, 1473–1494 (2001) °
1474
Carlo R. Laing and Carson C. Chow
Bressloff and Coombes (1998) and Bressloff, Bressloff, and Cowan (1999); there is no continuous path in parameter space connecting a bump and the all-off state. A time-stationary solution is one that corresponds to asynchronous firing of neurons where the firing rate is constant at each spatial point but the rate depends on spatial location. We show that the activity profile of the bumps of our model is the same as that of a corresponding population rate model. However, bumps predicted by the rate model to be stable may in fact be unstable in a model that includes the spiking dynamics of the neurons. The rate model implicitly assumes asynchronous firing and considers only the dynamics of the firing rate. As the synaptic decay time is increased in the spiking network, the bump can lose stability as a result of temporal correlation or “partial synchronization” of neurons involved in the bump. If the initial conditions are symmetric, then this synchronization causes the input to the neurons to drop below the threshold required to keep it firing, leading to cessation of oscillation of the neurons and consequently the rest of the bump. However, for generic initial conditions or with the inclusion of noise, the bump destabilizes to a traveling wave. For fast enough synapses, the wave cannot exist. If some heterogeneity in the intrinsic properties of the neuron is included, then the bump can be “pinned” to a fixed location; the traveling wave does not form, and the bump loses stability to the all-off state. This instability provides a mechanism for the termination of a bump, as would be required at the end of a memory task (e.g., the delayed saccade task discussed by Colby et al., 1995): if many of the neurons involved in the bump can be caused to fire approximately simultaneously and the synaptic timescale is short, there will not be enough input after this coincident firing to sustain activity, and the network will switch to the all-off state. 2 Neuron Model We consider a network of N integrate-and-fire neurons whose voltages, vi , obey the differential equations X Jij X dvi = Ii − vi + α(t − tjm ) − δ(t − tli ), dt N j,m l
(2.1)
where the subscript i indexes the neurons, tjm is the mth firing of neuron j, defined by the times that vj (t) crosses the threshold, which we have set to 1, Ii is the input current applied to neuron i, and δ(·) is the Dirac delta function, which resets the voltage to zero. The function α(t) is a postsynaptic current and is nonzero only for t > 0. The connection weight between neuron i and neuron j is Jij . The sum over m and l extends over the entire firing history of the neurons in the network, and the sum over j extends over the network. Each time the voltage crosses the threshold from below the neuron is said to
Stationary Bumps in Networks of Spiking Neurons
1475
“fire.” The voltage then immediately resets to vi = 0, and a synaptic pulse α(t) is sent to all connected neurons. In our examination of bump solutions, we will consider subthreshold input (Ii < 1) and a weight matrix that is translationally invariant (i.e., Jij depends on only |i − j|). It is of the lateral inhibition form (locally excitatory but distally inhibitory); this type of connectivity matrix can be shown to arise from a multilayer network with both inhibitory and excitatory populations if the inhibition is fast, as shown by Ermentrout (1998). We can formally integrate equation 2.1 to obtain the spike response form (Gerstner, 1995; Gerstner, van Hemmen, & Cowan, 1996; Chow, 1998). This form will allow us to relate the bump profile for the integrate-and-fire network to the profile of a rate model similar to that studied by Amari (1977). Suppose that neuron i has fired in the past at times tli , where l = 0, −1, −2, . . . , −∞. The neuron most recently fired at t0i . We consider the dynamics for t > t0i . Integrating equation 2.1 yields vi (t) = Ii (1 − e−(t−ti ) ) + 0
X Jij Z t es−t α(s − tjm ) ds. N t0i j,m
(2.2)
By breaking up the integral in equation 2.2 into two pieces, we obtain vi (t) = Ii (1 − e−(t−ti ) ) + 0
×
X Jij Z t 0 es−t α(s − tjm ) ds − e−(t−ti ) N −∞ j,m
X Jij Z t0i 0 es−ti α(s − tjm ) ds, N −∞ j,m
(2.3)
from which we obtain the spike response form vi (t, s) = Ii − [Ii + ui (s)]e−(t−s) + ui (t),
t > s,
vi (s, s) = 0, (2.4)
where ui (t) =
X Jij ²(t − tjm ), N j,m
(2.5)
and Z
t
²(t) =
es−t α(s)ds.
(2.6)
0
We normalize ²(t) so that [I + ui (s)]e−(t−s) ' e−(t−s) .
R∞ 0
²(t) dt = 1. Note that for low rates of firing,
1476
Carlo R. Laing and Carson C. Chow
Coupling function, J ij
4 3 2 1 0
1
0
20
40 60 80 Neuron Index, j
100
Figure 1: The weight function Jij = NJ(|i − j|/N) for i = 50 and N = 100, where J(z) = 5[1.1w(1/28, z) − w(1/20, z)], and w(a, z) = (aπ )−1/2 exp(−z2 /a).
2.1 Numerical Methods. We performed numerical simulations of the integrate-and-fire network using the spike response form equation 2.2. We step equation 2.2 forward using a fixed time step, 1t, until one or more of the voltages is above threshold (set to be 1). Assume that vi ([n + 1]1t) > 1 while vi (n1t) < 1, for some n. At this point, a backward linear interpolation in voltage is made to determine the approximate firing time of neuron i. To determine the approximate correct value of vi ([n + 1]1t), we make an Euler step from the approximate last firing time of neuron i to time [n + 1]1t, using equation 2.1 for the slope. The equations are then stepped forward again. The domain is the unit interval with periodic boundary conditions, and the weight function involves a difference of gaussians (see Figure 1 for an example). For the synaptic pulse, α(t), we take α(t) = β exp (−βt),
(2.7)
so that ²(t) =
β[e−t − e−βt ] . β −1
(2.8)
The parameter β affects the rate at which the postsynaptic current decays. Noise is added to the network as current pulses to each neuron of the form Irand (t) = 6(e−10t − e−15t ),
Stationary Bumps in Networks of Spiking Neurons
1477
where t ≥ 0. The arrival times of these pulses have a Poisson distribution with mean frequency 0.05, and there is no correlation between pulse arrival times for different neurons. 3 Existence of the Bump State We examine the existence of bump solutions of the spike response system described by equation 2.4. A bump solution is spatially localized with spatially dependent average firing rate of the participating neurons. The firing rate is zero outside the bump and rises from zero at the edges to a maximum in the center. The firing times of the neurons are uncorrelated, so the bump is a localized patch of incoherent or asynchronous firing. The state coexists with the homogeneous nonfiring (all-off) state. It is convenient to define the activity of neuron i as X δ(t − tli ), (3.1) Ai (t) = l
where the sum over l is over all past firing times. Our activity differs from the population activity of Gerstner (1995), which considers the activity of an infinite pool of neurons at a given spatial location. We can then rewrite the synaptic input 2.5 in terms of the activity as X Jij Z ∞ ²(s)Aj (t − s) ds. (3.2) ui (t) = N 0 j Consider stationary asynchronous solutions to the spike response equations. Many authors have studied the spatially homogeneous asynchronous state with various coupling schemes (Abbott & Van Vreeswijk, 1993; Treves, 1993; Gerstner, 1995, 1998, 2000). Our approach is similar to that of Gerstner (1995, 1998, 2000). We first rewrite the activity as Ai (t) = A0i + 1Ai (t), where 1 τ →∞ τ
A0i = lim
Z
τ
Ai (r)dr.
(3.3)
(3.4)
0
Substituting equation 3.1 into 3.4 then yields A0i = limτ →∞ n(τ )/τ , where n(τ ) is the number of times neuron i fired in the time interval τ . Thus, A0i is the mean firing rate of neuron i. We now insert equation 3.3 into 3.2 to obtain ui (t) = u0i + 1ui (t), where u0i =
X Jij A0 , N j j
(3.5)
1478
Carlo R. Laing and Carson C. Chow
and 1ui (t) =
X Jij Z ∞ ²(s)1Aj (t − s)ds N 0 j
(3.6)
R∞ (recall that 0 ²(s) ds = 1). We define the asynchronous state to be one where 1ui (t) is zero in the limit of infinite network size N. In the asynchronous state, the input to neuron i is a constant and given by u0i . This implies that the firing times of the neurons are uncorrelated. For a finite system, 1u(t) will contribute fluctuations that scale as N−1/2 . We now derive the self-consistent equations for the asynchronous state. Substitute ui (t) = u0i into equation 2.4; the local firing period (A0i )−1 will be given by 0 −1
vi ((A0i )−1 + s, s) = 1 = Ii − [Ii + u0i ]e−(Ai )
+ u0i .
(3.7)
Solving equation 3.7 yields A0i = G[u0i ],
(3.8)
where ½ G[z] =
0, ¤ £ , −1/ ln I+z−1 I+z
z≤1−I z > 1 − I.
(3.9)
(A plot of G[z] can be seen in Figure 2.) This form is similar to the usual neural network rate equation (Amari, 1977; Kishimoto & Amari, 1979; Hansel & Sompolinsky, 1998; Ermentrout, 1998) except that the gain function we have derived is a result of the intrinsic neuronal dynamics of our model (Gerstner, 1995). Combining equations 3.5 and 3.8, we obtain the condition for a stationary asynchronous solution: u0i =
X Jij G[uj0 ]. N j
(3.10)
For a finite sized system, the time-averaged firing rate of the neurons follows a profile given by Aj0 = G[uj0 ]. We first considerPmean-field solutions to equation 3.10. We assume that u0i = u0 , Ii = I and Jij /N = J, yielding u0 = JG[u0 ].
(3.11)
If I > 1 (oscillatory neurons), then there are no solutions if J is too large
Stationary Bumps in Networks of Spiking Neurons
1479
3.5 3
JG[z]
2.5 2
1.5 1 0.5 0 0
0.5
1 z
1.5
2
Figure 2: A multiple (J) of the gain function G[z] (see equation 3.9) (solid line) for I = 0.2 and J = 2, and the diagonal (dashed). The threshold occurs at z = 1 − I, when 1 is the voltage threshold for the integrate-and-fire neuron. The intersection of the dashed line and the gain function gives the mean-field solution to equation 3.11.
and one solution if J is small enough. For I < 1 (excitatory neurons), equation 3.11 has one solution at u0 = 0 if J is too small and two solutions if J is large enough (See Figure 2, which shows the case J = 2.) These two states correspond to an “all-off” state and an “all-on” state, respectively. 3.1 Bump State. In order for a bump to exist, a solution to equation 3.10 for u0i must be found such that u0i + Ii is above threshold (u0i + Ii > 1) in a localized region of space. We show example figures of such solutions in Figures 3 and 4. Amari (1977) and Kishimoto & Amari (1979) proved that such a solution can exist for a class of gain functions G[z]. Similar to the mean-field solution, we find that for subthreshold input (Ii < 1), the all-off state always exists, and the bump state can exist if the weight function has enough excitation. We will discuss stability of the bump in section 4. Stability will be affected by the synaptic timescale, the weight function, the amount of applied current, and the size of the network. For a finite-sized system, we show in the appendix that the individual neurons in a bump do not fire with a fixed spatially dependent period. These finite-sized fluctuations act as a source of noise. As noted by Gerstner (1995, 1998), the spike response model can be connected to classical neural network or population rate models (Wilson & Cowan, 1972; Amari, 1977; Hopfield, 1984). If we choose ²(s) = e−s ,
1480
Carlo R. Laing and Carson C. Chow
which is true for α(t) = δ(t), and assume near-synchronous firing so that Ai (t) ' G[ui (t)], then by differentiating equation 3.2 with respect to time we obtain: X d ui (t) = −ui (t) + Jij G[uj (t)]. dt j
(3.12)
This is the classical neural network or population rate model. Amit and Tsodyks (1991), Gerstner (1995), and Shriki, Sompolinsky, and Hansel (1999) have previously shown that networks of spiking neurons can be represented with rate models provided there does not exist a large degree of synchrony. The condition for a stationary solution for equation 3.12 is identical to equation 3.10. If the synapse has a more complicated time course such as a difference of exponentials, then a higher-order rate equation could be derived (Ermentrout, 1998). The activity is a functional of the input as well as a function of time. The assumption made in deriving equation 3.12 is that the explicit time dependence of the activity is weak compared to the dependence on the input ui (t). This is valid only for weakly synchronous or correlated firing. For strongly synchronous firing, the explicit time dependence of the activity would dominate the functional dependence on the inputs. 3.2 Numerical Simulations. We compare stationary bump profiles obtained from both equation 3.12 and a network of integrate-and-fire neurons. A space-time raster plot of the firing times of a stationary bump from a simulation of a network of 100 integrate-and-fire neurons is shown in Figure 3. The network was switched into the bump state by applying a transient spatially localized current for sufficient time to excite the bump. This state has a large basin of attraction; as long as neurons are excited in a localized region, the network relaxes into a bump of the form shown in Figure 3. The raster plot shows that the bump is localized in space and persists for many firing times. The neurons in the center of the bump fire much faster than those near the edges. Figure 4 shows the profile of the average firing rate (activity) for the integrate-and-fire system, equation 2.2, with three different values of β and no noise. The solid line corresponds to the theoretically predicted profile from equation 3.10. This corresponds to the stable stationary solution of the corresponding rate model, equation 3.12, with the gain function, equation 3.9. The comparison for small β (i.e., slow synapses) is excellent; as β is increased, the agreement lessens but is still very good. We also have found bumps in networks of conductance-based neurons (results not shown; see Gutkin, Laing, Chow, Colby, & Ermentrout, 2000, for an example). We find that the existence of a bump solution for a network of spiking neurons to be robust.
Stationary Bumps in Networks of Spiking Neurons
1481
0 5 10
Time
15 20 25 30 35 40
20
40 60 Neuron Index
80
100
Figure 3: An example of a stationary bump. The dots represent the firing times of a network of 100 neurons. Noise in the form of random current pulses with a mean frequency of 0.05 as discussed in section 2.1 was included in the simulation. Parameter values used were β = 1.5, I = 0.9, and Jij , as in Figure 1.
4 Stability and the Synaptic Timescale In order for the bump to be observable, it must be stable to perturbations. A stability analysis of the spike response system can be performed by considering the linear behavior of small perturbations around the stationary bump state. However, for the spatially inhomogeneous bump, this computation is quite involved. Instead, we infer the conditions for stability of the bump from a stability analysis of the homogeneous asynchronous state of the spike response model and confirm our conjectures with numerical simulations. Stability of the bump state has previously been examined in a first-order
1482
Carlo R. Laing and Carson C. Chow
Activity
1
0.5
0 0
20
40 60 Neuron Index
80
100
Figure 4: Activity profile (mean firing rate) for β = 2.5 (◦), β = 1.5 (+), and β = 0.5 (×) for the integrate-and-fire network and for the rate model, equation 3.12, (solid line). I = 0.9, and the weight function is as in Figure 1. Note that for smaller β, the agreement between the firing-rate model and the integrate-and-fire model is better. The errors in the activity values are all less than 0.01.
rate model. Amari (1977) and Kishimoto and Amari (1979) found for saturating gain functions that the stability of the bump in the rate model depended on a relationship between the weight function and the applied current. Hansel and Sompolinsky (1998) found the stability constraints for a model with a simplified weight function on a periodic domain and a piecewise linear gain function. However, as discussed in section 3.1 the rate model is valid only for infinitely fast synapses and asynchronously firing neurons. If correlations develop between firing times of neurons, the rate model is no longer valid. It has been shown previously for homogeneous networks (Abbott & Van Vreeswijk, 1993; Treves, 1993; Gerstner, 1995, 1998, 2000) that the asynchronous state of a network of integrate-and-fire neurons is unstable to oscillations with fast excitatory coupling. Gerstner (1998, 2000) has shown that oscillations will develop at harmonics of the average firing rate (activity). The addition of noise helps to stabilize the asynchronous state. However, for low to moderate levels of noise, a fast enough synapse can still cause an instability. These calculations are based on perturbing the stationary asynchronous state for an infinite number of neurons using a FokkerPlanck or related integral formalism. We note that there are two sources of noise in the simulations: the randomly arriving currents, Irand , and the fluctuations due to the finite size of the network. The latter is a manifes-
Stationary Bumps in Networks of Spiking Neurons
1483
tation of unpredictability in a high-dimensional, deterministic, dynamical system. The integrate-and-fire network has a parameter, β, that controls the timescale of the synaptic input. We find that varying β has a profound effect on the stability of the bump solution. If β is slowly increased (i.e., the synaptic timescale is slowly shortened) while all other parameters are held constant, the bump will eventually lose stability. The initial location of a bump is determined by the spatial structure of the initial current stimulation. If the initial condition is symmetric and no noise is added, the bump will remain in place and then lose stability directly to the all-off state at a critical value of β. However, as a result of the invariance of the network under spatial translations, a bump is marginally stable with respect to spatial translation; there is a continuum of attractors parameterized by their spatial location rather than a finite set of isolated attractors. A consequence of this marginal stability is that a bump may “wander” under the influence of noise or finite size fluctuations. With asymmetric initial conditions or noise, the bump will begin to wander as β is increased (see Figure 5). (Figure 10 shows the bump wandering for a fixed value of β.) At a larger value of β, the wandering bump loses stability to a traveling wave. Note that the wave speed increases as β increases. The same type of behavior was observed for larger networks (results not shown). If the wave hits an obstruction, it will switch to the all-off state. The bump can be pinned in place if a small amount of disorder or heterogeneity is included in the input current. In Figure 6 we have added disorder by randomly choosing the fixed currents, Ii , keeping the average of these values across the network equal to the value used in Figure 5. The bump migrates to a local maximum in the current and remains there. For sufficiently large β, it destabilizes into the all-off state, as in Figure 6. The pinning due to disorder is sufficiently strong that traveling waves cannot persist. Note that small patches of neurons can be activated by noise, but they cannot persist if β is too large. We conjecture that the loss of stability in the bump for large β is due to a loss of stability of the asynchronous bump state due to the synchronizing tendency of the neurons with fast excitatory coupling, as is seen in the homogeneous network. Integrate-and-fire neurons belong to what is known as type I or class I neurons (Hansel, Mato, & Meunier, 1995; Ermentrout, 1996). It is known that for type I neurons, fast excitation has a synchronizing tendency, whereas slow excitation has a desynchronizing tendency (Van Vreeswijk, Abbott, & Ermentrout,1994; Gerstner, 1995; Hansel et al., 1995). Chow (1998) showed that a network with heterogeneous input is still able to synchronize. The stationary bump corresponds to an asynchronous network state where the neurons receive heterogeneous input. For fast enough excitatory coupling, the asynchronous state might lose stability, and oscillations will develop. The synchronous oscillations need not be exceptionally strong.
1484
Carlo R. Laing and Carson C. Chow
0 10 20 30
Time
40 50 60 70 80 90 100
20
40 60 Neuron Index
80
100
Figure 5: Bump destabilization due to quasi-static increase of β. Here, β is linearly increased in time from β = 3.1 at the top of the figure to β = 8.3 at the end. The weight function is J(z) = 5[1.16w(1/28, z) − w(1/20, z)] (see the caption of Figure 1 for definiton of w), I = 0.8, and the noise is as in Figure 3. The bump loses stability first to wandering and then to a traveling wave.
All that is required is that large enough oscillations develop and induce a traveling wave. For symmetric initial conditions, the symmetry prevents the formation of a traveling wave. The neurons oscillate in place, and for enough synchronization, the synaptic input is not at a high enough level to push some of the neurons in the bump above threshold, when it is time for them to fire. The bump solution would then collapse and switch the network to the all-off state. This also occurs for the case with heterogeneity. The pinned bump develops stationary oscillations and switches to the all-off state. For a fixed weight function, we find that when I is at a value for which the rate model predicts the bump to be stable, the bump in the integrate-
Stationary Bumps in Networks of Spiking Neurons
1485
0
20
Time
40
60
80
100 20
40 60 Neuron Index
80
100
Figure 6: Bump destabilization due to quasi-static increase of β with disorder added to the static background current. The range of β, the weight function, the noise, and the average value of I are as in Figure 5. Disorder pins the bump and prevents the formation of a traveling wave.
and-fire network is stable for small β, but becomes unstable as β increases. This is shown in Figure 7, where we show the region of the I − β plane in which there is a stable bump solution in the integrate-and-fire network with no noise for N = 100. The stability region, as well as the size and shape of the bump, depends on the applied current I, the weight function J(x), and the size of the network, but is unchanged by adding noise of the intensity used in other simulations in this article. The finite size fluctuations are important for the dynamics. Figure 8 shows a plot of ui (t) at the center of a bump containing 100 neurons for two different values of β. The traces show noisy oscillations at a given frequency around an average value. The dominant frequency of the oscillations is at the neuron’s average firing rate. As β increases, the amplitude of the
1486
Carlo R. Laing and Carson C. Chow
3 Rate stable, IF unstable
β
2 No bump
Stable
1
0 0.82
0.84
0.86
I
0.88
0.9
0.92
Figure 7: Numerically obtained stability boundary (circles) for a bump in the noise-free integrate-and-fire for Jij as in Figure 1 and N = 100, as a function of I and β. The bump is stable below the curve marked by circles. The solid line shows the existence region for the bump in the rate model: it exists to the right but not to the left of the line. Above the curve marked by circles, the bump exists and is stable in the rate model but unstable in the integrate-and-fire model. As discussed in the text, the curve marked by circles will move to larger β as N is increased. This curve is not significantly changed when noise of the level used in the other simulations shown is added.
noisy oscillations increases. We conjecture that the increase in the size of the oscillations is partially due to the dynamical synchronizing effect described above. For small N, the noisy oscillations are dominated by finite size fluctuations, but for large N, it will be dominated by the synchronizing effect of the fast excitation. We believe that the destabilizing effect of the oscillations will be present even for infinite N. While increasing N does decrease the size of the fluctuations, this is more than compensated by the increase in oscillation size due to increasing β. In Figure 9 we plot the maximum value of β for which a bump is stable as a function of N for various noise values. The plot shows that the stability boundary asymptotes for a fixed value of β for large values of N. This plot suggests that for any N, there is a finite β above which the bump is no longer stable even in the presence of noise. As expected, the bump can tolerate higher levels of β as the noise level increases, although the maximum β seems to saturate as noise increases. From Figures 4 and 8, we see that as β is increased, the activity and mean value of u at each neuron decrease slightly despite the fact that we have compensated the synaptic strength so the integrated synaptic input is
Stationary Bumps in Networks of Spiking Neurons
1487
0.8
u 50
0.75
0.7
0.65
30
40
50 Time
60
70
Figure 8: ui (t) at the center of a bump for β = 0.9 (upper line) and β = 1.9 (lower line) for I = 0.9, Jij as in Figure 1, and noise as in Figure 3. As β increases, the size of the noisy oscillations increases.
4.5 4
β
3.5 3 2.5 2 0
500
1000 N
1500
2000
Figure 9: Values of β above which the bump is not stable as a function of N in the network for I = 0.91, Jij as in Figure 1, and the mean frequencies of the random current pulses have values: no noise (◦), 0.05 (×), 0.1 (∗), and 0.2 (¦).
1488
Carlo R. Laing and Carson C. Chow
constant. We believe that this small decrease is a result of the fluctuations in the input due to the finite number of neurons and the synchronizing dynamical effect. The termination of the bump as β is increased is not due to this overall decrease in synaptic input. If the coupling weight is increased to keep the mean value of u constant (by multiplying J(z) by a factor slightly greater than 1), the bump is still seen to terminate as β is increased. Indeed, numerical results suggest that the mean value of u decreases linearly (albeit weakly) with β, while its standard deviation increases at a faster than linear rate (results not shown). 5 Initiating and Terminating Bumps Since bumps are thought to be involved in such areas as working memory (Colby et al., 1995; Funahashi et al., 1989) and head direction systems (Redish et al., 1996; Zhang, 1996), understanding their dynamics on a timescale much longer than the period of oscillation of the individual neurons is important. In particular, being able to “turn on” and “turn off” a bump is of interest. Since the input current I is subthreshold, the network is bistable, the two stable states being the bump and the all-off state, where none of the neurons fires. Turning the bump on is simply a matter of shifting the system from the all-off state into the basin of attraction of the bump. This is done by applying a spatially localized input current to the already existing uniform current for a short period of time—on the order of five oscillation periods of the neuron in the center of the bump (whose activity is highest), as is shown in Figure 10. Note that the bump persists after the stimulation has been removed. As we have seen, partially synchronizing the neurons can cause the bump to terminate. One mechanism for causing some of the neurons to synchronize is to apply a brief, strong, excitatory current to most or all of the neurons involved in the bump. This will cause a number of the neurons to fire together (i.e., be temporarily synchronized), leading to termination of the bump if the synapse is fast enough. An example is shown in Figure 10, where a short stimulus is applied to all of the neurons involved in the bump, although not to all of the neurons in the network. The stimuli to turn on and turn off the bump are both excitatory; the stimulus to turn off a bump does not have to be inhibitory, although that is also successful. The reason the bump turns off is a dynamical effect. An alternate means of turning off the bump with excitation is to use the lateral inhibition in the network. If the parameters are tuned correctly, then a strong excitatory input to all of the neurons in the network will induce enough inhibition to reduce the synaptic input below threshold and extinguish the bump. As discussed in section 4, the bump is seen to wander in Figure 10. A small amount of disorder could prevent this occurrence by pinning the bump to a given location. We explore the initiation and termination of the bumps more carefully in a companion publication. (Gutkin et al., 2000).
Stationary Bumps in Networks of Spiking Neurons
Input Current
0
0
10
10
20
20
30
30
40
40
Time
Time
Firing Times
50
50
60
60
70
70
80
80
90
20 40 60 80 100 Neuron Index
1489
90
20 40 60 80 100 Neuron Index
Figure 10: Turning on and off a bump with excitation. The plot on the left shows the firing times and that on the right shows the input current (maximum amplitude above background 0.4). The coupling weight is as in Figure 1, β = 2.4, I = 0.9, and noise is as in Figure 3.
6 Discussion and Conclusion We have shown that a one-dimensional network of spiking neurons can sustain spatially localized bumps of activity and that the profiles of the bumps agree with those predicted by a corresponding population rate model. However, when the synapses occur on a fast timescale, bumps can no longer be sustained in the network. They either lose stability to traveling waves or completely switch off. We also find that heterogeneity or disorder can pin the bumps to a single location and keep them from wandering. We conjec-
1490
Carlo R. Laing and Carson C. Chow
ture that the loss of stability of the bump is due to partial synchronization between the neurons. It is known for homogeneous networks of type 1 neurons that fast excitatory synapses have a synchronizing tendency. We use this instability to turn off bumps with a brief excitatory stimulus to synchronize the neurons partially. For the network sizes that we have probed, we have found that bumps can be sustained by synapses with decay rates as fast as three to four times the firing rate of the fastest neurons in the bump. If we consider neurons in the cortex to be firing at approximately 40 Hz, this would correspond to synaptic decay times of the order of 5 to 10 ms, which is not unreasonable. Results with conductance-based neurons have found that the synaptic timescale can be speeded up to well within the AMPA range and still sustain a bump state (Gutkin et al., 2000). We also find that as the network size increases, the bump may tolerate faster synapses. While the stability of the bump depends crucially on the synaptic timescale, the activity profile of the bump depends on only the connection weights and the gain function. Thus, it may be possible to make predictions on the connectivity patterns of experimental cortical systems from the firing rates of the neurons within the bump and the firing rate (F-I) curve of individual neurons. If these recurrent bumps are involved in working memory tasks, then our results lead to some experimental predictions. For example, if it is possible to speed up the excitatory excitations in the cortex pharmacologically, bump formation and hence working memory may be perturbed. A brief applied stimulus applied to the cortical area where the working memory is thought to be held may also disrupt a working memory task. Among other authors who have produced similar work are Hansel and Sompolinsky (1998), Bressloff et al. (1999), and Compte, Brunel, and Wang (1999). Hansel and Sompolinsky (1998) consider a rate model similar to that studied by Amari (1977) and Kishimoto and Amari (1979), using a piecewise linear gain function (our G[z]) and retaining only the first two Fourier components of the weight function J, which allows them to make analytic predictions about the transitions between different types of behavior. They also show the existence of a bump in a network of conductance-based model neurons and show that bumps can follow moving spatially localized current stimulations, a feature that may be relevant for head-direction systems such as those studied by Redish et al. (1996) and Zhang (1996). Bressloff and Coombes (1998) and Bresslof et al. (1999) study pattern formation in a network of coupled integrate-and-fire neurons, but their systems consider suprathreshold input (Ii > 1) so that the all-off state is not a solution. They find that by increasing the coupling weight between neurons, the spatially uniform synchronized state (all neurons behave identically) becomes unstable through a Turing-Hopf bifurcation, leading to spatial patterns similar to those shown in Figure 4. They find bistability between a bump and a spatially uniform synchronized state, whereas we find bistability between a bump and the all-off state. This difference is crucial if the
Stationary Bumps in Networks of Spiking Neurons
1491
system is to be thought of as modeling working memory as investigated by, among others, Colby et al. (1995) and Funahashi et al. (1989). Compte et al. (1999) have demonstrated the existence of a bump attractor in a two-layer network of excitatory and inhibitory integrate-and-fire neurons. Their network involves strong excitation and inhibition in a balanced state. It is possible that a corresponding rate model could be found for this network to obtain the shape of the profile. They were also able to switch the bump off and on with an excitatory stimulus. However, it is believed that their switching-off mechanism is due to the inhibitory input induced from the excitation provided. We have also observed bumps in two-dimensional networks. A number of other authors, including Fohlmeister, Gerstner, Ritz, and van Hemmen (1995) and Horn and Opher (1997), have investigated two-dimensional networks of integrate-and-fire neurons, but not in the context of bumps. Preliminary investigations (results not shown) indicate that bumps can be initiated and terminated in two-dimensional networks in exactly the same fashion as for one-dimensional networks. We emphasize that while the bumps are spatially localized, they need not be contiguous. It may be such that a certain amount of disorder in the synaptic connectivity would lead to a bump that is distributed over a given region. This disorder may ever confer some beneficial effects by breaking the translational symmetry and keep the bump from wandering when under the influence of stochastic firing. We hope to elucidate these effects in the future. Appendix: Nonperiodicity of Bump Solutions Here we show that locally periodic solutions are not possible for a bump in a finite-sized integrate-and-fire network. We consider the ansatz of periodic i firing given by tm i = (φi − mi )Ti , where φi is a phase, mi is an integer, and Ti is the local firing period. We suppose that the neuron previously fired 0 at t−1 i and will next fire at ti . We now substitute this ansatz into the firing equation, equation 2.4, to obtain −Ti + ui (t0i ), 1 = Ii − [Ii − ui (t−1 i )]e
(A.1)
where Z ui (t) =
Jij
X
²(t − (φj − mj )Tj ).
(A.2)
mj
In order to find a solution for Ti that is constant P in time, we require ui (t) to be constant or Ti periodic in t. However, mj ²(t − (φj − mj )Tj ) ≡ f (t) is Tj periodic and Tj is not constant in j for a bump solution. This implies that ui (t) cannot be Ti periodic for all i (except for infinite N or for highly singular cases where all the periods are rationally related). Hence, a locally
1492
Carlo R. Laing and Carson C. Chow
periodic bump solution is not possible for the finite-sized spike response model and hence for the integrate-and-fire network. Acknowledgments We thank D. Pinto and G. B. Ermentrout for stimulating discussions that spurred this research. We also thank A. Compte and N. Brunel for some clarifying conversations. We especially thank Boris Gutkin for many fruitful discussions and for his critical insight. This work was supported in part by the National Institutes of Health (C. C.) and the A. P. Sloan Foundation (C. C.). References Abbott, L. F., & Van Vreeswijk, C. (1993). Asynchronous states in a network of pulse-coupled oscillators. Phys. Rev. E, 48, 1483–1490. Amari, S. (1977). Dynamics of pattern formation in lateral–inhibition type neural fields. Biol. Cybern., 27, 77–87. Amit, D. J., & Brunel, N. (1997). Model of global spontaneous activity and local structured activity during delay periods in the cerebral cortex. Cereb. Cortex, 7, 237–252. Amit, D. J., & Tsodyks, M. V. (1991). Quantitative study of attractor neural network retrieving at low spike rates: I. Substrate—spikes, rates and neuronal gain. Network, 2, 259–273. Bresslof, P. C., Bresslof, N. W., & Cowan, J. D. (1999). Dynamical mechanism for sharp orientation tuning in an integrate-and-fire model of a cortical hypercolumn. Preprint 99/13, Department of Mathematical Sciences, Loughborough University. Bresslof, P. C., & Coombes, S. (1998). Spike train dynamics underlying pattern formation in integrate-and-fire oscillator networks. Phys. Rev. Lett., 81(11), 2384–2387. Camperi, M., & Wang, X.-J. (1998). A model of visuospatial working memory in prefrontal cortex: Recurrent network and cellular bistability. J. Comput. Neurosci., 5, 383–405. Chow, C. C. (1998). Phase-locking in weakly heterogeneous neuronal networks. Physica D, 118, 343–370. Colby, C. L., Duhamel, J.-R., & Goldberg, M. E. (1995). Oculocentric spatial representation in parietal cortex. Cerebral Cortex, 5, 470–481. Compte, A., Brunel, N., & Wang, X.-J. (1999). Spontaneous and spatially tuned persistent activity in a cortical working memory model. Abstract, Computational Neuroscience Meeting. Ermentrout, G. B. (1996). Type I membranes, phase resetting curves, and synchrony. Neural Comp., 8, 979–1001. Ermentrout, G. B. (1998). Neural networks as spatio-temporal pattern-forming systems. Reports on Progress in Physics, 61(4), 353–430.
Stationary Bumps in Networks of Spiking Neurons
1493
Fohlmeister, C., Gerstner, W., Ritz, R., & van Hemmen, J. L. (1995). Spontaneous excitations in the visual cortex: Stripes, spirals, rings and collective bursts. Neural Comp., 7(5), 905–914. Funahashi, S., Bruce, C. J., & Goldman-Rakic, P. S. (1989). Mnemonic coding of visual space in the monkey’s dorsolateral prefrontal cortex. J. Neurophysiology, 61(2), 331–349. Gerstner, W. (1995). Time structure of the activity in neural networks models. Phys. Rev. E, 51, 738–758. Gerstner, W. (1998). Populations of spiking neurons. In W. Maass & C. M. Bishop (Eds.), Pulsed neural networks (pp. 261–295). Cambridge, MA: MIT Press. Gerstner, W. (2000). Population dynamics of spiking neurons: Fast transients, asynchronous states, and locking. Neural Comput., 12, 43–89. Gerstner, W., van Hemmen, J. L., & Cowan, J. (1996). What matters in neuronal locking? Neural Comput., 8, 1653–1676. Gutkin, B. S., Laing, C. R., Chow, C. C., Colby, C., & Ermentrout, G. B. (2000). Turning on and off with excitation: The role of spike-timing asynchrony and synchrony in sustained neural activity. Preprint. Hansel, D., Mato, G., & Meunier, C. (1995). Synchrony in excitatory neural networks. Neural Comput., 7, 307–335. Hansel, D., & Sompolinsky, H. (1998). Modeling feature selectivity in local cortical circuits. In C. Koch & I. Segev (Eds.), Methods in neuronal modeling (2nd ed.). Cambridge, MA: MIT Press. Hopfield, J. J. (1984). Neurons with graded response have collective computational properties like those of two-state neurons. Proc. Natl. Acad. Sci. USA, 81, 3088–3092. Horn, D., & Opher, I. (1997). Solitary waves of integrate-and-fire neural fields. Neural Comp., 9(8), 1677–1690. Kishimoto, K., & Amari, S. (1979). Existence and stability of local excitations in homogeneous neural fields. J. Math. Biology, 7, 303–318. Redish, A. D., Elga, A. N., & Touretzky, D. S. (1996). A coupled attractor model of the rodent head direction system. Network, 7, 671–685. Shriki, O., Sompolinsky, H., & Hansel, D. (1999). Rate models for conductance based cortical neuronal networks. Preprint. Skaggs, W. E., Knierim, J. J., Kudrimoti, H. S., & McNaughton, B. L. (1995). A model of the neural basis of the rat’s sense of direction. In G. Tesauro, D. Touretzky, & T. Leen (Eds.), Advances in neural information processing systems, 7 (pp. 173–180). Cambridge, MA: MIT Press. Somers, D. C., Nelson, S. B., & Sur, M. (1995). An emergent model of orientation selectivity in cat visual cortical simple cells. J. Neuroscience, 15, 5448–5465. Treves, A. (1993). Mean-field analysis of neuronal spike dynamics. Network, 4, 259–284. Van Vreeswijk, C., Abbott, L. F., & Ermentrout, G. B. (1994). When inhibition not excitation synchronizes neural firing. J. Comp. Neurosci., 1, 313–321. Wilson, H. R., & Cowan, J. D. (1972). Excitatory and inhibitory interactions in localized populations of model neurons. Biophysical J., 12, 1–24. Wilson, H. R., & Cowan, J. D. (1973). A mathematical theory of the functional dynamics of cortical and thalamic nervous tissue. Kybernetic, 13, 55–80.
1494
Carlo R. Laing and Carson C. Chow
Zhang, K. (1996). Representation of spatial orientation by the intrinsic dynamics of the head-direction cell ensembles: A theory. J. Neuroscience, 16, 2112–2126. Received August 4, 1999; accepted September 25, 2000.
LETTER
Communicated by Richard Krauzlis
Vergence Dynamics Predict Fixation Disparity Saumil S. Patel College of Optometry and Department of Electrical and Computer Engineering, University of Houston, Houston, TX 77204, U.S.A.
Bai-Chuan Jiang College of Optometry, Nova Southeastern University, Ft. Lauderdale, FL 33328, U.S.A.
Haluk Ogmen Department of Electrical and Computer Engineering, University of Houston, Houston, TX 77204, U.S.A.
The neural origin of the steady-state vergence eye movement error, called binocular fixation disparity, is not well understood. Further, there has been no study that quantitatively relates the dynamics of the vergence system to its steady-state behavior, a critical test for the understanding of any oculomotor system. We investigate whether fixation disparity can be related to the dynamics of opponent convergence and divergence neural pathways. Using binocular eye movement recordings, we first show that opponent vergence pathways exhibit asymmetric angle-dependent gains. We then present a neural model that combines physiological properties of disparity-tuned cells and vergence premotor cells with the asymmetric gain properties of the opponent pathways. Quantitative comparison of the model predictions with our experimental data suggests that fixation disparity can arise when asymmetric opponent vergence pathways are driven by a distributed disparity code. 1 Introduction Central to the understanding of human behavior is the understanding of neural sensorimotor systems that produce the behavior. The eye movement system is one of the best understood sensorimotor systems in the brain. Binocular gaze shifts in space are achieved by simultaneous operation of two classes of eye movement subsystems: conjugate and disjunctive. The disjunctive system, which is also called the vergence system, maintains optical alignment of both eyes when viewing a target binocularly. For a gaze shift between targets at different depths, it responds in a manner that reduces the resulting binocular disparity, thus maintaining single vision. The sensorimotor architecture of the vergence system is poorly understood. Experience from conjugate eye movement system studies suggests that sigc 2001 Massachusetts Institute of Technology Neural Computation 13, 1495–1525 (2001) °
1496
S. S. Patel, B.-C. Jiang, and H. Ogmen
Figure 1: An illustration of fixation disparity. The eyes are fixating on a point (unfilled circle) that is less convergent than the target position (filled circle). The difference between the target position and the actual fixation point of the eyes represents the fixation disparity.
nificant understanding of mechanisms of eye movements is possible when physiological and behavioral aspects of the system are incorporated into an analytical model that provides accurate predictions of the static and dynamic oculomotor responses under a variety of stimulus conditions (Robinson, 1981). However, for the vergence system, relatively few attempts (see Keller, 1981; Mays, 1984; Mays, Porter, Gamlin, & Tello, 1986; Gamlin, Gnadt, & Mays, 1989; Gamlin & Mays, 1992) have been made to combine physiological and behavioral evidence analytically. Even fewer attempts have been made to relate steady-state properties to the dynamic properties of the system. In this article, we seek to relate fixation disparity, a static vergence error, to the dynamics of the vergence system using a model that combines behavioral properties of the vergence system with the neurophysiological characteristics of disparity-tuned sensory and premotor vergence cells. The difference between the position of a fixation target and the point actually fixated by the eyes is called fixation disparity (see Figure 1). The basis of this ocular misalignment, which has been observed for a century (see Ogle, Martens & Dyer, 1967), is largely unknown. As a consequence of fixation disparity, the fixation target is imaged away from the central parts of the fovea in either one or both eyes, therefore degrading the quality of the images available for fine stereopsis. Excessive fixation disparity has been associated with some types of binocular dysfunction, though its use as a general predictor of binocular stress and dysfunction has been difficult and at times unsuccessful (see Ogle et al., 1967). One of the main reasons for the failure of fixation disparity to predict binocular dysfunction lies in the fact that the neural origin of this error has not been well understood. Existing
Vergence Dynamics Predict Fixation Disparity
1497
control-type models of the vergence system (Ogle et al., 1967; Schor 1979a, 1979b; Hung & Semmlow, 1980; Schor, Robertson, & Wesson, 1986; Saladin, 1986), suggest that fixation disparity results from the leakiness of position integrators (Schor 1979a, 1979b; Hung & Semmlow, 1980; Schor et al., 1986; Saladin, 1986). However, most reports in the literature favor a non-leaky integration in the vergence system (Rashbass & Westheimer, 1961; Zuber & Stark, 1968; Semmlow, Hung, Horng, & Ciuffreda, 1994), although there has been an isolated report to the contrary (Pobuda & Erkelens, 1993). Despite the evidence against leaky integration, these models required the leakiness to account for the decay of vergence angle in the absence of binocular stimulus (Krishnan & Stark, 1977). However, as suggested by a recent neural network model (Patel, Ogmen, White, & Jiang, 1997), a nonlinear active turn-off circuit in conjunction with a nonleaky integrator (nonleaky with respect to the time-scale of the dynamics) would achieve the same behavioral function. Moreover, this neural network model can explain a much wider range of vergence dynamic data compared to alternative controltype models (Patel et al., 1997). Unlike control-type models, this model can also correlate behavioral responses to underlying neurophysiological components of the vergence system. On the other hand, nonleaky integration in general implies zero position error, that is, zero fixation disparity. As suggested by a previous analysis (Patel et al., 1997), a model with nonleaky integration coupled with an asymmetry between the dynamics of convergence and divergence pathways can offer an alternative explanation for fixation disparity. In this article, we present experimental evidence supporting this hypothesis. 2 Model Description 2.1 Dynamic Model. A neural network model of horizontal disparity vergence dynamics (Patel et al., 1997) is shown in Figure 2a. Only those portions of the model that play an active role during binocular fixation are shown. Prior to a vergence movement, the binocular target activates a subset of disparity encoders that signal the presence of a nonzero initial binocular disparity. Two types of disparity encoders are employed in the model: convergence encoders and divergence encoders. In the model, we have used the word encoder instead of disparity-tuned cell because, as mentioned in our earlier work (Patel, Ogmen, & Jiang, 1996; Patel et al., 1997), it is not clear whether the same cells that may be responsible for the perception of depth (Barlow, Blakemore, & Pettigrew, 1967; Poggio, Gonzalez and Krause, 1988; Roy, Komatsu, & Wurtz, 1992) are also responsible for the vergence movements. Recent evidence suggests that the disparity cells responsible for vergence movements might not be directly responsible for perception of depth (Cumming & Parker, 1997; Masson, Busettini, & Miles, 1997). Functionally, convergence and divergence encoders are equivalent to near-tuned excitatory and far-tuned excitatory disparity cells, respectively.
1498
S. S. Patel, B.-C. Jiang, and H. Ogmen
a.
LR
MR
D
Position Cells
C
D
Velocity Cells
C
DEO
-8
CEO
-7
-6
0
6
7
8
b.
Fixation Disparity - (deg)
Disparity Encoders
0.20.2
Eso
0.15 0.1 0.05
0.0 0 -0.05
Convergence Divergence
-0.1 -0.15
-0.2-0.2-10 -10
Exo -5
0
0
5
10
10
Vergence Angle - (deg)
The disparity encoders are topographically ordered in a horizontal direction such that the central encoder signals zero disparity and encoders positioned on either side of the zero disparity encoder signal disparity proportional to their Euclidean distance from zero disparity. The encoders to the right (left) of the zero disparity encoder signal crossed (uncrossed) disparity. To understand the operation of the model, consider the case where the eyes are presented with a target whose images on the two retinas induce a disparity.
Vergence Dynamics Predict Fixation Disparity
1499
This retinal disparity will generate an activation pattern across the population of disparity encoders. For simplicity, we will first consider the case where the activation pattern consists of the activity of a single encoder, that is, a spatially localized disparity code. A vergence velocity signal will be formed based on the sign of the disparity represented by this encoder; a crossed disparity activates the convergence velocity cell, and an uncrossed disparity activates the divergence velocity cell. The strength of the innervation of the velocity cell is proportional to the magnitude of the disparity. The velocity signal drives a pair of vergence position cells in a push-pull manner. The position cells exhibit nonlinear shunting type membrane dynamics (Grossberg, 1988). The convergence position cell projects to medial rectus motoneurons, and the divergence position cell projects to lateral rectus motoneurons of both eyes. The eye plant is a first-order linear system (Robinson, 1981; Krishnan & Stark, 1983). The initial disparity therefore results in a movement of both eyes in opposite directions—that is, a vergence
Figure 2: Facing page. (a) Sensorimotor transformation and push-pull integration in the neural network model of disparity vergence system. Only the portions of the model that play a role during binocular fixation are shown. All solid (dotted) lines represent excitatory (inhibitory) connections. The cells in the convergence pathways are labeled C, and those in divergence pathways are labeled D. The graded connectivity between the disparity encoder cells and the velocity cells represents the sensorimotor transformation and is shown by variable-thickness lines on both sides of the figure. Thicker lines depict larger synaptic weights. The convergence (divergence) encoders send excitatory projections to convergence (divergence) velocity cells. CEO (DEO) is a vector of the outputs of the convergence (divergence) encoders. The numbers under the disparity encoders represent their position in the map of disparity encoders. The velocity cells project in a push-pull manner (e.g., convergence velocity cell sends excitatory projections to convergence position cell, and divergence velocity cell sends inhibitory projections to convergence position cell) to position cells, which in turn project to medial and lateral rectus sections (LR, MR) of the motoneurons/plant system for each eye. The firing profiles of the vergence velocity cells and the vergence position cells in the model are shown in rectangular boxes besides the corresponding cells. The top trace within each box is the simulated vergence step movement of 2 degrees for which the corresponding firing profiles are obtained. The bottom trace is the firing rate of the corresponding cell. Upward deflection in the vergence traces indicate increased convergence and downward deflection indicates increased divergence. The firing profiles of model neurons are very similar to those of actual cells found in primate midbrain (Mays, 1984, Mays et al., 1986). (b) Steady-state vergence error exhibited by the neural network model shown in a. This error characteristic is obtained by a parameter set that is equal for cells in both convergence and divergence pathways. A rectangular binary disparity code is used (see main text). (Adapted Patel et al., 1997)
1500
S. S. Patel, B.-C. Jiang, and H. Ogmen
movement that reduces the disparity of the target. The reduced disparity will activate the disparity encoder representing this new value of the target disparity. This cyclic process will continue until the disparity is reduced to zero. In other words, at steady state, only the encoder signaling zero disparity will be active. In summary, if disparity is represented by the activity of a single disparity encoder, the model will reach steady-state where fixation disparity is zero. To facilitate the understanding of the derivation of the static model from the dynamic model, let us first describe the dynamic model shown in Figure 2a in simple mathematical terms. The vergence output (VO), defined as the difference between the angle of horizontal rotation of the two eyes, can be simply written as: VO(t) = F(CEO, DEO, LR, MR, t)
(2.1)
where F is some nonlinear function that represents the entire vergence feedback system, CEO (DEO) is a vector of the outputs of the convergence (divergence) encoders, LR (MR) is the output to the motoneurons innervating lateral (medial) rectus muscles of both eyes, and t is time. Let us define the two subtypes of mutually exclusive vergence outputs: VOC(t) = FC(CEO, LR, MR, t)
(2.2)
VOD(t) = FD(DEO, LR, MR, t)
(2.3)
where VOC (or convergence) is the vergence position when DEO is zero, VOD (or divergence) is the vergence position when CEO is zero and, FC and FD are some nonlinear functions. In otherwords, convergence movement is induced by activation of the sensorimotor pathways innervated by the convergence encoders, and divergence movement is induced by activation of the sensorimotor pathways innervated by the divergence encoders. Notice that due to the push-pull integration of the vergence velocity signals, these two functionally separate movements share common physical pathways. Because the sensory and motor components are arranged in a feedforward cascade manner, we assume that for each type of movement, the sensory and motor components are functionally separable then, VOC(t) = FCS(CEO, t) ∗ FCM(CV, LR, MR, t)
(2.4)
VOD(t) = FDS(DEO, t) ∗ FDM(DV, LR, MR, t)
(2.5)
where FCS (FDS) and FCM (FDM) are sensory and motor functions for convergence (divergence) and CV (DV) is the output of the convergence (divergence) sensory component. When convergence and divergence movements are induced simultaneously by a distributed disparity code, an equilibrium can only be reached
Vergence Dynamics Predict Fixation Disparity
1501
if both movements are equal in magnitude. In this case of simultaneous activation, if we define the vergence imbalance as VI(t) = VOC(t) − VOD(t),
(2.6)
then at equilibrium (steady state) we have VI(infinity) = 0.
(2.7)
2.2 Fixation Disparity Simulated from the Dynamic Model. Previous studies on disparity-sensitive neurons have generally found cells that have broad as opposed to binary (on-off) tuning functions (Barlow et al., 1967; Roy, Komatsu, & Wurtz, 1992; Gonzalez, Relova, Perez, Acuna, & Alonso, 1993; Wagner & Frost, 1993). This implies that a given target disparity activates more than one disparity encoder, i.e. a spatially distributed code. Further, in humans, it has been proposed that vergence control is based on a population code formed by a pool of disparity tuned cells (Mallot, Roll, & Arndt, 1996; Patel et al., 1997). The optics of the eye cause the image of a visual point to spread on the retina (see Charman, 1991). This, coupled with cross-correlation of images from the two eyes (Stevenson, Cormack, & Schor 1994), suggests that even small targets (∼2–4 arc-min) would generate a spatially distributed disparity code. As a consequence of the broad tuning property of disparity cells, binocular disparity due to a target would be coded in a distributed manner. In our model, as a result of the disparity spread, both convergence and divergence disparity encoders will be excited at steady state. This simultaneous activation of convergence and divergence disparity encoders will result in simultaneous activation of convergence and divergence velocity cells. An equilibrium will be achieved when the convergence and divergence motor activity due to simultaneous activation of convergence and divergence velocity cells balances each other. The displacement of the centroid of the disparity spread relative to the zero disparity encoder is then the fixation disparity. If the disparity code is perfectly symmetric and the convergence and divergence pathways are also perfectly symmetric for all vergence angles, then the fixation disparity is zero for all those vergence angles. On the other hand, for a symmetric disparity code, any asymmetry in convergence and divergence pathways (sensory or motor or both components) would result in a nonzero fixation disparity. As shown in Figure 2b, the neural network model, when simulated by a rectangular disparity code, exhibits a nonlinear steady-state error function with respect to vergence angle. A rectangular disparity code can be formed when a set of cells within a topographic map of cells fires equally in response to a given input. Normally, in a distributed sensory code, the firing pattern as a function of cell position in the topographic map would be symmetric and decreasing (e.g., gaussian) around the cell firing maximally. However, the assumption of a rectangular code greatly simplifies the
1502
S. S. Patel, B.-C. Jiang, and H. Ogmen
R(V)
+ Md
-
+
Gd(V)
Mc Gc(V)
Sd
Motor Controllers
Sc Velocity Cells
C
D n q
p
Fd(z)
Fc(z)
Divergence 0
Convergence z
Disparity Encoders
Binocular Disparity Monocular Activity
Sensory
Target
steady-state analysis without losing generality. The steady-state vergence error function shown in Figure 2b comes as a result of asymmetry in dynamics generated by the nonlinear push-pull integration architecture. We also tested a gaussian and a triangular disparity code and found that the qualitative nature of the steady-state vergence error function was similar to that shown in Figure 2b. 2.3 Static Model. In order to write a simple analytical formula for fixation disparity, a simplified form of the neural network model is derived (see Figure 3). The original neural network model and its simplified version will be hereafter called dynamic vergence model and static vergence model, respectively. We now consider the static model shown in Figure 3. In the nervous system and the dynamic model, disparity is encoded by the activities of a discrete set of neurons. To simplify the analysis, in the static model we use a continuous variable (z) to represent disparity. This simplification is justified if one considers that the set of disparity encoders contains
Vergence Dynamics Predict Fixation Disparity
1503
a large number of neurons. A consequence of this simplification is that the discrete synaptic weights between the disparity encoders and velocity cells are replaced by continuous functions (Fc (z) and Fd (z)) in the static model. Because we are interested in the origin of fixation disparity, our analysis is restricted to the vergence system and assumes that the lens accommodation system is functionally in an open-loop condition. We use a binary rectangular disparity code to illustrate the principles of our model. The aggregate sensory output from crossed and uncrossed sides is determined by the spatial extent n and location of the disparity spread (p and q) generated by the target (see Figure 3). Note that p and q represent the disparity spread relative to an absolute disparity and should not be confused with absolute disparity represented by z. In ideal steady-state condition, the absolute disparity around which they are defined is 0. The disparity encoder for disparity z projects to velocity cell with a synaptic weight represented by F(z). If this cell is active, its synaptic effect will be given by F(z), and if it is inactive, its postsynaptic effect will be zero. Because velocity cells summate their inputs, the output of divergence (Sd (z)) and convergence (Sc (z)) velocity cells can be written as Z Sd =
0
−q
Z Fd (z)dz and Sc =
p
Fc (z)dz,
(2.8)
0
where z is the disparity, Fd and Fc are sensorimotor transformation functions for divergence and convergence pathways, respectively, and integration (as Figure 3: Facing page. Static vergence model. The lower part of the figure shows an example of a disparity code formed by cross-correlation of monocular activity that is induced by a white line target (consider one dimension as indicated by a horizontal dotted line). The horizontal axis represents a mapping of visual space onto a line of neurons. The fulcrum represents the zero disparity point. The horizontal line extending in the z direction from the fulcrum represents a continuous array of horizontal disparity encoders. The negative (positive) z-axis represents divergence (convergence) encoders. The disparity encoders activated by the disparity spread of the target are shaded in gray. The points p and q represent the end points of the disparity spread, and n is the total extent of the spread. The sensory contribution of each active divergence and convergence disparity encoder is represented by Fd (z) and Fc (z), respectively, where z denotes disparity (the position in the array of horizontal disparity encoders). Sc and Sd are the output of vergence velocity cells (C and D). The gains of divergence and convergence motor controllers are represented by Gd (V) and Gc (V). The steadystate vergence angle is denoted by V, and the vergence imbalance for that angle, R, is defined as the difference between the divergence and convergence motor outputs Md and Mc , respectively, and is zero at steady state The eye plant is not shown but is assumed to be a first-order linear system, and therefore at steady state, its contribution is just a constant gain factor.
1504
S. S. Patel, B.-C. Jiang, and H. Ogmen
opposed to summation) is used to take into account the continuous nature of z. Sc and Sd correspond to CV and DV in equations 2.4 and 2.5. The functions that relate Sc and Sd to z correspond to FCS and FDS in equations 2.4 and 2.5. A sensorimotor transformation function transforms a given sensory variable to a motor variable. In our case, the sensory variable is target disparity and motor variable is a correlate of vergence velocity. Although the vergence position cells in the dynamic model integrate their inputs nonlinearly over the entire range of vergence movements, for a small vergence movement (small-signal linearity), the temporal integration of the output of the velocity cells by the vergence position cells can be considered linear. Thus, at steady state only a gain that depends on the initial vergence angle (the large signal) relates the output of the velocity cells to the output of the position cells. Further, the eye plant is also linear; thus, at steady state, it is the final gain stage in the vergence pathway. Therefore, the experimentally measured linear relationship between target disparity and the velocity of the vergence movement for at least the ±2 degree disparity range (Rashbass & Westheimer, 1961) implies a linear relationship between target disparity and the output of the velocity cells in our model (see the appendix for a formal derivation). We therefore choose the functions Fd and Fc to be linear functions of z, Fd (z) = Kd z and Fc (z) = Kc z,
(2.9)
where Kd and Kc are defined as divergence and convergence sensory gains, respectively. In the dynamic model, these gains are represented by the weights of the crossed and uncrossed disparity signals as they feed into the corresponding velocity cells. Therefore, in the dynamic vergence model, the sensorimotor transformation is performed by the vergence velocity cells. Substituting equation 2.9 in equation 2.8, we obtain Sd =
Kd q2 Kc p2 and Sc = . 2 2
(2.10)
The steady-state outputs Md and Mc of divergence and convergence motor controllers can then be calculated as Md = Sd Gd (V) =
Kd q2 Gd (V) Kc p2 Gc (V) and Mc = Sc Gc (V) = , (2.11) 2 2
where V is the steady-state vergence angle. Gc (V) and Gd (V) are defined as angle-dependent convergence and divergence motor gains respectively, and, correspond to FCM and FDM in equations 2.4 and 2.5. Mc and Md correspond to VOC and VOD in equations 2.2 and 2.3 when the pathway downstream of the vergence position cells is assumed to be linear. Although the relationship of the steady-state output of the motor controllers is quadratic
Vergence Dynamics Predict Fixation Disparity
1505
in the parameters of the disparity spread (p and q), the relationship between disparity z and the velocity of small vergence movement (under most dynamic conditions) remains linear. As shown previously (Patel et al., 1997), the dynamic vergence model exhibits angle-dependent variable gains during convergence and divergence movements. The exact nature of the variable gains as a function of angle primarily depends on the parameters defining the membrane characteristics of the vergence position cells in the dynamic model. As indicated in Figure 3, the vergence imbalance R(V) (same as VI in equation 2.6) is calculated as R(V) = Mc − Md =
1 [Kc p2 Gc (V) − Kd q2 Gd (V)]. 2
(2.12)
The fixation disparity E(V) at a given angle V, specifically for a binary disparity code, is E(V) = p − q.
(2.13)
At steady state, we have R(V) = 0 (same as equation 2.7) and for a rectangular binary disparity code p+q = n; hence, we can rewrite equations 2.12 and 2.13 in variables p and n. By eliminating p from the rewritten equations 2.12 and 2.13 ¶ µ 1 − γ (V) , (2.14) E(V) = n 1 + γ (V) where γ (V) =
s
Kc Gc (V) . Kd Gd (V)
(2.15)
If p and q are equal at equilibrium, then the fixation disparity is zero. If q is larger (smaller) than p, the eyes are overconverged (underconverged). To illustrate how Kc Gc (V) and Kd Gd (V) can be determined experimentally, consider a time instant t during an open-loop or disparity clamped convergence movement. Let the average target disparity be denoted by ds . Let the disparity spread be rectangular with a symmetrical spread of n/2 on both sides of ds . Because the disparity input is fixed or clamped in time, the corresponding fixed or steady-state output of the convergence velocity cell R ds + n would be nKc ds (by using Sc = ds − n2 Kc zdz). 2 Because the position cells are nonleaky integrators and can be considered linear for a small range of vergence angles, the vergence angle, which is the output of the convergence motor controller, would be nKc ds Gc (V)t (by using Rt Mc = 0 nKc ds Gc (V)dt; assume initial convergence is zero). Let us define the product Kc Gc (V), which linearly relates the output convergence angle to the
1506
S. S. Patel, B.-C. Jiang, and H. Ogmen
input disparity as the sensory motor gain for the convergence pathway. The velocity of the convergence movement would therefore be nKc Gc (V)ds . Because the relationship between the magnitude of ds and the experimentally measured velocity of the corresponding open-loop convergence movement is linear for small movements, the slope of this relationship can be used as an approximation for 9Kc Gc (V) where 9 is a constant that includes the gain of the pathway downstream of the vergence position cells. If we assume that 9 is similar for convergence and divergence movements, then the ratio of the slopes of the relationship between ds and the velocity of the open-loop vergence movement for convergence and divergence movements can be used to approximate the ratio of Kc Gc (V) to Kd Gd (V), respectively. For the purpose of forming the ratio of open-loop convergence to divergence velocity, the open loop vergence velocity measurements can further be replaced by peak vergence velocity measurements (Semmlow et al., 1994; Hung, Zhu, & Ciuffreda, 1997). In summary, quantitative analysis of the static vergence model therefore suggests a linear relationship between fixation disparity, E(V) and a function, λ(V), of the ratio of convergence and divergence sensory motor gains where λ(V) =
1 − γ (V) . 1 + γ (V)
(2.16)
Hence, if fixation disparity is a result of asymmetry between convergence and divergence dynamics, then a strong correlation should exist between E(V) and λ(V). This prediction was tested experimentally by a simple staircasepulse paradigm (see Figure 4a). 3 Methods Five subjects (LFH, JCK, VTA, NYN, and HNG) participated in this study voluntarily. All the subjects were emmetropic. All had at least 20/20 visual Figure 4: Facing page. (a) The stimulus paradigm used in our experiments. Upward deflection in the stimulus trace represents increased convergence demand. The dotted horizontal lines represent the three pedestal demands used in the experiments. (b) A typical vergence recording in our experiments. The right and left eye position signals are shown in the top two traces (RE, LE; sampling rate 60 Hz). The unfiltered vergence signal (RV) is obtained by subtracting RE from LE. The filtered vergence signal (FV) is obtained by digitally low-pass filtering RV with a cut-off frequency of 5 Hz. The vergence velocity signal (VV) is computed by differentiating FV. The bottom trace shows the stimulus with each step in the paradigm corresponding to a 2 degree symmetric vergence demand (except a 1 degree first step during staircase). Upward deflection indicates increased convergence demand.
Vergence Dynamics Predict Fixation Disparity
1507
acuity with normal binocular vision. Horizontal eye movements of both eyes were measured with a pair of dual Purkinje-image eye trackers (Crane & Steele, 1978). In a dark room, using the Badal optical systems of the stabilizing attachments to the eye trackers, the subject viewed a pair of bright vertical lines (9 degrees in length and 0.35 degree in width; 0.56 candelas per square meter), one presented to each eye on separate monitors with dark background via mirrors positioned in front of the eye trackers. Each
1508
S. S. Patel, B.-C. Jiang, and H. Ogmen
monitor was placed at a distance of 1 m from the exit pupil of the corresponding eye tracker. During target viewing, the subject rested his or her chin on a chin rest, and head movements were restricted using a forehead support. A Macintosh II computer controlled the stimulus display and collected data at a sampling rate of 60 Hz per channel, using a 12-bit A/D converter. Two D/A channels were used to map the current stimulus position on each screen. These mapped signals were also recorded along with the eye movement signals. All data were analyzed using the data analysis package AcqKnowledge (BIOPAC Systems Inc.). Instead of pinhole viewing, we used a low-luminance, low-spatial-frequency target that greatly reduces the accommodative gain (Campbell, 1954; Johnson, 1976). It can be shown that as far as the relationship between fixation disparity and vergence demand is concerned, a reduction in accommodation gain of a log unit is effectively the same as opening the accommodative control loop (Schor, 1992). One at a time, the target on each monitor was initially aligned to force the corresponding eye to look straight ahead. This alignment was achieved by placing a rectangular grid close to the subject’s viewing eye. The target was adjusted (vertically and horizontally using keyboard) until the subject aligned the center of the target with the center of the grid. After this alignment procedure was completed for both eyes, the grid was removed from the optical path, and the subject was asked to fuse the targets. In case the subject had vertical phoria, an additional vertical adjustment was performed on one of the two targets until fusion was established. The initial monocular alignment that resulted in binocular viewing at 0 degree (parallel eyes) was fixed during all subsequent sessions. The distance of the display monitors was moved from 1 m (physical) to 4 m (optical) by placing a convex lenses of 0.75D in the optical path of each eye. Due to the haploscopic viewing in our experiments, the vergence demand can be manipulated independent of the constant accommodation demand. Calibration of the eye tracker, conducted in each session of the experiments, was accomplished by presenting a monocular target at different fixation directions for each eye and recording the related eye movement. Based on the eye tracker calibration, the signal recorded from each eye tracker was converted to eye position. The vergence response was then computed by subtracting the two eye position signals (see Figure 4b). Since vergence responses are attenuated by 40 dB around 4 Hz (Zuber & Stark, 1968), the vergence signal was digitally low-pass-filtered at 5 Hz (see Figure 4b). In the experiment, vergence demand was toggled between convergence and divergence around a pedestal (or average) demand using three 2 degree peak-to-peak pulses of 5 seconds (see Figure 4a). Three pulses were used for purposes of averaging to improve the signal-to-noise ratio of the vergence signal. The duration of 5 sec was used to avoid adaptive effects (Sethi, 1986). Three pedestal convergence demands were used in random order: 2, 5 and 8 from a 0 degree demand, which represents a parallel angle for the two eyes. A staircase paradigm comprising multiple 2 degree steps
Vergence Dynamics Predict Fixation Disparity
1509
was used to achieve each pedestal demand. For 2 and 8 degree pedestal demand, the first step was 1 degree instead of 2 degrees, as shown in the example in Figure 4a. The duration of each step in the staircase was 5 seconds. The vergence velocity was computed by differentiating the filtered vergence signal (see Figure 4b). Positive (negative) velocity was assigned to convergence (divergence) movement. The fixation disparities at various demands were obtained from the staircase response generated during the stimulation of 8 degree pedestal demand. The fixation disparity was computed as the difference between the demand and the average vergence angle during the last 2 seconds of the step. Positive (negative) fixation disparity was assigned to underconvergence (overconvergence). The fixation disparity at pedestal demand of 2 degrees was computed by averaging the disparities at demands of 1 and 3 degrees. Similarly averaging of fixation disparities at 7 and 9 degrees pedestal demands was used to obtain the fixation disparity at pedestal demand of 8 degrees. About 10 minutes of rest was introduced between tests at different demands. Because many researchers have shown that saccades facilitate vergence movements (Enright, 1986; Zee, Fitzgibbon, & Optican, 1992; Wick & Bedell, 1992), care was taken to separate and analyze only pure-vergence movements. Unfiltered responses for which a saccade (less than 10 degrees per second or amplitude more than 0.3 degree) occurred in either eye between the onset of stimulation and 33 msec (two samples) after peak vergence velocity occurred were rejected from further analysis. Each subject was tested twice, thus providing a maximum of six pulses per pedestal demand for subsequent data analysis. 4 Results and Discussion 4.1 Dynamic Asymmetry in the Vergence System. We show the existence of dynamic asymmetry, which is characterized by a pedestal demanddependent difference in convergence and divergence velocities. Figure 5 shows typical vergence eye movement recordings for one subject (LFH) at pedestal demands of 2, 5, and 8 degrees. The asymmetry in peak convergence and divergence velocities increases as pedestal demand increases. All subjects showed this increase in asymmetry with increase in pedestal demand; however, the magnitude of differences between convergence and divergence peak velocities was different. The differences between occurrence times of peak convergence and divergence velocities were nonsystematic with respect to the pedestal demand. To analyze the dynamic asymmetry quantitatively, repeated measures analysis of variance (ANOVA) was performed on the peak convergence and divergence velocities and their occurrence times. The average convergence and divergence peak velocities for all subjects are shown in Figure 6a. A three-way repeated measures ANOVA of the data in Figure 6a indicates that an increase in pedestal demand results in a significant increase in the difference between peak convergence and
1510
S. S. Patel, B.-C. Jiang, and H. Ogmen
Figure 5: Vergence eye movement data from subject LFH. The left column shows vergence position traces, and the right column shows the corresponding computed vergence velocity traces. The stimulus is a 2 degree vergence step on a pedestal demand. The corresponding pedestal demand is indicated in each box containing position traces. Upward (downward) deviation in position traces indicates increased convergence (divergence). Upward (downward) deviation in velocity traces indicates positive (negative) vergence velocity. The arrows indicate the onset of the convergence or divergence step stimulus. The position, velocity, and timescales are the same for all panels. The position and velocity traces are aligned to obtain the overlap shown in the figure.
Vergence Dynamics Predict Fixation Disparity
1511
divergence velocities (F[1, 8] = 1.1, p = .28 at 2 degrees; F[1, 8] = 38.8, p = .004 at 5 degrees; F[1, 8] = 106.9, p = .0005 at 8 degrees). Furthermore, ANOVA also indicates a strong dependence of peak divergence velocity on pedestal vergence demand (F[1, 8] = 22, p = .01 between 2 and 5 degrees; F[1, 8] = 47.1, p = .001 between 2 and 8 degrees; F[1, 8] = 17.6, p = .02 between 5 and 8 degrees) but little effect of pedestal demand on peak convergence velocity (F[1, 8] = .23, p = .51 between 2 and 5 degrees; F[1, 8] = .15, p = .56 between 2 and 8 degrees; F[1, 8] = .008, p = .8 between 5 and 8 degrees). These results suggest some form of motor nonlinearity in the vergence system. Similar dependence of prior vergence angle on short-latency vergence movements induced by radial optic flow has been recently reported (Yang, Fitzgibbon, & Miles, 1998). Their data also indicate that the effect of prior vergence angle on convergence is different from that on divergence. These results suggest that vergence velocity depends on its prior angle and may partly explain why comparisons between convergence and divergence peak velocities have varied idiosyncratically between past studies (Zuber & Stark, 1968; Krishnan & Stark, 1977; Schor et al., 1986; Erkelens, Steinman, & Collewijn, 1989; Hung et al., 1997; Zee et al., 1992). 4.2 Is Asymmetry in Vergence Dynamics Due to Fixation Disparity? One might expect an effect of pedestal demand on vergence velocity due to the presence of fixation disparity. The average fixation disparity at various angles for all subjects is shown in Figure 6b. At the pedestal demands tested in our experiment, all the subjects exhibited an increase in underconvergence with an increase in pedestal demand. Because of fixation disparity, a symmetrical convergence-divergence pulse demand would stimulate the vergence system asymmetrically (see Figure 7a inset). However, a fixation disparity should affect both the convergence and divergence peak velocities. In contrast to this expectation, the data in Figure 6a clearly show that only divergence velocity is affected by the pedestal demand. To investigate the possibility further, the stimulus pulses, which were ±1 degree around all pedestal demands (i.e., 2 degree step), were corrected for each subject based on the corresponding fixation disparities. As an example of correction, for a 2 degree pedestal demand, assume that the fixation disparity represents a 0.5 degree of underconvergence. An additional 1 degree convergence demand would then create an instantaneous convergence disparity step of 1.5 degree (corrected convergence step) and, a 1 degree divergence demand would create an instantaneous divergence disparity of 0.5 degree (corrected divergence step). In our experiments, the corrected convergence (divergence) step size was computed by adding (subtracting) the fixation disparity at the lower (upper) amplitude of the pulse demand. As shown in Figure 7a, the magnitude of the corrected stimulating step for convergence (divergence) increases (decreases) with an increase in pedestal demand. The corresponding changes in vergence velocity (see Figure 6a)
1512
S. S. Patel, B.-C. Jiang, and H. Ogmen
are opposite to the predictions of the open-loop relationship between disparity and vergence velocity (Rashbass & Westheimer, 1961), which suggests that the magnitude of vergence velocity should increase with increases in stimulating step size. This indicates that the effect of pedestal demand on vergence velocity cannot be explained by the corresponding changes in fixation disparity. Our analysis also indicates that pedestal vergence demand does not affect the time at which peak convergence velocity and peak divergence velocity occur, thus ruling out the possibility of pedestal demand-dependent changes in reaction time within the vergence system.
Vergence Dynamics Predict Fixation Disparity
1513
4.3 Is Fixation Disparity Due to Asymmetry in Vergence Dynamics? We tested the quantitative relationship between E(V) and λ(V) as predicted by the static vergence model. Due to the small-signal linearity exhibited by the vergence system around a given pedestal demand (Rashbass & Westheimer, 1961), γ (V) during a small vergence step response (∼ ±2 degrees) is approximated by: v u Peak convergence velocity around V/ u u Corrected convergence step input u . γ (V) ≈ u t Peak divergence velocity around V/ Correctred divergence step input
(4.1)
Figure 7b shows the computed sensory motor gains for convergence and divergence pathways. A three-way repeated measures ANOVA indicates a significant difference between convergence and divergence gains (F[1, 4] = 11.9, p = .03). The magnitude of average divergence gain increases with an increase in pedestal demand, while the convergence gain decreases slightly, as expected for subjects who exhibit underconvergence. To determine the relationship between fixation disparity and vergence asymmetry, γ (V) was computed from the data shown in Figures 6a and 7a using equation 4.1. λ(V) was computed from γ (V) using equation 2.16. Standard error propagation techniques (Bevington, 1969) were used to transform standard error estimates in all calculations. Figure 8a shows the linear relationship between λ(V) and fixation disparity E(V) for all subjects. Figure 6: Facing page. Peak convergence and divergence velocities and fixation disparities. The symbols for subjects are same in all subsequent figures and are shown in b. (a) Average peak convergence and divergence velocities and their relationship with the pedestal convergence demand for five subjects. The maximum standard error (the bar shown in the upper left column of each panel) in velocity for any subject was 0.96 degree per second for convergence velocity and 1.37 degrees per second for divergence velocity. The number of observations ranged from two to six. (b) Average fixation disparities for various step demands within a staircase paradigm. Positive values on the y-axis indicate underconvergence. The largest staircase, which consisted of a 1 degree first step, followed by four consecutive 2 degree steps, is generated when using an 8 degree pedestal demand. Note that the 8 degree pedestal demand is obtained by toggling the three pulses between 7 and 9 degree demands. The fixation disparity values for 2 and 8 degree demands are interpolated by averaging the fixation disparities from corresponding neighboring demands. For example, the fixation disparity for 2 degree demand is the average of fixation disparities at 1 and 3 degree demands. The maximum standard error (bar shown in upper left column) in fixation disparity for any subject was 0.024 degree. The number of observations ranged from two to six.
1514
S. S. Patel, B.-C. Jiang, and H. Ogmen
Notice also that the gains for convergence and divergence are very similar for the lowest pedestal demand (see Figure 7b). As shown in Figure 8a, the fixation disparity for four out of five subjects is very close to zero at the lowest demand, thus supporting the prediction of the static vergence model that relates zero error to perfect symmetry (λ(V) = 0). The fifth subject (JCK) shows a small bias that may be due to a constant physiological or anatomical error with respect to the optical axis. Figure 8a also shows the nature of intersubject as being primarily noninteracting. Hence, our results strongly suggest that fixation disparity arises from an asymmetry between the dynamics of convergence and divergence pathways when stimulated
Vergence Dynamics Predict Fixation Disparity
1515
Figure 8: Relationship between fixation disparity E(V) and λ(V) for five subjects. Positive fixation disparity indicates underconvergence. Least-square linear regression lines are fit for each subject using data from three pedestal demands.
by a disparity spread. An exceptional feature of our explanation is that it does not require a leaky position integrator in the vergence system, illustrating that steady-state errors can arise in nonleaky negative feedback neural control systems. 4.4 Fixation Disparity and Closed-Loop Accommodation. Most clinical fixation disparity measurements are obtained under a closed-loop accommodation condition (Ogle et al., 1967); therefore, it is necessary to comment on the influence of accommodation upon fixation disparity and how it relates to the model presented here. It has been previously shown that Figure 7: Facing page. Asymmetric stimulation and sensory motor gains. (a) The stimulating step sizes at various pedestal demands after correcting for corresponding fixation disparities. Recall that the uncorrected step size is 2 degrees for convergence and divergence movements. The inset illustrates the asymmetric stimulation induced by a symmetric external stimulus (demand) in the presence of fixation disparity. The dotted line is the vergence angle for the corresponding demand shown in solid lines (pulse). The difference between the solid and dotted lines is the fixation disparity. The up (down) arrow indicates the actual convergence (divergence) step input to the vergence system. (b) Average sensory motor gains of convergence and divergence pathways as a function of pedestal demand for five subjects. The convergence (divergence) gains are computed using the numerator (denominator) under the square-root sign of equation 4.1. The error bars in all figures represent ±1 standard error.
1516
S. S. Patel, B.-C. Jiang, and H. Ogmen
the slope of the relationship between fixation disparity and vergence demand is modified when accommodation operates in open- versus closedloop condition. Semmlow and Hung (1979) have shown that the slope of fixation disparity curve becomes shallower when accommodation operates in open-loop compared to closed-loop condition, and their data are consistent with current models of accommodation and vergence (Hung & Semmlow, 1980; Schor, 1992). Contrary to Semmlow and Hung’s data, Hessler, Pickwell, and Gilchrist (1989) have shown that the slope of fixation disparity curve becomes steeper when accommodation operates in open-loop compared to closed-loop condition. Regardless of the actual sign of the change in slope, one can determine the fixation disparity curve under a closed-loop accommodation condition from that under the open-loop accommodation condition by adding a term, FD(V, A) = E(V) + αV + βA + δ,
(4.2)
where α represents the change in slope of the fixation disparity curve from the closed-loop to open-loop accommodation condition, β represents the shift in fixation disparity curve due to accommodation, δ is a constant bias, and A is the accommodative demand. Note that the above analysis is approximate, and it assumes linear interactions between the accommodation and the vergence systems, as suggested by current models of accommodation and vergence (Hung & Semmlow, 1980; Schor, 1992). Ogle et al. (1967) showed that for many subjects, the fixation disparity curves did not merely shift but also changed shapes when accommodation changed from far to near and vice versa. This phenomenon cannot be explained by linear interactions of vergence and accommodation and points to a severe limitation of existing models of accommodation and vergence. Further, Wick and Joubert (1988) have shown that blur affects fixation disparity in a highly nonlinear way. A more formal and accurate analysis based on our model requires a complete neural network model of accommodation and vergence. 5 Other Factors Affecting Fixation Disparity Several vergence parameters other than those considered in our model are also shown to influence fixation disparity. Vergence adaptation, resulting from prolonged binocular viewing, is known to reduce fixation disparity (Schor 1979a, 1979b; Carter, 1980; Schor, 1980). After vergence adaptation, the fixation disparity at certain demands has been shown to be correlated with the decay of vergence in darkness (Schor et al., 1986). Proximal cues also influence the magnitude of fixation disparity (North, Henson, & Smith, 1993). Viewing distance in some cases alters the form of fixation disparity in the same subjects (Ogle et al., 1967; Kruza, 1993). Contrary to the results of Palmer and von Noorden (1978), heterophorias measured at near distances are shown to be correlated with fixation disparity (Schor, 1983). Dark
Vergence Dynamics Predict Fixation Disparity
1517
vergence is also known to have correlation with fixation disparity (Kruza, 1994). However, since fixation disparity is observed under conditions that eliminate (or keep fixed) the aforementioned parameters (i.e., in the absence of adaptation, for stimuli without proximal cues, when accommodation input and viewing distance are kept constant), we suggest that these are modulatory effects, rather than being the basic neural origin of fixation disparity. These factors may affect fixation disparity indirectly via changes in vergence dynamics. Future work can assess the relative contribution of these modulatory factors in determining vergence dynamics and fixation disparity. Summary As opposed to an explanation of fixation disparity based on control-type models of the vergence system, our explanation is based on a model that has neurophysiological correlates in primates and has successfully explained most of vergence dynamic data. Our experiments and analysis show that fixation disparity can arise due to the asymmetry in the dynamics of the convergence and divergence pathways when stimulated by a distributed disparity code. Further, our explanation remains consistent with the nonleaky position integration property of the vergence system; we show that neural leakage is not necessary to explain steady-state errors in the vergence system and presumably in other sensorimotor systems driven by distributed sensory coding. Appendix We have used the same equations (and terminology) that were presented in earlier work describing the neural network model of the vergence system (Patel et. al., 1997) to show the relationship between the experimental findings of Rashbass and Westheimer (1961) and the functions describing the transformation from the output of disparity encoders to the output of velocity cells in the model. A convention used in this appendix is that the equations with capital letters belong to the previous article (Patel et. al., 1997) and those with lowercase letters are described in this article. We consider the situation in which the input (or disparity) is clamped (fixed) in time and the output (vergence angle) is temporally changing. Since the input signal is fixed in time, we can consider the disparity encoders to be in a steady state. The activity of the disparity encoders (eq A.5 in the previous article) will be represented by a temporally fixed and spatially distributed activity. For simplicity, we chose a binary rectangular spread. Thus: ½
1 fDC (xDC,d ) = 0
if ds − n2 ≤ d ≤ ds + otherwise
n 2
¾ ,
(a.1)
1518
S. S. Patel, B.-C. Jiang, and H. Ogmen
where d is the index of the disparity encoder cell representing disparity d, ds is the average stimulus disparity, and n is the total spread of activity across disparity encoder cells. The output of the disparity encoder cells feeds into the convergence and divergence velocity cells via the positive synaptic weights wd,v⇑ and wd,v⇓ . The dynamics of the velocity cells is governed by equations A.6 and A.7, which are rewritten below: Ks
d=0 X dxv⇑ = −Av⇑ xv⇑ + wd,v⇑ fDC (xDC , d) dt 2N
(a.2)
Ks
0 X dxv⇓ = −Av⇓ xv⇓ + wd,v⇓ fDC (xDC , d), dt d=−2N
(a.3)
where xv⇑ (xv⇓ ) is the activity of the convergence (divergence) velocity cell, Av⇑ (Av⇓ ) is the decay constant of the convergence (divergence) velocity cell, −2N ≤ d ≤ 2N, fDC is the firing-rate function of the disparity encoder cells (same form as equation A.1; also see equation a.13), and Ks is a scaling constant. The position cells integrate the outputs of the velocity cells according to the following shunting push-pull equations (same as equation A.8): Ks
¡ ¢ dxp⇑ = Bp⇑ − xp⇑ wv⇑,p⇑ fv⇑ (xv⇑ ) − (Dp⇑ + xp⇑ )wv⇓,p⇑ fv⇓ (xv⇓ ) dt
Ks
¡ ¢ dxp⇓ = Bp⇓ − xp⇓ wv⇓,p⇓ fv⇓ (xv⇓ ) − (Dp⇓ + xp⇓ )wv⇑,p⇓ fv⇑ (xv⇑ ), (a.5) dt
(a.4)
where, xp⇑ (xp⇓ ) is the activity of the convergence (divergence) position cell, Bp⇑ (Bp⇓ ) is the upper bound for the activity of convergence (divergence) position cell, wv⇑,p⇑ (wv⇑,p⇓ ) and wv⇓,p⇑ (wv⇓,p⇓ ) are the synaptic weights between the output of the convergence and divergence velocity cells and the input to the convergence (divergence) position cells respectively, fv⇑ ( fv⇓ ) is the firing-rate function of the convergence (divergence) velocity cell, and −Dp⇑ (−Dp⇓ ) is the lower bound for the activity of convergence (divergence) position cell. Notice the absence of the passive decay term due to which the integration by the position cells is nonleaky. Finally, the output of the position cells drives the motoneurons and the plant of the left and the right eye whose dynamics are described by the following equations (same as equation A.17 but without the last two terms): Kp
1 dθ L L L = − L θ L + Kp⇑ fp⇑ (xp⇑ ) − Kp⇓ fp⇓ (xp⇓ ) dt τ
(a.6)
Kp
1 dθ R R R = − θ R + Kp⇓ fp⇓ (xp⇓ ) − Kp⇑ fp⇓ (xp⇑ ), dt τ
(a.7)
Vergence Dynamics Predict Fixation Disparity
1519
where θ L (θ R ) is the position of the left (eye) eye in the head, τ L (τ R ) is the time constant of the motoneuron and the plant system of the left (right) L R L R (Kp⇑ ) and Kp⇓ (Kp⇓ ) are the gain of the convergence and divergence eye, Kp⇑ position cells for the left (right) eye, fp⇑ ( fp⇓ ) is the firing rate function of the convergence (divergence) position cell, and Kp is a scaling constant. For simplicity, we have ignored the signals from the velocity overdrive circuit. Now let us consider Rashbass and Westheimer’s (1961) experiment. In this experiment, the input (or disparity) was clamped (help constant) and the vergence velocity was measured. The finding is that within a small range of disparities (at least ±2 degrees, see Figure 22 in Rashbass and Westheimer, 1961), the eye velocity (which remained almost constant in time) is a linear function of the clamped disparity. Let us apply this finding to our model. The dependent variable in the experiment is the vergence velocity (Vrw ) and is defined by d dVrw = (θ L − θ R ). dt dt
(a.8)
The finding is that dVrw = Krw ds , dt
(a.9)
where Krw is a constant estimated to be about 10 degrees per second per degree of disparity and ds is the average target disparity. By substituting equations a.6 and a.7 in equation a.8, we have ½ ³ ´ 1 1 1 L R − L θ L + R θ R + Kp⇑ + Kp⇑ fp⇑ (xp⇑ ) Krw ds = Kp τ τ ¾ ´ ³ L R + Kp⇓ fp⇓ (xp⇓ ) . − Kp⇓ L R L R + Kp⇑ = 1p⇑ , Kp⇓ + Kp⇓ = 1p⇓ and τ L = τ R = τ in the By using Kp⇑ previous equation, we have ¾ ½ 1 1 (a.10) − (θ L − θ R ) + 1p⇑ fp⇑ (xp⇑ ) − 1p⇓ fp⇓ (xp⇓ ) . Krw ds = Kp τ
Differentiating both sides of equation a.10 and using Rashbass and Westheimer’s results from equation a.9 give: 1 0 0 0 = − Krw ds + 1p⇑ fp⇑ (xp⇑ )x˙ p⇑ − 1p⇓ fp⇓ (xp⇓ )x˙ p⇓ τ
(a.11)
Because, the target used in Rashbass and Westheimer’s experiments was very small (0.08 degree wide), the results obtained in their experiments
1520
S. S. Patel, B.-C. Jiang, and H. Ogmen
are likely due to the case where ds ≥ n2 . In this case, the disparity spread remains on either the convergence or the divergence side, and only one of the two sensory pathways is activated. For definiteness, assume only the convergence pathway is activated. Starting from equation a.11, we have: 1 0 0 Krw ds = 1p⇑ fp⇑ (xp⇑ ) − 1p⇓ fp⇓ (xp⇓ )x˙ p⇓ . τ
(a.12)
0 was chosen to have the form as (same as equation A.1): The function fp⇑
0 f (x) = αx αÄ
if x, 0 if 0 ≤ xÄ , if ≤ x
(a.13)
where 0, Ä, and α are constants determining the threshold, saturation, and gain of the firing function respectively. Because the position cells are active, xp⇑ > 0 and xp⇓ > 0, and because the experiment is conducted for small disparities and measurements are made before the eye plant system saturates, we have 0 ≤ xp⇑ ≤ Ä and 0 ≤ 0 (x ) = α xp⇓ ≤ Ä, fp⇑ (xp⇑ ) = αp⇑ xp⇑ and fp⇓ (xp⇓ ) = αp⇓ xp⇓ , and fp⇑ p⇑ p⇑ and 0 fp⇓ (xp⇓ ) = αp⇓ . Further, if we assume that the convergence and divergence position cells have identical parameters (see Table B1 in Patel et al., 1997) and the plant characteristics are identical downstream from the position cells for convergence and divergence (see Table 3 in Patel et al., 1997), then αp⇑ = αp⇓ = αp and 1p⇑ = 1p⇓ = 1. Therefore equation a.12 becomes © ª 1 Krw ds = 1αp x˙ p⇑ − x˙ p⇓ . τ
(a.14)
By substituting equations a.4 and a.5 in equation a.14, we have Ks Krw ds = (Bp⇑ − xp⇑ )wv⇑,p⇑ fv⇑ (xv⇑ ) − (Dp⇑ + xp⇑ )wv⇓,p⇑ fv⇓ (xv⇓ ) τ 1p⇑ αp⇑ − (Bp⇓ − xp⇓ )wv⇓,p⇓ fv⇓ (xv⇓ ) + (Dp⇓ + xp⇓ )wv⇑,p⇓ fv⇑ (xv⇑ ).
(a.15)
Because only convergence sensory pathway is active, fv⇓ (xv⇓ ) = 0. Also, as mentioned earlier, because the data in Rashbass and Westheimer’s study were collected for a small range of eye positions (from initial convergence of 1.7 degrees) or only the initial portion (within the delay period) of the vergence eye movement, we have xp⇑ K. If the number M of mixtures is greater than the number K of sources signals, then a standard procedure
Blind Source Separation Using Temporal Predictability
1565
for reducing M consists of using principal component analysis (PCA). PCA is used to reduce the dimensionality of the signal mixtures by discarding eigenvectors of x that have eigenvalues close to zero. However, in this article, only mixtures for which K = M are analyzed. 2.3 A Physical Interpretation. The solutions found by the method are the eigenvectors (W1 , W2 , . . . , WM ) of the matrix (C˜ −1 C). These eigenvectors ˜ are orthogonal in the metrics C and C: ˜ t=0 Wi CW j Wi CWjt = 0,
(2.5)
where, ˜ t= Wi CW j Wi CWjt
X (yiτ − y˜ iτ )(yjτ − y˜jτ ) τ
X = (yiτ − yiτ )(yjτ − yjτ ).
(2.6)
τ
Given equations (2.5), a simple and physically realistic interpretation of the method can be demonstrated. Consider the short-term and long-term half-life parameters hS and hL in the limits (hS → 0) and (hL → ∞). First, in the limit (hS → 0), the short-term mean is y˜ τ ≈ yτ −1 , and therefore (yτ − y˜ τ ) ≈ dyτ /dτ = y0τ . Second, if y has zero mean, then in the limit (hL → ∞), the long-term mean y ≈ 0, and therefore (yτ − yτ ) ≈ yτ . In these limiting cases, equations (2.5) and (2.6) imply that E[y0i yj0 ] = 0 E[yi yj ] = 0,
(2.7)
where E[ ] denotes an expectation value. Thus, one interpretation of the method is that each signal yi = Wi x is uncorrelated with every other signal yj = Wj x, and the temporal derivative y0i = Wi x0 of each extracted signal is uncorrelated with the temporal derivative yj0 = Wj x0 of every other extracted signal, where x0 is a vector variable that is the temporal derivative of the mixtures x. Critically, if two signals yi and yj are statistically independent then the conditions specified in equation (2.7) are met. Therefore, mixtures of independent signals are guaranteed to be separable by the method, at least in the limiting cases specified above. 3 Results Three experiments using the method described above were implemented in Matlab. In each experiment, K source signals were used to generate M =
1566
James V Stone
Table 1: Correlation Magnitudes Between Source Signals and Signals Recovered from Mixtures of Source Signals with Different pdfs. Source Signals
Signals Recovered
s1
s2
s3
y1 y2 y3
0.000 1.000 0.042
0.001 0.000 0.999
1.000 0.000 0.002
K signal mixtures, using a K × K mixing matrix, and these M mixtures were used as input to the method. Each mixture signal was normalized to have zero-mean and unit variance. Each mixing matrix was obtained using the Matlab randn function. The short-term and long-term half-lives defined in equation (1.2) were set to hS = 1 and hL = 9000, respectively. Correlations between source signals and recovered signals are reported as absolute values. Results were obtained in under 60 seconds on a Macintosh G3 (233 MHz) for all experiments reported here, using nonoptimized Matlab code. In each case, an un-mixing matrix was obtained as the solution to a generalised eigenvalue problem using the Matlab eigenvalue function ˜ W = eig(C, C). 3.1 Separating Mixtures of Signals with Different Pdfs. Three source signals s = {s1 | s2 | s3 }t are displayed in Figure 2: (1) a supergaussian signal (the sound of a gong), (2) a subgaussian signal (a sine wave), and (3) a gaussian signal. Signal 3 was generated using the randn procedure in Matlab, and temporal structure was imposed on the signal by sorting its values in ascending order. These three signals were mixed using a random matrix A to yield a set of three signal mixtures: x = As. Each signal consisted of 3000 samples; the first 1000 samples of each mixture are shown in Figure 1. The correlations between source signals and recovered signals are given in Table 1. The three recovered signals each had a correlation of r > 0.99, with only one of the source signals, and other correlations were close to zero. Note that the mixtures used here do not include any temporal delays or echoes. 3.2 Separating Mixtures of Sounds. For each experiment reported here, 50,000 data points were sampled at a rate 44,100 Hz, using a microphone to record different voices from a VHF radio onto a Macintosh computer. Two sets of eight sounds were recorded: male and female voices and classical music, with and without singing. 3.2.1 Separating Voices. The method was tested on mixtures of normal speech. Correlations between each source signal and every recovered signal for four and eight voices are given in Tables 2 and 3, respectively. Graphs
Blind Source Separation Using Temporal Predictability
1567
Figure 1: Three signal mixtures used as input to the method. See Figure 2 for a description of the three source signals used to synthesize these mixtures. Only the first 1000 of the 9000 samples used in experiments are shown here. The ordinal axis displays signal amplitude.
of original (source) voice amplitudes and the signals recovered from a set of four mixtures (not shown) are shown in Figure 3. Note that correlations are approximately r = 0.99. With correlations this high, it is not possible to hear the difference between the original and recovered speech signals. A correlation of 0.956 was found between source 8 and recovered signal 3; this represents the worst performance of the method out of all data sets described in this article. Table 2: Correlation Magnitudes Between Each of Four Source Signals and Every Signal Recovered (y1 , . . . , y4 ) by the Method. Source Signals (voices)
Signal Recovered
s1
s2
s3
s4
y1 y2 y3 y4
0.097 0.996 0.002 0.030
0.994 0.081 0.042 0.076
0.028 0.012 0.995 0.101
0.049 0.019 0.095 0.992
Note: Each source signal has a high correlation with only one recovered signal.
1568
James V Stone
Figure 2: Three signals with different probability density functions. A supergaussian gong sound, a subgaussian sine wave, and sorted gaussian noise (see text) are displayed from top to bottom respectively. In each graph, the source signals used to synthesize the mixtures displayed in Figure 1 are shown as a solid line, and corresponding signals recovered from those mixtures are shown as dotted lines. Each source signal and its corresponding recovered signal have been shifted vertically for display purposes. The correlations between source and recovered signals are greater than r = 0.999 (see Table 1). Only the first 1000 of the 9000 samples used are shown here. The ordinal axis displays signal amplitude.
3.2.2 Separating Music. The method was tested on mixtures of eight segments of music. Correlations between each source signal and each recovered signal for eight music segments are given in Table 4. Again, correlations are approximately r = 0.99, and it is not possible to hear the difference between the original and recovered music signals. The method seems to be largely insensitive to the values used for the short-term and long-term half-lives defined in equation 1.2, provided the latter is much larger than the former. 3.3 How Maximizing Predictability Can Fail. The method is based on the assumption that different source signals are associated (via Wi ) with distinct critical points in F. However, if any two source signals have the
Blind Source Separation Using Temporal Predictability
1569
Figure 3: Four voices (two male and two female). In each graph, the source signals used to synthesize four mixtures (not shown) are shown as solid lines, and corresponding signals recovered from these mixtures are in shown as dotted lines. Each source signal and its corresponding recovered signal have been shifted vertically for display purposes. The correlations between source and recovered signals are greater than r = 0.99 (see Table 2). Only the first 1000 of the 50,000 samples used are shown here. The ordinal axis displays signal amplitude. Table 3: Correlation Magnitudes Between Each of Eight Source Signals and Every Signal Recovered by the Method. Source Signals (voices)
Signal Recovered
s1
s2
s3
s4
s5
s6
s7
s8
y1 y2 y3 y4 y5 y6 y7 y8
0.001 0.994 0.179 0.015 0.004 0.010 0.027 0.015
0.008 0.013 0.001 0.012 0.020 0.003 0.999 0.003
0.004 0.003 0.162 0.007 0.993 0.026 0.012 0.027
0.003 0.001 0.011 0.999 0.000 0.018 0.002 0.022
0.028 0.001 0.037 0.024 0.021 0.021 0.000 0.998
0.046 0.017 0.102 0.032 0.008 0.992 0.010 0.020
0.988 0.016 0.134 0.004 0.007 0.044 0.009 0.021
0.143 0.109 0.956 0.004 0.109 0.111 0.002 0.043
Note: Each source signal has a high correlation with only one recovered signal.
1570
James V Stone
Table 4: Correlation Magnitudes Between Each of Eight Source Signals and Every Signal Recovered (y1 , . . . y8 ) by the Method. Source Signals (classical music)
Signal Recovered
s1
s2
s3
s4
s5
s6
s7
s8
y1 y2 y3 y4 y5 y6 y7 y8
0.096 0.088 0.006 0.028 0.005 0.995 0.000 0.008
0.993 0.032 0.070 0.044 0.007 0.083 0.001 0.027
0.001 0.000 0.005 0.018 0.998 0.001 0.075 0.022
0.001 0.002 0.001 0.002 0.066 0.008 0.997 0.012
0.019 0.000 0.016 0.068 0.006 0.007 0.002 0.995
0.030 0.991 0.094 0.034 0.009 0.055 0.002 0.006
0.035 0.038 0.085 0.991 0.021 0.018 0.000 0.098
0.043 0.086 0.990 0.093 0.010 0.007 0.000 0.019
Note: Each source signal has a high correlation with only one recovered signal.
same degree of predictability F, then two eigenvectors Wi and Wj have equal eigenvalues (and are associated with the same critical points in F). Therefore, any vector Wk that lies in the plane defined by Wi and Wj also maximizes F, but Wk cannot (in general) be used to extract a source signal. This has been demonstrated (not shown here) by creating two mixtures x = As from two signals s1 and s2 , where s1 is a time-reversed version of s2 . Although s1 and s2 have different time courses, they share exactly the same degree of predictability F and cannot be extracted from the mixtures x using this method. In practice, signals from different sources (e.g., voices) typically can be separated because each source signal has a unique degree of predictability (i.e., value of F). Indeed, every set of signals in which each signal is from a physically distinct source (e.g., voices) has been successfully separated in the many experiments used in the preparation of this article. 4 Discussion Methods for recovering statistically independent source signals from signal mixtures work by taking advantage of generic differences between the properties of signals and their mixtures. Three such differences were summarized in the introduction: (1) temporal predictability (conjecture) (the temporal predictability of any signal mixture is less than (or equal to) that of any of its component source signals), (2) gaussian probability density function (the central limit theorem ensures that the extent to which the pdf of any mixture approximates a gaussian distribution is greater than or equal to any of its component source signals), and (3) statistical independence (the degree of statistical independence between any two signal mixtures is less than or equal to the degree of independence between any two source sig-
Blind Source Separation Using Temporal Predictability
1571
nals). The last two have previously been used as a basis for signal separation, and the first has been used to augment source separation (e.g., Pearlmutter & Parra, 1996; Attias, 2000; Porrill, Stone, Berwick, Mayhew, & Coffey, 2000). In this article, only the first property (temporal predictability) has been used to separate signal mixtures. While the method has a low-order polynomial time complexity of O(N3 ), this does not necessarily imply that it finds solutions more quickly than other source separation methods (see Comon & Chevalier, 2000, for an analysis of the time complexity of ICA methods). However, the fact that each simulation reported here was run in under 60 seconds (on a Macintosh G3) may indicate that the method is reasonably fast. This issue can be resolved only by a direct comparison of different methods on the same data sets. One desirable property of method described in this article is that local extrema are not an issue. This contrasts with other methods for which the existence of local extrema may be difficult to detect (Ding, Hohnson, & Kennedy, 1994). Of the methods reviewed in section 1, the method that Molgedey and Schuster (1994) described is mathematically most similar to the one described in this article, inasmuch as both methods involve a generalized eigenvalue problem. As stated in section 1, Molgedey and Schuster implicitly assume that source signals are uncorrelated at two different time lags. In contrast, one interpretation of the assumptions underlying the method presented here is that source signals are uncorrelated and their corresponding temporal derivatives are also uncorrelated (in the limiting cases specified in ection 2.3). Thus, although both methods can be formulated as generalized eigenvalue problems, the assumptions required by each method regarding the nature of source signals are qualitatively very different. As defined above, the predictions (yτ and y˜ τ ) of each signal value yτ are based on a linear weighted sum of previous values {y}. Natural extensions to this method could involve defining predicted values of yτ as general functions of previous signal values {y}, yielding functionals of the general form Pn ( f ({y} − yτ )2 . (4.1) G = log Pτn=1 2 τ =1 (g({y} − yτ ) Here, the functions f and g provide predictions of yτ , based on previous values {y} of yi , such that G is invariant with respect to the norm of Wi (recall that y = Wi x). For example, the functions f and g could be defined in terms of linear predictive coefficients, as in Pearlmutter and Parra (1996), or in terms of the exponents p and q in f (y) = yp , g(y) = yq . The ability to incorporate specific types of models into the method could also be used to deal with signal sources that contain echoes and delays. More generally, information-theoretic measures of temporal structure, such as approximate entropy (Pincus & Singer, 1996), may prove useful as measures of temporal predictability for source separation (approximate entropy was investigated
1572
James V Stone
in the development of the method presented here). In particular, the issue of sensor (e.g., microphone) noise has not been addressed in this article, and these more elaborate measures of predictability may be robust with respect to such noise. It is noteworthy that the principles underlying the method have been used for unsupervised learning in artificial neural networks (Stone, 1996a, 1996b, 1999; Becker & Hinton, 1992; Becker, 1993). This principle is useful for both unsupervised learning and source signal separation precisely because it is based on a fundamental property of the physical world: temporal predictability. However, predictability is not only a property of the temporal domain; an obvious extension (explored in Becker & Hinton, 1992; Eglen et al., 1997; Stone & Bray, 1995) is to apply the principle to the spatial domain. Additionally, the assumptions of temporal or spatial predictability used in the methods just described can be combined in a method that assumes a degree of temporal and spatial predictability. Specifically, methods that maximize predictability in space or predictability in time can be replaced by a method that maximizes predictability over space and time. An analogous spatiotemporal ICA method has been described in Stone, Porrill, Buchel, and Friston (1999). If all three properties listed above apply to any statistically independent source signals and their mixtures, a method that relies on constraints from all of these properties might be expected to deal with a wide range of signal types. It is widely acknowledged that ICA forces statistical independence on recovered signals, even if the underlying source signals are not independent. Similarly, the current method may impose temporal predictability on recovered signals even where none exists in the underlying source signals. Therefore, a method that incorporates constraints from all three properties should be relatively insensitive to violations of the assumptions on which the method is based. A framework for incorporating experimentally relevant constraints based on physically realistic properties has been formulated in the form of weak models (Porrill et al., 2000) and has been used to constrain ICA’s solutions. In particular, the function F has the correct form for a weak model and has been shown to improve solutions found by ICA (Stone & Porrill, 1999). Finally, the method described here may be useful in the analysis of medical images and electroencephalogram data. In a seminal article, Bell and Sejnowski (1995) compared their ICA method to Becker and Hinton’s IMAX method (1992) and speculated that “some way may be found to view the two in the same light.” In fact, if the hidden layers of nonlinear units in a temporal IMAX network (Becker, 1992) are removed, then the resultant system of equations is given by equation 1.1 in the limits (hS → 0) and (hL → ∞). The two methods can then be viewed in the same light: whereas ICA recovers source signals by maximizing the mutual information between input and output, (temporal) IMAX and the method described here recover source signals by maximizing the mutual information I(yτ ; yτ −1 ) between a recovered signal y at time τ and time (τ − 1).
Blind Source Separation Using Temporal Predictability
1573
Acknowledgments Thanks to J. Porrill for discussions of the generalized eigenvalue method. Thanks to R. Lister, D. Johnston, N. Hunkin, S. Isard, K. Friston, D. Buckley, and two anonymous referees for comments on this article. This research was supported by a Mathematical Biology Wellcome Fellowship (Grant Number 044823). References Attias, H. (2000). Independent factor analysis with temporally structured factors. In S. A. Solla, T. K. Leen, & K.-R. M. (Eds.), Advances in neural information processing systems, 12. Cambridge, MA: MIT Press. Becker, S. (1993). Learning to categorize objects using temporal coherence. In S. J. Hanson, J. Cowan, & C. L. Giles (Eds.), Neural information processing systems, 5 (pp. 361–368). San Mateo, CA: Morgan Kaufmann. Becker, S., & Hinton, G. (1992). Self-organizing neural network that discovers surfaces in random-dot stereograms. Nature, 335, 161–163. Bell, A., & Sejnowski, T. (1995). An information-maximization approach to blind separation and blind deconvolution. Neural Computation, 7, 1129–1159. Bellini, S. (1994). Bussgang techniques for blind deconvolution and equalization. In S. Haykin (Ed.), Blind deconvolution (pp. 8–59). Englewood Cliffs, NJ: Prentice Hall. Borga, M. (1998). Learning multidimensional signal processing. Unpublished doctoral dissertation Linkoping University, Linkoping, Sweden. Cardoso, J. (1998). On the stability of some source separation algorithms. In Proceddings of Neural Networks for Signal Processing ’98 (pp. 13–22). Cambridge, England. Comon, P., & Chevalier, P. (2000). Blind source separation: Models, concepts, algorithms and performance. In S. Haykin (Ed.), Unsupervised adaptive filtering, Vol. 1: Blind source separation (pp. 191–235). New York: Wiley. Ding, Z., Hohnson, C., & Kennedy, R. (1994). Global convergence issues with linear blind adaptive equalizers. In S. Haykin (Ed.), Blind deconvolution (pp. 60– 115). Englewood Cliffs, NJ: Prentice Hall. Eglen, S., Bray, A., & Stone, J. (1997). Unsupervised discovery of invariances. Network, 8, 441–452. Friedman, J. (1987). Exploratory projection pursuit. Journal of the American Statistical Association, 82(397), 249–266. Hyv¨arinen, A. (2000). Complexity pursuit: Combining nongaussinity and autocorrelations of signal separation. In Proc. Int. Workshop on Independent Component Analysis and Blind Signal Separation (ICA2000) (pp. 175–180). Helsinki, Finland. Jutten, C., & H´erault, J. (1988). Independent component analysis versus pca. In Proc. EUSIPCO (pp. 643–646). Molgedey, L., & Schuster, H. (1994). Separation of a mixture of independent signals using time delayed correlations. Physical Review Letters, 72(23), 3634– 3637.
1574
James V Stone
Pearlmutter, B., & Parra, L. (1996). A context-sensitive generalization of ica. In International Conference on Neural Information Processing. Hong Kong. Available online at: http://www.cs.unm.edu/∼bap/publications.html#journal. Pincus, S., & Singer, B. (1996). Randomness and degrees of irregularity. Proceedings of the National Academy of Sciences USA, 93, 2083–2088. Porrill, J., Stone, J. V, Berwick, J., Mayhew, J., & Coffey, P. (2000). Analysis of optical imaging data using weak models and ICA. In M. Girolami (Ed.), Advances in independent component analysis. Berlin: Springer-Verlag. Stone, J. V (1996a). A canonical microfunction for learning perceptual invariances. Perception, 25(2), 207–220. Stone, J. V (1996b). Learning perceptually salient visual parameters through spatiotemporal smoothness constraints. Neural Computation, 8(7), 1463–1492. Stone, J. V (1999). Learning perceptually salient visual parameters using spatiotemporal smoothness constraints. In G. Hinton & T. Sejnowski (Eds.), Unsupervised learning: Foundations of neural computation (pp. 71–100). Cambridge, MA: MIT Press. Stone, J. V, & Bray, A. (1995). A learning rule for extracting spatio-temporal invariances. Network, 6(3),1–8. Stone, J. V, & Porrill, J. (1999). Regularisation using spatiotemporal independence and predictability (Tech. Rep. No. 201). Sheffield: Sheffield University. Stone, J. V, Porrill, J., Buchel, C., & Friston, K. (1999). Spatial, temporal, and spatiotemporal independent component analysis of FMRI data. In K. V. Mardia, R. G. Aykroyd, & I. L. Dryden (Eds.), Proceedings of the 18th Leeds Statistical Research Workshop on Spatial-Temporal Modelling and Its Applications (pp. 23–28). Leeds: Leeds University Press. Wu, H., & Principe, J. (1999). A gaussianity measure for blind source separation insensitive to the sign of kurtosis. In Neural Networks for Signal Processing IX: Proceedings of the 1999 Signal Processing Society Workshop (pp. 58–66). Received February 24, 2000; accepted September 8, 2000.
LETTER
Communicated by Erkki Oja
A Biologically Motivated Solution to the Cocktail Party Problem Brian Sagi∗ Department of Electrical and Computer Engineering, University of California, San Diego, La Jolla, CA 92093-0407, U.S.A.
Syrus C. Nemat-Nasser Department of Physics, University of California, San Diego, La Jolla, CA 92093-0319, U.S.A.
Rex Kerr Department of Biology, University of California, San Diego, La Jolla, CA 92093-0349, U.S.A.
Raja Hayek Christopher Downing Department of Electrical and Computer Engineering, University of California, San Diego, La Jolla, CA 92093-0407, U.S.A.
Robert Hecht-Nielsen Department of Electrical and Computer Engineering, Program in Computational Neurobiology, Institute for Neural Computation, University of California, San Diego, La Jolla, CA 92093-0407, U.S.A., and HNC Software, San Diego, CA 92121, U.S.A.
We present a new approach to the cocktail party problem that uses a cortronic artificial neural network architecture (Hecht-Nielsen, 1998) as the front end of a speech processing system. Our approach is novel in three important respects. First, our method assumes and exploits detailed knowledge of the signals we wish to attend to in the cocktail party environment. Second, our goal is to provide preprocessing in advance of a pattern recognition system rather than to separate one or more of the mixed sources explicitly. Third, the neural network model we employ is more biologically feasible than are most other approaches to the cocktail party problem. Although the focus here is on the cocktail party problem, the method presented in this study can be applied to other areas of information processing.
∗
To whom all correspondence should be directed at
[email protected].
c 2001 Massachusetts Institute of Technology Neural Computation 13, 1575–1602 (2001) °
1576
Brian Sagi et al.
1 Introduction We consider the cocktail party problem as follows. A human listener is in a room in which many people are speaking. The listener’s goal is to follow a single speaker (the attended speaker) and understand what that speaker is saying. To distinguish the above problem from other formulations, we refer to it as the human cocktail party problem. It is important to note that our goal in the human cocktail party problem defined here is not to separate one or more of the sources from the mixture. We may think of our human cocktail party problem as a problem of attended source identification. This difference is significant because even though it is clear that people can perform the attending function quite well, technical (machine) solutions for the problem prove hard to find. Any biologically motivated solutions to the human cocktail party problem are of additional interest since they may provide insight into information processing that takes place when humans solve the human cocktail party problem. Finally, while we make the analogy to human speech recognition in a real cocktail party environment, the same technical approach could be applied to other important areas such as visual information processing, radar signal processing, and other types of sound processing. We model our cocktail party problem as an aspect of the human speech recognition problem in a cocktail party environment. The model environment consists of a collection of speakers in a room, all talking simultaneously. A listener samples this auditory scene through a single microphone. This listener is familiar with the language and would like to focus on a particular speaker as if the other speakers did not exist. In this model, all speakers speak the same language and have the same voice qualities, such as pitch variation, vocal chord frequency, and volume. Such a set of identical speakers makes the problem more difficult in some respects, since the system cannot rely on voice quality variations to separate the sources. A dilemma we faced when designing the model problem was how to make the model problem as realistic as possible, without getting bogged down by problem-specific preprocessing and immense computational requirements. To solve this dilemma, we adopted a middle path using real-world speech data with a simple synthetic model of a language, allowing us to solve the model cocktail party problem with modest computational resources. The cocktail party problem is traditionally treated as a blind source separation problem, with many techniques offered to handle the separation. Some of the most prominent solutions include the information maximization approach of Bell and Sejnowski (1995) and independent component analysis by the minimization of mutual information as examined by Comon (1994), Amari, Cichocki, and Yang (1996), and Oja and Karhunen (1995) among others. Lee, Girolami, Bell, and Sejnowski (2000) and Amari and Cichocki (1998) provide recent reviews of these techniques. Most approaches to blind source separation focus on reproducing all the original sources
A Biologically Motivated Solution to the Cocktail Party Problem
1577
as the measure of performance, and some require multiple microphones (n microphones to separate up to n sources).1 These approaches to the cocktail party problem are blind in the sense that they ignore the properties of the source (other than assuming some general statistical characteristics of the source components). In addition, most methods used in blind source separation require a form of gradient-descent learning or the inversion of large matrices and are therefore computationally intensive and suffer from start-up delays. Another area of research related to the human cocktail party problem is computational auditory scene analysis. These techniques take advantage of properties of real speech, but currently do not exploit language knowledge. For example, voice quality variations such as pitch and frequency characteristics may be used to separate different speakers. Although some research in this area is modeled after biological systems, the computational techniques employed include nonbiological methods of classification. Wang and Brown (1999) give an overview of computational auditory scene analysis and a specific approach to the separation of speakers with added noise. The literature on auditory scene analysis reports modest results at best for cocktail party environments consisting of more than a few speakers. The distinction between the cocktail party problem and the singlespeaker speech recognition problem is important. Modern speech recognition systems achieve a high level of accuracy in transcribing a single speaker’s speech into text. Most of these systems are based on hidden Markov models (HMMs) and modeled after the SPHINX system (Lee, Hon, Hwang, & Huang, 1996). Nevertheless, these systems can operate only in environments exhibiting a high signal-to-noise ratio. In this work, we do not attempt to duplicate the capabilities of such systems. The cocktail party problem is set in a low signal-to-noise ratio environment. A system solving the cocktail party problem is analogous to a front end to a speech recognition system that enables the system to operate in a high signal-to-noise ratio environment. For example, Yen and Zhao (1997) and Koutras, Dermatas, and Kokkinakis (1999) report independent efforts to use blind source separation techniques in front of an HMM automatic speech recognition system. Both efforts employ two microphones to separate two speakers in a room to produce a reasonable signal-to-noise ratio suitable for speech recognition. Our approach to the human cocktail party problem uses an artificial neural network called a cortronic network. This network functions as part of the front end of a listener in the model cocktail party environment. The cortronic artificial neural network architecture was proposed by Hecht-Nielsen (1998) and is similar to a Steinbuch Learnmatrix (Steinbuch, 1963) or a Willshaw
1 In fact, this requirement of n microphones has often been included in the definition of blind separation in the past. For example, see Bell and Sejnowski (1995).
1578
Brian Sagi et al.
associative network (Willshaw, Buneman, & Longuet-Higgens, 1969) with the addition of some new biologically motivated ideas. We use a hierarchical combination of cortronic regions to process temporal information in a biologically motivated way. We rely on training the network on the language spoken by the individual sources in the model cocktail party environment. Our model listener uses the structural knowledge of the language,2 gained through training, together with the incoming speech to form expectations about the content of the attended speaker’s speech. For example, familiarity with the language can allow the listener to complete partial phrases. This technical approach differs from other approaches to the cocktail party problem in several important respects. First, we do not depend on an inherent distinction between the sources other than the specific content of their speech. Second, we require only a single microphone regardless of the number of sources in the cocktail. Third, we use specific knowledge of the source. This part of our approach is contrary to the blind aspect of blind source separation but is consistent with the human cocktail party problem. For example, attempts by a human listener to focus on one source in a room of similarly sounding speakers, speaking an unfamiliar foreign language, are unlikely to succeed. 2 Technical Problem Definition We define our cocktail party environment as follows: All possible sounds belong to a finite set of length d, {s1 , s2 , . . . , sd }, composed of sound characters. These sounds can be thought of as equal-length temporal snapshots of a speech stream, as with the vector quantization tokens used in commercial speech recognition systems. We define a word wi as a sequence of q sound characters wi = (si(1) , si(2) , . . . , si(q) ), 1 ≤ i(j) ≤ d. The notation si(j) signifies the jth sound in word wi . For simplicity, we assume that all words consist of the same fixed number q of sound characters. We define a language as a sequence of l words (w1 , w2 , . . . , wl ). The language can be thought of as a very long string of sound characters, containing all valid verbal utterances. To simplify implementation, we require that all words be distinct; that is, no two words are made up of exactly the same set of q sounds in the same order and each word appears exactly once. A word in this model is not directly analogous to an English word but rather is analogous to an English word or phrase in a specific context. We believe that it will not prove difficult to remove the restriction that words must be distinct. For example, we already allow nearly arbitrary sequences of sounds due to the coordination provided by the words, so higher-order structures (phrases composed of words) would allow nearly arbitrary sequences of words.
2 The listener is trained to be familiar with the structure of the language: how sounds are formed into words, words into sentences, and so forth.
A Biologically Motivated Solution to the Cocktail Party Problem
1579
We can now regard the utterances of a speaker as a contiguous subsequence of the language. A speaker picks a particular position j in the language sequence and utters the words in a subsequence beginning with the chosen position: (wj , wj+1 , . . .). In our human cocktail party problem, the system is listening to a superposition of p speakers. All speakers are identical except for the specific content. In addition, we disallow multiple speakers from following the same series of words at the same time (an improbable event). We indicate by si(j,k) the kth sound uttered by the jth speaker, where the i function is responsible for mapping the kth sound uttered by the jth speaker to the appropriate sound index. In time slot k, the system receives Pp the superposition j=1 si(j,k) of sound characters (one from each speaker in the mixture). The task of the system is to focus on a particular speaker and isolate the word series that speaker is uttering. The system determines which stream to follow based on an expectation, provided externally or formed recursively (by another part of the system), for the content of the desired speech stream. With this general expectation, the system isolates that speaker’s speech stream (wi(g,l) ), l = 1, 2, . . . , where g is the attended speaker’s index, and the i function is defined (in a similar way to the sound mapping function) to map the lth word speaker g utters to its appropriate word index. It is worthwhile to note that the l (i.e., word) timescale’s rate is a factor of q (i.e., the number of sounds per word) slower than the k (i.e., sound) timescale’s rate. 3 Model Architecture Sparse binary associative memories form a central building block in our system. Their origin is in the early Learnmatrix work of Steinbuch in the 1950s (Steinbuch, 1963). In the late 1960s Willshaw and his colleagues (1969) advanced the study of such structures and presented a theory of their optimum storage capacity. Since the 1970s sparse binary associative memory architectures have received only modest attention, the work of Palm (1980) and Amari (1989) being of particular note. Recently, Hecht-Nielsen (1998) proposed a new theory of the cerebral cortex in which sparse binary associative memories again play a central role. According to Hecht-Nielsen, a basic cerebral operation is association, which is performed in two stages (see Figure 1). A sparse set of neurons (a token) denoted by a vector u is assumed active on one cortronic region—region X. Through synaptic connections between region X and a second region— region Y—the token on X affects the activation level (weighted sum of inputs) of each of the neurons in Y. Region Y is then restarted; a competition takes place in which the m most active neurons in the region “win.” These m winners’ outputs are set to 1, while all other neurons’ outputs are set to 0, forming a sparse binary vector v in region Y. This competitive process is termed a restart competition or restart. We can say that the network associates
1580
Brian Sagi et al.
MYX connections
... Region X T u=[ 0 1 0 0 … 1 0 1 ]
Region Y T v=[ 1 0 0 1 … 0 0 1 ]
Figure 1: Cortronic regions. Each cortronic region contains neurons, which may become active in a sparse binary pattern. A matrix MYX stores the connection weights between neurons in regions X and Y. At a given time, an active token in region X may broadcast its input to region Y through the weight matrix MYX . If region Y subsequently undergoes a restart competition, a sparse pattern of neurons wins, forming a sparse binary token on region Y. This figure illustrates a token u on X associatively impinging on region Y, which then restarts, yielding a token v. In the general case, connections can go from any region to any other region, as needed, including from a region to itself.
token u in region X with token v in region Y. This recall of associations is dependent on a period of training in which pairs of tokens are introduced to the regions, and boolean Hebbian learning is used to form a binary connection matrix between the regions. For each token u on region X that is to be associated with a token v in region Y, the weight of the connections from the active units in X to the active units in Y is set to 1. Several variations of the training process have been considered in the literature (Palm, 1980)—for instance, binary versus nonbinary connection weights and fixed number of winners versus threshold—and many of these have similar performance characteristics. For simplicity, we use a fixed number of winners and binary connection weights. Relying on previous work in sparse binary associative memories and Hecht-Nielsen’s cortronic artificial neural network architecture, we designed a hierarchical construct of cortronic regions to address the cocktail party problem. During training, the cortronic network learns associations between sound and word tokens. As a consequence of its hierarchical structure, the network learns association in three levels, based on the principle of co-occurrence: that adjacent language constructs must be
A Biologically Motivated Solution to the Cocktail Party Problem
1581
related. First, the network learns the association between a particular sound and the sounds that follow it in the language. Next, the network learns the association between a particular sequence of sounds and the token representation of the word they convey. Finally, the network learns the association between a word and the word that follows it in the language. These three levels of association represent the network’s knowledge of the language. When employed in the model cocktail party problem, the system receives a superposition of sounds during each sound sampling interval. The system uses an expectation in the form of a word it anticipates hearing from the attended speaker. In the analogous human cocktail party problem, such an expectation cue may be formed by observing the specific speaker’s lip movements, hearing a conversation fragment during a period of relative silence, or in an iterative trial-and-error process. In our model, at the beginning of operation, we assume that an initial expectation cue has been formed by some such mechanism and that it takes the form of a fragment of a word token—a sparse binary vector with very few bits set. This token fragment functions as a tentative initial hypothesis, and the system serves to test this hypothesis. If it is confirmed, then the expectation process continues and the token representations of the incoming sounds being attended to are separated, making them available for the higher cerebral mechanisms of comprehension. Our cortronic neural network is composed of three different parts (see Figure 2): a sound input region I, q sound processing regions X1 , X2 , . . . , Xq , and a word region Y. Region I preprocesses the incoming sounds and maps them into token representations. It is well known that in the primary auditory cortex, local-in-time features employing multiple timescales, similar to Gabor Logons, are employed (Kohonen, 1996; Rauscheker, 1998).3 As an approximation, we use a set of wavelet-like transforms to convert the sound input into activation levels for the neurons in region I. A competition is then held in which the most active neurons win. This results in a distinct representation of each sound, forming a sparse binary code that can be used in the next stage of processing. The exact number of winners is modified in a way appropriate for the task at hand. During learning, a sparser set of winners is allowed to define an internal representation for a sound, while in the cocktail party environment, a larger number of winners is allowed to represent a superposition of the many possible sounds that could be attended to. The second portion of our architecture, the sound processing area, is composed of q (the number of sounds in each 3
The acoustical dynamics and complexity of the human ear introduce a number of significant effects. For the purposes of this study, however, we made a simplifying assumption that the auditory input system functions approximately as a microphone connected to a rich set of wavelet feature detectors. We expect the current experiments to form a basis on which others can build and include realistic modeling of the acoustic properties of biological ears.
1582
Brian Sagi et al.
X1
MX
1X4
MYX
1
MX
2X1
MYX
2
MYY
MX
1Y
X2 MX
Y
2Y
I
MX
3X2
MYX
3
X3
MX
3Y
MYX
4
MX
MX
4Y
4X3
X4 Figure 2: Cortronic network architecture for the model cocktail party problem. Region I represents the early processing apparatus of the raw sounds wherein each isolated pure sound is represented by a unique sparse binary pattern (as would emerge from a competition between feature detectors). Our focus is on information processing in the internal cortronic network within the sound processing regions X1 , X2 , X3 , X4 , and the word processing region Y.
word) cortronic regions that process consecutive sound tokens, extracting the sounds spoken by the attended speaker from the mixture present on I. Processing is cyclic: region X1 processes the first sound token in a word, region X2 processes the second sound token in a word, and so forth. In a biological system, this would correspond to feeding all the auditory data to all portions of the primary auditory cortex all the time, but restarting only the region that is now to respond (Rauscheker, 1998). An issue of prime importance is the synchronization between incoming sounds and the appropriate region that processes the sound, as well as the detection of word boundaries. We assume that a cognitive process (i.e., a determinate sequence of cortical restarts) is used to synchronize the sounds and the sound processing regions. Such a process would be driven and sustained by successful recognition of word and sound boundaries. For simplification, we did not model those controlling cognitive processes. It is interesting to note that this need for synchronization between source and processing is evident in our own
A Biologically Motivated Solution to the Cocktail Party Problem
1583
experiences with the human cocktail party problem as well (Hecht-Nielsen, 1998). The last layer of our architecture is a single word cortronic region Y, which processes a higher-order representation of words in the language. Knowledge of the language is conveyed in the connections between the different cortronic regions in our architecture, as follows. First, the sound input region I is connected to each of the q sound processing regions. Second, each sound processing region Xi is connected to the sound processing region Xi+1 that follows it in the temporal processing sequence, with Xq connected back to X1 . During the language training process, these connections learn the associations between consecutive sounds in the language. In addition, each sound processing region Xi , i = 1, 2, . . . , q is connected to the word region Y, which in turn is connected back to the sound processing regions. These connections represent the associations between sounds and words in the language. Finally, the word region Y is self-connected. These connections represent the temporal links between sequential words in the language. The appendix describes this architecture and the experimental protocol in detail, using a notation we defined specifically for cortronic network architectures. An important characteristic of cortronic regions is their storage capacity, defined as the number of vector associations the network can recall with acceptable accuracy. Referring back to the simple cortronic network in Figure 1 and using a derivation similar to Amari (1989) and Jagota, Narasimhan, and Regan (1998), assume regions X and Y contain n neurons each. Further, assume that the regions’ forward connection matrix MYX contains elements trained using boolean Hebbian learning by presenting l pairs of uniformly random sparse binary vectors, where each n-bit vector contains m 1’s and n − m 0’s. We would like to calculate how many pattern pairs l we can store in the connection matrix and still get acceptably accurate recall. In the recall process we activate a trained pattern on region X and restart region Y, retrieving the associated pattern on region Y. Since Hebbian learning connects all m active neurons in X to the m desired output neurons in Y, the recall process would be perfect unless all m active neurons in X are also connected to one or more extraneous neurons in Y. In this case, an erroneous bit may win the competition in region Y. We will proceed to calculate the probability of such an event’s happening. Assuming that the stored vectors are sparse (m ¿ n ), the probability that a particular neuron in X is connected to a particular neuron in Y is approx2 imately l mn2 after Hebbian training of the network. Since each pattern has m active neurons, the probability that a particular vector on X is connected to ³ 2 ´m a particular neuron in Y is l mn2 . Thus, the probability p that there is no spurious neuron in Y with activation level m is à p= 1−
µ
lm2 n2
¶m !n−m .
(3.1)
1584
Brian Sagi et al.
One criterion we can use to optimize the storage capacity of our network is information maximization. The information I in a network with n neurons and represented by the l vector associations is given by the familiar ³ ´ n
information theory definition I = l log2 m . Applying Stirling’s formula for logarithms, solving equation 3.1 for l, and substituting gives ³ ´ 1 ln(p) n2 1 m1 ln 1−e n−m e I(n, m) = 2 m ln ! Ã2 1 1 nn+ 2 . · ln 1 1 1 mm+ 2 (2π ) 2 (n − m)n−m+ 2
(3.2)
An approximate analytical solution for the optimal value of m and l given n is reached by dropping the lower-order terms above and differentiating, ¡ ¢ 2 which yields m ≈ log2 n4 and l ≈ mn 2 ln 2. Note that the optimal value of m depends logarithmically on n: m ∼ ln(n), justifying our assumption of vector sparseness. A second-order approximation can be derived by taking two terms in the Stirling approximation above, yielding ³ ln(p) ´ n2 m1 ln 1−e n−m . l ≈ 2e m
(3.3)
The optimal number of active bits and resulting network capacity are important checks that help in the design of cortronic networks. On modern computers, regions containing on the order of 2048 neurons are computationally tractable; for a 1% probability of spurious activation (p = 0.99), approximately 11 bits should be active in each vector and up to 11,000 vector pairs can be stored. However, the above analysis, while instructive as a starting point, assumes a simple one-to-one connectivity between two regions and an absence of noise in the starting pattern. Our architecture contains many regions of one-to-one connectivity, but as shown in Figure 2, it also contains a convergence of the Xi regions onto Y, divergence from Y onto each Xi , use of partial patterns as expectation, noisy patterns, and input that is not completely random. While a complete accounting of these effects is difficult, each can be examined qualitatively. The case of q regions of size n converging to one is equivalent to the case of a single region of size qn providing input to a region of size n. Since m depends logarithmically on n, the optimal number of active bits will be raised only slightly from the original estimate. The active bits are now shared between q separate regions, so approximately m/q active bits should be used in each region. However, since we also rely on the divergent connection from Y back to each Xi , which is equivalent to the standard one given above, we cannot afford to use fewer active bits on the Xi regions. Note that there is some benefit to this arrangement also: the chance that there will be no spurious
A Biologically Motivated Solution to the Cocktail Party Problem
activations on Y is now approximately à µ 2 ¶qm !n−m lm p= 1− n2
1585
(3.4)
Comparison of equations 3.1 and 3.4 shows that the chance of spurious activation is reduced exponentially with q. If the network is not over capacity, recovery of words on region Y will therefore be exponentially more accurate than recovery of individual sounds. Since the goal of our system is to recover speech at the word level, this is exactly the desired behavior. In a noisy environment, we may lose a fraction of our active bits to noise. The neurons in the correct pattern will then only be guaranteed an activation level of mz, where z is the fraction of bits unaffected by noise. To compensate, we may scale the number of active bits by a factor of r. Assuming that noise scales linearly with the number of active bits, our approximate formula for the probability of correct recall becomes à ¶ µ 2 2 ¶mzr !n−mr µ lm r mr . (3.5) p= 1− mzr n2 If we assume n À mr and apply Stirling’s formula, we find that à µ ¶mzr !n lm2 r2 1 p≈ 1− √ 2π m z(1 − z)r z(1 − z)(1−z)/z n2 n−1 (Ar)2mzr dp ≈p n n √ dr Br
µ
µ µ ¶ ¶¶ 1 1 + 2mz ln −1 2r Ar
(3.6)
(3.7)
where A2 =
lm2 , B = 2π m z(1 − z). z(1 − z)(1−z)/z n2
(3.8)
Note that the derivative is positive when Ar < e−1 . From the definition of A, we can see that this requirement is fulfilled when the system is running sufficiently below capacity, that is, with l sufficiently small. Even without knowing exact values for noise, this suggests a strategy for noisy conditions: run the system below capacity and increase the number of active bits in the sparse representation. If the level of noise is known, equation 3.7 can be used to maximize p as a function of r; if not, experimentation can yield an optimal value for mr, the number of active bits. Our system not only contains noise of various types but also uses vectors, generated from real-world data as described below, that are not completely uncorrelated. So we made qualitatively informed adjustments to the optimal parameters derived for the simple case of two connected regions. The specific parameters used are described below.
1586
Brian Sagi et al.
(a)
Time-domain snapshot
Wavelet bank, 2048 values
amplitude
1
64 64 64 64 64 64 64 64
0.5
128
128
128
128
0
256 -0.5
256 512
-1 0
32
64
96
128
t, ms
(b) First Signal
Second Signal
1000100010 0000100001 0100001000 0000110000 0000000100 1000001000 0000010010
I
Figure 3: Preprocessing the sound input using a wavelet filter bank approach. (a) The speech signal for a single source, sampled at 8 kHz, is analyzed with a wavelet filter bank. (b) The real amplitude values from the wavelet bank become the activation levels of cells in the I region. One or more source signals are combined arithmetically at the filter bank level to form the input to region I. A restart competition on region I generates a sparse binary pattern representing the sound input.
4 Experimental Protocol To generate the sounds for our model environment, we use a set of snapshots {sR1 , sR2 , . . . , sRd } of human male speech, sampled at 8 kHz. Each snapshot contains 1024 samples and is extracted from 60 seconds of continuous spoken reading in a conversational tone. Region I uses a wavelet spectral power filter bank approach to preprocess the data, converting each real speech input sR into a sparse binary token representation sB . The speech signal is first preemphasized at 6 dB per octave and then analyzed using four sets of fast Fourier transforms (FFT)s, where the length of a basic FFT is doubled between sets (see Figure 3) and each FFT is normalized to have the same maximum amplitude. We take the magnitude squared of each FFT point in the analysis bank as the activation level of a particular cell in the I region. Each of the four sets of FFTs yields 512 data points, giving a total of 2048 points per speech snapshot. A restart competition on region I generates the
A Biologically Motivated Solution to the Cocktail Party Problem
1587
sparse binary pattern representing the sound input.4 We control the number of winners in the competition differently depending on whether the network is learning (with a single speaker) or performing (in the cocktail party environment). Before tackling the cocktail party problem, the network must first be trained on the language. The connections between the regions in the cortronic network, described in the previous section, are trained by boolean Hebbian learning in response to sequential input of language constructs under noise-free conditions. In the biologically analogous situation, the connections would be made gradually over time as a human learns a language by being exposed to it. Specifically, we train each of the connection matrices as follows. The connections between the sound input region and each of the sound processing regions X1 , X2 , . . . , Xq , that is, MX1 I , . . . , MXq I , are set as random row permutations of the identity matrix, to emulate a lookup table. Exactly m winners on I are allowed during training. This converts each real sound sR into its sparse binary representation sB and maps that representation to a different random token xi on each sound region Xi , i = 1, 2, . . . , q. For each word in the language, we impose the sounds in the word through the input region and onto the appropriate sound processing region Xi . In other words, the sparse binary token that corresponds to the first sound in the word, as it passes through the input region, is imposed on region X1 , the sparse binary token that corresponds to the second sound in the word, as it passes through the input region, is imposed on region X2 , and so on. The architecture learns the connections between each sound processing region Xi and its (temporally) consecutive sound processing region Xi+1 (as well as, cyclically, between Xq and X1 ), as captured in the connection matrices MX2 X1 , MX3 X2 , . . . , MXq Xq−1 , MX1 Xq . In addition, a randomly picked token corresponding to the word represented by the above q sound tokens is imposed on region Y, and the connections MYX1 , . . . , MYXq , as well as the reciprocal connections MX1 Y , . . . , MXq Y , between each of the X regions and region Y are trained. Finally, as the input sound streams proceed from one word to the next, we train the Y to Y connections MYY . To simulate the cocktail party environment, we pick p speakers and feed the network an additive superposition of their speech streams. In this superposition, we take the amplitude of all speech characters in the speech streams to be approximately equal. Next, we follow a series of restarts of the different regions. Our architecture uses an expectation to process the incoming mixture. Initially, we set this expectation e to be a fragment of the first word token we expect to hear from the attended speaker. At t = 1 we feed the arithmetic superposition of preprocessed sounds from all speakers through the input region. For small numbers of speakers p, we allow 4 Other preprocessing approaches may provide equal or superior performance. The main goal is for the preprocessing to provide a robust yet sufficiently segmented representation of the input.
1588
Brian Sagi et al.
mp winners on the input region; in this case, the complete token for every sound could be present, depending on the nature of the input sounds. For larger numbers of speakers, however, this would lead to sound tokens apparently being present when they were not, due to overlapping patterns. We therefore restrict the number of winners such that accidental patterns are unlikely. This restriction is performed empirically by requiring a “hallucination” of less than 6%, as described below. Although we perform this operation manually in our system, a test for excessive accidental patterns could be included as part of the processing operation. Even a fixed maximum number of winners, 5m for instance, provides similar performance for fewer than 10 speakers. Once winners—usually many more than m—have been selected on I, they are mapped onto region X1 . Added to this input signal on X1 is the expectation e from the word region Y. This token fragment e projects onto X1 through the binary weight matrix MX1 Y at an intensity of α (a predetermined weighting parameter). We now conduct a restart competition on region X1 . The result of this restart is a sparse vector with only m active bits. This vector is equivalent to a hypothesis—it may or may not contain active units that correctly match the sound associated with the expected word. We will call this active token x01 (1), as it approximates the sound token x1 (1) that corresponds to the sound that the attended speaker uttered at t = 1. At t = 2 we feed the arithmetic superposition of sounds from all speakers through I and onto region X2 . Note that we have moved one time step forward along each of the superposed speech streams. In addition, we add the expectation from the word region as well as signal from the active token x01 (1) on X1 . That is, x01 (1) is projected associatively onto X2 through the binary weight matrix MX2 X1 at an intensity of β (another predetermined weighting parameter). As before, we hold a restart competition in X2 resulting in x02 (2), an approximation for the second sound the attended speaker uttered. In the following time step we repeat the above process on regions X3 , . . . , Xq to form all q sounds that compose the first word x01 (1), x02 (2), . . . , x0q (q). Now we would like to recover the first word the attended speaker uttered. We project the recovered sound tokens x01 (1) through x0q (q) on regions X1 through Xq , respectively, to the word region Y. We hold a restart competition on the word region to form y0 (1), an approximation to the token for the first word uttered by the attended speaker.5 This vector can be com-
5 The goal in the human cocktail party problem is to attend to a single speaker in a mix rather than to separate the sources themselves. In this problem, the actual sound is never reconstructed from the token. The wavelet-based speech preprocessor we employed does not preserve phase and amplitude relationships in the original audio signal. In the context of this model problem, it is unnecessary (and, in fact, is impossible for our system) to reconstitute the original signal from this representation. Our experiment demonstrates expectation-assisted source attendance at a higher level of abstration than source separation.
A Biologically Motivated Solution to the Cocktail Party Problem
1589
pared to y(1), the token representation of the attended speaker’s first word, to obtain a performance measure. To form an expectation that will assist in recovering the second word’s sounds, we self-feed the word region through the MYY associative connection matrix. Restarting the Y region generates the expectation token for the next word’s sounds. We will denote this automatically generated expectation token as y∗ (2). It represents the next stage of the process of synchronization with the desired sound stream.6 Note again that e—a single fragment of the starting word, representing a vague expectation—initiated the entire sequence of processing. From then on, processing is self-sustaining. For the next q time steps that compose the second word’s sounds, we proceed as with the first word’s sounds with the following changes. The full token y∗ is used, whereas before we used a fragment e of the word token. This implies a greater strength of expectation. The restart of region X1 that generates the first sound in the second word benefits from a cyclical input from region Xq of the last sound in the previous word. This procedure generates x0 (q+1) through x0 (2q). As before, we associatively feed the recovered sounds x0 (q + 1) through x0 (2q) to the word region Y. Restarting the word region produces the second recovered word . Again, we can derive a performance measure by comparing the recovered word y0 (2) with the token that represents the second original word y(2) that our attended speaker uttered. We can continue by applying equivalent steps to recover additional sounds and words. It is easy to detect if the initial sound fragment hypothesis was incorrect, and the system can be restarted when a new hypothesis is supplied. The appendix presents in a succinct code the operations that the cortronic network performs throughout its operation. 5 Results Most of our experiments were made with the parameters presented in Table 1. We constructed the speakers’ language subsequences by picking the beginning of each subsequence uniquely at random from the language string. A new set of speakers was picked for each of the 1000 trials used to generate each data point. As a performance metric, we used the fraction of correctly retrieved bits in the sparse binary vectors. The information capacity of the network at the given size, as derived in equation 3.6, is l = 7600 patterns for a mean recall error of 1%. Our network is scaled so the capacity exceeds the number of words, in order to compensate for correlated input vectors and to provide improved performance under the noisy conditions of multiple interfering speakers. Note also that
6 Processing of incoming sensory data is a highly active operation. The vector y∗ (2) represents the feedback expectation that will be used in the processing of the next word. This is one of the ways in which knowledge of the language is explicitly used.
1590
Brian Sagi et al.
Table 1: Typical Experimental Parameters. Description Number of sounds per word Number of unique sounds Number of words in the language Number of neurons in each cortronic region Number of active units in a sparse binary vector Expectation weighting Weighting of previous sound
Symbol
Value
q d l n m α β
4 254 1024 2048 16 0.25 0.05
the number of active units has been scaled upward from the theoretical optimum of 11 to 16, again to compensate for noise. This number was chosen empirically to give reasonable performance. Figure 4 shows the performance of our network in retrieving the sounds and words that the attended speaker uttered. Word recovery is superior to sound recovery. This is analogous to the human cocktail party experience in that a listener may isolate the content of a speaker from the cocktail but often cannot identically reproduce the speaker’s sound stream. In addition, both sound recovery and word recovery improve over time. This improvement is most noticeable between word 1 and word 2, while improvement after word 3 is minute. Put differently, it is more difficult to follow a particular speaker’s conversation at its beginning, when the listener’s expectations are still vague. As the conversation progresses, it is easier for the listener to isolate and follow the attended speaker’s speech. To test the system more thoroughly, we conducted a simple experiment in which the attended speaker’s sounds were deliberately omitted from the input so that the word expectations mismatched the sound input. As Figure 5 depicts, except for the trivial case of no input signal, the network did not get confused; it did not erroneously generate the expected speech stream, which was absent from the auditory environment. If expectation is weighted too heavily, the “recovery” of a nonexistent speech stream can occur regardless of the input, a process we term hallucination. We kept hallucination to under one bit (6%) by restricting the number of winners on I. Finally, we tested the scalability of our solution by measuring word recovery performance with various network and language sizes (see Figure 6). As the size of the vocabulary grows beyond the region’s capacity, performance begins to degrade gradually and later drops more precipitously. As expected, using larger region sizes allows the system to handle larger vocabularies. 6 Discussion Previous approaches to the cocktail party problem have used a blind approach. For instance, methods based on independent component analysis
A Biologically Motivated Solution to the Cocktail Party Problem
1591
100 Word 2 Word 1
Average % of correctly recovered bits
80 60 40 20 0 100
Sound 8 Sound 7 Sound 6 Sound 5 Sound 4 Sound 3 Sound 2 Sound 1
80 60 40 20 0
0
10 5 15 Number of Unattended Speakers
20
Figure 4: Word and sound recovery as functions of the number of simultaneous speakers including the attended speaker. The cortronic network successfully extracts words from the mix. Both sound recovery and word recovery improve temporally. Word recovery performance is superior to sound recovery. This is analogous to the human cocktail party: a listener may isolate the content of a speaker from the cocktail but cannot identically reproduce the speaker’s sound stream. The following parameters apply to these data: number of words in language l = 1024; number of unique sounds d = 254; number of sounds per word q = 4; number of neurons per region n = 2048; number of active neurons in a sparse binary vector m = 16; number of expectation bits = 4 for first word; α = 0.25, β = 0.05; 1000 trials per data point.
can be highly effective, reconstructing 20 or more sources with virtually no error (Lee, Girolami, Bell, & Sejnowski, 1999), given at least as many microphones as sources. However, methods that allow fewer microphones report results that are more modest; systems attempting to separate three sources with two microphones generate signal-to-noise ratios (SNRs) of a
Average % of hallucinated bits
1592
Brian Sagi et al.
100 Word 1 Word 2
80 60 40 20 0
0
5
10 Number of Speakers
15
20
Figure 5: Word hallucination. Hallucination refers to the performance of the system when the attended speaker signal is missing from the cocktail. The expectation is strong enough to complete the desired words when there is no sound input to the cortronic network. However, with any number of real input sources present, the network does not reproduce the desired words. Thus, the network does not suffer from word hallucination and will not erroneously pick out a source that is not present in the input stream. The parameters used in this experiment are identical to those used in the word and sound recovery experiment of Figure 4.
few decibels at most (Van Hulle, Yu-Hen, Larsen, Wilson, & Douglas, 1999). This is not surprising given the difficulty of reconstructing all sources, of arbitrary type, from a small number of mixtures; the problem can be mathematically underdetermined. In contrast, relying on intimate knowledge of the source allows our system to identify the desired content for large numbers of superposed sources using a single microphone. In our model problem, we were able to isolate the content of an attended speaker from up to approximately 10 others with minimal noise. Indeed, our system may seem to provide speaker separation performance that surpasses human capability, probably because of the relatively small vocabulary used in our model. However, the scalability of our model indicates the possibility of application to problems with larger vocabularies. In addition, traditional blind source separation could be used to complement our approach. Multiple inputs such as two ears could be used to generate partially separated output prior to an expectation-based system. The performance of our system is insensitive to moderate parameter changes; there is a broad range of parameter space in which it performs adequately. Such parameter space robustness is what we would expect from
Average % of correctly recovered bits
A Biologically Motivated Solution to the Cocktail Party Problem
1593
100
80
60
40
20
2048 1536
n
1024 512
512
1024
1536
2560
2048
3072
3584
4096
l
Figure 6: Scalability of the cocktail party cortronic network as a function of number of neurons in each region (n) and number of words in the language (l). As the vocabulary size increases, performance degrades gradually at first, before falling off rapidly. As expected, using a larger region size allows the system to handle larger vocabularies. The following parameters apply to these data: number of unique sounds d = 254; number of sounds per word q = 4; number of active neurons in a sparse binary vector m = 16; number of expectation bits = 4 for first word; α = 0.25, β = 0.05; number of speakers p = 4; 1000 trials per data point.
a real biological system. Better results would be possible through the use of token completion, a feature of the associative memory model employed in the cortronic network (see Hecht-Nielsen, 1998, and Schwenker, Sommer, & Palm, 1996, for a description of this mechanism). We believe that the restriction we placed by setting a fixed number of sounds per word and relying on an external mechanism for word and sound synchronization can be overcome, as cortronic networks are suited for internally controlling their operation. For example, internal management of expectation weighting and number of winners could be implemented with a “sanity check” mechanism that would test for the apparent presence of inputs that were known to be highly improbable. A key advantage of our approach is that while we require knowledge of the sounds we wish to attend to, we require no knowledge of background
1594
Brian Sagi et al.
sounds such as static, traffic, and music. This is because of the mapping of raw sounds onto the largely random tokens at the front end of the system. The probability of the background sounds’ interfering significantly with our known sounds is negligibly small. During training, the network also requires no knowledge of the specific superposition of sources. The network, trained on an isolated source, does not need retraining to isolate that source from a variety of different mixtures, avoiding the need for additional, timeconsuming computation when addressing novel environments. Additionally, the cortronic architecture is based on a simplification of observed biological structures and capabilities. We use a simple form of Hebbian learning; real neurons are capable of more complex interactions. The restart competition provides a simplified way to implement sparse coding of data, with each neuron being either on or off. Biological systems that use sparse coding, for instance, hippocampal place fields (Wilson & McNaughton, 1993), typically have graded activity for each neuron. Although the exact operation of our cortronic network is unlikely to be duplicated in a biological system, the simplified network suggests an approach to solving the cocktail party problem that may be analogous to strategies used by the human brain. On a final note, the processing of information in the cortronic network is massively parallel, as is the biologically analogous human cerebral cortex. The use of dedicated hardware would enable cortronic networks of very large capacity with reasonable processing speed and no start-up delay. Appendix This appendix presents a concise definition of the architecture of the cortronic network in our simulation and the processing operations that it uses. For an overview of the basic concepts behind cortronic network operation, see the main text. We use a specialized code, defined below, to describe the architecture and processing functions of our cortronic network. Use of this code allows the cortronic network to be described with a series of succinct operations that mimic the actual processing operations necessary to implement such a network on a computer. We use the following terms throughout the appendix: • A sparse binary vector is an n-ary vector with binary elements such that the vector contains many more 0’s than 1’s. • A cortronic region (or simply a region) consists of n elements (neurons), each of which can be either active or inactive. The pattern of activity on a region is the binary representation of those elements that are active (1 = active, 0 = inactive). Typically this pattern is a sparse binary vector.
A Biologically Motivated Solution to the Cocktail Party Problem
1595
• An associative connection between two regions consists of a binary n×n matrix that specifies the elements in the second region that receive input from any particular element in the first region. • A restart competition on a region modifies its pattern of activity based on which elements are receiving the greatest input from other active elements. The pattern of activity does not change based on input without a restart competition. • A cortronic network consists of a number of cortronic regions that are associatively connected. This network represents objects as sparse binary codes, learns associations between objects by training the associative connection matrices, and then uses these trained associations to process information using restart competitions. We use the following notation for components of a cortronic network: X, Y, Z, . . . Uppercase italic letters denote regions. When used in a mathematical expression, they denote the column vector representing the pattern of activity of the region (typically an n-ary sparse binary vector with m bits active). x, y, z, . . .
Lowercase bold letters denote sparse binary vectors. Unless stated otherwise, all vectors are assumed to be of length n and have exactly m bits active.
MYX
Uppercase bold letters denote n×n binary weight matrices, in this example corresponding to connections from region X to region Y. Note that MXX would be a matrix that connects region X to itself.
To define the associative connections between regions, we use the following notation: X →α Y
X connects to Y with strength α via a matrix MYX . When such a connection is present, X sends as input to Y the vector αYX MYX X, where X is viewed as a column vector. (Note that Y → X is distinct from X → Y.)
X→Y
Shorthand for X →1 Y, that is, α = 1.
αYX = α 0
Change the strength of the connection to Y from X from αYX to α 0 . One biologically analogous situation occurs where intermediate neurons that propagate a signal from X to Y are activated or inhibited, thus modulating the strength of the signal.
1596
Brian Sagi et al.
Operations used primarily during training are given below: G(x)
Generate a sparse binary vector x uniformly at random.
G(x1 , x2 , . . . , xk )
Generate k distinct random sparse binary vectors x1 , x2 , . . . , xk . (Distinct means i 6= j ⇒ xi 6= xj .)
x↓X
Activate x on X: the pattern of activity on region X becomes the sparse binary vector x.
T(Y, X)
Associate the pattern on X to the pattern on Y: boolean Hebbian learning is used to add weights to the connection matrix MYX corresponding to the pattern of activity on X and Y. If X and Y are viewed as column vectors, this is equivalent to MYX,(new) = MYX,(old) ∨ YXT where ∨ denotes the boolean OR operation. In essence, this connects every currently active element in X to every currently active element in Y.
T(Y)
Associate the current pattern of activity on Y with the pattern of activity on Y that existed before the last association or restart.
We use the information processing operations defined below: R(X)
Restart X: Perform a restart competition on X. The total input to X P is given by z αXZ MXZ Z, where Z ranges over all regions in the network that are connected to X. After the restart, the pattern of activity is changed to a sparse binary vector with m bits active by picking the active elements that had maximal total input before the restart. In the case where k > m elements could become active (because of tied input scores), elements that are tied for least input among the k are discarded uniformly at random until exactly m elements remain active. Note that this newly active pattern now provides input to all regions that X connects to.
Rk (X)
Restart X and allow k winners.
R(X)¬x
Restart X and use x to denote the resulting pattern of activity.
We introduce the following notation to help remind readers of the implicit broadcast of active patterns that takes place in a cortronic network. The notation does not affect the actual processing operations. x: X
x, the pattern of activity on X. This is used to remind the reader explicitly which pattern is active on X.
X 7→α Y
X provides input to Y of strength α. This reminds readers that when X is active, it sends input to Y through MYX , and that
A Biologically Motivated Solution to the Cocktail Party Problem
1597
this input to Y is relevant for the current processing step. (We abbreviate X 7→1 Y as X 7→ Y.) x: X 7→α Y Shorthand for x: X, X 7→α Y (and similarly for x: X 7→ Y). +
Used to indicate that multiple inputs provide the total input to one region. For instance, we would write X 7→ Z + Y 7→ Z.
To illustrate the usage of the above notation, we reproduce the architecture and simple associative recall shown in Figure 1: X→Y
Define the architecture. Region X is associatively connected to region Y.
G(u), G(v)
Generate the sparse binary vectors u and v.
u ↓ X, v ↓ Y, T(Y, X) Activate u on X and v on Y and train an association between them. u ↓ X, R(Y)¬v0
Perform associative recall by activating u on X and restarting Y to generate v0 , the recalled pattern corresponding to v. (If recall is perfect, v = v0 .)
In order to implement the cocktail party problem cortronic network of Figure 2, recall that we start with I, the sound input region, Xj , j = 1..q, the sound processing regions, and Y, the word processing region. I → X1 , I → X2 , . . . , I → Xq
The sound input region connects to all sound processing regions.
X1 →β X2 , . . . , Xq−1 →β Xq , Xq →β X1
The sound processing regions each connect to the following region (cyclically) with strength β.
X1 → Y, X2 → Y, . . . , Xq → Y
All sound processing regions connect to the word processing region.
Y →α X1 , . . . , Y →α Xq
The word processing region connects to each sound processing region with weight α.
Y→Y
The word processing region connects forward in time to itself.
Recall that the complete synthetic language of our speakers consists of a series of words (w1 , w2 , . . . , wl ), each of which consists of a sequence of sounds from the set of sounds {s1 , s2 , . . . , sd }. We use sound segments taken
1598
Brian Sagi et al.
from human speech {sR1 , sR2 , . . . , sRd } to generate internal binary representations {sB1 , sB2 , . . . , sBd }, but we need tokens to represent each word: G(w2 , w2 , . . . , wl ) Generate a unique sparse binary representation for each word. In addition, we pick a unique sequence (si(j,1) , si(j,2) , . . . , si(j,q) ) of q sounds corresponding to each word wj . Here we define i(j, k) to be the index of the kth sound of word wj . Before training, we first pick weight matrices connecting I to each of X1 , X2 , . . . , Xq , which are row permutations of the identity matrix. This simulates a random learned connection between the representation of a sound on I and the corresponding one on each Xi without the processing overhead of actually training such connections. In an actual implementation, such permutations of identity can be simulated by a lookup table. Training then proceeds as follows, where we use xi(j,k) to refer to the representation of sound si(j,k) on the sound processing region Xk . (Note that in normal operation, Xk always represents the kth sound of each word, so this is a sufficiently general notation.) αX2 X1 = 0, . . . , αXq Xq−1 = 0, αX1 Xq = 0, αX2 Y = 0, . . . , αXq Y = 0
Decrease the strength of connections to avoid interference with training.
sRi(1,1) ↓ I, R(I)¬sBi(1,1) , R(X1 )¬xi(1,1)
Note that I 7→ X1 . Listen to the first sound of the first word, find its binary representation on I, and translate it to X1 . This also defines the representation xi (1, 1) on X1 of the first sound of the first word.
si(1,2) ↓ I, R(I), R(X2 )¬xi(1,2) , T(X2 , X1 )
Listen to the second sound, activate it on X2 , and associate the representation of the second sound with that of the first. (Note that I 7→ X2 .)
···
Continue for all the sounds of the first word.
si(1,q) ↓ I, R(I), R(Xq )¬xi(1,q) , T(Xq , Xq−1 )
Associate the last two sound representations of the first word.
w1 ↓ Y, T(X1 , Y), T(Y, X1 ), . . . , T(Xq , Y), T(Y, Xq )
Associate all the sound representations with the word and vice versa.
A Biologically Motivated Solution to the Cocktail Party Problem
1599
si(2,1) ↓ I, R(I), R(X1 )¬xi(2,1) , T(X1 , Xq )
Associate the last sound representation of the first word with the first of the second word.
···
Continue for the second word, and associate all the sound representations with that word.
T(Y)
Associate the current word (the second) with the previous (the first).
···
Continue associating sound representations with following ones, words with following words, and sound representations with their word and vice versa.
wl ↓ Y, T(X1 , Y), T(Y, X1 ), . . . , T(Xq , Y), T(Y, Xq )
Associate the last sequence of sound representations with the last word.
T(Y)
Associate the last two words.
si(1,1) ↓ I, R(I), R(X1 ), T(X1 , Xq ), w1 ↓ Y, T(Y)
Wrap around by associating the last word with the first word, and the last sound representation of the last word with the first sound representation of the first word.
αX2 X1 = β, . . . , αXq Xq−1 = β, αX1 Xq = β, αX1 Y = α, . . . , αXq Y = α
Set connections to the appropriate strength for processing.
Now we apply the network to the cocktail party problem. Let there be p speakers, each uttering words from a distinct point in the language. Let i(j, k) be the index of the kth sound uttered by the jth speaker where i is an index to a sound vector or the representation of a sound vector on a sound processing region, and let it be the index of the kth word uttered by the jth speaker when applied as an index to a word vector. Let g ∈ {1, 2, . . . , p} be the index of the speaker to whom we are attending. Let a pattern e be active on Y that contains a few bits of the word we are listening for (expectation). Let h be the number of bits we allow active on I to serve as a binary representation of the possible sounds present. Empirically we found that with our parameters, for p < 5, h = mp is appropriate (to approximate the superposition of the binary representations of the sounds). For p > 5, h = 5m provides more discriminating performance; we picked h as large as
1600
Brian Sagi et al.
possible such that hallucination remains under 6% on the second word for a given number of speakers. (Note that h is a function of p, but need only be calculated once for a given number of speakers, not every time the system processes input.) ´ ³P p R 0 α j=1 si(j,1) ↓ I, Rh (I), R(X1 )¬xi(g,1) Note that I 7→ X1 + e: Y 7→ X1 . Process the first sound, generating first a rich internal representation of the input mixture on I, and then generating the recovered sound representation x0i(g,1) from that mixture plus expectation. We can score the accuracy of this recovery by counting the percentage of correctly placed active bits in x0i(g,1) relative to the representation xi(g,1) generated during noise-free training. (This percentage is equivalent to the fraction xi(g,1) · x0i(g,1) /m.) ´ ³P p R 0 j=1 si(j,2) ↓ I, Rh (I), R(X2 )¬xi(g,2) Process the second sound now using X1 feedforward from X1. (We now have I 7→ X2 + e: Y 7→α X2 + x0i(g,1) : X1 7→β X2 .) ···
Repeat for all q sounds.
αγ γ = 0
We have to confirm our expectation before we can generate the next word, so turn off the feedforward connection.
R(Y)¬w0i(g,1)
(Note that x0i(g,1) : X1 7→ Y, . . . , x0i(g,q) : Xq 7→ Y.) Restart Y generating w0i(g,1) , an approximation of wi(g,1) , the first word spoken by the attended speaker. Accuracy can be determined as with sounds.
αYX1 = 0, . . . , αYXq = 0 αYY = 1
Ignore current sound input to word region. Turn on feedforward connections in the word region.
R(Y)¬w∗i(g,2)
Restart Y to compute the expectation for the next word to be spoken. (Input is from the previous word, as w0i(g,1) : Y 7→ Y.)
A Biologically Motivated Solution to the Cocktail Party Problem
αYX1 , . . . , αYXq = 1 ´ R s j=1 i(j,q+ 1) ↓ I, Rh (I)
³P p
R(X1 )¬x0i(g,q+1)
···
1601
Effect of sound input resumes. Repeat the above process again, now using w∗i(g,2) as the self-generated expectation, which replaces e. (Now I 7→ X1 + w∗i(g,2) : Y 7→α X1 + x0i(g,q) : Xq 7→β X1 .) Continue word recovery for as many iterations as desired.
To test whether the network will hallucinate a speech stream based on expectation, when the actual corresponding sound input is not available, we generate a p + 1st imaginary speaker who does not actually contribute to the superposition of sounds. We then set g = p + 1 and run the cortronic network as above. If a trial run were to generate significant hallucination, an appropriate correction would be to lower the value of h used for that p, or to decrease the strengths of expectation, α and β, as appropriate. Acknowledgments R. H-N. gratefully acknowledges support of his research by DARPA (N0001498-C-0356) and ONR (N0014-96-C-0109). References Amari, S. (1989). Characteristics of sparsely encoded associative memory. Neural Networks, 2, 451–457. Amari, S., & Cichocki, A. (1998). Adaptive blind signal processing—neural network approaches. Proceedings of the IEEE, 86(10), 2026–2048. Amari, S., Cichocki, A., & Yang, H. H. (1996). A new learning algorithm for blind signal separation. In D. Touretzky, M. Mozer, & M. Hasselmo (Eds.), Advances in neural information processing systems, 8 (pp. 757–763). Cambridge, MA: MIT Press. Bell, A. J., & Sejnowski, T. J. (1995). An information-maximization approach to blind separation and blind deconvolution. Neural Computation, 7(6), 1129– 1159. Comon, P. (1994). Independent component analysis, a new concept? Signal Processing, 36, 287–314. Hecht-Nielsen, R. (1998). A theory of the cerebral cortex. In Proceedings of the 1998 International Conference on Neural Information Processing (ICONIP’98), Japanese Neural Network Society, Kitakyushu, Japan (pp. 1459–1464). Burke, VA: IOS Press. Jagota, A., Narasimhan, G., & Regan, K.W. (1998). Information capacity of binary weights associative memories. Neurocomputing, 19 (1–3), 35–58.
1602
Brian Sagi et al.
Kohonen, T. (1996). Self-organizing maps (2nd ed.), New York: Springer-Verlag. Koutras, A., Dermatas, E., & Kokkinakis, G. (1999). Blind signal separation and speech recognition in the frequency domain. In Proceedings of ICECS’99. 6th IEEE International Conference on Electronics, Circuits and Systems, Pafos, Cyprus (pp. 427–430). Piscataway, NJ: IEEE. Lee, T. W., Girolami, M., Bell, A. J., & Sejnowski, T. J. (1999). Independent component analysis using an extended infomax algorithm for mixed subgaussian and supergaussian sources. Neural Computation, 11(2), 417–441. Lee, T. W., Girolami, M., Bell, A. J., & Sejnowski, T. J. (2000). A unifying information-theoretic framework for independent component analysis. Computers & Mathematics with Applications, 39(11), 1–21. Lee, K., Hon, H., Hwang, M., & Huang, X. (1996). Speech recognition using hidden Markov models: A CMU perspective. In H. Fujisaki (Ed.), Recent research towards advanced man-machine interface through spoken language (pp. 249–266). Amsterdam: Elsevier. Oja, E., & Karhunen, J. (1995). Signal separation by nonlinear Hebbian learning. In Computational intelligence—a dynamic system perspective (pp. 83-97). M. Palaniswami, Y. Attikiouzel, R. Marks, II, D. Fogel, and T. Fukada (Eds.), New York: IEEE Press. Palm, G. (1980). On associative memory. Biological Cybernetics, 36(1), 19–31. Rauscheker, J. P. (1998). Cortical processing of complex sounds. Current Opinions in Neurobiology, 8(4), 516–521. Schwenker, F., Sommer, F. T., & Palm, G. (1996). Iterative retrieval of sparsely coded associative memory patterns. Neural Networks, 9(3), 445–455. Steinbuch, K. (1963). Automat und Mensch (2nd ed.), New York: Springer-Verlag. Van Hulle, M., Yu-Hen, H., Larsen, J., Wilson, E., & Douglas, S. (1999). Clustering approach to square and non-square blind source separation. In Proceedings of the 1999 Signal Processing Society Workshop: Neural Networks for Signal Processing IX, Madison, Wisconsin (pp. 315-323). Piscataway, NJ: IEEE Press. Wang D. L., & Brown, G. J. (1999). Separation of speech from interfering sounds based on oscillatory correlation. IEEE Transactions on Neural Networks, 10(3), 684–697. Willshaw, D., Buneman, O., & Longuet-Higgens, H. (1969). Non-holographic associative memory. Nature, 222, 960–962. Wilson, M. A., & McNaughton, B. L. (1993). Dynamics of the hippocampal ensemble code for space. Science, 261, 1055–1058. Yen, K. & Zhao, Y. (1997). Co-channel speech separation for robust automatic speech recognition: stability and efficiency. In IEEE International Conference on Acoustics, Speech, and Signal Processing, Munich, Germany (Vol. 2, pp. 859-862). Los Alamitos, CA: IEEE Computer Society Press. Received November 16, 1998; accepted September 22, 2000.
LETTER
Communicated by Richard Lippmann
A Comparative Study of Feature-Salience Ranking Techniques W. Wang Department of Computer Science, University of Bradford, Bradford, U.K.
P. Jones D. Partridge Department of Computer Science, University of Exeter, Exeter, U.K.
We assess the relative merits of a number of techniques designed to determine the relative salience of the elements of a feature set with respect to their ability to predict a category outcome—for example, which features of a character contribute most to accurate character recognition. A number of different neural-net-based techniques have been proposed (by us and others) in addition to a standard statistical technique, and we add a technique based on inductively generated decision trees. The salience of the features that compose a proposed set is an important problem to solve efficiently and effectively, not only for neural computing technology but also in order to provide a sound basis for any attempt to design an optimal computational system. The focus of this study is the efficiency and the effectiveness with which high-salience subsets of features can be identified in the context of ill-understood and potentially noisy real-world data. Our two simple approaches, weight clamping using a neural network and feature ranking using a decision tree, generally provide a good, consistent ordering of features. In addition, linear correlation often works well. 1 Introduction A pivotal issue at the outset of a data mining exercise is what specific data features to extract from the wealth of possibilities. The extensive literature on feature selection as, for example, surveyed in Fukunaga (1990) and Liu and Motoda (1998) addresses the problem of finding a “good” subset of features by employing various searching strategies. In this study, we focus on associating a measure of salience with each individual feature of a prespecified set of possibilities in order to obtain a ranking of features. Furthermore, our goal is to determine these measures in the early stages of data mining for noisy, nonlinear real-world decision problems. Removal of irrelevant (redundant) or less important features can improve performance and reduce the cost of data collection. Simple and reliable methods are more practically useful for these purposes because (1) an optimal model may not c 2001 Massachusetts Institute of Technology Neural Computation 13, 1603–1623 (2001) °
1604
W. Wang, P. Jones, and D. Partridge
be absolutely necessary at the outset but a robust one is essential; (2) complex techniques are not only computationally costly but fickle enough to push their cost up even more due to wasted runs; and (3) overly complex techniques that require human intervention or interpretation are costly in terms of human labor and are also exposed to human error. A number of feature-ranking techniques, such as the sensitivity-analysisbased methods (Utans & Moody, 1991), functional analysis (Gedeon, 1997), the Bayesian based automatic relevance determination (ARD) model (MacKay, 1992), the neural net clamping (Wang, Jones, & Partridge, 1998), and the weight product (Tchaban, Taylor, & Griffin, 1998), have emerged from the field of neural computing. Tchaban et al. (1998) have conducted the comparison between sensitivity, functional analysis, and their weightproduct technique. In this article, we perform a comparative evaluation of some neural-net-based techniques (clamping, ARD, weight product) together with a standard statistical technique (to provide a baseline) and one we propose based on decision tree algorithms (another inductive data mining technology). To evaluate the performance of the techniques, we train and test neural networks with each of the reduced feature sets identified by the different salience measures. The consistency between rankings is quantitatively measured by the “distance” and a weighted dissimilarity we define, together with a standard statistical similarity measure: Spearman’s correlation. This article introduces each of the feature-ranking techniques compared in the study and then describes the measures of rankings. These are followed by the empirical results for three different problems and a discussion of the relative strengths and weaknesses of the techniques compared.
2 The Clamping Technique This technique, proposed and studied in Wang et al. (1998), relies on the observation that the degradation of the generalization performance of a trained neural network when the information content of a particular feature in all test patterns is removed will reflect the salience of that feature with respect to the network’s computational task. A feature-ranking exercise then consists of testing a trained neural network to obtain the baseline generalization performance; retesting with one feature in every test pattern clamped to a fixed value (we use the mean value for that feature in the test set) and recording the new generalization performance; and repeating this clamp-and-retest cycle for each feature of the test patterns. Finally, the sets of generalization performances, relative to the baseline performance, ordered in terms of extent of degradation give the salience ranking of the features: the largest performance decrease indicates the most salient feature. We use the term impact ratio to represent the salience of inputs when
Feature-Salience Ranking Techniques
1605
applying the clamping technique, which has been defined by Wang, Jones, and Partridge (2000). For convenience, it is restated below. Definition. Given a data set for a classification problem with n input features, {xi , i = 1, .., n} and a finite number of target categories, the salience of a specific input feature xi is defined as the impact ratio, ξ(xi ): ξ(xi ) = 1 −
g(x|xi =x¯ i ) g(x)
(2.1)
where g(x) is the generalization performance of the neural net and g(x|xi =x¯ i ) is the clamped generalization performance of the net when input xi is clamped to its mean x¯ i . By this definition, a higher value of ξ(xi ) indicates a higher salience for that input xi with respect to the output. 3 Automatic Relevance Determination (ARD) ARD is a technique developed by MacKay (1992) when he applied Bayesian theory to neural networks in order to improve applications of this technology systematically. It can be used to determine the salience of inputs and is currently available within the Netlab package,1 which runs under a commercial package, Matlab. Mackay’s original purpose for applying Bayesian theory to neural networks was to estimate a set of hyperparameters associated with a weight decay term, αEW , that was introduced into the objective function to avoid the “overfitting” problem. The cost function then became M(w) = βED + αEW ,
(3.1) 1P
where ED is a commonly used objective error function, EW = 2 i w2i . The weight decay term, or regularizer, αEW , will penalize the more complex functions and favor general smooth ones. Minimizing M(w) involves finding a set of weights, wMP , the “Most Probable” weight set, for a specific neural model, H, and optimizing the hyperparameters, α and β, given a data set D. MacKay (1998) applied Bayesian theory to infer the hyperparameters and worked out that the (locally) most probable values of αMP and βMP will satisfy the following equations: 1/αMP =
X
2
(wiMP ) /γ
i 1
Available at no charge from Aston University, UK.
(3.2)
1606
W. Wang, P. Jones, and D. Partridge
1/βMP = 2ED /(N − γ )
(3.3)
γ = k − αTrace(A−1 ),
(3.4)
where γ is called the number of well-determined parameters, k is the number of parameters, and A is a Hessian matrix dependent on both data and net. The “hyperparameters” are assigned one-to-one with the set of input features to a neural network, and the optimal hyperparameter values found then reflect the relevance of their associated features to the solution. It should be pointed out that reestimation of hyperparameters involves evaluations of determinants and inversion of the Hessian matrix, A. When the size of a real problem is large (in terms of dimensions of input space and number of data patterns, which will usually require a proportional size of a neural net) and the data contain a high level of noise, the associated Hessian matrix may be ill conditioned or may not meet assumptions (positive definite). It may have negative eigenvalues, or in the worst case no real eigenvalues at all. In these situations, the ARD procedure cannot be advanced without heuristic intervention such as resetting negative eigenvalues to zero. 4 The Weight-Product Technique This technique, originally designed as a first step toward explaining neural net behavior by assessing the impact of input features on network output, is based on the idea that in a trained network, large weights on the path from input to output reflect high impact for that particular input. This is complicated by a number of factors: weights may be large positive or negative, so input impacts may be high positive impacts or negative impacts accordingly; actual input values may be large or small, and so overall impact assessment must take these into account. In introducing this technique, Tchaban et al. (1998) have devised a formula for multilayer perceptrons (MLPs) with a single hidden layer. They defined the impact of an input xi on the output ok as Si =
xi X wij wjk , ok j
(4.1)
where wij is the weight between xi and hidden unit j, wjk the weight between unit j and out unit k. We called it the weight-product technique. This technique has been systematically explored in an earlier study (Wang et al., 1998), so the details will not be repeated, although the crucial results and conclusions will be reproduced here in the interest of completeness in this comparative study.
Feature-Salience Ranking Techniques
1607
5 Decision Tree Heuristics A further approach to automatic feature ranking that we are developing is based on the output of the so-called automatic induction algorithms, such as C5.0, the latest development from ID3 (Quinlan, 1986). From a training set in which each pattern is associated with a correct target, these algorithms will generate a decision tree that attempts to classify each training pattern correctly as a result of a sequence of decisions based on the feature values that comprise a pattern. Typically the trees are constructed from a heuristic basis of determining which features within the training set are maximally informative (in an information-theoretic sense) with respect to the target classes given. The decision tree is then built, from root to leaves (the target categorizations), by recursively applying the information-maximizing heuristic. The resultant decision trees produce a ranking of feature salience in the sequence of decision functions from root to leaf. This picture is, however, complicated in several ways: a decision tree contains many root-to-leaf paths, and they will not have identical feature orderings, although the nature of a tree is such that maximum agreement will occur for the highest salience features, and the throughput (i.e., how many of the training patterns are classified, correctly or incorrectly, by each path from root to leaf) will be highly variable. We have devised the following heuristic, which takes tree structure and throughput into account, for generating a feature-salience ranking from a decision tree. A number of obvious enhancements (such as accuracy— i.e., the relative numbers of correct and incorrect classifications that reach each leaf) were explored. But as they resulted in no significant difference in general level of performance, the basic heuristic was adopted for all the empirical results reported later. The heuristic used was as follows: For the data set available; Execute C5.0 to train a tree; For each pattern feature; For each tree node at which the feature occurs; Add to running score for feature (number of patterns processed by node)/ (level of node in tree); Scale the set of feature scores by dividing each by the maximum score; Order features by their scores.
We name these scores significance scores. The higher a significance score is, the more salient is the feature. In general, our decision tree heuristic can be used in conjunction with any decision tree induction tool; it is not tied to the specifics of C5.0.
1608
W. Wang, P. Jones, and D. Partridge
6 Assessment Measures of Rankings Different techniques and different applications of the same technique may produce different rankings for inputs. Hence, we need some quantitative measures for difference or correlation between rankings and reliability and stability of ranking techniques. Three measures are used in this study; we defined two of them, D and β, and the other is a standard nonparametric statistic, Spearman’s correlation. Suppose that two rankings, P = {...pi ...} and Q = {...qi ...} are produced for a given problem with n features. Let pi denote the position of feature xi in ranking P, and qi denote the position of feature xi in ranking Q. Define di = pi − qi , the difference of the positions of feature xi in rankings P and Q. For disagreement between two rankings, a simple measure is defined as below by modifying Manhattan distance: D=
n n 1X 1X kpi − qi k = kdi k. n i=1 n i=1
(6.1)
Interpretation of this measure is obvious and straightforward. The value of D represents the extent of mean distance between two rankings. A smaller D means less difference between them. When D = 0, two rankings are exactly the same. D = n/2 ( for n = even number), or D = (n2 − 1)/(2n) (for n = odd number), indicates that two rankings have the maximum distance, i.e., complete inverse order. The Spearman’s correlation is a commonly used nonparametric statistical measure of similarity (or correlation) between paired rankings. The null hypothesis of the test is that no correlation exists between two rankings. Spearman’s correlation coefficient, rs , is calculated by rs = 1 −
P 6 ni=1 di 2 . n(n2 − 1)
(6.2)
rs = 1 means that two rankings are exactly same, rs = −1 that two rankings are in reverse order, and rs = 0 indicates no correlation between two rankings. Tables exist for obtaining the significance levels of other values of rs . However, the Spearman’s coefficient measures only linear correlation between two rankings. It treats all features equally without considering their relative ranking positions. What is really useful in practice is to know whether those factors ranked highly by one technique are also ranked highly by another or by different applications of the same technique. In the other words, the position changes of high-ranked features in both rankings are more important than the position changes of low-ranked features. Therefore, we propose a weighted measure of difference between rankings that gives
Feature-Salience Ranking Techniques
1609
a larger weight to a high-ranked feature in both rankings than to one in a lower position of the ranking. It is defined as β=
n kpi − qi k 1X . n i=1 (pi + qi )/2
(6.3)
By the definition, β = 0 indicates that two rankings are identical, and the larger β is, the bigger is the dissimilarity between the two rankings. When β = n/(n + 1) for n = even, or β = (n − 1)/n for n = odd, two rankings have the maximum dissimilarity—a complete inverse order. In this study we measure the reliability and stability of the different ranking methods. To do this, a set of N rankings is produced by each method. Then the above measures are calculated for each possible pair within the ranking sets, and the mean and standard deviation are quoted for the N (N − 1)/2 pairs. 7 Empirical Comparison Empirical study will be limited to three problems: a well-defined abstract problem for which feature salience is known and can be systematically manipulated and two rather different real-world problems, which are noisy and ill understood and typify the sorts of problems on which a featureranking technique must succeed if it is to be useful in practice. Within these three problems, the output categories in the data are generally well balanced (i.e., close to 50:50). In the worst case, the level-off problem, the skew is approximately 40:60. With networks generalizing at about 75% and approximate relative performance being all that is germane to this study, we have avoided the extra clutter of specificity and sensitivity analyses. The results are analyzed from two perspectives. The first is stability. To test stability, we apply the measures, D, Spearman’s test, and β, described in the earlier section, to the sets of N rankings, each produced by a technique separately and then compare the values of the measures, If the rankings produced by a given ranking technique have a short distance on average, or smaller mean β, together with a small standard deviation, the technique is considered to be stable and consistent. Effectiveness (success) is the other perspective. To test the effectiveness of the different ranking methods on the real-world problems, neural nets are trained and tested using selected subsets of the most salient features as ranked by each specific ranking technique in turn. A ranking technique is considered effective to the extent that a selected feature subset produces neural nets that generalize well compared with those trained on the full feature set and on randomly selected subsets. In our experiments, the number of repetitions, N, is 9 unless otherwise stated.
1610
W. Wang, P. Jones, and D. Partridge
7.1 LIC1: A Well-Defined Problem. The original LIC1 problem is defined as: ½ true if d((x1 , y1 ), (x2 , y2 )) > length (7.1) LIC1 = false otherwise, where d((x1 , y1 ), (x2 , y2 )) is the Euclidean distance between two points specified by Cartesian coordinates (x1 , y1 ) and (x2 , y2 ). The problem is whether the distance between the two points is greater than a given random value, length. This problem (prediction of outcome) had previously been solved successfully using a multiversion system of neural networks (Partridge, 1996; Partridge & Yates, 1996). It is used in this article for a different purpose: identifying the salience of the inputs. Intuitively, we understand that the parameter, length, is critical for making a decision and therefore should be ranked as the most significant input. The other four inputs, x1 , y1 , x2 , y2 , are equally important but less than length, and they should be ranked at a range of 2 to 5 with an equal probability. The problem has been modified slightly (Wang et al., 1998) by introducing an extra input, in6, into the input vector. First, we set in6 as a dummy variable (it is randomly generated and has no relation to the problem at all) in order to examine whether the salience-ranking techniques are able to identify it clearly as a redundant input (i.e., ranked it in sixth position). Subsequently, we set in6 to (x2 − x1 ) (essentially the information contained in x1 and x2 ), which made feature in6 more salient than either y1 or y2 , and relegated x1 and x2 to being redundant. A number of data sets composed of 1000 patterns were generated at random, with each input feature uniformly distributed on [0,1]. As this problem has been thoroughly explored using the techniques of neural network clamping and weight product (Wang et al., 2000) detailed results will be presented only for the other techniques. Table 1 lists the values of the hyperparameters, α, and their corresponding rankings for three different data sets. The ARD technique worked very well for the first case, in which the ranking results match the expected ones. However, in the second case, it ranked the features y1 and y2 higher than in6. Table 2 lists the scaled scores of the inputs and their corresponding rankings from the decision tree heuristic technique, using the same three data sets. The results echo those of the ARD technique, except that in the second variation of LIC1, in6 is always ranked second to length as anticipated. Figures 1a and 1b show the input impacts achieved by the neural network clamping technique for the two variations (in6 = dummy and in6 = x1 − x2 ) of the LIC1 problem, respectively. Similarly, Figures 2, 3, and 4 illustrate the results of the other three ranking techniques—weight product, decision tree, and ARD—for the same problems. (In the ARD results, the smaller the value of an α, the more salient is the input.) These results indicate that all four techniques performed equally well on the first case (in which in6 is a redundant feature) and picked out in6
Feature-Salience Ranking Techniques
1611
Table 1: Salience Rankings by ARD Using Three Training Sets on Two Variations of LIC1. Feature
α
:Rank
With in6 set to a dummy feature 0.096 :4 x1 y1 0.105 :5 x2 0.083 :3 0.082 :2 y2 length 0.059 :1 in6 0.275 :6
α
:Rank
α
:Rank
0.028 0.036 0.032 0.031 0.020 0.190
:2 :5 :4 :3 :1 :6
0.046 0.035 0.038 0.032 0.029 0.131
:5 :3 :4 :2 :1 :6
With in6 set to a salient feature x1 − x2 x1 0.057 :5 0.197 0.017 :3 0.073 y1 x2 0.059 :6 0.115 0.014 :2 0.053 y2 length 0.012 :1 0.036 in6 0.029 :4 0.091
:6 :3 :5 :2 :1 :4
0.142 0.060 0.088 0.063 0.031 0.064
:6 :2 :5 :3 :1 :4
Note: Smaller α value reflects higher relevance (salience).
Table 2: Salience Rankings by the Decision Tree Heuristic Using Three Training Sets on Two Variations of LIC1. Feature
Score
:Rank
Score
:Rank
Score
:Rank
0.006 0.063 0.033 0.096 1.000 0.000
:5 :3 :4 :2 :1 :6
0.003 0.074 0.004 0.101 1.000 0.000
:5 :3 :4 :2 :1 :6
With in6 set to a salient feature x1 − x2 x1 0.000 :5 0.000 y1 0.099 :3 0.051 0.000 :5 0.002 x2 y2 0.004 :4 0.062 length 1.000 :1 1.000 in6 0.392 :2 0.425
:6 :4 :5 :3 :1 :2
0.001 0.096 0.007 0.094 1.000 0.692
:6 :3 :5 :4 :1 :2
With in6 set to a dummy feature 0.034 :4 x1 y1 0.016 :5 0.144 :2 x2 y2 0.064 :3 length 1.000 :1 in6 0.000 :6
Note: The higher a score is, the more important is the input.
as the least important. But for the second variation—in6 = x1 − x2 —the results from weight product and the ARD techniques are not as obvious as those from clamping and the decision tree heuristic. We also note that interrelated feature subsets (such as x1 and x2 ) emerge with similar and appropriate salience values under the clamping technique (see Figure 1b when in6 = x1 − x2 ).
1612
W. Wang, P. Jones, and D. Partridge
Figure 1: Comparison of the input impact ratios obtained by neural network clamping technique for the LIC1 problem. (a) Input impacts for in6 = dummy. (b) in6 = x1 − x2 . The bar at the top of each column is the standard deviation.
Table 3 shows the results of evaluating the stability of the set of rankings produced by each technique separately. The measures D and β clearly indicate that the rankings by the clamping technique have the smallest dissimilarity, with the decision tree heuristic recording the next smallest values. The Spearman’s test shows that the rankings generated by the clamping have the highest correlation coefficient (0.71), with the decision tree heuristic again coming in second (0.69), followed by the weight product technique, and then ARD on a par with linear correlation. The clamping technique also tended to produce the lowest standard deviations within its rankings, which implies consistency over the different applications of the technique. Clamping also produces the lowest significance level of all the nonlinear techniques explored. All of these results indicate that the clamping technique is the most stable.
Feature-Salience Ranking Techniques
1613
Figure 2: Comparison of the input impacts (scaled) obtained by the weightproduct ranking technique for the two LIC1 problems. Table 3: Stability Measures of the Feature Rankings Within Each of the Five Ranking Techniques for the LIC1 Problem.
Measures
Linear correlation Net clamping Weight product Decision tree ARD
Distance D
rs
Spearman’s Significance
Dissimilarity β
Mean
S.D.
Mean
S.D.
Mean
S.D.
Mean
S.D.
1.3519 0.7986 1.3148 0.8021 1.3482
0.6618 0.3873 0.4511 0.4732 0.6583
0.4271 0.7061 0.5321 0.6868 0.4635
0.4466 0.1423 0.2462 0.1611 0.2561
0.1195 0.1596 0.3529 0.1674 0.2612
0.1441 0.1040 0.2533 0.1297 0.1388
0.2541 0.1846 0.3583 0.2246 0.4836
0.1598 0.1236 0.1233 0.1220 0.1926
7.2 Osteoporosis Prediction. Osteoporosis is a systemic skeletal disease characterized by low bone mass and microarchitectural deterioration of
1614
W. Wang, P. Jones, and D. Partridge
Figure 3: Comparison of the input significance scores obtained by the decision tree heuristic technique for the two LIC1 problems.
bone tissue, with a consequent increase in bone fragility and susceptibility to fractures. People are often unaware of developing osteoporosis because the early symptoms of the disease are not obvious and may not cause pain until the disease has progressed to an advanced stage, when bones start to fracture. There is currently no cure for osteoporosis, and the best treatment is prevention, which requires early identification of the likelihood of osteoporosis. Early diagnosis of the likelihood of the disease is very important and will make preventative treatments more effective. Identification of the most significant risk factors associated with development of the disease will help earlier diagnosis. More than 700 data cases have been collected from regional hospitals with a questionnaire form designed by the medical experts and tests to supply a diagnosis. Each case was composed of 40 different risk factors, which included basic facts (such as age and height), lifestyle information (such as alcohol consumption and exercise), medical history (such as hormone
Feature-Salience Ranking Techniques
1615
Figure 4: Comparison of the α values achieved by the ARD ranking technique for the two LIC1 problems.
replacement therapy), and family history (such as incidence of osteoporosis in ancestors). After further study (Rae, Wang, & Partridge, 1999), 31 of these risk factors were identified as particularly relevant to the occurrence of the disease and therefore were taken as input variables for prediction. These constituted the 31 input features for our study; the output was thresholded to a boolean diagnosis of osteoporotic or nonosteoporotic. The 719 cases were partitioned into three subsets at random: one set of 319 patterns was used for net training, and the other two subsets, of 200 patterns each, were used for validation and testing. Faced with this limited data set and the real need to identify the important cost-effective features before more data can be collected, multiple experiments were conducted by shuffling the data before each repetition. The results obtained from each technique are ranked in accordance with their salience. The features ranked at the top are more significant or relevant
1616
W. Wang, P. Jones, and D. Partridge
Figure 5: Test performance of nets trained on variously selected subsets of osteoporosis features compared with performance on the full feature set.
to prediction of the disease. The features ranked on the bottom are less important or redundant. The features ranked in the top 10, 15, and 20 positions (averaged over nine repetitions) by the different techniques were identified, as well as randomly selected subsets. The corresponding data subsets were then constructed for training and testing purposes. Table 4 summarizes the results of the test performance of the nets trained with the feature-reduced data sets. Figure 5 depicts more clearly the comparison of the test performance of the nets trained with a different number of features selected separately by each ranking technique, plus a random ranking. It clearly shows that all the ranking techniques performed better than random selection. The nets trained by using the top 10 and 15 features chosen by the clamping technique tested better than those of the nets trained using the features selected by the other methods. The decision tree heuristic also did quite well—only a bit lower than the clamping. ARD achieved similar results in 20 features group, which were always better than the weight-product method. 7.3 Level-Off Prediction. The level-off problem is derived from an overall problem of alerting air traffic controllers to the possibility of midair collision. In particular, if it could be reliably predicted when an aircraft is about to level off for some subset of aircraft, then the overall aircraft alert system could be improved. The data sets used comprise information for aircraft descending in stacks at London’s Heathrow airport. In a stack, there exist a number of flight
Feature-Salience Ranking Techniques
1617
levels at which an aircraft may or may not level off. Each data pattern used in this experiment has 28 input features, some real valued and some boolean, representing the vertical velocity and position of the aircraft just before reaching a flight level, the location of the flight level, and the aircraft category. The single output, or target feature, is boolean and represents whether the aircraft actually levels off. The data were obtained from real aircraft tracks recorded at Heathrow over three weeks. The first two weeks were used for training and ranking. The third week was used for final testing. Each week contains approximately 15,000 patterns. From earlier work unrelated to this experiment, it was already known that the data contain noise and inconsistencies. Because of the large quantity of noise and inconsistencies in the data, obtaining a ranking using ARD proved difficult since singularity in the Hessian matrix occurred when some patterns were used. However, by taking random selections of 1000 and 2000 patterns, a number of successful ARD runs were possible, and a ranking was obtained by averaging the successful runs. Because of the simplicity of the calculation of ranking by the clamping, weight product, and tree algorithms, the noise and inconsistencies in the data did not prevent the calculations from taking place whatever subset of the data was used. Clamping and weight-product values were obtained from nets trained for a fixed 20 epochs with large data sets. The tree ranking was obtained from trees created by C5.0 with all the commercial default settings. In each case, the ranking calculation was made nine times each using 8/9 of the data from the first two weeks, and averages were taken. In order to compare the rankings obtained by the different methods, an investigation was carried out to see how well a neural net could be trained using only the first 5, 10, or 15 features in the rank. Since an independent test set, the third week’s data, was available (unlike the osteoporosis problem, where the quantity of the data patterns is relatively small), a comparison of the performance of nets trained with different subsets of the features could be made. For each ranking method and each number of features, nine nets were trained. Nine nets were also trained using all 28 features. Diversity was introduced by using different subsets of the first two weeks’ data for training and validation, different numbers of hidden units, and different initial weights. Each net was then tested with the final week’s data and the generalization recorded. Fifty nets were also trained using different random selections of 5, 10, or 15 features for comparison with the selected features. The results are shown in Table 5 and Figure 6. From the figure it can be seen that all the methods of ranking were better than random. The rankings determined by clamping and trees seem to be most effective. The ARD ranking was poor (direct comparison is not straightforward because ARD failed to handle as many data patterns as the other methods).
Mean 0.712 0.732 0.741 0.723
Selected
Top 10 Top 15 Top 20 All 31
0.746 0.765 0.721 0.723
Mean 0.011 0.008 0.017 0.014
S.D.
Clamping
0.725 0.749 0.707 0.723
Mean 0.014 0.019 0.019 0.014
S.D.
Weight Product
0.739 0.758 0.723 0.723
Mean 0.014 0.013 0.015 0.014
S.D.
Decision Tree
0.726 0.757 0.739 0.723
Mean
ARD
0.013 0.014 0.015 0.014
S.D. 0.642 0.684 0.701 0.723
Mean
S.D. 0.046 0.031 0.023 0.014
Random
Mean 0.732 0.756 0.762 0.768
Selected
Top 5 Top 10 Top 15 All 28
0.003 0.003 0.003 0.004
S.D.
Linear Correlation
Features
0.754 0.757 0.768 0.768
Mean 0.003 0.004 0.004 0.004
S.D.
Clamping
0.731 0.736 0.770 0.768
Mean 0.004 0.003 0.002 0.004
S.D.
Weight Product
0.749 0.765 0.772 0.768
Mean
0.003 0.003 0.003 0.004
S.D.
Decision Tree
0.725 0.733 0.759 0.768
Mean
ARD
0.001 0.003 0.002 0.004
S.D.
0.683 0.721 0.745 0.768
Mean
S.D. 0.034 0.025 0.018 0.004
Random
Table 5: Test Performance of the Nets Trained with Reduced Numbers of Level-Off Features Selected by the Ranking Techniques and a Random Picking Method.
0.018 0.013 0.014 0.014
S.D.
Linear Correlation
Features
Table 4: Generalization Rates of the Nets Trained with Reduced Numbers of Osteoporosis Features Selected by the Five Ranking Techniques and a Random Picking Method.
1618 W. Wang, P. Jones, and D. Partridge
Feature-Salience Ranking Techniques
1619
Figure 6: Test performance of nets trained on variously selected subsets of leveloff features compared with performance on the full feature set.
8 Discussion All the techniques do quite a good job of feature-salience ranking: they get it correct for LIC1, and they indicate subsets that are sometimes superior (for neural net training) to all features and are always superior to randomly chosen subsets. Using the decision tree heuristic to rank the features of the LIC1 problem produced variable values of the scaled score, although the ranking produced was always exactly as expected. Table 2 gives only three results, which vary from each other, but all agree on the rankings of input length (always ranked at the first place) and the input dummy (ranked at the sixth position). Taking the techniques in order, linear correlation analysis is generally the poorest, although it is at times surprisingly effective (e.g., with the leveloff problem, where it is superior to several other techniques). Originally it was included merely to ensure that something better could be done. The result from the other techniques provides this assurance. However, the performance of linear correlation analysis is often better than we anticipated. Clamping proves to be consistently good. Although it relies on prior training of a neural network to a reasonable (but definitely not optimal) level, the subsequent clamping and salience assessment is simple and straightforward, not subject to distortion, much less obstruction, by noisy data, and increases only linearly with the number of features to be ranked. We used the standard backpropagation algorithm on multilayer perceptrons with
1620
W. Wang, P. Jones, and D. Partridge
one hidden layer for network training, but any appropriate training algorithm and neural net combination is expected to do as well. The weight-product technique proves to be accurate for the most salient feature but erratic and uninformative for the remainder of the features. High-salience subsets selected by this method tended to perform worse than most of the other three methods. The problems with this technique have been fully examined and documented elsewhere (Wang et al., 2000). The decision tree heuristic performed surprisingly well, and its performance appears to be a general feature of decision trees; C5.0 was used in default mode to compute results. Decision tree induction is not suitable for all problems, and a reasonable decision tree is a necessary prerequisite for worthwhile use of the heuristic. However, decision tree training tends to be considerably faster than neural net training, and ranking time (i.e., traversing a decision tree with the ranking heuristic) is probably better than linearly dependent on the feature set size. But as very large numbers of input features are probably not an issue, the decision tree heuristic is virtually instantaneous, the clamping and weight-product techniques can involve a time delay in order to (roughly) train neural nets, and ARD tends to take orders of magnitude longer (frequently without terminating satisfactorily; see below). The ARD technique is the one founded on solid analytical grounds. It is also by far the most complex and most fragile of the techniques. It constantly stalled on the real-world data problems and needed considerable attention to obtain a useful series of results. When it completed, the results were good but no better than (and usually slightly inferior to) those from both the clamping technique and the decision tree heuristic. But in fairness to this technique, it should be said that we use it for feature ranking, which might be termed no more than a by-product of its overall goal: the production of an optimally trained neural network for the problem at hand. We might say that the ARD technique, through its adjustment of the hyperparameters associated with each input feature, in effect selects the most salient features as an integral part of the (modified) training procedure. This does not, of course, help with our goal of preliminary data analysis in order to design a costeffective data collection procedure, but it does have relevance to the question of optimal network performance as a result of input feature filtering. It might also be noted that within the ARD procedure, the inverse of αMP , essentially the salience indicator, is computed from a summation of weight products (divided by the expensively computed values γ ). This appears to echo the form of the weight-product method, which suggests that a fast and easy approximation to the first n hyperparameters, X w2ij , αi0 = 1/ j
might yield another quick and simple method for salience ranking from a roughly trained network. Initial investigations of this idea look promising.
Feature-Salience Ranking Techniques
1621
We did not attempt at this stage to carry out strict theoretical analysis in terms of computational complexity of the algorithms of the ranking techniques because some (e.g., C5.0) we used as a black box. Nevertheless, the experiments we have conducted suggest that the decision tree ranking is the quickest one among these four techniques. The other three are neural-netbased methods, within which (apart from training of nets) clamping and weight-product techniques are of linear complexity with respect to feature set size. But ARD takes much longer times to optimize the hyperparameters in the reestimation cycles because of the repeated matrix inversions A−1 . This is to be expected when matrix inversion is of the order of n-cubed complexity and the matrix size, n, is itself some exponential function larger than the number of input features. With respect to the particular real-world problems, this preliminary study of the osteoporosis data clearly paid off, for it indicates that not only can we reduce the expense of data collection by collecting some 10 to 20 features (as opposed to the original collection of over 40), but also that by so doing we can also do better. As we can see from Figure 5, the nets trained using selected reduced-feature subsets outperformed the nets trained with all 31 features. The test results suggest that 10 to 15 features chosen by the clamping and decision tree techniques are more likely to produce higher accuracy in disease prediction than when all available features are used. The features ranked at the bottom by these two methods appear to be detrimental to prediction. This fact emphasizes the necessity of identifying the most relevant features for effectiveness in terms of both costs of data collection and level of prediction performance achievable. There is no similar gain from feature set reduction apparent in the leveloff problem. But 15 features (almost half) selected by almost any technique, except the ARD and random picking, prove to be about as good as all 28 features. So again, restricted feature collection is desirable if only to reduce costs and complexity. 9 Conclusion We have compared our two techniques (neural net clamping and the decision tree heuristic) with four other methods (two neural-net-based: the weight-product and ARD: a conventional linear correlation analysis, and random picking) for identifying and ranking feature salience in order to optimize full data collection from a wide-ranging study. The study was undertaken on three problems: a well-studied simple problem (LIC1) and two different real-world problems—osteoporosis prediction and aircraft leveloff prediction. Our two techniques appeared simple and practical and generally produce a good consistent ordering of input features. The ARD method suffers from fragility and complexity as well as an inability to outperform the other techniques. The weight-product technique was excellent at finding the most salient feature but relatively poor at rank-
1622
W. Wang, P. Jones, and D. Partridge
ing other features. Linear correlation, a well-established method for assessing features, compares reasonably well with the other methods of ranking, particularly when the problems are linear or near linear. Nevertheless, the other methods usually did better on real problems. Our results have demonstrated that feature-salience identification and ranking techniques are particularly important for dealing with real-world problems in terms of reduction of dimensionality of input data. In this sense, simple and reliable methods for feature ranking are of more practical use in the early stages of attempting to understand a real data set. And at this initial stage, the data are likely to contain unknown and erratic noise as well as inconsistencies, so a robust technique is essential. For both real-world problems, the benefit of such a preliminary salience analysis in reducing the cost of data acquisition was evident. With the osteoporosis problem, in particular, the cost saving was coupled with a clear accuracy gain. For the level-off problem, the clamping and the decision tree rankings, although determined by very different methods, were extremely similar, while the other rankings were all different. Substantial agreement between these two simple and quick, though diverse, ranking techniques might be interpreted as an indication of inherent feature salience rather than salience with respect to application in neural computation. Rankings for the LIC1 features, which matched expectation with respect to an abstract algorithm, would lend further support to this possibility. The larger problem of feature salience in some abstract sense, rather than with respect to a computational technology to be applied, is beyond the scope of this study, although our results hint at salience in the wider sense. Acknowledgments This research was funded by the EPSRC, UK, under grant number GR/ K78607. We thank Julia Sonander of National Air Traffic Services for the level-off problem data, and Sarah Rae, formerly with the Royal Devon and Exeter Hospital, for the osteoporosis data. References Fukunaga, K. (1990). Introduction to statistical pattern recognition. San Diego, CA: Academic Press. Gedeon, T. (1997). “Data mining of inputs: Analysis magnitude and functional measures. Int. J. of Neural Networks, 8, 209–218. Liu, H., & Motoda, H. (1998). Feature selection for knowledge discovery and data mining. Norwell, MA: Kluwer. MacKay, D. (1998). Bayesian methods for supervised neural networks. In M. Arbib (Ed.), The handbook of mind theory and neural metworks (pp. 144–149). Cambridge, MA: MIT Press.
Feature-Salience Ranking Techniques
1623
MacKay, D. (1992). A practical Bayesian framework for backpropagation networks. Neural Computation, 4, 448–472. Partridge, D. (1996). Network generalization differences quantified. Neural Networks, 9, 263–271. Partridge, D., & Yates, W. (1996). Engineering multiversion neural-net systems. Neural Computation, 4, 869–893. Quinlan, J. P. (1986). Induction of decision trees. Machine Learning, 1, 81–106. Rae, S., Wang, W., & Partridge, D. (1999). Artificial neural networks: A potential role in osteoporosis. Journal of the Royal Society of Medicine, 92, 119–122. Tchaban, T., Taylor, M., & Griffin, J. (1998). Establishing impacts of the inputs in a feedforward network. Neural Computing and Applications, 7, 309–317. Utans, J., & Moody, J. (1991). Selecting neural network architecture via the prediction risk: Application to corporate bond rating prediction. In D. S. Touretzky (Ed.), Proc. 1st Int. Conference on Artificial Intelligence Applications on Wall Street. Los Altamitos, CA: IEEE Computer Society Press. Wang, W., Jones, P., & Partridge, D. (1998). Ranking pattern recognition features for neural networks. In S. Singh (Ed.), Advances in pattern recognition (pp. 232– 241). New York: Springer-Verlag. Wang, W., Jones, P., & Partridge, D. (2000). Assessing the impact of input features in a feedforward network. Neural Computing and Applications, 9, 101–112. Received August 5, 1999; accepted September 25, 2000.
LETTER
Communicated by Shun-ichi Amari
A Theory for Learning by Weight Flow on Stiefel-Grassman Manifold Simone Fiori Neural Networks and Adaptive System Research Group, Department of Industrial Engineering, University of Perugia, Italy
Recently we introduced the concept of neural network learning on StiefelGrassman manifold for multilayer perceptron–like networks. Contributions of other authors have also appeared in the scientific literature about this topic. This article presents a general theory for it and illustrates how existing theories may be explained within the general framework proposed here. 1 Introduction In a multilayer perceptron–like network formed by the interconnection of basic neurons, whose only adjustable part consists of weight vectors, learning the optimal set of connection patterns may be interpreted as selecting the best directions among all possible ones in the space that the weight vectors belong to (Fyfe, 1995). This interpretation is very useful, in that if a learning error criterion is defined over the weight space, it provides a measure of the relevance of the different directions, so that ultimately the rule with which the network learns may be conceived of as a searching procedure that allows finding the optimal (most interesting) directions. If the criterion is designed to be unbounded, its optimization must be performed under some constraint, and the weight space reduces to the subset of directions satisfying mutual restrictions. Also, some problems have a particular structure such that it is a priori known that the associated optimal solutions must fulfill mutual constraints as for limited-resource-based learning theories. In this article the constraint of orthonormality is considered; each weight vector is expected to have a prescribed norm and to be orthogonal to each other. Following is a partial list of problems and applications where orthonormal solutions are involved: • Optimal linear compression, noise reduction, and signal representation by principal or minor component analysis and principal or minor subspace decomposition (Affes & Grenier, 1995; Ephraim & Van Trees, 1995; Karhunen & Joutsensalo, 1993; Oja, 1989; Palmieri, Zhu, & Chang, 1993; Xu, Oja, & Suen, 1992; Yang, 1995) c 2001 Massachusetts Institute of Technology Neural Computation 13, 1625–1647 (2001) °
1626
Simone Fiori
• Blind source separation by signal prewhitening (Cardoso & Laheld, 1996; Cichocki, Karhunen, Kasprzak, & Vigario, 1999; Comon, 1994; Hyv¨arinen & Oja, 1998; Karhunen, 1996; Paraschiv-Ionescu, Jutten, & Bouvier, 1997) • Denoising by sparse coding (Oja, Hyv¨arinen, & Hoyer, 1999) • Direction of arrival estimation (Affes & Grenier, 1995; Comon, 1994; MacInnes & Vaccaro, 1996) • Best basis search or selection (Amari, 1999; Kreutz-Delgado & Rao, 1999; Moreau & Pesquet, 1997) • Linear programming problem solving (Brockett, 1991) • Optical character recognition by transformation invariant neuron (Sona, Sperduti, & Starita, 2000) • Electronic structures computation within local density approximation, for example, for understanding the thermodynamics of bulk materials, the structure and dynamics of surfaces, and the nature of point defects in crystals (Edelman, Arias, & Smith, 1998) Moreover, the eigenvalue decomposition theorem or, more generally, the singular value decomposition theorem are widely known mathematical tools that allow recasting any problem into a pair of orthonormal problems. In order to enforce the constraint of orthonormality, standard methods are known that rely on the definition of a penalty function that measures how much the weight vectors deviate from being mutually orthogonal and normal and the definition of a Lagrange function built up on the basis of the learning criterion (for a detailed discussion, see Cichocki & Unbehauen, 1993; Bertekas, 1982). Also, the Gram-Schmidt orthogonalization procedure, or the projection over the orthogonal group may be employed as well (Helmke & Moore, 1993; Nishimori, 1999; Yang, 1995). Nevertheless, some authors have developed their own techniques for enforcing the orthonormality constraints as a way better to exploit the geometrical properties of the weight space and the properties that the learning criteria induce on it (Amari, 1998). We illustrate a general theory for orthonormal learning, whose bases we have already given (Fiori, 1996; Fiori & Piazza, 1998; Fiori, Uncini, & Piazza, 1998) and present in a homogeneous way the related contributions that recently have appeared in the scientific literature. A summary of the contributions considered in this article is given in Table 1. Notation. Symbol tr[·] denotes the trace of the matrix contained within, (·)T denotes transposition, det(·) indicates the determinant of the argument; if the matrix M ∈ Rp×p is such that MT = −M, then it is termed skew∂f symmetric. Symbols ∇ f and ∂M denote the Jacobian of f (M) with respect to the matrix (or the vector) M; the symbol x˙ (t) stands for dx dt . Symbol Ex [ f (x)]
Learning on Stiefel-Grassman Manifold
1627
Table 1: Summary of the Contributions Considered. Authors
Contribution
Grassman (1862); Stiefel (1935–1936) Aluffi-Pentini, Parisi, and Zirilli (1985)
Grassman-Stiefel manifold Optimization via second-order dynamical systems First-order dynamical systems on Stiefel manifold Independent component analysis by prewhitening PCA/PSA by gradient learning on Stiefel manifold Volume-conserving networks
Brockett (1991) Comon (1994); Comon and Moreau (1997) Xu (1994) Deco and Brauer (1995), Parra, Deco, and Miesbach (1996) Cardoso and Laheld (1996) Fiori (1996, 1999); Fiori, Uncini, and Piazza (1997) Moreau and Pesquet (1997) Amari (1998) Douglas, Kung, and, Amari (1998), Chen, Amari, and Lin (1998) Fiori, Uncini, and Piazza (1998); Fiori and Piazza (1998) Douglas and Kung (1999); Nishimori (1999)
Learning or adapting by relative gradient Mechanical learning Contrast and semicontrast function Natural gradient Insights on invariant manifold stability Learning on Stiefel-Grassman manifold Learning via geodesic flow
Note: The cited work refers to the contributions where the theory appeared with the most details.
denotes mathematical expectation with respect to the statistics of x; symbol Ip denotes the p × p identity matrix. 2 Neural Learning on Stiefel-Grassman Manifold In this section the concept of learning by weight flow on Stiefel-Grassman manifold is defined, and some general properties that characterize the class of associated learning rules are illustrated. Two families of learning rules belonging to that class are studied, and relationships with some existing contributions drawn from the scientific literature are established. 2.1 A Class of Stiefel Learning Equations. Consider a transformation of a signal x(t) ∈ Rp into a new signal y(t) ∈ Rq performed by a one-layer neural network described by y = S(WT x + w0 )
(2.1)
where S(·) indicates an activation function, the weight matrix is W(t) ∈ Rp×q with q ≤ p, and w0 (t) ∈ Rq is a biasing vector arbitrarily adapted. Suppose that a connection pattern is looked for such that a global criterion C(W)
1628
Simone Fiori
optimizes (i.e. maximizes or minimizes) under the constraint of orthonormality for W, which means that the optimal weight matrix has to satisfy the condition WT W = w2 Iq ,
(2.2)
for some value w ∈ R \{0}. The subset of Rp×q of the matrices satisfying equation 2.2 has been studied by Stiefel (1935–1936) and Grassman (1862). We informally recall the definitions of Grassman manifold and Stiefel manifold from Douglas and Kung (1999), Gallot, Hulin, and Lafontaine (1991) and p×q Helgason (1978). The space of matrices Nw ⊂ Rp×q such that WT W = w2 Iq p×q and a homogeneous function C: Nw → R such that C(W) = C(WQ) for p×q any q × q orthogonal matrix Q is called the Grassman manifold. Nw and an inhomogeneous function C(W) 6= C(WQ) for an orthogonal matrix Q 6= Iq is called the Stiefel manifold. In other terms, let us consider a set of q mutually orthogonal equilength vectors, in a p-dimensional Euclidean space, termed a q-tuple. The set of all q-tuples defines a Stiefel manifold, as well as a q-dimensional subspace in the original Euclidean space. Any orthogonal transformation inside a q-subspace changes q-tuples but does not change the related q-subspaces. These q-tuples are said to be equivalent. The set of all equivalent q-tuples gives a Grassman manifold. Usually the cost functions considered in neural learning are expected to be inhomogeneous; thus, in the following, the treatment will be restricted almost exclusively to learning on Stiefel manifold. In this case, the learning problem for the network, equation 2.1, with the constraint 2.2 can be reformulated as Find W? such that: C(W? ) = minp×q C(W)
(2.3)
W∈Nw
for some value w. (If C has to be maximized, the problem recasts by replacing C with −C.) A standard way to obtain the desired solutions is to define the Lagrangian function, 1 def C` (W) = C(W) + tr[(WT W − w2 Iq )E], 2
(2.4)
where E contains the so-called Lagrange multipliers, and to look for the free extremes of the function C` (W), for instance, by means of gradient descent ˙ = −η∇C` , with η > 0. In order to ensure orthonormaltechnique, that is, W ity, the iterative orthogonalization of the columns of W could be employed as well, for instance, by the Grahm-Schmidt orthogonalization of the matrix updated by gradient optimization of C(W) (Orfanidis, 1990; Yang, 1995) or the projection onto orthogonal group (Nishimori, 1999). However, imposing the constraint 2.2 may be problematic in practice (see, for instance, the discussion in Douglas & Kung, 1999). Some researchers have thus started
Learning on Stiefel-Grassman Manifold
1629
to study learning paradigms that keep the weight matrix orthogonal at any time, belonging to the following class of learning rules, n o def S = AC (W)|∀t ∈ T ⊂ R: WT (t)W(t) = w2 (t)I , (2.5) where AC (W) is a generic learning rule for W(t), T is a time interval, w(t) is a scalar function continuously differentiable almost everywhere at least once, and C = C(W) is the objective function that drives the network’s learning. We say that each algorithm contained in S is SOC (orthonormal strongly constrained) or Stiefel. Any rule in S is characterized by a fundamental property. Differentiating both members of WT (t)W(t) = w2 (t)Iq with respect to the time yields ˙ = φIq , ˙ T W + WT W W def
˙ with φ = 2ww.
(2.6) (2.7)
p×p
Noticeably, when W ∈ Nw , | det[W(t)]| = |w(t)|p as a consequence. ˙ for an SOC learning rule, which satisfies equaThe general form of W tion 2.6 can be written as ˙ = PW + φ W, W 2w2
(2.8)
where P ∈ Rp×p is a skew-symmetric matrix describing the rotational part of weight matrix evolution and φ/(2w2 ) describes the stretching of weight matrix columns during learning. To start, we need to prove that the above system truly generates a flow on Stiefel-Grassman manifold and that equation 2.8 is general. ˙ = PW + sW where P(t) is skew Theorem 1. Consider the dynamical system W p×q symmetric and s ∈ R; if W(0) ∈ Nw0 , then the flow W(t) is Stiefel; also, the above dynamical system describes any Stiefel flow. Proof. The matrix P enjoys the property PT = −P. With some algebra, it can be found that for t ≥ 0, it holds that d(WT W)/dt = WT (PT +P)W+2sw2 Iq ; thus, condition 2.6 is fulfilled. The fact that the provided representation is general, may be proven by observing that equation 2.6, thought of as a ˙ gives q(q + 1)/2 equations against pq linear equation in the unknown W, unknowns; thus, a general solution must count at least pq − q(q + 1)/2 free parameters. The representation provided by the dynamical system above includes p(p − 1)/2 parameters; thus we conclude that it is a general form if p(p − 1) ≥ 2pq − q(q + 1). This is true as long as p ≥ q.
1630
Simone Fiori
In the Grassman case, P and φ are determined uniquely from the structure of the rule AC (W); this is not true in the Stiefel case. On the other hand, P and φ completely determine AC (W). 2.2 A First-Order SOC Learning Equation. A special family of SOC algorithms is that of the gradient-based ones, namely, those algorithms AC (W) of the form ˙ = ±η∇C , t ∈ T, η > 0. W
(2.9)
In order to ensure that the gradient-based learning algorithm, 2.9, belongs to S, criterion C has to satisfy a condition derived from equations 2.7 and 2.6, that is: (∇C)T W + WT (∇C) = φC Iq ,
(2.10)
˙ = φC . with 2ww
(2.11)
The meaning of equations 2.10 and 2.11 is that a learning algorithm based on the gradient optimization of a criterion C may be SOC only if C satisfies condition 2.10. In this case w(t) is determined by equations 2.10 and 2.11. It is important to remark that if function C is such that equation 2.10 holds true, then rule 2.9 is effective either to maximize or to minimize C, providing that it is bounded at least on the Stiefel manifold. The first-order learning equation, 2.9, requires the gradient of the objec˙ is tive function C to be equal to the right-hand side of equation 2.8, as W given by the gradient. Obviously that is not true for any function, because the ordinary gradient does not work on a Riemannian manifold; furthermore, it is intuitively gathered that there exist SOC flows that are not integrable (i.e., are not given by the gradients). Thus, more general algorithms than first-order ones can be considered. 2.3 A Second-Order SOC Learning Equation. In the first study we presented about SOC learning theory (Fiori & Piazza, 1998; Fiori et al., 1998), the following second-order nonlinear differential equation was considered for the neural layer, equation 2.1, ¨ −W ¨ TW = 2 WT W
σρ T p×q W (H − HT )W, W(0) ∈ Nw0 , t ∈ T, w20
(2.12)
where σ and ρ are positive constants and H(t) is an arbitrarily variable p × p matrix that controls network learning. System 2.12 has been introduced because it encompasses many learning equations found in the related scientific literature. However, prior to continuing the analysis of the second-order equation, let us briefly investigate to what extent it may be considered a learning rule. The key issue is that it
Learning on Stiefel-Grassman Manifold
1631
shows uncontrollable dynamics as at a given network state W its solution is not unique. To understand this behavior, let us rewrite equation 2.12 as WT X + XT W = 2S,
(2.13)
where S ∈ Rq×q is a skew-symmetric matrix, dependent on W and H. If ¨ =) we look at system 2.13 as a linear matrix equation in the unknown (W X ∈ Rp×q , we see that it counts q(q − 1)/2 scalar equations, because of symmetry, against a total of pq unknowns; thus, at a given network state, the network weight acceleration would not be unique. Furthermore, the constraint of orthonormality, if added, does not suffice to solve the puzzle because it adds only q(q + 1)/2 scalar constraints, so the total amount of equations would become q2 ≤ pq. Ultimately, equation 2.12 (or 2.13) plus the constraint of orthonormality generate an unambiguous SOC learning flow only in the full-rank case (q = p). As a special case of system 2.12 that proves to be well defined, let us introduce the following coupled system (Fiori, 1996): ˙ = σ PW with WT (0)W(0) = w2 Iq , W 0 w20
(2.14)
P˙ = ρ(H − HT ) with PT (0) = −P(0).
(2.15)
Owing to the structure of the subsystem 2.15, for all t ≥ 0, the property PT = −P holds true; in fact, integrating equation 2.15 yields: Z P(t) = P(0) + ρ 0
t
[H(θ) − HT (θ )]dθ,
Z
PT (t) = PT (0) − ρ
t
[H(θ ) − HT (θ )]dθ ;
0
therefore, P(t) + PT (t) = P(0) + PT (0) = 0. Subsystem 2.15 makes P be skew symmetric; thus theorem 1 applies to subsystem 2.14 and ensures that W(t) always belongs to the Stiefel manip×q fold Nw0 . It deserves to underscore that equation 2.14 is less general than expression 2.8, yet it is preferred because of its stability. It remains to show that systems 2.14 and 2.15 are a restriction of 2.12. Dif˙ + ¨ = (σ/w2 )PW ferentiating equation 2.14 with respect to the time yields W 0 2 ˙ Replacing equations 2.14 and 2.15 in the preceding formula gives (σ/w0 )PW. ¨ = WT W
σ2 σρ T W (H − HT )W + 4 WT P2 W. 2 w0 w0
¨ T from WT W ¨ (note that P2 is symmetric) leads to Then, subtracting (WT W) rule 2.12.
1632
Simone Fiori
To summarize, the following points are important: • The second-order differential matrix equation, 2.12, may describe a learning rule and may not, as its solution, in general, is not unique. • Instead, the system 2.14 plus 2.15 is well defined and describes an SOC flow; it is a special case of system 2.12. • Formula 2.14 is less general than formula 2.8, but the former exhibits desirable stability properties; in fact, it may be readily shown that the norm of the column vectors of the weight matrix keeps constant, that is, if W(0) ∈ Nw0 , then W(t) ∈ Nw0 for any t ∈ T; • Equation 2.15 gives a way to specify an SOC learning direction other than given by the gradient of the first-order equation; it may give the same one as the gradient but can be more general. System 2.14 plus 2.15, is in equilibrium at t? ∈ T if ½ H(t) is symmetric at t = t? , P(t) = 0 at t = t? .
(2.16)
Unless the control matrix H(t) is specified, system 2.12 is completely general and conditions 2.16 give a general rule for studying its stationary points. 2.4 Notes on Stiefel Manifold Stability. A key point in the design of a learning rule in S is its stability, that is, its ability to achieve the desired solution by preserving the prescribed constraints and keeping solutions bounded over time. Unfortunately, the stability of SOC flow is not sufficient for an algorithm belonging to S to converge; thus, the stability of the algorithms must be studied case by case. As an exemplary discussion on this topic, let us recall the puzzle about broken duality between principal component analysis (PCA) and its counterpart minor component analysis (MCA), as well as between principal and minor subspace analysis (PSA/MSA). The parameter spaces of PCA and MCA are Stiefel manifolds; PCA maximizes an objective function, while MCA minimizes the same function in the same space, and SOC rules exist for both PCA and MCA. However, there is no perfect dual correspondence in the algorithms for PCA and MCA, in the sense that it is not ensured that the minimization of an objective function, whose maximization leads to PCA, will work, as noted by many authors (see Chen et al., 1998; Douglas et al., 1998, Fiori & Piazza, 2000). The two most important causes of instability are the following: 1. While the continuous-time differential equation, 2.8, maintains the weight matrix in the space of the scaled orthonormal matrices, the corresponding discrete-time updating algorithms are subject to numerical perturbations and rounding-off errors; thus, they may suffer from marginal instability, which can be recovered by infrequent
Learning on Stiefel-Grassman Manifold
1633
column renormalization, but which may also be quite dangerous in practical implementations, for example, on digital signal processing (DSP) processors. This drawback is analyzed with details by Chen et al. (1998), who illustrate the effects of a small deviation of the righthand side in equation 2.6 from its expected value. 2. The stretching effect in equation 2.8 may lead an algorithm to finding right directions with uncontrollable norms. This phenomenon dep×q pends on the uncontrolled dynamics of w(t) in the manifold Nw described by the general differential equation, 2.7. This problem was described with examples in Fiori et al. (1998), where a modification of a PCA algorithm was proposed in order to take on control of the stretching dynamics. It is also worth noting that in the first-order SOC learning algorithm AC previously discussed, the dynamics of w(t) is completely determined by the form of the objective function C; in the second-order SOC learning algorithms discussed, the problem is directly overcome by structuring the rules so that φ = 0 (hence w˙ = 0). 3 Relationships with Previous Contributions In this section, relationships among the SOC theory and some related learning theories known from the scientific literature are discussed. 3.1 A SOC Principal Subspace Analyzer. Here we consider one among the learning rules by Xu (1994) and show that it is SOC. It allows the extraction of the principal subspace of assigned dimension from a stationary zero-mean multivariate random signal x(t) ∈ Rp with covariance matrix def
Cx = Ex [xxT ] having distinct and positive eigenvalues. To this aim, Xu considered a linear neural network with p inputs and q outputs and weight matrix W ∈ Rp×q , described by y = WT x. He proposed as an objective function the following criterion, def
X(W) =
ϕ[det(WT Cx W)] ϕ[det(Ex [yyT ])] = , ψ[det(Ex [y˜ y˜ T ])] ψ[det(WT W)]
(3.1)
where ϕ(v) and ψ(v) are positive monotonically increasing functions defined for v > 0, and y˜ is the output of a companion network, sharing the same weight matrix, stimulated by a reference signal drawn from a zeromean gaussian distribution with covariance matrix Iq . Xu (1994) proved that under some conditions, the above function maximizes when the columns of W span the principal subspace of x of dimension q, that is, the subspace of Rp×q spanned by the q eigenvectors of Cx corresponding to its q largest eigenvalues, that is, a PSA of x. The gradient steepest ascent rule AC (W) based on the above criterion is SOC.
1634
Simone Fiori
˙ = ∇X with initial condition W(0) ∈ Np×q Theorem 2. Consider the system W and X as in equation 3.1. Then for all t > 0, the flow W(t) ∈ Np×q . Proof.
Define the functions:
ϕ 0 [det(WT Cx W)] det(WT Cx W) , ψ[det(WT W)] 0 T def ψ ϕ · det(W W) , α(W) = 0 T ψϕ · det(W Cx W) def
β(W) =
ϕ = ϕ[det(WT Cx W)], ψ = ψ[det(WT W)], ϕ 0 = ϕ 0 [det(WT Cx W)] , ψ 0 = ψ 0 [det(WT W)]. Then the gradient ∇X writes: β(∇X) = Cx W(WT Cx W)−1 − αW(WT W)−1 .
(3.2)
Premultiplying the preceding equation by WT yields βWT (∇X) = (WT Cx W)(WT Cx W)−1 + −α · (WT W)(WT W)−1 = (1 − α)Iq . Hence, it turns out that the left-hand side of equation 2.10 satisfies WT (∇X) + (∇X)T W =
2(1 − α) Iq ; β
(3.3)
therefore, condition 2.10 holds, whereby the theorem follows. The quantity w(t) is a real function of the time, and its initial value obeys w2 (0) = 1q tr[WT (0)W(0)]. Its evolution is governed by the function φX given by βφX = 2(1 − α) through equation 2.11. Xu’s criterion X(W) is a marginally inhomogeneous cost function whose unconstrained gradient-based maximization leads to a weight flow on Stiefel manifold (unless ϕ(v) and ψ(v) are linear: in this case X(W) is homogeneous). 3.2 Brockett’s Isospectral Flow System Theory. Brockett (1991) discusses a (neural) dynamical system that provides a very useful solution to some optimization problems, as linear programming problems where equality as well as inequality constraints are involved. Here we try to interpret his system within the SOC class. Brockett’s system may be represented by means of the very elegant double Lie bracket form, ˙ = [B, [B, N]] , B
(3.4)
Learning on Stiefel-Grassman Manifold
1635
def
where [X, Y] = XY − YX is the Lie bracket, N is a symmetric matrix in Rp×p , and B ∈ Rp×p . Hereafter, Brockett’s system is referred to as isospectral flow system, since as B evolves, its eigenvalues remain unchanged (Brockett, 1991). Isospectral flow system has very interesting neural features (Brockett, 1991). For instance, under some fair conditions, it allows diagonalizing a symmetric matrix, and this behavior readily recalls PCA-type neural learning rules. Furthermore, it allows solving linear programming problems— for example, maximize cT x over x ∈ C, where C is a convex polytope of the form C = {x|Ax ≤ b}, which is an important problem studied extensively in the neural network context (see Cichocki and Unbehauen, 1993). In Brockett (1991) the problem of sorting (or rearranging) lists is also considered, and in this case the usefulness of the isospectral flow system has been proven too. Now, the aim is to show that isospectral flow system is Stiefel because it is closely related to equations 2.14 and 2.15. To start with, it should be noted that when W is square in equation 2.14, then equations 2.14 and 2.15 may be rewritten equivalently as ˙ = (σ/w2 )WP , P˙ = ρ(H − HT ) , W 0
(3.5)
since a square orthonormal matrix W ∈ Np×p satisfies both WT W = w2 Ip ˙ = (σ/w2 )WP (with P skew-symmetric) ensures and WWT = w2 Ip , and W 0 T that d(WW )/dt = 0. Defining a constant symmetric matrix Q ∈ Rp×p and performing the change B = WT QW, the isospectral flow system recasts into ´ ³ ˙ = −W NWT QW − WT QWN . W
(3.6)
Clearly, system 3.6 is of the form 3.5, where a control matrix H is assumed such that Z
t
ρ
H(θ)dθ = −NWT QW + S ,
0
if ρσ = 1, w20 = 1, P(0) = 0, and S denotes whatever symmetric matrix in Rp×p . Hence the isospectral flow is Stiefel. 3.3 Volume-Conserving Networks. Deco and Brauer (1995) and Parra et al. (1996) introduced the concept of volume-conserving neural networks. A volume-conserving “square” architecture performs a mapping y = N(x) between a set³ of ´input patterns and a set of outputs characterized by the
n×n , property det ∂N ∂x = 1. For a linear network with weight matrix W ∈ R the condition above becomes det(W) = 1. The same condition ensures the bijectivity of the mapping. Volume conservation allows, for example, the
1636
Simone Fiori
preservation of the entropy Hx , when the input signal is mapped into the output signal whose entropy is Hy , which means Hx = Hy —in fact: Hy = Hx + Ex [log | det(∂N/∂x)|] . for a linear neural network is An SOC learning rule for matrix W ∈ Nn×n 1 always volume conserving; in fact, det(W) = ±1. 3.4 Relative Gradient and Equivariancy. The relative gradient rule by Cardoso and Laheld (1996) closely relates to system 2.14 plus 2.15. Indeed, equation 26 in Cardoso and Laheld (1996) coincides with the discrete-time version of system of 2.14 plus 2.15, providing that Z
t
ρ
def
H(θ)dθ = Q = Ex [g(y)yT ] + S
0
and P(0) = 0, w20 = 1. Here S denotes whatever symmetric matrix in Rp×p , and g(·) is a suitable scalar function that acts component-wise. The same rule will be discussed in subsection 4.2. A concept recalled in Cardoso and Laheld (1996) is the equivariancy in feature estimation. Strictly speaking, an equivariant estimator R[·] from an observation X is such that R[KX] = KR[X], ∀K invertible. That means the estimate does not depend on the quality of the observation, a very desirable property. Now let x(t) = As(t) be an observation of the real-valued signal s(t) described by the full-rank square measurement matrix A. Let us suppose we employ the system 2.14 plus 2.15 in order to estimate features of s(t) by means of the neural network y(t) = W(t)x(t). The following result holds: Theorem 3. If the control matrix H depends only on the output y, then the system 2.14 plus 2.15 provides equivariant estimation. Proof.
Postmultiplying equation 2.14 by A, the system recasts into
˙ = σ PT , P˙ = ρ[H(Ts) − HT (Ts)] , T
(3.7)
def
where T = WA. Thus, the system’s behavior does not depend explicitly on A. Changing A is equivalent to changing the starting point T(0) (and P(0)) of / N. the algorithm. Note that unless A ∈ Na , in general T ∈
Learning on Stiefel-Grassman Manifold
1637
3.5 Natural Gradient and Riemannian Metric. The ordinary gradient of an objective function gives the steepest direction only in the case of a Euclidean space. When the parameter space has its own geometrical structure, the ordinary gradient does not necessarily provide a tangent direction— that is, a direction constrained within the parameter space N. This may be explained by recognizing that the ordinary gradient is a covariant tensor, while an SOC direction is a controvariant tensor. Thus, we need a Riemannian metric to convert a covariant tensor into a controvariant tensor. This Riemannian gradient is called the natural gradient (Amari, 1998), which automatically satisfies equation 2.10. The problem is how to define an adequate Riemannian metric in the parameter space. In general, the Fisher information helps when the statistical structure is accompanied (Amari, 1998). The Lie group structure also gives a Riemannian metric, as is done in the natural gradient used in the independent component analysis, where it is easily defined, as shown by LaheldCardoso algorithm (see Amari, 1998). (For more details about these concepts, refer to Edelman et al., 1998, where the geometry of Stiefel-Grassman manifold is analyzed with special reference to optimization algorithms as Newton and conjugate gradient iterations, and where important results are surveyed, such as the tangent spaces, the Riemannian metric and gradient, and the Riemannian geometry induced by the Lie group structure.) 4 Physical Description of Stiefel Learning 4.1 Mechanics-Type Control. In Fiori (1996), Fiori et al. (1997), Fiori et al. (1998), and Fiori (1999), the present author showed how to interpret 2.14 and 2.15 as equations describing the dynamics of an abstract rigid system of masses moving in an abstract space under a potential energy field and a viscous braking effect, providing that ¶ µ ∂U + µPW WT , (4.1) H=− κ ∂W where U is the potential energy function of the mechanical system, that is, p×q a function bounded below to be minimized under the restriction W ∈ Nw0 , and the constants are such that µ ≥ 0 and κ > 0. In a neural viewpoint, U represents a cost function, which drives the network learning. It would be of interest to study the existence of special equilibrium points for the second-order system 2.14 plus 2.15 when the control matrix, equation 4.1, is used, in those cases that U = U(W) only. Note that system 2.12 is represented by the pair of coupled systems 2.14 and 2.15, whose state def
matrices P and W form an unique extended state Z = (P, W). Theorem 4.
Consider the dynamical system 2.14 plus 2.15 where the control def
∂U . matrix is assumed as in equation 4.1, and define the matrix function F = −κ ∂W
1638
Simone Fiori
A state Z? = (0, W? ) is a stationary point of the system if FT? W? is diagonal. (F? denotes F at W? ). Proof.
Let the following Lagrangian function be defined
1 def C` (W) = κU(W) + tr[(WT W − w20 I)E], 2 where E is a symmetric matrix containing Lagrange multipliers. The KuhnTucker condition for the existence of an (orthonormal) extreme of U reads −F + WE = 0. Also, w20 E? = WT? F? ; therefore, w20 F? = W? (WT? F? ). This equation implies w20 (F? WT? − W? FT? ) = W? (WT? F? − FT? W? )WT? . By hypothesis P? = 0 and E? = ET? ; thus, from equation 4.1, it follows that matrix H? = F? WT? is symmetric. This proves that conditions 2.16 are fulfilled in Z? . A feature of the system 2.14 plus 2.15 plus 4.1 is its asymptotical (Lyapunov) stability. Theorem 5. Let U be a real-valued, bounded-below function of W ∈ Np×q , with a global minimum in W? . Then the equilibrium state Z? = (0, W? ) of the system 2.14 plus 2.15 plus 4.1 is asymptotically stable. Proof. Let us prove that there exists a Lyapunov function for the system. It is equivalent to: ¶ µ ∂U µ d2 wi v = σρ −κ − i , dt2 ∂wi σ def dwi = σ Pwi , i = 1, . . . , q. vi = σ dt
(4.2) (4.3)
Premultiplying equation 4.2 by vTi , integrating with respect to the time, and summing over index i yields: q Z q Z t X 1 X vi (t) T vi dvi = −ρµ vTi vi dθ σ i=1 vi (0) 0 i=1 ¶ q Z wi (t) µ X ∂U T dwi . − σ 2 ρκ ∂wi i=1 wi (0)
Learning on Stiefel-Grassman Manifold
1639 def
def 1 T 2 tr[V (t)
Let us define the auxiliary functions V(t) = σ P(t)W(t), Kσ (t) = V(t)]. Then the relationship above rewrites as Kσ (t) − Kσ (0) = −σ 3 ρκ[U(t) − U(0)] Z t tr[VT (θ )V(θ )]dθ. − σ µρ
(4.4)
0
By hypothesis, function U is bounded below and presents a global minimum U? = U(W? ). Hence, a function Y(t) defined as def
Y(t) = Kσ (t) + σ 3 ρκ[U(t) − U? ]
(4.5)
is such that Y(t) ≥ 0, and the equality holds when P = 0 and U = U? , that is, when W = W? . Furthermore, from equations 4.5 and 4.4, function Y(t) satisfies Z Y(t) = Y(0) − 2σ µρ 0
t
Kσ (θ )dθ.
Since Kσ (t) ≥ 0, the time derivative of Y(t) enjoys the following property, dY(t) = −2σ µρKσ (t) ≤ 0 , dt
(4.6)
where the equality holds when P = 0. Thus, Y(t) is positive definite in Z? , ˙ and Y(t) is negative definite; hence Z? is an asymptotically stable equilibrium point. This theorem proves that the system seeks the minimum of U(W), which is asymptotically stable for it. A similar result holds for −U. Also, each local p×q extreme of U in Nw0 might be asymptotically stable (if any). The dynamics of the system depends not only on W(0) but also on P(0); we have two degrees of freedom in the initial conditions, that is the initial position (as in the first-order algorithms), and the initial speed. Aluffi-Pentini et al. (1985) found that this may be an advantage of the second-order algorithms in optimization. 4.2 Learning by Contrast-Semicontrast Optimization. An excellent theoretical instrument for developing a principal component analysis and an independent component analysis theory in connection with the SOC one is the definition of semicontrast and constrast functions specialized for orthonormal mixtures. Let these concepts be recalled from Moreau and Pesquet (1997). Say Si is the set of independent random source vectors with n entries, and Mi is the set of random vectors x = Rs where s ∈ Si and
1640
Simone Fiori def
R ∈ N = Nn×n 1 . Denote with P the subset of N containing matrices R satisfying the independence property: R = DB, where D = diag(±1, . . . , ±1), and B is a permutation matrix. Then the concept of contrast function for orthonormal mixtures is defined as: Definition 1 (Moreau & Pesquet, 1997). A contrast on Mi is a multivariate mapping 9(·) from the set Mi to R, which satisfies the following three requirements: M1. ∀s ∈ Mi , ∀R ∈ P, 9(Rs) = 9(s) M2. ∀s ∈ Si , ∀R ∈ N, 9(Rs) ≤ 9(s) M3. ∀s ∈ Si , ∀R ∈ N, 9(Rs) = 9(s) ⇔ R ∈ P
Among others, a deep research for developing contrast functions with different features has been carried out by Comon (1994), Moreau and Macchi (1996), Moreau and Pesquet (1997), Comon and Moreau (1997), Bell and Sejnowski (1995), and Hyv¨arinen and Oja (1998). Example (Bell & Sejnowski, 1995; Burel, 1992; Fiori et al., 1997; Yang & Amari, 1997). Blind signal separation by the orthonormal independent component analysis (ICA⊥ ), or ICA by prewhitening (Comon, 1994), is a good source of examples. In the ICA⊥ , the neural network is described by y(t) = WT (t)x(t), where x(t) constains the prewhitened observable mixture of n statistically independent source signals, and W is the n × n separating matrix, which should belong to Nn×n . It is worth recalling the contrast function, designed by Burel (1992) in a nonlinear blind source separation context, defined as γ def 9y =
Z Rn
( [p1...n (y1 , . . . , yn ) − p1 (y1 ) · · · pn (yn )] " ? γi exp −
n X
#)2 γi2 y2i
dy,
(4.7)
i=1
where p1...n (y1 , . . . , yn ) is the joint probability density function (pdf) of the network output vector y, pi (yi ) are the marginal pdf’s of the entries of y, the symbol ? denotes convolution, and the γi ’s are arbitrary real positive coefficients. Another popular contrast function, very useful when the sources are all subgaussian or supergaussian (Comon & Moreau, 1997), is based on the P def unnormalized kurtosis: U = ± 18 i Ex [y4i ]. If we define g(y) = y3 and suppose σ = ρ = 1, w20 = 1, then the learning rule 2.14 plus 2.15 reads ˙ = PW, P˙ = κWEx [ygT (y) − g(y)yT ]WT − 2µP. W
Learning on Stiefel-Grassman Manifold
1641
It may be conceived as a second-order version of Laheld-Cardoso learning rule for orthonormal mixtures. Note that the role of the term −µP clearly emerges, as it could be thought of as a stabilizer. A well-known criterion for blind separation is the mutual entropy of the output of network 2.1, Hy (W) = Hx + ln[| det(W)|] + F(W),
(4.8)
where F(·) is a properly defined matrix-to-scalar function (Bell & Sejnowski, 1995; Yang & Amari, 1997). It was originally defined for regular independent component analysis but can be used in ICA⊥ as well. This consideration is enforced by experimental results (see Cichocki et al., 1999, and references therein), where convergence difficulties have been reported for non-prewhitening-based rules in the presence of several real-world signals. In the context of SOC mechanics-like learning, Hy (W) noticeably simplifies into HySOC (W) = Hx + ln[|w0 |] + F(W). As Hy is to be maximized, the potential energy function can be simply defined as U(W) = −F(W). Note that in this framework, in order to achieve separation, the network 2.1 must be non-volume conserving. About semicontrast, say Sd , the set of uncorrelated random source vectors with n entries, and Md the set of random vectors x = Rs where s ∈ Sd and R ∈ N. Then the concept of semicontrast is defined as follows: Definition 2 (Moreau & Pesquet, 1997). A semicontrast on Md is a multivariate mapping 8(·) from the set Md to R, which satisfies the following three requirements: M1’. ∀s ∈ Md , ∀R ∈ P, 8(Rs) = 8(s); M2’. ∀s ∈ Sd , ∀R ∈ N, 8(Rs) ≤ 8(s); M3’. ∀s ∈ Sd , ∀R ∈ N, 8(Rs) = 8(s) ⇔ Rs ∈ Sd .
Examples of the employment of semicontrasts in the context of SOC learning are given in the following. Example (Fiori et al., 1997; Moreau & Pesquet, 1997). Suppose that a neural network described by y = WT x is used for making random signals uncorrelated. An example of semicontrast function evaluated on y is given in Moreau f def P and Pesquet (1997): 8y = ni=1 f (σy2i ), where f (·) is a strictly convex function from R+ to R, and the σy2i are the variances of the network’s outputs yi . This function was already used by Watanabe (1965) with f (v) = v log(v).
1642
Simone Fiori
Also, consider U(W) = −
P
2 i Ex [yi ]
= −tr(WT Cx W). In this case we have
∇U = −2Cx W, F = 4κCx W. Thus, FT W = 2κWT Cx W, which is diagonal if and only if W contains q eigenvectors of the covariance Cx . The fact that the rule seeks the weight matrix that minimizes U ensures that at convergence, W contains those eigenvectors corresponding to the largest eigenvalues; hence, the network learns principal components of the signal x. If we approximate F by sample mean, we obtain F ∼ 4κxyT ; thus, we say F is a Hebbian forcing term (Fiori et al., 1997). It is worth noticing that the functions 8 and 9 above depend uniquely on the matrix variable W, and requirements M1–M2–M3 (respectively, M1’– M2’–M3’) ensure that they have a unique maximum (but possible sign and permutation of argument’s columns). This example allows showing that in order to perform decorrelation, a function U(W) ∝ −8(W) should be used in equation 4.1, where 8(·) is a semicontrast, while to perform orthonormal separation, we may assume U(W) ∝ +9(W), where 9(·) is a contrast. 4.3 Relationships with Barlow’s Sparse Coding Theory. A recent application of orthonormal representations that is closely related to independent component analysis is sparse coding for data denoising. Neural sparse coding is a technique that allows finding a neural network representation of multivariate data in which only a small number of neurons is active at the same time, causing a given neuron to be active only rarely (Barlow, 1994). Given a multivariate signal x corrupted by additive spherical gaussian noise, it can be denoised by thresholding the sparse components through a shrinkage function that sets to zero the response of a neuron with low activation, since this activation level is supposed to originate from noise only. Under the hypothesis of spherical additive gaussian noise, the coding linear neural network is represented by y = WT x, where the weight matrix W is constrained to be orthonormal (Oja et al., 1999). As an objective function C(wi ) of any weight vector wi , a measure of the sparseness of the network output signals is used. A widely known measure is based on the kurtosis U1 (wi ) = −kur(yi ) = −
Ex [y4i ] +3. E2x [y2i ]
Another useful measure of sparseness, which is more robust with respect to outliers, has been reported in Oja et al. (1999). It leads to U2 (wi ) = −
Ex [log cosh(2yi )] . E2x [y2i ]
Learning on Stiefel-Grassman Manifold
1643
Both functions have to be minimized with respect to weight vectors under the constraint of orthonormality. 4.4 Learning via Geodesic Flow. Two excellent contributions have recently appeared in the literature concerning neural learning on Stiefel manifold: the KUICNET rule by Douglas and Kung (1999) and the rule proposed by Nishimori (1999). They rely on the concept of geodesic, that is, the shortest path on the manifold connecting two points. The KUICNET rule for the linear version of network 2.1 recasts from Douglas and Kung (1999) into ˙ = −(F1 WT − WFT )W, W 1
(4.9)
with F1 = ∇ J, where J is an inhomogeneous objective function to be maximized. This rule belongs to the SOC class. A geodesic of a smooth manifold is intuitively looked on as the pathway followed by a point particle sliding on it with constant speed; the mechanical-like learning system of section 4.1 is somehow more general than KUICNET, as it takes into account at any time the forces that act on the point of coordinates W. This may be illustrated by making use of the following informal reasoning. Let us say we “turn off” all forces, except that in equally spaced instants, in the mechanical system, ˙ = σ P, P˙ = ρ(H − HT ), W w2 ³ 0 µ ´ H = F − V WT , t ∈ T, σ where the notation reflects the definitions of theorem 5. Let us divide the time interval T into a set of consecutive time intervals Ti and consider, for + simplicity, T0 = [t− 0 t0 ] centered around t = 0. Turning off the forces is ¯ ¯ explained here by writing F(t) = Fδ(t) and V(t) ¡ = Vδ(t) ¢ in T0 , where δ(t) is ¯ WT (t)δ(t); thus: the Dirac delta. Then in T0 we have H(t) = F¯ − µσ V ³ ´ ³ µ ¯´ ¯ T ¯ T + P, ¯ F¯ T − µ V ¯ W −W P(t) = ρ F¯ − V σ σ def ¯ def = W(0) and P¯ = P(0). Considering that for any Ti the system has with W P¯ = 0 and that the braking effect of the viscous fluid can be neglected, that simply means taking µ = 0; with these hypotheses we ultimately have
¯ T −W ¯ F¯ T )W, t ∈ T0 . ˙ = ρσ (F¯ W W w20
(4.10)
− Clearly, the more |T0 | = t+ 0 − t0 approaches zero, the more system 4.10 approaches system 4.9. Note that the difference in sign, due to the fact that J is to be maximized while U, which generates F, is to be minimized. def
1644
Simone Fiori
Nishimori’s learning theory is interesting for his very elegant derivation, based on concepts from differential geometry. In fact, the rule given by Nishimori (1999) for a discrete-time linear network is W(t + 1) = e−ηX(t) W(t), t ∈ Z,
(4.11)
where X is a skew-symmetric matrix and η is a small positive constant. This expression, which had already been introduced in (Fiori, 1996; Fiori et al., 1997), represents a geodesic flow on the Stiefel manifold; this also means that it ensures W(t + 1) ∈ N, providing that W(t) ∈ N. This property can be proven by noting that (e−ηX )T = eηX . e−ηX can be computed by either the truncated series e−ηX = PrThe matrix k + k=0 (−ηX) /k! with r ∈ Z (Fiori et al., 1997), or by the eigenvalue deT composition X = U 3U, and e−ηX = UT e−η3 U, where e−η3 has a special (simpler) form (Nishimori, 1999). Matrix X has the expression X=
(∇U)WT − W(∇U)T . 2
(4.12)
It is worth noticing that the geodesic flow 4.11 is the solution of the contin˙ ) = XW(τ ¯ uous-time equation W(τ ) within the interval τ ∈ [ηt, η(t + 1)[ and ¯ with X = X(ηt). 5 Conclusion The present author (Fiori, 1996) proposed the earliest version of the concept of neural learning on Stiefel-Grassman manifold, which was further elaborated (Fiori & Piazza, 1998; Fiori et al., 1998; Fiori, 1999). Several excellent contributions have appeared recently in the scientific literature about this topic, even if a global study of its general properties still lacking. This article aimed to give a homogeneous presentation of these argument, including general theorems about equilibrium and stability. The results of this research have been given in terms of original contributions about the general theory and by illustrating the close relationships among the issues of different research that have appeared in the literature in related fields. Acknowledgments This work was partially supported by the Italian MURST. The author wishes to gratefully thank one of the anonymous reviewers whose careful and detailed suggestions helped improve the clarity and thoroughness of the article.
Learning on Stiefel-Grassman Manifold
1645
References Affes, S., & Grenier, Y. (1995). A signal subspace tracking algorithm for speech acquisition and noise reduction with a microphone array. In Proceedings of the IEEE/IEE Workshop on Signal Processing Methods in Multipath Environments (pp. 64–73). Aluffi-Pentini, C., Parisi V., & Zirilli F. (1985). Global optimization and stochastic differential equations. Journal of Optimization Theory and Applications, 47, 1–16. Amari, S.-i. (1998). Natural gradient works efficiently in learning. Neural Computation, 10, 251–276. Amari, S.-i. (1999). Natural gradient learning for over- and under-complete bases in ICA. Neural Computation, 11, 1875–1883. Barlow, H. B. (1994). What is the computational goal of neurocortex? In C. Koch & J. L. Davis (Eds.), Large-scale neuronal theories of the brain. Cambridge, MA: MIT Press. Bell, A. J., & Sejnowski, T. J. (1995). An information maximisation approach to blind separation and blind deconvolution. Neural Computation, 7(6), 1129– 1159. Bertekas, D. P. (1982). Constrained optimization and Lagrange multiplier methods. Orlando, FL: Academic Press. Brockett, R. W. (1991). Dynamical systems that sort lists, diagonalize matrices and solve linear programming problems. Linear Algebra and Its Applications, 146, 79–91. Burel, G. (1992). Blind separation of sources: A nonlinear neural algorithm. Neural Networks, 5, 937–947. Cardoso, J. F., & Laheld, B. (1996). Equivariant adaptive source separation. IEEE Transactions on Signal Processing, 44(12), 3017–3030. Chen, T. P., Amari, S.-i., & Lin, Q. (1998). A unified algorithm for principal and minor components extraction. Neural Networks, 11(3), 385–390. Cichocki, A., Karhunen, J., Kasprzak, W., & Vigario, R. (1999). Neural networks for blind separation with unknown number of sources. Neurocomputing, 24, 55–93. Cichocki, A., & Unbehauen, R. (1993). Neural networks for optimization and signal processing. New York: Wiley. Comon, P. (1994). Independent component analysis, a new concept? Signal Processing, 36, 287–314. Comon, P., & Moreau, E. (1997). Improved contrast dedicated to blind separation in communications. In Proceedings of the International Conference on Acoustics, Speech and Signal Processing (pp. 3453–3456). Deco, G., & Brauer, W. (1995). Nonlinear higher-order statistical decorrelation by volume-conserving neural architectures. Neural Networks, 8(4), 525–535. Douglas, S. C., & Kung, S.-Y. (1999). An ordered-rotation KUICNET algorithm for separating arbitrarily-distributed sources. In Proceedings of the International Conference on Independent Component Analysis (pp. 81–86). Douglas, S. C., Kung, S.-Y., & Amari,S.-i. (1998). A self-stabilized minor subspace rule. IEEE Signal Processing Letters, 5(12), 328–330.
1646
Simone Fiori
Edelman, A., Arias, T. A., & Smith, S. T. (1998). The geometry of algorithms with orthogonality constraints. SIAM Journal on Matrix Analysis Applications, 20(2), 303–353. Ephraim, Y., & Van Trees, L. (1995). A signal subspace approach for speech enhancement. IEEE Transactions on Speech and Audio Processing, 3(4), 251–266. Fiori, S. (1996). Unsupervised neural artificial learning models for blind signal processing. Unpublished M.Eng. dissertation, University of Ancona, Italy. Fiori, S. (1999). “Mechanical” neural learning applied to blind source separation. Electronics Letters, 35(22), 1963–1964. Fiori, S., & Piazza, F. (1998). Orthonormal strongly-constrained neural learning. In Proceedings of the International Joint Conference on Neural Networks (pp. 2332– 2337). Fiori, S., & Piazza, F. (2000). Neural MCA for robust beamforming. In Proceedings of the International Symposium on Circuits and Systems (Vol. 3, pp. 614–617). Geneva, Switzerland. Fiori, S., Uncini, A., & Piazza, F. (1997). Mechanical learning applied to principal component analysis and source separation. In Proceedings of the International Conference on Artificial Neural Networks (pp. 571–577). Berlin: Springer-Verlag. Fiori, S., Uncini, A., & Piazza, F. (1998). Neural learning and weight flow on Stiefel manifold. In Proceedings of the Tenth Italian Workshop on Neural Networks (pp. 325–333). Berlin: Springer-Verlag. Fyfe, C. (1995). A general exploratory projection pursuit network. Neural Processing Letters, 2(3), 17–19. Gallot, S., Hulin, D., & Lafontaine, J. (1991). Differential geometry. Berlin: SpringerVerlag. Grassman, H. G. (1862). Die Ausdehnungslehre. Berlin: Enslin. Helgason, S. (1978). Differential geometry, Lie groups, and symmetric spaces. Orlando, FL: Academic Press. Helmke, U., & Moore, J. B. (1993). Optimization and dynamical systems. Berlin: Springer-Verlag. Hyv¨arinen, A., & Oja, E. (1998). Independent component analysis by general non-linear Hebbian-like rules. Signal Processing, 64(3), 301–313. Karhunen, J. (1996). Neural approaches to independent component analysis and source separation. In Fourth European Symposium on Artificial Neural Networks (pp. 249–266). Karhunen, J., & Joutsensalo, J. (1993). Learning of robust principal component subspace. In Proceedings of the International Joint Conference on Neural Networks (pp. 2409–2412). Kreutz-Delgado, K., & Rao, B. D. (1999). Sparse basis selection, ICA, and majorization: Towards a unified perspective. In Proceedings of the International Conference on Acoustics, Speech and Signal Processing, 2 (pp. 1344–1347). MacInnes, C. S., & Vaccaro, R. J. (1996). Tracking direction-of-arrival with invariant subspace updating. In Proceedings of the International Conference on Acoustics, Speech and Signal Processing (pp. 2896–2899). Moreau, E., & Macchi, O. (1996). Higher order contrasts for self-adaptive source separation. International Journal of Adaptive Control and Signal Processing, 10(1), 19–46.
Learning on Stiefel-Grassman Manifold
1647
Moreau, E., & Pesquet, J. C. (1997). Independence/decorrelation measures with application to optimized orthonormal representations. In Proceedings of the International Conference on Acoustics, Speech and Signal Processing (pp. 3425– 3428). Nishimori, Y. (1999). Learning algorithm for ICA by geodesic flows on orthogonal group. In Proceedings of the International Joint Conference on Neural Networks. Washington, D.C. Oja, E. (1989). Neural networks, principal components, and subspaces. International Journal of Neural Systems, 1, 61–68. Oja, E., Hyv¨arinen, A., & Hoyer, P. (1999). Image feature extraction and denoising by sparse coding. Pattern Analysis and Applications Journal, 2(2), 104–110. Orfanidis, S. J. (1990). Gram-Schmidt neural nets. Neural Computation, 2, 116–126. Palmieri, F., Zhu, J., & Chang, C. (1993). Anti-Hebbian learning in topologically constrained linear networks: A tutorial. IEEE Transactions on Neural Networks, 4(5), 748–761. Paraschiv-Ionescu, A., Jutten, C., & Bouvier, G. (1997). Neural network based processing for smart sensor arrays. In Proceedings of the International Conference on Artificial Neural Networks (pp. 565–570). Parra, L., Deco, G., & Miesbach, S. (1996). Statistical independence and novelty detection with information preserving non-linear maps. Neural Computation, 8, 260–269. Sona, D., Sperduti, A., & Starita, A. (2000). Discriminant pattern recognition using transformation invariant neurons. Neural Computation, 12(6). Stiefel, E. (1935–1936). Richtungsfelder und fernparallelismus in n-dimensionalem manning faltigkeiten. Commentarii Math. Helvetici, 8, 305–353. Watanabe, S. (1965). Karhunen-Lo`eve expansion and factor analysis: Theoretical remarks and applications. In Transactions of the 4th Prague Conference on Information Theory, Statistical Decision Functions Random Processes (pp. 635–660). Publishing House of the Czechoslovak Academy of Sciences. Xu, L. (1994). Theories for unsupervised learning: PCA and its nonlinear extension. In Proceedings of the International Joint Conference on Neural Networks (pp. 1252–1257). Xu, L., Oja, E., & Suen, C. Y. (1992). Modified Hebbian learning for curve and surface fitting. Neural Networks, 5, 441–457. Yang, B. (1995). Projection approximation subspace tracking. IEEE Transactions on Signal Processing, 43, 1247–1252. Yang, H. H., & Amari, S.-i. (1997). Adaptive online learning algorithms for blind separation: Maximum entropy and minimal mutual information. Neural Computation, 9, 1457–1482.
Received September 7, 1999; accepted August 24, 2000.
LETTER
Communicated by Hagai Attias
Online Model Selection Based on the Variational Bayes Masa-aki Sato Information Sciences Division, ATR International, and CREST, Japan Science and Techonology Corporation, Seika-cho, Soraku-gun, Kyoto 619-0288, Japan
The Bayesian framework provides a principled way of model selection. This framework estimates a probability distribution over an ensemble of models, and the prediction is done by averaging over the ensemble of models. Accordingly, the uncertainty of the models is taken into account, and complex models with more degrees of freedom are penalized. However, integration over model parameters is often intractable, and some approximation scheme is needed. Recently, a powerful approximation scheme, called the variational bayes (VB) method, has been proposed. This approach defines the free energy for a trial probability distribution, which approximates a joint posterior probability distribution over model parameters and hidden variables. The exact maximization of the free energy gives the true posterior distribution. The VB method uses factorized trial distributions. The integration over model parameters can be done analytically, and an iterative expectation-maximization-like algorithm, whose convergence is guaranteed, is derived. In this article, we derive an online version of the VB algorithm and prove its convergence by showing that it is a stochastic approximation for finding the maximum of the free energy. By combining sequential model selection procedures, the online VB method provides a fully online learning method with a model selection mechanism. In preliminary experiments using synthetic data, the online VB method was able to adapt the model structure to dynamic environments. 1 Introduction The learning of model parameters from observed data can be accomplished by using the maximum likelihood (ML) method for probabilistic models (Bishop, 1995). The expectation-maximization (EM) algorithm (Dempster, Laird, & Rubin, 1977) provides a general framework for calculating the ML estimator for models with hidden variables. The fundamental problems of the ML method are overfitting and the inability to account for the model complexity, so it is unable to determine the model structure. The Bayesian framework overcomes these problems in principle (Bishop, 1995; Cooper & Herskovitz, 1992; Gelman, Carlin, Stern, & Rubin, 1995; c 2001 Massachusetts Institute of Technology Neural Computation 13, 1649–1681 (2001) °
1650
Masa-aki Sato
Heckerman, Geiger, & Chickering, 1995; Mackay, 1992a, 1992b). The Bayesian method estimates a probability distribution over an ensemble of models, and the prediction is done by averaging over the ensemble of models. Accordingly, the uncertainty of the models is taken into account, and complex models with more degrees of freedom are penalized. The evidence, which is the marginal posterior probability given the data, gives a criterion for the model selection (Mackay, 1992a, 1992b). However, an integration over model parameters is often intractable, and some approximation scheme is needed (Chickering & Heckerman, 1997; Mackay, 1999; Neal, 1996; Richardson & Green, 1997; Roberts, Husmeier, Rezek, & Penny, 1998). Markov chain Monte Carlo (MCMC) methods and the Laplace approximation method have been developed to date. MCMC methods can, in principle, find exact results, but they require a huge amount of time for computation. In addition, it is difficult to determine when these algorithms converge. The Laplace approximation method makes a local gaussian approximation around a maximum a posteriori parameter estimate. This approximation is valid only for a large sample limit. Unfortunately, it is not suited to parameters with constraints such as mixing proportions of mixture models. Recently, an alternative approach, variational Bayes (VB), has been proposed (Waterhouse, Mackay, & Robinson, 1996; Attias, 1999, 2000; Ghahramani & Beal, 2000; Bishop, 1999). This approach defines the free energy for a trial probability distribution, which approximates a joint posterior probability distribution over model parameters and hidden variables. The maximum of the free energy gives the log evidence for an observed data set. Therefore, the exact maximization of the free energy gives the true posterior distribution over the parameters and the hidden variables. The VB method uses trial distributions in a restricted space where the parameters are assumed to be conditionally independent of the hidden variables. Once this approximation is made, the remaining calculations are all done exactly. As a result, an iterative EM-like algorithm, whose convergence is guaranteed, is derived. The predictive distribution is also calculated analytically. The VB method has several attractive features. The method requires only a modest amount of computational time, comparable to that of the EM algorithm. The Bayesian information criterion (BIC) (Schwartz, 1978) and the minimum description length (MDL) criterion (Rissanen, 1987) for the model selection are obtained from the VB method in a large sample limit (Attias, 1999). In this limit, the VB algorithm becomes equivalent to the ordinary EM algorithm. The VB method can be easily extended to the hierarchical Bayes method. Sequential model selection procedures (Ghahramani & Beal, 2000; Ueda, 1999) have also been proposed by combining the VB method and the split-and-merge algorithm (Ueda, Nakano, Ghahramani, & Hinton, 1999). In this article, we derive an on-line version of the VB algorithm and prove its convergence by showing that it is a stochastic approximation for finding the maximum of the free energy. We also prove that the VB algorithm is a gradient method with the inverse of the Fisher information matrix for
Online Model Selection Based on the Variational Bayes
1651
the posterior parameter distribution as a coefficient matrix. Namely, the VB method is a type of natural gradient method (Amari, 1998). By combining sequential model selection procedures, the online VB algorithm provides a fully online learning method with a model selection mechanism. It can be applied to real-time applications. In preliminary experiments using synthetic data, the online VB method was able to adapt the model structure to dynamic environments. We also found that the introduction of a discount factor was crucial for a fast convergence of the online VB method. We study the VB method for general exponential family models with hidden variables (EFH models) (Amari, 1985), although the VB method can be applied to more general graphical models (Attias, 1999). The use of the EFH models makes the calculations transparent. Moreover, the EFH models include a lot of interesting models such as normalized gaussian networks (Sato & Ishii, 2000), hidden Markov models (Rabiner, 1989), mixture of gaussian models (Roberts, Husmeier, Rezek, & Penny, 1998), mixture of factor analyzers (Ghahramani & Beal, 2000), mixture of probabilistic principal component analyzers (Tipping & Bishop, 1999), and others (Roweis & Ghahramani, 1999; Titterington, Smith, & Makov, 1985). 2 Variational Bayes Method In this section, we review the VB method (Waterhouse et al., 1996; Attias, 1999, 2000; Ghahramani & Beal, 2000; Bishop, 1999) for EFH models (Amari, 1985). 2.1 Bayesian Method. In the maximum likelihood (ML) approach, the objective is to find the ML estimator that maximizes the likelihood for a given data set. The ML approach, however, suffers from overfitting and the inability to determine the best model structure. In the Bayesian method (Bishop, 1995; Mackay, 1992a, 1992b), a set of models, {Mm |m = 1, . . . , N }, where Mm denotes a model with a fixed structure, is considered. A probability distribution for a model Mm with a model parameter θ m is denoted by P(x|θ m , Mm ), where x represents an observed stochastic variable. For a set of observed data, X{T} = {x(t)|t = 1, . . . , T}, and a prior probability distribution P(θ m |Mm ), the Bayesian method calculates the posterior probability over the parameter, P(θ m |X{T}, Mm ) =
P(X{T}|θ m , Mm )P(θ m |Mm ) . P(X{T}|Mm )
Here, the data likelihood is defined by P(X{T}|θ m , Mm ) =
T Y t=1
P(x(t)|θ m , Mm ),
1652
Masa-aki Sato
and the evidence for a model Mm is defined by Z P(X{T}|Mm ) = dµ(θ m )P(X{T}|θ m , Mm )P(θ m |Mm ). The posterior probability over model structure is given by P(Mm |X{T}) = P
P(X{T}|Mm )P(Mm ) . m0 P(X{T}|Mm0 )P(Mm0 )
If we use a noninformative prior for the model structure, that is, P(Mm ) = 1/N , the posterior for a given model P(Mm |X{T}) is proportional to the evidence of the model P(X{T}|Mm ). Therefore, the model selection can be done according to the evidence. In the following, we will calculate the evidence for a given model by using the VB method, and the explicit model structure dependence is omitted. 2.2 Exponential Family Model with Hidden Variables. An EFH model for an N-dimensional vector variable x = (x1 , . . . , xN )T is defined by a probability distribution, Z P(x|θ ) = dµ(z)P(x, z|θ ), P(x, z|θ ) = exp [r(x, z) · θ + r0 (x, z) − ψ (θ )] ,
(2.1)
where z = (z1 , . . . , zM )T denotes an M-dimensional vector hidden variable and θ = (θ1 , . . . , θK )T denotes a set of model parameters called the natural parameter.1 A set of sufficient statistics is denoted by r(x, z) = (r1 (x, z), . . . , rK (x, z))T . An inner product of two vectors r and θ is denoted P by r · θ = Kk=1 rk θk . Measures on the observed and the hidden variable spaces are denoted by dµ(x) and dµ(z), respectively. The normalization factor ψ (θ ) is determined by Z (2.2) exp [ψ (θ )] = dµ(x)dµ(z) exp [r(x, z) · θ + r0 (x, z)] , R which is derived from the probability condition dµ(x)P(x|θ ) = 1. P(x, z|θ ) represents the probability distribution for a complete R event (x, z). If the hidden variableP is a discrete variable, the integration dµ(z) is replaced by the summation z . The expectation parameter for the EFH model, φ = (φ1 , . . . , φK )T , is defined by
φ = ∂ ψ (θ )/∂ θ = (∂ ψ /∂θ1 , . . . , ∂ ψ /∂θK )T Z = E [r(x, z)|θ ] = dµ(x)dµ(z)r(x, z)P(x, z|θ ).
(2.3)
1 In general, the natural parameter is a function of another model parameter ϕ: θ = θ(ϕ). The following discussion can also be applied in this case.
Online Model Selection Based on the Variational Bayes
1653
2.3 Evidence and Free Energy. The evidence for a data set X{T} is defined by Z (2.4) P(X{T}) = dµ(θ )P(X{T}|θ )P0 (θ ), where dµ(θ ) denotes a measure on the model parameter space and P0 (θ ) denotes a prior distribution for the model parameters. The integration over model parameters in equation 2.4 penalizes complex models with more degrees of freedom (Bishop, 1995). This integration, however, is often difficult to perform. In order to evaluate this integration, the VB method introduces a trial probability distribution Q(θ , Z{T}), which approximates the posterior probability distribution over the model parameter θ and the hidden variables Z{T} = {z(t)|t = 1, . . . , T}: P(θ , Z{T}|X{T}) =
P(X{T}, Z{T}|θ )P0 (θ ) , P(X{T})
(2.5)
where the probability distribution for a complete data set (X{T}, Z{T}) is given by P(X{T}, Z{T}|θ ) =
T Y
P(x(t), z(t)|θ ).
(2.6)
t=1
The Kullback-Leibler (KL) divergence between the trial distribution Q(θ , Z{T}) and the true posterior distribution P(θ , Z{T}|X{T}) is given by Z KL(Q||P) = dµ(θ )dµ(Z{T}) Q(θ , Z{T}) ¶ µ Q(θ , Z{T}) × log P(θ , Z{T}|X{T}) = log P(X{T}) − F(X{T}, Q), (2.7) where the free energy F(X{T}, Q) is defined by Z F(X{T}, Q) = dµ(θ )dµ(Z{T}) Q(θ , Z{T}) ¶ µ P(X{T}, Z{T}|θ )P0 (θ ) ] . × log Q(θ , Z{T})
(2.8)
Therefore, the true posterior distribution is obtained by maximizing the free energy with respect to the trial distribution Q(θ , Z{T}). The maximum of the free energy is equal to the log evidence: log P(X{T}) = max F(X{T}, Q) ≥ F(X{T}, Q), Q
(2.9)
Equation 2.9 implies that the lower bound for the log evidence can be evaluated by using some trial posterior distributions Q(θ , Z{T}).
1654
Masa-aki Sato
2.4 Variational Bayes Algorithm. In the VB method, trial posterior distributions are assumed to be factorized as Q(θ , Z{T}) = Qθ (θ )Qz (Z{T}).
(2.10)
The hidden variables are assumed to be conditionally independent with the model parameters given the data. We also assume that the prior distribution P0 (θ ) is given by the conjugate prior distribution2 for the EFH model, equation 2.1, P0 (θ ) = exp [γ0 (α0 · θ − ψ (θ )) − Φ(α0 , γ0 )] ,
(2.11)
where (α0 , γ0 ) are prior hyperparameters. The normalization factor Φ(α0 , γ0 ) is determined by Z exp [Φ(α0 , γ0 )] =
dµ(θ ) exp [γ0 (α0 · θ − ψ (θ ))] .
(2.12)
Equations 2.10 and 2.11 are the only assumptions in this method. Under these assumptions, we try to maximize the free energy F(X{T}, Q). The maximum free energy with respect to factorized Q, 2.10, gives an estimate (lower bound) for the log evidence log(P(X{T})). The free energy can be maximized by alternately maximizing the free energy with respect to Qθ and Qz . This process closely resembles the free energy formulation for the EM algorithm (Neal & Hinton, 1998) for finding the ML estimator. In the VB expectation step (E-step), Rthe free energy is maximized with respect to Qz (Z{T}) under the condition dµ(Z{T})Qz (Z{T}) = 1, while Qθ (θ ) is fixed. The maximum solution is given by the posterior distribution for the hidden variables with the ensemble average of model parameters (see appendix A): Qz (Z{T}) =
T Y
Qz (z(t)),
(2.13)
t=1
Qz (z(t)) = P(z(t)|x(t), θ¯ ) = P(x(t), z(t)|θ¯ )/P(x(t)|θ¯ ), Z θ¯ = dµ(θ )Qθ (θ )θ .
(2.14) (2.15)
In the VB maximization step (M-step),Rthe free energy is maximized with respect to Qθ (θ ) under the condition dµ(θ )Qθ (θ ) = 1, while Qz (Z{T}) obtained in the VB E-step is fixed. The maximum solution is given by the 2
It is also possible to use noninformative priors (Attias, 1999).
Online Model Selection Based on the Variational Bayes
1655
conjugate distribution for the EFH model with posterior hyperparameters (α, γ ) (see appendix A): Qθ (θ ) = Pα (θ |α, γ ) = exp [γ (α · θ − ψ (θ )) − Φ(α, γ )] , γ = T + γ0 , i 1h Thr(x, z)i ¯ + α0 · γ0 , α= θ γ Z T 1X dµ(z(t))P(z(t)|x(t), θ¯ )r(x(t), z(t)). hr(x, z)i ¯ = θ T t=1
(2.16) (2.17) (2.18) (2.19)
The effective amount of data γ = (T + γ0 ) represents the reliability (or uncertainty) of the estimation. As the amount of data T increases, the reliability of the estimation increases. The prior hyperparameter γ0 represents the reliability of the prior belief on the prior hyperparameter α0 . The posterior hyperparameter α is determined by the expectation value of the sufficient statistics. The prior hyperparameter α0 gives the initial value for α. Since the posterior parameter distribution Qθ (θ ) is given by the conjugate distribution Pα (θ |α, γ ), which is also an exponential family model, the integration over the parameter θ in equation 2.15 can be explicitly calculated as
θ¯ = hθ iα , Z 1 ∂Φ (α, γ ). hθ iα = dµ(θ )Pα (θ |α, γ )θ = γ ∂α
(2.20) (2.21)
The natural parameter of the conjugate distribution is given by (γ α, γ ). The corresponding expectation parameters are given by the ensemble averages of the model parameters: hθ iα defined in equation 2.21 and Z hψ (θ )iα = =
dµ(θ )Pα (θ |α, γ )ψ (θ ) ∂Φ 1 ∂Φ (α, γ ). (α, γ ) · α − γ ∂α ∂γ
(2.22)
2.5 Parameterized Free Energy Function. Since the optimal solution simultaneously satisfies equations 2.14 and 2.16, the trial posterior distributions, Qθ (θ ) and Qz (Z{T}), can be parameterized as Qθ (θ ) = Pα (θ |α, γ ) = exp [γ (α · θ − ψ (θ )) − Φ(α, γ )] , Qz (Z{T}) =
T Y
Qz (z(t)),
(2.23) (2.24)
t=1
Qz (z(t)) = P(z(t)|x(t), θ¯ ),
(2.25)
1656
Masa-aki Sato
where γ , α, and θ¯ are arbitrary variational parameters. By substituting this parameterized form into the definition of the free energy, equation 2.8, one can get the parameterized free energy function: F(X{T}, θ¯ , α, γ ) =
T X
log P(x(t)|θ¯ ) + (γ0 α0 − γ α) · hθ iα
t=1
+ Thr(x, z)i ¯ · (hθ iα − θ¯ ) − (T + γ0 − γ )hψ (θ )iα
θ
+ Tψ (θ¯ ) + Φ(α, γ ) − Φ(α0 , γ0 ),
(2.26)
where the ensemble averages of the parameters hθ iα and hψ (θ )iα are given by equations 2.21 and 2.22, respectively. The VB E-step equation, 2.20, can be derived from the free energy maximization condition with respect to θ¯ : ∂F(X{T}, θ¯ , α, γ )/∂ θ¯ = 0.
(2.27)
The derivative of the free energy with respect to θ¯ is given by (see appendix B) ∂F/∂ θ¯ = U(θ¯ ) · (hθ iα − θ¯ ), U(θ¯ ) =
T D³ X
r(x(t), z(t)) − hr(x(t), z(t))i ¯
´
θ
t=1
´T À ³ , × r(x(t), z(t)) − hr(x(t), z(t))i ¯
θ
Z hr(x(t), z(t))i ¯ =
θ
θ¯
dµ(z(t))P(z(t)|x(t), θ¯ )r(x(t), z(t)).
(2.28)
Since the coefficient matrix U is positive definite, the maximization condition, equation 2.27, leads to the VB E-step equation, 2.20. The Hessian of the free energy with respect to θ¯ at the VB E-step solution is given by (−U). This shows that the VB E-step solution is actually a maximum of the free energy with respect to θ¯ . The VB M-step equations, 2.17 and 2.18, can be derived from the free energy maximization condition with respect to (α, γ ): ∂F(X{T}, θ¯ , α, γ )/∂γ = 0, ∂F(X{T}, θ¯ , α, γ )/∂ α = 0.
(2.29)
The derivative of the free energy with respect to (α, γ ) is given by (see
Online Model Selection Based on the Variational Bayes
1657
appendix B) ¶ ¶ µ Vα,α , Vα,γ (∂F/∂ α) = T Vα , Vγ ,γ (∂F/∂γ ) ,γ ¶ µ Thr(x, z)i ¯ + γ0 α0 − (T + γ0 )α θ , × T + γ0 − γ
µ1 γ
(2.30)
where the Fisher information matrix V for the posterior parameter distribution Pα (θ |α, γ ) is given by ¿µ ¶µ ¶À 1 ∂ log Pα ∂ log Pα Vα,α = 2 γ ∂α ∂ αT α D E = (θ − hθ iα )(θ − hθ iα )T , ¿µ
¶µ
α ¶À
∂ log Pα 1 ∂ log Pα γ ∂α ∂γ α ® = (θ − hθ iα )(g(θ ) − hg(θ )iα ) α , ¶µ ¶À ¿µ ∂ log Pα ∂ log Pα = ∂γ ∂γ α ® = (g(θ ) − hg(θ )iα )(g(θ ) − hg(θ )iα ) α ,
Vα,γ =
Vγ ,γ
g(θ ) = α · θ − ψ (θ ).
(2.31)
Since the Fisher information matrix V is positive definite, the free energy maximization condition, 2.29, leads to the VB M-step equations, 2.17 and 2.18. From equation 2.30, it is shown that the VB M-step solution is a maximum of the free energy with respect to (α, γ ), as in the VB-E step. The VB algorithm is summarized as follows. First, γ is set to (T + γ0 ). In the VB E-step, the ensemble average of the parameter θ¯ is calculated by using equation 2.20. Subsequently, the expectation value of the sufficient statistics hr(x, z)i ¯ , 2.19, is calculated by using the posterior distribution for θ the hidden variable P(z(t)|x(t), θ¯ ), equation 2.14. In the VB M-step, the posterior hyperparameter α is updated by using equation 2.18. Repeating this process, the free energy function, equation 2.26, increases monotonically. This process continues until the free energy function converges. Using equations 2.28 and 2.30, VB equations 2.20 and 2.18 can be expressed as the gradient method: 1θ¯ = θ¯ new − θ¯ = hθ iα − θ¯ ∂F = U−1 (θ¯ ) (X{T}, θ¯ , α, γ ), ∂ θ¯
(2.32)
1658
Masa-aki Sato
1α = αnew − α = =
1 (Thr(x, z)i ¯ + γ0 α0 − γ α) θ γ
∂F 1 −1 V (α, γ ) (X{T}, θ¯ , α, γ ), γ 2 α,α ∂α
(2.33)
together with γ = T + γ0 . Substituting the VB E-step equation, 2.20, into the free energy equation, 2.26, the VB algorithm is further rewritten as 1α =
∂F 1 −1 V (α, γ ) (X{T}, θ¯ = hθ iα , α, γ ). γ 2 α,α ∂α
(2.34)
This shows that the VB algorithm is the gradient method with the inverse of the Fisher information matrix as a coefficient matrix. Namely, it is a type of natural gradient method (Amari, 1998), which gives the optimal asymptotic convergence. This fact is proved for the first time in this article. The natural gradient gives the steepest direction in the hyperparameter space, which has the Riemannian structure according to the information geometry. It should be noted that the learning rate in the VB algorithm, 2.34, is automatically determined by the inverse of the Fisher information matrix. When the VB algorithm converges, the free energy equation, 2.26, can be written in a simple form: F(X{T}) = log(P(X{T}|θ¯ )P0 (θ¯ )) −
Z dµ(θ ) Qθ (θ ) log(Qθ (θ ))
+ γ [9(θ¯ ) − h9(θ )iα ].
(2.35)
The first term on the right-hand side is the log likelihood together with the prior. It is estimated at the ensemble average of the parameters. The second term is the entropy of the posterior parameter distribution. It penalizes the complex models and overfitting. The third term represents the deviation from the ensemble average of the parameters and becomes negligible in the large sample limit. 2.6 Predictive Distribution. If the posterior parameter distribution is obtained by using the VB algorithm, one can calculate the predictive distribution for the observed variable x. The predictive distribution for x is given by Z dµ(θ )Qθ (θ )P(x|θ )
P(x|X{T}) =
Z
Z =
dµ(θ )
dµ(z) exp [(r(x, z) + γ α) · θ + r0 (x, z)
− (1 + γ )ψ (θ ) − Φ(α, γ )] .
(2.36)
Online Model Selection Based on the Variational Bayes
1659
By interchanging the integration with respect to θ and z, one can get Z P(x|X{T}) =
dµ(z) ¤ £ ˆ (x, z), γ + 1) − Φ(α, γ ) , × exp r0 (x, z) + Φ(α
(2.37)
ˆ (x, z) = (γ α + r(x, z))/(1 + γ ). α For a finite T, this predictive distribution has a different functional form from the model distribution P(x|θ ), equation 2.1. 2.7 Large Sample Limit. When the amount of observed data becomes large (T À 1 : γ À 1), the solution of the VB algorithm becomes the ML estimator (Attias, 1999). In this limit, the integration over the parameters with respect to the posterior parameter distribution can be approximated by using a stationary point approximation: Z exp [Φ(α, γ )] =
dµ(θ ) exp [γ (α · θ − ψ (θ ))]
¯ ¯ ¸ ¯ ∂ 2ψ ¯ 1 ˆ ˆ ˆ ¯ ¯ (θ )¯ + O(1/γ ) , ∼ exp γ (α · θ − ψ (θ ) − log ¯γ 2 ∂ θ∂ θ ·
(2.38)
¡ ¢ where θˆ is the maximum of the exponent α · θ − ψ (θ ) , that is, ∂ψ ˆ (θ ) = α. ∂θ
(2.39)
Therefore, Φ can be approximated as
Φ(α, γ ) ∼ γ (α · θˆ − ψ (θˆ ) −
¯ ¯ ¯ ∂ 2ψ ¯ 1 log ¯¯γ (θˆ )¯¯ + O(1/γ ). 2 ∂ θ∂ θ
(2.40)
Consequently, the ensemble average of the parameter θ¯ can be approximated as 1 ∂Φ θ¯ = (α, γ ) γ ∂α ∼
1 ∂ (γ (α · θˆ − ψ (θˆ ))) = θˆ . γ ∂α
(2.41)
The relations 2.39 and 2.41 imply that the posterior hyperparameter α is equal to the expectation parameter of the EFH model, φ (see equation 2.3) in this limit. Furthermore, equations 2.18, 2.39, and 2.41 are equivalent to the
1660
Masa-aki Sato
ordinary EM algorithm for the EFH model. In a large sample limit, the data term is dominant over the model complexity term. Consequently, the free energy maximization becomes equivalent to the likelihood maximization. Using equations 2.40 and 2.41, the free energy becomes F∼
T X t=1
log P(x|θ¯ ) −
K log γ 2
(2.42)
¯ 2 ¯ ¯∂ ψ ¯ 1 ¯ ¯ (θ )¯¯ + γ0 (α0 · θ¯ − ψ (θ¯ )) − Φ(α0 , γ0 ) + O(1/γ ). − log ¯ 2 ∂ θ∂ θ
This expression coincides with the BIC/MDL criteria (Rissanen, 1987; Schwartz, 1978). The predictive distribution P(x|X{T}) in this limit coincides with the model distribution using the ML estimator P(x|θ¯ ). This can be shown by using the following relations: ˆ (x, z) ∼ α + α
1 (r(x, z) − α) + O(1/γ 2 ), γ
ˆ (x, z), r + 1) ∼ Φ(α, γ ) + Φ(α
∂Φ ∂Φ 1 (r(x, z) − α) (α, γ ) (α, γ ) + γ ∂α ∂γ
∼ Φ(α, γ ) + r(x, z) · θ¯ − ψ (θ¯ ). 3 Online Variational Bayes Method 3.1 Expectation Value of the Free Energy. In this section, we derive an online version of the VB algorithm. The amount of data increases over time in the online learning. Therefore, it is desirable to calculate the free energy corresponding to a fixed amount of data. For this purpose, let us define an expectation value of the log evidence for a finite amount of data: Z £ ¤ E log P(X{T}) ρ = dµ(X{T})ρ(X{T}) µZ × log
¶ dµ(θ )P(X{T}|θ )P0 (θ ) ,
(3.1)
where ρ represents an unknown probability distribution for observed data. The corresponding VB free energy is given by Z E [F(X{T}, Qθ , Qz )]ρ = T dµ(θ )Qθ (θ ) ·Z ×E
¸ ¡ ¢ dµ(z)Qz (z) log P(x, z|θ )/Qz (z)
ρ
Online Model Selection Based on the Variational Bayes
Z
¡ ¢ dµ(θ )Qθ (θ ) log P0 (θ )/Qθ (θ ) .
+
1661
(3.2)
The ratio (γ0 /T) determines the relative reliability between the observed data and the prior belief for the parameter distribution. The expected free energy, equation 3.2, can be estimated by µ ¶X Z Z T τ dµ(θ )Qθ (θ ) dµ(z(t))Qz (z(t)) F(X{τ }, Qz {τ }, Qθ , T) = τ t=1 ¢ ¡ × log P(x(t), z(t)|θ )/Qz (z(t)) Z ¡ ¢ (3.3) + dµ(θ )Qθ (θ ) log P0 (θ )/Qθ (θ ) , where Qz {τ } = {Qz (z(t))|t = 1, . . . , τ }. Note that τ represents the actual amount of observed data, and it increases over time while T is fixed. The estimation of the posterior distribution Qz (z(t)) is inaccurate in the early stage of the online learning and gradually becomes accurate as learning proceeds. However, the early inaccurate estimations and the later accurate estimations contribute to the free energy (see equation 3.3) in equal weight. This might cause slow convergence of the learning process. Therefore, we introduce a time-dependent discount factor λ(t) (0 ≤ λ(t) ≤ 1, t = 2, 3, . . .) for forgetting the earlier inaccurate estimation effects. Accordingly, a discounted free energy is defined by λ
F (X{τ }, Qz {τ }, Qθ , T) = Tη(τ ) Z
τ X t=1
Ã
τ Y
! λ(s)
s=t+1
dµ(θ )Qθ (θ )
×
Z dµ(z(t))Qz (z(t))
¢ ¡ × log P(x(t), z(t)|θ )/Qz (z(t)) Z ¡ ¢ + dµ(θ )Qθ (θ ) log P0 (θ )/Qθ (θ ) ,
(3.4)
where η(τ ) represents a normalization constant: " η(τ ) =
τ X t=1
Ã
τ Y
!#−1 λ(s)
.
(3.5)
s=t+1
3.2 Online Variational Bayes Algorithm. The online VB algorithm can be derived from the successive maximization of the discounted free energy (see equation 3.4). Let us assume that Qz {τ − 1} = {Qz (z(t))|t = 1, . . . , τ − 1} −1) (θ ) have been determined for an observed data set X{τ − 1} = and Q(τ θ
1662
Masa-aki Sato
{x(t)|t = 1, . . . , τ − 1}. With new observed data x(τ ), the discounted free energy Fλ (X{τ }, Qz {τ }, Qθ , T) is maximized with respect to Qz (z(τ )), while −1) (θ ). The solution is given by Qθ is set to Q(τ θ Qz (z(τ )) = P(z(τ )|x(τ ), θ¯ (τ )) Z −1) θ¯ (τ ) = dµ(θ )Q(τ (θ )θ . θ
(3.6)
In the next step, the discounted free energy Fλ (X{τ }, Qz {τ }, Qθ , T) is maximized with respect to Qθ (θ ), while Qz {τ } is fixed. The solution is given by ) Q(τ θ (θ ) = exp [γ (α(τ ) · θ − ψ (θ )) − Φ(α(τ ), γ )] ,
γ = T + γ0 , γ α(τ ) = Thhr(x, z)ii(τ ) + γ0 α0 ,
(3.7)
where the discounted average hh·ii(τ ) is defined by hhr(x, z)ii(τ ) = η(τ ) Z ×
τ X t=1
Ã
τ Y
! λ(s)
s=t+1
dµ(z(t))Qz (z(t))r(x(t), z(t)).
(3.8)
The discounted average can be calculated by using a step-wise equation: hhr(x, z)ii(τ ) = (1 − η(τ ))hhr(x, z)ii(τ − 1) i h + η(τ )Ez r(x(τ ), z(τ ))|θ¯ (τ ) , η(τ ) = (1 + λ(τ )/η(τ − 1))−1 .
(3.9)
By using equation 3.6, the expectation value of the sufficient statistics for the current data Ez [r(x(τ ), z(τ ))|θ¯ (τ )] is given by i Z h Ez r(x(τ ), z(τ ))|θ¯ (τ ) = dµ(z(τ )) × P(z(τ )|x(τ ), θ¯ (τ ))r(x(τ ), z(τ )). The recursive formula for α(τ ) is derived from the above equations: 1α(τ ) = α(τ ) − α(τ − 1)
(3.10)
Online Model Selection Based on the Variational Bayes
=
³ i h 1 η(τ ) TEz r(x(τ ), z(τ ))|θ¯ (τ ) γ ´ + γ0 α0 − γ α(τ − 1) .
1663
(3.11)
By using equation 3.7, the ensemble average of the parameter θ¯ (τ ) defined in equation 3.6 can be calculated as 1 ∂Φ θ¯ (τ ) = hθ iα(τ −1) = (α(τ − 1), γ ). γ ∂α
(3.12)
The online VB algorithm is summarized as follows. In the VB E-step, the ensemble average of the parameter θ¯ (τ ) is determined by equation 3.12. Using this value, one calculates the expectation value of the sufficient statistics (see equation 3.10) for the current data. The posterior hyperparameter α(τ ) is updated by equation 3.11 in the VB M-step. This process is repeated when new data are observed. By combining the VB E-step (3.12) and VB M-step (3.11) equations, one can get the recursive update equation for α(τ ): Ã £ ¤ 1 1α(τ ) = η(τ ) TEz r(x(τ ), z(τ ))|hθ iα(τ −1) γ ! + γ0 α0 − γ α(τ − 1) .
(3.13)
3.3 Stochastic Approximation. Unlike the VB algorithm, the discounted free energy in the online VB algorithm does not always increase, because a new contribution is added to the discounted free energy at each time instance. In the following, we prove that the online VB algorithm can be considered as a stochastic approximation (Kushner & Yin, 1997) for finding the maximum of the expected free energy defined in equation 3.2, which gives a lower bound for the expected log evidence defined in equation 3.1. The expected free energy (see equation 3.2), in which the maximization with respect to Qz has been performed, can be written as max E [F(X{T}, Qθ , Qz )]ρ = E [FM (x, α, T)]ρ , Qz
where Z FM (x, α, T) = T
dµ(z)P(z|x, hθ iα ) Z
×
¡ ¢ dµ(θ )Qθ (θ ) log P(x, z|θ )/P(z|x, hθ iα )
(3.14)
1664
Masa-aki Sato
Z +
¡ ¢ dµ(θ )Qθ (θ ) log P0 (θ )/Qθ (θ )
= T log P(x|hθ iα ) + Φ(α, γ ) − Φ(α0 , γ0 ) − [(γ α − γ0 α0 ) · hθ iα − Tψ (hθ iα )]
(3.15)
The gradient of FM is calculated as ∂FM (x, α, T) = γ Vα,α (α, γ ) ∂α ¢ ¡ × TEz [r(x, z)|hθ iα ] + γ0 α0 − γ α ,
(3.16)
where the Fisher information matrix Vα,α for the posterior parameter distribution, Qθ (θ ) = Pα (θ |α, γ ), is defined in equation 2.31. Consequently, the online VB algorithm, 3.13, can be written as 1α(τ ) =
1 ∂FM −1 η(τ )Vα (x(τ ), α(τ − 1), T). ,α (α(τ − 1), γ ) · γ2 ∂α
(3.17)
If the effective learning rate η(τ )(≥ 0) satisfies the condition (Kushner & Yin, 1997) ∞ X
η(τ ) = ∞ and
t=1
∞ X
η2 (τ ) < ∞,
(3.18)
t=1
the online VB algorithm, equation 3.13, defines the stochastic approximation for finding the maximum of the expected free energy (see equation 3.2). When there is no discount factor, that is, λ(τ ) = 1, the effective learning rate η(τ ) is given by η(τ ) = 1/τ.
(3.19)
This satisfies the stochastic approximation condition, equation 3.18. However, the learning speed becomes very slow if this schedule is adopted (see section 5). The reason for this slow convergence is that earlier inaccurate hyperparameter estimations affect the hyperparameter estimations in later learning stages because there is no discount factor in the sufficient statistics average (see equation 3.8). The introduction of the discount factor is crucial for fast convergence. As in the online EM algorithm we proposed previously (Sato, 2000), we employ the following discount schedule, 1 − λ(τ ) =
1 , (τ − 2)κ + τ0
(3.20)
Online Model Selection Based on the Variational Bayes
1665
which can be calculated recursively: λ(τ ) = 1 −
1 − λ(τ − 1) . 1 + κ(1 − λ(τ − 1))
(3.21)
The corresponding effective learning rate η(τ ) satisfies µ η(τ ) −→
τ →∞
κ +1 κ
¶
1 , τ
(3.22)
so that the stochastic approximation condition (see equation 3.18) is satisfied. The constants appearing in equation 3.20 have clear physical meanings. τ0 represents how many samples contribute to the discounted average for the sufficient statistics (see equation 3.8) in the early stage of learning. κ controls the asymptotic decreasing ratio for the effective learning constant η(τ ), as in equation 3.22. The values of τ0 and κ control the learning speeds in the early and later stages of learning, respectively. 3.4 Discussion. There are some comments on our online VB method. The objective function of our online VB method is the expected log evidence (see equation 3.1) for a fixed amount of data. Unlike the usual Bayesian method, our online VB method estimates the same quantity even if the amount of the observed data is increased. The observed data are used for improving the estimation quality of the expected free energy (see equation 3.2) which approximates the expected log evidence (see equation 3.1). Therefore, our online VB method does not converge to the EM algorithm in the large sample limit (τ → ∞ and T : fixed). Nevertheless, the method can perform both learning and prediction concurrently in an online fashion. The batch VB method can be derived from our online VB method if the same data set is repeatedly presented. Let suppose that the posterior hyperparameters are updated only at the end of each epoch, in which all of the data are presented once, by using equation 3.7, while the discounted averages are calculated every time by using equation 3.9. If the discount factor is given by λ(τ ) = 0 at the beginning of each epoch, or λ(τ ) = 1 otherwise, then this online VB method is equivalent to the batch VB method. It is also possible to use the log evidence for the observed data as in the usual Bayesian method. The corresponding online VB algorithm can be derived by equating the amount of data τ to T in equation 3.3. In this case, the objective function changes over time as the amount of data increases. The target posterior parameter distribution also changes over time. The variance of the posterior parameter distribution is proportional to 1/τ . This gives too much confidence for the posterior parameter distribution because the free energy is not fully maximized for each datum; that is, the VB-E and VB-M steps are performed once for each datum. Consequently, this type of online VB method soon converges to the EM algorithm: the posterior parameter
1666
Masa-aki Sato
distribution becomes sharply peaked near the maximum likelihood estimator. In order to avoid this difficulty, the VB-E and VB-M steps should be iterated until convergence for each datum. However, the above algorithm is not an online algorithm in usual sense.
4 Online Model Selection In the usual Bayesian procedure for model selection, one prepares a set of models with different structures and calculates the evidence for each model. Then the best model that gives the highest evidence is selected or the average over models with different structures is taken. An alternative approach is to change model structure sequentially. Sequential model selection procedures have been proposed based on either the batch VB method (Ghahramani & Beal, 2000; Ueda, 1999) or the MCMC method (Richardson & Green, 1997). In this article, we adopt sequential model selection procedures based on the online VB method. We can consider several sequential model selection strategies. In the first method, we start from an initial model with a given structure. The VB learning process for this model is continued by monitoring the free energy value. When the free energy converges, the model structure is changed according to some criterion, and the initial model is saved as the base model. The VB learning process for the current model is continued until the free energy converges. If the free energy of the current model is greater than that of the base model, the current model is saved as the base model. Otherwise the base model is not changed. One always keeps the base model as the best model to date. A new trial model is selected based on the base model. This process continues until further attempts do not improve the base model. The above procedure is a deterministic process. We can consider a stochastic model selection process based on the Metropolis algorithm (Metropolis, Rosenbluth, Rosenbluth, Teller, & Teller, 1953). In this case, the base model is different from the best model to date. If the free energy of the current model is greater than that of the base model, the current model is saved as the base model. Otherwise the base model is changed to the current model with the probability exp(β(Fcurrent − Fbase )). This stochastic process can be applied to model selection in dynamic environments. We also propose an online model selection procedure using a hierarchical mixture of models with different structures. We train them concurrently by using the online VB method for each model. They are trained independently with each other. (It is also possible to train them competitively by using the online VB method for the hierarchical mixture of models.) When the free energies converge, these models are compared according to their free energy values. Then the current best model structure is selected. Structures of the other models are changed based on the current best model structure. This sequential model selection procedure is suitable for dynamic environment.
Online Model Selection Based on the Variational Bayes
1667
This method is used in a model selection task for dynamic environments in section 5. In the next section, we study the model selection problem for mixture of gaussian models. As a mechanism for structural change, we adopt the split-and-merge method proposed by Ueda et al. (1999) (see also Richardson & Green, 1997; Ghahramani & Beal, 2000; Ueda, 1999). For mixture models, the split-and-merge method provides a simple procedure for structural changes. We choose either to split a unit into two or to merge two units into one in the sequential model selection process. In the current implementation, the same process is applied if the previous attempt was successful. Otherwise the other process is applied. A criterion for splitting a unit is given by the unit’s free energy, which is assigned to each unit (see appendix C). The split is applied to the unit with the lowest free energy among unattempted units. A criterion for merging units is given by the correlation between the two units’ activities, which are represented by the posterior probability that the units will be selected for given data. The unit pair with the highest correlation among unattempted unit pairs is selected for merging. The deletion of units is also performed for units with very small activities, which indicate that the units have not been selected at all. We adopted the above model selection procedure because of its simplicity. Other model selection procedures using the split-and-merge algorithm have also been proposed (Ghahramani & Beal, 2000; Ueda, 1999). By combining the sequential model selection procedure with the online VB learning method, a fully online learning method with a model selection mechanism is obtained, and it can be applied to real-time applications. 5 Experiments As a preliminary study on the performance of the online VB method, we considered model selection problems for two-dimensional mixture of gaussian (MG) models (see appendix C). We borrowed two tasks from Roberts et al. (1998). Data set A, consisting of 200 points, was √ √ generated from a mixture of four gaussians with the centers (0, 0), (2, 12), (4, 0), and (−2, − 12) (see Figure 1A). The gaussians had the same isotropic variance σ 2 = (1.2)2 . In addition, data set B, consisting of 1000 points, was generated from a mixture of four gaussians (see Figure 1B). In this case, they were √ paired such that each pair√had a common center—m1 = m2 = (2, 12) and m3 = m4 = (−2, − 12)—but different variances—σ12 = σ32 = (1.0)2 and σ22 = σ42 = (5.0)2 . Although these models were simple, the model selection tasks for them were rather difficult because of the overlap between the gaussians (Roberts et al., 1998). In the first experiment, we examined the usual Bayes model selection procedure. A set of models consisting of different numbers of units was
1668
Masa-aki Sato
Figure 1: (A) 200 points in data set A generated from mixture of four gaussians with different centers. (B) 1000 points in data set B generated from mixture of four gaussians. Pairs of gaussians have the same centers but different variances.
prepared. The VB method was applied to each model, and the maximum free energy was calculated. Three VB methods were compared: the batch VB, online VB without the discount factor, and online VB with the discount factor whose schedule is given by equation 3.20 with τ0 = 100 and κ = 0.01. We used a nearly noninformative prior for all cases: γ0 = 0.01. The learning speed was measured according to epoch numbers. In one epoch, all training data were supplied to each VB method once. The online VB method updated the ensemble average of parameters for each datum, while the batch VB method updated them once according to the average of the sufficient statistics over all of the training data. The results are summarized in Figure 2. Both the batch VB method and the online VB method gave the highest free energy for the true model consisting of four units. The online VB method with the discount factor showed faster and better performance than the batch VB method, especially for large amounts of data (see Figure 2). The reason for this performance difference can be considered as follows. In the online VB method, the posterior probability for hidden variables is calculated by using the newly calculated ensemble average of the parameters improved at each observation. The batch VB method, in contrast, uses the ensemble average of the parameters calculated in the previous epoch for all data. Therefore, the estimation quality of the posterior probability for the hidden variables improves rather slowly. This becomes more prominent for larger amounts of data. In this case, the online VB method can find the optimal solution within one epoch, as shown in Figure 2B.
Online Model Selection Based on the Variational Bayes
1669
Figure 2: Maximum free energies (FE) obtained by three learning methods and their convergence times measured by epoch numbers are plotted for various models. Three methods are batch VB (dash-dotted line with square), online VB with discount factor (solid line with circles), and online VB without discount factor (dashed line with triangles). The abscissa denotes the number of gaussian units in trained models. (A) Results for data set A. (B) Results for data set B.
The online VB method without the discount factor showed poor performance and slow convergence for all cases. This result implies that the introduction of the discount factor is crucial for good performance of the online VB method, as pointed out in section 3. If there is no discount factor, the early inaccurate estimations contribute to the sufficient statistics average even in the later stages of the learning process and degrade the quality of estimations. In the second experiment, the sequential model selection procedure using a trial model (see section 4) was tested. When the free energy converged, the structure of the trial model was changed based on the base model, which was the best model to date. We tested two initial model configurations consisting of 2 units and 10 units. When the model structure was changed, the discount factor and the effective learning constant in the online VB method were reset as (1 − λ(τ )) = 0.01 and η(τ ) = 0.01. The online VB method was able to find the best model in all cases (see Figures 3 and 4). It should be noted that the VB method sometimes increased the free energy while decreasing the data likelihood (see Figures 3–6). This was achieved as a result of the decrease
1670
Masa-aki Sato
Figure 3: Online model selection processes using the online VB method for data set A. Free energies (FE) and number of units for the base model (solid line), which is the best model to date, and the current trial model (dash-dotted line) are shown. Log likelihood (LKD) for the current trial model (dashed line) is also shown. (A) The initial model consists of 2 units. (B) The initial model consists of 10 units.
in the model complexity. The batch VB method also found the best model (see Figures 5 and 6) except for one case, in which the batch VB method got stuck in a local maximum (see Figure 6A). We also examined the online model selection procedure for a dynamic environment using a hierarchical mixture of MG models described in section 4. The hierarchical mixture in this experiment consisted of two MG models. Both models started from the same model structure with 10 units. When the free energies converged, the structure of a MG model with lower free energy was changed. At the moment, the discount factor and the effective learning constant were reset as (1 − λ(τ )) = 0.01 and η(τ ) = 0.01. In this experiment, the data generation model was changed at the fiftyfirst epoch. In the first 50 epochs, the number of gaussians was four. After the
Online Model Selection Based on the Variational Bayes
1671
Figure 4: Online model selection processes using the online VB method for data set B. Free energies (FE) and number of units for the base model (solid line), which is the best model to date, and the current trial model (dash-dotted line) are shown. Log likelihood (LKD) for the current trial model (dashed line) is also shown. (A) The initial model consists of 2 units. (B) The initial model consists of 10 units.
fifty-first epoch, the number of gaussians was six. In the learning process, each model observed 1000 data in one epoch. Figure 7 shows the learning process. After the twentieth epoch, the optimal model structure with four units was found. When the environment was changed at the fifty-first epoch, the free energies of both models dropped dramatically. The trained models tried to search for the new optimal structure. After the seventy-fifth epoch, the optimal model structure with six units was found. The result showed that the online VB method was able to adjust the model structure to dynamic environment.
1672
Masa-aki Sato
Figure 5: Sequential model selection processes using the batch VB method for data set A. Free energies (FE) and number of units for the base model (solid line), which is the best model to date, and the current trial model (dash-dotted line) are shown. Log likelihood (LKD) for the current trial model (dashed line) is also shown. (A) The initial model consists of 2 units. (B) The initial model consists of 10 units.
6 Conclusion We derived an online version of the VB algorithm and proved its convergence by showing that it is a stochastic approximation for finding the maximum of the free energy. A fully online learning method with a model selection mechanism was also proposed based on the online VB method, together with a sequential model selection procedure. This method can be applied to real-time applications. We considered the Bayes model without hierarchy. The current method can be easily extended to the hierarchical Bayes model (see appendix D).
Online Model Selection Based on the Variational Bayes
1673
Figure 6: Sequential model selection processes using the batch VB method for data set B. Free energies (FE) and number of units for the base model (solid line), which is the best model to date, and the current trial model (dash-dotted line) are shown. Log likelihood (LKD) for the current trial model (dashed line) is also shown. (A) The initial model consists of 2 units. (B) The initial model consists of 10 units.
In preliminary experiments using synthetic data, the online VB method was able to adapt the model structure to dynamic environments. A detailed study on the performance of the online VB method will be published in a forthcoming article. It also remains for future study to find better sequential model selection procedure.
Appendix A Free energy maximization can be done by using the following theorem:
1674
Masa-aki Sato
Figure 7: Online model selection processes for dynamic environment using hierarchical mixture of MG models. The number of units in the data generation model is four in the first 50 epochs and six after the fifty-first epoch. Free energies (FE) and number of units for best model (solid line), which is selected at the previous competition, and trial model (dashed line) are shown.
R Theorem R 1. The maximum of ( dµ(y)Q(y)( f (y)−log(Q(y)))) under the condition, dµ(y)Q(y) = 1, is given by Q(y) = R
exp[ f (y)] . dµ(y0) exp[ f (y0)]
The theorem can be proved with the help of the Lagrange multiplier method. The VB equations, 2.13 through 2.19, can be proved by using this theorem and the following relations: Z F = dµ(θ )Qθ (θ ) ¢ ¤ £ ¡ × T hr(x, z)iQz · θ − ψ (θ ) + log P0 (θ ) − log Qθ (θ )
Online Model Selection Based on the Variational Bayes
1675
+ Qθ (θ )-independent terms Z = dµ(Z{T})Qz (Z{T}) # T X (r(x(t), Z(t)) · hθ iQθ + r0 (x(t), Z(t))) − log Qz (Z{T}) × "
t=1
+ Qz (Z{T}) - independent terms, where h·iQθ and h·iQz denote the expectation value with respect to Qθ (θ ) and Qz (Z{T}), respectively. Appendix B The calculation of the derivative of the parameterized free energy (see equation 2.26) is lengthy but straightforward. The outline of the calculation is shown below. The derivative with respect to θ¯ can be calculated as µ ¶ ∂ hr(x, z)i ¯ (hθ iα − θ¯ ), ∂F/∂ θ¯ = T θ ∂ θ¯ by using the relation ! Ã µ ¶ T ∂ψ ∂ X ¯ log P(x(t)|θ ) = T hr(x, z)i ¯ − θ ∂ θ¯ . ∂ θ¯ t=1 The coefficient matrix T(∂hri ¯ /∂ θ¯ ) turns out to be U(θ ) defined in equaθ tion 2.28. The derivatives with respect to (α, γ ) are given by 1 ∂F = γ ∂α
µ
1 ∂ 28 γ 2 ∂ α∂ αT
¶ ³ ´ · Thr(x, z)i ¯ + γ0 α0 − (T + γ0 )α
θ
µ
¶ 1 ∂8 1 ∂ 28 − 2 , γ ∂ α∂γ γ ∂α µ ¶ ³ ´ 1 ∂ 28 1 ∂8 ∂F = − 2 · Thr(x, z)i ¯ + γ0 α0 − (T + γ0 )α θ ∂γ γ ∂ α∂γ γ ∂α µ 2 ¶ ∂ 8 . + (T + γ0 − γ ) ∂γ ∂γ + (T + γ0 − γ )
Equations 2.30 and 2.31 can be derived by using the above and the following equations: 1 1 ∂ 28 = 2 γ 2 ∂ α∂ αT γ
¿µ
∂ log Pα ∂α
¶µ
∂ log Pα ∂ αT
¶À
α
,
1676
Masa-aki Sato
¿µ ¶À ¶µ ∂ log Pα 1 ∂8 ∂ log Pα 1 ∂ 28 1 − 2 = , γ ∂ α∂γ γ ∂α γ ∂α ∂γ α ¿µ ¶µ ¶À ∂ log Pα ∂ log Pα ∂ 28 = . ∂γ ∂γ ∂γ ∂γ α Appendix C The VB algorithm for the mixture of Gaussian model is briefly explained in this appendix. (Notations in this appendix are slightly different from those in the text.) We first explain more general mixture models: the mixture of exponential family (MEF) models. The probability distribution for the ith unit in the MEF model is defined by P(x|θ i , i) = exp[ri (x) · θ i + ri,0 (x) − 9i (θ i )].
(C.1)
The conjugate distribution for equation C.1 is given by Pα (θ i |αi , γi , i) = exp[γi (αi · θ i − 9i (θ i )) − 8i (αi , γi )].
(C.2)
The probability distribution for the MEF model is then defined by P(x|g, θ ) =
M X
gi P(x|θ i , i)
i=1
=
X {z}
" exp
M X
à zi (log gi − 9i (θ i ))
i=1
!# + zi ri (x) · θ i + zi ri,0 (x)
,
(C.3)
where the hidden variable zP = {zi |i = 1, . . . , M} is an indicator variable: PM zi = 1. {z} denotes the summation over M possible zi = 0 or 1, and i=1 configurations of z. The mixing proportion g = {gi |i = 1, . . . , M} satisfies the PM gi = 1, which is automatically satisfied by the expression constraint i=1 P M ϕj e ). gi = eϕi /( j=1 The set of model parameters {g, θ } = {gi , θ i |i = 1, . . . , M} is not the natural parameter of the MEF model (see equation C.3). The natural parameter is given by {ω , θ } = {ωi = ϕi − 9i (θ i ), θ i |i = 1, . . . , M}. The corresponding sufficient statistics is given by {zi , zi ri (x)|i = 1, . . . , M}. Accordingly, the MEF model can be written as the EFH model: " # M ¡ X X ¢ exp (C.4) zi ωi + zi ri (x) · θ i − 9θ (ω , θ ) , P(x|ω , θ ) = {z}
i=1
Online Model Selection Based on the Variational Bayes
9θ (ω , θ ) = log
" M X
1677
# exp(ωi + 9i (θ i )) .
(C.5)
i=1
The conjugate distribution for the MEF model, equation C.3, is given by the product of the Dirichlet distribution and the conjugate distribution for each unit: " # M X ¡ ¢ νi log gi − 9i (θ i ) + αi · θ i − 8α (α, ν , γ ) Pα (g, θ |α, ν , γ ) = exp γ " = exp γ
i=1 M X
(νi ωi + νi αi · θ i )
i=1
#
−γ 9θ (ω , θ ) − 8α (α, ν , γ ) ,
8α (α, ν , γ ) =
M X
(C.6)
log 0(γ νi + 1)
i=1
− log 0(γ + M) +
M X
8i (αi , γ νi ),
(C.7)
i=1
PM νi = 1, and 0(γ ) is the gamma function, that is, where νRi satisfies i=1 ∞ 0(γ ) = 0 dse−s sγ −1 . The VB algorithm for the MEF model can be derived by using 8α as described in section 2. The VB E-step equation is given by 1 ∂8α θ¯ i = hθ i iα = , γ νi ∂ αi ω¯ i = hωi iα =
1 ∂8α − αi · hθ i iα . γ ∂νi
(C.8) (C.9)
The VB M-step equation is given by γ = T + γ (0), ´ 1³ Thzi i ¯ + γ (0)νi (0) , νi = θ γ ´ 1 ³ αi = Thzi ri (x)i ¯ + γ (0)νi (0)αi (0) , θ γ νi
(C.10) (C.11) (C.12)
where γ (0) and {νi (0), αi (0)|i = 1, . . . , M} are the prior hyperparameters of the prior parameter distribution. h·i ¯ denotes the expectation value (see θ ¯ , θ¯ ). equation 2.19) with respect to P(z|x, ω
1678
Masa-aki Sato
The free energy of the MEF model after the VB M-step is expressed as F=
M h X ¯ , θ¯ )i ¯ − Thzi log P(x, z|ω ¯ , θ¯ )i ¯ Thzi log P(x|ω
θ
i=1
θ
+ log 0(γ νi + 1) − νi log 0(γ + M) − log 0(γ (0)νi (0) + 1) + νi (0) log 0(γ (0) + M) i + 8i (αi , γ νi ) − 8i (αi (0), γ (0)νi (0)) .
(C.13)
The mixture of Gaussian (MG) model is obtained when the component distribution P(x|θ i , i) is the normal distribution: · ¸ 1 −N/2 1/2 T |6i | exp − (x − mi ) 6i (x − mi ) P(x|mi , 6i , i) = (2π) 2 ¸ · 1 (C.14) = exp − xT 6i x + xT 6i mi − 9i (mi , 6i ) , 2 9i (mi , 6i ) =
1 T 1 N mi 6i mi + log |6i | − log(2π ), 2 2 2
(C.15)
where mi and 6i denote the center and the inverse covariance matrix of the ith gaussian. The natural parameter of the normal distribution is given by θ i = (6i , 6i mi ). The conjugate distribution for the normal distribution (see equation C.14) is given by the normal Wishart distribution (Gelman et al., 1995), " 1 Pα (mi , 6i |ci , 1i , γi ) = exp − γi (mi − ci )T 6i (mi − ci ) 2 # 1 1 −1 − γi Tr(6i 1i ) + (γi − N) log |6i | − 8i (1i , γi ) , 2 2
8i (1i , γi ) =
(C.16)
¶ µ N X γi + 1 − n 1 γi log |1−1 | + log 0 i 2 2 n=1
³γ ´ 1 ³γ ´ N 1 i i − log + N(N − 1) log π. − γi N log 2 2 2 2π 4
(C.17)
The natural parameter of the conjugate distribution, equation C.16, is given by T (γi αi , γi ) = (γi (1−1 i + ci ci ), γi ci , γi ).
(C.18)
The VB algorithm for the MG model can be derived by using the above equations.
Online Model Selection Based on the Variational Bayes
1679
Appendix D The VB method can be easily extended to the hierarchical Bayes model. Let us consider the EFH model (see equation 2.1) with the prior distribution Pα (θ |α0 , γ0 ). The evidence for the hierarchical Bayes model is given by the marginal likelihood with respect to the model parameter θ and the prior hyperparameter α0 , Z (D.1) P(X{T}) = dµ(θ)dµ(α0 )P(X{T}|θ )Pα (θ |α0 , γ0 )P0 (α0 ), where P0 (α0 ) is the prior distribution for the prior hyperparameter α0 . The free energy is defined by Z F(X{T}, Q) = dµ(θ )dµ(α0 )dµ(Z{T})Q(θ , α0 , Z{T}) ¶ P(X{T}, Z{T}|θ )Pα (θ |α0 , γ0 )P0 (α0 ) . × log Q(θ , α0 , Z{T}) µ
(D.2)
The hierarchical VB method can be obtained assuming the conjugate prior for Pα (θ |α0 , γ0 ), P0 (α0 ) = exp [b0 (a0 α0 γ0 − 8α (α0 , γ0 )) − 8a (a0 , b0 )] ,
(D.3)
and the factorization for the trial posterior distribution, Q(θ , α0 , Z{T}) = Qθ (θ )Qα (α0 )Qz (Z{T}).
(D.4)
The remaining calculations can be done by the same way as in the VB method. The VB algorithm in this case consists of three steps. The posterior probability for the hidden variable P(z(t)|x(t), θ¯ ) is calculated in the VB E-step by using the ensemble average of the parameters
θ¯ = hθ iα .
(D.5)
The posterior hyperparameter α is calculated in the VB M-step, γ α = Thr(x, z)i ¯ + γ0 hα0 ia , θ Z 1 ∂8a (a, b), γ0 hα0 ia = γ0 dµ(α0 )Qα (α0 )α0 = b ∂a
(D.6) (D.7)
together with γ = T + γ0 . The posterior hyper-hyperparameter (a, b) is then calculated: a = hθ iα + a0 ,
(D.8)
1680
Masa-aki Sato
b = b0 + 1, hθ iα =
1 ∂8α (α, γ ). γ ∂α
(D.9) (D.10)
Repeating the above three steps, the free energy monotonically increases. The online VB algorithm can be similarly derived. References Amari, S. (1985).Differential geometrical method in statistics. New York: SpringerVerlag. Amari, S. (1998). Natural gradient works efficiently in learning. Neural Computation, 10, 251–276. Attias, H. (1999). Inferring parameters and structure of latent variable models by variational Bayes. In Proc. 15th Conference on Uncertainty in Artificial Intelligence (pp. 21–30). Attias, H. (2000). A variational Bayesian framework for graphical models. In S. A. Solla, T. K. Leen, & K.-R. Muller ¨ (Eds.), Advances in neural information processing systems, 12. Cambridge, MA: MIT Press (pp. 206–212). Bishop, C. M. (1995). Neural networks for pattern recognition. New York: Oxford University Press. Bishop, C. M. (1999). Variational principal components. In IEE Conference Publication on Artificial Neural Networks ICANN99 (pp. 509–514). Chickering, D. M., & Heckerman, D. (1997). Efficient approximations for the marginal likelihood of Bayesian networks with hidden variables. Machine Learning, 29, 181–212. Cooper, G., & Herskovitz, E. (1992). A Bayesian method for the induction of probabilistic networks from data. Machine Learning, 9, 309–347. Dempster, A. P., Laird, N. M., & Rubin, D. B. (1977). Maximum likelihood from incomplete data via the EM algorithm. Journal of Royal Statistical Society B, 39, 1–22. Gelman, A., Carlin, J. B., Stern, H. S., & Rubin, D. B. (1995). Bayesian data analysis. London: Chapman & Hall. Ghahramani, Z., & Beal, M. J. (2000). Variational inference for Bayesian mixture of factor analysers. In S. A. Solla, T. K. Leen, & K.-R. Muller ¨ (Eds.), Advances in neural information processing systems, 12. Cambridge, MA: MIT Press (pp. 449– 455). Heckerman, D., Geiger, D., & Chickering, D. (1995). Learning Bayesian networks: The combination of knowledge and statistical data. Machine Learning, 20, 197– 243. Kushner, H. J., & Yin, G. G. (1997). Stochastic approximation algorithms and applications. New York: Springer-Verlag. Mackay, D. J. C. (1992a). Bayesian interpolation. Neural Computation, 4, 405–447. Mackay, D. J. C. (1992b). A practical Bayesian framework for backpropagation networks. Neural Computation, 4, 448–472.
Online Model Selection Based on the Variational Bayes
1681
Mackay, D. J. C. (1999). Comparison of approximate methods for handling hyperparameters. Neural Computation, 11, 1035–1068. Metropolis, N., Rosenbluth, A. W., Rosenbluth, M. N., Teller, A. H., & Teller, E. (1953). Equations of state calculations by fast computing machines. Journal of Chemical Physics, 21, 1087–1092. Neal, R. M. (1996). Bayesian learning for neural networks. New York: SpringerVerlag. Neal, R. M., & Hinton, G. E. (1998). A view of the EM algorithm that justifies incremental, sparse, and other variants. In M. I. Jordan (Ed.), Learning in graphical models (pp. 355–368). Norwell, MA: Kluwer. Rabiner, L. R. (1989). A tutorial on hidden Markov models and selected applications in speech recognition. Proceedings of the IEEE, 77, 257–286. Richardson, S., & Green, P. J. (1997). On Bayesian analysis of mixtures with an unknown number of components. Journal of the Royal Statistical Society B, 59, 731–792. Rissanen, J. (1987). Stochastic complexity. Journal of the Royal Statistical Society B, 49, 223–239, 253–265. Roberts, S. J., Husmeier, D., Rezek, I., & Penny, W. (1998). Bayesian approaches to gaussian mixture modeling. IEEE Transactions on Pattern Analysis and Machine Intelligence, 20, 1133–1142. Roweis, S. T., & Ghahramani, Z. (1999). A unifying review of linear gaussian models. Neural Computation, 11, 305–345. Sato, M. (2000). Convergence of on-line EM algorithm. In Proc. of 7th International Conference on Neural Information Processing ICONIP-2000 Vol. 1 (pp. 476–481). Sato, M., & Ishii, S. (2000). On-line EM algorithm for the normalized gaussian network. Neural Computation, 12, 407–432. Schwartz, G. (1978). Estimating the dimension of a model. Annals of Statistics, 6, 461–464. Tipping, M. E., & Bishop, C. M. (1999). Mixtures of probabilistic principal component analyzers. Neural Computation, 11, 443–482. Titterington, D. M., Smith, A. F. M., & Makov, U. E. (1985). Statistical analysis of finite mixture distributions. New York: Wiley. Ueda, N. (1999). Variational Bayesian learning with split and merge operations (Tech. Rep. No. PRMU99-174, 67-74). Institute of Electronics, Information, and Communications Engineers. (in Japanese) Ueda, N., Nakano, R., Ghahramani, Z., & Hinton, G. E. (1999). SMEM algorithm for mixture models. In M. S. Kearns, S. A. Solla, & D. A. Cohn (Eds.), Advances in neural information processing systems, 11 (pp. 599–605). Cambridge, MA: MIT Press. Waterhouse, S., Mackay, D., & Robinson, T. (1996). Bayesian methods for mixture of experts. In D. S. Touretzky, M. C. Mozer, & M. E. Hasselmo (Eds.), Advances in neural information processing systems, 8 (pp. 351–357). Cambridge, MA: MIT Press. Received December 28, 1999; accepted September 27, 2000.
Neural Computation, Volume 13, Number 8
CONTENTS Article
Orientation, Scale, and Discontinuity as Emergent Properties of Illusory Contour Shape Lance R. Williams and Karvel K. Thornber
1683
Note
A Spike-Train Probability Model Robert E. Kass and ValÂerie Ventura
1713
Letters
Orientation Tuning Properties of Simple Cells in Area V1 Derived from an Approximate Analysis of Nonlinear Neural Field Models Thomas Wennekers
1721
Computational Design and Nonlinear Dynamics of a Recurrent Network Model of the Primary Visual Cortex 1749 Zhaoping Li A Hierarchical Dynamical Map as a Basic Frame for Cortical Mapping and Its Application to Priming Osamu Hoshino, Satoru Inoue, Yoshiki Kashimori, and Takeshi Kambara
1781
Dynamical Stability Conditions for Recurrent Neural Networks with Unsaturating Piecewise Linear Transfer Functions Heiko Wersing, Wolf-J¨urgen Beyn, and Helge Ritter
1811
An Information-Based Neural Approach to Constraint Satisfaction Henrik J¨onsson and Bo S¨oderberg
1827
Recurrence Methods in the Analysis of Learning Processes S. Mendelson and I. Nelken
1839
Subspace Information Criterion for Model Selection Masashi Sugiyama and Hidemitsu Ogawa
1863
Convergent Decomposition Techniques for Training RBF Neural Networks C. Buzzi, L. Grippo, and M. Sciandrone
1891
Errata
1921
ARTICLE
Communicated by David Mumford
Orientation, Scale, and Discontinuity as Emergent Properties of Illusory Contour Shape Lance R. Williams Department of Computer Science, Universityof New Mexico, Albuquerque, NM 87131, U.S.A. Karvel K. Thornber NEC Research Institute, Princeton, NJ 08540, U.S.A. A recent neural model of illusory contour formation is based on a distribution of natural shapes traced by particles moving with constant speed in directions given by Brownian motions. The input to that model consists of pairs of position and direction constraints, and the output consists of the distribution of contours joining all such pairs. In general, these contours will not be closed, and their distribution will not be scaleinvariant. In this article, we show how to compute a scale-invariant distribution of closed contours given position constraints alone and use this result to explain a well-known illusory contour effect. 1 Introduction
Mumford (1994) has proposed that the distribution of illusory contour shapes can be modeled by particles traveling with constant speed in directions given by Brownian motions. More recently, Williams and Jacobs (1997a, 1997b) introduced the notion of a stochastic completion eld, the distribution of particle trajectories joining pairs of position and direction constraints, and showed how it could be computed in a local parallel network. They argued that the mode, magnitude, and variance of the completion eld are related to the observed shape, salience, and sharpness of illusory contours. Unfortunately, the Williams and Jacobs model, as described, has some shortcomings. Recent psychophysics suggests that contour salience is greatly enhanced by closure (Kovacs & Julesz, 1993). Yet, in general, the distribution computed by the Williams and Jacobs model does not consist of closed contours, nor is it scaleinvariant; doubling the distances between the constraints does not produce a comparable completion eld of double the size without a corresponding doubling of particle speeds. However, the Williams and Jacobs model contains no intrinsic mechanism for speed selection. The speeds (like the directions) must be specied a priori. In this article, we show how to compute a scale-invariant distribution of closed contours given position constraints alone. The signicance of this is c 2001 Massachusetts Institute of Technology Neural Computation 13, 1683– 1711 (2001) °
1684
Lance R. Williams and Karvel K. Thornber
twofold. First, it provides insight into how the output of neurons with purely isotropic receptive elds like those of lateral geniculate nucleus (LGN) might be combined to produce orientation-dependent responses like those exhibited by neurons in V1 and V2. Second, it suggests that the responses of these neurons are not just input for long-range contour grouping processes, but instead are emergent properties of such computations. 2 A Discrete-State, Continuous-Time Random Process 2.1 Shape Distribution. Mumford (1994) observed that the probability distribution of boundary completion shapes could be modeled by a FokkerPlanck equation of the following form: @P @t
D ¡c cosh
@P @x
¡ c sinh
@P @y
C
s 2 @2 P 1 ¡ P 2 @t2 t
This partial differential equation can be viewed as a set of independent advection equations in xE D [x, y]T (the rst and second terms) coupled in the h dimension by the diffusion equation (the third term). The advection equations translate probability mass in direction h , with constant speed c , while the diffusion term models the Brownian motion in direction, with diffusion parameter s. The combined effect of these three terms is that particles tend to travel in straight lines, but over time they drift to the left or right by an amount proportional to s 2 . Finally, the effect of the fourth term is that particles decay over time, with a half-life given by the decay constant t . This represents our prior expectation on the lengths of gaps: most are quite short. The Green’s function, Gc ( xE , h I t1 | uE , w I t0 ), gives the probability that a particle (traveling with speed c ) observed at position uE , and direction w , at time t0 , will later be observed at position xE , and direction h, at time t1 . It is the solution, P ( xE , h I t1 ) , of the Fokker-Planck initial value problem with initial value, P ( xE , h I t0 ) D d( xE ¡ uE )d(h ¡ w ) . The symmetries of the Green’s function are summarized by the following equation: E w I t0 ) D G1 ( Rw ( xE ¡ u E ) /c , h ¡ w I t1 ¡ t 0 | 0, 0 I 0) Gc ( xE , h I t1 | u,
where the function Rw ( .) rotates its argument by w about the origin, [0, 0]T . Two of these symmetries are especially relevant to this article. The rst of these is scale invariance: E w I t0 ) D G1 ( x E /c , w I t 0 ) . E /c , h I t1 | u Gc ( xE , h I t1 | u,
Scale invariance requires that the probability of a particle following a path between two edges be invariant under a transformation that scales both the speed of the particle and the distance between the two edges by the same factor. The second symmetry that concerns us here is time-reversal symmetry: E w I t0 ) D Gc ( u E , w C p I t1 | xE , h C p I t0 ) . Gc ( xE , h I t1 | u,
Orientation, Scale, and Discontinuity
i
1685
i i
Figure 1: (Left) An input pattern consisting of N edges. Each edge i has a position xE i and an orientation hi . (Right) A set of 2N states, S. For each edge i in the input pattern, there are two states, i and Ni, in S. The two states associated with an edge have the same position but are opposite in direction. That is, xE i D Ex Ni , but hi D h Ni C p . These two states represent the two possible directions a particle undergoing a random motion can visit an edge of the input pattern.
Time-reversal symmetry requires that the probability of a particle following a path between two edges be invariant under a transformation that reverses the order and directions of the two edges. 2.2 Particles Visiting Edges. Consider an input pattern consisting of N edges. Each edge, i, has a position, xE i , and an orientation, hi . Now construct a set of 2N states, S. For each edge i in the input pattern, there are two states, i and Ni, in S. The two states associated with an edge have the same position but are opposite in direction; that is, xE i D xE Ni but hi D h Ni C p . These two states represent the two possible directions a particle undergoing a random motion can visit an edge of the input pattern. Henceforward, we will refer to states simply as edges. (See Figure 1.) Assuming that the shape of contours in the world can be modeled as the trajectories of particles undergoing random motions, we seek answers to questions like these: What is the probability that a contour begins at edge i and ends at edge j? What is the probability that a contour begins at edge i and reaches edge j and contains n ¡ 1 other edges from the set S? To begin, we use the Green’s function to dene an expression for the conditional probability, P ( j | i I t1 ¡ t0 ) D G1 ( xE j , hj I t1 | xE i , hi I t0 ) ,
1686
Lance R. Williams and Karvel K. Thornber
that a particle (traveling with unit speed) observed at edge i at some unspecied time will subsequently be observed at edge j after time t1 has elapsed. Integrating over all possible visitation times for edge j yields: Z 1 ( ) P j|i D dt 1 P ( j | i I t1 ) . 0
This expression gives the probability that a particle observed at edge i will later be observed at edge j. Because there are no intermediate edges, we term this a path of length n D 1 joining i and j. In contrast, a path of length n D 2 joining i and j is a path that begins at edge i, next visits some unspecied intermediate edge k, and then ends at edge j. In addition to integrating over all possible visitation times, t1 , for edge j, the expression for P (2) ( j | i) requires summing the probabilities of paths through all possible intermediate edges, k1 , and integrating over all possible visitation times, t1 , for edge k1 : Z 1 Z t2 X P (2) ( j | i) D dt2 dt1 P ( j | k1 I t2 ¡ t1 ) P ( k1 | i I t1 ) , 0
0
k1
where the limits of integration reect the fact that t2 is constrained to be greater than t1 . By extending the pattern, we can dene a path of length n joining i and j to be a path that begins at edge i, next visits n ¡ 1 unspecied edges, and then ends at edge j. The expression for P ( n ) requires n sums over the set of possible intermediate edges and n C 1 integrals over visitation times: ( ) P n ( j | i) Z 1
D
0
Z dt n ¢ ¢ ¢
t2
dt1
X
0
kn
¢¢¢
X k1
P ( j | kn I tn ¡ tn¡1 ) ¢ ¢ ¢ P ( k1 | i I t1 ) ,
where the limits of integration reect the fact that tm is constrained to be greater than tm¡1 for 1 · m · n. By reversing the order in which the integrals are evaluated, so that the rst integral evaluated is over tn and the last evaluated is over t1 , and making the necessary changes in the limits of integration, we get the following expression: P ( n ) ( j | i) Z 1 D
0
Z dt 1 ¢ ¢ ¢
1
dtn
X
tn¡1
kn
¢¢¢
X k1
P ( j | kn I tn ¡ tn¡1 ) ¢ ¢ ¢ P ( k1 | i I t1 ).
We now substitute tm for tm ¡ tm¡1 , for 1 · m · n, and move the integrals inside the expression, so that each integral is immediately to the left of the conditional probability involving its variable of integration: Z 1 Z 1 X X ¢¢¢ P ( n ) ( j | i) D dtn ¢ ¢ ¢ dt1 P ( i | kn I tn ) ¢ ¢ ¢ P ( k1 | i I t1 ) 0
0
kn
k1
Orientation, Scale, and Discontinuity
1687
P( j | i ) i
j
P( i | j ) i
j
P( i | j )
Figure 2: The matrix P represents the transition probabilities between pairs of 6 edges. In general, P ( i | j) D P( j | i ) ; that is, P is not symmetric. However, P possesses another form of symmetry: P ( i | j ) D P ( Nj | Ni ) and P ( j | i) D P (Ni | Nj) . This is termed time-reversal symmetry.
D D
X kn
X kn
¢¢¢ ¢¢¢
XZ k1
X k1
1 0
Z dtn P ( j | kn I tn ) ¢ ¢ ¢
1 0
dt1 P ( k1 | i I t1 )
P ( j | kn ) ¢ ¢ ¢ P ( k1 | i) .
The result is an expression for P ( n) ( j | i) , the probability of a length n path joining i and j, purely in terms of probabilities of length one paths. Let us now dene a 2N £2N matrix, P, where Pji D P ( j | i), that is, Pji is the probability of 6 P ( i | j ) , the matrix, a length one path between edges i and j. Because P ( j | i) D P, is not symmetric. However, by time-reversal symmetry, P ( j | i) D P (Ni | jN) where h Ni D hi C p and h jN D hj C p . (See Figure 2.) From the above expression, it follows that: ¡ ¢ ( ) P n ( j | i) D Pn ji . This result is signicant because it shows that the probability of a particle with motion governed by a continuous time random process visiting n edges at n real valued times can be computed simply by taking the nth power of a matrix. The implication is that our analysis from this point on need not involve the continuous time random process; we need not write expressions involving integrals over continuous times. Rather, our analysis can be based entirely on the discrete-time random process with transition probabilities
1688
Lance R. Williams and Karvel K. Thornber
specied by the matrix, P. Furthermore, any expressions we derive based on the discrete-time process will also apply to the underlying continuous-time process. 3 A Discrete-State, Discrete-Time Random Process 3.1 Edge Saliency. In section 2, we derived an expression for the probability that a particle moving with constant speed in a direction given by a Brownian motion will travel from edge i to edge j and visit n ¡ 1 intermediate edges. Signicantly, this expression involved only the probabilities of length one paths: Z 1 P ( j | i) D P ( j | i I t) dt. 0
In effect, we reduced the problem of computing visitation probabilities for a continuous-time random process to the more tractable problem of computing visitation probabilities for a discrete-time random process. In the discrete-time process, the probability of a path of length n between edges i and j can be computed using the following recurrence equation: X P ( n ) ( j | i) D P ( j | k) P ( n¡1) ( k | i). k
This equation allows us to dene an expression for the relative number of contours that visit edge i and eventually return to edge i: P ( n) ( i | i ) . ci D lim P ( n) n!1 j P ( j | j) This quantity, which we term the edge saliency, is the relative number of closed contours through edge i. To evaluate the above expression, we rst divide P by its largest real positive eigenvalue, l, allowing us to distribute the limit over the numerator and denominator. This yields ¡ ¢n limn!1 Pl ii ci D P ¡ ¢m . limm!1 j Pl jj To evaluate the numerator and denominator, we observe that P is a positive matrix, and therefore, by Perron’s theorem (see Golub & Van Loan, 1996), there is a largest positive real eigenvalue, l, that is, l > |m |, for all eigenvalues, m , except m D l. It follows that ³ ´n P s¯sT lim D T , n!1 l s¯ s where s and s¯ are the right and left eigenvectors of P with largest positive real eigenvalue, that is, ls D Ps and l¯s D PT s¯ . Because of the time-reversal
Orientation, Scale, and Discontinuity
1689
symmetry of P, the right and left eigenvectors are related by a permutation that exchanges values associated with the same edge but opposite directions: sNi D s Ni . Applying this result to the denominator of the expression for ci yields X ³ P ´m X ( s¯sT ) jj lim D D 1. m!1 l jj s¯ T s j j Consequently, the saliency of edge i is given by the numerator: ci D
(s¯sT ) ii si sNi . D P T s s¯ j sj sNj
3.2 Link Saliency and Markov Chains. Generalizing the notion of edge saliency, it is possible to compute the relative number of contours that begin at edge i, immediately visit edge j, and then eventually return to i: P ( n¡1) ( i | j) P ( j | i) P ( n) . n!1 k P ( k | k)
Cji D lim
This quantity, which we term the link saliency, is the relative number of closed contours that visit edges i and j in succession. To solve this expression, we rst divide the numerator and denominator by ln and then take separate limits in the numerator and the denominator, yielding: ¡ ¢n¡1 ¡ P ¢ limn!1 Pl ij l ji Cji D P ¡ ¢m . limm!1 k Pl kk After evaluating the limits as before, we get the following expression for link saliency: sNj P ( j | i) si . l¯sT s It easy to verify that the number of closed contours entering edge i equals the number of closed contours leaving edge i: X X Cij D ci D Cki . Cji D
j
k
This establishes that closed contours are conserved at edges. Since closed contours are conserved, it is possible to treat them as Markov chains. By dividing the joint probability of a closed contour visiting edges i and j in succession by the probability of a closed contour visiting i, we get a conditional probability, M ( j | i) D
Cji sNj P ( j | i) D , lNsi ci
equal to the probability that a closed contour will visit edge j given that it has just visited i. Unlike the matrix P, the matrix M is stochastic, that is,
1690
P j
Lance R. Williams and Karvel K. Thornber
Mji D 1, and c is the eigenvector of M with eigenvalue equal to one: c D Mc.
This is consistent with our claim that ci is the probability of a closed contour through edge i. 3.3 Saliency of Closed Contours. In our approach, the magnitude of the largest positive real eigenvalue of P is related to the saliency of the most salient closed contour. To develop some intuition for the meaning of the eigenvalue and its relationship to contour saliency, it will be useful to consider an idealized situation. We know from linear algebra that the eigenvalues of P are solutions to the equation det( P¡lI ) D 0. Now consider a closed contour, C, threading n edges. The probability that a contour will visit edge C ( i C 1) mod n , given that it has just visited edge C i , equals P ( C ( i C 1) mod n | C i ) . Assuming that the probability of a contour joining edges C i and Cj it is negligible for nonadjacent i and j (i.e., Pj i D P (Cj | C i ) when j D ( i C 1) mod n and Pj i D 0 otherwise), then:
³ l( C ) D
n Y iD 1
´ n1 P ( C ( i C 1) mod n | C i )
satises det(P ¡ lI) D 0. This is the geometric mean of the transition probabilities in the closed path. Equivalently, minus one times the logarithm of the eigenvalue equals the average transition energy: ¡ ln l( C ) D ¡
n X iD 1
ln P ( C ( i C 1) mod n | C i ) / n.
Because maximizing l minimizes the average transition energy, there is a close relationship between the most salient closed contour and the minimum mean weight cycle of the directed graph with weight matrix, ¡ ln P. Using psychophysical methods, Elder and Zucker (1994) quantied the effect that the distribution and size of boundary gaps have on contour closure. They measured reactiontime in a preattentive search task and found that it was well modeled by the square root of the sum of the squares of the gap lengths. They assumed that reactiontime is inversely related to the degree of contour closure. We note that for stimuli consisting of relatively few edges of negligible length separated by large gaps, l is maximized when the edges are equidistant. Also, the decrease in saliency due to one large gap is much greater than the decrease due to many small gaps. Both of these properties are consistent with the observations by Elder and Zucker (1994). 3.4 Stochastic Completion Fields. Finally, given s and s¯ , it is possible to compute the relative number of closed contours at an arbitrary position and
Orientation, Scale, and Discontinuity
1691
direction in the plane, that is, to compute the stochastic completion eld. E w ) be an arbitrary position and direction in the plane; then Let g D ( u, X X P ( n¡1 ) ( i | j) P ( j | g)P(g | i) P ( ) n n!1 k P ( k | k) i j
cg D lim
represents the probability that a contour rst visits edge i, then passes through position uE in direction w, next visits edge j, then visits additional n ¡ 1 edges, and nally returns to i. Dividing numerator and denominator by ln yields 20 ¡ ¢n¡1 1 3 P ³ ´ XX ( ) | | g)P(g P j i 4@ Pl ¡ij ¢ A ¢ 5. cg D lim P n n!1 l i j k
l kk
Taking separate limits in the numerator and the denominator results in X X µ³ si sNj ´ ³ P ( j | g)P(g | i) ´¶ P ¢ cg D , l k sk sN k i j which can be rearranged to yield cg D
X 1 X P (g | i) si ¢ P ( j | g) sNj . T ls s¯ i j | {z } | {z } source eld
sink eld
This expression gives the relative probability that a closed contour will pass through g, an arbitrary position and direction in the plane. Note that this is a natural generalization of the factorization of the stochastic completion eld into the product of source and sink elds described by Williams and Jacobs (1997a). For this reason, we call the components of s and s¯ the eigensources and eigensinks of the stochastic completion eld. The crucial difference is that we now know how to weight the contribution of each edge to the stochastic completion eld. 4 A Continuous-State, Discrete-Time Random Process The approach just outlined suffers from several limitations. First, it assumes that we have perfect knowledge of the positions and directions of the edges that serve as the input to the problem. This is unecessarily restrictive. Second, because the vectors and matrices (e.g., s and P) are specic to the conguration of input edges, it is not obvious how the computations we have described can be translated into brainlike representations and algorithms, that is, parallel operations in a nite basis. In order to address both of these limitations, we can generalize our approach by considering the distribution of closed contours, c, to be a function
1692
Lance R. Williams and Karvel K. Thornber
of position and direction in the plane, c( uE , w ). While previously the input took the form of a set of edges, with exact knowledge of the edge’s positions and orientations, now the input takes the form of a probability density function (pdf), b( xE , h ) , which represents the probability that an edge exists at position xE and orientation h. We refer to this pdf as the input bias function. Instead of a discrete-state and discrete-time random process with transition probabilities represented by a matrix, P, we have a continuous-state and discrete-time random process with transition probabilities represented by a linear operator, P ( uE , w | xE , h ) . The expression for the eigensources of the stochastic completion eld then becomes Z Z Z ls( xE , h ) D duE dw Q ( xE , h | uE , w ) s( uE , w ) , R2 £S1
where 1
1
Q ( xE , h | uE , w ) D b ( xE , h ) 2 P ( xE , h | uE , w ) b( uE , w ) 2 , and s( xE , h ) is the eigenfunction of Q with largest positive real eigenvalue l. The input bias function, b( .) , is distributed equally between the left and right sides of P ( .) to preserve the time-reversal symmetry of Q( .). Consequently, left and right eigenfunctions of Q ( .) with equal eigenvalue are related through a reversal symmetry, sN( xE , h ) D s ( xE , h C p ) . Finally, the expression for the stochastic completion eld itself can be generalized in the same way: Z Z Z 1 1 c( uE , w ) D dxE dh P ( uE , w | xE , h ) b( xE , h ) 2 s ( xE , h ) lhs, sNi 2 1 R £S Z Z Z 1 E w ) b( x ¢ E 0 , h 0 ) 2 sN( x E 0, h 0 ), dxE 0 dh 0 P ( xE 0 , h 0 | u, R2 £S1
where sN( xE , h ) D s( xE , h C p ) , R2 £ S1 is the of positions in the plane and R Rspace R directions on the circle, and < s, sN > D Edh s( xE , h ) sN ( xE , h ) . R2 £S 1 dx Given the above expression for the stochastic completion eld, it is clear that the key problem is computing the eigenfunction with the largest positive real eigenvalue. To accomplish this, we can use the well-known power method (see Golub & Van Loan, 1996). In this case, the power method involves repeated application of the linear operator Q ( .) to the function s ( .) , followed by normalization: RRR E Q( xE , h | uE , w ) s( n) ( uE , w ) ( n C 1) R2 £S 1 dudw RRR ( xE , h ) D . s E s( n) ( uE , w ) R2 £S1 dudw In the limit, as n gets very large, s ( n C 1) ( xE , h ) converges to the eigenfunction of Q ( .) , with largest positive real eigenvalue. We observe that the above computation can be considered a continuous-state, discrete-time, recurrent neural network.
Orientation, Scale, and Discontinuity
1693
5 Scale Invariance Ideally, we would like our computation to be scale invariant. A computation is scale invariant if scaling the input by a constant factor c produces a corresponding scaling of the output. This property is best summarized by a commutative diagram: C
!
b( xE , h ) #S
b ( xE /c , h )
C
!
c( xE , h ) #S
c( xE /c , h ) , C
S
where b( .) is the input, c( .), is the output, ! is the computation, and ! is the scaling operator. The diagram shows that the output is independent of the order in which the operators are applied. The only scale-dependent parameter in the computation is the speed of the particles. The lack of scale invariance is due to the fact that this parameter has been arbitrarily set to one. In order to achieve a scale-invariant computation, we need to eliminate this bias. To accomplish this, all speeds must be treated uniformly. Previously Q (.) was indexed by four arguments: the initial and nal particle positions and directions. In the scale-invariant computation, Q ( .) will be indexed by six arguments: the initial and nal particle positions, directions, and speeds. Because particles have constant speed, Q ( .) is block diagonal. This property of Q ( .) , together with the scale invariance of the Green’s function, G (.) , allows Q ( .) to be dened as follows: E w , c 0) Q( xE , h, c 1 | u, » 1 1 b 2 ( xE , h ) P ( xE /c 1 , h | uE /c 0 , w ) b 2 ( uE , w ) D 0
if c 0 D c 1 otherwise.
The Q ( .) operator now includes an integral over all positive speeds, c 0 . This eliminates the scale dependency in the computation: Z Z Z Z ls( xE , h, c 1 ) D duE dw dc 0 Q ( xE , h, c 1 | uE , w , c 0 ) s( uE , w , c 0 ). R> 0
R2 £S1
The eigenfunction of Q( .) with the largest positive real eigenvalue represents the limiting distribution for particles of all speeds. Because Q (.) is block diagonal, its eigenfunctions are zero everywhere outside a single region of constant speed. Consequently, the eigenfunction with the largest positive real eigenvalue of Q( .) can be identied in two steps. First, we nd the largest positive real eigenvalue for each constant speed submatrix: l(c ) D max
RRR
Edh R2 £S1 dx
sN( xE , h ) RRR
RRR
E dw R2 £S1 du
E w , c ) s( u E, w ) Q ( xE , h, c | u,
E dh sN( x E , h ) s( x E, h ) R2 £S1 dx
,
1694
Lance R. Williams and Karvel K. Thornber
where the eigenvalue is written as a Rayleigh quotient, and the maximum is taken over all eigenfunctions s( .) of the submatrix of Q( .) with constant speed, c . Next, we nd the speed, c max , which maximizes l(c ) : c max D argmax l(c ) . c
The eigenfunction s ( xE , h ) with eigenvalue l(c max ) gives the limiting distribution for particles of all speeds: Z Z Z l(c max ) s ( xE , h ) D E dudw Q( xE , h, c max | uE , w , c max ) s( uE , w ) . R2 £S 1
6 Orientation Selectivity in Primary Visual Cortex A long-standing problem in visual neuroscience is the emergence of orientation-selective responses in simple cells of primary visual cortex given input from cells in lateral geniculate that exhibit little or no orientation selectivity. In contrast with the classical pure feedforward model for the emergence of orientation selectivity proposed by Hubel and Wiesel (1962), the authors of a recent review article (Sompolinsky & Shapley, 1997) and two computer simulation studies (Somers, Nelson, & Sur, 1995); and (Ben-Yishai, Lev Bar-Or, & Sompolinsky, 1995), argue for a model with three dening features: (1) a weak orientation bias provided by excitatory input from the lateral geniculate, (2) intracortical excitatory connections between simple cells with similar orientation preferences, and (3) intracortical inhibitory connections between simple cells without regard to orientation preference. These authors suggest that the weak orientation bias provided by excitatory input from the lateral geniculate is amplied by intracortical excitatory connections between simple cells with similar orientation preference. The role of the intracortical inhibition is to prevent the level of activity due to the intracortical excitation from growing unbounded. The principal contribution of these recent studies is a unied explanation of the many different (and sometimes contradictory) experimental ndings related to the emergence of orientation selectivity in primary visual cortex. However, none of these authors considers the functional signicance of orientation selectivity—what purpose it serves in the larger context of human visual information processing. Stated differently, is orientation selectivity an end in itself? Or can it be understood only as an emergent property of a higher-level visual computation, such as contour completion? To our knowledge, the rst model of orientation selectivity in visual cortex that differed signicantly from the original Hubel and Wiesel feedforward model was described in Parent and Zucker (1989). Like the more recent and detailed integrate-and-re models described in Somers et al. (1995) and Ben-Yishai et al. (1995), Parent and Zucker considered orientation selectivity to be an end in itself. Unlike these recent models, Parent
Orientation, Scale, and Discontinuity
1695
and Zucker’s primary motivations were computational. In the relaxation labeling network they describe, crude local estimates of tangent and curvature (i.e., the initial states of simple cells with and without endstopping) are sharpened by the activity of recurrent excitatory connections representing geometric constraints between position, tangent, and curvature. The magnitude of the network state vector is normalized at every time step by dividing by the sum of its components.1 It can be seen that this divisive normalization plays the same role as the nonspecic inhibition in the model of Somers et al. (1995). Consequently, we see that there is a strong relationship between Parent and Zucker ’s model and more recent models of orientation selectivity in primary visual cortex. In this article, we do not consider orientation selectivity to be an end in itself; rather, we consider it to be an emergent property of a higher-level visual computation devoted to contour completion. Unlike Parent and Zucker (1989), we did not specically intend to model the emergence of orientation selectivity in primary visual cortex. Instead, our intention was to formulate a computational theory level (Marr, 1982) account of contour completion; orientation selectivity is simply a sideeffect. Our specic hypothesis is that one of the major goals of early visual processing is to compute a scale-invariant distribution of closed contours, c( .) , consistent with weak constraints on position and direction derived by linear ltering, b( .). We termed this distribution the stochastic completion eld and in the previous section described a continuous-state, discrete-time neural network for computing it. We have demonstrated experimentally (see the next section) that the distribution of s( .) , the eigensources of c(.) , can be highly nonisotropic, even for isotropic b( .) . We now show that the neural network that computes s( .) is consistent with recent hypotheses concerning the emergence of orientation selectivity in primary visual cortex. The state of the neural network at time t is given by s ( t ) ( .), which is a function of R2 £ S1 , the continuous space of positions and directions. In this article, we do not address the problem of how s( t) ( .) can be represented as a weighted sum of a xed set of basis functions, that is, receptive elds. Obviously this needs to be done before we can claim to have a complete account of the computation at the algorithm and representation level (Marr, 1982). 2 Nevertheless, we believe that it is often best rst to describe the
1
For an introduction to relaxation labeling, see Rosenfeld, Hummel, and Zucker (1976). Although beyond the scope of this article, the problem of representing continuous (but band-limited) functions of position and direction using a nite set of basis functions is one we are actively working on. For example, in Zweck and Williams (2000), we consider the problem of computing the stochastic completion eld using parallel operations in a nite basis subject to the constraint that the result be invariant under rotations and translations of the input pattern. This is accomplished, in part, by generalizing the notions of steerability and shiftability of basis functions introduced in Simoncelli, Freeman, Adelson, and Heeger (1992). 2
1696
Lance R. Williams and Karvel K. Thornber C
i
V2
P(i|j) i
s
i
j
b( i )
b( j )
V1
LGN
Figure 3: Thin solid lines indicate feedforward connections from LGN, which provide a weak orientation bias, that is, b ( i) , to simple cells in V1, that is, s( i ) . Solid lines with arrows indicate orientation-specic intracortical excitatory connections—P( i | j ) . Dashed lines with arrows indicate orientation-nonspecic intracortical inhibitory connections. Thick solid lines indicate feedforward connections between V1 and V2, that is, c( i) .
computation in the continuum (e.g., Williams & Jacobs, 1997a), and by doing so temporarily avoid the issue of sampling altogether.3 Recall that the xed point of the neural network we described is the eigenfunction with the largest positive real eigenvalue of the linear operator Q( .) . The linear operator Q (.) is the composition of the input-independent linear operator P (.) and the input-dependent linear operator B( .) . The dynamics of the neural network are derived from the update equation for the standard power method for computing eigenvectors. It is useful to draw an analogy between our neural network for contour completion and the models for the emergence of orientation selectivity in primary visual cortex described by Somers et al. (1995) and Ben-Yishai et al. (1995). (See Figure 3.) First, we can identify s( .) with simple cells in V1 and the input bias function b( .) , which modulates s( .) in the numerator of the update equation, with the feedforward excitatory connections from the lateral geniculate. Second, we can identify P ( .) with the intracortical excitatory connections that Somers et al. hypothesize are primarily responsible for the emergence of orientation selectivity in V1. As in the model of Somers et al., these connections are highly specic and mainly target cells of similar orientation
3
In this respect, the model of Parent and Zucker (1989) is instructive. Although their model was a major source of inspiration for our own, we believe that it (unnecessarily) confounds the computational theory level goal of estimating tangent and curvature everywhere with algorithm and representation level details related to discrete sampling.
Orientation, Scale, and Discontinuity
1697 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0
R
Figure 4: Visualization of S1 dh P ( x, y, h | 0, 0, 0) computed using the analytic expression from Thornber and Williams (1996). This is a rendering of the kernel of the hypothesized intracortical excitatory connections, integrated over the h dimension. Displayed values are scaled by a factor of 105 .
preference (see Figures 4 and 5). Third, we identify the denominator of the update equation with the non-specic intra-cortical inhibitory connections that Somers et al. hypothesize keep the level of activity within bounds. Because the purpose of the denominator is to normalize s( .) , it plays the same role as the denominator in the relaxation labeling update equation (Rosenfeld et al., 1976) and might be implemented using a mechanism similiar to the divisive inhibition mechanism proposed by Heeger (1992). Finally, we identify c( .) , the stochastic completion eld, with the population of cells in V2 described by von der Heydt, Peterhans, and Baumgartner (1984). There is obviously a huge gap in level of detail between the continuousstate, discrete-time neural network we describe and the integrate-and-re simulations of Somers et al. (1995) and Ben-Yishai et al. (1995). For this reason, the above discussion must be regarded as highly speculative. However, we would like to stress that unlike these recent theoretical studies, our neural network implements a well-dened (and nontrivial) computation in the
1698
Lance R. Williams and Karvel K. Thornber 1
0.8
0.6
0.4
0.2
0
R
Figure 5: Visualization of R dxP( x, y, h | 0, 0, 0) computed using the analytic expression from Thornber and Williams (1996). This is a rendering of the kernel of the hypothesized intracortical excitatory connections, integrated over the x dimension. Displayed values are scaled by a factor of 105 .
sense of Marr (1982). For this reason, we believe our top-down approach complements the bottom-up approach pursued by others. 7 Experiments 7.1 Analytic Solution of Conditional Probabilities. The conditional probability, P ( i | j) , is the probability that a particle, moving with constant speed in a direction given by a Brownian motion, will travel from edge i to edge j. In order to test the computational theory described in the previous sections, we need a fast and efcient method for computing these probabilities. In prior work, these probabilities were computed using Monte Carlo simulation (Williams & Jacobs, 1997a) and by numerical solution of the Fokker-Planck equation (Williams & Jacobs, 1997b). Although there (currently) is no analytic solution for the Fokker-Planck equation described by Mumford (1994), there is a similiar equation for which an exact analytic solution exists (Thornber & Williams, 1996). This equation governs the motion of particles with position and velocity (instead of position and direction). Straight lines are base trajectory of these particles. Their velocities are modied by random impulses with a zero-mean distribution, and variance, sg2 , acting at Poisson distributed times, with rate, Rg . If the initial and nal velocities are conditioned to be equal, this random process can be used to compute stochastic completion elds that are virtually indistinguishable from those computed using the Mumford random process. The analytic expression for P ( i | j) , based on the process described in Thornber and Williams (1996), is given in the appendix.
Orientation, Scale, and Discontinuity
1699
Figure 6: (Top left) The eight position constraints (the dots) that dene the test conguration. The order of traversal, directions, and speed are specied a priori. (Top right) The right eigenvector, s(c max ) [where c max D 0.149, represents the limiting distribution of the random process over all spatial scales. (Bottom left) The left eigenvector, s¯ (c max ) , represents the time-reversed distribution. (Bottom right) The vector, s(c max ) s¯ (c max ) , represents the magnitude of the stochastic completion eld at the locations of the dots.
7.2 Eight Dot Circle. Given eight dots spaced uniformly around the perimeter of a circle of diameter d D 16, we would like to nd the relative number of closed contours that visit each dot and in each direction. We would also like to compute the corresponding completion eld (see Figure 6, top left). Neither the order of traversal, directions h, nor speed c is specied a priori. Accordingly, the position and direction bias, bdot , is purely isotropic: X d( xE ¡ xE i ) . bdot ( xE , h ) D i
1700
Lance R. Williams and Karvel K. Thornber
The isotropy of bdot ( xE , h ) can be veried by noting the lack of h dependence on the right side of the equation. In our neural model, this represents the assumption that the LGN provides no information about orientation to simple cells in V1. So that all computations can be performed using ordinary vectors and matrices, the functions s ( .) , P ( .) , and b( .) are sampled at the locations of the eight dots, xE i , and at N discrete directions in the h dimension, to form a vector, s, and matrices, P and B: s k D s ( xE i , mDh ) P kl D P ( xE i , mD h | xE j , nD h ) B kl D d k`
where k D iN C m and l D jN C n, for dots i and j and sampling directions m and n. Since b (.) is isotropic and unweighted, after sampling, B D I. In all of our experiments, we sample the h dimension at 5 degree intervals. Consequently, there are N D 72 discrete directions and 576 position-direction pairs, that is, P is of size 576 £ 576.4 To achieve scale invariance, we make s and P functions of c and solve 1
1
l(c ) s(c ) D B 2 P(c )B 2 s(c ) D Ps(c ) . In the rst experiment, we evaluated l(c ) over the speed interval [1.1 ¡1 , 1.1 ¡30] using standard numerical routines and plotted the magnitude of the largest, real positive eigenvalue, l, versus log1.1 ( 1 /c ) (see Figure 7). The function reaches its maximum value at c max ¼ 1.1 ¡20. Consequently, the eigenvector s(1.1 ¡20) represents the limiting distribution over all spatial scales (See Figure 6, top right). The direction-reversed permutation of this eigenvector sN ( 1.1 ¡20 ) is shown in Figure 6 (bottom left). This is the eigenvector of PT with eigenvalue l(c max ) . The component-wise product of s and sN is shown in Figure 6 (bottom right). This vector represents the magnitude of the stochastic completion eld at the locations of the dots. As one would expect, orientations tangential to the circle have the greatest magnitude. The magnitude of the stochastic completion eld at all other positions in the plane (summed over all directions) is shown in Figure 8. Next, we scaled the test gure by a factor of two—d0 D 32.0—and plotted l0 ( log1.1 ( 1 /c ) over the same interval (see Figure 7). We observe that l0 ( 1.1¡x C 7 ) ¼ l(1.1 ¡x ) ; that is, when plotted using a logarithmic x-axis, 0 the functions are identical except for a translation. It follows that c max ¼ log1.1 7 £ c max ¼ 2.0 £ c max . This conrms the scale invariance of the system: doubling the size of the gure results in a doubling of the selected speed. 4 The parameters dening the distribution of completion shapes are T D 0.0005 and t D 9.5. See Thornber and Williams (1996).
1701
0.10 0.05 0.00
Max. Pos. Real Eigenvalue (10 e 6)
0.15
Orientation, Scale, and Discontinuity
0
5
10
15
20
25
30
x Figure 7: Plot of magnitude of maximum positive real eigenvalue, l, versus x D log1.1 (1 /c ) for eight-point test conguration with d D 16.0 (thin line) and d D 32.0 (thick line). Note that speed increases right to left.
7.3 Contours with Corners. The distribution of shapes considered by Mumford (1994) and Thornber and Williams (1996) consists of smooth, short contours. Yet there are many examples in human vision where completion shapes perceived by humans contain discontinuities in orientation (corners). Figure 9 shows a display by Kanizsa (1979). This display illustrates the completion of a circle and square under a square occluder. The completion of the square is signicant because it includes a discontinuity in orientation. Figure 10 shows a continuum of Koffka crosses. When the width of the arms of the cross is increased, observers report that the percept changes from an illusory circle to an illusory square (Sambin, 1974). In the experiments described in the next section, we did not generalize the distribution described in Mumford (1994) to include contours with corners (e.g., by randomizing the direction of the particle’s motion at Poisson distributed times). Instead, consistent with the experiments in the previ-
1702
Lance R. Williams and Karvel K. Thornber
Figure 8: Stochastic completion eld due to s(c max ) for eight-point circle.
(a)
(b)
Figure 9: Amodal completion of a partially occluded circle and square (redrawn from Kanizsa, 1979).
ous section, we assumed a distribution of completion shapes consisting of straight-line base trajectories modied by random impulses drawn from a mixture of two limiting distributions. The rst distribution consists of weak but frequently acting impulses (we call this the gaussian limit). The distribution of these weak impulses has zero mean and variance equal to sg2 . The weak impulses act at Poisson times with rate Rg . The second distribution consists of strong but infrequently acting impulses (we call this the Poisson limit). Here, the magnitude of the random impulses is gaussian distributed
Orientation, Scale, and Discontinuity
1703
Figure 10: An array of Koffka crosses with arms of varying width. Observers report that as the width of the arms increases, the shape of the illusory contour changes from a circle to a square.
with zero mean. However, the variance is equal to sp2 (where sp2 À sg2 ). The strong impulses act at Poisson times with rate Rp ¿ Rg . Particles decay with half-life equal to a parameter t . The effect is that particles tend
1704
Lance R. Williams and Karvel K. Thornber
(a)
(b)
( w/2, d/2)
(w/2, d/2)
( d/2, w/2)
(d/2, w/2)
( d/2, w/2)
(d/2, w/2)
w d
( w/2, d/2)
(w/2, d/2)
Figure 11: (a) Koffka cross. (b) Orientation and position constraints in terms of d and w. The normal orientation at each end point is indicated by the thick line; the thin lines represent plus or minus one standard deviation (i.e., 12.8 degrees) of the gaussian weighting function.
to travel in smooth, short paths punctuated by occasional orientation discontinuities. (The interested reader is encouraged to consult Thornber and Williams, 2000.) 7.4 Koffka Cross. The Koffka cross stimulus (see Figure 10) has two basic degrees of freedom that we call diameter (d) and arm width (w) (see Figure 11 a). We are interested in how the stochastic completion eld changes as these parameters are varied. (Recall that observers report that as the width of the arms increases, the shape of the illusory contour changes from a circle to a square; Sambin, 1974.) The end points of the lines comprising the Koffka cross can be used to dene a set of position and orientation constraints (see Figure 11 b). The position constraints are specied in terms of the parameters d and w. The orientation constraints take the form of a gaussian weighting function that assigns higher probabilities to contours passing through the end points with orientations normal to the lines. Observe that Figure 12a is perceived as a square while Figure 12b is perceived as a circle. Yet the positions of the line end points are the same. It follows that the orientations of the lines affect the percept. We have chosen to model this dependence through the use of a gaussian weighting function that favors contours passing through the end points of the lines in the normal direction. The corresponding input bias function is
bend ( xE , h ) D p
1 2p s 2
X i
d( xE ¡ xE i ) e ¡(h ¡hi § 2 ) / 2s , p
2
Orientation, Scale, and Discontinuity
(a)
1705
(b)
Figure 12: (a) Typically perceived as circle. (b) Typically perceived as square. The positions of the ends of the line segments are the same in both cases.
where s D 12.8 degrees is the standard deviation of the gaussian weighting function. As before, so that all computations can be performed using ordinary vectors and matrices, the functions s( .) , P ( .) , and b (.) are sampled at the locations of the eight line end points xE i and at N discrete directions in the h dimension. The vector s and matrix P are dened as before. However, since b (.) is not isotropic, B is now a diagonal matrix, B kk D bend ( xE i , nD h ) , where k D iN C n, for line end point i and sampling direction n. To achieve scale invariance, we make s and P functions of c and solve 1
1
l(c ) s(c ) D B 2 P(c ) B 2 s(c ) D Q(c )s(c ) , where P(c ) is the edge-to-edge transition probability matrix for speed c , l(c ) is an eigenvalue of Q(c ) , and s(c ) is the corresponding eigenvector. Let l(c ) be the largest positive real eigenvalue of Q(c ) , and let c max be the scale where l(c ) is maximized. Then s(c max ) , the eigenvector of Q(c max ) associated with l(c max ) , is the limiting distribution over all spatial scales. First, we used a Koffka cross where d D 2.0 and w D 0.5 and evaluated l(c ) over the speed interval [8.0£1.1 ¡1 , 8.0£1.1 ¡80] using standard numerical routines.5 The function reaches its maximum value atc max ¼ 8.0£1.1¡62 (see Figure 13). Observe that the completion eld due to the eigenvector, s(8.0 £ 1.1 ¡62 ) is dominated by contours of a predominantly circular shape (see Figure 14, right). We then uniformly scaled the Koffka cross gure by a factor of two—d0 D 4.0 and w0 D 1.0—and plotted l0 ( log1.1 1 /c ) over the 5
The parameters dening the distribution of completion shapes are T D 0.0005, t D 9.5, jp D 100.0, and Rp D 1.0 £ 10¡8 . See Thornber and Williams (2000). As an antialiasing measure, the transition probabilities, P(j | i), were averaged over initial conditions modeled as gaussians of variance sx2 D sy2 D 0.00024 and sh2 D 0.0019.
Lance R. Williams and Karvel K. Thornber
0.15 0.10 0.05 0.00
Max. Pos. Real Eigenvalue (10e 5)
0.20
1706
0
20
40
60
80
x Figure 13: Plot of magnitude of maximum positive real eigenvalue l versus x D log1.1 ( 1 /c ) for Koffka crosses with d D 2.0 and w D 0.5 (thin line) and d D 4.0 and w D 1.0 (thick line).
same interval (see Figure 13). Observe that l0 ( 8.0£1.1¡x C 7 ) ¼ l( 8.0£1.1 ¡x ) . As before, this conrms the scale invariance of the system. Next, we studied how the relative magnitudes of the local maxima of l(c ) change as the parameter w is varied. We begin with a Koffka Cross where d D 2.0 and w D 0.5 and observe that l(c ) has two local maxima (see Figure 15). We refer to the larger of these maxima as c circle . As previously noted, this maximum is located at approximately 8.0 £ 1.1 ¡62 . The second maximum is located at approximately 8.0 £ 1.1 ¡32. When the completion eld due to the eigenvector s( 8.0 £ 1.1 ¡32 ) is rendered, we observe that the distribution is dominated by contours of predominantly square shape (see Figure 16a). For this reason, we refer to this local maximum as c square. Now consider a Koffka cross where the widths of the arms are doubled but the diameter remains the same: d0 D 2.0 and w0 D 1.0. We observe that l0 (c ) still
Orientation, Scale, and Discontinuity
1707
Figure 14: Stochastic completion eld due to the eigenvector s( 8.0 £ 1.1 ¡62 ) . This is the eigenvector with maximum positive real eigenvalue for a Koffka cross with d D 2.0 and w D 0.5.
has two local maxima: one at approximately 8.0 £ 1.1 ¡63 and a second at approximately 8.0 £ 1.1¡29 (see Figure 15). When we render the completion elds due to the eigenvectors s0 ( 8.0£1.1¡63 ) and s0 ( 8.0£1.1¡29 ) , we nd that the completion elds have the same general character as before: the contours associated with the smaller spatial scale (i.e., lower speed) are approximately circular, and those associated with the larger spatial scale (i.e., higher speed) are approximately square (see the lower right and lower left of Figure 16). 0 Accordingly, we refer to the locations of the respective local maxima as c circle 0 and c square . However, what is most interesting is that the relative magnitudes of the local maxima have reversed. Whereas we previously observed that 0 0 ) > l0 (c circle ) . Therefore, l(c circle ) > l(c square ) , we now observe that l0 (c square 0 (c 0 0 ) !] reprethe completion eld due to the eigenvector s square ) [not s0 (c circle sents the limiting distribution over all spatial scales. This is consistent with the transition from circle to square reported by human observers when the widths of the arms of the Koffka cross are increased. 8 Conclusion We have improved on a previous model of illusory contour formation by showing how to compute a scale-invariant distribution of closed contours
Lance R. Williams and Karvel K. Thornber
0.15 0.10 0.05 0.00
Max. Pos. Real Eigenvalue (10e 5)
0.20
1708
0
20
40
60
80
x Figure 15: Plot of magnitude of maximum positve real eigenvalue l versus x D log1.1 ( 1 /c ) for Koffka crosses with d D 2.0 and w D 0.5 (thin line) and d D 2.0 and w D 1.0 (thick line).
given position constraints alone. We also used our model to explain a previously unexplained perceptual effect. Appendix In this appendix, we give the analytic expression for the conditional probabilities in the pure gaussian case. 6 We dene the afnity Pj i between two directed edges i and j, to be Z Pj i ´ P ( j | i) D 6
1 0
dt P ( j | iI t ) ¼ F P ( j | iI topt ) ,
For a derivation of a related afnity function, see Sharon, Brandt, and Basri (1997).
Orientation, Scale, and Discontinuity
1709
Figure 16: Stochastic completion elds for Koffka cross due to (upper left) s(c square ) is a local optimum for w D 0.5, (upper right) s(c circle ) is the global 0 ) is the global optimum for w D 1.0, optimum for w D 0.5, (lower left) s0 (c square 0 0 and (lower right) s (c square ) is a local optimum for w D 1.0. These results are consistent with the circle-to-square transition perceived by human subjects when the width of the arms of the Koffka cross is increased.
where P ( j | iI t ) is the probability that a particle that begins its stochastic motion at ( xE i , hi ) at time 0 will be at ( xE j , hj ) at time t. The afnity between two edges is the value of this expression integrated over stochastic motions of all durations, P ( j | i). This integral is approximated analytically using the method of steepest descent. The approximation is the product of P evaluated at the time at which the integral is maximized (i.e., topt ), and a weighting factor, F. The expression for P at time t is P ( j | iI t) D
) 3 exp[¡ Tt63 ( a t2 ¡ b t C c) ] ¢ exp( ¡t t p p 3 T 3 t7 / 2
where aD [2 C cos(hj ¡ hi ) ] / 3
1710
Lance R. Williams and Karvel K. Thornber
bD [xji ( coshj C coshi ) C yji ( sin hj C sin hi ) ] /c cD ( xji2 C yji2 ) /c
2
for xji D xj ¡ xi and yji D yj ¡ yi . The parameters T, t, and c determine the distribution of shapes (where T D sg2 is the diffusion coefcient, t is particle half-life, and c is speed). The expression for P should be evaluated at t D topt , where topt is real and positive and satises the following cubic equation: ¡7 t3 / 4 C 3( a t2 ¡ 2 b t C 3 c) / T D 0. If more than one real, positive root exists, then the root-maximizing P ( j | iI t) is chosen.7 Finally, the weighting factor F is q F D 2p t5opt / [12( 3 c ¡ 2 b topt ) / T C 7 t3opt / 2]. For our purposes here, we ignore the exp( ¡t / t ) factor in the steepest-descent approximation for topt . We note that by increasing c , the distribution of contours can be uniformly scaled. References Ben-Yishai, R., Lev Bar-Or, R., & Sompolinsky, H. (1995). Theory of orientation tuning in visual cortex. Proc. Natl. Acad. of Sci. (USA), 92, 3844–3848. Elder, J. H., & Zucker, S. W. (1994). A measure of closure. Vision Research, 34(24), 3361–3370. Golub, G. H., & Van Loan, C. F. (1996). Matrix computations. Baltimore, MD: Johns Hopkins University Press. Heeger, D. J. (1992). Normalization of cell responses in cat striate cortex. Visual Neuroscience, 9, 181–197. Hubel, D. H., & Wiesel, T. N. (1962). Receptive elds, binocular interaction and functional architecture of the cat’s visual cortex. Journal of Physiology(London), 160, 106–154. Kanizsa, G. (1979). Organization in vision. New York: Praeger. Kovacs, I., & Julesz, B. (1993). A closed curve is much more than an incomplete one: Effect of closure in gure-ground segmentation. Proc. Natl. Acad. Sci. (USA), 90, 7495–7497. Marr, D. (1982). Vision. San Francisco: W. H. Freeman. Mumford, D. (1994). Elastica and computer vision. In C. Bajaj (Ed.), Algebraic geometry and its applications. New York: Springer-Verlag. Parent, P., & Zucker, S. W. (1989). Trace inference, curvature consistency and curve detection. IEEE Transactions on Pattern Analysis and Machine Intelligence, 11, 823–889. 7 For a discussion on solving cubic equations, see Press, Flannery, Teukolsky, and Vetterling (1988).
Orientation, Scale, and Discontinuity
1711
Press, W. H., Flannery, B. P., Teukolsky, S. A., & Vetterling, W. T. (1988). Numerical recipes in C. Cambridge: Cambridge University Press. Rosenfeld, A., Hummel R., & S. Zucker, Scene Labeling by Relaxation Operations, IEEE Trans. on Systems, Man and Cybernetics, 6, pp. 420–433, 1976. Sambin, M. (1974). Angular margins without gradients. Italian Journal of Psychology, 1, 355–361. Sharon, E., Brandt, A., & Basri, R. (1997). Completion energies and scale. In Proc. IEEE Conf. Computer Vision and Pattern Recognition (CVPR ’97) (pp. 884–890). San Juan, PR. Simoncelli, E., Freeman, W., Adelson E., & Heeger, D., (1992).Shiftable multiscale transforms. IEEE Trans. Information Theory, 38(2), 587–607. Somers, D. C., Nelson, S. B., & Sur, M. (1995). An emergent model of orientation selectivity in cat visual cortical cells. Journal of Neuroscience, 15, 5448–5465. Sompolinsky, H., & Shapley, R. (1997). New perspectives on the mechanisms for orientation selectivity. Current Opinion in Neurobiology, 7, 514–522. Thornber, K. K., & Williams, L. R. (1996). Analytic solution of stochastic completion elds. Biological Cybernetics, 75, 141–151. Thornber, K. K., & Williams, L. R. (2000). Characterizing the distribution of completion shapes with corners using a mixture of random processes. Pattern Recognition, 33, 543–553. von der Heydt, R., Peterhans, E. & Baumgartner, G. (1984). Illusory contours and cortical neuron responses. Science, 224, 1260–1262. Williams, L. R., & Jacobs, D. W. (1997a). Stochastic completion elds: A neural model of illusory contour shape and salience. Neural Computation, 9(4), 837– 858, Williams, L. R., & Jacobs, D. W. (1997b). Local parallel computation of stochastic completion elds. Neural Computation, 9(4), 859–881. Zweck, J. W., & Williams, L. R. (2000). Euclidean group invariant computation of stochastic completion elds using shiftable-twistable functions. In Proc. of the 6th European Conf. on Computer Vision (ECCV ’00). Dublin, Ireland. Received January 19, 1999; accepted October 3, 2000.
NOTE
Communicated by Terence Sanger
A Spike-Train Probability Model Robert E. Kass Val Âerie Ventura Department of Statistics and Center for the Neural Basis of Cognition, Carnegie Mellon University, Pittsburgh, PA 15213, U.S.A. Poisson processes usually provide adequate descriptions of the irregularity in neuron spike times after pooling the data across large numbers of trials, as is done in constructing the peristimulus time histogram. When probabilities are needed to describe the behavior of neurons within individual trials, however, Poisson process models are often inadequate. In principle, an explicit formula gives the probability density of a single spike train in great generality, but without additional assumptions, the ring-rate intensity function appearing in that formula cannot be estimated. We propose a simple solution to this problem, which is to assume that the time at which a neuron res is determined probabilistically by, and only by, two quantities: the experimental clock time and the elapsed time since the previous spike. We show that this model can be tted with standard methods and software and that it may used successfully to t neuronal data. 1 Introduction
Probability distributions of spike trains are used for a variety of theoretical and data-analytical purposes, including determination of information content, Bayesian population coding, testing for spike patterns, and detecting synchrony (Barbieri, Quirk, Frank, Wilson, & Brown, in press; Brown, Frank, Tang, Quirk, Wilson, 1998; Zhang, Ginzburg, McNaughton, & Sejnowski, 1998; Sanger, 1996; Oram, Wiener, Lestienne, & Richmond, 1999; Optican & Richmond, 1987; Riehle, Grun, ¨ Diesmann, & Aertsen, 1997). The peristimulus time histogram (PSTH) provides an estimate of the time-varying ring rate (or intensity, in the jargon of probability theory), and for Poisson processes the ring rate, in turn, determines the spike-train distribution. When large numbers of trials are combined, the resulting set of pooled spike times essentially follows a Poisson process, due to general limit theory (Daley & Vere-Jones, 1988, theorem 9.2V). Thus, when reasoning only from data pooled across trials or when the spike trains on individual trials themselves follow a Poisson process, the PSTH (or smoothed versions of it) provides a complete summary of the data from which other probability-based calculations may be made. However, when neuron ring on individual trials is of c 2001 Massachusetts Institute of Technology Neural Computation 13, 1713– 1720 (2001) °
1714
Robert E. Kass and ValÂerie Ventura
interest and the process is non-Poisson, the joint distribution of the spikes cannot be obtained from the PSTH nor can it be obtained from the PSTH in combination with the distribution of interspike intervals (ISIs). The purpose of this note is to suggest a class of probability models for spike trains that allow recovery of the joint spike-train distributions within individual trials and to report results from tting these models to neuronal data. 2 Inhomogeneous Markov Interval Processes Letting N ( t) represent the number of spikes that have already occurred at time t on a particular trial, a spike occurs at time t if N ( t C D t ) ¡ N ( t) D 1 for all sufciently small increments D t. In the language of probability theory, the spike times follow a point process and N ( t) is the corresponding counting process (Daley & Vere-Jones, 1988, chap. 13). If s( t) D ( s1 , s2 , . . . , sN ( t) ) denotes the spike times up to time t, then the probability density of the spike train during an interval [0, T] is p( s1 , . . . , sn ) D e
¡
RT 0
l(u|s(u) ) du
n Y
l( sk | s1 , . . . , sk¡1 )
(2.1)
kD 1
where P( N ( t C D t) ¡ N ( t) D 1 | s ( t) ) Dt is the conditional intensity of the process and N ( T ) D n. In principle, this solves the problem of representing the spike trains in terms of a probability distribution. However, when n spikes are observed, the conditional intensity is a function of n C 1 variables ( t, s1 , . . . , sn ) , which poses an extremely difcult nonparametric estimation problem. A tractable simplication is to take the conditional intensities to have the form l(t | s( t) ) D lim
D t!0
l( t | s( t) ) D l( t, t ¡ s¤ ( t) ) ,
(2.2)
where s¤ ( t) is the last spike time preceding t. We call the resulting processes inhomogeneous Markov interval (IMI) processes. When l( t, t ¡s¤ ( t ) ) is a function only of its rst argument—the ring rate does not depend on occurrence times of previous spikes—we obtain an inhomogeneous (timevarying) Poisson process. The beauty of a Poisson process is that the spike probabilities are determined solely by the time-varying ring rate, or intensity, which may be written as l( t) and is estimated by the smoothed PSTH. The special case of a homogeneous Poisson process occurs when there is no dependence on experimental clock time, meaning that the ring rate is constant across time. In this case, the ISIs are independent, and all follow the same exponential distribution. A generalization, not requiring the Poisson
A Spike-Train Probability Model
1715
assumption, is the class of renewal processes, in which the ISIs all follow the same probability distribution, but it is no longer required to be an exponential distribution. These processes t into equation 2.2 when l( t, t ¡ s¤ ( t) ) is a function only of its second argument; that is, the ring rate does not depend on the experimental clock time. An interesting subclass of IMI processes are the time-rescaled inhomogeneous versions of renewal processes recently used by Barbieri et al. (in press). These models begin with the fact that an inhomogeneous Poisson spike train is obtained from a sequence of exponential ISIs together with a rescaling of time.1 A different ISI distribution, such as a gamma distribution, may be used in place of the exponential, and time may again be transformed in the same way. An alternative subclass that may be treated nonparametrically (i.e., does not require the assumption of a specic ISI distribution) are the multiplicative IMI processes l( t, t ¡ s¤ ( t) ) D l1 ( t) ¢ l2 ( t ¡ s¤ ( t) )
(2.3)
which have been applied previously by Berry and Meister (1998) and Miller and Mark (1992). Here, the factor l1 ( t) modulates the ring rate only as a function of experimental clock time while l2 ( t ¡ s¤ ( t) ) represents nonPoisson spiking behavior. Note, however, that the observed ISI distribution is based on both factors. The units of l( t, t ¡ s¤ ( t) ) are those of ring rate: spikes per unit time. By convention, in our analysis of neuronal data reported below, we will take l1 ( t ) also to have units of ring rate, leaving l2 ( t ¡ s¤ ( t) ) dimensionless. 3 Fitting IMI Processes to Data In general, a spike train in the form of a sequence of “exact” spike times ( s1 , s2 , . . . , sn ) may be approximated by discretizing time into small intervals of length D t (e.g., D t D 1 msec) and converting to a binary sequence of 0s and 1s, with each 1 indicating that a spike occurred within the corresponding time interval. As Brillinger (1988) observed, this is useful statistically because the binary sequence may be tted using generalized linear models (McCullagh & Nelder, 1989), with maximum likelihood in this generalized form of regression playing a role analogous to least squares in ordinary linear regression. (Formally, it is not hard to show that the likelihood function from equation 2.1 may be approximated by the corresponding binary regression likelihood function.) This implies that IMI processes may be tted 1 Specically, the probability density for the inhomogeneous Poisson is given by equation 2.1 when l(t|s(t)) D l(t). The inhomogeneous Poisson reduces to a homogeneous Rt Poisson after the change of variables in time from t to t according to t (t) D 0 l(u)du.
1716
Robert E. Kass and ValÂerie Ventura
with standard methods and software. In particular, to t the multiplicative IMI processes in equation 2.3, we use the additive form log l( t, t ¡ s¤ ( t) ) D log l1 ( t) C log l2 ( t ¡ s¤ ( t) )
(3.1)
and dene suitable functional representations for each of the components log l1 ( t) and log l2 ( t¡s¤ ( t) ), which will play a role analogous to explanatory variables in ordinary linear regression. An especially exible and tractable way to represent the components is with cubic splines. That is, we represent each component log l1 ( t) and log l2 ( t ¡ s¤ ( t ) ) as a piecewise cubic in its argument (respectively, t and t ¡ s¤ ( t) ), with the cubic pieces joined at “knots” in such a way that the resulting functions are twice continuously differentiable. One may choose the knots by preliminary examination of the data. Then the coefcients of the spline basis elements are determined via maximum likelihood; we have used the function gam in the software S-PLUS (MathSoft Inc., Seattle), as described in Venables and Ripley (2000). The inputs to gam are the discretized binary sequence of spike times, the time arguments t and t ¡ s¤ ( t) , the symbolic specication of the additive model in equation 3.1, and the knots. Further details on the use of splinebased regression methods (or regression splines in statistical parlance) for analyzing neuronal data may be found in Olson, Gettner, Ventura, Carta, and Kass, (2000) and Ventura, Carta, Kass, Gettner, and Olson (in press), where they are applied in the case of inhomogeneous Poisson processes. To t the more general IMI processes dened in equation 2.2., terms formed by taking products (“interactions”) of the spline representations for log l1 ( t) and log l2 ( t ¡ s¤ ( t ) ) may be included in the regression function. In addition, terms may be added to the statistical model that reect dependence on timing of spikes prior to the most recent one. We have not yet been explicit about the way the intensity 2.2 might vary from trial to trial. The simplest way to allow for excess trial-to-trial variability is to include multiplicative constants for each trial in equation 2.2, which become additive constants on the log scale. 4 Results We examined data recorded from a neuron in the supplementary eye eld of a macaque monkey while he was carrying out 15 trials of a delayed eye movement task (neuron PK66a.1 from Olson et al., 2000; we used the pattern condition). We tted the IMI process intensity function, equation 2.2, as described above. We included: 1. Spline basis elements for log l1 ( t) 2. Spline basis elements for log l2 ( t ¡ s¤ ( t) )
3. Cross-products between the spline basis elements for log l1 ( t) and those for log l2 ( t ¡ s¤ ( t) ) (interaction terms)
A Spike-Train Probability Model
1717
4. Additional basis elements to represent functions of time since the spike s¤¤ ( t) prior to s¤ ( t) and time since the spike s¤¤¤ ( t) prior to s¤¤ ( t) 5. Constants for each trial Using component 1 alone would result in tting an inhomogeneous Poisson process, component 2 represents departures from a Poisson process, component 3 represents departures from equation 2.3 within the general framework of equation 2.2, component 4 represents departures from equation 2.2, and component 5 represents excess trial-to-trial variability. In addition to the overall constant, there were 4 degrees of freedom for component 1 resulting from the use of 3 knots, 2 3 for component 2 resulting from the use of 2 knots, 12 for component 3 (the coefcients for all products between terms in components 1 and 2), 3 each for the two components in component 4 (each based on 2 knots), and 14 for component 5 (because there were 15 trials). For this neuron we found that the additive IMI process (see equation 2.3) without excess trial-to-trial variability clearly provided the best t of the models considered: none of the components 3 through 5 was statistically signicant, but components 1 and 2 were highly signicant (p < .0005). Figure 1 displays the ring behavior of this neuron. As is typically the case, the t of a Poisson model (effectively applying equation 2.2 with only the rst component l1 ( t) ) to the data pooled across trials is good. In Figure 1A, the tted spline simply smooths the PSTH. Furthermore, when the 15 withintrial IMI ts are averaged across trials, the result agrees very closely with the Poisson model. However, the effect of the highly signicant non-Poisson component 2, shown in Figure 1B, may be understood when we consider the way the neuron behaves following a spike at t D 50 milliseconds, as shown in Figure 1C. There it may be seen that the neuron has a refractory period of roughly 10 milliseconds during which it is less likely to re again than predicted by the Poisson model. After the ring rate reaches its peak at roughly 100 milliseconds, the effect of the non-Poisson component diminishes over time. The effect over the course of a single trial is illustrated in Figure 1D. Immediately after each occurrence of a spike, the neuron partially “resets,” lowering its ring intensity below the average ring across trials predicted by the smoothed PSTH; then it increases to a rate higher than the PSTH-based prediction until the neuron res again. 3
2 Essentially, we have linear, quadratic, and k C 1 cubic coefcients for k knots, and then lose degrees of freedom because of the differentiability constraints; technically, in S-PLUS, we use the natural spline basis. 3 An absolute refractory period, during which the neuron would not re, would force l2 to zero as its argument t ¡ s¤ (t) goes to zero. However, the data from Olson et al. (2000) were acquired only to 1 millisecond accuracy, and there is no evidence here of an absolute refractory period longer than 1 millisecond. Thus, our tted function l2 in Figure 1B does not vanish at the origin.
1718
Robert E. Kass and ValÂerie Ventura
(A)
(B) 1.2 0.6
0.8
1.0
Spikes/sec 0 20 40 60 80
Poisson IMI
0
100
200 300 Time (ms)
400
0
(C)
Spikes/sec 0 20 40 60 80
Intensity ratio 0.6 0.8 1.0 1.2 1.4 1.6
400
(D) Previous spike at t = 50 ms
0
100 200 300 Inter-spike time (ms)
50 100 200 300 Spike _____ Time (ms)
400
Poisson IMI
||| 0
|| | 100
| ||
||||| ||||| |
200 300 Time (ms)
400
Figure 1: Firing behavior of the supplementary eye eld neuron. (A) The PSTH is shown together with the tted intensity assuming a Poisson process and the average across the 15 trials of the tted intensities from the multiplicative IMI model (see equation 2.3). (B) The tted factor l2 ( t ¡ s¤ ( t ) ) , which modulates the time-varying ring rate to account for non-Poisson spiking behavior. If the spike trains followed a Poisson process, this intensity would be constant and equal to 1. Instead, the cell begins (during its refractory period) with a diminished ring intensity relative to what would be predicted by the Poisson assumption. This is followed by a fairly rapid increase in ring intensity until about 50 millisecods, at which time the intensity slowly declines. (To ease comparison with C, we have scaled l2 ( t ¡ s¤ ( t ) ) in B so that the value l2 ( t ¡ s¤ ( t ) ) D 1 corresponds to the tted Poisson intensity at t D 50 milliseconds; equivalently, l1 ( t ) is equal to the tted Poisson intensity at t D 50 milliseconds.) (C, D) Effect of l2 ( t ¡ s¤ ( t) ) on this neuron’s ring rate. According to the tted model (see equation 2.3), the IMI ring intensity is the product of the tted factor l2 ( t ¡ s¤ ( t) ) shown in B and the tted factor l1 ( t) (which is not shown); because of the former, it depends on the time at which the previous spike occurred. C displays the ratio of IMI to Poisson tted ring intensities when the previous spike occurred at t D 50 milliseconds. As implied by B, during the rst 10 milliseconds following the spike at t D 50 milliseconds, the cell is less likely to re again than predicted by the Poisson model. D shows the tted IMI intensity for the rst trial together with the tted Poisson intensity based on pooling all 15 trials (as in A). The bars along the horizontal axis indicate the times at which spikes occurred on this trial. After each spike, the ring rate immediately drops, then climbs to a rate substantially above the predicted Poisson rate.
A Spike-Train Probability Model
1719
5 Discussion The IMI processes dened in equation 2.2 generalize inhomogeneous Poisson processes, allowing computation of quantities dened by the spike-train probability distribution once the IMI intensity function l( t, t ¡ s¤ ( t) ) is estimated from the data. For example, it is possible to generalize the test for synchrony used by Riehle et al. (1997), avoiding the possibly objectionable Poisson assumption. We hope to report on this in a future communication. The “Markov interval” terminology comes from Cox and Lewis (1972), who discussed the time-homogeneous case for coupled processes. We are currently investigating extensions of IMI processes for multiple neurons. Estimation of l( t, t ¡ s¤ ( t) ) could be carried out by various methods. We have used cubic splines on the log scale estimated by maximum likelihood within the statistical framework of generalized linear models, a standard statistical methodology for which software is widely available. Previously, Miller and Mark (1992) applied model 2.3 using step functions; Berry and Meister (1998) assumed l1 ( t) was constant in tting l2 ( t ¡ s¤ ( t ) ) . Our approach allowed assessment of the t of model 2.3 within the broader class 2.2. We found it t well for a particular neuron in the supplementary eye eld, but we would like to emphasize that this multiplicative form may or may not be an appropriate simplication of equation 2.2 in other settings. This is an empirical issue that must be examined in each case. Substantively, as illustrated in Figure 1D, we found that this neuron has much more rapid uctuations in its ring rate within trials than predicted by the Poisson model, which simply smoothes the PSTH. The fundamental utility of IMI processes is that they avoid the assumption that spike trains are Poisson processes, which fails to account for important effects such as the refractory period. Instead, they make a weaker and seemingly reasonable assumption that spike timing is determined by, and only by, experimental clock time and the elapsed time since the previous spike; there is no further restriction. Statistically, the main point is that the IMI assumption reduces the tting problem posed in equation 2.1 to a very manageable two-variable nonparametric binary regression. We solved this problem using splines with knots xed at suitable locations. We are currently developing a more fully automated procedure based on the method of DiMatteo, Genovese, and Kass (2000).
Acknowledgments We are grateful to Emery Brown and John Lehoczky for many helpful conversations. Comments from the referees substantially improved the exposition. This work was supported in part by grants CA54652-08 from the National Institutes of Health and DMS-9803433 from the National Science Foundation.
1720
Robert E. Kass and ValÂerie Ventura
References Barbieri, R., Quirk, M. C., Frank, L. M., Wilson, M. A., & Brown, E. N. (in press). Non-Poisson stimulus-response models of neural spike train activity. Neuroscience Methods. Berry, M. J. and Meister, M. (1998). Refractoriness and neural precision. J. Neuroscience, 18, 2200–2211. Brillinger, D. R. (1988). Maximum likelihood analysis of spike trains of interacting nerve cells. Biol. Cyber., 59, 189–200. Brown, E. N., Frank, L. M., Tang, D., Quirk, M. C., & Wilson, M. A. (1998). A statistical paradigm for neural spike train decoding applied to position prediction from ensemble ring patterns of rat hippocampal place cells. J. Neurosci., 18, 7411–7425. Cox, D. R., & Lewis, P. A. W. (1972). Multivariate point processes. In Proc. 6th Berkeley Symp. Math. Statist. Prob. 3, (pp. 401–448). Berkeley: University of California Press. Daley, D. J., & Vere-Jones, D. (1988). Introduction to the theory of point processes. New York: Springer-Verlag. DiMatteo, I., Genovese, C. R., & Kass, R. E. (2000). Bayesian curve tting with free-knot splines. In press. McCullagh, P., & Nelder, J. A. (1989).Generalized linear models (2d ed.). New York: Chapman and Hall. Miller, M. I. and Mark, K. E. (1992).A statistical study of cochlear nerve discharge patterns in response to complex speech stimuli. J. Acoust. Soc. Am., 92, 202– 209. Olson, C. R., Gettner, S. N., Ventura, V., Carta, R., & Kass, R. E. (2000). Neuronal activity in macaque supplementary eye eld during planning of saccades in response to pattern and spatial cues. J. Neurophys., 84, 1369–1384. Optican, L. M., & Richmond, B. J. (1987). Temporal encoding of two-dimensional patterns by single units in primate inferior temporal cortex. III. Information theoretic analysis. J. Neurophysiol., 57, 162–178. Oram, M. W., Wiener, M. C., Lestienne, R., & Richmond, B. J. (1999). Stochastic nature of precisely timed spike patterns in visual system neuronal responses. J. Neurophysiol., 81, 3021–3033. Riehle, A., Grun, ¨ S., Diesmann, M., & Aertsen, A. (1997). Spike synchronization and rate modulation differentially involved in motor cortical function. Science, 278, 1950–1953. Sanger, T. D. (1996). Probability density estimation for the interpretation of neural population codes. J. Neurophysiol., 76, 2790–2793. Venables, W. N., & Ripley, B. D. (2000). Modern applied statistics with S-PLUS (3rd ed.). New York: Springer-Verlag. Ventura, V., Carta, R., Kass, R. E., Gettner, S. N. & Olson, C. R. (in press). Statistical analysis of temporal evolution in single-neuron ring rates. Biostatistics. Zhang, K., Ginzburg, I., McNaughton, B. L. & Sejnowski, T. J. (1998). Interpreting neuronal population activity by reconstruction: Unied framework with application to hippocampal place cells. J. Neurophysiol., 79, 1017–1044. Received May 4, 2000; accepted October 2, 2000.
LETTER
Communicated by Misha Tsodyks
Orientation Tuning Properties of Simple Cells in Area V1 Derived from an Approximate Analysis of Nonlinear Neural Field Models Thomas Wennekers Max-Planck-Institute for Mathematics in the Sciences, D-04103 Leipzig, Germany We present a general approximation method for the mathematical analysis of spatially localized steady-state solutions in nonlinear neural eld models. These models comprise several layers of excitatory and inhibitory cells. Coupling kernels between and inside layers are assumed to be gaussian shaped. In response to spatially localized (i.e., tuned) inputs, such networks typically reveal stationary localized activity proles in the different layers. Qualitative properties of these solutions, like response amplitudes and tuning widths, are approximated for a whole class of nonlinear rate functions that obey a power law above some threshold and that are zero below. A special case of these functions is the semilinear function, which is commonly used in neural eld models. The method is then applied to models for orientation tuning in cortical simple cells: rst, to the one-layer model with “difference of gaussians” connectivity kernel developed by Carandini and Ringach (1997) as an abstraction of the biologically detailed simulations of Somers, Nelson, and Sur (1995); second, to a two-eld model comprising excitatory and inhibitory cells in two separate layers. Under certain conditions, both models have the same steady states. Comparing simulations of the eld models and results derived from the approximation method, we nd that the approximation well predicts the tuning behavior of the full model. Moreover, explicit formulas for approximate amplitudes and tuning widths in response to changing input strength are given and checked numerically. Com paring the network behavior for different nonlinearities, we nd that the only rate function (from the class of functions under study) that leads to constant tuning widths and a linear increase of ring rates in response to increasing input is the semilinear function. For other nonlinearities, the qualitative network response depends on whether the model neurons operate in a convex (e.g., x2 ) or concave (e.g., sqrt( x) ) regime of their rate function. In the rst case, tuning gradually changes from input driven at low input strength (broad tuning strongly depending on the input and roughly linear amplitudes in response to input strength) to recurrently driven at moderate input strength (sharp tuning, supralinear increase of amplitudes in response to input strength). For concave rate functions, the network reveals stable hysteresis between a state c 2001 Massachusetts Institute of Technology Neural Computation 13, 1721– 1747 (2001) °
1722
Thomas Wennekers
at low ring rates and a tuned state at high rates. This means that the network can “memorize” tuning properties of a previously shown stimulus. Sigmoid rate functions can combine both effects. In contrast to the Carandini-Ringach model, the two-eld model further reveals oscillations with typical frequencies in the beta and gamma range, when the excitatory and inhibitory connections are relatively strong. This suggests a rhythmic modulation of tuning properties during cortical oscillations. 1 Introduction
Almost 40 years after Hubel and Wiesel (1962) rst characterized orientation selectivity in the mammalian visual cortex, it is still a matter of controversy as to what precise mechanism gives rise to this tuning property (Ben-Yishai, Bar-Or, & Sompolinsky, 1995; Ferster, Chung, & Wheat, 1996; Sabatini, 1997; Sato, Katsuyama, Tamura, Hata, & Tsumoto, 1996; Somers, Nelson, & Sur, 1995; Troyer, Krukowski, Priebe, & Miller, 1998; Worg¨ ¨ otter & Koch, 1991; reviews in Ferster & Koch, 1987; Heeger, Simoncelli, & Movshon, 1996; Reid & Alonso, 1996; Sompolinsky & Shapley, 1997; Vidyasagar, Pei, & Vogulshev, 1996). Hubel and Wiesel (1962) originally proposed that orientation tuning is caused in cortical simple cells by feedforward projections from circularly tuned cells in the lateral geniculate nucleus (LGN) that are aligned along rows of the respective preferred orientation (cf. also Ferster et al., 1996). In contrast, more recently it has been suggested that the input into cortical simple cells is only weakly tuned, but that this initial tuning is considerably sharpened by cortex-intrinsical amplication processes based on excitatory and inhibitory lateral connections (for alternative models of the origin of orientation tuning see also AdorjÂan, Levitt, Lund, & Obermayer, 1999; Troyer et al., 1998). This sharpening process was rst modeled by Somers et al. (1995) in a detailed biological neural network simulation comprising some 20,000 spiking neurons. At the same time, Ben-Yishai et al. (1995) provided a beautiful mathematical analysis of the principal mechanism in a simple, spatially continuous neural eld model. The spatial dimension of the model was intended to represent the angle of orientation preference of local cortical cell groups. For simplicity, only a single continuous chain of neurons was considered without distinguishing excitatory and inhibitory cells. Therefore, all cells had identical piecewise linear sigmoid rate functions with saturation, and synaptic connectivities were of a “Mexican hat” type. This combined excitatory and inhibitory synaptic interactions in a single connectivity kernel, such that nearby cells of similar orientation preference were excitatorily coupled and cells with orientation preferences differing more than a certain angle inhibited each other. Gaussian-shaped input proles, representing bar stimuli at certain orientations, evoke localized activation peaks in this network. Ben-Yishai et al. (1995) demonstrated mathematically as well as numerically that weakly tuned LGN input can be locally ampli-
Analysis of Orientation Tuning
1723
ed and sharpened, such that a tuning width of cortical cells comparable to the width of the excitatory coupling kernel is reached. This means that the ring-rate prole inside the eld is considerably more localized than the input current distribution. Accordingly, the receptive elds of the simulated neurons are sharper than the tuning width of the input from LGN. In addition, the tuning width obtained this way appears to be largely independent of the strength of the input currents: Although ring rates increase with increasing input, the width of the stationary activity prole remains approximately constant for a large range of input currents. These effects appeared very similarly also in Somers et al.’s biologically detailed simulations (1995), and they are in accordance with experiments showing that orientation tuning does not change signicantly in response to bars presented at different contrast (Sclar & Freeman, 1982; Skottun, Bradley, Sclar, Ohzawa, & Freeman, 1987). Field models were used earlier to investigate static as well as dynamic properties of cortical cells and networks (Amari, 1977; Engels & Schoner, ¨ 1995; Krone, Mallot, Palm, & Schuz, ¨ 1986; Sabatini, 1997; von Seelen, Mallot, & Giannakopoulos, 1987; Suder, Worg¨ ¨ otter, & Wennekers, 2001; Wennekers & Palm, 1996; Worg¨ ¨ otter & Koch, 1991; Zhang, 1996). Much of this work, however, was based solely on computer simulations. Where analytical calculations have been given, those were usually restricted to purely linear elds or eld equations linearized around some stationary activation prole. In that case, Fourier-Laplace methods can be used to characterize receptive eld properties by means of spatio temporal linear response functions or power spectra (Krone et al., 1986; Sabatini, 1997; Mineiro & Zipser, 1998). This way, valuable insight into linear properties of cortical cells can be obtained, but nonlinear effects—for instance, the sharpening process described above—are completely neglected. With that respect Ben-Yishai et al.’s (1995) mathematical model presents an important step toward an analysis of nonlinear receptive eld properties. Only a few articles considered nonlinear effects analytically earlier (Amari, 1977; Ben-Yishai et al., 1995; Wennekers & Palm, 1996). In this article, we consider eld models for orientation tuning in more general settings and approximate solutions of truly nonlinear mutually coupled excitatory and inhibitory elds. We derive analytical expressions for qualitative properties of steady-state solutions (xed points) of these eld equations, taking into account a whole class of nonlinear rate functions that obey a power law above some threshold and that are zero below. These functions can well approximate rate functions of cortical cells in the region of low and intermediate ring rates. The algebraic expressions derived for stationary ring rates and tuning widths can be solved numerically and also analytically in an approximate way. This way, tuning properties of cortical simple cells can be linked to anatomical and physiological data, and the qualitative effects found by Ben-Yishai et al. (1995) and Somers et al. (1995) can be characterized more quantitatively. In addition, further properties of
1724
Thomas Wennekers
cortical receptive elds can be predicted, especially when rate functions of cortical cells are assumed to be nonlinear. In the next section, we dene the general neural eld equations under study comprising several neural layers, gaussian connectivity kernels, semipower rate functions, and spatially localized gaussian-shaped input. As special cases, we consider a two-eld model consisting of an excitatory and an inhibitory eld that has the same steady-state solutions as the one-eld model investigated by Carandini and Ringach (1997). Section 3 presents the main approximation scheme, which consists of approximating the suprathreshold potential proles in the different layers by equivalent gaussian proles. This leads to a closed set of coupled nonlinear equations for the response amplitudes and standard deviations (tuning widths) of the localized activation peaks in the network layers. The method is applied in section 4 to a two-eld model with vanishing ring thresholds, which is equivalent to the Carandini-Ringach model with respect to its steadystate behavior. Approximate analytical formulas for amplitudes and tuning widths in response to changing input contrast are given and compared with the simulations performed by Somers et al. (1995) and Carandini and Ringach (1997). Furthermore, it is shown that the developed approximation scheme well governs the tuning properties of the full set of the original nonlinear neural eld equations. In Section 5, we discuss the inuence of nonlinearities on the tuning behavior. Here, the approximate equations provide a means to understand the occurring nonlinear bifurcation phenomena in informal terms, which appear to be qualitatively very different for concave, convex, and semilinear rate functions. It is demonstrated that semilinear rate functions are the only ones that lead to constant tuning in response to changes in input strengths and that concave rate functions support shortterm memory of previously seen stimuli in the form of tuned persistent activity. Furthermore, it will be argued that the two-eld model can display autonomous oscillations in the beta- and gamma-frequency range in certain parameter regimes, in contrast to the one-eld models with lateral inhibition type kernel studied by Ben-Yishai et al. (1995) and Carandini and Ringach (1997). Finally, the concluding section restates and discusses the most important ndings. 2 Field Models for Orientation Tuning
In general, neural eld equations for, say, n cortical layers can be written as ti wPi ( x, t) D ¡w i ( x, t) C Ii ( x, t) C
n X jD 1
kij ( x ) fj ( wj ( x, t) ) ,
(2.1)
where w i ( x, t) describes the membrane potential of cells at location x and time t in layer i (Wilson & Cowan, 1973; Amari, 1977; Krone et al., 1986; Worg¨ ¨ otter & Koch, 1991). In the context of orientation tuning, it is convenient
Analysis of Orientation Tuning
1725
to choose one-dimensional elds. The spatial variable x then corresponds with the best matching orientations of local cortical cell groups. Ii ( x, t) in equation 2.1 denotes external input into cells in layer i, and the kernels kij ( x) describe effective synaptic densities from layer j to layer i that are assumed to be translation invariant for simplicity. The symbol denotes spatial convolution, ti is the membrane time constant of cells in layer i, and fi (w ) is their rate or output function. For completeness, we have to specify initial conditions at, say, time t D 0: w i ( x, 0) D W i ( x) . Observe that we assume an innitely extended model, x 2 R, although in a model for orientation tuning, it would be more appropriate to constrain the domain of the dynamic equation 2.1 to the interval [¡90, 90] (degrees) and assume cyclic boundary conditions. The reason that we choose an innite domain is purely technical and will become clear at the end of section 3, where we briey discuss cyclic models. The inputs Ii ( x, t) in equation 2.1 have to represent an optically stimulating bar at a certain orientation. This is modeled by gaussian-shaped input proles, where the center of the gaussian represents the orientation of the stimulating bar and its width the tuning width of the input projection from LGN cells to cortical cells. Without loss of generality, we may assume that the bar ’s orientation is zero: " ³ ´ # 1 x 2 . (2.2) Ii ( x, t) D Ii ( x) :D Ii0 exp ¡ 2 si0 In equation 2.2, the input is independent of time, because in the sequel we study only stationary properties of receptive elds. The standard deviations si0 need not be the same for all i, that is, upcoming connections from LGN can be tuned differently for different layers of target cells. Because excitatory and inhibitory cells are separated into different elds, we assume that the coupling kernels are not given by Mexican hat kernels, as in one-eld models for orientation tuning, but by gaussians: " ³ ´ # Kij 1 x 2 exp ¡ . kij ( x ) D p 2 sij 2p
(2.3)
Furthermore, the rate functions in equation 2.1 are chosen of the form p
f ( w ) D b[w ¡ #] C ,
(2.4)
where # is the ring threshold and [x] C :D x if x > 0 and 0 else. This means that the rates are zero below ring threshold # and obey a power law above. A semilinear f requires p D 1, but equation 2.4 contains more general cases, for example, p D n, n positive integer, or p D 1 / 2 (square root). This way, the inuence of nonlinearities on tuning properties can be compared.
1726
Thomas Wennekers
Typically the model as described above responds to input of the form 2.2 with localized ring-rate proles, fi (w i¤ ) , in the different layers, where w i¤ D w i¤ ( x ) are the stationary potential proles. The widths of these rate proles are equivalent to the tuning widths observed for the respective cell classes in experiments, as those are usually derived from extracellular ring rates. The existence of steady-state solutions of the model equations 2.1 through 2.4 at least requires time-independent inputs and wPi ( x, t) D 0. Therefore, the xed points are determined by w i¤ ( x ) D Ii ( x ) C
n X j D1
kij ( x ) fj (w j¤ ( x ) ) .
(2.5)
We should mention, that Ben-Yishai et al.’s model (1995) is formulated directly for ring rates and not for potentials. Therefore, it does not perfectly correspond with equation 2.1. However, a special case of equation 2.1 is the model for orientation tuning described by Carandini and Ringach (1997), which was intended as a straight approximation of the biologically detailed model studied by Somers et al. (1995). The Carandini-Ringach model reads t wP ( x, t) D ¡w ( x, t) C I ( x, t) C k( x ) f (w ( x, t) ) ,
(2.6)
which is just equation 2.1 for a single layer of cells (n D 1). Although formally somewhat different, all three models described by Ben-Yishai et al. (1995), Somers et al. (1995), and Carandini and Ringach (1997) have comparable properties, which will also be reproduced in a two-eld model consisting of an excitatory and an inhibitory layer considered below. Both the Carandini-Ringach model and that used by Ben-Yishai et al. lump excitatory and and inhibitory cells into a single neural eld. Therefore, coupling kernels in these models have a Mexican hat prole, for instance, in the form of a difference of gaussians (DOG kernel): " " ³ ´ # ³ ´ # 1 x 2 1 x 2 KC K¡ exp ¡ ¡p exp ¡ . k( x ) D p 2 sC 2 s¡ 2p 2p
(2.7)
The rate function f is chosen by Ben-Yishai et al. (1995) and Carandini and Ringach (1997) essentially as the semilinear function, which results from equation 2.4 by setting p to one. In addition, in both models the rate function saturates at large ring rates, but usually ring rates stay below saturation. The input I ( x, t) in Ben-Yishai et al.’s model (1995) represents a single stimulating bar, which can be modeled as in equation 2.2. In Carandini and Ringach’s paper (1997), the network response to superimposed bars (crosses, plaids, and so on) is studied. This situation requires two (or even more) gaussians at different locations in the input eld. We will not consider multiple peak solutions here but restrict ourselves to single bumps in the
Analysis of Orientation Tuning
1727
activity prole. Moreover, we also neglect saturation at high rates. Then the stationary activity proles are determined by the xed-point equation, w ¤ ( x ) D I ( x ) C k( x ) f (w ¤ ( x ) ) ,
(2.8)
which is a nonlinear integral equation for w ¤ ( x ) . Exactly the same equation also results from a eld model that treats excitatory and inhibitory neurons in two separate elds. This model reads t1 wP1 D ¡w 1 C I C k11 f1 ( w 1 ) C k12 f2 ( w 2 ) t2 wP2 D ¡w 2 C k21 ( x ) f1 (w 1 ) ,
(2.9) (2.10)
where w 1 and w 2 describe excitatory and inhibitory neurons, respectively. Note that in equation 2.10 we neglected external input into as well as feedback between inhibitory cells. The xed-point equation corresponding with equations 2.9 and 2.10 is w 1¤ ( x ) D I ( x ) C k11 f1 ( w 1¤ ( x ) ) C k12 f2 ( w 2¤ ( x ) ) w 2¤ ( x ) D k21 f1 ( w 1¤ ( x ) ) .
(2.11) (2.12)
If we further require that the rate function of the inhibitory cells is semilinear with threshold #2 D 0, that is, f2 ( w 2 ) D b2 [w 2 ] C , insertion of equation 2.12 into 2.11 immediately yields equation 2.8 for the excitatory potential distribution w 1¤ with an effective DOG kernel (see equation 2.7). The stationary potential prole of the inhibitory cells is still given by equation 2.12, and the parameters to be used in equations 2.7 and 2.8 to calculate the excitatory prole w 1¤ are f D f1 , K C D K11 , sC D s11, K¡ D K21K12 b2 s12 s21 / s¡ , 2 C 2 s¡2 D s12 s21 . Thus, for the stated conditions, the Carandini-Ringach model and the two-eld model, equations 2.9 and 2.10, have the same xed-point solutions. Observe that the height K¡ and width s¡ of the inhibitory component of the effective DOG kernel are not that of the physiological inhibitory connections, but that of the convolution of the physiological projections from excitatory to inhibitory cells and back. 3 Approximation Method
The xed-point equations 2.5, 2.8, 2.11, and 2.12 derived in the previous section are nonlinear integral equations such that it will in general be impossible to nd exact solutions. Therefore, in this section we develop an approximation scheme that allows approximating qualitative properties of steady-state solutions. In particular this will yield relations for stationary tuning widths and the dependence of response amplitudes from the strength of external inputs. Figure 1 depicts the main approximation method. Shown is the stationary potential prole of an arbitrary one of the n elds, w ( x ) . Typically the
1728
Thomas Wennekers
[f - q ] +
g(x)
h
q
s
0 f (x)
~ w
w
x
q(x)
Figure 1: Approximation scheme. The thick line [w ¡ h ]C indicates the part of the membrane potential prole w above ring threshold h . Only this part results in positive ring rates when w is squashed through the output function (see equation 2.4). The approximation consists in replacing [w ¡ h ]C by a gaussian function g( x ) with the same height h and standard deviation s such that both curves have the same second-order derivative in x D 0. The quadratic approximation q( x ), in addition, predicts an approximate value w Q for the true width w of the region above threshold. w is the halfwidth of the ring-rate prole (not shown) at baseline and thus tightly related to tuning widths derived from extracellular recordings of ring rates.
prole is above threshold (#) in a certain region [¡w, w] around x D 0 and below threshold outside. The rate function f transforms the potentials into a localized ring-rate prole that is positive in the interval [¡w, w] and zero outside (the rate prole is not shown in the gure). The main trick now is to approximate the potential prole above threshold, that is, [w ¡ #] C , by a gaussian prole, g ( x ), with the same height h D w ( 0) ¡ # and a certain “effective” standard deviation, s. Obviously, in the most relevant central region where signals are strongest, the approximation can be reasonably good. In the tails, however, the true prole [w ¡ #] C (the thick line in Figure 1) will deviate somewhat from the approximated curve g. But there the signals are already relatively small and can be expected to have a minor inuence on the network behavior. This implies that the method always results in some approximation errors whatever the absolute amplitudes of the elds. But as demonstrated in detail below, it can nonetheless be used very well to derive qualitative and even quantitative properties of the ring-rate distributions—for instance, response amplitudes h and tuning widths s.
Analysis of Orientation Tuning
1729
The choice of the effective standard deviation s taken in the approximation is not unique. In the simplest case, we may require that both the original ring-rate distribution and the gaussian approximation have the same second-order derivative in x D 0. This means we employ a quadratic approximation near x D 0, which is depicted as q( x ) in Figure 1. Accordingly, we assume that the potential prole near x D 0 above ring threshold # can be approximated by a gaussian function with amplitude h and standard deviation s: µ ¶ 1 ± x ²2 [w ¤ ( x ) ¡ #] C ¼ h exp ¡ . (3.1) 2 s Expanding to second order in x yields µ ¶ µ ¶ 1 ¤ 1 ± x ²2 ( 0) x2 C ¢ ¢ ¢ ¼ h 1 ¡ C ¢ ¢ ¢ D : q ( x) . w ¤ ( 0) ¡ # C w xx 2 2 s C
(3.2)
£ ¤ h ¼ w ¤ ( 0) ¡ # C
(3.3)
Thus, and
s2 ¼
h . ¤ ( 0) ¡w xx
In addition, the quadratic p approximation, q( wQ ) D 0, predicts for the value of w in Figure 1: w ¼ wQ D 2s. The advantage of this gaussian approximation is twofold. First, observe that the rate functions f are power functions above threshold. But the power of a gaussian is again a gaussian with rescaled standard deviation. Thus, inserting equation 3.1 into 2.4, we get µ
¶ p ± x ²2 f ( w ( x ) ) D b[w ( x ) ¡ #] C ¼ bh exp ¡ , 2 s p
p
(3.4)
where s is the width of the gaussian that approximates the potential prop le. We further dene sp :D s / p as the width of the ring-rate prole that results from applying f to w . sp corresponds with tuning widths observed in experiments because those are typically measured from extracellular ring rates. Equation 3.4 then enables us to deal with the nonlinearity in the xed-point equations. Given the same effective width of the suprathreshp old potential prole, s, the ring-rate prole, sp D s / p, becomes sharper when p increases. This is intuitive, because (for p > 1) the output function amplies high potentials near x D 0 stronger than the relatively lower potentials in the tails (i.e., for p > 1, f is convex from below). The second advantage of the gaussian approximation is that inserting equation 3.4 into the xed-point equation 2.5 leads to convolution integrals of two gaussians in the synaptic interaction terms. The convolution of two
1730
Thomas Wennekers
gaussians, however, can be solved in closed form and is again a gaussian. Therefore, " ³ ´ # Kij bsij sp hp 1 x 2 exp ¡ (3.5) kij f ( w ) ( x ) ¼ , 2 Sij Sij where Sij2 :D sij2 C sp2 . Terms of the form kij f ( w ) appear in the sum over the input layers in equation 2.5. This means the stationary potential distribution in each target layer i can be envisaged as consisting of an additive superposition of gaussian-shaped contributions of the form 3.5. The different contributions can be positive or negative, depending on whether a source layer j consists of excitatory or inhibitory cells, respectively. If we insert equation 3.5 into 2.5, we get an approximate expression for w i¤ ( x ) . Restricting attention again only to the region near x D 0 where the w i¤ ( x ) are above threshold, we can expand these expressions to second order in x. Then, the rst order cancels out (because the problem is mirror symmetric with respect to x D 0), and in zeroth and second order we obtain hi D k i C
n X jD 1
pj
(3.6)
cij hj
pj
n X hj hi Ii0 C D c , ij si2 si02 Sij2 j D1
(3.7)
where we used equation 3.3 and the denitions cij :D Kij bj sij spj / Sij and ki :D Ii0 ¡ #i . Equations 3.6 and 3.7 provide a closed system of 2n nonlin-
ear equations for the 2n variables hi and si , i D 1, . . . , n. Solutions may or may not exist depending on the model parameters. In general, they can only be found numerically using, for example, root-nding methods (see, e.g., Press, Teukolsky, Vetterling, & Flannery, 1993). In addition, the allowed solutions are further constrained by the requirement hi ¸ 0, because by definition the hi measure amplitudes above thresholds. Although equations 3.6 and 3.7 are still nonlinear, their advantage is that they reduce the integral equations for the true stationary elds to algebraic equations for the amplitudes and standard deviations of the localized activation peaks. This greatly reduces computational complexity when searching for stationary solutions. In particular, using branch tracing methods (Seydel, 1988), large parameter spaces can be searched efciently at minimal cost. Moreover— and even more important—in concrete applications, it is often possible to derive further approximations from equations 3.6 and 3.7 that relate properties of stationary activity proles directly to model parameters. This will be demonstrated in the sequel for some special cases. Closing this section, we should emphasize that the developed approximation method makes explicit use of properties of the gaussian function dened on the whole real line. These properties are especially that the
Analysis of Orientation Tuning
1731
power of a gaussian as well as the convolution of two gaussians are again a gaussian. These properties led to a closed expression for the functions Y( x ) :D k f ( w ¤ ) ( x ) given in equation 3.5. The assumption of gaussian kernels and tuning functions is clearly a restriction of the method. For instance, the method cannot easily be extended to the two-dimensional case, since then the convolution of two gaussians can no longer be solved explicitly. Similarly, it is not easily possible to use more complicated coupling kernels than gaussian ones. However, kernels that are sums of gaussians localized at zero do not pose problems (i.e., the Mexican hat function). Furthermore, Swindale (1998) has pointed out that experimentally observed orientation tuning curves are often best described in the cyclic domain by wrapped gaussian functions or von Mises functions and not by simple gaussian functions. These functions have a similar shape as gaussians restricted to the cyclic domain. They are of the form Ay ( x, sp ), where A is some amplitude factor and y ( x, sp ) is a p periodic function in x, localized at, say, x D 0 in the interval [¡p / 2, p / 2], and sp is some measure of the effective width of the tuning curve. Clearly, if we assume that our model is restricted to the cyclic domain and that the ring-rate proles are of the above type, that is, f ( w ¤ ) ( x ) D Ay ( x) , we can no longer calculate an explicit formula for the functions Y(x) dened above. Nonetheless, the respective convolution integrals certainly exist. So still assuming that f is a semipower function, we can again in principle derive the system of coupled algebraic equations 3.6 and 3.7 by Taylor expansion up to second order. The general form of the resulting equations will be the same (with hp » A), but the coupling functions cij ( spj ) , which depend on only the tuning widths, will be different from those in equation 3.7; typically, it will not even be possible to give explicit formulas for them. Nonetheless, the generic properties of these functions will be the same as before. This becomes clear from the following reasoning: The functions Y(x) describe the impact of a source layer on the membrane potential in a target layer at location x. For simplicity, we may assume x D 0. Y further is a convolution of the coupling kernel and the tuning function of width sp in the source layer. So if sp tends to zero, Y( 0) also tends to zero. If sp increases (keeping amplitudes xed), Y( 0, sp ) monotonously increases until it reaches an upper bound if the tuning becomes very broad—that is, at. In some sense, the function Y( x, sp ) as a function of sp denes the fraction of synapses of the target cell at location x that is efciently addressed by a ring-rate prole of width sp . As a function of sp it starts at zero for sp D 0 and increases monotonously if sp grows until it reaches an upper bound determined by the total synaptic efcacy in the respective coupling kernel k( x ). This property is generic whether the domain of integration is cyclic or innite and/or the tuning functions are assumed to be gaussians, wrapped gaussians, or von Mises functions. Thus, equations 3.6 and 3.7 are generic, such that the quantitative behavior studied in the following sections will qualitatively also appear in more general cases. Furthermore, we should mention that the tuning functions consid-
1732
Thomas Wennekers
ered later are typically so sharp that a restriction of our model to the interval [¡p / 2, p / 2] would lead to only very small differences as compared with the innitely extended model. 4 The Semilinear Carandini-Ringach Model
In the previous section the principal approximation method was presented, and equations 3.6 and 3.7 for qualitative stationary properties of the general xed-point equation 2.5 were derived. These equations determine amplitudes and widths of localized activation peaks in the network layers. To obtain insight into cortical tuning properties, we now apply the method to the Carandini-Ringach model (see equation 2.8) and the two-layer model (see equations 2.9 and 2.10) given in the introduction. In this section, we discuss the network behavior when the rate functions are semilinear. Nonlinear output functions of the class given by equation 2.4 will be taken into account in section 5. Furthermore, we evaluate the quality of the approximation by comparing numerically determined solutions of the original integral equations for the xed points with approximate results derived from the developed method. To start, observe that the stationary potential prole w 2¤ ( x ) of the inhibitory eld in the two-layer model is directly given by the excitatory prole w 1¤ ( x ) via equation 2.12. This means that all properties of the potential prole in the inhibitory eld can be directly computed from those of the excitatory eld. Therefore, it sufces to consider only the excitatory eld in the sequel. Moreover, because both the Carandini-Ringach model and the two-layer model have equivalent xed points as long as f2 (w 2 ) D b2 [w 2 ] C , we rst restrict attention to that particular case and investigate equation 2.8 using the Mexican hat kernel, equation 2.7. We solved equation 2.8 by simulating the dynamic model, equation 2.6, and determined tuning properties from stationary activity proles obtained this way. We also compared the results with simulations of the two-eld model, equations 2.9 and 2.10. Both models give virtually the same values for h and s. However, stability properties of the two-eld model are somewhat different from those of equation 2.6. We will come back to that in section 6. The approximation method then leads to the following set of coupled algebraic equations for the response amplitude h and tuning width sp D p s / p of the excitatory cells (cf. also equations 3.6 and 3.7): h D k C a(sp ) hp
(4.1)
h I0 D 2 C c ( sp ) hp psp2 s0
(4.2)
where a(sp ) D c C ¡ c¡ ,
c ( sp ) D
cC
2
SC
¡
c¡
S ¡2
,
c§ D
K§ s§ sp b
S§
(4.3)
Analysis of Orientation Tuning
k D I0 ¡ #,
h D [w ¤ ( 0) ¡ #] C ,
1733
S §2
D s§2 C sp2 .
(4.4)
Here, K § and s§ are the strengths and widths of the effective excitatory and inhibitory coupling kernels appearing in the DOG kernel (see equation 2.7). Their relation to the parameters of the gaussian kernels in the two-layer model is given at the end of section 2. S § are the standard deviations of the excitatorily (resp. inhibitorily) mediated approximately gaussian contributions to the spatial membrane potential distributions of the cells. Clearly, S § ¸ s§ , because the standard deviations of the excitatory and inhibitory components of the potentials result from convolutions of the respective synaptic kernels (widths s§ ) and the ring-rate prole (width sp ). Therefore, S §2 D s§2 C sp2 ¸ s§2 , which implies that the excitatory and inhibitory contributions to the membrane potentials are always less tuned than the respective synaptic kernels. Note that if p D 1 and # D 0, then k D I0 and equation 4.1 implies h D ( 1 ¡ a) ¡1 I 0. Inserting this into equation 4.2, the input I0 drops out, and we are left with a single equation determining sp D s1 D s independent of the input strength (i.e., contrast or some function of contrast; cf. Cheng, Chino, Smith, Hamamoto, & Yoshida, 1995). Inserting the s found this way into a D a( s) , we furthermore nd that ( 1 ¡ a) ¡1 is also independent of I0 . Therefore, s does not depend on I0 but h D ( 1 ¡ a) ¡1 I0 is proportional to I0 . This proves the result found by Ben-Yishai et al. (1995) analytically. In informal terms, in response to varying input, the response amplitude h changes proportionally. Accordingly, the excitatory and inhibitory contributions to the membrane potentials are also proportional to I0 . This means that besides a multiplicative scaling by the input strength I0 , the potential prole will always be of the same form. Thus, because # is zero, the tuning width must be constant in response to changes of I0. Similar explanations have been given by Somers et al. (1995) and Carandini and Ringach (1997). 6 0 we get from equations 4.1 and 4.2, More generally, for p D 1 and # D h D m ( s)k :D
k
1 ¡ a(s) 1 1 1 h D I0 ¢ 2 C c C h ¢ 2 ¡ c¡ h ¢ 2 . 2 s s0 SC S¡
(4.5) (4.6)
Although h, c§ , and S § recursively depend on s in equation 4.6, the inverse variance 1 / s 2 can be formally envisaged as a weighted sum of the inverse variances of the external input and the recurrent excitatory and inhibitory synaptic inputs. The weighting factors I 0, c C h, and c¡ h represent the strengths of these inputs; that is, the larger one of these prefactors in equation 4.6, the stronger will the respective component inuence s. If all but one of the factors vanish, the particular component solely determines s. This happens, for instance, in pure feedforward models where c C D c¡ D 0.
1734
Thomas Wennekers
This implies m ( s) D 1, h D k , and s 2 D s02k / I0 . In particular, for # D 0, one gets s D s0 and h D I0 , as should be expected, because under these restrictions the potential prole above threshold is simply a copy of the input prole, [w ¤ ( x ) ¡ #] C D w ¤ ( x ) D I ( x ) . Another special case is that of weak external input (in comparison with synaptic inputs). Then the rst term containing I0 on the right-hand side in equation 4.6 can be neglected, and the equation becomes independent of h. Thus, also in this case of strong recurrent processing, s will be largely independent of the input strength (i.e., I0), and the response amplitude h will be proportional to I0. It is somewhat controversial whether external or reccurent inputs dominate in real cortical cells. Recent results suggest that at least in spiny stellate cells in layer IV of cat visual cortex, currents driven by upcoming synapses from LGN may be of the same order as currents mediated by recurrent cortex-intrinsic connections (Stratfort, Tarczy-Hornoch, Martin, Bannister, & Jack, 1996). Thus, we may assume that I 0, c C h, and c¡h are all roughly of the same order. Interestingly this allows us to calculate some explicit relations for tuning properties simpler than the set of nonlinear algebraic equations, 4.5 and 4.6. For that purpose, we consider an example that has also been analyzed by Somers et al. (1995) and Carandini and Ringach (1997). Both articles compare a pure feedforward model for orientation tuning with models that contain feedback inhibition (but no feedback excitation) and a model comprising both excitatory and inhibitory feedback. Results are presented in Figure 5 in Somers et al. (1995) and Figure 3 in Carandini and Ringach’s work (1997). Somers et al. (1995) studied the problem by means of biologically detailed computer simulations, whereas Carandini and Ringach used a neural eld equation (i.e., equation 2.6). In the following we calculate the response amplitudes and tuning widths analytically. We use equations 2.6 and 2.7 with the same parameters as given by Carandini and Ringach (1997). These authors adapted their network parameters to those given by Somers at al. (1995) to obtain comparable tuning properties. The parameters are t D 15 ms, b D 15 ips/mV, p D 1, # D 0, sC D 7.5 deg, s¡ D 60 deg, s0 D 23 deg, I0 D 1.6 mV. Coupling constants K C and K¡ vary according to the different network variants under study: (A) pure feedforward model: K C D K ¡ D 0; (B) inhibition only: K C D 0, K¡ b D 0.092/deg; (C) strong inhibition only: K C D 0, K¡ b D 0.183/deg; and (D) full model: K C b D .23/deg, K¡ b D 0.092/deg. Since the normalization factors of our coupling kernels (see equation 2.3) differ from those in Carandini and Ringach (1997), all values are adapted to our notation. The relation between Carandini and Ringach’s coupling constants J§ given in Table 1 of their work and ours is K§ D J§ / s§ . There is a slight complication concerning the inhibitory coupling kernel: Carandini and Ringach used a spatial gaussian function with standard deviation 60 deg, which was in addition clipped at 60 deg and then scaled to a total integral of 1. In contrast, we use an innitely extended gaussian with s¡ D 60 deg also normalized
Analysis of Orientation Tuning
1735
to an integral of 1. Because the nonnormalized clipped gaussian contains only 68% of the mass of the full gaussian, the coupling strengths of the clipped gaussian after normalization are by a factor of 1 / .68 D 1.47 larger than for the nonclipped gaussian. We account for that by appropriately upscaled inhibitory connections. The activation peaks in the sequel are typically considerably sharper than the width of the inhibitory kernel. Because the kernels and spatial activation functions further decay exponentially in their tails, we can expect that effects due to the nonclipping in our model are very small as compared to the model with clipped inhibition. With the stated parameters Carandini and Ringach (1997) obtained the following ring rates, f ( h) D bh, and tuning half-widths, HW (all values taken from simulations using Carandini and Ringach’s original Matlabprogram, cf. also Figure 3 in their article): A: (f(h), HW) = (24.0 ips, 54.1 deg), B: (10.9 ips, 34.5 deg), C: (7.93 ips, 28.8 deg), and D: (57.1 ips, 19.7 deg). We repeated the simulations using our own C-programs with almost identical results: A: (24.0 ips, 54.0 deg) B: (10.9 ips, 35.0 deg), C: (7.9 ips, 29.6 deg), D: (56.7 ips, 20.3 deg). Tuning halfwidth HW here and in Carandini and Ringach’s article is the full width at half maximum (FWHM). This can be checked from their results for the pure feedforward model. As already stated for p D 1, # D 0, and no recurrent connections, the stationary potential prole is identical to the input prole, which again has to be scaled by the gain factor b of the rate function to obtain the ring-rate prole (i.e., the tuning function). Therefore, in the pure feedforward case, one expects ( h ) D bI0 D 24 ips and s D s0 D 23 deg. The FWHM for a gaussian equals fp 2 2ln2s D 2.35s; thus, FWHM = 54 deg, in agreement with Carandini and Ringach’s and also our results. Considering the model variants with inhibition only (cases B and C), 2 > s 2 À s 2 as long as the observe that c C D 0 in equation 4.6 and S ¡ ¡ inhibition is not too small. Therefore we may neglect the third term on the right-hand side of equation 4.6. Inserting h D m( a) I0 from equation 4.4 and after some rearrangements, one nds a closed equation for s:
³ sD
s02 ¡ s 2 K¡ b
´1 / 3 .
(4.7)
Solutions of this equation can be found by xed-point iteration, that is, by calculating the right-hand side of equation 4.7 recursively, starting from some initial guess (s D 0). This turns out to converge rather quickly. If we consider just the rst iteration starting from s D 0, we get a rough rst estimate of the expected tuning width:
³ sD
s02 K¡ b
´1 / 3 .
(4.8)
1736
Thomas Wennekers
A
C 1.6
20
K+= 0
1.4
h
1.2
15
1
h 10
0.8 0.6
K- b =.092
5
0.4 0.2
0
0.05
0.1
0.15
0
0.2
K- b
B
0.05 0.1 0.15 0.2 0.25 0.3 0.35
K+ b
D 20
24 22
18
K+= 0
K- b =.092
16
20
s
0
s
18 16
14 12
14
10
12 10
8 0
0.05
0.1
K- b
0.15
0.2
6
0
0.05 0.1 0.15 0.2 0.25 0.3 0.35
K+ b
Figure 2: Tuning properties of the Carandini-Ringach model with semilinear rate functions for varying excitatory and inhibitory coupling strength. Shown in A and B are response amplitudes h and tuning widths s when the recurrent excitation is zero and the inhibitory coupling strength increases. C and D display h and s when inhibition is xed at K ¡ b D .092 but excitation grows. Solid lines denote “exact” results from direct simulations of equations 2.6 and 2.7, dashed lines are solutions of equations 4.5 and 4.6, and the dotted lines result from the approximations 4.8 and 4.5 in A and B, and 4.9 and 4.5 in C and D. Plus signs indicate values found by Carandini and Ringach (1997).
Inserting s calculated this way into equation 4.5, we can further estimate h. Although somewhat ad hoc, this approximation nonetheless gives reasonable results as long as K¡ b is not too small (see Figure 2). Moreover, s will increase when the input tuning increases and decrease if the inhibitory strength K ¡ grows. Asymptotically, s tends to zero for K ¡ ! 1, but it always stays nite for nite coupling K¡ . Similarly, for the full model (model variant D) we can again neglect the third term in equation 4.6. Moreover, if the recurrent amplication is moderately strong, such that the excitatory feedback c C h is comparable to or dominates I0 , the tuning width s will become sharp and approach roughly
Analysis of Orientation Tuning
1737
the width of the excitatory coupling kernel sC . Because the input tuning is bad (s0 À s ¼ sC ), we may also skip the rst term on the right-hand side in equation 4.6 and, in addition, assume s ¼ sC as an initial guess. Then we nd from equation 4.6, sD
p
³ 2
sC2 KC b
´1/ 3 ,
(4.9)
which can again be used to calculate h via equation 4.5. Now, s appears to be independent of the properties of the input and the inhibitory feedback kernel. Thus, if the recurrent amplication is relatively strong and recurrent inhibition is broadly tuned in comparison with recurrent excitation, essentially the excitatory coupling kernel determines the tuning width. We checked the analytical results given above numerically. Figure 2 presents response amplitudes and tuning widths for the model with inhibition only in the plots A and B, and for the full model with inhibition xed at K¡ b D .092 in plots C and D. Model variants A, B, and C as described above correspond with K¡ b D 0, .092, and .183 in Figures 2A and 2B, and variant D corresponds with K C b D .23 in Figures 2C and 2D. Solid lines in Figure 2 are exact values derived from numerical simulations of equations 2.6 and 2.7, dashed lines represent solutions of equations 4.5 and 4.6 found by branch tracing methods (Adams-Bashfort predictor and Newton-corrector; cf. Seydel, 1988), and the dotted lines result from the rough approximations 4.8 and 4.9 for s in the respective regimes, together with equation 4.5 to calculate m (s) and h. In addition to our results, plus signs in Figure 2 indicate values found by Carandini and Ringach (1997). Their reported amplitudes and half-widths agree well with our exact simulations. Nonetheless, observe that only the response amplitudes h in Figure 2 agree with the exact solutions that we found (solid lines). The effective tuning widths s deviate because they are determined differently. In our simulations, we compute them from the peak ¤ ( 0) (cf. equation 3.3). curvature and amplitude at zero: s 2 D ¡w ¤ ( 0) / w xx Carandini and Ringach, on the other hand, gave FWHM values in their article. From those we computed the sigmas shown as plus signs in Figures 2B and 2D assuming that the peaks are approximately gaussian shaped, that is, s ¼ FWHM/2.35. Of course, the activity peaks in the simulations are typically not perfect gaussians, so both approaches give only estimates of the tuning width, which explains the observable differences. Comparing the exact and approximate curves, observe that the exact values and those derived from equations 4.5 and 4.6 well agree with each other. Thus, the gaussian approximation described in section 3 conserves the qualitative properties of the stationary potential and ring-rate proles of the full eld model with high accuracy. This validates the principal approach and enables a calculation of tuning properties from a few model parameters without performing numerically expensive simulations.
1738
Thomas Wennekers
Furthermore, even the approximations 4.8 and 4.9 provide reasonable rst estimates for s and surprisingly good results for the response amplitudes h and the cortical amplication factors m D h / I0 (cf. Figures 2A and 2C). As it has to be expected from the assumptions underlying equations 4.8 and 4.9, the s’s derived this way diverge when K¡ (resp. K C ) tend to zero in Figures 2B and 2D. However, for moderate and large coupling strengths, the formulas govern the qualitative behavior of the exact curves and may be used to calculate approximate tuning widths. Carandini and Ringach (1997), as well as Somers et al. (1995), report that for strong excitatory couplings in the full model, tuning curves become essentially at (i.e., untuned), with all cells ring at high rates. We did not observe this in our model. Instead, the tuning width decreases continuously for increasing excitatory coupling strength (see Figure 2D). However, as Figure 2C shows, when K C b approaches approximately the value .25, the cortical amplication factor becomes innite. Above that value, stationary solutions turn out to be unstable, and ring rates grow unboundedly with time. The reason for this difference to the work of Carandini & Ringach (1997) and Somers et al. (1995) is that in their models, the rate functions saturate at high rates and are not truly semilinear, as in our case. 5 The Nonlinear Carandini-Ringach Model
In this section we consider the impact of nonlinear output functions on 6 tuning properties. First we investigate equation 2.8 with p D 1 and # D 0. Figure 3 shows results for parameters K1 b D .5, K2 b D .25, s0 D 30.0, sC D 7.0, s¡ D 30.0, # D 0, and different values of the nonlinearity exponent p (p D .5, 1., 1.5, 2., 2.5). Plotted against the input intensity I0 are the tuning width sp in Figure 3A and response amplitudes h in Figure 3B. Solid lines are computed from equations 3.6 and 3.7 using branch tracing methods; dashed lines represent “exact” solutions derived from numerical simulations of the nonlinear Carandini-Ringach equations. Furthermore, the case of semilinear output functions p D 1 is highlighted by thick solid lines. These curves are again characterized by a constant tuning width and a linear growth of the response amplitude with input strength. As can be seen in Figure 3A, the curves for the exact and approximated tuning widths almost coincide for all p, so the approximation for sp is very good. Furthermore, observe that not all of the xed-point branches predicted by the approximate equations, 3.6 and 3.7, are stable and can therefore be observed in simulations of the full Carandini-Ringach equations (dashed lines). Unstable branches are denoted by small u’s in Figure 3. These solutions also exist in the full eld model, but they are not stable against small perturbations. Comparing response amplitudes h in dependence of the input strength as displayed in Figure 3B, we nd that results predicted by equations 3.6 and 3.7 are well in agreement with those derived from the full model at
Analysis of Orientation Tuning
A
1739
25 20
s
15 p
1.5
10
.5
5 0 -0.2
B
2.
2.5
1.
u
u
0
I0
0.2
0.4
2 1.5
1.5
h 1
2.
u
2.5
.5
0.5 0 -0.2
1.
u
0
I0
0.2
0.4
Figure 3: Tuning properties for different values of the nonlinearity exponent p. Shown are (A) amplitudes h and (B) tuning widths sp in dependence of the input strength (i.e., contrast). Thick lines indicate the semilinear case p D 1; solid lines are derived from approximations 3.6 and 3.7, and the dashed lines present simulation results. Note that not all xed-point branches are stable (i.e., observable in simulations). Unstable branches are denoted by small u’s.
least as long as p ¸ 1 (i.e., p D 1.5, 2, 2.5 in Figure 3B). For the semilinear rate function p D 1 and the square root p D 1 / 2, the gure reveals differences between exact and approximated solutions. In case of the square root function, these differences result solely from the gaussian approximation. Simulations reveal that for p ! 0 and keeping other parameters xed, the ring-rate prole approaches a rectangular form without signicantly changing amplitude and the width at baseline (simulations not shown). Then the gaussian approximation can no longer be expected to give good quantitative results. Nonetheless, although the approximated amplitudes for p D .5 are roughly 50% below the exact stable branch, they still govern
1740
Thomas Wennekers
the qualitative behavior of the true solutions. In particular, they well predict the bifurcation around I0 D ¡.17. For p D 1, the exact and approximated curves are both linear, but their slopes differ somewhat. This is mainly because for the chosen parameters, the system operates not far from the bifurcation from stable xed points to exponentially growing solutions described in the previous section. In that region, the slope of the h ( I0 ) relation is rather sensitive against perturbations (compare Figure 2C). Therefore, approximation errors become more pronounced. The approximate formulas, 4.9 and 4.5, result in a width of s ¼ 6.5 and a recurrent amplication factor of m ¼ 5.0. In contrast, Figures 3C and 3D (resp. the numerical simulations) give exact values of s D 6.3 and m D 7.5 in reasonable agreement with the obtained approximations. We now derive some further approximate formulas for properties of the tuning functions shown in Figure 3. First we consider the stable branches for p > 1 in the limit I0 ! 0. In that case, the feedback input ahp becomes vanishingly small as compared with the external input I0, and from equation 4.1, one gets h ¼ I0, which can also be checked in Figure 3B. Moreover, since the membrane potentials are basically equal to the input distribution p I ( x) for neglectable feedback, the tuning widths must be sp D s0 / p in the limit I0 ! 0. In fact, the simulations reveal sp D 24.8, 21.2, 18.9 for p D 1.5, 2, 2.5, well in accord with that prediction. Observe that although the response amplitudes are approximately linear for small I0, the tuning widths strongly depend on I0; they are constant only for p D 1. When I0 grows, the recurrent signals increase stronger than linearly with I0 (because p > 1) such that the excitatory and inhibitory contributions to the membrane potential begin to dominate the input. This implies that the tuning width s is increasingly determined by the widths of the recurrent coupling kernels (cf. equation 4.2). Since s¡ À sC the main impact will be due to the excitatory kernel. Thus, for small I0 the width s is determined by the input width s0 , but for increasing input, s will progressively decline and approach the width of the excitatory kernel the stronger ahp dominates p over the input I 0. For that reason, sp D s / p declines in Figure 3D for p D 1.5, 2., 2.5. Finally, at the bifurcation point, I 0 D I0c , recurrent excitation becomes so strong that the network activity explodes. (Then the tuning still remains sharp because the excitatory feedback dominates in the equation for s, equation 4.2, but the network activity grows unboundedly.) We can similarly treat the unstable branches for p > 1 in the limit I 0 ! 0. In that case h À I0 and setting I 0 D 0 in equation 4.1 results in h ¼ a1 / (1¡p) . In addition in equation 4.2, we can again neglect the inhibitory contribution because s¡ is large in comparison with sC . Setting I0 D 0 and assuming sp2 D sC2 / p as an initial guess, we nd from equation 4.2 after some manipulations, ±
²1 / 2 2 3 1 C 1p sC/ sp ¼ ¡ ¢1 / 3 K C bp
,
h ¼ a( sp ) 1/ (1¡p ) ,
(5.1)
Analysis of Orientation Tuning
1741
Table 1: Response Amplitudes and Tuning Widths in the Limit I0 ! 0 for the Unstable Branches with p ¸ 1 in Figure 3. p 1.0 1.5 2.0 2.5
Branch Tracing sp h
Approximation (equation 5.1) sp h
0 1.65 1.4 1.3
0 1.54 1.28 1.21
6.3 4.43 3.75 3.36
6.5 5.2 4.5 4.0
where a( sp ) is dened in equation 4.3. As a special case for p D 1 we recover equation 4.9 from the equation for sp in equation 5.1. Table 1 shows that values obtained from equation 5.1 provide good estimates for h and sp in the limit I0 ! 0. Note that we cannot compare the results with exact simulations of the full Carandini-Ringach model, because the discussed xed-point branches are unstable. If I0 becomes positive, observe that the recurrent signals will still be considerably larger (see Figure 3B), ahp À I0 . Therefore, the tuning will be mainly determined by the recurrent signals along the unstable branch and, in particular, the excitatory kernel. Because both the excitatory and inhibitory contributions to the membrane potentials are proportional to ahp , the tuning will be roughly independent of the input strength for the same reason, as discussed in the case of p D 1. The shape of the potential prole is scaled by ahp but otherwise (almost) independent of h and thus I0. Therefore, the width at a threshold level of # D 0 must be approximately constant. This explains why the tuning of the unstable branches for p D 1.5, 2., 2.5 in Figure 3A is almost at. Finally, we consider the locations of the bifurcations in Figure 3 that appear for p > 1 if I0 increases. For instance, for p D 2, this bifurcation occurs at I 0 ¼ .3 with sp ¼ 5.4 and h ¼ .62. For inputs larger than I0 ¼ .3, stable localized solutions do not exist; instead, the network activity grows without bound. At the critical points, the slope of h ( I0 ) becomes innity. Taking derivatives past I0 in equation 4.1, this leads after some simple calculations to a critical response amplitude of hc D ( ap) 1 / (1¡p ) . The critical input is then p given by I0c D hc ¡a( sp ) hc (cf. equation 4.1). To estimate a(sp ) observe that sp is almost independent of I0 on the unstable branches for p > 1 (cf. Figure 3). Thus, we may simply take the sp ’s calculated from (5.1). Critical parameters obtained this way are compared in Table 2 with simulations of the full model and approximations found by tracing solutions of equations 4.1 and 4.2. Again, these equations as well as the further approximations derived above well predict the critical point properties. There is a striking difference in the bifurcation behavior of concave (p < 1) versus convex rate functions (p > 1) that appears to be important for cortical
1742
Thomas Wennekers Table 2: Critical Point Parameters for the Branches in Figure 3 with p > 1. Simulation
Branch Tracing
Approximation
p
sp
hc
I 0c
sp
hc
I 0c
sp
hc
I 0c
1.5 2.0 2.5
5.5 5.6 6.2
.65 .57 .58
.21 .31 .37
5.6 5.4 5.4
.66 .62 .63
.225 .308 .375
5.2 4.5 4.0
.68 .64 .66
.227 .321 .395
information processing. Figure 3B shows that in a neighborhood of I0 D 0, both concave and convex rate functions lead to two xed-point branches. However, for p > 1, the upper xed point is unstable and the lower one stable, whereas for p < 1, it is just the other way around. Especially for p < 1, we nd a stable xed point at relatively high rates, which is well tuned (see Figure 3A). This means that concave rate functions support the existence of stable hysteresis, but convex functions do not. For concave functions, after a localized activation prole has been set up by some input stimulus, the activity peak can remain even if the input disappears. The ring rates will be somewhat lower as in the case with input, but the recurrent activity will be strong enough to keep the network in a stable state of persistent activity, and the tuning of the activity prole will become even somewhat sharper. This way, the network can “memorize” properties of a recent stimulus, insofar they are coded in the location of the activity peak, which in this context represents angle of orientation. The strength of the stimulus, corresponding with stimulus contrast, is not memorized, because the response amplitude for vanishing input is a xed quantity determined by the coupling kernels. To date, there is no evidence that “memory” of this kind exists in primary visual cortex, which may suggest that cells in V1 do not operate in a concave regime of their rate functions (cf. Carandini & Ringach, 1997; Douglas, Koch, Mahowald, Martin, & Suarez, 1995). Nonetheless, the described phenomenon might be related to persistent ring states that have been observed during delay periods in delayed-matching-to-sample tasks in higher visual areas and other cortical regions (Fuster, 1994). Closing this section, we emphasize that for more general parameter sets than those discussed or more general (e.g., sigmoid) output functions, the network behavior can become increasingly complex. The same holds if more than just one or two layers of cells are considered. Then one typically has to solve the approximations 3.6 and 3.7 numerically, although in some situations, it may still be possible to infer qualitative tuning properties using arguments as above. The behavior of response amplitudes may be derived from the nature of the xed-point equation 3.6, and considerations regarding the relative contribution of input versus excitatory and inhibitory feedback (see equation 3.7) may be useful to predict the dependence of tuning widths on input strength.
Analysis of Orientation Tuning
1743
6 Discussion
In summary, we derived an approximation scheme that allows the analysis of localized steady-state activation peaks in coupled nonlinear neural eld models. The method essentially approximates the suprathreshold activity proles by equivalent gaussian-shaped ones. Since coupling kernels are assumed to be gaussian too, this enables the calculation of the membrane potential proles in the different layers, which appear to be additive superpositions of approximately gaussian-shaped components due to the input and recurrent excitatory and inhibitory signals, respectively. Furthermore, nonlinear rate functions in the form of semipower functions can be considered. This extends earlier approaches studying neural elds, which often were based on linear (or linearized) models and Fourier-Laplace methods or relied solely on computer simulations. In contrast, our method takes full account of the nonlinearity in the model. Moreover, often explicit approximate equations can be determined that relate tuning properties to the network parameters. Steady states of the neural eld models are given by a set of nonlinear integral equations (e.g., equation 2.5). The approximation method reduces these equations to a set of coupled nonlinear algebraic equations, 3.6 and 3.7. Equations 3.6 for i D 1, . . . , n can be envisaged as determining the response amplitudes; whereas the second set, equation 3.7, determines the tuning widths. Fixed-point equations of the form 3.6 appear very similarly also in mean-eld models of identically coupled neurons without spatial structure (e.g., Amari, 1972; Schuster & Wagner, 1990a, 1990b). The prefactors cij then are constant, because tuning widths si are, of course, undened in those models. Qualitatively, the behavior of steady states in completely coupled models is comparable to the response amplitudes in the present eld models, although the input-dependent tuning widths somewhat complicate the response behavior. The special structure of the equations 3.7, which determine the tuning widths si , shows that the inverse variance of any of the suprathreshold membrane proles appears as a mixture of the respective variances of the input prole and the excitatory and inhibitory contributions to the potential prole. These variances are weighted by the amplitudes of the respective components. Therefore, in general, the tuning widths will be mixtures of the widths of input and recurrent components. Only if one of these contributions becomes particularly strong does it dominate the tuning width. For vanishing ring thresholds, the equations for the response amplitudes decouple from those for the tuning widths. This implies that the tuning widths are independent of the input strength and that the response amplitudes depend linearly on the input strength. This proves the results in Somers et al. (1995) and Ben-Yishai et al. (1995) analytically and holds for arbitrary input strength as well as for any relative weighting of input strength and recurrent signals. The relative weighting only determines which com-
1744
Thomas Wennekers
ponents dominate the tuning behavior. As limiting cases, for weak recurrent connections, the cortical tuning is given by the input tuning from LGN to cortex, and for strong recurrent connections, the width of the excitatory kernel determines the width of the tuning. In general, the tuning width is a mixture of both components. Experimental results suggest that this might be the case in visual cortex (Stratfort et al., 1996). Comparing the inuence of nonlinear rate functions on tuning properties in the Carandini-Ringach model, one observes that only for the semilinear rate function with vanishing threshold, s will be independent of the input strength and h ( I0 ) will be linear. Moreover, the tuning scenarios differ for convex and concave rate functions: if f is convex (p > 1), one observes two xed-point branches in response to varying input strength—a stable one at low rates and an unstable one at higher rates. For weak input, the stable branch is input dominated, that is, it is broadly tuned, and h is roughly proportional to I0 . For larger inputs, h( I0 ) grows supralinearly, and recurrent signals gradually dominate. Then tuning becomes sharp until a critical input is reached, above which the network activity is unstable. In contrast, for concave rate functions (p < 1), the network can exhibit hysteresis between two states—one at low (actually zero) ring rate and one at high rates. The upper state is sharply tuned and can exist even for vanishing input. Therefore, the network can store information about a previously presented stimulus even if that input disappears. However, only the orientation of the stimulus is stored and not the absolute input strength (i.e., contrast). Persistent activity of this type might be related to delayed activity found in different cortical areas in delayed-matching-to-sample tasks (Fuster, 1994). In primary visual cortex, delayed activity has not been reported so far, suggesting that cells in that area operate in the convex regime of their rate functions or are linear (cf. Douglas et al., 1995). Finally, truly sigmoid rate functions consist of concave and convex regions. Therefore, they can combine all of the above behavior: multiple xed points at low rates and hysteresis at high rates. Clearly, for true sigmoids, the steady-state behavior may be even more complex than described. This, for instance, is suggested by studies considering the stationary behavior of completely connected networks (Amari, 1972; Schuster and Wagner, 1990a, 1990b). An important difference between the Carandini-Ringach model (1997) or that of Ben-Yishai et al. (1995) and the two-eld model dened in this work is related to stability issues. Simulations indicate that the one-eld models have only instabilities that lead to exponentially growing solutions in time (those can, of course, be bounded by saturating rate functions). The two-eld model, in contrast, can reveal oscillations with frequencies typically in the beta and gamma range (simulations not shown). Similar oscillations are well known in completely connected networks (Amari, 1972; Schuster & Wagner, 1990a, 1990b; Skarda & Freeman, 1987), where they occur as mass oscillations if both excitation and inhibition are relatively strong. In such oscillatory states, not only the response amplitudes are time
Analysis of Orientation Tuning
1745
dependent, but also the tuning widths are rhythmically modulated. This way, beta and/or gamma oscillations may support a high-resolution mode of visual information processing. Somers et al. (1995) simulated excitatory and inhibitory cells in their network, but they did not report the occurrence of mass oscillations. In simulations of the two-eld model, we also did not observe oscillations, when parameters were used as given by Carandini and Ringach (1997), which are directly derived from Somers et al.’s (1995) physiological estimates. This may explain why oscillations are not reported in Somers et al. (1995). However, doubling excitation and inhibition in the two-eld model did lead to oscillations (simulations not shown). Therefore, one may conclude that the cortex operates relatively near a bifurcation point from stationary behavior to gamma oscillations (cf. also Wennekers, Sommer, & Palm, 1995). Although applied here to the one-eld Carandini-Ringach model and the two-eld excitatory-inhibitory model, the method derived is considerably more general and should be useful wherever eld models present a suitable level of description. This may apply, for instance, to head direction cells in rats (Zhang, 1996), place cells and spatial navigation (Samsonovic & McNaughton, 1997), and other contexts (Engels & Sch¨oner, 1995; Wennekers & Palm, 1996; Wilson & Cowan, 1973). In addition, it seems possible to generalize the method from the study of steady-state solutions of the full eld model (see equation 2.1) to time-dependent solutions like the oscillations or other spatiotemporal receptive eld phenomena. Preliminary results into that direction can be found in Suder et al. (2001). References Adorj a n, P., Levitt, J. B., Lund, J. S., & Obermayer, K. (1999). A model for the intracortical origin of orientation preference and tuning in macaque striate cortex. Vis. Neurosci., 16, 303–318. Amari, S.-I. (1972). Characteristics of random nets of analog neuron-like elements. IEEE T. Syst. Man. Cyb., 2, 643–657. Amari, S. I. (1977). Dynamics of pattern formation in lateral-inhibition type networks. Biol. Cybern., 27, 77–87. Ben-Yishai, R., Bar-Or, R. L., & Sompolinsky, H. (1995). Theory of orientation tuning in visual cortex. Proc. Nat. Acad. Sci. USA, 92, 3844–3848. Carandini, M., & Ringach, D. L. (1997). Predictions of a recurrent model of orientation selectivity. Vision Res., 37, 3061–3071. Cheng, H., Chino, Y., Smith, E., Hamamoto, J., & Yoshida, K. (1995). Transfer characteristics of lateral geniculate nucleus X neurons in the cat: Effects of spatial frequency and contrast. J. Neurophysiol., 74, 2548–2557. Douglas, R., Koch, C., Mahowald, M., Martin, K. A. C., & Suarez, H. H. (1995). Recurrent excitation in neocortical circuits. Science, 269, 981–998. Engels, C., & Sch¨oner, G. (1995). Dynamic elds endow behavior-based robots with representations. Robotics and Autonomous Systems, 14, 55–77.
1746
Thomas Wennekers
Ferster, D., Chung, S., & Wheat, H. (1996). Orientation selectivity of thalamic input to simple cells of cat visual cortex. Nature, 380, 249–252. Ferster, D., & Koch, C. (1987). Neuronal connections underlying orientation selectivity in cat visual cortex. Trends Neurosci., 10, 487–492. Fuster, J. (1994). Memory in the cerebral cortex. Cambridge, MA: MIT Press. Heeger, D. J., Simoncelli, E. P., & Movshon, J. A. (1996). Computational models of cortical visual processing. Proc. Natl. Acad. Sci. USA, 93, 623–627. Hubel, D. H., & Wiesel, T. N. (1962). Receptive elds, binocular interaction and functional architecture in the cat’s visual cortex. J. Physiol. Lond., 160, 106–154. Krone, G., Mallot, H., Palm, G., & Sch¨uz, A. (1986). Spatiotemporal receptive elds: A dynamical model derived from cortical architectonics. Proc. Roy. Soc. Lond. B, 226, 421–444. Mineiro, P., & Zipser, D. (1998). Analysis of direction selectivity arising from recurrent cortical interactions. Neural Computation, 10, 353–371. Press, W. H., Teukolsky, S. A., Vetterling, W. T., & Flannery, B. P. (1993). Numerical recipes in C: The art of scientic computing. Cambridge, MA: Cambridge University Press. Reid, R. C., & Alonso, J.-M. (1996). The processing and encoding of information in the visual cortex. Curr. Opin. Neurobio., 6, 465–480. Sabatini, S. P. (1997). Recurrent inhibition and clustered connectivity as a basis for Gabor-like receptive elds in the visual cortex. Biol. Cybern., 74, 189–202. Samsonovic, A., & McNaughton, B. L. (1997). Path integration and cognitive mapping in a continuous attractor neural network model. J. Neurosci., 17, 5900–5920. Sato, H., Katsuyama, N., Tamura, H., Hata, Y., & Tsumoto, T. (1996). Mechanisms underlying orientation selectivity in the primary visual cortex of the macaque. J. Physiol., 496, 757–771. Schuster, H., & Wagner, P. (1990a). A model for neuronal oscillations in the visual cortex. I. Mean-eld theory and the derivation of the phase equations. Biol. Cybern., 64, 77–82. Schuster, H., & Wagner, P. (1990b). A model for neuronal oscillations in the visual cortex. II. Phase description and feature dependent synchronization. Biol. Cybern., 64, 83–85. Sclar, G., & Freeman, R. (1982). Orientation selectivity in the cat’s striate cortex is invariant with stimulus contrast. Exp. Brain Res., 46, 457–461. Seydel, R. (1988).From bifurcationto chaos. New York: Elsevier Science Publishing. Skarda, C. A., & Freeman, W. (1987). How brains make chaos in order to make sense of the world. Beh. Brain. Sci., 10, 161–195. Skottun, B., Bradley, A., Sclar, G., Ohzawa, I, & Freeman, R. (1987). The effects of contrast on visual orientation and spatial frequency discrimination. J. Neurophysiol., 57, 773–786. Somers, D. C., Nelson, S. B., & Sur, M. (1995). An emergent model of orientation selectivity in cat visual cortex simple cells. J. Neurosci., 15, 5448–5465. Sompolinsky, H., & Shapley, R. (1997). New perspectives on the mechanism for orientation selectivity. Curr. Opin. Neurobio., 7, 514–522.
Analysis of Orientation Tuning
1747
Stratfort, K. J., Tarczy-Hornoch, K., Martin, K. A. C., Bannister, N. J., & Jack, J. J. B. (1996). Excitatory synaptic inputs to spiny stellate cells in cat visual cortex. Nature, 382, 258–261. Suder, K., Worg¨ ¨ otter, F., & Wennekers, T. (2001). Neural eld model of receptive eld restructuring in primary visual cortex. Neural Computation, 13, 139–159. Swindale, N. V. (1998). Orientation tuning curves: Empirical description and estimation of parameters. Biol. Cybern., 78, 45–56. Troyer, T. W., Krukowski, A. E., Priebe, N. J., & Miller, K. D. (1998). Contrastinvariant orientation tuning in cat visual cortex: Thalamocortical input tuning and correlation-based intracortical connectivity. J. Neurosci., 18, 5908– 5927. Vidyasagar, T. R., Pei, X., & Vogulshev, M. (1996). Multiple mechanisms underlying the orientation selectivity of visual cortical neurones. Trends Neurosci., 19, 272–277. von Seelen, W., Mallot, H. A., & Giannakopoulos, F. (1987). Characteristics of neuronal systems in the visual cortex. Biol. Cybern., 56, 37–49. Wennekers, T., & Palm G. (1996). Controlling the speed of synre chains. In C. von der Malsburg et al. (Eds.), Proceedings of the International Conference on Articial Neural Networks ICANN96 (pp. 451–456). Berlin: Springer-Verlag. Wennekers, T., Sommer, F. T., & Palm, G. (1995). Iterative retrieval in associative memories by threshold control of different neural models. In H. J. Herrmann, D. E. Wolf, & E. Poppel ¨ (Eds.), Supercomputing in brain research: From tomography to neural networks (pp. 301–319). Singapore: World Scientic Publishing. Wilson, H. R., & Cowan, J. D. (1973). A mathematical theory of the functional dynamics of cortical and thalamic nervous tissue. Kybernetik, 13, 55–80. Worg ¨ otter, ¨ F., & Koch, C. (1991). A detailed model of the primary visual pathway in the cat: Comparison of afferent excitatory and intracortical inhibitory connecting schemes for orientation selectivity. J. Neurosci., 11, 1959–1979. Zhang, K.-C. (1996).Representation of spatial orientation by the intrinsic dynamics of the head-direction cell ensemble: A theory. J. Neurosci., 16, 2112–2126. Received April 24, 2000; accepted October 25, 2000.
LETTER
Communicated by Jack Cowan
Computational Design and Nonlinear Dynamics of a Recurrent Network Model of the Primary Visual Cortex¤ Zhaoping Li GCNU, University College London, London WC1N 3AR, U.K. Recurrent interactions in the primary visual cortex make its output a complex nonlinear transform of its input. This transform serves preattentive visual segmentation, that is, autonomously processing visual inputs to give outputs that selectively emphasize certain features for segmentation. An analytical understanding of the nonlinear dynamics of the recurrent neural circuit is essential to harness its computational power. We derive requirements on the neural architecture, components, and connection weights of a biologically plausible model of the cortex such that region segmentation, gure-ground segregation, and contour enhancement can be achieved simultaneously. In addition, we analyze the conditions governing neural oscillations, illusory contours, and the absence of visual hallucinations. Many of our analytical techniques can be applied to other recurrent networks with translation-invariant neural and connection structures. 1 Introduction
Recurrent neural dynamics is a basic computational substrate for cortical processing. In the primary visual cortex, this recurrent dynamics is instantiated by nite range, lateral, intracortical neural connections. The input to the cortex is the retinal image ltered through cortical receptive elds (RFs) shaped like small edges or bars and retinotopically distributed in visual space. The outputs of the cortex are the cell activities, which can be viewed as a complex nonlinear transform of the input under the recurrent interactions. Two characteristics of this transform follow immediately. First, if we focus on cases when top-down feedback from higher visual areas does not change during the course of the transform, the primary cortical computation is autonomous, suggesting that the computation concerned is preattentive in nature. In other words, we consider cases when feedback from higher visual areas is purely passive and its role is merely to set a background or operating point for V1 computation. This enables us to isolate
¤ A preliminary version of this article was published as “Neural dynamics in a recurrent network model of primary visual cortex” in Proceedings of ICANN99.
c 2001 Massachusetts Institute of Technology Neural Computation 13, 1749– 1780 (2001) °
1750
Zhaoping Li
the recurrent dynamics in V1 for thorough study. Of course, more extensive computations can doubtless be performed when V1 interacts dynamically with other visual areas; however, this is beyond the scope of the article. The second characteristic is that the recurrent dynamics enables computations to occur at a global scale, despite the local connectivity. The output of a V1 cell depends nonlocally on its inputs in a way that it is hard to achieve in feedforward networks with only retinotopically organized connections. Physiological and psychophysical data suggest that V1 implements preattentive computations such as contour enhancement, texture segmentation, and gure-ground segregation (Kapadia, Ito, Gilbert, & Westheimer, 1995; Gallant, van Essen, & Nothdurft, 1995; Knierim & van Essen, 1992). To perform these tasks, V1 functions as a saliency circuit that gives higher responses to locations of higher saliency in inputs, such as the borders between texture regions, pop-out gures against backgrounds, and smooth contours (Li, 1997, 1998, 1999a, 1999b). Such preattentive segmentation is known to be quite difcult, especially considering that the same cortical circuit needs to achieve both contour enhancement and region or gureground segmentation and that there is still no known general solution to segmentation after decades of research in machine and natural visual algorithms. Various models of the cortex have addressed particular components of the cortical computation, such as contour enhancement, that is, the relatively higher activities of cells receiving inputs arising from bars belonging to smooth contours (Grossberg & Mingolla, 1985; Zucker, Dobbins, & Iverson, 1989; Yen & Finkel, 1998). It is already very hard to model contour enhancement successfully in the cortex (Li, 1998). Previous efforts (Grossberg & Mingolla, 1985) have been made to capture both contour enhancement and texture segmentation in a single model. However, a fully functional and dynamically well-behaved model has only recently been proposed (Li, 1997, 1999a). Understanding the complex, recurrent, and nonlinear neural dynamics underlying the computation is essential to marshal its power. Li (1997, 1998, 1999a) introduced and described the structure and behavior of a recurrent model of V1 that simultaneously achieves the desired components of the computation. In this article, we describe the mathematical analysis of the nonlinear dynamics that enabled the computational design of our model. We study issues such as network architecture, computational constraints, and dynamic stability that are directly relevant to the global scale computation. By contrast, single unit properties, such as orientation tuning, that are less relevant to the global scale computation will not be a focus. Some of our analytical techniques, such as the analysis of the cortical microcircuit and the stability study of translation-invariant networks, can be applied to study other cortical areas that share the common properties of neural elements, connections, and the canonical microcircuit (Shepherd, 1990).
Design and Analysis of a Recurrent V1 Model
1751
2 A Minimal Model of the Primary Visual Cortex
A minimal model of the cortex has just enough components to execute the necessary computations without excess details. It is essentially a subjective matter as to what a minimal model is, since there is no recipe for a minimalist design. However, we present as a candidate a model that instantiates all the desired computation, but for which simplied versions fail. Throughout the article, we try to keep our analysis general in discussing characteristics of the recurrent dynamics. However, to illustrate or demonstrate particular analytical results and approximation and simplication techniques, we often use a model of V1 whose specics and numerical parameters are available (Li, 1998, 1999a), 1 so that the readers can try out our simulations. We model only layer 2–3 cells in the cortex, which are mainly responsible for the recurrent dynamics. A model neuron has membrane potential x and output or ring rate gx ( x ) , which is a sigmoid-like function of x. Model cells have orientation-selective RFs arranged on a regular two-dimensional grid in image coordinates. At each grid point i D ( mi , ni ) , where mi and ni are the horizontal and vertical coordinates, there are K units, one each for preferred orientations h D kp / K for k D 0, 1, . . . , K ¡ 1 spanning 180 degrees. Unit ih has its RF located at i and prefers orientation h. It receives external visual inputs Iih , which is the result of preprocessing the visual image through the RF. Its response gx ( xih ) is the result of both Iih and the recurrent interactions. The image grid and the interactions are treated as translation invariant, allowing us to use many powerful analytical techniques. However, we should keep in mind that translation symmetry holds approximately only over a sufciently small portion of the visual eld, since our visual system has different resolutions at different eccentricities. The desired computation fIih g ! fgx ( xih ) g gives higher responses gx ( xih ) to input bars ih of higher perceptual saliency. For instance, even if two input bars ih and jh 0 have the same input contrast Iih D Ijh 0 , the response gx ( xih ) to ih may be higher if ih (but not jh 0 ) is part of an isolated smooth contour, is at the boundary of a texture region, or is a pop-out target against a background. Conversely, if the input bars are of the same saliency, as when the input consists merely of bars of the same contrast from a homogeneous texture without any boundary, the output level to every bar should be the same.
1 In October 2000, a typo was discovered in the appendix of the published version of Li (1998, 1999a) for the model parameter for Wih, jh 0 . There it was mistakenly written that “Wih, jh 0 D 0” when “d ¸ 10” or other conditions listed in those articles are satised. The correct model parameter, which has been used to produce all the published model results so far (including the ones in Li, 1998, 1999a), should be such that the condition “d ¸ 10” printed in Li (1998, 1999a) be changed to condition “d / cos(b / 4) ¸ 10.” Here d and b are as dened in Li (1998, 1999a). The typo should lead to quantitative changes in the model behavior from those published so far or those presented in this article.
1752
Zhaoping Li
2.1 A Less-Than-Minimal Recurrent Model of V1. A very simple re-
current model of the cortex can be described by X xP ih D ¡xih C Tih, jh 0 gx ( xjh 0 ) C Iih C Io ,
(2.1)
jh 0
where ¡xih models the decay in membrane potential and Io is the background input. The recurrent connections Tih, jh 0 link cells ih and jh 0 . Visual input Iih persists after onset and initializes the activity levels gx ( xih ) . The activities are then modied by the network interaction, making gx ( xih ) depen6 ( ih ) . Translation invariance in the connections dent on input Ijh 0 for ( jh 0 ) D means that Tih, jh 0 depends on only the vector i ¡ j and the relative angles of this vector to the orientations h and h 0 . Reection symmetry means that Tih, jh 0 D Tjh 0 ,ih . Many previous models of the primary visual cortex (e.g., Grossberg & Mingolla, 1985; Zucker et al., 1989; Braun, Niebur, Schuster, & Koch, 1994) can be seen as more complex versions of the one described above. The added complexities include stronger nonlinearities, global normalization (e.g., by adding a global normalizing input to the background Io ), and shunting inhibition. However, they are all characterized by reciprocal or symmetric interactions between model units. It is well known (Hopeld, 1982; Cohen & Grossberg, 1983) that in a symmetric recurrent network as in equation (2.1), given any stationary input Iih , the dynamic trajectory xih ( t) will converge in time t to a xed point that is a local minimum (attractor) in an energy landscape, X 1 X T ih, jh 0 gx ( xih ) gx ( xjh 0 ) ¡ Iih gx ( xih ) 2 ih, jh 0 ih X Z gx ( xih ) C gx¡1 ( x ) dx.
E (fxih g) D ¡
ih
(2.2)
0
Empirically, this convergence behavior to attractors still holds when the complexities above are added to the network. The xed point xN ih of the motion trajectory, or the minimum energy state where @E / @gx ( xih ) D 0 for all ih, is (when Io D 0): X (2.3) xN ih D Iih C Tih, jh 0 gx ( xNjh 0 ) . jh 0
Without recurrent interactions (T D 0), this minimum xN ih D Iih is a faithful copy of the input Iih . Sufciently strong interactions T shape xN ih and make them unfaithful to the input. This happens when T is so strong that one of the eigenvalues lT of the matrix T with elements T ih, jh 0 ´ Tih, jh 0 g0x ( xNjh 0 ) satises lT > 1 (here g0x is the slope of gx (.) ). For instance, when the in6 put Iih is translation invariant such that Iih D Ijh for all i D j, there is a
Design and Analysis of a Recurrent V1 Model
1753
6 translation invariant xed point xN D xNjh for all i D j. Strong interactions T could make this xed point unstable and no longer a local minimum of the energy, pull the state into an attractor in the direction of an eigen6 6 vector of T that is not translation invariant, that is, xih D xjh for i D j. Computationally, the input unfaithfulness (i.e., gx ( xih ) is not a function of Iih alone) is desirable to a limited degree since this is how a saliency circuit produces differential outputs gx ( xih ) to input bars of the same contrast Iih but different saliencies. However, this unfaithfulness should be driven by the nature of the input pattern fIih g or its deviation from homogeneity (e.g., the smooth contours or gures against a background). Otherwise, visual hallucinations (Ermentrout & Cowan, 1979) result when spontaneous or non-input-driven network behaviors—spontaneous pattern formation or symmetry breaking—happen. Note that if fxih g is an attractor under homogeneous input Iih D Ijh ; so is a translated state fx0ih g such that x0ih D xi C a,h for any translation a, since fxih g and fx0ih g have the same energy value E. Hence, the absolute positions of the hallucinated patterns are random and shiftable. When the translation a is one-dimensional, such a continuum of attractors has been called a line attractor (Zhang, 1996). For two (or more) dimensional patterns, the continuum is a surface attractor. To illustrate, consider an example when Tih, jh 0 / dh,h 0 only links cells that prefer the same orientation, an idealization from observations (Gilbert & Wiesel, 1983; Rockland & Lund, 1983) that the lateral interactions tend to link cells preferring similar orientations. The network contains multiple, independent subnetworks, one each for every h. Take the h D 0 degree subnet, and for convenience drop the subindex h. Consider a simple centersurround interaction, such that in a Manhattan grid,
8 < 1 Tij / ¡1 : 0
if i D j if ( mj , nj ) D ( mi § 1, ni ) or ( mi , ni § 1) otherwise.
(2.4)
With sufciently strong T, the network under homogeneous input Ii D Ij for all i, j can settle into an “antiferromagnetic” state in which neighboring units 6 xi exhibit one of the two different activities xmi ,ni D xmi C 1,ni C 1 D xmi C 1,n D xmi ,ni C 1 . This pattern fxi g is a spatial array of replicas of the center-surround interaction pattern T. Figure 1 shows the behavior of a subnet for h D 0, the vertical bars, when the interaction Tij depends on the orientation of i ¡ j and is not rotationally invariant. Here, T ij > 0 between local and roughly vertically displaced i and j, and Tij < 0 between local and more horizontally displaced i and j. Hence, two nearby bars i and j excite each other when they are coaligned and inhibit each other otherwise, as suggested by experimental observations and theoretical studies (e.g., Kapadia et al., 1995; Polat & Sagi, 1993; Field, Hayes, & Hess, 1993). Although the network enhances an input (vertical)
1754
Zhaoping Li Local connection Pattern T 0.2 0.1 0 0.1 Input pattern
1.5
Output pattern
1
1 0.5 0.5 0 1.5
0 1
1 0.5 0.5 0
0
Figure 1: A reduced model consisting of symmetrically coupled cells tuned to vertical orientation (h D 0). Shown here are ve gray-scale images, each with a scale bar on the right. The network has 100 £ 100 cells arranged in a twodimensional array, with wrap-around boundary condition. Each cell models a cortical cell tuned to vertical orientation, in a retinotopic manner. The sigmoid function gx ( x) of the cells is gx ( x ) D 0 when x < 1, gx ( x ) D x ¡ 1 when 1 · x < 2, and gx ( x) D 1 when x > 2. The top image shows the connection pattern between the center cell and the other cells. This pattern is local and translation invariant; it gives local colinear excitation between cells vertically displaced and local inhibition between cells horizontally displaced. (Middle left) Two-dimensional input pattern I, an input line, and a noise spot. (Middle right) Two-dimensional output pattern gx ( x ) to the input at middle left. The line induces a response that is approximately 100% higher than the noise spot. (Bottom left) Two-dimensional input pattern I for noise input. (Bottom right) Two-dimensional output pattern gx ( x ) to the noisy input—hallucination of vertical streaks.
line relative to the isolated (short) bar, it also hallucinates other vertical lines under noisy inputs. The competition between internal interactions T and the external inputs I to shape fxi g is uncompromising in such recurrent networks. For analy6 sis, take for simplicity Tij such that T ij D 0 only when P i and j are in the same or nearest-neighbor (vertical) columns. T 0 ´ j,mj Dmi Tij > 0 is the
Design and Analysis of a Recurrent V1 Model
1755
total recurrent excitation within a vertical contour P P (column) for contour enhancement, and T 1 ´ ¡ j,mj Dmi C 1 Tij D ¡ j,mj Dmi ¡1 T ij > 0 is the total recurrent suppression between two nearby vertical columns, to suppress noise and background. Both T 0 and T1 should be large for a large difference in saliency between a vertical contour and a homogeneous background input Ii D Ij for all i, j. However, when T 0 C 2T1 > 1 (for g0x D 1), this network under homogeneous input spontaneously breaks symmetry to hallucinate saliency waves—two alternate saliencies for neighboring columns. (Mathematically, T 0 C 2T1 is the eigenvalue of the matrix T with this saliency wave as the eigenvector.) Hence, contour enhancement makes the network prone to “see” contours even when there is none. The orientations and widths of the “ghost contours” match the interaction structure T. Avoiding the hallucination forces T 0 or T1 to be smaller, and so harms the capability of the network to enhance contours relative to backgrounds (Li & Dayan, 1999). In other words, although symmetric recurrent networks are useful for associative memory computations (Hopeld, 1982), for which correcting signicant input errors or lling in extensively missing inputs is exactly what is needed, such an input distortion is too strong for early visual tasks that require greater faithfulness to visual input. 2.2 A Minimal Recurrent Model with Hidden Units. The major weakness of the symmetrically connected model is the attractor dynamics that strongly attract the network state fxih g away from the ones guided by the visual input fIih g. Since this attractor dynamics is largely dictated by the symmetry of the neural connections, it cannot be removed by introducing ion channels or spiking neurons (rather than ring-rate neurons), for instance, or by mechanisms like shunting inhibition, global activity normalization, and input gating (Grossberg & Mingolla, 1985; Zucker et al., 1989; Braun et al., 1994), which are used by many models despite their questionable biological foundations. Attractor dynamics is untenable, however, in the face of the well-established fact that a real neuron is either exclusively excitatory or exclusively inhibitory. It is obviously impossible to have symmetric connections between excitatory and inhibitory neurons. Mathematical analysis by Li and Dayan (1999) showed that asymmetric recurrent E-I networks with separate excitatory (E) and inhibitory (I) cells can indeed perform computations that symmetric ones cannot. Thus, we model neurons xih as exclusively excitatory pyramidal cells and introduce one inhibitory interneuron (hidden units) yih for each xih to mediate indirect, or disynaptic, inhibition between xih ’s, as in the real cortex (White, 1989; Gilbert, 1992; Rockland & Lund, 1983; also see Grossberg & Raizada, 2000 for a much more fully elaborated model). The units xih and yih in such an E-I pair are reciprocally connected. Hence the dynamical equations become:
xP ih D ¡xih ¡ gy ( yi,h ) C Jo gx ( xih ) ¡
X 6 0 Dh D
y ( Dh ) gy ( yi,h C Dh )
1756
Zhaoping Li C
X 6 i,h 0 jD
Jih, jh 0 gx ( xjh 0 ) C Iih C Io
yP ih D ¡ay yih C gx ( xih ) C
X 6 i,h 0 jD
Wih, jh 0 gx ( xjh 0 ) C Ic ,
(2.5) (2.6)
where ay and gy ( y ) model the interneuron yih , which inhibits its partner xih . The longer-range connections Tih, jh 0 (between cells in different hyper6 columns i D j) are now separated into two terms: monosynaptic excitation Jih, jh 0 ¸ 0 between xih and xjh 0 and disynaptic inhibition Wih, jh 0 ¸ 0 between xih and xjh 0 via the interneuron yih . Including both the monosynaptic and disynaptic pathways, the net effective connection between xih and xjh 0 in stationary (but not in dynamic) states is, for example, Jih, jh 0 ¡ Wih, jh 0 / ay if gy ( y ) D y, and it can be either facilitatory or inhibitory. Both y ( Dh ) and Jo are explicit representations of the original interaction Tih,ih 0 between units within a hypercolumn. y ( Dh ) · 1 models local inhibition and Jo gx ( xih ) models self-excitation. Figure 2C schematically shows an example of the network. Ic and Io are background inputs, including neural noise, feedback from higher areas, and inputs modeling the general and local normalization of activities (Li, 1998) (which are omitted in the analysis in this article but are present in the simulations). An edge of input strength IOib at i with orientation b in the input image contributes to Iih (for h ¼ b) by an amount IOib w (h ¡ b ) , where w (h ¡ b ) is the orientation tuning curve. Lateral connections link cells preferring similar orientations. To implement net colinear facilitation and noncolinear ank inhibition (between similarly oriented bars), the excitatory J connections are dominant between units preferring coaligned bars (h » h 0 » ( i ¡ j) ), while the inhibitory W connections are dominant between units preferring nonaligned (but still similarly oriented) ones (Zucker et al., 1989; Field et al., 1993; Li, 1998, 1999a). Such an interaction pattern has been called an association eld (Field et al., 1993). A simple model of this interaction is the biphasic pattern as in Figure 2B: J > 0 and W D 0 for mutual excitation and J D 0 and W > 0 for mutual inhibition (Li, 1998, 1999a). Physiological evidence (Hirsch & Gilbert, 1991) suggests that both Jih, jh 0 > 0 and Wih, jh 0 > 0 contribute to the links between a given pair of pyramidal cells xih and xjh 0 . This gives extra computational exibility (e.g., contrast dependence of contextual inuences; see section 3) by letting the ratio Jih, jh 0 : Wih, jh 0 determine the overall sign of the interaction. For illustrative convenience, however, the simpler biphasic connection is sometimes used in this article to demonstrate the analysis and is used for all the examples in the gures. In principle, an E-I recurrent model can perform computations that symmetric models cannot. In practice, this is not guaranteed and has to be ensured by designing the right model parameters, in particular, J and W, guided by an analytical understanding of the nonlinear dynamics.
Design and Analysis of a Recurrent V1 Model
1757
Visual space, edge detectors, and their interactions
A
A sampling location
B
One of the edge detectors
Neural connection pattern. Solid: J, Dashed: W
C
Model Neural Elements Edge outputs to higher visual areas Inputs Ic to inhibitory cells
-
-
+
+
+
+
-
Jo
W
J
An interconnected neuron pair for edge segment i q Inhibitory interneurons Excitatory neurons
+
Visual inputs, filtered through the receptive fields, to the excitatory cells.
Figure 2: A schematic of the minimal model of the primary visual cortex. (A) Visual inputs are sampled in a discrete grid by edge/bar detectors, modeling RFs for V1 layer 2–3 cells. Each grid point has K neuron pairs (see C), one per bar segment. All cells at a grid point share the same RF center, but are tuned to different orientations spanning 180 degrees, thus modeling a hypercolumn. A bar segment in one hypercolumn can interact with another in a different hypercolumn via monosynaptic excitation J (the solid arrow from one thick bar to another), and/or disynaptic inhibition W (the dashed arrow to a thick dashed bar). (See also C). (B) A schematic of the neural connection pattern from the center (thick solid) bar to neighboring bars within a nite distance (a few RF sizes). J’s contacts are shown by thin, solid bars. W’s are shown by thin, dashed bars. All bars have the same connection pattern, suitably translated and rotated from this one. (C) An input bar segment is associated with an interconnected pair of excitatory and inhibitory cells; each model cell models abstractly a local group of cells of the same type. The excitatory cell receives visual input and sends output gx ( xih ) to higher centers. The inhibitory cell is an interneuron. The visual space has toroidal (wraparound) boundary conditions.
1758
Zhaoping Li
3 Dynamic Analysis
The model state is characterized by fxih , yih g, or simply fxih g, omitting the hidden units fyih g. The interaction between excitatory and inhibitory cells makes fxih ( t) g intrinsically oscillatory in time. Given an input fIih g, the model does not guarantee convergence to a xed point where xP ih D yP ih D 0. However, if fxih ( t) g converges to or oscillates periodically around a xed point, after the transient following the onset of fIih g, the temporal average fxN ih g of fxih ( t) g can characterize the model output and approximate the xed point. We henceforth use the notation fxN ih g to denote either the xed point or the temporal average and denote the computation as I ! gx ( xN ih ) . Sections 3.1 through 3.6 analyze I ! gx ( xN ih ) and derive constraints on J and W in order to make I ! gx ( xN ih ) achieve the desired computations. Other investigators have also analyzed the xed-point behavior I ! gx ( xN ih ) in such E-I networks or the corresponding symmetric ones (Ben-Yishai, BarOr, & Sompolinsky, 1995; Stemmler, Usher, & Niebur, 1995; Somers et al., 1998; Mundel, Dimitrov, & Cowan, 1997; Tsodyks, Skaggs, Sejnowski, & McNaughton, 1997), mainly to model a local circuit of a hypercolumn (or a CA1 region) with simplied or no spatial organization beyond the hypercolumn. Our analysis emphasizes the spatial or geometrical organization of visual inputs in order to study global visual computations. Section 3.7 studies the stability and dynamics around fxN ih g and derives constraints on the model parameters coming from the need to avoid visual hallucination (Ermentrout & Cowan, 1979)—the curse of symmetric networks. 3.1 A Single Pair of Neurons. In isolation, a single pair ih follows equa-
tions xP D ¡x ¡ gy ( y ) C Jo gx ( x) C I
(3.1)
yP D ¡y C gx ( x) C Ic ,
(3.2)
where ay D 1 for simplicity (as in the rest of the article), index ih is omitted, and I D Iih C Io . The input-output (I, Ic ! gx ( xN ) ) gain at a xed point ( x, N yN ) is dgx ( xN ) g0x ( xN ) D , 0 dI 1 C gx ( xN ) g0y ( yN ) ¡ Jo g0x ( xN )
dgx ( xN ) dgx ( xN ) 0 . D ¡gy ( yN ) dIc dI
(3.3)
When both gx ( x) and gy ( y ) are piece-wise linear (see Figures 3A and 3B) functions, so is the input-output relation I ! gx ( xN ) (see Figure 3C). The threshold, input gain control, and saturation in I ! gx ( xN ) are apparent. The dgx ( xN ) 6 0. It inslope dI is nonnegative; otherwise, I D 0 gives nonzero output x D creases with g0x ( xN ) , decreases with g0y ( yN ) , and depends on Ic . Shifting ( I, Ic ) to
Design and Analysis of a Recurrent V1 Model
B
Cell firing rate
gx(x)
Activation function for excitatory cells
x cell membrane potential
Edge response g x(x)
C
Edge response to visual input at different Ic For lower Ic
For higher Ic Input I to the excitatory cell
gy(y) Cell firing rate
A
1759
Activation function for inhibitory cells
y cell membrane potential
D I
I c
curve on which
This region for input facilitation
I
I c = g’ (y) y
This region for input suppression I
Figure 3: (A, B) Examples of gx ( x ) and gy ( y ) functions. (C) Input-output function I ! gx ( xN ) for an isolated neural pair without interpair neural interactions, under different levels of Ic . (D) The overall effect of the external or contextual inputs (D I, D Ic ) on a neural pair is excitatory or inhibitory if D I / D Ic is large or less than g0y ( yN ) , which depends on I.
( I C D I, Ic C D Ic ) changes gx ( xN ) by D gx ( xN ) ¼ (d gx ( xN ) /dI) ( D I ¡g0y ( yN ) D Ic ) , which is positive or negative depending on whether D I / D Ic > g0y ( yN ) . Hence, a more elaborate model could allow a fraction of the external visual input to go onto interneurons, as suggested by physiology (White, 1989) and modeled by Grossberg and Raizada (2000), provided that D I / D Ic > g0y ( yN ) . Contextual inputs from other neuron pairs (via J and W) effectively give ( D I, D Ic ) . In our example, when g0y ( yN ) increases with I (or Ic ), the contextual inputs can switch from being facilitatory to being suppressive as I increases (see Figure 3D). This input contrast dependence of the contextual inuences has been observed physiologically (Sengpiel, Baddeley, Freeman, Harrad, & Blakemore, 1998) and modeled by others (Stemmler et al., 1995; Somers et al., 1998).
3.2 Two Interacting Pairs of Neurons with Nonoverlapping Receptive Fields. Using indices a D 1, 2 to denote the two pairs and their associated
1760
Zhaoping Li
quantities (J12 D J21 D J and W12 D W21 D W), xP a D ¡xa ¡ gy ( ya ) C Jo gx ( xa ) C Jgx ( xb ) C Ia C Io yP a D ¡ya C gx ( xa ) C Wgx ( xb ) C Ic ,
6 where a, b D 1, 2 and a D b. Including monosynaptic and disynaptic pathways, the net effective connection from x2 to x1 , according to the gain functionsdgx ( xN ) /dI and dgx ( xN ) /dIc , is J¡g0y ( yN 1 ) W. When I ´ I1 D I2 in the simplest case, xN ´ xN 1 D xN 2 and yN ´ yN 1 D yN 2 . The two bars can excite or inhibit each other depending on whether J ¡ g0y ( yN ) W > 0. This in turn depends on the input I through g0y ( yN ) . When I1 > I2 , we have ( xN 1 , yN 1 ) > ( xN 2 , yN 2 ) . Usually N hence J12 ¡ g0y ( yN 1 ) W 12 < J21 ¡ g0y ( yN 2 ) W21. In particg0y ( yN ) increases with y; ular, it can happen that J12 ¡ g0y ( yN 1 ) W 12 < 0 < J21 ¡ g0y ( yN 2 ) W21, that is, x1 excites x2 , which in turn inhibits x1 . This implies that two interacting pairs tend to have closer activity values x1 and x2 than two noninteracting pairs. Even this very simple contextual inuence can already account for some perceptual phenomena involving sparse visual inputs consisting of only single test and contextual bars. Examples include the altered detection threshold (Polat & Sagi, 1993; Kapadia et al., 1995) or perceived orientation (tilt illusion; Mundel et al., 1997; Kapadia, pers. comm., 1998) of a test bar when a contextual bar is present.
3.3 A One-Dimensional Array of Identical Bars. An innitely long, horizontal array of evenly spaced, identical bars gives an input pattern approximated as
8 > :
0
for i D ( mi , ni D 0) on the horizontal axis and h D h1
(3.4)
otherwise.
6 The approximation Iih D 0 for h D h1 is good for small orientation tuning width and low input contrast. When bars ih outside that array are silent gx ( xih ) D 0 due to insufcient excitation, we omit them and treat the remaining system as one-dimensional. Omitting index h and using i to denote bars according to their one-dimensional location, we get
xP i D ¡xi ¡ gy ( yi ) C Jo gx ( xi ) C yP i D ¡yi C gx ( xi ) C
X 6 i jD
X 6 i jD
Jij gx ( xj ) C Iarray C Io
Wij gx ( xj ) C Ic .
(3.5) (3.6)
Design and Analysis of a Recurrent V1 Model
1761
Translation symmetry implies that all units have the same equilibrium point ( xN i , yN i ) D ( x, N yN ) , and 0
1
xPN D 0 D ¡Nx ¡ gy ( yN ) C @ Jo C 0 yPN D 0 D ¡yN C @1 C
X
1 X 6 j iD
6 j iD
Jij A gx ( xN ) C Iarray C Io
Wij A gx ( xN ) C Ic .
(3.7)
(3.8)
This array is then equivalent to a single neural pair (cf. equations P P 3.1 and 3.2) with the substitution Jo ! Jo C j Jij and g0 ( yN ) ! g0y ( yN ) ( 1 C j Wij ) . The response to bars in the array is thus higher than that to an isolated bar if the net extra excitatory connection,
E´
X
Jij ,
(3.9)
j
is stronger than the net extra inhibitory (effective) connection
I ´ g0y ( yN )
X
W ij .
(3.10)
j
The input-output relationship I ! gx ( xN ) is qualitatively the same as that of a single bar, with a quantitative change in the gain: dgx ( xN ) g0x ( xN ) . D 0 0 dI 1 C gx ( xN ) ( gy ( yN ) ¡ ( E ¡ I ) ) ¡ Jo g0x ( xN )
(3.11)
E and I depend on h1 . Consider the case of association eld connections. When the bars are parallel to the array, making a straight line (see Figure 4B), E > I . The condition for contour enhancement is Contour facilitation Fcontour ´ ( E ¡ I ) gx ( xN ) > 0
(3.12)
and is sufciently strong. When the bars are orthogonal to the array (see Figure 4C), E < I and the responses are suppressed. This analysis extends to other translation-invariant one-dimensional arrays like those in Figures 4D and 4E, for which the index i simply denotes a bar at a location along the array (Li, 1998). The straight line in Figure 4B is in fact the limit of a circle in Figure 4D when the radius goes to innity. Similarly, the pattern in Figure 4C is a special case of the one in Figure 4E. The qualities of the approximations in equations 3.4 through 3.8 depend on the input, as shown in Figure 5. Contextual facilitation in Figures 5A, 5B,
1762
Zhaoping Li
A
q
1
D
B E
C
Figure 4: Examples of the one-dimensional input stimuli mentioned in the text. (A) Horizontal array of identical bars oriented at angle h1 . (B) A special case of A when h1 D p / 2 and, in C, when h1 D 0. (D) An array of bars aligned into, or tangential to, a circle. The pattern in B is a special case of this circle when the radius is innitely large. (E) Same as D except that the bars are perpendicular to the circle circumference. The pattern in C is a special case when the radius is innitely large.
and 5E and contextual suppression in Figures 5C, 5D, 5F are visualized by the thicker and thinner bars, respectively, than the isolated bar in Figure 5G. In Figure 5A, cells whose RFs are centered on the line but are not oriented exactly horizontally are also excited above threshold, unlike our approximation gx ( xih ) D 0 for nonhorizontal h. (This should not cause perceptual problems, though, given population coding.) This is caused by direct visual 6 h1 (h ¼ h1 ) and the colinear facilitation from other bars in input Iih for h D the line. The approximation of translation invariance xN i D xNj for all bars in the array is compromised when the array has an end (e.g., see Figure 5B), or when the bars are unevenly spaced (see Figure 5E and 5F). In Figure 5B, the bars at or near the left end of the line are less enhanced since they receive less or no contextual facilitation from their left. In Figure 5F, the more densely spaced bars receive more contextual suppression than others. 3.4 Two-Dimensional Textures and Texture Boundaries. The analysis of the one-dimensional array also applies to an innitely large twodimensional texture of uniform input Iih1 D Itexture when iPD ( mi , ni ) sit on a regularly spaced grid (see Figure 6A). The sums E D j Jij and I D P g0y ( yN ) j Wij are taken over all j in that grid.
Design and Analysis of a Recurrent V1 Model
A: In nitely long line
1763
E: Uneven circular array
B: Half in nitely long line, ending on the left F: Uneven radiant array C: In nitely long array of oblique bars
D: In nitely long horizontal array of vertical bars
G: An isolated bar
Figure 5: Simulated outputs from a cortical model to corresponding visual input patterns of one-dimensional arrays of bars. The model transforms input Iih to cell output gx ( xih ) . The thicknesses of the bars ih are proportional to temporally averaged model outputs gx ( xih ) . The corresponding (suprathreshold) input IOih D 1.5 is of low to intermediate contrast and is the same for all seven examples and all visible bars. Different outputs gx ( xih ) for different examples or for different bars in each example are caused by contextual interactions. Overall contextual facilitations cause higher outputs in A, B, and E than that of an isolated bar in G, while overall contextual suppressions cause lower outputs in C, D, and F (compare the different thicknesses of the bars). Note the deviations from the idealized approximations in the text. Uneven spacing between the bars (F, G) or an end of a line (at the left end of B) causes deviations from the translation invariance of responses. Note that the responses taper off near the line end in B and that the responses are noticeably weaker to bars that are more densely packed in F. In A and B, cells preferring neighboring orientations (near horizontal) at the line are also excited above threshold, unlike the approximated treatment in the text.
1764
Zhaoping Li
A A texture of bars oriented at q
q
B A texture of vertical bars
1
1
Two neighboring textures of bars.
C
q q
D
1
2
Four example pairs of vertical arrays of bars.
E
F
G
Figure 6: Examples of the two-dimensional textures and their interactions. (A) Texture made of bars oriented ath1 and sitting on a Manhattan grid. This can be seen as a horizontal array of a vertical array of bars. (B) A special case of A when h1 D 0. This is a horizontal array of vertical lines. Each texture can also be seen as a vertical array of horizontal arrays of bars or an oblique array of oblique arrays of bars. Each vertical, horizontal, or oblique array can be viewed as a single entity, shown as examples in the dotted boxes. (C) Two nearby textures and the boundary between them. (D–F) Examples of nearby and identical vertical arrays. (G) Two nearby but different vertical arrays. When each vertical array is seen as an entity, one can calculate effective connections J 0 and W 0 (dened in the text) between these vertical arrays.
Physiologically the response to a bar is reduced when the bar is part of a texture (Knierim & van Essen, 1992). This can be achieved when E < I . Consider, for example, the case when i D ( mi , ni ) form a Manhattan grid
Design and Analysis of a Recurrent V1 Model
1765
with integer values of mi and ni (see Figure 6). The texture can be seen as a horizontal array of vertical arrays of bars, like the horizontal array of vertical contours in Figure 6B. The effective connections between the vertical arrays (see Figures 6D, 6E, and 6F) distance a apart are: Ja0 ´
X
Wa0 ´
Jij ,
j,mj D mi C a
X
Wij .
(3.13)
j,mj D mi C a
P 0 P 0 0 Then E D a Ja and I D gy ( yN ) a Wa . The effective connection within a 0 0 single vertical array is J0 and W 0. One has to design J and W such that contour enhancement and texture suppression can occur using the same neural circuit (V1). That is, when the vertical array is a long, straight line (h1 D 0), contour enhancement (i.e., J 00 > g0y ( yN ) W 00 ) occurs when the line is P P isolated, but overall suppression (i.e., E D a Ja0 < I D g0y ( yN ) a Wa0 ) occurs when that line is embedded within a texture of lines (see Figure 6B), as long as there is sufcient excitation within a line and sufcient inhibition between the lines. Computationally, contextual suppression within a texture means that the boundaries of a texture region induce relatively higher responses, thereby marking the boundaries for segmentation. The contextual suppression of a bar within a texture is h1 Cwhole -texture ´
X a
( g0y ( yNh1 ) Wa0h1 ¡ Ja0h1 ) gx ( xNh1 )
D ( I ¡ E ) gx ( xNh1 ) > 0,
(3.14)
where xNh1 denotes the (translation-invariant) xed point for all texture bars. Consider the bars on the vertical axis i D ( mi D 0, ni ) . Removing the texture bars on the left i D ( mi < 0, ni ) removes the contextual suppression from them, and so gives them higher responses. This highlights the texture boundary mi D 0. Now the activity xN ih1 depends on mi , the distance of the bars from the texture boundary. As mi ! 1, xN ih1 ! xNh1 . The contextual suppression of the bars on the boundary, mi D 0, is h1 Chalf -texture ´
¼
X mj ¸ 0
X a¸ 0
0h1 0h1 ( g0y ( yN ih1 ) W m ) gx ( xNjh1 ) ¡ Jm j j
1 ( g0y ( yNh1 ) Wa0h1 ¡ Ja0h1 ) gx ( xNh1 ) < Chwhole -texture,
(3.15) (3.16)
where we approximate ( xNjh1 , yNjh1 ) ¼ ( xNh1 , yNh1 ) for all mj ¸ 0. The boundary highlight persists when a neighboring, different, texture of bars oriented at h2 for i D ( mi < 0, ni ) is present (see Figure 6C). To analyze
1766
Zhaoping Li
this, dene connections between arrays in different textures (see Figure 6G) as X X (3.17) Ja0h1h2 ´ Jih1 jh2 Wa0h1h2 ´ W ih1 jh2 , j,mj D mi C a
j,mj D mi C a
when h1 D h2 , Ja0h1h2 D Ja0h1 , and Wa0h1h2 D W a0h1 . The contextual suppression from the neighboring texture (h2 ) on the texture boundary (mi D 0) P 0h1 h2 0h1 h2 0 1 ,h2 is Chneighbor -half-texture ´ mj < 0 ( gy ( yN ih1 ) Wmj ¡ Jmj ) gx ( xNjh2 ) . For the association eld connection, Jih1 , jh2 and Wih1 , jh2 tend to link similarly oriented bars 1 ,h2 h1 » h2 ; we have Chneighbor -half-texture minimum or zero when h1 ? h2 and increasing with decreasing |h1 ¡ h2 |. Hence, the boundary highlight is expected to increase with the orientation contrast |h1 ¡h2 |. The net contextual 2 suppression on the border, contributed by both textures, is Ch21-,h half-textures ´ h1 h1 ,h2 Chalf-texture C Cneighbor-half-texture. Hence, the border enhancement, or the reduction of contextual suppression at the border relative to regions farther inside the texture, is h1 ,h2 1 dC ´ Chwhole -texture ¡ C2-half-texture
(3.18)
h1 ,h2 1 ,h2 Dh1 ¼ Chneighbor -half-texture ¡ Cneighbor-half-texture X ( g0y ( yNh1 ) Wa0h1 ¡ Ja0h1 ) gx ( xNh1 ) ¼
(3.19)
a > >
> > :
0
0h1
a < 0 ( gy ( yNh1 ) Wa
> 0 roughly increases
0h1
¡ Ja ) gx ( xNh1 )
for h1 ¼ h2 for h1 ? h2
(3.21)
as |h1 ¡ h2 | increases.
Thus, the border highlight diminishes as the orientation contrast approaches 0 (see Figure 7). Furthermore, even at a given contrast |h1 ¡ h2 |, the border enhancement dC depends on h1 . For instance, with |h1 ¡ h2 | D p / 2 and the association eld connections, the enhancement dC for border bars parallel to the border h1 D 0 (which form a contour) is higher than that for border bars perpendicular to the border h1 D p / 2. This is because both the suppression
Design and Analysis of a Recurrent V1 Model
B
C
D
Output
Input
Output
Input
A
1767
Figure 7: Simulated examples of texture boundary highlights between different pairs of textures, dened by bar orientations. In each example, we show the input image Iih above the output image gx ( xih ) averaged in time. Each image shows a small region out of an extended input area. (A) h1 D 45 degrees, h2 D ¡45 degrees. (B) h1 D 45 degrees, h2 D 90 degrees. (C) h1 D 0 degree, h2 D 90 degrees. (D) h1 D 45 degrees, h2 D 60 degrees. The texture border is vertical in the middle of each stimulus pattern. Note how border highlights increase with increasing orientation contrast h1 ¡ h2 . The orientation contrast of 15 degrees in D is difcult to detect by the model or humans. The orientation contrast h1 ¡h2 D 90 degrees for both A and C. Note how the responses to the boundary bars decrease with increasing orientation differences between the bars and the boundary.
0h 0h 6 0) and the g0y ( yNh1 ) Wa 1 ¡ Ja 1 between parallel contours (h1 D 0 and a D 0h1 0h1 0 ( ) facilitation J0 ¡ gy yNh1 W 0 within a contour (see Figure 6D) are much stronger than their counterparts for the vertical arrays of horizontal bars (see Figure 6E). Thus, the strength of the border highlight is predicted to be tuned to the relative orientation h1 between the border and the bars
1768
Zhaoping Li
(Li, 2000a). This explains the asymmetry in the outputs of Figure 7C; the highlight of the vertical border is much stronger for the vertical than the horizontal texture bars. Clearly, the approximations xN ih1 ¼ xNh1 for mi ¸ 0 and xN ih2 ¼ xNh2 for mi < 0), which are used to arrive at equation 3.21, break down at the border, especially at more salient borders like that in Figure 7C. This accentuates the tuning of the border highlight to h1 . Iso-orientation suppression underlies the border highlight, and by equation 3.14, its strength I ¡ E depends on contrast through g0y ( yN ) . Since g0y ( yN ) N the highlight is stronger at higher conusually increases with increasing y, trast. Psychophysically, texture segmentation does require an input contrast well above the texture detection threshold (Nothdurft, 1994). It is easy to tune the connection weights in the model quantitatively such that isoorientation suppression holds at all input contrasts or holds only at sufcient input contrast and becomes iso-orientation facilitation at very low contrast, as in Li (1998, 1999a). Computationally, facilitation certainly helps texture detection, which at low input contrast could be more important than segmentation. On this note, contour facilitation (Fcontour > 0) holds at all contrasts (Li, 1998) using the biphasic connection, since no W connections link the contour segments. Nonbiphasic connections should be employed to model diminished contour enhancement at high contrast (Sceniak, Ringach, Hawken, & Shapley, 1999). 3.5 Translation Invariance and Pop-out. In the examples above, orientation contrasts are highlighted because they mark boundaries between textures composed of bars of single orientations. However, if orientation contrasts are homogeneous within the texture itself, they will not pop out. Figure 8A shows an example for which the texture is made of alternating columns of bars at h1 D 45 degrees (even a) and h2 D 135 degrees (odd a). The contextual suppression of a bar oriented at h1 is X ( g0y ( yNh1 ) Wa0h1 ¡ Ja0h1 ) gx ( xNh1 ) Ccomplex-texture D even a
C
X
odd a
( g0y ( yNh1 ) Wa0h1h2 ¡ Ja0h1h2 ) gx ( xNh2 ) .
(3.22)
Thus, no bar oriented at h1 is less suppressed, or more salient, than other h 6 C 1 bars oriented at h1 . Note that since Ccomplex-texture D whole-texture, the value of xNh1 is not the same as it would be in a simple texture of bars of a single 6 x orientationh1 . This applies similarly to xNh2 . For general h1 and h2 , xNh1 D Nh2 . In Figure 8A, reection symmetry leads to xNh1 D xNh2 or uniform saliency within the whole texture. In Figure 8B, bars oriented at h1 D 0 degree induced higher responses than those oriented at h2 D 90 degrees. Nevertheless, looking at this texture, which is dened by both the vertical and horizontal bars and their spatial arrangement, no local patch of the texture is more
Design and Analysis of a Recurrent V1 Model
A
B
1769
C
Figure 8: Model responses to homogeneous (A, B) and inhomogeneous (C) input images, each composed of bars of equal input contrasts. (A) A homogeneous (despite of the orientation contrast) texture of bars of different orientations. A uniform output saliency results. (B) Another homogeneous texture. Vertical bars are more salient; however, the whole texture has a translation-invariant saliency distribution. (C) The small gure pops out from the background because it is where translation invariance is broken in inputs, and the whole gure is its own boundary.
salient than any other patch. This translation invariance in saliency is simply the result of the network’s preserving the translation invariance in the input (texture), as long as the translation symmetry is not spontaneously broken (see section 3.7 for analysis). A boundary between textures is one place where input is not translation invariant and is highlighted by the cortical interactions. A special case is when one small texture patch is embedded in a large and different texture. The small texture is small enough that the whole texture is its own boundary and thus pops out from the background (see Figure 8C). In general, orientation contrasts do not correspond to texture boundaries and thus do not necessarily pop out. Through contextual inuences, the highlight at a texture border can alter responses to nearby locations up to a distance comparable to the lateral connection lengths. Hence, the response to a texture region is not homogeneous unless this region is far enough away from the border. This is evident at the right side of the border in Figure 7C. These effects are not to be confused with spontaneous symmetry breaking since they are generated by the input border and are local. (See Li, 2000b, for more details about these effects and their physiological counterparts.) 3.6 Filling In and Leaking Out. Small fragments of a contour or homogeneous texture can be missing in inputs due to input noise or the visual scene itself. Filling in is the phenomenon that the missing input fragments are not noticed (see Pessoa, Thompson, & Noe, 1998, for an extensive discussion). It could be caused by one of the following two possible mechanisms. The rst is that although the cells for the missing fragment do not receive direct visual inputs, they are excited enough by the contextual inuences to re as if there were direct visual inputs. (This is how Grossberg & Mingolla, 1985 model illusory contours.) The second possibility is that although the
1770
Zhaoping Li
B
Output
Input
A
Figure 9: Examples of lling in: Model outputs from inputs composed of bars of equal contrasts in each example. (A) A line with a gap. The response to the gap is nonzero. (B) A texture with missing bars. The responses to bars near the missing bars are not signicantly higher than the responses to other texture bars.
cells for the missing fragment do not re, the regions near, but not at, the missing fragments are not salient or conspicuous enough to attract visual attention strongly. In the latter case, the missing fragments are noticeable only by attentive visual scrutiny or search. It is not clear from physiology (Kapadia et al., 1995) which mechanism is involved. Consider a single bar segment i D ( mi D 0, ni D 0) missing in a smooth contour, say, a horizontal line like Figure 9A; lling in could be achieved by either of the two possible mechanisms. To excite the cell P i to ring threshold Tx (such that gx ( xi > Tx ) > 0), contextual facilitation j ( Jij ¡Wij g0y ( yN i ) ) gx ( xNj ) should be strong enough, or approximately Fcontour C Io D ( E ¡ I ) gx ( xN ) C Io > T x ,
(3.23)
where Io is the background input, Fcontour and the effective net connections E and I are as dened in equations 3.9 through 3.12, and we approximate for
Design and Analysis of a Recurrent V1 Model
1771
all contour bars ( xNj , yNj ) by ( x, N yN ) , the translation-invariant activity in a complete contour. If segments within a smooth contour facilitate each other ’s ring, then a missing fragment i reduces the saliencies of the neighboring contour segments j ¼ i. The missing segment and its vicinity are thus not easily noticed, even if the cell i for the missing segment does not re. The cell i D ( mi D 0I ni D 0) should not be excited enough to re if the left half j D ( mj < 0, nj D 0) of the horizontal contour is removed. Otherwise the contour extends beyond its end or grows in length—leaking out. To prevent leaking out, Fcontour / 2 C Io < Tx ,
(3.24)
since the contour facilitation to i is approximately Fcontour / 2, half of that Fcontour in an innitely long contour. The inequality 3.24 is satised for the line end in Figure 5B and should hold at any contour saliency gx ( xN ). Not leaking out also means that large gaps in lines cannot be lled in. To prevent leaking out at i D ( mi D 0, ni D 1) , the side of an innitely long (e.g.) horizontal contour on the horizontalPaxis in Figure 4B (thus, to prevent the contour getting thicker), we require j2 contour ( Jij ¡ g0y ( yN i ) Wij ) gx ( xN ) < Tx ¡ Io 6 contour. This condition is satised in Figure 5A. for i 2 Filling in in a texture with missing fragments i (texture lling in) is feasible only by the second mechanism (to avoid conspicuousness near i) since i cannot be excited to re if contextual input within a texture is suppressive. If i is not P missing, its neighbor k ¼ i receives contextual suppression ( I ¡ E ) gx ( xN ) ´ j2 texture ( g0y ( yN ) W kj ¡Jkj ) gx ( xN ). A missing i makes its neighbor k more salient by the removal of its contribution ( W ki g0y ( yN ) ¡ Jki ) gx ( xN ) to the suppression. This contribution should be a negligible fraction of the total suppression to ensure that the neighbors are not too conspicuous, that is, X ( g0y ( yN ) W kj ¡ Jkj ) . (3.25) g0y ( yN ) W ki ¡ Jki ¿ ( I ¡ E ) ´ j2 texture
This is expected when the lateral connections are extensivePenough to reach sufciently large contextual areas, that is, when W ki ¿ j W kj and J ki ¿ P . Leaking out is not expected outside a texture border when the conJ j kj textual input from the texture is suppressive. Note that lling in by exciting the cells for a gap in a contour (see equation 3.23) works against preventing leaking out (see equation 3.24) from contour ends. It is not difcult to build a model that achieves active lling in. However, preventing the model from leaking out and unwarranted illusory contours implies a small range of choices for the connection strengths J and W. 3.7 Hallucination Prevention and Neural Oscillations. To ensure that the model performs the desired computation analyzed in sections 3.1
1772
Zhaoping Li
through 3.6, the mean or the xed points ( X¯ , Y¯ ) in these analyses should correspond to actual model behavior. (We use boldface characters to represent vectors or matrices.) Section 2 showed that this is difcult to achieve in the symmetric networks, as the xed points ( X¯ ) for the desired computation are likely to be unstable; that is, they are saddle points or local maximums in the energy function. In that case, the actual model output deviates drastically from the desired xed point ( X¯ ) or visual input, and, in particular, visual hallucinations occur. In the corresponding E-I network, the asymmetric connections between E and I give the network a tendency to oscillate around the xed point. This oscillation enables our model to avoid the motion of ( X , Y ) toward hallucination (Li & Dayan, 1999), making it possible to correspond the desired xed points ( X¯ , Y¯ ) with the (temporally averaged) model behavior. However, this correspondence is not guaranteed and is in fact difcult to achieve without guided model design. It requires stability conditions on ( X¯ , Y¯ ) to be satised, which constrains J and W, in addition to the conditions placed on J and W in sections 3.1 through 3.6 for desired contour integration and texture segmentation (the inequalities 3.12, 3.14, 3.23–3.25). This section derives these stability constraints and their implications. Specically, we derive the condition to prevent visual hallucinations or spontaneous formations of spatially inhomogeneous outputs given translation-invariant visual inputs. Ermentrout and Cowan (1979) have analyzed the stability conditions to obtain hallucinations in a simplied model of V1 (see Bressloff, Cowan, Golubitsky, Thomas, & Wiener, in press, for a more recent analysis), and studied the forms and dynamics of the hallucinations. Li (1998) analyzed the stability constraints to prevent hallucination under contour inputs. Here we generalize the analysis in Li (1998) to other homogeneous inputs and in addition analyze the nonlinear dynamics of (nonhallucinating) homogeneous oscillations around ( X¯ , Y¯ ). We omit the analysis of the emergent hallucinations since the hallucinations are prevented by the model. To analyze stability, we study how small deviations ( X ¡ X¯ , Y ¡ Y¯ ) from the xed point evolve. Change variables ( X ¡ X¯ , Y ¡ Y¯ ) ! ( X , Y ) . For small deviation X , Y , a Taylor expansion on equations 2.5 and 2.6 gives the linear approximation:
³ ´ P X P Y
³ D
¡1 C J
G 0x C W
¡G 0y ¡1
´³ ´ X Y
,
(3.26)
6 where J , W , G 0x , and G 0y are matrices with Jihjh 0 D Jihjh 0 g0x ( xNjh 0 ) for i D j, Jih,ih D 6 Jo g0x ( xN ih ) , W ihjh 0 D Wihjh 0 g0x ( xNjh 0 ) for i D j, W ih,ih 0 D 0, G 0x ihjh 0 D dihjh 0 g0x ( xNjh 0 ) , and G 0y 0 D dij y (h ¡ h 0 ) g0y ( yNjh 0 ) where y ( 0) D 1. To focus on the output X ,
ihjh
eliminate (hidden) variable Y :
P C ( G 0y ( G 0x C W ) C 1 ¡ J ) X D 0. XR C ( 2 ¡ J ) X
(3.27)
Design and Analysis of a Recurrent V1 Model
1773
Consider inputs of our interest that are bars arranged in a translationinvariant fashion along a one- or two-dimensional array. For simplicity and approximation, we again omit bars outside the array and their associated quantities in equation 3.27, and omit the index h, as we did in sections 3.3 and 3.4. Translation symmetry implies ( xN i , yN i ) D ( x, N yN ) , G 0y D dij g0y ( yN ), G 0x ij D ij
dij g0x ( xN ), ( G 0y G 0x ) ij D g0x ( xN ) g0y ( yN )dij , and ( G 0y W ) ij D g0y ( yN ) W ij . Furthermore, Jij D Ji C a, j C a ´ J i¡j and W ij D W i C a, j C a ´ W i¡j for any a. One can now go to the Fourier domain of the spatial variables fX i g and their associated quantities J, W to obtain
X Rk C (2 ¡ J ) X Pk C ( g0y ( yN ) ( g0x ( xN ) C W
k) C
1 ¡ Jk ) X
k
D 0,
(3.28)
where X k, J k , W k are spatial Fourier transforms of X , J, W for frequency fk such that ei fk N D 1, where N is the size of the system. X k is the amplitude of theP spatial wave of frequency fk in the deviation X from the xed point, P Jk D a Ja ei fk a , and W k D a W a ei fk a . X k evolves as X k ( t) / ec k t where q c k ´ ¡1 C J k / 2 § i g0y ( g0x C W
k)
¡ J k2 / 4 .
(3.29)
Denote Re(c k ) as the real part of c k, Re(c k ) < 0 for all k makes any deviation X decay to zero, and hence no hallucination can occur. Otherwise the mode with the largest Re (c k ) —let it be k D 1—will dominate the deviation X ( t) . If this mode has zero spatial frequency f1 D 0, then the dominant deviation is translation invariant and synchronized across space, and hence no spatially varying patterns can be hallucinated. Thus, the conditions to prevent hallucinations are Re(c k ) < 0
for all k,
or
Re (c 1 ) f1 D 0 > Re(c k ) fk D6 0 .
(3.30)
When Re(c 1 ) f1 D 0 > 0, the xed point is not stable; the divergent homogeneous deviation X is eventually conned by the threshold and saturation nonlinearity. It oscillates (synchronously) in time when g0y ( g0x C W 1 ) ¡J12 / 4 > 0 or when there is no other xed point to which the system trajectory can approach. Denote this translation-invariant oscillatory trajectory by ( x, y ) D ( xi , yi ) , which is the same for all i. Then, xP D ¡x ¡ ( gy ( y C yN ) ¡ gy ( yN ) ) C J1 ( gx ( x C xN ) ¡ gx ( xN ) ) yP D ¡y C ( 1 C W1 ) ( gx ( x C xN ) ¡ gx ( xN ) ) ,
where J1 D J1 / g0x ( xN ) and W1 D W 1 / g0x ( xN ) . After converging to a nite oscillation amplitude, the trajectory ( x( t) , y ( t) ) is a closed curve in the ( x, y)
1774
Zhaoping Li
space. It oscillates with period T such that ( x ( t C T ) , y( t C T ) ) D ( x ( t) , y ( t) ) , and satises Z T dt[( 1 C W1 ) x ( gx ( x C xN ) ¡ gx ( xN ) ) C y ( gy ( y C yN ) ¡ gy ( yN ) ) ] 0
Z D
T 0
dtJ1 ( 1 C W1 ) ( gx ( x C xN ) ¡ gx ( xN ) ) 2 ,
(3.31)
since over a cycle of the oscillation, the oscillation energy Z
x C xN xN
Z ( 1 C W1 ) ( gx ( s ) ¡ gx ( xN ) ) ds C
y C yN yN
( gy ( s) ¡ gy ( yN ) ) ds
(3.32)
(potential and kinetic energy, the two terms in the above expression) is dissipated and restored to a conservation, as readers can verify. This is because the dissipation, on the left side of equation 3.31, is balanced by the self-excitation, on the right side of equation 3.31. At smaller oscillation amplitudes, the self-excitation dominates, as exemplied by the unstable xed point; at larger amplitudes, the dissipation dominates because of the saturation or threshold behavior in self-excitation. Thus, the oscillation trajectory converges to the balance of a periodic nonlinear oscillation. Since J a D J ¡aP ¸ 0 and W a D W ¡a ¸ P 0, J k and W k are real and have maxima Max( J k ) D a Ja and Max( W k ) D a W a for the zero frequency fk D 0 2 mode. Many simple forms of J and W decay with fk, for example, Ja / e ¡a / 2 2 gives J k / e ¡ fk / 2 . However, the dominant mode is determined by the value 6 0. In principle, given a model interaction J and of Re (c k ) and may have f1 D W and a translation-invariant input, whether it is arranged on a Manhattan grid or some other grid, Re (c k ) should be evaluated for all k to ensure appropriate behavior of the model or inequalities (see equation 3.30). In practice, the nite range of (J, W) and the discreteness and the (rotational) symmetry in the image grid imply that often only a nite, discrete set of k needs to be examined. Let us look at some examples using the biphasic connections. For onedimensional contour input like that in Figure 4B, W ij D 0. Then Re(c k ) D q Re( ¡1 C J k / 2 § i g0y g0x ¡ J k2 / 4) increases with J k, whose maximum occurs P at the translation-invariant mode f1 D 0, and J1 D j J ij . Then no hallucination can happen, although synchronous oscillations can occur when enough excitatory connections J link the units involved. For one-dimensional non6 contour inputs like Figures 4C and 4E, Jij D 0 for i D j; thus, J k D Jii , q 0 0 2 c k D ¡1 C Jii / 2 § i gy ( gx C W k ) ¡ J ii / 4. Hence, Re(c k ) < ¡1 C J ii D ¡1 C Jo g0x ( xN ) < 0 for all k, since ¡1 C Jo g0x ( xN ) < 0 is always satised (otherwise an isolated principal unit x, which follows equation xP D ¡x C Jx gx ( x ) C I, is not well behaved). Hence there should be no hallucination or oscillation.
Design and Analysis of a Recurrent V1 Model
1775
Under two-dimensional texture inputs, frequency fk D ( fx ( k) , fy ( k) ) is a wave vector perpendicular to the peaks and troughsP of the waves. When fk D (P fx ( k) , 0) is in the horizontal direction, J k D g0 ( xN ) a Ja0 ei fx ( k) a and W k D g0 ( xN ) a Wa0 ei fx ( k) a , where Ja0 and Wa0 are the effective connections between two texture columns as dened in equation 3.13. Hence, the texture can be analyzed as a one-dimensional array as above, substituting bar-to-bar connections ( J, W ) by column-to-column connections ( J0 , W 0 ) . However, J0 and W 0 are stronger, have a more complex Fourier spectrum (J k, W k), and depend on the orientation h1 of the texture bars. Again use the biphasic connection for examples. When h1 D 90 degrees (horizontal bars), Wb0 is weak between columns, that is, Wb0 ¼ db0W 00 and W k ¼ W 00. Then Re (c k ) is largest when J k is, at fx ( k) D 0—a translation-invariant mode. Hence, illusory saliency waves (peaks and troughs) perpendicular to the texture bars are unlikely. Consider, however, vertical texture bars for the horizontal wave vector fk D ( fx ( k) , 0) . The biphasic connection gives nontrivial Jb0 and Wb0 between vertical columns, or nontrivial dependences of J k and W k on fk. The dominant mode with the largest Re (c k ) is not guaranteed to be homogeneous, and J and W must be tuned in order to prevent hallucination. Given a nonhallucinating system and under simple or translation-invariant inputs, neural oscillations, if they occur, can only be synchronous and homogeneous (that is, identical) among the units involved, that is, f1 D 0. q P 1 0 0 ( C C Sincec D ¡1 J1 / 2 § i gy gx W 1 ) ¡ J12 / 4, and J 1 D j Jij for f1 D 0, the tendency for oscillation increases with increasing excitatory-to-excitatory links Jij between units involved (Koenig & Schillen, 1991). Hence, this tendency is likely to be higher for two-dimensional texture inputs than for one-dimensional array inputs, and lowest for a single, small bar input. This may explain why neural oscillations are observed in some physiological experiments and not others (Gray & Singer, 1989; Eckhorn et al., 1988). Under the biphasic connections, a long contour input is more likely to induce oscillation than an input of noncontour one-dimenstional array (see Figure 10). These predictions can be physiologically tested. Indeed, physiologically, grating stimuli are more likely to induce oscillations than bar stimuli (Molotchnikoff, Shumikhina, & Moisan, 1996). 4 Summary and Discussion
In this article, we have argued that a recurrent model composed of interacting E-I pairs is a suitable minimal model of the primary visual cortex performing preattentive computation of contour integration and texture segmentation. We analyze the nonlinear input-output transform I ! gx ( x ) and the stability and temporal dynamics of the model. We derived design conditions on the intracortical connections such that I ! gx ( x) performs the desired computations, and no hallucinations occur. Such an understanding has been essential to reveal the computational potential and limitations of
1776
Zhaoping Li
Model input pattern
Cell temporal activities
A Outputs for isolated bar
1
Outputs for 2 line segments
B
C
0.8 0.6 0.4 0.2 0 1 0.8 0.6 0.4 0.2 0
Outputs for 2 array bars
1
D
0.8 0.6 0.4 0.2 0
Outputs for 3 texture bars
1 0.8 0.6 0.4 0.2 0 0 1 2 3 4 5 6 7 8 9 10 time
Figure 10: Different stimuli have different tendencies to cause oscillatory responses. The pictures on the left show visual stimuli (all appear at time zero and stay on), and the graphs on the right the time course of the neural activities. (A) An isolated bar, the neural response to which stablizes after an initial oscillatory transient. (B) An input contour and the synchronized and sustained oscillatory neural responses of two nonneighboring neurons. All neurons corresponding to the contour segments respond similarly. (C) A horizontal array of vertical bars and the responses (decaying oscillations toward static values) of two nonneighboring neurons. (D) An input texture (with some holes in it) and the sustained oscillatory responses of three neurons, whose spatial (horizontal, vertical) coordinates are (2, 2) (solid curve), (15, 2) (dotted curve), and (5, 9) (solid-dotted curve). The coordinate of the bottom left texture bar is (0, 0). Note that the responses to bars next to the holes in the textures are a little higher.
Design and Analysis of a Recurrent V1 Model
1777
the models and led to a successful design (Li 1998, 1999a). The analysis techniques in this article can be applied to other recurrent networks whose neural connections are translationally symmetric. Note that the design conditions for a functional model can be satised by many quantitatively different models with qualitatively the same architecture. The model by Li (1998, 1999a) is one of them, and interested readers can nd quantitative comparisons between the behavior of that model and experimental data. Although the behavior of Li’s model agrees reasonably well with experimental data, there must be better and quantitatively different models. In particular, as discussed in this article, nonbiphasic connections (unlike those in Li’s model) could be more computationally exible and thus account for additional experimental data. Emphasizing analytical tractability and a minimal design, this article does not survey other visual cortical models with more elaborate structures and components. (See Li 1998, 1999a, for such a survey.) One example is Grossberg and Raizada’s (2000) recent model of V1 and V2, which evolved from earlier versions by Grossberg and Mingolla (1985), in which, the neuron units are now differentiated into excitatory and inhibitory ones. In this article, we have shown an example of how nonlinear neural dynamics link computations with the model architecture and neural connections. Additional or different computational goals, including the ones that may be performed by the primary visual cortex and are not yet modeled by our model example, might call for a more complex or different design. For example, our model lacks an end-stopping mechanism for V1 neurons. Such a mechanism could highlight the ends of, or gaps in, a contour that in our model induce decreased responses (relative to the rest of the contour) due to reduced contour facilitation (Li, 1998). Highlighting the line ends can be desirable, especially under high-input contrasts when the gaps are clearly not due to input noise, and both the gaps and ends of contours can be behaviorally very meaningful. Without endstopping, our model is fundamentally limited in performing these computations. Our model also does not generate subjective contours like the ones that form the Kanizsa triangle or the Ehrenstein illusion (which could enable a perception of a circle whose contour connects the interior line ends of bars in Figure 4E). Evidence (von der Heydt, Peterhans, & Baumgartner, 1984) suggests that area V2, rather than V1, is more likely to be responsible for these subjective contours, and this is addressed by models by Grossberg and colleagues (Grossberg & Mingolla, 1985; Grossberg & Raizada, 2000). Another desired computation is to generalize the notion of translation invariance to prevent the spontaneous saliency differentiation even when the input is not homogeneous in the image plane but is generated from a homogeneous at texture surface slanted in depth. This will require multiscale image representations and recurrent interactions between cells tuned to different scales. By studying the recurrent nonlinear dynamics and analyzing the link between the structure and computation of a model, we hope to be able to understand better the compu-
1778
Zhaoping Li
tations in the primary visual cortex and in other visual or nonvisual cortical areas where recurrent network dynamics play important roles. Acknowledgments
I am very grateful to Peter Dayan for conversations and discussions and his very helpful comments on the drafts of this article. I also much appreciated comments and suggestions from the anonymous reviewers of this article. This work is supported in part by the Gatsby Foundation. References Ben-Yishai, R., Bar-Or, R. L, Sompolinsky, H. (1995). Theory of orientation tuning in visual cortex. Proc. Natl. Acad. Sci. U.S.A., 92(9), 3844–3848. Braun, J., Niebur, E., Schuster H. G., and Koch, C. (1994). Perceptual contour completion: A model based on local, anisotropic, fast-adapting interactions between oriented lters. Society for Neuroscience Abstracts, 20, 1665. Bressloff, P. C., Cowan, J. D., Golubitsky, M., Thomas, P. J., & Wiener, M. C. (In press). Geometric visual hallucinations, Euclidean symmetry, and the functional architecture of striate cortex. Phil. Trans. Royal Soc. B. Cohen, M. A., & Grossberg, S. (1983). Absolute stability of global pattern formation and parallel memory storage by competitive neural networks. IEEE Trans. Systems., Man, Cybern., 13, 815–826. Eckhorn, R., Bauer, R., Jordan, W., Brosch, M., Kruse, W., Munk, M., & Reitboeck H. J. (1988)Coherent oscillations: A mechanism of feature linking in the visual cortex? Multiple electrode and correlation analysis in the cat. Biol. Cybern., 60, 121–130. Ermentrout, G. B., & Cowan, J. D. (1979). A mathematical theory of visual hallucination patterns. Biol. Cybern., 34, 137–150. Field, D. J., Hayes, A., & Hess, R. F. (1993). Contour integration by the human visual system: Evidence for a local “association eld.” Vision Res., 33(2), 173– 193. Gallant, J. L., van Essen, D. C., & Nothdurft, H. C. (1995) Two-dimensional and three-dimensional texture processing in visual cortex of the macaque monkey. In T. Papathomas, C. Chubb, A. Gorea, & E. Kowler (Eds.), Early vision and beyond (pp. 89–98). Cambridge, MA: MIT Press. Gilbert, C. D. (1992). Horizontal integration and cortical dynamics. Neuron, 9(1), 1–13. Gilbert, C. D., & Wiesel, T. N. (1983). Clustered intrinsic connections in cat visual cortex. J. Neurosci., 3(5), 1116–1133. Gray, C. M., & Singer, W. (1989). Stimulus-specic neuronal oscillations in orientation columns of cat visual cortex. Proc. Natl. Acad. Sci. USA, 86, 1698–1702. Grossberg, S., & Mingolla, E. (1985). Neural dynamics of perceptual grouping: Textures, boundaries, and emergent segmentations. Percept. Psychophys., 38(2), 141–171.
Design and Analysis of a Recurrent V1 Model
1779
Grossberg, S, & Raizada, R. D. (2000). Contrast-sensitive perceptual grouping and object-based attention in the laminar circuits of primary visual cortex. Vision Res., 40(10–12), 1413–1432 Hirsch, J. A., & Gilbert, C. D. (1991). Synaptic physiology of horizontal connections in the cat’s visual cortex. J Neurosci., 11(6), 1800–1809. Hopeld, J. J. (1982). Neural networks and systems with emergent selective computational abilities. Proc. Natl. Acad. Sci. USA, 79, 2554–2258. Kapadia, M. K., Ito, M., Gilbert, C. D., & Westheimer, G. (1995). Improvement in visual sensitivity by changes in local context: Parallel studies in human observers and in V1 of alert monkeys. Neuron., 15(4), 843–856. Knierim, J. J., & van Essen, D. C. (1992). Neuronal responses to static texture patterns ion area V1 of the alert macaque monkeys. J. Neurophysiol., 67, 961– 980. Koenig, P., & Schillen, T. B. (1991). Stimulus-dependent assembly formation of oscillatory responses: I. Synchronization. Neural Computation, 3, 155–166. Li, Z. (1997). Primary cortical dynamics for visual grouping. In K. Y. M. Wong, I. King, & D-Y. Yeung (Eds.), Theoretical aspects of neural computation. SpringerVerlag. Li, Z. (1998). A neural model of contour integration in the primary visual cortex. Neural Computation, 10, 903–940. Li, Z. (1999a). Visual segmentation by contextual inuences via intracortical interactions in primary visual cortex. Network:Computation in Neural Systems, 10, 187–212. Li, Z. (1999b) Contextual inuences in V1 as a basis for pop out and asymmetry in visual search. Proc. Natl. Acad. Sci. USA, 96(18), 10530–10535. Li, Z. (2000a). Pre-attentive segmentation in the primary visual cortex. Spatial Vision, 13, 25–50. Li, Z. (2000b). Can V1 mechanisms account for gure-ground and medial axis effects? In S. A. Solla, T. K. Leen, & K-R. Muller (Eds.), Advances in neural information processing systems, 12 (pp. 136–142). Cambridge, MA: MIT Press. Li, Z., & Dayan, P. (1999). Computational differences between asymmetrical and symmetrical networks. Network: Computation in Neural Systems, 10, 59–77. Molotchnikoff, S., Shumikhina, S., & Moisan, L. E. (1996). Stimulus-dependent oscillations in the cat visual cortex: Differences between bar and grating stimuli. Brain Res., 731(1–2), 91–100. Mundel, T., Dimitrov, A., & Cowan, J. D. (1996).Visual cortex circuitry and orientation tuning. In M. Mozer, M. Jordan, & T. Petsche (Eds.), Advances in neural information processing systems, 9 (pp. 887-893), Cambridge, MA: MIT Press. Nothdurft, H. C. (1994). Common properties of visual segmentation. In G. R. Bock & J. A. Goode (Eds.), Higher-order processing in the visual system (pp. 245–268). New York: Wiley. Pessoa, L., Thompson, E., & Noe, A. (1998). Finding out about lling-in: A guide to perceptual completion for visual science and the philosophy of perception. Behav. Brain Sci., 21(6), 723–48. Polat, U., & Sagi, D. (1993). Lateral interactions between spatial channels: Suppression and facilitation revealed by lateral masking experiments. Vis. Res., 33, 993–999.
1780
Zhaoping Li
Rockland, K. S., & Lund, J. S. (1983). Intrinsic laminar lattice connections in primate visual cortex. J. Comp. Neurol., 216, 303–318. Sceniak, M. P, Ringach, D. L, Hawken, M. J, & Shapley, R. (1999). Contrast’s effect on spatial summation by macaque V1 neurons. Nature Neurosci., 2(8), 733–739. Sengpiel, R., Baddeley, R., Freeman, T., Harrad R., & Blakemore, C. (1998). Different mechanisms underlie three inhibitory phenomena in cat area 17. Vision Research, 38, 2067–2080. Shepherd, G. M. (1990). The synaptic organization of the brain (3rd ed.). New York: Oxford University Press. Somers, D. C., Todorov, E. V., Siapas, A. G., Toth, L. J., Kim, D. S., & Sur, M. (1998). A local circuit approach to understanding integration of long-range inputs in primary visual cortex. Cereb. Cortex, 8(3), 204–217. Stemmler, M., Usher, M., & Niebur, E. (1995). Lateral interactions in primary visual cortex: A model bridging physiology and psychophysics. Science, 269, 1877–1880. Tsodyks, M. V, Skaggs, W. E, Sejnowski, T. J, & McNaughton, B. L. (1997). Paradoxical effects of external modulation of inhibitory interneurons. J. Neurosci., 17(11), 4382-4388. von der Heydt, R., Peterhans, E., & Baumgartner, G. (1984). Illusory contours and cortical neuron responses. Science, 224(4654), 1260–1262. White, E. L. (1989). Cortical circuits. Boston: Birkhauser. Yen, S-C., & Finkel, L. H. (1998). Extraction of perceptually salient contours by striate cortical networks. Vision Res., 38(5), 719–741. Zhang, K. (1996). Representation of spatial orientation by the intrinsic dynamics of the head-direction cell ensemble: A theory. Journal of Neuroscience, 16, 2112– 2126. Zucker, S. W., Dobbins, A., & Iverson L. (1989) Two stages of curve detection suggest two styles of visual computation. Neural Computation, 1, 68–81. Received February 7, 2000; accepted October 9, 2000.
LETTER
Communicated by Alessandro Treves
A Hierarchical Dynamical Map as a Basic Frame for Cortical Mapping and Its Application to Priming Osamu Hoshino Department of Human Welfare Engineering, Oita University, Otia 870-1192, Japan Satoru Inoue Department of Information Network Science, The University of Electrocommunications, Chofu, Tokyo 182-8585, Japan Yoshiki Kashimori Takeshi Kambara Department of Applied Physics and Chemistry, The University of Electrocommunications, Chofu, Tokyo 182-8585, Japan A hierarchical dynam ical map is proposed as the basic framework for sensory cortical mapping. To show how the hierarchical dynamical map works in cognitive processes, we applied it to a typical cognitive task known as priming, in which cognitive performance is facilitated as a consequence of prior experience. Prior to the priming task, the network memorizes a sensory scene containing multiple objects presented simultaneously using a hierarchical dynamical map. Each object is composed of different sensory features. The hierarchical dynamical map presented here is formed by random itinerancy among limit-cycle attractors into which these objects are encoded. Each limit-cycle attractor contains multiple point attractors into which elemental features belonging to the same object are encoded. When a feature stimulus is presented as a priming cue, the network state is changed from the itinerant state to a limit-cycle attractor relevant to the priming cue. After a short priming period, the network state reverts to the itinerant state. Under application of the test cue, consisting of some feature belonging to the object relevant to the priming cue and fragments of features belonging to others, the network state is changed to a limit-cycle attractor and nally to a point attractor relevant to the target feature. This process is considered as the identication of the target. The model consistently reproduces various observed results for priming processes such as the difference in identication time between cross-modality and within-modality priming tasks, the effect of interval between priming cue and test cue on identication time, the effect of priming duration on the time, and the effect of repetition of the same priming task on neural activity.
c 2001 Massachusetts Institute of Technology Neural Computation 13, 1781– 1810 (2001) °
1782
O. Hoshino, S. Inoue, Y. Kashimori, and T Kambara
1 Introduction
At any moment, we can easily and rapidly perceive various sensory scenes (e.g., visual, olfactory, auditory, or gustatory) of the external world. Although this can be effortlessly processed by cognitive systems, the neural process of perception is much more complicated. The cognitive system separately detects elemental sensory features of objects in a scene and interprets relations among specic features, forming individual objects so that those objects are perceived correctly. Individual objects present in the same scene are discriminated while they are grouped together, so that the different scenes are also discriminated from each other. Cortical mapping of the sensory information is a fundamental task of the cognitive system and functions effectively in such cognitive processes. Cortical mapping of sensory features has been reported in various sensory systems: cortical maps known as orientation column (Hubel & Wiesel, 1977) and color-specic blob (Livingston & Hubel, 1984) in vision, frequency column (Knudsen, DuLac, & Esterly, 1987) and auditory space map (Knudsen & Konishi, 1978) in auditory systems, and barrel elds for whisker sensing (van der Loos & Dor, 1978) and segregates for skin sensing (Favorov & Diamond, 1990) in somatosensory systems. Mapping components (features) are represented by stable ring of local groups of neurons that respond specically to similar sensory features. Exact locations of the groups are the basis for the mapping. In contrast to such a feature detection map, most of the other cortical maps are not as simple. Those maps systematically have to process features, objects (relations among features), and sensory scenes (relations among objects) so that the complete images of features, objects, and scenes are correctly perceived. This type of cortical map has been found in various sensory cognitive systems. Cortical maps on which visual information of simple relations of visual features (or simple visual objects) is represented have been found in temporal cortical areas (Tanaka, 1996, 1997). A cortical map that represents relations of visual objects has also been found in prefrontal cortical areas (Hasegawa, Fukushima, Ihara, & Miyashita, 1998). In the rst stage of olfaction, the olfactory bulb, individual odor components, and their mixing ratios could be mapped on the bulb network (Freeman & Baird, 1987; Lam, Cohen, Wachowiak, & Zochowski, 2000). Parahippocampal cortical areas of rats serve as navigation maps on which information about the layout of local space (visual scenes) is represented (Russell & Kanwisher, 1998). Cortical maps that are considered to serve for binding features belonging to different sensory modalities have been found for gustation, olfaction, and vision in orbitofrontal cortex (Rolls & Baylis, 1994) and for audition and vision in hippocampus (Sakurai, 1996). Many observations on various sensory cortical maps, initiated by Hubel and Wiesel’s (1977) studies of the visual cortex, revealed the topographical mapping of features and a hierarchy of ltering stages in which the features
Hierarchical Dynamical Map for Priming
1783
of objects detected at early stages are bounded together for complex sensory mapping at later stages. This implies that sensory cortical maps are created at various stages of neural information processing and organized into a hierarchy along the anatomical neural pathways. The components (features, objects, and scenes) represented neurally at those stages are arranged in sensory maps as stable ring of groups of distributed neurons, which implies that the exact locations seldom contribute to representing such complex sensory information. For a clear understanding of cognitive functions, it is important to know the neural basis for the mapping function and in what manner the cognitive systems use those maps in neural information processing. As a plausible framework for cortical mapping function, we have proposed a “dynamical map” scheme (Hoshino, Usuba, Kashimori, & Kambara, 1997) that was constructed based on attractor dynamics developed by pioneer workers (Amari, 1977; Hopeld, 1982, 1984; Tsuda, 1991, 1992; Amit, 1994; Amit & Brunel, 1995). The dynamical map is formed in the dynamic system of a neural network and represented by random itinerancy among stable dynamical attractors such as point attractors, limit-cycle attractors, and chaotic attractors. Those attractors are created when stimulated by a sensory scene, and single attractors represent individual features and objects in the scene. The recognition of an input stimulus is induced by a dynamic phase transition from the randomly itinerant state to the basin of the attractor corresponding to the input. The randomly itinerant state is regarded as a critical state at which the network is able to respond effectively to input stimuli. The driving force for the transition is fast and innitesimal synaptic change during the input application period (Hoshino, Kashimori, & Kambara, 1996). We applied the dynamical map scheme to the olfactory bulb (Hoshino, Kashimori, & Kambara, 1998) and investigated the functional structure and roles of the olfactory map in odor information processing. We have shown that the olfactory map has a dynamically hierarchical structure. In bulb network dynamics, chemical components of a single odor and their mixing ratio are represented as point attractors and a limit-cycle attractor, respectively. The olfactory map is characterized by random itinerancy among limit-cycle attractors into which individual odors are encoded. Odor recognition is a dynamical phase transition from the randomly itinerant state to a specic limit-cycle attractor corresponding to an input odor stimulus. The multiple dynamic states, that is, the randomly itinerant state, limit-cycle attractors, and point attractors, characterize the olfactory map. One of the notable ndings in that study is that a hierarchy in dynamics exists even within a single neural network domain and plays a key role in odor recognition processing in the bulb. We have hypothesized that such a dynamic hierarchy may also be one of the fundamental architectures employed actively by the brain for cortical neural information processing. Our basic view is that cortical information may also be processed hierarchically
1784
O. Hoshino, S. Inoue, Y. Kashimori, and T Kambara
based not only on the anatomically hierarchical structure but also on the dynamically hierarchical structure even at a single stage of neural information processing. We propose in this article that such a dynamically hierarchical map, on which individual sensory information and their relations are mapped as dynamical attractors, is a possible form of cortical mapping and is one of the frameworks that underlie fundamentals of cortical information processing. In order to investigate how the hierarchical dynamical map works in actual cognitive processes, we consider a cognitive process known as priming in which cognitive performance is facilitated as a consequence of prior experience. There are numerous tasks that can be used to assess priming and most of those tasks have been reviewed by Roediger and McDermott (1993). In a word-stem completion task (Tulving, Schacter, & Stark, 1982), subjects are shown a series of words followed by a delay period of minutes to hours and directed to complete the words from the word stems. For example, the word “theorem” is presented rst, and after a delay period, the subject tries to complete the word stem “ h o em”. When the word “theorem” is presented previously, the subjects perform the task more rapidly and correctly than for words that are not presented previously. In a word identication task (Haist, Musen, & Squire, 1991), the identication of a word is enhanced when subjects are shown the word prior to identication. Priming also occurs across sensory modalities (Musen & Squire, 1992; Clarke & Morton, 1983; Roediger & Blaxton, 1987). Clarke and Morton (1983) conrmed an effect of auditory priming on visual perception of words. The identication of visual words was enhanced when subjects were presented the words auditorily prior to the identication task. In order to clarify the neuronal processes underlying priming based on a hierarchical dynamical map, we consider a typical priming task and study how the dynamical state of the map is changed during each process constructing the task. We propose that neuronal processes inducing the change in the dynamical state are essential for the priming task. 2 Outline of Responses of Hierarchical Dynamical Map to Priming Task
In order to make the following sections clearly understandable, we describe here briey the priming task considered and how the hierarchical dynamical map responds to stimuli relevant to the task. 2.1 Formation of Hierarchical Dynamical Map. Prior to the priming task, the neural network memorizes a visual scene that simultaneously contains three objects: A, B, and C. Each object is composed of three different kinds of sensory features such as shape, color, and size, which are denoted as Xs, Xc, and Xz (X = A, B, C), respectively. After memorization, the state
Hierarchical Dynamical Map for Priming
1785
of the network begins to itinerate randomly among the three limit-cycle attractors corresponding to the three objects. Each of the limit-cycle attractors contains three point attractors corresponding to three features Xs, Xc, and Xz of object X. The model is constructed based on the so-called associative neural network, in which specic ring patterns of neurons are embedded as stable point attractors established by certain synaptic construction. The ring patterns of the network used to describe the features Xs, Xc, and Xz of the object X (X D A, B, C) are shown in Figure 1a. As shown in Figure 1b, each feature is represented by a point attractor, and the relation among the three features belonging to the same object is represented by a limit-cycle. Within the attractor region of each limit-cycle, the state of the network shows cyclic itinerancy among the three point attractors [Xs ! Xc ! Xz ! Xs ! ¢ ¢ ¢]. The relation among the three objects (A, B, C) is represented by the random itinerancy among the three limit-cycle attractors. The ongoing state of the network is the randomly itinerant state, which is considered a ready state for an incoming signal to the network. 2.2 Responses of the Map to the Priming Task. Figure 2 is a schematic drawing of how the priming task is processed based on the hierarchical dynamical map scheme. The randomly itinerant state among the limit-cycle attractors, into which the visual scene is encoded, ranks at the top within the hierarchy in the network dynamics; the limit-cycle attractors, into which the objects are encoded, rank in the middle, and the point attractors, into which the sensory features are encoded, rank at the bottom. Without any input stimulus, the network stays at the top. During the priming period, the network is stimulated by the feature Bc, which is partial information about object B. If the period is not too short, the network state is changed to the limit-cycle attractor relevant to object B. After an instantaneous presentation of the priming cue, the network state moves back to the top rank, as shown in Figure 2. This instantaneous stay of the network at the limit-cycle attractor B corresponds to an unconscious recall of object B. If the priming period is too long, the network state is further changed from the middle rank to the point attractor relevant to the feature Bc at the bottom. This process corresponds to a recognition of the stimulus Bc. O is applied as a test During the identication period, another stimulus Bz O cue to the network, where the test cue Bz is composed of Bz and fragments of Az and Cz. The network state is changed to the limit-cycle attractor relevant to the object B in spite of the interference of the fragments Az and Cz. This arises from the fact that the limit-cycle attractor relevant to the object B has preferentially been stabilized by the priming stimulus Bc. When the test O is presented to the network during a period in which the stability cue Bz of the limit-cycle attractor (B) is greater than that of the other two limitcycle attractors (A and C), the limit-cycle attractor (B) is easily induced by
1786
O. Hoshino, S. Inoue, Y. Kashimori, and T Kambara
O The network state is nally changed to the point attractor the stimulus Bz. relevant to the feature Bz. This corresponds to the identication of the feature O Bz in the test cue Bz. In the model, we describe our neural model of priming effect based on the consideration that part of a single object such as one feature of the object
Hierarchical Dynamical Map for Priming
1787
is recognized hierarchically; that is, at rst the object is recalled based on the partial information about the object given by the part. Second, the part is recognized based on the relation between the part and the object. The priming task shown in Figure 2 belongs to indirect priming, or priming across different feature modalities. The feature Bc of object B included in the priming cue is different from the feature Bz in the test cue. Such indirect priming becomes possible through the hierarchical linkage among the point attractors representing the features, as described in detail in the following sections. 3 Neural Network Model
To describe a hierarchical dynamical map, we used a simple cortical neural network model consisting of projection neurons and interneurons such as pyramidal cells and basket cells, respectively. The model is shown in Figure 3. The projection neurons are connected to each other with both excitatory and inhibitory synapses. The excitatory connections are made directly via axoncollaterals of the projection neurons. Although the inhibitory connections between neurons are made through mediation of some kind of interneuron, the connections are made directly between projection neurons for simplicity. Each projection neuron also connects with a single inhibitory interneuron by which the activity of the projection neuron is suppressed locally. The dynamic evolutions of cell potentials of the projection neuron and interneuron are dened by tp,i
N dX max X dup,i ( t) D ¡up,i ( t) C wpp,ij ( t, dij ) ¢ Vp, j ( t ¡ dij ) dt j D1 d ij D 0
C wpr,ii ¢ Ur,i ( t ) C Ip,i ( t ) C I g,i ( t ) ,
tr,i
dur,i ( t) D ¡ur,i ( t) C wrp ¢ Vp,i ( t) , dt
(3.1) (3.2)
where up,i ( t) and ur,i ( t) are the cell potentials of ith projection neuron and ith interneuron at time t, respectively. tp,i and tr,i are the decay times of Figure 1: Facing page. (a) Firing patterns of the network used to describe the features Xs, Xc, and Xz of the object X ( X D A, B, C ) . Filled and open circles denote ring and resting states of neurons, respectively. (b) Schematic drawing of the hierarchical dynamical map. Each feature is represented by a point attractor, and the relation among the three features belonging to the same object is represented by a limit-cycle attractor. Within the attractor region of each limitcycle, the state of the network shows cyclic itinerancy (arrows) among the three point attractors [Xs ! Xc ! Xz ! Xs ! ¢ ¢ ¢]. The relation among the three objects ( A, B, C ) is represented by the random itinerancy among the three limit-cycle attractors (dashed lines).
1788
O. Hoshino, S. Inoue, Y. Kashimori, and T Kambara
Figure 2: Schematic drawing of the priming identication task based on the hierarchical dynamical map scheme. The randomly itinerant state among the limit-cycle attractors ranks at the top, the limit-cycle attractors in the middle, and the point attractors at the bottom. Without any input stimulus, the network stays at the top. During the priming period, the network is stimulated by the feature Bc. The network state is changed to the limit-cycle attractor relevant to object B. After switching off the priming cue, the network state moved back to O is applied as the top rank. During the identication period, another stimulus Bz O a test cue to the network, where the test cue Bz is composed of Bz and fragments of Az and Cz. The network state is changed to the limit-cycle attractor relevant to object B and then to the point attractor relevant to feature Bz.
these cell potentials. N is the number of projection neurons. dij is the delay time of signal propagation from neuron j to i, and dmax is the maximum delay time. These signal delays are omnipresent in the brain (Miller, 1987). wpp,ij ( t, dij ) is the strength of synaptic connection from projection neuron j to projection neuron i whose propagation delay time is dij . wpr (wrp ) is the strength of synaptic connections from the interneuron (projection neuron) to the projection neuron (interneuron). Vp,i ( t) is the axonal output of ith
Hierarchical Dynamical Map for Priming
1789
Figure 3: Cortical neural network model. The network model consists of projection (P) neurons and inhibitory type of interneurons (r). The projection neurons are connected to each other with both excitatory and inhibitory synapses. Each projection neuron connects with a single inhibitory interneuron by which the activity of the projection neuron is suppressed locally.
projection neuron. Ur,i ( t) is the output of ith interneuron. Ip,i ( t) is an external stimulus to projection neuron i. Ig,i ( t) is a global oscillation of cell potential coming from other regions of the brain such as thalamus (Abeles, Prut, Bergman, Vaadia, & Aertsen, 1993), and dened as Ig,i ( t) D I g sin 2p ºt (Ig D 0.3 and º D 10 Hz). Output activities of the projection neurons and the interneurons are given by Prob[Vp,i ( t ) D 1] D fp [up,i ( t)],
(3.3)
Ur,i ( t) D fr [ur,i ( t) ],
(3.4)
fx [y] D
1 1
C e¡gx ( y¡hx )
,
( x D p, r) ,
(3.5)
where gx and hx are the steepness and the threshold of the sigmoid function fx , respectively, for x kind of neuron. Equation 3.3 denes the probability of ring of the projection neuron i; that is, the probability of Vp,i ( t) D 1 is given by fp . The synaptic connections between the projection neurons are changed temporally, that is, decayed continuously but modied according to the Hebb rule whenever any stimulus is presented to the network. Synaptic strength wpp,ij ( t, dij ) is represented as the sum of excitatory synaptic strength
1790
O. Hoshino, S. Inoue, Y. Kashimori, and T Kambara
ex ( ih ( wpp,ij t, dij ) and inhibitory synaptic strengths wpp,ij t, dij ). The dynamic evolutions of these synapses are determined by
ex twpp
ih twpp
ex ( t, d ) dwpp,ij ij
dt ih ( dwpp,ij t, dij )
dt
ex ( D ¡wpp,ij t, dij ) C k ¢ 2 ij ( dij ) ¢ Vp,i ( t) ¢ Vp, j ( t ¡ dij ) , (3.6)
ih ( D ¡wpp,ij t, dij )
C k ¢ 2 ij ( dij ) ¢ ( Vp,i ( t ) ¡ 1) ¢ Vp, j ( t ¡ dij ) ,
(3.7)
ex and t ih are time constants. k is 1 if a stimulus is applied to the where twpp wpp network, otherwise 0. 2 ij ( dij ) is the rate of synaptic modication. The excitatory synaptic strengths are strengthened when the neurons i and j re with a positive correlation. The inhibitory synaptic strengths are strengthened when the two neurons re with a negative correlation. Numerical calculations have been carried out for a network composed of 81 neurons (N D 81). Values of the network parameters used here are tp,i = tr,i = 5 msec, dmax = 10 msec, hp D 0.1, hr D 0.9, gp D 14.0, gr D 21.0, ex = t ih = 100 sec, and 2 ( d ) = 1.0. The synaptic connection strengths twpp ij ij wpp between the projection neurons and interneurons are chosen as constant values, wpr D ¡1.0, wrp D 1.0. The real values observed in the brain are a few to hundreds of milliseconds for the response time constants of neurons (Koch, Rapp, & Segev, 1996), 0 » hundreds of milliseconds for the signal delay (Miller, 1987) and seconds to minutes (even to hours, months, or years) for synaptic time constants. The values used in the model are in the ranges of these observed ones, and some of the other values (h, g) were chosen so as to obtain a good model performance that shows clearly the essential dynamic behavior of the network. Parameter set for Ip,i ( t) in equation 3.1 will be explained in the next section.
4 Formation of a Hierarchical Dynamical Map
The network was trained to memorize a scene that contains three objects: A, B, and C. Each object is composed of three features belonging to different modalities such as shape, color, and size. The three features belonging to the three different modalities, S, C, and Z, are denoted as Xs, Xc, and Xz, respectively ( X D A, B, C) . Each feature is encoded as a relevant on-off pixel pattern fji ( Xn )g as shown in Figure 1a, where n D s, c, z and i D 1 » N. When the feature Xn is presented to a subject, the stimulus received by the network of projection neurons of the subject is represented by Ip,i ( t) D c ji ( Xn ) ,
(4.1)
where c denotes the intensity of feature stimulus Xn. ji ( Xn ) takes on value 1 for on or 0 for off pixels. The on-off pixel patterns fji ( Xn) g are set to be
Hierarchical Dynamical Map for Priming
1791
orthogonal to each other. Equation 4.1 means that the stimulus is presented not to the retina but to the cortex as neural input coming through afferent bers. The value of c is 10 for the memorization process and 0.1 for priming tasks. Such high intensity (c D 10) is necessary for the creation of the dynamical map and corresponds to the sensitivity increase that arises from some kinds of behaviorally attentional processes (Hoshino et al., 1997, 1998). In order to investigate dynamic behaviors of the model, we calculated the measure “overlap” O ( XnI t) of output pattern fVp,i ( t)g of the network with on-off pixel pattern fji ( Xn ) g, which is given as O( XnI t) D
N 1 X ( 2ji ( Xn ) ¡ 1) (2Vp,i ( t) ¡ 1) . N i D1
(4.2)
When Vp,i ( t ) = ji ( Xn ) for i D 1 » N, O( XnI t) becomes unity. This means that the ring state of the network is equivalent to the ring pattern fji ( Xn ) g. As shown in Figure 4a, each of the three objects, A, B, and C, is separately presented as input to the network. Stimuli arising from features Xs, Xc, and Xz belonging to object X are presented repeatedly in the sequence of [fji ( Xs) g ! fji ( Xc) g ! fji ( Xz )g ! fji ( Xs) g ! ¢ ¢ ¢] during the application period for object X. Because the input strength is strong enough (c D 10), the activity of the network is almost the same as the input pattern, as shown in Figure 4a. During this memorization period, the synaptic weights are modied using equations 3.6 and 3.7. After the memorization periods of the three objects, the activity state of the network itinerates randomly among three limit-cycle attractors corresponding to the three objects as shown in Figure 4b. Each of the limit-cycle attractors represents one object, X, and contains three point attractors into which the three features ( Xs, Xc, Xz) of the object are encoded. The global oscillation Ig,i ( t) in equation 3.1 is necessary for the creation of the hierarchical dynamical map. Without such a global oscillation during the memorization process, old memories tend to deteriorate as a new memorization process goes on (Hoshino et al., 1997). For example, in the case shown in Figure 4, if object C is presented without global oscillation, the point attractors and the limit-cycle attractor corresponding to object A are deleted from the dynamics of the network. We use the terms limit-cycle attractor and point attractor for descriptive convenience, because, in a broad sense, three ring patterns fji ( Xs) g, fji ( Xc) g, and fji ( Xz )g, appear cyclically in the basin of the limit-cycle attractor corresponding to object X, and one ring pattern fji ( Xn) g appears consecutively in the basin of the point attractor corresponding to feature Xn. However, as shown in Figure 5, the real dynamic properties of these attractors differ from the nominal properties (see Hoshino et al., 1996, 1997, 1998). This is revealed using the autocorrelation functions of local eld potentials given by, Z 1 T (4.3) AF ( XnI t ) D lim uN p ( XnI t C t ) uN p ( XnI t) dt, T!1 T 0
1792
O. Hoshino, S. Inoue, Y. Kashimori, and T Kambara
where the local eld potential uN p ( XnI t) is dened by X uN p ( XnI t) D up,i ( t) .
(4.4)
i2Xn
Here, the sum is taken only from the projection neurons that are responsive specically to feature Xn.
Hierarchical Dynamical Map for Priming
1793
In the randomly itinerant state, the neurons relevant to As, that is, ring in the pattern fji ( As) g, are activated in a temporally almost random manner except for a weak periodic manner, as shown in Figure 5a. This arises from the random itinerancy of the network among limit-cycle attractors. When the network state is within the basin of the limit-cycle attractor A, the neurons relevant to As behave in a more periodic manner, as shown in Figure 5b. When the network state is within the basin of the point attractor relevant to As, the neurons relevant to As re almost continuously, but not always because of a very weak periodic uctuation in the activity state of the neurons, as seen in Figure 5c. Through the object memorization process, the features belonging to the object and their relation are encoded into these feature-specic point attractors and object-specic limit-cycle attractors, respectively. The whole image of the scene is represented on the hierarchical dynamical map as the randomly itinerant state among all the limit-cycle attractors. The fundamental dynamic properties and the signicance of the random itinerancy among dynamic attractors in neural information processing in the brain have been discussed in detail by Hoshino et al. (1997). A dynamical behavior similar to random itinerancy, chaotic itinerancy, has been described by Tsuda (1991, 1992, 1994). Chaotic itinerancy is characterized by itinerant transitions of neural network dynamic states among low-dimensional coherent states (attractors) and enables neural networks to search the attractors relevant to the input in dynamical phase spaces of network states. The dynamical behavior of the model could correspond to the chaotic itinerancy in the broad sense that a randomly itinerant state also shows the transitions among attractors. 5 Priming Process Based on the Hierarchical Dynamical Map
In this section, we show how the hierarchical dynamical map serves as a cognitive processor in the perceptual task known as priming. We carried
Figure 4: Facing page. (a) Memorization of three objects A, B, and C. Stimuli from features Xs, Xc, and Xz belonging to object X are presented repeatedly in the sequence of [fj ( Xs) g ! fj ( Xc ) g ! fj ( Xz) g ! fj ( Xs) g ! ¢ ¢ ¢] during the application period (horizontal bars) for object X. fE ( Xn ) g denotes the on-off pixel pattern relevant to the feature n of object X as depicted in Figure 1a. One vertical bar is drawn on row Xn when measure “overlap” O ( XnI t) ¸ 0.8 (see text), which means that the current ring pattern of the network is almost the same as the on-off pixel pattern corresponding to Xn. (b) Random itinerancy among the three limit-cycle attractors corresponding to the three objects, A, B, and C. Each limit-cycle attractor contains three point attractors corresponding to features Xs, Xc, and Xz of object X. ( aN ) denotes the application period shown in a.
1794
O. Hoshino, S. Inoue, Y. Kashimori, and T Kambara
Figure 5: Autocorrelation functions of local eld potentials of projection neurons in three different dynamic states: (a) randomly itinerant state, (b) state being within the basin of the limit-cycle attractor corresponding to object A, and (c) state being within the basin of point attractor corresponding to feature As. The projection neurons used for the calculation are the neurons that re in the pattern As as shown in Figure 1a.
Hierarchical Dynamical Map for Priming
1795
out two types of priming: priming across different feature modalities and priming within the same modality of feature. The stabilization of the limitcycle attractor, including the point attractor relevant to the target feature, which is made by the priming signal, is the main origin of facilitation of target recognition in the case of the rst type. The stabilization of the point attractor relevant to the target feature, besides the stabilization of the limitcycle attractor, is the additional origin in the case of the second point. 5.1 Priming Across Different Modalities. We primed the network with feature Bc of object B, that is, the weak input (c D 0.1) represented by the ring pattern Bc is applied to the network during the priming period. As shown in Figure 6a, brief exposure of the network to the feature causes the state of the network to change to the limit-cycle attractor corresponding to object B, whereby the network activity is changed almost cyclically among the three ring patterns relevant to Bs, Bc, and Bz. After switching off the stimulus, the state returns to a randomly itinerant one. Then we present test O that is composed of feature Bz and fragments of the other features cue Bz (Az and Cz) belonging to the same modality z as shown in Figure 6a. For the identication of Bz, the network has to select only feature Bz among the three stimuli included in the test cue. The identication process is made in a hierarchical manner in the following order: (1) the state of the network itinerates randomly among the three limit-cycle attractors corresponding to objects A, B, and C; (2) the state changes from an itinerant state to the limitcycle attractor corresponding to object B; and (3) the state nally changes to the point attractor corresponding to feature Bz. The hierarchical transition— from the itinerant state to the limit-cycle attractor and then to the point attractor—indicates that initially the subject recalls object B through the partial information about the object and then selects the feature Bz relevant to B in the test cue. After switching off the input stimulus, the state of the network becomes randomly itinerant again. This reverse transition is a clearing process to get the state of the network ready for new inputs. In order to clarify the effect of priming on the identication of Bz, we investigated how the network without the priming responds to the same test O The result is shown in Figure 7. A comparison of Figures 6b and 7b cue Bz. O to the identication of shows that the period from the initiation of test cue Bz target Bz is shortened by the priming. The shortening of the period required for identication occurs in the process from stage 1 to 2. This indicates that the subject, who recalled instantaneously or unconsciously object B slightly before the presentation of the test cue, can recall more quickly the object based on even partial information included in the test cue than can the subject without the priming. However, once the subjects have recalled object B, the length of time required for the identication of the target Bz does not noticeably depend on whether the subjects were presented indirect priming.
1796
O. Hoshino, S. Inoue, Y. Kashimori, and T Kambara
Hierarchical Dynamical Map for Priming
1797
The value of the intensity (c D 0.1) for the priming was chosen so that the subject unconsciously recognizes object B, in which the state transition from stage 1 to 2 occurs during the priming period. If the intensity (c ) gets stronger, the subject consciously recognizes the object and the state makes the transition from stage 1 to 2 and nally to 3. On the other hand, if the intensity gets weaker, the subject does not recognize the object, and the state of the network keeps itinerating among the limit-cycle attractors during the whole period of the priming process. 5.2 Priming Within the Same Modality. In order to investigate the effect of priming within the same feature modality, we applied the input corresponding to the feature Bz to the network as shown in Figure 8. A brief exposure of the network to feature Bz causes the state of the network to change to the limit-cycle attractor corresponding to object B. After switching off the stimulus, the state returns to a randomly itinerant one. Then we O composed of feature Bz and fragments of the other present the test cue Bz features (Az and Cz) belonging to the same modality (z). The identication of feature Bz included in the test cue is made through the two kinds of dynamical state transitions: (1) the transition from a randomly itinerant state to the limit-cycle attractor B and (2) the transition from the limit-cycle attractor B to the point attractor Bz. The two kinds of dynamical state transitions are the same as those in the task with the priming across different modalities. The shortening of the identication process induced by priming within the same modality arises from the shortening of time lengths required for transitions (1) and (2), as seen by comparing Figure 8b with Figure 7b. It is also noted by comparing Figure 8b with Figure 6b that the time length required for transition (1) in the case of the same modality is similar to that in the case of different modalities, but the time length required for transition (2) in the former case is noticeably shorter than the time length in the latter case. That is, transition process (2) causes the difference in the time length required for the identication of the target between the priming across different modalities and the priming within the same modality. 5.3 Neuronal Processes Underlying Two Types of Primings. In order to clarify the neuronal processes underlying the priming and the neuronal Figure 6: Facing page. (a) Priming across different modalities. Priming with feature Bc causes the network state to change to the limit-cycle attractor corresponding to object B. After switching off the feature, the network returns to a O is presented. The identication of Bz randomly itinerant state. Then test cue Bz proceeds in a dynamically hierarchical manner; the transitions from the itinerant state to the limit-cycle attractor corresponding to the object B and nally to the point attractor corresponding to the feature Bz. (b) Expansion in timescale for a.
1798
O. Hoshino, S. Inoue, Y. Kashimori, and T Kambara
Figure 7: (a) Identication without priming. (b) Expansion in timescale for Figure a. Note that the period of randomly itinerant state is longer than that in Figure 6b.
Hierarchical Dynamical Map for Priming
1799
Figure 8: (a) Priming within the same modality. Priming with feature Bz is O (b) Expansion in timescale for a. Note followed by presenting the test cue Bz. that the period of the limit-cycle attractor state corresponding to the object B is shorter than those in Figures 6b and 7b.
1800
O. Hoshino, S. Inoue, Y. Kashimori, and T Kambara
origin generating the difference between the primingacross different modalities and that within the same modality, we investigate how priming affects the dynamic behavior of the network. Because the instantaneous (unconscious) recall of object B induced by the primingwith feature Bc is considered the main cause for the facilitation of subsequent identication of Bz, we calculated the cross-correlation functions between the activities of neurons in the randomly itinerant state before the priming and in the same state after the priming, and investigated how these functions are changed by priming. 6 The cross-correlation function (i D j) is given by CF ( Xmi, XnjI t ) D
1 ( T2 ¡ T1 )
Z
T2
T1
up,i ( t C t ) up, j ( t) dt,
(5.1)
where neurons i and j are specically activated under presentation of features Xm and Xn, respectively, that is, neuron i contributes to the ring pattern fji ( Xm) g representing feature Xm. T1 and T2 are the initial and nal times, respectively, of the relevant period of randomly itinerant state. 6 The cross-correlation (i D j) within the same modality as the priming feature Bc (Xm D Xn D Bc) is shown as a function of t in Figure 9a for the period before the priming and in Figure 9b for the period after the priming. The correlation with t D 0 is increased noticeably by the priming. This holds true for any pair of neurons specically responsive to feature Bc. Therefore, the point attractor representing Bc is stabilized through the priming process. The cross-correlation across feature modalities (Xm D Bc, Xn D Bz) is shown as a function of t in Figure 10a for the period before priming and in Figure 10b for the period after. The correlation is noticeably increased by priming. Similar increases of the correlation are obtained in the cases of (Xm D Bc, Xn D Bs) and (Xm D Bz, Xn D Bs). This means that the limit-cycle attractor corresponding to object B is stabilized by the priming process. This stabilization is made through the dynamical linkage of the point attractor Bc with other point attractors Bz and Bs. The point attractors Bz and Bs are also stabilized through this linkage by priming with feature Bc. However, stabilization of point attractors Bz and Bs is slightly weak compared with that of Bc corresponding to the priming feature. These stabilizations of attractors, having been established during the priming period, arise from the small and fast synaptic increase between relevant neurons according to equation 3.6. That is, the frequent appearance of the three point attractors corresponding to object B during the priming period eventually results in strengthening the relevant synaptic connections according to equation 3.6. The connection between attractor stabilization and synaptic increase has been investigated quantitatively (Hoshino et al., 1996). On the other hand, the limit-cycle attractors representing objects A and C are destabilized through the decrease in strength of synaptic connections between neurons specically responsive to the features Xn (X D A or C).
Hierarchical Dynamical Map for Priming
1801
Figure 9: Normalized cross-correlation functions CF( Bci, BcjI t ) for randomly itinerant states: (a) before priming and (b) after priming. The correlation with t D 0 is increased noticeably by priming.
This occurs because the frequencies of activation of the six point attractors (An and Cn, n D s, c, z) relevant to objects A and C are decreased during the priming period, and as a result the relevant synaptic connections are weakened according to equation 3.6. Figures 11a and 11b show that the cross-correlation after priming becomes slightly weaker as compared to before. The difference in the time required for identication of the target Bz between the priming across different modalities and the priming within the same modality is due to the difference in the stabilization of point attractor Bz between the two types of primings. The point attractor corresponding to the target Bz included in the test cue is slightly more stabilized when primed
1802
O. Hoshino, S. Inoue, Y. Kashimori, and T Kambara
Figure 10: Normalized cross-correlation functions CF( Bci, BzjI t ) for randomly itinerant states: (a) before priming and (b) after priming. The correlation is increased noticeably by priming.
with the same feature, Bz, as the target than with feature Bc, different from the target. Thus, the time required for transition from the limit-cycle attractor B to the point attractor Bz shortens when primed with Bz. 6 Relation of the Model to Various Observed Results
In order to evaluate the plausibility of a neural model of priming process, we considered possible neuronal origins of various observed results relevant to the priming processes based on the model.
Hierarchical Dynamical Map for Priming
1803
Figure 11: Normalized cross-correlation functions CF( Aci, AzjI t ) for randomly itinerant states: (a) before priming and (b) after priming. The correlation is decreased noticeably by priming.
6.1 Difference in Facilitation of Target Identication Between CrossModality and Within-Modality Primings. In the model, the time period
from the presentation of a test cue to the identication of the target is definitely shorter in the case of within-modality priming than in the case of cross-modality priming, as shown in Figures 6b and 8b. This difference in facilitation between the two types of primings has been reported based on various psychophysical experiments. Clarke and Morton (1983) showed it using visual recognition tasks with auditory priming, and Roediger and Blaxton (1987) and Curran, Schacter, and Galluccio (1999) showed it using visual word-stem completion tasks with auditory priming.
1804
O. Hoshino, S. Inoue, Y. Kashimori, and T Kambara
6.2 Effect of Interval Between Priming Stimulation and Test Cue Presentation. Tulving, Hayman, and Macdonald (1991), and Cave and Squire
(1992) found, based on psychological experiments, that the effect of priming on the facilitation of target identication decreases as the interval increases. However, the decay rate is so small that the priming effect can be retained for days to even months. The model explored here generated an interval effect similar to the observed results. This is because stabilization of relevant attractors induced by priming is decreased through the decay term in equations 3.6 and 3.7. However, so that the model may have very long retention periods, we need to consider more complex decay processes in the model, because the existing decay term allows mainly for the network state to revert quickly to a randomly itinerant state. 6.3 Effect of Duration of Priming Stimulation. Many psychological experiments (Jacoby & Dallas, 1981; Neill, Beck, Bottalico, & Molloy, 1990; Hirshman, & Mulligan, 1991) have shown that performance on priming tasks is not noticeably changed by priming duration, when the length is changed from 10 milliseconds to seconds. In the model, the length of the period from presentation of test cue to identication of the target is not noticeably changed by priming duration. As the priming duration is increased, the attractors relevant to the priming stimulus become more stabilized. However, the effect of the increase of attractor stabilization on identication time is small. When priming duration is increased by a second, identication time is decreased only by about 10 milliseconds. This is within the amount of standard errors (of the order of 10 milliseconds) of the mean time required for identication, which has been observed in many psychological experiments (Nissen & Bullemer, 1987; Cave & Squire, 1992). The rather weak dependence of identication time on priming duration in the model comes from the fact that the test cue is presented to the network only after the network state returns to a randomly itinerant state. This procedure is required to satisfy the condition that priming stimulation needs to give unconscious or implicit inuences to the subject, however long the duration of priming becomes. 6.4 Effect of Repetition of the Same Priming Task. Behavioral and neurophysiological experiments on priming (Buckner et al., 1998; Wiggs & Martin, 1998) have shown that perceptual performances are greatly enhanced and that activity of some cortical neurons decreases signicantly as the same priming task is repeated. The enhancement of perceptual performance is well explained based on the model. The attractors relevant to the priming task are stabilized in every task. If the interval between successive tasks is not too long, the effect of the stabilization remains in the map. That is, the identication of the target becomes easier as the same task is repeated. Wiggs and Martin (1998) sug-
Hierarchical Dynamical Map for Priming
1805
gested that the decrease of some neuronal activity induced by repetition of the same priming task arises from the sharpening of stimulus representation in the relevant cortical area. Then the remaining neurons, continuing to re under application of the sensory stimulation, carry critical information about the stimuli. In the model, the synaptic connections between the neurons specically responsive to the relevant attractors are strengthened in every task by equation 3.6. But the connections between the neurons responsive to the relevant attractors as well as those responsive to the attractors irrelevant to the task are weakened in every task by equation 3.7. These variations in the synaptic connections make the representation of the relevant stimuli sharper in every task; that is, they increase the network ability to respond to the stimuli relevant to the task. 6.5 Fast Synaptic Change Underlying Implicit Memory. Tulving et al. (1982) showed using a word fragment completion test that (implicit) memory used for the priming process is different from (explicit) memory used for the explicit recognition of the priming cue. Many researchers (Schacter, 1995; Schacter & Buckner, 1998) have used the priming task as a useful tool for investigation of implicit memory of cognitive systems. We propose here that the neural process underlying the priming is the stabilization of relevant dynamical attractors, which is made by fast synaptic changes during the priming period. The neural processing of implicit memory may be made through these changes. The fast synaptic changes in cortices, which are induced by brief sensory stimulations, have been found in various sensory systems such as the visual system (Gilbert & Wiesel, 1992; Kapadia, Gilbert, & Westheimer, 1994), auditory system (Ahissar et al., 1992), and olfactory system (Freeman, 1991). 7 Discussion
We have described how the hierarchical dynamical map responds to each stimulus presented according to the processes of priming task, and we showed that the hierarchical dynamical map has a functional ability to manage the neural processes of priming. In this section, we discuss the relation of the model to other theoretical works in order to consider the background of the concept of hierarchical dynamical map. We also discuss the quantitative aspect of the model in order to understand the robustness of the network performance in the priming task. 7.1 Relations of the Model to Other Relevant Models. The hierarchical pattern formation in neural networks can be made in two different ways. One is processing by networks consisting of multiple layers, in which lower layers process primitive information (e.g., orientation, edges, color) and higher layers process complex information (e.g., faces or specic objects). The hierarchy proceeds from the lower to the higher layers. Hierarchical
1806
O. Hoshino, S. Inoue, Y. Kashimori, and T Kambara
processing based on such layered networks has been considered for pattern recognition by Fukushima (1980, 1988) and Hummel and Biederman (1992) and for facial recognition of monkeys by K¨orner, Gewaltig, Korner, ¨ Richter, and Rodemann (1999). The other hierarchical pattern formation that we followed is based on the dynamical hierarchy, in which even a single network is able to process hierarchical information. Amari (1977) and Hopeld (1982, 1984), pioneers in this eld, showed that input to the network may make the transition from an initial state to one point attractor whose basin contains the initial state. The meaning of transition from one point attractor to another in information processing has been described by Amit (1994; Amit & Brunel, 1995). Tsuda (1991, 1992) proposed a role of transition from a chaotic attractor to a point attractor corresponding to an external stimulus in recognition of the stimulus. Erdi, Grobler, ¨ and Kaski (1993) studied the role of chaotic dynamics of olfactory bulb in learning and memory of odors. Those investigations have used point attractors of the networks to represent entities in the distributed coding scheme. Based on their work, we have used complex attractors that have dynamical internal structures. That is, the limit-cycle attractor contains the three point attractors corresponding to a single object. The random itinerant state contains three limit-cycle attractors corresponding to a single scene. The existing hierarchical pattern formation is made based on those internal structures. 7.2 Robustness of the Model. In order to evaluate the robustness of network performance in the identication of a target included in the test cue, we investigated the response of the network to various test cues that are made by gradually adding irrelevant on-pixels to the original input pattern, as shown in Figure 12. In the gure, the rate of correct identication is plotted against the level of interference. The priming task procedure is O applied as the the same as in Figure 6 except that the stimulus pattern, Bz, test cue, is varied as illustrated along the abscissa in Figure 12. At each O to the level of interference, we applied 50 slightly different patterns of Bz network in which the number of on-pixels belonging to the features Az and Cz is kept constant, but their locations are randomly changed within the locations relevant to the same features. Even if the complete patterns representing the features of other objects (Az and Cz) are added to the target stimulus Bz, the network can identify Bz in the probability of 62.5%. This value is greater than chance level (33.3%) that should be observed when priming does not work. Finally, we consider the relations of network size to the maximum numbers of objects and features. In the neural network model, the minimum fraction of active neurons by which the attractors can be created is 6/81. That is, at least 6 neurons of 81 total neurons have to be red to make a stable point attractor. Therefore, at most, 13 features and 4 objects can be
Hierarchical Dynamical Map for Priming
1807
Figure 12: Rate of identication versus level of interference. The priming task O applied procedure is the same as in Figure 6 except that the stimulus pattern, Bz, as the test cue, is varied, as illustrated along the abscissa.
embedded when 1 object consists of 3 features, as in the present case. If one object consists of 2 features, 6 objects can be embedded. In the calculation here, we used 9 neurons for a single point attractor. The number is required for the network dynamics to be restored robustly even after many tasks are carried out, in which synaptic connections are modied each time. References Abeles, M., Prut, Y., Bergman, H., Vaadia, E., & Aertsen, A. (1993). Integration, synchronicity, and periodicity. In A. Aertsen and W. Von Seelen (Eds.), Spatiotemporal aspect of brain function (pp. 106–121). Amsterdam: Elsevier. Ahissar, E., Vaadia, E., Ahissar, M., Bergman, H., Alieri, A., & Abeles, M. (1992). Dependence of cortical plasticity on correlated activity of single neurons and on behavioral context. Science, 257, 1412–1415. Amari, S. (1977). Neural theory of association and concept-formation. Biol. Cybern., 26, 175–185. Amit, D. J. (1994). Persistent delay activity in cortex: A Galilean phase in neurophysiology? Network, 5, 429–436. Amit, D. J., & Brunel, N. (1995). Learning internal representations in an attractor neural network with analog neurons. Network, 6, 359–388.
1808
O. Hoshino, S. Inoue, Y. Kashimori, and T Kambara
Buckner, R. L., Goodman, J., Burock, M., Rotte, M., Koutstaal, W., Schacter, D., Rosen, B., & Dale, A. M. (1998).Functional-anatomic correlates of object priming in humans revealed by rapid presentation event-related fMRI. Neuron, 20, 285–296. Cave, C. B., & Squire, L. R. (1992). Intact and long-lasting repetition priming in amnesia. J. Exp. Psychol.: Learn., Mem., and Cogn., 18, 509–520. Clarke, R., & Morton, J. (1983).Cross modality facilitation in tachistoscopic word recognition. Quart. J. Exp. Psychol., 35A, 79–96. Curran, T., Schacter, D. L., & Galluccio, L. (1999). Cross-modal priming and explicit memory in patients with verbal production decits. Brain and Cognition, 39, 133–146. Erdi, P., Gr¨obler, T., & Kaski, K. (1993). Dynamics of olfactory bulb: Bifurcations, learning, and memory. Biol. Cybern., 69, 57–66. Favorov, O. V., & Diamond, M. E. (1990).Demonstration of discrete place-dened columns—segregates—in the cat SI. J. Comparative Neurol., 298, 97–112. Freeman, W. J. (1991). The physiology of perception. Sci. Am., 264, 78–85. Freeman, W. J., & Baird, B. (1987). Relation of olfactory EEG to behavior: Spatial analysis. Behav. Neurosci., 101, 393–408. Fukushima, K. (1980). Neocognitron: A self-organizing neural network model for a mechanism of pattern recognition unaffected by shift in position. Biol. Cybern., 36, 193–202. Fukushima, K. (1988). Neocognitron: A hierarchical neural network capable of visual pattern recognition. Neural Networks, 1, 119–130. Gilbert, C. D., & Wiesel, T. N. (1992). Receptive eld dynamics in adult primary visual cortex. Nature, 356, 150–152. Haist, F., Musen, G., & Squire, R. (1991). Intact priming of words and nonwords in amnesia. Psychobiology, 19, 275–285. Hasegawa, I., Fukushima, T., Ihara, T., & Miyashita, Y. (1998). Callosal window between prefrontal cortices: Cognitive interaction to retrieve long-term memory. Science, 281, 814–818. Hirshman, E., & Mulligan, N. (1991). Perceptual interference improves explicit memory but does not enhance data-driven processing. J. Exp. Psychol.:Learn., Mem., and Cogn., 17, 507–513. Hopeld, J. J. (1982). Neural networks and physical systems with emergent collective computational abilities. Proc. Natl. Acad. Sci. USA, 79, 2554–2558. Hopeld, J. J. (1984). Neurons with graded response have collective computational properties like those of two-state neurons. Proc. Natl. Acad. Sci. USA, 81, 3088–3092. Hoshino, O., Kashimori, Y., & Kambara, T. (1996). Self-organized phase transitions in neural networks as a neural mechanism of information processing. Proc. Natl. Acad. Sci. USA, 93, 3303–3307. Hoshino, O., Kashimori, Y., & Kambara, T. (1998). An olfactory recognition model based on spatio-temporal encoding of odor quality in olfactory bulb. Biol. Cybern., 79, 109–120. Hoshino, O., Usuba, N., Kashimori, Y., & Kambara, T. (1997). Role of itinerancy among attractors as dynamical map in distributed coding scheme. Neural Networks, 10, 1375–1390.
Hierarchical Dynamical Map for Priming
1809
Hubel, D. H., & Wiesel, T. N. (1977). Functional architecture of macaque monkey visual cortex. Proc. of Royal Soc. of London (Biol.), 198, 1–59. Hummel, J. E., & Biederman, I. (1992). Dynamic binding in a neural network for shape recognition. Psychol. Rev., 99, 480–517. Jacoby, L. L., & Dallas, M. (1981). On the relationship between autobiographical memory and perceptual learning. J. Exp. Psychol.: Gen., 110, 306–340. Kapadia, M. K., Gilbert, C. D., & Westheimer, G. (1994). A quantitative measure for short-term cortical plasticity in human vision. J. Neurosci., 14, 451– 457. Knudsen, E. I., DuLac, S., & Esterly, S. D. (1987). Computational maps in the brain. Annual Rev. Neurosci., 10, 41–65. Knudsen, E. I., & Konishi, M. (1978). A neural map of auditory apace in the owl. Science, 200, 795–797. Koch, C., Rapp, M., & Segev I. (1996). A brief history of time (constants). Cerebral Cortex, 6, 93–101. Korner, ¨ E., Gewaltig, M. O., Korner, ¨ U., Richter, A., & Rodemann, T. (1999). A model of computation in neocortical architecture. Neural Networks, 12, 989– 1005. Lam, Y.-W., Cohen, L. B., Wachowiak, M., & Zochowski, M. R. (2000). Odors elicit three different oscillations in the turtle olfactory bulb. J. Neurosci., 20, 749–762. Livingston, M. S., & Hubel, D. H. (1984). Anatomy and physiology of a color system in the primate visual cortex. J. Neurosci., 4, 309–356. Miller, R. (1987). Representation of brief temporal patterns, Hebbian synapses, and the left-hemisphere dominance for phoneme recognition. Psychobiology, 15, 241–247. Musen, M., & Squire, R. (1992). Implicit learning of one-trial color-word “associations” in amnesic patients. Society for Neurosci. Abst., 18, 386. Neill W. T., Beck, J. L., Bottalico, K. S., & Molloy, R. D. (1990). Effects of intentional versus incidental learning on explicit and implicit tests of memory. J. Exp. Psychol.: Learn., Mem., and Cogn., 16, 457–463. Nissen, M. J., & Bullemer, P. (1987). Attentional requirement of learning: Evidence from performance measures. Cogn. Psychol., 19, 1–32. Roediger III, H. L., & Blaxton, T. A. (1987).In D. S. Gorfein & R. H. Hoffman (Eds.), The Ebbinghaus Centennial Conference (pp. 349–379). Hillsdale, NJ: Erlbaum. Roediger III, H. L., & McDermott, K. B. (1993). Implicit memory in normal subjects. In F. Boller & J. Grafman (Eds.), Handbook of neuropsychology, 8 (pp. 63– 131). Amsterdam: Elsevier. Rolls, E. T. & Baylis, L. L. (1994). Gustatory, olfactory, and visual convergence within the primate orbitofrontal cortex. J. Neurosci., 14, 5437–5452. Russell, E., & Kanwisher, N. (1998). A cortical representation of the local visual environment. Nature, 392, 598–601. Skaurai, Y. (1996). Hippocampal and neocortical cell assemblies encode memory processes for different types of stimuli in the rat. J. Neurosci., 158, 181–184. Schacter, D. L. (1995). Implicit memory: A new frontier for cognitive neuroscience. In M. S. Gazzaniga (Ed.) The cognitive neurosciences (pp. 815–824). Cambridge, MA: MIT Press.
1810
O. Hoshino, S. Inoue, Y. Kashimori, and T Kambara
Schacter, D. L., & Buckner, R. L. (1998). Priming and the brain. Neuron, 20, 185– 195. Tanaka, T. (1996). Representation of visual features of objects in the inferotemporal cortex. Neural Networks, 9, 1459–1475. Tanaka, T. (1997). Mechanisms of visual object recognition: Monkey and human studies. Curr. Opin. Neurobiol., 7, 523–529. Tsuda, I. (1991). Chaotic itinerancy as a dynamical basis of hermeneutics in brain and mind. World Futures, 32, 167–184. Tsuda, I. (1992). Dynamic link of memory—chaotic memory map in nonequilibrium neural networks. Neural Networks, 5, 313–326. Tsuda, I. (1994). Can stochastic renewal of maps be a model for cerebral cortex? Physica D, 75, 165–178. Tulving, E., Schacter, D. L., & Stark, H. A. (1982). Priming effects in wordfragment completion are independent of recognition memory. J. Exp. Psychol.: Learn., Mem., and Cogn., 8, 336–342. Tulving, E., Hayman, C. A. G., & Macdonald, C. A. (1991). Long-lasting perceptual priming and semantic learning in amnesia: A case experiment. J. Exp. Psychol.: Learn., Mem., and Cogn., 17, 595–717. van der Loos, H., & Dor, J. (1978). Does the skin tell the somatosensory cortex how to construct a map of the periphery? Nuerosci. Newslett., 7, 23–30. Wiggs, C. L., & Martin, A. (1998). Properties and mechanisms of perceptual priming. Curr. Opin. Neurobiol., 5, 227–233. Received August 25, 1999; accepted September 26, 2000.
LETTER
Communicated by Richard Hahnloser
Dynamical Stability Conditions for Recurrent Neural Networks with Unsaturating Piecewise Linear Transfer Functions Heiko Wersing¤ Faculty of Technology, University of Bielefeld, D-33501 Bielefeld, Germany Wolf-Jurgen ¨ Beyn Faculty of Mathematics, University of Bielefeld, D-33501 Bielefeld, Germany Helge Ritter Faculty of Technology, University of Bielefeld, D-33501 Bielefeld, Germany We establish two conditions that ensure the nondivergence of additive recurrent networks with unsaturating piecewise linear transfer functions, also called linear threshold or semilinear transfer functions. As Hahnloser, Sarpeshkar, Mahowald, Douglas, and Seung (2000) showed, networks of this type can be efciently built in silicon and exhibit the coexistence of digital selection and analog amplication in a single circuit. To obtain this behavior, the network must be multistable and nondivergent, and our conditions allow determining the regimes where this can be achieved with maximal recurrent amplication. The rst condition can be applied to nonsymmetric networks and has a simple interpretation of requiring that the strength of local inhibition match the sum over excitatory weights converging onto a neuron. The second condition is restricted to symmetric networks, but can also take into account the stabilizing effect of nonlocal inhibitory interactions. We demonstrate the application of the conditions on a simple example and the orientation-selectivity model of Ben-Yishai, Lev Bar-Or, and Sompolinsky (1995). We show that the conditions can be used to identify in their model regions of maximal orientation-selective amplication and symmetry breaking. 1 Introduction
A prominent model of the coupled dynamics of networks of neurons is the additive recurrent model, 0 1 X (1.1) xP i D ¡xi C si @ wij xj C hi A , j
¤ Current address: HONDA R&D Europe (Germany), Carl-Legien-Str.30, 63073 Offenbach/Main, Germany
c 2001 Massachusetts Institute of Technology Neural Computation 13, 1811– 1825 (2001) °
1812
H. Wersing, W-J. Beyn, and H. Ritter
s
s x
s x
a) Sigmoid
b) Limiter
x c) Linear Threshold
Figure 1: Transfer functions. The sigmoidal function (a) is continuous and saturates for x ! 1. The limiter function (b) is linear between its saturation points. The linear threshold or semilinear transfer function (c) does not saturate as x ! 1.
where xi denotes the activity of neuron i. The synaptic weights wij denote the strength of the synaptic connection from neuron j onto neuron i. The weights can be positive or negative, corresponding to an excitatory or inhibitory synapse, respectively. The function si ( ¢) is called the transfer or gain function of the neuron i and can be interpreted as computing the cell’s average spike rate depending on its membrane potential. The constants hi denote external inputs. The form of the transfer function s (¢) (see Figure 1) plays an important role for bounding the dynamics 1.1. If the transfer function has lower and upper saturation limits according to a · s ( x ) · b, then the dynamics is obviously bounded, since xP i ¸ 0 for xi · a and xP i · 0 for xi ¸ b. Examples for such saturating transfer functions are the logistic or Fermi function,
s ( x) D
1 , 1 C e ¡2bx
(1.2)
and the limiter function (also referred to as the saturating linear function or the piecewise linear sigmoidal function): 8 b.
(1.3)
There has been a growing literature on biologically based models (Hartline & Ratliff, 1958; von der Malsburg, 1973; Douglas, Mahowald, & Martin, 1994; Ben-Yishai, Lev Bar-Or, & Sompolinsky, 1995; Salinas & Abbott, 1996; Adorjan, Levitt, Lund, & Obermayer, 1999; Bauer, Scholz, Levitt, Obermayer, & Lund, 1999), where the transfer function, denoted as semilinear or linear
Dynamical Stability Conditions for Recurrent Neural Networks
1813
threshold (LT) function, is nonsaturating and of the form » s( x) D
0 k( x ¡ h )
if if
x < h, x ¸ h,
(1.4)
where h is the threshold and k > 0 is the gain of the transfer function. It has been argued (Douglas, Koch, Mahowald, Martin, & Suarez, 1995) that this transfer function is more appropriate, because cortical neurons rarely operate close to saturation, despite strong recurrent excitation. This indicates that the upper saturation may not be involved in the actual computations of the recurrent network, since the activation is dynamically bounded due to the effective net interactions. Another important eld for nonsaturating linear threshold transfer functions are winner-take-all (WTA) networks (Lippmann, 1987; Hahnloser, 1998; Ritter, 1990; Wersing, Steil, & Ritter, 2001), where the lack of an upper saturation is necessary to exclude spurious ambiguous states. Some properties of neural networks with piecewise linear transfer functions have been investigated. Feng and Hadeler (1996) and Kuhn and Loehwen (1987) proved conditions that ensure the uniqueness of a stationary state. Feng, Pan, and Roychowdhury (1996) analyzed the general xedpoint structure of networks with piecewise linear sigmoidal functions and a convergence condition for asynchronous update procedures was given by Feng (1997). Hahnloser (1998) considered nonsaturating LT neurons for WTA networks and discussed their piecewise linear analysis. Wersing et al. (2000) showed that complex perceptual grouping tasks can be accomplished in an LT network composed of competitive layers, where the overall contextual modulation can be enhanced by operating the model close to the stability limits. Recently Hahnloser, Sarpeshkar, Mahowald, Douglas, and Seung (2000) demonstrated an efcient silicon design for LT networks and discussed the coexistence of analog amplication and digital selection in their circuit. They showed that the stability analysis for symmetric networks can be reduced to the discussion of sets that contain the active neurons, where an active superset of an unstable attractor is also unstable and a subset of a stable attractor is also stable. They also stated a condition for nondivergence of symmetric systems, which, however, depends on eigenvalues of all possible subsystems and is therefore difcult to evaluate for complex systems. To summarize, in spite of their wide application in biological and cortically inspired circuit models, little attention has been paid so far to the general analysis of the conditions that ensure nondivergence in LT networks. This article gives an in-depth analysis of this issue. 2 Monostable and Multistable Dynamics
We consider only the case where ki D 1 and hi D 0, since any network can be reduced to this form by an afne transformation of the activities that leaves
1814
H. Wersing, W-J. Beyn, and H. Ritter
the boundedness property of the dynamics invariant.1 The dynamics is then given by 0 1 X xP i D ¡xi C s @ wij xj C hi A ,
(2.1)
j
where the transfer function is of the form s( x ) D max( 0, x ). The standard approach to obtain a nondiverging dynamics as in equation 2.1 in the general case of nonsymmetric weights is to choose the combined gains of the transfer function and weights sufciently small (see Steil, 1999, for a review). A simple example is the condition given by Hirsch (1989) that is based on the property that all eigenvalues of the symmetrical parts of the Jacobians of the vector eld in equation 2.1 must be negative. This gives for equation 2.1 the conditions wii C
1X |wij | C |wji | < 1 2 jD6 i
for all i,
(2.2)
which poses a limit on the sum of the magnitudes of the incoming and outgoing weights of each neuron i, compared to its self-coupling weight wii . These conditions, however, imply global asymptotic stability—that is, convergence to a single and unique attractor—and can therefore be employed only if the application requires a monostable dynamical behavior (see Figure 2). Nevertheless, there are many applications where multistable dynamics is essential to the model’s operation. One example is a WTA network, where, depending on the external input (or the initial value), only the neuron with the strongest input (or highest initial value) should remain active. If arbitrarily small differences can be amplied by this competitive process, then the system is said to exhibit symmetry-breaking behavior. In the piecewise linear LT systems, this is possible only if there are unstable directions in the state-space, corresponding to local Jacobians with eigenvalues with positive real parts. The conditions that we present in the following allow for this multistable behavior. 3 Conditions for Nondivergence 3.1 Nonsymmetric Networks. Hahnloser (1998) has proved a condition for nondivergence of LT networks that is based on a principle of global inhibition. For a given excitatory weight matrix with components aij ¸ 0, 1
If si (xi ) D ki max(0, xi ¡ hi ), the transformation is given by x0i D
corresponding coefcients are w0ij D
p
ki kj wij and h0i D (hi ¡hi ) /
p
p
ki (xi ¡hi ), and the
ki . Note also that for wij
P i D ¡mi C being invertible, equation 2.1 is dynamically equivalent to m where mi D
P
w x C hi . j ij j
P
j
wij s(mj ) C hi ,
1815
Activity
Activity of Neuron 2
Dynamical Stability Conditions for Recurrent Neural Networks
Neuron index
Activity of neuron 1
Activity
Activity of Neuron 2
a) Monostable dynamics
Neuron index
Activity of neuron 1
b) Multistable dynamics Figure 2: Difference between monostable and multistable dynamics. (a) A monostable system. Different initial activations (dashed lines) converge to the same and unique attractor (the straight line on the left and the lled circle on the right). (b) A multistable system. Many attractors can coexist, and the dynamics also has unstable xed points (open circle), which allow for unstable modes and symmetry breaking.
he assumes a global inhibition with a strength equal to the sum over all excitatory weights diverging from a neuron. This leads to an effective weight matrix W of the form X for all i, j. (3.1) wij D aij ¡ akj , k
Although the result is rather restrictive and global all-to-all interactions are prohibitive in large networks, it is an example for a condition that does not imply monostability2 and can be applied to some WTA networks (Hahnloser, 1998). 2
Suppose the matrix A with components aij is symmetric positive denite and has an eigenvector of 10 s. Since the column-constant shift in equation 3.1 affects only this eigenvalue, the resulting matrix W still has positive eigenvalues, causing unstable dynamical modes.
1816
H. Wersing, W-J. Beyn, and H. Ritter
In the following, we prove a more general criterion, which shows that in fact local inhibition is sufcient to achieve boundedness of the LT dynamics. The condition is based on a reduced connection matrix W C with entries wijC D dij wij C (1 ¡ dij ) max( 0, wij ) ,
(3.2)
which is constructed from the diagonal and the positive off-diagonal entries of W, and where the Kronecker delta is dened by dij D 1 if i D j and dij D 0 6 if i D j. Theorem 1.
X j
If there exists a vector vO with positive components vO i > 0 satisfying
wijC vOj < vO i
for all
(3.3)
i,
then the LT dynamics xP i D ¡xi C s(
P
wij xj C hi ) is bounded: If 0 · xi ( 0) · cvO i ± ² P for all i, where c > 0 is sufciently large to satisfy c vO i ¡ j wijC vOj > hi for all i, then 0 · xi ( t) · cvO i for all i and t > 0. j
First note that initially nonnegative activities remain nonnegative, since xP i ¸ 0 for xi D 0, and therefore the activity variables cannot cross zero to obtain negative values (see footnote 3). The boundedness is proved by constructing a “box” in the positive domain that cannot be left by the dynamics. The ith face of the box is dened by Proof.
xi D cvO i ,
xjD6 i · cvOj .
(3.4)
Now we have either xP i D ¡xi · 0 or xP i D ¡xi C · ¡xi C
X j
X j
wij xj C hi
(3.5)
wijC xj C hi .
(3.6)
Inserting equation 3.4 and due to the positiveness of both activities and vO , we obtain (note that for the diagonal term wii xi D wii cvO i D wiiC cvO i , since we sit on face i): xP i · ¡xi C
X j
wijC cvOj C hi
0 1 X C D c@ wij vOj ¡ vO i A C hi . j
(3.7)
(3.8)
Dynamical Stability Conditions for Recurrent Neural Networks
1817
Therefore, if we choose c > 0 sufciently large such that c > maxi ( hi / ( vO i ¡ P C ) ) O for all i, then xP i · 0. Thus, the vector eld dening the dynamics is w v j j ij pointing inward on all sides of the box, and the neural activities are conned to this bounded region of the state-space. 3 Using the theory of nonnegative matrices4 (Berman & Plemmons, 1979), condition 3.3 may be expressed in terms of eigenvalues of the matrix W C . It is known that the maximal real part Re(lmax ) of the eigenvalues of W C is an eigenvalue itself with corresponding nonnegative eigenvector vO , vO i ¸ 0 for all i. Condition 3.3 is then equivalent to lmax fW C g < 1. This shows that condition 3.3 is sharp if the eigenvector vO is positive and W D W C (no offdiagonal inhibitory weights). In fact, if W has an eigenvalue l > 1 with positive eigenvector, then the dynamics obviously tends to innity after initialization with a suitable multiple of this eigenvector. Another well-known characterization of condition 3.3 (see Berman & Plemmons, 1979) states that the linear system vO i ¡
X j
wijC vOj D 1
(3.9)
has a unique solution that has positive components vO i > 0. Thus, condition 3.3 can be tested by solving this system. Finally we notice that the special case vO i D 1 for all i in theorem 1 yields an especially simple criterion: If the self-coupling weights satisfy wii < 1 ¡ the LT dynamics is bounded. Corollary 1.
P
C
6 i w ij jD
for all i, then
If we compare this to the global inhibition principle of Hahnloser, expressed in equation 3.1, we see that starting from the same aij , local inhibition mediated by a self-inhibitory term is sufcient for nondivergence with wij D aij ¡ dij
X
aik.
(3.10)
k
Since self-interacting synapses (autapses) are rare in cortical circuits, the local inhibition should be mediated by local interneurons in a biologically more realistic setting. This inhibition, however, must be sufciently fast to avoid a qualitative change in the dynamics (Li & Dayan, 1999). The conditions of corollary 1 can be interpreted as a measure of inhibitory diagonal dominance, which is weaker than the condition of equation 2.2 and thus allows multistable dynamics. This is complementary to networks with 3 4
In dynamical systems theory, this is called a subtangent condition (see Amann, 1990). The results can be applied regardless of the sign of the diagonal matrix elements.
1818
H. Wersing, W-J. Beyn, and H. Ritter
saturating nonlinearities, where excitatory diagonal dominance (Greenberg, 1988) can be used to enforce multistability by a locally divergent dynamics that drives the system into the corners of the bounded state-space. 3.2 Symmetric Networks. In the following we assume that the weights are symmetric with wij D wji for all i, j. In this case, an additional criterion can be given that allows improving the stability margins by also taking into account the nonlocal inhibitory contribution. The result is based on the energy function of the system, which can be constructed for symmetric weights.
O Theorem 2. The LT dynamics is bounded if there exists a symmetric matrix W O O ¸ l f < 1. with wij wij for all i, j, for which max Wg The LT dynamics has an energy function (Cohen & Grossberg, 1983) of the form Proof.
ED¡
X 1X ( wij ¡ dij ) xi xj ¡ hj xj . 2 ij j
If we take Ei D @E / @xi D under the dynamics EP D
X j
Ej xPj D
X j
P
j wij xj ¡xi C hi
(3.11)
then we see that E is nonincreasing
Ej ( ¡xj C s( ¡Ej C xj ) ) · 0.
(3.12)
Since xi ¸ 0 we simply obtain ED ¡
X 1X ( wij ¡ dij ) xi xj ¡ hj xj 2 ij j
(3.13)
¸ ¡
X 1X ( wO ij ¡ dij ) xi xj ¡ hj xj 2 ij j
(3.14)
1 2
O ¡ 1) || x || 2 ¡ ¸ ¡ (lmax fWg
X j
hj xj ! 1,
(3.15)
for x ! 1. Since E is bounded from below and E ! 1 for x ! 1 according to equation 3.15, E is a global Lyapunov function and the dynamics must be bounded and converges to the set of equilibria (Hirsch, 1989). The matrix W C used in theorem 1 provides an example for a possible O with nonnegative off-diagonal elements. The components of choice of W O however, need not be positive. In the following we give an argument W,
Dynamical Stability Conditions for Recurrent Neural Networks
1819
based on linear perturbation theory, which shows that negative entries can be used to infer larger stability margins. Suppose the weight matrix W is decomposed into the excitatory part W C and an inhibitory off-diagonal part W ¡ according to wij D wijC C awij¡ , where a controls the strength of the off-diagonal inhibitory part. The condition of O D W C , and neglecting the theorem 1 amounts to choosing a D 0, taking W C off-diagonal inhibition. The eigenvector of W with the largest eigenvalue lmax has nonnegative components vO i ¸ 0. According to linear perturbation theory, the eigenvalue of the dominant eigenvector changes to rst order O Since W ¡ has only nonpositive entries, in a by l0max D lmax C aOvT W ¡v. 0 lmax < lmax , and by raising a from zero, the stability margin is increased by a decrease of the maximal eigenvalue. Since this, however, may in turn increase other eigenvalues for which then li > 1, the increase in stability can O corresponds be inferred only for small values of a. Note that although W to a monostable network, W can correspond to a multistable network, since O for small a. lmax fWg > lmax fWg What happens if the strength of the nonlocal inhibition is increased to large values of a? For such a strong inhibition, only pools of neurons can remain active, which share only excitatory interactions. Since then only these pools contribute to the dynamics, we can reconsider theorem 1, applied to all possible active pools. Since now the number of excitatory interactions that must be considered is smaller, strong lateral inhibition has the expected effect of decreasing the local self-inhibition that is necessary to avoid runaway excitation. 4 Examples 4.1 A Simple Example. Let us consider an example network consisting of four neurons, which is simple enough to allow a complete analytic discussion of the different parameter regimes of the presented conditions for nondivergence. The interaction of neural activities xi , i D 1, 2, 3, 4 (see Figure 3) is given by
0
a B b WDB @ ¡c b
b a b ¡c
¡c b a b
1 b ¡cC C. bA a
(4.1)
The network is a simplied version of the lateral inhibition network with closed boundary conditions considered by Ben-Yishai et al. (1995), which is discussed in the next section. The parameter b ¸ 0 characterizes the cooperation between neighboring neurons, c ¸ 0 implements inhibition between diagonally opposing neurons, and a is the self-coupling weight. The circulant matrix W has the corresponding eigenvectors v i and eigen-
1820
H. Wersing, W-J. Beyn, and H. Ritter
a
a
b -c b
-c
a
b
b a
Figure 3: A simple network consisting of four neurons, which is a simple example for a lateral inhibition network with local cooperation.
values li : v 1 D ( 1, 1, 1, 1) ,
v 2 D ( 1, 0, ¡1, 0), v 3 D ( 0, 1, 0, ¡1),
v 4 D ( 1, ¡1, 1, ¡1) ,
l1 l2 l3 l4
D a C 2b ¡ c, D a C c, D a C c,
(4.2)
D a ¡ 2b ¡ c.
We now address the following question: Given b ¸ 0 and c ¸ 0, how small must be the self-coupling a to keep nondivergence of the dynamics? If we apply corollary 1, which neglects all off-diagonal inhibitory contributions, we obtain the nondivergence condition a < 1 ¡ 2b,
(4.3)
which is not dependent on c. According to theorem 2 we can infer nondivergence of a system with parameters ( a, b, c) from the nondivergence of a system with parameters ( a0 , b0 , c0 ) , where a < a0 , b < b0 , and c > c0 . The maximum eigenvalue is given by » a C 2b ¡ c if b ¸ c, lmax D (4.4) if b < c. aCc The condition lmax < 1 then results in i. ii.
a < 1 ¡ 2b C c a < 1¡c
for
b > c,
(4.5)
for
b · c.
(4.6)
Dynamical Stability Conditions for Recurrent Neural Networks
a
a
1
1
1821
1-b B
D
1-2b 1
b
C
c
c
A
a)
b)
Figure 4: Nondivergence domains and regions of mono- and multistable dynamics. (a) The nondivergent domains depending on parameters a and inhibition strength c > 0. Region A is bounded according to theorem 1. Theorem 2 covers boundedness in the additional region B. Note that due to theorem 2 the nondivergence for a pair of parameters ( a, c) implies nondivergence for all ( a0 , c0 ) with a0 < a and c0 > c. (b) In region C, the monostable attractor has (approximately) equal activity in all neurons. In region D, the system is multistable, and four attractors of pairwise active neurons coexist.
Case i improves the stability margin toward larger values of a with a maximum when c D b to a < 1 ¡ b. Case ii offers no advantage since b · c implies that 1 ¡ c < 1 ¡ 2b. The domains of nondivergence for large c and small a that can be obtained from these inequalities are shown in Figure 4. The diagrams demonstrate that by taking into account off-diagonal inhibitory contributions, theorem 2 covers a larger area of the parameter space. Figure 4b shows the domains of mono- and multistable dynamics, if the system obtains a weak input hi D m C 2 i , where m is constant and 2 i ¿ m is a small, random modulation. Only in the multistable regime is the system capable of symmetry breaking, where the dynamics causes a nonlinear amplication of small input differences. 4.2 A Model for Orientation Selectivity. Ben-Yishai et al. (1995) have considered a one-dimensional continuum model for orientation selectivity that is of the form
xP ( w ) D ¡x(w ) C s( F ( w ) ) ,
¡p / 2 · w · p / 2,
(4.7)
where 1 F (w ) D p
Z p /2 ¡p / 2
f ( w ¡ w 0 ) x( w 0 ) dw 0 C hext ( w ¡ w 0 )
(4.8)
1822
H. Wersing, W-J. Beyn, and H. Ritter
with f ( y ) D ¡J0 C J2 cos(2y ) and 8 if x · 1, 0 and J2 > 0 characterize on “on center–off surround” interaction between a continuum of neurons in a one-dimensional orientation space with closed boundary conditions. The external input is of the form hext ( w ) D c( 1 ¡g C g cos( 2w ) ) , where c denotes the contrast and g ¿ 1 is the amplitude of a weakly biased input stimulus. The parameter k is the gain of the transfer function. Of particular interest is a region in the J0 £ J2 space, where the model exhibits a so-called marginal phase. In this phase, the model spontaneously generates orientation selectivity by recurrently amplifying arbitrarily small anisotropies in the input for g ! 0. As the results of Ben-Yishai et al. (1995) demonstrate, this occurs in a regime where the upper saturation of the transfer function is not reached, which corresponds to the case of nondivergent linear threshold dynamics. Therefore, we use in the following the presented conditions for nondivergence with multistable dynamics to identify this regime in the parameter space. The linear integral operator in equation 4.8 has two continuous eigenfunctions x0 , x2 with eigenvalues l0 , l2 , where x 0 ( w ) D 1, x2 ( w ) D cos( 2w ) ,
l 0 D ¡1 ¡ J 0 k, l2 D ¡1 C 2k J2 .
(4.10)
As Ben-Yishai et al. (1995) have stated, a necessary condition for the system’s being in the marginal phase is that J2 > 2k¡1 , which in turn makes the x2 mode unstable with l2 > 0. We can now use corollary 1 to infer a condition that ensures the nondivergence of the system, even if unstable modes are present. The straightforward generalization to the continuum case is given by the condition Z ¡ ¢ k p/2 C 0 > ¡1 max 0, f ( w ) dw . (4.11) p ¡p / 2 This can be geometrically interpreted as the condition that the area under the positive part of f is smaller than p / k. If J0 > J2 , the lateral interaction is completely negative, and the condition is trivially satised. Due to the absence of excitatory interactions, the activity settles down to zero for a weak unspecic input. For Jc0 < J0 < J2 , the area can be estimated from above by the larger rectangle with height J2 ¡ J0 and width between the zeroes of f (see Figure 5). p Using the approximations J0 D J2 ¡ 2 with 2 small and 2 arccos( 1 ¡ a2 ) ¼ 2a2 3 / 2 , we obtain the condition J0 > Jc0 D J2 ¡
p 2 / 3 1 1/ 3 J . 21 / 3 k2 / 3 2
(4.12)
Dynamical Stability Conditions for Recurrent Neural Networks
f
A
f - p /2
p /2
B
A
M B
J0 C
a) b)
1823
0
M C
2k
J2
Figure 5: Angular on center–off surround interaction and phase diagram of the orientation selectivity model by Ben-Yishai et al. (1995). The shaded rectangle in a is an estimate to the sum over excitatory interactions. With this estimate, a phase diagram shown in b can be obtained. In region A (shaded area) all eigenvalues are negative and the system always converges to an isotropic state. In region B, the activity remains small due to large inhibition with J0 > J2 with the same unique isotropic attractor. Region M is the marginal phase, where a sharp and peaked response can be generated from isotropic input. In region C, the network saturates due to strong excitation. The border between M and C is given by the nondivergence condition, the bold straight line gives the theoretical prediction for the critical inhibition J c0, and the dashed line was obtained from a simulation of the system. Within region M, there is a gradual increase of peak height toward this boundary.
This condition can be used effectively to identify the region in the J0 £ J2 space, where the lateral interaction is capable of recurrently amplifying small uctuations in the activity distribution, without saturating at the upper saturation limit (see Figure 5). This leads to localized peaks with increasing amplication for J0 ! Jc0 . For J0 > J c0 the network runs into saturation. A comparison with a numerical simulation of the dynamics shows that equation 4.12 captures the border between maximal amplication and saturation rather precisely. 5 Conclusion
The two presented conditions ensure the nondivergence of LT networks without restricting them to be monostable. Although, at least for symmetric systems, the second condition covers a larger parameter regime, it requires the knowledge of eigenvalues, which may not be readily available. On the contrary, the rst condition can be easily applied by choosing appropriate local (self-inhibitory) interactions, matching the sum over excitatory weights. Although the conditions are only sufcient and not necessary, our discussion of the orientation-selectivity model shows that they can be used effectively to approach the regime of critical amplication in LT networks,
1824
H. Wersing, W-J. Beyn, and H. Ritter
where small stimulus contrasts can be largely enhanced through dynamical symmetry breaking. LT networks provide a promising step toward neuromorphic analog circuit design and were shown to accomplish complex tasks like visual perceptual grouping. Nondivergence is a crucial property for the coexistence of analog amplication and digital selection in these circuits. Our results largely extend earlier conditions, which required either global inhibitory interactions or symmetry of the weights, and open the door for more complex models and circuit designs. Acknowledgments
We thank J. J. Steil for helpful discussions. We also thank the anonymous referees for comments that improved the clarity of the presentation. H. W. was supported by DFG grant GK-231. References Adorjan, P., Levitt, J. B., Lund, J. S., & Obermayer, K. (1999). A model for the intracortical origin of orientation preference and tuning in macaque striate cortex. Visual Neuroscience, 16, 303–318. Amann, H. (1990). Ordinary differential equations. Berlin: De Gruyter. Bauer, U., Scholz, M., Levitt, J. B., Obermayer, K., & Lund, J. S. (1999). A biologically-based neural network model for geniculocortical information transfer in the primate visual system. Vision Research, 39, 613–629. Ben-Yishai, R., Lev Bar-Or, R., & Sompolinsky, H. (1995). Theory of orientation tuning in visual cortex. Proc. Nat. Acad. Sci. USA, 92, 3844–3848. Berman, A., & Plemmons, R. J. (1979). Nonnegative matrices in the mathematical sciences. Orlando, FL: Academic Press. Cohen, M. A., & Grossberg, S. (1983). Absolute stability of global pattern formation and parallel memory storage by competitive neural networks. IEEE Trans. Systems, Man, Cybernetics 13, 815–826. Douglas, R., Koch, C., Mahowald, M., Martin, K., & Suarez, H. (1995). Recurrent excitation in neocortical circuits. Science, 269, 981–985. Douglas, R., Mahowald, M., & Martin, K. (1994). Hybrid analog-digital architectures for neuromorphic systems. In IEEE International Conference on Neural Networks (Vol. 3, pp. 1848–1853). Orlando, FL. Feng, J. (1997). Lyapunov functions for neural nets with nondifferentiable inputoutput characteristics. Neural Computation, 9(1), 43–49. Feng, J., & Hadeler, K. P. (1996). Qualitative behaviour of some simple networks. Journal of Physics A, 29, 5019–5033. Feng, J., Pan, H., & Roychowdhury, V. P. (1996). On neurodynamics with limiter function and Linsker ’s development model. Neural Computation, 8, 1003– 1019. Greenberg, H. J. (1988). Equilibria of the brain-state-in-a-box neural model. Neural Networks, 1, 323–324.
Dynamical Stability Conditions for Recurrent Neural Networks
1825
Hahnloser, R. L. T. (1998). On the piecewise analysis of linear threshold neurons. Neural Networks, 11, 691–697. Hahnloser, R., Sarpeshkar, R., Mahowald, M. A., Douglas, R. J., & Seung, H. S. (2000). Digital selection and analogue amplication coexist in a cortexinspired silicon circuit. Nature, 405, 947–951. Hartline, H. K., & Ratliff, F. (1958). Spatial summation of inhibitory inuence in the eye of limulus and the mutual interaction of receptor units. Journal of General Physiology, 41, 1049–1066. Hirsch, M. W. (1989). Convergent activation dynamics in continuous time networks. Neural Networks, 2, 331–349. Kuhn, D., & Loehwen, R. (1987). Piecewise afne bijections of Rn and the equation Sx C ¡ Tx ¡ D y. Linear Algebra and Its Appl., 96, 109–129. Li, Z., & Dayan, P. (1999). Computational differences between asymmetrical and symmetrical networks. Network, 10, 59–77. Lippmann, R. (1987). An introduction to computing with neural nets. IEEE ASSP Mag., 4, 4–22. Ritter, H. (1990). A spatial approach to feature linking. In Proc. Int. Neur. Netw. Conf. Paris, 2, 898–901. Salinas, A., & Abbott, L. F. (1996). A model of multiplicative responses in parietal cortex. Proc. Nat. Acad. Sci. USA, 93, 11956–11961. Steil, J. J. (1999). Input-output stability of recurrent neural networks. PhD Thesis, University of Bielefeld, published by Cuvillier, Goettingen. von der Malsburg, C. (1973). Self-organization of orientation sensitive cells in the striate cortex. Kybernetik, 14, 85–100. Wersing, H., Steil, J. J., & Ritter, H. (2001). A competitive layer model for feature binding and sensory segmentation. Neural Computation, 13(2), 357–387. Received January 14, 2000; accepted October 4, 2000.
LETTER
Communicated by Scott Kirkpatrick
An Information-Based Neural Approach to Constraint Satisfaction Henrik Jonsson ¨ Bo S oderberg ¨ Complex Systems Division, Department of Theoretical Physics, Lund University, S-223 62 Lund, Sweden A novel articial neural network approach to constraint satisfaction problems is presented. Based on information-theoretical considerations, it differs from a conventional mean-eld approach in the form of the resulting free energy. The method, implemented as an annealing algorithm, is numerically explored on a testbed of K-SAT problems. The performance shows a dramatic improvement over that of a conventional mean-eld approach and is comparable to that of a state-of-the-art dedicated heuristic (GSAT+walk). The real strength of the method, however, lies in its generality. With minor modications, it is applicable to arbitrary types of discrete constraint satisfaction problems. 1 Introduction
In the context of difcult optimization problems, articial neural networks (ANN) based on the mean-eld approximation provide a powerful and versatile alternative to problem-specic heuristic methods and have been successfully applied to a number of different problem types (Hopeld & Tank, 1985; Peterson & Soderberg, ¨ 1998). In this article, an alternative ANN approach to combinatorial constraint satisfaction problems (CSP) is presented. It is derived from a very general information-theoretical idea, which leads to a modied cost function as compared to the conventional mean-eld-based neural approach. A particular class of binary CSP that has attracted recent attention is KSAT (Papadimitriou, 1994; Du, Gu, & Pardalos, 1997). Many combinatorial optimization problems can be cast in K-SAT form. We will demonstrate in detail how to apply the information-based ANN approach, to be referred to as INN, to K-SAT as a modied mean-eld annealing algorithm. The method is evaluated by means of extensive numerical explorations on suitable test beds of random K-SAT instances. The resulting performance shows a substantial improvement as compared to that of the conventional ANN approach and is comparable to that of a good dedicated heuristic: GSAT+walk (Selman, Kautz, & Cohen, 1994; Gu, Purdom, Franco, & Wah, 1997). c 2001 Massachusetts Institute of Technology Neural Computation 13, 1827– 1838 (2001) °
1828
Henrik Jonsson ¨ and Bo Soderberg ¨
The real strength of the INN approach lies in its generality. The basic idea can easily be applied to arbitrary types of constraint satisfaction problems, not necessarily binary ones. 2 K-SAT
A CSP amounts to determining whether a given set of simple constraints over a set of discrete variables can be simultaneously fullled. Most heuristic approaches to a CSP attempt to nd a solution—an assignment of values to the variables consistent with the constraints—and are hence incomplete in the sense that they cannot prove unsatisability. If the heuristic succeeds in nding a solution, satisability is proven; a failure, however, does not imply unsatisability. A commonly studied class of binary CSP is K-SAT. A K-SAT instance is dened as follows: For a set of N boolean variables xi , determine whether an assignment can be found such that a given boolean function U evaluates to true, where U has the form U D ( a11 OR a12
OR,
AND , . . . , AND
. . . , a1K )
( aM1
OR,
AND
( a21 OR, . . . , a2K )
. . . , aMK ) .
(2.1)
U is the boolean disjunction of M clauses, indexed by m D 1, . . . , M, each dened as the boolean conjunction of K simple statements (literals) amk , k D 1, . . . , K. Each literal represents one of the elementary boolean variables xi or its negation :xi . For K D 2 we have a 2-SAT problem, for K D 3 a 3-SAT problem, and so on. If the clauses are not restricted to having equal length, the problem is referred to as a satisability problem (SAT). There is a fundamental difference between K-SAT problems for different values of K. While a 2-SAT instance can be exactly solved in a time polynomial in N, K-SAT with K ¸ 3 is NP-complete. Every K-SAT instance with K > 3 can be transformed in polynomial time into a 3-SAT instance (Papadimitriou, 1994). In this article we focus on 3-SAT. 3 Conventional ANN Approach 3.1 ANN Approach to CSP in General. In order to apply the conventional mean-eld-based ANN approach as a heuristic to a boolean CSP problem, the latter is encoded in terms of a nonnegative cost function H ( s ) in terms of a set of N binary ( § 1) spin variables, s D fsi , i D 1, . . . , Ng, such that a solution corresponds to a combination of spin values that makes the cost function vanish. The cost function can be extended to continuous arguments in a unique way, by demanding that it be a multilinear polynomial in the spins (i.e., containing no squared spins). Assuming a multilinear cost function H ( s ), one
A Neural Approach to Constraint Satisfaction
1829
1. Initiate the mean-eld spins vi to random values close to zero and T to a high value. 2. Repeat the following (a sweep) until the mean-eld variables have saturated (i.e., become close to § 1): For each spin, calculate its local eld from equation 3.2, and update the spin according to equation 3.1. Decrease T slightly (typically by a few percent). 3. Extract the resulting solution candidate, using si D sign( vi ) . Figure 1: Mean-eld annealing ANN algorithm.
considers mean-eld variables (or neurons) vi 2 [¡1, 1], approximating the thermal spin averages hsi i in a Boltzmann distribution P ( s ) / exp( ¡H( s ) / T ). They are dened by the mean-eld equations vi D tanh( ui / T ) @H ( v ) ui D ¡ , @vi
(3.1) (3.2)
where ui is referred to as the local eld for spin i. Here, T is an articial temperature, and v denotes the collection of mean-eld variables. Equations 3.1 and 3.2 can be seen as conditions for a local minimum of the mean-eld free energy F ( v ) , F ( v ) D H ( v ) ¡ TS ( v ) , where S ( v ) is the spin entropy, ³ ´ ³ ´ X 1 C vi 1 C vi 1 ¡ vi 1 ¡ vi C log log . S( v ) D ¡ 2 2 2 2 i
(3.3)
(3.4)
The conventional ANN algorithm consists of solving the mean-eld equations, 3.1 and 3.2 iteratively, combined with annealing in the temperature. A typical algorithm is described in Figure 1. 3.2 Application to K-SAT. When applying the ANN approach to K-SAT, the boolean variables are encoded using § 1-valued spin variables si , i D 1 . . . N, with si D C 1 representing true and si D ¡1 false. In terms of the spins, a suitable multilinear cost function H ( s ) is given by the following expression,
H( s ) D
M Y X 1 (1 ¡ Cmi si ) , 2 m D1 i2M m
(3.5)
1830
Henrik Jonsson ¨ and Bo Soderberg ¨
where Mm denotes the set of spins involved in the mth clause. H ( s ) evaluates to the number of broken clauses and vanishes iff s represents a solution. The M £ N matrix C denes the K-SAT instance. An element Cmi equals C 1 (or ¡1) if the mth clause contains the ith boolean variable as is (or negated); otherwise, Cmi D 0. The cost function (see equation 3.5) denes a problem-specic set of mean-eld equations, 3.1 and 3.2, in terms of mean-eld variables vi 2 [¡1, 1]. In the mean-eld annealing approach (see Figure 1), the temperature T is initiated at a high value and then slowly decreased (annealing), while a solution to equations 3.1 and 3.2 is tracked iteratively. At high temperatures, there will be a stable xed point with all neurons close to zero, while at a low temperature they will approach §1 (the neurons have saturated), and an assignment can be extracted. For the K-SAT cost function, equation 3.5, the local eld ui in equation 3.2 is given by ui D
X1 Y 1¡ ¢ 1 ¡ Cmj vj , Cmi 2 2 m j2M m
(3.6)
6 i jD
which, due to the multilinearity of H, does not depend on vi . This lack of self-coupling is benecial for the stability of the dynamics. 4 Information-Based ANN Approach: INN 4.1 The Basic Idea. For problems of the CSP type, we suggest an information-based neural network approach, based on the idea of balance of information, considering the variables as sources of information and the constraints as consumers thereof. This suggests constructing an objective function (or free energy) F of the general form
F D const. £ (information demand) ¡ const. £ (available information),
(4.1)
that is, to be minimized. The meaning of the two terms can be made precise in a mean-eld-like setting, where a factorized articial Boltzmann distibution is assumed, with each boolean variable having an independent probability to be assigned the value true. We will give a detailed derivation below for K-SAT. Other problem types can be treated in an analogous way. We will refer to this type of approach as INN. 4.2 INN Approach to K-SAT. Here we describe in detail how to apply the general ideas above to the specic case of K-SAT.
A Neural Approach to Constraint Satisfaction
1831
The average information resource residing in a spin is given by its entropy, S ( si ) D ¡Psi D 1 log Psi D1 ¡ Psi D ¡1 log Psi D ¡1 ,
(4.2)
where P are probabilities. If the spin is completely random, Psi D 1 D Psi D ¡1 D 1 ( ) 2 and S si D log( 2), representing an unused resource of one bit of information. If the spin is set to a denite value (si D § 1), no more information is available, and S ( si ) D 0. For a clause, the interesting property is the expected amount of information needed to satisfy it. For the mth clause, this can be estimated as ± ² unsat , ¡ log 1 ¡ Im D ¡ log Psat D P m m
(4.3)
in terms of the probability Psat m for the clause to be satised in a given probability distribution for the spins. Of the 2K distinct states available to the K spins appearing in the clause, only one corresponds to the clause being unsatised. Then, for a totally undetermined clause (all ¡K spins having random values), we have P unsat D m ¢ 2 ¡K , yielding Im D ¡ log 1 ¡ 2 ¡K . For a denitely satised clause, on the other hand, we must have Punsat D 0, giving Im D 0. Finally, a broken clause m corresponds to Punsat D 1, leading to Im ! 1. m Assuming a mean-eld-like probability distribution, with each spin obeying independent probabilities, Psi D §1 D
1 § vi , 2
(4.4)
in terms of mean-eld variables vi D hsi i 2 [¡1, 1], the probabilities used above for the clauses become Punsat D m
Y 1 ( 1 ¡ Cmi vi ) . 2 i2M
(4.5)
m
The unused spin information is given by the entropy S of the spins (see equation 3.4), and the information I needed by the clauses is M X
³
´
Y 1 (1 ¡ Cmi vi ) . ¡ log 1 ¡ I(v) D 2 i2M mD 1
(4.6)
m
We now have the necessary prerequisites to dene an information-based free energy, which we choose as F ( v ) D I ( v ) ¡TS( v ) (in analogy with ANN), which is to be minimized. Demanding that F have a local minimum with
1832
Henrik Jonsson ¨ and Bo Soderberg ¨
respect to the mean-eld variables yields equations similar to the mean-eld equations, 3.1 and 3.2, but with H ( v ) replaced by I ( v ) : ui D ¡
@I @vi
.
(4.7)
Note that for discrete arguments, vi D § 1, the infomation demand I will be innite for any nonsolving assignment. 4.3 Algorithmic Details. Based on the analysis, we propose an information-based annealing algorithm similar to mean-eld annealing, but with the multilinear cost function H (see equation 3.5) replaced by the clause information I (see equation 4.6). Note that the contribution Im to I from a single clause m is a simple function of the corresponding contribution Hm to H,
Im D ¡ log (1 ¡ Hm ) .
(4.8)
As a result, the effective cost function I is not multilinear, and measures have to be taken to ensure stability of the dynamics. The resulting selfcouplings can be avoided by instead of the derivative in equation 4.7 using the difference, ui D ¡
¢ 1¡ I| vi D1 ¡ I| vi D ¡1 , 2
(4.9)
which coincides with the derivative for a multilinear I (Ohlsson, Peterson, & Soderberg, ¨ 1993). The resulting INN annealing algorithm is summarized in Figure 2. At high temperatures, information is expensive, and the neurons stay fuzzy, vi ¼ 0. As T is decreased, information becomes cheaper, and the more useful neurons begin to saturate. As T ! 0, all neurons are eventually forced to saturate, yielding a denite spin state, vi ¼ si D § 1. 5 Numerical Explorations 5.1 Test Beds. For performance investigations, we have considered two distinct test beds. One consists of uniform random K-SAT problems with N and a D M / N xed (Cook & Mitchell, 1997). For every problem instance, each of the M clauses is independently generated by choosing at random a set of K distinct variables (among the N available). Each selected variable is negated with probability 12 . For this ensemble of problems, the fraction of unsatisable problems increases with the parameter a. In the thermodynamic limit (N ! 1) there is a sharp satisability transition at a K-dependent criti( ) cal a-value acK (Hogg, Hubermann, & Williams, 1996; Monasson, Zecchina,
A Neural Approach to Constraint Satisfaction
1833
1. Choose a suitable high initial temperature T, such that the equilibrium neurons are close to zero. 2. Do a sweep. Update all neurons according to equations 3.1 and 4.9. 3. Lower the temperature T by a xed factor m . 4. If the stop criteria are not met, repeat from 2. 5. Extract a solution by means of si D sign( vi ) . Note: A typical m value is 0.95 - 0.99, and suitable stop-criteria are that all neurons are either saturated (|vi | > 0.99) or redundant (|vi | < 0.01).
Figure 2: The INN annealing algorithm for K-SAT. ( )
Kirkpatrick, Selman, & Troyansky, 1999). For problems where a < ac K al( ) most all generated problems are satisable, and for a > ac K almost all are unsatisable. For 3-SAT, ac ¼ 4.25 (Cook & Mitchell, 1997; Monasson et al., 1999). We used a set of N-values between 100 and 2000, and for each N, a set of a-values between 3.7 and 4.3. For each N and a, 200 problem instances are generated. In addition, test beds consisting purely of satisable instances are useful to gauge the efciency of a heuristic. Such a test bed can be generated by ltering out unsatisable instances (using a complete [exact] algorithm) from the uniform random distribution described above. For a second test bed, we collected a set of instances of this type from SATLIB,1 consisting of satisable random problems for different N between 20 and 250, with a xed close to ac . For natural reasons, this test bed does not include very large N. 5.2 Comparison Algorithms. To gauge the performance of the INN algorithm, in addition to the conventional ANN algorithm, we also applied a state-of-the-art dedicated heuristic to our test beds. A wealth of algorithms has been tested on SAT problems. (For a survey, see Gu, et al., 1997.) A local search method proven to be competitive is the GSAT+walk algorithm, which we will use as a second reference algorithm. GSAT+walk starts with a random assignment and then uses two types of local moves to proceed. A local move consists of ipping the state of a single variable between true and false. The rst type of move is greedy; the ip that increases the number of satised clauses the most is chosen. The second type of move is a restricted random walk move. A clause among those that are unsatised is chosen at random, and then a randomly chosen variable in this clause is ipped. 1
http://www.informatik.tu-darmstadt.de/AI/SATLIB
1834
Henrik Jonsson ¨ and Bo Soderberg ¨
5.3 Implementation Details. In order to have a fair comparison of performances, we chose the parameter values such that the three algorithms used approximately equal CPU time for each problem size.
5.3.1 ANN. For ANN a preliminary initial temperature of 3.0 was used, whichPwas dynamically adjusted upward until the neurons were close to zero ( i v2i < 0.1N), in order to ensure a start close to the high-T xed point. The annealing rate was set to 0.99. At each temperature, up to 10 sweeps were allowed in order for the neurons to converge, as signaled by the maximal change in value for a single neuron being less than 0.01. At every tenth temperature value, the cost function was evaluated using the signs of the mean-eld variables, si D sign(vi ) ; if this vanished, a solution was found and the algorithm exited. If no solution was found when the temperature reached a certain lower bound (set to 0.1), the algorithm also exited; at that temperature, most neurons typically will have stabilized close to § 1 (or occasionally 0). Neurons that wind up at zero are those that are not needed at all or equally needed as § 1. 5.3.2 INN. For the INN approach, the same temperature parameters as in ANN were used except for the low T bound, which was set to 0.5. Because of the divergent nature of the cost function I (see equation 4.6) and the local eld ui (see equation 4.9), extra precaution had to be taken when updating the neurons; innities appear when all the neurons in a clause are § 1 with the wrong sign: vi D ¡Cmi . When calculating ui , the innite clause contributions were counted separately. If the positive (negative) innities are more (less) numerous, vi was set to C 1 (¡1); otherwise, vi was randomly set to § 1 if innities existed but in equal numbers, else the nite part of ui was used. This introduced randomness in the low-temperature region if a solution had not been found; the algorithm then acquired a local search behavior, increasing its ability to nd a solution. In this mode, the neurons do not change smoothly, and the maximum number of updates per temperature sweep (set to 10) is frequently used, which explains why INN needs more time than the conventional ANN for difcult problem instances. Performance can be improved, at the cost of increasing the CPU time used, with a slower annealing rate or a lower low-T bound, or both. Restarts of the algorithm also improve performance. 5.3.3 GSAT+walk. The source code for GSAT+walk can be found at SATLIB.2 We have attempted to follow the recommendations in the enclosed documentation for parameter settings. The probability at each ip of choosing a greedy move instead of a restricted random walk move was set 2
http://www.informatik.tu-darmstadt.de/AI/SATLIB
A Neural Approach to Constraint Satisfaction 1
1835
1
1 (B)
(C)
fU
(A)
0
3.7
4 a = M/N
4.3
0
3.7
4 a = M/N
4.3
0
3.7
4 a = M/N
4.3
Figure 3: Fraction of unsatised problems ( fU ) versus a for ANN (A ), INN (B), and GSAT+walk (C ), for N = 100 (+), 500 (£), 1000 (¤), 1500 ( ), and 2000 ( ). The fractions are calculated from 200 instances; the error in each point is less than 0.035.
to 0.5. We chose to use a single run with 200 £N ips per problem, instead of several runs with fewer ips per try, since this appeared to improve overall performance. Making several runs or using more ips per run will improve performance at the cost of an increased CPU consumption. 5.4 Results. Following are the results from our numerical investigations for the two test beds. All explorations were made on a 600 MHz AMD Athlon computer running Linux. The results from INN, ANN, and GSAT+walk for the uniform random test bed are summarized in Figures 3, 4, and 5. In Figure 3 the fraction of the problems not satised by the separate algorithms ( fU ) is shown as a function of a for different problem sizes N. The three algorithms show different transitions in a above which they fail to nd solutions. For INN and GSAT+walk the transition appears slightly beneath the real ac , and for ANN, the transition is situated below a D 3.7. The average number of unsatised clauses per problem instance (H) is presented in Figure 4 for the three algorithms. H is shown as a function of a for different N. This can be used as a performance measure also when an algorithm fails to nd solutions.3 The average CPU time consumption (t) is shown in Figure 5 for all algorithms. The CPU time is presented as a function of N for different a in order to show how the algorithms scale with problem size. The results ( fU , H, t) for the solvable test bed for all three algorithms are summarized in Table 1.
3 Finding a maximal number of satised clauses for an SAT instance is referred to as MAXSAT (Papadimitriou, 1994).
1836
Henrik Jonsson ¨ and Bo Soderberg ¨
(A)
(B) 10
10
H
10
(C)
0
3.7
4 a = M/N
4.3
0
3.7
4 a = M/N
4.3
0
3.7
4 a = M/N
4.3
Figure 4: Number of unsatised clauses H per instance versus a, for ANN (A ), INN (B ), and GSAT+walk (C ), for N = 100 (+), 500 (£), 1000 (¤), 1500 ( ), and 2000 ( ). Average over 200 instances.
10
(A)
10
(B)
10
1
1
0.1
0.1
0.1
t
1
(C)
100
1000 N
100
1000 N
100
1000 N
Figure 5: Log-log plot of used CPU time t (given in seconds) versus N, for ANN (A ), INN (B ), and GSAT+walk (C ), for a = 3.7 (+), 3.9 (£), 4.1 (¤), and 4.3 ( ). N ranges from 100 to 2000. Averaged over 200 instances.
5.5 Discussion. The rst point to be made is the dramatic performance improvement in INN as compared to ANN. This is partly due to the divergent nature of the INN cost function I, leading to a progressively increased focus on the neurons involved in the relatively few critical clauses on the verge of becoming unsatised. This improves the revision capability, which is benecial for the performance. The choice of randomizing vi to § 1 (which appears very natural) in cases of balancing innities in ui contributes to this effect. A performance comparison of INN and GSAT+walk indicates that the latter appears to have the upper hand for small N. For larger N, however, INN seems to be quite comparable to GSAT+walk.
A Neural Approach to Constraint Satisfaction
1837
Table 1: Results for the Solvable 3-SAT Problems Close to ac .
N
M
Number of Instances
fU
H
t
fU
H
t
fU
H
t
1000 1000 100 1000 100 100 100 100 100 100
0.231 0.607 0.84 0.844 0.88 0.89 0.98 1 0.97 0.99
0.248 0.759 1.3 1.485 1.72 2.07 2.6 3.06 3.15 3.53
0.01 0.04 0.07 0.09 0.11 0.14 0.17 0.20 0.22 0.25
0.076 0.194 0.41 0.315 0.39 0.34 0.51 0.6 0.52 0.58
0.078 0.208 0.44 0.362 0.41 0.4 0.61 0.81 0.67 0.77
0.01 0.05 0.11 0.13 0.18 0.23 0.39 0.52 0.51 0.65
0.000 0.008 0.05 0.072 0.10 0.16 0.27 0.32 0.35 0.39
0.000 0.008 0.05 0.074 0.10 0.17 0.28 0.34 0.37 0.44
0.01 0.02 0.05 0.09 0.13 0.19 0.33 0.39 0.46 0.53
20 91 50 218 75 325 100 430 125 538 150 645 175 753 200 860 225 960 250 1075
ANN
INN
GSAT+walk
Notes: fU is the fraction of problems not satised by the algorithm, H is the average number of unsatised clauses (see equation 3.5) and t is the average CPU time used (given in seconds). Number of instances is the number of instances in the problem set.
6 Conclusion
We have presented a heuristic algorithm, INN, for binary satisability problems. It is a modication of the conventional mean-eld-based ANN annealing algorithm and differs from this mainly by a replacement of the usual multilinear cost function by one derived from an information-theoretical argument. This modication is shown empirically to enhance dramatically the performance on a test bed of random K-SAT problem instances. The resulting performance is for large problem sizes comparable to that of a good dedicated heuristic, tailored to K-SAT. An important advantage of the INN approach is its generality. The basic philosophy—the balance of information—can be applied to a host of different types of binary as well as nonbinary problems. Work in this direction is in progress. Acknowledgments
Thanks are due to C. Peterson for inspiring discussions. This work was supported in part by the Swedish Foundation for Strategic Research. References Cook, S. A., & Mitchell, D. G. (1997). Finding hard instances of the satisability problem: A survey. In D. Du, J. Gu, & P. M. Paradalos (Eds.), Satisability
1838
Henrik Jonsson ¨ and Bo Soderberg ¨
problem: Theory and applications (pp. 1–18). Providence, RI:American Mathematical Society. Du, D., Gu, J., & Pardalos, P. M. (Eds.). (1997). Satisability problem: Theory and applications. Providence, RI: American Mathematical Society. Gu, J., Purdom, P. W., Franco, J., & Wah, B. W. (1997). Algorithms for the satisability (SAT) problem: A survey. In D. Du, J. Gu, & P. M. Pardalos, (Eds.), Satisability problem: Theory and applications (pp. 19–152). Providence, RI: American Mathematical Society. Hogg, T., Hubermann, B. A., & Williams, C. P. (1996). Special volume on frontiers in problem solving: Phase transitions and complexity. Articial Intelligence, 81(1,2). Hopeld, J. J., & Tank, D. W. (1985). Neural computation of decisions in optimization problems. Biological Cybernetics, 52, 141–152. Monasson, R., Zecchina, R., Kirkpatrick, S., Selman, B., & Troyansky, L. (1999). Determining computational complexity from characteristic “phase transitions”. Nature, 400(6740), 133–137. Ohlsson, M., Peterson, C., & S¨oderberg. B. (1993). Neural networks for optimization problems with inequality constraints—the knapsack problem. Neural Computation, 5(2), 331–339. Papadimitriou, C. H. (1994). Computational complexity. Reading, MA: AddisonWesley. Peterson, C. & S¨oderberg, B. (1998). Neural optimization. In M. A. Arbib (Ed.), The handbook of brain research and neural networks (2nd ed., pp. 617–622). Cambridge, MA: Bradford Books/MIT Press. Selman, B., Kautz, H. A., & Cohen, B. (1994). Noise strategies for improving local search. In Proceedings of AAAI-94 (pp. 46–51). Menlo Park, GA: AAAI Press. Received July 6, 2000; accepted September 28, 2000.
LETTER
Communicated by Helge Ritter
Recurrence Methods in the Analysis of Learning Processes S. Mendelson Department of Mathematics, Technion, and Institute of Computer Science, Hebrew University, Jerusalem 91120, Israel I. Nelken Department of Physiology, Hebrew University–Hadassah Medical School, and the Interdisciplinary Center for Neural Computation, Hebrew University, Jerusalem 91120, Israel The goal of most learning processes is to bring a machine into a set of “correct” states. In practice, however, it may be difcult to show that the process enters this target set. We present a condition that ensures that the process visits the target set innitely often almost surely. This condition is easy to verify and is true for many well-known learning rules. To demonstrate the utility of this method, we apply it to four types of learning processes:the perceptron, learning rules governed by continuous energy functions, the Kohonen rule, and the committee machine. 1 Introduction
A central issue in the study of learning processes is that of their long-term behavior: if the process converges, what its limit points are, and if not, which sets are visited frequently by the orbits of the process (which are the sequences of states generated by the application of the learning rule). This article introduces a method that can be applied to a wide range of well-known learning rules and may reduce the difculties in answering this question considerably. Given a learning rule, one may dene a target set into which the orbits of the learning process should enter often. For example, consider the set of correct states in a supervised learning process. In the supervised learning setup, the student is adapted until it agrees with the teacher on every possible example, which means that it enters the set of correct states. In this case, the target set is absorbing: once an orbit enters the set of correct states, it will never leave it. However, in the case of many unsupervised learning processes, there is no guarantee that the target set is absorbing. A property of the target set, which is weaker than being absorbing, is recurrence. The target set is recurrent if the orbits enter it innitely often with probability 1. Recurrence of the target set is also weaker than ergodicity. When a process is ergodic, every set of states with positive measure c 2001 Massachusetts Institute of Technology Neural Computation 13, 1839– 1861 (2001) °
1840
S. Mendelson and I. Nelken
is recurrent. In many learning processes, we would eventually like to limit the set of states visited by the orbits of the process, and thus we would like only specic subsets of the state-space (those containing the target set) to be recurrent. Although proving that the target set is recurrent does not ensure convergence of the process to this set, such a proof is useful. Recurrence proof can serve as part of a convergence proof: if the target set is not recurrent, convergence cannot be achieved. If the target set is recurrent, the other piece of information required in order to prove convergence is that once the process enters this target set, it will not leave it again easily (in the sense that escape probabilities from the target set are small enough). Therefore, the main result of this article is a method that allows us to determine, under very general conditions, whether a target set is recurrent. The method consists of two parts. First, we need to prove that the learning process satises a (rather weak) continuity assumption. We show that if this assumption holds and if from any initial condition there is a positive probability of entering the target set, the orbits will enter the target set innitely often with probability 1. However, there may be initial conditions with a 0 probability to enter the target set. The identication of this set of “undesirable” initial conditions, which we denote by A, is the second part of our method. We demonstrate that with probability 1, the orbits either converge to the set A or enter the target set innitely often. In particular, if A is empty, the target set is recurrent. This approach does not deal directly with the issue of how long it takes to reach a state in the target set starting from a generic initial state. We do not obtain quantitative results, nor do we attempt to estimate the time it takes to reach the target set. Rather, we focus on a “soft” approach, which makes it easy to check whether the target set is recurrent. We present four examples of learning rules to which our method may be applied. The rst is the perceptron learning rule (Haykin, 1994). We prove that even under conditions that are less restrictive than those required by the perceptron convergence theorem, the orbits of the process come arbitrarily close to the teacher with probability 1, although convergence does not necessarily occur. The second example is the Kohonen learning rule (Kohonen, 1989; Ritter, Martinetz, & Schulten, 1992). Again, we show that the continuity assumption holds. We dene a target set for a small two-dimensional (2D) Kohonen process and demonstrate that with probability 1, the orbits of the process must enter the target set innitely often. Then, we investigate the 2D Kohonen process with respect to a lattice order. We introduce an algorithm that should allow us to construct a sequence of inputs that takes a generic initial state to the target set, that is, a set that has a lattice order. Thus, with probability 1, the orbits of the process become ordered. The third example is a general convergence proof for processes for which it is possible to build a smooth energy function, using the theory presented here. Finally, we examine a version of a “bounded change” learning algorithm suggested
Recurrence Methods in the Analysis of Learning Processes
1841
for the committee machine (Kim & Sompolinsky, 1996). We demonstrate that although the continuity assumption is fullled, if the bound on the step size is too tight, there is a nontrivial set of initial conditions from which it is impossible to enter the set of correct states. 2 General Theory
In many learning processes, there is some set of states to which one hopes that the orbits of the learning process converge. The minimal requirement is that this set of states is recurrent, in the sense that the orbits of process enter it innitely often almost surely. Our goal is to formulate an easy-to-verify condition that ensures that the target set is indeed recurrent. We start with some notation. For a set A, denote by ÂA the characteristic function of A—a function that is 1 on A and vanishes outside A. Put d( x, A ) the distance between x and the set A. Finally, Bx ( r ) is the open ball with radius r centered in x. We view the learning process T as a dynamical process on some compact state-space X. During each learning step, the system is exposed to an input v selected from some set V, and it adapts by the learning rule x ! T ( x, v ). The inputs are selected by an identically and independently (i.i.d) law on V, which is given by a Borel measure º. Thus, our process is a time homogeneous Markov process ( Xn , Vn ) dened by the transition density ¡ ¢ ¡ ¢ P ( X1 , V1 ) D ( a, b) |( X0 , V0 ) D P ( T ( X 0, V0 ) D a ) P V1 D b . Hence, ( Vn ) are selected independently while ( Xn ) are adapted according to T. Equip X with a probability measure m , let P 0 be the measure induced on the orbits ( Xn ), and set P x to be the induced measure given that X 0 D x. For every k–tuple ( v1 , . . . , v k ) 2 V k , set T k ( xI v1 , . . . , v k ) to be the state of the initial condition x after it was exposed to the examples v1 , . . . , vk. The basis of our discussion is the following theorem: Let ( Xn ) be a homogeneous Markov process, and set A, O ½ X. Assume that infx2A Px ( Xn 2 O for some n ) > 0. Then Theorem 1.
fXn 2 A innitely often g ½ fXn 2 O innitely often g P 0 -almost surely. Roughly speaking, the assertion of the theorem is that if the transition probability from every x 2 A to some set O is strictly positive (that is, larger than some positive constant), an orbit that visits A innitely often will also visit O innitely often. The proof of theorem 1 is deferred to the appendix. This theorem is useful for proving that a set is recurrent, but given some O ½ X, it is usually difcult to prove the existence of a positive lower bound on Px ( Xn 2 O for some n ). We bypass this difculty by showing that the function hO ( x ) D Px ( Xn 2 O for some n ) is lower semicontinuous (dened below) under mild assumptions on the process T and on the set O.
1842
S. Mendelson and I. Nelken
A real valued function f is lower semicontinuous if for every xn ! x, lim infn!1 f ( xn ) ¸ f ( x ) . Denition 1.
An important property of lower semicontinuous functions, which is stated in the following lemma, is that like continuous functions, they attain their minimum on compact sets. Let f be a real valued function that is lower semicontinuous. Then on every compact set, f attains its minimum. Lemma 1.
The proof of the lemma is easy and we shall omit it. From lemma 1, it follows that in order to prove that a set O is recurrent, it is sufcient to demonstrate that hO is lower semicontinuous and that for every x 2 X, hO ( x ) > 0. Indeed, since X is compact and hO is lower semicontinuous, it attains its minimum in X, implying that this minimum is positive. Hence inf hO ( x ) D inf Px ( Xn 2 O for some n ) > 0.
x2X
x2X
By theorem 1, fXn 2 X innitely ofteng ½ fXn 2 O innitely ofteng almost surely and our claim follows. Unfortunately, one sometimes encounters a process in which there is a set A on which hO ( x) D 0, and XnA is not compact. Thus, the idea we used above does not apply directly to this case. However, only a small modication is required to prove the following: Assume that hO ( x) is lower semicontinuous and that A ½ X is a set such that for every x 2 XnA, hO ( x ) > 0. Then Theorem 2.
¡ ¢ P 0 fd( Xn , A) ! 0g [ fXn 2 O innitely ofteng D 1.
Fix some integer m and set Rm D fx|d( x, A) ¸ 1 / mg. If an orbit ( Xn ) does not converge to A, then it belongs to some Rm innitely often. Since Rm is compact and hO is positive on this set, then there is some d > 0 such that hO ( x ) > d on Rm , which implies that the orbit enters O innitely often. Sketch of proof.
Next, we shall formulate a sufcient condition on the dynamical process T, which ensures that for every open set O, hO is lower semicontinuous. For every x 2 X and every sequence ( yn ) that converges to ( ( ¡) ! x, T yn , T x, ¡) almost surely, that is, there is some set V ½ V (which may depend on both x and ( yn ) ) with º(V) D 1 such that for every v 2 V, T ( yn , v) ! T ( x, v). Assumption 1.
Recurrence Methods in the Analysis of Learning Processes
1843
Note that this assumption is considerably weaker than a condition of º—almost everywhere continuity in x since the set V may depend on the specic sequence ( yn ) . This weak condition sufces to ensure the stability of the stochastic process in the sense that for every integer k there is a neighborhood of x and a “large” set of inputs such that for every y in the neighborhood and every such k-tuple of inputs, v1 , . . . , v k, T k ( yI v1 , . . . , v k ) is close to T k ( xI v1 , . . . , v k ). This stability condition is at the heart of the proof of the next theorem: Assume that assumption 1 holds and that O is an open subset of X. Then hO ( x ) D Px fXn 2 O innitely ofteng is lower semicontinuous. Theorem 3.
The proof of this theorem is deferred to the appendix. However, the proof presented there uses a less direct approach. Therefore, we present a sketch of a direct proof of theorem 3 in which we assume that T is continuous. Let Bnx D fv1 , . . . , vn |T n ( xI v1 , . . . , vn ) 2 Og and set Cnx to be a compact subset of Bnx whose measure is “almost” the measure of Bnx . Since T n is continuous, T n ( xI Cnx ) is a compact subset of O, implying a positive distance between this set and XnO. Therefore, there is a neighborhood U of x such that for every y 2 U and every ( v1 , . . . , vn ) 2 Cnx , T n ( yI v1 , . . . , vn ) is “close” to T n ( x, Cnx ) ; hence T n ( yI v1 , . . . , vn ) belongs to O. By this argument, it follows that for every e > 0, there is a neighborhood Ux such that for every y 2 Ux , Sketch of proof.
³ n
n [
º
iD 1
´ Biy
³ n
¸º
n [ iD 1
´ Bix
¡ e.
Hence, for every y 2 Ux , Py fXn 2 O for some i · ng ¸ Px fXn 2 O for some i · ng ¡ e.
(2.1)
Dene a stopping time t to be the time in which the process rst enters O, and set fn ( x ) D P x ft · ng. Thus, by equation 2.1, fn is an increasing sequence of lower semicontinuous functions that converges pointwise to hO . Therefore, hO is lower semicontinuous. If assumption 1 holds, O is open and A is a set such that for every 6 A, hO ( x) > 0 then x2 Corollary 1.
¡ ¢ P 0 fd( Xn , A) ! 0g [ fXn 2 O innitely ofteng D 1.
(2.2)
1844
S. Mendelson and I. Nelken
3 Examples
In this section, we give some examples of learning processes to which our method can be applied. We chose well–known learning processes, in order to compare the theory presented here with previous treatments of the same processes. Each example has two parts: rst, it is required to show that assumption 1 holds. Second, given a target open set O, it is necessary to study the function hO ( x ) , giving the probability of entering O starting at any state x. The second step is the more complicated one, usually requiring ad hoc arguments adapted for each case individually. 3.1 The Perceptron. The perceptron is a classical supervised learning model. Here, both the teacher T and the student X are positive sides of given maximal subspaces. Let
» sign ( x ) D
1 0
x¸0 x < 0.
(3.1)
Then for any perceptron X, ÂX ( v) D sign ( hx, vi). Thus, ÂX ( v) is the decision function of the perceptron X. Let T be the teacher perceptron and put X to be the perceptron we are training. Note that X and T agree on an example v if and only if ÂT ( v ) D ÂX ( v ) . Hence, Rd is divided into two sets. On the rst one, ÂT and ÂX agree; on the other set, ÂT and ÂX disagree (see Figure 1). We can identify each maximal subspace X with its normal unit vector « ¬ this identication, fy| ¸ 0g and the boundary is @X D x. With X D x, y « ¬ fy| x, y D 0g. @X is called the decision boundary of X. Furthermore, as far as the perceptron is concerned, an example y is classied in the same way as any example cy for c > 0. Thus, we can assume that the input space is the set of unit vectors, which is the d-dimensional unit sphere, Sd¡1 D fx| kxk D 1g. Next, we have to select the probability distribution º with which the examples for the training process are chosen. Assume that the measure is smooth in the sense that it does not assign positive probability to sets whose area is 0. In particular, it assigns zero probability to any specic input. Formally, we require that º is absolutely continuous with respect to the Haar measure, which is the natural extension to the sphere of the notion of arc length on the circle. Given a student X and a teacher T, set the function fT,X ( v ) : Sd¡1 ! R by fT,X D ÂT ¡ ÂX . Thus, fT,X ( v ) D 0 if and only if T and X agree on the input v. A perceptron X is correct if it classies essentially all inputs correctly—that is, if the set on which X and T disagree has probability ¡ ¢ 0. Hence, we say that 6 0g D 0. the perceptron X is correct if º fv 2 V| fT,X ( v ) D The standard perceptron training algorithm can now be formulated in the following way. Dene a function Tp and a dynamical process xn C 1 D Tp ( xn , vn ) that takes the current perceptron Xn represented by the unit vector xn , an input example vn drawn independently by the probability distribution
Recurrence Methods in the Analysis of Learning Processes
+
x
Corr
Corr
t
ect
X
ect
T
1845
+
Figure 1: Geometry of perceptron learning. In two dimensions, hyperplanes are lines dividing the plane into a positive and a negative half, denoted here by the C and ¡ signs. The teaching hyperplane T is in this case a vertical line, and its positive half-plane is to its right. The student hyperplane X gives identical answers to the teacher in the regions marked Correct, but in the other two regions, the student gives opposite answers to those of the teacher. As long as there is a positive probability for training examples to be chosen from those regions, the student perceptron will continue to adapt. x and t are the unit vectors orthogonal to X and T, respectively.
º, which produces as an output a new perceptron xn C 1 . The function Tp is computed in two steps. First, the direction of the new perceptron is chosen by the standard perceptron update rule, and then it is normalized to a unit length: xQ n C 1 D xn C e fT,Xn ( vn ) vn ,
xn C 1 D Tp ( xn , vn ) D
xQ n C 1 . kxQ n C 1 k
(3.2)
The standard perceptron convergence theorem states that for a nite training set and for any sequence of examples taken from this set such that all the members of the training set appear an innite number of times, the algorithm converges to a perceptron that agrees with the teacher on the training set. A crucial part of the proof is that the smallest angle between a training example and the teacher is always greater than zero. This theorem can be generalized to an innite set of training examples and a more general distribution on that set, but then the assumption that the angle between the training examples and the teacher is bounded from below by a positive number must be made explicitly. Here, we extend the perceptron convergence theorem even further. Assume that the measure º by which
1846
S. Mendelson and I. Nelken
the examples are selected is absolutely continuous with respect to the Haar measure. We show that with probability 1, running the perceptron training algorithm brings us arbitrarily close to the teacher innitely often. The price paid for the weaker assumption is that although the orbits approach the teacher innitely often, they may not converge and the student may oscillate around the position of the teacher. To prove recurrence, rst we have to show that the function Tp satises assumption 1. Indeed, for every x 2 Sd¡1 , the set on which Tp ( x, ¡) is not continuous is ( @X [ @T ) \ Sd¡1 and consists of the points of discontinuity S of fX,T . Fix x 2 Sd¡1 , a sequence yn ! x, and set V D Vn@X [ @T 1 n D 1 @Yn . Then º( V) D 1 and for every v 2 V, Tp ( yn , v) ! Tp ( x, v) . Therefore, the conditions of assumption 1 hold. Next, we have to dene the candidate recurrent set O and study the function hO ( x ). We select Bt ( r) , the open ball with radium r around the teacher, as the set O of corollary 1. Thus, We show that with probability 1, the orbits enter Bt ( r) innitely often. Denote by ( xn ) the orbits of the perceptron learning rule Tp . Then for every r > 0, P 0 fxn 2 Bt ( r) i.o.g D 1. Corollary 2.
The idea behind the proof is to use the classical perceptron convergence theorem to show that hO ( x) > 0 for O D Bt ( r ). Let V 0 D Sd¡1 nfd(T, x ) · rg. Note that the set of correct states for the process 3.2 with V 0 as its input set and T as the teaching halfspace is Bt ( r) , and that d( @T, V 0 ) > 0. Therefore, by the standard perceptron convergence theorem, the orbits of the process 3.2 using V 0 and T as the input set and the teacher, respectively, enter Bt ( r) in a nite number of steps almost surely. This implies that if V D Sd¡1 , then for every x 2 Sd¡1 , hBt ( r ) ( x ) > 0. By the discussion above, hBt ( r ) is lower semicontinuous; therefore, by theorem 2 applied to A D w, it follows that P 0fxn 2 Bt ( r ) i.o.g D 1 as claimed. Proof.
We wish to emphasize that the fact that the orbits are arbitrarily close to the teacher innitely often does not imply that they converge to the teacher. Let T D fx 2 Rd |xd · 0g and assume that º is the Haar measure restricted to Sd¡1 \ T. Then, by corollary 2, ( xn ) enters Br ( t) innitely often almost surely. However, it can be shown (Mendelson, 1998) that P 0 fxn ! tg D 0. In other words, although Xn are arbitrarily close to T almost surely, process 3.2 does not converge to the teacher with P 0 –probability 1. Remark 1.
3.2 The Multidimensional Kohonen Process. The Kohonen adaptive process is dened using two sets: the (nite) set we wish to adapt and the set of possible inputs. We assume that both sets are contained in some nite
Recurrence Methods in the Analysis of Learning Processes
1847
dimensional normed space and that the set of inputs is compact (that is, it is closed and bounded). Again, denote the input set by V and the adapting set in the n-stage by xn . Note that xn is a set consisting of the current conguration of the adapted set. Given v 2 V, let Pxn ( v ) be the set of the nearest points to v in xn D fx1n , . . . , xm n g and put i ( y) the index of y in the set x. Our learning process T k generates the next state xn C 1 by moving some of the points in xn in the direction of the current input. The points to be moved are determined by a function f that depends on the index of the nearest point to v, i( Pxn ( v ) ), and on the index of the point being considered, j, j
j
j
xn C 1 D [T k ( xn , vn ) ] j D xn C e f ( i( Pxn ( v ) ) , j) ( vn ¡ xn ) , j
(3.3)
where xn 2 xn , e > 0 and f : N £ N ! [0, 1]. Thus, the function f implicitly denes a neighborhood relationship on the indices. Pairs for which f ( i, j) is large are “close,” whereas pairs of points for which f ( i, j) is small or zero are “far.” At each step of the process, points that are close to Pxn ( v ) in this sense take a large step in the direction of the current input v, and in particular approach each other. Points that are farther away take a relatively smaller step or do not move at all. The goal of the process is to cause points that are close in terms of indices (as determined by the function f ) to be close in the sense of the metric distance. Each step of the Kohonen process improves this correspondence to some extent, but only for the nearest point i( Pxn ( v ) ) and its neighbors—because they move closer to each other. In so doing, however, the same step may worsen the situation for other points. The main experimental observation about the Kohonen process is that in spite of the local character of each learning step, it is possible to generate global order. However, a proof of this fact is available only in the onedimensional case (under some restrictions on the neighborhood function). Usually it is assumed that the step size e decreases to 0 in some way, but not too quickly. This ensures that at the later stages of the process, the modication of the state at each moment becomes very small. However, if e depends on the stage of the learning process (that is, e is time dependent), the theory presented in this article is not applicable. Thus, we analyze a “homogeneous” Kohonen process in which the step size e is constant. One way to justify the analysis of this case is that in practical implementations of the Kohonen process, it is often necessary to keep e relatively large for some time in order to reach approximate order, and only then is e decreased in order to “freeze” this order. The analysis presented here is then relevant to the rst stage of such implementations, and as we demonstrate below, the homogeneous case already ensures the emergence of order in the multidimensional Kohonen process, although only transiently (in the sense that the set of ordered states is not absorbing). The homogeneous Kohonen process has been also analyzed by others (e.g., Fort & Pages, 1995).
1848
S. Mendelson and I. Nelken
To apply the theory presented here, it is rst necessary to check that the process has the required continuity property. Clearly, for every x D fx1 , . . . , xm g, the set in which T k ( x, ¡) is not continuous is a subset of [ 6 j iD
fd( xi , v ) D d ( x j , v ) g.
Thus, by the same method as in section 3.1, one can show that the conditions of assumption 1 hold, given that the measure º assigns 0 probability to hyperplanes. In the multidimensional case, there is no known natural denition of order for which the set of ordered states is absorbing (see also Fort & Pages, 1996), nor do we supply one here. Instead, we use the theory developed here to nd recurrent sets of states that capture some intuitive features of order. Thus, although we cannot demonstrate that the Kohonen process converges to an ordered state, we do demonstrate that order emerges as a result of the application of the Kohonen process even in the multidimensional case in the sense that the process visits such ordered states innitely often. To illustrate the essential features of this application in a toy example, consider the two-dimensional setup, where the states of the Kohonen process are composed of ve points denoted by f1 2 3 4 5g with the natural neighborhood relationships, as presented in Figure 2A. The relative weights for moving the points are chosen so that each time an input v is presented, the point x closest to v is inuenced most by the input v, while the inuence of v on the other points decreases as a function of their neighborhood relationships with x. For example, in the notation above, we take f (3, 3) D 1, f (3, j) D 0.5 for other values of j, f (1, 1) D 1, f ( 1, 3) D 0.5, f ( 1, 2) D f ( 1, 5) D 0.25, and f ( 1, 4) D 0. To emphasize the fact that it is the function f that implicitly denes the order, we can dene an ordered state as a state in which the distance between the points is compatible with the neighborhood structure as imposed by the function f . Thus, a state is ordered if the distance between a point x and its neighbors (as imposed by f ) is smaller than the distance between x and points that are not its neighbors. In addition, its distance to a closer neighbor (with a larger value of f ) should be smaller than the distance to a farther neighbor (with a smaller value of f ). The state in Figure 2B is ordered according to this denition. Denote the set of all ordered states according to this denition by O. Some of the problems in the denition of order in the multidimensional Kohonen process are apparent even in this toy problem. First, the denition of order is not unique. For example, the state in Figure 2C is not in O, although it still captures some aspects of the ordering of the ve points. Furthermore, the set O is not absorbing. Even if the process enters O, it is easy to see that there is always a positive probability of leaving it in a nite number of steps. We tested a hierarchy of such denitions of order, all of
Recurrence Methods in the Analysis of Learning Processes
1
A
1849
B 1
0
2
3
4 1
5
4 3
2
5
1 5 1
0
2 3 4
5
1 1
0
D
Unordered points
1
C
1
0 0
1000 Iteration
2000
Figure 2: (A) Canonical ordered set. (B) Example of a state in which the distances between the points are in accordance with the neighborhood function f . This state belongs to O. (C) An example of a state that preserves some of the features of the order in A but does not belong to O. For example, point 4 is closer to point 2 than to point 1, although f ( 4, 2) D 0, whereas f ( 4, 1) D 0.25. (D) Number of unordered points during a typical run of the learning process.
which suffered from the same two problems. We chose to work with the set O as dened above, since the constraints it imposes are less stringent than the requirement of keeping an exact square arrangement of the points. On the other hand, it still imposes some metric conditions on the target set. Such metric conditions are absent from a more “topological” denition of order, which would have included the state in Figure 2C.
1850
S. Mendelson and I. Nelken
To demonstrate that the set O is visited innitely often, we present an algorithm that takes any nondegenerate initial state containing ve points to a state in O by generating a nite, deterministic sequence of input examples, none of which is on a discontinuity line of the process. This algorithm is described in the appendix. Once such a deterministic sequence is generated, we can (using continuity) replace each point in this sequence by a small neighborhood, and any sequence of points selected from the neighborhoods still takes the initial condition into O. If the measure on the set of inputs is equivalent to the Lebesgue measure, the probability of this set of sequences is greater than 0. This argument proves that hO ( x) > 0 for every nondegenerate state. Applying corollary 1, we conclude that with probability 1, an orbit either visits O innitely often or it converges to a degenerate state. O is not absorbing, so the process also leaves O with probability 1. In fact, there are open sets in the complement of O that have the same property; the orbits enter them with positive probability from any initial condition. Thus, the process does not converge to the set O. To illustrate this, dene a function e( xn ) on the set of states, which measures how many points are locally unordered, in the sense that their distances from the other points are not compatible with the neighborhood structure imposed by f . Thus, x 2 O if and only if e( x ) D 0. Figure 2D shows e( xn ) for a typical run of the Kohonen process. Initially, e uctuates between 2 and 4, but it decreases to smaller values rather rapidly. Eventually it reaches 0, meaning that the orbit entered O, but from time to time it increases again, showing the process left O. This is the typical behavior described by corollary 1: although the orbits may leave the set of ordered states, they will eventually enter this set again. Next, we turn to the example of the two-dimensional lattice. Here, each state consists of n £ n points in [0, n]. Each point is indexed by a pair ( i, j) , such that 0 · i, j · n. The neighbors of a point ( i, j) are all the points ( i0 , j0 ) such that |i ¡ i0 | C | j ¡ j0 | · 1. Dene f ( ( i, j) , ( i0 , j0 ) ) D 0.5 if the two points are neighbors, and 0 otherwise. A state ( xi, j ) is said to be ordered if the points ( 1, j) , . . . , ( n, j) are linearly ordered along the x-axis for each j (either increasing for all j or decreasing for all j), while the points ( j, 1) , . . . , ( j, n ) are linearly ordered along the y-axis for every j (again, either increasing for all j or decreasing for all j), or if the reverse situation occurs, that is, points (1, j) , . . . , ( n, j) are linearly ordered along the y-axis whereas points ( j, 1) , . . . , ( j, n ) are linearly ordered along the x-axis. We would like to show that this set is recurrent, which implies that theoretically, the Kohonen process can be used to order the two-dimensional lattice. In practice, the waiting time may be very long. In fact, in simulations, the process often seems to be trapped in a set of states in which there are several domains, each of which is ordered, but the domains are incompatible. Figure 3A shows an initial condition in which each of the upper and lower halves of a 20 £ 20 is ordered, but the two halves are reversed with respect to each other. After 800,000 iterations, the initial mismatch between the upper and the lower part remains (see Figure 3B).
Recurrence Methods in the Analysis of Learning Processes
A
B
18,18 14,18 10,18 6,18
2,18
18,14 14,14 10,14 6,14
2,14
2,10
6,10 10,10 14,10 18,10
2,6
6,6
10,6
14,6
18,6
2,2
6,2
10,2
14,2
18,2
0
14,18 10,18 6,18 18,18 14,14 10,14 6,14 18,14 14,1010,10 6,10 2,2
18,10 2,6 6,6 10,6 14,6
6,2
20
0
C
10,2
2,18
2,14 2,10
18,6
14,2
18,2
20
D
Disorder
40
18,2 18,6 18,10 18,14 18,18 14,2 14,6 14,1014,14 14,18 10,2 10,6 10,1010,14 10,18 6,2 6,6 6,10 6,14 6,18 2,2 2,6 2,10 2,14 2,18
0
1851
10
0 0 200 3 Iteration no. (*10 )
Figure 3: Ordering the lattice with the Kohonen process. (A) Initial state of the simulations. To improve visibility, only one out of every four points is shown. (B) State of the process after 800,000 iterations. There are roughly three domains. The top half of the square keeps roughly its initial order; the bottom half is divided into two domains, rotated by 90 degrees relative to each other. In this simulation run, the three domains were already apparent after 200,000iterations (e D 1). (C) State of the process at the end of the second stage of our algorithm, when an ordered state is formed at the bottom left part of the set of examples. Note the change in scale of the x- and y-axes. (D) Number of disordered rows and columns during the operation of the algorithm. Since the simulations were run on a 20 £ 20 lattice, the maximal number of disordered rows and columns is 40. The arrows show transitions between states of the algorithm (further details are in the text).
1852
S. Mendelson and I. Nelken
In spite of this negative experimental result, we conjecture that with probability 1, the mismatch will eventually be resolved. As usual, the difculty lies in the attempt to show that the set of ordered states can be reached with positive probability from any nondegenerate initial state. We present an algorithm that should ensure this. Begin the process by nding the quarter of the square in which the point (1, 1) lies. Then pull all other points to a small square near the opposite corner. This can be done (with positive probability), because it requires choosing examples in the small square, making sure that the points (1, 1) , (1, 2), ( 2, 1) , and ( 2, 2) are never the closest points to the chosen example. Within a nite number of steps, all points (except (1, 1) ) are pulled to that corner. Next, choose points that are in the quarter square in which ( 1, 1) lies. At rst, point ( 1, 1) is the closest to any example we choose; hence, it pulls its nearest neighbors. As its neighbors are pulled toward ( 1, 1) , they enter the area from which examples are chosen. Therefore, they become the nearest point to some of the examples, and they pull their own neighbors in. Since the points are pulled into the quarter square in the right order, we conjecture that they will always end up in the correct lattice order. Simulations show that the suggested algorithm is useful for ordering nontrivial states. Unfortunately, we were not able to prove this conjecture formally. If the conjecture is true, then hO ( x) > 0 for any nondegenerate state, and it follows from corollary 1 that with probability 1, topologically ordered states will be visited innitely often. In Figure 3C we show the state of the orbit the rst time we detected entry into the set of ordered states. All the points are concentrated in one corner, but they are ordered according to our denition. Figure 3D shows a plot of the number of disordered rows and columns during the operation of the algorithm. Since the initial state of the algorithm is almost ordered, at rst the amount of disorder actually increases. All points except ( 1, 1) arrived to their corner after 29,510 steps, marked by the left arrow in the plot. Next, the points are pulled to the opposite corner. At rst, this increases the disorder, but eventually an ordered state is reached, as indicated by the right arrow in Figure 3D. The simulation was continued afterward, allowing examples to be chosen from the full 20 £ 20 square without limitations. Since the points are already in a nearly correct order, ordered states are often hit. Note that the number of steps required for the sequence in the simulation to take the initial condition to O is very large (approximately 105 steps). This indicates why the Kohonen process was not able to order the initial state in Figure 3 in 800,000 steps. If the simulation is indeed a “typical” example, then 5 the probability of entering O in approximately 105 steps is (1 / 9) (10 ) (since the area of the small squares we used in the application of the algorithm had one-ninth the area of the full square). Thus, the expected value of the time 5 needed to enter O is approximately 105 £ 9 (10 ) , which is huge compared
Recurrence Methods in the Analysis of Learning Processes
1853
with 800,000. Although this estimate is an upper bound on the time to enter O, it may be safe to say that it will be greater than 800,000. Thus, although the Kohonen rule may be used to order a lattice, it may not be very useful for this purpose. 3.3 Processes with Energy Functionals.
3.3.1 General Considerations. In the context of dynamical systems, a Lyapunov functional is an “energy function” that decreases along the orbits of the process. For the nondeterministic systems studied here, the requirement that for every given input and every state the energy of T ( x, v ) will be smaller than the energy of x may be too restrictive. A less strict requirement is that instead of decreasing along the orbit, the energy may increase from time to time, but averaged on all examples, the energy decreases. Even this denition is too strict. Roughly speaking, near minima of the energy function, the stochastic noise is expected to increase, rather than decrease, the energy on average. Therefore, we rst treat a general case in which the functional is allowed both to increase and decrease, and later we show that under additional assumptions on the structure of the process, neighborhoods of the critical points are recurrent. Let F be a continuous functional for T on the space X. Set G( x ) D EºF ( T ( x, v ) ) ¡ F ( x ). Since X is compact and F is continuous, then F and G must be bounded. For every d > 0, let Od D fx|G( x ) > ¡dg; Od is the set of states on which the average “energy” either increases, or decreases by less than d. Let T be a process that has a continuous functional and satises assumption 1. Then, for every d > 0, Theorem 4.
P 0 ( Xn 2 Od innitely often) D 1. Intuitively, since the energy function is bounded, it cannot decrease by more than d an innite number of times. Formally, we shall show that the conditions of corollary 1 hold. Using the notation of corollary 1, let A be an empty set. To use the corollary, we have to show that Od is open and that for every x 2 X, hOd ( x) > 0. First, note that G is a continuous function. Indeed, let ( xn ) ½ X be a sequence that converges to x. By assumption 1, there is a set V ½ V, such that º( V) D 1 and for every v 2 V, T ( xn , v ) converges to T ( x, v) . Since F is continuous, then F ( T ( xn , v ) ) ¡F( xn ) tends almost surely to F ( T ( x, v) ) ¡F( x ). Because F is bounded, then by the bounded convergence theorem, G ( xn ) ! G( x ). Thus, Od is open as an inverse image of an open set by a continuous function. Next, x some x 2 X and recall that Proof.
hOd ( x) D Px ( Xn 2 Od for some n) .
1854
S. Mendelson and I. Nelken
If hO ( x ) D 0, then for every integer n the conditional expectation
E ( F ( Xn C 1 ) ¡ F ( Xn ) |Xn ) · ¡d. Therefore, for every integer n,
E (F( Xn C 1 ) ¡ F ( X1 ) ) D D
n X iD 1 n X iD 1
E ( F ( Xi C 1 ) ¡ EF ( Xi ) ) D E ( E ( F ( Xi C 1 ) ¡ F ( Xi ) |Xi ) ) · ¡nd,
which is impossible because F is bounded on X. In the special case when there is a continuous and bounded energy function F that decreases along orbits, the sets Od are neighborhoods of the minima of F, and theorem 4 demonstrates that the process visits such neighborhoods innitely often. However, there are situations in which a natural denition for energy function exists, but the energy does not necessarily decrease along orbits. Even in these cases, it will be shown now that with some additional assumptions, neighborhoods of the critical points of the energy function are recurrent. 3.3.2 Stochastic Gradient. To apply these results in a more concrete setup, let H: X £ V ! R be a twice differentiable function. Set T ( x, v) D x ¡ e Dx H ( x, v) ,
(3.4)
where Dx H ( x, v ) is the derivative of H with respect to the rst variable and e is some positive parameter. If F ( x ) D EºH ( x, v ) , then one can show that for every x 2 X, the derivative of F at x is DFx D EºDx H ( x, v ). Put G( x ) D EºF ( T ( x, v) ) ¡F( x) . Note that since T ( x, v ) is continuous with respect to both variables, G is continuous. Next, we show that the set Od , which is the candidate recurrent set by theorem 4, consists of states for which the expected value of the gradient of H is small. Lemma 2.
There is a constant C such that for every e, d > 0
» ¼ d Od ½ kDFx k2 · C Ce . e For every v 2 V, apply Taylor’s expansion for F. Since F has a bounded second derivative, then Proof.
F ( T ( x, v) ) ¡ F ( x) D hDFx , T ( x, v ) ¡ xi C O ( kT(x, v) ¡ xk2 ) .
Recurrence Methods in the Analysis of Learning Processes
1855
By the denition of T, it follows that there is a constant C such that kT(x, v ) ¡ xk D e kDx H ( x, v )k · Ce . Hence, if G ( x ) D EºF ( T ( x, v ) ) ¡ F ( x ) ¸ ¡d, then (for another constant C), ¡d · EºF ( T ( x, v ) ) ¡ F ( x ) · ¡e Eº hDFx , Dx H ( x, v ) i C Ce 2 2 2 D ¡e kDFx k C Ce .
Therefore, kDFx k2 ·
d C e
Ce , as claimed.
There is some constant C such that for every 0 < e < 1 the orbits Xn of the process 3.4 enter the set fkDFx k2 · Ce g innitely often almost surely. Corollary 2.
Proof.
Clearly, T satises assumption 1. If d D O ( e 3 ) , then by lemma 2,
n o Od ½ kDFx k2 · Ce . Hence, by theorem 4, our claim follows. Roughly speaking, when the process is at a state x in which kDFx k2 D 0, the change in the state of the process 3.4 is 0 on average. Therefore, these states are the candidate states for convergence of the process. Thus, the corollary implies that the process will visit neighborhoods of these critical states innitely often, as is expected intuitively. 3.4 The Com mittee Machine with Bounded Change. The committee machine is an extension of the perceptron. It consists of an odd number of perceptrons, and the decision is taken by a majority vote: if an example belongs to the majority of the closed halfspaces determined by the perceptrons, the committee machine assigns the value 1 to that example; otherwise, the result will be 0. We analyze below two variants of a learning algorithm for the committee machine. In both cases, there is a nontrivial set A from which the target set of states is not accessible. Thus, the learning algorithm may never reach a “good” target state. We prove that A is large, in the sense that it has a nonempty interior. Formally, let
» sgn ( x) D
1 ¡1
x¸0 x < 0,
(3.5)
and put Xi to be a perceptron in Rd . For every X D fX1 , . . . , Xn g and d¡1 function of the committee machine X be gX ( v ) D v2S « decision ¬ P , let the sgn ( niD 1 sgn xi , v ) . Just as in the previous case, we dene the error function that enables us to determine whether two states agree or disagree on a
1856
S. Mendelson and I. Nelken
given input. For every aommittee machine X, let the online error function be » 0 gX ( v ) D gT ( v ) fX,T D 1 otherwise, where T D fT 1 , . . . , T n g is the teaching committee machine. There are several possible ways to dene a learning process for committee machines, and we shall focus on two of them. Both learning rules are based on the same idea: given an example in which the student disagrees with the teacher, we wish to move several perceptrons from the (wrong) majority vote to the correct minority one. In the rst learning rule, given such an example, we nd the perceptron whose decision boundary is nearest to the example and then change it so that it agrees with the teaching committee machine. The problem with this learning rule is that even now, the student may be wrong—because the wrong opinion may still be in the majority. Thus, the second learning rule is to take a large enough set of perceptrons and change their vote, just as we did with the single perceptron. We shall take the smallest set possible, and the perceptrons selected are those with decision boundaries closest to the example. We dene the learning process Tc in the following way. If fX,T ( v ) D 1, let W D W ( x, v ) be the set of perceptrons ( Xi i D 1, . . . , n ) such that the response of the perceptron Xi does not agree with that of the teaching committee machine T. From W, select X j such that @X j is the nearest to v. If v and @X j are “close enough” in the sense that d ( @X j , v ) · dn , then X j is adapted so that it agrees with the teacher on v with a minimal change in its location, where dn may be xed, or even a sequence decreasing to 0. Otherwise, if the boundary of X j is too far away from v, the student remains stationary. The assumption that no learning takes place on “far ” examples is called bounded change. This assumption is made to help in cases where the teacher itself may randomly make mistakes (e.g., Kim & Sompolinsky, 1996). Note that when both the teacher and the student are single perceptrons (n D 1), this learning process guarantees that the d ( xn , t) ! 0 almost surely (Mendelson, 1998). Now we apply the theory of this article to the committee machine. Using a similar method to the one presented in section 3.1, it is easy to see that the conditions of corollary 1 are fullled in this case. However, as the following example shows, the set A of states from which the target set is inaccessible may be large and ( Xn ) may remain far away from the set of the system’s correct states, even when n D 3 (see Figure 4). The only set in which the student and the teacher do not agree is the domain bounded by perceptrons 2 and 3. For every input selected from that set, we assume that the only @Xi that is near enough to allow Example.
Recurrence Methods in the Analysis of Learning Processes
1857
Figure 4: The teaching committee machine is shown in A, and the student appears in B. The C signs near each hyperplane represent the positive side of the hyperplane on which the perceptron votes C . The numbers represent the nal vote of the committee machine on examples selected from that area. For example, C 1 means that the example is on the positive side of two perceptrons and on the negative side of the third. In C we see the teacher and student together. The area bounded by perceptrons 1 and 3 is the only area in which the teacher and the student disagree. If the bounded change parameter is too small, then 2 is the only perceptron that can adapt, and it will converge to 3, to form an absorbing state that is not correct.
the adaptation process is perceptron 2. Hence, perceptron 2 converges to perceptron 3 P 0 —almost surely and the limit state is an absorbing state of the process that is not a correct state.
1858
S. Mendelson and I. Nelken
Using this idea, one can demonstrate that for every odd n, the committee machine containing n perceptrons can behave in the same manner as in the example above. It is possible to construct such examples in which both the teacher and the student cannot be realized by a single perceptron. Clearly, only a slight modication in the example above is needed to show that the set A has a nonempty interior. Indeed, a small perturbation of the student in Figure 4 yields a state that, by the same argument as above, converges to the same limit state as the student in Figure 4. These examples indicate that one should be careful when selecting the bounded change parameter d. A wise choice of d is necessary; otherwise, one may stumble on such “bad” states. Of course, when the geometry of the problem is not known, it may be difcult to make this selection. Also, even if the bounded change parameter works for the initial state, there may be a positive probability to move to a state for which d is too small. Modifying the learning algorithm to move sets of perceptrons instead of a single perceptron does not solve this problem, and in fact the same example is a counterexample to such learning algorithms as well. 4 Conclusion
We presented a theory that makes it possible to check if a target set of a learning algorithm is recurrent. The main disadvantage of this method is that it does not yield a quantitative estimate on the time required to enter the target set. However, the rst step in obtaining any such estimate is to study the properties of hO . Our method takes at least a step in this direction, in that it gives conditions under which a positive lower bound exists. The advantage of this method is its simplicity. As we showed in the examples, it applies in many standard settings, simplifying, unifying, and extending existing results. Appendix A.1 Proofs of the Theorems from Section 2. The appendix is devoted to the proofs of the main theorems in section 2. We begin with the proof of the well-known recurrence result, which may be found, for example, in Orey (1971). We present the proof for the sake of completeness.
The rst part of the proof is a version of a 0-1 law that is due to LÂevy (Lo`eve, 1963). Let Y1 , Y2 , . . . , be a sequence of random variables, and let Y be a random variable dened on Y1 , Y2 , . . . such that E|Y| < 1. Note that Zn D E ( Y|Y1 , . . . , Yn ) forms a martingale; thus, by the martingale convergence theorem (Lo`eve, 1963), Zn converges almost surely to Y. In particular, if we set Ui D fXi 2 Og, U D fXn 2 O i.o.g, Yn D Xn , and Y D ÂU , then P 0 ( U|Y1 , . . . , Yn ) D E( Y|Y1 , . . . , Yn ) converges almost Proof of Theorem 1.
Recurrence Methods in the Analysis of Learning Processes
1859
|Y . . . , Yn ) tends to Â[1 surely to ÂU , and P 0 ( [1 Ui for every xed k. On k Ui 1 , k the other hand, for every k · n, note that ) ( 1 |Y ) ( ). |Y P 0 ( [1 k Ui 1 , . . . , Yn ¸ P 0 [n Ui 1 , . . . , Yn ¸ P 0 U|Y1 , . . . , Yn Thus, by taking n ! 1, 1 Â[1 Ui ¸ lim sup P 0 ([ n Ui |Y 1 , . . . , Yn ) k n!1
¸ lim inf P 0 ( [1 n Ui |Y1 , . . . , Y n ) ¸ ÂU . n!1
Again, taking k ! 1, the left side converges almost surely to ÂU ; hence, P 0 ([1 n Ui |Y1 , . . . , Yn ) tends to ÂU . Denote by Ln ( Xn , O ) D P 0 ( [1 n Ui |X1 , . . . , Xn ) ; then by the 0-1 law, Ln ( Xn , O ) tends to the characteristic function of the set fXn 2 O i.o.g. Since Ln ( Xn , O ) is the probability of an orbit to visit O for some m > n given its history X1 , . . . , Xn , then by our assumption Ln ( ¡, O ) is strictly positive on the set fXn 2 A i.o. g implying that Ln ( ¡, O ) ! 1 on that set. Hence, fXn 2 O i.o. g ¾ fXn 2 A i.o. g P 0—almost surely. Next we shall prove that under assumption 1, hO is lower semicontinuous. We begin the proof with the following lemma: If assumption 1 holds, then for every x 2 X, every sequence yn ! x and every integer k, there is a set V ½ V k, for which ºk ( V) D 1, and for every k–tuple ( v1 , . . . , vk ) 2 V, T k ( yn I v1 , . . . , vk ) ! T k ( xI v1 , . . . , vk ) . Lemma 3.
We will prove our claim for k D 2. The general case follows by induction. Fix x 2 X and a sequence ( yn ) such that yn ! x. Set zn ( v ) D T ( yn , v) and z( v ) D T ( x, v ) . Hence, by assumption 1, zn ! z º—almost surely. Again by assumption 1, for every such v, T ( zn ( v ) , v) ! T ( z( v ) , v) º—almost surely. Note that T 2 ( yn I v, v) D T ( zn ( v) I v) and that T 2 ( xI v, v) D T ( z( v )I v). Thus, our claim follows from Fubini’s theorem. Proof.
Dene stopping times tx by the rst time in which Xn enters O given that X0 D x. If the orbit does not enter O, we set tx D 1. Let fk ( x ) D Px ftx · kg. Clearly, fk is an increasing sequence that converges to hO . Thus, to show that hO is lower semicontinuous, it is enough to show that each fk is lower semicontinuous. Fix x 2 X and a sequence ( yn ) converging to x. By lemma 3, for every integer k, there is a set V ½ V k such that ºk (V) D 1, and T k ( yn I ¡) ! T k ( xI ¡) on V. Note that if O is open and if T k ( xI v1 , . . . , vk ) 2 O, then for n large enough, so is T k ( yn I v1 , . . . , vk ). Thus, Proof of Theorem 3.
Âftx · kg · lim inf Âftyn · kg . n!1
(A.1)
1860
S. Mendelson and I. Nelken
Table 1: Position of Points for the Simple Two-Dimensional Kohonen Process. After:
Step 3
Step 4
Step 5
Step 6
Point
x
y
x
y
x
y
x
y
1 2 3 4 5
0.5000 0.1250 0.2500 0.1250 0
0.5000 0.1250 0.2500 0.1250 0
0.3282 ¡0.3745 ¡0.0310 0.1250 ¡0.1092
0.5625 0.5625 0.4375 0.1250 0.1250
0.4122 ¡0.3745 0.2268 0.5625 0.0294
0.3672 0.5625 0.0781 ¡0.4375 ¡0.0156
0.4122 ¡0.4527 ¡0.0799 0.3672 ¡0.4853
0.3672 0.3672 ¡0.1914 ¡0.5078 ¡0.5078
Taking the expectation and by Fatou’s lemma, it follows that
EÂftx · kg · E lim infftyn · kg · lim inf EÂftyn · kg . n!1
n!1
(A.2)
Hence, fk ( x) · lim infn!1 fk ( yn ). A.2 Algorithm for the Simple 2D Kohonen Process. 1. Given the random initial state, denote by d the minimum distance between point 3 and any of the four corners of the unit square. Assume that d · 3e . Otherwise, it is necessary to “pull” point 3 away from the border so that this condition is fullled. Alternatively one may decrease the size of the step of the process e to be less than d / 3. 2. Choose point 3. All other four points will approach point 3, so that the maximum distance between point 3 and any of the other four points decreases (by 1 ¡ e / 2). Repeat this step until the maximum distance between point 3 and the other points is small enough (e.g., smaller than e / 106 ). At this point, we do the following rescaling. Since point 3 did not move, it is still at a distance of 3e from the border. Build a square around point 3 with sides of length 2e . The whole square will be inside the borders of the original region. To simplify notations, rescale the coordinates so that 1 corresponds to e . With this rescaling, the small square now has corners [ § 1, § 1], and the new e is equal 0.5. For the rest of the discussion, all the coordinates refer to this new coordinate system. The coordinates of all ve points are [0, 0] (up to perturbations of size 10¡6 ). 3. At least one corner of the unit square is closer to one of points 1, 2, 4, 5 than to point 3. Choose one such corner (up to a perturbation of size 10¡6 ) as the next example. We assume without loss of generality that it is corner [1, 1] and that it is closest to point 1. The coordinates after this step are given in Table 1, up to perturbations of size 2 £ 10¡6 . 4. Consider the point [¡0.874, 1]. This point is (up to the small perturbations) close to points 2 and 4 and farther away from all other points. Clearly, we may assume that it is actually closest to point 2. Also, it can be slightly perturbed so that it does not have two nearest neighbors. The coordinates are as given in Table 1, up to perturbations of size 3 £ 10¡6 .
Recurrence Methods in the Analysis of Learning Processes
1861
5. The next example is point [¡1, 1], which is closest to point 4. It is easy to check that all points whose distance from [¡1, 1] is smaller than 10¡6 will also be closest to point 4. The coordinates are as given in Table 1, up to perturbations of size 4 £ 10¡6 . 6. The last example is point [¡1, ¡1], which is closest to point 5 (together with a neighborhood of size 10¡6 ). The coordinates are as given in Table 1, up to perturbations of size 5 £ 10¡6 . At this point, the relative distances between the points are consistent with the function f . For example, the distance between point 1 and the other points is 0.7481, 0.5542, 0.7677, 1.5712. It is thus closest to point 3, farther from points 2 and 4, and farthest from point 5, as required. References Fort, J. C., & Pages, G. (1995). On the a.s. convergence of the Kohonen algorithm with a general neighborhood function. Annals of Applied Probability, 5, 1177– 1216 Fort, J. C., & Pages, G. (1996). About the Kohonen algorithm: Strong or weak self organization. Neural Networks, 9, 773–785. Haykin, S. S., (1994). Neural networks: A comprehensive foundation. New York: MacMillan. Kim, J. W., & Sompolinsky, H. (1996). On-line Gibbs learning. Physical Review Letters, 76, 3021–3024. Kohonen, T. (1989). Self-organization and associative memory. Springer-Verlag. Lo`eve, M. (1963). Probability theory (3rd ed.). New York: Van Nostrand. Mendelson, S. (1998). Mathematical aspects of learning in neural networks. Unpublished doctoral dissertation, Technion-I.I.T. Orey, S. (1971). Limit theorems for Markov chain transition probabilities. New York: Van Nostrand Reinhold. Ritter, H., Martinetz, T., & Schulten, K. (1992). Neural computation and self organizing maps. Reading, MA: Addison-Wesley.
Received September 27, 1999; accepted September 27, 2000.
LETTER
Communicated by Grace Wahba
Subspace Information Criterion for Model Selection Masashi Sugiyama Hidemitsu Ogawa Department of Computer Science, Graduate School of Information Science and Engineering, Tokyo Institute of Technology, Meguro-ku, Tokyo, 152-8552, Japan The problem of model selection is considerably important for acquiring higher levels of generalization capability in supervised learning. In this article, we propose a new criterion for model selection, the subspace information criterion (SIC), which is a generalization of Mallows’s CL . It is assumed that the learning target function belongs to a specied functional Hilbert space and the generalization error is dened as the Hilbert space squared norm of the difference between the learning result function and target function. SIC gives an unbiased estimate of the generalization error so dened. SIC assumes the availability of an unbiased estimate of the target function and the noise covariance matrix, which are generally unknown. A practical calculation method of SIC for least-mean-squares learning is provided under the assumption that the dimension of the Hilbert space is less than the number of training examples. Finally, computer simulations in two examples show that SIC works well even when the number of training examples is small. 1 Introduction
Supervised learning is obtaining an underlying rule from training examples made up of sample points and corresponding sample values. If the rule is successfully acquired, then appropriate output values corresponding to unknown input points can be estimated. This ability is called the generalization capability. So far, many supervised learning methods have been developed, including the stochastic gradient descent method (Amari, 1967), the backpropagation algorithm (Rumelhart, Hinton, & Williams, 1986a, 1986b), regularization learning (Tikhonov & Arsenin, 1977; Poggio & Girosi, 1990), Bayesian inference (Savage, 1954; MacKay, 1992), projection learning (Ogawa, 1987), and support vector machines (Vapnik, 1995; Sch¨olkopf, Burges, & Smola, 1998). In these learning methods, the quality of the learning results depends heavily on the choice of models. Here, models refer to, for example, Hilbert spaces to which the learning target function belongs in the projection learning and support vector regression cases, multilayer perceptrons (MLPs) with different numbers of hidden units in the backpropagation case, pairs c 2001 Massachusetts Institute of Technology Neural Computation 13, 1863– 1889 (2001) °
1864
Masashi Sugiyama and Hidemitsu Ogawa
of the regularization term and regularization parameter in the regularization learning case, and families of probabilistic distributions in the Bayesian inference case. If the model is too complicated, learning results tend to overt noisy training examples. And if the model is too simple, then it is not capable of tting training examples, causing learning results to become undertted. In general, both over- and undertted learning results have lower levels of generalization capability. Therefore, the problem of nding an appropriate model, referred to as model selection, is important for acquiring higher levels of generalization capability. The problem of model selection has been studied mainly in the eld of statistics. Mallows (1964) proposed CP for the selection of subset-regression models (see also Gorman & Toman, 1966; Mallows, 1973). CP gives an unbiased estimate of the predictive training error—the error between estimated and true values at sample points contained in the training set. Mallows (1973) extended the range of application of CP to the selection of arbitrary linear regression models. It is called CL or the unbiased risk estimate (Wahba, 1990). CL may require a good estimate of the noise variance. In contrast, the generalized cross-validation (Craven & Wahba, 1979; Wahba, 1990), which is an extension of the traditional cross-validation (Mosteller & Wallace, 1963; Allen, 1974; Stone, 1974; Wahba, 1990), is the criterion for nding the model minimizing the predictive training error without the knowledge of noise variance. Li (1986) showed the asymptotic optimality of CL and the generalized cross-validation, i.e., they asymptotically select the model minimizing the predictive training error (see also Wahba, 1990). However, these methods do not explicitly evaluate the error for unknown input points. In contrast, model selection methods explicitly evaluating the generalization error have been studied from various standpoints: information statistics (Akaike, 1974; Takeuchi, 1976; Konishi & Kitagawa, 1996), Bayesian statistics (Schwarz, 1978; Akaike, 1980; MacKay, 1992), stochastic complexity (Rissanen, 1978, 1987, 1996; Yamanishi, 1998), and structural risk minimization (Vapnik, 1995; Cherkassky, Shao, Mulier, & Vapnik, 1999). Particularly, information-statistics-based methods have been extensively studied. Akaike’s information criterion (AIC) (Akaike, 1974) is one of the most eminent methods of this type. Many successful applications of AIC to realworld problems have been reported (e.g., Bozdogan, 1994; Akaike & Kitagawa, 1994, 1995; Kitagawa & Gersch, 1996). AIC assumes that models are faithful.1 Takeuchi (1976) extended AIC to be applicable to unfaithful models. This criterion is called Takeuchi’s modication of AIC (TIC) (see also Stone, 1977; Shibata, 1989). The learning method with which TIC can deal is restricted to the maximum likelihood estimation. Konishi and Kitagawa
1 A model is said to be faithful if the learning target can be expressed by the model (Murata, Yoshizawa, & Amari, 1994).
Subspace Information Criterion for Model Selection
1865
(1996) relaxed the restriction and derived the generalized information criterion for a class of learning methods represented by statistical functionals. The common characteristic of AIC and its derivatives described above is to give an asymptotic unbiased estimate of the expected log likelihood. This implies that when the number of training examples is small, these criteria are no longer valid. To overcome this weakness, two approaches have been taken. One is to calculate an exact unbiased estimate of the expected log likelihood for each model. This type of modication can be found in many articles (e.g., Sugiura, 1978; Hurvich & Tsai, 1989, 1991, 1993; Noda, Miyaoka, & Itoh, 1996; Fujikoshi & Satoh, 1997; Satoh, Kobayashi, & Fujikoshi, 1997; Hurvich, Simonoff, & Tsai, 1998; Simonoff, 1998; McQuarrie & Tsai, 1998). The other approach is to use the bootstrap method (Efron, 1979; Efron & Tibshirani, 1993) for numerically evaluating the bias when the expected log likelihood is estimated by the log likelihood. The idea of the bootstrap bias correction was introduced by Wong (1983) and Efron (1986), and then it is formalized as a model selection criterion by Ishiguro, Sakamoto, & Kitigawa (1997) (see also Davison & Hinkley, 1997; Cavanaugh & Shumway, 1997; Shibata, 1997). In the neural network community, AIC has been extended to a different direction. Murata et al. (1994) generalized the loss function of TIC and proposed the network information criterion (NIC). NIC assumes that the quasi-optimal estimator minimizing the empirical error, say, the maximum likelihood estimator when the log loss is adopted as the loss function, has been exactly obtained. However, when we are concerned with MLP learning, it is difcult to obtain the quasi-optimal estimator in real time since MLP learning is generally performed by iterative methods such as the stochastic gradient descent method (Amari, 1967) and the backpropagation algorithm (Rumelhart et al., 1986a, 1986b). To cope with this problem, information criteria taking the discrepancy between the quasi-optimal estimator and the obtained estimator into account have been devised (Wada & Kawato, 1991; Onoda, 1995). In this article, we propose a new criterion for model selection from the functional analytic viewpoint: the subspace information criterion (SIC). SIC is mainly different from AIC-type criteria in three respects. The rst is the generalization measure. In AIC-type criteria, the averaged generalization error over all training sets is adopted as the generalization measure, and the averaged terms are replaced with particular values calculated by one given training set. In contrast, the generalization measure adopted in SIC is not averaged over training sets. The fact that the replacement of the averaged terms is unnecessary for SIC is expected to result in better selection than AICtype methods. The second is the approximation method. AIC-type criteria use asymptotic approximation and give an asymptotic unbiased estimate of the generalization error. In contrast, SIC uses the noise characteristics and gives an exact unbiased estimate of the generalization error. Our computer simulations show that SIC works well even when the number of training
1866
Masashi Sugiyama and Hidemitsu Ogawa
examples is small. The third is the restriction of models. Takeuchi (1983) pointed out that AIC-type criteria are effective only in the selection of nested models (see also Murata et al., 1994). In SIC, no restriction is imposed on models. This letter is organized as follows. Section 2 formulates the problem of model selection. In section 3, our main result, SIC, is derived. In section 4, SIC, is compared with Mallows’s CL and AIC-type criteria. In section 5, SIC is applied to the selection of least-mean-squares learning models, and a complete algorithm of SIC is described. Finally, section 6 is devoted to computer simulations demonstrating the effectiveness of SIC. 2 Mathematical Foundation of Model Selection
Let us consider the supervised learning problem of obtaining an approximation to a target function from a set of training examples. Let the learning target function be f ( x ) of L variables dened on a subset D of the L-dimensional Euclidean space R L . The training examples are made up of sample points xm in D and corresponding sample values ym in C : f( xm , ym ) | ym D f ( xm ) C 2
M m gm D1 ,
(2.1)
where ym is degraded by additive noise 2 m . Let h be a set of factors determining learning results, such as the type and number of basis functions, and parameters in learning algorithms. We call h a model. Let fOh be a learning result obtained with a model h . Assuming that f and fOh belong to a Hilbert space H, the problem of model selection is described as follows. Denition 1 (model selection).
From a given set of models, nd the model minimizing the generalization error dened as E2 k fOh ¡ f k2 ,
(2.2)
where E2 denotes the ensemble average over the noise, and k ¢ k denotes the norm. 3 Subspace Information Criterion
In this section, we derive the SIC, which gives an unbiased estimate of the generalization error. Let y, z, and 2 be M-dimensional vectors whose mth elements are ym , f ( xm ), and 2 m , respectively: yD zC2 .
(3.1)
y and z are called a sample value vector and an ideal sample value vector,
Subspace Information Criterion for Model Selection
1867
respectively. Let Xh be a mapping from y to fOh : fOh D Xh y.
(3.2)
Xh is called a learning operator. In the derivation of SIC, we assume the following conditions: 1. The learning operator Xh is linear. 2. The mean noise is zero: E2 2 D 0.
(3.3)
3. A linear operator Xu that gives an unbiased learning result fOu is available: E2 fOu D f,
(3.4)
where fOu D Xu y.
(3.5)
Assumption 1 implies that the range of Xh becomes a subspace of H. Linear learning operators include various learning methods such as leastmean-squares learning (Ogawa, 1992), regularization learning (Nakashima & Ogawa, 1999), projection learning (Ogawa, 1987), and parametric projection learning (Oja & Ogawa, 1986). It follows from equations 3.5, 3.1, and 3.3 that E2 fOu D E2 Xu y D Xu z C E2 Xu2 D Xu z.
(3.6)
Hence, assumption 3 yields Xu z D f.
(3.7)
Note that as discussed in section 5, assumption 3 holds if the dimension of the Hilbert space H is not larger than the number M of training examples. Based on the above setting, we shall rst give an estimation method of the generalization error of fOh . The unbiased learning result fOu and the learning operator Xu are used for this purpose. The generalization error of fOh is decomposed into the bias and variance (see, e.g., Takemura, 1991; Geman, Bienenstock, & Doursat, 1992; Efron & Tibshirani, 1993): E2 k fOh ¡ f k2 D kE2 fOh ¡ f k2 C E2 k fOh ¡ E2 fOh k2 .
(3.8)
1868
Masashi Sugiyama and Hidemitsu Ogawa
It follows from equations 3.2 and 3.1 that equation 3.8 yields E2 k fOh ¡ f k2 D kXh z ¡ f k2 C E2 kXh 2 k2 ¡ ¢ D kXh z ¡ f k2 C tr Xh QXh¤ ,
(3.9)
where tr ( ¢) denotes the trace of an operator, Q is the noise covariance matrix, and Xh¤ denotes the adjoint operator of Xh . Let X0 be an operator dened as X 0 D Xh ¡ Xu .
(3.10)
Then the bias of fOh can be expressed by using fOu as kXh z ¡ f k2 D k fOh ¡ fOu k2 ¡ k fOh ¡ fOu k2 C kXh z ¡ f k2 D k fOh ¡ fOu k2 ¡ kXh z C Xh 2 ¡ ( Xu z C Xu2 ) k2 C kXh z ¡ Xu zk2
D k fOh ¡ fOu k2 ¡ kX0 z C X 02 k2 C kX 0zk2
D k fOh ¡ fOu k2 ¡ kX0 zk2 ¡ 2RehX 0z, X 02 i
¡ kX02 k2 C kX 0zk2 D k fOh ¡ fOu k2 ¡ 2RehX 0z, X02 i ¡ kX 02 k2 ,
(3.11)
where Re stands for the real part of a complex number and h¢, ¢i denotes the inner product. The second and third terms of the right-hand side of equation 3.11 cannot be directly calculated since z and 2 are unknown. Here, we shall replace them with the averages of them over the noise. Then the second term vanishes because of equation 3.3, and the third term yields ¡ ¢ E2 kX02 k2 D tr X0 QX¤0 .
(3.12)
This approximation immediately gives the following criterion. Denition 2 (subspace information criterion).
The following functional is
called the subspace information criterion: ¡ ¢ ¡ ¢ SIC D k fOh ¡ fOu k2 ¡ tr X 0QX¤0 C tr Xh QXh¤ .
(3.13)
The model minimizing SIC is called the minimum SIC model (MSIC model), and the learning result obtained by the MSIC model is called the MSIC learning result. The generalization capability of the MSIC learning result measured by equation 2.2 is expected to be the best; the expectation is theoretically supported by the fact that SIC gives an unbiased estimate of the generalization error since it follows from equations 3.13, 3.11, 3.3, 3.12,
Subspace Information Criterion for Model Selection
1869
and 3.9 that ± ¡ ¢ ¡ ¢² k fOh ¡ fOu k2 ¡ tr X 0QX¤0 C tr Xh QXh¤ ± 2 2 D E2 kXh z ¡ f k C 2RehX0 z, X02 i C kX02 k ¡ ¢ ¡ ¢¢ ¡ tr X0 QX¤0 C tr Xh QXh¤ ¡ ¢ D kXh z ¡ f k2 C tr Xh QXh¤
E2 SIC D E2
D E2 k fOh ¡ f k2 .
(3.14)
Since the bias is always nonnegative from the denition, we can also consider the following corrected SIC (cSIC): h ¡ ¢i ¡ ¢ C tr Xh QXh¤ , cSIC D k fOh ¡ fOu k2 ¡ tr X 0QX¤0 (3.15) C
where [¢] C is dened as [t] C D max( 0, t ).
(3.16)
4 Discussion
In this section, SIC is compared with CL and AIC-type criteria. 4.1 Comparison with CL . The idea of estimation used in SIC generalizes CL by Mallows (1973). The problem considered in Mallows’s work is to estiM . mate the ideal sample value vector z from training examples f(xm , ym ) gm D1 The predictive training error is adopted as the error measure:
E2
M X 2 fOh ( xm ) ¡ f ( xm ) D E2 kBh y ¡ zk2 ,
(4.1)
mD 1
where Bh is a mapping from y to an M-dimensional vector whose mth element is fOh ( xm ). Analogous to equations 3.8 and 3.9, the predictive training error can be decomposed into the bias and variance: E2 kBh y ¡ zk2 D kE2 Bh y ¡ zk2 C E2 kBh y ¡ E2 Bh yk2 ¡ ¢ D kBh z ¡ zk2 C tr Bh QBh¤ .
(4.2)
Then CL is given as ¡ ¢ ¡ ¢ CL D kBh y ¡ yk2 ¡ tr ( Bh ¡ I ) Q( Bh ¡ I ) ¤ C tr Bh QBh¤ .
(4.3)
Although the problem Mallows considered was to estimate z, acquiring a higher level of generalization capability is implicitly expected. However, minimizing the predictive training error does not generally mean to minimize the generalization error dened by equation 2.2. In contrast, we con-
1870
Masashi Sugiyama and Hidemitsu Ogawa
sider the problem of minimizing the generalization error, and SIC directly gives an unbiased estimate of the generalization error. Mallows employed the sample value vector y as an unbiased estimate of the target z. In contrast, we assumed the availability of the unbiased learning result fOu of the target function f . fOu plays a similar role to y in Mallows’s case. 4.2 Comparison with AIC-Type Methods. In AIC-type criteria, the relation between the generalization error and the empirical error is rst evalM . Then the uated in the sense of the average over all training sets fxm gm D1 averaged terms are replaced with particular values calculated by one given training set. In contrast, the generalization measure adopted in SIC (see equation 2.2) is not averaged over training sets. The fact that the replacement of the averaged terms is unnecessary for SIC is expected to result in better selection than AIC-type methods. AIC-type criteria give an asymptotic unbiased estimate of the generalization error. In contrast, SIC gives an exact unbiased estimate of the generalization error (see equation 3.14). Therefore, SIC is expected to work well even when the number of training examples is small. Indeed, computer simulations performed in section 6 support this claim. Takeuchi (1983) pointed out that AIC-type criteria are effective only in the selection of nested models (see also Murata et al., 1994):
S1 ½ S2 ½ ¢ ¢ ¢ .
(4.4)
In SIC, no restriction is imposed on models except that the range of Xh is included in H. AIC-type methods compare models under the learning method minimizing the empirical error. In contrast, SIC can consistently compare models with different learning methods—for example, least-mean-squares learning (Ogawa, 1992), regularization learning (Nakashima & Ogawa, 1999), projection learning (Ogawa, 1987), and parametric projection learning (Oja & Ogawa, 1986). The type of learning methods is also included in the model. AIC-type criteria do not explicitly require a priori information on the class to which the target function belongs. In contrast, SIC requires a priori information on the Hilbert space H to which the target function f and a learning result fOh belong. If H is unknown, a Hilbert space including all models is practically adopted as H (see the experiments in section 6.2). SIC requires the noise covariance matrix Q, while AIC-type criteria do not. However, as shown in section 5, we can cope with the case where Q is not available. In the derivation of AIC-type methods, terms that are not dominant for model selection are neglected (see, e.g., Murata et al., 1994). This implies that the value of the AIC-type criteria is not an estimate of the generalization error itself. In contrast, SIC gives an estimate of the generalization error. This difference can be clearly seen in the top graph in Figure 4 (see section 6.1).
Subspace Information Criterion for Model Selection
1871
AIC-type methods assume that training examples are independently and identically distributed (i.i.d.). In contrast, SIC can deal with the correlated noise if the noise covariance matrix Q is available. 5 SIC for Least-Mean-Squares Learning
In this section, SIC is applied to the selection of least-mean-squares (LMS) learning models. LMS learning is to obtain a learning result fOh ( x ) in a subspace S of H minimizing the training error M X 2 fOh ( xm ) ¡ ym
(5.1)
mD 1
M . In the LMS learning case, a model from training examples f(xm , ym ) gm D1 h refers to a subspace S of H. Here, we assume that S has the reproducing kernel (see Aronszajn, 1950; Bergman, 1970; Wahba, 1990; Saitoh, 1988, 1997). Let KS ( x, x0 ) be the reproducing kernel of S and D be the domain of functions in S. Then KS ( x, x0 ) satises the following conditions:
For any xed x0 in D , KS ( x, x0 ) is a function of x in S. For any function f in S and for any x0 in D , it holds that h f (¢) , KS ( ¢, x0 ) i D f ( x0 ) .
(5.2)
Note that the reproducing kernel is unique if it exists. Let m be the dimension m of S. Then for any orthonormal basis fQj gjD 1 in S, KS ( x, x0 ) is expressed as KS ( x, x0 ) D
m X
Qj ( x ) Qj ( x0 ) .
(5.3)
jD 1
Let AS be an operator from H to the M-dimensional unitary space C M dened as AS D
M ± X mD 1
² em KS ( x, xm ) ,
(5.4)
where ( ¢ ¢) denotes the Neumann-Schatten product,2 and em is the mth vector of the so-called standard basis in C M . AS is called a sampling operator 2 For any xed g in a Hilbert space H1 and any xed f in a Hilbert space H2 , the ¡ ¢ Neumann-Schatten product f g is an operator from H1 to H2 dened by using any h 2 H1 as (Schatten, 1970):
¡
¢
f g h D hh, gi f.
1872
Masashi Sugiyama and Hidemitsu Ogawa
since it follows from equation 5.2 that (5.5)
AS f D z for any f in S. Then LMS learning is rigorously dened as follows. Denition 3 (least-mean-squares learning) (Ogawa, 1992).
An operator X is called the LMS learning operator for the model h if X minimizes the functional
J[X] D
M X 2 fOh ( xm ) ¡ ym D kAS Xy ¡ yk2 .
(5.6)
mD 1
Let A† be the Moore-Penrose generalized inverse3 of A. Then the following proposition holds. Proposition 1 (Ogawa, 1992).
The LMS learning operator for the model h is
given by Xh D A†S .
(5.7)
SIC requires an unbiased learning result fOu and an operator Xu providing Ofu . Here, we show a method of obtaining fOu and Xu by LMS learning. Let us assume that H is also a reproducing kernel Hilbert space and regard H as a model. Let AH be a sampling operator dened with the reproducing kernel of H: AH D
M ± X mD 1
² em KH ( x, xm ) .
(5.8)
Since f belongs to H, it follows from equation 5.2 that the ideal sample value vector z is expressed as z D AH f.
(5.9)
Let fOH be a learning result obtained with the model H: fOH D XH y, 3
(5.10)
An operator X is called the Moore-Penrose generalized inverse of an operator A if X satises the following four conditions (see Albert, 1972; Ben-Israel & Greville, 1974): AXA D A, XAX D X, (AX)¤ D AX, and (XA)¤ D XA. The Moore-Penrose generalized inverse is unique and denoted as A† .
Subspace Information Criterion for Model Selection
1873
where XH is the LMS learning operator with the model H: XH D A†H .
(5.11)
Now let us assume that the range of A¤H agrees with H.4 . This holds only if the number M of training examples is larger than or equal to the dimension of H. Then it follows from equations 5.10, 3.1, 3.3, 5.9, and 5.11 that E2 fOH D E2 XH y D XH z C E2 XH 2
D XH z D XH AH f D A†H AH f D f,
(5.12)
which implies that fOH is unbiased. Hence, we can put fOu D fOH ,
(5.13)
Xu D XH .
(5.14)
For evaluating SIC, the noise covariance matrix Q is required (see equation 3.13). One of the measures is to use QO D sO 2 I
(5.15)
as an estimate of the noise covariance matrix, and sO 2 is estimated from training examples f( xm , ym ) gM mD 1 as 2 PM fOu ( xm ) ¡ ym D1 m sO 2 D (5.16) , M ¡ dim( H )
where M > dim(H) is implicitly assumed. When Q is really in the form Q D s 2 I and s 2 is estimated by equation 5.16, SIC still gives an unbiased estimate of the generalization error since equation 5.16 is an unbiased estimate of s 2 (see theorem 1.5.1 in Fedorov (1972)). Based on the above discussions, we shall show a calculation method of LMS learning results and the value of SIC by matrix operations. Let TS and TH be M-dimensional matrices whose ( m, m0 ) th elements are KS ( xm , xm0 ) and KH ( xm , xm0 ) , respectively. Then the following theorem holds: Theorem 1 (calculation of LMS learning results). LMS learning results fOh ( x )
and fOu ( x ) can be calculated as fOh ( x ) D fOu ( x ) D 4
M X m D1 M X
m D1
hTS† y, em iKS ( x, xm ) ,
(5.17)
hT †H y, em iKH ( x, xm ) .
(5.18)
This condition is sufcient for fOH to be unbiased, not necessary.
1874
Masashi Sugiyama and Hidemitsu Ogawa
A proof of theorem 1 is provided in appendix A. Let T be an M-dimensional matrix dened as † † † C TH . T D TS† ¡ TH TS TS† ¡ TS† TS TH
(5.19)
Then the following theorem holds: Theorem 2 (calculation of SIC for LMS learning). When the noise covariance
matrix Q is estimated by equations 5.15 and 5.16, SIC for LMS learning is given as ± ² SIC D hTy, yi ¡ sO 2 tr ( T) C sO 2 tr TS† , (5.20) where sO 2 is given as sO 2 D
† kyk2 ¡ hTH TH y, yi . M ¡ dim(H)
(5.21)
A proof of theorem 2 is given in appendix B. In practice, the calculation of the Moore-Penrose generalized inverse is sometimes unstable. To overcome the instability, we recommend using Tikhonov’s regularization (Tikhonov & Arsenin, 1977): T †S á TS ( TS2 C aI) ¡1 ,
(5.22)
where a is a small constant, say, a D 10¡3 . Then a complete algorithm of the MSIC procedure—nding the best model from fSn gn —is described in a pseudocode in Figure 1. 6 Computer Simulations
In this section, computer simulations are performed to demonstrate the effectiveness of SIC compared with the network information criterion (NIC) (Murata et al., 1994), which is a generalized AIC. 6.1 Illustrative Example. Let the target function f ( x) be
f ( x) D
p
p p p 2 sin x C 2 2 cos x ¡ 2 sin 2x ¡ 2 2 cos 2x p p p C 2 sin 3x ¡ 2 cos 3x C 2 2 sin 4x p p p ¡ 2 cos 4x C 2 sin 5x ¡ 2 cos 5x,
(6.1)
M and training examples f( xm , ym )gm D1 be
p 2p m C , M M ym D f ( xm ) C 2 m , xm D ¡p ¡
(6.2) (6.3)
Subspace Information Criterion for Model Selection
1875
Figure 1: MSIC procedure in a pseudocode: Find the best model from fSn gn .
where the noise 2 m is independently subject to the normal distribution with mean 0 and variance 3: 2 m » N ( 0, 3).
(6.4)
Let us consider the following models: fSn g20 nD 1
(6.5)
where Sn is a trigonometric polynomial space of order n, that is, a Hilbert space spanned by f1, sin px, cos pxgpnD 1 ,
(6.6)
and the inner product is dened as h f, giSn D
1 2p
Z
p
f ( x ) g ( x) dx. ¡p
(6.7)
1876
Masashi Sugiyama and Hidemitsu Ogawa
The reproducing kernel of Sn is expressed as 8 (2n C 1) ( x ¡ x0 ) . < x ¡ x0 sin KSn ( x, x0 ) D sin 2 2 :2n C 1 Our task is to nd the best model Sn minimizing Z p 2 1 Error D fOSn ( x ) ¡ f ( x ) dx, 2p ¡p
6 x0 , if x D
if x D x0 .
(6.8)
(6.9)
where fOSn ( x ) is the learning result obtained with the model Sn . Let us consider the following two model selection methods: A. SIC: We use cSIC given by equation 3.15. LMS learning is adopted. The largest model S20 including all models is employed as H. The unbiased learning result fOu and the estimate sO 2 of the noise variance are obtained by equations 5.13 and 5.16, respectively. B. NIC: The squared loss is adopted as the loss function. In this case, learning results obtained by the stochastic gradient-descent method (Amari, 1967) converge to the LMS estimator. The distribution of sample points given by equation 6.2 is regarded as a uniform distribution. In both methods, no a priori information is used and the LMS estimator is commonly adopted. Hence, the efciency of SIC and NIC can be fairly compared by this simulation. Figures 2, 3, and 4 show the simulation results when the numbers of training examples are 50, 100, and 200, respectively. The top graphs show the values of the error measured by equation 6.9, SIC, and NIC by each model. They are shown by the solid, dashed, and dash-dotted lines, respectively. The horizontal axis denotes the highest order n of trigonometric polynomials in the model Sn . The bottom-left graphs show the target function f ( x ) , training examples, and the MSIC learning result, denoted by the solid line, the empty circles, and the dashed line, respectively. The bottom-right graphs show the target function f ( x) , training examples, and the minimum NIC (MNIC) learning result, denoted by the solid line, empty circles, and the dash-dotted line, respectively. When M D 50, the minimum value of the error measured by equation 6.9 is 0.98 attained by the model S5 . The MSIC model is S5 , and the error of the MSIC learning result is 0.98. The MNIC model is S20, and the error of the MNIC learning result is 3.36. When M D 100, the minimum value of the error is 0.37 attained by the model S5 . The MSIC model is S5 , and the error is 0.37, while the MNIC model is S9 and the error is 0.75. When M D 200, the minimum value of the error is 0.11 attained by the model S5 . The MSIC model is S5 and the error is 0.11, while the MNIC model is S6 and the error is 0.17.
Subspace Information Criterion for Model Selection
1877
4
Error, SIC, NIC
3
2
1
0
15
Error SIC NIC 5
f MSIC result
10
10 n
15
15
5
0
0
5
5
10
10 0
2
f MNIC result
10
5
2
20
2
0
2
Figure 2: Simulation results when the number M of training examples is 50. (Top) The values of the error measured by equation 6.9, SIC, and NIC in each model, denoted by the solid, dashed, and dash-dotted lines, respectively. The horizontal axis denotes the highest order n of trigonometric polynomials in the model Sn . (Bottom left) The target function f ( x) , training examples, and MSIC learning result, denoted by the solid line, empty circles, and the dashed line, respectively. (Bottom right) The target function f ( x) , training examples, and the MNIC learning result, denoted by the solid line, empty circles, and the dashdotted line, respectively. The minimum value of the error is 0.98 attained by the model S5 . The MSIC model is S5 , and the error is 0.98. The MNIC model is S20 , and the error is 3.36.
These results show that when M is large, both SIC and NIC give reasonable learning results (see Figures 3 and 4). However, when it comes to the case where M D 50, SIC outperforms NIC (see Figure 2). This implies that SIC works well even when the number of training examples is small.
1878
Masashi Sugiyama and Hidemitsu Ogawa 4
Error SIC NIC
Error, SIC, NIC
3
2
1
0
15
5
10 n
f MSIC result
10
15
15
f MNIC result
10
5
5
0
0
5
5
10
20
10 2
0
2
2
0
2
Figure 3: Simulation results when the number M of training examples is 100. The minimum value of the error is 0.37 attained by the model S5 . The MSIC model is S5 , and the error is 0.37. The MNIC model is S9 , and the error is 0.75.
As mentioned in section 4.2, NIC neglects terms that are not dominant for model selection, while SIC gives an unbiased estimate of the generalization error. The top graph in Figure 4 clearly shows this difference. SIC well approximates the true error, while the value of NIC is larger than the true error. Generally, neglecting nondominant terms does not affect the performance of model selection. However, this graph shows that for 5 · n · 20, the slope of NIC is gentler than the true error, while that of SIC is in good agreement with it. This implies that SIC is expected to be resistant to the noise since the discrimination of models by SIC is clearer than NIC.
Subspace Information Criterion for Model Selection 4
1879
Error SIC NIC
Error, SIC, NIC
3
2
1
0
15
5
10 n
f MSIC result
10
15
15
f MNIC result
10
5
5
0
0
5
5
10
20
10 2
0
2
2
0
2
Figure 4: Simulation results when the number M of training examples is 200. The minimum value of the error is 0.11 attained by the model S5 . The MSIC model is S5 , and the error is 0.11. The MNIC model is S6 , and the error is 0.17.
6.2 Interpolation of Chaotic Series. Let us consider the problem of interpolating the following chaotic series created by the Mackey-Glass delaydifference equation (see, e.g., Platt, 1991):
8 C n > > 2( x ¡ x0 ) n 1 > > 0 < ¡Pn ( x ) Pn C 1 ( x ) ] KSn ( x, x0 ) D ( C 1) 2 [P ( ) 2 n > > n x > > 2( 1 ¡ x2 ) > : ¡2xPn ( x ) Pn C 1 ( x ) C Pn C 1 ( x ) 2 ]
6 x0 , if x D
(6.18)
if x D x0 ,
where P n ( x ) is the Legendre polynomial of order n dened as Pn ( x ) D
1 dn 2 ( x ¡ 1) n . 2n n! dxn
(6.19)
We again compare SIC and NIC. S40 is adopted as H in SIC. The error is measured by ³ ´ 200 2 1 X 2 O . ( C Error D ¡0.995 ¡ 1) ¡ f t h t 200 t D 1 200
(6.20)
The distributions of errors by 1000 trials are shown in Figures 6, 7, and 8, where the numbers of training examples are 50, 150, and 250, respectively. The horizontal axis denotes the error measured by equation 6.20, while the vertical axis denotes the number of trials in which the corresponding generalization error is given. The distributions of the errors by SIC and NIC are almost the same when the number M of training examples is 250 (see Figure 8). However, when it comes to the case where M D 50 and M D 150, SIC tends to outperform NIC (see Figures 6 and 7). This simulation again shows that SIC works well even when the number of training examples is small. 7 Conclusion
We have proposed a new model selection criterion, the subspace information criterion (SIC). It is assumed that the learning target function belongs to a specied functional Hilbert space and the generalization error is dened as the Hilbert space squared norm of the difference between the learning result function and target function. SIC gives an unbiased estimate of the generalization error. SIC assumed the availability of an unbiased estimate of
1882
Masashi Sugiyama and Hidemitsu Ogawa 1000 800 600 400 200 0
0
1
2
3 4 x 10
0
1
2
3 4 x 10
1000 800 600 400 200 0
Figure 6: Distributions of errors by 1000 trials when the number M of training examples is 50. The horizontal axis denotes the error measured by equation 6.20, while the vertical axis denotes the number of trials.
the target function and the noise covariance matrix, which are generally unknown. A practical calculation method of SIC for least-mean-squares learning was provided under the assumption that the dimension of the Hilbert space is less than the number of training examples. The range of application of SIC for LMS learning is limited to the nite dimensional Hilbert space case. A practical estimation method of an unbiased learning result and the
Subspace Information Criterion for Model Selection
1883
1000 800 600 400 200 0
0
1
2
3 4 x 10
0
1
2
3 4 x 10
1000 800 600 400 200 0
Figure 7: Distributions of errors by 1000 trials when the number M of training examples is 150.
noise covariance matrix for the innite dimensional Hilbert space case is our important future work. Appendix A: Proof of Theorem 1
It follows from equation 5.2 that hK( ¢, xm0 ) , K (¢, xm ) i D K ( xm , xm0 ) .
(A.1)
1884
Masashi Sugiyama and Hidemitsu Ogawa 1000 800 600 400 200 0
0
1
2
3 4 x 10
0
1
2
3 4 x 10
1000 800 600 400 200 0
Figure 8: Distributions of errors by 1000 trials when the number M of training examples is 250.
Hence, equations 5.4 and 5.8 yield
AS A¤S D AH A¤H D
M X M X m D 1 m0 D 1 M X M X
m D 1 m0 D 1
KS ( xm , xm0 ) ( em em0 ) D TS ,
(A.2)
KH ( xm , xm0 ) (em em0 ) D T H .
(A.3)
Subspace Information Criterion for Model Selection
1885
From equations 3.2, 5.7, 3.5, 5.14, and 5.11, we have fOh D Xh y D A†S y D A¤S ( AS A¤S ) † y D A¤S TS† y, † fOu D Xu y D A†H y D A¤H ( AH A¤H ) † y D A¤H TH y,
(A.4) (A.5)
which imply equations 5.17 and 5.18 because of equations 5.4 and 5.8. Appendix B: Proof of Theorem 2
Since S is a subspace of H, it follows from equations 5.8, 5.4, A.1, and A.2 that AH A¤S D D
M X M X
hKS ( ¢, xm0 ) , KH (¢, xm ) i ( em em0 )
m D 1 m0 D 1 M X M X
m D 1 m0 D 1
KS ( xm , xm0 ) ( em em0 )
D TS .
(B.1)
It follows from equations A.4, A.5, B.1, and 5.19 that † k fOh ¡ fOu k2 D kA¤S TS† y ¡ A¤H TH yk2
† )¤( ¤ † †) D h( A¤S TS† ¡ A¤H TH AS TS ¡ A¤H TH y, yi D hTy, yi.
(B.2)
It follows from equations 3.10, 5.7, 5.14, 5.11, 5.15, A.2, B.1, and 5.19 that ± ² ¡ ¢ tr X0 QX¤0 D sO 2 tr ( A†S ¡ A†H ) ( A†S ¡ A†H ) ¤ ± ² O 2 tr ( A†S ¡ A†H ) ¤ ( A†S ¡ A†H ) D s O 2 tr ( T ) . D s It follows from equations 5.7, 5.15, and A.2 that ± ² ¡ ¢ tr Xh QXh¤ D sO 2 tr A†S ( A†S ) ¤ ± ² ± ² ± ² O 2 tr ( A†S ) ¤ A†S D sO 2 tr ( AS A¤S ) † D sO 2 tr TS† . D s
(B.3)
(B.4)
From equations 5.16, 5.2, A.5, and A.3, we have sO 2 D
† kAH A¤H T †H y ¡ yk2 kyk2 ¡ hTH TH kAH fOu ¡ yk2 y, yi . D D M ¡ dim( H ) M ¡ dim( H ) M ¡ dim( H )
(B.5)
Substituting equations 5.2 through B.5 into equations 3.13, we have equations 5.20 and 5.21.
1886
Masashi Sugiyama and Hidemitsu Ogawa
Acknowledgments
We thank the anonymous referees for bringing CL to our attention and for their valuable comments. References Abramowitz, M., & Segun, I. A. (Eds.). (1964). Handbook of mathematical functions with formulas, graphs, and mathematical tables. New York: Dover. Akaike, H. (1974). A new look at the statistical model identication. IEEE Transactions on Automatic Control, AC-19(6), 716–723. Akaike, H. (1980). Likelihood and the Bayes procedure. In N. J. Bernardo, M. H. DeGroot, D. V. Lindley, & A. F. M. Smith (Eds.), Bayesian statistics (pp. 141– 166). Valencia: University Press. Akaike, H., & Kitagawa, G. (Eds.). (1994). The practice of time series analysis I. Tokyo: Asakura Syoten. (In Japanese) Akaike, H., & Kitagawa, G. (Eds.). (1995). The practice of time series analysis II. Tokyo: Asakura Syoten. (In Japanese) Albert, A. (1972). Regression and the Moore-Penrose pseudoinverse. New York: Academic Press. Allen, D. (1974). The relationship between variable selection and data augmentation and a method for prediction. Technometrics, 16, 125–127. Amari, S. (1967). Theory of adaptive pattern classiers. IEEE Transactions on Electronic Computers, EC-16(3), 299–307. Aronszajn, N. (1950). Theory of reproducing kernels. Transactions of the American Mathematical Society, 68, 337–404. Ben-Israel, A., & Greville, T. N. E. (1974). Generalized inverses: Theory and applications. New York: Wiley. Bergman, S. (1970). The kernel function and conformal mapping. Providence, RRI: American Mathematical Society. Bozdogan, H. (Ed.). (1994).Proceedings of the First US/Japan Conference on the Frontiers of Statistical Modeling: An Informational Approach. Norwell, MA: Kluwer. Cavanaugh, J. E., & Shumway, R. H. (1997). A bootstrap variant of AIC for state space model selection. Statistica Sinica, 7, 473–496. Cherkassky, V., Shao, X., Mulier, F. M., & Vapnik, V. N. (1999). Model complexity control for regression using VC generalization bounds. IEEE Transactions on Neural Networks, 10(5), 1075–1089. Craven, P., & Wahba, G. (1979). Smoothing noisy data with spline functions: Estimating the correct degree of smoothing by the method of generalized cross-validation. Numerische Mathematik, 31, 377–403. Davison, A. C., & Hinkley, D. V. (1997). Bootstrap methods and their application. Cambridge: Cambridge University Press. Efron, B. (1979). Bootstrap methods: Another look at the jackknife. Annals of Statistics, 7(1), 1–26. Efron, B. (1986). How biased is the apparent error rate of a prediction rule? Journal of the American Statistical Association, 81(394), 461–470.
Subspace Information Criterion for Model Selection
1887
Efron, B., & Tibshirani, R. J. (1993). An introduction to the bootstrap. New York: Chapman & Hall. Fedorov, V. V. (1972). Theory of optimal experiments. New York: Academic Press. Freud, G. (1966). Orthogonal polynomials. Oxford: Pergamon Press. Fujikoshi, Y., & Satoh, K. (1997). Modied AIC and CP in multivariate linear regression. Biometrika, 84, 707–716. Geman, S., Bienenstock, E., & Doursat, R. (1992). Neural networks and the bias/variance dilemma. Neural Computation, 4(1), 1–58. Gorman, J. W., & Toman, R. J. (1966). Selection of variables for tting equations to data. Technometrics, 8(1), 27–51. Hurvich, C. M., Simonoff, J. S., & Tsai, C. L. (1998). Smoothing parameter selection in nonparametric regression using an improved Akaike information criterion. Journal of the Royal Statistical Society, Series B, 60, 271–293. Hurvich, C. M., & Tsai, C. L. (1989). Regression and time series model selection in small samples. Biometrika, 76, 297–307. Hurvich, C. M., & Tsai, C. L. (1991). Bias of the corrected AIC criterion for undertted regression and time series models. Biometrika, 78, 499–509. Hurvich, C. M., & Tsai, C. L. (1993). A corrected Akaike information criterion for vector autoregressive model selection. Journal of Time Series Analysis, 14, 271–279. Ishiguro, M., Sakamoto, Y., & Kitagawa, G. (1997). Bootstrapping log likelihood and EIC, an extension of AIC. Annals of the Institute of Statistical Mathematics, 49, 411–434. Kitagawa, G., & Gersch, W. (1996). Smoothness priors analysis of time series. New York: Springer-Verlag. Konishi, S., & Kitagawa, G. (1996). Generalized information criterion in model selection. Biometrika, 83, 875–890. Li, K. (1986). Asymptotic optimality of CL and generalized cross-validation in ridge regression with application to spline smoothing. Annals of Statistics, 14(3), 1101–1112. MacKay, D. (1992). Bayesian interpolation. Neural Computation, 4(3), 415–447. Mallows, C. L. (1964).Choosing variablesin a linear regression:A graphicalaid. Paper presented at the Central Regional Meeting of the Institute of Mathematical Statistics, Manhattan, KS. Mallows, C. L. (1973). Some comments on CP . Technometrics, 15(4), 661–675. McQuarrie, A. D. R., & Tsai, C. L. (1998). Regression and time series model selection. Singapore: World Scientic. Mosteller, F., & Wallace, D. (1963). Inference in an authorship problem. A comparative study of discrimination methods applied to the authorship of the disputed Federalist Papers. Journal of the American Statistical Association, 58, 275–309. Murata, N., Yoshizawa, S., & Amari, S. (1994). Network information criterion— Determining the number of hidden units for an articial neural network model. IEEE Transactions on Neural Networks, 5(6), 865–872. Nakashima, A., & Ogawa, H. (1999). How to design a regularization term for improving generalization. In Proceedings of ICONIP’99, the 6th International Conference on Neural Information Processing, 1 (pp. 222–227). Perth, Australia.
1888
Masashi Sugiyama and Hidemitsu Ogawa
Noda, K., Miyaoka, E., & Itoh, M. (1996). On bias correction of the Akaike information criterion in linear models. Communications in Statistics: Theory and Methods, 25, 1845–1857. Ogawa, H. (1987). Projection lter regularization of ill-conditioned problem. In Proceedings of SPIE, Inverse Problems in Optics, 808 (pp. 189–196). Ogawa, H. (1992). Neural network learning, generalization and over-learning. In Proceedingsof the ICIIPS’92,InternationalConference on Intelligent Information Processing and System, 2 (pp. 1–6). Beijing, China. Oja, E., & Ogawa, H. (1986). Parametric projection lter for image and signal restoration. IEEE Transactions on Acoustics, Speech, and Signal Processing, ASSP34(6), 1643–1653. Onoda, T. (1995). Neural network information criterion for the optimal number of hidden units. In Proceedings of IEEE International Conference on Neural Networks, ICNN’95 (pp. 275–280). Perth, Australia. Platt, J. (1991). A resource-allocating network for function interpolation. Neural Computation, 3(2), 213–225. Poggio, T., & Girosi, F. (1990). Networks for approximation and learning. Proceedings of the IEEE, 78(9), 1481–1497. Rissanen, J. (1978). Modeling by shortest data description. Automatica, 14, 465– 471. Rissanen, J. (1987). Stochastic complexity. Journal of the Royal Statistical Society, Series B, 49(3), 223–239. Rissanen, J. (1996). Fisher information and stochastic complexity. IEEE Transactions on Information Theory, IT-42(1), 40–47. Rumelhart, D. E., Hinton, G. E., & Williams, R. J. (1986a). Learning representations by back-propagating errors. Nature, 323, 533–536. Rumelhart, D. E., Hinton, G. E., & Williams, R. J. (1986b). Learning internal representations by error propagation. In D. E. Rumelhart, J. L. McClelland, & the PDP Research Group (Eds.), Parallel distributed processing: Explorations in the microstructure of cognition, 1 (pp. 318–362). Cambridge, MA: MIT Press. Saitoh, S. (1988). Theory of reproducing kernels and its applications. London: Longman Scientic & Technical. Saitoh, S. (1997). Integral transform, reproducing kernels and their applications. London: Longman. Sato, K., Kobayashi, M., & Fujikoshi, Y. (1997). Variable selection for the growth curve model. Journal of Multivariate Analysis, 60, 277–292. Savage, L. J. (1954). The foundation of statistics. New York: Wiley. Schatten, R. (1970). Norm ideals of completely continuous operators. New York: Springer-Verlag. Sch¨olkopf, B., Burges, C. J. C., & Smola, A. J. (1998). Advances in kernel methods: Support vector machines. Cambridge, MA: MIT Press. Schwarz, G. (1978). Estimating the dimension of a model. Annals of Statistics, 6, 461–464. Shibata, R. (1989). Statistical aspects of model selection. In J. C. Willems (Ed.), From data to model (pp. 375–394). New York: Springer-Verlag. Shibata, R. (1997). Bootstrap estimate of Kullback-Leibler information for model selection. Statistica Sinica, 7, 375–394.
Subspace Information Criterion for Model Selection
1889
Simonoff, J. S. (1998). Three sides of smoothing: Categorical data smoothing, nonparametric regression, and density estimation. International Statistical Review, 66, 137–156. Stone, M. (1974). Cross-validatory choice and assessment of statistical predictions. Journal of the Royal Statistical Society, Series B, 36, 111–147. Stone, M. (1977). An asymptotic equivalence of choice of model by crossvalidation and Akaike’s criterion. Journal of the Royal Statistical Society, Series B, 39, 44–47. Sugiura, N. (1978).Further analysis of the data by Akaike’s information criterion and the nite corrections. Communications in statistics. Theory and Methods, 7(1), 13–26. Szego, ¨ G. (1939).Orthogonalpolynomials. Providence, RI: American Mathematical Society. Takemura, A. (1991). Modern mathematical statistics. Tokyo: Sobunsya. (In Japanese) Takeuchi, K. (1976). Distribution of information statistics and validity criteria of models. Mathematical Science, 153, 12–18. (In Japanese) Takeuchi, K. (1983). On the selection of statistical models by AIC. Journal of the Society of Instrument and Control Engineering, 22(5), 445–453. (In Japanese) Tikhonov, A. N., & Arsenin, V. Y. (1977). Solutions of ill-posed problems. Washington, DC: V. H. Winston. Vapnik, V. N. (1995). The nature of statistical learning theory. New York: SpringerVerlag. Wada, Y., & Kawato, M. (1991). Estimation of generalization capability by combination of new information criterion and cross validation. Transactions of the IEICE, J74-D-II(7), 955–965. (In Japanese) Wahba, H. (1990). Spline model for observational data. Philadelphia: Society for Industrial and Applied Mathematics. Wong, W. (1983). A note on the modied likelihood for density estimation. Journal of the American Statistical Association, 78(382), 461–463. Yamanishi, K. (1998). A decision-theoretic extension of stochastic complexity and its application to learning. IEEE Transactions on Information Theory, IT-44, 1424–1439. Received September 7, 1999; accepted October 2, 2000.
LETTER
Communicated by John Platt
Convergent Decomposition Techniques for Training RBF Neural Networks C. Buzzi L. Grippo Dipartimento di Informatica e Sistemistica, Universit`a di Roma “La Sapienza,” Via Buonarroti 12 00185, Roma, Italy M. Sciandrone Istituto di Analisi dei Sistemi ed Informatica del CNR, Viale Manzoni 30 - 00185 Roma, Italy In this article we dene globally convergent decomposition algorithms for supervised training of generalized radial basis function neural networks. First, we consider training algorithms based on the two-block decomposition of the network parameters into the vector of weights and the vector of centers. Then we dene a decomposition algorithm in which the selection of the center locations is split into sequential minimizations with respect to each center, and we give a suitable criterion for choosing the centers that must be updated at each step. We prove the global convergence of the proposed algorithms and report the computational results obtained for a set of test problems. 1 Introduction
We consider a generalized radial basis function (RBF) neural network with M hidden nodes and one output unity, whose input-output mapping is dened by
y ( xI w1 , . . . , wM , c1 , . . . , cM ) D
M X i D1
wi w ( kx ¡ ci k2 ) ,
where x 2 Rn is the input vector, wi 2 R, for i D 1, . . . , M are the output weights, ci 2 Rn , for i D 1, . . . , M, are the centers, w : R C ! R C is an RBF, and k ¢ k is the Euclidean norm. Given a set of input-output training pairs ( xp , tp ) 2 Rn £ R, for p D 1, . . . , P, and letting w D ( w1 , . . . , wM ) and c D ( c1 , . . . , cM ) , we dene by Ep ( w, c) a measure of the distance between the desired output tp corresponding to the input xp and the actual output y( xp I w, c) of the network, so that c 2001 Massachusetts Institute of Technology Neural Computation 13, 1891– 1920 (2001) °
1892
C. Buzzi, L. Grippo, and M. Sciandrone
the overall error function E is given by E ( w, c) D
P X
Ep ( w, c) .
p D1
Two basic approaches are commonly adopted for tackling the learning problem in generalized RBF networks (Bishop, 1995; Haykin, 1999): (1) unsupervised selection of centers and supervised learning of the output weights and (2) supervised learning of both center positions and output weights. The rst strategy consists of xing the center positions ci (and possibly the other internal parameters of each unit) using an unsupervised technique and then estimating the linear weights wi of the output layer by means of a supervised technique consisting of minimizing E with respect to w, which typically corresponds to solving a linear least-squares problem. The simplest criterion for choosing the center positions can be that of setting them equal to a random subset of the input vectors from the training set. As an improvement on this procedure several clustering techniques have been also suggested, such as the K-means clustering algorithm of Moody and Darken (1989), which attempt to estimate appropriate locations for the centers that more accurately reect the distribution of the data points (see Bishop, 1995, and Haykin, 1999, for a review). We note that the use of unsupervised techniques takes no account of the target values associated with the training inputs, and it may lead to the use of an unnecessarily large number of basis functions in order to achieve adequate performance. The alternative training strategy consists of the supervised selection of both center positions and output weights (Poggio & Girosi, 1990), and it can be implemented by minimizing the training error E with respect to both w and c using gradient-descent techniques. Some computational experimentation performed on the NETtalk task (Wettschereck & Dietterich, 1992) has shown that generalized RBF networks with supervised selection of center positions yield improved generalization performance, in comparison with RBF based on unsupervised selection of centers and supervised learning of the output weights. However, choosing the basis function parameters by supervised learning may constitute a difcult nonlinear optimization problem, which can be quite computationally intensive for large values of the input dimension and of the size of the training set. In order to overcome this difculty to some extent, while retaining the advantages of supervised learning, we can attempt to decompose the optimization problem into a sequence of simpler problems by partitioning the problem variables into different blocks and then employing some blockdescent technique, based on alternate partial minimizations with respect to the different blocks. This permits the use of specialized techniques for the solution of the subproblems, which can be advantageous in terms of both computing times and error values. However, appropriate conditions should be satised in order to guarantee convergence toward minimizers
Techniques for Training RBF Neural Networks
1893
of the error function. In fact, block descent methods may not converge in the nonconvex case unless suitable precautions are adopted for avoiding oscillatory behavior (Bertsekas, 1999; Powell, 1973). In this article, on the basis of some recent results on block descent methods for nonlinear optimization (Grippo & Sciandrone, 1999, 2000), we dene convergent decomposition algorithms for supervised training of generalized RBF networks. First, we consider a two-block training scheme, based on alternating the optimization with respect to the output weights with the optimization of the center locations. In this scheme, as in the case of unsupervised learning, we retain the possibility of computing a global optimum with respect to the output weights by solving a least-squares problem, but we can also guarantee convergence toward stationary points of the error function with respect to both c and w. However, when the number M of centers and the dimension n of the input space are very large, even the approximate minimization with respect to c can be computationally expensive. Therefore, we dene a decomposition algorithm in which the optimization of the center locations is split into sequential minimizations with respect to each center ci , and we introduce a suitable criterion for selecting the centers that must be updated at each step. In this way, unnecessary operations can be avoided; moreover, when a single center is updated, considerable computational savings can be obtained by exploiting the structure of the objective function in the minimization process. The convergence of this technique is ensured by imposing appropriate conditions on the approximate minimizations with respect to ci . In section 2 we formally dene the learning problem for RBF networks and introduce some basic notation. In section 3 we describe two-block learning algorithms based on alternating the optimization with respect to the output weights with the optimization of the center locations, and we prove the convergence of these schemes. In section 4 we dene a globally convergent learning algorithm based on the additional decomposition of the vector c into M blocks. In section 5 we show the results obtained with the proposed algorithms in the solution of some standard test problems taken from Preschelt (1994); in particular, we report the results of early stopping experiments and comparisons with an efcient reduced memory quasi-Newton algorithm of the Numerical Algorithms Group (NAG) library, in terms of both computing times and generalization errors. Finally, some concluding remarks are given in section 6. In appendixes A and B we collect the convergence proofs of the proposed algorithms. In appendix C we describe in detail a line search procedure employed in the convergence analysis and the computational experimentation.
1894
C. Buzzi, L. Grippo, and M. Sciandrone
2 Problem Formulation and Basic Notation
Given the set of input-output training pairs ( xp , tp ) 2 Rn £R, for p D 1, . . . , P, we dene the least-squares error:
³
E LS ( w, c) D
P M X X p D1
iD 1
´2 p
wi w ( kx ¡ ci k
2)
¡t
p
,
where it is assumed that the function w is continuously differentiable on R C . If we introduce the P £ M matrix W( c) with elements ( W ( c) ) p,i D w (kxp ¡ ci k2 ) , and we set t D ( t1 , . . . , tP ) T , then the error function ELS can be put into the form E LS ( w, c) D kW ( c) w ¡ tk2 . For the solution of the training problem, we consider a regularized error function (Bishop, 1995), which is obtained by adding to ELS a regularization term expressed as the squared norm of the network parameters, that is: E ( w, c) D ELS ( w, c) C 2 ( kwk2 C kck2 ) ,
(2.1)
where 2 is a given positive number. We note that the function E can be put into the form ®³ ´ ³ ´®2 ® W( c) t ® ® ® C 2 kck2 , p ( ) E w, c D ® w¡ 0 ® 2 I Q and hence, if we introduce the ( P C M ) £ M matrix W(c) and the ( P C M ) vector tQ dened by ³ ´ ³ ´ W( c) t Q W(c) :D p , tQ :D , 0 2 I where I is the M £ M identity matrix, we can also write ® ®2 ®Q ® E ( w, c) D ®W( c) w ¡ tQ® C 2 kck2 . Given a vector ( w 0, c0 ) 2 RM £ RMn of network parameters, we can dene the level set of E corresponding to E ( w0 , c0 ) , that is:
L
0
0 0 D f( w, c ) 2 RM £ RMn : E ( w, c ) · E ( w , c )g.
We note that the function E has the following properties: E is a strictly convex quadratic function of w for xed c.
Techniques for Training RBF Neural Networks
The level set L
0
1895
is compact for any given ( w0 , c0 ) 2 RM £ RMn .
E admits a global minimum point in RM £ RMn .
Then the supervised training problem associated with the given data set can be formulated as the optimization problem minimize E ( w, c) w 2 RM , c 2 RMn .
(2.2)
We will indicate by rw E ( w, c) 2 RM , rci E ( w, c) 2 Rn , rc E ( w, c) 2 RMn the partial gradients of E ( w, c) with respect to w, ci , c, that is, ± ² Q T W( Q c) w ¡ t rw E( w, c) D 2W(c) 2
rci E( w, c) D ¡4wi 4
³
P M X X p D1
hD 1
´ p
wh w ( kx ¡ ch k
2)
¡t
p
3
w 0 ( kxp ¡ ci k2 ) ( xp ¡ ci ) 5 C 22 ci 0
1 rc1 E( w, c) B rc2 E ( w, c) C C, rc E( w, c) D B .. @ A . ( ) rcM E w, c
where w 0 is the rst-order derivative of w . 3 Two-Block Learning Algorithms
In this section, we consider two-block training procedures, which are dened as a sequence of iterations consisting of two steps. In the rst step, starting from the current estimate ( wk , ck ) , we compute an estimate w kC 1 of the output weights, holding xed the vector ck. In the second step, holding xed wkC 1 , we compute a new vector ckC 1 that updates the center positions. In this approach, as in the case of unsupervised learning of center locations, we retain a distinction between the optimization of the output weights, which can be performed by solving a linear least-squares problem, and the construction of the hidden units. However, the activation functions of the hidden units are now computed through a nonlinear optimization technique, and the repeated alternation of the two steps yields asymptotically a stationary point in the extended space of the parameters w and c. Different schemes can be dened in relation to the technique used for updating the vector c. A rst possibility can be that of performing an approximate minimization of the error function with respect to c, using some
1896
C. Buzzi, L. Grippo, and M. Sciandrone
convergent minimization algorithm, which terminates with a new vector ckC 1 when some stopping criterion is satised. This strategy corresponds to a modied version of the block nonlinear Gauss-Seidel method (see, e.g., Bertsekas, 1999) and it is described formally in the following scheme, where the stopping rule is dened through an adjustable upper bound j k on the norm of the partial gradient rc E: Algorithm 1 Data. w0 2 RM , c 0 2 RMn and a sequence j k > 0 such that j k ! 0. Step 0. Set k D 0. Step 1. Compute wk C 1 D arg minw E ( w, c k ) , by solving the linear least-
squares problem minimize w 2 RM
® ®2 ®Q k ® ®W(c ) w ¡ tQ® .
Step 2. Using a minimization method, compute c k C 1 such that
E ( w kC1 , ckC1 ) · E( w kC1 , ck )
and
krc E ( wkC 1 , ckC 1 )k · j k.
Step 3. Set k D k C 1 and go to step 1.
Although block-coordinate methods may not converge in the general nonconvex case, Grippo and Sciandrone (1999) have shown that convergence can be established in the case of a two-block decomposition, even in the absence of any convexity assumption on the objective function. In our case, the convergence proof can be further simplied because of the fact that E is strictly convex with respect to w. Taking this into account, we can establish the following proposition, whose proof is reported in appendix A. Proposition 1.
Let f( wk, ck ) g be the sequence generated by algorithm 1. Then:
i. f(w k, ck )g has limit points.
ii. The sequence fE( wk, ck ) g converges to a limit.
iii. Every limit point of f(w k, ck ) g is a stationary point of E. In practice, step 2 of algorithm can be performed by using any convergent algorithm for unconstrained optimization, such as the steepest-descent method. However, a more convenient approach could be that of performing only a xed number of minimization steps with respect to c. In this case, in order to dene a convergent algorithm, we must guarantee that appropriate conditions are satised at each iteration. Sufcient convergence conditions could be imposed, for instance, by choosing as a search direction in RMn the negative partial gradient with respect to c, that is, d k D ¡rc E ( w kC 1 , ck ) ,
Techniques for Training RBF Neural Networks
1897
and then requiring that E ( w k C 1 , c k C 1 ) · E ( w k C 1 , c k C ak dk ) , where the step size ak along d k is computed by means of a “convergent” line search algorithm. More specically, we can require that the step size ak is produced by a line search along dk , such that the following condition holds: Condition 1.
For all k we have:
E ( w k C 1 , c k C ak dk ) · E ( w k C 1 , c k ) .
(3.1)
Moreover, if lim E ( wkC 1 , ck ) ¡ E ( w kC 1 , ck C akd k ) D 0,
k!1
(3.2)
N cN) we have then for every subsequence f(w kC 1 , ck ) gK converging to some ( w, lim
k2K, k!1
rcE ( w kC 1 , ck ) D 0.
(3.3)
Condition 1 imposes, in essence, that the line search algorithm must be able to drive to zero the directional derivative of E along d k (and hence the partial gradient rc E ( wkC 1 , ck )), at least when equation 3.2 holds and the sequence f( w kC 1 , ck )g has limit points. The steepest descent direction appearing in condition 1 can also be replaced with any “gradient-related” (Bertsekas, 1999) search direction. Many different algorithms are available for determining a step size ak in a way that condition 1 is satised. In fact, we can introduce simple modications in the convergence proofs of standard line search algorithms (see, e.g., Bertsekas, 1999) for taking into account the fact that the objective function is now also dependent on wk . The simplest idea could be that of employing an Armijo-type inexact line search algorithm, which consists of choosing ak D max f(d) jr k : E ( wkC 1 , ck C (d) jr kdk ) j D 0,1,...
· E ( w k C 1 , c k ) ¡ c (d) jr k kd k k2 g,
(3.4)
where r k ¸ r > 0 is some initial estimate of the step size, d 2 ( 0, 1) , and c 2 ( 0, 1). More generally, we can adopt acceptability conditions for the step size that ensure that the objective function has been “sufciently reduced” and the step size is “sufciently large.” In equation 3.4, the rst requirement is satised through the condition E ( w kC 1 , ck C ak dk ) · E ( wkC 1 , ck ) ¡ c akkd kk2 ,
1898
C. Buzzi, L. Grippo, and M. Sciandrone
while the second is implicitly imposed by requiring both that the initial tentative stepsize is bounded away from zero (r k ¸ r > 0) and that the contraction factor d is xed. Another particular example, where the preceding conditions are taken into account, can be derived, as a special case, from the line search algorithm described in appendix C. Making use of condition 1, we can dene the following training scheme: Algorithm 2 Data. c 0 2 RMn . Step 0. Set k D 0. Step 1. Compute wk C 1 D arg minw E ( w, c k ) , by solving the linear least-
squares problem: minimize w 2 RM .
® ®2 ® Q k) ® ®W(c w ¡ tQ® .
Step 2. If krc E ( w k C 1 , c k ) k D 0 then stop; otherwise
a. Set d k D ¡rc E ( wkC 1 , ck ) , compute ak using any line search algorithm satisfying condition 1.
b. Choose ckC 1 as any vector such that E ( w k C 1 , c k C 1 ) · E ( w k C 1 , c k C ak dk ) . Step 3. Set k D k C 1 and go to step 1.
The following proposition, whose proof is reported in appendix A, states that algorithm 2 retains essentially the same convergence properties of algorithm 1: Proposition 2.
Then:
Suppose that algorithm 2 generates an innite sequence f(w k, ck )g.
i. fwk, ck ) g has limit points.
ii. The sequence fE( wk, ck ) g converges to a limit.
iii. Every limit point of f(w k, ck ) g is a stationary point of E. The condition of step 2b can be satised, for instance, by taking ckC 1 D ck C akd k. More generally, starting from the point ck C akd k, some iterations of any descent method could be performed to generate ckC 1 . Thus, as conditions a and b of step 2 are usually met in most of computational implementations of unconstrained optimization methods (such as conjugate gradient methods or quasi-Newton methods), the preceding analysis ensures that a convergent two-block training scheme can be realized, in practice, by running some standard code for a xed number of iterations each time that w k is updated through a least-squares technique.
Techniques for Training RBF Neural Networks
1899
4 A Block Decomposition Algorithm
In this section we dene a decomposition algorithm in which the problem variables ( w, c) are partitioned into the M C 1 blocks corresponding to the weights w and the center positions ci , for i D 1, . . . , M. As in the algorithms of the preceding section, the weights are computed by solving a linear leastsquares problem for xed values of c, but now the center positions are updated in sequence for i D 1, . . . , M. As the problem is nonconvex in ( w, c ) and the variables are partitioned into a number of blocks larger than two, suitable restrictions must be imposed on the computation of the updated values for ci , in order to guarantee the convergence of the iterative process. In fact, to prevent the occurrence of oscillatory behavior, we must ensure that kcikC 1 ¡ cikk ! 0,
for i D 1, . . . , M.
(4.1)
We can again use a line search along the steepest-descent direction; however, in this case, the line search algorithm must also be compatible with equation 4.1. More specically, for each k ¸ 0 and i D 1, . . . , M, letting dik D ¡rci E ( w kC 1 , c1kC 1 , . . . , cik , . . . , ckM ) , we suppose that a line search along dik computes a step size aik, such that the following condition holds: Condition 2.
For all k ¸ 0 and for all i D 1, . . . , M it results:
k ).(4.2) E ( w kC 1 , c1kC 1 , . . . , cik C aikdik, . . . , ckM ) · E ( wkC 1 , c1kC 1 , . . . , cik, . . . , cM
Moreover, if k ) lim E ( wkC 1 , c1kC 1 , . . . , cik, . . . , cM
k!1
k ) ¡E(w kC 1 , c1kC 1 , . . . , cik C aikdik, . . . , cM D 0,
(4.3)
then: i. We have lim aikkdikk D 0.
k!1
(4.4)
ii. For every subsequence f(w kC 1 , c1kC 1 , . . . , cik, . . . , ckM ) gK converging to some ( w, N cN1 , . . . , cNi , . . . , cNM ) we have lim
k2K, k!1
rci E( w kC 1 , c1kC 1 , . . . , cik, . . . , ckM ) D 0.
(4.5)
1900
C. Buzzi, L. Grippo, and M. Sciandrone
In comparison with condition 1, we note that condition 2 imposes similar requirements on the objective function reduction and on the limit behavior of the partial gradients, but also requires that equation 4.4 is satised. We note that this condition could be satised in principle by imposing a constant upper bound on the step size aik in the Armijo-type line search algorithm (algorithm 2). A more exible line search algorithm that satises condition 2 without imposing a priori upper or lower bounds on the step size is described in detail in appendix C. A convergent training algorithm, where condition 2 is taken into account, is dened in the following scheme: Algorithm 3 0 Data. c10, . . . , cM 2 Rn , ti > 0, and sequences jik > 0 such that jik ! 0 for
i D 1, . . . , M. Step 0. Set k D 0. Step 1. Compute wk C 1 D arg minw E ( w, c k ) , by solving the linear leastsquares problem: minimize w 2 RM .
® ®2 ®Q k ® ®W(c ) w ¡ tQ® .
Step 2. For i D 1, . . . , M; if krci E ( w k C 1 , c k )k · jik , then set cik C 1 D c ik ; other-
wise:
a. Set dik D ¡rci E ( wkC 1 , c1kC 1 , . . . , cik, . . . ckM ) ; compute aik using any line search algorithm satisfying condition 2. b. Compute cQik such that: k ) · E ( wk C 1 , c1k C 1 , . . . , cik C aik dik , . . . , c kM ) I E ( wkC 1 , c1kC 1 , . . . , cQik, . . . , cM k ) · E ( wk C 1 , c k C 1 , . . . , c k , . . . , c k ) ¡ t kQ k if E ( wkC 1 , c1kC 1 , . . . , cQik, . . . , cM i ci ¡ i 1 M C C cikk2 , then set cik 1 D cQik else set cik 1 D cik C aikdik.
Step 3. Set k D k C 1 and go to step 1.
When the condition krci E ( wkC 1 , ck ) k · jik at step 2 is satised, the center ci is not updated. In fact, if the gradient is “small,” the minimization with respect to ci is not expected to produce a signicant reduction of the error function, and hence we can speed up the training process by avoiding unnecessary operations. A heuristic rule for the choice of the sequence fjikg is given in section 5, where implementation details of algorithm 3 are described. It is possible, in principle, to adopt alternative rules for selecting at each iteration the centers to be updated. In order to ensure the Remark 1.
Techniques for Training RBF Neural Networks
1901
convergence of the sequence, it can be sufcient, for instance, to guarantee that each center will be considered within a xed number of successive iterations (Grippo & Sciandrone, 1999). The conditions of step 2b, on the one hand, guarantee a sufcient reduction of the objective function and, on the other, ensure that the distance kcikC 1 ¡ cikk between successive points goes to zero. These conditions can be satised, for instance, by setting cikC 1 D cik C aikdik. However, any minimization method can be adopted to generate the candidate point cQik. We note that at step 2, when a center ci must be updated, at least one evaluation of the error function in correspondence to new positions of ci must be performed. However, the evaluation of the error function, k ) correspondwhich depends on the outputs y ( xp I w kC 1 , c1kC 1 , . . . , ci , . . . , cM ing to input patterns xp for p D 1, . . . , P, can be organized in a way that considerable computational savings are obtained at the expense of storing the outputs obtained in correspondence to cik. In fact, we can write Remark 2.
k ) y ( xp I wkC 1 , c1kC 1 , . . . , ci , . . . , cM C1 k ) ¡ k w ( kxp ¡ k k2 ) C k w ( kxp ¡ k2 ) D y ( xp I wk C 1 , c1k , . . . , cik , . . . , c M wi ci wi ci ,
so that the new output is obtained by means of few elementary operations. The convergence properties of algorithm 3 are given in the following proposition whose proof is reported in appendix B. Proposition 3.
Let f( wk, ck ) g be the sequence generated by algorithm 3. Then:
i. f( wk, ck ) g has limit points.
ii. The sequence fE( w k, ck )g admits a limit.
iii. We have limk!1 kw kC 1 ¡ w kk D 0.
iv. For i D 1, . . . , M, we have limk!1 kcikC 1 ¡ cik k D 0.
v. Every limit point of f( wk, ck ) g is a stationary point of E.
5 Computational Experiments
In this section we present the numerical results obtained with algorithms 2 and 3 in a set of four test problems taken from Preschelt (1994), and we compare the performance of the proposed algorithms with that of an efcient algorithm (routine E04DGF of NAG library), which implements a
1902
C. Buzzi, L. Grippo, and M. Sciandrone
reduced-memory quasi-Newton method for the joint minimization of the error function with respect to w and c. We adopted as basis function w the direct multiquadric function dened as w (kx ¡ ck2 ) D (kx ¡ ck2 C s) 1/ 2 , where the scaling parameter s has been chosen equal to 0.01. The regularization parameter 2 in equation 2.1 has been xed to the value 10¡6 . In what follows we report a short description of the problems, the implementation description of the training algorithms, and the computational results. 5.1 Training Problems.
5.1.1 Problem P1 (Credit Card Problem). In this classication problem, the task is that of predicting the approval or nonapproval of a credit card to a customer. The data set consists of 690 pairs ( xp , dp ), where xp 2 R51 and dp 2 f0, 1g. 5.1.2 Problem P2 (Heart Problem). In this classication problem, the task is that of deciding whether at least one of four major vessels of the heart is reduced in diameter by more than 50%. The diagnosis must be made on the basis of personal data (e.g., age and smoking habits). The data set consists of 303 pairs ( xp , dp ), where xp 2 R35 and tp 2 f0, 1g. 5.1.3 Problem P3 (Cancer Problem). In this classication problem, the task is that of predicting a breast lump as either benign or malignant. This classication must been performed on the basis of information regarding, for instance, the uniformity of cell size and cell shape. The data set consists of 699 pairs ( xp , tp ), where xp 2 R9 and tp 2 f0, 1g. 5.1.4 Problem P4 (Building Problem). In this approximation problem, the task is that of predicting energy consumption in a building taking into account the outside temperature, outside air humidity, and other similar information. The data set consists of 4208 pairs ( xp , tp ) , where xp 2 R14 and tp 2 R. 5.2 Training Algorithms.
5.2.1 Algorithm Q-N. The algorithm makes use of the E04DGF routine of the NAG library. This routine implements a preconditioned, limited memory (two-step) quasi-Newton conjugate gradient method. It is intended for use on large-scale problems; in fact, it does not require matrix operations and has storage costs that grow linearly with the number of variables. It can be considered quite efcient in terms of computational cost, because of the
Techniques for Training RBF Neural Networks
1903
“good” convergence properties of quasi-Newton methods. A description of the method can be found in Gill and Murray (1979) and Gill, Murray, and Wright (1981). We have used the default values of the parameters in the E04DGF routine. 5.2.2 Algorithm 2 (Section 3). The updating of the output weight vector w (step 1) is performed by using a standard routine (F04JAF routine of NAG library) for solving the linear least-squares problem. This routine, which implements a direct method, computes the minimum norm solution of a linear least-squares problem by means of the pseudoinverse matrix (determined by the singular value decomposition). The vector of centers c is updated (step 2) by executing a xed number of iterations (10 in our case) of the quasi-Newton method implemented by the E04DGF routine. In this case, the conditions of step 2 are satised automatically, as the rst iteration of the quasi-Newton method consists of a line search along the negative gradient, and the objective function is reduced in the subsequent iterations. The iteration counter k is updated at step 3, so that a main iteration of algorithm 2 involves the solution of a linear least-squares problem (for the updating of the vector w) and 10 inner iterations of the E04DGF routine (for the updating of the vector c). 5.2.3 Algorithm 3 (Section 4). The vector w is updated (step 1) as in algorithm 2, while each center ci is updated by using the line search algorithm (algorithm LS) dened in appendix C, and then letting cikC 1 D cik C aikdik. The implementation of algorithm LS is described below. Regarding the sequences jik, i D 1, . . . , M, we used the rule k )) jik D 2 k ( 1 C E ( wkC 1 , c1kC 1 , . . . , cik, . . . , cM ,
where, starting from 2 » C1 2 k D
2 k
0.52
k
0
D 0.5, we set
6 c k for at least one j 2 f1, . . . , Mg if cjkC 1 D j
otherwise.
The iteration counter k is updated at step 3, so that a single main iteration of algorithm 3 involves the solution of a linear least-squares problem (for the updating of w) and one run of algorithm LS for each center ci that is updated. 5.2.4 Algorithm LS (Appendix C). The line search algorithm has been implemented according to the pseudocode given in appendix C with the
1904
C. Buzzi, L. Grippo, and M. Sciandrone
following values of the parameters: c il D 10¡4 ,
c iu D 103 ,
hil D 2,
hiu D 5,
dil D 10¡2 ,
diu D 0.9,