ARTICLE
Communicated by Peter Dayan
Bayesian Computation in Recurrent Neural Circuits Rajesh P. N. Rao
[email protected] Department of Computer Science and Engineering, University of Washington, Seattle, WA 98195, U.S.A.
A large number of human psychophysical results have been successfully explained in recent years using Bayesian models. However, the neural implementation of such models remains largely unclear. In this article, we show that a network architecture commonly used to model the cerebral cortex can implement Bayesian inference for an arbitrary hidden Markov model. We illustrate the approach using an orientation discrimination task and a visual motion detection task. In the case of orientation discrimination, we show that the model network can infer the posterior distribution over orientations and correctly estimate stimulus orientation in the presence of signicant noise. In the case of motion detection, we show that the resulting model network exhibits direction selectivity and correctly computes the posterior probabilities over motion direction and position. When used to solve the well-known random dots motion discrimination task, the model generates responses that mimic the activities of evidence-accumulating neurons in cortical areas LIP and FEF. The framework we introduce posits a new interpretation of cortical activities in terms of log posterior probabilities of stimuli occurring in the natural world.
1 Introduction Bayesian models of perception have proved extremely useful in explaining results from a number of human psychophysical experiments (see, e.g., Knill & Richards, 1996; Rao, Olshausen, & Lewicki, 2002). These psychophysical tasks range from inferring 3D structure from 2D images and judging depth from multiple cues to perception of motion and color (Bloj, Kersten, & Hurlbert, 1999; Weiss, Simoncelli, & Adelson, 2002; Mamassian, Landy, and Maloney, 2002; Jacobs, 2002). The strength of the Bayesian approach lies in its ability to quantitatively model the interaction between prior knowledge and sensory evidence. Bayes’ rule prescribes how prior probabilities and stimulus likelihoods should be combined, allowing the responses of subjects during psychophysical tasks to be interpreted in terms of the resulting posterior distributions. Neural Computation 16, 1–38 (2004)
c 2003 Massachusetts Institute of Technology °
2
R. Rao
Additional support for Bayesian models comes from recent neurophysiological and psychophysical studies on visual decision making. Carpenter and colleagues have shown that the reaction time distribution of human subjects making eye movements to one of two targets is well explained by a model computing the log-likelihood ratio of one target over the other (Carpenter & Williams, 1995). In another study, the saccadic response time distribution of monkeys could be predicted from the time taken by neural activity in area FEF to reach a xed threshold (Hanes & Schall, 1996), suggesting that these neurons are involved in accumulating evidence (interpreted as log likelihoods) over time. Similar activity has also been reported in the primate area LIP (Shadlen & Newsome, 2001). A mathematical model based on log-likelihood ratios was found to be consistent with the observed neurophysiological results (Gold & Shadlen, 2001). Given the growing evidence for Bayesian models in perception and decision making, an important open question is how such models could be implemented within the recurrent networks of the primate brain. Previous models of probabilistic inference in neural circuits have been based on the concept of neural tuning curves or basis functions and have typically focused on the estimation of static quantities such as stimulus position or orientation (Anderson & Van Essen, 1994; Zemel, Dayan, & Pouget, 1998; Deneve, Latham, & Pouget, 1999; Pouget, Dayan, & Zemel, 2000). Other models have relied on mean-eld approximations or various forms of Gibbs sampling for perceptual inference (Hinton & Sejnowski, 1986; Dayan, Hinton, Neal, & Zemel, 1995; Dayan & Hinton, 1996; Hinton & Ghahramani, 1997; Rao & Ballard, 1997, 1999; Rao, 1999; Hinton & Brown, 2002). In this article, we describe a new model of Bayesian computation in a recurrent neural circuit. We specify how the feedforward and recurrent connections may be selected to perform Bayesian inference for arbitrary hidden Markov models (see sections 2–4). The approach is illustrated using two tasks: discriminating the orientation of a noisy visual stimulus and detecting the direction of motion of moving stimuli. In the case of orientation discrimination, we show that the model network can infer the posterior distribution over orientations and correctly estimate stimulus orientation in the presence of considerable noise (see section 5.1). In the case of motion detection, we show that the resulting model network exhibits direction selectivity and correctly computes the posterior probabilities over motion direction and position (see section 5.2). We then introduce a decision-making framework based on computing log posterior ratios from the outputs of the motion detection network. We demonstrate that for the well-known random dots motion discrimination task (Shadlen & Newsome, 2001), the decision model produces responses that are qualitatively similar to the responses of “evidence-accumulating” neurons in primate brain areas LIP and FEF (see section 5.3). Some predictions of the proposed framework are explored in section 6. We conclude in sections 7 and 8 by discussing several possi-
Bayesian Computation in Recurrent Neural Circuits
3
bilities for neural encoding of log probabilities and suggest a probabilistic framework for on-line learning of synaptic parameters. 2 Modeling Cortical Networks We begin by considering a commonly used neural architecture for modeling cortical response properties: a recurrent network with ring-rate dynamics (see, e.g., Dayan & Abbott, 2001). Let I denote the vector of input ring rates to the network, and let v represent the output ring rates of the recurrently connected neurons in the network. Let W represent the feedforward synaptic weight matrix and M the recurrent weight matrix. The following equation describes the dynamics of the network, ¿
dv D ¡v C WI C Mv; dt
(2.1)
where ¿ is a time constant. The equation can be written in a discrete form as follows: vi .t C 1/ D vi .t/ C ².¡vi .t/ C wi I.t/ C
X
M ij vj .t//;
(2.2)
j
where ² is the integration rate, vi is the ith component of the vector v, wi is the ith row of the matrix W, and M ij is the element of M in the ith row and jth column. The above equation can be rewritten as vi .t C 1/ D ²wi I.t/ C
X
mij vj .t/;
(2.3)
j
where mij D ² Mij for i 6D j and mii D 1 C ².M ii ¡ 1/. How can Bayesian computation be achieved using the dynamics given by equation 2.3? We approach this problem by considering Bayesian inference in a simple hidden Markov model. 3 Bayesian Computation in a Cortical Network Let µ1 ; : : : ; µN be the hidden states of a hidden Markov model. Let µ .t/ represent the hidden state at time t with transition probabilities denoted by P.µ .t/ D µi jµ.t ¡ 1/ D µj / for i, j D 1; : : : ; N. Let I.t/ be the observable output governed by the probabilities P.I.t/jµ .t//. Then the posterior probability of being in state µi at time t is given by Bayes’ rule, P.µ .t/ D µi jI.t/; : : : ; I.1//
D kP.I.t/jµ.t/ D µi /P.µ .t/ D µi jI.t ¡ 1/; : : : ; I.1//;
(3.1)
4
R. Rao
where k is a normalization constant. The equation can be made recursive with respect to the posterior probabilities as follows: P.µ it jI.t/; : : : ; I.1//
D kP.I.t/jµit /
X j
P.µ it jµjt¡1 /P.µjt¡1 jI.t ¡ 1/; : : : ; I.1//;
(3.2)
where we have written “µ.t/ D µi ” as µit for notational convenience. This equation can be written in the log domain as log P.µit jI.t/; : : : ; I.1// D log P.I.t/jµit / C log k µX ¶ C log P.µ it jµjt¡1 /P.µjt¡1 jI.t ¡ 1/; : : : ; I.1// : j
(3.3) A recurrent network governed by equation 2.3 can implement equation 3.3 if vi .t C 1/ D log P.µit jI.t/; : : : ; I.1// X j
²wi I.t/ D
(3.4)
log P.I.t/jµit /
µX ¶ t¡1 t t¡1 D log jµ jI.t ¡ 1/; I.1// mij vj .t/ P.µ i j /P.µj :::; ;
(3.5) (3.6)
j
with the normalization term log k being added after each update. In other words, the network activities are updated according to vi .t C 1/ D ²wi I.t/ C
X j
mij vj .t/ ¡ log
X
euj .t/ ;
(3.7)
j
P where uj .t/ D ²wi I.t/C j mij vj .t/. The negative log term, which implements divisive normalization in the log domain, can be interpreted as a form of global recurrent inhibition.1 In equation 3.7, the log likelihoods can be computed using linear lters F.µ i / (D ²wi ). For example, F.µi / could represent a spatially localized gaussian or an oriented Gabor lter when the inputs I.t/ are images. Note that as formulated above, the updating of probabilities is directly tied to the time constant of neurons in the network, as specied by the parameter ¿ in If normalization is omitted, the network outputs can be interpreted as representing log joint probabilities, that is, vi .tC1/ D log P.µit ; I.t/; : : : ; I.1//. However, we have found that the lack of normalization makes such a network prone to instabilities. 1
Bayesian Computation in Recurrent Neural Circuits
5
equation 2.1 (or ² in equation 2.2). However, it may be possible to achieve longer integration time constants by combining slow synapses (e.g., NMDA synapses) with relatively strong recurrent excitation (see, e.g., Seung, 1996; Wang, 2001). A particularly challenging problem is to pick recurrent weights mij such that equation 3.6 holds true (the alternative of learning such weights is addressed in section 8.1). For equation 3.6 to hold true, we need to approximate a log-sum with a sum-of-logs. We tackled this problem by generating a set of random probabilities xj .t/ for t D 1; : : : ; T and nding a set of weights mij that satisfy X j
µX ¶ t t¡1 mij log xj .t/ ¼ log P.µi jµj /xj .t/
(3.8)
j
for all i and t (as an aside, note that a similar approximation can be used to compute a set of recurrent inhibitory weights for implementing the negative log term in equation 3.7). We examine this sum-of-logs approximation in more detail in the next section. 4 Approximating Transition Probabilities Using Recurrent Weights A set of recurrent weights mij can be obtained for any given set of transition probabilities P.µit jµjt¡1 / by using the standard pseudoinverse method. Let A represent the matrix of recurrent weights, that is, Aij D mij , and let L represent a matrix of log probabilities, the jth element in the tth column representing log xj .t/ for t D 1; : : : ; T. Let B represent the matrix of log P sums, that is, Bit D log[ j P.µ it jµjt¡1 /xj .t/]. To minimize the squared error in equation 3.8 with respect to the recurrent weights mij (D Aij ), we need to solve the equation AL D B:
(4.1)
Multiplying both sides by the pseudoinverse of L, we obtain the following expression for the recurrent weight matrix: A D BLT .LLT /¡1 :
(4.2)
We rst tested this approximation as a function of the transition probabilities P.µ it jµjt¡1 / for a xed set of uniformly random probabilities xj .t/. The matrix of transition probabilities was chosen to be a sparse random matrix. The degree of sparseness was varied by choosing different values for the density of the matrix, dened as the fraction of nonzero elements. Approximation accuracy was measured in terms of the absolute value of the error between the left and right sides of equation 3.8, averaged over both time
6
R. Rao
Figure 1: Approximation accuracy for different classes of transition probability matrices. Each data point represents the average approximation error from Equation 3.8 for a particular number of neurons in the recurrent network for a xed set of conditional probabilities xj .t/ D P.µjt¡1 jI.t ¡ 1/; : : : ; I.1//. Each curve represents the errors for a particular density d of the random transition probability matrix (d is the fraction of nonzero entries in the matrix).
and neurons. As shown in Figure 1, for any xed degree of sparseness in the transition probability matrix, the average error decreases as a function of the number of neurons in the recurrent network. Furthermore, the overall errors also systematically decrease as the density of the random transition probability matrix is increased (different curves in the graph). The approximation failed in only one case, when the density was 0.1 and the number of neurons was 25, due to the occurrence of a log 0 term in B (see discussion below). We also tested the accuracy of equation 3.8 as a function of the conditional probabilities xj .t/ D P.µjt¡1 jI.t ¡ 1/; : : : ; I.1// for a xed set of uniformly random transition probabilities P.µit jµjt¡1 /. The matrix of conditional probabilities was chosen to be a sparse random matrix (rows denoting indices of neurons and columns representing time steps). The degree of sparseness was varied as before by choosing different values for the density of the matrix. It is clear that the approximation fails if any of the xj .t/ equals exactly 0 since the corresponding entry log xj .t/ in L will be undened. We therefore added a small arbitrary positive value to all entries of the random matrix before renormalizing it. The smallest value in this normalized matrix may be regarded as the smallest probability that can be represented by neurons in the network. This ensures that xj .t/ is never zero while allowing us to test
Bayesian Computation in Recurrent Neural Circuits
7
Figure 2: Approximation accuracy for different sets of conditional probabilities P.µjt¡1 jI.t ¡ 1/; : : : ; I.1// (D xj .t/). The data points represent the average approximation error from Equation 3.8 as a function of the number of neurons in the network for a xed set of transition probabilities. Each curve represents the errors for a particular density d of the random conditional probability matrix (see the text for details).
the stability of the approximation for small values for xj .t/. We dene the density of the random matrix to be the fraction of entries that are different from the smallest value obtained after renormalization. As shown in Figure 2, for any xed degree of sparseness of the conditional probability matrix, the average error decreases as the number of neurons is increased. The error curves also show a gradual decrease as the density of the random conditional probability matrix is increased (different curves in the graph). The approximation failed in two cases: when the number of neurons was 75 or fewer and the density was 0.1, and when the number of neurons was 25 and the density was 0.25. The approximation error was consistently below 0.1 for all densities above 0.5 and number of neurons greater than 100. In summary, the approximation of transition probabilities using recurrent weights based on approximating a log sum with a sum of logs was found to be remarkably robust to a range of choices for both the transition probabilities and the conditional probabilities P.µjt¡1 jI.t ¡ 1/; : : : ; I.1//. The only cases where the approximation failed were when two conditions both held true: (1) the number of neurons used was very small (25 to 75) and (2) a majority of the transition probabilities were equal to zero or a majority of the conditional probabilities were near zero. The rst of these conditions is not a major concern given the typical number of neurons in cortical cir-
8
R. Rao
cuits. For the second condition, we need only assume that the conditional probabilities can take on values between a small positive number ² and 1, where log ² can be related to the neuron’s background ring rate. This seems a reasonable assumption given that cortical neurons in vivo typically have a nonzero background ring rate. Finally, it should be noted that the approximation above is based on assuming a simple linear recurrent model of a cortical network. The addition of a more realistic nonlinear recurrent function in equation 2.1 can be expected to yield more exible and accurate ways of approximating the log sum in equation 3.6. 5 Results 5.1 Example 1: Estimating Orientation from Noisy Images. To illustrate the approach, we rst present results from a simple visual discrimination task involving inference of stimulus orientation from a sequence of noisy images containing an oriented bar. The underlying hidden Markov model uses a set of discrete states µ1 ; µ2 ; : : : ; µM that sample the space of possible orientations of an elongated bar in the external world. During each trial, the state is assume to be xed: P.µ it jµjt¡1 / D 1 if i D j
D 0 otherwise:
(5.1) (5.2)
Finally, the likelihoods are given by the product of the image vector I.t/ with a set of oriented lters F.µi / (see Figure 3A) corresponding to the orientations represented by the states µ1 ; µ2 ; : : : ; µM : log P.I.t/jµit / D ²wi I.t/ D F.µi /I.t/:
(5.3)
For each trial, the sequence of input images was generated by adding zeromean gaussian white noise of xed variance to an image containing a bright oriented bar against a dark background (see Figure 3B). The goal of the network was to compute the posterior probabilities of each orientation given an input sequence of noisy images containing a single oriented bar. Clearly, the likelihood values, given by equation 5.3, can vary signicantly over time (see Figure 4, middle row), the amount of variation being determined by the variance of the additive gaussian noise. The recurrent activity of the network, however, allows a stable estimate of the underlying stimulus orientation to be computed across time (see Figure 4, bottom row). This estimate can be obtained using the maximum a posteriori (MAP) method: the estimated orientation at any time is given by the preferred orientation of the neuron with the maximum activity, that is, the orientation with the maximum posterior probability. We tested the MAP estimation ability of the network as a function of noise variance for a stimulus set containing 36 orientations (0 to 180 in 10 degree steps). For these
Bayesian Computation in Recurrent Neural Circuits
9
Figure 3: Estimating stimulus orientation from noisy images. (A) Twelve of the 36 oriented lters used in the recurrent network for estimating stimulus orientation. These lters represent the feedforward connections F.µi /. Bright pixels denote positive (excitatory) values and dark pixels denote negative (inhibitory) values. (B) Example of a noisy input image sequence containing a bright oriented bar embedded in additive zero-mean gaussian white noise with a standard deviation of 1.5.
Figure 4: Example of noisy orientation estimation. (Top) Example images at different time steps from a noisy input sequence containing an oriented bar. (Middle) Feedforward inputs representing log-likelihood values for the inputs shown above. Note the wide variation in the peak of the activities. (Bottom) Output activities shown as posterior probabilities of the 36 orientations (0 to 180 degrees in 10 degree steps) encoded by the neurons in the network. Note the stable peak in activity achieved through recurrent accumulation of evidence over time despite the wide variation in the input likelihoods.
10
R. Rao
Figure 5: Orientation estimation accuracy as a function of noise level. The data points show the classication rates obtained for a set of 36 oriented bar stimuli embedded in additive zero-mean gaussian white noise with standard deviation values as shown on the x-axis. The dotted line shows the chance classication rate (1=36 £ 100) obtained by the strategy of randomly guessing an orientation.
experiments, the network was run for a xed number of time steps (250 time steps), and the MAP estimate was computed from the nal set of activities. As shown in Figure 5, the classication rate of the network using the MAP method is high (above 80%) for zero-mean gaussian noise of standard deviations less than 1 and remains above chance (1/36) for standard deviations up to 4 (i.e., noise variances up to 16). 5.2 Example 2: Detecting Visual Motion. To illustrate more clearly the dynamical properties of the model, we examine the application of the proposed approach to the problem of visual motion detection. A prominent property of visual cortical cells in areas such as V1 and MT is selectivity to the direction of visual motion. In this section, we interpret the activity of such cells as representing the log posterior probability of stimulus motion in a particular direction at a spatial location, given a series of input images. For simplicity, we focus on the case of 1D motion in an image consisting of X pixels with two possible motion directions: leftward (L) or rightward (R). Let the state µij represent motion at a spatial location i in direction j 2 fL; Rg. Consider a network of N neurons, each representing a particular state µij (see Figure 6A). The feedforward weights are gaussians, that is, F.µ iR / D F.µiL / D F.µi / D gaussian centered at location i with a standard deviation ¾ . Figure 6B depicts the feedforward weights for a network of 30 neurons: 15 encoding leftward and 15 encoding rightward motion.
Bayesian Computation in Recurrent Neural Circuits
11
Figure 6: Recurrent network for motion detection. (A) A recurrent network of neurons, shown for clarity as two chains selective for leftward and rightward motion, respectively. The feedforward synaptic weights for neuron i (in the leftward or rightward chain) are determined by F.µi /. The recurrent weights reect the transition probabilities P.µiR jµkR / and P.µiL jµjL /. (B) Feedforward weights F.µi / for neurons i D 1; : : : ; 15 (rightward chain). The feedforward weights for neurons i D 15; : : : ; 30 (leftward chain) are identical. (C) Transition probabilities P.µ .t/ D µi jµ .t ¡ 1/ D µj / for i; j D 1; : : : ; 30. Probability values are proportional to pixel brightness. (D) Recurrent weights mij computed from the transition probabilities in C using equation 4.2.
To detect motion, the transition probabilities P.µ ij jµkl / must be selected to reect both the direction of motion and speed of the moving stimulus. For this study, the transition probabilities for rightward motion from the state µkR (i.e., P.µiR jµkR /) were set according to a gaussian centered at location k C x, where x is a parameter determined by stimulus speed. The transition probabilities for leftward motion from the state µkL were set to gaussian values centered at k ¡ x. The transition probabilities from states near the
12
R. Rao
Figure 7: Approximating transition probabilities with recurrent weights. (A) A comparison of the right and left sides of equation 3.8 for random test data xj .t/ (different from the training xj .t/) using the transition probabilities and weights shown in Figures 6C and 6D, respectively. The solid line represents the right side of Equation 3.8 and the dotted line the left side. Only the results for a single neuron (i D 8) are shown, but similar results were obtained for the rest of the neurons. (B) The approximation error (difference between the two sides of equation 3.8 over time).
two boundaries (i D 1 and i D X) were chosen to be uniformly random values. Figure 6C shows the matrix of transition probabilities. Given these transition probabilities, the recurrent weights mij can be computed using equation 4.2 (see Figure 6D). Figure 7 shows that for the given set of transition probabilities, the log-sum in equation 3.8 can indeed be approximated quite accurately by a sum-of-logs using the weights mij . 5.2.1 Bayesian Estimation of Visual Motion. Figure 8 shows the output of the network in the middle of a sequence of input images depicting a bar moving either leftward or rightward. As shown in the gure, for a leftwardmoving bar at a particular location i, the highest network output is for the neuron representing location i and direction L, while for a rightward-moving bar, the neuron representing location i and direction R has the highest out-
Bayesian Computation in Recurrent Neural Circuits
13
Figure 8: Network output for a moving stimulus. (Left) The four plots depict, respectively, the log likelihoods, log posteriors, neural ring rates, and posterior probabilities observed in the network for a rightward-moving bar when it arrives at the central image location. Note that the log likelihoods are the same for the rightward and leftward selective neurons (the rst 15 and last 15 neurons respectively, as dictated by the feedforward weights in Figure 6B), but the outputs of these neurons correctly reect the direction of motion as a result of recurrent interactions. (Right) The same four plots for a leftward-moving bar as it reaches the central location.
put. The output ring rates were computed from the log posteriors vi using a simple linear encoding model: fi D [c ¢ vi C m]C , where c is a positive constant (D 12 for this plot), m is the maximum ring rate of the neuron (D 100 in this example), and C denotes rectication (see section 7.1 for more details). Note that although the log likelihoods are the same for leftward- and rightward-moving inputs, the asymmetric recurrent weights (which represent the transition probabilities) allow the network to distinguish between leftward- and rightward-moving stimuli. The plot of posteriors in Figure 8 shows that the network correctly computes posterior probabilities close to 1 for the states µiL and µiR for leftward and rightward motion, respectively, at location i. 5.2.2 Encoding Multiple Stimuli. An important question that needs to be addressed is whether the network can encode uncertainties associated with multiple stimuli occurring in the same image. For example, can the motion detection network represent two or more bars moving simultaneously in the image? We begin by noting that the denition of multiple stimuli is closely related to the notion of state. Recall that our underlying probabilistic model (the hidden Markov model) assumes that the observed world can be in only one of M different states. Thus, any input containing “multiple stimuli” (such as two moving bars) will be interpreted in terms of probabilities for each of the M states (a single moving bar at each location).
14
R. Rao
Figure 9: Encoding multiple motions: Same direction motion. (A) Six images from an input sequence depicting two bright bars, both moving rightward. (B) The panels from top to bottom on the left and continued on the right show the activities (left) and posterior probabilities (right) for the image sequence in A in a motion detection network of 60 neurons—the rst 30 encoding rightward motion and the next 30 encoding leftward motion. Note the two large peaks in activity among the right-selective neurons, indicating the presence of two rightward-moving bars. The corresponding posterior probabilities remain close to 0.5 until only one moving bar remains in the image with a corresponding single peak of probability close to 1.
We tested the ability of the motion detection network to encode multiple stimuli by examining its responses to an image sequence containing two simultaneously moving bars under two conditions: (1) both bars moving in the same direction and (2) the bars moving in opposite directions. The results are shown in Figures 9 and 10, respectively. As seen in these gures, the network is able to represent the two simultaneously moving stimuli under both conditions. Note the two peaks in activity representing the location and direction of motion of the two moving bars. As expected from the underlying probabilistic model, the network computes a posterior probability close to 0.5 for each of the two peaks, although the exact values uctuate depending on the locations of the two bars. The network thus interprets
Bayesian Computation in Recurrent Neural Circuits
15
Figure 10: Encoding multiple motions: Motion in opposite directions. (A) Six images from an input sequence depicting two bright bars moving in opposite directions. (B) The panels show the activities in the motion-detection network (left) and posterior probabilities (right) for the image sequence in A. Note the two large peaks in activity moving in opposite directions, indicating the presence of both a leftward- and a rightward-moving bar. The corresponding posterior probabilities remain close to 0.5 throughout the sequence (after an initial transient).
the two simultaneously moving stimuli as providing evidence for two of the states it can represent, each state representing motion at a particular location and direction. Such a representation could be fed to a higher-level network, whose states could represent frequently occurring combinations of lower-level states, in a manner reminiscent of the hierarchical organization of sensory and motor cortical areas (Felleman & Van Essen, 1991). We discuss possible hierarchical extensions of the model in section 8.2. 5.2.3 Encoding Multiple Velocities. The network described is designed to estimate the posterior probability of a stimulus moving at a xed velocity as given by the transition probability matrix (see Figure 6C). How does the network respond to velocities other than the ones encoded explicitly in the transition probability matrix?
16
R. Rao
We investigated the question of velocity encoding by using two separate networks based on two transition probability matrices. The recurrent connections in the rst network encoded a speed of 1 pixel per time step (see Figure 11A), while those in the second network encoded a speed of 4 pixels per time step (see Figure 11C). Each network was exposed to velocities ranging from 0.5 to 5 pixels per time step in increments of 0.5. The resulting responses of a neuron located in the middle of each network are shown in Figures 11B and 11D, respectively. The bars depict the peak response of the neuron during stimulus motion at a particular velocity in the neuron’s preferred and nonpreferred (null) direction. The response is plotted as a posterior probability for convenience (see equation 3.4). As evident from the gure, the neurons exhibit a relatively broad tuning curve centered on their respective preferred velocities (1 and 4 pixels per time step, respectively). These results suggest that the model can encode multiple velocities using a tuning curve even though the recurrent connections are optimized for a single velocity. In other words, the posterior distribution over velocities can be represented by a xed number of networks optimized for a xed set of preferred velocities, in much the same way as a xed number of neurons can sample the continuous space of orientation, position, or any other continuous state variable. In addition, as discussed in section 8.1, synaptic learning rules could be used to allow the networks to encode the most frequently occurring stimulus velocities, thereby allowing the statespace of velocities to be tiled in an adaptive manner. 5.3 Example 3: Bayesian Decision Making in a Random Dots Task. To establish a connection to behavioral data, we consider the well-known random dots motion discrimination task (see, e.g., Shadlen & Newsome, 2001). The stimulus consists of an image sequence showing a group of moving dots, a xed fraction of which are randomly selected at each frame and moved in a xed direction (e.g., either left or right). The rest of the dots are moved in random directions. The fraction of dots moving in the same direction is called the coherence of the stimulus. Figure 12A depicts the stimulus for two different levels of coherence. The task is to decide the direction of motion of the coherently moving dots for a given input sequence. A wealth of data exists on the psychophysical performance of humans and monkeys as well as the neural responses in brain areas such as MT and LIP in monkeys performing the task (see Shadlen & Newsome, 2001 and references therein). Our goal is to explore the extent to which the proposed model can explain the existing data for this task and make testable predictions. The motion detection network in the previous section can be used to decide the direction of coherent motion by computing the posterior probabilities for leftward and rightward motion, given the input images. These probabilities can be computed by marginalizing the posterior distribution
Bayesian Computation in Recurrent Neural Circuits
17
Figure 11: Encoding multiple velocities. (A) Transition probability matrix that encodes a stimulus moving at a speed of 1 pixel per time step in either direction (see also Figure 6C). (B) Output of a neuron located in the middle of the network as a function of stimulus velocity. The peak outputs of the neuron during stimulus motion in both the neuron’s preferred (“Pref”) direction and the null direction are shown as posterior probabilities. Note the relatively broad tuning curve indicating the neuron’s ability to encode multiple velocities. (C) Transition probability matrix that encodes a stimulus moving at a speed of 4 pixels per time step. (D) Peak outputs of the same neuron as in B (but with recurrent weights approximating the transition probabilities in C). Once again, the neuron exhibits a tuning curve (now centered around 4 pixels per time step) when tested on multiple velocities. Note the shift in velocity sensitivity due to the change in the transition probabilities. A small number of such networks can compute a discrete posterior distribution over the space of stimulus velocities.
18
R. Rao
computed by the neurons for leftward (L) and rightward (R) motion over all spatial positions i: P.LjI.t/; : : : ; I.1// D P.RjI.t/; : : : ; I.1// D
X
P.µ iL jI.t/; : : : ; I.1//
(5.4)
i
P.µ iR jI.t/; : : : ; I.1//:
(5.5)
i
X
Note that the log of these marginal probabilities can also be computed directly from the log posteriors (i.e., outputs of the neurons) using the log sum approximation method described in sections 3 and 4. The log posterior ratio r.t/ of leftward over rightward motion can be used to decide the direction
Bayesian Computation in Recurrent Neural Circuits
19
of motion: r.t/ D log P.LjI.t/; : : : ; I.1// ¡ log P.RjI.t/; : : : ; I.1// D log
P.LjI.t/; : : : ; I.1// : P.RjI.t/; : : : ; I.1//
(5.6) (5.7)
If r.t/ > 0, the evidence seen so far favors leftward motion and vice versa for r.t/ < 0. The instantaneous ratio r.t/ is susceptible to rapid uctuations due to the noisy stimulus. We therefore use the following decision variable dL .t/ to track the running average of the log posterior ratio of L over R, dL .t C 1/ D dL .t/ C ®.r.t/ ¡ dL .t//;
(5.8)
and likewise for dR .t/ (the parameter ® is between 0 and 1). We assume that the decision variables are computed by a separate set of “decision neurons” that receive inputs from the motion detection network. These neurons are once again leaky-integrator neurons as described by equation 5.8, with the driving inputs r.t/ being determined by inhibition between the summed inputs from the two chains in the motion detection network (as in equation 5.6). The output of the model is L if dL .t/ > c and R if dR .t/ > c, where c is a “condence threshold” that depends on task constraints (e.g., accuracy versus speed requirements, Reddi & Carpenter, 2000). Figure 12: Facing page. Output of decision neurons in the model. (A) Depiction of the random dots task. Two different levels of motion coherence (50% and 100%) are shown. A 1D version of this stimulus was used in the model simulations. (B, C) Outputs dL .t/ and dR .t/ of model decision neurons for two different directions of motion. The decision threshold is shown as a horizontal line (labeled c in B). (D) Outputs of decision neurons for three different levels of motion coherence. Note the increase in rate of evidence accumulation at higher coherencies. For a xed decision threshold, the model predicts faster reaction times for higher coherencies (dotted arrows). (E) Activity of a neuron in area FEF for a monkey performing an eye movement task (from Schall & Thompson, 1999, with permission, from the Annual Review of Neuroscience, Volume 22, copyright 1999 by Annual Reviews, www.annualreviews.org). Faster reaction times were associated with a more rapid rise to a xed threshold (see the three different neural activity proles). The arrows point to the initiation of eye movements, which are depicted at the top. (F) Averaged ring rate over time of 54 neurons in area LIP during the random dots task, plotted as a function of motion coherence (from Roitman & Shadlen, 2002, with permission, from the Journal of Neuroscience, Volume 22, copyright 2002 by The Society for Neuroscience). Solid and dashed curves represent trials in which the monkey judged motion direction toward and away from the receptive eld of a given neuron, respectively. The slope of the response is affected by motion coherence (compare, e.g., responses for 51.2% and 25.6%) in a manner similar to the model responses shown in D.
20
R. Rao
Figures 12B and 12C show the responses of the two decision neurons over time for two different directions of motion and two levels of coherence. Besides correctly computing the direction of coherent motion in each case, the model also responds faster when the stimulus has higher coherence. This phenomenon can be appreciated more clearly in Figure 12D, which predicts progressively shorter reaction times for increasingly coherent stimuli (dotted arrows). 5.3.1 Comparison to Neurophysiological Data. The relationship between faster rates of evidence accumulation and shorter reaction times has received experimental support from a number of studies. Figure 12E shows the activity of a neuron in the frontal eye elds (FEF) for fast, medium, and slow responses to a visual target (Schall & Hanes, 1998; Schall & Thompson, 1999). Schall and collaborators have shown that the distribution of monkey response times can be reproduced using the time taken by neural activity in FEF to reach a xed threshold (Hanes & Schall, 1996). A similar rise-tothreshold model by Carpenter and colleagues has received strong support in human psychophysical experiments that manipulate the prior probabilities of targets (Carpenter & Williams, 1995) and the urgency of the task (Reddi & Carpenter, 2000) (see section 6.2). In the case of the random dots task, Shadlen and collaborators have shown that in primates, one of the cortical areas involved in making the decision regarding coherent motion direction is area LIP. The activities of many neurons in this area progressively increase during the motion viewing period, with faster rates of rise for more coherent stimuli (see Figure 12F) (Roitman & Shadlen, 2002). This behavior is similar to the responses of “decision neurons” in the model (see Figures 12B–D), suggesting that the outputs of the recorded LIP neurons could be interpreted as representing the log posterior ratio of one task alternative over another (see Carpenter & Williams, 1995; Gold & Shadlen, 2001, for related suggestions). 5.3.2 Performance of the Network. We examined the performance of the network as a function of motion coherence for different values of the decision threshold. As shown in Figure 13, the results obtained show two trends that are qualitatively similar to human and monkey psychophysical performance on this task. First, for any xed decision threshold T, the accuracy rate (% correct responses) increases from chance levels (50% correct) to progressively higher values (up to 100% correct for higher thresholds) when the motion coherence is increased from 0 to 1. These graphs are qualitatively similar to psychometric performance curves obtained in human and monkey experiments (see, e.g., Britten, Shadlen, Newsome, & Movshon, 1992). We did not attempt to quantitatively t the model curves to psychophysical data because the model currently operates only on 1D images. A second trend is observed when the decision threshold T is varied. As the threshold is increased, the accuracy also increases for each coherence value, reveal-
Bayesian Computation in Recurrent Neural Circuits
21
Figure 13: Performance of the model network as a function of motion coherence. The three curves show the accuracy rate (% correct responses) of the motion network in discriminating between the two directions of motion in the random dots task for three different decision thresholds T. Two trends, which are also observed in human and monkey performance on this task, are the rapid increase in accuracy as a function of motion coherence (especially for large thresholds) and the increase in accuracy for each coherence value for increasing thresholds, with an asymptote close to 100% for large enough thresholds.
ing a set of curves that asymptote around 95% to 100% accuracy for large enough thresholds. However, as expected, larger thresholds also result in longer reaction times, suggesting a correlate in the model for the type of speed-accuracy trade-offs typically seen in psychophysical discrimination and search experiments. The dependence of accuracy on decision threshold suggests a model for handling task urgency: faster decisions can be made by lowering the decision threshold, albeit at the expense of some accuracy (as illustrated in Figure 13). Predictions of such a model for incorporating urgency in the decision-making process are explored in section 6.2. 6 Model Predictions 6.1 Effect of Multiple Stimuli in a Receptive Field. A rst prediction of the model arises from the log probability interpretation of ring rates and the effects of normalization in a probabilistic system. In a network of neurons where each neuron encodes the probability of a particular state, the sum of the probabilities should sum to one. Thus, if an input contains multiple stimuli consistent with multiple states, the probabilities for each state would be appropriately reduced to reect probability normalization.
22
R. Rao
The current model predicts that one should see a concomitant logarithmic decrease in the ring rates, as illustrated in the example below. Consider the simplest case of two static stimuli in the receptive eld, reecting two different states encoded by two neurons (say, 1 and 2). Let v1 and v2 be the outputs (log posteriors) of neurons 1 and 2, respectively, when their preferred stimulus is shown alone in the receptive eld. In the ideal noiseless case, both log posteriors would be zero. Suppose both stimuli are now shown simultaneously. The model predicts that the new outputs v01 0 0 and v02 should obey the relation ev1 C ev2 D 1. If both states are equally likely, 0 ev1 D 0:5, that is, v01 D ¡0:69, a nonlinear decrease from the initial value of zero. To see the impact on ring rates, consider the case where neuronal ring rate f is proportional to the log posterior, that is, f D [c ¢ v C m]C , where c is a positive constant and m is the maximum ring rate (see section 7.1 for a discussion). Then the new ring rate f10 for neuron 1 is related to the old ring rate f1 (D m) by f10 D ¡0:69c C m D f1 ¡ 0:69c. Note that this is different from a simple halving of the single stimulus ring rate, as might be predicted from a linear model. Thus, the model predicts that in the presence of multiple stimuli, the ring rate of a cell encoding for one of the stimuli will be nonlinearly reduced in such a way that the sum of the probabilities encoded by the network approximates 1. Evidence for nonlinear reductions in ring rates for multiple stimuli in the receptive eld comes from some experiments in V1 (DeAngelis, Robson, Ohzawa, & Freeman, 1992) and other attention-related experiments in higher cortical areas (Reynolds, Chelazzi, & Desimone, 1999). The existing data are, however, insufcient to quantitatively validate or falsify the above prediction. More sophisticated experiments involving multiunit recordings may be needed to shed light on this prediction derived from the log probability hypothesis of neural coding. 6.2 Effect of Changing Priors and Task Urgency. The strength of a Bayesian model lies in its ability to predict the effects of varying prior knowledge or various types of biases in a given task. We explored the consequences of varying two types of biases in the motion discrimination task: (1) changing the prior probability of a particular motion direction (L or R) and (2) lowering the decision threshold to favor speed instead of accuracy. Changing the prior probability for a particular motion direction amounts to initializing the decision neuron activities dL and dR to the appropriate log prior ratios: dL .0/ D log P.L/ ¡ log P.R/
dR .0/ D log P.R/ ¡ log P.L/;
(6.1) (6.2)
where P.L/ and P.R/ denote the prior probabilities of leftward and rightward motion, respectively. Note that for the equiprobable case, these variables get initialized to 0 as in the simulations presented thus far. We ex-
Bayesian Computation in Recurrent Neural Circuits
23
Figure 14: Reaction time distributions as a function of prior probability and urgency. (A) The left panel shows the distribution of reaction times for the motion discrimination task obtained from model simulations when leftward and rightward motion are equiprobable. The right panel shows the distribution for the case where leftward motion is more probable than rightward motion (0:51 versus 0:49). Note the leftward shift of the distribution toward shorter reaction times. Both distributions here were restricted to leftward motion trials only. (B) The right panel shows the result of halving the decision threshold, simulating an urgency constraint that favors speed over accuracy. Compared to the nonurgent case (left panel), the reaction time distribution again exhibits a leftward shift toward faster (but potentially inaccurate) responses.
amined the distribution of reaction times obtained as a result of assuming a slightly higher prior probability for L compared to R (0:51 versus 0:49). Figure 14A compares this distribution (shown on the right) to the distribution obtained for the unbiased (equiprobable) case (shown on the left). Both distributions shown are for trials where the direction of motion was L. The model predicts that the entire distribution for a particular task choice should
24
R. Rao
shift leftward (toward shorter reaction times) when the subject assumes a higher prior for that particular choice. Similarly, when time is at a premium, a subject may opt to lower the decision threshold to reach decisions quickly, albeit at the risk of making some incorrect decisions. Such an urgency constraint was simulated in the model by lowering the decision threshold to half the original value, keeping the stimulus coherence constant (60% in this case). As shown in Figure 14B, the model again predicts a leftward shift of the entire reaction time histogram, corresponding to faster (but sometimes inaccurate) responses. Although these predictions are yet be tested in the context of the motion discrimination task (see Mazurek, Ditterich, Palmer, & Shadlen, 2001 for some preliminary results), psychophysical experiments by Carpenter and colleagues lend some support to the model’s predictions. In these experiments (Carpenter & Williams, 1995), a human subject was required to make an eye movement to a dim target that appeared randomly to the left or right of a xation point after a random delay of 0.5 to 1.5 seconds. The latency distributions of the eye movements were recorded as a function of different prior probabilities for one of the targets. As predicted by the model for the motion discrimination task (see Figure 14A), the response latency histogram was observed to shift leftward for increasing prior probabilities. In a second set of experiments, Reddi and Carpenter (2000) examined the effects of imposing urgency constraints on the eye movement task. They asked subjects to respond as rapidly as possible and to worry less about the accuracy of their response. Once again, the latency histogram shifted leftward, similar to the model simulation results in Figure 14B. Furthermore, in a remarkable follow-up experiment, Reddi and Carpenter (2000) showed that this leftward shift in the histogram due to urgency could be countered by a corresponding rightward shift due to lower prior probability for one of the targets, resulting in no overall shift in the latency histogram. This provides strong evidence for a rise-to-threshold model of decision making, where the threshold is chosen according to the constraints of the task at hand. 7 Discussion In this section, we address the problem of how neural activity may encode log probability values, which are nonpositive. We also discuss the benets of probabilistic inference in the log domain for neural systems and conclude the section by summarizing related work in probabilistic neural models and probabilistic decision making. 7.1 Neural Encoding of Log Probabilities. The model introduced above assumes that the outputs of neurons in a recurrent network reect log posterior probabilities of a particular state, given a set of inputs. However, log probabilities take on nonpositive values between ¡1 and 0, raising two
Bayesian Computation in Recurrent Neural Circuits
25
important questions: (1) How are these nonpositive values encoded using ring rates or a spike-based code? (2) How do these codes capture the precision and range of probability values that can be encoded? We address these issues in the context of rate-based and spike-based encoding models. A simple encoding model is to assume that ring rates are linearly related to the log posteriors encoded by the neurons. Thus, if vi denotes the log posterior encoded by neuron i, then the ring rate of the neuron is given by fi D [c ¢ vi C m]C ;
(7.1)
where c is a positive gain factor, m is the maximum ring rate of the neuron, and C denotes rectication ([x]C D x if x > 0 and 0 otherwise). Note that the ring rate fi attains the maximum value m when the log posterior vi D 0, that is, the posterior probability P.µ it jI.t/; : : : ; I.1// D 1. Likewise, fi attains its minimal value of 0 for posteriors below e¡m=c . Thus, the precision and range of probability values that can be encoded in the ring rate are governed by both m and c. Since e¡x quickly becomes negligible (e.g., e¡10 D 0:000045), for typical values of m in the range 100 to 200 Hz, values for c in the range 10 to 20 would allow the useful range of probability values to be accurately encoded in the ring rate. These ring rates can be easily decoded when received as synaptic inputs using the inverse relationship: vi D . fi ¡ m/=c. A second straightforward but counterintuitive encoding model is to assume that fi D ¡c ¢ vi , up to some maximum value m.2 In this model, a high ring rate implies a low posterior probability, that is, an unexpected or novel stimulus. Such an encoding model allows neurons to be interpreted as novelty detectors. Thus, the suppression of neural responses in some cortical areas due to the addition of contextual information, such as surround effects in V1 (Knierim & Van Essen, 1992; Zipser, Lamme, & Schiller, 1996), or due to increasing familiarity with a stimulus, such as response reduction in IT due to stimulus repetition (Miller, Li, & Desimone, 1991), can be interpreted as increases in posterior probabilities of the state encoded by the recorded neurons. Such a model of response suppression is similar in spirit to predictive coding models (Srinivasan, Laughlin, & Dubs, 1982; Rao & Ballard, 1999) but differs from these earlier models in assuming that the responses represent log posterior probabilities rather than prediction errors. A spike-based encoding model can be obtained by interpreting the leaky integrator equation (see equation 2.1) in terms of the membrane potential of a neuron rather than its ring rate. Such an interpretation is consistent with the traditional RC circuit-based model of membrane potential dynamics (see, e.g., Koch, 1999). We assume that the membrane potential Vim is A different model based also on a negative log encoding scheme has been suggested by Barlow (1969, 1972). 2
26
R. Rao
proportional to the log posterior vi , Vim D k ¢ vi C T;
(7.2)
where k is a constant gain factor and T is a constant offset. The dynamics of the neuron’s membrane potential is then given by dV im dvi Dk ; dt dt
(7.3)
i where dv is computed as described in sections 2 and 3. Thus, in this model, dt changes in the membrane potential due to synaptic inputs are proportional to changes in the log posterior being represented by the neuron. More interestingly, if T is regarded as analogous to the spiking threshold, one can dene the probability of neuron i spiking at time t C 1 as m .tC1/¡T/=k
P.spikei .t C 1/jV1m .t C 1// D e.Vi
D evi .tC1/ D P.µit jI.t/; : : : ; I.1//:(7.4)
Such a model for spiking can be viewed as a variant of the noisy threshold model, where the cell spikes if its potential exceeds a noisy threshold. As dened above, the spiking probability of a neuron becomes exactly equal to the posterior probability of the neuron’s preferred state µi given the inputs (see also Anastasio, Patton, & Belkacem-Boussaid, 2000; Hoyer & Hyv¨arinen, in press). This provides a new interpretation of traditional neuronal spiking probabilities computed from a poststimulus time histogram (PSTH). An intriguing open question, given such an encoding model, is how these spike probabilities are decoded “on-line” from input spike trains by synapses and converted to the log domain to inuence the membrane potential as in equation 7.3. We hope to explore this question in future studies. 7.2 Benets of Probabilistic Inference in the Log Domain. A valid question to ask is why a neural circuit should encode probabilities in the log domain in the rst place. Indeed, it has been suggested that multiplicative interactions between inputs may occur in dendrites of cortical cells (Mel, 1993), which could perhaps allow equation 3.2 to be directly implemented in a recurrent circuit (cf. Bridle, 1990). However, there are several reasons why representing probabilities in the log domain could be benecial to a neural system: ² Neurons have a limited dynamic range. A logarithmic transformation allows the most useful range of probabilities to be represented by a small number of activity levels (see the previous discussion on neural encoding of log probabilities). ² Implementing probabilistic inference in the log domain allows the use of addition and subtraction instead of multiplication and division. The
Bayesian Computation in Recurrent Neural Circuits
27
former operations are readily implemented through excitation and inhibition in neural circuits, while the latter are typically much harder to implement neurally. As shown earlier in this article, a standard leaky-integrator model sufces to implement probabilistic inference for a hidden Markov model if performed in the log domain. ² There is growing evidence that the brain uses quantities such as loglikelihood ratios during decision making (see section 7.4). These quantities can be readily computed if neural circuits are already representing probabilities in the log domain. Nevertheless, it is clear that representing probabilities in the log domain also makes certain operations, such as addition or convolution, harder to implement. Specic examples include computing the OR of two distributions or convolving a given distribution with noise. To implement such operations, neural circuits operating in the log domain would need to resort to approximations, such as the log-sum approximation for addition discussed in section 4. 7.3 Neural Encoding of Priors. An important question for Bayesian models of brain function is how prior knowledge about the world can be represented and used during neural computation. In the model discussed in this article, “priors” inuence computation in two ways. First, long-term prior knowledge about the statistical regularities of the environment is stored within the recurrent and feedforward connections in the network. Such knowledge could be acquired through evolution or learning, or some combination of these two strategies. For example, the transition probabilities stored in the recurrent connections of the motion detection network capture prior knowledge about the expected trajectory of a moving stimulus at each location. This stored knowledge could be changed by adapting the recurrent weights if the statistics of moving stimuli suddenly change in the organism’s environment. Similarly, the likelihood function stored in the feedforward connections captures properties of the physical world and can be adapted according to the statistics of the inputs. The study of such stored “priors” constitutes a major research direction in Bayesian psychophysics (e.g., Bloj et al., 1999; Weiss et al., 2002; Mamassian, Landy, & Mahoney, 2002). Second, at shorter timescales, recurrent activity provides “priors” for expected inputs based on the history of past inputs. These priors are then combined with the feedforward input likelihoods to produce the posterior probability distribution (cf. equations 3.3 and 3.7). Note that the priors provided by the recurrent activity at each time step reect the long-term prior knowledge about the environment (transition probabilities) stored in the recurrent connections.
28
R. Rao
7.4 Related Work. This article makes contributions in two related areas: neural models of probabilistic inference and models of probabilistic decision making. Below, we review previous work in these two areas with the aim of comparing and contrasting our approach with these previous approaches. 7.4.1 Neural Models of Probabilistic Inference. Perhaps the earliest neural model of probabilistic inference is the Boltzmann machine (Hinton & Sejnowski, 1983, 1986), a stochastic network of binary units. Inference in the network proceeds by randomly selecting a unit at each time step and setting its activity equal to 1 according to a probability given by a sigmoidal function of a weighted sum of its inputs. This update procedure, also known as Gibbs sampling, allows the network to converge to a set of activities characterized by the Boltzmann distribution (see, e.g., Dayan & Abbott, 2001 for details) The current model differs from the Boltzmann machine in both the underlying generative model and the inference mechanism. Unlike the Boltzmann machine, our model uses a generative model with explicit dynamics in the state-space and can therefore model probabilistic transitions between states. For inference, the model explicitly operates on probabilities (in the log domain), thereby avoiding the need for sampling-based inference techniques, which can be notoriously slow. These features also differentiate the proposed approach from other more recent probabilistic neural network models such as the Helmholtz machine (Dayan et al., 1995; Lewicki & Sejnowski, 1997; Hinton & Ghahramani, 1997). Bridle (1990) rst suggested that statistical inference for hidden Markov models could be implemented using a modication of the standard recurrent neural network. In his model, known as an Alpha-Net, the computation of the likelihood for an input sequence is performed directly using an analog of equation 3.2. Thus, Bridle’s network requires a multiplicative interaction of input likelihoods with the recurrent activity. The model we have proposed performs inference in the log domain, thereby requiring only additive interactions, which are more easily implemented in neural circuitry. In addition, experimental data from human and monkey decision-making tasks have suggested an interpretation of neural activities in terms of log probabilities, supporting a model of inference in the log domain. The log domain implementation, however, necessitates the approximation of the transition probabilities using equation 4.2, an approximation that is avoided if multiplicative interactions are allowed (as in Bridle’s approach). A second set of probabilistic neural models stems from efforts to understand encoding and decoding of information from populations of neurons. One class of approaches uses basis functions to represent probability distributions within neuronal ensembles (Anderson & Van Essen, 1994; Anderson, 1996; Deneve & Pouget, 2001; Eliasmith & Anderson, 2003). In this approach, a distribution P.x/ over stimuli x is represented using a linear combination of basis functions (kernels), X (7.5) P.x/ D ri bi .x/; i
Bayesian Computation in Recurrent Neural Circuits
29
where ri is the normalized response (ring rate) and bi the implicit basis function associated with neuron i in the population. The basis function of each neuron is assumed to be linearly related to the tuning function of the neuron as measured in physiological experiments. The basis function approach is similar to the approach proposed here in that the stimulus space is spanned by a limited number of neurons with preferred stimuli or state vectors. The two approaches differ in how probability distributions are decoded from neural responses, one using an additive method and the other using a logarithmic transformation, as discussed above. A limitation of the basis function approach is that due to its additive nature, it cannot represent distributions that are sharper than the component distributions. A second class of models addresses this problem using a generative approach, where an encoding model is rst assumed and a Bayesian decoding model is used to estimate the stimulus x (or its distribution), given a set of responses ri (Zhang, Ginzburg, McNaughton, & Sejnowski, 1998; Pouget, Zhang, Deneve, & Latham, 1998; Zemel et al., 1998; Zemel & Dayan, 1999; Wu, Chen, Niranjan, & Amari, 2003). For example, in the distributional population coding (DPC) method (Zemel et al., 1998; Zemel & Dayan 1999), the responses are assumed to depend on general distributions P.x/, and a Bayesian method is used to decode a probability distribution over possible distributions over x. The best estimate in this method is not a single value of x but an entire distribution over x, which is assumed to be represented by the neural population. The underlying goal of representing entire distributions within neural populations is common to both the DPC approach and the model being proposed in this article. However, the approaches differ in how they achieve this goal: the DPC method assumes prespecied tuning functions for the neurons and a sophisticated, nonneural decoding operation, whereas the method introduced in this article directly instantiates a probabilistic generative model (in this article, a hidden Markov model) with a comparatively straightforward linear decoding operation as described above. Several probabilistic models have also been suggested for solving specic problems in visual motion processing such as the aperture problem (Simoncelli, 1993; Koechlin, Anton, & Burnod, 1999; Zemel & Dayan, 1999; Ascher & Grzywacz, 2000; Freeman, Haddon, & Pasztor, 2002; Weiss & Fleet, 2002; Weiss et al., 2002) These typically rely on a prespecied bank of spatiotemporal lters to generate a probability distribution over velocities, which is processed according to Bayesian principles. The model proposed in this article differs from these previous approaches in making use of an explicit generative model for capturing the statistics of time-varying inputs (i.e., a hidden Markov model). Thus, the selectivity for direction and velocity is an emergent property of the network and not a consequence of tuned spatiotemporal lters chosen a priori. Finally, Rao and Ballard have suggested a hierarchical model for neural computation based on Kalman ltering (or predictive coding) (Rao & Ballard, 1997, 1999; Rao, 1999). The Kalman lter model is based on a generative
30
R. Rao
model similar to the one used in the hidden Markov model as discussed in this article, except that the states are assumed to be continuous-valued, and both the observation and transition probability densities are assumed to be gaussian. As a result, the inference problem becomes one of estimating the mean and covariance of the gaussian distribution of the state at each time step. Rao and Ballard showed how the mean could be computed in a recurrent network in which feedback from a higher layer to the input layer carries predictions of expected inputs and the feedforward connections convey the errors in prediction, which are used to correct the estimate of the mean state. The neural estimation of covariance was not addressed. In contrast to the Kalman lter model, the current model does not require propagation of predictive error signals and allows the full distribution of states to be estimated, rather than just the mean. In addition, the observation and transition distributions can be arbitrary and need not be gaussian. The main drawback is the restriction to discrete states, which implies that distributions over continuous state variables are approximated using discrete samples. Fortunately, these discrete samples, which are the preferred states µi of neurons in the network, can be adapted by changing the feedforward and feedback weights (W and M) in response to inputs. Thus, the distribution over a continuous state-space can be modeled to different degrees of accuracy in different parts of the state-space by a nite number of neurons that adapt their preferred states to match the statistics of the inputs. Possible procedures for adapting the feedforward and feedback weights are discussed in section 8.1. 7.4.2 Models of Bayesian Decision Making. Recent experiments using psychophysics and electrophysiology have shed light on the probabilistic basis of visual decision making in humans and monkeys. There is converging evidence that decisions are made based on the log of the likelihood ratio of one alternative over another (Carpenter & Williams, 1995; Gold & Shadlen, 2001). Carpenter and colleagues have suggested a mathematical model called LATER (linear approach to threshold with ergodic rate) (Carpenter & Williams, 1995; Reddi & Carpenter, 2000) for explaining results from their visual decision-making task (described in section 6.2). In their model, the evidence for one alternative over another is accumulated in the form of a log-likelihood ratio at a constant rate for a given trial, resulting in a decision in favor of the rst alternative if the accumulated variable exceeds a xed threshold. This is similar to the decision-making process posited in this article, except that we do not assume a linear rise to threshold. Evidence for the LATER model is presented in Carpenter and Williams (1995) and Reddi and Carpenter (2000). A related and more detailed model has been proposed by Gold and Shadlen (2001) to explain results from the random dots task. The Gold-Shadlen model is related to “accumulator models” commonly used in psychology (Ratcliff, Zandt, & McKoon, 1999; Luce, 1986; Usher & McClelland, 2001) and treats decision making as a diffu-
Bayesian Computation in Recurrent Neural Circuits
31
sion process that is biased by a log-likelihood ratio. Rather than assuming that entire probability distributions are represented within cortical circuits as proposed in this article, the Gold-Shadlen model assumes that the difference between two ring rates (e.g., from two different motion-selective neurons in cortical area MT) can be interpreted as a log-likelihood ratio of one task alternative over another (e.g., one motion direction over another). We believe that there are enormous advantages to be gained if the brain could represent entire probability distributions of stimuli because such a representation allows not just decision making when faced with a discrete number of alternatives but also a wide variety of other useful tasks related to probabilistic reasoning, such as estimation of an unknown quantity and planning in uncertain environments. A number of models of decision making, based typically on diffusion processes or races between variables, have been proposed in the psychological literature (Ratcliff et al., 1999; Luce, 1986; Usher & McClelland, 2001). Other recent models have focused on establishing a relation between the effects of rewards on decision making and concepts from decision theory and game theory (Platt & Glimcher, 1999; Glimcher, 2002). These models, like the models of Carpenter, Shadlen, and colleagues, are mathematical models that are aimed mainly at explaining behavioral data but are formulated at a higher level of abstraction than the neural implementation level. The model presented in this article seeks to bridge this gap by showing how rigorous mathematical models of decision making could be implemented within recurrent neural circuits. A recent model proposed by Wang (2002) shares the same goal but approaches the problem from a different viewpoint. Starting with a biophysical circuit that implements an “attractor network,” Wang shows that the network exhibits many of the properties seen in cortical decision-making neurons. The inputs to Wang’s decision-making model are not based on a generative model of time-varying inputs but are assumed to be two abstract inputs modeled as two Poisson processes whose rates are determined by gaussian distributions. The means of these distributions depend on the motion coherence level linearly, such that the two means approach each other as the coherence is reduced. Thus, the effects of motion coherence on the inputs to the decision-making process are built in rather than being computed from input images as in our model. Consequently, Wang’s model, like the other decision-making models discussed here, does not directly address the issue of how probability distributions of stimuli may be represented and processed in neural circuits. 8 Conclusions and Future Work We have shown that a recurrent network of leaky integrator neurons can approximate Bayesian inference for a hidden Markov model. Such networks have previously been used to model a variety of cortical response properties (Dayan & Abbott, 2001) . Our results suggest a new interpretation of neural
32
R. Rao
activities in such networks as representing log posterior probabilities. Such a hypothesis is consistent with the suggestions made by Carpenter, Shadlen, and others based on studies in visual decision making. We illustrated the performance of the model using the examples of visual orientation discrimination and motion detection. In the case of motion detection, we showed how a model cortical network can exhibit direction selectivity and compute the log posterior probability of motion direction, given a sequence of input images. The model thus provides a probabilistic interpretation of direction selective responses in cortical areas V1 and MT. We also showed how the outputs of the motion detection network could be used to decide the direction of coherent motion in the well-known random dots task. The activity of the “decision neurons” in the model during this task resembles the activity of evidence-accumulating neurons in LIP and FEF, two cortical areas known to be involved in visual decision making. The model correctly predicts reaction times as a function of stimulus coherence and produces reaction time distributions that are qualitatively similar to those obtained in eye movement experiments that manipulate prior probabilities of targets and task urgency. Although the results described thus far are encouraging, several important questions remain: (1) How can the feedforward weights W and recurrent weights M be learned directly from input data, (2) How can the approach be generalized to graphical models that are more sophisticated than hidden Markov models, and (3) How can the approach be extended to incorporate the inuence of rewards on Bayesian estimation, learning, and decision making? We conclude by discussing some potential strategies for addressing these questions in future studies. 8.1 Learning Synaptic Weights. We intend to explore the question of synaptic learning using biologically plausible approximations to the expectation-maximization (EM) algorithm (Dempster, Laird, & Rubin, 1977). For standard hidden Markov models, the E-step in the EM algorithm is realized by the “forward-backward” algorithm (Rabiner & Juang, 1986). The network described in this article implements the “forward” part of this algorithm and can be regarded as an approximate E-step for the case of on-line estimation. One possible likelihood function3 for the on-line case is: Qt .W; M/ X D P.µjt jI.t/; : : : ; I.1/; W; M/ log P.µjt ; I.t/jI.t ¡ 1/; : : : ; I.1/; W; M/: j
This function can be rewritten in terms of the outputs vj of the model network and the summed inputs uj from section 3 for the current values of the 3 approximates the full on-line likelihood function QFt .W; M/ D P This t jI.t/; P.µ : : : ; I.1/; W; M/ log P.µjt ; I.t/; I.t ¡ 1/; : : : ; I.1/jW; M/. j j
Bayesian Computation in Recurrent Neural Circuits
33
weights W and M: Qt .W; M/ D
X
evj .tC1/ uj .t/:
(8.1)
j
For synaptic learning, we can perform an on-line M-step to increase Qt through gradient ascent with respect to W and M (this actually implements a form of incremental or generalized EM algorithm; Neal & Hinton, 1998): 1wi .t/ D ®
@Qt @ wi
D ®²gi .t/evi .tC1/ I.t/ @Qt 1mik .t/ D ¯ @mik
(8.2)
(8.3)
D ¯ gi .t/evi .tC1/ vk .t/; where ® and ¯ are learning rates, and gi .t/ D .1 C ui .t/ ¡ Qt .W; M// is a gain function. The above expressions were derived by substituting the right-hand side of the equations for vi .t C 1/ and ui .t/, respectively (see section 3), and approximating the gradient by retaining only the rst-order terms pertaining to time step t. The learning rules above are interesting variants of the well-known Hebb rule (Hebb, 1949): the synaptic weight change is proportional to the product of the input (I.t/ or vk .t/) with an exponential function of the output vi .t C 1/ (rather than simply the output), thereby realizing a soft form of winner-takeall competition. Furthermore, the weight changes are modulated by the gain function gi .t/, which is positive only when .ui .t/ ¡ Qt .W; M// > ¡1, that is, when P.µ it ; I.t/jI.t ¡ 1/; : : : ; I.1/; W; M/ > .1=e/eQt .W;M/ . Thus, both learning rules remain anti-Hebbian and encourage decorrelation (Foldi´ ¨ ak, 1990) unless the net feedforward and recurrent input (ui .t/) is high enough to make the joint probability of the state i and the current input exceed the threshold of .1=e/eQt .W;M/ . In that case, the learning rules switch to a Hebbian regime, allowing the neuron to learn the appropriate weights to encode state i. An in-depth investigation of these and related learning rules is underway and will be discussed in a future article. 8.2 Generalization to Other Graphical Models. A second open question is how the suggested framework could be extended to allow neural implementation of hierarchical Bayesian inference in graphical models that are more general than hidden Markov models (cf. Dayan et al., 1995; Dayan & Hinton, 1996; Hinton & Ghahramani, 1997; Rao & Ballard, 1997, 1999). Such models would provide a better approximation to the multilayered architecture of the cortex and the hierarchical connections that exist between cortical areas (Felleman & Van Essen, 1991). In future studies, we hope to
34
R. Rao
investigate neural implementations of various types of graphical models by extending the method described in this paper for implementing equation 3.2. In particular, equation 3.2 can be regarded as a special case of the general sum-product rule for belief propagation in a Bayesian network (Jordan & Weiss, 2002). Thus, in addition to incorporating “belief messages” from the previous time step and the current input as in equation 3.2, a more general rule for Bayesian inference would also incorporate messages from other hierarchical levels and potentially from future time steps. Investigating such generalizations of equation 3.2 and their neural implementation is an important goal of our ongoing research efforts. 8.3 Incorporating Rewards. A nal question of interest is how rewards associated with various choices available in a task can be made to inuence the activities of decision neurons and the decision-making behavior of the model. We expect ideas from reinforcement learning and Bayesian decision theory as well as recent neurophysiological results in the monkey (Platt & Glimcher, 1999) to be helpful in guiding our research in this direction. Acknowledgments This research was supported by a Sloan Research Fellowship, NSF grant no. 130705, and an ONR Young Investigator Award. I am grateful to Peter Dayan, Sophie Deneve, Aaron Hertzmann, John Palmer, Michael Rudd, Michael Shadlen, Eero Simoncelli, and Josh Tenenbaum for discussions on various ideas related to this work. I also thank the two anonymous reviewers for their constructive comments and suggestions. References Anastasio, T. J., Patton, P. E., & Belkacem-Boussaid, K. (2000). Using Bayes’ rule to model multisensory enhancement in the superior colliculus. Neural Computation, 12(5), 1165–1187. Anderson, C. H. (1996). Unifying perspectives on neuronal codes and processing. In E. Ludena, P. Vashishta & R. Bishop (Eds.), Proceedings of the XIX International Workshop on Condensed Matter Theories. New York: Nova Science. Anderson, C. H., & Van Essen, D. C. (1994). Neurobiological computational systems. In J. M. Zurada, R. J. Marks II, C. J., Robinson (Eds.), Computational intelligence: Imitating life (pp. 213–222). New York: IEEE Press. Ascher, D., & Grzywacz, N. M. (2000). A Bayesian model for the measurement of visual velocity. Vision Research, 40, 3427–3434. Barlow, H. B. (1969). Pattern recognition and the responses of sensory neurons. Annals of the New York Academy of Sciences, 156, 872–881. Barlow, H. B. (1972). Single units and cognition: A neurone doctrine for perceptual psychology. Perception, 1, 371–394.
Bayesian Computation in Recurrent Neural Circuits
35
Bloj, M. G., Kersten, D., & Hurlbert, A. C. (1999). Perception of three-dimensional shape inuences colour perception through mutual illumination. Nature, 402, 877–879. Bridle, J. S. (1990). Alpha-Nets: A recurrent “neural” network architecture with a hidden Markov model interpretation. Speech Communication, 9(1), 83–92. Britten, K. H., Shadlen, M. N., Newsome, W. T., & Movshon, J. A. (1992). The analysis of visual motion: A comparison of neuronal and psychophysical performance. Journal of Neuroscience, 12, 4745–4765. Carpenter, R. H. S., & Williams, M. L. L. (1995). Neural computation of log likelihood in control of saccadic eye movements. Nature, 377, 59–62. Dayan, P., & Abbott, L. F. (2001). Theoretical neuroscience: Computational and mathematical modeling of neural systems. Cambridge, MA: MIT Press. Dayan, P., & Hinton, G. (1996). Varieties of Helmholtz machine. Neural Networks, 9(8), 1385–1403. Dayan, P., Hinton, G., Neal, R., & Zemel, R. (1995). The Helmholtz machine. Neural Computation, 7, 889–904. DeAngelis, G. C., Robson, J. G., Ohzawa, I., & Freeman, R. D. (1992). Organization of suppression in receptive elds of neurons in cat visual cortex. J. Neurophysiol., 68(1), 144–163. Dempster, A. P., Laird, N. M., & Rubin, D. B. (1977). Maximum likelihood from incomplete data via the EM algorithm. J. Royal Statistical Society Series B, 39, 1–38. Deneve, S., Latham, P. E., & Pouget, A. (1999). Reading population codes: A neural implementation of ideal observers. Nature Neuroscience, 2(8), 740– 745. Deneve, S. & Pouget, A. (2001). Bayesian estimation by interconnected neural networks (abstract no. 237.11). Society for Neuroscience Abstracts, 27. Eliasmith, C. & Anderson, C. H. (2003). Neural Engineering: Computation, Representation, and Dynamics in Neurobiological Systems. Cambridge, MA: MIT Press. Felleman, D. J., & Van Essen, D. C. (1991). Distributed hierarchical processing in the primate cerebral cortex. Cerebral Cortex, 1, 1–47. Foldi´ ¨ ak, P. (1990). Forming sparse representations by local anti-Hebbian learning. Biol. Cybern., 64, 165–170. Freeman, W. T., Haddon, J., & Pasztor, E. C. (2002). Learning motion analysis. In R. P. N. Rao, B. A. Olshausen, & M. S. Lewicki (Eds.), Probabilistic models of the brain: Perception and neural function (pp. 97–115). Cambridge, MA: MIT Press. Glimcher, P. (2002). Decisions, decisions, decisions: Choosing a biological science of choice. Neuron, 36(2), 323–332. Gold, J. I., & Shadlen, M. N. (2001). Neural computations that underlie decisions about sensory stimuli. Trends in Cognitive Sciences, 5(1), 10–16. Hanes, D. P., & Schall, J. D. (1996). Neural control of voluntary movement initiation. Science, 274, 427–430. Hebb, D. O. (1949). The organization of behavior. New York: Wiley. Hinton, G. E., & Brown, A. D. (2002). Learning to use spike timing in a restricted Boltzmann machine. In R. P. N. Rao, B. A. Olshausen, & M. S. Lewicki (Eds.),
36
R. Rao
Probabilistic models of the brain: Perception and neural function (pp. 285–296) Cambridge, MA: MIT Press. Hinton, G. E., & Ghahramani, Z. (1997). Generative models for discovering sparse distributed representations. Phil. Trans. Roy. Soc. Lond. B, 352, 1177– 1190. Hinton, G., & Sejnowski, T. (1983). Optimal perceptual inference. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 448–453) New York: IEEE. Hinton, G., & Sejnowski, T. (1986). Learning and relearning in Boltzmann machines. In D. Rumelhart & J. McClelland, (Eds.), Parallel distributed processing (Vol. 1, pp. 282–317). Cambridge, MA: MIT Press. Hoyer, P. O., & Hyva¨ rinen, A. (in press). Interpreting neural response variability as Monte Carlo sampling of the posterior. In S. Becker, S. Thrun, & K. Obermayer (Eds.), Advances in neural informationprocessing systems,15. Cambridge, MA: MIT Press. Jacobs, R. A. (2002). Visual cue integration for depth perception. In R. P. N. Rao, B. A. Olshausen, & M. S. Lewicki (Eds.), Probabilistic models of the brain: Perception and neural function (pp. 61–76). Cambridge, MA: MIT Press. Jordan, M. I., & Weiss, Y. (2002). Graphical models: Probabilistic inference. In M. A. Arbib (Ed.), The handbook of brain theory and neural networks (2nd ed.). Cambridge, MA: MIT Press. Knierim, J., & Van Essen, D. C. (1992). Neural responses to static texture patterns in area V1 of the alert macaque monkey. Journal of Neurophysiology, 67, 961– 980. Knill, D. C., & Richards, W. (1996). Perception as Bayesian inference. Cambridge: Cambridge University Press. Koch, C. (1999). Biophysics of computation: Information processing in single neurons. New York: Oxford University Press. Koechlin, E., Anton, J. L., & Burnod, Y. (1999). Bayesian inference in populations of cortical neurons: A model of motion integration and segmentation in area MT. Biological Cybernetics, 80, 25–44. Lewicki, M. S., & Sejnowski, T. J. (1997). Bayesian unsupervised learning of higher order structure. In M. Mozer, M. Jordan, & T. Petsche (Eds.), Advances in neural information processing systems, 9 (pp. 529–535). Cambridge, MA: MIT Press. Luce, R. D. (1986). Response time: Their role in inferring elementary mental organization. New York: Oxford University Press. Mamassian, P., Landy, M., & Maloney, L.T. (2002). Bayesian modelling of visual perception. In R. P. N. Rao, B. A. Olshausen, & M. S. Lewicki (Eds.), Probabilistic models of the brain: Perception and neural function (pp. 61–76). Cambridge, MA: MIT Press. Mazurek, M. E., Ditterich, J., Palmer, J., & Shadlen, M. N. (2001). Effect of prior probability on behavior and LIP responses in a motion discrimination task (abstract no. 58.11). Society for Neuroscience Abstracts, 27. Mel, B. W. (1993). Synaptic integration in an excitable dendritic tree. J. Neurophysiology, 70(3), 1086–1101.
Bayesian Computation in Recurrent Neural Circuits
37
Miller, E. K., Li, L., & Desimone, R. (1991). A neural mechanism for working and recognition memory in inferior temporal cortex. Science, 254, 1377–1379. Neal, R. M., & Hinton, G. E. (1998). A view of the EM algorithm that justies incremental, sparse, and other variants. In M. I. Jordan (Ed.), Learning in graphical models (pp. 355–368). Norwell, MA: Kluwer. Platt, M. L., & Glimcher, P. W. (1999). Neural correlates of decision variables in parietal cortex. Nature, 400, 233–238. Pouget, A., Dayan, P., & Zemel, R. S. (2000). Information processing with population codes. Nature Reviews Neuroscience, 1(2), 125–132. Pouget, A., Zhang, K., Deneve, S., & Latham, P. E. (1998). Statistically efcient estimation using population coding. Neural Computation, 10(2), 373– 401. Rabiner, L., & Juang, B. (1986). An introduction to hidden Markov models. IEEE ASSP Magazine, 3, 4–16. Rao, R. P. N. (1999). An optimal estimation approach to visual perception and learning. Vision Research, 39(11), 1963–1989. Rao, R. P. N. & Ballard, D. H. (1997). Dynamic model of visual recognition predicts neural response properties in the visual cortex. Neural Computation, 9(4), 721–763. Rao, R. P. N., & Ballard, D. H. (1999). Predictive coding in the visual cortex: A functional interpretation of some extra-classical receptive eld effects. Nature Neuroscience, 2(1), 79–87. Rao, R. P. N., Olshausen, B. A., & Lewicki, M. S. (2002). Probabilistic models of the brain: Perception and neural function. Cambridge, MA: MIT Press. Ratcliff, R., Zandt, T. V., & McKoon, G. (1999). Connectionist and diffusion models of reaction time. Psychological Review, 106, 261–300. Reddi, B. A., & Carpenter, R. H. (2000). The inuence of urgency on decision time. Nature Neuroscience, 3(8), 827–830. Reynolds, J. H., Chelazzi, L., & Desimone, R. (1999). Competitive mechanisms subserve attention in macaque areas V2 and V4. Journal of Neuroscience, 19, 1736–1753. Roitman, J. D., & Shadlen, M. N. (2002). Response of neurons in the lateral intraparietal area during a combined visual discrimination reaction time task. Journal of Neuroscience, 22, 9475–9489. Schall, J. D., & Hanes, D. P. (1998). Neural mechanisms of selection and control of visually guided eye movements. Neural Networks, 11, 1241–1251. Schall, J. D., & Thompson, K. G. (1999). Neural selection and control of visually guided eye movements. Annual Review of Neuroscience, 22, 241–259. Seung, H. S. (1996). How the brain keeps the eyes still. Proceedings of the National Academy of Sciences, 93, 13339–13344. Shadlen, M. N., & Newsome, W. T. (2001). Neural basis of a perceptual decision in the parietal cortex (area LIP) of the rhesus monkey. Journal of Neurophysiology, 86(4), 1916–1936. Simoncelli, E. P. (1993). Distributed representation and analysis of visual motion. Doctoral dissertation, MIT. Srinivasan, M. V., Laughlin, S. B., & Dubs, A. (1982). Predictive coding: A fresh view of inhibition in the retina. Proc. R. Soc. Lond. B, 216, 427–459.
38
R. Rao
Usher, M., & McClelland, J. (2001). On the time course of perceptual choice: The leaky competing accumulator model. Psychological Review, 108, 550–592. Wang, X.-J. (2001). Synaptic reverberation underlying mnemonic persistent activity. Trends in Neuroscience, 24, 455–463. Wang, X.-J. (2002). Probabilistic decision making by slow reverberation in cortical circuits. Neuron, 36(5), 955–968. Weiss, Y., & Fleet, D. J. (2002). Velocity likelihoods in biological and machine vision. In R. P. N. Rao, B. A. Olshausen, & M. S. Lewicki (Eds.), Probabilistic models of the brain: Perception and neural function (pp. 77–96). Cambridge, MA: MIT Press. Weiss, Y., Simoncelli, E. P., & Adelson, E. H. (2002). Motion illusions as optimal percepts. Nature Neuroscience, 5(6), 598–604. Wu, S., Chen, D., Niranjan, M., & Amari, S.-I. (2003). Sequential Bayesian decoding with a population of neurons. Neural Computation, 15, 993–1012. Zemel, R. S., & Dayan, P. (1999). Distributional population codes and multiple motion models. In M. S. Kearns, S. A. Solla, & D. A. Cohn (Eds.), Advances in neural information processing systems, 11 (pp. 174–180). Cambridge, MA: MIT Press. Zemel, R. S., Dayan, P., & Pouget, A. (1998). Probabilistic interpretation of population codes. Neural Computation, 10(2), 403–430. Zhang, K., Ginzburg, I., McNaughton, B. L., & Sejnowski, T. J. (1998). Interpreting neuronal population activity by reconstruction: A unied framework with application to hippocampal place cells. Journal of Neurophysiology, 79(2), 1017– 1044. Zipser, K., Lamme, V. A. F., & Schiller, P. H. (1996). Contextual modulation in primary visual cortex. Journal of Neuroscience, 16(22), 7376–7389. Received August 2, 2002; accepted June 4, 2003.
LETTER
Communicated by Kechen Zhang
Probability of Stimulus Detection in a Model Population of Rapidly Adapting Fibers Burak Gu¸ ¨ clu¨ Burak
[email protected] Stanley J. Bolanowski sandy
[email protected] Institute for Sensory Research, Syracuse, NY 13244-5290,and Department of Bioengineering and Neuroscience, Syracuse University, Syracuse, NY 13244
The goal of this study is to establish a link between somatosensory physiology and psychophysics at the probabilistic level. The model for a population of monkey rapidly adapting (RA) mechanoreceptive bers by Gu¸ ¨ clu¨ and Bolanowski (2002) was used to study the probability of stimulus detection when a 40 Hz sinusoidal stimulation is applied with a constant contactor size (2 mm radius) on the terminal phalanx. In the model, the detection was assumed to be mediated by one or more active bers. Two hypothetical receptive eld organizations (uniformly random and gaussian) with varying average innervation densities were considered. At a given stimulus-contactor location, changing the stimulus amplitude generates sigmoid probability-of-detection curves for both receptive eld organizations. The psychophysical results superimposed on these probability curves suggest that 5 to 10 active bers may be required for detection. The effects of the contactor location on the probability of detection reect the pattern of innervation in the model. However, the psychophysical data do not match with the predictions from the populations with uniform or gaussian distributed receptive eld centers. This result may be due to some unknown mechanical factors along the terminal phalanx, or simply because a different receptive eld organization is present. It has been reported that human observers can detect one single spike in an RA ber. By considering the probability of stimulus detection across subjects and RA populations, this article proves that more than one active ber is indeed required for detection.
1 Introduction Rapidly adapting (RA) bers is one of four classes of low-threshold mechanoreceptive nerve bers responsible for the sense of touch in human glabrous skin (Knibestol ¨ & Vallbo, 1970; Johansson, 1976, 1978). Fibers with similar response properties have been found in the monkey (Talbot, Darian-Smith, Neural Computation 16, 39–58 (2004)
c 2003 Massachusetts Institute of Technology °
40
B. Gu¸ ¨ clu¨ and S. Bolanowski
Kornhuber, & Mountcastle, 1968), cat (J¨anig, Schmidt, & Zimmermann, 1968), raccoon (Pubols & Pubols, 1976), and other mammalian species. The sensory end organs of these bers are Meissner corpuscles (Lindblom, 1965) located in the dermis. The response of RA bers to sinusoidal displacements has been studied by numerous researchers (Talbot et al., 1968; Johnson, 1974; Goodwin, Youl, & Zimmerman, 1981; Johansson, Landstr om, ¨ & Lundstr om ¨ 1982; Gu¸ ¨ clu¨ & Bolanowski, 2001, 2003). The average ring rate is a piecewise function of the stimulus amplitude always made up of ramps and plateaus. This fact has been used in population models for RA bers (Johnson, 1974; Goodwin & Pierce, 1981; Gu¸ ¨ clu¨ & Bolanowski, 1999, 2002). Johnson (1974) statistically modeled the distribution of rate-intensity functions. Goodwin and Pierce (1981) considered the case for an “average” RA ber. Gu¸ ¨ clu¨ and Bolanowski (1999, 2002) presented a computational model that consists of anatomical distributions of receptive eld centers as well as statistically modeled rate/intensity functions. Here, this latter model will be used to study the probability of activity in a model RA population caused by 40 Hz sinusoidal stimulation on the terminal phalanx. The frequency of stimulation is the best frequency for this ber class (Talbot et al., 1968). RA bers mediate the psychophysical NP I channel, which is one of the four channels that contribute to tactile sensitivity (Bolanowski, Gescheider, Verrillo, & Checkosky, 1988; Gescheider, Bolanowski, & Hardick, 2001). Activation of RA bers produces utter sensation (Ochoa & Torebjork, ¨ 1983), and it has been reported that a single action potential in an RA ber can give rise to sensation (e.g., Vallbo, 1981). Hensel and Boman (1960) found that the thresholds eliciting a single impulse in a single mechanoreceptive ber were comparable to those for arousing a subjective touch sensation. Vallbo and Johansson’s experiments (1976) show that the neural and perceptive threshold curves coincide for RA units located on the distal phalanx. During percutaneous recording, a spike was almost always coincident with a “yes” response. Intraneural microstimulation studies also conrm this (Vallbo, Olsson, Westberg, & Clark, 1984). Therefore, one active ber in an RA population may trigger a psychophysical response through the NP I channel, at least at the distal phalanx. The main goal of this study is to establish a reliable link between RA population physiology and NP I channel psychophysics. This task is accomplished by deriving a probabilistic measure, using the population model by Gu¸ ¨ clu¨ and Bolanowski (2002), that can be compared directly to psychophysical results (Gu¸ ¨ clu, ¨ 2003). Specically, the effects of stimulus amplitude, its location, and the average innervation density on the probability of detection in the model population are presented. The probability of detection is found by assuming that one or more active bers contribute to the detection of the stimulus. Comparing the psychophysical probability of detection with the derived functions gives an estimate of how many bers are necessary for detection.
Probability of Stimulus Detection in a Model Population
41
The probability of at least one active ber is equivalent to the probability of at least one spike in the population model and is the minimum limit for detection. Since the model does not include the temporal response properties of bers, it is not reliable to count action potentials using the very low average ring rates at the detection level. Therefore, the probabilities of multiple spikes are not calculated. It is also important to note that the derived psychometric functions do not give probabilities of detection in a single subject but represent probabilities of activity across subjects. This is because the variation in the output of the population model is due to the variation of rate-intensity functions in the population and not to the variation of the output in a single ber. Studying the probability of detection across subjects is unique in the literature, and it contributes further to the signicance of the work presented here. 2 Methods 2.1 Theory. The responses of RA bers to a 40 Hz sinusoidal steadystate stimulus are considered. It was demonstrated (Johnson, 1974; Gu¸ ¨ clu¨ & Bolanowski, 2001, 2003) that the rate-intensity functions of the bers can be approximated by 8 0I 0 < A < a0 > > > > > f > s > > .A ¡ a0 /I a0 < A < a1 > < a1 ¡ a0 fs I a1 < A < a2 rD > ³ ´ > > A ¡ a2 > > > 1 C I a2 < A < a3 f > > s a3 ¡ a2 > : 2 fs I a 3 < A
(2.1)
for a0 < a1 < a2 < a3 ; where A is the effective amplitude at the ber’s receptive eld center, fs the stimulus frequency, and (a0 , a1 , a2 , a3 ) are the parameters that are distributed randomly among bers. The rst parameter a0 (´ a) is of importance here, since the ber will be considered to be active (r 6D 0) if A is larger than this parameter. It is important to note that the experiments to obtain the statistics regarding equation 2.1 were done using long (1 s) stimuli. Consequently, the following analyses presume constant duration and sufciently long stimuli. The probability of a psychophysical response is assumed to be equal to the probability of at least n active bers in a population of N independent bers indexed by i; j; k; : : : That is, at least n bers should have nonzero average ring rates in the model. This can also be written as the complement
42
B. Gu¸ ¨ clu¨ and S. Bolanowski
of the probability of n ¡ 1 or fewer active bers: O Prfat least n active bersg PD ³ ´ Prfno active bersg C Prf1 active bergC D 1¡ Prf2 active bersg C : : : C Prfn ¡ 1 active bersg D 1¡ ¡
¡
N Y
Prfber i inactiveg
iD1
N X
0 @Prfber i activeg
iD1 N¡1 X iD1
N Y
1 Prfber j inactivegA
j6Di
0
N B X B BPrfber i activegPrfber j activeg @ jDiC1
1 N Y
0
k 6D i k 6D j
C N¡nC1 X N¡nC2 X C Prfber k inactivegC ¡ : : : ¡ ::: A iD1 jDiC1
B B B N B X B BPrfber i activegPrfber j activeg : : : Prfber y activeg B y B B @ 1 N Y z 6D i z 6D j ::: z 6D y
C C C C Prfber z inactivegC C: C C A
(2.2)
The bers in equation 2.2 are sampled from the same stochastic model. Therefore, by letting p = Prfber i inactiveg = Prfber j inactiveg = : : : , equation 2.2 can be simplied to: O Prfat least n active bersg D 1 ¡ pN ¡ N.1 ¡ p/pN¡1 PD ³ ´ ³ ´ N N ¡ .1 ¡ p/2 pN¡2 ¡ : : : ¡ .1 ¡ p/n¡1 pN¡nC1 2 n¡1
Probability of Stimulus Detection in a Model Population
43
³
´ N D Prfat least n ¡ 1 active bersg ¡ .1 ¡ p/n¡1 pN¡nC1 ; (2.3) n¡1 where the large parentheses are the binomial coefcients. If p can be derived, the probability of detection from the contribution of any n or more out of N bers can be sequentially found by substituting p into equation 2.3. For ber i to be inactive, the effective amplitude of the stimulus at that ber’s receptive eld center (xi , yi ) must be smaller than the ber’s threshold. The effective amplitude (A) as a function of location (x, y) was given (Johnson, 1974; Gu¸ ¨ clu¨ & Bolanowski, 2002) as: 8 2 2 2 K1 and jUji j > K2 , then the two decision boundaries, equations 3.10 and 3.11, nearly intersect orthogonally. That is, if, for any 1 · j · M, both the absolute values of the real and imaginary parts of the net input (complex number) to the hidden neuron j are sufciently large, then the decision boundaries nearly intersect orthogonally. Therefore, the following theorem can be obtained: Theorem. The decision boundaries for the real and imaginary parts of an output neuron in the three-layered complex-valued neural network approach orthogonality as both the absolute values of the real and imaginary parts of the net inputs to all hidden neurons grow. 4 Utility of the Orthogonal Decision Boundaries In this section, we will show the utility that the properties on the decision boundary discovered in the previous section bring about. Minsky and Papert (1969) claried the limitations of two-layered realvalued neural networks (i.e., no hidden layers): in a large number of interesting cases, the two-layered real-valued neural network is incapable of solving the problems. A classic example of this case is the exclusive-or (XOR) problem, which has a long history in the study of neural networks, and many other difcult problems involve the XOR as subproblem. Another example is the detection-of-symmetry problem. Rumelhart et al. (1986a, 1986b) showed that the three-layered real-valued neural network (i.e., with one hidden layer) can solve such problems, including the XOR problem and the detection-of-symmetry problem, and the interesting internal representations can be constructed in the weight-space. As described above, the XOR problem and the detection-of-symmetry problem cannot be solved with the two-layered real-valued neural network. First, contrary to expectation, it will be proved that such problems can be solved by the two-layered complex-valued neural network (i.e., no hidden layers) with the orthogonal decision boundaries, which reveals a potent computational power of complex-valued neural nets. In addition, it will be shown as an application of the computational power that the fading equalization problem can be successfully solved by the two-layered complexvalued neural network with the highest generalization ability. Rumelhart
82
T. Nitta
Table 1: The XOR Problem. Input
Output
x1
x2
y
0 0 1 1
0 1 0 1
0 1 1 0
et al. (1986a, 1986b) showed that increasing the number of layers made the computational power of neural networks high. In this section, we will show that extending the dimensionality of neural networks to complex numbers has a similar effect on neural networks. This may be a new direction for enhancing the ability of neural networks. Second, we will present the simulation results on the generalization ability of the three-layered complex-valued neural networks trained using the Complex-BP (called the Complex-BP network) (Nitta & Furuya, 1991; Nitta, 1993, 1997) and will compare them with those of the three-layered realvalued neural networks trained using the Real-BP (called the Real-BP network) (Rumelhart et al., 1986a, 1986b). 4.1 The XOR Problem. In this section, it is proved that the XOR problem can be solved by a two-layered complex-valued neural network (i.e., no hidden layers) with the orthogonal decision boundaries. The input-output mapping in the XOR problem is shown in Table 1. In order to solve the XOR problem with complex-valued neural networks, the input-output mapping is encoded as shown in Table 2, where the outputs 1 and i are interpreted to be 0 and 0 and 1 C i are interpreted to be 1 of the original XOR problem (Table 1), respectively. We use a 1-1 complex-valued neural network (i.e., no hidden layers) with a weight w D u C iv 2 C between the input neuron and the output neuron (we assume that it has no threshold parameters). The activation function is
Table 2: An Encoded XOR Problem for Complex-Valued Neural Networks. Input
Output
z D x C iy
Z D X C iY
¡1 ¡ i ¡1 C i 1¡i 1Ci
1 0 1Ci i
Orthogonality of Decision Boundaries
83
dened as 1C .z/ D 1R .x/ C i1R .y/; z D x C iy;
(4.1)
where 1R is a real-valued step function dened on R, that is, 1R .u/ D 1 if u ¸ 0, 0 otherwise for any u 2 R . The decision boundary of the 1-1 complex-valued neural network described above consists of the following two straight lines, which intersect orthogonally: [u ¡ v] ¢ t [x [v
y] D 0;
u] ¢ t [x
(4.2)
y] D 0
(4.3)
for any input signal z D x Ciy 2 C , where u and v are the real and imaginary parts of the weight parameter w D u C iv, respectively. Expressions 4.2 and 4.3 are the decision boundaries for the real and imaginary parts of the 1-1 complex-valued neural network, respectively. Letting u D 0 and v D 1 (i.e., w D i), we have the decision boundary shown in Figure 1, which divides the input space (the decision region) into four equal sections and has the highest generalization ability for the XOR problem. On the other hand, the decision boundary of the three-layered real-valued neural network for the XOR problem does not always have the highest generalization ability (Lippmann, 1987). In addition, the required number of learnable parameters
Decision boundary for the imaginary part (x=0)
Im
Decision boundary for the real part (y=0)
1+i
-1+i 0
Re
1 -1-i
1-i 0
1
Figure 1: The decision boundary in the input space of the 1-1 complex-valued neural network that solves the XOR problem. The black circle means that the output in the XOR problem is 1, and the white one 0.
84
T. Nitta
is only 2 (i.e., only w D u C iv), whereas at least nine parameters are needed for the three-layered real-valued neural network to solve the XOR problem (Rumelhart et al.,p1986a, 1986b), where a complex-valued parameter z D x C iy (where i D ¡1) is counted as two because it consists of a real part x and an imaginary part y. 4.2 The Detection of Symmetry. Another interesting task that cannot be done by two-layered real-valued neural networks is the detection-ofsymmetry problem (Minsky & Papert, 1969). In this section, a solution to this problem using a two-layered complex-valued neural network (i.e., no hidden layers) with the orthogonal decision boundaries is given. The problem is to detect whether the binary activity levels of a onedimensional array of input neurons are symmetrical about the center point. For example, the input-output mapping in the case of three inputs is shown in Table 3. We used patterns of various lengths (from two to six) and could solve all the cases with two-layered complex-valued neural networks. Only a solution to the case with six inputs is presented here because the other cases can be done in a similar way. We use a 6-1 complex-valued neural network (i.e., no hidden layers) with weights wk D uk C ivk 2 C between an input neuron k and the output neuron (1 · k · 6) (we assume that it has no threshold parameters). In order to solve the problem with the complex-valued neural network, the input-output mapping is encoded as follows: an input xk 2 R is encoded as an input xk C iyk 2 C to the input neuron k, where yk D 0 (1 · k · 6), the output 1 2 R is encoded as 1 C i 2 C , and the output 0 2 R is encoded as 1 or i 2 C which is determined according to inputs (e.g., the output corresponding to the input t [0 0 0 0 1 0] is i). The activation function is the same as in expression 4.1.
Table 3: The Detection-of-Symmetry Problem with Three Inputs. Input
Output
x1
x2
x3
y
0 0 0 1 0 1 1 1
0 0 1 0 1 0 1 1
0 1 0 0 1 1 0 1
1 0 1 0 0 1 0 1
Note: Output 1 means that the corresponding input is symmetric, and 0 asymmetric.
Orthogonality of Decision Boundaries
85
The decision boundary of the 6-1 complex-valued neural network described above consists of the following two straight lines, which intersect orthogonally: [u1 ¢ ¢ ¢ u6 ¡v1 ¢ ¢ ¢ ¡v6 ] ¢ t [x1 ¢ ¢ ¢ x6 y1 ¢ ¢ ¢ y6 ] D 0; [v1 ¢ ¢ ¢ v6
u1 ¢ ¢ ¢
t
u6 ] ¢ [x1 ¢ ¢ ¢ x6 y1 ¢ ¢ ¢ y6 ] D 0;
(4.4) (4.5)
for any input signal zk D xk C iyk 2 C where uk and vk are the real and imaginary parts of the weight parameter wk D uk C ivk , respectively (1 · k · 6/. The expressions 4.4 and 4.5 are the decision boundaries for the real and imaginary parts of the 6-1 complex-valued neural network, respectively. Letting t [u1 ¢ ¢ ¢ u6 ] D t [¡1 2 ¡4 4 ¡2 1] and t [v1 ¢ ¢ ¢ v6 ] D t [1 ¡2 4 ¡4 2 ¡1] (i.e., w1 D ¡1 C i; w2 D 2 ¡ 2i; w3 D ¡4 C 4i; w4 D 4 ¡ 4i; w5 D ¡2 C 2i and w6 D 1 ¡ i), we have the orthogonal decision boundaries shown in Figure 2, which successfully detect the symmetry of the 26 .D 64/ input patterns. In addition, the required number of learnable parameters is 12 (i.e., 6 complex-valued weights), whereas at least 17 parameters are needed for the three-layered real-valued neural network to solve the detection of symmetry (Rumelhartp et al., 1986a, 1986b) where a complex-valued parameter z D xCiy (where i D ¡1) is counted as two as in section 4.1.
Figure 2: The decision boundary in the net input space of the 6-1 complexvalued neural network that solves the detection-of-symmetry problem. Note that the plane is not the input space but the net input space because the dimension of the input space is 6 and the input space cannot be written in a twodimensional plane. The black circle means a net input for a symmetric input and the white one asymmetric. There is only one black circle at the origin. The four circled complex numbers are the output values of the 6-1 complex-valued neural network in their regions, respectively.
86
T. Nitta
Table 4: Input-Output Mapping in the Fading Equalization Problem. Input
Output
¡1 ¡ i ¡1 C i 1¡i 1Ci
¡1 ¡ i ¡1 C i 1¡i 1Ci
4.3 The Fading Equalization Technology. In this section, it is shown that two-layered complex-valued neural networks with orthogonal decision boundaries can be successfully applied to the fading equalization technology (Lathi, 1998). Channel equalization in a digital communication system can be viewed as a pattern classication problem. The digital communication system receives a transmitted signal sequence with additive noise and tries to estimate the true transmitted sequence. A transmitted signal can take one of the p following four possible complex values: ¡1¡i, ¡1Ci, 1¡i, and 1Ci .i D ¡1/. Thus, the received signal will take value around ¡1 ¡ i, ¡1 C i, 1 ¡ i, and 1 C i, for example, ¡0:9 ¡ 1:2i, 1:1 C 0:84i, and the like, because some noises are added to them. We need to estimate the true complex values from such complex values with noise. Thus, a method with excellent generalization ability is needed for the estimate. The input-output mapping in the problem is shown in Table 4. We use the same 1-1 complex-valued neural network as in section 4.1. In order to solve the problem with the complex-valued neural network, the input-output mapping in Table 4 is encoded as shown in Table 5. Letting u D 1 and v D 0 (i.e., w D 1), we have the orthogonal decision boundary shown in Figure 3, which has the highest generalization ability for the fading equalization problem and can estimate true signals without errors. In addition, the required number of learnable parameters is only two (i.e., only w D u C iv). 4.4 Generalization Ability of Three-Layered Complex-Valued Neural Networks. We present the simulation results on the generalization abilTable 5: An Encoded Fading Equalization Problem for Complex-Valued Neural Networks. Input
Output
¡1 ¡ i ¡1 C i 1¡i 1Ci
0 i 1 1Ci
Orthogonality of Decision Boundaries
87
Figure 3: The decision boundary in the input space of the 1-1 complex-valued neural network that solves the fading equalization problem. The black circle means an input in the fading equalization problem. The four circled complex numbers refer to the output values of the 1-1 complex-valued neural network in their regions, respectively.
ity of the three-layered complex-valued neural networks trained using the Complex-BP (Nitta & Furuya, 1991; Nitta, 1993, 1997) and compare them with those of the three-layered real-valued neural networks trained using the Real-BP (Rumelhart et al., 1986a, 1986b). In the experiments, the three sets of (complex-valued) learning patterns shown in Tables 6 to 8 were used, and the learning constant " was 0.5. The initial components of the weights and the thresholds were chosen to be random real between ¡0:3 and 0:3. We judged that learning nished qPnumbers PN .p/ .p/ 2 .p/ .p/ when p kD1 jTk ¡ Ok j D 0:05 held, where Tk ; Ok 2 C denoted the desired output value, the actual output value of the output neuron k for the pattern p; N denoted the number of neurons in the output layer. We regarded presenting a set of learning patterns to the neural network as one learning cycle. We used the four kinds of three-layered Complex-BP networks: 1-3-1, 1-6-1, 1-9-1, and 1-12-1 networks. After training, by presenting the 1,681(D 41£41) points in the complex plane [¡1; 1] £ [¡1; 1] (x C iy, where x D ¡1:0; ¡0:95; : : : ; 0:95; 1:0I y D ¡1:0; ¡0:95; : : : ; 0:95; 1:0), the actual output points formed the decision boundaries. Figure 4 shows an example of the decision boundary of the Complex-BP network. In Figure 4, the number 1 denotes the region in which the real part of the output value of the neural
88
T. Nitta
Table 6: Learning Pattern 1. Input Pattern
Output Pattern
¡0:03 ¡ 0:03i 0:03 ¡ 0:03i 0:03 C 0:03i ¡0:03 C 0:03i
1Ci i 0 1
Table 7: Learning Pattern 2. Input Pattern
Output Pattern
¡0:03 ¡ 0:03i 0:03 ¡ 0:03i 0:03 C 0:03i ¡0:03 C 0:03i
i 0 1 1Ci
network is OFF (0.0-0.5), and the imaginary part is OFF; region 2 the real part ON (0.5–1.0) and the imaginary part OFF; region 3 the real part OFF and the imaginary part ON; region 4 the real part ON and the imaginary part ON. The decision boundary for the real part (i.e., the boundary that the region 1+3 and the region 2+4 form) and that for imaginary part (i.e., the boundary that the region 1+2 and the region 3+4 form) intersect orthogonally. We also conducted corresponding experiments for the Real-BP networks. We chose the 2-4-2 Real-BP network for the 1-3-1 Complex-BP network as a comparison object because the numbers of the parameters (weights and thresholds) were almost the same: the number of parameters for the 1-31 Complex-BP network was 20, and that for the 2-4-2 Real-BP network 22, p where a complex-valued parameter z D xCiy (where i D ¡1) was counted as two because it consisted of a real part x and an imaginary part y. Similarly, the 2-7-2, 2-11-2, and 2-14-2 Real-BP networks were chosen for the 1-6-1, 1-9-1, and 1-12-1 Complex-BP networks as their comparison objects, respectively. The numbers of parameters of them are shown in Table 9. In the Real-BP networks, the real component of a complex number was input into the rst input neuron, and the imaginary component was input into Table 8: Learning Pattern 3. Input pattern
Output pattern
¡0:03 ¡ 0:03i 0:03 ¡ 0:03i 0:03 C 0:03i ¡0:03 C 0:03i
0 1 1Ci i
Orthogonality of Decision Boundaries
89
Figure 4: An example of the decision boundary of the 1-12-1 Complex-BP network learned with learning pattern 1. The meanings of the numerals are as follows. 1: real part OFF (0.0–0.5), imaginary part OFF, 2: real part ON(0.5–1.0), imaginary part OFF, 3: real part OFF, imaginary part ON, and 4: real part ON, imaginary part ON. The decision boundary for the real part (i.e., the boundary that the region 1+3 and the region 2+4 form) and that for imaginary part (i.e., the boundary that the region 1+2 and the region 3+4 form) intersect orthogonally. Table 9: Number of Parameters in the Real-BP and Complex-BP Networks. Complex-BP Network Number of parameters Real-BP network Number of parameters
1-3-1 20 2-4-2 22
1-6-1 38 2-7-2 37
1-9-1 56 2-11-2 57
1-12-1 74 2-14-2 72
the second input neuron; the output from the rst output neuron was interpreted to be the real component of a complex number, and the output from the second output neuron was interpreted to be the imaginary component. Figure 5 shows an example of the decision boundary of the Real-BP network where the numbers 1-4 have the same meanings as those of Figure 4. We can nd from Figure 5 that the decision boundary for the real part (i.e., the boundary that the region 1+3 and the region 2+4 form) and that for imaginary part (i.e., the boundary that the region 1+2 and the region 3+4 form) do not intersect orthogonally.
90
T. Nitta
Figure 5: An example of the decision boundary of the 2-14-2 Real-BP network learned with learning pattern 1. The numbers 1–4 have the same meanings as those of Figure 4. The decision boundary for the real part (i.e., the boundary that the region 1+3 and the region 2+4 form) and for imaginary part (i.e., the boundary that the region 1+2 and the region 3+4 form) do not intersect orthogonally.
First, we measured the angles between the decision boundary for the real part (i.e., the boundary that the region 1+3 and the region 2+4 formed) and that for imaginary part (i.e., the boundary that the region 1+2 and the region 3+4 formed), which were the components of the decision boundary of the output neuron in the Complex-BP networks in the visual observation under the experimental conditions described above. In the visual observation, the angles of the decision boundaries observed were roughly classied into three classes: 30, 60, and 90 degrees. The result of the observation is shown in Table 10, where most of trials were 90 degrees, a few trials 60 degrees, and none of trials 30 degrees. Then the average and the standard deviation of the angles of 100 trials for each of the three learning patterns and each of the four kinds of network structures were used as the evaluation criteria. Although we stopped learning at the 200,000th iteration, all trials succeeded in converging. Also, we measured the same quantities of the Real-BP networks for the comparison. The results of the experiments are shown in Table 11. We see in Table 11 that all the average angles for the Complex-BP networks are almost 90 degrees, which are independent of the learning patterns and the network structures, whereas those of the Real-BP networks are around 70 to 80 degrees. In addition, the standard deviations of the an-
Orthogonality of Decision Boundaries
91
Table 10: Result of the Visual Observation of the Angles of the Decision Boundaries: Complex-BP Network.
Learning pattern 2 (1-3-1 network) Learning pattern 3 (1-6-1 network) Learning pattern 3 (1-9-1 network) Other 9 cases
30 degrees
60 degrees
90 degrees
0 0 0 0
4 3 1 0
96 97 99 100
Notes: Angles were roughly classied into the three classes: 30, 60 and 90 degrees. A numeral means the number of trials in which the angle observed was classied into a specic class.
Table 11: Comparison of the Angles of the Decision Boundaries (in degrees). Pattern 1
Pattern 2
Pattern 3
Complex-BP network Average Standard deviation Real-BP network Average Standard deviation Complex-BP network Average Standard deviation Real-BP network Average Standard deviation Complex-BP network Average Standard deviation Real-BP network Average Standard deviation
1-3-1 90 0 2-4-2 78 19 1-3-1 89 6 2-4-2 85 16 1-3-1 90 0 2-4-2 86 11
1-6-1 90 0 2-7-2 72 22 1-6-1 90 0 2-7-2 77 20 1-6-1 89 5 2-7-2 76 20
1-9-1 90 0 2-11-2 76 17 1-9-1 90 0 2-11-2 77 18 1-9-1 90 3 2-11-2 72 22
1-12-1 90 0 2-14-2 80 17 1-12-1 90 0 2-14-2 75 21 1-12-1 90 0 2-14-2 73 22
gles for the Complex-BP networks are around 0 to 5 degrees and those for the Real-BP networks around 20 degrees. Thus, we can conclude from the experimental results that the decision boundary for the real part and that for imaginary part, which are the components of the decision boundary of the output neuron in the three-layered Complex-BP networks, almost intersect orthogonally, whereas those for the Real-BP networks do not. There is a possibility of improving the generalization ability of neural networks caused by the orthogonality of the decision boundaries of the network. Next, we measured the discrimination rate of the Complex-BP network for unlearned patterns in order to clarify how the orthogonality of the decision boundary of the three-layered Complex-BP network changed its generalization ability. Specically, we counted the number of the test patterns for which the Complex-BP network could give the correct output in the same experiments described above on the angles of decision boundaries of
92
T. Nitta
100 trials for each of the three learning patterns and each of the four kinds of network structures. We dened the correctness as follows: the output value X C iY.0 · X; Y · 1/ of the Complex-BP network for an unlearned pattern xC iy .¡1 · x; y · 1/ was correct if jX ¡ Aj < 0:5 and jY ¡Bj < 0:5, provided that the closest input learning pattern to the unlearned pattern xCiy was aCib whose corresponding output learning pattern was ACiB .A; B D 0 or 1/. For example, the output value X C iY.0 · X; Y · 1/ of the Complex-BP network for an unlearned pattern x C iy .0 · x; y · 1/ was correct if both the real and imaginary parts of the output value of the network took a value less than 0.5, provided that the corresponding output learning pattern for the input learning pattern 0:03 C 0:03i was 0. Then the average and the standard deviation of the discrimination rate of 100 trials for each of the three learning patterns and each of the four kinds of network structures were used as the evaluation criterion. The results of the experiments including the Real-BP network case appear in Table 12. The simulation results clearly suggest that the Complex-BP network has better generalization performance than that of the Real-BP network. Furthermore, we investigated the causality between the high generalization ability of the Complex-BP network and the orthogonal decision boundary. Table 13 shows the average of the discrimination rate for the three upper cases shown in Table 10. It is clearly suggested from Table 13 that the average of the discrimination rate for the three-layered complex-valued neural network with the orthogonal decision boundary is superior to that for the one with the nonorthogonal decision boundary. Thus, we believe that the orthogonality of the decision boundaries causes the high generalization ability of the Complex-BP network.
Table 12: Comparison of the Discrimination Rate (%). Pattern 1
Pattern 2
Pattern 3
Complex-BP network Average Standard deviation Real-BP network Average Standard deviation Complex-BP network Average Standard deviation Real-BP network Average Standard deviation Complex-BP network Average Standard deviation Real-BP network Average Standard deviation
1-3-1 92 6 2-4-2 88 8 1-3-1 93 6 2-4-2 88 8 1-3-1 92 7 2-4-2 87 9
1-6-1 95 5 2-7-2 90 7 1-6-1 95 4 2-7-2 91 7 1-6-1 94 4 2-7-2 90 7
1-9-1 97 3 2-11-2 93 4 1-9-1 96 4 2-11-2 92 5 1-9-1 97 3 2-11-2 90 7
1-12-1 98 2 2-14-2 93 4 1-12-1 97 3 2-14-2 93 6 1-12-1 97 3 2-14-2 92 6
Orthogonality of Decision Boundaries
93
Table 13: Average of the Discrimination Rate for the Three Upper Cases Shown in Table 10 (%).
Learning pattern 2 (1-3-1 network) Learning pattern 3 (1-6-1 network) Learning pattern 3 (1-9-1 network)
30 degrees
60 degrees
90 degrees
— — —
88 88 84
93 93 96
Table 14: Comparison of the Learning Speed, by Learning Cycle. Pattern 1
Pattern 2
Pattern 3
Complex-BP Network Average Standard deviation Real-BP Network Average Standard deviation Complex-BP Network Average Standard deviation Real-BP Network Average Standard deviation Complex-BP Network Average Standard deviation Real-BP Network Average Standard deviation
1-3-1 10,770 438 2-4-2 31,647 1268 1-3-1 10,608 418 2-4-2 31,781 2126 1-3-1 10746 740 2-4-2 34,038 5182
1-6-1 10,178 472 2-7-2 29,945 944 1-6-1 9932 167 2-7-2 29,842 809 1-6-1 10055 412 2-7-2 29,620 806
1-9-1 9766 210 2-11-2 28,947 697 1-9-1 9713 148 2-11-2 28,902 721 1-9-1 9713 228 2-11-2 28,980 603
1-12-1 9529 144 2-14-2 28,230 566 1-12-1 9539 110 2-14-2 28,267 576 1-12-1 9502 188 2-14-2 28,471 584
Finally, we investigated the average and the standard deviation of the learning speed (i.e., learning cycles needed to converge) of 100 trials for each of the three learning patterns and each of the four kinds of network structures in the experiments described above. The results are shown in Table 14. We nd from these experiments that the learning speed of the Complex-BP is several times faster than that of the Real-BP, and the standard deviation of the learning speed of the Complex-BP is smaller than that of the Real-BP. 5 Discussion We have proved that the decision boundary for the real part of an output of a single complex-valued neuron and that for the imaginary part intersect orthogonally in section 3.1. Since this property is completely different from a usual real-valued neuron, one needs to design the complex-valued neural network for real applications and its learning algorithm taking into account the orthogonal property of the complex-valued neuron whatever the type of the network is (multilayered type or mutually connected type). More-
94
T. Nitta
over, it has constructively been proved using the orthogonal property of the decision boundary that the XOR problem and the detection-of-symmetry problem can be solved by two-layered complex-valued neural networks (i.e., complex-valued neurons), which cannot be solved by two-layered realvalued neural networks (i.e., real-valued neurons). These results reveal a potent computational power of complex-valued neural nets. Making the dimensionality of neural networks high (e.g., complex numbers) may be a new direction for enhancing the ability of neural networks. In addition, it has been shown as an application of the above computational power that the fading equalization problem can be successfully solved by the two-layered complex-valued neural network with the highest generalization ability. One should use two-layered complex-valued neural networks rather than threelayered ones when solving the fading equalization problem with complexvalued neural networks. It is not always guaranteed that the decision boundary of the three-layered complex-valued neural network has the orthogonality, as we made clear in section 3.2. Then we derived in section 3.2 a sufcient condition for the decision boundaries in the three-layered complex-valued neural network to almost intersect orthogonally (Theorem). The sufcient condition was as follows: both the absolute values of the real and imaginary parts of the net inputs to all hidden neurons are sufciently large. This is a characterization for the structure of the decision boundaries in the three-layered complex-valued neural network. The theorem will be useful if a learning algorithm such that both the absolute values of the real and imaginary parts of the net inputs to all hidden neurons become sufciently large is devised because there is a possibility that the orthogonality of the decision boundaries of the network can improve the generalization ability of three-layered complex-valued neural networks, as we have seen in section 4.4. That is, there is a possibility that we can use the theorem in order to improve the generalization ability of the three-layered complex-valued neural network. However, the situation in which the theorem is directly useful for the Complex-BP network cannot be considered for now because the control of the net input is difcult as long as the Complex-BP algorithm (that is, steepest-descent method) is used. The Complex-BP is one of the learning algorithms for complex-valued neural networks. Thus, it should be noted that the usefulness of the theorem depends on the learning algorithm used. Although the orthogonality of the decision boundaries in the three-layered complex-valued neural network can be guaranteed conditionally as described above, we can nd from the experiments in section 4.4 that most of the decision boundaries in the threelayered Complex-BP network intersect orthogonally. Moreover, it is learned from the experiments that there is a possibility that the orthogonality of the decision boundaries in the three-layered Complex-BP network improves its generalization ability. The decision boundary of the complex-valued neural network, which consists of two orthogonal hypersurfaces, divides a decision region into four equal sections. So it is intuitively considered that the
Orthogonality of Decision Boundaries
95
orthogonality of the decision boundaries improves its generalization ability. Then we showed the possibility experimentally. The theoretical evaluation of the generalization ability of the complex-valued neural network using the criteria for the evaluation such as the Vapnik-Chervonenkis dimension (VC-dimension) (Vapnik, 1998) could clarify how the orthogonal property of the decision boundaries of the complex-valued neural network inuences its generalization ability. It has already been reported that the average of the learning speed of the Complex-BP is several times faster than that of the Real-BP (Nitta, 1997). In this connection, we could conrm this in the experiments on the orthogonality of the decision boundary and the generalization ability of the threelayered Complex-BP network (section 4.4). It was learned that the standard deviation of the learning speed of the Complex-BP was smaller than that of the Real-BP, which had not been reported in Nitta (1997). 6 Conclusions We have claried the differences between the real-valued neural network and the complex-valued neural network through theoretical and experimental analyses of their fundamental properties and claried the utility for the complex-valued neural network that the properties discovered in this work bring about. It turned out that the complex-valued neural network has basically the orthogonal decision boundary as a result of the extension to complex numbers. A decision boundary of a single complex-valued neuron consists of two hypersurfaces that intersect orthogonally, and divides a decision region into four equal sections. The XOR problem and the detection-of-symmetry problem, which cannot be solved with two-layered real-valued neural networks, can be solved by two-layered complex-valued neural networks with the orthogonal decision boundaries, which reveals a potent computational power of complex-valued neural nets. Furthermore, the fading equalization problem can be successfully solved by the two-layered complex-valued neural network with the highest generalization ability. A decision boundary of a three-layered complex-valued neural network has the orthogonal property as a basic structure, and its two hypersurfaces approach orthogonality as all the net inputs to each hidden neuron grow. In particular, most of the decision boundaries in the three-layered complex-valued neural network intersect orthogonally when the network is trained using the Complex-BP algorithm. As a result, the orthogonality of the decision boundaries improves its generalization ability. Its theoretical proof is a future topic. Furthermore, the average learning speed of the Complex-BP is several times faster than that of the Real-BP. The standard deviation of the learning speed of the Complex-BP is smaller than that of the Real-BP. The complex-valued neural network and the related Complex-BP algorithm are natural methods for learning complex-valued patterns for the above rea-
96
T. Nitta
sons and are expected to be used effectively in elds dealing with complex numbers. Acknowledgments I give special thanks to S. Akaho, Y. Akiyama, M. Asogawa, T. Furuya, and the anonymous reviewers for valuable comments. References Arena, P., Fortuna, L., Muscato, G., & Xibilia, M. G. (1998). Neural networks in multidimensional domains. Lecture Notes in Control and Information Science 234, Berlin: Springer-Verlag. Arena, P., Fortuna, L., Re, R., & Xibilia, M. G. (1993). On the capability of neural networks with complex neurons in complex valued functions approximation. In Proc. IEEE Int. Conf. on Circuits and Systems (pp. 2168–2171). Washington, DC: IEEE. Benvenuto, N., & Piazza, F. (1992). On the complex backpropagation algorithm. IEEE Trans. Signal Processing, 40(4), 967–969. Georgiou, G. M., & Koutsougeras, C. (1992). Complex domain backpropagation. IEEE Trans. Circuits and Systems–II: Analog and Digital Signal Processing, 39(5), 330–334. ICONIP. (2002). Complex-valued neural networks. In Proc. International Conference on Neural Information Processing (Vol. 3, pp. 1074–1103). Washington, DC: IEEE. KES. (2001).Complex-valued neural networks and their applications. In N. Baba, L. C. Jain, & R. J. Howlett (Eds.), Knowledge-based intelligent information engineering systems and allied technologies (Part I, pp. 550–580). Tokyo: IOS Press. KES. (2002). Complex-valued neural networks. In E. Damiani, R. J. Howlett, L. C. Jain, & N. Ichalkaranje (Eds.), Knowledge-based intelligent information engineering systems and allied technologies (Part I, pp. 623–647). Amsterdam: IOS Press. Kim, M. S., & Guest, C. C. (1990). Modication of backpropagation networks for complex-valued signal processing in frequency domain. In Proc. Int. Joint Conf. on Neural Networks (Vol. 3, pp. 27–31). Washington, DC: IEEE. Kim, T., & Adali, T. (2000). Fully complex backpropagation for constant envelope signal processing. In Proc. IEEE Workshop on Neural Networks for Signal Processing (pp. 231–240). Washington, DC: IEEE. Kuroe, Y., Hashimoto, N., & Mori, T. (2002). On energy function for complexvalued neural networks and its applications. In Proc. Int. Conf. on Neural Information Processing (Vol. 3, pp. 1079–1083). Washington, DC: IEEE. Lathi, B. P. (1998). Modern digital and analog communication systems (3rd ed.). New York: Oxford University Press. Lippmann, R. P. (1987, April). An introduction to computing with neural nets. IEEE Acoustic, Speech and Signal Processing Magazine, 4–22.
Orthogonality of Decision Boundaries
97
Minsky, M. L., & Papert, S. A. (1969). Perceptrons. Cambridge, MA: MIT Press. Miyauchi, M., Seki, M., Watanabe, A., & Miyauchi, A. (1992). Interpretation of optical ow through neural network learning. In Proc. IAPR Workshop on Machine Vision Applications (pp. 523–528). Miyauchi, M., Seki, M., Watanabe, A., & Miyauchi, A. (1993). Interpretation of optical ow through complex neural network. In Proc. Int. Workshop on Articial Neural Networks: Lecture Notes in Computer Science 686, (pp. 645–650). Berlin: Springer-Verlag. Nitta, T. (1993). A complex numbered version of the back-propagation algorithm. In Proc. World Congress on Neural Networks (Vol. 3, pp. 576–579). Mahwah, NJ: Erlbaum. Nitta, T. (1997). An extension of the back-propagation algorithm to complex numbers. Neural Networks, 10(8), 1392–1415. Nitta, T., & Furuya, T. (1991). A complex back-propagation learning. Transactions of Information Processing Society of Japan, 32(10), 1319–1329 (in Japanese). Rumelhart, D. E., Hinton, G. E., & Williams, R. J. (1986a). Parallel distributed processing (Vol. 1). Cambridge, MA: MIT Press. Rumelhart, D. E., Hinton, G. E., & Williams, R. J. (1986b). Learning representations by back-propagating errors. Nature, 323, 533–536. Vapnik, V. N. (1998). Statistical learning theory. New York: Wiley. Watanabe, A., Yazawa, N., Miyauchi, A., & Miyauchi, M. (1994). A method to interpret 3D motions using neural networks. IEICE Trans. Fundamentals of Electronics, Communications and Computer Sciences, E77-A(8), 1363–1370. Received July 9, 2002; accepted May 5, 2003.
LETTER
Communicated by Sumio Watanabe and Katsuyuki Hagiwara
On the Asymptotic Distribution of the Least-Squares Estimators in Unidentifiable Models Taichi Hayasaka
[email protected] Department of Information and Computer Engineering, Toyota National College of Technology, Toyota, Aichi 471-8525, Japan
Masashi Kitahara
[email protected] Department of Information and Computer Sciences, Toyohashi University of Technology, Toyohashi, Aichi 441–8580, Japan
Shiro Usui
[email protected] Laboratory for Neuroinformatics, RIKEN Brain Science Institute, Wako, Saitama 351–0198, Japan
In order to analyze the stochastic property of multilayered perceptrons or other learning machines, we deal with simpler models and derive the asymptotic distribution of the least-squares estimators of their parameters. In the case where a model is unidentified, we show different results from traditional linear models: the well-known property of asymptotic normality never holds for the estimates of redundant parameters. 1 Introduction Studying the fundamental property of multilayered perceptrons (MLP) treated as a parametric stochastic model in mathematical statistics is a challenging problem. White (1989), who did pioneering work in this area, identified one big difference between MLP and such traditional models as linear regression and ARMA: the unidentifiablity of optimal parameters due to the nonuniqueness of a combination of parameter values, which gives an equivalent input-output relation. This result suggests that it is fruitless to apply some traditional methods to a number of problems in applications of MLP. For example, consider the problem of selecting a suitable network structure or number of units that has good generalization ability. This is formulated in terms of a statistical model selection problem. Various criteria to evaluate the generalization ability of models have been proposed. It is known that the probability distribution of the maximum likelihood estiNeural Computation 16, 99–114 (2004)
c 2003 Massachusetts Institute of Technology
100
T. Hayasaka, M. Kitahara, and S. Usui
mators (MLE) of parameters converges to the gaussian distribution as the number of given samples goes to infinity. According to this property, called asymptotic normality, the well-known criterion AIC (Akaike information criterion; Akaike, 1974) is derived. In the case of MLP, however, we cannot show the asymptotic normality of estimators because of the unidentifiability of optimal parameters, so that the effectiveness of AIC is not ensured (Hagiwara, Toda, & Usui, 1993; Anders & Korn, 1999). Although some other criteria, such as NIC (network information criterion; Murata, Yoshizawa, & Amari, 1994) and GPE (generalized prediction error; Moody, 1992), have been proposed for MLP, they are effective only when the asymptotic normality holds. This implies that the determination of network structure is an open problem (Hagiwara, 2002). Therefore, several researchers have been working on MLP’s mathematical property related to its generalization ability ( e.g., Fukumizu, 1996; Hagiwara, Hayasaka, Toda, Usui, & Kuno, 2001; Watanabe, 2001). White (1989) showed the asymptotic normality of the estimates by supposing the uniqueness of optimal parameters and compactness of the parameter space. Murata et al. (1994) obtained a similar result. Furthermore, White asserted that the asymptotic distribution of parameter estimators in MLP becomes limiting mixed gaussian (LMG) introduced by Phillips (1989) in case the optimal values of parameters are not unique but are locallyidentifiable. However, White has not proved whether the local identifiability of optimal parameters holds for MLP. Kitahara, Hayasaka, Toda, and Usui (2001) suggested by numerical experiments that the compactness was not satisfied for MLP parameters. They also obtained the empirical distribution of their estimators in which it cannot be represented by a family of gaussian or LMG. We could say from those results that White’s idea does not express the exact stochastic property of MLP. The goal of this study is to derive such a distribution function theoretically. We first deal with a simple model with the unidentifiable condition and derive the asymptotic distribution of its parameter estimators in a context of regression estimation. This is equivalent to uniting the asymptotic normality for identifiable parameters (Murata et al., 1994; White, 1989) and the asymptotic distribution of the squares of unidentifiable parameters for a gaussian noise sequence (Hagiwara et al., 2001, Hayasaka, Toda, Usui, & Hagiwara, 1996). Then we discuss whether a similar result holds for the case of the models provided by the subset of a parameter space of MLP by applying the theory of extremes in a non–independent and identically distributed (i.i.d.) gaussian sequence. 2 Problem Formulation Let us consider the regression estimation from a given set of N pairs of independent input-output samples (x, y)N = {(xi , yi ); i = 1, . . . , N}, xi ∈ R, yi ∈ R. Outputs of a target system yi for predetermined inputs xi are
Least-Squares Estimators in Unidentifiable Modes
101
generated by adding noise to a function, yi = h(xi ) + εi∗ , i = 1, . . . , N,
(2.1)
where εi∗ is an independent random variable with a common gaussian distribution N (0, σ∗2 ). The true function h determines an invariant input-output relation of the system. Applying MLP with one output unit, s hidden units, and one input unit for estimating h, we describe a generating rule of samples (x, y)N as follows: yi =
s j=1
cj + εi , i = 1, . . . , N, 1 + exp(−bj1 x − bj0 )
(2.2)
where cj , bj0 , bj1 ∈ R, and εi is an independent random variable with a common gaussian distribution N (0, σs2 ). cj corresponds to the connection weight between the output layer and the hidden layer, and bj1 is the weight between the hidden layer and the input layer. bj0 is regarded as the threshold. bj0 and bj1 are nonlinear parameters in terms of the input-output relationship. We define the context of regression estimation in this study. A combination of basis functions is usually used for estimating the unknown function h, namely, the regression model, yi =
s
cj φ(xi ; bj ) + εi , i = 1, . . . , N,
(2.3)
j=1
where cj ∈ R and bj ∈ Rt . Equation 2.3 is represented by the matrix notation, yN = Φs(bs ) cs + εN ,
(2.4)
where yN = (y1 , . . . , yN )T , cs = (c1 , . . . , cs )T , εN = (ε1 , . . . , εN )T , and Φs(bs ) is the N × s design matrix in which the component in the ith row and the jth column is φ(xi ; bj ) provided by bs = (b1 , . . . , bs ). We call cj and bj a coefficient and a basis parameter, respectively. In case of MLP, bj = (bj0 , bj1 ) for all j. Throughout this article, we deal with the case where the squared error is employed as a loss function for parameter estimation: 2 N s 1 remp (cs , bs ; (x, y)N ) = cj φ(xi ; bj ) . yi − N i=1 j=1
(2.5)
The estimators based on (x, y)N are assumed to be the parameters that minimize the empirical loss, that is, the least squares estimators (LSE): bs ≡ argmin remp (cs , bs ; (x, y)N ). cs , cs ,bs
(2.6)
102
T. Hayasaka, M. Kitahara, and S. Usui
LSE cs = ( c1 , . . . , cs ) and bs = ( b1 , . . . , bs ) are equivalent to MLE because of the assumption of gaussian noise. Suppose that the true function h is represented by the linear combination of s∗ basis functions with optimal parameter values cj∗ and bj∗ , j = 1, . . . , s∗ , h(xi ) ≡
s∗
cj∗ φ(xi ; bj∗ ).
(2.7)
j=1
When we apply the regression model with s basis functions (see equation 2.3) and s > s∗ , the optimal coefficient values for redundant basis functions cj∗ , j > s∗ are regarded as zero. In this case, bj∗ can take any value; we say that the optimal basis parameter values are unidentifiable, because all bj gives the same function representation (or the same value of squared error, equation 2.5) when cj = 0. If the basis parameters are constant, equation 2.3 is a traditional linear regression model. Therefore, cs has a joint gaussian distribution asymptotically, and the inverse of Hessian for the loss function (it is equivalent to the information matrix in this case) corresponds to the covariance matrix of gaussian. However, the optimal basis parameter values are unidentifiable, so the Hessian becomes singular. Since it is impossible to calculate its inverse, the asymptotic normality cannot be shown for the regression models with variable basis parameters like MLP (White, 1989; Hagiwara et al., 1993; Anders & Korn, 1999). In the following sections, we consider the probability distribution of cs in such a case. 3 Asymptotic Distribution for the Regression Model with Orthogonal Basis Functions The parameters of regression models usually take real values or positive real values. In order to simplify the analysis, we set the basis parameter space to a finite set B ≡ {b1 , . . . , bN } ⊂ Rt so that the basis functions φ(x; b1 ), . . . , φ(x; bN ) are linearly independent. Since this model also includes unidentifiable optimal parameters, its property may be reflected in that of MLP or other learning machines represented by a superposition of adaptive basis functions. Let cs(bs ) be LSE of cs given by solving the normal equation, cs(bs ) = (Φs(bs )T Φs(bs ) )−1 Φs(bs )T yN ,
(3.1)
provided by a basis parameter vector bs . Then the least-squared error can be obtained by min min bs
cs
1 (yN − Φs(bs ) cs )T (yN − Φs(bs ) cs ) N
Least-Squares Estimators in Unidentifiable Modes
1 s) (yN − Φs(bs ) cs(bs ) )T (yN − Φs(bs ) c(b s ) N 1 = min (yTN yN − yN Ps(bs ) yN ), bs N
103
= min bs
(3.2)
(bs )T s ) −1 . LSE are obtained by where Ps(bs ) = Φs(bs ) (Φs(bs )T Φ(b s ) Φs
bs = argmax yN Ps(bs ) yN ,
(3.3)
cs = (Φs(bs )T Φs(bs ) )−1 Φs(bs )T yN .
(3.4)
bs
From equations 3.3 and 3.4, we can say that the probability distribution of cs is related to the stochastic property of the maximum in the random sequence {yTN Ps(bs ) yN } for all combinations of the elements in bs . If φ(x; b1 ), . . . , φ(x; bN ) satisfy the orthogonal condition under the appropriate choice of an ordered set of input values, it is possible to analyze the property of LSE in detail. We assume that N
φ(xi ; bj )φ(xi ; bk ) =
i=1
λ 0
j=k j = k,
(3.5)
where the squared norm of basis function λ usually diverges as N goes to infinity so that the true value of each coefficient cj should be a constant. Suppose that λ is a constant function of N. For example, let us consider trigonometric basis functions for xi ≡ 2π i/N, i = 1, . . . , N, φ(xi ; bj ) = bj0 cos(bj1 xi − bj2 ),
(3.6)
where each basis parameter bj = (bj0 , bj1 , bj2 ) is in the basis parameter space √ √ √ √ √ Bj = {(1, 0, 0), ( 2, 1, 0), ( 2, 1, π/2), ( 2, 2, 0), ( 2, 2, π/2), . . . , ( 2, N/2 − 1, π/2), (1, N/2, 0)}. In this case, equation 3.5 is satisfied, and λ = N. We have s )T s) max yN Ps(bs ) yN = max Φs(bs )T Φs(bs ) c(b c(b s s bs bs = λ max ( c(bj ) )2 ,
bs
(3.7)
j
where c(bj ) =
N 1 yi · φ(xi ; bj ). λ i=1
(3.8)
104
T. Hayasaka, M. Kitahara, and S. Usui
(b ) LSE cj and bj , j = 1, . . . , s, are regarded as bj = blj and cj = c lj , where l1 , . . . , lN correspond to the sequence rearranged in ascending order of (b ) c(bl1 ) )2 ≥ · · · ≥ ( c(blN ) )2 . Then the squared coefficients ( c lj )2 , that is, ( (bj ) 2 2 ( c ) λ/σ∗ , j = 1, . . . , N, are independent random variables with the noncentral χ 2 distribution with 1 degree of freedom and the noncentrality parameter (cj∗ )2 λ/σ∗2 . The independence of random sequence makes easier to study the property of maxima, because the classical extreme value theory ( e.g., Leadbetter & Rootz´en, 1988) can be applied. We show the following theorem for the asymptotic distribution of LSE of the coefficients.
Theorem 1. Assume that (c∗1 )2 > (c∗2 )2 > · · · > (c∗s∗ )2 and λ > O(log N). Then the probability distribution of cj , j ≤ s∗ converges to the gaussian distribution as N → ∞, σ2 P{ cj ≤ x} → N cj∗ , ∗ , λ
(3.9)
as N → ∞. We have for j > s∗ , ∗
P{ cj ≤ x} →
j−s −1 e−n·uN (x ) 1 sgn(x) + , exp(−e−uN (x ) ) 2 2σ∗ n! n=0
(3.10)
as N → ∞, where x = x2 λ/σ∗2 , un (x) = αN (x−βN ), and αN = 2 log(N − s∗ ), βN = αN − (log log(N − s∗ ) + log π )/2αN . Let ξj ≡ ( c(bj ) )2 λ/σ∗2 and µj ≡ (cj∗ )2 λ/σ∗2 , j = 1, . . . , N. Consider
Proof.
ξj = (mj + j )2 , where mj is a constant such that µj = mj2 , and j is a random variable with N (0, 1). Then ξj = µj + 2mj j + j2 ≤ µj + 2|mj ||j | + j2 ≤ µj + 2|mj | max |i | + max i2 , i
i
(3.11)
for j ≤ s∗ . Deo (1972) showed that max |i | →
i
2 log N, a.s.
(3.12)
This is turned into max i2 → 2 log N, a.s. i
(3.13)
Least-Squares Estimators in Unidentifiable Modes
105 (j)
Since µj > O(log N), we have ξj → O(µj ), a.s. Let MN be the jth largest maximum in the sequence {ξs∗ +1 , . . . , ξN }. From equation 3.13, M(1) N → 2 log(N − s∗ ), a.s., because µj = 0, j = s∗ + 1, . . . , N. Then ξ1 > ξ2 > · · · > ξs∗ > M(1) N , a.s.
(3.14)
bj corresponds to Equation 3.14 is followed by bj → bj , a.s., j ≤ s∗ , because ˆ
the jth largest maximum in {ξ1 , . . . , ξN }, j ≤ s∗ . Therefore, cj (≡ c(bj ) ) con(b ) j verges to c almost surely (it implies convergence in probability). Since each c(bj ) is a random variable with the distribution N (cj∗ , σ∗2 /λ), we can prove by applying the following lemma (Rao, 1973, p. 122) that the asymptotic distribution of cj is given by the gaussian distribution, equation 3.9, for j ≤ s∗ . Lemma 1.
Let {WN , ZN }, N = 1, 2, . . . be a sequence of pairs of variables. Then
|WN − ZN | → 0 in probability, ZN → Z in distribution, ⇒ WN → Z in distribution,
(3.15)
as N → ∞, that is, the asymptotic distribution of WN exists and is the same as that of Z.
cj | → In the case where j > s∗ , it is obvious that |
(j−s∗ )
σ∗2 MN
/λ, a.s., and
(j) MN
converges almost surely to the (j−s∗ )th largest maximum of absolute values of independent random variables with N (0, 1). From lemma 3 in the appendix, the asymptotic distribution of extremes,
(j)
MN is the type 1 distribution of
∗
j−s −1 e−nx (j) −x P αN MN − βN ≤ x → exp(−e ) , n! n=0
as N → ∞,
(3.16)
where αN = 2 log(N − s∗ ) and βN = αN − (log log(N − s∗ ) + log π )/2αN . Since the distribution of each c(bj ) is N (0, σ∗2 /λ) and this density function is symmetric with regard to the origin, equation 3.10 holds by using lemma 1. The asymptotic normality of cj holds only when j ≤ s∗ , which is equivalent to the limit theorems for identifiable parameters (Murata et al., 1994; White, 1989). Equation 3.10 is the asymptotic distribution of LSE of coef-
106
T. Hayasaka, M. Kitahara, and S. Usui
-5.0
0.6 0.4
pdf(x)
0.0
0.2
0.4 0.2 0.0
pdf(x)
0.6
ficients for redundant basis functions, which represent noise components included in the given samples, and it can also be derived from lemma 7 in Hagiwara et al. (2001). They dealt with the case where the regression model with n-point-fitting basis functions was applied to samples provided by a gaussian noise sequence only (i.e., h(x) = 0), and analyzed its asymptotic property by adopting more general solutions for the regression model with orthonormal basis functions by Hayasaka et al. (1996), in which they also used the classical extreme value theory. Theorem 1 shows that the same result holds with probability one even if h(x) = 0. As shown in the example of trigonometric basis functions, the assumption λ > O(log N) in theorem 1 is satisfied in general. If not, equation 3.10 holds for all cj , j ≤ s∗ for the case where the true coefficient values should be constant in terms of N. Figure 1 illustrates the probability density function of the asymptotic distribution of cj derived from theorem 1. Symmetry with regard to the origin and a double-peaked shape of its density function represented in Figure 1b is similar to the histogram of estimated parameters in MLP obtained by Kitahara et al. (2001). Although the regression model with orthogonal basis functions gives a very different function representation from MLP, both density shape suggest that theorem 1 may be extended to the other unidentifiable models with nonorthogonal basis functions.
-2.5
0.0
2.5
5.0
-5.0
-2.5
0.0
x
x
(a)
(b)
2.5
5.0
Figure 1: Probability density function of asymptotic distribution of cj for the regression model with orthogonal basis functions. (a) Equation 3.9 with σ∗2 = 1 and λ = 1. The asymptotic normality of cj holds for j ≤ s∗ . (b) Equation 3.10 with σ∗2 = 1, λ = 1, j = s∗ + 1, and N = 16. Density with double-peaked shape is obtained for cj , j > s∗ .
Least-Squares Estimators in Unidentifiable Modes
107
4 Asymptotic Distribution for the Regression Model with Step-Type Basis Functions In this section, we deal with the simplest case: the number of basis functions s = 1 and cj∗ = 0 for all j, that is, h(x) = 0 and s∗ = 0. According to these assumptions, the optimal parameter values are not identifiable, independent of types of models. Again, we set the basis parameter space to a finite set with N elements and N 2 i=1 φ (xi ; bj ) = λ for all j. In the case of nonorthogonal basis functions, LSE c1 is obtained by the maximization of squared coefficient values provided by all basis parameters, similar to the case of orthogonal basis functions: ˆ
c1 = c(bj ) ,
(4.1)
1 c(bj ) )2 , c(bj ) = bj = argmax( λ
N
yi · φ(xi ; bj ).
(4.2)
i=1
LSE of the coefficient is equivalent to the maximum in the sequence of N random variables with the χ 2 distribution with 1 degree of freedom, though the sequence is dependent. Since the classic extreme value theory cannot be applied directly to analyze the asymptotic distribution of LSE, the problem becomes more difficult. First, we show the sufficient condition for c1 to have the same asymptotic distribution as an orthogonal case. Theorem 2. Assume that yi ∼ N (0, σ∗2 ) for the appropriate choice of an ordered set of input values xi ∈ R, i = 1, . . . , N. If the regression model, equation 2.3, s = 1 satisfies the following condition, lim
N→∞
N−1
sup
l=1 |j1 −j2 |≥l
2 N 1 φ(xi ; bj1 )φ(xi ; bj2 ) < ∞, λ i=1
(4.3)
the LSE of the coefficient parameter has the asymptotic distribution 3.10. Proof. From the proof of theorem 1, we need to derive the sufficient condition that the asymptotic distribution of the maximum among absolute values of dependent random variables converges to that for independent ones with N (0, 1). This can be shown under certain natural conditions about the dependency of the sequence. Deo (1972, theorem 1) shows the following: Lemma 2. Let ξi , i = 1, . . . , N be random variables with N (0, 1) and M N be the maximum among {|ξ1 |, . . . , |ξN |}. Let δl = sup|j1 −j2 |≥l |E[ξj1 ξj2 ]|. If lim δl2 < −x ∞,
then αN (MN − βN ) converges in distribution to exp(−e ) where αN = 2 log N and βN = αN − (log log N + log π )/2αN .
108
T. Hayasaka, M. Kitahara, and S. Usui
We can show {E[ξj1 ξj2 ]}2 = Corr[ξj21 ξj22 ]. Then δl2
= sup
|j1 −j2 |≥l
2 N 1 φ(xi ; bj1 )φ(xi ; bj2 ) , λ i=1
so that equation () holds for the maximum of | c(bj ) | if N → ∞. This ends the proof.
(4.4) N−1 l=1
δl2 < ∞ as
For example, let us consider the regression model with the following basis functions, φ (r) (xi ; b) =
1 (b − r)+ < i ≤ b 0 otherwise,
(4.5)
where the basis parameter b ∈ {1, . . . , N} and the hyperparameter r = 1, 2, . . ., r N. (b − r)+ stands for max(0, b − r). If r = 1, equation 4.4 is equivalent to the n-point-fitting function (Hagiwara et al., 2001). When (r) 2 normalizing φ (r) (xi ; b) so as to make N i=1 {φ (xi ; b)} to a constant λ for all b, it is easy to show that the assumption of theorem 2 is satisfied even if r > 1. In fact, the basis function of such a model can be constructed by two step-type basis functions, φ (r) (xi ; b) =
√
b · φstep (xi ; b) −
(b − r)+ · φstep (xi ; b − r),
(4.6)
where
φstep (xi ; b) =
√ 1/ b i ≤ b 0 otherwise.
(4.7)
Equation 4.7 is regarded as a limit of a sigmoidal basis function with respect to the basis parameter value, so that we may say that the above example is provided by the subset of the function representation realized by MLP. It can be expected that the property of the regression model with the basis function, equation 4.7, alone is more similar to that of MLP. Unfortunately, theorem 2 does not hold in this case. However, equation 3.10 can represent the asymptotic distribution of the coefficient parameters, shown by the following theorem: Theorem 3. Assume that yi ∼ N (0, σ∗2 ) for the appropriate choice of an ordered set of input values xi ∈ R, i = 1, . . . , N. For the regression model with one basis function represented by equation 4.7, the LSE of the coefficient parameter has the asymptotic distribution, equation 3.10, with the constants λ = 1, αN =
2 log log N, and βN = αN + (log log log N − log π )/2αN .
Least-Squares Estimators in Unidentifiable Modes
Proof.
109
Let
c(j) =
N i=1
j 1 yi · φ(xi ; j) = yi . j i=1
(4.8)
Then LSE of the coefficient c1 is given by 1 c1 = c , j = argmax( c(j) )2 = argmax ˆ ( j)
1≤j≤N
1≤j≤N
j yi . j i=1
(4.9)
j Then we have | c1 | = maxj | i=1 yi |/ j. Applying the Darling-Erdos ¨ theorem for sums of independent gaussian variables (Darling & Erdos, ¨ 1956), the following equation holds, P αN
1 max
1≤j≤N
j yi − βN ≤ x → exp(−e−x ), j i=1
as N → ∞,
(4.10)
where αN = 2 log log N and βN = αN +(log log log N −log π )/2αN . Hence, we conclude theorem 3. Compared to the result for orthogonal basis functions, the convergence rate of tail probability is different, but the type of probability is same. From the Darling-Erdos ¨ theorem (Darling & Erdos, ¨ 1956), it is possible to generalize theorem 3 to the case regardless of the distributions of input and output samples, respectively. Notice again that the function representation of this regression model is regarded as the limit of MLP with respect to the basis parameters. 5 Discussion and Conclusion We showed that the different asymptotic distributions from gaussian are obtained for LSE of parameters in a context of regression estimation. By applying the theory of extremes in a gaussian sequence, we derived the asymptotic distributions for models with orthogonal basis functions (Hayasaka et al., 1996; Hagiwara et al., 2001). In this article, we treated such a result as a template of the stochastic property of other unidentified models (Figure 1 shows it graphically) and extended them to the nonorthogonal case. Although it cannot be applied directly to the case of the models provided by the subset of a parameter space of MLP, a similar interesting property was shown. The shape of the density of the asymptotic distribution we derived here has two peaks and takes the minimum value at their optimal value.
110
T. Hayasaka, M. Kitahara, and S. Usui
Such a function cannot be constructed by the family of LMG, which White (1989) regards as the asymptotic distribution in case of unidentified models. We assumed that the basis parameter space is a finite set in order to study the asymptotic property of MLP. The function representation defined by a linear combination of basis functions with basis parameters is called a dictionary representation, and the stochastic models with superior ability to linear models (e.g., projection pursuit regression, support vector machines) can be represented by its form (Cherkassky, Shao, Mulier, & Vapnik, 1999). We cannot conclude that there is no relationship between MLP and the models derived from assuming a finite parameter space because they provide some measure of dictionary representations for MLP. We analyzed the asymptotic distribution of LSE for the regression model with step-type basis functions as an example that gives a function representation close to MLP. In the typical case of unidentified optimal parameters, that is, MLP is applied to the samples that contain gaussian noise only, Kitahara et al. (2001) numerically showed that the optimal values of basis parameters in a sigmoidal function tended to infinity and its shape became discontinuous, namely, steptype. Our numerical and theoretical results suggest that theorem 3 explains the property of LSE for the coefficients of redundant basis functions in MLP. We will have to reduce a number of restrictions on models (e.g., the number of basis functions, input dimensionality) in order to probe its actual state. Appendix: Asymptotic Distributions of Extremes A main issue of the extreme value theory is the asymptotic property of extremes (i.e., maxima and minima) of a stochastic sequence. The basic importance is given by the extremal types theorem, which deals with the distribution of maxima in i.i.d. random sequences, as noted in the proof of the following lemma. Lemma 3. Let ξi , i = 1, . . . , N be random variables with N (0, 1) and M(k) N be the kth largest maximum among {|ξ1 |, |ξ2 |, . . . , |ξN |}. Then k−1 −nx e −x P αN M(k) − β ) ≤ x → exp(−e , as N → ∞, N N n! n=0
where αN =
(A.1)
2 log N and βN = αN − (log log N + log π )/2αN .
Proof. Let F(x) be a common distribution function of |ξ1 |, . . . , |ξN |, and SN be the number of |ξ1 |, . . . , |ξN | so that |ξi | > uN for a level uN . Then SN is a
Least-Squares Estimators in Unidentifiable Modes
111
random variable in which its distribution is binomial: P{SN ≤ k} =
k
N−n (uN )(1 N Cn F
− F(uN ))n , k = 0, 1, 2, . . . .
(A.2)
n=0
We have P{M(1) N ≤ uN } = P{|ξ1 | ≤ uN , |ξ2 | ≤ uN , . . . , |ξN | ≤ uN }.
(A.3)
Equation A.3 shows that the event {M(1) N ≤ uN } is equivalent to none of |ξ1 |, . . . , |ξN | exceeding the level uN : namely, P{SN = 0}. In the same way, the following holds: P{M(k) N ≤ uN } = P{SN < k} =
k−1
N Cn F
N−n
(uN )(1 − F(uN ))n , k = 1, 2, . . . .
(A.4)
n=0
If there exists uN so that N(1 − F(uN )) → τ (τ is a constant in terms of N) as N → ∞, the distribution of SN converges to Poisson, P{SN < k} → e−τ
k−1 n τ n=0
n!
k = 1, 2, . . . ,
(A.5)
as N → ∞. N(1 − F(uN )) → τ as N → ∞ implies FN (uN ) → e−τ as N → ∞. If there exists a distribution function G such that FN (uN ) → G(x) as N → ∞, −1 where uN = αN x + βN for some constants αN > 0 and βN , we have P{αN (M(k) N − βN ) ≤ x} = P{SN < k} → G(x)
k−1 − log G(x) n=0
n!
, k = 1, 2, . . . ,
(A.6)
as N → ∞. It is known that G(x) has one of the following three forms (extremal types theorem, theorem 1.2.1 in Leadbetter & Rootz´en, 1988): I : G(x) = exp(−e−x ),
(A.7)
II : G(x) = exp(−x−a ) x > 0,
(A.8)
III : G(x) = exp(−(−x)a ) x ≤ 0,
(A.9)
where a is a shape parameter. Now we seek to determine which of the three types of limit distributions applies for the random variables |ξi | with the distribution F(x). A useful condition has been obtained as follows:
112
T. Hayasaka, M. Kitahara, and S. Usui
Lemma 4. (de Haan 1976, theorem 5). Suppose that the distribution of an i.i.d. random sequence F(x) is absolutely continuous with a density function f ≡ F and f has a derivative f ≡ F such that f (x) < 0 for all x in (x0 , ∞). If lim
t→∞
f (t)(1 − F(t)) = −1, f 2 (t)
(A.10)
the asymptotic distribution of the maximum of the sequence is type 1 in the extremal types theorem. Let (x) and φ(x) be the probability distribution function and its density function of ξi , respectively. Then F(x) is given by the following: F(x) =
x −x
φ(x)dx =
= (x) − (−x)
x
−∞
φ(x)dx −
−x −∞
= 2(x) − 1.
φ(x)dx
(A.11)
F(x)’s density function f (x) and its differential function f (x) are as follows: f (x) = 2φ(x),
f (x) = −2xφ(x).
(A.12) (A.13)
Since 1 − (u) ∼ φ(u)/u as u → ∞, f (x)(1 − F(x)) −2xφ(x)(2 − 2(x)) = f 2 (x) 4φ 2 (x) x =− (1 − (x)) φ(x) x φ(x) ∼− = −1, φ(x) x
(A.14)
as x → ∞. From lemma 4, G(x) is given by equation A.7. Normalized constants αN and βN can be obtained by the process similar to example 2 in Resnick (1987). Equation A.1 is valid for k ≥ 2 by substituting equation A.7 for equation A.6. Acknowledgments We wish to thank Katsuyuki Hagiwara at Nagoya Institute of Technology and Naohiro Toda at Aichi Prefectural University for their useful comments.
Least-Squares Estimators in Unidentifiable Modes
113
References Akaike, H. (1974). A new look at the statistical model identification. IEEE Trans. Automat. Contr., AC–19, 716–723. Anders, U., & Korn, O. (1999). Model selection in neural networks. Neural Networks, 12, 309–323. Cherkassky, V., Shao, X., Mulier, F. M., & Vapnik, V. (1999). Model complexity control for regression using VC generalization bounds. IEEE Trans. Neural Networks,10, 1075–1089. Darling, D. A., & Erdos, ¨ P. (1956). A limit theorem for the maximum of normalized sums of independent random variables. Duke Math. J., 23, 143–155. de Haan, L. (1976). Sample extremes: An elementary introduction. Statist. Neelandica, 30, 161–172. Deo, C. M. (1972). Some limit theorems for maxima of absolute values of gaussian sequences. Sankhy¯a, A, 34, 289–292. Fukumizu, K. (1996). A regularity condition of information matrix of a multilayer perceptron network. Neural Networks, 9, 871–879. Hagiwara, K. (2002). On the problem in model selection of neural network regression in an overrealizable scenario. Neural Computation, 14, 1979–2002. Hagiwara, K., Hayasaka, T., Toda, N., Usui, S., & Kuno, K. (2001). Upper bound of the expected training error of neural network regression for a gaussian noise sequence. Neural Networks, 14, 1419–1429. Hagiwara, K., Toda, N., & Usui, S. (1993). On the problem of applying AIC to determine the structure of a layered feed-forward neural network. In Proceedings of International Joint Conference on Neural Networks III (pp. 2263–2266). New York: IEEE. Hayasaka, T., Toda, N., Usui, S., & Hagiwara, K. (1996). On the least square error and prediction square error of function representation with discrete variable basis. In Proceedings of the 1996 IEEE Signal Processing Society Workshop (pp. 72–81). New York: IEEE. Kitahara, M., Hayasaka, T., Toda, N., & Usui, S. (2001). On the probability distribution of estimators of regression model using 3-layered neural networks. In Proceedings of 2001 International Symposium on Nonlinear Theory and its Applications (Vol. 2, pp. 533–536). Zao, Miyagi, Japan. Leadbetter, M. R., & Rootz´en, H. (1988). Extremal theory for stochastic processes. Ann. Prob., 16, 431–478. Moody, J. E. (1992). The effective number of parameters: An analysis of generalization and regularization in nonlinear learning systems. In J. E. Moody, S. J. Hanson, & R. P. Lippmann (eds.), Advances in neural information processing systems, 4 (pp. 847–854). San Mateo, CA: Morgan Kaufmann. Murata, N., Yoshizawa, S., & Amari, S. (1994). Network information criterion— determining the number of hidden units for an artificial neural network model. IEEE Trans. Neural Networks, 5, 865–872. Phillips, P. C. B. (1989). Partially identified econometric models. Econometric Theory, 5, 181–240. Rao, C. R. (1973). Linear statistical inference and its applications (2nd ed.). New York: Wiley.
114
T. Hayasaka, M. Kitahara, and S. Usui
Resnick, S. I. (1987). Extreme values, regular variation, and point processes. Berlin: Springer-Verlag. Watanabe, S. (2001). Algebraic analysis for nonidentifiable learning machines. Neural Computation, 13, 899–933. White, H. (1989). Learning in artificial neural networks: A statistical perspective. Neural Computation, 1, 425–464. Received November 6, 2002; accepted June 3, 2003.
Communicated by John Platt
LETTER
Asymptotic Properties of the Fisher Kernel Koji Tsuda
[email protected] Max Planck Institute for Biological Cybernetics, 72076 Tubingen, ¨ Germany, and AIST Computational Biology Research Center, Koto-ku, Tokyo, 135-0064, Japan
Shotaro Akaho
[email protected] AIST Neuroscience Research Institute, Tsukuba, 305-8568, Japan
Motoaki Kawanabe nabe@rst.fhg.de Fraunhofer FIRST, 12489 Berlin, Germany
Klaus-Robert Muller ¨ klaus@rst.fhg.de Fraunhofer FIRST, 12489 Berlin, Germany, and University of Potsdam, 14482 Potsdam, Germany
This letter analyzes the Fisher kernel from a statistical point of view. The Fisher kernel is a particularly interesting method for constructing a model of the posterior probability that makes intelligent use of unlabeled data (i.e., of the underlying data density). It is important to analyze and ultimately understand the statistical properties of the Fisher kernel. To this end, we rst establish sufcient conditions that the constructed posterior model is realizable (i.e., it contains the true distribution). Realizability immediately leads to consistency results. Subsequently, we focus on an asymptotic analysis of the generalization error, which elucidates the learning curves of the Fisher kernel and how unlabeled data contribute to learning. We also point out that the squared or log loss is theoretically preferable—because both yield consistent estimators—to other losses such as the exponential loss, when a linear classier is used together with the Fisher kernel. Therefore, this letter underlines that the Fisher kernel should be viewed not as a heuristics but as a powerful statistical tool with well-controlled statistical properties. 1 Introduction Recently, the Fisher kernel (Jaakkola & Haussler, 1999) has been successfully applied as a feature extractor in supervised classication (Jaakkola & Haussler, 1999; Tsuda, Kawanabe, R¨atsch, Sonnenburg, & Muller, ¨ 2002; SonNeural Computation 16, 115–137 (2004)
c 2003 Massachusetts Institute of Technology °
116
K. Tsuda, S. Akaho, M. Kawanabe, and K. Muller ¨
nenburg, Ratsch, ¨ Jagota, & Muller, ¨ 2002; Smith & Gales, 2002; Vinokourov & Girolami, 2002). The original intuition (Jaakkola & Haussler, 1999) for the Fisher kernel was to construct a probabilistic model of the data in order to induce a metric for a subsequent discriminative training. Two problems could be addressed simultaneously. First it became possible to compare “apples and oranges”—as the Fisher kernel approach measures distances in the space of the respective probabilistic model parameters. So, for example, a DNA sequence of length, say, 100 and another one of length 1000 can be easily compared by using a representation in the respective hidden Markov models (HMM) parameter space. Thus, the Fisher kernel is very much in contrast to alignment methods that compare directly by essentially using dynamic programming techniques (e.g., Gotoh, 1982). A second feature of the Fisher kernel is that it allows incorporating prior knowledge about the data distribution into the classication process in a highly principled manner. In the practical use of support vector machines (SVM) (e.g., Vapnik, 1998; Cristianini & Shawe-Taylor, 2000; Muller, ¨ Mika, Ratsch, ¨ Tsuda, & Sch¨olkopf, 2001; Sch¨olkopf & Smola, 2002), where the choice of the kernel is of crucial importance, either the kernel can be engineered using all available prior knowledge (e.g., Zien et al., 2000), or it can be derived as the Fisher kernel from a probabilistic model (e.g., Jaakkola & Haussler, 1999; Tsuda et al., 2002; Sonnenburg et al., 2002; Smith & Gales, 2002). 1 In spite of its practical success, a theoretical analysis of the Fisher kernel has not been sufciently explored so far, with the exceptions being Jaakkola, Meila, and Jebara (1999), Tsuda and Kawanabe (2002), Seeger (2002), and Tsuda, Kawanabe, and Muller, ¨ in press). For example, Jaakkola et al. (1999) showed how to determine the prior distribution of parameters to recover the Fisher kernel in the framework of maximum entropy discrimination. And Seeger (2002) pointed out that the Fisher kernel can be perceived as an approximation of the mutual information kernel. This article presents theoretical results from a statistical point of view. In particular, we perceive the Fisher kernel as a method of constructing a model of the posterior probability of the class labels. The Fisher kernel can be derived as follows: Let X denote the domain of objects, which can be discrete or continuous. Let us assume that a probabilistic model q.x j µ /; x 2 X ; µ 2 :
(1.1)
1 Of course, a brute force search over all possible kernels can be pursued using crossvalidation procedures or bounds from learning theory to select the “best” kernel (cf. Muller ¨ et al., 2001).
Asymptotic Properties of the Fisher Kernel
117
The Fisher kernel refers to the inner product in this space. When used in supervised classication, the Fisher kernel is commonly combined with a linear classier such as SVMs (Vapnik, 1998), where a linear function is trained to discriminate two classes. Since the Fisher kernel can efciently make use of prior knowledge about the marginal distribution p.x/ (which can be estimated rather well using unlabeled samples), it is especially attractive in vision, text classication, and bioinformatics, where we can expect a lot of unlabeled samples (Zhang & Oles, 2000; Seeger, 2001). First, we will show the sufcient conditions that the obtained posterior model is realizable—that it contains the true posterior distribution, which then immediately leads to consistency. Once realizability is ensured, we can evaluate the expected generalization error in large sample situations by means of asymptotic statistics (Barndorff-Nielsen & Cox, 1989). This enables us to elucidate learning curves and how unlabeled samples contribute in reducing the generalization error. In addition, we point out that when a linear classier is combined with the Fisher kernel, then the log loss and the squared loss are theoretically preferable to other loss functions. This result recommends using a classier based on the log loss or the squared loss. 2 Realizablity Conditions Let y 2 fC1; ¡1g be the set of class labels. Denote by p.x/, P.y j x/, and p.x; y/ the true underlying marginal, posterior, and joint distributions, respectively. Let @® f D @ f=@®, rµ f D .@µ1 f; : : : ; @µd f /> , and rµ2 f denote the d £ d matrix, the Hessian, whose .i; j/th element is @ 2 f=.@ µi @µj /. For statistical learning, we construct a model of posterior probability P.y j x/ out of the Fisher score, equation 1.1. The posterior probability is described by a linear function followed by an activation function h, Q.y j x; ´ / D h.y[w> fµ .x/ C b]/;
(2.1) > > , and
where w 2 fµ¤ .x/ C b¤ D P.y D C1 j x/ ¡ P.y D ¡1 j x/:
(2.5)
To prove the lemma for q0 , it is sufcient to show the existence of w; b 2 < such that w@® log q0 .x j ® ¤ / C b D P.y D C1 j x/ ¡ P.y D ¡1 j x/:
(2.6)
The Fisher score for q0 .x j ® ¤ / can be written as @® log q0 .x j ® ¤ / D
P.y D C1 j x/ P.y D ¡1 j x/ ¡ : 1 ¡ ®¤ ®¤
When w D 2® ¤ .1 ¡ ® ¤ / and b D 2® ¤ ¡ 1, equation 2.6 holds. 2.2 Deriving Realizability Conditions. Since we do not know the true class distributions p.x j y/, the core model q0 .x j ®/ in lemma 1 is never available. In the following, the result of lemma 1 is therefore relaxed to a more general class of probability models. Denote by M a set of probability distributions M D fq0 j q0 .x j ®/; ® 2 [0; 1]g. According to information geometry (Amari & Nagaoka, 2001), M is regarded as a manifold in a Riemannian space. Let Q denote the manifold of q.x j µ/: Q D fq j q.x j µ/; µ 2 fµO .xi / C b]/:
(3.2)
The second step maximizes the conditional likelihood Q.y j x; ´ /. Let `.y; y0 / denote a loss function. Then equation 3.2 is generalized as O D argmin [w; O b] w;b
n X iD1
`.w> fµO .xi / C b; yi /;
(3.3)
where the loss function in equation 3.2 corresponds to `.y; y0 / D ¡ log h.yy0 /:
(3.4)
First, we prove that the consistency is ensured for the log loss (see equation 3.4), that is, in the limit that n goes to innity, the estimator ´O converges to the true one ´ ¤ . Further, it will be shown that the consistency can be proved for the squared loss, which has practical advantages. Lemma 2. Assume that q.x j µ/ satises the realizablity conditions in theorem 1. The two-step estimator with the log loss (see equation 3.4) is consistent.
Asymptotic Properties of the Fisher Kernel
121
Proof. In the two-step scheme, µ is estimated separately as maximum likelihood 3.1, so obviously µO converges to µ¤ . When we have innite samples, equation 3.3 is written as [wC ; bC ] D argminw;b
Z
X j2f1;¡1g
P.y D j/
`.w> fµ ¤ .x/ C b; j/p.x j y D j/dx:
Therefore, we should prove wC D w¤ and bC D b¤ . In other words, this problem is rewritten as a constrained variation problem (Gelfand & Fomin, 1963), where we nd a function g: X ! < that minimizes the following functional, Z X (3.5) L.g/ D P.y D j/ `.g.x/; j/p.x j y D j/dx; j2f1;¡1g
subject to the constraint g 2 G where
G D fg j g.x/ D w> fµ¤ .x/ C b; w 2 ]: n!1 n 0 D lim
(4.6) (4.7)
We calculate the asymptotic distribution of the M-estimator ´O . From the estimating equation and equation 4.4, we have 1 1 v.zn ; xum ; ´ ¤ / C r´ v.zn ; xum ; ´ ¤ /.´O ¡ ´ ¤ / C Op .k´O ¡ ´ ¤ k2 / n n 1 D p ³ C 0.´O ¡ ´ ¤ / C Op .n¡1 /; n p where ³ D v.zn ; xum ; ´ ¤ /= n. Therefore, the estimator ´O can be approximated as p (4.8) n.´O ¡ ´ ¤ / D ¡0 ¡1 ³ C Op .n¡1=2 /; 0D
and it is asymptotically gaussian distributed, p n.´O ¡ ´ ¤ / » N .0; 0 ¡1 30 ¡> /; where 0 ¡> D .0 > /¡1 .
(4.9)
Asymptotic Properties of the Fisher Kernel
125
4.2 Asymptotic Expansion of the Generalization Errors. Next, let us consider the asymptotic expansion of generalization error 4.1. By Taylor expansion, we can calculate the expectation of R.´O / over labeled and unlabeled samples zn and xum ,
E[R.´O /] D R.´ ¤ / C r´> R.´ ¤ /E[´O ¡ ´ ¤ ]
1 trfr´2 R.´ ¤ / E[.´O ¡ ´ ¤ / .´O ¡ ´ ¤ /> ]g C O.n¡3=2 / 2 1 D R.´ ¤ / C trfr´2 R.´ ¤ / 0 ¡1 30 ¡> g C O.n¡3=2 /: (4.10) 2n C
When we adopt the KL divergence, equation 4.1, it turns out that R.´ ¤ / D 0, and the Hessian is equal to the Fisher information matrix (Barndorff-Nielsen & Cox, 1989), r´2 R.´ ¤ / D ¡ Ex;y [r´2 log q.x; y j ´ ¤ /] D G;
where
G D Ex;y [r´ log q.x; y j ´ ¤ /r´> log q.x; y j ´ ¤ /]:
Therefore, the generalization error is described as
E[R.´O /] D
1 trfG0 ¡1 30 ¡> g C O.n¡3=2 /: 2n
(4.11)
Notice that the derivation of the generalization error 4.11 relies heavily on the regularity conditions governing the limiting properties of M-estimators (Barndorff-Nielsen & Cox, 1989). When the regularity conditions do not hold, we need a different mathematical machinery to analyze the generalization error (Watanabe, 2001). 4.3 Generalization Error of the Two-Step Estimator. The two-step estimator 3.1 and 3.2 is regarded as a special case of M-estimators, where the estimating function is vµ .zn ; xum ; µO / D v» .zn ; xum ; »O ; µO / D
n X iD1 n X iD1
rµ log q.xi j µO / C
m X jD1
rµ log q.xju j µO / D 0;
O D 0; r» log hfyi .w O > fµO .xi / C b/g
(4.12)
(4.13)
where we will write » D .w> ; b/> for convenience. Following the general approach presented in section 4.1, the generalization error can be derived. To this end, let us dene important notations rst. Let us decompose the Fisher information matrix G as ³ ´ G»» G»µ GD ; Gµ» Gµµ
126
K. Tsuda, S. Akaho, M. Kawanabe, and K. Muller ¨
where G»» , G»µ ; Gµµ are the matrices of size .d C 1/ £ .d C 1/, .d C 1/ £ d, and d £ d, respectively and Gµ» D G> »µ . Then its inverse is written as ³ ´ S»» S»µ G¡1 D ; Sµ» Sµµ ¡1 where Sµµ D .Gµµ ¡ Gµ» G¡1 (others not shown for brevity). From »» G »µ / these submatrices, we dene the effective Fisher information (Kawanabe & Amari, 1994) as ¡1 GEµµ :D S¡1 µµ D Gµµ ¡ Gµ» G »» G»µ ;
(4.14)
which is the net information of µ after subtracting the amount shared with the other parameter » . We also dene Uµµ D Ex [rµ log q.x j µ¤ /rµ log q.x j µ ¤ /> ]: Then the generalization error is derived as follows: Theorem 2.
The generalization error of the two-step estimator is
E[R.´O /] D
» ± ²¼ 1 1 tr GEµµ U¡1 C O.n¡3=2 /: dC1C µµ 2n 1Cr
(4.15)
The proof is described in appendix A. 4.4 Cram´er-Rao Bound. Now that we have derived the generalization error (see equation 4.15), the next question is how it compares to other estimators. In order to answer this question, we will consider the lower bound of the generalization errors among a reasonable set of estimators. It is well known that the parameter variance of any asymptotically unbiased estimator 4 is lower-bounded by means of the Fisher information (e.g., Barndorff-Nielsen & Cox, 1989). Theorem 3 (Asymptotic Cram´er-Rao bound). Assume that there are n samples x1 ; : : : ; xn derived i.i.d. from p.x j ´ ¤ /. Also assume that an estimator ´O .x1 ; : : : ; xn / is asymptotically unbiased, that is,
E[´ .x1 ; : : : ; xn /] D ´ ¤ C o.n¡1=2 /: The covariance matrix of the estimator is asymptotically lower-bounded as lim nV[´O ¡ ´ ¤ ] ¸ J¡1 ;
n!1
(4.16)
4 One could consider asymptotically biased estimators, but typically such estimators are too tricky to be used in practice.
Asymptotic Properties of the Fisher Kernel
127
where J is the Fisher information matrix, J D Ex [r´ log p.x j ´ ¤ /r´ log p.x j ´ ¤ /> ]; and A ¸ B means that A ¡ B is positive semidenite. Our problem is slightly more complicated than stated in this theorem, because we have both n labeled and m unlabeled samples. In this case, the total Fisher information is simply the sum of Fisher information of labeled and unlabeled data (e.g., Zhang & Oles, 2000; Seeger, 2001). Therefore, xing the ratio r D m=n, the bound 4.16 is rewritten as lim nV[´ ¡ ´ ¤ ] ¸ .G C rU/¡1 ;
n!1
where U is the Fisher information of the marginal model q.x j µ /: U D Ex [r´ log q.x j µ¤ /r´> log q.x j µ ¤ /]:
Once the parameter variance is bounded, we can bound the generalization error asymptotically as follows: Theorem 4. The generalization error of any asymptotically unbiased estimator is lower-bounded as lim nE[R.´O /] ¸
n!1
1 tr.I C rG¡1 U/¡1 : 2
(4.17)
Proof. Let us abbreviate V[´O ¡ ´ ¤ ] as V. As seen in equation 4.11, the generalization error is asymptotically expanded as
E[R.´O /] D
1 trfGVg C O.n¡3=2 /: 2n
(4.18)
Since limn!1 nV ¸ .G C rU/¡1 , we derive equation 4.17 as 1 tr G.G C rU/¡1 2 1 D tr.I C rG¡1 U/¡1 : 2
lim nE[R.´O /] ¸
n!1
4.5 Effect of Unlabeled Data. As we compare the generalization error 4.15 with the lower bound 4.17, it is obvious that the generalization error does not achieve the lower bound by equality. This means that the twostep estimator fails to exploit all the Fisher information provided by the samples. Intuitively, it is because we only use x’s in estimating µ at the rst step, discarding the information of y.
128
K. Tsuda, S. Akaho, M. Kawanabe, and K. Muller ¨
However, we will show that the difference to the lower bound gets smaller as the number of unlabeled samples increases. In order to compare the generalization error and the lower bound, the lower bound is expanded as follows: Lemma 4. When expanded with respect to r, the lower bound in equation 4.17 is described as 1 tr.I C rG¡1 U/¡1 2 » ¼ 1 1 ¡1 D d C 1 C tr.GEµµ Uµµ / C O.r¡2 /: 2 r
lim nE[R.´O /] ¸
n!1
(4.19)
The proof is in appendix B. The n¡1 coefcient of the generalization error 4.15 is described as » ¼ 1 1 ¡1 E (4.20) d C 1 C tr.Gµµ Uµµ / C O.r¡2 /: 2 r Thus, the difference to the lower bound is within the order of r¡2 , which becomes very small when r is large. In order to illustrate this result, we calculate the learning curves for a simple model. The Fisher score is derived from the core model 2.3, where class distributions are one-dimensional unit gaussians centered on ¡1 and 1, respectively: x j y D C1 » N
.¡1; 1/;
x j y D ¡1 » N
.1; 1/:
The learning curves at r D 0; 1, and 3 are shown in Figure 2. When there are no unlabeled samples (r D 0), the difference between the learning curve and the lower bound is substantially large. However, the difference gets smaller quickly as r increases, and the two curves become almost identical at r D 3. This illustrative result underlines our theorerical analysis and suggests the importance of unlabeled samples in learning with the Fisher kernel. 5 Conclusion In this article, we have investigated several theoretical aspects of the Fisher kernel. One contribution is that we have put the Fisher kernel into the framework of statistical inference by showing the realizability conditions. This allows for subsequent analysis of discriminative classiers, consistency, and learning curves of the generalization error (including unlabeled data). Thus, our study has put the Fisher kernel approach on a more solid statistical basis, from which new algorithmic directions can be explored (e.g., the Bayes inference). We examined in this article only one option for feature extraction from marginal models: the combination of the Fisher kernel, a linear classi-
Asymptotic Properties of the Fisher Kernel
129
0.1
r=0 0.05
0 0 0.1
100
200
r=1 0.05
0
0 0.1
100
200
r=3 0.05
0
0
100
200
Figure 2: Theoretical learning curves of the Fisher kernel classier. The horizontal axis shows the number of labeled samples n, and the vertical axis shows O The solid and broken curves correspond to the generalization error E[R.´/]. the generalization error of the two-step estimator and the lower bound determined by the Cram´er-Rao bound, respectively. As the unlabeled/labeled ratio r increases, the two curves get closer.
er, and a linear activation function. In practice, it makes sense to consider alternative combinations. Ultimately, our goal is to construct a universal statistical theory of feature extraction from marginal models, which allows even wider practical use and a better inclusion of prior knowledge (e.g., hidden in unlabeled data or in industrial domain knowledge) into kernel-based learning methods.
130
K. Tsuda, S. Akaho, M. Kawanabe, and K. Muller ¨
Appendix A: Proof of Theorem 2 Let us decompose the matrices 0, 3 as ³ ´ ³ ´ 0»» 0»µ 3»» 3»µ 0D ; 3D : 0µ» 0µµ 3µ» 3µµ The submatrices are computed as follows: 0»»
0»µ
" # n X 1 ¤ D lim E r» r» log Q.yi j xi ; ´ / n!1 n iD1 D E[r»2 log q.x; y j ´ ¤ / ¡ r»2 log q.x j µ ¤ /] D ¡G»» ; " # n X 1 ¤ D lim E rµ r» log Q.yi j xi ; ´ / n!1 n iD1 D ¡G»µ ; 8 <X n
2 0µ» D lim
n!1
D 0; 0µµ D lim
n!1
1 4 rµ log q.xi j µ¤ / C E r» : n iD1
m X
8 <X n
2
1 4 rµ log q.xi j µ ¤ / C E rµ : n iD1
jD1
m X jD1
93 = rµ log q.xju j µ ¤ / 5 ; 93 = rµ log q.xju j µ ¤ / 5 ;
D ¡.1 C r/Uµµ ;
3»»
" # n n X X 1 ¤ ¤ > D lim E r» log Q.yi j xi ; ´ / r» log Q.yi j xi ; ´ / n!1 n iD1 iD1 D G»» ;
3»µ
" n X 1 D lim E r» log Q.yi j xi ; ´ ¤ / n!1 n iD1 £ D 0;
3µ» D 0;
8 <X n :
iD1
rµ log q.xi j µ ¤ / C
m X jD1
9> 3 = 7 rµ log q.xju j µ¤ / 5 ;
Asymptotic Properties of the Fisher Kernel
3µµ
131
9 28 n m = X 1 4<X D lim E rµ log q.xi j µ ¤ / C rµ log q.xju j µ ¤ / n!1 n : ; iD1 jD1 £
8 <X n :
iD1
rµ log q.xi j µ¤ / C
m X jD1
9> 3 = 7 rµ log q.xju j µ ¤ / 5 ;
D .1 C r/Uµµ : In summary, we have the following: ³ ´ ³ G»» G»» G»µ 0D¡ ; 3D 0 .1 C r/Uµµ 0 The inverse matrix of 0 becomes Á ! Á ¡1 ¡1 ¡1 ¡0»» G¡1 0»» 0»µ 0µµ ¡1 »» D¡ 0 D ¡1 0 0 0µµ
´ 0 : .1 C r/Uµµ ! ¡1 1 ¡ 1Cr G¡1 »» G »µ Uµµ : ¡1 1 1Cr Uµµ
The asymptotic covariance of ´O is ³ ´ A B ¡1 ¡> D 0 30 ; B> D ¡1 ¡1 ¡1 ¡1 ¡1 > ¡1 C 0»» A D 0»» 3»» 0»» 0»µ 0µµ 3µµ 0µµ 0»µ 0»»
D G¡1 »» C
1 ¡1 G¡1 G»µ Uµµ Gµ» G¡1 »» ; 1 C r »»
¡1 ¡1 ¡1 D¡ B D ¡0»» 0»µ 0µµ 3µµ 0µµ ¡1 ¡1 D D D 0µµ 3µµ 0µµ
Therefore,
1 U ¡1 : 1 C r µµ
1 ¡1 G¡1 G»µ Uµµ ; 1 C r »»
G0 ¡1 30 ¡> ³ ´ Á ¡1 ¡1 ¡1 1 G»» C 1Cr G¡1 G»» G»µ »» G»µ Uµµ Gµ» G »» D ¡1 1 Gµ» Gµµ ¡ 1Cr Uµµ Gµ» G¡1 »» ³
D
I ¡1 ¡1 1 E Gµ» G¡1 »» ¡ 1Cr Gµµ Uµµ G µ» G»»
¡1 1 ¡ 1Cr G¡1 »» G »µ Uµµ ¡1 1 1Cr Uµµ
!
´ 0 ¡1 : 1 E 1Cr Gµµ Uµµ
By substituting it to equation 4.11, we get the asymptotic expansion of the generalization error as » ¼ 1 1 ¡1 ¡3=2 E O [R. D C 1 C tr.G (A.1) E ´ /] d /: µµ Uµµ / C O.n 2n 1Cr
132
K. Tsuda, S. Akaho, M. Kawanabe, and K. Muller ¨
We remark that GEµµ > Uµµ in this case. This can be shown as follows. The conditional information matrix, J.Y j X/ D Ex;y [r´ log Q.y j x; ´ ¤ /r´ log Q.y j x; ´ ¤ /] D ¡Ex;y [r´ r´ log Q.y j x; ´ ¤ /] Á ! G»» G»µ D ; Gµ» Gµµ ¡ Uµµ
is positive denite if the probabilistic model is regular. Let us transform the information matrix as ³ ´ 0 G»» > FJ.Y j X/F D ; 0 GEµµ ¡ Uµµ where
³
FD
I
¡Gµ» G¡1 »»
´ 0 : I
Since the matrix FJ.Y j X/F> is positive denite, GEµµ ¡ Uµµ must be positive denite too. Although we showed the result only in the case of log-likelihood loss, it is possible to calculate the generalization error for general loss functions. The fomula becomes » ¼ 1 1 ¡1 ¡1 tr.G»» L¡1 C tr.HU E[R.´O /] D 3 L / / »» »» »» µµ 2n 1Cr C O.n¡3=2 /;
(A.2)
where ¡1 ¡1 ¡1 H D Gµµ ¡ Gµ» L¡1 »» L»µ ¡ Lµ» L»» G»µ C Lµ» L»» G»» L»» L»µ
L´´ D ¡Ex;y [r´ r´ `.w> fµ .x/ C b; y/]
3»» D Ex;y [r» `.w> fµ .x/ C b; y/ r» `.w> fµ .x/ C b; y/]: Appendix B:
Proof of Lemma 4
In order to prove equation 4.19, we will use the following expansion (Sugiyama, 2001): Lemma 5. follows:
For any symmetric matrix Z and r 6D 0, .I C rZ/¡1 is expanded as
.I C rZ/¡1 D .I ¡ ZZ† / ¡
´j k ³ X 1 ¡ Z† r jD1
Asymptotic Properties of the Fisher Kernel
³ ´kC1 ³ ´¡1 1 1 ¡ ¡ Z† I C Z† ; r r
133
(A.1)
where † indicates the Moore-Penrose pseudo inverse (Campbell & Meyer, 1979) and k is an arbitrary positive integer. The proof is described in appendix C. The lower bound in equation 4.17 is rewritten as 1 1 tr.I C rG¡1 U/¡1 D tr[.G1=2 C rG¡1=2 U/¡1 G1=2 ] 2 2 1 D tr[G1=2 .G1=2 C rG¡1=2 U/¡1 ] 2 1 D tr.I C rG¡1=2 UG¡1=2 /¡1 : 2 Setting Z D G¡1=2 UG¡1=2 , it is expanded as » ¼ 1 1 »1 tr.I C rG¡1 U/¡1 D C O.r¡2 /; »0 C 2 2 r
(A.2)
where the coefcients are described as »0 D tr.I ¡ ZZ† /;
»1 D tr.Z† /:
(A.3)
Equation 4.19 is proved because the coefcients are derived as follows: Lemma 6.
The coefcients »0 and »1 are described as
»0 D d C 1; E
(A.4) ¡1
»1 D tr.Gµµ Uµµ /; respectively. Proof.
Since q.x j µ / does not depend on w and b, U is described as
UD
³ 0 0
´ 0 : Uµµ
Then Z is rewritten as Z D G¡1=2 UG¡1=2 D BB> ;
(A.5)
134
K. Tsuda, S. Akaho, M. Kawanabe, and K. Muller ¨
where B is a .2d C 1/ £ d matrix: µ B D G¡1=2
¶ 0 1=2 : Uµµ
In terms of B, the pseudo inverse of Z is written as Z† D B.B> B/¡2 B> : The coefcient »0 is rewritten as »0 D tr.I ¡ ZZ† / D .2d C 1/ ¡ tr.ZZ† /; where tr.ZZ† / D tr.BB> B.B> B/¡2 B> / D tr.B.B> B/¡1 B> / D tr.B > B.B> B/¡1 / D d:
We thus have »0 D d C 1. »1 D tr.Z† / is rewritten as tr.Z† / D tr[B.B> B/¡2 B> ]
D tr.B> B/¡1 Á " #!¡1 h i 0 1=2 ¡1 D tr 0 Uµµ G 1=2 Uµµ ¡1=2 ¡1
D tr[Uµµ
¡1=2
Sµµ Uµµ ]
¡1 ¡1 E D tr.S¡1 µµ Uµµ / D tr.Gµµ Uµµ /:
Appendix C:
Proof of Lemma 5
This expansion was originally derived in lemma 4.8 of Sugiyama (2001). In the following, we quote his proof for readers’ convenience. Let us dene ® D 1=r. According to theorem 4.8 in Albert (1972), the following holds for a symmetric matrix Z: .I C ® ¡1 Z/¡1 D .I ¡ ZZ† / C ®Z† .I C ®Z† /¡1 : Also, for any matrix B, we have the following equation: .I C B/¡1 D I.I C B/¡1 D .I C B ¡ B/.I C B/¡1 D I ¡ B.I C B/¡1 :
(A.1)
Asymptotic Properties of the Fisher Kernel
135
Dening B D ®Z† , we have .I C ®Z† /¡1 D I ¡ ®Z† .I C ®Z† /¡1 :
(A.2)
By repeatedly applying equation A.2 to equation A.1, we have .I C ® ¡1 Z/¡1 D .I ¡ ZZ† / C ®Z† [I ¡ ®Z† .I C ®Z† /¡1 ]
D .I ¡ ZZ† / C ®Z† ¡ .®Z† /2 .I C ®Z† /¡1
D .I ¡ ZZ† / C ®Z† ¡ .®Z† /2 C .®Z† /3 .I C ®Z† /¡1 :: :
D .I ¡ ZZ† / ¡
k X jD1
.¡®Z† / j ¡ .¡®Z† /kC1 .I C ®Z† /¡1 :
Acknowledgments K.R.M. acknowledges partial nancial support by DFG (MU 987/1-1) and and B.M.B.F. under contract FKZ 01IBB02A. References Albert, A. (1972). Regression and the Moore-Penrose pseudoinverse. Orlando, FL: Academic Press. Amari, S., & Murata, N. (1993). Statistical theory of learning curves under entropic loss criterion. Neural Computation, 5, 140–153. Amari, S. & Nagaoka, H. (2001). Methods of information geometry. Providence, RI: American Mathematical Society. Barndorff-Nielsen, O., & Cox, D. (1989). Asymptotic techniques for use in statistics. London: Chapman and Hall. Baum, E., & Haussler, D. (1989). What size net gives valid generalization? Neural Computation, 1, 151–160. Campbell, S., & Meyer, C. (1979). Generalized inverse of linear transformations. New York: Pitman Publishing. Cox, D., & Hinkley, D. (1974). Theoretical statistics. London: Chapman & Hall. Cristianini, N., & Shawe-Taylor, J. (2000). An introduction to support vector machines. Cambridge: Cambridge University Press. Devroye, L., Gy¨or, L., & Lugosi, G. (1996). A probabilistic theory of pattern recognition. New York: Springer-Verlag. Eguchi, S., & Copas, J. (2001). Information geometry on discriminant analysis and recent development. Journal of the Korean Statistical Society, 27, 101–117.
136
K. Tsuda, S. Akaho, M. Kawanabe, and K. Muller ¨
Gelfand, I., & Fomin, S. (1963). Calculus of variations. Englewood Cliffs, NJ: Prentice-Hall. Gotoh, O. (1982). An improved algorithm for matching biological sequences. Journal of Molecular Biology, 162, 705–708. Haussler, D., Kearns, M., Seung, H., & Tishby, N. (1996). Rigorous learning curve bounds from statistical mechanics. Machine Learning, 25, 195–236. Jaakkola, T., & Haussler, D. (1999). Exploiting generative models in discriminative classiers. In M. Kearns, S. Solla, & D. Cohn (Eds.), Advances in neural information processing systems, 11 (pp. 487–493). Cambridge, MA: MIT Press. Jaakkola, T., Meila, M., & Jebara, T. (1999). Maximum entropy discrimination (Tech. Rep. No. AITR-1668). Cambridge, MA: MIT Press. Kawanabe, M., & Amari, S. (1994). Estimation of network parameters in semiparametric stochastic perceptron. Neural Computation, 6, 1244–1261. Malzahn, D., & Opper, M. (2002). A variational approach to learning curves. In T. G. Dietterich, S. Becker, & Z. Ghahramani (Eds.), Advances in neural information processing systems, 14 (pp. 463–469). Cambridge, MA: MIT Press. Muller, ¨ K.-R., Finke, M., Schulten, K., Murata, N., & Amari, S. (1996). A numerical study on learning curves in stochastic multi-layer feed-forward networks. Neural Computation, 8, 1085–1106. Muller, ¨ K.-R., Mika, S., R¨atsch, G., Tsuda, K., & Sch¨olkopf, B. (2001). An introduction to kernel-based learning algorithms. IEEE Trans. Neural Networks, 12, 181–201. Sch¨olkopf, B., & Smola, A. (2002). Learning with kernels. Cambridge, MA: MIT Press. Seeger, M. (2001). Learning with labeled and unlabeled data (Tech. Rep.). Edinburgh: Institute for Adaptive and Neural Computation, University of Edinburgh. Available on-line: http://www.dai.ed.ac.uk/homes/seeger/ papers/review.ps.gz. Seeger, M. (2002). Covariance kernels from Bayesian generative models. In T. G. Dietterich, S. Becker, & Z. Ghahramani (Eds.), Advances in neural information processing systems, 14, (pp. 905–912). Cambridge, MA: MIT Press. Seung, S., Sompolinsky, H., & Tishby, N. (1992). Statistical mechanics of learning from examples. Physical Review A, 45, 6056. Smith, N. & Gales, M. (2002). Speech recognition using SVMs. In T. G. Dietterich, S. Becker, & Z. Ghahramani (Eds.), Advances in neural information processing systems, 14, (pp. 1197–1204). Cambridge, MA: MIT Press. Sonnenburg, S., R¨atsch, G., Jagota, A., & Muller, ¨ K.-R. (2002). New methods for splice site recognition. In J. Dorronsoro (Ed.), Articial neural networks— ICANN 2002 (pp. 329–336). New York: Springer-Verlag. Sugiyama, M. (2001). A theory of model selection and active learning for supervised learning. Unpublished doctoral dissertation, Tokyo Institute of Technology. Tsuda, K., & Kawanabe, M. (2002). The leave-one-out kernel. In J. Dorronsoro (Ed.), Articial neural networks—ICANN 2002 (pp. 727–732). New York: Springer-Verlag. Tsuda, K., Kawanabe, M., & Muller, ¨ K.-R. (in press). Clustering with the Fisher score. In S. Becker, S. Thrun, & K. Obermayer (Eds.), Advances in neural information processing systems, 15. Cambridge, MA: MIT Press.
Asymptotic Properties of the Fisher Kernel
137
Tsuda, K., Kawanabe, M., R¨atsch, G., Sonnenburg, S., & Muller, ¨ K.-R. (2002). A new discriminative kernel from probabilistic models. Neural Computation, 14, 2397–2414. van der Vaart, A. (1998). Asymptotic statistics. Cambridge: Cambridge University Press. Vapnik, V. (1998). Statistical learning theory. New York: Wiley. Vinokourov, A., & Girolami, M. (2002). A probabilistic framework for the hierarchic organization and classication of document collections. Journal of Intelligent Information Systems, 18, 153–172. Watanabe, S. (2001). Algebraic analysis for non-identiable learning machines. Neural Computation, 13, 899–933. Watkin, T., Rau, A., & Biehl, M. (1993). The statistical mechanics of learning a rule. Reviews of Modern Physics, 65, 499. Zhang, T., & Oles, F. (2000). The value of unlabeled data for classication problems. In P. Langley (Ed.), Proceedings of the Seventeenth International Conference on Machine Learning (pp. 1191–1198). San Mateo, CA: Morgan Kaufmann. Zien, A., R¨atsch, G., Mika, S., Sch¨olkopf, B., Lengauer, T., & Muller, ¨ K.-R. (2000). Engineering support vector machine kernels that recognize translation initiation sites. Bioinformatics, 16, 799–807. Received February 3, 2003; accepted June 3, 2003.
LETTER
Communicated by David Stork
Adaptive Hybrid Learning for Neural Networks Rob Smithies
[email protected] Said Salhi
[email protected] Nat Queen
[email protected] School of Mathematics and Statistics, University of Birmingham, Birmingham B15 2TT, U.K.
A robust locally adaptive learning algorithm is developed via two enhancements of the Resilient Propagation (RPROP) method. Remaining drawbacks of the gradient-based approach are addressed by hybridization with gradient-independent Local Search. Finally, a global optimization method based on recursion of the hybrid is constructed, making use of tabu neighborhoods to accelerate the search for minima through diversication. Enhanced RPROP is shown to be faster and more accurate than the standard RPROP in solving classication tasks based on natural data sets taken from the UCI repository of machine learning databases. Furthermore, the use of Local Search is shown to improve Enhanced RPROP by solving the same classication tasks as part of the global optimization method. 1 Introduction Neural networks solve problems by learning the functional relationship between paired input-output data characterizing the problem. Therefore, they see use in a diverse range of applications, including pattern recognition, classication, and optimization. How this learning is performed is key to the success of the neural network. The most popular technique is supervised learning: the neural network is exposed to training examples of the functional relationship sampled from the paired input-output data, so that connection weights affecting signals propagated through the neural network are able to be adapted. The standard architecture for supervised learning is the multilayer feedforward network (MLFF) (Rumelhart & McClelland, 1986), consisting of a number of units linked by weighted connections and organized into input, hidden, and output layers. Therefore, this study focuses on the training of MLFFs. Each training example consists of an input pattern xp and a target output pattern tp , such that the functional relationship between paired input-output data is represented by a Neural Computation 16, 139–157 (2004)
c 2003 Massachusetts Institute of Technology °
140
R. Smithies, S. Salhi, and N. Queen
set P of these training examples. The goal of supervised learning is to tune the connection weights so that when xp is presented to the input layer and propagated through the neural network, the output pattern yp computed in the output layer will match tp . The training is guided by a cost function ED
X
Ep ;
(1.1)
p2P
where Ep is a measure of error between yp and tp . The aim of supervised learning is now equivalent to searching for a global minimum of E over the vector space spanned by the weights of the neural network. Typically, gradient descent is used for this type of optimization, in which each connection weight wij is adapted on iteration t by a step taken in the opposite direction to the gradient, scaled by the learning rate ²: 1wij .t/ D ¡²
@E .t/ @ wij
(1.2)
wij .t C 1/ D wij .t/ C 1wij .t/: The standard method for computing the rst-order derivative of equation 1.2 and thus training MLFFs is the backpropagation algorithm, devised by Rumelhart and McClelland (1986). It works by recursively propagating the training examples through the neural network in order to backpropagate the errors Ep of equation 1.1. If on-line learning is used, weights are @Ep updated for each training example p using the gradient @w to minimize Ep , ij which works well if the data set contains a lot of redundant information. If off-line learning is used, weights are updated for all training examples @E using the gradient @w to minimize E. Off-line learning is more reliable, as ij the shape of the entire cost function is considered, and so it is the strategy of choice in this study. This letter tackles the conundrum of choosing an appropriate learning rate ² for training, which is often a difcult task and is problem specic, being dependent on the shape of the error function. For instance, a small learning rate leads to slow convergence in shallow regions of the error function, whereas a large learning rate can lead to oscillations in the search. A number of proposed approaches deal with this problem of gradient descent and can be broadly classied as either globally or locally adaptive techniques. Globally adaptive techniques such as the conjugate gradient method (Moller, 1993) use knowledge about the state of the entire neural network, whereas locally adaptive techniques use weight-specic information only. Although local strategies require less information, Schiffman and Joost (1993) argue that they outperform global strategies for many applications, and so this letter focuses on locally adaptive techniques.
Adaptive Hybrid Learning for Neural Networks
141
The article is organized as follows. A brief review of the locally adaptive techniques is given in section 2. These form the basis of the new local and global optimization methods presented in section 3. The new methods are tested using natural data sets, and computational results are presented in section 4. A concluding summary follows in section 5. 2 A Review of Locally Adaptive Techniques QuickProp is a locally adaptive technique due to Fahlman (1988), in which the local error function for each weight is assumed to be a parabolic basin whose gradient is unaffected by the updating of other weights. Estimates of the position of the minimum for each weight are given by the following update, @E 1wij .t/ D ¡² .t/ C @wij
@E @wij .t/ @E @wij .t
¡ 1/ ¡
@E @wij .t/
1wij .t ¡ 1/;
(2.1)
where the rst term on the right-hand side is the standard gradient-descent step, and the second term is equivalent to a local application of Newton’s approximation method (Riedmiller, 1994). Furthermore, in order to avoid excessively large update steps due to a small denominator in the second term of equation 2.1, the update step is restricted to be at most º times as large as the previous update step. QuickProp consistently outperforms gradient descent and is one of the most popular adaptive learning techniques (Fahlman, 1988). Delta-Bar-Delta, proposed by Jacobs (1988), uses weight-specic learning rates based on the reasoning that the error function may well differ in shape from weight to weight. Each learning rate is adapted using a local estimate of the shape of the error function given by the gradients at consecutive weight updates. If the gradients have the same sign, the learning rate is linearly increased by a small constant · to accelerate learning in shallow regions of the error function. Otherwise, a local minimum has been traversed, as the previous weight update was too large, and so the learning rate is exponentially decreased by multiplying it with a constant ´¡ :
²ij .t/ D
8 > >· C ²ij .t ¡ 1/ < > > :
´¡ ²ij .t ¡ 1/ ²ij .t ¡ 1/
;
if
;
if
;
@E ¡ @wij .t @E ¡ @wij .t
@E 1/ @w .t/ > 0 ij @E 1/ @w .t/ < 0 ij
(2.2)
otherwise;
where 0 < ´¡ < 1. The weight update is given by 1wij .t/ D ¡²ij .t/
@E .t/ C ¹1wij .t ¡ 1/; @ wij
(2.3)
142
R. Smithies, S. Salhi, and N. Queen
which is similar to the standard weight update equation 1.2 of gradient descent except that it includes an extra momentum term with a momentum coefcient ¹ scaling the inuence of the previous update on the current update. The Delta-Bar-Delta method also outperforms gradient descent and is more robust with respect to the choice of parameters (Jacobs, 1988). Super SAB, described by Tollenaere (1990), is based on weight-specic learning rates like Delta-Bar-Delta, but differs slightly in that the learning rate is increased exponentially in shallow regions of the error function instead of linearly, thus providing a more reactive algorithm: 8 @E @E > ´ C ²ij .t ¡ 1/; if @w .t ¡ 1/ @w .t/ > 0 > ij ij < @E @E ¡ ²ij .t/ D ´ ²ij .t ¡ 1/; if @w .t ¡ 1/ @w .t/ < 0 ij ij > > : otherwise; ²ij .t ¡ 1/; where 0 < ´¡ < 1 < ´C . The weight update is once again performed by equation 2.3. The adaptive methods seen thus far suffer from a counterintuitive problem. When large update steps are required to accelerate learning along shallow regions of the error function, the associated small partial derivatives give rise to small update steps. Alternatively, when small update steps are needed to converge to a solution carefully, as in steep regions of the error function in the vicinity of local minima, the associated large partial derivatives give rise to large update steps. The best of the gradient-based methods in terms of convergence speed, accuracy, and robustness with respect to its parameters is Resilient Propagation (RPROP), developed by Riedmiller and @E Braun (1993). Only the sign of the partial derivative @w is used by RPROP ij to determine each weight update. Therefore, the size of each weight update is independent of the size of the partial derivative, overcoming the problem mentioned above. Indeed, RPROP updates each weight wij based on a step size 1ij controlled by the following heuristic:
1ij .t/ D
8 > ´C 1ij .t ¡ 1/; > < > > :
´¡ 1ij .t ¡ 1/; 1ij .t ¡ 1/;
if if
@E @wij .t @E @wij .t
@E ¡ 1/ @w .t/ > 0 ij @E ¡ 1/ @w .t/ < 0 ij
(2.4)
otherwise;
@E where 0 < ´¡ < 1 < ´C . If the gradient @w does not change sign for conij secutive weight steps, the step size is increased exponentially to accelerate learning; otherwise, the step size is decreased in an attempt to converge on the local minimum that has been traversed. The rst two heuristics presented in section 3 are enhancements of equation 2.4. It should be noted that step sizes are bounded above by 1max to prevent excessively large weights. Moreover, the choice of initial step size 1ij .0/ has some impact on performance, though RPROP is less sensitive to this than its rivals are to
Adaptive Hybrid Learning for Neural Networks
143
the choice of their initial learning rate, as has been shown experimentally by Riedmiller (1994). A common attempt at improving learning algorithms such as RPROP is to use weight backtracking (Silva & Almeida, 1990; Riedmiller & Braun, 1993), where a weight update is reversed if it impinges on the learning achieved for that weight of the neural network. If weight backtracking is used, each weight update 1wij is given by the RPROP+ heuristic: » ¼ @E @E @E D ¡sign 1wij .t/ .t/ 1ij .t/; if .t ¡ 1/ .t/ ¸ 0 (2.5) @wij @wij @ wij 1wij .t/ D ¡1ij .t/ and
@E .t/ D 0; @wij
if
@E @E .t ¡ 1/ .t/ < 0; @wij @wij
(2.6)
whereas if weight backtracking is omitted, it is given by the RPROP heuristic: » ¼ @E (2.7) 1wij .t/ D ¡sign .t/ 1ij .t/: @wij Enhancements of the RPROP+ and RPROP¡ heuristics have been proposed in the literature. An improved version of RPROP+ due to Igel and Husken ¨ (2000), known as iRPROP+, is based on the idea that a change in sign of the partial derivative implies a traversal of a local minimum but does not reveal whether the weight update leads to an increase or decrease in error. Therefore, reversing a weight step when the error decreases is counterproductive, and so weight backtracking should be performed only if the overall error increases. This is done by enhancing equation 2.6 to give the following heuristic: 1wij .t/ D ¡1ij .t/; if
@E @E .t ¡ 1/ .t/ < 0 and E.t/ > E.t ¡ 1/; @ wij @wij
(2.8)
where iRPROP+ is given by combining equation 2.5 with 2.8. Close to a local minimum, the error function can be regarded as quadratic, and so if step sizes are adapted to be small enough to maintain the search in the associated basin of attraction, then each weight update traversing the minimum will always take a step closer to that minimum. Compared to RPROP+, only one additional variable needs to be stored: the previous overall error E.t ¡ 1/. An improved version of RPROP¡ due to Igel and Husken ¨ (2000), known as iRPROP¡, incorporates the idea of decreasing the step size if a local minimum has been traversed, while avoiding unnecessary weight modication in the next iteration. This is done via the following heuristic: @E .t/ D 0; @ wij
if
@E @E .t ¡ 1/ .t/ < 0; @ wij @wij
(2.9)
144
R. Smithies, S. Salhi, and N. Queen
where iRPROP¡ is given by combining equations 2.7 and 2.9. According to Igel and Husken ¨ (2000), iRPROP+ and iRPROP¡ outperform RPROP+ and RPROP¡ over a variety of problems, including optimization of articial test functions, classication tasks taken from the PROBEN1 benchmark collection of natural data sets (Prechelt, 1994), and approximation of differential equation solutions. There appears to be little difference in performance between iRPROP+ and iRPROP¡, but iRPROP+ requires additional computation of the previous error E.t ¡ 1/, and so the enhancements of RPROP presented in this article will be tested in conjunction with the iRPROP¡ heuristic. It should be noted that gradient-based methods require the existence of a gradient in order to descend toward a local minimum and thus fail in at regions of the error function. However, they can also falter if the gradient is shallow enough to be deemed effectively at by round-off due to a lack of computational precision. Moreover, local optimization strategies cannot guarantee a globally optimal solution to any given problem, as they are prone to becoming trapped in the basin of attraction of some local minimum. Such issues will be addressed later in the article, by hybridizing Enhanced iRPROP¡ with gradient-independent techniques and then constructing a global optimization method based on the hybrid. 3 Methodology Two further enhancements of RPROP will be proposed with the aim of producing a new local optimization method, before being hybridized with a proposed form of Local Search to create a new global optimization method. 3.1 Enhancing the RPROP Algorithm. 3.1.1 The POWER Heuristic. Although RPROP is less sensitive to the choice of initial step sizes 1ij .0/ than other learning algorithms, they must necessarily be small to avoid excessively large weights. Hence the exponential increases of RPROP due to equation 2.4 require a warm-up period before optimum step sizes are reached. This deciency is overcome through a modication of equation 2.4, giving the following power-based heuristic, known as POWER: 8 C @E @E > ± .t ¡ 1/1=´ ; if @w .t ¡ 1/ @w .t/ > 0 > ij ij < ij @E @E ¡ (3.1) ±ij .t/ D ´ ±ij .t ¡ 1/; if @wij .t ¡ 1/ @wij .t/ < 0 > > : otherwise ±ij .t ¡ 1/; 1ij .t/ D ±ij .t/1max ;
where 0 < ´¡ < 1 < ´C and 0 < ±ij .0/ < 1. Any acceleration of the learning done by increasing the step size using equation 3.1 is more immediate than
Adaptive Hybrid Learning for Neural Networks
145
Figure 1: The step size increase due to POWER is initially more rapid than the exponential increase of RPROP or the linear increase of Delta-Bar-Delta, resulting in a more responsive algorithm.
in the standard RPROP heuristic, equation 2.4, or in the Delta-Bar-Delta heuristic, equation 2.2, due to the power-based modication, resulting in a more responsive algorithm, since the adaptation of the step sizes is much more rapid than in Delta-Bar-Delta and RPROP, as illustrated in Figure 1. Therefore, RPROP enhanced by POWER is more robust than the standard RPROP, being even less sensitive to the choice of initial step sizes. Note that ±ij .0/ < 1 in order for the exponent of equation 3.1 to increase the step size. In fact, making use of the rapid adaptation one can effectively eliminate ±ij .0/ as a heuristic parameter, since any positive value close to zero will sufce. A further benet of using equation 3.1 is that over sustained periods, the step size will increase smoothly to a maximum given by 1max , unlike the sharp cut-offs that characterize the standard RPROP heuristic equation 2.4. 3.1.2 The TRIG Heuristic. Using the fact that close to a local minimum, the error function is assumed to be quadratic, the heuristic parameter ´¡ of equation 2.4 or 3.1 can be replaced by a parameter ´ij¡ .t/ adapted at each iteration of RPROP by the following trigonometric-based heuristic, referred to as TRIG: ( ´ij¡ .t/
D
cos.µ .t¡1// cos.µ .t¡1//Ccos.µ.t// ; cos.µ .t// cos.µ .t¡1//Ccos.µ.t// ;
» µij .t/ D tan¡1
¼ @E .t/ : @ wij
if weight backtracking is used otherwise
(3.2)
(3.3)
146
R. Smithies, S. Salhi, and N. Queen
If weight update 1wij traverses a local minimum, then the assumption of a quadratic shape of the error function close to the local minimum can be used to detect any increase or decrease in the error for weight wij , provided the step size is small enough to maintain the search in the associated basin of attraction. This idea is depicted in Figure 2 for the case when weight backtracking @E @E is used. An increase in the error occurs if @w .t ¡ 1/ < @w .t/ or equivalently ij ij if µ.t/ > µ.t ¡ 1/ by equation 3.3, as illustrated in the top diagram of Figure @E @E 2. A decrease in the error occurs if @w .t ¡ 1/ > @w .t/ or, equivalently, ij ij if µ.t/ < µ.t ¡ 1/ by equation 3.3, as indicated in the bottom diagram of Figure 2. This angular difference allows step-size decrements to be tailored to the local state of the error function by equation 3.2, depending on whether weight backtracking is used, and is a more exible way of achieving effective convergence to the local minimum than using the xed parameter ´ ¡ . 3.2 Hybridizing Enhanced RPROP with Local Search. If Enhanced RPROP (that is RPROP with TRIG & POWER) fails to converge to a local minimum for a given problem, it is likely to be either oscillating about the minimum along several of the weight-specic gradients or stagnating in an effectively at region of the error function. Provided such situations can be detected, hybridizing Enhanced RPROP with a gradient-independent Local Search (Aarts & Lenstra, 1997), can reinvigorate exploration of the weight space. A neural network can be trained by Local Search using the following weight update for each iteration t: ( W.t C 1/ D
W.t/ C 1W.t/; W.t/;
if E.W.t/ C 1W.t// < E.W.t// otherwise
1W.t/ 2 N.W.t//;
(3.4)
(3.5)
where N.W.t// is a region of weight space centered on W.t/, known as the neighborhood of W.t/. Note that since weight space is continuous, N.W.t// must also be continuous. Key to the use of Local Search after a run of Enhanced RPROP is the choice of neighborhood. If Enhanced RPROP is oscillating, the neighborhood should be small enough in the relevant dimensions of weight space in order to guide the search toward a minimum in those dimensions. Alternatively, if Enhanced RPROP is stagnating, the neighborhood should be large enough in the relevant dimensions of weight space to give sufcient scope to the search. Both ideals can be achieved by using the step sizes 1Fij given by the nal iteration of Enhanced RPROP to construct a neighborhood, N.W.t// D fW.t/ C 1W.t/ : j1wij j < 1Fij for all i and jg;
Adaptive Hybrid Learning for Neural Networks
147
Figure 2: The assumption of a quadratic shape to the error function in the vicinity of a local minimum is used by TRIG to detect increases (top) or decreases (bottom) in error, resulting in faster convergence.
148
R. Smithies, S. Salhi, and N. Queen
where 1wij is the .i; j/th entry of the weight update matrix 1W.t/. This provides an estimated basin of attraction of the local minimum. The switch from Enhanced RPROP to the new version of Local Search, and the subsequent termination of the local optimization method, are controlled by criteria dened by Prechelt (1994). First, there is the relative increase of the validation error over the current minimum validation error, known as the generalization loss at iteration t: GL.t/ D
Eva .t/ Eva .t/ ¡1D ¡ 1; mint0 ·t Eva .t0 / Eopt .t/
where Etr .t/ is the training error, Eva .t/ is the validation error, and Eopt .t/ is the lowest validation error up to iteration t. A high generalization loss is one reason to switch to Local Search or to terminate training, which occurs if (3.6)
GLj .t/ > .GLj /threshold
for j successive iterations t ¡ j C 1; : : : ; t. Second, there is a measure of how much larger the average training error over k successive epochs is compared to the minimum training error for that period, known as the training progress at iteration t: P TP k .t/ D
t0 2t¡kC1:::t Etr .t
kmin
t0 2t¡kC1:::t
0
/ ¡ 1: Etr.t0 /
This measure is high for unstable periods of training but is guaranteed to approach zero unless the whole training phase oscillates. Therefore, the switch to Local Search or termination of training occurs if TP k .t/ < .TPk /threshold:
(3.7)
Combining these criteria with Enhanced RPROP & Local Search gives a hybrid method for local optimization that draws on the relative strengths of both gradient-based and gradient-independent techniques. 3.3 Tabu Neighborhoods for Global Optimization. Existing global optimization strategies like the branch-and-bound algorithm are mostly conned to problems associated with discrete solution spaces, such as combinatorial optimization problems. Exceptions include hill climbing algorithms such as simulated annealing (Kirkpatrick & Gelatt, 1983). The global optimization strategy presented here involves the recursive application of the Enhanced RPROP & Local Search hybrid method, where the nal solution of each recursion can be used to improve the search of subsequent recursions. This is achieved via an application of Tabu Search (Salhi & Queen, in
Adaptive Hybrid Learning for Neural Networks
149
press), an extension of Local Search in which previous candidate updates are prevented from reselection by surrounding tabu regions that collectively form tabu space ( T [ N.1W.t//; if E.W.t/ C 1W.t// ¸ E.W.t// TD otherwise; T; and hence equation 3.5 is replaced by the following rule: 1W.t/ 2 N.W.t//nT:
The weight update is the same as that of Local Search, namely equation 3.4. The rst recursion of the hybrid method begins with the uniformly random selection of an initial solution from initial weight space. Each time a nal solution W F generated by a recursion of the hybrid method has some chosen F , error measure E.W F / worse than that of the current best nal solution Wbest I the corresponding initial solution W is surrounded by a tabu neighborhood N.W I / subsequently incorporated into tabu space: ( F / T [ N.W I /; if E.W F / ¸ E.Wbest TD otherwise: T; Otherwise, W F becomes the new best nal solution: ( F if E.W F / < E.Wbest WF ; / F Wbest D F Wbest ; otherwise: F Initially, T D ; and E.Wbest / À 1, and subsequent initial solutions are chosen to satisfy
WI 2 = T:
(3.8)
The idea of using Tabu Search in this context is to encourage exploration of new regions of weight space. Therefore, the tabu neighborhood N.W I / is dened in such a way that a minimum of m neighborhoods provides a covering of the initial weight space. Suppose the initial weight space has volume V, and each tabu neighborhood N.W I / is an n-dimensional hypercube subspace of the initial weight space, of side 2r and hence volume .2r/n : N.W I / D fW I C 1W I : max j1wIij j < rg; i;j
(3.9)
where 1WijI is the .i; j/th entry of the matrix 1W I . Then it is evident that V ¸ m.2r/n , which can be rearranged to give a recommended minimum for the parameter r in equation 3.9, r 1n V (3.10) rD : 2 m
150
R. Smithies, S. Salhi, and N. Queen
Figure 3: Main steps of the global optimization method.
Finally, if the initial weight space itself is dened as an n-dimensional hypercube of side 2R, then V D .2R/n and substituting this into equation 3.10 gives R rD p : n m
(3.11)
The methodology of the proposed global optimization method is summarized in Figure 3. 4 Computational Results All algorithms were coded in the Java programming language and tested on two documented classication tasks involving the cancer1 and diabetes1 natural data sets taken from the UCI repository of machine learning databases (Prechelt, 1994). The results that are specically commented on in this section have been found to be statistically signicant by the nonparametric Mann-Whitney U test for unpaired data taken at the 5% signicance level. 4.1 Local Optimization. The impact of the new heuristics POWER and TRIG on the behavior of RPROP is analyzed by using the best-performing
Adaptive Hybrid Learning for Neural Networks
151
version of RPROP taken from Igel and Husken ¨ (2000): iRPROP¡. Performances of QuickProp, Delta-Bar-Delta, Super SAB, iRPROP¡, iRPROP¡ with POWER, and iRPROP¡ with TRIG & POWER are compared when solving the cancer1 and diabetes1 classication tasks. The parameter settings of the local optimization methods are: ² QuickProp: ² D 0:001; º D 1:75 ² Delta-Bar-Delta: ²ij .0/ D 0:01; ¹ D 0:5; · D 0:01; ´¡ D 0:5; 1max D 50 ² Super SAB: ²ij .0/ D 0:01; ¹ D 0:5; ´ C D 1:2; ´¡ D 0:5; 1max D 50 ² iRPROP- : 1ij .0/ D 0:0125; ´ C D 1:2; ´¡ D 0:5; 1max D 50
² iRPROP- with POWER: ±ij .0/ D 10¡10 ; ´ C D 1:05; ´¡ D 0:5; 1max D 1 ² iRPROP- with TRIG & POWER: ±ij .0/ D 10¡10 ; ´C D 1:05; 1max D 1
where those for iRPROP¡ are taken from Igel & Husken ¨ (2000), while the remainder have been found empirically. The experimental set-up follows that of Prechelt (1994). The architectures used for each task incorporate linear output units and hidden units with sigmoidal outputs of the following form: yi D
neti ; 1 C jneti j
(4.1)
whose output is in the range [¡0:5; 0:5]. The cost function equation 1.1 is a normalized form of sum-of-squares error where Ep is dened as Ep D
.ymax ¡ ymin / X p p 2 .tn ¡ yn / ; PN n2N
(4.2) p
p
where N is£the set of output units and the targets tn of outputs yn are from ¤ the range ymin ; ymax . Each local optimization method is subject to 100 trials, attempting to correctly classify as many of the examples as possible over the 3000 epochs of each trial. The speed of convergence in each trial is measured by recording the training epoch resulting in the lowest validation error. This is a standard early stopping criterion to ensure generalization of the learning and, hence, good neural network performance for examples other than those used for training. The accuracy of each trial is gauged by the percentage of examples correctly classied according to a 40%-20%-40% criterion due to Fahlman (1988). That is, an example is correctly classied if each target output of ¡1 corresponds to an actual output in the lower 40% of the output range and each target output of 1 corresponds to an actual output in the upper 40% of the output range. This criterion is used rather than some reciprocal of the normalized sum-of-squares error, equation 4.2, as it provides a better measure of how close a learning algorithm is to achieving perfect classication.
152
R. Smithies, S. Salhi, and N. Queen
4.1.1 Breast Cancer Classication Task. This task, labeled as cancer1 in Prechelt (1994) and used by Igel and Husken ¨ (2000) to compare iRPROP+ and iRPROP¡ with RPROP+ and RPROP¡, is to decide whether a tumor is benign or malignant on the basis of cell data collated by microscopic examination (e.g., clump thickness, uniformity of cell size, cell shape, amount of marginal adhesion, frequency of bare nuclei). The associated data set, courtesy of William H. Wolberg at the University of Wisconsin Hospitals, consists of nine continuous input attributes, two output attributes (with a target of f1; ¡1g for each malignant tumor and f¡1; 1g for each benign tumor), and 699 patients (65.5% of whom have benign tumors). As suggested by Prechelt (1994), the data set is sequentially split in a 50%-25%-25% ratio, giving 349 training, 175 validation, and 175 testing examples, respectively. Moreover, the 9-4-2-2 neural network topology (including all shortcut connections) recommended by Prechelt (1994) is used. Finally, the initial weights wij .0/ are chosen from a weight space that is a 109-dimensional hypercube, the range of each dimension being [¡0:5; 0:5]. The results are summarized in Figures 4 and 5. Figure 4 shows the mean percentage of correctly classied examples accompanied by error bars indicating standard deviation. It is evident that both QuickProp and Delta-Bar-Delta are less accurate than the other learning algorithms, while Super SAB compares favorably with iRPROP¡. However, the addition of the POWER heuristic improves iRPROP¡, enabling perfect training (since all training examples are correctly classied) without compromising the ability of the neural network to generalize (since there is no signicant drop in the percentage of validation examples correctly classied). Furthermore, the inclusion of the TRIG heuristic does not severely impinge on the accuracy achieved by iRPROP¡ with POWER. Figure 5 shows the mean number of training epochs required to minimize the validation error. It is clear that QuickProp, Delta-Bar-Delta, and Super SAB have slower convergence than the other learning algorithms. A more crucial point is that there is no signicant effect on the convergence speed of iRPROP¡ when POWER and TRIG are added. This task, labeled as diabetes1 in 4.1.2 Diabetes Classication Task. Prechelt (1994) and again used by Igel and Husken ¨ (2000), is to decide whether a Pima Indian individual is diabetic given personal data (e.g., age, number of pregnancies) and medical data (e.g., blood pressure, body mass index, glucose tolerance). The data set consists of eight continuous input attributes, two output attributes (with a target of f1; ¡1g for each diabetic individual and f¡1; 1g otherwise), and 768 individuals (34.9% of whom are diabetic). Once more, the data set is split in a 50%-25%-25% ratio to give 384 training, 192 validation, and 192 testing examples, respectively. Also, the 8-2-2-2 neural network topology (including all shortcut connections) proposed by Prechelt (1994) is used. Finally, the initial weights are chosen from a weight
Adaptive Hybrid Learning for Neural Networks
153
Figure 4: Accuracy of local optimization algorithms solving the cancer1 task.
Figure 5: Speed of local optimization algorithms solving the cancer1 task.
154
R. Smithies, S. Salhi, and N. Queen
Figure 6: Accuracy of local optimization algorithms solving the diabetes1 task.
space that is a 74-dimensional hypercube, the range of each dimension once again being [¡0:5; 0:5]. The results are collated in Figures 6 and 7. Figure 6 shows once again that both QuickProp and Delta-Bar-Delta are less accurate than the other learning algorithms and that Super SAB compares well with iRPROP¡. Again, the training accuracy of iRPROP¡ is improved by the inclusion of POWER and TRIG. Figure 7 indicates that the inclusion of POWER and TRIG substantially improves the convergence speed of iRPROP¡. 4.2 Global Optimization. The impact of using Local Search in the global optimization method is analyzed by comparing the performances of Enhanced iRPROP¡ (that is, iRPROP¡ with TRIG & POWER) with Enhanced iRPROP¡ & Local Search when solving problems based on the cancer1 and diabetes1 data sets. For both tasks, the parameter settings for Enhanced iRPROP¡ are ±ij .0/ D 10¡10, ´C D 1:05, and 1max D 1. The data sets are again divided using a 50%-25%-25% ratio, and the same architectures are used as in the local optimization experiments: a 9-4-2-2 architecture for the cancer1 task and an 8-2-2-2 architecture for the diabetes1 task. Both Enhanced iRPROP¡ and Enhanced iRPROP- & Local Search are subject to 1000 trials. For each trial of the hybrid, control switches to Local Search when the stagnation criteria are satised. Furthermore, the current trial nishes if the termi-
Adaptive Hybrid Learning for Neural Networks
155
Figure 7: Speed of local optimization algorithms solving the diabetes1 task.
nation criteria are satisied. For both tasks, the stagnation and termination criteria are given by Prechelt (1994), and both are satised if either the generalization loss (see equation 3.6) exceeds .GLj /threshold D 0:05 where j D 8, or the training progress (see equation 3.7) falls below .TPk /threshold D 10¡5 where k D 5. For each trial, the percentages of correctly classied examples are recorded, along with the epoch at which termination occurs. In addition, a log is kept of the percentages of correctly classied examples for the best of the 1000 trials (the trial with lowest validation error). Considering the number of weights and biases in the respective architectures, the initial weight space for the cancer1 task is a 109-dimensional hypercube and for the diabetes1 task a 74-dimensional hypercube. Since each dimension is in the range [¡0:5; 0:5], it is clear that R D 0:5, and letting m D 1000 (the number of trials) in equation 3.11 allows the size of tabu neighborhood hypercubes to be dened by r D 109p0:5 for the cancer1 task and r D 74p0:5 for the 1000 1000 diabetes1 task. The results are summarized in Table 1. Table 1 indicates that for both the cancer1 and diabetes1 tasks, the mean percentages of correctly classied examples are noticeably higher when Local Search is used. Therefore, the addition of Local Search is seen to improve the accuracy of each trial, without causing a signicant increase in the mean number of epochs per trial. The relatively low percentages for the diabetes1 task are due to initial solutions in at regions of the error function, causing
156
R. Smithies, S. Salhi, and N. Queen
Table 1: Accuracy and Speed of Global Optimization Algorithms. Performance Measures
Breast Cancer Task Enhanced iRPROP¡
Enhanced iRPROP¡ & Local Search
Correctly classied examples (% to 4 signicant gures) Training Mean 72.85 76.10 SD 34.49 29.83 Best 91.98 97.71 Validation Mean 72.32 80.52 SD 36.69 27.60 Best 97.14 97.71 Testing Mean 73.88 82.99 SD 38.54 28.12 Best 98.86 98.29 78.93 All Mean 72.97 SD 35.94 28.57 Best 94.99 97.85 Number of Epochs to Convergence (to nearest epoch) Per trial Mean 93 104 SD 27 31
Diabetes Task Enhanced iRPROP¡
Enhanced iRPROP¡ & Local Search
34.84 19.92 60.68 35.91 21.18 54.17 34.01 19.54 54.69 34.90 19.97 57.56
45.08 19.84 67.71 50.26 22.69 71.35 44.10 39.34 64.06 46.13 20.38 69.52
175 360
180 395
Note: The gures in boldface type are the best results found by the hybrids.
early stagnation of Enhanced iRPROP¡, though this is tempered by use of Local Search. Finally, the best percentages of correctly classied examples for both tasks are mostly higher when Local Search is used.
5 Conclusion Two proposed enhancements of the standard RPROP algorithm have been incorporated into iRPROP¡ and tested using classication tasks based on the cancer1 and diabetes1 natural data sets taken from the UCI repository of machine learning databases. The POWER heuristic improves the accuracy and the convergence speed of iRPROP¡ in solving both classication tasks without much increase in algorithm complexity, and the TRIG heuristic eliminates a xed parameter of iRPROP¡ without impinging on performance. Moreover, hybridizing Enhanced iRPROP¡ with Local Search leads to further advancements in terms of accuracy when incorporated as part of the proposed global optimization method. Together, the new local and global optimization methods provide an efcient supervised learning approach for training neural networks to solve real-world problems.
Adaptive Hybrid Learning for Neural Networks
157
Acknowledgments We thank both referees for their constructive suggestions, which improved the content and the presentation of this article, and also EPSRC for sponsoring R.S. References Aarts, E., & Lenstra, J. (1997). Local search in combinatorial optimization. New York: Wiley. Fahlman, S. (1988). An empirical study of learning speed in back-propagation networks (Tech. Rep. No. CMU-CS-88-162). Pittsburgh, PA: Carnegie-Mellon University. Igel, C., & Husken, ¨ M. (2000). Improving the Rprop learning algorithm. In Proceedings of the Second International Symposium on Neural Computation, NC2000, (pp. 115–121). Berlin: ICSC Academic Press. Jacobs, R. (1988). Increased rates of convergence through learning rate adaption. Neural Networks, 1, 295–307. Kirkpatrick, S., Gelatt, C., & Vecchi, M. (1983). Optimisation by simulated annealing. Science, 220, 671–680. Moller, M. (1993). A scaled conjugate gradient algorithm for fast supervised learning. Neural Networks, 6, 525–533. Prechelt, L. (1994). PROBEN1—A set of benchmarks and benchmarking rules for neural network training algorithms (Tech. Rep. No. 21/94). Karlsruhe: Fakult¨at fur ¨ Informatik, Universit at ¨ Karlsruhe. Riedmiller, M. (1994). Advanced supervised learning in multi-layer perceptrons—from backpropagation to adaptive learning algorithms. Computer Standards and Interfaces, 16, 265–278. Riedmiller, M., & Braun, H. (1993). A direct adaptive method for faster backpropagation learning: The RPROP algorithm. In Proceedings of the IEEE International Conference on Neural Networks (ICNN) (Vol. 16, pp. 586–591). Piscataway, NJ: IEEE. Rumelhart, D., & McClelland, J. (1986). Parallel distributed processing. Cambridge, MA: MIT Press. Salhi, S., & Queen, N. (in press). Determining local and global minima for functions with multiple minima. European Journal of Operational Research. Schiffman, W., Joost, M., & Werner, R. (1993). Optimization of the backpropagation algorithm for training multilayer perceptrons (Tech. Rep.) Koblenz: University of Koblenz, Institute of Physics. Silva, F., & Almeida, L. (1990). Advanced neural computers. Amsterdam: NorthHolland. Tollenaere, T. (1990). Supersab: Fast adaptive backpropagationwith good scaling properties. Neural Networks, 3, 561–573. Received January 2, 2003; accepted June 5, 2003.
Communicated by Kenji Fukumizu
LETTER
Divergence Function, Duality, and Convex Analysis Jun Zhang
[email protected] Department of Psychology, University of Michigan, Ann Arbor, MI 48109, U. S. A
From a smooth, strictly convex function 8: Rn ! R, a parametric family .®/ of divergence function D8 may be introduced: .®/ D8 .x; y/ D
4 1 ¡ ®2
³
1¡® 1C® 8.x/ C 8.y/ 2 2 ³ ´´ 1¡® 1C® ¡8 xC y 2 2
.§1/ for x; y 2 int dom.8/ ½ Rn , and for ® 2 R, with D8 dened through taking the limit of ®. Each member is shown to induce an ®-independent Riemannian metric, as well as a pair of dual ®-connections, which are .§1/ generally nonat, except for ® D §1. In the latter case, D8 reduces to the (nonparametric) Bregman divergence, which is representable using 8 and its convex conjugate 8¤ and becomes the canonical divergence for dually at spaces (Amari, 1982, 1985; Amari & Nagaoka, 2000). This formulation based on convex analysis naturally extends the informationgeometric interpretation of divergence functions (Eguchi, 1983) to allow the distinction between two different kinds of duality: referential duality (® $ ¡®) and representational duality (8 $ 8¤ ). When applied to (not necessarily normalized) probability densities, the concept of conjugated representations of densities is introduced, so that §®-connections dened on probability densities embody both referential and representational duality and are hence themselves bidual. When restricted to a nite-dimensional afne submanifold, the natural parameters of a certain representation of densities and the expectation parameters under its conjugate representation form biorthogonal coordinates. The alpha representation (indexed by ¯ now, ¯ 2 [¡1; 1]) is shown to be the only measure-invariant representation. The resulting two-parameter family of divergence functionals D .®;¯ / , .®; ¯/ 2 [¡1; 1] £ [¡1; 1] induces identical Fisher information but bidual alpha-connection pairs; it reduces in form to Amari’s alpha-divergence family when ® D §1 or when ¯ D 1, but to the family of Jensen difference (Rao, 1987) when ¯ D ¡1.
Neural Computation 16, 159–195 (2004)
c 2003 Massachusetts Institute of Technology °
160
J. Zhang
1 Introduction Divergence functions play an important role in many areas of neural computation like learning, optimization, estimation, and inference. Roughly, they measure the directed (asymmetric) difference of two probability density functions in the innite-dimensional functional space, or two points in a nite-dimensional vector space that denes parameters of a statistical model. An example is the Kullback-Leibler (KL) divergence (cross-entropy) between two probability densities p and q, here expressed in its extended form (i.e., without requiring p; q to be normalized), K.p; q/ D
´ Z ³ q q ¡ p ¡ p log d¹ D K¤ .q; p/; p
(1.1)
which reaches the unique, global minimum value of zero on the diagonal of the product manifold (i.e., p D q). Many learning algorithms and/or proof for their convergence rely on properties of the KL divergence; the common ones are Boltzmann machine (Ackley, Hinton, & Sejnowski, 1985; Amari, 1991; Amari, Kurata, & Nagaoka 1992), the em algorithm and its comparison with EM algorithm (Amari, 1995), and the wake-sleep algorithm of the Helmholtz machine (Ikeda, Amari, & Nakahara, 1999). Another class of divergence functions widely used in optimization and convex programming literature is the so-called Bregman divergence (see below). It plays an essential role in unifying the class of projection and alternating minimization algorithms (Lafferty, Della Pietra, & Della Pietra, 1997; Della Pietra, Della Pietra, & Lafferty, 2002; Bauschke & Combettes, 2002; Bauschke, Borwein, & Combettes, 2002). Parametric families of Bregman divergence were used in blind source separation (Mihoko & Eguchi, 2002) and for boosting machines (Lebanon & Lafferty, 2002; Eguchi, 2002). Divergence function or functional1 is an essential subject in information geometry, the differential geometric study of the manifold of (parametric or nonparametric) probability distributions (Amari, 1982, 1985; Eguchi, 1983, 1992; Amari & Nagaoka, 2000). As rst demonstrated by Eguchi (1983), a well-dened divergence function (also called a contrast function) with vanishing rst order (in the vicinity of p D q) term will induce a Riemannian metric g by its second-order properties and a pair of dual (also called conjugate) connections .0; 0 ¤ / by its third-order properties, where the dual connections jointly preserve the metric under parallel transport. A manifold 1 Strictly speaking, when the underlying space is a nite-dimensional vector space, for example, that of parameters for a statistical or neural network model, then the term function is appropriate. However, if the underlying space is an innite-dimensional function space, for example, that of nonparametric probability densities, then the term functional ought to be used. The latter implicitly denes a divergence function (through pullback) if the probability densities are embedded into a nite-dimensional space as a statistical model.
Divergence Function, Duality, and Convex Analysis
161
endowed with fg; 0; 0 ¤ g is known as a statistical manifold; conditions for its afne realization through its embedding into a higher-dimension space have been claried (Kurose, 1994; Matsuzoe, 1998, 1999; Uohashi, Ohara, & Fujii, 2000). 1.1 Alpha-, Bregman, and Csiszar’s f-Divergence and Their relations. Amari (1982, 1985) introduced and investigated an important parametric family of divergence functionals, called ®-divergence 2
A
.®/
4 .p; q/ D 1 ¡ ®2
Z ³
1¡® 1C® 1¡® 1C® pC q¡p 2 q 2 2 2
´ d¹; ® 2 R: (1.2)
The ®-divergence, which specializes to K.p; q/ for ® D ¡1 and K¤ .p; q/ for ® D 1 (by taking the limit of ®), induces on the statistical manifold the family of ®-connections (Chentsov, 1982; Amari, 1982). The duality between ® $ ¡® is reected in that §®-connections form a pair of dual connections that jointly preserve the metric and that an ®-connection is at if and only .¡®/-connection is at (Amari, 1985; Lauritzen, 1987). As a special case, the exponential family (® D 1) and the mixture family (® D ¡1) of densities generate dually at statistical manifolds. Alpha divergence is a special case of the measure-invariant f -divergence introduced by Csisz a´ r (1967), which is associated with any convex function fc : RC ! RC satisfying fc .1/ D 0; fc0 .1/ D 0: Z
F fc .p; q/ D
p fc
³ ´ q d¹; p
(1.3)
where RC ´ RC [ f0g. This can be seen by fc taking the following family of convex functions3 indexed by a parameter ®, f
2
.®/
4 .t/ D 1 ¡ ®2
³
1¡® 1C® 1C® C t¡t 2 2 2
´ ;
® 2 R:
(1.4)
This form of ®-divergence rst appeared in Zhu and Rohwer (1995, 1997), where 1C® it was called the ±-divergence, ± D .1 ¡ ®/=2. The term 1¡® 2 p C 2 q in equation 1.2 after integration, is simply 1 for normalized densities; this was how Amari (1982, 1985) introduced ®-connection as a specic family of Csisza´ r’s f -divergence. See note 3. 3 Note that this form differs slightly with the function given in Amari (1985) and 1C® Amari and Nagaoka (2000) by the additional term 1¡® 2 C 2 t. This term is needed for the form of ®-divergence expressed in equation 1.2, which is “extended” from the original denition given in Amari (1982, 1985) to allow denormalized densities, in much the same way that extended Kullback-Leibler divergence (see equation 1.1) differs from its original form (without the p ¡ q or q ¡ p term). This additional term in f .®/ allows the condition f .®/ .1/ D 0 to be satised. It also provides a unied treatment for the ® D §1 case, since lim ®!1 f .®/ .t/ D 1 ¡ t C t log t; lim ®!¡1 f .®/ .t/ D t ¡ 1 ¡ log t.
162
J. Zhang
Eguchi (1983) showed that any f -divergence induced a statistical manifold with a metric proportional to Fisher information with the constant of proportionality fc00 .1/ and an equivalent ®-connection, ® D 3 C 2 fc000 .1/=fc00 .1/:
(1.5)
We note in passing that for a general, smooth, and strictly convex function f : R ! R, we can always induce a measure-invariant divergence by using fc .t/ D g.t/ in equation 1.3, where g.t/ ´ f .t/ ¡ f .1/ ¡ f 0 .1/.t ¡ 1/:
(1.6)
That the right-hand side of the above is nonnegative can be easily proved by showing that t D 1 is a global minimum with g.1/ D g0 .1/ D 0. Another kind of divergence function, called Bregman divergence, is dened for any two points x D [x1 ; : : : ; xn ], y D [y1 ; : : : ; yn ] in an n-dimensional vector space Rn (Bregman, 1967). It is induced by a smooth and strictly convex function 8: Rn ! R: B 8 .x; y/ D 8.y/ ¡ 8.x/ ¡ hy ¡ x; r8.x/i;
(1.7)
where r is the gradient (or, more strictly, subdifferential @8 if differentiability condition is removed) operator and h¢; ¢i denotes the standard inner product of two vectors. It is also called (actually proportional to) geometric divergence (Kurose, 1994), proposed in the context of afne realization of a statistical manifold. Bregman divergence B8 .x; y/ specializes to P i the KL divergence upon setting 8.x/ RD i ex , P introducing new variables xi D log pi ; yi D log qi , and changing d¹ into i . More generally, as observed by Eguchi (2002), Csisz a´ r ’s f -divergence (see equation 1.3) is naturally related (but not P identical) to Bregman divergence (see equation 1.7), having taken 8.x/ D i f .xi / with yi D qi =pi and xi D 1. In this case (with a slight abuse of notation), ³ i ´ X q F f .p; q/ D pi B f ;1 : pi i It is now known (Kass & Vos, 1997) that Bregman divergence is essentially the canonical divergence (Amari & Nagaoka, 2000) on a dually at manifold equipped with a pair of biorthogonal coordinates induced from a pair of “potential functions” under the Legendre transform (Amari, 1982, 1985). It is a distance-like measure on a nite-dimensional Riemannian manifold that is essentially at and is very different from the ®-divergence (see equation 1.2) that is dened over the space of positively valued, innite-dimensional functions on sample space (i.e., positive measures) and is generally nonat. However, if the positive measures p and q can be afnely embedded
Divergence Function, Duality, and Convex Analysis
163
into some nite-dimensional submanifold, the Legendre potentials for ®divergence could exist. Technically, this corresponds to the so-called ®-afne manifold, where the embedded ®-representation of the densities (® 2 R), » log p ®D1 (1.8) l.®/ .p/ D 2 .1¡®/=2 else, 1¡® p can be expressed as a linear combination of a countable set of basis functions of the innite-dimensional functional space (the denition of ®-afnity). If and only if such embedding is possible for a certain value of ®, a potential function (and its dual) can be found so that equation 1.2 becomes 1.7. In general, Bregman divergence and ®-divergence are very different in terms of both the dimensionality and the atness of the underlying manifold that they are dened on, though both may induce dual connections. Given the fundamental importance of ®-connections in information geometry, it is natural to ask whether the parameter ® may arise other than from the context of ®-embedding of density functions. How is the ® $ ¡® duality related to the pair of biorthogonal coordinates and the canonical divergence they dene? Does there exist an even more general expression of divergence functions that would include the ®-divergence, the Bregman divergence, and the f -divergence as special cases yet would still induce the dual §®-connections? The existence of a divergence function on a statistical manifold given the Riemannian metric and a pair of dual, torsion-free connections was answered positively by Matumoto (1993). Here, the interest is to nd explicit forms for such general divergence functions, in particular, measure-invariant ones. The goal of this article is to introduce a unifying perspective for the ®-, Bregman, and Csiszar ’s f -divergence as arising from certain fundamental convex inequalities and to clarify two different senses of duality embodied by divergence function and functionals and the statistical manifolds they dene: referential duality and representational duality. Starting from the denition of a smooth, strictly convex function 8: Rn ! .®/ R, a parametric family of divergence D8 .x; y/, ® 2 R; over points x; y 2 S D int dom.8/ can be introduced that will be shown to induce a single Riemannian metric with a parametric family of afne connections indexed by ®, the convex mixture parameter. These ®-connections are nonat unless .§1/ ® D §1, when D8 turns out to be the Bregman divergence, which can be cast into the canonical form using a pair of convex conjugate functions 8 and 8¤ (Amari’s potential functions) that obey Legendre-Fenchel duality. The biorthogonal coordinates x and u, originally introduced for a dually at manifold, can now be extended to dene the divergence function on any nonat manifold (® 6D §1) as well. A distinction is drawn between two kinds of duality (“biduality”) of statistical manifolds, in the sense of mutual references x $ y, as reected by ® $ ¡® duality, and in the sense of conjugate representations u D r8.x/ $ x D .r8¤ /.u/ D .r8/¡1 .u/,
164
J. Zhang
as reected by 8 $ 8¤ duality. In the innite-dimensional case, representational duality is achieved through conjugate representations of any (not necessarily normalized) density function; here, conjugacy is with respect to a strictly convex function dened on the real line f : R ! R. Our notion of conjugate representations of density functions, which generalizes the notion of alpha representation (see equation 1.8), proves to be useful in characterizing the afne embedding of a density function into a nitedimensional submanifold; the natural and expectation parameters become the pair of biorthogonal coordinates, and this case completely reduces to the one discussed earlier. Of particular practical importance is that our analysis provides a twoparameter family of measure-invariant divergence function(al) D .®;¯ / .p; q/ under the alpha representation l.¯/ .p/ of densities (indexed by ¯ now, ¯ 2 [¡1; 1], with ® reserved to index convex mixture), which induce identical Fisher information metric and a family of alpha connections where the product ®¯ serves as the “alpha” parameter. The two indices themselves, .®; ¯/ 2 [¡1; 1] £ [¡1; 1], precisely express referential duality and representational duality, respectively. Interestingly, at the level of divergence functional, D .®;¯/ turns out to be the family of alpha divergence for ® D §1 or for ¯ D 1, and the family of “Jensen difference” (Rao, 1987) for ¯ D ¡1. Thus, Kullback-Leibler divergence, the one-parameter family of ®-divergence and of Jensen difference, and the two-parameter family of .®; ¯/-divergence form nested families of measure-invariant divergence function(al) associated with the same statistical manifold studied in classical information geometry. 2 Divergence on Finite-Dimensional Parameter Space In this section, we consider the nite-dimensional vector space Rn , or a convex subset thereof, that denes the parameter of a probabilistic (e.g., neural network) model. The goal is to introduce, with the help of an arbitrary strictly convex function 8: Rn ! R, a family of asymmetric, nonnegative measures between two points in such space, called divergence functions (see proposition 1) and, through which to induce a well-dened statistical manifold with a Riemannian metric and a pair of dual connections (see proposition 2). The procedure used for linking a divergence function(al) to the underlying statistical manifold is due to Eguchi (1983); our notion of referential duality is reected in the construction of dual connections. The notion of representational duality is introduced through equation 2.15 and proposition 5, based on the convex conjugacy operation (see remark 2.3.1). Biduality is thus shown to be the fundamental property of a statistical manifold induced by the family of divergence functions based on a convex function 8. 2.1 Convexity and Divergence Functions. Consider the n-dimensional vector space (e.g., the parameter space in the case of parametric probability
Divergence Function, Duality, and Convex Analysis
165
density functions or neural network models). A set S µ Rn is called convex if for any two points x D [x1 ; : : : ; xn ] 2 S, y D [y1 ; : : : ; yn ] 2 S and any real number ® 2 [¡1; 1], the convex mixture 1¡® 1C® xC y 2 S; 2 2 that is, the line segment connecting any two points x and y, belongs to the set S. A strictly convex function of several variables 8.x/ is a function dened on a nonempty convex set S D int dom.8/ µ Rn such that for any two points x 2 S, y 2 S and any real number ® 2 .¡1; 1/, the following, ³
´ 1¡® 1C® 1¡® 1C® 8 xC y · 8.x/ C 8.y/; 2 2 2 2
(2.1)
is valid, with equality holding only when x D y. Equation 2.1 will sometimes be referred to as the fundamental convex inequality below. Intuitively, the difference between the left-hand side and the right-hand side of (equation 2.1) depends on some kind of proximity between the two points x and y in question, as well as on the degree of convexity (loosely speaking) of the function 8. For convenience, 8 is assumed to be differentiable up to third order, though this condition can be slightly relaxed to the class of so-called essentially smooth and essentially strictly convex functions or the convex function of the Legendre type (Rockafellar, 1970) without affecting much of the subsequent analysis. Note that for ® D §1, the equality in equation 2.1 holds uniformly for all x; y; for ® 6D §1, the equality will not hold uniformly unless 8.x/ is the linear function ha; xi C b with a a constant vector and b a scalar. Proposition 1. For any smooth, strictly convex function 8: Rn ! R; x 7! 8.x/ and ® 2 R, the function .®/ D8 .x; y/ D
³ 4 1¡® 1C® 8.x/ C 8.y/ 1 ¡ ®2 2 2 ³ ´´ 1¡® 1C® ¡8 xC y 2 2
(2.2)
with .§1/ .®/ D8 .x; y/ D lim D8 .x; y/ ®!§1
(2.3)
is a parametric family of nonnegative functions of x; y that equal zero if and only 1C® if x D y. Here, the points x; y, and z D 1¡® 2 x C 2 y are all assumed to belong to S D int dom.8/.
166
J. Zhang
Proof. Clearly, for any ® 2 .¡1; 1/, 1 ¡ ® 2 > 0, so from equation 2.1, the .®/ functions D8 .x; y/ ¸ 0 for all x; y 2 S, with equality holding if and only if 2 as a convex mixture of z x D y. When ® > 1, we rewrite y D ®C1 z C ®¡1 ®C1 x 0 0 2 1¡® ®¡1 1C® 0 and x (i.e., ®C1 D 2 , ®C1 D 2 with ® 2 .¡1; 1/). Strict convexity of 8 guarantees 2 ®¡1 8.z/ C 8.x/ ¸ 8.y/ ®C1 ®C1 or explicitly 2 1C®
³
³ ´´ 1¡® 1C® 1¡® 1C® · 0: 8.x/ C 8.y/ ¡ 8 xC y 2 2 2 2
.®/ This, along with 1 ¡ ® 2 < 0, proves the nonnegativity of D8 .x; y/ ¸ 0 for ® > 1, with equality holding if and only if z D x, that is, x D y. The case of ® < ¡1 is similarly proved by applying equation 2.1 to the three points y; z, .®/ 2 and their convex mixture x D 1¡® zC ¡1¡® 1¡® x. Finally, continuity of D8 .x; y/ with respect to ® guarantees that the above claim is also valid in the case of ® D §1. ¦ .®/ .®/ Remark 2.1.1. These functions are asymmetric, D8 .x; y/ 6D D8 .y; x/ in general, but satisfy the dual relation .®/ .¡®/ D8 .x; y/ D D8 .y; x/:
(2.4)
.®/ Therefore, D8 as parameterized by ® 2 R, for a xed 8, properly form a family of divergence functions (also known as deviations or contrast functions) in the sense of Eguchi (1983, 1992), Amari (1982, 1985), Kaas & Vos (1997), and Amari & Nagaoka (2000).
Example 2.1.2. Take the negative of the entropy function 8.x/ D which is easily seen to be convex. Then equation 2.2 becomes Á 4 X 1¡® i x log 1 ¡ ®2 i 2
P i
xi log xi ,
xi 1¡® i 2 x
1C® i C y log 2
C
1¡® i 2 x
1C® i 2 y
yi C
1C® i 2 y
! ´ E.®/ .x; y/;
(2.5)
a family of divergence measure called Jensen difference (Rao, 1987), apart 4 from the 1¡® 2 factor. The Kullback-Leibler divergence, equation 1.1, is recovered by letting ® ! §1 in E.®/ .x; y/.
Divergence Function, Duality, and Convex Analysis
167
P i Example 2.1.3. Take 8.x/ D i ex while denoting pi D log xi ; qi D log yi . .®/ Then D.®/ 8 .x; y/ becomes the ®-divergence A .p; q/ in its discrete version. .®/ 2.2 Statistical Manifold Induced by D8 . The divergence function measure of the asymmetric (directed) distance between a comparison point y as measured from a xed reference point x. Although this function is globally dened for x; y at large, information geometry provides a standard technique, due to Eguchi (1983), to investigate the differential geometric structure induced on S from any divergence .®/ function, through taking limx!x0 ; y!x0 D8 .x; y/. The most important geometric objects on a differential manifold are the Riemannian metric g and the afne connection 0. The metric tensor xes the inner product operation on the manifold, whereas the afne connection establishes the afne correspondence among neighboring tangent spaces and denes covariant differentiation. .®/ D8 .x; y/ provides a quantitative
.®/ Proposition 2. The statistical manifold fS; g; 0 .®/ ; 0 ¤.®/ g associated with D8 is given (in component forms) by
(2.6)
gij D 8ij and .®/ 0ij;k D
1¡® 8ijk ; 2
¤.®/
0ij;k D
1C® 8ijk : 2
(2.7)
Here, 8ij , 8ijk denote, respectively, second and third partial derivatives of 8.x/: 8ij D
@ 2 8.x/ ; @xi @x j
8ijk D
@ 3 8.x/ : @xi @x j @xk
Proof. Assuming Fre´ chet differentiability of 8, we calculate the Taylor .®/ expansion of D8 .x; y/ around the reference point x0 in the direction » for the rst variable (i.e., x D x0 C » ) and in the direction of ´ for the second variable (i.e., y D x0 C ´), while renaming x0 as x for clarity:4 .®/ D8 .x C »; x C ´/ ’
1X 8ij .» i ¡ ´i / .» j ¡ ´ j / 2 i;j
4 We try to follow the conventions of tensor algebra for upper and lower indices, but do not invoke Einstein summation convention since many of the equalities are not in coordinate-invariant or tensorial form.
168
J. Zhang
³ 1X 3¡® i j k 3C® i j k C 8ijk »» » C ´´´ 6 i;j;k 2 2
´ 3 C 3® i j k 3 ¡ 3® i j k ¡ ´´» ¡ » » ´ C o.» m ´l /; 2 2
where higher orders in the expansion (i.e., mCl ¸ 4) have been collected into o.¢/. Following Eguchi (1983), the metric tensor of the Riemannian geometry .®/ induced by D8 is gij .x/ D ¡
@2 .®/ D C C .x »; x ´/ ; ´D»D0 @ » i @´ j 8
(2.8)
whereas the pair of dual afne connections 0 and 0 ¤ is 0ij;k .x/ D ¡ ¤ 0ij;k .x/ D ¡
@3
.®/ D8 .x C »; x C ´/
;
(2.9)
@3 .®/ D C C .x » ; x ´/ : ´D» D0 @´i @ ´ j @» k 8
(2.10)
@» i @» j @´k
´D» D0
Carrying out differentiation yields equations 2.6 and 2.7. ¦ Remark 2.2.1. Clearly, the metric tensor gij , which is symmetric and positive semidenite due to the strict convexity of 8, is actually independent of ®, whereas the ®-dependent afne connections are torsion free (since .®/ .®/ 0ij;k D 0ji;k ) and satisfy the duality ¤.®/ .¡®/ 0ij;k D 0ij;k ;
in accordance with equation 2.4, the duality in the selection of reference ver.®/ sus comparison point in D8 . Dual ®-connections in the form of equation 2.7 were formally introduced and investigated in Lauritzen (1987). Here, the .®/ family of D8 -divergence is shown to induce these ®-connections. Clearly, when ® D 0, the connection 0 .0/ D 0 ¤.0/ ´ 0 LC is the self-dual, metric (Levi-Civita) connection, as through direct verication it satises LC 0ij;k
1 D 2
³
´ @ gik .x/ @ gkj .x/ @ gij .x/ C ¡ : @xi @xk @x j
The Levi-Civita connection and other members in the ®-connection family are related through LC D 0ij;k
² 1 ± .®/ ¤.®/ 0ij;k C 0ij;k : 2
Divergence Function, Duality, and Convex Analysis
169
Note the covariant form of the afne connection, 0ij;k , is related to its contravariant form 0ijk through gij : X
gkl 0ijk D 0ij;l
(actually, 0ijk is the more primitive quantity since it does not involve the metric). The curvature or atness of a connection is described by the RiemannChristoffel curvature tensor, i Rj¹º .x/ D
i @0ºj .x/
@x¹
¡
i @0¹j .x/
@xº
C
X k
i k 0¹k .x/0ºj .x/ ¡
X
i k 0ºk .x/0¹j .x/;
k
or equivalently by Rij¹º D
X
l gil Rj¹º :
l
It is antisymmetric when i $ j or when ¹ $ º and symmetric when .i; j/ $ .¹; º/. Since the curvature R¤ij¹º of the dual connection 0 ¤ equals Rij¹º (Lauritzen, 1987), ¤.®/
.®/ .¡®/ Rij¹º D Rij¹º D Rij¹º :
Proposition 3. The Riemann-Christoffel curvature tensor for the ®-connection .®/ .®/ 0ij;k induced by D8 is D R.®/ ij¹º
1 ¡ ®2 X .8ilº 8jk¹ ¡ 8il¹ 8jkº /8lk ; 4 l;k
(2.11)
where 8ij D gij is the matrix inverse of 8ij . Proof.
First, from its denition, Rij¹º can be recast into
Rij¹º D
³ ³ ´ ³ ´´ @0ºj;i @0¹j;i X k @gik @gik k ¡ C ¡ ¡0 ¡ 0 0 0 : (2.12) ¹k;i ºk;i ºj ¹j @ x¹ @ xº @x¹ @xº k
Substituting in the expression of ®-connections (equation 2.7), the rst P two terms cancel and the terms under k give rise to equation 2.11. ¦
170
J. Zhang
Remark 2.2.2. The metric (see equation 2.6), dual ®-connections (see equation 2.7), and the curvature (see equation 2.11) in such forms rst appeared in Amari (1985) where 8.x/ is the cumulant generating function © ª of an exponential family. Here, the statistical manifold S; g; 0 .®/ ; 0 ¤.®/ is induced by a divergence function via the Eguchi relation, and 8.x/ can be any (smooth and strictly) convex function. The purpose of this proposition is to remind readers that for any convex 8 in general, the curvature is determined by 4 both an ®-dependent factor 1¡® 2 and a 8-related component, the latter depending on 8ij plus its derivatives and inverse. This leads to the following conformal property: O Corollary 1. If two smooth, strictly convex functions 8.x/ and 8.x/ are conformally related, that is, if there exists some positive function ¾ .x/ > 0 such that O ij D ¾ 8ij , then the curvatures of their respective ®-connection satisfy 8 .®/ .®/ RO ij¹º D ¾ Rij¹º :
Proof.
(2.13)
Observe that
O ilº D ¾ 8ilº C ¾i 8lº ; 8 where ¾i denotes @ ¾=@xi . Permutating the index set .i; l; º/ to .i; l; ¹/, to O ij D .j; k; ¹/, and to .j; k; º/ yield three other similar identities. Noting 8 .®/ ¡1 ij .¾ / 8 , direct substitution of these relations into the expression of Rij¹º in equation 2.11 yields equation 2.13. ¦ 2.3 Dually Flat Statistical Manifold (® D §1). When ® D §1, all components of the curvature tensor vanish, that is, R.§1/ ij¹º D 0. In this case, there ¤.¡1/ .1/ exists a coordinate system under which either 0ij;k D 0 or 0ij;k D 0. This is the well-studied dually at statistical manifold (Amari, 1982, 1985; Amari & Nagaoka, 2000), under which a pair of global biorthogonal coordinates, related to each other through the Legendre transform with respect to 8, can be found to cast the divergence function into its canonical form. The Riemannian manifold with metric tensor given by equation 2.6, along with the dually at 0 .1/ and 0¤.¡1/ , is known as the Hessian manifold (Shima, 1978; Shima & Yagi, 1997).
Proposition 4. equation 1.7)
.®/ When ® ! §1, D8 reduces to the Bregman divergence (see
.¡1/ .1/ D8 .x; y/ D D8 .y; x/ D B8 .x; y/; .1/ .¡1/ D8 .x; y/ D D8 .y; x/ D B8 .y; x/:
Divergence Function, Duality, and Convex Analysis
Proof.
171
Assuming that the Gˆateaux derivative of 8,
8.x C ¸» / ¡ 8.x/ ; ¸!0 ¸ lim
exists and equals h»; r8.x/i, where r is the gradient (subdifferential) operator and h¢; ¢i denotes the standard inner product. Similarly, lim
¸!1
8.y C .1 ¡ ¸/´/ ¡ 8.y/ D h´; r8.y/i: 1¡¸
Taking » D y ¡ x, ´ D x ¡ y, and ¸ D 1§® 2 , and substituting these into equation 2.3 immediately yields the results. ¦ Remark 2.3.1. Introducing the convex conjugate of 8 through the LegendreFenchel transform (see, e.g., Rockafellar, 1970), 8¤ .u/ D hu; .r8/¡1 .u/i ¡ 8..r8/¡1 .u//;
(2.14)
it can be shown that the function 8¤ is also convex (on a convex region S0 3 u where u D r8.x/) and has 8 as its conjugate, .8 ¤ /¤ D 8: Since r8 and r8¤ are inverse functions of each other, as veried by differentiating equation 2.14, it is convenient to denote this one-to-one correspondence between x 2 S and u 2 S0 by x D r8¤ .u/ D .r8/¡1 .u/;
u D r8.x/ D .r8¤ /¡1 .x/:
(2.15)
.§1/ With these, it can be shown that the Bregman divergence D8 is actually the canonical divergence (Amari & Nagaoka, 2000) in disguise. .§1/ Corollary 2. The D8 -divergence is the canonical divergence of a dually at statistical manifold: .1/ D8 .x; .r8/¡1 .v// D A8 .x; v/ ´ 8.x/ C 8¤ .v/ ¡ hx; vi;
.¡1/ D8 ..r8/¡1 .u/; y/ D A8¤ .u; y/ ´ 8.y/ C 8¤ .u/ ¡ hu; yi:
Proof.
Using the convex conjugate 8¤ , we have .1/ D8 .x; y/ D 8.x/ ¡ hx; r8.y/i C 8¤ .r8.y//;
.¡1/ D8 .x; y/ D 8.y/ ¡ hy; r8.x/i C 8¤ .r8.x//:
(2.16)
172
J. Zhang
.1/ Substituting u D r8.x/; v D r8.y/ yields equation 2.16. So D8 .x; ¡1 .r8/ .v//, when viewed as a function of x; v, is the canonical divergence. .¡1/ So is D8 .¦
Remark 2.3.2. The canonical divergence A8 .x; v/ based on the LegendreFenchel inequality is introduced by Amari (1982, 1985), where the functions 8; 8¤ , the cumulant generating functions of an exponential family, were referred to as dual potential functions. This form, equation 2.16, is “canonical” because it is uniquely specied in a dually at space where the pair of canonical coordinates (see equation 2.15) induced by the dual potential functions is biorthogonal, @ ui D gij .x/; @ xj
@ xi D gQ ij .u/; @u j
(2.17)
where gQ ij .u.x// D gij .x/ is the matrix inverse of gij .x/ given by equation 2.6. Remark 2.3.3. We point out that there are two kinds of duality associated with the divergence (directed distance) dened on a dually at statistical .¡1/ .1/ .¡1/ .1/ manifold: one between D8 $ D8 and D8 $ D8 ¤ ¤ , the other between .¡1/ .¡1/ .1/ .1/ D8 $ D8¤ and D 8 $ D8¤ . The rst kind is related to the duality in the choice of the reference and the comparison status for the two points (x versus y) for computing the value of the divergence, and hence is called referential duality. The second kind is related to the duality in the choice of the representation of the point as a vector in the parameter versus gradient space (x versus u) in the expression of the divergence function, and hence is called representational duality. More concretely, .¡1/ .¡1/ D8 .x; y/ D D8 ¤ .r8.y/; r8.x// .1/ .1/ D D8 ¤ .r8.x/; r8.y// D D8 .y; x/:
The biduality is compactly reected in the canonical divergence as A8 .x; v/ D A8¤ .v; x/: 2.4 Biduality of Statistical Manifold for General ®. A natural question .®/ to ask is whether biduality is a general property of the divergence D8 and hence a property of any statistical manifold admitting a metric and a pair of dual (but not necessarily at) connections. Proposition 5 provides a positive answer to this question after considering the geometry generated by the pair .®/ .®/ of conjugate divergence functions, D8 and D8 ¤ , for each ® 2 R.
Divergence Function, Duality, and Convex Analysis
173
Q 0Q .®/ ; 0Q ¤.®/ g induced by Proposition 5. For the statistical manifold fS0 ; g; .®/ 0 D8¤ .u; v/ dened on u; v 2 S , denote the Riemannian metric as gQ mn .u/, the pair of dual connections as 0Q .®/mn;l .u/; 0Q ¤.®/mn;l .u/, and the Riemann-Christoffel curvature tensor as RQ .®/klmn .u/. They are related to those (in lower scripts and .®/ without the tilde) induced by D8 .x; y/ via X l
gil .x/ gQ ln .u/ D ±in ;
and 0Q .®/mn;l .u/ D ¡ Q ¤.®/mn;l
0
Q .®/klmn
R
.u/ D ¡ .u/ D
X i;j;k
X i;j;k
X
i;j;¹;º
.®/ gQ im .u/ gQ jn .u/ gQ kl .u/0 ij;k .x/; .¡®/ gQ im .u/ gQ jn .u/ gQ kl .u/0 ij;k .x/;
.®/ gQ ik .u/ gQ jl .u/ gQ ¹m .u/ gQ ºn .u/Rij¹º .x/
where the dual correspondence (see equation 2.15) is invoked. Proof. Following the proof of proposition 2 (i.e., using the Eguchi relation), the metric and dual connections induced on S0 are simply gQ mn D .8¤ /mn and 0Q .®/mn;l D
1¡® .8¤ /mnl ; 2
0Q ¤.®/mn;l D
1C® .8¤ /mnl ; 2
with the corresponding Riemann-Christoffel curvature of 0Q .®/mn;l given as 1 ¡ ®2 X RQ .®/klmn D ..8¤ /k½n .8¤ /l¿ m ¡ .8¤ /k½m .8¤ /l¿ n /.8¤ /½¿ ; 4 ½ ;¿ where .8 ¤ /mn D
@ 2 8¤ .u/ ; @ um @un
.8¤ /mnl D
@ 3 8¤ .u/ ; @um @ un @ul
P P and .8 ¤ /½¿ is the matrix inverse of .8 ¤ /mn . That l gil .x/ gQ ln .u.x// D l gil .x.u// gQ ln .u/ D ±in is due to equations 2.15 and 2.17. Differentiating this
174
J. Zhang
identity with respect to xk yields X @gim .x/ X @ gQ mn .u/ Q mn .u/ D ¡ g g .x/ im @ xk @xk m m Á ! X X @ .8 ¤ /mn .u/ @ul D¡ gim .x/ @ul @xk m l or X m
8imk .x/ gQ mn .u/ D ¡
X
gim .x/ gkl .x/ .8¤ /mnl .u/;
m;l
which immediately gives rise to the desired relations between the ®connections. Simple substitution yields the relation between RQ .®/klmn and .¦ R.®/ ij¹º Remark 2.4.1. The biorthogonal coordinates x and u were originally dened on the manifold S and its dual S0 , respectively. Because of the bijectivity of the mapping between x and u, we may identify points in S with points in S0 and simply view x $ u as coordinate transformations on the same underlying manifold. The relations between the metric g, dual connections 0 .§®/ , or the curvature R.®/ written in superscripts with tilde and those written in subscripts without tilde are merely expressions of the same geometric entities using different coordinate systems. The dualistic geometric structure 0 .®/ $ 0 ¤.®/ under g, which reects referential duality, is preserved under the mapping x $ u, which reects representational duality. When the manifold is dually at (® D §1), x and u enjoy the additional property of being geodesic coordinates. Remark 2.4.2. Matumoto (1993) proved that a divergence function always exists for a statistical manifold equipped with an arbitrary metric tensor and a pair of dual connections. Given a convex function 8, along with its unique conjugate 8¤ , are there other families of divergence functions .®/ .®/ D8 .x; y/ and D8¤ .u; v/ that induce the same bidualistic statistical mani.®/ folds fS; g; 0 ; 0¤.®/ g? The answer is positive. Consider the family of divergence functions, D.®/ 8 .x; y/ D
1 ¡ ® .¡1/ 1 C ® .1/ D8 .x; y/ C D8 .x; y/; 2 2
and their conjugate (replacing 8 with 8¤ ). Recall from proposition 2 that .¡1/ .1/ the metric tensor induced by D8 .x; y/ and D8 .x; y/ is the same gij , while ¤.1/ ¤.¡1/ .¡1/ .1/ the induced connections satisfy 0ij;k D 0ij;k D 8ijk ; 0ij;k D 0ij;k D 0.
Divergence Function, Duality, and Convex Analysis
175
Since the Eguchi relations, equations 2.8 to 2.10, are linear with respect to inducing functions, the family of divergence functions D.®/ 8 .x; y/, as convex .¡1/ .1/ mixture of D 8 .x; y/ and D8 .x; y/, will necessarily induce the metric 1¡® 1C® gij C gij D gij ; 2 2 and dual connections 1 ¡ ® .¡1/ 1 C ® .1/ .®/ 0ij;k C 0ij;k D 0ij;k ; 2 2 1 ¡ ® ¤.¡1/ 1 C ® ¤.1/ ¤.®/ 0ij;k C 0ij;k D 0ij;k : 2 2 .®/ .®/ Similar arguments apply to D.®/ 8¤ .u; v/. This is, D8 .x; y/ and D8¤ .u; v/ form another pair of families of divergence functions that induce the same statis.®/ .®/ tical manifold fS; g; 0 .®/ ; 0 ¤.®/ g. The two pairs, .D8 .x; y/; D8¤ .u; v// pair, .®/ and .D.®/ 8 .x; y/; D8¤ .u; v// pair, agree on ® D §1, the dually at case when there is a single form of canonical divergence. They differ for any other val.®/ .®/ ues of ®, including the self-dual element (® D 0). The two pairs .D8 ; D8 ¤ / .®/ .®/ versus .D8 ; D8¤ / coincide with each other up to the third order when Taylor expanding .1 § ®/.y ¡ x/. That is why they induce an identical statistical manifold. They differ on the fourth-order terms in their expansions.
3 Divergence on Probability and Positive Measures The previous sections have discussed divergence functions dened between points in Rn or in its dual space, or both. In particular, they apply to probability measures of nite support, or the nite-dimensional parameter space, which denes parametric probability models. Traditionally, divergence functionals are also dened with respect to innite-dimensional probability densities (or positive measures in general if normalization constraint is removed). To the extent that a probability density function can be embedded into a nite-dimensional parameter space, a divergence measure on density functions will implicitly induce a divergence on the parameter space (technically, via pullback). In fact, this is the original setup in Amari (1985), where each ®-divergence (® 2 R) is seen as the canonical divergence arising from the ®-embedding of the probability density function into an afne submanifold (the condition of ®-afnity). The approach outlined below avoids such a atness assumption while still achieving conjugate representations (embeddings) of probability densities, and therefore extends the notion of biduality to the innite-dimensional case. It will be proved (in proposition 9) that if the embedding manifold is at, then the induced
176
J. Zhang
®-connections reduce to the ones introduced in the previous section, with the natural and expectation parameters arising out of these conjugate representations forming biorthogonal coordinates just like the ones induced by dual potential functions in the nite-dimensional case. To follow the procedures of section 2.1 and construct divergence functionals, a smooth and strictly convex function dened on the real line f : R ! R is introduced. Recall that any such f can be written as an integral of a strictly R monotone-increasing function g and vice versa: f .° / D c° g.t/dt C f .c/, with g0 .t/ > 0. The convex conjugate f ¤ : R ! R is given by f ¤ .±/ D R± ¡1 ¤ ¡1 also strictly monotonic and ° ; ± 2 R. g.c/ g .t/dt C f .g.c//, with g Here, the monotonicity condition replaces the requirement of a positive semidenite Hessian in the case of a convex function of several variables. The Legendre-Fenchel inequality f .° / C f ¤ .±/ ¸ ° ± can be rewritten as Young’s inequality, Z ° Z ± g.t/ dt C g¡1 .t/ dt C cg.c/ ¸ ° ±; c
g.c/
with equality holding if and only if ± D g.° /. The use of a pair of strictly monotonic functions f 0 D g and . f ¤ /0 D g¡1 , which we call ½- and ¿ representations below, provides a means to dene conjugate embeddings (representations) of density functions and therefore a method to extend the analysis in the previous sections to the innite-dimensional manifold of positive measures (after integrating over the sample space). 3.1 Divergence Functional Based on Convex Function on the Real Line. Recall that the fundamental inequality (see equation 2.1) of a strictly convex function 8, now for f : R ! R, can be used to dene a nonnegative quantity (for any ® 2 R), ³ ³ ´´ 4 1¡® 1C® 1¡® 1C® C ¡ C f .° / f .±/ f ° ± : 1 ¡ ®2 2 2 2 2 Note that here, ° and ± are numbers instead of nite-dimensional vectors. In particular, they can be the values of two probability density functions ° D p.³ / and ± D q.³ / where ³ 2 X is a point in the sample space X . The nonnegativity of the above expression for each value of ³ allows us to dene a global divergence measure, called a divergence functional, over the space of a (denormalized) probability density function after integrating over the sample space (with appropriate measure ¹.d³ / D d¹), Z .®/ .®/ d f .p; q/ D d f .p.³ /; q.³ // d¹ » ³Z ´ ³Z ´ 4 1¡® 1C® D f .p/ d¹ C f .q/ d¹ 1 ¡ ®2 2 2 ´ ¼ Z ³ 1¡® 1C® ¡ f pC q d¹ ; 2 2
Divergence Function, Duality, and Convex Analysis
177
with .¡1/
df
Z .1/ .p; q/ D d f .q; p/ D . f .q/ ¡ f .p/ ¡ .q ¡ p/ f 0.p// d¹ Z D . f .q/ C f ¤ . f 0 .p// ¡ qf 0 .p//d¹ ´ A f .q; f 0 .p//
(3.1) (3.2)
where f ¤ : R ! R, dened by f ¤ .t/ D t . f 0 /¡1 .t/ ¡ f .. f 0 /¡1 .t//; is the convex conjugate to f , with . f ¤ /¤ D f and . f ¤ /0 D . f 0 /¡1 . In this way, d.®/ over the innite-dimensional functional space is def .p; q/ .®/ ned in much the same way as on the nite-dimensional R D8 .x; y/ dened R vector space. The integration f .p/d¹ D f .p.³ //d¹, which is a nonlinear functional of p, assumes the role of 8 of the nite-dimensional case discussed in section 2; this is most transparent if we consider, latter, Pn for the i /; x 2 Rn the special class of “separable” convex functions 8.x/ D f .x iD1 R P such that r8.x/ is simply [ f 0 .x1 /; : : : ; f 0 .xn /]. With i $ d¹, the expressions of the divergence function and the divergence functional look similar. However, one should not conclude that divergence functions are special cases of divergence functionals or vice versa. There is a subtle but important difference: in the former, the inducing function 8.x/ is strictly convex in x; in the latter, f .p/ is strictly convex in p, but its pullback on X , f .p.³ // is not assumed to be convex at all. So even when the sample space may be nite, . f ± p/.³ / is generally not a convex function of ³ .
Example 3.1.1. Take f .t/ D t log t ¡ t .t > 0/, with conjugate function f ¤ .u/ D eu . The divergence Z A f .p; u/ D D
¡ ¢ .p log p ¡ p/ C eu ¡ p u d¹
Z ± ² p p log u ¡ p C eu d¹ e
is the Kullback-Leibler divergence K.p; eu / between p.³ / and q.³ / D eu.³ / . Example 3.1.2. 0 tr
r0
Take f .t/ D
tr r
.t > 0/ with the conjugate function f ¤ .t/ D
, where the pair of conjugated real exponents r > 1; r0 > 1 satises 1 1 C 0 D 1: r r
178
J. Zhang
The divergence A f is a nonnegative expression based on Holder’s ¨ inequality for two functions u.³ /; v.³ /, Z Á A f .u; v/ D
! 0 ur vr C 0 ¡ u v d¹ ¸ 0; r r 0
with equality holding if and only if ur .³ / D vr .³ / for all ³ 2 X . Denote 2 2 and r0 D 1C® , with ® 2 .¡1; 1/. The above divergence is just r D 1¡® 2
2
A® .p; q/ between p.³ / D .u.³ // 1¡® and q.³ / D .v.³ // 1C® , apart from a factor 4 . 1¡® 2
3.2 Conjugate Representations and Induced Statistical Manifold. We introduce the notion of ½-representation of a (not necessarily normalized) probability density by dening a mapping ½: RC ! R; p 7! ½.p/ where ½ is a strictly monotone increasing function. This is a generalization of the ®representation (Amari, 1985; Amari & Nagaoka, 2000) where ½.p/ D l.®/ .p/, as given by equation 1.8. For a smooth and strictly convex function f : R ! R, the ¿ -representation of the density function p 7! ¿ .p/ is said to be conjugate to the ½-representation with respect to f if ¿ .p/ D f 0 .½.p// D .. f ¤ /0 /¡1 .½.p// Ã! ½.p/ D . f 0/¡1 .¿ .p// D . f ¤ /0 .¿ .p//:
(3.3)
Just like the construction in section 3.1 of divergence functionals for two densities p; q, one may construct divergence functionals for two densities under ½-representations D .®/ ´ d.®/ or under ¿ f;½ .p; q/ f .½ .p/; ½.q// representations D .®/ ´ d.®/ f ¤ ;¿ .p; q/ f ¤ .¿ .p/; ¿ .q//.
Proposition 6. For ® 2 R, D .®/ D .®/ D .®/ D .®/ f;½ .p; q/, f;¿ .p; q/, f ¤ ;½ .p; q/, and f ¤ ;¿ .p; q/, each forms a one-parameter family of divergence functionals, with the .§1/-divergence functional,
D .1/ .p; q/ D D .¡1/ .q; p/ D D .1/ .q; p/ D D .¡1/ .p; q/ f;½ f;½ f ¤ ;¿ f ¤ ;¿ D .1/ .p; q/ f ¤ ;½
D A f .½ .p/; ¿ .q//;
D D .¡1/ .q; p/ D D .1/ .q; p/ D D .¡1/ .p; q/ f ¤ ;½ f;¿ f;¿ D A f ¤ .½ .p/; ¿ .q// ;
taking the following canonical form, Z ¡ ¢ A f .½ .p/; ¿ .q// ´ f .½ .p// C f ¤ .¿ .q// ¡ ½.p/ ¿ .q/ d¹ D A f ¤ .¿ .q/; ½.p// :
(3.4)
Divergence Function, Duality, and Convex Analysis
179
Proof. The proof for nonnegativity of these functionals for all ® 2 R follows that in proposition 2. Taking lim®!§1 D .®/ and noting equation 3.3 f;½ .p; q/ immediately leads to the expressions of .§1/-divergence functional. ¦ Example 3.2.1. Amari’s ®-embedding where ½.p/ D l.®/ .p/; ¿ .p/ D l.¡®/ .p/ corresponds to (assuming ® 6D §1) f .t/ D
2 1C®
³
1¡® t 2
´
2 1¡®
;
¤
f .t/ D
2 1¡®
³
1C® t 2
´
2 1C®
:
Writing out A f .½ .p/; ¿ .q// explicitly yields the ®-divergence in the form of equation 1.2. For ® D §1, see example 3.1.1. Now we restrict attention to a nite-dimensional submanifold of probability densities whose ½-representations are parameterized using µ D [µ 1 ; : : : ; µ n ] 2 Mµ . Under such a statistical model, the divergence functional of any two densities p and q, assumed to be specied by µp and µq , respectively, becomes an implicit function of µp ; µq 2 Rn . In other words, through introducing parametric models (i.e., a nite-dimensional submanifold) of the innite-dimensional manifold of probability densities, we again arrive at divergence functions over the vector space. We denote the ½-representation of a parameterized probability density as ½.p.³ I µ//, or sometimes simply ½.µp /, while suppressing the sample space variable ³ , and denote the corresponding divergence function by » 4 1¡® 1C® .®/ D f;½ .µp ; µq / D E¹ f .½ .µp // C f .½ .µq // 1 ¡ ®2 2 2 ³ ´¼ 1¡® 1C® ¡f (3.5) ½.µp / C ½.µq / ; 2 2 R R where E¹ f¢g denotes f¢g d¹. We will also use Ep f¢g to denote f¢g p d¹ later. Similarly, the parametrically embedded probability density under ¿ representation is denoted ¿ .p.³ I µ // or simply ¿ .µp /. Proposition 7.
The family of divergence functions D .®/ f;½ .µp ; µq / induces a dually
afne Riemannian manifold fMµ ; g; 0 .®/ ; 0 ¤.®/ g for each ® 2 R, with the metric tensor » ¼ @½ @½ (3.6) gij .µ / D E¹ f 00 .½.µ // i @ µ @µ j and the dual afne connections » ¼ 1 ¡ ® 000 .®/ 00 D C 0ij;k .µ / E ¹ f .½/ Aijk f .½/ B ijk 2
;
(3.7)
180
J. Zhang
» ¤.®/ 0ij;k .µ /
D E¹
¼ 1 C ® 000 00 f .½/ Aijk C f .½/ Bijk : 2
(3.8)
Here, ½ and all its partial derivatives (with respect to µ ) are functions of µ and ³ , while Aijk , B ijk denote Aijk .³ I µ / D
@½ @½ @½ ; @µ i @ µ j @µ k
Bijk .³ I µ / D
@ 2 ½ @½ : @µ i @µ j @µ k
Proof. We follow the same technique of section 2.2 to expand the value of divergence measure D .®/ around µp D µ C »; µq D µ C ´ for small f;½ .µp ; µq /
»; ´ 2 Rn . Considering the order of expansion o.» m ´l / with nonnegative integers m; l, the terms with m C l · 1 vanish uniformly. The terms with m C l D 2 are 8 9 0
Divergence Function, Duality, and Convex Analysis
185
for any » D [» 1 ; : : : ; » n ] 2 Rn , due to linear independence of the ¸i ’s and the strict convexity of f . Now @8 D @µ i
Z
@ f .½/ d¹ D @µ i
Z f 0 .½/
@½ d¹ D @µ i
Z ¿ .p.³ // ¸i .³ / d¹ D ´i
by denition 3.14. We can verify straightforwardly that @28 D @´i =@µ j D @µ i @ µ j
Z f 00 .½ .³ //
@½ ¸i .³ / d¹ D gij .µ / @µj
is positive denite, so 8.µ / must be a strictly convex function. Parts iv and then iii follow proposition 2 once strict convexity of 8.µ / is established. Differentiating both sides of equation 3.14 with respect to ´j yields j
Z
±i D
@¿ ¸i .³ / d¹: @´j
Thus, Z Z @ f ¤ .¿ / @¿ @¿ d¹ D . f ¤ /0 .¿ / d¹ D ½.p.³ // d¹ @´i @´i @´i 0 1 ´ Z X X ³Z @ ¿ @¿ j @ D µ ¸j .³ /A d¹ D µj ¸j .³ / d¹ @´i @´i j j X j D µ j ±i D µ i :
@8¤ D @´i
Z
j
Part ii, namely, biorthogonality of µ and ´, is thus established. Evaluating X i
@8 ¡ 8.µ / D µ @µ i
Z
i
D D
Z Z
¿ .p.³ I µ//
Á X
! i
µ ¸i .³ /
Z d¹¡
i
¿ .p.³ I µ// ½.p.³ I µ // d¹ ¡
f .½ .p.µ // d¹
Z f .½.p.³ I µ // d¹
f ¤ .¿ .p.³ I ´// d¹ D 8¤ .´/
establishes the conjugacy between 8 and 8¤ , and hence strict convexity of 8¤ , as claimed in part i. Finally, substituting these expressions into equation 3.4 establishes part v. Therefore, we have proved all the relations stated in this proposition. ¦
186
J. Zhang
Remark 3.4.1. This is a generalization of the results about ®-afne manifolds (Amari, 1985; Amari & Nagaoka, 2000), where ½- and ¿ -representations are just ®- and .¡®/-representations, respectively. Proposition 9 says that when ¸i .³ /’s are used as the basis functions of the sample space, µ is the natural (contravariant) coordinate to express ½.p/, while ´ is the expectation (covariant) coordinate to express ¿ .p/. They are biorthogonal Z
@½ @¿ j d¹ D ±i ; @µ i @´j
when the ½- (or ¿ )-representation of the density function is embeddable into the nite-dimensional afne space. The natural and expectation parameters are related to the ¿ - and to the ½-representation, respectively, via ¿ .p.³ // D f
0
Á X
! i
µ ¸i .³ / ;
i
Z ´i D
f 0 .½.p//¸ i .³ / d¹:
With the expectation parameter ´, one may express the divergence functional D .®/ and obtain the corresponding metric and dual connection f;½ .´p ; ´q / pair. The properties of the statistical manifold on M´ are shown by the next proposition. Proposition 10. The metric tensor gO ij and the dual connections 0O .®/ij;k , 0O ¤.®/ij;k induced by D .®/ f;½ .´p ; ´q / are related to those (expressed in lower induces) induced .®/
by D f;½ .µp ; µq / via X l
gil .µ / gO lm .´/ D ±im ;
(3.16)
and 0O .®/ij;k .´/ D ¡
X l;m;n
0O ¤.®/ij;k .´/ D ¡
.¡®/ gO im .´/ gO jn .´/ gO kl .´/0ml;n .µ /;
X l;m;n
.®/ gO im .´/ gO jn .´/ gO kl .´/0ml;n .µ /;
(3.17)
(3.18)
where ´ and µ are biorthogonal. Proof. The relation 3.16 follows proposition 9. To prove equation 3.17, we write out 0O .®/ij;k following proposition 8 (note that upper- and lower-case
Divergence Function, Duality, and Convex Analysis
187
here are pro forma): »
¼ 1 ¡ ® @ 2 ¿ @½ 1 C ® @ 2½ @ ¿ C 2 @´i @´j @´k 2 @´i @´j @ ´k ( Á !Á ! X @µm @ @¿ 1 ¡ ® X @½ @µ l E¹ m 2 @ µ l @´k m @´i @µ @´j l Á !Á !) X @µ m @ @½ 1 C ® X @¿ @µl C m 2 @µ l @´k m @´i @µ @´j l ( Á ! X @µ l @µ m X @ ¿ @µ n 1 ¡ ® @½ @ E¹ n 2 @ µ l @µ m @ ´k @´i n @ µ @´j l;m Á !) X @½ @µ n 1 C ® @¿ @ C n 2 @µ l @µ m n @ µ @´j ³ » X @µ l @µ m @µ n 1 ¡ ® @½ @ 2 ¿ E¹ 2 @ µ l @µ m @µ n @´k @´i @´j l;m;n ¼ 1 C ® @¿ @ 2 ½ C 2 @µ l @µ m @µ n ³ ´ » ¼ ´ n 1 ¡ ® @½ @¿ 1 C ® @¿ @½ @ @µ C E C ¹ 2 @µ l @ µ n 2 @ µ l @µ n @µ m @´j ³ ´ X @µ l @µ m @µ n .®/ @ @µ n 0mn;l C m gnl : @´k @´i @´j @µ @´j l;m;n
0O .®/ij;k D E¹ D
D
D
D Since
X ³ @ @µn ´ X @ g jn X @ gnl D g gnl D ¡ g jn nl m m m @µ @ ´j n n @µ n @µ ± ² X .®/ .¡®/ D¡ g jn 0mn;l C 0ml;n ; n
where the last step is from @gnl ¤.®/ .®/ .®/ .¡®/ D 0mn;l C 0ml;n D 0mn;l C 0ml;n ; @µ m assertion 3.17 is proved after direct substitution. Observing the duality 0O ¤.®/ij;k D 0O .¡®/ij;k leads to equation 3.18. ¦ Remark 3.4.2. The relation between g and 0 in their subscript and superscript forms is analogous to that stated by proposition 5. However, note the
188
J. Zhang
.¡®/ conjugacy of ® in 0O .®/ij;k $ 0ml;n correspondence, due to the change between µ- and ´-coordinates, both under the ½-representation. On the other hand, similar to corollary 3, the metric gN ij and the dual afne connection 0N .®/ij;k ; 0N ¤.®/ij;k of the statistical manifold (denoted using bar) induced by the conjugate divergence functions D .®/ are related to those (def ¤ ;¿ .´p ; ´q /
noted using hat) induced by D .®/ via f ¤ ;½ .´p ; ´q / gN ij .´/ D gO ij .´/; with 0N .®/ij;k .´/ D 0O .¡®/ij;k .´/;
0N ¤.®/ij;k .´/ D 0O .®/ij;k .´/:
3.5 Divergence Functional from Generalized Mean. When f is, in addition to being strictly convex, strictly monotone increasing, we may set ½ D f ¡1 , so that the divergence functional becomes
D½.®/ .p; q/ D
Z ³ 4 1¡® 1C® pC q 1 ¡ ®2 2 2 ³ ´´ 1¡® 1C® ¡½ ¡1 ½.p/ C ½.q/ d¹: 2 2
(3.19)
Note that for ® 2 [¡1; 1], ³ ´ 1C® ¡1 1 ¡ ® ´ C M.®/ .p; q/ ½ ½.p/ ½.q/ ½ 2 2 denes a generalized mean (“quasi-linear mean” by Hardy, Littlewood, & Polya, ´ 1952) associated with a concave and monotone function ½: RC ! R. Viewed in this way, the divergence is related to the departure of the linear (arithmetic) mean from a quasi-linear mean induced by a nonlinear function with nonzero concavity/convexity. 1¡®
1C®
.®/ Example 3.5.1. Take ½.p/ D log p, then M .®/ ½ .p; q/ D p 2 q 2 , and D½ .p; q/ is the ®-divergence (see equation 1.2). For a general concave ½,
Z
D½.1/ .p; q/ D
.p ¡ q ¡ .½ ¡1 /0 .½ .q// .½.p/ ¡ ½.q/// d¹ D D½.¡1/ .q; p/
is an immediate generalization of the extended Kullback-Leibler divergence in equation 1.1. To further explore the divergence functionals associated with the quasilinear mean operator, we impose a homogeneity requirement, such that the
Divergence Function, Duality, and Convex Analysis
189
divergence is invariant after scaling (· 2 RC ):
D½.®/ .·p; ·q/ D · D½.®/ .p; q/: Proposition 11. The only measure-invariant divergence functional associated .®/ with quasi-linear mean operator M½ is a two-parameter family, Z ³ 4 2 1¡® 1C® .®;¯/ D .p; q/ ´ pC q 1 ¡ ®2 1 C ¯ 2 2 ! ³ ´ 2 1 ¡ ® 1¡¯ 1 C ® 1¡¯ 1¡¯ ¡ (3.20) p 2 C q 2 d¹; 2 2 which results from the alpha-representation (indexed by ¯ here) ½.p/ D l.¯/ .p/ as given by equation 1.8. Here .®; ¯/ 2 [¡1; 1] £ [¡1; 1], and the factor 2=.1 C ¯/ is introduced to make D .®;¯/ .p; q/ well dened for ¯ D ¡1. Proof.
This homogeneity requirement implies that ³
½ ¡1
´ ³ ´ 1¡® 1C® 1¡® 1C® ½.·p/ C ½.·q/ D ·½ ¡1 ½.p/ C ½.q/ : 2 2 2 2
By a lemma in Hardy et al. (1952, p. 68), the general solution to the above functional equation is ½.t/ D
» s at C b a log t C b
s 6D 0 s D 0;
with corresponding ³ M .®/ s .p; q/
D
1¡® s 1C® s p C q 2 2
´1 s
;
.®/
M0 .p; q/ D p
1¡® 2
q
1C® 2
:
Here a; b; s are all constants. Strict concavity of ½ requires 0 · s · 1 and .®/ .®/ a > 0. Since it is easily veried D½ D Da½Cb , without loss of generality, we have ½.p/ D l.¯/ .p/; ¯ 2 [¡1; 1] where s D 3.20. ¦
1¡¯ 2 .
This gives rise to equation
Proposition 12 (corollary to Proposition 7). The two-parameter family of divergence functions D .®;¯/ .µp ; µq / induces a statistical manifold with Fisher information as its metric and generic alpha-connections as its dual connection pair, » ¼ @ log p @ log p D gij Ep @µi @µ j
190
J. Zhang
»
.®;¯ /
0ij;k
¤.®;¯ /
0ij;k
¼
@ 2 log p @ log p 1 ¡ ®¯ @ log p @ log p @ log p C 2 @µi @µ i @µ j @µ k @µ j @µ k » 2 ¼ @ log p @ log p 1 C ®¯ @ log p @ log p @ log p D Ep C 2 @µi @µ i @µ j @µ k @µ j @µ k D Ep
; :
Proof. Applying formulas 3.6 to 3.8 to the measure-invariant divergence functional D½.®/ .p; q/ with ½.p/ D log p and f D ½ ¡1 gives rise to the desired result. ¦ .®;¯/
Remark 3.5.2. This two-parameter family of afne connections 0ij;k , indexed now by the numerical product ®¯ 2 [¡1; 1], is actually in the generic form of an alpha-connection, .®;¯ /
0ij;k
.¡®;¡¯/
D 0ij;k
;
with biduality compactly expressed as ¤.®;¯ /
0ij;k
.¡®;¯ /
D 0ij;k
.®;¡¯/
D 0ij;k
(3.21)
:
The parameters ® 2 [¡1; 1] and ¯ 2 [¡1; 1] reect referential duality and representational duality, respectively. Among this two-parameter family, the Levi-Civita connection results when either ® or ¯ equals 0. When ® D §1 or ¯ D §1, each case reduces to the one-parameter version of the generic alpha-connection. The family D .®;¯/ is then a generalization of Amari’s alpha-divergence, equation 1.2, with lim D .®;¯ / .p; q/ D A .¡¯/ .p; q/;
®!¡1
lim D .®;¯ / .p; q/ D A .¯/ .p; q/;
®!1
lim D .®;¯ / .p; q/ D A .®/ .p; q/;
¯!1
1¡®
1C®
where the last equation is due to lim¯!1 M.®/ D M.®/ D p 2 q 2 . On the s 0 other hand, when ¯ ! ¡1, we have the interesting asymptotic relation, lim D .®;¯/ .p; q/ D E.®/ .p; q/;
¯!¡1
where E.®/ was the Jensen difference, equation 2.5, discussed by Rao (1987). 3.6 Parametric Family of Csisz´ar’s f -Divergence. The fact (see Proposition 12) that our two-parameter family of divergence functions D .®;¯/ actually induces a one-dimensional family of alpha-connection is by no means surprising. This is because D .®;¯/ obviously falls within Csisz a´ r’s
Divergence Function, Duality, and Convex Analysis
191
f -divergence (see equation 1.3), the generic form for measure-invariant divergence, where f
.®;¯ /
8 .t/ D .1 ¡ ® 2 /.1 C ¯/
Á
1¡® 1C® C t 2 2 ³ ´ 2 ! 1¡® 1 C ® 1¡¯ 1¡¯ ¡ C t 2 ; 2 2
(3.22)
is now a two-parameter family with f .®;¯/ .1/ D 0I . f .®;¯ / /0 .1/ D 0I . f .®;¯ / /00 .1/ D 1. That the alpha index is given by the product ®¯ in this case follows explicitly from calculating . f .®;¯ / /000 .1/ using equation 1.5. What is interesting in this regard is the distinct roles played by ® (for reference duality) and by ¯ (for representational duality). The parameters .®; ¯/ 2 [0; 1] £ [0; 1] form an interesting topological structure of a Moebius band in the space of divergence functions, all with identical Fisher information and the family of alpha-connections. We may generalize Csisz a´ r’s f -divergence to construct a family of measure-invariant divergence functional in the following way. Given a smooth, strictly convex function f .t/, construct the family (for ° 2 R) ³ ³ ´´ 4 1¡° 1C° 1¡° 1C° .° / C ¡ C G f .t/ D f .1/ f .t/ f t ; 1 ¡ °2 2 2 2 2 with G.¡1/ .t/ D g.t/ as given in equation 1.6. It is easy to verify that for an f .° /
arbitrary ° , G f 0, and that
.° /
.° /
is a proper Csiszar’s ´ function with G f .1/ D 0, .G f /0 .1/ D
.G f /00 .1/ D f 00 .1/; .° /
.G f /000 .1/ D .° /
° C 3 000 f .1/; 2
/ so the statistical manifold generated by G.° has the same metric as that f generated by f but a family of parameterized alpha-connections. If we take f .t/ D f .®/ .t/ as in equation 1.4, then G.° ;®/ will generate a two-parameter family of alpha-connections with the effective alpha value 3C.®¡3/.° C3/=2. We note in passing that repeating this process, by having G.° ;®/ now take the role of f , may lead to nested (e.g., two-, three- parameter) families of alpha-connections.
4 General Discussion This article introduced several families of divergence functions and functionals all based on the fundamental inequality of an arbitrary smooth and strictly convex function. In the nite-dimensional case, the convex mixture parameter, ®, which reects reference duality, turns out to correspond to
192
J. Zhang
the ® parameter in the one-parameter family of ®-connection in the sense of Lauritzen (1987), which includes the at connections .® D §1/ induced by Bregman divergence. The biorthogonal coordinates related to the inducing convex function and its conjugate (Amari’s dual potentials) reect representational duality. In the innite-dimensional cases, with the notion of conjugate (i.e., ½- and ¿ -) embeddings of density functions, the form of the constructed divergence functionals generalizes the familiar ones (®-divergence and f -divergence). The resulting ®-connections, equation 3.7, or equivalently, equation 3.10, have the most generalized yet explicit form found in the literature. When densities are ½-afne, they specialize to ®-connections in the nite-dimensional vector space mentioned above. When measureinvariance is imposed, they specialize to the family of alpha-connections proper, but now with two parameters—one reecting reference duality and the other representational duality. These ndings will enrich the theory of information geometry and make it applicable to nite-dimensional vector space (not necessarily of parameters of probability densities) as well as to innite-dimensional functional space (not necessarily of normalized density functions). In terms of neural computation, to the extent that alpha-divergence and alpha-connections generate deep analytic insights (e.g., Amari, Ikeda, & Shimokawa, 2001; Takeuchi & Amari, submitted), these theoretical results may help facilitate those analyses by clarifying the meaning of duality in projection-based algorithms. Previously, alpha-divergence, in its extended form (see equation 1.2), was shown (Amari & Nagaoka, 2000) to be the canonical divergence for the ®-afne family of densities (densities that, under the ®-representation l.®/ , are spanned by an afne subspace). Therefore, for a given ® value, there is only one such family that induces the at (®)connection with all components zero when expressed in suitable coordinates (as special cases, 0 .1/ D 0 for the exponential family and 0 .¡1/ D 0 for the mixture family). This is slightly different from the view of Zhu and Rohwer (1995, 1997) who, in their Bayesian inference framework, simply treated ® as a parameter in the entire class of (®-)divergence (between any two densities) which yields, through Eguchi relation, at connections only when ® D §1. These apparently disparate interpretations, despite being subtly so, have now been straightened out. The current framework points out two related but different senses of duality in information geometry: representational duality and reference duality. Further, it has been claried how the same one-parameter family of dual alpha-connections actually may embody both kinds of dualities. Future research will illuminate how this notion of biduality in characterizing the asymmetric difference of two density functions or two parameters may have captured the very essence of computational algorithms of inference, optimization, and adaptation.
Divergence Function, Duality, and Convex Analysis
193
Acknowledgments I thank Huaiyu Zhu for introducing the topic of divergence functions and information geometry. Part of this work was presented to the International Conference on Information Geometry and Its Applications, Pescara, 2002, where I beneted from direct feedback and extensive discussions with many conference participants, including S. Amari and S. Eguchi in particular. Discussions with Matt Jones at the University of Michigan have helped improve the presentation.
References Ackley, D. H., Hinton, G. E., & Sejnowski, T. J. (1985). A learning algorithm for Boltzmann machines. Cognitive Science, 9, 147–169. Amari, S. (1982). Differential geometry of curved exponential families— curvatures and information loss. Annals of Statistics, 10, 357–385. Amari, S. (1985). Differential geometric methods in statistics. New York: SpringerVerlag. Amari, S. (1991). Dualistic geometry of the manifold higher-order neurons. Neural Networks, 4, 443–451. Amari, S. (1995). Information geometry of EM and EM algorithms for neural networks. Neural Networks, 8, 1379–1408. Amari, S., Kurata, K., & Nagaoka, H. (1992).Information geometry of Boltzmann machines. IEEE Transactions on Neural Networks, 3, 260–271. Amari, S., Ikeda, S., & Shimokawa, H. (2001). Information geometry and mean eld approximation: The ®-projection approach. In M. Opper, & D. Saad (Eds), Advanced mean eld methods—Theory and practice, (pp. 241–257). Cambridge, MA: MIT Press. Amari, S., & Nagaoka, H. (2000). Method of information geometry. New York: Oxford University Press. Bauschke, H. H., Borwein, J. M., & Combettes, P. L. (2002). Bregman monotone optimization algorithms. CECM Preprint 02:184. Available on-line: http://www.cecm.sfu.ca/preprints/2002pp.html. Bauschke, H. H., & Combettes, P. L. (2002). Iterating Bregman retractions. CECM Preprint 02:186. Available on-line: http://www.cecm.sfu.ca/ preprints/2002pp.html. Bregman, L. M. (1967). The relaxation method of nding the common point of convex sets and its application to the solution of problems in convex programming. USSR Computational Mathematics and Physics, 7, 200–217. Chentsov, N. N. (1982). Statistical decision rules and optimal inference. Providence, RI: AMS, 1982. Csisz a´ r, I. (1967). On topical properties of f-divergence. Studia Mathematicarum Hungarica, 2, 329–339.
194
J. Zhang
Della Pietra, S., Della Pietra, V., & Lafferty, J. (2002).Duality and auxiliary functions for Bregman distances (Tech. Rep. No. CMU-CS-01-109).Pittsburgh, PA: School of Computer Science, Carnegie Mellon University. Eguchi, S. (1983). Second order efciency of minimum contrast estimators in a curved exponential family. Annals of Statistics, 11, 793–803. Eguchi, S. (1992). Geometry of minimum contrast. Hiroshima Mathematical Journal, 22, 631–647. Eguchi, S. (2002). U-boosting method for classication and information geometry. Paper presented at the SRCCS International Statistical Workshop, Seoul National University, June. Hardy, G., Littlewood, J. E., & Polya, ´ G. (1952). Inequalities. Cambridge: Cambridge University Press. Ikeda, S., Amari, S., & Nakahara, H. (1999) Convergence of the wake-sleep algorithm. In M. Kearns, S. Solla, & D. Cohn (Eds.), Advances in neural information processing systems, 11 (pp. 239–245). Cambridge, MA: MIT Press. Kaas, R. E., & Vos, P. W. (1997). Geometric foundation of asymptotic inference. New York: Wiley. Kurose, T. (1994). On the divergences of 1-conformally at statistical manifolds. T¨ohoko Mathematical Journal, 46, 427–433. Lafferty, J., Della Pietra, S., & Della Pietra, V. (1997). Statistical learning algorithms based on Bregman distances. In Proceedings of 1997 Canadian Workshop on Information Theory, pp. 77–80. Toronto, Canada: Fields Institute. Lauritzen, S. (1987). Statistical manifolds. In S. Amari, O. Barndorff-Nielsen, R. Kass, S. Lauritzen, and C. R. Rao (Eds.), Differential geometry in statistical inference (pp. 163–216). Hayward, CA: Institute of Mathematical Statistics. Lebanon, G., & Lafferty, J. (2002). Boosting and maximum likelihood for exponential models. In T. G. Dietterich, S. Becker, & Z. Ghahramani (eds.) Advances in neural information processingsystems, 14 (pp. 447–454). Cambridge, MA: MIT Press. Matsuzoe, H. (1998). On realization of conformally-projectively at statistical manifolds and the divergences. Hokkaido Mathematical Journal, 27, 409–421. Matsuzoe, H. (1999). Geometry of contrast functions and conformal geometry. Hiroshima Mathematical Journal, 29, 175–191. Matumoto, T. (1993). Any statistical manifold has a contrast function—On the C3 -functions taking the minimum at the diagonal of the product manifold. Hiroshima Mathematical Journal, 23, 327–332. Mihoko, M., & Eguchi, S. (2002). Robust blink source separation by betadivergence. Neural Computation, 14, 1859–1886. Rao, C. R. (1987). Differential metrics in probability spaces. In S. Amari, O. Barndorff-Nielsen, R. Kass, S. Lauritzen, & C. R. Rao (Eds.), Differential geometry in statistical inference. (pp. 217–240). Hayward, CA: Institute of Mathematical Statistics. Rockafellar, R. T. (1970). Convex analysis. Princeton, NJ: Princeton University Press. Shima, H. (1978). Compact locally Hessian manifolds. Osaka Journal of Mathematics, 15, 509–513.
Divergence Function, Duality, and Convex Analysis
195
Shima, H., & Yagi, K. (1997). Geometry of Hessian manifolds. Differential Geometry and Its Applications, 7, 277–290. Takeuchi, J., & Amari, S. (submitted). ®-Parallel prior and its properties. IEEE Transaction on Information Theory. Manuscript under review. Uohashi, K., Ohara, A., & Fujii, T. (2000). 1-Conformally at statistical submanifolds. Osaka Journal of Mathematics, 37, 501–507. Zhu, H. Y., & Rohwer, R. (1995). Bayesian invariant measurements of generalization. Neural Processing Letter, 2, 28–31. Zhu, H. Y., & Rohwer, R. (1997) Measurements of generalisation based on information geometry. In S. W. Ellacott, J. C. Mason, & I. J. Anderson (Eds.), Mathematics of neural networks: Model algorithms and applications (pp 394–398). Norwell, MA: Kluwer. Received October 29, 2002; accepted June 26, 2003.
LETTER
Communicated by Brendan Frey
Linear Response Algorithms for Approximate Inference in Graphical Models Max Welling
[email protected] Department of Computer Science, University of Toronto, Toronto M5S 3G4 Canada
Yee Whye Teh
[email protected] Computer Science Division, University of California at Berkeley, Berkeley, CA 94720, U.S.A.
Belief propagation (BP) on cyclic graphs is an efcient algorithm for computing approximate marginal probability distributions over single nodes and neighboring nodes in the graph. However, it does not prescribe a way to compute joint distributions over pairs of distant nodes in the graph. In this article, we propose two new algorithms for approximating these pairwise probabilities, based on the linear response theorem. The rst is a propagation algorithm that is shown to converge if BP converges to a stable xed point. The second algorithm is based on matrix inversion. Applying these ideas to gaussian random elds, we derive a propagation algorithm for computing the inverse of a matrix. 1 Introduction Like Markov chain Monte Carlo sampling and variational methods, belief propagation (BP) has become an important tool for approximate inference on graphs with cycles. Especially in the eld of error correction decoding, it has brought performance very close to the Shannon limit (Frey & MacKay, 1997). A number of studies of BP have gradually increased our understanding of the convergence properties and accuracy of the algorithm (Weiss & Freeman, 2001; Weiss, 2000). In particular, recent developments show that the stable xed points are local minima of the Bethe free energy (Yedidia, Freeman, & Weiss, 2000; Heskes, in press). This insight paved the way for more sophisticated generalized BP algorithms (Yedidia, Freeman, & Weiss, 2002) and convergent alternatives to BP (Yuille, 2002; Teh & Welling, 2001). Other developments include the expectation propagation algorithm designed to propagate sufcient statistics of members of the exponential family (Minka 2001). Despite its success, BP does not provide a prescription to compute joint probabilities over pairs of nonneighboring nodes in the graph. When the Neural Computation 16, 197–221 (2004)
c 2003 Massachusetts Institute of Technology °
198
M. Welling and Y. Weh
graph is a tree, there is a single chain connecting any two nodes, and dynamic programming can be used to integrate out the internal variables efciently. However, when cycles exist, it is not clear what approximate procedure is appropriate. It is precisely this problem that we address in this article. We show that the required estimates can be obtained by computing the sensitivity of the node marginals to small changes in the node potentials. Based on this idea, we present two algorithms to estimate the joint probabilities of arbitrary pairs of nodes. These results are interesting in the inference domain and may have future applications to learning graphical models from data. For instance, information about dependencies between random variables is relevant for learning the structure of a graph and the parameters encoding the interactions. Another possible application area is active learning. Since the node potentials encode the external evidence owing into the network and since we compute the sensitivity of the marginal distributions to changing this external evidence, this information can be used to search for good nodes to collect additional data. For instance, nodes that have a big impact on the system seem to be good candidates. The letter is organized as follows. Factor graphs are introduced in section 2. Section 3 reviews the Gibbs free energy and two popular approximations: the mean eld and Bethe approximations. In section 4, we explain the ideas behind the linear response estimates of pairwise probabilities and prove a number of useful properties that they satisfy. We derive an algorithm to compute the linear response estimates by propagating “supermessages” around the graph in section 5; section 6 describes an alternative method based on inverting a matrix. Section 7 describes an application of linear response theory to gaussian networks that gives a novel algorithm to invert matrices. In experiments (section 8), we compare the accuracy of the new estimates against other methods. We conclude with a discussion of our work in section 9. 2 Factor Graphs Let V index a collection of random variables fXi gi2V . Let xi denote values of Xi . For a subset of nodes ® ½ V, let X® D fXi gi2® be the variable associated with that subset and x® be values of X® . Let A be a family of such subsets : of V. The probability distribution over X D XV is assumed to have the following form, P X .X D x/ D
Y 1 Y î .x® / Ãi .xi /; Z ®2A i2V
(2.1)
where î ; Ãi are positive potential functions dened on subsets and single nodes, respectively. Z is the normalization constant (or partition function),
Linear Response Algorithms
199
1
2
3
4
a
g b P ~ ya (1,2) yb (2,3) yg (1,3,4) y1 (1)y2 (2) y3 (3) y4 (4) Figure 1: Example of a factor graph.
given by ZD
XY
Y î .x® /
x ®2A
Ãi .xi /;
(2.2)
i2V
where the sum runs over all possible states x of X. In the following, we : will write P.x/ D P X .X D x/ for notational simplicity. The decomposition of equation 2.1 is consistent with a factor graph with function nodes over X® and variables nodes Xi . Figure 1 shows an example. Neighbors in a factor graph are dened as nodes that are connected by an edge (e.g., subset ® and variable 2 are neighbors in Figure 1). For each i 2 V, denote its neighbors by N i D f® 2 A : ® 3 ig, and for each subset ®, its neighbors are simply N ® D fi 2 V : i 2 ®g. Factor graphs are a convenient representation for structured probabilistic models and subsume undirected graphical models and acyclic directed graphical models (Kschischang, Frey, & Loeliger, 2001). Further, there is a simple message-passing algorithm for approximate inference that generalizes the BP algorithms on both undirected and acyclic directed graphical models. For that reason, we will state the results of this article in the language of factor graphs. 3 The Gibbs Free Energy Let B.x/ be a variational probability distribution, and let b® ; bi be its marginal distributions over ® 2 A and i 2 V, respectively. Consider minimizing the following objective, called the Gibbs free energy, G.B/ D ¡ ¡
XX ®
x®
i
xi
XX
b® .x® / log î .x® / bi .xi / log Ãi .xi / ¡ H.B/;
(3.1)
200
M. Welling and Y. Weh
where H.B/ is the entropy of B.x/, X H.B/ D ¡ B.x/ log B.x/:
(3.2)
x
It is easy to show that the Gibbs free energy is precisely minimized at B.x/ D P.x/. In the following, we will use this variational formulation to describe two types of approximations: the mean eld and the Bethe approximations. 3.1 The Mean Field Approximation. The mean eld approximation uses a restricted set of variational distributions: those that assume inde: Q pendence between all variables xi : BMF .x/ D i bMF i .xi /. Plugging this into the Gibbs free energy, we get Á ! XX Y GMF .fbMF g/ D ¡ bMF .x / log à .x / i
¡
®
x®
i
xi
XX
i2®
i
i
®
®
MF .fbMF g/; bMF i .xi / log Ãi .xi / ¡ H i
where H MF is the mean eld entropy: XX MF HMF .fbMF bMF i g/ D ¡ i .xi / log bi .xi /: i
(3.3)
(3.4)
xi
Minimizing this with respect to bMF i .xi / (holding the remaining marginal distributions xed), we derive the following update equation, 0 1 XX Y 1 log î .x® / (3.5) bMF Ãi .xi / exp @ bjMF .xj /A ; i .xi / à °i x®ni ®2N ni j2N i
®
where °i is a normalization constant. Sequential updates that replace each bMF i .xi / by the right-hand side of equation 3.5 are a form of coordinate descent on the mean eld–Gibbs free energy, which implies that they are guaranteed to converge to a local minimum. 3.2 The Bethe Approximation: Belief Propagation. The mean eld approximation ignores all dependencies between the random variables and therefore overestimates the entropy of the model. To obtain a more accurate approximation, we sum the entropies of the subsets ® 2 A and the nodes i 2 V. However, this overcounts the entropies on the overlaps of the subsets ® 2 A, which we therefore subtract off as follows, BP HBP.fbBP ® ; bi g/ XX X X BP BP D¡ bBP ci bBP ® .x® / log b® .x® / ¡ i .xi / log bi .xi /; (3.6) ®
x®
i
xi
Linear Response Algorithms
201
where the overcounting numbers are ci D 1 ¡ jN energy is thus given by (Yedidia et al., 2000) BP GBP.fbBP i ; b® g/ D ¡
¡
XX x®
®
XX i
xi
i j.
The resulting Gibbs free
bBP ® .x® / log î .x® / bBP i .xi / log Ãi .xi /
BP ¡H BP .fbBP i ; b® g/;
(3.7)
where the following local constraints need to be imposed,1 X x®ni
BP bBP ® .x® / D bi .xi /
8® 2 A; i 2 ®; xi ;
(3.8)
in addition to the constraints that all marginal distributions should be normalized. Yedidia et al. (2000) showed that this constrained minimization problem may be solved by propagating messages over the links of the graph. Since the graph is bipartite, we need only to introduce messages from factor nodes to variable nodes m®i .xi / and messages from variable nodes to factor nodes ni® .xi /. The following xed-point equations can now be derived that solve for a local minimum of the BP-Gibbs free energy, ni® .xi / Ã Ãi .xi / m®i .xi / Ã
X
Y ¯2N
m¯i .xi / Y
î .x® / x®ni
(3.9)
i n®
j2N
nj® .xj /:
(3.10)
® ni
Finally, marginal distributions over factor nodes and variable nodes are expressed in terms of the messages as follows, b® .x® / D bi .xi / D
Q 1 °® î .x® / i2N Q 1 °i Ãi .xi / ®2N i
ni® .xi /
(3.11)
m®i .xi /;
(3.12)
®
where °i ; °® are normalization constants. On tree-structured factor graphs, there exists a scheduling such that each message needs to be updated only once in order to compute the exact marginal distributions on the factors and the nodes. On factor graphs with loops, iterating the messages does not always converge, but if they converge, they often give accurate approximations to the exact marginals (Murphy, 1 Note that although the beliefs fbi ; b® g satisfy local consistency constraints, they need not actually be globally consistent in that they do not necessarily correspond to the marginal distributions of a single probability distribution B.x/.
202
M. Welling and Y. Weh
Weiss, & Jordan, 1999). Further, the stable xed points of the iterations can only be local minima of the BP-Gibbs free energy (Heskes, in press). We note that theoretically, there is no need to normalize the messages themselves (as long as one normalizes the estimates of the marginals), but that it is desired computationally to avoid numerical overow or underow. 4 Linear Response The mean eld and BP algorithms described above provide estimates for single node marginals (both mean eld and BP) and factor node marginals (BP only), but not for joint marginal distributions of distant nodes. The linear response (LR) theory can be used to estimate joint marginal distributions over an arbitrary pair of nodes. For pairs of nodes inside a single factor, this procedure even improves on the estimates that can be obtained from BP by marginalization of factor node marginals. The idea here is to study changes in the system when we perturb the single node potentials, log Ãi .xi / D log Ãi0 .xi / C µi .xi /:
(4.1)
The superscript in equation 4.1 indicates unperturbed quantities in the following. Let µ D fµi g, and dene the free energy F.µ / D ¡ log
XY
Y
î .x® / x ®2A
Ãi0 .xi /eµi .xi / :
(4.2)
i2V
¡F.µ / is the cumulant generating function for P.X/, up to irrelevant constants. Differentiating F.µ / with respect to µ gives @F.µ / D pj .xj / ¡ @µj .xj / µ D0 @ 2 F.µ / D @pj .xj / ¡ @µi .xi /@µj .xj / µ D0 @µi .xi / µD0 » pij .xi ; xj /¡pi .xi /pj .xj / D pi .xi /±xi ;xj ¡pi .xi /pj .xj /
(4.3)
if i 6D j if i D j;
(4.4)
where pi ; pij are single and pairwise marginals of P.x/. Hence, second-order perturbations in the system (4.4) give the covariances between any two nodes in the graph. The desired joint marginal distributions are then obtained by adding back the pi .xi /pj .xj / term. Expressions for higher-order cumulants can be derived by taking further derivatives of ¡F.µ /. 4.1 Approximate Linear Response. Notice from equation 4.4 that the covariance estimates are obtained by studying the perturbations in pj .xj / as
Linear Response Algorithms
203
we vary µi .xi /. This is not practical in general since calculating pj .xj / itself is intractable. Instead, we consider perturbations of approximate marginal distributions fbj g. In the following, we will assume that bj .xj I µ / are the beliefs at a local minimum of the approximate Gibbs free energy under consideration (possibly subject to constraints). @b .x Iµ / In analogy to equation 4.4, let Cij .xi ; xj / D @µj .xj / be the linear rei
i
µ D0
sponse estimated covariance, and dene the linear response estimated joint pairwise marginal as 0 0 bLR ij .xi ; xj / D C ij .xi ; xj / C bi .xi /bj .xj /;
(4.5)
: where b0i .xi / D bi .xi I µ D 0/. We will show that bLR and Cij satisfy a number ij of important properties of joint marginals and covariances. First, we show that Cij .xi ; xj / can be interpreted as the Hessian of a well behaved convex function. We focus here on the Bethe approximation (the mean eld case is simpler). First, let C be the set of beliefs that satisfy the constraints (3.8) and normalization constraints. The approximate marginals fb0i g along with the joint marginals fb0® g form a local minimum of the Bethe: Gibbs free energy (subject to b0 D fb0i ; b0® g 2 C ). Assume that b0 is a strict 2 BP local minimum of G . That is, there is an open domain D containing b0 such that GBP .b0 / < GBP .b/ for each b 2 D \ C nb0 . Now we can dene X (4.6) G¤ .µ / D inf GBP .b/ ¡ bi .xi /µ i .xi /: b2D \C
i;xi
G¤ .µ / is a concave function since it is the inmum of a set of linear functions in µ . Further, G¤ .0/ D G.b0 /, and since b0 is a strict local minimum when µ D 0, small perturbations in µ will result in small perturbations in b0 , so that G¤ is well-behaved on an open neighborhood around µ D 0. Differentiating ¤ .µ/ D ¡bj .xj I µ/, so that we now have G¤ .µ /, we get @G @µj .xj / @bj .xj I µ/ @ 2 G¤ .µ / : D¡ Cij .xi ; xj / D @µi .xi / @µi .xi /@µj .xj / µ D0 µ D0
(4.7)
In essence, we can interpret G¤ .µ / as a local convex dual of GBP .b/ (by restricting attention to D ). Since GBP is an approximation to the exact Gibbs free energy (Welling & Teh, 2003), which is in turn dual to F.µ / (Georges & Yedidia, 1991), G¤ .µ / can be seen as an approximation to F.µ / for small values of µ . For that reason, we can take its second derivatives Cij .xi ; xj / as approximations to the exact covariances (which are second derivatives of ¡F.µ /). These relationships are shown pictorially in Figure 2. 2
The strict local minimality is in fact attained if we use loopy BP (Heskes, in press).
204
M. Welling and Y. Weh
p C
dual
F(q )
approximate
approximate
BP
b
dual
G*(q )
LR
C
G(p )
BP
G (b )
Figure 2: Diagrammatic representation of the different objective functions discussed in the article. The free energy F is the cumulant generating function (up to a constant), G is the Gibbs free energy, GBP is the Bethe-Gibbs free energy, which is an approximation to the true Gibbs free energy, and G¤ is the approximate cumulant generating function. The actual approximation is performed in the dual space, and the dashed arrow indicates that the overall process gives G¤ as an approximation to F.
We now proceed to prove a number of important properties of the covariance C. Theorem 1.
The approximate covariance satises the following symmetry: (4.8)
Cij .xi ; xj / D Cji .xj ; xi /:
Proof. The covariances are second derivatives of ¡G¤ .µ / at µ D 0, and we can interchange the order of the derivatives since G¤ .µ / is well behaved on a neighborhood around µ D 0. Theorem 2. The approximate covariance satises the following “marginalization” conditions for each xi ; xj : X x0i
Cij .x0i ; xj / D
X xj0
Cij .xi ; xj0 / D 0:
(4.9)
As a result, the approximate joint marginals satisfy local marginalization constraints: X X 0 0 0 0 (4.10) bLR bLR ij .xi ; xj / D bj .xj / ij .xi ; xj / D b i .xi /: x0i
xj0
Linear Response Algorithms
205
Using the denition of Cij .xi ; xj / and marginalization constraints
Proof. for bj0 ,
X xj0
X @ bj .xj0 I µ/ X @ Cij .xi ; xj0 / D bj .xj0 I µ / D @µi .xi / @µi .xi / x0 xj0 µD0 j µ D0 @ D 1 D 0: @µi .xi / µ D0
(4.11)
P The constraint x0i Cij .x0i ; xj / D 0 follows from the symmetry 4.8, while the corresponding marginalization, equation 4.10, follows from equation 4.9 and the denition of bLR . ij
Since ¡F.µ / is convex, its Hessian matrix with entries given in equation 4.4 is positive semidenite. Similarly, since the approximate covariances Cij .xi ; xj / are second derivatives of a convex function ¡G¤ .µ /, we have: Theorem 3. The matrix formed from the approximate covariances Cij .xi ; xj / by varying i and xi over the rows and varying j; xj over the columns is positive semidefinite.
Using the above results, we can reinterpret the linear response correction as a “projection” of the (only locally consistent) beliefs fb0i ; b0® g onto a set of beliefs fb0i ; bLR ij g that is both locally consistent (see theorem 2) and satises the global constraint of being positive semidenite (see theorem 3). This is depicted in Figure 3. Indeed, the idea to include global constraints such as positive semideniteness in approximate inference algorithms was proposed in Wainwright and Jordan (2003). It is surprising that a simple post hoc projection can achieve the same result. 5 Propagation Algorithms for Linear Response Although we have derived an expression for the covariance in the linear response approximation (see equation 4.3), we have not yet explained how to compute it efciently. In this section, we derive a propagation algorithm to that end and prove some convergence results and in the next section we present an algorithm based on a matrix inverse.
206
M. Welling and Y. Weh
locally consistent marginals
postive semi definite covariance
globally consistent marginals
LR BP
Figure 3: Constraint sets discussed in the article. The inner set is the set of all marginal distributions that are consistent with some global distribution B.x/, the outer set is the constraint set of all locally consistent marginal distributions, and the middle set consists of locally consistent marginal distributions with positive semidenite covariance. The linear response algorithm performs a correction on the joint pairwise marginals such that the covariance matrix is symmetric and positive semidenite, while all local consistency relations are still respected.
Recall from equation 4.5 that we need the rst derivative of bi .xi I µ / with respect to µj .xj / at µ D 0. This does not automatically imply that we need an analytic expression for bi .xi I µ / in terms of µ. Instead, we need only to keep track of rst-order dependencies by expanding all quantities and equations up to rst order in µ . For the beliefs we write,3 ³ ´ X bi .xi I µ/ D b0i .xi / 1 C Rij .xi ; yj /µj .yj / :
(5.1)
j;yj
The “response matrix” Rij .xi ; yj / measures the sensitivity of log bi .xi I µ/ at node i to a change in the log node potentials log Ãj .yj / at node j. Combining equation 5.1 with equation 4.7, we nd that Cij .xi ; xj / D b0i .xi /Rij .xi ; xj /:
(5.2)
3 The unconventional form of this expansion will make subsequent derivations more transparent.
Linear Response Algorithms
207
The constraints, equation 4.9 (which follow from the normalization of bi .xi I µ / and b0i .xi /), translate into X xi
b0i .xi /Rij .xi ; yj / D 0;
(5.3)
and it is not hard to verify that the following shift can be applied to accomplish this,4 Rij .xi ; yj / Ã Rij .xi ; yj / ¡
X xi
b0i .xi /Rij .xi ; yj /:
(5.4)
5.1 The Mean Field Approximation. Let us assume that we have found a local minimum of the mean eld–Gibbs free energy by iterating equation 3.5 until convergence. By inserting the expansions, equations 4.1 and 5.1, into equation 3.5 and equating terms linear in µ, we derive the following update equations for the response matrix in the mean eld approximation, Rik .xi ; yk / Ã ±ik ±xi yk C Á £
X
XX ®2N
i
x®ni
Á log î .x® / !
Rjk .xj ; yk / :
Y j2®ni
! 0; bj MF .xj /
(5.5)
j2®ni
This update is followed by the shift 5.4 in order to satisfy the constraint 5.3, and the process is initialized with Rik .xi ; yk / D 0. After convergence we compute the approximate covariance according to equation 5.2. Theorem 4. The propagation algorithm for computing the linear response estimates of pairwise probabilities in the mean eld approximation is guaranteed to converge to a unique xed point using any scheduling of the updates. For a full proof, we refer to the proof of theorem 6, which is very similar. However, it is easy to see that for sequential updates, convergence is guaranteed because equation 5.5 is the rst-order term of the mean eld equation, 3.5, which converges for arbitrary µ . 5.2 The Bethe Approximation. In the Bethe approximation, we follow a similar strategy as in the previous section for the mean eld approximation. First, we assume that belief propagation has converged to a stable xed 4 The shift can be derived by introducing a µ-dependent normalizing constant in equation 5.1, expanding it to rst order in µ and using the rst-order terms to satisfy constraint 5.3.
208
M. Welling and Y. Weh
point, which by Heskes (in press) is guaranteed to be a local minimum of the Bethe-Gibbs free energy. Next, we expand the messages ni® .xi / and m®i .xi / up to rst order in µ around the stable xed point, ³ ´ X 0 (5.6) ni® .xi / D ni® .xi / 1 C Ni®;k .xi ; yk /µ k .yk / k;yk
³ ´ X m®i .xi / D m0®i .xi / 1 C M®i;k .xi ; yk /µk .yk / :
(5.7)
k;yk
Inserting these expansions and the expansion 4.1 into the BP equations, 3.9 and 3.10, and matching rst-order terms, we arrive at the following update equations for the “supermessages” M ®i;k .xi ; yk / and Ni®;k .xi ; yk /, X (5.8) Ni®;k .xi ; yk / Ã ±ik ±xi yk C M¯i;k .xi ; yk / ¯2N
M®i;k .xi ; yk / Ã
i n®
X î .x® / Y x®ni
m0®i .xi / l2N
® ni
n0l® .xl /
X j2N
Nj®;k .xj ; yk /:
(5.9)
® ni
The supermessages are initialized at M®i;k D Ni®;k D 0 and “normalized” as follows:5 X (5.10) Ni®;k .xi ; yk / Ã Ni®;k .xi ; yk / ¡ Ni®;k .xi ; yk / xi
M®i;k .xi ; yk / Ã M ®i;k .xi ; yk / ¡
X xi
M®i;k .xi ; yk /:
(5.11)
After the above xed-point equations have converged, we compute the response matrix Rij .xi ; xj / by inserting the expansions 5.1, 4.1, and 5.7 into equation 3.12 and matching rst-order terms: X (5.12) Rij .xi ; xj / D ±ij ±xi xj C M®i;j .xi ; xj /: ®2N
i
We then normalize the response matrix as in equation 5.4 and compute the approximate covariances as in equation 5.2. We now prove a number of useful results concerning the iterative algorithm proposed above. Theorem 5. If the factor graph has no loops, then the linear response estimates dened in equation 5.2 are exact. Moreover, there exists a scheduling of the supermessages such that the algorithm converges after just one iteration (i.e., every message is updated just once). 5 The derivation is along similar lines as explained in the previous section for the mean eld case. Note also that unlike the mean eld case, normalization is desirable only for reasons of numerical stability.
Linear Response Algorithms
209
Proof. Both results follow from the fact that BP on tree-structured factor graphs computes the exact single-node marginals for arbitrary µ. Since the supermessages are the rst-order terms of the BP updates with arbitrary µ, we can invoke the exact linear response theorem given by equations 4.3 and 4.4 to claim that the algorithm converges to the exact joint pairwise marginal distributions. Moreover, the number of iterations that BP needs to converge is independent of µ, and there exists a scheduling that updates each message exactly once (inward-outward scheduling). Since the supermessages are the rst-order terms of the BP updates, they inherit these properties. For graphs with cycles, BP is not guaranteed to converge. We can, however, still prove the following strong result: Theorem 6. If the messages fm0®i .xi /; n0i® .xi /g have converged to a stable xed point, then the update equations for the supermessages (see equations 5.8, 5.9, and 5.11) will also converge to a unique stable xed point, using any scheduling of the supermessages. Sketch of proof: As a rst step, we combine the BP message updates (see equations 3.9 and 3.10) into one set of xed-point equations by inserting equation 3.9 into equation 3.10. Next, we linearize the xed-point equations for the BP messages around the stable xed point. We introduce a small : Q Q ®i .xi / D perturbation in the logarithm of the messages: ± log m®i .xi / D M Ma , where we have collected the message index ®i and the state index xi into one “attened” index a. The linearized equation takes the general form, Q a à log ma C log ma C M
X b
Q b; Lab M
(5.13)
where the matrix L is given by the rst-order term of the Taylor expansion of the xed-point equation. Since we know that the xed point is stable, we infer that the absolute values of the eigenvalues of L are all smaller than 1, Q a ! 0 as we iterate the xed-point equations. so that M Similarly for the supermessages, we insert equation 5.8 into equation 5.9 and include the normalization, equation 5.11, explicitly so that equations 5.8, 5.9, and 5.11 collapse into one linear equation. We now observe that the collapsed update equations for the supermessages are linear and of the form M a¹ Ã Aa¹ C
X
Lab Mb¹ ;
(5.14)
b
where we introduced new attened indices ¹ D .k; xk / and where L is identical to the L in equation 5.13. The constant term Aa¹ comes from the
210
M. Welling and Y. Weh
fact that we also expanded the node potential ù as in equation 4.1. Next, we recall that for the linear dynamics 5.14, there can only be one xed point at
Ma¹ D
Xh b
.I ¡ L/¡1
i ab
(5.15)
Ab¹
which exists only if det.I ¡ L/ 6D 0. Finally, since the eigenvalues of L are less than 1, we conclude that det.I ¡ L/ 6D 0, so the xed point exists, the xed point is stable, and the (parallel) xed-point equations, 5.14, will converge to that xed point. The above proves the result for parallel updates of the supermessages. However, for linear systems, the Stein-Rosenberg theorem now guarantees that any scheduling will converge to the same xed point and, moreover, that sequential updates will do so faster. 6 Noniterative Algorithms for Linear Response In section 5, we described propagation algorithms to compute the approx@bi .xi / imate covariances @µ directly. In this section, we describe an alternative .x / k
k
method that rst computes @µi .xi / , @bk .xk /
@µi .xi / @bk .xk /
and then inverts the matrix formed by
where we have attened fi; xi g into a row index and fk; xk g into a column index. This method is a direct extension of Kappen and Rodriguez (1998). The intuition is that while perturbations in a single µi .xi / affect the whole system, perturbations in a single bi .xi / (while keeping the others xed) affect each subsystem ® 2 A independently (see also Welling & Teh, @µi .xi / @bi .xi / 2003). This makes it easier to compute @b than to compute @µ . k .xk / k .xk / First, we propose minimal representations for bi and µk . Notice that the P current representations of bi and µk are redundant: we always have xi bi .xi / D 1 for all i, while for each k, adding a constant to all µk .xk / does not change the beliefs. This means that the matrix is actually not invertible: it has eigenvalues of 0. To deal with this noninvertibility, we propose a minimal representation for bi and µi . In particular, we assume that for each i, there is a distinguished value xi D 0 and set µi .0/ D 0 while functionally P @µi .xi / dene bi .0/ D 1 ¡ xi 6D0 bi .xi /. Now the matrix formed by @b for each i; k, k .xk / and xi ; xk 6D 0 is invertible; its inverse gives us the desired covariances for xi ; xk 6D 0. Values for xi D 0 or xk D 0 can then be computed using equation 4.9. 6.1 The Mean Field Approximation. Taking the log of the mean eld xed-point equation, 3.5, and differentiating with respect to bk .xk /, we get,
Linear Response Algorithms
211
after some manipulation for each i; k, and xi ; xk 6D 0, ³
@µi .xi / @bk .xk /
´ 1 ±xi xk C bi .xi / bi .0/ X X Y ¡.1 ¡ ±ik / log î .x® / bj .xj /:
D ±ik
®2N
i \N
k
x®ni;k
(6.1)
j2®ni;k
Inverting this matrix thus results in the desired estimates of the covariances (see also Kappen & Rodriguez, 1998, for the binary case). 6.2 The Bethe Approximation. In addition to using the minimal representations for bi and µi , we will also need minimal representations for the messages. This can be achieved by dening new quantities ¸i® .xi / D i/ log nni® .x for all i and xi . The ¸i® ’s can be interpreted as Lagrange multipliers i® .0/ to enforce the consistency constraints, equation 3.8 (Yedidia et al., 2000). We will use these multipliers instead of the messages in this section. Reexpressing the xed-point equations, 3.9 through 3.12, in terms of bi ’s and ¸i® ’s only, and introducing the perturbations µi , we get: ³
´ci
Ãi .xi / µi .xi / Y ¡¸i® .xi / e e Ãi .0/ ®2N i P Q ¸j® .xj / x®ni î .x® / j2N ® e bi .xi / D P Q ¸j® .xj / x® î .x® / j2N ® e
bi .xi / bi .0/
D
for all i; xi 6D 0 for all i; ® 2 N
(6.2) i ; xi
6D 0: (6.3)
The division by the values at 0 in equation 6.2 is to get rid of the proportionality constant. The above forms a minimal set of xed-point equations that the single node beliefs bi ’s and Lagrange multipliers ¸i® ’s need to satisfy at any local minimum of the Bethe free energy. Differentiating the logarithm of equation 6.2 with respect to bk .xk /, we get ³ @µi .xi / @bk .xk /
D ci ±ik
1 ±x i x k C bi .xi / bi .0/
´ C
X @¸i® .xi / ; @bk .xk / ®2N
(6.4)
i
remembering that bi .0/ is a function of bi .xi /, xi 6D 0. Notice that we need @µi .xi / i® .xi / values for @¸ in order to solve for @b . Since perturbations in bk .xk / @bk .xk / k .xk / (while keeping other bj ’s xed) do not affect nodes not directly connected to i® .xi / D 0 for k 62 ®. When k 2 ®, these can in turn be obtained by k, we have @¸ @bk .xk / solving, for each ®, a matrix inverse. Differentiating equation 6.3 by bk .xk /,
212
M. Welling and Y. Weh
we obtain ±ik ±xi xk D C®ij .xi ; xj / D
XX j2® xj 6D0
@¸ .x /
C®ij .xi ; xj / @bj®k .xkj/
» b® .xi ; xj / ¡ bi .xi /bj .xj / bi .xi /±xi xj ¡ bi .xi /bj .xj /
(6.5) if i 6D j if i D j
(6.6)
for each i; k 2 N ® and xi ; xk 6D 0. Flattening the indices in equation 6.5 (varying i; xi over rows and k; xk over columns), the left-hand side becomes the identity matrix, while the right-hand side is a product of two matrices. The rst is a covariance matrix C® where the ijth block is C®ij .xi ; xj /, while @¸ .x /
the second matrix consists of all the desired derivatives @bj®k .xkj/ . Hence, the derivatives are given as elements of the inverse covariance matrix C¡1 ® . @¸ .x / @µi .xi / Finally, plugging the values of @bj®.x j/ into equation 6.4 now gives @b , and k k k .xk / inverting that matrix will now give us the desired approximate covariances over the whole graph. Interestingly, the method requires access to the beliefs only at the local minimum, not to the potentials or Lagrange multipliers. 7 A Propagation Algorithm for Matrix Inversion Up to this point, all considerations have been in the discrete domain. A natural question is whether linear response can also be applied in the continuous domain. In this section, we use linear response to derive a propagation algorithm to compute the exact covariance matrix of a gaussian Markov random eld. A gaussian random eld is a real-valued Markov random eld with pairwise interactions. Its energy is ED
X 1X Wij xi xj C ®i xi ; 2 ij i
(7.1)
where Wij are the interactions and ®i are the biases. Since gaussian distributions are completely described by their rst- and second-order statistics, inference in this model reduces to the computation of the mean and covariance, : ¹ D hxi D ¡W¡1 ®
: 6 D hxxT i ¡ ¹¹T D W¡1 :
(7.2)
Weiss and Freeman (1999) showed that BP (when it converges) will compute the exact means ¹i but approximate variances 6ii and covariance 6ij between neighboring nodes. We will now show how to compute the exact covariance matrix using linear response, which through equation 7.2 translates into a perhaps unexpected algorithm to invert the matrix W.
Linear Response Algorithms
213
First, we introduce a small perturbation to the biases, ® ! ® C º and note that @ 2 F.º/ @¹i : D¡ 6ij D ¡ @ºi @ºj @ ºj ºD0 ºD0
(7.3)
Our strategy will thus be to compute ¹.º/ ¼ ¹0 ¡ 6º up to rst order in º. This can again be achieved by expanding the propagation updates to rst order in º. It will be convenient to collapse the two sets of message updates, equations 3.9 and 3.10, into one set of messages by inserting equation 3.9 into equation 3.10. Because the subsets ® correspond to pairs of variables in the gaussian random eld model, we change notation for the messages from ® ! j with ® D fi; jg to i ! j. Using the following denitions for the messages and potentials, ¡ 12 aij x2j ¡bij xj
(7.4)
mij .xj / / e
¡ 12 Wii x2i ¡®i xi
Ãij .xi ; xj / D e¡Wij xi xj Ãi .xi / D e we derive the update equations,
(7.5)
;
6
³ X ´ aij ®i C bki Wii C k2N i nj aki Wij k2N i nj P X ®i C k2N i bki ¿i D W ii C aki ¹i D ¡ ; ¿i k2N aij Ã
¡Wij2 P
bij Ã
(7.6) (7.7)
i
where the means ¹i are exact at convergence, but the precisions ¿i are approximate (Weiss & Freeman, 1999). We note that the aij messages do not depend on ®, so that the perturbation ®P ! ® C º will have no effect on it. Perturbing the bij messages as bij D b0ij C k Bij;k ºk ; we derive the following update equations for the “supermessages” Bij;l , Bij;l D
X aij .±il C Bki;l /: Wij k2N nj
(7.8)
i
Note that given a solution to equation 7.8, it is no longer necessary to Prun the updates for bij (see equation 7.6) since bij can be computed by bij D l Bij;l ®l . Theorem 7. If BP has converged to a stable xed point (i.e., message updates 7.6 have converged to a stable xed point), then the message updates 7.8 will converge 6
Here we used the following identity:
R
1
dxe¡ 2 ax
2 ¡bx
2 =2a
D eb
p
2¼=a.
214
M. Welling and Y. Weh
to a unique stable xed point. Moreover, the exact covariance matrix 6 D W¡1 is given by the following expression, 6il D
³ ´ X 1 ±il C Bki;l ; ¿i k2N
(7.9)
i
with ¿i given by equation 7.7. Sketch of proof. The convergence proof is similar to the proof of theorem 6 and is based on the observation that equation 7.8 is a linearization of the xed-point equation for bij , equation 7.6, so has the same convergence properties. The exactness proof is similar to the proof of theorem 5 and uses the fact that BP computes the means exactly so equation 7.3 computes the exact covariance, which is what we compute with equation 7.9. Weiss and Freeman (1999) P further showed that for diagonally dominant weight matrices (jWii j > j6Di jWij j 8i), convergence of BP (i.e., message updates, equation 7.6) is guaranteed. Combined with the above theorem, this ensures that the proposed iterative algorithm to invert W will converge for diagonally dominant W. Whether the class of problems that can be solved using this method can be enlarged, possibly at the expense of an approximation, is an open question. From equation 7.2, we observe that the exact covariance matrix may also be computed by running BP N times with ®i D ¡ei ; i D 1; : : : ; N, where ei is the unit vector in direction i. The exact means f¹i g computed using BP thus form the columns of the matrix W¡1 . This idea was exploited in the proof of claim 2 in Weiss and Freeman (1999). The complexity of the above algorithm is O .N £ E/ per iteration, where N is the number of nodes and E the number of edges in the graph. Consequently, it will improve on a straight matrix inversion only if the graph is sparse (i.e., the matrix to invert has many zeros). 8 Experiments In the following experiments, we compare ve methods for computing approximate estimates of the covariance matrix Cij .xi ; xj / D pij .xi ; xj / ¡ pi .xi /pj .xj /: MF: Since mean eld assumes independence, we have C D 0. This will act as a baseline. BP: Estimates computed directly from equation 3.11 by integrating out variables that are not considered (in fact, in the experiments below, the factors ® consist of pairs of nodes, so no integration is necessary). Note that nontrivial estimates exist only if there is a factor node that contains
Linear Response Algorithms
215
(a)
(b)
Figure 4: (a) Square grid used in the experiments. The rows are collected into supernodes which then form a chain. (b) Spanning tree of the nodes on the square grid used in the experiments.
both nodes. The BP messages were uniformly initialized at m®i .xi / D ni® .xi / D 1 and run until convergence. No damping was used.
MF+LR: Estimates computed from the linear response correction to the mean eld approximation (see section 5.1). The MF beliefs were rst uniformly initialized at bi .xi / D 1=D and run until convergence, while the response matrix Rij .xi ; xj / was initialized at 0. BP+LR: Estimates computed from the linear response correction to the Bethe approximation (see section 5.2). The supermessages fM ®i;j .xi ; xj /; Ni®;j .xi ; xj /g were all initialized at 0. COND: Estimates computed using the following conditioning procedure. Clamp a certain node j to a specic state xj D a. Run BP to compute conditional distributions bBP .xi jxj D a/. Do this for all nodes and all states to obtain all conditional distributions bBP .xi jxj /. The joint distribution is now computed as bCOND .xi ; xj / D bBP .xi jxj /bBP .xj /. Finally, the covariance ij is computed as X X CCOND .xi ; xj / D bCOND .xi ; xj / ¡ bCOND .xi ; xj / bCOND .xi ; xj /: (8.1) ij ij ij ij xj
Note that C is not symmetric and that the marginal consistent with bBP .x /.
xi
P xj
bCOND .xi ; xj / is not ij
i
The methods were halted if the maximum change in absolute value of all beliefs (MF) or messages (BP) was smaller than 10¡8 . The graphical model in the rst two experiments has nodes placed on a square 6 £ 6 grid (i.e., N D 36) with only nearest neighbors connected (see Figure 4a). Each node is associated with a random variable, which can be in one of three states (D D 3). The factors were chosen to be all pairs of neighboring nodes in the graph.
216
M. Welling and Y. Weh
Grid (neighbors, s
Grid (neighbors, s
=0)
2
C=0 BP
MF+LR
BP 5
10
error covariances
error covariances
10
3
10
BP+LR
4
10
MF+LR BP+LR 10
10
5
10
COND COND
(a)
s
0.5
edge
1
1.5
2
0.5
Grid (next to nearest neighbors, snode=0)
(d)
edge
1
1.5
2
=2)
node
C=0
C=0 MF+LR
5
10
MF+LR
error covariances
error covariances
s
Grid (next to nearest neighbors, s
2
10
BP+LR
4
10
BP+LR 10
10
COND 6
10
COND
(b)
s
0.5
edge
15
1
1.5
Grid (distant nodes, s
2
10
=0) 10
COND
MF+LR
(e)
edge
1
1.5
2
=2)
5
BP+LR
C=0
s
node
error covariances
4
10
0.5
Grid (remaining nodes, s
node
error covariances
=2)
node
node
C=0
C=0
MF+LR
10
BP+LR
10
6
10
COND
0.5
s
edge
(c) 1
15
1.5
2
10
0.5
s
edge
(f) 1
1.5
Figure 5: Absolute error for the estimated covariances for 6 £ 6 square grid.
2
Linear Response Algorithms
217
Tree ® Grid (neighbors, s
T re e ® G rid (ne xt to ne ares t ne ighb ors , s node= 0 )
=0)
node
0
10
0
10
C=0
C=0
5
10
MF+LR BP+LR
5
10
BP
erro r c ov arian ce s
error covariances
MF+LR BP+LR COND
COND
10
10
10
10
15
10
15
10
(a)
0.5 s
0
edge
1.5
10
2
0
0.5
s
edge
1
1.5
2
Tree ® 6x6 Grid (remaining nodes, snode=0)
0
10
MF+LR
C=0 5
BP+LR
10
error covariances
(b )
20
1
10
10
COND
15
10
20
10
0
0.5
(c)
s
edge
1
1.5
2
Figure 6: Absolute error for the spanning tree in Figure 4b that is smoothly changed into the 6 £ 6 square grid. Fully Connected Graph (snode=0)
10
Fully Connected Graph (snode=2)
2
C=0 4
10
6
10
8
C=0
10
BP
error covariances
error covariances
10
MF+LR
BP+LR
0.2
0.4
s
0.6
edge
0.8
1
BP
10
6
10
8
10
10
COND
(a)
4
MF+LR
BP+LR
COND
0.2
(b) 0.4
s
0.6
edge
0.8
1
Figure 7: Absolute error computed on a fully connected network with 10 nodes.
218
M. Welling and Y. Weh
By clustering the nodes in each row into supernodes, exact inference is still feasible by using the forward-backward algorithm. Pairwise probabilities between nodes in nonconsecutive layers were computed by integrating out the intermediate supernodes. The error in the estimated covariances was computed as the absolute difference between the estimated and the true values, averaged over pairs of nodes and their possible states, and averaged over 15 random draws of the network, as described below. An instantiation of a network was generated by randomly drawing the logarithm of the node and edge potentials from a gaussian with zero mean and standard deviation ¾node and ¾edge , respectively. In the rst experiment, we generated networks randomly with a scale ¾edge varying over the range [0; 2] and two settings of the scale ¾node , namely, f0; 2g. The results in Figure 5 were separately plotted for neighboring nodes, next-to-nearest neighboring nodes, and the remaining nodes in order to show the decay of dependencies with distance. Estimates for BP are absent in Figures 5b and 5e and in 5c and 5f because BP does not provide nontrivial estimates for nonneighbors. In the next experiment, we generated a single network with ¾edge D 1 and fÃi g D 1 on the 6 £ 6 square grid used in the previous experiment. The edge strengths of a subset of the edges forming a spanning tree of the graph were held xed (see Figure 4b), while the remaining edge strengths were multiplied by a factor increasing from 0 to 2 on the x-axis. The results are shown in Figure 6. Note that BP+LR and COND are exact on the tree. Finally, we generated fully connected graphs with 10 nodes and 3 states per node (i.e., all nodes are neighbors). We used varying edge strengths (¾edge ranging from [0; 1]) and two values of ¾node : f0; 2g. The results are shown in Figure 7. If we further increase the edge strengths in this fully connected network, we nd that BP often fails to converge. We could probably improve this situation a little bit by damping the BP updates, but because of the many tight loops, BP is doomed to fail for relatively large ¾edge . All experiments conrm that the LR estimates of the covariances in the Bethe approximation improve signicantly on the LR estimates in the MF approximation. It is well known that the MF approximation usually improves for large, densely connected networks. This is probably the reason MF+LR performed better on the fully connected graph but never as well as BP+LR or COND. The COND method performed surprisingly well at either the same level of accuracy as BP+LR or a little bit better. It was, however, checked numerically that the symmetrized estimate of the covariance matrix was not positive semidenite and that the various marginals computed from the joint distributions bCOND .xi ; xj / were inconsistent with each other. ij In the next section, we further discuss the differences between BP+LR and COND. Finally, as expected, the BP+LR and COND estimates are exact on
Linear Response Algorithms
219
a tree—the error is of the order of machine precision—but increases when the graph contains cycles with increasing edge strengths. 9 Discussion Loosely speaking, the “philosophy” of this article to compute estimates of covariances is as follows (see Figure 2). First, we observe that the log partition function is the cumulant generating function. Next, we dene its conjugate dual—the Gibbs free energy—and approximate it (e.g., the mean eld or the Bethe approximation). Finally, we transform back to obtain a convex approximation to the log partition function, from which we estimate the covariances. We have presented linear response algorithms on factor graphs. In the discrete case, we have discussed the mean eld and the Bethe approximations, while for gaussian random elds, we have shown how the proposed linear response algorithm translates into a surprising propagation algorithm to compute a matrix inverse. The computational complexity of the iterative linear response algorithm scales as O .N £ E £ D3 / per iteration, where N is the number of nodes, E the number of edges, and D the number of states per node. The noniterative algorithm scales slightly worse, O .N 3 £D3 /, but is based on a matrix inverse for which very efcient implementations exist. A question that remains open is whether we can improve the efciency of the iterative algorithm when we are interested only in the joint distributions of neighboring nodes. On tree-structured graphs, we know that belief propagation computes those estimates exactly in O .E £ D2 /, but the linear response algorithm still seems to scale as O .N £ E £ D3 /, which indicates that some useful information remains unused. Another hint pointing in that direction comes from the fact that in the gaussian case, an efcient algorithm was proposed in Wainwright, Sudderth, and Willsky (2000) for the computation of variances and neighboring covariances on a loopy graph. There are still a number of generalizations worth exploring. First, instead of MF or Bethe approximations, we can use the more accurate Kikuchi approximation dened over larger clusters of nodes and their intersections (see also Tanaka, 2003). Another candidate is the convexied Bethe free energy (Wainwright, Jaakkola, and Willsky, 2002. Second, in the case of the Bethe approximation, belief propagation is not guaranteed to converge. However, convergent alternatives have been developed in the literature (Teh & Welling, 2001; Yuille, 2002), and the noniterative linear response algorithm can still be applied to compute joint pairwise distributions. For reasons of computational efciency, it may be desirable to develop iterative algorithms for this case. Third, the presented method easily generalizes to the computation of higher-order cumulants. It is straightforward (but cumbersome) to develop iterative linear response algorithms for this as well. Finally, we
220
M. Welling and Y. Weh
are investigating whether linear response algorithms may also be applied to xed points of the expectation propagation algorithm. The most important distinguishing feature between the proposed LR algorithm and the conditioning procedure described in section 8 is that the covariance estimate is automatically positive semidenite. The idea to include global constraints such as positive semideniteness in approximate inference algorithms was proposed in Wainwright and Jordan (2003). LR may be considered as a post hoc projection on this constraint set (see section 4.1 and Figure 3). Another difference is the lack of a convergence proof for conditioned BP runs, given that BP has converged without conditioning (convergence for BP+LR was proven in section 5.2). Even if the various runs for conditioned BP do converge, different runs might converge to different local minima of the Bethe free energy, making the obtained estimates inconsistent and less accurate (although in the regime we worked with in the experiments, we did not observe this behavior). Finally, the noniterative algorithm is applicable to all local minima in the Bethe-Gibbs free energy— even those that correspond to unstable xed points of BP. These minima can, however, still be identied using convergent alternatives (Yuille, 2002; Teh & Welling, 2001). Acknowledgments We thank Martin Wainwright for discussion and the referees for valuable feedback. M.W. thanks Geoffrey Hinton for support. Y.W.T thanks Mike Jordan for support. References Frey, B., & MacKay, D. (1997). A revolution: Belief propagation in graphs with cycles. In M. Jordan, M. Kearns, & S. Solla (Eds.), Advances in neural information processing systems, 10 (pp. 479–485). Cambridge, MA: MIT Press. Georges, A., & Yedidia, J. (1991). How to expand around mean-eld theory using high-temperature expansions. J. Phys A: Math. Gen., 24, 2173–2192. Heskes, T. (in press). Stable xed points of loopy belief propagation are minima of the bethe free energy. In S. Becker, S. Thrun, & K. Obermayer (Eds.), Advances in neural information processing systems, 15. Cambridge, MA: MIT Press. Kappen, H., & Rodriguez, F. (1998). Efcient learning in Boltzmann machines using linear response theory. Neural Computation, 10, 1137–1156. Kschischang, F., Frey, B., & Loeliger, H. (2001). Factor graphs and the sumproduct algorithm. IEEE Transactions on Information Theory, 47(2), 498–519. Minka, T. (2001). Expectation propagation for approximate Bayesian inference. In Proceedings of the Conference on Uncertainty in Articial Intelligence (pp. 362– 369). San Francisco: Morgan Kaufmann.
Linear Response Algorithms
221
Murphy, K., Weiss, Y., & Jordan, M. (1999). Loopy belief propagation for approximate inference: An empirical study. In Proceedings of the Conference on Uncertainty in Articial Intelligence (pp. 467–475). San Francisco: Morgan Kaufmann. Tanaka, K. (2003). Probabilistic inference by means of cluster variation method and linear response theory. IEICE Transactions in Information and Systems, E 6-D (7), 1–15. Teh, Y., & Welling, M. (2001). The unied propagation and scaling algorithm. In T. G. Dietterich, S. Becker, & Z. Ghahramani (Eds.), Advances in neural information processing systems, 14 (pp. 953–960). Cambridge, MA: MIT Press. Wainwright, M., Jaakkola, T., & Willsky, A. (2002). A new class of upper bounds on the log partition function. In Proceedings of the Conference on Uncertainty in Articial Intelligence (pp. 536–543). San Francisco: Morgan Kaufmann. Wainwright, M., & Jordan, M. (2003). Semidenite relaxations for approximate inference on graphs with cycles (Tech. Rep. No. UCB/CSD-3-1226). Berkeley: CS Division, University of California, Berkeley. Wainwright, M., Sudderth, E., & Willsky, A. (2000). Tree-based modeling and estimation of gaussian processes on graphs with cycles. In T. Leen, T. Dietterich, & V. Tresp (Eds.), Advances in neural information processing systems, 13 (661–667). Cambridge, MA: MIT Press. Weiss, Y. (2000). Correctness of local probability propagation in graphical models with loops. Neural Computation, 12, 1–41. Weiss, Y, & Freeman, W. (1999). Correctness of belief propagation in gaussian graphical models of arbitrary topology. In S. A. Solla, T. K. Leen, & K. R. Muller (Eds.), Advances in neural information processing systems, 12 (pp. 673– 679). Cambridge, MA: MIT Press. Weiss, Y, & Freeman, W. (2001). On the optimality of solutions of the maxproduct belief-propagation algorithm in arbitrary graphs. IEEE Transactions on Information Theory, 47, 723–735. Welling, M, & Teh, Y. (2003). Approximate inference in Boltzmann machines. Articial Intelligence, 143, 19–50. Yedidia, J., Freeman, W., & Weiss, Y. (2000). Generalized belief propagation. In T. Leen, T. Dietterich, & V. Tresp (Eds.), Advances in neural information processing systems, 13 (pp. 689–695). Cambridge, MA: MIT Press. Yedidia, J., Freeman, W., & Weiss, Y. (2002). Constructing free energy approximations and generalized belief propagation algorithms (Tech. Rep. TR-2002-35). Cambridge, MA: Mitsubishi Electric Research Laboratories. Yuille, A. (2002). CCCP algorithms to minimize the Bethe and Kikuchi free energies: Convergent alternatives to belief propagation. Neural Computation, 14 (7), 1691–1722. Received October 11, 2002; accepted July 8, 2003.
ARTICLE
Communicated by Pamela Reinagel
Analyzing Neural Responses to Natural Signals: Maximally Informative Dimensions Tatyana Sharpee
[email protected] Sloan–Swartz Center for Theoretical Neurobiology and Department of Physiology, University of California at San Francisco, San Francisco, CA 94143, U.S.A.
Nicole C. Rust
[email protected] Center for Neural Science, New York University, New York, NY 10003, U.S.A.
William Bialek
[email protected] Department of Physics, Princeton University, Princeton, NJ 08544, U.S.A., and Sloan–Swartz Center for Theoretical Neurobiology and Department of Physiology, University of California at San Francisco, San Francisco, CA 94143, U.S.A.
We propose a method that allows for a rigorous statistical analysis of neural responses to natural stimuli that are nongaussian and exhibit strong correlations. We have in mind a model in which neurons are selective for a small number of stimulus dimensions out of a high-dimensional stimulus space, but within this subspace the responses can be arbitrarily nonlinear. Existing analysis methods are based on correlation functions between stimuli and responses, but these methods are guaranteed to work only in the case of gaussian stimulus ensembles. As an alternative to correlation functions, we maximize the mutual information between the neural responses and projections of the stimulus onto low-dimensional subspaces. The procedure can be done iteratively by increasing the dimensionality of this subspace. Those dimensions that allow the recovery of all of the information between spikes and the full unprojected stimuli describe the relevant subspace. If the dimensionality of the relevant subspace indeed is small, it becomes feasible to map the neuron’s input-output function even under fully natural stimulus conditions. These ideas are illustrated in simulations on model visual and auditory neurons responding to natural scenes and sounds, respectively.
Neural Computation 16, 223–250 (2004)
c 2003 Massachusetts Institute of Technology °
224
T. Sharpee, N. Rust, and W. Bialek
1 Introduction From olfaction to vision and audition, a growing number of experiments are examining the responses of sensory neurons to natural stimuli (Creutzfeldt & Northdurft, 1978; Rieke, Bodnar, & Bialek, 1995; Baddeley et al., 1997; Stanley, Li, & Dan, 1999; Theunissen, Sen, & Doupe, 2000; Vinje & Gallant, 2000, 2002; Lewen, Bialek, & de Ruyter van Steveninck, 2001; Sen, Theunissen, & Doupe, 2001; Vickers, Christensen, Baker, & Hildebrand, 2001; Ringach, Hawken, & Shapley, 2002; Weliky, Fiser, Hunt, & Wagner, 2003; Rolls, Aggelopoulos, & Zheng, 2003; Smyth, Willmore, Baker, Thompson, & Tolhurst, 2003). Observing the full dynamic range of neural responses may require using stimulus ensembles that approximate those occurring in nature (Rieke, Warland, de Ruyter van Steveninck, & Bialek, 1997; Simoncelli & Olshausen, 2001), and it is an attractive hypothesis that the neural representation of these natural signals may be optimized in some way (Barlow, 1961, 2001; von der Twer & Macleod, 2001; Bialek, 2002). Many neurons exhibit strongly nonlinear and adaptive responses that are unlikely to be predicted from a combination of responses to simple stimuli; for example, neurons have been shown to adapt to the distribution of sensory inputs, so that any characterization of these responses will depend on context (Smirnakis, Berry, Warland, Bialek, & Meister, 1996; Brenner, Bialek, & de Ruyter van Steveninck, 2000; Fairhall, Lewen, Bialek, & de Ruyter van Steveninck, 2001). Finally, the variability of a neuron’s responses decreases substantially when complex dynamical, rather than static, stimuli are used (Mainen & Sejnowski, 1995; de Ruyter van Steveninck, Lewen, Strong, Koberle, & Bialek, 1997; Kara, Reinagel, & Reid, 2000; de Ruyter van Steveninck, Borst, & Bialek, 2000). All of these arguments point to the need for general tools to analyze the neural responses to complex, naturalistic inputs. The stimuli analyzed by sensory neurons are intrinsically high-dimensional, with dimensions D » 102 ¡ 103 . For example, in the case of visual neurons, the input is commonly specied as light intensity on a grid of at least 10£10 pixels. Each of the presented stimuli can be described as a vector s in this high-dimensional stimulus space (see Figure 1). The dimensionality becomes even larger if stimulus history has to be considered as well. For example, if we are interested in how the past N frames of the movie affect the probability of a spike, then the stimulus s, being a concatenation of the past N samples, will have dimensionality N times that of a single frame. We also assume that the probability distribution P.s/ is sampled during an experiment ergodically, so that we can exchange averages over time with averages over the true distribution as needed. Although direct exploration of a D » 102 ¡ 103 -dimensional stimulus space is beyond the constraints of experimental data collection, progress can be made provided we make certain assumptions about how the response has been generated. In the simplest model, the probability of response can be described by one receptive eld (RF) or linear lter (Rieke et al., 1997).
Analyzing Neural Responses to Natural Signals
225
Figure 1: Schematic illustration of a model with a one–dimensional relevant subspace: eO1 is the relevant dimension, and eO2 and eO3 are irrelevant ones. Shown are three example stimuli, s, s0 , and s00, the receptive eld of a model neuron—the relevant dimension eO1 , and our guess v for the relevant dimension. Probabilities of a spike P.spikejs¢ eO1 / and P.spikejs¢v/ are calculated by rst projecting all of the stimuli s onto each of the two vectors eO 1 and v, respectively, and then applying equations (2.3, 1.2, 1.1) sequentially. Our guess v for the relevant dimension is adjusted during the progress of the algorithm in such a way as to maximize I.v/ of equation (2.5), which makes vector v approach the true relevant dimension eO 1 .
The RF can be thought of as a template or special direction eO1 in the stimulus space1 such that the neuron’s response depends on only a projection of a given stimulus s onto eO1 , although the dependence of the response on this 1 The notation eO denotes a unit vector, since we are interested only in the direction the vector species and not in its length.
226
T. Sharpee, N. Rust, and W. Bialek
projection can be strongly nonlinear (cf. Figure 1). In this simple model, the reverse correlation method (de Boer & Kuyper, 1968; Rieke et al., 1997; Chichilnisky, 2001) can be used to recover the vector eO1 by analyzing the neuron’s responses to gaussian white noise. In a more general case, the probability of the response depends on projections si D eOi ¢ s of the stimulus s on a set of K vectors fOe1 ; eO2 ; : : : ; eOK g: P.spikejs/ D P.spike/ f .s1 ; s2 ; : : : ; sK /;
(1.1)
where P.spikejs/ is the probability of a spike given a stimulus s and P.spike/ is the average ring rate. In what follows, we will call the subspace spanned by the set of vectors fOe1 ; eO2 ; : : : ; eOK g the relevant subspace (RS).2 We reiterate that vectors fOei g, 1 · i · K may also describe how the time dependence of the stimulus s affects the probability of a spike. An example of such a relevant dimension would be a spatiotemporal RF of a visual neuron. Although the ideas developed below can be used to analyze input-output functions f with respect to different neural responses, such as patterns of spikes in time (de Ruyter van Steveninck & Bialek, 1988; Brenner, Strong, Koberle, Bialek, & de Ruyter van Steveninck, 2000; Reinagel & Reid, 2000), for illustration purposes we choose a single spike as the response of interest. 3 Equation 1.1 in itself is not yet a simplication if the dimensionality K of the RS is equal to the dimensionality D of the stimulus space. In this article, we will assume that the neuron’s ring is sensitive only to a small number of stimulus features (K ¿ D). While the general idea of searching for lowdimensional structure in high-dimensional data is very old, our motivation here comes from work on the y visual system, where it was shown explic2 Since the analysis does not depend on a particular choice of a basis within the full D-dimensional stimulus space, for clarity we choose the basis in which the rst K basis vectors span the relevant subspace and the remaining D ¡ K vectors span the irrelevant subspace. 3 We emphasize that our focus here on single spikes is not equivalent to assuming that the spike train is a Poisson process modulated by the stimulus. No matter what the statistical structure of the spike train is, we always can ask what features of the stimulus are relevant for setting the probability of generating a single spike at one moment in time. From an information-theoretic point of view, asking for stimulus features that capture the mutual information between the stimulus and the arrival times of single spikes is a well-posed question even if successive spikes do not carry independent information. Note also that spikes’ carrying independent information is not the same as spikes’ being generated as a Poisson process. On the other hand, if (for example) different temporal patterns of spikes carry information about different stimulus features, then analysis of single spikes will result in a relevant subspace of artifactually high dimensionality. Thus, it is important that the approach discussed here carries over without modication to the analysis of relevant dimensions for the generation of any discrete event, such as a pattern of spikes across time in one cell or synchronous spikes across a population of cells. For a related discussion of relevant dimensions and spike patterns using covariance matrix methods see de Ruyter van Steveninck and Bialek (1988) and Aguera ¨ y Arcas, Fairhall, and Bialek (2003).
Analyzing Neural Responses to Natural Signals
227
itly that patterns of action potentials in identied motion-sensitive neurons are correlated with low-dimensional projections of the high-dimensional visual input (de Ruyter van Steveninck & Bialek, 1988; Brenner, Bialek, et al., 2000; Bialek & de Ruyter van Steveninck, 2003). The input-output function f in equation 1.1 can be strongly nonlinear, but it is presumed to depend on only a small number of projections. This assumption appears to be less stringent than that of approximate linearity, which one makes when characterizing a neuron’s response in terms of Wiener kernels (see, e.g., the discussion in section 2.1.3 of Rieke et al., 1997). The most difcult part in reconstructing the input-output function is to nd the RS. Note that for K > 1, a description in terms of any linear combination of vectors fOe1 ; eO2 ; : : : ; eOK g is just as valid, since we did not make any assumptions as to a particular form of the nonlinear function f . Once the relevant subspace is known, the probability P.spikejs/ becomes a function of only a few parameters, and it becomes feasible to map this function experimentally, inverting the probability distributions according to Bayes’ rule: f .s1 ; s2 ; : : : ; sK / D
P.s1 ; s2 ; : : : ; sK jspike/ : P.s1 ; s2 ; : : : ; sK /
(1.2)
If stimuli are chosen from a correlated gaussian noise ensemble, then the neural response can be characterized by the spike-triggered covariance method (de Ruyter van Steveninck & Bialek, 1988; Brenner, Bialek, et al., 2000; Schwartz, Chichilnisky, & Simoncelli, 2002; Touryan, Lau, & Dan, 2002; Bialek & de Ruyter van Steveninck, 2003). It can be shown that the dimensionality of the RS is equal to the number of nonzero eigenvalues of a matrix given by a difference between covariance matrices of all presented stimuli and stimuli conditional on a spike. Moreover, the RS is spanned by the eigenvectors associated with the nonzero eigenvalues multiplied by the inverse of the a priori covariance matrix. Compared to the reverse correlation method, we are no longer limited to nding only one of the relevant dimensions fOei g, 1 · i · K. Both the reverse correlation and the spike–triggered covariance method, however, give rigorously interpretable results only for gaussian distributions of inputs. In this article, we investigate whether it is possible to lift the requirement for stimuli to be gaussian. When using natural stimuli, which certainly are nongaussian, the RS cannot be found by the spike-triggered covariance method. Similarly, the reverse correlation method does not give the correct RF, even in the simplest case where the input-output function in equation 1.1 depends on only one projection (see appendix A for a discussion of this point). However, vectors that span the RS are clearly special directions in the stimulus space independent of assumptions about P.s/. This notion can be quantied by the Shannon information. We note that the stimuli s do not have to lie on a low-dimensional manifold within the overall D-dimensional
228
T. Sharpee, N. Rust, and W. Bialek
space.4 However, since we assume that the neuron’s input-output function depends on a small number of relevant dimensions, the ensemble of stimuli conditional on a spike may exhibit clear clustering. This makes the proposed method of looking for the RS complementary to the clustering of stimuli conditional on a spike done in the information bottleneck method (Tishby, Pereira, & Bialek, 1999; see also Dimitrov & Miller, 2001). Noninformationbased measures of similarity between probability distributions P.s/ and P.sjspike/ have also been proposed to nd the RS (Paninski, 2003a). To summarize our assumptions: ² The sampling of the probability distribution of stimuli P.s/ is ergodic and stationary across repetitions. The probability distribution is not assumed to be gaussian. The ensemble of stimuli described by P.s/ does not have to lie on a low-dimensional manifold embedded in the overall D-dimensional space. ² We choose a single spike as the response of interest (for illustration purposes only). An identical scheme can be applied, for example, to particular interspike intervals or to synchronous spikes from a pair of neurons. ² The subspace relevant for generating a spike is low dimensional and Euclidean (cf. equation 1.1). ² The input-output function, which is dened within the low-dimensional RS, can be arbitrarily nonlinear. It is obtained experimentally by sampling the probability distributions P.s/ and P.sjspike/ within the RS. The article is organized as follows. In section 2 we discuss how an optimization problem can be formulated to nd the RS. A particular algorithm used to implement the optimization scheme is described in section 3. In section 4 we illustrate how the optimization scheme works with natural stimuli for model orientation-sensitive cells with one and two relevant dimensions, much like simple and complex cells found in primary visual cortex, as well as for a model auditory neuron responding to natural sounds. We also discuss the convergence of our estimates of the RS as a function of data set size. We emphasize that our optimization scheme does not rely on any specic statistical properties of the stimulus ensemble and thus can be used with natural stimuli. 4 If one suspects that neurons are sensitive to low-dimensional features of their input, one might be tempted to analyze neural responses to stimuli that explore only the (putative) relevant subspace, perhaps along the line of the subspace reverse correlation method (Ringach et al., 1997). Our approach (like the spike-triggered covariance approach) is different because it allows the analysis of responses to stimuli that live in the full space, and instead we let the neuron “tell us” which low-dimensional subspace is relevant.
Analyzing Neural Responses to Natural Signals
229
2 Information as an Objective Function When analyzing neural responses, we compare the a priori probability distribution of all presented stimuli with the probability distribution of stimuli that lead to a spike (de Ruyter van Steveninck & Bialek, 1988). For gaussian signals, the probability distribution can be characterized by its second moment, the covariance matrix. However, an ensemble of natural stimuli is not gaussian, so that in a general case, neither second nor any nite number of moments is sufcient to describe the probability distribution. In this situation, Shannon information provides a rigorous way of comparing two probability distributions. The average information carried by the arrival time of one spike is given by Brenner, Strong, et al. (2000), µ
Z dsP.sjspike/ log2
Ispike D
¶ P.sjspike/ ; P.s/
(2.1)
where ds denotes integration over full D–dimensional stimulus space. The information per spike as written in equation 2.1 is difcult to estimate experimentally, since it requires either sampling of the high-dimensional probability distribution P.sjspike/ or a model of how spikes were generated, that is, the knowledge of low-dimensional RS. However, it is possible to calculate Ispike in a model-independent way if stimuli are presented multiple times to estimate the probability distribution P.spikejs/. Then, ½ Ispike D
µ ¶¾ P.spikejs/ P.spikejs/ log2 ; P.spike/ P.spike/ s
(2.2)
where the average is taken over all presented stimuli. This can be useful in practice (Brenner, Strong, et al., 2000), because we can replace the ensemble average his with a time average and P.spikejs/ with the time-dependent spike rate r.t/. Note that for a nite data set of N repetitions, the obtained value Ispike.N/ will be on average larger than Ispike.1/. The true value Ispike can be found by either subtracting an expected bias value, which is of the order of » 1=.P.spike/N 2 ln 2/ (Treves & Panzeri, 1995; Panzeri & Treves, 1996; Pola, Schultz, Petersen, & Panzeri, 2002; Paninski, 2003b) or extrapolating to N ! 1 (Brenner, Strong, et al., 2000; Strong, Koberle, de Ruyter van Steveninck, & Bialek, 1998). Measurement of Ispike in this way provides a model independent benchmark against which we can compare any description of the neuron’s input-output relation. Our assumption is that spikes are generated according to a projection onto a low-dimensional subspace. Therefore, to characterize relevance of a particular direction v in the stimulus space, we project all of the presented stimuli onto v and form probability distributions Pv .x/ and Pv .xjspike/ of projection values x for the a priori stimulus ensemble and that conditional
230
T. Sharpee, N. Rust, and W. Bialek
on a spike, respectively: (2.3)
P v .x/ D h±.x ¡ s ¢ v/is ;
(2.4)
P v .xjspike/ D h±.x ¡ s ¢ v/jspikeis ;
R where ±.x/ is a delta function. In practice, both the average h¢ ¢ ¢is ´ ds ¢ ¢ ¢ P.s/ R over the a priori stimulus ensemble and the average h¢ ¢ ¢ jspikeis ´ ds ¢ ¢ ¢ P.sjspike/ over the ensemble conditional on a spike are calculated by binning the range of projection values x. The probability distributions are then obtained as histograms, normalized in a such a way that the sum over all bins gives 1. The mutual information between spike arrival times and the projection x, by analogy with equation 2.1, is µ
Z I.v/ D
dxPv .xjspike/ log2
¶ Pv .xjspike/ ; Pv .x/
(2.5)
which is also the Kullback-Leibler divergence D[Pv .xjspike/jjPv .x/]. Notice that this information is a function of the direction v. The information I.v/ provides an invariant measure of how much the occurrence of a spike is determined by projection on the direction v. It is a function only of direction in the stimulus space and does not change when vector v is multiplied by a constant. This can be seen by noting that for any probability distribution and any constant c, Pcv .x/ D c¡1 Pv .x=c/ (see also theorem 9.6.4 of Cover & Thomas, 1991). When evaluated along any vector v, I.v/ · Ispike. The total information Ispike can be recovered along one particular direction only if v D eO1 and only if the RS is one-dimensional. By analogy with equation 2.5, one could also calculate information I.v1 ; : : : ; vn / along a set of several directions fv1 ; : : : ; vn g based on the multipoint probability distributions of projection values x1 , x2 ; : : : ; xn along vectors v1 , v2 ; : : : ; vn of interest: * P v1 ;:::;vn .fxi gjspike/ D * P v1 ;:::;vn .fxi g/ D
n Y iD1 n Y iD1
+ ±.xi ¡ s ¢ vi /jspike ;
(2.6)
s
+ ±.xi ¡ s ¢ vi / :
(2.7)
s
If we are successful in nding all of the directions fOei g, 1 · i · K contributing to the input-output relation, equation 1.1, then the information evaluated in this subspace will be equal to the total information Ispike. When we calculate information along a set of K vectors that are slightly off from the RS, the answer, of course, is smaller than Ispike and is initially quadratic in small deviations ±vi . One can therefore hope to nd the RS by maximizing information with respect to K vectors simultaneously. The information does
Analyzing Neural Responses to Natural Signals
231
not increase if more vectors outside the RS are included. For uncorrelated stimuli, any vector or a set of vectors that maximizes I.v/ belongs to the RS. On the other hand, as discussed in appendix B, the result of optimization with respect to a number of vectors k < K may deviate from the RS if stimuli are correlated. To nd the RS, we rst maximize I.v/ and compare this maximum with Ispike, which is estimated according to equation 2.2. If the difference exceeds that expected from nite sampling corrections, we increment the number of directions with respect to which information is simultaneously maximized. 3 Optimization Algorithm In this section, we describe a particular algorithm we used to look for the most informative dimensions in order to nd the relevant subspace. We make no claim that our choice of the algorithm is most efcient. However, it does give reproducible results for different starting points and spike trains with differences taken to simulate neural noise. Overall, choices for an algorithm are broader because the information I.v/ as dened by equation 2.5 is a continuous function, whose gradient can be computed. We nd (see appendix C for a derivation) µ ¶ Z £ ¤ d Pv .xjspike/ rv I D dxPv .x/ hsjx; spikei ¡ hsjxi ¢ (3.1) ; dx Pv .x/ where hsjx; spikei D
1 P.xjspike/
Z ds s±.x ¡ s ¢ v/P.sjspike/;
(3.2)
and similarly for hsjxi. Since information does not change with the length of the vector, we have v¢rv I D 0, as also can be seen directly from equation 3.1. As an optimization algorithm, we have used a combination of gradient ascent and simulated annealing algorithms: successive line maximizations were done along the direction of the gradient (Press, Teukolsky, Vetterling, & Flannery, 1992). During line maximizations, a point with a smaller value of information was accepted according to Boltzmann statistics, with probability / exp[.I.viC1 / ¡ I.vi //=T]. The effective temperature T is reduced by factor of 1 ¡ ²T upon completion of each line maximization. Parameters of the simulated annealing algorithm to be adjusted are the starting temperature T0 and the cooling rate ²T , 1T D ¡²T T. When maximizing with respect to one vector, we used values T0 D 1 and ²T D 0:05. When maximizing with respect to two vectors, we either used the cooling schedule with ²T D 0:005 and repeated it several times (four times in our case) or allowed the effective temperature T to increase by a factor of 10 upon convergence to a local maximum (keeping T · T0 always), while limiting the total number of line maximizations.
232
T. Sharpee, N. Rust, and W. Bialek 600 w h ite n o is e s tim u li
n a tu ra lis tic s tim u li
P(I/I spike )
3 (a )
400
2
(b ) (c )
200
0 10
1
4
10
2
I/I
10
s p ik e
0
0 0
0 .5
I/I
1 s p ik e
Figure 2: The probability distribution of information values in units of the total information per spike in the case of (a) uncorrelated binary noise stimuli, (b) correlated gaussian noise with power spectrum of natural scenes, and (c) stimuli derived from natural scenes (patches of photos). The distribution was obtained by calculating information along 105 random vectors for a model cell with one relevant dimension. Note the different scales in the two panels.
The problem of maximizing a function often is related to the problem of making a good initial guess. It turns out, however, that the choice of a starting point is much less crucial in cases where the stimuli are correlated. To illustrate this point, we plot in Figure 2 the probability distribution of information along random directions v for both white noise and naturalistic stimuli in a model with one relevant dimension. For uncorrelated stimuli, not only is information equal to zero for a vector that is perpendicular to the relevant subspace, but in addition, the derivative is equal to zero. Since a randomly chosen vector has on average apsmall projection on the relevant subspace (compared to its length) vr =jvj » n=d, the corresponding information can be found by expanding in vr =jvj: v2r I¼ 2jvj2
Á
Z dxPeOir .x/
P0eO .x/ ir
PeOir .x/
!2 [hsOer jspikei ¡ hsOer i]2 ;
(3.3)
where vector v D vr eOr C vir eOir is decomposed in its components inside and outside the RS, respectively. The average information for a random vector is therefore » .hv2r i=jvj2 / D K=D. In cases where stimuli are drawn from a gaussian ensemble with correlations, an expression for the information values has a similar structure to equation 3.3. To see this, we transform to Fourier space and normalize each Fourier component by the square root of the power spectrum S.k/.
Analyzing Neural Responses to Natural Signals
233
In this new basis, both the vectors fei g, 1 · i · K, forming the RS and the randomly chosen vector p v along which information is being evaluated, are to be multiplied by S.k/. Thus, if we now substitute for the dot product P v2r the convolution weighted by the power spectrum, Ki .v ¤ eOi /2 , where P
v ¤ eOi D pP
ei .k/S.k/ k v.k/O pP ; 2 2 k v .k/S.k/ k eOi .k/S.k/
(3.4)
then equation 3.3 will describe information values along randomly chosen vectors v for correlated gaussian stimuli with the power spectrum S.k/. Although both vr and v.k/ are gaussian variables with variance » 1=D, the weighted convolution has not only a much larger variance but is also strongly P nongaussian (the nongaussian character is due to the normalizing factor k v2 .k/S.k/ in the denominator of equation 3.4). As for the variance, it can be estimated as < .v ¤ eO1 /2 >D 4¼= ln2 D, in cases where stimuli are taken as patches of correlated gaussian noise with the two-dimensional power spectrum S.k/ D A=k2 . The large values of the weighted dot product v ¤ eOi , 1 · i · K result not only in signicant information values along a randomly chosen vector, but also in large magnitudes of the derivative rI, which is no longer dominated by noise, contrary to the case of uncorrelated stimuli. In the end, we nd that randomly choosing one of the presented frames as a starting guess is sufcient. 4 Results We tested the scheme of looking for the most informative dimensions on model neurons that respond to stimuli derived from natural scenes and sounds. As visual stimuli, we used scans across natural scenes, which were taken as black and white photos digitized to 8 bits with no corrections made for the camera’s light-intensity transformation function. Some statistical properties of the stimulus set are shown in Figure 3. Qualitatively, they reproduce the known results on the statistics of natural scenes (Ruderman & Bialek, 1994; Ruderman, 1994; Dong & Atick, 1995; Simoncelli & Olshausen, 2001). Most important properties for this study are strong spatial correlations, as evident from the power spectrum S(k) plotted in Figure 3b, and deviations of the probability distribution from a gaussian one. The nongaussian character can be seen in Figure 3c, where the probability distribution of intensities is shown, and in Figure 3d, which shows the distribution of projections on a Gabor lter (in what follows, the units of projections, such as s1 , will be given in units of the corresponding standard deviations). Our goal is to demonstrate that although the correlations present in the ensemble are nongaussian, they can be removed successfully from the estimate of vectors dening the RS.
234
T. Sharpee, N. Rust, and W. Bialek 5
(b) 10
(a)
S(k) /s2(I)
100 0
200
10
300 5
400
(c) 0 10
200
400
600
10
1
10
10
0
ka p
5
(d)10 P(I)
P(s ) 1
0
10
2
10
2
5
0
2
10
(I ) /s(I)
4
2 0 2 4 (s1 <s1> ) /s(s1 )
Figure 3: Statistical properties of the visual stimulus ensemble. (a) One of the photos. Stimuli would be 30 £ 30 patches taken from the overall photograph. (b) The power spectrum, in units of light intensity variance ¾ 2 .I/, averaged over orientation as a function of dimensionless wave vector ka, where a is the pixel size. (c) The probability distribution of light intensity in units of ¾ .I/. (d) The probability distribution of projections between stimuli and a Gabor lter, also in units of the corresponding standard deviation ¾ .s1 /.
4.1 A Model Simple Cell. Our rst example is based on the properties of simple cells found in the primary visual cortex. A model phase- and orientation-sensitive cell has a single relevant dimension eO1 , shown in Figure 4a. A given stimulus s leads to a spike if the projection s1 D s ¢ eO1 reaches a threshold value µ in the presence of noise: P.spikejs/ ´ f .s1 / D hH.s1 ¡ µ C » /i; P.spike/
(4.1)
where a gaussian random variable » of variance ¾ 2 models additive noise, and the function H.x/ D 1 for x > 0, and zero otherwise. Together with the RF eO1 , the parameters µ for threshold and the noise variance ¾ 2 determine the input-output function.
Analyzing Neural Responses to Natural Signals
235
Figure 4: Analysis of a model simple cell with RF shown in (a). (b) The “exact” spike-triggered average vsta . (c) An attempt to remove correlations according O max to the reverse correlation method, C¡1 a priori vsta . (d) The normalized vector v found by maximizing information. (e) Convergence of the algorithm according to information I.v/ and projection vO ¢ eO1 between normalized vectors as a function of the inverse effective temperature T ¡1 . (f) The probability of a spike P.spikejs ¢ vO max / (crosses) is compared to P.spikejs1 / used in generating spikes (solid line). Parameters of the model are ¾ D 0:31 and µ D 1:84, both given in units of standard deviation of s1 , which is also the units for the x-axis in f .
When the spike-triggered average (STA), or reverse correlation function, is computed from the responses to correlated stimuli, the resulting vector will be broadened due to spatial correlations present in the stimuli (see Figure 4b). For stimuli that are drawn from a gaussian probability distribution, the effects of correlations could be removed by multiplying vsta by the inverse of the a priori covariance matrix, according to the reverse correlation method, vO Gaussian est / C¡1 a priori vsta , equation A.2. However, this procedure tends to amplify noise. To separate errors due to neural noise from those
236
T. Sharpee, N. Rust, and W. Bialek
due to the nongaussian character of correlations, note that in a model, the effect of neural noise on our estimate of the STA can be eliminated by averaging the presented stimuli weighted with the exact ring rate, as opposed to using a histogram of responses to estimate P.spikejs/ from a nite set of trials. We have used this “exact” STA, Z vsta D
ds sP.sjspike/ D
1 P.spike/
Z dsP.s/ sP.spikejs/;
(4.2)
in calculations presented in Figures 4b and 4c. Even with this noiseless STA (the equivalent of collecting an innite data set), the standard decorrelation procedure is not valid for nongaussian stimuli and nonlinear input-output functions, as discussed in detail in appendix A. The result of such a decorrelation in our example is shown in Figure 4c. It clearly is missing some of the structure in the model lter, with projection eO1 ¢ vO Gaussian est ¼ 0:14. The discrepancy is not due to neural noise or nite sampling, since the “exact” STA was decorrelated; the absence of noise in the exact STA also means that there would be no justication for smoothing the results of the decorrelation. The discrepancy between the true RF and the decorrelated STA increases with the strength of nonlinearity in the input-output function. In contrast, it is possible to obtain a good estimate of the relevant dimension eO1 by maximizing information directly (see Figure 4d). A typical progress of the simulated annealing algorithm with decreasing temperature T is shown in Figure 4e. There we plot both the information along the vector and its projection on eO1 . We note that while information I remains almost constant, the value of projection continues to improve. Qualitatively, this is because the probability distributions depend exponentially on information. The nal value of projection depends on the size of the data set, as discussed below. In the example shown in Figure 4, there were ¼ 50; 000 spikes with average probability of spike ¼ 0:05 per frame, and the reconstructed vector has a projection vO max ¢ eO1 D 0:920 § 0:006. Having estimated the RF, one can proceed to sample the nonlinear input-output function. This is done by constructing histograms for P.s ¢ vO max / and P.s ¢ vO max jspike/ of projections onto vector vO max found by maximizing information and taking their ratio, as in equation 1.2. In Figure 4f, we compare P.spikejs ¢ vO max / (crosses) with the probability P.spikejs1 / used in the model (solid line). 4.2 Estimated Deviation from the Optimal Dimension. When inforO which mation is calculated from a nite data set, the (normalized) vector v, O eO1 arises maximizes I, will deviate from the true RF eO1 . The deviation ±v D v¡ because the probability distributions are estimated from experimental histograms and differ from the distributions found in the limit of innite data size. For a simple cell, the quality of reconstruction can be characterized by the projection vO ¢ eO1 D 1 ¡ 12 ±v2 , where both vO and eO1 are normalized and ±v is by denition orthogonal to eO1 . The deviation ±v » A¡1 rI, where A is
Analyzing Neural Responses to Natural Signals
237
the Hessian of information. Its structure is similar to that of a covariance matrix: ³ ´ Z 1 d P.xjspike/ 2 ln Aij D dxP.xjspike/ ln 2 dx P.x/ £ .hsi sj jxi ¡ hsi jxihsj jxi/: (4.3) When averaged over possible outcomes of N trials, the gradient of information is zero for the optimal direction. Here, in order to evaluate h±v2 i D Tr[A¡1 hrIrIT iA¡1 ], we need to know the variance of the gradient of I. Assuming that the probability of generating a spike is independent for different bins, we can estimate hrIi rIj i » Aij =.Nspike ln 2/. Therefore, an expected error in the reconstruction of the optimal lter is inversely proportional to the number of spikes. The corresponding expected value of the projection between the reconstructed vector and the relevant direction eO1 is given by 1 Tr0 [A¡1 ] vO ¢ eO1 ¼ 1 ¡ h±v2 i D 1 ¡ ; 2 2Nspike ln 2
(4.4)
where Tr0 means that the trace is taken in the subspace orthogonal to the model lter.5 The estimate, equation 4.4, can be calculated without knowledge of the underlying model; it is » D=.2Nspike/. This behavior should also hold in cases where the stimulus dimensions are expanded to include time. The errors are expected to increase in proportion to the increased dimensionality. In the case of a complex cell with two relevant dimensions, the error is expected to be twice that for a cell with single relevant dimension, also discussed in section 4.3. We emphasize that the error estimate according to equation 4.4 is of the same order as errors of the reverse correlation method when it is applied for gaussian ensembles. The latter are given by .Tr[C¡1 ] ¡ C¡1 11 /=[2Nspike 02 h f .s1 /i]. Of course, if the reverse correlation method were to be applied to the nongaussian ensemble, the errors would be larger. In Figure 5, we show the result of simulations for various numbers of trials, and therefore Nspike. The average projection of the normalized reconstructed vector vO on the RF eO1 behaves initially as 1=Nspike (dashed line). For smaller data sets, ¡2 in this case, Nspikes < » 30; 000, corrections » Nspikes become important for estimating the expected errors of the algorithm. Happily, these corrections have a sign such that smaller data sets are more effective than one might have expected from the asymptotic calculation. This can be veried from the expansion vO ¢ eO1 D [1 ¡ ±v2 ]¡1=2 ¼ 1 ¡ 12 h±v2 i C 38 h±v4 i, were only the rst two terms were taken into account in equation 4.4. By denition, ±v1 D ±v ¢ eO1 D 0, and therefore h±v21 i / A¡1 is to be subtracted from 11 / Tr[A¡1 ]. Because eO1 is an eigenvector of A with zero eigenvalue, A¡1 is innite. 11 Therefore, the proper treatment is to take the trace in the subspace orthogonal to eO1 . 5
h±v2 i
238
T. Sharpee, N. Rust, and W. Bialek 1
0.9
1
e ×v
max
0.95
0.85
0.8
0.75 0
0.2
0.4
0.6
N 1
0.8
spike
4
x 10
Figure 5: Projection of vector vO max that maximizes information on RF eO 1 is plotted as a function of the number of spikes. The solid line is a quadratic t in 1=Nspike , and the dashed line is the leading linear term in 1=Nspike . This set of simulations was carried out for a model visual neuron with one relevant dimension from Figure 4a and the input-output function (see equation 4.1), with parameter values ¾ ¼ 0:61¾ .s1 /, µ ¼ 0:61¾ .s1 /. For this model neuron, the linear approximation for the expected error is applicable for Nspike > » 30; 000.
4.3 A Model Complex Cell. A sequence of spikes from a model cell with two relevant dimensions was simulated by projecting each of the stimuli on vectors that differ by ¼=2 in their spatial phase, taken to mimic properties of complex cells, as in Figure 6. A particular frame leads to a spike according to a logical OR, that is, if either js1 j or js2 j exceeds a threshold value µ in the presence of noise, where s1 D s ¢ eO1 , s2 D s ¢ eO2 . Similarly to equation 4.1, P.spikejs/ D f .s1 ; s2 / D hH.js1 j ¡ µ ¡ »1 / _ H.js2 j ¡ µ ¡ »2 /i; P.spike/
(4.5)
where »1 and »2 are independent gaussian variables. The sampling of this input-output function by our particular set of natural stimuli is shown in Figure 6c. As is well known, reverse correlation fails in this case because the spike–triggered average stimulus is zero, although with gaussian stimuli, the spike-triggered covariance method would recover the relevant dimensions (Touryan et al., 2002). Here we show that searching for maximally informative dimensions allows us to recover the relevant subspace even under more natural stimulus conditions.
Analyzing Neural Responses to Natural Signals
239
Figure 6: Analysis of a model complex cell with relevant dimensions eO1 and eO 2 shown in (a) and (b), respectively. Spikes are generated according to an “OR” input-output function f .s1 ; s2 / with the threshold µ ¼ 0:61¾ .s1 / and noise standard deviation ¾ D 0:31¾ .s1 /. (c,d) Vectors v1 and v2 found by maximizing information I.v1 ; v2 /.
We start by maximizing information with respect to one vector. Contrary to the result in Figure 4e for a simple cell, one optimal dimension recovers only about 60% of the total information per spike (see equation 2.2). Perhaps surprisingly, because of the strong correlations in natural scenes, even a projection onto a random vector in the D » 103 -dimensional stimulus space has a high probability of explaining 60% of total information per spike, as can be seen in Figure 2. We therefore go on to maximize information with respect to two vectors. As a result of maximization, we obtain two vectors v1 and v2 , shown in Figure 6. The information along them is I.v1 ; v2 / ¼ 0:90, which is within the range of information values obtained along different linear combinations of the two model vectors I.eO1 ; eO2 /=Ispike D 0:89 § 0:11. Therefore, the description of neuron’s ring in terms of vectors v1 and v2 is complete up to the noise level, and we do not have to look for extra relevant dimensions. Practically, the number of relevant dimensions can be determined by comparing I.v1 ; v2 / to either Ispike or I.v1 ; v2 ; v3 /, the latter being the result of maximization with respect to three vectors simultaneously. As mentioned in section 1, information along set a of vectors does not increase when extra dimensions are added to the relevant subspace. Therefore, if I.v1 ; v2 / D I.v1 ; v2 ; v3 / (again, up to the noise level), this means that
240
T. Sharpee, N. Rust, and W. Bialek
there are only two relevant dimensions. Using Ispike for comparison with I.v1 ; v2 / has the advantage of not having to look for an extra dimension, which can be computationally intensive. However, Ispike might be subject to larger systematic bias errors than I.v1 ; v2 ; v3 /. Vectors v1 and v2 obtained by maximizing I.v1 ; v2 / are not exactly orthogonal and are also rotated with respect to eO1 and eO2 . However, the quality of reconstruction, as well as the value of information I.v1 ; v2 /, is independent of a particular choice of basis with the RS. The appropriate measure of similarity between the two planes is the dot product of their normals. In the example of Figure 6, nO .Oe1 ;Oe2 / ¢ nO .v1 ;v2 / D 0:82 §0:07, where nO .Oe1 ;Oe2 / is a normal to the plane passing through vectors eO1 and eO2 . Maximizing information with respect to two dimensions requires a signicantly slower cooling rate and, consequently, longer computational times. However, the expected error in the reconstruction, 1 ¡ nO .Oe1 ;Oe2 / ¢ nO .v1 ;v2 / , scales as 1=Nspike behavior, similar to equation 4.4, and is roughly twice that for a simple cell given the same number of spikes. We make vectors v1 and v2 orthogonal to each others upon completion of the algorithm. 4.4 A Model Auditory Neuron with One Relevant Dimension. Because stimuli s are treated as vectors in an abstract space, the method of looking for the most informative dimensions can be applied equally well to auditory as well as to visual neurons. Here we illustrate the method by considering a model auditory neuron with one relevant dimension, which is shown in Figure 7c and is taken to mimic the properties of cochlear neurons. The model neuron is probed by two ensembles of naturalistic stimuli: one is a recording of a native Russian speaker reading a piece of Russian prose, and the other is a recording of a piece of English prose read by a native English speaker. Both ensembles are nongaussian and exhibit amplitude distributions with long, nearly exponential tails (see Figure 7a) which are qualitatively similar to those of light intensities in natural scenes (Voss & Clarke, 1975; Ruderman, 1994). However, the power spectrum is different in the two cases, as can be seen in Figure 7b. The differences in the correlation structure in particular lead to different STAs across the two ensembles (cf. Figure 7d). Both of the STAs also deviate from the model lter shown in Figure 7c. Despite differences in the probability distributions P.s/, it is possible to recover the relevant dimension of the model neuron by maximizing information. In Figure 7c we show the two most informative vectors found by running the algorithm for the two ensembles and replot the model lter from Figure 7c to show that the three vectors overlap almost perfectly. Thus, different nongaussian correlations can be successfully removed to obtain an estimate of the relevant dimension. If the most informative vector changes with the stimulus ensemble, this can be interpreted as caused by adaptation to the probability distribution.
P(Amplitude)
Analyzing Neural Responses to Natural Signals 10
0
10
2
10
4
(a)
1 2
0.1
241
(b)
S(w)
0.01
1 2
0.001
6
10 10 0 10 Amplitude (units of RMS) (c)
0.1
5 10 15 20 frequency (kHz) (d)
0.1
2 0
0
0.1
0.1
1 5
t(ms) 10
0
(e)
P(spike| s × v
0.1 0 0.1 0
5
t(ms)
10
5
t(ms) 10
1
max
)
0
(f)
0.5
0
5
0
5 s× v
max
Figure 7: A model auditory neuron is probed by two natural ensembles of stimuli: a piece of English prose (1) and a piece of of Russian prose (2) . The size of the stimulus ensemble was the same in both cases, and the sampling rate was 44.1 kHz. (a) The probability distribution of the sound pressure amplitude in units of standard deviation for both ensembles is strongly nongaussian. (b) The power spectra for the two ensembles. (c) The relevant vector of the model neuron, of dimensionality D D 500. (d) The STA is broadened in both cases, but differs between the two cases due to differences in the power spectra of the two ensembles. (e) Vectors that maximize information for either of the ensembles overlap almost perfectly with each other and with the model lter, which is also replotted here from c. (f) The probability of a spike P.spikejs ¢ vO max / (crosses) is compared to P.spikejs1 / used in generating spikes (solid line). The input-output function had parameter values ¾ ¼ 0:9¾ .s1 / and µ ¼ 1:8¾ .s1 /.
5 Summary Features of the stimulus that are most relevant for generating the response of a neuron can be found by maximizing information between the sequence of responses and the projection of stimuli on trial vectors within the stimulus space. Calculated in this manner, information becomes a function of
242
T. Sharpee, N. Rust, and W. Bialek
direction in stimulus space. Those vectors that maximize the information and account for the total information per response of interest span the relevant subspace. The method allows multiple dimensions to be found. The reconstruction of the relevant subspace is done without assuming a particular form of the input-output function. It can be strongly nonlinear within the relevant subspace and is estimated from experimental histograms for each trial direction independently. Most important, this method can be used with any stimulus ensemble, even those that are strongly nongaussian, as in the case of natural signals. We have illustrated the method on model neurons responding to natural scenes and sounds. We expect the current implementation of the method to be most useful when several most informative vectors (· 10, depending on their dimensionality) are to be analyzed for neurons probed by natural scenes. This technique could be particularly useful in describing sensory processing in poorly understood regions of higher-level sensory cortex (such as visual areas V2, V4, and IT and auditory cortex beyond A1) where white noise stimulation is known to be less effective than naturalistic stimuli. Appendix A: Limitations of the Reverse Correlation Method Here we examine what sort of deviations one can expect when applying the reverse correlation method to natural stimuli even in the model with just one relevant dimension. There are two factors that, when combined, invalidate the reverse correlation method: the nongaussian character of correlations and the nonlinearity of the input-output function (Ringach, Sapiro, & Shapley, 1997). In its original formulation (de Boer & Kuyper, 1968), the neuron is probed by white noise, and the relevant dimension eO1 is given by the STA eO1 / hsr.s/i. If the signals are not white, that is, the covariance matrix Cij D hsi sj i is not a unit matrix, then the STA is a broadened version of the original lter eO1 . This can be seen by noting that for any function F.s/ of gaussian variables fsi g, the identity holds: hsi F.s/i D hsi sj ih@sj F.s/i;
@j ´ @sj :
(A.1)
When property A.1 is applied to the vector components of the STA, hsi r.s/i D Cij h@j r.s/i. Since we work within the assumption that the ring rate is a (nonlinear) function of projection onto one lter eO1 , r.s/ D r.s1 /, the latter average is proportional to the model lter itself, h@j ri D eO1j hr0 .s1 /i. Therefore, we arrive at the prescription of the reverse correlation method, eO1i / [C¡1 ]ij hsj r.s/i:
(A.2)
The gaussian property is necessary in order to represent the STA as a convolution of the covariance matrix Cij of the stimulus ensemble and the model
Analyzing Neural Responses to Natural Signals
243
lter. To understand how the reconstructed vector obtained according to equation A.2 deviates from the relevant one, we consider weakly nongaussian stimuli, with the probability distribution PnG .s/ D
1 P0 .s/e²H1 .s/ ; Z
(A.3)
where P0 .s/ is the gaussian probability distribution with covariance matrix C and the normalization factor Z D he²H1 .s/ i. The function H1 describes deviations of the probability distribution from gaussian, and therefore we will set hsi H1 i D 0 and hsi sj H1 i D 0, since these averages can be accounted for in the gaussian ensemble. In what follows, we will keep only the rstorder terms in perturbation parameter ². Using property A.1, we nd the STA to be given by £ ¤ hsi rinG D hsi sj i h@j ri C ²hr@j .H1 /i ;
(A.4)
where averages are taken with respect to the gaussian distribution. Similarly, the covariance matrix Cij evaluated with respect to the nongaussian ensemble is given by Cij D
1 hsi sj e²H1 i D hsi sj i C ²hsi sk ihsj @k .H1 /i; Z
(A.5)
so that to the rst order in ², hsi sj i D Cij ¡ ²Cik hsj @k .H1 /i. Combining this with equation A.4, we get ¡ ¢ hsi rinG D const £ Cij eO1j C ²Cij h r ¡ s1 hr0 i @j .H1 /i:
(A.6)
The second term in equation A.6 prevents the application of the reverse correlation method for nongaussian signals. Indeed, if we multiply the STA, equation A.6, with the inverse of the a priori covariance matrix Cij according to the reverse correlation method, equation A.2, we no longer obtain the RF eO1 . The deviation of the obtained answer from the true RF increases with ², which measures the deviation of the probability distribution from gaussian. Since natural stimuli are known to be strongly nongaussian, this makes the use of the reverse correlation problematic when analyzing neural responses to natural stimuli. The difference in applying the reverse correlation to stimuli drawn from a correlated gaussian ensemble versus a nongaussian one is illustrated in Figures 8b and 8c. In the rst case, shown in Figure 8b, stimuli are drawn from a correlated gaussian ensemble with the covariance matrix equal to that of natural images. In the second case, shown in Figure 8c, the patches of photos are taken as stimuli. The STA is broadened in both cases. Although the two-point correlations are just as strong in the case of gaussian stimuli
244
T. Sharpee, N. Rust, and W. Bialek
Figure 8: The nongaussian character of correlations present in natural scenes invalidates the reverse correlation method for neurons with a nonlinear inputoutput function. (a) A model visual neuron has one relevant dimension eO1 and the nonlinear input-output function. The “exact” STA is used (see equation 4.2) to separate effects of neural noise from alterations introduced by the method. The decorrelated “exact” STA is obtained by multiplying the “exact” STA by the inverse of the covariance matrix, according to equation A.2. (b) Stimuli are taken from a correlated gaussian noise ensemble. The effect of correlations in STA can be removed according to equation A.2. (c) When patches of photos are taken as stimuli for the same model neuron as in b, the decorrelation procedure gives an altered version of the model lter. The two stimulus ensembles have the same covariance matrix.
as they are in the natural stimuli ensemble, gaussian correlations can be successfully removed from the STA according to equation A.2 to obtain the model lter. On the contrary, an attempt to use reverse correlation with natural stimuli results in an altered version of the model lter. We reiterate that for this example, the apparent noise in the decorrelated vector is not
Analyzing Neural Responses to Natural Signals
245
due to neural noise or nite data sets, since the “exact” STA has been used (see equation 4.2) in all calculations presented in Figures 8 and 9. The reverse correlation method gives the correct answer for any distribution of signals if the probability of generating a spike is a linear function of si , since then the second term in equation A.6 is zero. In particular, a linear input-output relation could arise due to a neural noise whose variance is much larger than the variance of the signal itself. This point is illustrated in Figures 9a, 9b, and 9c, where the reverse correlation method is applied to a threshold input-output function at low, moderate, and high signal-to-noise ratios. For small signal-to-noise ratios where the noise standard deviation is similar to that of projections s1 , the threshold nonlinearity in the inputoutput function is masked by noise and is effectively linear. In this limit, the reverse correlation can be applied with the exact STA. However, for experimentally calculated STA at low signal-to-noise ratios, the decorrelation procedure results in strong noise amplication. At higher signal-to-noise ratios, decorrelation fails due to the nonlinearity of the input-output function in accordance with equation A.6. Appendix B: Maxima of I.v/: What Do They Mean? The relevant subspace of dimensionality K can be found by maximizing information simultaneously with respect to K vectors. The result of maximization with respect to a number of vectors that is less than the true dimensionality of the relevant subspace may produce vectors that have components in the irrelevant subspace. This happens only in the presence of correlations in stimuli. As an illustration, we consider the situation where the dimensionality of the relevant subspace K D 2, and vector eO1 describes the most informative direction within the relative subspace. We show here that although the gradient of information is perpendicular to both eO1 and eO2 , it may have components outside the relevant subspace. Therefore, the vector vmax that corresponds to the maximum of I.v/ will lie outside the relevant subspace. We recall from equation 3.1 that Z d P.s1 jspike/ rI.Oe1 / D ds 1 P.s1 / (B.1) .hsjs1 ; spikei ¡ hsjs1 i/; ds 1 P.s1 / R We can rewrite the conditional averages hsjs1 i D ds2 P.s1 ; s2 /hsjs1 ; s2 i=P.s1 / R and hsjs1 ; spikei D ds 2 f .s1 ; s2 /P.s1 ; s2 /hsjs1 ; s2 i=P.s1 jspike/, so that Z P.spikejs1 ; s2 / ¡ P.spikejs1 / rI.Oe1 / D ds 1 ds2 P.s1 ; s2 /hsjs1 ; s2 i P.spike/ d P.s1 jspike/ £ ln (B.2) : ds 1 P.s1 / Because we assume that the vector eO1 is the most informative within the relevant subspace, eO1 rI D eO2 rI D 0, so that the integral in equation B.2
246
T. Sharpee, N. Rust, and W. Bialek
Figure 9: Application of the reverse correlation method to a model visual neuron with one relevant dimension eO1 and a threshold input-output function of decreasing values of noise variance ¾=¾ .s1 /s ¼ 6:1; 0:61; 0:06 in a, b, and c, respectively. The model P.spikejs1 / becomes effectively linear when the signal-to-noise ratio is small. The reverse correlation can be used together with natural stimuli if the input-output function is linear. Otherwise, the deviations between the decorrelated STA and the model lter increase with the nonlinearity of P.spikejs1 /.
is zero for those directions in which the component of the vector hsjs1 ; s2 i changes linearly with s1 and s2 . For uncorrelated stimuli, this is true for all directions, so that the most informative vector within the relevant subspace is also the most informative in the overall stimulus space. In the presence of correlations, the gradient may have nonzero components along some irrelevant directions if projection of the vector hsjs1 ; s2 i on those directions is not a linear function of s1 and s2 . By looking for a maximum of information, we will therefore be driven outside the relevant subspace. The deviation of vmax from the relevant subspace is also proportional to the
Analyzing Neural Responses to Natural Signals
247
strength of the dependence on the second parameter s2 because of the factor [P.s1 ; s2 jspike/=P.s1 ; s2 / ¡ P.s1 jspike/=P.s1 /] in the integrand. Appendix C: The Gradient of Information According to expression 2.5, the information I.v/ depends on the vector v only through the probability distributions Pv .x/ and Pv .xjspike/. Therefore, we can express the gradient of information in terms of gradients of those probability distributions: 1 rv I D ln 2
Z
µ Pv .xjspike/ rv .Pv .xjspike// dx ln Pv .x/ ¶ Pv .xjspike/ ¡ rv .Pv .x// ; Pv .x/
(C.1)
R where we took into account that dxP v .xjspike/ D 1 and does not change with v. To nd gradients of the probability distributions, we note that µZ rv Pv .x/ D rv D¡
¶ Z dsP.s/±.x ¡ s ¢ v/ D ¡ dsP.s/s± 0 .x ¡ s ¢ v/
¤ d £ p.x/hsjxi ; dx
(C.2)
and analogously for Pv .xjspike/: rv Pv .xjspike/ D ¡
¤ d £ p.xjspike/hsjx; spikei : dx
(C.3)
Substituting expressions C.2 and C.3 into equation C.1 and integrating once by parts, we obtain Z rv I D
£
¤ dxPv .x/ hsjx; spikei ¡ hsjxi ¢
µ
¶ d Pv .xjspike/ ; dx Pv .x/
which is expression 3.1 of the main text. Acknowledgments We thank K.D. Miller for many helpful discussions. Work at UCSF was supported in part by the Sloan and Swartz foundations and by a training grant from the NIH. Our collaboration began at the Marine Biological Laboratory in a course supported by grants from NIMH and the Howard Hughes Medical Institute.
248
T. Sharpee, N. Rust, and W. Bialek
References Aguera ¨ y Arcas, B., Fairhall, A. L., & Bialek, W. (2003). Computation in a single neuron: Hodgkin and Huxley revisited. Neural Comp., 15, 1715–1749. Baddeley, R., Abbott, L. F., Booth, M.C.A., Sengpiel, F., Freeman, T., Wakeman, E. A., & Rolls, E. T. (1997). Responses of neurons in primary and inferior temporal visual cortices to natural scenes. Proc. R. Soc. Lond. B, 264, 1775– 1783. Barlow, H. (1961). Possible principles underlying the transformations of sensory images. In W. Rosenblith (Ed.), Sensory communication (pp. 217–234). Cambridge, MA: MIT Press. Barlow, H. (2001). Redundancy reduction revisited. Network: Comput. Neural Syst., 12, 241–253. Bialek, W. (2002). Thinking about the brain. In H. Flyvbjerg, F. Julicher, ¨ P. Ormos, & F. David (Eds.), Physics of Biomolecules and Cells (pp. 485–577). Berlin: Springer-Verlag. See also physics/0205030.6 Bialek, W., & de Ruyter van Steveninck, R. R. (2003). Features and dimensions: Motion estimation in y vision. Unpublished manuscript. Brenner, N., Bialek, W., & de Ruyter van Steveninck, R. R. (2000). Adaptive rescaling maximizes information transmission. Neuron, 26, 695–702. Brenner, N., Strong, S. P., Koberle, R., Bialek, W., & de Ruyter van Steveninck, R. R. (2000). Synergy in a neural code. Neural Computation, 12, 1531–1552. See also physics/9902067. Chichilnisky, E. J. (2001). A simple white noise analysis of neuronal light responses. Network: Comput. Neural Syst, 12, 199–213. Creutzfeldt, O. D., & Northdurft H. C. (1978). Representation of complex visual stimuli in the brain. Naturwissenschaften, 65, 307–318. Cover, T. M., & Thomas, J. A. (1991). Information theory. New York: Wiley. de Boer, E., & Kuyper, P. (1968). Triggered correlation. IEEE Trans. Biomed. Eng., 15, 169–179. de Ruyter van Steveninck, R. R., & Bialek, W. (1988). Real-time performance of a movement-sensitive neuron in the blowy visual system: Coding and information transfer in short spike sequences. Proc. R. Soc. Lond. B, 265, 259–265. de Ruyter van Steveninck, R. R., Borst, A., & Bialek, W. (2000). Real time encoding of motion: Answerable questions and questionable answers from the y’s visual system. In J. M. Zanker, & J. Zeil (Eds.), Motion vision: Computational, neural and ecological constraints (pp. 279–306). New York: Springer-Verlag. See also physics/0004060. de Ruyter van Steveninck, R. R., Lewen, G. D., Strong, S. P., Koberle, R., & Bialek, W. (1997). Reproducibility and variability in neural spike trains. Science, 275, 1805–1808. See also cond-mat/9603127. 6 Where available we give references to the physics e–print archive, which may be found on-line at http://arxiv.org/abs/*/*; thus, Bialek (2002) is available on-line at http://arxiv.org/abs/physics/0205030. Published papers may differ from the versions posted to the archive.
Analyzing Neural Responses to Natural Signals
249
Dimitrov, A. G., & Miller, J. P. (2001). Neural coding and decoding: Communication channels and quantization. Network: Comput. Neural Syst., 12, 441–472. Dong, D. W., & Atick, J. J. (1995). Statistics of natural time-varying images. Network: Comput. Neural Syst., 6, 345–358. Fairhall, A. L., Lewen, G. D., Bialek, W., & de Ruyter van Steveninck, R. R. (2001). Efciency and ambiguity in an adaptive neural code. Nature, 412, 787–792. Kara, P., Reinagel, P., & Reid, R. C. (2000). Low response variability in simultaneously recorded retinal, thalamic, and cortical neurons. Neuron, 27, 635–646. Lewen, G. D., Bialek, W., & de Ruyter van Steveninck, R. R. (2001). Neural coding of naturalistic motion stimuli. Network: Comput. Neural Syst., 12, 317–329. See also physics/0103088. Mainen, Z. F., & Sejnowski, T. J. (1995). Reliability of spike timing in neocortical neurons. Science, 268, 1503–1506. Paninski, L. (2003a). Convergence properties of three spike–triggered analysis techniques. Network: Compt. in Neural Systems, 14, 437–464. Paninski, L. (2003b). Estimation of entropy and mutual information. Neural Computation, 15, 1191–1253. Panzeri, S., & Treves, A. (1996). Analytical estimates of limited sampling biases in different information measures. Network: Comput. Neural Syst., 7, 87–107. Pola, G., Schultz, S. R., Petersen, R., & Panzeri, S. (2002). A practical guide to information analysis of spike trains. In R. Kotter (Ed.), Neuroscience databases: A practical guide, (pp. 137–152). Norwell, MA: Kluwer. Press, W. H., Teukolsky, S. A., Vetterling, W. T., & Flannery, B. P. (1992). Numerical recipes in C: The art of scientic computing. Cambridge: Cambridge University Press. Reinagel, P., & Reid, R. C. (2000). Temporal coding of visual information in the thalamus. J. Neurosci., 20, 5392–5400. Rieke, F., Bodnar, D. A., & Bialek, W. (1995). Naturalistic stimuli increase the rate and efciency of information transmission by primary auditory afferents. Proc. R. Soc. Lond. B, 262, 259–265. Rieke, F., Warland, D., de Ruyter van Steveninck, R. R., & Bialek, W. (1997). Spikes: Exploring the neural code. Cambridge, MA: MIT Press. Ringach, D. L., Hawken, M. J., & Shapley, R. (2002). Receptive eld structure of neurons in monkey visual cortex revealed by stimulation with natural image sequences. Journal of Vision, 2, 12–24. Ringach, D. L., Sapiro, G., & Shapley, R. (1997). A subspace reverse-correlation technique for the study of visual neurons. Vision Res., 37, 2455–2464. Rolls, E. T., Aggelopoulos, N. C., & Zheng, F. (2003). The receptive elds of inferior temporal cortex neurons in natural scenes. J. Neurosci., 23, 339–348. Ruderman, D. L. (1994). The statistics of natural images. Network: Compt. Neural Syst., 5, 517–548. Ruderman, D. L., & Bialek, W. (1994). Statistics of natural images: Scaling in the woods. Phys. Rev. Lett., 73, 814–817. Schwartz, O., Chichilnisky, E. J., & Simoncelli, E. (2002). Characterizing neural gain control using spike-triggered covariance. In T. G. Dietterich, S. Becker, & Z. Ghahramani (Eds.), Advances in Neural InformationProcessing (pp. 269–276). Cambridge, MA: MIT Press.
250
T. Sharpee, N. Rust, and W. Bialek
Sen, K., Theunissen, F. E., & Doupe, A. J. (2001). Feature analysis of natural sounds in the songbird auditory forebrain. J. Neurophysiol., 86, 1445–1458. Simoncelli, E., & Olshausen, B. A. (2001). Natural image statistics and neural representation. Annu. Rev. Neurosci., 24, 1193–1216. Smirnakis, S. M., Berry, M. J., Warland, D. K., Bialek, W., & Meister, M. (1996). Adaptation of retinal processing to image contrast and spatial scale. Nature, 386, 69–73. Smyth, D., Willmore, B., Baker, G. E., Thompson, I. D., & Tolhurst, D. J. (2003). The receptive-eld organization of simple cells in primary visual cortex of ferrets under natural scene stimulation. J. Neurosci., 23, 4746–4759. Stanley, G. B., Li, F. F., & Dan, Y. (1999). Reconstruction of natural scenes from ensemble responses in the lateral geniculate nucleus. J. Neurosci., 19, 8036– 8042. Strong, S. P., Koberle, R., de Ruyter van Steveninck, R. R., & Bialek, W. (1998). Entropy and information in neural spike trains. Phys. Rev. Lett., 80, 197–200. Theunissen, F. E., Sen, K., & Doupe, A. J. (2000). Spectral-temporal receptive elds of nonlinear auditory neurons obtained using natural sounds. J. Neurosci., 20, 2315–2331. Tishby, N., Pereira, F. C., & Bialek, W. (1999). The information bottleneck method. In B. Hajek & R. S. Sreenivas (Eds.), Proceedings of the 37th Allerton Conference on Communication, Control and Computing (pp. 368–377). Urbana: University of Illinois. See also physics/0004057. Touryan J., Lau, B., & Dan, Y. (2002). Isolation of relevant visual features from random stimuli for cortical complex cells. J. Neurosci., 22, 10811–10818. Treves, A., & Panzeri, S. (1995). The upward bias in measures of information derived from limited data samples. Neural Comp., 7, 399–407. von der Twer, T., & Macleod, D. I. A. (2001). Optimal nonlinear codes for the perception of natural colours. Network: Comput. Neural Syst., 12, 395–407. Vickers, N. J., Christensen, T. A., Baker, T., & Hildebrand, J. G. (2001). Odourplume dynamics inuence the brain’s olfactory code. Nature, 410, 466–470. Vinje, W. E., & Gallant, J. L. (2000). Sparse coding and decorrelation in primary visual cortex during natural vision. Science, 287, 1273–1276. Vinje, W. E., & Gallant, J. L. (2002). Natural stimulation of the nonclassical receptive eld increases information transmission efciency in V1. J. Neurosci., 22, 2904–2915. Voss, R. F., & Clarke, J. (1975). “1/f noise” in music and speech. Nature, 317–318. Weliky, M., Fiser, J., Hunt, R., & Wagner, D. N. (2003). Coding of natural scenes in primary visual cortex. Neuron, 37, 703–718. Received January 21, 2003; accepted August 6, 2003.
LETTER
Communicated by Laurence Abbott
Rapid Temporal Modulation of Synchrony by Competition in Cortical Interneuron Networks P.H.E. Tiesinga
[email protected] Department of Physics and Astronomy, University of North Carolina, Chapel Hill, NC 27599, U.S.A.
T.J. Sejnowski
[email protected] Sloan-Swartz Center for Theoretical Neurobiology, Salk Institute, La Jolla, CA 92037, Computational Neurobiology Lab, Salk Institute, La Jolla, CA 92037, Howard Hughes Medical Institute, Salk Institute, La Jolla, CA 92037, and Department of Biology, University of California–San Diego, La Jolla, CA 92093, U.S.A.
The synchrony of neurons in extrastriate visual cortex is modulated by selective attention even when there are only small changes in ring rate (Fries, Reynolds, Rorie, & Desimone, 2001). We used Hodgkin-Huxley type models of cortical neurons to investigate the mechanism by which the degree of synchrony can be modulated independently of changes in ring rates. The synchrony of local networks of model cortical interneurons interacting through GABAA synapses was modulated on a fast timescale by selectively activating a fraction of the interneurons. The activated interneurons became rapidly synchronized and suppressed the activity of the other neurons in the network but only if the network was in a restricted range of balanced synaptic background activity. During stronger background activity, the network did not synchronize, and for weaker background activity, the network synchronized but did not return to an asynchronous state after synchronizing. The inhibitory output of the network blocked the activity of pyramidal neurons during asynchronous network activity, and during synchronous network activity, it enhanced the impact of the stimulus-related activity of pyramidal cells on receiving cortical areas (Salinas & Sejnowski, 2001). Synchrony by competition provides a mechanism for controlling synchrony with minor alterations in rate, which could be useful for information processing. Because traditional methods such as cross-correlation and the spike eld coherence require several hundred milliseconds of recordings and cannot measure rapid changes in the degree of synchrony, we introduced a new method to detect rapid changes in the degree of coincidence and precision of spike timing. Neural Computation 16, 251–275 (2004)
c 2003 Massachusetts Institute of Technology °
252
P. Tiesinga and T. Sejnowski
1 Introduction Selective attention greatly enhances the ability of the visual system to detect visual stimuli and to store and recall these stimuli. For instance, in change blindness, differences between two images shown after a brief delay are not visible unless the subject pays attention to the particular spatial location of the change (Simons, 2000; Rensink, 2000). The neural correlate of selective attention has recently been studied in macaque monkeys (Connor, Gallant, Preddie, & Van Essen, 1996; Connor, Preddie, Gallant, & Van Essen, 1997; Luck, Chelazzi, Hillyard, & Desimone, 1997; McAdams & Maunsell, 1999; Treue & Martinez Trujillo, 1999; Reynolds, Pasternak, & Desimone, 2000; Fries, Reynolds, Rorie, & Desimone, 2001; Moore & Armstrong, 2003). A key nding is that attention modulates both the mean ring rate of a neuron in response to a stimulus (McAdams & Maunsell, 1999; Reynolds et al., 2000) and the coherence of spiking with other neurons responsive to the stimulus (Fries et al., 2001). Increased synchrony can boost the impact of spikes on subsequent cortical areas to which these neurons project (Salinas & Sejnowski, 2000, 2001, 2002). These results suggest that mean activity and the degree of coherence may represent independent signals that can be independently controlled. It is unclear biophysically what mechanisms are responsible. For example, when excitatory networks are depolarized, there can be an increase in synchrony but it is always associated with a large increase in the mean output rate (Aoyagi, Takekawa, & Fukai, 2003). Here we explore the hypothesis that interneuron networks are involved in modulating synchrony. The issue of how the interneuron network affects the pyramical cells is a complex one that is not the subject of this article (but see section 4). We found that interneuron networks can dynamically modulate their synchrony on timescales of 100 ms with only moderate changes in the mean ring rate. Synchrony arose by competition between two groups of interneurons and required a specic range of synaptic background activity. To analyze the temporal dynamics of synchrony modulation, we introduced time-resolved measures for the degree of coincidence and precision. The sensitivity of these measures is compared to the spike eld coherence (Fries et al., 2001). 2 Methods 2.1 Network Model. The model consisted of a network of N interneurons connected all-to-all. Each neuron had a sodium INa , a potassium IK , and a leak current IL and was driven by a white noise current ´ with mean Ii and variance 2D. The equation for the membrane potential V reads:
Cm
dV D ¡INa ¡ IK ¡ IL C ´.t/ C Isyn: dt
(2.1)
Rapid Modulation of Synchrony
253
Here, Cm D 1¹F/cm2 is the membrane capacitance (normalized by area), and Isyn is the synaptic input from other neurons in the network. For the rst NA neurons the current was Ii D IA C ±Ii (i D 1; : : : ; NA ), and for the remainder, it was Ii D IB C ±Ii (i D NA C 1; : : : ; N). Heterogeneity in neural properties was represented by ±Ii , which was drawn from a gaussian distribution with a standard deviation ¾I . IA was increased from I to I C 1I during a 500 ms interval between t D 1000 and 1500 ms; IB was equal to I for the duration of the simulation. The neurons were connected by GABAA synapses with unitary strength gsyn =N and time constant ¿syn D 10 ms. The model equations and implementation are exactly as described by Tiesinga and Jos´e (2000a) and are not repeated here. The model used here was adapted from that introduced by Wang and Buzs a´ ki (1996). The standard set of parameters was N D 1000, N A D 250, NB D N ¡ NA D 750, I D 0:7 ¹A/cm2 , 1I D 0:3 ¹A/cm2 , gsyn D 0:4 mS/cm2 , ¾I D 0:02 ¹A/cm2 , and D D 0:02 mV2 /ms. 2.2 Statistical Analysis. Spike times were recorded when the membrane potential (V in equation 2.1) crossed 0 mV from below. tni is the ith spike time by the nth neuron; likewise, tmj is the jth spike time by the mth neuron. N is the total number of neurons in the network, and Ns is the number of spikes (by all neurons) generated during a given simulation run. For some calculations, it is convenient to pool the spike times of all neurons together into one set, ft1 ; : : : ; tº ; : : : ; tNs g, ordered from low to high values. The ordered set is indexed by º, where º D 1 is the earliest spike and º D Ns is the latest. Most of the statistical quantities determined from the simulations were spike based. That is, for each measurement yº there is an associated spike time tº . For instance, yº could be an interspike interval, and tº could be the rst spike time of the interval. The specic choices for y are given below. The simulation runs were divided into three intervals: (I) the baseline state, 0 < t < 1000 ms; (II) the activated state, 1000 ms < t < 1500 ms, during which NA of the neurons received a current step, and (III) return to baseline for 1500 ms < t < Tmax . Here Tmax is the duration of the simulated trial. Each interval is divided in subintervals A and B, A is the transient during which the network adapts from the old state to the new state, and B is the steady state reached after the transient. For instance, IIA is the onset of the synchronous state, and IIB is the synchronous state itself. Note that the subdivision in A and B depends on network parameters such as I and D. Averages of y were determined for each of these intervals, using all the yº associated with a tº in the interval under consideration. Since the tº are ordered, this is a contiguous set, ºb · º · ºe (b stands for begin and e stands for end):
hyi D
ºe X 1 yº ; ºe C 1 ¡ ºb ºDºb
hti D
ºe X 1 tº : ºe C 1 ¡ ºb ºDºb
(2.2)
254
P. Tiesinga and T. Sejnowski
We also determined time-resolved averages, hyik and htik , using a sliding window of length Tav that was translated along the time axis with increments equal to Tincr . The position of the sliding window was indicated by the index k, and the sum is over all º values with .k ¡ 1/T incr · tº < .k ¡ 1/Tincr C T av . For convenience, the set of points .htik ; hyik / connected by lines is denoted by y.t/. Tav and Tincr are expressed in terms of nav and nincr using the relations Tav D Tmax =nav and Tincr D .Tmax ¡ Tav /=nincr . We used nav D 30, nincr D 90 and T max D 3000 ms. 2.2.1 Standard Spike Train Statistics. The ith interspike interval (ISI) of the nth neuron is ¿ni D tn;iC1 ¡ tn;i . The mean ISI for neuron n is denoted by ¿n , and its average across all neurons is ¿ . The coefcient of variation (CV) of the ISIs is the standard deviation of the ¿ni across all neurons divided by ¿ . The interval ¿ni can be associated with three times: the starting spike time of the interval, tni , the end spike time of the interval, tn;iC1 , and the mean, .tni C tn;iC1 /=2. For the sliding window average, we used tº D .tni C tn;iC1 /=2. Each individual ISI was plotted in a scatter plot as a point with coordinates .tni ; ¿ni /. The mean ring rate f is dened as the number of spikes divided by the duration of the measurement interval. The mean ring rate is determined for each time interval (I–III) for all the N network neurons together, as well as separately for the NA activated neurons and the NB nonactivated neurons. 2.2.2 Binned Spike Train Statistics. A binned representation Xn .t/ of the spike train of the nth neuron is obtained by setting Xn .t/ D 1 when there is a spike time between t ¡ 1t=2 and t C 1t=2 and Xn .t/ D 0 otherwise. We took 1t D 1 ms. The spike time histogram, STH D .10001t=N/X.t/, P is proportional to the sum of Xn over all neurons, X D N nD1 Xn (since 1t is expressed in ms, the factor 1000 is necessary so that STH is in Hz). The local eld potential (LFP) is estimated by ltering X by an exponential lter, exp.¡t=¿ //=¿ , with timescale ¿ D 5 ms. The sliding window average, ¾LFP .t/, of the standard deviation of the LFP was used as a synchrony measure. We N also determined the normalized autocorrelation, AC.t/ D h.X.s/ ¡ X/.X.s ¡ N s =h.X.s/ ¡ X/.X.s/ N N s , where his denotes the time average and ¡ X/i t/ ¡ X/i XN D hXis . 2.2.3 Spectral Measures for Neural Synchrony. The coherence of spike trains with the underlying network oscillations was determined using the spike-triggered average (STA) (see Rieke, Warland, de Ruyter van Steveninck, & Bialek, 1997; Dayan & Abbott, 2001, for details). The STA was calculated by taking the LFP from the full network centered around the combined spike times of four neurons, two of which were part of the activated group and the other two were part of the nonactivated group. The STA power spectrum was calculated using the MATLAB routine psd with standard windowing using a 2,048 point, fast Fourier transform. The spike eld co-
Rapid Modulation of Synchrony
255
herence (SFC) is the STA power spectrum divided by the power spectrum of the LFP (Fries et al., 2001). The SFC for the standard simulations was calculated based on 300 ms long LFP segments (IB: 700 ms · t · 1000 ms and IIB: 1200 ms · t · 1500 ms). This was necessary since synchrony was modulated on a fast timescale. Simulations were also performed for the situation where the baseline state and the activated state both lasted for 2000 ms. The SFC in that case was calculated based on 1500 ms long segments. 2.2.4 Event-Based Measures for Neural Synchrony. During synchronized network activity, the histogram X.t/ contains peaks of high activity—events —during which many neurons spike at approximately the same time. Spike times tº were clustered in events using MATLAB’s fuzzy k-means clustering algorithm, fcm (Ripley, 1996). For each cluster k, the algorithm returns the indices ºk , ºk C 1; : : : ; ºkC1 ¡ 1 of the spike times that are part of the cluster. The number of spike times Nk D ºkC1 ¡ ºk in event k and their standard deviation ¾k was determined. For low synchrony, peaks in the STH are broad and overlap; as a result, the clustering procedure could not reliably cluster the spikes into events. 2.2.5 Interspike Distance Synchrony Measures. The CVP measure is based on the idea that during synchronous states, the minimum distance between spikes of different neurons is reduced compared with asynchronous states. The interspike interval of the combined set of network spikes is ¿º D tºC1 ¡tº . Note that these interspike intervals are between different neurons. The CV is p h¿º2 iº ¡ h¿º i2º CVP D ; h¿º iº where P stands for population and hiº denotes the average over all intervals. As with regular ISIs, the interval ¿º can be identied with three times: tº , tºC1 , and the mean .tº C tºC1 /=2. For the sliding window average, CVP .t/, the last was used. In an asynchronous network, each neuron res independently at a constant rate. The sum of the spike trains is to a good approximation a Poisson spike train, with a coefcient of variation, CVP , equal to one. For a fully synchronous network with N active neurons oscillating with period T, the set ¿º consists of Ns =N ¡ 1 intervals equal to T and Ns ¡ Ns =N intervals equal to 1 ! 0. Hence, p CVP D ® [1 ¡ .®=T C T=®/¯1] ; with ® D .Ns ¡ 1/=.Ns =N p ¡ 1/ and ¯ D .Ns =N/.N p ¡ 1/=.Ns ¡ 1/. For large Ns , this reduces to CVP ¼ N. Thus, .CVP ¡ 1/= N is a measure for synchrony normalized between 0 and 1. CVP is sensitive to the precision of the network discharge as well as to the degree of coincidence.
256
P. Tiesinga and T. Sejnowski
The nearest-neighbor distance of spike tni from the spikes of neuron m is 1mni D min jtni ¡ tmj j: j
The minimum is taken over all spike times tmj of neuron m. There are Np D .N ¡ 1/Ns different pairs. Note that each pair is counted twice, 1mni D 1nmj , because it is associated with tni as well as with tmj . The coincidence factor · is dened as the number of 1mni < P, divided by the maximum number of coincidences Nc . Here, P is a preset precision, typically equal to 2 ms. The number of 1mni < P cannot exceed Np ; hence, taking Nc D Np D .N ¡ 1/Ns would yield a number between 0 and 1 (note that Ns is now the total number of spikes in the time interval under consideration). However, when the minimum single-neuron interspike interval is larger than P, the number of coincidences in a pair of neurons .n; m/ cannot be larger than the minimum ofPthe number P of spikes produced by neurons n and m, formally equal to min. t Xn .t/; t Xm .t//. Hence, the normalization should be Á ! X X X min (2.3) Nc D Xn .t/; Xm .t/ : n6Dm
t
t
During intervals I and III, when the network dynamics is homogeneous, Np ¼ Nc (the difference is 10% or less). However, when there is a bimodal distribution of ring rates, this approximation is not appropriate. The sliding window average ·.t/ was calculated based on 1mni , as described above. The density of coincidences was also determined at a ner temporal resolution. K.t/ was the number of 1mni < P with t ¡ 1t=2 · tni · t C 1t=2 normalized by Np (Nc could not be estimated reliably for these small intervals). The · used here is related to the quantity described in Wang and Buzs a´ ki (1996), denoted here by ·WB : P X 1 Xn .t/Xm .t/ pP t (2.4) ·WB D ; P N.N ¡ 1/ n6Dm X t n .t/ t Xm .t/ P where t denotes the sum over the discrete bins (width 1t). The measure · used here differs from ·WB in three aspects:
1. ·WB is based on discrete bins. Hence, it is possible that two spike times with a distance less than 1t are in different bins and considered noncoincident (White, Chow, Ritt, Soto-Trevino, ˜ & Kopell, 1998). 2. The normalization of ·WB does not work if the mean ring rates of neurons are signicantly different. In that case, ·WB would be smaller than 1 even if all the spikes were coincident. the maxPThis is because P imumpnumber of coincident spikes is min. rather X .t/; X .t// n m t t P P than used in equation 2.4. The normalization by X .t/ X .t/, n m t t Nc of ·.t/ accounts for these effects.
Rapid Modulation of Synchrony
257
3. The calculation of ·.t/ is better suited to track fast changes in synchrony. An estimate for the dispersion of spike times in an event can be obtained from 1mni . For a given spike time tni , we take the mean of the N1 lowest values of 1mni . This value is averaged across all tni in an averaging time interval of length Tav to obtain 1.t/. When N1 is somewhat smaller than the number of neurons ring on each cycle, 1.t/ is proportional to the dispersion in spike times. Typically it is slightly less than the dispersion obtained directly from clustering. The relationship between the value of 1 and the dispersion of the spike times on a cycle was studied using a set of N D 300 deviates of a gaussian probability distribution with a variance of 1. 1 was equal to 0:53 § 0:02 for N1 D 150, asymptoted to 1:13 § 0:05 for N1 D 300, and reached 1 for N1 D 278 (errors were the standard deviation across 100 different realizations). 3 Results 3.1 Competition Leads to Dynamical Modulation of Synchrony. We study the dynamics of an all-to-all connected interneuron network. Key parameters are mean level of depolarizing current I, variance 2D of the white noise current, and neuronal heterogeneity modeled as a dispersion ¾I in the driving current across neurons. These three parameters can account for changes occuring in vivo in the level of background synaptic activity and the effect of neurotransmitters and neuromodulators. Previous work shows that synchrony is stable only up to ¾I ¼ 0:10 ¹A/cm2 (Wang & Buzsaki, ´ 1996) and D ¼ 0:10 mV2 /ms (Tiesinga & Jos´e, 2000a); for higher values, only the asynchronous state is stable. For moderate values of ¾I and D (below these critical values), synchrony is stable only for a large enough driving current I. Based on these results, it is expected that applying a current pulse to the network in a low-synchrony state will lead to increased synchrony. Representative simulation results are shown in Figure 1. There were N D 1000 neurons in the network; the other parameters were D D 0:02 mV2 /ms and ¾I D 0:02 ¹A/cm2 . Initially, the driving current was I D 0:7 ¹A/cm2 . In this baseline state, the mean ring rate per neuron was f D 8:18 Hz with a CV D 0:38. During the time interval between 1000 ms and 1500 ms, the drive to all neurons was increased to I D 1:0 ¹A/cm2 . The ring rate increased by approximately 50% to f D 12:25; the CV was 0:25 (see Figure 1A). There was little change in synchrony. In contrast, when the depolarizing current was increased to only 250 of the 1000 interneurons, a robust increase in synchrony was observed (see Figure 1B). During that period, the mean ring rate increased only weakly from f D 8:18 Hz to f D 9:48 Hz (CV D 0:40 ! 1:55). It is also possible to induce synchrony by increasing the driving current to all neurons in the network. However, the necessary increase in current is
258
P. Tiesinga and T. Sejnowski
sp/s
A 40
sp/s
0
B
40
IA
IB
II
IIIA
0 0
1000
2000
t (ms) Figure 1: Competition dynamically modulates synchrony. (A) Driving current I D 0:7 ¹A/cm2 was increased to I D 1:0 ¹A/cm2 during the time interval between 1000 ms and 1500 ms (indicated by the bar in B). (B) Driving current to NA D 250 out of N D 1000 neurons was increased from I D 0:7 ¹A/cm2 to I D 1:0 ¹A/cm2 . The other neurons remained at baseline. Roman numerals in B are explained in the text. Parameters for these simulations were ¾I D 0:02 ¹A/cm2 , D D 0:02 mV2 /ms, gsyn D 0:4 mS/cm2 , and ¿syn D 10 ms.
much higher than used in Figure 1A, and the resulting ring rate increase is also much higher compared with the competitive mechanism. In the remainder, we investigate the properties of the synchrony-by-competition mechanism. 3.2 Synchrony Modulation by Competition. We have labeled four distinct time intervals in the histogram shown in Figure 1B. The rastergram for all neurons and histograms for the activated and non-activated neurons are shown separately in Figure 2 for the time interval between 500 ms and 2000 ms. (IA) 0 < t < 500 ms. The membrane potential of the neurons at t D 0 was drawn for each neuron independently from a uniform distribution between ¡70 mV and ¡50 mV. Initially, the network was partially synchronized due to the particular realization of the initial condition. The amount of synchrony decreased over the course of 500 ms. (IB) 500 < t < 1000 ms: During the baseline state there was only weak synchrony. (II) 1000 < t < 1500 ms: A subset of N A D 250 neurons gets activated. Their ring rate increased, and they became synchronized within
Rapid Modulation of Synchrony
259
200 ms. Initially, the NB D 750 nonactivated neurons were completely suppressed by the low-synchronous activated neurons, but as the synchrony of the activated neurons increased, the nonactivated neurons resumed ring, albeit at a much lower rate. (IIIA) 1500 < t < 1700 ms: The additional drive to the activated neurons was reduced back to baseline. Their ring rate decreased, which released the nonactivated neurons from inhibition. Because their membrane potentials were similar, the network was transiently synchronized. Over the course of 200 ms, the network returned to the asynchronous state. During the baseline state, there were brief episodes of spontaneously enhanced synchrony (see below). The changes in network dynamics are mimicked in the distribution of interspike intervals (see Figure 2B). The interspike intervals (ISI) are shown as dots in a scatter plot (see Figure 2Ba). The sliding window averages of the mean ISI and its coefcient of variation are shown in Figures 2Bb and 2Bc, respectively. In the baseline state, the ISI distribution was homogeneous, but it had a large dispersion (see Figure 2Ba). However, during the synchronized period, the distribution became bimodal: a set of short intervals with a small dispersion corresponding to the activated neurons and a set of long intervals with a large dispersion due to cycle skipping associated with the nonactivated neurons (see Figure 2Ba). The sliding window average of the mean ISI was sensitive to the onset and offset of bimodality associated with synchrony (see Figure 2Bb). Similarly, the CV of the intervals increased during the synchronized period because of the onset of bimodality in the ISI distribution (see Figure 2Bc). However, these time-resolved statistics are not a good measure of synchrony, since synchrony may increase without introducing bimodality in the ISIs. 3.3 Spectral Analysis of Synchrony. In the experimental analysis of attentional modulation of synchrony, there was an increase in the coherence of multiunit spike trains with the local eld potential (LFP) in response to the attended stimuli (Fries et al., 2001). Here we used a ltered version of the spike time histogram (STH) as an estimate for the LFP, and we used the spikes obtained by combining spike trains of four neurons (two activated, two nonactivated) as the multiunit spike train (see Figure 3A). The autocorrelation function AC.t/ of the spike time histogram during a 300 ms long synchronized period showed strong periodic peaks, whereas during an equally long asynchronous period, the AC showed periodic modulations but at a much lower amplitude (see Figure 3B). Hence, the AC was sensitive to synchrony. The spike-triggered average (STA) of the LFP triggered on the multiunit spike train was also determined for the same period. The spike eld coherence (SFC) is the power spectrum of the STA divided by that of the LFP. For the 300 ms segments used before, there was no clear difference in the SFC between the synchronized and asynchronized segments across a single trial (results not shown). The analysis was also performed on 1500 ms long segments from a longer simulation of the same network (see
260
P. Tiesinga and T. Sejnowski
neuron index
A
400
200
b
sp/s
200
a
0
c
sp/s
50
0 500
ISI (ms)
B
600
1000
t (ms)
1500
2000
1500
2000
a
400 200
t (ms)
0 200 150 100 50
CV
2
b
c
1 0 500
1000
t (ms)
Figure 2: Mechanism for synchrony modulation by competition. (A) Firing rate dynamics. (a) Rastergram showing the spike trains of 200 out of the 1000 neurons. The bottom 50 neurons were activated by the current pulse, and the top 150 neurons received baseline current throughout the simulation. Spike time histogram of the (b) 250 activated neurons and (c) 750 nonactivated neurons. (B) Dynamics of the interspike interval (ISI) statistics. (a) Scatter plot: The ycoordinate is the length of the interval, and the x-coordinate is the starting spike time of the interval. Sliding window average of the (b) mean ¿ and (c) coefcient of variation CV of the interspike intervals. Data are from the simulation results shown in Figure 1B.
Rapid Modulation of Synchrony
261
LFP (Hz)
A 20 0
STA (Hz) AC
500 1.0
2500
B
0
50
0.0
C 20 0 50 10
SFC
1500
t (ms)
0
10
1
10
2
t (ms)
D
0
20
40
f (Hz)
60
80
100
Figure 3: Spike eld coherence (SFC) is increased during the synchronous episode. (A, top) The simulated local eld potential (LFP) was calculated by convolving the spike time histogram with an exponential lter (time constant 5 ms). (A, bottom) Simulated multiunit recording consisting of the combined spike trains of two activated and two nonactivated neurons. (B) Autocorrelation of the spike time histogram during (solid line) synchronous period (IIB, 1200–1500 ms) and (dotted line) asynchronous period (IB, 700–1000 ms). (C, D) Spike-triggered average (STA) of the LFP triggered on the simulated multiunit recording and the resulting SFC during (solid line) a 1500 ms long synchronous period and (dashed line) an asynchronous period of equal length. Data for A and B are from the simulation results shown in Figure 1B. Data in C and D are from a 6000 ms long simulation with the same parameters, except that the current pulse was applied during the time interval between 2000 and 4000 ms. The synchronous period used for analysis was between 2500 and 4000 ms, and the asynchronous period was 4500 to 6000 ms.
Figures 3C and 3D). The SFC in the gamma frequency band (30–60 Hz) was two orders of magnitude larger for the synchronized segment compared with the asynchronous segment. For our simulation data, the SFC is not a robust measure for changes in synchrony on a 100 ms timescale, but it does work well when longer time intervals are available for averaging.
262
P. Tiesinga and T. Sejnowski
3.4 Spike-Distance Based Synchrony Measures. During the synchronous state, many neurons re at similar times; hence, the distance between spikes of different neurons is reduced compared with the asynchronous state. We calculate two synchrony measures using this fact (see section 2). The coefcient of variation CVP is obtained by pooling the spike times of all neurons, ordering them from low to high values and then calculating the coefcient of variation of the resulting interspike intervals (see section 2). For an asynchronous population, the combined spike train will form a Poisson process with CVP D 1. For a synchronized network, there will be many small intervals between spikes in a peak and a few long intervals between the rst spike p of the next peak and the last spike of the preceding peak, yielding » CVP N (here, N is the number of neurons that re on a given cycle). Note that CVP is different from the CV shown in Figure 2Bc: the latter is based on the interspike intervals between spikes of the same neurons. The sliding window average of CVP could resolve fast changes in neural synchrony and was sensitive to the transient synchrony at the start of the simulation and an episode of spontaneous synchrony at t ¼ 2800 ms (the circle in Figure 4A). The oscillations in the LFP also increased during synchrony. Hence, the sliding window average ¾LFP of the standard deviation of the LFP was also sensitive to changes in synchrony (see Figure 4B). There are two different aspects of synchrony: the number of coincidences (the number of neurons ring at approximately the same time) and the temporal dispersion of the spike times. CVP and ¾LFP confound these aspects. For CVP , only the distance between a given spike and the closest other spike time is taken into account. A new time-resolved coincidence measure that is an extension of CVP takes into account the nearest spike time for all other neurons separately. Hence, for each spike tni (ith spike of nth neuron), there are N ¡ 1 nearest distances 1mni (m D 1; : : : ; n ¡ 1; n C 1; : : : ; N; see section 2). We determined the set of 1mni smaller than P D 2 ms. The density K.t/ of the coincident pairs and the sliding window average ·.t/ are shown in Figures 4C and D, respectively. ·.t/ was normalized by the number of expected coincidences, as explained in section 2. Note that when ·.t/ was normalized by the total number of pairs (Np ; see section 2), it signicantly underestimated the number of coincidences during interval II (lled circles versus solid line in Figure 4D). An estimate of the jitter 1.t/ (the inverse of the precision) was obtained by calculating the mean distance of the N1 D 100 nearest spike times to a given spike time and averaging over all spike times (see Figure 4E). The number of coincidences and the dispersion of spike times were calculated using a clustering algorithm for comparison purposes. During the synchronous state, peaks of elevated ring rate occur in the STH (see Figure 5A). The spikes that are part of a peak were determined using a fuzzy k-means clustering algorithm. The number of peaks during a particular time interval was determined by visual inspection and supplied as a parameter to the algorithm. The number Nk of neurons ring on a particular cycle and
2 1 0
0.05
0.1 0.0 0.4 0.2 0.0 7
D (ms)
K (t)
0.00 0.2
k(t)
sLFP
CVP
Rapid Modulation of Synchrony
2
263
A B C
D E 0
1000
t (ms)
2000
3000
Figure 4: Spike distance measures can track synchrony modulation on fast timescales. (A) Sliding window average of the coefcient of variation (CVP ) of the interspike intervals between all the network neurons. The arrow points to the decay of transient synchrony due to the initial condition, and the circle highlights an episode of spontaneously elevated synchrony. (B) Sliding window average ¾LFP of the standard deviation in the local eld potential (LFP). The number of spike pairs with a distance 1mni smaller than 2 ms plotted as (C) a histogram K.t/ (bin width was 1 ms) and (D) a sliding window average · .t/. The small circles in D are · normalized by Nc , and the solid line is · normalized by Np . (E) The sliding window average 1.t/ of the mean distance of the N1 D 100 closest spike times to the given spike time. The analysis in A and B used all the spike data, whereas the analysis in C and E was based on the spike trains of 125 activated neurons and 375 nonactivated neurons. Data are from the simulation results shown in Figure 1B.
their temporal dispersion ¾k were calculated and plotted as a function of the centroid of the peak (see Figure 5B). After the onset of the current pulse, the jitter ¾k decreased over time, whereas the number of neurons Nk ring per cycle remained approximately constant. Likewise, when the current pulse was turned off, the jitter increased and did so at a much faster rate of change. Changes in synchrony can be resolved on a cycle-by-cycle basis using the jitter. The temporal modulation of 1.t/ (see Figure 4E) closely mimics that of ¾k , shown in Figure 5B.
264
P. Tiesinga and T. Sejnowski
sp/s
60
A
40 20 0
B 300 200
Nk
sk (ms)
10 5 0 1000
100 1200
1400
t (ms)
1600
0
Figure 5: Clustering of spike times during the onset of the synchronized period. Each event corresponds to a cycle of the network oscillation. (A) Spike time histogram with the odd-numbered clusters shown in black and the even-numbered in white. (B, left-hand side scale) The standard deviation ¾k of the spike times and (B, right-hand side scale) the number Nk of neurons that spiked during a cycle plotted versus the mean spike time during a cycle. The oscillation period between 1200 ms and 1500 ms was 27:5 ms. Data are from the simulation results shown in Figure 1B.
3.5 Synaptic Background Activity is Necessary for Dynamic Changes in Synchrony. High synchrony occurs when the network discharge has a jitter (¾k or 1) of less than 5 ms; for low synchrony, the network discharges with strong temporal modulation of the ring rate but with a jitter of more than 5 ms. Asynchronous network discharge does not have signicant ring rate modulations. The neurons in the network received incoherent background activity, which was studied by systematically varying the variance D of a balanced white noise current. Increasing the variability in the input current without increasing the mean input current corresponded to a larger value of D (Tiesinga, Jos´e, & Sejnowski, 2000; Chance, Abbott, & Reyes, 2002). For each value of D, a current pulse of 1I D 0:3 ¹A/cm2 was applied to 250 out of 1000 neurons during the time interval between 1000 ms and 1500 ms. As before, the simulation was run for 3000 ms and divided into
Rapid Modulation of Synchrony
265
three intervals, I–III. During the rst (I) and third (III) intervals, the network was in its baseline state, and during the second (II) interval, it was in its activated state. The spike time histograms are shown in Figure 6A. For D D 0 mV2 /ms, the network synchronized during the rst interval and remained synchronized throughout the second and third intervals. During the second interval, the network splits into two populations with different ring rates ( f D 38 Hz for activated neurons, whereas the nonactivated population spiked on only 1 in 13 cycles of the network oscillation). Because there was neural heterogeneity, ¾I D 0:02¹A/cm2 , the p jitter ¾k was nonzero, but CVP was close to its fully synchronized value, N. For D D 0:0004 mV2 /ms, network synchrony was stable for the baseline network state. Starting from a synchronized state, the network remained synchronized, and starting from an asynchronous state, it became synchronized over time. During the rst interval, the network synchronized; synchrony increased even more during the second interval, and in the third interval the degree of synchrony continued to increase. At the start of the third interval, the nonactivated neurons all had a similar voltage. Once these neurons were released from the inhibition projected by the activated neurons, they immediately synchronized into a two-cluster state (see below). The mean ring rate was doubled compared with that during the rst interval. For D D 0:0084 mV2 /ms, the network converged to a low-synchrony state during the rst interval. The network switched to a high-synchrony state during the second interval. In the third interval, the network returned slowly to the low-synchrony state. The dynamics for D D 0:0164 mV2 /ms was similar, except that the return to the low-synchrony state during the third interval was faster. The same holds for D D 0:0284 mV2 /ms, except that the degree of synchrony reached during the second interval was reduced. For D D 0:0364 mV2 /ms, the network was in an asynchronous state during the rst and third intervals. The current pulse during the second interval was unable to drive the network to a synchronous state. The network dynamics for different values of D is also reected in the time-resolved coincidence measure ·.t/ and jitter 1.t/. The state with dynamic modulation of synchrony is characterized by low values of ·.t/ and high values of 1.t/ during the rst and third intervals and high values of ·.t/ and low values of 1.t/ during the second interval. This is the case only for the curves with D D 0:084 mV2 /ms and D D 0:0204 mV2 /ms (see Figure 6B). In Figure 7, the mean ring rate and synchrony measures CVP , ·, and 1 are plotted versus D for the intervals IB, II, and IIIB. For D < 0:005 mV2 /ms, there was multistability. Depending on the distribution of initial voltages across the neurons, the network converged to states with two, three, or more clusters present (these clusters should not be confused with those obtained
266
P. Tiesinga and T. Sejnowski
A
0.0004
0.0084
0.0164
0.0284 80
sp/s
0.0364
0 500
1500
2500
t (ms)
B
1
a 1
2
k
4 5
3
0
8
b 5
D (ms)
6
4
4 2
3 1
2
0 0
1000
t (ms)
2000
3000
Figure 6: Synaptic background activity is required for fast synchrony modulation. (A) Five spike time histograms with D increasing from top to bottom. The value of D is indicated in the top left corner of each graph. (B) Sliding window average of (a) the number · .t/ of coincidences and (b) the estimate 1.t/ of spike time dispersion for D D (1) 0:0004, (2) 0:0044, (3) 0:0084, (4) 0:0204 and (5) 0:0364 (D values are in mV2 /ms). A current pulse 1I D 0:3¹A/cm2 was applied to 250 of the 1000 neurons during the time interval between t D 1000 ms and 1500 ms (indicated by the bar in each graph). Network parameters are the same as in Figure 1B except that D is varied.
Firing rate (Hz)
Rapid Modulation of Synchrony
20
A
15 10 5 0 30
B 2
20
CVP
267
0 0.01
10
0.02
0.03
0 1.0
C
0.8
k
0.6 0.4 0.2 0.0
8
D
D
6 4 2 0 0.00
0.01
0.02
0.03
0.04
2
D (mV /ms) Figure 7: Optimal level of synaptic background activity required for synchrony modulation. (A) The mean ring rate f and (B) CVP (C) · and (D) 1 versus noise strength D for three time intervals: (IB, circles) 500–1000 ms, (II, squares) 1000–1500 ms and (IIIB, diamonds) 2000–3000 ms. The inset in B is a close-up. Network parameters are the same as in Figure 1B except that D is varied.
in the preceding section using a clustering procedure). For instance, in a two-cluster state, neurons red once every two cycles, yielding a ring rate of approximately 20 Hz. Each neuron was in the cluster that was active on either even cycles or odd cycles. The dynamics of states with a higher number of clusters is similar (Kopell & LeMasson, 1994). The number of clusters in IB depended on the initial condition. f , CVP , ·, and 1 had different val-
268
P. Tiesinga and T. Sejnowski
ues depending on the number of clusters. The graphs of these quantities versus D would have a ragged appearance due to the sensitivity to initial conditions. Hence, we did not plot them in Figure 7 for small D values in intervals I and II. However, after the application of the current pulse during the second interval, the network settled into the most synchronous state with the highest ring rate. As a result, the network state reached in IIIB did not depend on initial conditions. For higher D values, these synchronized states became unstable, and the ring rate as well as the values of CVP and · in the third interval dropped precipitously (see Figure 7, arrow). In summary, synchrony can be modulated in time only for a nite region of background activity consistent with D values around D D 0:02 mV2 /ms. For weaker background activity, it is possible to increase synchrony, but it cannot be shut off. Furthermore, the ring rate during synchrony is much higher compared with that for lower synchrony. For stronger background activity, synchrony cannot be established. 4 Discussion 4.1 Measuring Synchrony. Physiological correlates of attention have been observed in the response of cortical neurons in macaque monkeys to either orientated gratings (area V4) or random dot patterns moving in different directions (area MT) (McAdams & Maunsell, 1999; Treue & Martinez Trujillo, 1999). When attention is directed into the receptive eld of the recorded neuron, the gain of the ring rate response may increase (McAdams & Maunsell, 1999; Treue & Martinez Trujillo, 1999), as well as the coherence of the neuron’s discharge with the local eld potential (Fries et al., 2001). The coherence was quantied using the SFC, the power spectrum of the spike triggered average divided by that of the local eld potential. The experimental recordings were 500 ms to 5000 ms in duration—long enough to get statistical differences with attention focused either into the receptive eld or outside the receptive eld. Here we study how synchrony can be modulated dynamically as it would be during normal behavior. The speed with which attention can be covertly shifted is quite rapid—in the range of a few hundred ms (Shi & Sperling, 2002). Attentional shifts caused by sudden changes in stimulus properties, such as brightness and motion, can be as low as 100 ms. For our simulations, the SFC across a single trial could not resolve rapid changes in synchrony that were obvious from visual inspection of the spike time histogram. For oscillatory synchrony in the gamma frequency range, the fastest timescale on which synchrony can change is the cycle length. Event-based methods could resolve changes in synchrony on these timescales (see Figure 5). We introduced two nonevent-based measures that could track the temporal dynamics of synchrony on timescales of about 100 ms: the sliding window average of the standard deviation of the LFP and CVP . These measures were able to resolve the changes in synchrony that were visible in
Rapid Modulation of Synchrony
269
the spike time histogram. These measures confound two different aspects of synchrony: the number of coincident spikes per event and their temporal dispersion corresponding to Nk and ¾k , respectively, in the event-based analysis. Either increasing the number of spikes or decreasing their temporal dispersion will lead to increases in the value of ¾LFP or CVP . However, the distribution of the pairwise distances between spikes used to dene · and 1 distinguishes between changes in the number of coincident neurons and the temporal dispersion. There were two parameter values—P, the precision of the coincidences, and N1 , the number of pairwise distances—in the average for 1. For ·, all Np D .N ¡ 1/Ns » N 2 f Tmax pairwise distances need to be determined. The computational load increases quadratically with network size N and linearly with ring rate f and the length Tmax of the measurement interval; hence, it is less efcient than CVP . ·.t/ is a time-resolved version of the quantity used by Wang and Busz a´ ki (1996). We adapted · so that it was correctly normalized for networks with a bimodal distribution of ring rates. In summary, CVP is a fast and easy statistic to quantify synchrony modulations, but ·.t/ and 1.t/ delineate changes in temporal dispersion from changes in the number of coincident spikes. 4.2 Modulating Synchrony. The synchronization properties of model networks of interneurons has recently been intensively investigated (Traub, Whittington, Colling, Buzs a´ ki, & Jeffreys, 1996; Wang & Buzs a´ ki, 1996; White, Chow, Ritt, Soto-Trevino, ˜ & Kopell, 1998; Chow, 1998; Ermentrout & Kopell, 1998; Brunel & Hakim, 1999; Kopell, Ermentrout, Whittington, & Traub, 2000; White, Banks, Pearce, & Kopell, 2000; Tiesinga & Jos´e, 2000a; Hansel & Golomb, 2000; Bressloff & Coombes, 2000; Gerstner, 2000; Tiesinga, Fellous, Jos´e, & Sejnowski, 2001; Aradi, Santhakumar, Chen, & Soltesz, 2002; Bartos et al., 2002; Borgers & Kopell, 2003; Olufsen, Whittington, Camperi, & Kopell, 2003; Hansel & Mato, 2003; Fransen, 2003). The degree of synchrony, and its robustness against noise (D) and heterogeneity (¾I ), depends primarily on the synaptic coupling strength gsyn and degree of depolarization I (Wang & Buzs a´ ki, 1996; White et al., 1998; Tiesinga & Jos´e, 2000a). For I D 0:7 ¹A/cm2 , gsyn D 0:4 mS/cm2 , the network studied here is asynchronous (with D D 0:02 mV2 /ms and ¾I D 0:02 ¹A/cm2 ). During the application of a current pulse to the NA network neurons, the other NB neurons were almost completely suppressed. Hence, the full network behaves approximately as a network of NA neurons with effective synaptic coupling gef f D .NA =N/gsyn D 0:1 mS/cm2 and driving current I D 1:0 ¹A/cm2 . For these parameters, the network is synchronous (Wang & Buzs a´ ki, 1996; Tiesinga & Jos´e, 2000a). Synchrony by competition is effective when the baseline state is asynchronous and reached quickly from a synchronized network state (corresponding to an initial condition with all neurons having a similar mem-
270
P. Tiesinga and T. Sejnowski
brane potential). The activated state should be synchronous and should be established quickly from an asynchronous state (corresponding to an initial condition with a wide dispersion of membrane potential across different neurons). In a previous study (Tiesinga & Jos´e, 2000a), the focus was on the degree of synchronization in the asymptotic state of the network. Here, we study a network with competition between otherwise identical neurons, and the focus is on the speed with which the network can switch between synchronous and asynchronous states. The strength of the white noise current D, corresponding to the level of balanced synaptic background activity, was critical. In the noiseless network (D D 0), there are multiple stable cluster states. For the baseline state, we found states with two, three, or four approximately equal-sized clusters that were reached from the set of random initial conditions used in the simulations (Golomb & Rinzel, 1993, 1994; Kopell & LeMasson, 1994; van Vreeswijk, 1996; Tiesinga & Jos´e, 2000b). The cluster states were stable against weak noise. The transition to the activated state could switch the baseline state from one cluster state to another, but it would not affect the degree of synchrony. The ring rate did, however, vary strongly. For strong noise, neither the baseline nor the activated state was synchronized. Only for intermediate noise strengths could the degree of synchrony be modulated. In this regime, the degree of synchrony reached during the activated state decreased with increasing D, but the transition from the synchronized state to the asynchronous state during interval IIIA was speeded up. For what parameters values can synchrony by competition be obtained? We varied the value of the amplitude of the current pulse, 1I; the mean of the common driving current, I; the number of activated neurons, N A ; the total number of neurons in the network, N; and the strength of synaptic coupling, gsyn (data not shown). For large networks, N > 100, with all-to-all coupling, the network state did not depend on N (Tiesinga & Jos´e, 2000a). Let us denote the ring rate of the baseline network by f1 and that of the NA neurons in the activated network by f2 . f2 should be below but within 10% of the oscillation frequency (40 Hz for gamma frequency range). Mean activity was constant when f1 =f2 ¼ NA =N. Networks with NA > 100 robustly synchronize (Wang & Buzs a´ ki, 1996), and we found that N=NA > 2 works best. The values of I, gsyn, and 1I are less critical: I and gsyn should lead to a ring rate f1 in the appropriate range, whereas I C 1I and .NA =N/gsyn should lead to a synchronous state with a ring rate equal to f2 . This is possible since the higher ring rate state is usually more synchronous (Tiesinga & Jos´e, 2000a). The specic value of 1I determines how fast synchrony is attained in interval II. These parameter values are specic to the model used in our investigation. This raises the issue whether and to what extent the results reported here generalize to other models. When the maximum conductance of the sodium and potassium currents in the model were varied by less than 10%,
Rapid Modulation of Synchrony
271
we obtained similar results to those reported here. Hence, there is robustness against small changes in model neuron parameters. The key requirements are that the model network has a synchronous state in the gamma frequency range and an asynchronous state and that the degree of synchrony depends on the coupling strength and driving current. These properties hold across different Hodgkin-Huxley-type models (Wang & Buzs a´ ki, 1996; White et al., 1998) and for leaky integrate-and-re model neurons (Brunel & Hakim, 1999; Hansel & Golomb, 2000). This suggests that synchrony modulation by competition will also be present in these models. The ring rate of neurons recorded in vitro adapts in response to a sustained tonic depolarizing current: at the onset of current injection, neurons re at a high rate but the rate decreases and saturates at a lower value (McCormick, Connors, Lighthall, & Prince, 1985; Shepherd, 1998). The role of adaptation in synchronous oscillations has been studied in models (van Vreeswijk & Hansel, 2001; Fuhrman, Markram, & Tsodyks, 2002). We expect that synchrony modulation by competition also works in the presence of adaptation, but that the temporal dynamics of the transition between the activated and nonactivated states may be different. Further investigation of this issue is needed. 4.3 Functional Consequences of Temporal Modulation of Synchrony. The driving hypothesis of this investigation is that attention modulates the synchrony of cortical interneuron networks. We have shown that inhibitory networks can modulate their synchrony on timescales of the order 100 ms without increasing their mean activity. The question is, How do these synchrony modulations relate to the effects of attention on putative pyramidal neurons recorded in vivo (Connor et al., 1996, 1997; Luck et al., 1997; McAdams & Maunsell, 1999; Treue & Martinez Trujillo, 1999; Reynolds et al., 2000; Fries et al., 2001)? Attentional modulation can be studied in models using the following conceptual framework. The neuron is thought to receive two sets of inputs: predominantly excitatory, stimulus-related inputs from upstream cortical areas and modulatory synaptic inputs (Chance et al., 2002). Some of the modulatory inputs are from local interneuron networks that project to the principal output neurons in cortical layer 2/3 (Galarreta & Hestrin, 2001). The response to stimulus-related inputs can be characterized using the ring rate versus input ( f ¡ I) curve. The f ¡ I curve is determined by measuring the ring rate while the amplitude I of the tonic depolarizing current is systematically varied (sometimes the ring rate of presynaptic excitatory afferents is varied instead). Modulatory inputs alter the f ¡ I curve, leading to a different ring rate in response to the same input. The modulatory changes fall in two categories: leftward shifts of the f ¡ I curve, leading to higher sensitivity as the neuron can respond to weaker inputs, and gain changes, where the ring rate response to any input is multiplied by a constant factor. The attentional modulation of ring rate tuning curve observed
272
P. Tiesinga and T. Sejnowski
in McAdams and Maunsell (1999) and Treue and Martinez Trujillo (1999) can be explained as a multiplicative gain change of the f ¡ I curve. Recent investigations reveal that the gain of the f ¡ I curve is modulated by the amount of tonic inhibition that a neuron receives (Prescott & DeKoninck, 2003; Mitchell & Silver, 2003). However, measurements of the LFP (Fries et al., 2001) indicate that neurons receive temporally patterned inputs in the gamma frequency range. The spike time coherence in the gamma-frequency range increased with attention (Fries et al., 2001). Our hypothesis is that the modulations in gamma-frequency coherence are due to synchrony modulation of local interneuron networks. We studied the f ¡ I curves and the coherence with the LFP of model neurons receiving synchronous inhibition in the gamma-frequency range (Jos´e, Tiesinga, Fellous, Salinas, & Sejnowski, 2002). The results are summarized here, full details will be reported elsewhere. We found that changing the degree of synchrony, quantied as the jitter (¾k in Figure 5), led to gain changes as well as shifts. The mean number of inputs per oscillation cycle (Nk in Figure 5) determined whether the change in gain dominated the shift or vice versa. For Nk > 50, the modulation of the f ¡ I curve was shift dominated. In that case, the ring rate of output neurons could saturate for strong inputs, and an increase in synchrony led to enhanced coherence with the LFP but not to an increase in ring rate. For weak inputs, the input synchrony acted as as a gate. For low-synchrony inhibitory inputs, the receiving cortical neuron was shut down, whereas for high synchrony, it could transmit stimulusrelated information in its output. For Nk ¼ 10, there were large changes in the gain of the f ¡ I curve that were consistent with the gain modulation of orientation tuning curves reported in McAdams and Maunsell (1999). The modulation of ring rate with input synchrony depended on the oscillation frequency of the interneuron network. We found that the strongest modulation was obtained with oscillations in the gamma-frequency range. To summarize, synchrony by competition in the gamma-frequency range can selectively gate cortical information ow at approximately constant metabolic expense in the interneuron networks. Further investigation is necessary to determine what type of regulation can be performed using this mechanism. References Aoyagi, T., Takekawa, T., & Fukai, T. (2003). Gamma rhythmic bursts: Coherence control in networks of cortical pyramidal neurons. Neural Computation, 15, 1035–1062. Aradi, I., Santhakumar, V., Chen, K., & Soltesz, I. (2002). Postsynaptic effects of GABAergic synaptic diversity: Regulation of neuronal excitability by changes in IPSC variance. Neuropharmacology, 43, 511–522. Bartos, M., Vida, I., Frotscher, M., Meyer, A., Monyer, H., Geiger, J., & Jonas, P. (2002). Fast synaptic inhibition promotes synchronized gamma oscilla-
Rapid Modulation of Synchrony
273
tions in hippocampal interneuron networks. Proc. Natl. Acad. Sci., 99, 13222– 13227. Borgers, C., & Kopell, N. (2003). Synchronization in networks of excitatory and inhibitory neurons with sparse random connectivity. Neural Computation, 15, 509–538. Bressloff, P., & Coombes, S. (2000). Dynamics of strongly-coupled spiking neurons. Neural Computation, 12, 91–129. Brunel, N. and Hakim, V. (1999). Fast global oscillations in networks of integrateand-re neurons with low ring rates. Neural Computation, 11, 1621–1671. Chance, F., Abbott, L., & Reyes, A. (2002). Gain modulation from background synaptic input. Neuron, 35, 773–782. Chow, C. (1998). Phase-locking in weakly heterogeneous neuronal networks. Physica D, 118, 343–370. Connor, C., Gallant, J., Preddie, D., & Van Essen, D.C. (1996). Responses in area V4 depend on the spatial relationship between stimulus and attention. J. Neurophysiol., 75, 1306–1308. Connor, C., Preddie, D., Gallant, J., & Van Essen, D.C. (1997). Spatial attention effects in macaque area V4. J. Neurosci., 17, 3201–3214. Dayan, P., & Abbott, L. (2001). Theoretical neuroscience. Cambridge, MA: MIT Press. Ermentrout, G., & Kopell, N. (1998). Fine structure of neural spiking and synchronization in the presence of conduction delays. Proc. Natl. Acad. Sci., 95, 1259–1264. Fransen, E. (2003). Coexistence of synchronized oscillatory and desynchronized rate activity in cortical networks. Neurocomputing, 52–54, 763–769. Fries, P., Reynolds, J., Rorie, A., & Desimone, R. (2001). Modulation of oscillatory neuronal synchronization by selective visual attention. Science, 291, 1560– 1563. Fuhrman, G., Markram, H., & Tsodyks, M. (2002). Spike frequency adaptation and neocortical rhythms. J. Neurophys., 88, 761–770. Galarreta, M., & Hestrin, S. (2001). Electrical synapses between GABA-releasing interneurons. Nat. Rev. Neurosci., 2, 425–433. Gerstner, W. (2000). Population dynamics of spiking neurons: Fast transients, asynchronous states, and locking. Neural Computation, 12, 43–89. Golomb, D., & Rinzel, J. (1993). Dynamics of globally coupled inhibitory neurons with heterogeneity. Phys. Rev. E, 48, 4810–4814. Golomb, D., & Rinzel, J. (1994). Clustering in globally coupled inhibitory neurons. Physica D, 72, 259–282. Hansel, D., & Golomb, D. (2000). The number of synaptic inputs and the synchrony of large sparse neuronal networks. Neural Computation, 12, 1095– 1139. Hansel, D., & Mato, G. (2003). Asynchronous states and the emergence of synchrony in large networks of interacting excitatory and inhibitory neurons. Neural Computation, 15, 1–56. Jos´e, J., Tiesinga, P., Fellous, J., Salinas, E., & Sejnowski, T. (2002). Is attentional gain modulation optimal at gamma frequencies? Society of Neuroscience Abstract, 28, 55.6.
274
P. Tiesinga and T. Sejnowski
Kopell, N., Ermentrout, G., Whittington, M., & Traub, R. (2000). Gamma rhythms and beta rhythms have different synchronization properties. Proc. Natl. Acad. Sci., 97, 1867–1872. Kopell, N., & LeMasson, G. (1994). Rhythmogenesis, amplitude modulation, and multiplexing in a cortical architecture. Proc. Natl. Acad. Sci., 91, 10586–10590. Luck, S., Chelazzi, L., Hillyard, S., & Desimone, R. (1997). Neural mechanisms of spatial selective attention in areas V1, V2, and V4 of macaque visual cortex. J. Neurophys., 77, 24–42. McAdams, C., & Maunsell, J. (1999). Effects of attention on orientation-tuning functions of single neurons in macaque cortical area V4. J. Neurosci., 19, 431–441. McCormick, D., Connors, B., Lighthall, J., & Prince, D. (1985). Comparative electrophysiology of pyramidal and sparsely spiny stellate neurons of the neocortex. J. Neurophys., 54, 782–806. Mitchell, S., & Silver, R. (2003). Shunting inhibition modulates neuronal gain during synaptic excitation. Neuron, 38, 433–445. Moore, T., & Armstrong, K. (2003). Selective gating of visual signals by microstimulation of frontal cortex. Nature, 421, 370–373. Olufsen, M., Whittington, M., Camperi, M., & Kopell, N. (2003). New roles for the gamma rhythm: Population tuning and preprocessing for the beta rhythm. J. Comput. Neurosci., 14, 33–54. Prescott, S., & De Koninck, Y. (2003). Gain control of ring rate by shunting inhibition: Roles of synaptic noise and dendritic saturation. Proc. Natl. Acad. Sci., 100, 2071–2081. Rensink, R. (2000). The dynamics representation of scenes. Visual Cognition, 7, 17–42. Reynolds, J., Pasternak, T., & Desimone, R. (2000). Attention increases sensitivity of V4 neurons. Neuron, 26, 703–714. Rieke, F., Warland, D., de Ruyter van Steveninck, R. R., & Bialek, W. (1997). Spikes: Exploring the neural code. Cambridge, MA: MIT Press. Ripley, B. (1996). Pattern recognition and neural networks. Cambridge: Cambridge University Press. Salinas, E., & Sejnowski, T. (2000). Impact of correlated synaptic input on output variability in simple neuronal models. J. Neurosci., 20, 6193–6209. Salinas, E., & Sejnowski, T. (2001). Correlated neuronal activity and the ow of neural information. Nat. Rev. Neurosci., 2, 539–550. Salinas, E. and Sejnowski, T. (2002). Integrate-and-re neurons driven by correlated stochastic input. Neural Computation, 14, 2111–2155. Shepherd, G. (1998). Synaptic organization of the brain (4th ed.). New York: Oxford University Press. Shi, S., & Sperling, G. (2002). Measuring and modeling the trajectory of visual spatial attention. Psychological Review, 109, 260–305. Simons, D. (2000). Current approaches to change blindness. Visual Cognition, 7, 1–15. Tiesinga, P., Fellous, J.-M., Jos´e, J., & Sejnowski, T. (2001). Computational model of carbachol-induced delta, theta and gamma oscillations in the hippocampus. Hippocampus, 11, 251–274.
Rapid Modulation of Synchrony
275
Tiesinga, P., & Jos´e, J. (2000a). Robust gamma oscillations in networks of inhibitory hippocampal interneurons. Network, 11, 1–23. Tiesinga, P., & Jos´e, J. (2000b). Synchronous clusters in a noisy inhibitory network. J. Comp. Neurosci., 9, 49–65. Tiesinga, P., Jos´e, J., & Sejnowski, T. (2000). Comparison of current-driven and conductance-driven neocortical model neurons with Hodgkin-Huxley voltage-gated channels. Phys. Rev. E, 62, 8413–8419. Traub, R., Whittington, M., Colling, S., Buzs´aki, G., & Jeffreys, J. (1996). Analysis of gamma rhythms in the rat hippocampus in vitro and in vivo. J. Physiol., 493, 471–484. Treue, S., & Martinez Trujillo, J. (1999). Feature-based attention inuences motion processing gain in macaque visual cortex. Nature, 399, 575–579. van Vreeswijk, C. (1996). Partial synchronization in populations of pulsecoupled oscillators. Phys. Rev. E, 54, 5522–5537. van Vreeswijk, C., & Hansel, D. (2001). Patterns of synchrony in neural networks with spike adaptation. Neural Computation, 13, 959–992. Wang, X., & Buzs´aki, G. (1996). Gamma oscillation by synaptic inhibition in a hippocampal interneuronal network model. J. Neurosci., 16, 6402–6413. White, J., Banks, M., Pearce, R., & Kopell, N. (2000). Networks of interneurons with fast and slow gamma -aminobutyric acid type A (GABAA) kinetics provide substrate for mixed gamma-theta rhythm. Proc. Natl. Acad. Sci., 97, 8128–8133. White, J., Chow, C., Ritt, J., Soto-Trevino, ˜ C., & Kopell, N. (1998). Synchronization and oscillatory dynamics in heterogeneous, mutually inhibited neurons. J. Comput. Neurosci., 5, 5–16. Received June 24, 2003; accepted August 6, 2003.
LETTER
Communicated by Jonathan Victor
Dynamic Analyses of Information Encoding in Neural Ensembles Riccardo Barbieri
[email protected] Neuroscience Statistics Research Laboratory, Department of Anesthesia and Critical Care, Massachusetts General Hospital/Harvard Medical School, Boston, MA 02114, U.S.A.
Loren M. Frank
[email protected] David P. Nguyen
[email protected] Neuroscience Statistics Research Laboratory, Department of Anesthesia and Critical Care, Massachusetts General Hospital; Division of Health, Sciences, and Technology, Harvard Medical School/MIT, Boston, MA 02114, U.S.A.
Michael C. Quirk
[email protected] Picower Center for Learning and Memory, Riken-MIT Neuroscience Research Center, Department of Brain and Cognitive Sciences, Massachusetts Institute of Technology, Cambridge, MA 02139, U.S.A.
Victor Solo
[email protected] School of Electrical Engineering and Telecommunications, University of New South Wales,2052,Sydney, Australia;Martinos Center for BiomedicalImaging,Massachusetts General Hospital/Harvard Medical School, Boston, MA 02114, U.S.A.
Matthew A. Wilson
[email protected] Picower Center for Learning and Memory, Riken-MIT Neuroscience Research Center, Department of Brain and Cognitive Sciences, Massachusetts Institute of Technology, Cambridge, MA 02139, U.S.A.
Emery N. Brown
[email protected] Neuroscience Statistics Research Laboratory, Department of Anesthesia and Critical Care, Massachusetts General Hospital; Division of Health, Sciences, and Technology, Harvard Medical School/MIT, Boston, Massachusetts 02114-2698, U.S.A.
Neural Computation 16, 277–307 (2004)
c 2003 Massachusetts Institute of Technology °
278
R. Barbieri et al.
Neural spike train decoding algorithms and techniques to compute Shannon mutual information are important methods for analyzing how neural systems represent biological signals. Decoding algorithms are also one of several strategies being used to design controls for brain-machine interfaces. Developing optimal strategies to desig n decoding algorithms and compute mutual information are therefore important problems in computational neuroscience. We present a general recursive lter decoding algorithm based on a point process model of individual neuron spiking activity and a linear stochastic state-space model of the biological signal. We derive from the algorithm new instantaneous estimates of the entropy, entropy rate, and the mutual information between the signal and the ensemble spiking activity. We assess the accuracy of the algorithm by computing, along with the decoding error, the true coverage probability of the approximate 0.95 condence regions for the individual signal estimates. We illustrate the new algorithm by reanalyzing the position and ensemble neural spiking activity of CA1 hippocampal neurons from two rats foraging in an open circular environment. We compare the performance of this algorithm with a linear lter constructed by the widely used reverse correlation method. The median decoding error for Animal 1 (2) during 10 minutes of open foraging was 5.9 (5.5) cm, the median entropy was 6.9 (7.0) bits, the median information was 9.4 (9.4) bits, and the true coverage probability for 0.95 condence regions was 0.67 (0.75) using 34 (32) neurons. These ndings improve signicantly on our previous results and suggest an integrated approach to dynamically reading neural codes, measuring their properties, and quantifying the accuracy with which encoded information is extracted. 1 Introduction Neural spike train decoding algorithms are commonly used methods for analyzing how neural systems represent biological signals (Georgopoulos, Kettner, & Schwartz, 1986; Bialek, Rieke, de Ruyter van Stevenick, & Warland, 1991; Wilson & McNaughton, 1993; Warland, Reinagel, & Meister, 1997; Brown, Frank, Tang, Quirk, & Wilson, 1998; Zhang, Ginzburg, McNaughton, & Sejnowski, 1998; Stanley, Li, & Dan, 1999; Pouget, Dayan, & Zemel, 2000). More recently, these algorithms are also one of several strategies being used to design controls for neural prosthetic devices (Chapin, Moxon, Markowitz, & Nicolelis, 1999; Wessberg et al., 2000; Shoham, 2001; Serruya, Hatsopoulos, Paninski, Fellows, & Donoghue, 2002; Taylor, Tillery, & Schwartz, 2002) and other brain-machine interfaces (Donoghue, 2002; Wickelgren, 2003). Developing optimal strategies for constructing and testing decoding algorithms is therefore an important question in computational neuroscience. The rat hippocampus has specialized neurons known as place cells with spatial receptive elds that carry a representation of the animal’s environment (O’Keefe & Dostrovsky, 1971). The spatial information carried by the
Dynamic Analyses of Neural Information Encoding
279
spiking activity of the hippocampal place cells is believed to be an important component of the rat’s spatial navigation system (O’Keefe & Nadel, 1978). We previously derived the Bayes’ lter algorithm to decode the position of a rat foraging in an open environment from the ensemble spiking activity of pyramidal neurons in the CA1 region of the hippocampus (Brown et al., 1998). We represented the spatial receptive eld of each neuron as a twodimensional gaussian surface and assumed that the position of the animal in the receptive eld through time parameterized the rate function of an inhomogeneous Poisson process. We modeled the animal’s path as a random walk. The Bayes’ lter was a causal, recursive decoding algorithm developed by computing gaussian approximations to the well-known Bayes’ rule and the Chapman-Kolmogorov system of equations for state-space modeling (Mendel, 1995; Kitagawa & Gersh, 1996). The algorithm computed position estimates and their condence regions at 33 msec intervals. Because the algorithm depended critically on the form of the spatial gaussian model, it could not be used with other models of the place elds. While representing the place elds as gaussian surfaces and the path as a random walk was reasonable, these models gave only an approximate description of these data (Brown et al., 1998). Methods to compute the Shannon mutual information (Skaggs, McNaughton, Gothard, Marcus, 1993; Rieke, Warland, de Ruyter van Steveninck, & Bialek, 1997; Dan, Alonso, Usrey, & Reid, 1998; Strong, Koberle, de Ruyter van Steveninck, & Bialek, 1998; Reinagel & Reid, 2000; Reich, Melcher, & Victor, 2001; Johnson, Gruner, Baggerly, & Shehagiri, 2001; Nirenberg, Carcieri, Jacobs, & Latham, 2001; Victor, 2002), rather than decoding algorithms, are perhaps the most widely used techniques for analyzing how neural systems encode biological signals. None of these methods uses a parametric statistical model of the relation between spiking activity and the biological signal, and none computes mutual information estimates instantaneously (i.e., on the millisecond timescale of the decoded signal updates). Moreover, current analyses do not fully exploit the link between mutual information and decoding. We derive a general recursive lter decoding algorithm by using a point process to represent the spiking activity of individual neurons and a linear stochastic state-space model (Smith & Brown, 2003) to represent the animal’s path. From the algorithm, we derive new instantaneous estimates of the mutual information between position and the ensemble spiking activity, the entropy, and the entropy rate. We assess the accuracy of the algorithm by computing, along with the decoding error, the true coverage probability of the approximate 0.95 condence regions for the individual position estimates. We illustrate the new algorithm by reanalyzing the position and ensemble neural spiking activity of CA1 hippocampal neurons from two rats foraging in an open circular environment (Brown et al., 1998). For the point process, we use an inhomogeneous Poisson model with spatial receptive elds modeled as both gaussian surfaces and Zernike polynomial
280
R. Barbieri et al.
expansions (Born & Wolf, 1989), and for the linear state-space model of the path, we use a bivariate rst-order autoregressive process. We compare the performance of these algorithms with a linear lter algorithm constructed by the widely-used reverse correlation method (Warland, Reinagel, & Meister, 1997; Stanley et al., 1999; Wessberg et al., 2000; Serruya et al., 2002). 2 Theory In this section, we develop the theoretical framework for our approach. First, in the encoding analysis, we dene the relation between position and spiking activity for individual place cell neurons with two inhomogeneous Poisson models. In the rst model, the dependence of the rate function on space is dened as a gaussian surface, whereas in the second model, the spatial dependence is represented as an expansion in Zernike polynomials. We describe how the two models are t to the experimental data by maximum likelihood. Second, in the decoding analysis, we describe the autoregressive model of the path and the recursive lter algorithm we will use to decode position from the ensemble place cell spiking activity. Third, we describe several properties of the recursive lter algorithm. Fourth, we present the reverse correlation algorithm, and nally, we present our algorithms for computing instantaneous entropy, entropy rate, and mutual information. 2.1 Encoding Analysis: Place Cell Model. To dene the point process observation model, we consider two inhomogeneous Poisson models for representing the place cell spiking activity. In the rst, we model the rate or conditional intensity function ¸cG .t j x.t/; ³G / as a two-dimensional Gaussian surface dened as » ¼ 1 (2.1) ¸cG .t j x.t/; ³Gc / D exp ® c ¡ .x.t/ ¡ ¹ c /0 .Qc /¡1 .x.t/ ¡ ¹c / ; 2 where x.t/ D .x1 .t/; x2 .t// are the coordinates of the animal’s position at time t, ¹c D .¹c1 ; ¹c2 // are the coordinates of the place eld’s center, µ c 2 ¶ 0 .¾1 / c (2.2) Q D ; 0 .¾ 2c /2 is the scale matrix, ® is the log maximum ring rate (Brown et al., 1998), and ³Gc D .® c ; ¹c ; Qc /0 where c indexes a neuron, for c D 1; : : : ; C. As in Brown et al. (1998), we have taken the off-diagonal elements of Qc to be zero because for each neuron, these parameter estimates were all close to zero. In the second, we model the conditional intensity function ¸cz .t j x.t/; ³zc / as an exponentiated linear combination of Zernike polynomials given as ( ) L X ` X c c c m (2.3) ¸z .t j ³z / D exp ³`;m z` .½ .t/; Á .t// ; `D0 mD¡`
Dynamic Analyses of Neural Information Encoding
281
c where z`;m is the mth component of the `th order Zernike polynomial, ³`;m ¡1=2 2 2 1=2 is the associated coefcient, ½.t/ D r [.x1 .t/ ¡ ´1 / C .x2 .t/ ¡ ´2 / ] , Á.t/ D tan¡1 [.x2 .t/ ¡ ´2 /.x1 .t/ ¡ ´1 /¡1 ], .´1 ; ´2 / are the coordinates of the center of the circular environment, r is the radius of the circular environment, c gmD` gL , the coefcients in and ´1 D ´2 D r D 35cm. We have ³zc D ff³`;m mD¡` `D0 equation 2.3. The Zernike polynomials form an orthogonal basis whose support is restricted to the circular environment (Born & Wolf, 1989) and are dened as
( Zm ` .½ .t/; Á .t//
Rm ` .½.t//
D
8.`¡jmj/=2 > < X D
> :
0
jD0
Rm ` .½ .t// sin.mÁ .t// m > 0 Rm ` .½ .t// cos.mÁ.t// m < 0
.¡1/ j .` ¡ j/!
j!. `Cm 2
¡ j/!. `¡m 2 ¡ j/!
¢ ½.t/`¡2j
(2.4)
.` ¡ m/ even
(2.5)
.` ¡ m/ odd;
where 0 · ½.t/ · 1, 0 · Á.t/ · 2¼, 0 · jmj · `. To balance the tradeoff between model exibility and computational complexity, we chose L D 3. Based on equation 2.4, there are 10 nonzero coefcients for L D 3 in equation 2.3. We assume that during the experiment, the neural spiking activity of an ensemble of C neurons is recorded on an interval .0; T]. We take the subinterval .0; T e ] to be the encoding interval and the subinterval .T e ; T] to be the decoding interval. The spiking activity recorded during the encoding interval will be used to estimate the model parameters, whereas the spiking activity recorded during the decoding interval will be used to estimate, or decode, the animal’s position from the ensemble neural spiking activity through time. We dene the encoding interval to be the rst 15 minutes of the experiment for Animal 1 and the rst 13 minutes for Animal 2. Let 0 < ¹c1 < ¹c2 jWkjk j and is positive if jWk¡1jk¡1 j < jWkjk j. If there is less uncertainty about the animal’s position at step k than at step k ¡ 1, that is, jW k¡1jk¡1 j > jWkjk j, then the entropy decreases at step k, whereas if there is more uncertainty about the animal’s position at step k than at step k ¡ 1, that is, jWkjk j > jWk¡1jk¡1 j, then the entropy at step k increases. We can use the relation between the covariance matrices and the entropy rate to analyze whether the value of the entropy rate in ..k ¡ 1/1; k1] is due to either the AR.1/ model of the path or the spiking activity of the neural ensemble. By equation 2.11, Wk¡1jk D FWk¡1jk¡1 F0 C RW " . Therefore, if jWk¡1jk j > jWk¡1jk¡1 j, then the AR.1/ path model tends to increase the uncertainty and hence, the entropy about the animal’s location at any step k of
Dynamic Analyses of Neural Information Encoding
287
the algorithm. That is, the evolution of the path as described by the AR.1/ model makes the entropy rate positive in ..k¡1/1; k1]. For the entropy rate to be negative or, equivalently, for jWkjk j to decrease, the ensemble spiking activity must cause the sum on the right-hand side of equation 2.13 to be negative. The activity of the neural ensemble can further increase the entropy rate beyond the increase due to the path model if the spiking activity makes the second term on the right-hand side of equation 2.13 positive. However, because the path model makes the entropy rate positive, a decrease in the entropy, or a negative entropy rate in ..k ¡ 1/1; k1] can be due only to the spiking activity of the neural ensemble in that interval. Because this conclusion is predicated on jWk¡1jk j > jWk¡1jk¡1 j, we will check this condition at each decoding step. As mentioned above, although we use a gaussian approximation to estimate the probability densities, the point process likelihood in equations A.2 and A.6, the ensemble spiking activity enters the computation of Wkjk through equations 2.12 and 2.13 in a nongaussian manner. 2.7 Mutual Information. Because the animal’s position as a function of time is modeled as a bivariate gaussian AR.1/ process, its marginal probability density p.xk / is the gaussian probability density with mean 0 ¹ D [1 ¡ F]¡1 ¹x and covariance matrix W x D [I ¡ FF ]¡1 W" . Using the denition of the Shannon mutual information and the gaussian approximation to p.xk j N0:k /, the mutual information between position at k1 and the ensemble spiking activity in .0; k1] is (Cover, & Thomas, 1991; Twum-Danso, 1997) Z I.xk I N0:k / D ¡ p .xk / log2 [p.xk /]dxk µ Z Z ¶ ¡ ¡ p.xk j N0:k / log2 [p.xk j N0:k /]dxk p .N0:k / dN0:k Z ¡1 2 jWx j = D 2 log2 .2¼ e/ (2.17) Wkjk p.N0:k /dN0:k ; where jWx j is the determinant of W x and p.N0:k / is the marginal probability mass function of the neural ensemble in .0; k1]. Equation 2.17 can be evaluated at each step k of the algorithm and denes on average, the number of bits the random variable N0:k , i.e. the ensemble neural spiking activity in .0; k1], provides about the random variable xk , the animal’s position at k1. This is in contrast to the conditional entropy at k1 which describes the uncertainty about the animal’s position at k1 given the ensemble spiking activity in k1.The mutual information describes a property of the system whereas the instantaneous entropy provides a measure of uncertainty at a given time in a given realization of the ensemble spiking activity (TwumDanso, 1997). Given the inhomogeneous Poisson model for the spike train and the AR.1/ model for the path, we easily can compute the integral in the last line of equation 2.17 by Monte Carlo.
288
R. Barbieri et al.
3 Application As in our previous work (Brown et al., 1998), we divide the experimental data into two parts and the analysis into two stages: encoding analysis and decoding analysis. In the encoding analysis, we use the rst part of the experimental data to estimate the relation between spiking activity and position for each neuron for both animals and to estimate for each animal the parameters of the AR.1/ path model. In the decoding analysis, we apply the path and place cell models with their parameters estimated in the encoding analysis to the second part of the experimental data to decode the position of each animal from the ensemble spiking activity of its respective hippocampal neurons. We perform the decoding analysis by comparing the gaussian and Zernike decoding models across a range of learning rates and and two update intervals. For the optimal model, we analyze the mutual information between the spiking activity and the path. We begin with a brief review of the experimental protocol. 3.1 Experimental Protocol. We applied our paradigm to place cell spike train and position data recorded from two Long-Evans rats freely foraging in a circular environment 70 cm in diameter with walls 30 cm high and a xed visual cue. A multiunit electrode array was implanted into the CA1 region of the hippocampus of each animal. The simultaneous activity of 34 (32) place cells was recorded from the electrode array while the two animals foraged in the environment for 25 (23) minutes. Simultaneous with the recording of the place cell activity, the position of each animal was measured at 30 Hz by a camera tracking the location of two infrared diodes mounted on the animal’s headstage (Brown et al., 1998). 3.2 Encoding Analysis. The maximum likelihood estimates of the spatial gaussian and Zernike inhomogeneous Poisson rate functions computed from the rst 15 (13) minutes for Animal 1 (2) are shown in Figure 1. Neuron 34 for Animal 1 and neurons 29 to 32 for Animal 2 had split elds. Therefore, for the analysis with the gaussian model, the spiking activity of each of these neurons was t with two gaussian surfaces to capture this feature. The spiking activity of each neuron was also t with both a single Zernike model and two Zernike models for comparison. The spatial model ts in Figure 1 are displayed along with smoothed spatial rate maps. The empirical rate maps are displayed as an approximation to the raw data. In general, both the gaussian and Zernike model estimates of the place elds agree, with several noticeable exceptions. The Zernike place eld estimates are concentrated in a smaller area and have a wider range of asymmetric shapes. The gaussian model attempts to capture the asymmetry by placing the center of the eld outside the circle. The Zernike models can t split elds with a single polynomial expansion, whereas two gaussian models are required to t split elds. In a few cases when a place eld is con-
Dynamic Analyses of Neural Information Encoding
289
Figure 1: Spatial gaussian and Zernike place elds. Smoothed spatial spike histogram (row 1) and pseudocolor maps of place elds estimates from the spatial gaussian (row 2) and Zernike (row 3) models for the 34 place cells for Animal 1 (A) and the 32 place cells for Animal 2 (B).
290
R. Barbieri et al.
centrated on one border, the Zernike models produce an artifact of nonzero spiking activity on the opposite side of the environment. By the BIC, 29 (30) of the 34 (32) cells of Animal 1 (2) were best t by the Zernike spatial model, and the remaining 5 (2) were best t by the spatial gaussian model, suggesting that the former is an important improvement over the latter. 3.3 Decoding Analysis. We estimated the F and W " matrices for both Animal 1 and Animal 2 by maximum likelihood from the 13 (15) minutes of the respective path data in the encoding stage. The diagonal elements of the F matrices were 0.99 for both animals, and the off-diagonal elements were less than 0.01. This means that the decoding estimates from the AR.1/ model will be similar to those from the random walk model and that the determinants of the F matrices are close to 1. We decoded the position from the ensemble spiking activity of the 34 (32) neurons from the last 10 minutes of foraging for Animal 1 (2) using 20 different combinations of parameters and models for our algorithm (see equations 2.10–2.13). These were two rate function models, spatial gaussian or Zernike, two updating intervals, 1 D 3:3 or 33.3 msec, and ve values of LRSF with R D 1; 2; 5; 10 or 20 (see equation 2.10). For both animals, the median decoding error was a minimum when the LRSF was 5 (see Figure 2), suggesting that below 5, the new spiking activity should be weighted more, whereas above 5, the new spiking activity should be weighted less. For the ve LRSFs, the Zernike models with both 3.3 and 33 msec updating had smaller median decoding errors than the corresponding Figure 2: Facing page. Summary of decoding algorithm performance. (A, B) Median decoding error computed as the difference between the true and estimated path at each decoding update. (C, D) Median square root of the 0.95 condence region area. (E, F) Median coverage probability as a function of the learning rate scale factor for the spatial gaussian model with a 33 msec update (dark blue), the spatial gaussian model with a 3.3 msec update (light blue), the Zernike model with a 33 msec update (light red), and the Zernike model with a 3.3 msec update (dark red) for Animal 1 (left column) and Animal 2 (right column). The black circles in A and B indicate the minimum median error for the Zernike model with 3.3 msec updating and LRSF = 5. The red circles in A and B indicate the median error of the spatial gaussian model with a 33 msec updating and LRSF = 1. Although the spatial gaussian model also used the AR.1/ path model, the results are identical to the decoding results reported for this model in Brown et al. (1998) using the random walk path model. The black circles in E and F indicate the median coverage probability of the Zernike model with 3.3 msec updating and LRSF = 5. The red circles indicate the median coverage probability of the spatial gaussian model with 33 msec updating and LRSF = 1. Because the Zernike model with 3.3 msec updating and LRSF = 5 achieved for both animals nearly the highest median coverage probability with the smallest median error, we termed it the optimal Zernike model.
Dynamic Analyses of Neural Information Encoding
291
spatial gaussian models. For Animal 2, the 3.3 msec updating gave a smaller median error for both the Zernike and spatial gaussian models at all LRSFs, whereas for Animal 1, there was no difference in median error between the two sets of models for the 3.3 versus the 33 msec updating until the LRSF was greater than 5. The Zernike model with the 3.3 msec updating and an
292
R. Barbieri et al.
LRSF of 5 gave the smallest median decoding error: 5.9 (5.5) cm for Animal 1 (2). This is an improvement of 2.0 (2.2) cm from the 7.9 (7.7) cm using the spatial gaussian model and either the random walk (Brown et al., 1998) or AR.1/ path models (the red circles in Figure 2). Of this improvement, 1.2 (1.3) cm for Animal 1 (2) was due to optimizing the learning rate, whereas 0.7 (0.7) cm was due to the switch from the spatial gaussian to the Zernike model. Because increasing the LRSF increases the variance of the position estimate (see equations 2.11 and 2.13), the areas of the condence regions necessarily grow with increasing LRSF. Beyond an LRSF of 2, this increase is faster for all algorithms for both animals (see Figure 2). For all the algorithms, the condence region areas are smaller for the Zernike models than those of their corresponding spatial gaussian models. At an LRSF of 5, the median condence region area for the Zernike model was 8.62 (9.32) cm2 compared with 9.32 (10.52) cm2 for Animal 1 (2). The true coverage probability for the 0.95 condence regions (see equation 2.13) also increases with increasing LRSF. However, at an LRSF of 5, it reaches an approximate plateau for both animals for all algorithms. Up to an LRSF of 5, the coverage probability for both animals is greater for any Zernike model compared with its corresponding spatial gaussian model. The coverage probabilities plateau after 5 because even though the condence region areas increase with higher learning rates (see Figures 2C and 2D), the higher learning rates beyond this point give less accurate decoding (see Figures 2A and 2B). At an LRSF of 5, for the Zernike model with 3.3 msec, the coverage probability of the 0.95 condence regions was 0.67 (0.75) for Animal 1 (2) (black circles in Figures 2E and 2F). This is a signicant improvement over the 0.31 (0.40) for the spatial gaussian model with either the random walk or AR.1/ path models (red circles in Figures 2E and 2F). We term the decoding algorithm with Zernike model with LRSF = 5 and updated at 3.3 msec the optimal Zernike model. Illustrations of the recursive lter decoding algorithm with the spatial gaussian and optimal Zernike models are shown, along with a reverse correlation analysis, for a 60 sec continuous segment of the path from Animal 1 in Figure 3. The estimated trajectories from the optimal Zernike model resembled more closely the true trajectories than did those of the spatial gaussian model and had the smallest median error in each of the four 15 sec segments. In contrast, the reverse correlation algorithm performed poorly over the entire segment. The differences in performance of the algorithms in decoding the spike train data for both animals can be best appreciated in the videos of this analysis on our Web site (http://neurostat.mgh.harvard.edu). To quantify further the relative performance of the decoding algorithms, we computed the decoding error dened for each algorithm as the difference between the true and estimated path at each 33 msec time step for the 10 min of decoding. The box plot summaries of these error distributions for the three decoding algorithms are shown in Figure 4. For Animal 1 (2),
Dynamic Analyses of Neural Information Encoding
293
Figure 3: Continuous 60 sec segment of the true path (gray) displayed in four 15 sec intervals with the spatial gaussian decoding estimate with LRSF = 1 and 33 msec update (black, left column), the optimal Zernike model (black, center column), and the reverse correlation (black, right column). Numbers are the median error for the segment.
294
R. Barbieri et al.
Figure 4: Box plot summary of decoding error (True position—estimated position) distributions. Box plot summaries of the decoding error distributions for the spatial gaussian model with LRSF = 1 and 33 msec (left), the optimal Zernike model (center), and the reverse correlation (right) for Animals 1(A) and 2(B). The lower border of the box is the 25th percentile of the distribution, and the upper border is the 75th percentile. The white bar within the box is the median of distribution. The distance between the 25th and 75th percentiles is the interquartile range (IQR). The lower (upper) whisker is at 1.5 the IQR below (above) the 25th (75th) percentile. All the black bars below (above) the lower (upper) whiskers are far outliers. For reference, less than 0.35% of the observations from a gaussian distribution would lie beyond the 75th percentile plus 1:5 £ IQR.
the median decoding error for the spatial gaussian model was 7.7 (7.9) cm, with a 25th percentile of 5.2 (4.9) cm, a 75th percentile of 10.7 (10.9) cm, a minimum of 0.12 (0.06) cm, and a maximum of 44.6 (30.7) cm. For Animal 1 (2), the median decoding error for the Zernike model was 5.9 (5.5) cm, with a 25th percentile of 3.9 (3.3) cm, a 75th percentile of 8.6 (8.4) cm, a minimum of 0.8 (0.4) cm, and a maximum of 22.3 (43.8) cm. The decoding error distribution for Animal 1 was smaller for the optimal Zernike model compared with the spatial gaussian model (see Figure 4). For Animal 2, the decoding error distribution for the Zernike model was smaller up to the upper whisker of the box plot; however, beyond this point, this error distribution had a larger tail compared to the one for the corresponding spatial gaussian model. For both animals, the reverse correlation decoding errors were much larger. For Animal 1 (2), the median decoding error for the reverse correlation method was 27.5 (27.3) cm, with a 25th percentile of 14.7 (14.3) cm, a 75th percentile of 49.7 (43.8) cm, a minimum of 0.2 (0.7) cm, and a maximum of 143.3 (146.8) cm. The optimal Zernike model was also noticeably more accurate than the spatial gaussian model near the borders of the environment (see Figure 5). This is because the Zernike model gave a more accurate description of the place elds (see Figure 1) and because, by equation 2.5, the support of the Zernike model is restricted to the circle. The more realistic trajectory es-
Dynamic Analyses of Neural Information Encoding
295
Figure 5: Decoding analysis: 95% condence regions. Fifteen sec of true path (gray) optimal Zernike model estimate (black, A), the corresponding spatial gaussian model estimate (black, B), and 0.95 condence regions (ellipses) computed at position estimates spaced 1.5 sec apart. The median decoding error for the Zernike (gaussian) model is 5.99 (7.28) cm, and the median condence region area is 9.32 (10.62) cm2 on this 5 sec segment.
timates, the smaller decoding error, and the greater coverage probability with smaller condence regions (see Figures 2B and 2C, 4, and 5) show that the recursive lter decoding algorithm with the Zernike model was more accurate than using either this algorithm with the spatial gaussian model or the reverse correlation procedure. 3.4 Instantaneous Entropy, Entropy Rate, and Mutual Information. Along with the decoding estimate at each time step, we computed from the decoding analysis with the optimal Zernike model the instantaneous entropy, entropy rate, and the Shannon mutual information between the animal’s position and the ensemble spiking activity. Figure 6 shows the instantaneous entropy, entropy rate, condence region area, decoding error, and coverage probability for the 60 sec segment of data from Animal 1 considered in Figure 3. By equations 2.14 and 2.15, the number of bits the entropy of position given the ensemble spiking activity conveys about position (see Figure 6A) is directly proportional to the area of the 0.95 condence region at that instant (see Figure 6C). The entropy provided a different assessment of the algorithm’s performance compared with the decoding error. For example, at 12 sec, the decoding error was 20 cm (see Figure 6D), the condence region area was 172 cm2 (see Figure 6C), and the entropy was 9.3 bits (see Figure 6A). However, at 44 sec, the decoding error was nearly the same at 182 cm2 , yet the condence region area was 72 cm2 and the entropy was 8.0 bits. Because the entropy rate approximates the derivative of the entropy, it was positive when the instantaneous entropy increased and neg-
296
R. Barbieri et al.
Figure 6: Decoding and mutual information analysis. (A) Instantaneous entropy. (B) Entropy rate. (C) Square root of the condence region area. (D) Decoding error. (E) Coverage probability for the continuous 60 sec segment in Figure 3 computed by decoding with the optimal Zernike model.
Dynamic Analyses of Neural Information Encoding
5 0
bits/sec
-8
5
5
0
-5
7 6
bits
7
Entropy Rate 8
8
9
B
Entropy
9
A
297
ANIMAL 1 ANIMAL 2
C
ANIMAL 1 ANIMAL 2
Mutual Information
9.6
bits
9.8
10.0
10
9
9.0
9.2
9.4
9.5
ANIMAL 1 ANIMAL 2 Figure 7: Box plots of the (A) Instantaneous entropy, (B) entropy rate, and (C) Shannon mutual information distributions for the optimal Zernike model for Animal 1 (left) and Animal 2 (right) for the 10 min of decoding. See Figure 4 for an interpretation of the box plots.
ative when it decreased (see Figure 6B). For both animals, we veried that at each time step k of the algorithm, jWk¡1jk j > jWk¡1jk¡1 j. Therefore, as stated in section 2.6, the negative entropy rates were due solely to the ensemble spiking activity. The mean and median of the entropy rate were zero during this 60 sec segment, suggesting that there was no overall increase or decrease in the uncertainty about position given the ensemble spiking activity. The instantaneous coverage probabilities for this 60 sec segment range from 0.71 to 0.85 (see Figure 6E). The value of 0.73 at the end of the segment is closer to the total decoding stage coverage probability for this animal of 0.67. Together, the decoding error, the entropy, the entropy rate, condence regions, and the coverage probability give a comprehensive assessment of algorithm performance not appreciated from applying any single measure alone. To appreciate the range of entropy and entropy rates observed during the 10 minutes of decoding with the optimal Zernike model, we summarized as box plots in Figure 7 the distributions of these values for each animal. During 10 min of decoding with the optimal Zernike model, the median
298
R. Barbieri et al.
entropy was 6.9 (7.0) bits for Animal 1 (2) with a minimum of 3.5 (5.2) bits and a maximum of 9.3 (12.0) bits (see Figure 7). For the entire 10 min of decoding stage, as in the analysis of the 60 sec segment of the ensemble spiking activity in Figure 6, the median information rate was ñ0.08 (¡0:09) and the mean information rate was 0.002 (0.004) for Animal 1 (2). The corresponding 25th and 75th percentiles of the entropy rate distributions were ¡0:57 (¡0:63) and 0.67 (0.73) for Animal 1 (2), respectively. These ndings show that there is no overall growth or decline in the entropy of position given the ensemble spiking activity during the decoding time interval. We computed the mutual information in equation 2.17 by using Monte Carlo methods to simulate 50 realizations of the path (equation 2.7) and the ensemble spiking activity given the path model (equation 2.3) and the maximum likelihood estimate of the Zernike model parameters computed in the encoding analysis. The median and mean mutual information between position and ensemble spiking activity is 9.38 (9.45) with 25th and 75th percentile of 9.37 (9.43) and 9.41 for Animal 1 (2), respectively. 4 Discussion Our Bayes’ lter previously estimated the rat’s position with a median error of 8 cm during 10 min of decoding using » 30 place cells with a coverage probability of » 0.35 (Brown et al., 1998). The recursive point process decoding algorithm with the optimal Zernike model reduced the median error to less than 6 cm and increased the coverage probability to » 0.70, suggesting that the ensemble spiking activity carries more information than we previously estimated. At any instant, approximately 30% of the 105 (3 £ 104 ) neurons in the hippocampus are active in a given environment (Wilson & McNaughton, 1993). Therefore, our results based on » 30 neurons suggest that this brain region maintains a precise, dynamic representation of the animal’s spatial location. 4.1 Decoding, Coverage Probability, Entropy, and Mutual Information. The central concept in our decoding paradigm is the recursive estimation of the posterior probability density (see equation 2.8). The decoding estimates, entropy, mutual information, condence regions, and coverage probability are all functions of the posterior probability density that can be computed directly once this density has been estimated. The relation among these functions is the difference between the message in the code, certain properties of the code, and the accuracy with which the algorithm reads the code. The decoded position estimate together with its 0.95 condence region give a read-out of the message in the code at a given instant. In contrast, the entropy of the animal’s position given the ensemble spiking activity, and the area of the 0.95 condence region at that instant, are second-order properties of the code. The smaller the number of bits in the entropy, the smaller is the condence region. In this way, the entropy rate,
Dynamic Analyses of Neural Information Encoding
299
like the changes in the areas of the condence regions, identies when the spiking activity contributes “meaningfully” to the decoding estimation. Because the mutual information is computed by integrating the entropy, with respect to the marginal probability mass function of the ensemble spiking activity (equation 2.17), it measures the average number of bits at each time point the ensemble provides about the path. Hence, the entropy provides a measure of uncertainty for the current experiment whereas the mutual information denes the number of bits on average, i.e. over many realizations, that the neural ensemble conveys about the animal’s position. The decoding error and the coverage probability measure the accuracy with which the algorithm reads the neural code. The coverage probability measures simultaneously the accuracy of the algorithm’s rst-order (position estimate) and second-order (position uncertainty) properties. While the coverage probability for no model reached the expected 0.95, the coverage probability of each Zernike model was greater, with condence region areas that were smaller than those of the corresponding spatial gaussian model. Moreover, the coverage probability of the optimal Zernike model was greater than that of our original spatial gaussian model with either the random walk or AR.1/ path models. These ndings suggest that the new algorithm with the new spatial model is more accurate. Our computation of the instantaneous entropy and the entropy rate follows directly from applying the denition of entropy to the Bayes’ rule Chapman-Kolmogorov updating formula in equations 2.8 and 2.9. We used these calculations to make explicit the relation between recursively computing a decoding estimate and computation of mutual information. In the current experiments, the entropy was constant at approximately 7 bits and the median and mean entropy rates were zero. This suggests that the uncertainty in position determined by the ensemble of hippocampal neurons did not change during the experiment. Because place elds do evolve as an animal moves through both familiar (Mehta, Lee, & Wilson, 2002) and novel environments (Frank, Stanley, & Brown, 2003), it would not have been surprising to see an increase in entropy and more positive entropy rates as the decoding proceeded. This is because, as the place elds evolve, their estimates computed during the encoding stage no longer represent the spatial location that the neuron encoded during the decoding stage. To carry out decoding in this case would require combining our decoding algorithms with adaptive estimation algorithms (Brown et al., 2001) to track place eld evolution during decoding. Equation 2.16 and our analysis in section 2.6 showed that by the construction of our algorithm, a negative entropy rate at any step of the algorithm or, equivalently, a decrease in the entropy came solely from the spiking activity and not from the AR.1/ path model. This was because the AR.1/ path model contributed at each step to a positive entropy rate because, at each step, jWkjk¡1 j > jWk¡1jk¡1 j. Furthermore, because under the Poisson assumption used in the current analysis, the spiking activity of neurons in
300
R. Barbieri et al.
nonoverlapping intervals is independent, the decrease in entropy due to the spiking activity did not arise from correlated activity. The point process conditional intensity framework is well suited to carry out the important next step of repeating the current entropy and mutual information analysis with models using history-dependent spiking activity. Our parametric modeling paradigm provides dynamic estimates of entropy, the entropy rate, and mutual information using the Bayes’ rule and Chapman-Kolmogorov equations (see equations 2.8 and 2.9) to model explicitly the temporal evolution of the relation between position and ensemble spiking activity. The paradigm relates decoding and mutual information analyses and offers a plausible alternative to static mutual information analyses computed over long time intervals. Moreover, our approach to estimating entropy and mutual information obviates concerns about choosing word lengths and the accuracy of information estimation, particularly when word lengths are large relative to the length of the spike train (Victor, 2002). 4.2 The State-Space Paradigm Applied to Neural Spike Train Decoding. The current decoding algorithm is part of the paradigm we have been developing to conduct state estimation from point process measurements (Smith & Brown, 2003). We reviewed there the relation of our paradigm to other approaches to state estimation from point process observations. We previously compared our Bayes’ lter algorithm to several other approaches (Brown et al., 1998). In this study, we compared our new decoding algorithm to the reverse correlation because it was not included in our previous comparison and because it is one of the most widely used methods for decoding ensemble neural spike train activity (Warland et al., 1997; Stanley et al., 1999; Serruya et al., 2002). The reverse correlation algorithm, while simple to implement, performed signicantly worse than any version of our algorithm because it made no use of the spatial and temporal structure in this problem. This nding suggests that application of our algorithm to other decoding and neural prosthetic control problems where the reverse correlation methods have performed successfully (Bialek et al., 1991; Warland et al., 1997; Stanley et al., 1999; Wessberg et al., 2000; Serruya et al., 2002) may yield important improvements. The decoding error, condence regions, and coverage probabilities used together are more meaningful measures of algorithm performance than the R2 coefcient typically reported in reverse correlation analyses (Wessberg et al., 2000; Serruya et al., 2002; Taylor et al., 2002). Equations 2.10 to 2.13 generalize our Bayes’ lter decoding algorithm to one using arbitrary point process and linear stochastic state-space models. To dene our new algorithm or, equivalently, the posterior probability density (see equation 2.8), it sufces to specify these two models. This general formulation was necessary for our current analysis because our original algorithm depended critically on the form of the spatial gaussian model (Brown et al., 1998) and, hence, could not be applied to the Zernike model. The current formulation of the algorithm can use any point process model
Dynamic Analyses of Neural Information Encoding
301
that can be dened in terms of a conditional intensity function (see equation A.1). The inhomogeneous Poisson observation model based on the Zernike expansion gave more accurate decoding results than the one based on the gaussian model for two reasons. First, the Zernike model was more exible and represented more realistically the complex place eld shapes in the circular environment, as indicated by the plots in Figure 1 and the BIC goodness-of-t assessments. In particular, the support of the Zernike models was restricted to the circular environment, whereas the support of the gaussian model was all of R2 . Second, the algorithm with the Zernike Poisson model decoded more accurately because this model more accurately estimated the probability of a neuron’s spiking or not in a given location. The decoding update (see equation 2.11) “listens” to both spiking and silent neurons in each time window. Because few spikes occur within the 3.3 and 33 msec time windows, the new information (innovations) at each update comes mostly from neurons that do not spike. Use of the information from the neurons that do not spike in order to decode was a feature of our Bayes’ lter algorithm (Brown et al., 1998). Wiener and Richmond (2003) recently noted this feature as well. This observation is consistent with the idea that once a brain region develops a representation, which neurons do not spike, or are inhibited, can convey as much information as those that do spike. We chose an AR.1/ model as our linear stochastic state-space model. There were no differences between the decoding results for the random walk model compared with the AR.1/ model. However, because the AR.1/ model is stationary, it could be used, unlike the random walk model, to compute meaningful estimates of mutual information. Scaling the model’s white noise variance (see equation 2.7) scaled the learning rate. This scaling enhanced signicantly the algorithm’s accuracy by balancing the weight given to the prediction based on the previous estimate (see equation 2.10) and the modication produced by the innovations (see equation 2.12). Further improvements in our algorithm can be readily incorporated into the current framework. First, these include combining the Zernike spatial model with models of known hippocampal temporal dynamics such as bursting (Quirk & Wilson, 1999; Barbieri et al., 2001), theta phase modulation (Skaggs, McNaughton, Wilson, & Barnes, 1996; Brown et al., 1998; Jensen & Lisman, 2000), and phase precession (O’Keefe & Reece, 1993). Second, we can consider both higher-order linear state-space models (Shoham, 2001; Gao, Black, Bienenstock, Shoham, & Donoghue, 2002; Smith & Brown, 2003) or, perhaps, nonlinear state-space models to describe the path dynamics more accurately (Kitagawa & Gersh, 1996). Finally, we developed gaussian approximations to evaluate the Bayes’ rule and Chapman-Kolmogorov equations in equations 2.8 and 2.9 as a plausible initial approach. We can improve our algorithm and evaluate the accuracy of this approximation by applying nongaussian approximations (Pawitan, 2001), exact numerical integration techniques (Genz & Kass, 1994; Kitagawa & Gersh, 1996), and Monte Carlo methods (Doucet, DeFreitas, Gordon, & Smith, 2001; Shoham, 2001)
302
R. Barbieri et al.
to evaluate these equations. We suggest our point process linear state-space model framework as a broadly applicable approach for studying decoding problems as well as for designing algorithms to control neural prosthetic devices and brain-machine interfaces. Appendix: Derivation of the Decoding Algorithm During the decoding interval .0; T d ], we record the spiking activity of C neurons. Let 0 < uc1 < uc2 0:03¿m is necessary at SOA = 0). Another important feature in Figure 7 is that the inhibitory strengthduration curves are not monotonic and exhibit minima in a certain duration
324
Y. Miyawaki and M. Okada
range. These results indicate that there is an optimal duration for suppressing the network activity efciently with a perturbation of lower intensity. For example, a duration of 0:1¿m is the most effective at SOA = 0, under which conditions an intensity of only 2.45 is required for suppression. If the duration of the perturbation is around 0.1 ¿m , which is comparable to a typical TMS pulse duration (see also section 4.1), the inhibitory strengthduration curves for SOA = 0 and 1 exhibit lower values than under other SOA conditions. This means that the transient period around the afferent input onset is the most susceptible to the perturbation (see also Figure 6A). Furthermore, in the transient period, the bottom of the inhibitory strengthduration curve is almost at with respect to duration of the perturbation. The network can hence be suppressed by a similar intensity even if the duration changes to a certain degree. For the other SOA, the minimum intensity for suppression is easily affected by a duration change. For example, if the duration is varied between 0.5 and 0.05¿m , the minimum intensity for suppression varies by 10.4 at SOA = 5, whereas it varies by only 1.23 at SOA = 0. These results indicate that a perturbation close to the afferent input onset can suppress the network activity efciently with a lower intensity and more robustly with respect to variation in duration of the perturbation. 4 Discussion 4.1 Consistency with Experiments. We observed the suppressive effects induced in the simple neural network model by a TMS-like perturbation. There is a critical period in which the perturbation can completely suppress the network activity. In addition, the network activity is suppressed efciently and robustly if the perturbation is close to the onset of the afferent input. Assuming that the time constant ¿m is 10 ms, these results can be considered from a quantitative viewpoint. The duration of the perturbation we use in this article would be 1 ms, which is of the same order as the pulse duration produced by a typical TMS system. The SOA range in which the perturbation can suppress the network would be over 100 ms, which is also consistent with experimental data for occipital TMS (Amassian et al., 1989; Kamitani & Shimojo, 1999; Kammer & Nusseck, 1998; Masur et al. 1993). We also observed the parametric relationship between the intensity of the afferent input and the degree of suppression (see Figure 6). Raising the intensity of the afferent input increased the minimum intensity of the suppressive perturbation and narrowed the suppressive SOA range. Only when the afferent input is slightly above the threshold at which the network has a local excitation did the suppressive SOA range reach more than 100 ms. Parallel results have been obtained in experimental studies of occipital TMS. Most of these experiments have demonstrated a suppressive SOA range of about 100 ms, but the range actually varies with the visibility of the stimuli. Kammer and Nusseck (1998) demonstrated that raising the contrast of the visual stimuli decreased the error rate of the recognition task and narrowed
A Network Model of TMS
325
the suppressive SOA range. Amassian, Cracco, Maccabee, and Cracco (2002) also reported that suppression cannot be achieved by occipital TMS if the visual stimuli (alphabetical letters in their experiments) can be recognized too easily. These results indicate that to induce a powerful suppression in a wide time range of over 100 ms, the afferent input given as sensory stimuli should be just slightly above the threshold. This is indeed the condition under which the suppressive SOA range reached a large value of over 100 ms in our simulation. Another important point regarding the temporal properties of TMS is the peak timing at which the perturbation acts most suppressively. In most cases of occipital stimulation, a delay of about 100 ms from presentation of the visual stimulus is the most effective for suppressing its percept (see Figure 1). However, this delay varies according to the cortical area where TMS is applied. Beckers and Zeki (1995) and Hotson and Anand (1999) used visual motion stimuli and found the difference in the optimal SOA for suppressing motion perception between V1 and V5 stimulation. Corthout, Uttl, Walsh et al. (1999) and Corthout, Uttl, Ziemann, Cowey, & Hallett (1999), on the other hand, observed that TMS over the occipital pole could induce two distinct suppressive periods, and they discussed these periods in terms of the time course of feedforward and feedback visual processing. In most of these studies, variations in the absolute timing of the suppressive SOA have been considered to reect the differences in the arrival timing of the afferent input and the transmission delay from one cortical area to another (Pascual-Leone & Walsh, 2001). In addition, the suppressive SOA period starts later as the intensity of the visual stimulus is decreased. This delay has also been considered to be due to the reduced transmission speed on the afferent visual pathway (Kammer & Nusseck, 1998; Miller et al., 1996; Masur et al., 1993; Wilson & Anstis, 1969). The results presented here are measured from the onset of the afferent input to the network, not from the timing of the presentation of the external stimulus. Considering the absolute timing, the simulation results therefore need to be biased with the delay in transmission of the neural signals to the target neural population. The temporal prole of the degree of suppression, however, can be considered independent of the absolute timing bias. In typical experiments, the temporal prole of TMS-induced perceptual suppression has been directly depicted with a continuous measure like the percentage of correct answers (see Figure 1), whereas in our simulation, the minimum intensity of the suppressive perturbation was used as the measure because the suppression occurs in an all-or-nothing fashion. Although the measures are different, both results equivalently represent the temporal variation of the susceptibility to a brief perturbation. They show quite similar temporal proles and a parallel dependence on the intensity of the afferent stimuli, and the range of the suppressive period is also of the same order. Thus, assuming the proper delay in the afferent input arrival, the temporally selective
326
Y. Miyawaki and M. Okada
suppression in the network model agrees well with experimental data for TMS-induced perceptual suppression. 4.2 Neural Mechanisms of TMS-Induced Suppression. The neural mechanisms of TMS-induced suppression constitute the major issue still under debate and also the primary question of this article. A recent computational study suggested the involvement of a calcium-dependent potassium current inducing a long hyperpolarization period after stimulation (Kamitani et al., 2001). This mechanism might contribute to the suppression as an elemental component. On the other hand, several experimental studies have suggested that TMS-induced suppression is mediated by the activation of GABAergic inhibitory interneurons, based on evidence from the experiments using GABAergic drugs (Ziemann et al., 1996a, 1996b; Inghilleri et al., 1996), hyperventilation (Priori et al., 1995), and multiple pulse techniques (Ziemann, Rothwell, & Ridding, 1996; Romeo et al., 2000; Inghilleri et al., 1993). These reports imply the involvement of an inhibitory network in suppression, although the concrete mechanisms have not yet been clearly described. As Walsh and Cowey (2000) pointed out in their review, TMS is highly unlikely to evoke a particular activity pattern as coordinated by a local cortical network or an afferent projection to the target area. Rather, TMS might induce random activities regardless of the activity pattern existing in the target area. This would be as if inducing random noise into ordered neural processes (see Figure 8A). In this article, we simply approximated TMS as a uniform perturbation by which all neurons could be stimulated regardless of the present activity pattern in the network. Walsh’s idea and our modeling are essentially equivalent in terms that TMS gives neural stimulation irrelevant to the cortical activity formed by an afferent input and the cortical network. In consequence of such a perturbation, the inactive neurons initially suppressed by the active neurons can be also activated, and then they start to work suppressively on the active neurons via lateral inhibitory connections (see Figure 8B). After the local excitation builds up well, the inhibitory input from the active neurons to the inactive neurons becomes so strong that it is difcult to pull the activity of the inactive neurons up above the threshold by perturbation, resulting in failure to produce a counter-suppression to the active neurons. This temporal relationship between the perturbation and a local excitation limits the critical time range for suppressing the network activity after the afferent input onset. Before the afferent input is given, only the uniform inhibition J0 has the dominant effect because there are no second-order Fourier components. Thus, the origin of the suppressive effect in that period is recurrent inhibition brought by J0 after a brief excitation induced by a perturbation, and whose relaxation time limits the critical time range for suppressing the network activity. This is a basic framework for the dynamics of TMS-induced suppression provided by the network model of feature selectivity in the visual cortex
A Network Model of TMS
327
Figure 8: (A) The hypothesis of random stimulation of a neural population by TMS. TMS may have difculty inducing a particular localized activity pattern; rather, it could randomly stimulate the population of neurons under the coil (see also Walsh & Cowey, 2000). (B) The basic framework of TMS-induced suppression in the network with feature-specic interaction. TMS is modeled as a uniform excitatory perturbation across the entire network, so that the activity of all neurons is increased rst. If the perturbation is strong enough to pull the activity of inactive neurons up above the threshold, the active neurons receive a counter-suppression from the inactive neurons via inhibitory connections.
and the TMS model of a uniform perturbation. However, the functional architecture of feature-selective excitatory and inhibitory interaction like that we used in this article might not be necessary to produce similar results. The essential property of our network model is bistability generated by recurrent excitatory and inhibitory synaptic connections and afferent input channels. Hence, a sparsely connected two-population (excitatory and inhibitory) model with suitable synaptic connections and external input channels might be sufcient (e.g., Amit & Brunel, 1997; van Vreeswijk and Sompolinsky, 1996). Suppose a strong perturbation were given to both populations in such a model; inhibitory units would be activated directly or transsynaptically, and then the total activities would be decreased through inhibitory interactions. TMS would act like a reset input in the excitatoryinhibitory balanced network. In brain areas other than the visual cortex, we can also observe TMS-induced suppressive effects. For example, TMS of the motor cortex induces a muscle twitch and suppression of a surface EMG on an upper limb. The functional architecture of the motor cortex is distinct from the visual cortex; however, the essential property for the TMS-induced suppression we discussed above is a ubiquitous feature in the brain and also exists in the motor cortex. Thus, the basic mechanisms of the TMS-induced suppression might be shared similarly in the different cortical area, and the typical experimental results of TMS of the motor cortex could also be explained by a general model like a two-population network. There would
328
Y. Miyawaki and M. Okada
also be specicity originating from differences of the functional architecture and the coding scheme of neural signals in each cortical area. It is necessary to take such area specicity into account to identify neural mechanisms that yield phenomena unique to the focused cortical site. The issue of area specicity remains open for future study. In this article, we focused on perceptual suppression by TMS; in particular, we assumed a stimulation of the visual cortex and chose the model that has typical functional architecture for feature selectivity or, more specically, the model of orientation tuning function in the primary visual cortex. We obtained the temporal properties of the suppressive effect consistent with the typical data from TMS experiments in the occipital area, in which a visual stimulus evokes a certain coordinated neural activity pattern organized according to a columnar structure and retinotopic topology (e.g., Tootell et al., 1998). Therefore, TMS might randomly activate the neurons in the target area, and such random activities may suppressively affect the coordinated neural processes via local inhibitory connections. These results suggest that TMS-induced suppression is transsynaptically mediated by the inhibitory network and that dynamical interaction in a neural population plays an important role in the temporal properties of the suppression. Acknowledgments This work was supported by Special Postdoctoral Researchers Program of RIKEN, Grant-in-Aid for Scientic Research on Priority Areas No. 14084212, and Grant-in-Aid for Scientic Research (C) No. 14580438. References Adorjan, P., Levitt, J., Lund, J., & Obermayer, K. (1999). A model for the intracortical origin of orientation preference and tuning in macaque striate cortex. Vis. Neurosci., 16, 303–318. Amassian, V., Cracco, R., Maccabee, P., & Cracco, J. (2002). Handbook of transcranial magnetic stimulation. In A. Pascual-Leone, N. Davey, J. Rothwell, E. Wassermann, & B. Puri (Eds.), Visual system (chapter 5, pp. 323–334). London: Arnold. Amassian, V., Cracco, R., Maccabee, P., Cracco, J., Rudell, A., & Eberle, L. (1989). Suppression of visual perception by magnetic coil stimulation of human occipital cortex. Electroencephalogr. Clin. Neurophysiol., 74, 458–462. Amassian, V., Cracco, R., Maccabee, P., Cracco, J., Rudell, A., & Eberle, L. (1993). Unmasking human visual perception with the magnetic coil and its relationship to hemispheric asymmetry. Brain Res., 605, 312–316. Amit, D., & Brunel, N. (1997). Model of global spontaneous activity and local structured activity during delay periods in the cerebral cortex. Cereb. Cortex, 7, 237–252. Barker, A., Jalinous, R., & Freeston, I. (1985). Non-invasive magnetic stimulation of human motor cortex. Lancet, 1, 1106–1107.
A Network Model of TMS
329
Basser, P., & Roth, B. (2000). New currents in electrical stimulation of excitable tissues. Annu. Rev. Biomed. Eng., 2, 377–397. Beckers, G., & Homberg, V. (1992). Cerebral visual motion blindness: Transitory akinetopsia induced by transcranial magnetic stimulation of human area V5. Proc. R. Soc. Lond. B. Biol. Sci., 249, 173–178. Beckers, G., & Zeki, S. (1995). The consequences of inactivating areas V1 and V5 on visual motion perception. Brain, 118, 49–60. Ben-Yishai, R., Bar-Or, R., & Sompolinsky, H. (1995). Theory of orientation tuning in visual cortex. Proc. Natl. Acad. Sci. USA, 92, 3844–3848. Bredfeldt, C., & Ringach, D. (2002). Dynamics of spatial frequency tuning in macaque V1. J. Neurosci., 22, 1976–1984. Carandini, M., & Ringach, D. (1997). Predictions of a recurrent model of orientation selectivity. Vision Res., 37, 3061–3071. Corthout, E., Uttl, B., Walsh, V., Hallett, M., & Cowey, A. (1999). Timing of activity in early visual cortex as revealed by transcranial magnetic stimulation. Neuroreport, 10, 2631–2634. Corthout, E., Uttl, B., Ziemann, U., Cowey, A., & Hallett, M. (1999). Two periods of processing in the (circum)striate visual cortex as revealed by transcranial magnetic stimulation. Neuropsychologia, 37, 137–145. Hallett, M. (1995). Transcranial magnetic stimulation. negative effects. Adv. Neurol., 67, 107–113. Hansel, D., & Sompolinsky, H. (1998). Modeling feature selectivity in local cortical circuits. In C. Koch & I. Segev (Eds.), Methods in neuronal modeling (2nd ed., pp. 499–567). Cambridge, MA: MIT Press. Hilgetag, C., Theoret, H., & Pascual-Leone, A. (2001). Enhanced visual spatial attention ipsilateral to RTMS-induced “virtual lesions” of human parietal cortex. Nat. Neurosci., 4, 953–957. Hotson, J., & Anand, S. (1999). The selectivity and timing of motion processing in human temporo-parieto-occipital and occipital cortex: A transcranial magnetic stimulation study. Neuropsychologia, 37, 169–179. Inghilleri, M., Berardelli, A., Cruccu, G., & Manfredi, M. (1993). Silent period evoked by transcranial stimulation of the human cortex and cervicomedullary junction. J. Physiol., 466, 521–534. Inghilleri, M., Berardelli, A., Marchetti, P., & Manfredi, M. (1996). Effects of diazepam, baclofen and thiopental on the silent period evoked by transcranial magnetic stimulation in humans. Exp. Brain Res., 109, 467–472. Kamitani, Y., Bhalodi, V., Kubota, Y., & Shimojo, S. (2001). A model of magnetic stimulation of neocortical neurons. Neurocomputing, 38–40, 697–703. Kamitani, Y., & Shimojo, S. (1999). Manifestation of scotomas created by transcranial magnetic stimulation of human visual cortex. Nat. Neurosci., 2, 767– 771. Kammer, T., & Nusseck, H. (1998). Are recognition decits following occipital lobe TMS explained by raised detection thresholds? Neuropsychologia, 36, 1161–1166. Kastner, S., Demmer, I., & Ziemann, U. (1998). Transient visual eld defects induced by transcranial magnetic stimulation over human occipital pole. Exp. Brain Res., 118, 19–26.
330
Y. Miyawaki and M. Okada
Marg, E., & Rudiak, D. (1994). Phosphenes induced by magnetic stimulation over the occipital brain: Description and probable site of stimulation. Optom. Vis. Sci., 71, 301–311. Masur, H., Papke, K., & Oberwittler, C. (1993). Suppression of visual perception by transcranial magnetic stimulation—experimental ndings in healthy subjects and patients with optic neuritis. Electroencephalogr. Clin. Neurophysiol., 86, 259–267. Matthews, N., Luber, B., Qian, N., & Lisanby, S. (2001). Transcranial magnetic stimulation differentially affects speed and direction judgments. Exp. Brain Res., 140, 397–406. Meyer, B., Diehl, R., Steinmetz, H., Britton, T., & Benecke, R. (1991). Magnetic stimuli applied over motor and visual cortex: Inuence of coil position and eld polarity on motor responses, phosphenes, and eye movements. Electroencephalogr. Clin. Neurophysiol. Suppl., 43, 121–134. Miller, M., Fendrich, R., Eliassen, J., Demirel, S., & Gazzaniga, M. (1996). Transcranial magnetic stimulation: Delays in visual suppression due to luminance changes. Neuroreport, 7, 1740–1744. Nagarajan, S., Durand, D., & Warman, E. (1993). Effects of induced electric elds on nite neuronal structures: A simulation study. IEEE Trans. Biomed. Eng., 40, 1175–1188. Pascual-Leone, A., & Walsh, V. (2001). Fast backprojections from the motion to the primary visual area necessary for visual awareness. Science, 292, 510–512. Priori, A., Berardelli, A., Mercuri, B., Inghilleri, M., & Manfredi, M. (1995). The effect of hyperventilation on motor cortical inhibition in humans: A study of the electromyographic silent period evoked by transcranial brain stimulation. Electroencephalogr. Clin. Neurophysiol., 97, 69–72. Rattay, F. (1986). Analysis of models for external stimulation of axons. IEEE Trans. Biomed. Eng., 33, 974–977. Ringach, D., Hawken, M., & Shapley, R. (1997). Dynamics of orientation tuning in macaque primary visual cortex. Nature, 387, 281–284. Romeo, S., Gilio, F., Pedace, F., Ozkaynak, S., Inghilleri, M., Manfredi, M., & Berardelli, A. (2000). Changes in the cortical silent period after repetitive magnetic stimulation of cortical motor areas. Exp. Brain Res., 135, 504–510. Roth, B., & Basser, P. (1990). A model of the stimulation of a nerve ber by electromagnetic induction. IEEE Trans. Biomed. Eng., 37, 588–597. Schmolesky, M., Wang, Y., Hanes, D., Thompson, K., Leutgeb, S., Schall, J., & Leventhal, A. (1998). Signal timing across the macaque visual system. J. Neurophysiol., 79, 3272–3278. Somers, D., Nelson, S., & Sur, M. (1995). An emergent model of orientation selectivity in cat visual cortical simple cells. J. Neurosci., 15, 5448–5465. Sompolinsky, H., & Shapley, R. (1997). New perspectives on the mechanisms for orientation selectivity. Curr. Opin. Neurobiol., 7, 514–522. Tootell, R., Hadjikhani, N., Vanduffel, W., Liu, A., Mendola, J., Sereno, M., & Dale, A. (1998). Functional analysis of primary visual cortex (V1) in humans. Proc. Natl. Acad. Sci. USA, 95, 811–817. van Vreeswijk, C., & Sompolinsky, H. (1996). Chaos in neuronal networks with balanced excitatory and inhibitory activity. Science, 274, 1724–1726.
A Network Model of TMS
331
Walsh, V., & Cowey, A. (2000). Transcranial magnetic stimulation and cognitive neuroscience. Nat. Rev. Neurosci., 1, 73–79. Wilson, J., & Anstis, S. (1969). Visual delay as a function of luminance. Am. J. Psychol., 82, 350–358. Ziemann, U., Lonnecker, S., Steinhoff, B., & Paulus, W. (1996a). The effect of lorazepam on the motor cortical excitability in man. Exp. Brain Res., 109, 127–135. Ziemann, U., Lonnecker, S., Steinhoff, B., & Paulus, W. (1996b). Effects of antiepileptic drugs on motor cortex excitability in humans: A transcranial magnetic stimulation study. Ann. Neurol., 40, 367–378. Ziemann, U., Rothwell, J., & Ridding, M. (1996). Interaction between intracortical inhibition and facilitation in human motor cortex. J. Physiol., 496, 873–881. Received December 27, 2002; accepted July 8, 2003.
LETTER
Communicated by Kwokleung Chan
Adaptive Two-Pass Median Filter Based on Support Vector Machines for Image Restoration Tzu-Chao Lin
[email protected] Pao-Ta Yu
[email protected] Department of Computer Science and Information Engineering, National Chung Cheng University, Chiayi, Taiwan 62107, R.O.C.
In this letter, a novel adaptive lter, the adaptive two-pass median (ATM) lter based on support vector machines (SVMs), is proposed to preserve more image details while effectively suppressing impulse noise for image restoration. The proposed lter is composed of a noise decision maker and two-pass median lters. Our new approach basically uses an SVM impulse detector to judge whether the input pixel is noise. If a pixel is detected as a corrupted pixel, the noise-free reduction median lter will be triggered to replace it. Otherwise, it remains unchanged. Then, to improve the quality of the restored image, a decision impulse lter is put to work in the second-pass ltering procedure. As for the noise suppressing both xed-valued and random-valued impulses without degrading the quality of the ne details, the results of our extensive experiments demonstrate that the proposed lter outperforms earlier median-based lters in the literature. Our new lter also provides excellent robustness at various percentages of impulse noise. 1 Introduction Noise cancellation is an important step in image processing applications such as edge detection, image segmentation, and data compression. In the literature, nonlinear methods seem to perform better than linear methods in the case of corruption by impulse noise. The median lter, one of the most popular nonlinear lters, has been extensively studied due to its ability to suppress impulse noise computationally efciently (Astola & Kuosmanen, 1997). Although noise suppression is indeed achieved, the median lter tends to blur ne details and lines in many cases, that is, it usually removes desirable details. This problem becomes especially serious when the size of the lter window is large. Some modied median-based lters have been proposed to overcome this problem, including weighted median (WM) lters and fuzzy rules–based median lters as well as decision-based median lters (Yin, Neural Computation 16, 333–354 (2004)
c 2003 Massachusetts Institute of Technology °
334
T. Lin and P. Yu
Yang, Gabbouj, & Neuvo, 1996; Ko & Lee, 1991; Sun & Neuvo, 1994; Chen, Ma, & Chen, 1999; Chen & Wu, 2001; Abreu & Mitra, 1995; Arakawa, 1996). Acceptable results have been obtained by using these lters. Nevertheless, weighted median lters (WM) still tend either to remove ne details from the image or retain too much impulse noise. That is, a trade-off exists between noise reduction and detail preservation (Ko & Lee, 1991). On the other hand, fuzzy rules learning-based median lters are controlled by fuzzy rules that judge the possibility of the existence of impulse noise (Arakawa, 1996; Tsai & Yu, 1999, 2000; Yu & Chen, 1995; Yu, Lee, & Kuo, 1995). The learning process is realized by training over a reference image. Because the values of the membership function parameters depend on the reference image, some stations may remain untrained; in other words, the generalization capability is poor (Arakawa, 1996). Recently, decision-based median lters, realized by thresholding operations, have been studied (Sun & Neuvo, 1994; Chen et al., 1999). They mainly use a predened threshold value to control the lter so that it is activated only for contaminated pixels. That is, it avoids damage to good pixels. The switching schemes (Scheme I and II) by Sun and Neuvo (1994) prove to be more effective than uniformly applied methods. In both schemes, the nal output switches between the output from the median lter and the identity. The tristate decision-making mechanism by Chen et al. (1999) incorporates the median lter and the center weighted median (CWM) lter, which is a special case of WM lter, taking either the identity, the output of the median lter, or that of the CWM lter as the new pixel value. In general, decision-based median lters can be realized quite easily, but the problem is that they are difcult to practice threshold functioning on local signal statistics to judge whether an impulse noise exists. Moverover, those hard switching schemes may work well on large xed-valued impulses but poorly on random-valued ones, and vice versa. In this letter, we propose a novel adaptive two-pass median lter (ATM) based on SVMs and some switching schemes to overcome the drawbacks of these methods. Basically, the proposed lter applies a new SVM approach: an impulse detection algorithm will be put to use before the ltering process, and the detection results are used to decide whether a pixel should be modied. SVMs have been proposed recently as new kinds of feedforward networks for pattern recognition because of their good generalization capability (Haykin, 1999; Vapnik, 1995, 1998). In addition, to improve the ltering performance, the second-pass lter uses a simple decision impulse lter to lter rst-pass residue noise. With this new ltering framework, our lter can effectively eliminate both xed-valued and random-valued impulse noise, and the preservation of the original signals is greatly improved in comparison with other median-based lters without sacricing too much computational complexity. Furthermore, the new lter also shows its excellent robustness at different impulsive ratios, and it is independent of the reference image in the training phase.
Adaptive Median Filter Based on Support Vector Machines
335
In section 2, some basic concepts are reviewed, and the noise model is dened. In section 3, the design of the new SVMs technique is illustrated. In section 4, the design of the new ATM lter is portrayed in detail. In section 5, some extensive simulations demonstrate that the proposed ATM lter outperforms other median-based lters. The conclusion is given in section 6. 2 Basic Concepts and Impulse Noise Model 2.1 Basic Concepts. Before introducing the proposed ATM lter, some notations must be dened. Let X denote a two-dimensional h £ m image represented by 2 3 x11 x12 ¢ ¢ ¢ x1j ¢ ¢ ¢ x1m 6x21 x22 ¢ ¢ ¢ x2j ¢ ¢ ¢ x2m 7 6 7 6 :: :: :: :: 7 6 : 7 : : : 7 (2.1) XD6 6 xi1 xi2 ¢ ¢ ¢ xij ¢ ¢ ¢ xim 7 D [xij ]h£m ; 6 7 6 : :: :: :: 7 4 :: : : : 5 xh1
xh2
¢¢¢
xhj
¢¢¢
xhm
where h and m are its height and width, respectively, and xij 2 f0; 1; 2; : : : ; 255g is the pixel value at position .i; j/ in X. A lter window (or a sliding window) whose size is S D .2¿ C1/2 D 2nC1 (S is odd, in general) covers the image X at location .i; j/ to obtain an observed sample matrix Wij dened by 2 3 xi¡¿;j¡¿ ¢ ¢ ¢ xi¡¿;j ¢ ¢ ¢ xi¡¿;jC¿ 6 :: :: :: 7 6 : : : 7 6 7 7 ¢ ¢ ¢ ¢ ¢ ¢ (2.2) Wij D 6 x x x ; i;j¡¿ ij i;jC¿ 6 7 6 : 7 : : :: :: 5 4 :: xiC¿;j¡¿
¢¢¢
xiC¿;j
¢¢¢
xiC¿;jC¿
.2¿ C1/£.2¿ C1/
where 1 · i · h and 1 · j · m. The central pixel value of the lter window Wij is xij . Let the lter window Wij be slid on the image X from left to right, top to bottom, in a raster scan fashion. By using the row-major method on signal set Wij in equation 2.2, for convenience, the observed sample matrix can be represented by Wij D .xi¡¿;j¡¿ ; xi¡¿;j¡¿ C1 ; : : : ; xij ; : : : ; xiC¿;jC¿ ¡1 ; xiC¿;jC¿ /;
(2.3)
where the subscript ij can be replaced with a scalar running index, k D .i ¡ 1/ £ m C j. Hence, in order to transfer the .2¿ C 1/ £ .2¿ C 1/ window into a one-dimensional vector, equation 2.3 can be described as w.k/ D .x¡n .k/; x¡nC1 .k/; : : : ; x0 .k/; : : : ; xn¡1 .k/; xn .k//;
(2.4)
336
T. Lin and P. Yu
Figure 1: A 3 £ 3 lter window around the center pixel x0 .k/.
where x0 .k/ (or x.k/) is the originally central vector-valued pixel at location k. For example, we consider a 3 £ 3 lter window centered around x0 .k/, as shown in Figure 1, such that w.k/ D .x¡4 .k/; : : : ; x¡1 .k/; x0 .k/; x1 .k/; : : : ; x4 .k//:
(2.5)
The usual output value from a median lter is denoted as m.k/ at location k in a lter window of size 2n C 1 as follows: m.k/ D MED w.k/
D MED fx¡n .k/; : : : ; x¡1 .k/; x0 .k/; x1 .k/; : : : ; xn .k/g;
(2.6)
where MED means the median operation. 2.2 The Impulse Noise Model. In this work, the source images corrupted by impulse noise with probability of occurrence p can be described as follows, x.k/ D
» n.k/; a.k/;
with probability p ; with probability 1 ¡ p
(2.7)
where n.k/ denotes the image contaminated by impulse with noise ratio p, and a.k/ means the pixels are noise free. There are two types of impulse noise: xed-valued and random-valued impulse. In a gray-scale image, the xed-valued impulse, which is also known as salt and pepper noise, shows up as equal to or near the maximum or minimum of the allowable dynamic p range with equal probability (i.e., 2 ), while the random-valued impulse is uniformly distributed over the range of [0, 255] at ratio p.
Adaptive Median Filter Based on Support Vector Machines
337
3 SVMs Classier To take neural networks into account, there are many possible models to be chosen for classication problem (Lin & Yu, 2003). Since the SVMs have high generalization ability without the addition of a priori knowledge, even when the dimension of the input space is very high, the SVM approach is considered a good candidate for the classier. We describe the basic theory of the new promising technique, SVMs for classication (Haykin, 1999; Muller, ¨ Mika, Ra¨ tsch, Tsuda, & Sch¨olkopf, 2001; Cristianini & John, 2000), and then apply the SVM technique to our noise detector. 3.1 Support Vector Machines. SVMs are basically aimed at dealing with a two-class classication problem. Generally, the two classes are separated by an optimal discrimination function, which is induced from training examples. Let f.xi ; yi /; i D 1; : : : ; Ng be a set of training examples, where each example xi 2 Rt (t is the dimension of the input space) belongs to a class labeled by yi 2 f¡1; 1g with a hyperplane wx C b D 0, which divides the set of examples so that all the examples with the same label are on the same side of the hyperplane. If the set of examples is separated without error and the margin is maximal, then the optimal separating hyperplane (OSH) is obtained. The margin can be seen as a measure of the generalization ability; that is, the larger the margin, the better the generalization expected (Vapnik, 1995, 1998). A linearly separating hyperplane in the canonical form must satisfy the following constraint, yi [.w ¢ xi / C b] ¸ 1; i D 1; : : : ; N:
(3.1)
The distance of an example x to the hyperplane is d.w; bI x/ D
jw ¢ x C bj : kwk
(3.2)
The margin between the closest examples (called the support vectors) and 1 the hyperplane is kwk . Hence, the optimal separating hyperplane that optimally separates the data is the one that minimizes the cost function 8.w/ D
1 kwk2 : 2
(3.3)
The solution to the optimization problem in equation 3.3 under the constraint of equation 3.1 can be achieved by Lagrange multipliers. In this work, we need to consider both the linearly nonseparable case and the nonlinear case, for the training noisy examples have both. The nonlinear
338
T. Lin and P. Yu
discrimination function is of the form Á ! N X f .x/ D sign ®i yi K.xi ; x/ C b ;
(3.4)
iD1
where K.xi ; x/ is a positive denite kernel function, and nonnegative variables ®i are the Lagrange multipliers. There are three widely used kernel functions available: polynomial kernels, radial basis function (RBF) kernels, and sigmoidal kernels (Haykin, 1999). In this work, polynomial kernels are used to implement the proposed lter (Lin, 2001, 2002; Chang, Hsu, & Lin, 2000; Hsu & Lin, 2002; Chang & Lin, 2002). 3.2 Impulse Noise Detector Using SVMs. In this subsection, we design a decision median lter based on the SVM technique. The structure of the SVM classier is illustrated in Figure 2. The class level d is equal to 1 or ¡1, which means the input example x.k/ is a noise pixel or noise-free pixel, respectively. Note that x.k/ and z.k/ represent the input pixel value and the output value at location k, respectively. Figure 3 shows the feedforward network architecture of support vector machines for noise detection. Before training the SVMs, the observation vector must be extracted from the lter window w.k/ to reduce the size of the SVM structure. The details of feature extraction are discussed in the next section. 4 The Design of the ATM Filter The proposed ATM lter uses the SVM technique to classify the signal as either noise free or noise corrupted so that only noisy signals are ltered and good signals are preserved. Subsection 4.1 describes the structure of the ATM lter. Subsection 4.2 then presents a new approach to decide whether
Figure 2: Structure of the decision median ler.
Adaptive Median Filter Based on Support Vector Machines
339
Figure 3: Feedforward network architecture of support vector machines.
a pixel is corrupted by impulse noise. After that, the SVM impulse detector is devised to solve the classier problem. The two-pass ltering process is described in subsection 4.3. 4.1 The Structure of the ATM Filter. In section 3, we presented the SVMs two-class recognition scheme. In other words, SVMs recognize the input signals as two separate classes: noise-free and noise-corrupted signals. Now we describe the detailed structure of the proposed ATM lter, which consists of two-pass median lters to remove impulse noise, as shown in Figures 4a and 4b. In general, to reduce impulse noise effectively without degrading the quality of the original image, the ltering should work only on the corrupted pixels. The noise-free pixels should be kept unchanged. This can be achieved by the SVM detection process in the rst pass of the proposed ATM lter. The rst-pass ltering mainly has two steps, as shown in Figure 4a. Before the noise reduction ltering, the input signals are rst
340
T. Lin and P. Yu
(a)
(b) Figure 4: ATM lter. (a) First-pass ATM lter. (b) Second-pass ATM lter.
judged by the SVM impulse detector to be noise-free or noise-corrupted pixels. A binary ag map I.i; j/, 1 · i · h, 1 · j · m is used to indicate whether the pixels have been detected as impulses in the test image. (The feature extraction and SVM impulse detector are discussed in the next subsection.) The input signal x.k/ is determined to be noise free or noise corrupted depending on the binary ag map I.i; j/. If a pixel is determined to be a corrupted pixel, it is replaced with an estimated median selected from only the noise-free pixels within the lter window w.k/. Otherwise, it is kept unchanged. We call this special lter the noise-free reduction median
Adaptive Median Filter Based on Support Vector Machines
341
(NFR) lter. In case the SVM impulse detector might mistakenly leave any noisy pixels undetected or misdetect good pixels, the second-pass ltering by the decision median lter to remove the noisy pixels is added. In order to remove only residue impulses with very small signal distortion, a hard switching scheme is also a part of our new lter. 4.2 A New Approach to Judge Whether an Impulse Noise Exists. In this work, we propose a new and efcient classier approach to identify noise by using the SVM method. The SVM impulse detector as shown in Figure 4a uses the local characteristics in the lter window w.k/ as observation vectors to classify the signals into two classes: noise-free and noisecorrupted signals. This efcient approach can minimize not only the risk of misclassifying the examples in the training set (i.e., training errors) but also that of misclassifying the examples of the test set (i.e., generalization errors). This design of the detection approach consists of two steps: (1) feature extraction and (2) training the SVM impulse detector. 4.2.1 Feature Extraction Using a Local Information Measure. Before the noise-reduction ltering, the observation vector must be extracted from the lter window w.k/ to identify noisy pixel (Lin & Yu, in press). One can dene the following three variables to generate the observation vector O.k/ as the input data of the SVM impulse detector. In general, the amplitudes of the most impulse noises are more prominent than the ne changes of signals. Thus, the difference between the value x.k/ of the input pixel and the output value from the median lter provides an efcient measurement to identify noisy pixel. Denition 1. The variable u.k/ denotes the absolute difference between the input x.k/ and the median value m.k/ as follows: u.k/ D jx.k/ ¡ m.k/j:
(4.1)
Note that u.k/ is a measure for detecting the probability whether the input x.k/ is contaminated. A large value of u.k/ indicates that the input x.k/ is dissimilar to the median value of the lter window w.k/; that is, the central pixel x.k/ is corrupted by impulse noise. The correctness of the SVM impulse detector is decisive in the ltering process, because the whole noise ltering depends on the noise detection results. Most impulse noises can be detected using the variable u.k/ as the indicator. Nevertheless, using only the variable u.k/ as the noise detector has two problems: The rst problem is that if we take the value of u.k/ only to judge whether an impulse noise exists, then it is difcult to separate impulse noise fully. For example, line components are usually present in an image, and its width is just 1 pixel; therefore, if x.k/ is located on the line, it may be interpreted as impulse noise and be removed. This problem is shown in the following
342
T. Lin and P. Yu
example, where a skew line is located in a lter window as follows: 0
200 @ 10 10
10 200 10
1 10 10 A : 200
To avoid incorrect judgments, it is necessary to add other observations. Thus, another variable v.k/ is provided, dened as follows: Denition 2. v.k/ D
jxo .k/ ¡ xc1 .k/j C jx0 .k/ ¡ xc2 .k/j ; 2
(4.2)
where jx0 .k/ ¡ xc1 .k/j · jx0 .k/ ¡ xc2 .k/j · jx0 .k/ ¡ xi .k/j; ¡n · i · n; i 6D 0; c1; c2: Note that xc1 .k/ and xc2 .k/ are selected to be the pixel values closest to that of x.k/ in its adjacent pixels in the lter window w.k/, as shown in Figure 1. If the variable v.k/ is applied, then the line component in the lter window w.k/ will not be detected as noise but will be preserved because of the small v.k/. The second problem is that when a pixel is a good pixel, if u.k/ is larger, it is usually misjudged as an impulse noise. Such pixels may exist in edge areas. Therefore, it is important to distinguish noise components from edge components for effective noise ltering and edge preservation. For example, edges in a 3 £ 3 lter window can generally be represented as Figure 5 (Lee, Hwang, & Sohn, 1998). That is, edges generally have more than three similar pixels in the 3 £ 3 lter window, and they should not be detected
Figure 5: Edges in a 3 £ 3 window.
Adaptive Median Filter Based on Support Vector Machines
343
as noisy pixels. To overcome this problem, another variable, q.k/, should be provided. Denition 3. cw0 .k/ D MEDfx¡n .k/; : : : ; x¡1 .k/; w0 ¦ x0 .k/; : : : ; xn .k/g;
(4.3)
where MEDfx¡n .k/; : : : ; x¡1 .k/; w0 ¦ x0 .k/; : : : ; xn .k/g D MEDfx¡n .k/; : : : ; x¡1 .k/; x0 .k/; : : : ; x0 .k/; : : : ; xn .k/g: | {z } w0 times
MED means the median operation, w0 denotes the nonnegative integer weight, and w0 ¦ x0 .k/ means that there are w0 copies of input pixel x0 .k/ (Ko & Lee, 1991). Denition 4. q.k/ D jx0 .k/ ¡ c3 .k/j:
(4.4)
If the variable q.k/ is applied, whose center weight w0 equals 3, then the edge components in the lter window w.k/ will not be detected as noise but will be preserved, so we can decide that no impulse noise is located at the pixel x.k/. In this work, according to the above three variables u.k/; v.k/, and q.k/, the observation vectors are given by O.k/ D .u.k/; v.k/; q.k//:
(4.5)
The observation vectors O.k/ are derived to extract the feature information from the lter window for training the SVM impulse detector. This feature information will also be used in the ltering stage. 4.2.2 Training the SVM Impulse Detector. In the above steps, local features are extracted in an unsupervised manner. For training the SVM impulse detector, supervised class information of the training examples is incorporated with these extracted observation vectors O.k/. After learning, the discrimination function between noise-free and noise-corrupted classes is obtained, that is, represented by several support vectors together with their combination coefcients, as shown in equation 3.4. That is, the optimal separating hyperplane is obtained. The training process is shown in Figure 6.
344
T. Lin and P. Yu
Figure 6: Training framework of the SVM impulse detector.
4.3 Two-Pass Filtering 4.3.1 The First Pass Filtering. The ATM rst-pass lter mainly consists of two steps: an SVM impulse detector and a noise-free reduction lter, as shown in Figure 4a. Step 1: SVM impulse detector. First, the feature information O.k/ is extracted from the test image, and then the SVM impulse detector decides whether x.k/ is a noise-free or noise-corrupted pixel according to the discrimination function learned in the training stage. The I.i; j/ records the
Adaptive Median Filter Based on Support Vector Machines
345
location of the impulse noise in the test image, called a noise ag map: » 1; I.i; j/ D 0;
if SVM classies the pixel to be noisy : otherwise
(4.6)
Step 2: Noise-free reduction lter. If the input pixel is classied as an impulse according to the noise ag map I.i; j/, the noise-free reduction median lter replaces it by the median value. Otherwise, the input pixel is not replaced, and its original intensity is the output. However, the median value here is selected from only the noise-free pixels decided by the ag map I.i; j/ within a lter window w.k/ (Wang & Zhang, 1999). The noise-free pixels can be sorted in ascending order, which denes the vector as g.k/ D .g1 .k/; g2 .k/; : : : ; gR .k//;
(4.7)
where g1 · g2 .k/ · ¢ ¢ ¢ · gR .k/ are elements of w.k/ and are good pixels according to their corresponding I.i; j/ R denotes the number of all the pixels with I.i; j/ D 0 in the lter window w.k/. If R is odd, then m.k/ D MED g.k/
D MEDfg1 .k/; g2 .k/; : : : ; gR .k/g:
(4.8) (4.9)
If R is even but not 0, then m.k/ D .gR=2 .k/ C gR=2C1 .k//=2;
(4.10)
where gR=2 .k/ is the .R=2/th largest value and gR=2C1 .k/ is the .R=2 C 1/th largest value of the sorted noise-free pixel g.k/ The value of x.k/ is modied only when the pixel x.k/ is an impulse and R is greater than 0. Then the output y0 .k/ of the rst-pass lter is obtained by » y0 .k/ D
m.k/; x.k/;
if I.i; j/ D 1I R > 0 : else
(4.11)
4.3.2 The Second Pass Filtering. The rst-pass ltering involves the SVM impulse detector and the noise-free reduction lter. Owing to the possible mistakes the SVM impulse detector might make, two problems might need to be taken care of. First, the undetected noisy pixels may remain in the restored image because the SVM impulse detector has not detected them to be noisy pixels. Second, the misdetected pixels may appear in the restored image, causing the noise-free reduction lter to modify these input pixels though they are good pixels. Therefore, in addition to the SVM impulse detector and the noise-free reduction lter, we need a decision impulse lter to help improve the performance of the ATM lter by reducing both undetected and misdetected pixels.
346
T. Lin and P. Yu
The decision impulse lter of the second-pass ltering to detect the corrupted pixels is based on the difference between the output y0 .k/ of the rst-pass lter and the output value y0med .k/ of the median lter in the lter window w.k/, as shown in Figure 4b. The difference value is called a thresholding value T. Then the second-pass ATM lter uses the threshold to decide the nal output y.k/, » y.k/ D
y0med .k/; y0 .k/;
if jy0 .k/ ¡ y0med .k/j ¸ T : else
(4.12)
A potential advantage of the decision impulse lter is that when pixels are not corrupted by impulse noise, they will not be affected by the ltering; moreover, the residue noisy pixels can also be removed. Thus, the new ATM lter can suppress both xed-valued and random-valued impulses without degrading the quality of the ne details of the other median-based lters. 5 Experimental Results The proposed ATM lter has been tested to see how well it can remove the impulse noise and enhance the image restoration performance for signal processing. These extensive experiments have been conducted on a number of 512 £ 512 test images to evaluate and compare the performance of the proposed ATM lter with those of a number of existing impulse removal techniques, which are variances of the standard median lter. The peak signal-to-noise ratio (PSNR) criterion is adopted to measure the restoration performance quantitatively. The optimal separating hyperplane is obtained through a training reference image in the training process. That is, the images being ltered are outside the training set. The ATM lter is more powerful than other medianbased lters in generalization capability. Moreover, the performance is not strongly dependent on the training reference image. For example, the ltering results by the Couple training image are close to the ltering results by Lena or other training images. Note that to demonstrate the generalization capability of the ATM lter, the optimized discrimination function is used to restore an image outside the training set, where the Couple image, as shown in Figure 7, is used as a reference for training through the following experiments. The following experiments have been conducted to examine the effectiveness of our new lter: (1) the effects of the threshold T on the ATM lter, (2) a comparison with several other median-based lters in terms of noise suppression and detail preservation, and (3) a demonstration of the robustness of the ATM lter with respect to different impulse noise percentages. The inuence of the threshold T on the second-pass lter in our new lter is investigated in the rst experiment. The PSNR values obtained by adjusting the threshold T for different test images are graphically presented
Adaptive Median Filter Based on Support Vector Machines
347
Figure 7: Original training Couple image.
in Figure 8. The proper threshold values, which do not require a laborious process, are obtained experimentally for different test images. With the value 255 as the threshold T, the lter degenerates into the ATM(I) (the rst pass in ATM) lter. Note that the 0 threshold T represents a standard median lter. The median lter may result in degradation of the quality of the ne details, though it suppresses noises. Figure 8 shows that the PSNR is signicantly improved by using our ATM lter with appropriate thresholds. For the majority of images we tested, the threshold T employed by the ATM lter lies approximately in the range of [20, 46] at p D 20%. We observe that the threshold range is quite consistent and common for the xed-valued and random-valued impulse; thus, the performance is rather stable. The second experiment is to compare the ATM lter with the standard median (MED) lter, the CWM lter (Ko & Lee, 1991), switching scheme I (SWM-I) (Sun & Neuvo, 1994), switching scheme II (SWM II) (Sun & Neuvo, 1994), the tristate median (TSM) lter (Chen et al., 1999), and the fuzzy median (FM) lter (Arakawa, 1996) in terms of noise removal capability (measured in PSNR). Table 1 serves to compare the PSNR results of removing both the xed-valued and random-valued impulse noise at p D 20%. Apparently Table 1 reveals that the ATM lter achieves signicant improvement on the other lters to suppress both types of impulse noise. Note that
348
T. Lin and P. Yu
Figure 8: Effect of threshold T with respect to PSNR for ATM lter. Training processes are all conducted by the reference of the Couple image. (a) Fixedvalue impulse noise. (b) Random-valued impulse noise. (The Boat, Goldhill, Lake, Bridge, and Lena are the different test images.)
ATM(I) represents the rst-pass lter and ATM represents our nal proposed lter in Table 1. Figure 9 shows the comparative restoration results for the image Lena corrupted by random-valued impulse noise at 20% of MED, FM, ATM(I), and ATM lters, respectively. Apparently, the ATM lter produces a better subjective visual quality restored image by more noise
Adaptive Median Filter Based on Support Vector Machines
349
Table 1: Comparative Restoration Results in PSNR (dB) for 20% Impulse Noises. (a) Fixed-valued noise Filtersa MED CWM SWM-I SWM-II TSM FM ATM(I) ATM
Images Boat
Goldhill
Lake
Bridge
Lena
29.20 29.81 31.57 30.99 31.16 30.86 32.06 33.67
28.84 29.87 31.95 30.90 31.55 30.95 31.91 34.08
27.19 28.11 28.82 29.03 29.73 28.61 28.93 31.26
24.98 25.67 26.76 26.74 27.65 27.41 27.98 28.41
30.18 30.38 33.37 32.43 31.84 31.32 32.53 36.57
(b) Random-valued noise Filtersa MED CWM SWM-I SWM-II TSM FM ATM(I) ATM
Images Boat
Goldhill
Lake
Bridge
Lena
30.14 30.99 31.57 30.99 32.29 32.09 31.75 32.56
29.71 30.89 31.95 30.90 32.44 32.23 31.74 32.41
27.84 28.89 28.82 29.03 30.22 27.37 26.69 30.22
25.33 26.34 26.76 26.74 27.39 27.67 27.43 27.77
31.72 32.42 33.37 32.43 34.13 33.38 33.02 34.45
Notes: The Boat, Goldhill, Lake, Bridge, and Lena are the different test images. a Training image by the Couple image at p D 20% for FM and ATM lters.
suppression and detail preservation. In fact, only our ATM(I) lter is enough to get satisfactory restored images. Regardless of what is used in the training set, the third experiment is to demonstrate the robustness of the trained discrimination function at different percentages of impulse noise. In the experiment, a 20% impulsive reference image Couple is taken as the training image, independent of the actual corruption percentage. Figure 10 shows the comparative PSNR results of the restored image Boat when corrupted by the xed-valued and random-valued impulse noise of 10% to 30%. In Figure 10, the ATM lter has exhibited a satisfactory performance in robustness, even if the learning image is not available or there is no accurate information of impulse noise in training. Figure 11 shows the outstanding restoration results for the image Boat corrupted by random-valued impulse noise of 10% to 30%.
350
T. Lin and P. Yu
Figure 9: Subjective visual qualities of restored images of Lena. (a) Original image. (b) Corrupted image by 20% random-valued impulse noise and ltered by (c) MED lter, (d) FM lter, (e) ATM(I) lter, and (f) ATM lter.
6 Conclusion In this article, a novel adaptive two-pass median lter has been proposed for improving the median-based lters to preserve more image details while effectively suppressing impulse noise. The proposed lter is composed of
Adaptive Median Filter Based on Support Vector Machines
351
Figure 10: Comparative restoration results in PSNR (dB) for ltering the Boat image corrupted by impulse noise, with different percentages. Training processes for FM and ATM lters are all conducted by the reference of the Couple image at p D 20%. (a) Fixed-valued impulse noise. (b) Random-valued impulse noise.
352
T. Lin and P. Yu
Figure 11: Robust restored Boat image. (a) Original image. (b) Noisy image (20%). (c) Filtered image degraded by 10%. (d) Filtered image degraded by 20%. (e) Filtered image degraded by 25%. (f) Filtered image degraded by 30% of random-valued impulsive noise.
Adaptive Median Filter Based on Support Vector Machines
353
a noise decision maker and a two-pass median lter. A new approach is attached to our new lter to judge whether the input pixel is noise based on the SVM impulse detector. Owing to the excellent generalization capability of SVMs, the decision maker can successfully decide whether the pixel is noisy, so that the mean square error of the lter output can be minimized. The extensive experimental results presented here have demonstrated that the proposed ATM lter is superior to a number of well-accepted medianbased lters in the literature. The ATM lter is capable of showing desirable robustness in suppressing noise and providing satisfactory image quality. Acknowledgments We thank Chih-Jen Lin for providing the software LIBSVM, a library for support vector machines (version 2.32). References Abreu, E., & Mitra, S. K. (1995). A signal-dependent rank ordered mean (SDROM) lter: A new approach for removal of impulses from highly corrupted images. In Proceedings of IEEE ICASSP. Detroit, MI: IEEE, ICASSP. Arakawa, K. (1996). Median lters based on fuzzy rules and its application to image restoration. Fuzzy Sets and Systems, 77, 3–13. Astola, J., & Kuosmanen, P. (1997). Fundamentals of nonlinear digital ltering. Boca Raton, FL: CRC. Chang, C.-C., Hsu, C.-W., & Lin, C.-J. (2000). The analysis of decomposition methods for support vector machines. IEEE Trans. Neural Networks, 11, 1003– 1008. Chang, C.-C., & Lin, C.-J. (2002). Training nu-support vector regression: Theory and algorithms. Neural Computation, 14, 1959–1977. Chen, T., Ma, K. K., & Chen, L. H. (1999). Tri-state median lter for image denoising. IEEE Trans. Image Processing, 8, 1834–1838. Chen, T., & Wu, H. R. (2001). Application of partition-based median type lters for suppressing noise in images. IEEE Trans. Image Processing, 6(10), 829–836. Cristianini, N., & John, S.-T. (2000). An introduction to support vector machines and other kernel-based learning methods. Cambridge: Cambridge University Press. Haykin, S. (1999). Neural networks: A comprehensive foundation (2nd ed.). Upper Saddle River, NJ: Prentice Hall. Hsu, C.-W., & Lin, C.-J. (2002). A simple decomposition method for support vector machines. Machine Learning, 46, 291–314. Ko, S. J., & Lee, Y. H. (1991).Center weighted median lters and their applications to image enhancement. IEEE Trans. Circuits and Systems, 38, 984–993. Lee, K.-C., Hwang, J., & Sohn, K.-H. (1998).Detection-estimation based approach for impulsive noise removal. Electronics Letters, 5(34), 449–450. Lin, C.-J. (2001). On the convergence of the decomposition method for support vector machines. IEEE Trans. Neural Networks, 6, 1288–1298.
354
T. Lin and P. Yu
Lin, C.-J. (2002).A formal analysis of stopping criteria of decomposition methods for support vector machines. IEEE Trans. Neural Networks, 13, 1045–1052. Lin, T.-C., & Yu, P.-T. (2003). Centroid neural network adaptive resonance theory for vector quantization. Signal Processing, 83, 649–654. Lin, T.-C., & Yu, P.-T. (in press). Partition fuzzy median lter based on fuzzy rules for image restoration. Fuzzy Sets and Systems. Muller, ¨ K.-R., Mika, S., R¨atsch, G., Tsuda, K., & Sch¨olkopf, B. (2001). An introduction to kernel-based learning algorithms. IEEE Trans. Neural Networks, 12, 181–201. Sun, T., & Neuvo, Y. (1994). Detail-preserving median based lters in image processing. Pattern Recognition Letter, 15, 341–347. Tsai, H.-H., & Yu, P.-T. (1999). Adaptive fuzzy hybrid multichannal lters for removal of impulsive noise from color images. Signal Processing, 74, 127–151. Tsai, H.-H., & Yu, P.-T. (2000). Genetic-based fuzzy hybrid multichannel lter for color image restoration. Fuzzy Sets and Systems, 114, 203–224. Vapnik, V. (1995). The nature of statistical learning theory. New York: SpringerVerlag. Vapnik, V. (1998). Statistical learning theory. New York: Wiley. Wang, Z., & Zhang, D. (1999). Progressive switching median lter for the removal of impulse noise from highly corrupted images. IEEE Trans. Circuits and Systems-II: Analog and Digital Signal Processing, 1(46), 78–80. Yin, L., Yang, R., Gabbouj M., & Neuvo Y. (1996). Weighted median lters: A tutorial. IEEE Trans. Circuits Sys., 3(43), 157–192. Yu, P.-T., & Chen, R. C. (1995). An optimal design of fuzzy (m, n) rank order ltering with hard decision neural learning. In Proceedings of IEEE of ISCAS, International Symposium on Circuits and Systems (pp. 567–593). Seattle, WA: IEEE of ISCAS. Yu, P.-T., Lee, C.-S., & Kuo, Y.-H. (1995). Weighted fuzzy mean lters for heavytailed noised removal. In Proceedings of IEEE of ISUMA-NAFIPS, Third International Symposium on Uncertainty Modeling and Analysis (pp. 601–606). Received May 21, 2003; accepted July 10, 2003.
Communicated by David Saad
LETTER
Improving Generalization Performance of Natural Gradient Learning Using Optimized Regularization by NIC Hyeyoung Park
[email protected] Brain Science Institute, RIKEN, Saitama, Japan
Noboru Murata
[email protected] Waseda University, Tokyo, Japan
Shun-ichi Amari
[email protected] Brain Science Institute, RIKEN, Saitama, Japan
Natural gradient learning is known to be efcient in escaping plateau, which is a main cause of the slow learning speed of neural networks. The adaptive natural gradient learning method for practical implementation also has been developed, and its advantage in real-world problems has been conrmed. In this letter, we deal with the generalization performances of the natural gradient method. Since natural gradient learning makes parameters t to training data quickly, the overtting phenomenon may easily occur, which results in poor generalization performance. To solve the problem, we introduce the regularization term in natural gradient learning and propose an efcient optimizing method for the scale of regularization by using a generalized Akaike information criterion (network information criterion). We discuss the properties of the optimized regularization strength by NIC through theoretical analysis as well as computer simulations. We conrm the computational efciency and generalization performance of the proposed method in real-world applications through computational experiments on benchmark problems. 1 Introduction The natural gradient method is a stochastic gradient method originating with information geometry. Amari (1998) proposed the concept of natural gradient learning and proved that it is Fisher efcient in the case that the negative log likelihood is used as a loss function. Its dynamical property has been also studied by statistical-mechanical analysis in the large system size limit, and it has been shown that the natural gradient can escape from plateaus more efciently than the standard gradient (Rattray, Saad, & Neural Computation 16, 355–382 (2004)
c 2003 Massachusetts Institute of Technology °
356
H. Park, N. Murata, and S. Amari
Amari, 1998). The convergence properties of on-line learning methods including natural gradient have also been analyzed using stochastic approximation theory (Bottou, 1998). In order to implement the concept of natural gradient for learning of feedforward neural networks, Amari, Park, and Fukumizu (2000) proposed an adaptive method of obtaining an estimate of the natural gradient. It is called the adaptive natural gradient method, and they extended it to various stochastic neural network models. A number of computational experiments on well-known benchmark problems have conrmed that adaptive natural gradient learning can be applied to various practical applications successfully, and the method can alleviate or even avoid plateaus. Besides the problem of learning speed, generalization performance is another important problem in neural networks. When a network structure, an error function, and training data are xed, all learning algorithms based on the gradient-descent method have the same equilibrium points in the error surface. Thus, in a theoretical sense, it is hard to say that the generalization performance of the natural gradient method is much different from that of the standard gradient method. In a practical sense, however, the results may be different. Let us assume the following typical practical situation. First, we do not know the optimal complexity of network models for a given problem, and thus we use a large network with sufcient complexity. Second, since we do not know the minimum error that can be achieved by the network, we stop learning when the decrement of training error has been very small for a while and no improvement by learning is expected. In this situation, the solution obtained by the natural gradient method can be frequently different from that by the standard method, because standard gradient learning is subject to being trapped in a long plateau and can easily be misinterpreted as the end of the process. In this case, it is obvious that the generalization performances of the methods are different. Therefore, more careful consideration of the generalization performance is necessary when natural gradient learning is applied to practical problems. Park (2001) has investigated this situation and has suggested solving it by using a regularization term. When applying the regularization method, it is very important to optimize the scale of regularization in order to have a good generalization performance. Various methods have been developed for obtaining the optimal scale of regularization (Sigurdsson, Larsen, & Hansen, 2000). A simple method is to use a validation set for optimizing the scale. Composing a validation set, however, requires using part of the learning data set. Thus, this method is not desirable when the learning data are insufcient. To solve this problem, the cross-validation method (Stone, 1974) has been widely used. To get an accurate estimate of the regularization parameter, however, the cross-validation method is computationally expensive. There also have been theoretical approaches to estimating the generalization error (Bishop, 1995; Moody, 1992; Rissanen, 1978; Vapnik,
Natural Gradient with Regularization Using NIC
357
1995) and applying it to optimization of the regularization strength. However, these methods have shortcomings in terms of efcient practical implementation, and some of them need additional assumptions. In addition, there have been few theoretical analyses on the properties of the obtained optimal values using those estimators. In this letter, we used the concept of network information criterion (NIC) in order to optimize the regularization strength. The NIC (Murata, Yoshizawa, & Amari, 1994), which is one of the estimators for generalization errors, has been developed for general error functions and network models. Since natural gradient learning (Amari, 1998) and its adaptive version for stochastic neural networks (Park, Amari, & Fukumizu, 2000) have been derived from the same theoretical background as NIC, the natural gradient and NIC can be combined and applied to various stochastic learning models. In addition, since the theoretical meaning of applying NIC to optimize the regularization strength has been discussed by Murata (2001), the proposed method has strong theoretical justication as well. In this letter, we conrm the theoretical results by computer simulations. Furthermore, since the NIC and natural gradient learning use the common Fisher information matrix, we can also achieve computational efciency by sharing the matrix information, as we discuss in section 3. 2 Theory of Learning with Regularization 2.1 Natural Gradient Learning. We begin with a brief introduction of the natural gradient method. Since natural gradient learning is a kind of stochastic gradient descent learning method, we consider a stochastic neural network model dened by y D Sf f .x; µ /g;
(2.1)
where x is an n-dimensional input subject to a probability distribution q.x/; µ D .µ 1 ; : : : ; µ m /¿ is an m-dimensional column vector that plays the role of a coordinate system in the space of the networks; ¿ denotes the transposition; and the deterministic function f .x; µ / species a network P structure. Although the most popular type of f is given by f .x; µ / D vi ’ .w i ¢ x/, which is called a three-layer perceptron, the discussions in this letter have no restriction on the shape of function f except some regularity conditions such as differentiability. The output y is emitted through a stochastic process denoted by Sf¢g. The stochastic process can be dened by a conditional probability density function p.y j xI µ / conditioned on input x and the weight parameter µ that can be regarded as the parameter of the density function. Although the most typical example of p.y j xI µ/ is the gaussian noise model of the form µ ¶ 1 1 exp ¡ fy ¡ f .x; µ/g2 ; (2.2) p.y j xI µ/ D p 2 2¼
358
H. Park, N. Murata, and S. Amari
natural gradient learning and NIC can be applied to various types of stochastic models (see Park et al., 2000, and Murata, et al., 1994, for details). Moreover, we proceed with the scalar output for simplicity, but its generalization to multivariate outputs can also be easily done. From this stochastic viewpoint, we can consider a space of probability density functions fp.x; yI µ / j µ 2 <m g and dene an appropriate pointwise loss function d.z ; µ / D d.x; y; µ/ for each piece of data z D .x; y/ and parameter µ on the space. The ultimate goal of learning is to obtain the optimal parameter dened by
µ ¤ D argmin EZ [d.Z; µ /];
(2.3)
µ
where E Z denotes the expectation with respect to the true distribution of random variable Z. Here, the optimality is discussed in terms of minimizing the expected loss under a given environment represented by a distribution of input and output, p.z I µ / D p.x; yI µ / D p.y j xI µ /q.x/. A natural way of estimating the optimal parameter is to adopt the empirical minimum loss estimator dened by 1 µO D argmin L.D; µ / D argmin n µ µ
n X
d.zp ; µ /;
(2.4)
pD1
when a sample set D D f.xp ; yp /gpD1;:::;n is observed. To obtain µO , gradient descent learning is widely used. The natural gradient learning method is based on the fact that the space of p.x; yI µ/ is a Riemannian space in which the metric tensor is given by the Fisher information matrix F.µ / dened by F.µ / D Ex [Ey j xI µ [r log p.y j x; µ /r log p.y j x; µ /¿ ]]:
(2.5)
Here, we use r to denote a differential operator with respect to the parameter µ , and we will also use @i in some cases to denote a partial differential operator with respect to the ith element of µ . The Ex [¢] and E y j xI µ [¢] denote the expectation with respect to q.x/ and p.y j xI µ/, respectively. Using the inverse of the Fisher information matrix of equation 2.5, we can Q and its learning algorithm for the stochastic obtain the natural gradient rL systems, Q µt / D µ t ¡ ´t .F.µt //¡1 rL.µt /; µ tC1 D µt ¡ ´t rL.
(2.6)
where rL.µ / is the derivative of a loss function L with respect to µ . Since it is difcult to obtain the Fisher information matrix F.µ/ and its inverse in practical implementation, we need an approximation method. The
Natural Gradient with Regularization Using NIC
359
adaptive natural gradient (Amari et al., 2000) was originally developed for estimating the inverse of Fisher information matrix adaptively in on-line learning. In this letter, we present a similar estimation method for batch learning. Since we have a set of training data f.xp ; yp /gpD1;:::;n in batch learnQ µ/ ing, we can use the empirical distribution so as to derive an estimator F. of Fisher information matrix, n 1X FQn .µ/ D hp .µ /hp .µ /¿ ; n pD1
(2.7)
where hp .µ /hp .µ /¿ D Ey j xI µ [r log p.y j xp I µ/r log p.y j xp I µ /¿ ]. The explicit form of hp .µ / can be analytically obtained by dening a specic form of p.y j xp I µ /. In the case of the gaussian noise model in equation 2.2, hp .µ/ is simply given by r f .xp I µ/ (see Park et al., 2000, for details). Then ¡1 Q its inverse, FQ¡1 n D .Fn .µ // , can be calculated successively by using the iterative equation, FQ¡1 n D
Á ! ¿ Q ¡1 FQ¡1 n ¡1 n¡1 hn hn Fn¡1 Q Fn¡1 C ; n¡1 n ¡ 1 C h¿n F¡1 n¡1 hn
(2.8)
where the initial matrix F0 is a positive denite symmetric matrix such as the identity matrix. This iterative technique can also be applied to calculating the NIC, as we show in section 3. We should emphasize here that the denition of the Fisher information matrix depends not on the error function directly but on the stochastic neural network models that we use. Therefore, the natural gradient can be applied to arbitrary error functions satisfying some regularity conditions, such as differentiability. Although a typical error function can be dened as L.µ / of equation 2.4, we can also use other types of error functions, as we show in section 3. 2.2 Regularization Method. Due to noise in the observed samples and the limited number of samples, the estimate µO often has some divergence from µ¤ depending on the sample set, which results in large generalization errors. In order to decrease the generalization error EZ [d.Z; µO /], the regularization method is widely used. In the regularization method, we adopt an error function including a regularization term, which can be dened by Lreg .D; ®; µ / D
n 1X ® d.z p ; µ/ C r.µ/; n pD1 n
(2.9)
including a regularization parameter ®. Note that in this denition, the contribution of the regularization is scaled as 1=n. By minimizing this error
360
H. Park, N. Murata, and S. Amari
function, we obtain an estimator,
µN D argmin µ
8 ¾O1 "; i D 1 to qg:
(1.11)
However, the decision to discard a network should not be based on rO, because one may have rO D q while ·.Z/ > 108 , and hence while .ZT Z/¡1 is inaccurate. When r D q, one could think of computing the leverages using equation 1.8 since H D Z.ZT Z/¡1 ZT . However, as mentioned in Rivals and Personnaz (2000), it is better to use (see appendix A): hc kk.R&P/ D
rO X iD1
2 c .u ki /
for k D 1 to N;
(1.12)
where rO is computed with equation 1.11. As opposed to equation 1.8, expression 1.12 does not involve the inverse of the square of possibly inaccurate small singular values. The leverages computed with equation 1.12 are hence less sensitive to the ill conditioning of Z than .ZT Z/¡1 computed with equation 1.8. However, we must consider the error on q rst columns of U, which span the range of Z. Let Uq denote the matrix of the q rst columns of U. The angle between the range of Uq and that of its computed version UOq (see appendix A) is approximately bounded by (Anderson et al., 1999) angle.R.Uq /; R.UOq // ·
¾1 " : minj6Di j¾i ¡ ¾j j
(1.13)
Jacobian Conditioning Analysis for Model Validation
405
Thus, even if a model is very ill conditioned, as long as the singular values of its Jacobian are not too close to one another, the leverages can be computed accurately with equation 1.12. It can also be shown that UO is quasi orthogonal (Golub & Van Loan, 1983): UO D W C 1W with W T W D WW T D IN and k1Wk2 · ";
(1.14)
where IN denotes the identity matrix of order N. Thus, for leverage values obtained with equation 1.12, even if UO is not an accurate version of U, the relations 1.6 and 1.7 are satised to roughly unit roundoff.1 Note that if one is interested only in the leverages, they can be computed as accurately using the QR decomposition as with equation 1.12 (Dongarra, Moler, Bunch, & Stewart, 1979) (see appendix A). The advantage is that the QR decomposition demands far fewer computations than SVD. 2 Comment on the Proposition of Monari and Dreyfus (2002) Monari and Dreyfus do not consider the notion of condition number but focus on the values of the leverages. They choose to discard a model when the relations 1.6 and 1.7 are not satised for the computed values of its leverages. There are two problems with this proposition: 1. Property 1.7 is an equality, but Monari and Dreyfus do not specify a numerical threshold that could be used in order to make a decision. There is a similar problem with equation 1.6. 2. Instead of using equation 1.12, Monari and Dreyfus compute the leverages according to hc kk.M&D/ D
q X iD1
0
12 q X 1 @ zkj vbji A ¾Oi jD1
for k D 1 to N;
(2.1)
equation A.4 in Monari and Dreyfus (2002). This equation is derived from the expression ZV.6 T 6/¡1 V T ZT of the hat matrix if Z is full rank, 1
Proof that equation 1.6 is satised to unit roundoff: hb kk .R&P/ D
rO X iD1
.ubki /2 ·
N X iD1
.ubki /2 D 1
to unit roundoff:
Proof that equation 1.7 is satised to unit roundoff: N X kD1
O O T O T O hb kk .R&P/ D trace.Uq .Uq / / D trace..Uq / Uq / D rO
to unit roundoff:
406
I. Rivals and L. Personnaz
an expression that is, strangely enough, obtained by using the SVD of Z for .ZT Z/¡1 (expression A.5 of this article), but not for Z itself. Whereas the computed values of the leverages using equation 1.12 are accurate provided the singular values are not too close (property 1.13) and always satisfy equations 1.6 and 1.7 to unit roundoff (see property 1.14), there is no such result concerning the computed values obtained with equation 2.1. This has the following consequences: ² Assuming that a consistent threshold were specied for equation 1.6 and 1.7, because the leverages computed with equation 2.1 may be inaccurate, models well enough conditioned for the accurate estimation of condence intervals may wrongly be discarded. ² Conversely, in the case of models whose Jacobian Z is ill conditioned, the leverages computed with equation 2.1 (and a fortiori with equation 1.12) may satisfy equations 1.6 and 1.7, and hence the corresponding models may not be discarded, while condence intervals are meaningless. In the next section, these consequences are illustrated with numerical examples. 3 Numerical Examples 3.1 Accuracy of the Leverages Computed with Equations 1.12 and 2.1. We construct a .N; 2/ matrix Z such that its condition number can be adjusted by tuning a parameter ®, while the range of Z does not depend on the value of ®: 2 3 1 1 C ®c1 6¢ 7 ¢ 7; (3.1) Z D [1 1 C ®c] D 6 4¢ 5 ¢ N 1 1 C ®c where the fck g are realizations of gaussian random variables (see the Matlab program given in appendix B). For a given c, R.Z/ D span.1; c/ and hence the true leverages fhkk g have the same values 8 ® 6D 0. As an example, for a single realization of Z, with N D 4, and ® D 10¡12, we obtain the results reproduced in appendix B. In order to give statistically signicant results, we performed 10,000 realizations of the matrix Z. Table 1 displays the averaged results. When using equation 1.12, the computed values fhc kk.R&P/ g satisfy equation 1.6 and their sum satises equation 1.7 to roughly unit round-off for d up to 1016: as a matter of fact, this is ensured by property 1.14 of values of ·.Z/ O Moreover, according to property 1.13, since ¾b1 ¡ ¾b2 ¼ 2:8 8 ®, the estimate U. the values of the fhc kk.R&P/ g themselves are accurate.
Jacobian Conditioning Analysis for Model Validation
407
Table 1: Estimation of the Leverages with Formulas 1.12 (fhc kk .R&P/ g) and 2.1 g). (fhc kk .M&D/ ®
d h·.Z/i
10¡6 10¡8 10¡12 10¡15
3.2154 106 3.2154 108 3.2154 1012 Inf
hjOr ¡
PN kD1
3.7881 1.8516 1.8443 1.5901
hb kk .R&P/ ji 10¡16 10¡16 10¡16 10¡16
hjq ¡
PN kD1
hb kk .M&D/ ji
1.2029 10¡10 1.2316 10¡8 1.2213 10¡4 –
When using equation 2.1, the sum of the fhc kk.M&D/ g by far does not satisfy equation 1.7 to unit round-off even when Z is well conditioned. The c fhc kk .M&D/ g may also not satisfy equation 1.6, whereas the fhkk.R&P/ g always
do (run the program of appendix B with ® D 10¡15). This example illustrates the lower accuracy of equation 2.1 with respect to equation 1.12. If the values of leverages are central to a given procedure, they denitely should be computed according to equation 1.12. The results are similar for any N and any vector c, as well as for Jacobian matrices of actual problems.2 3.2 Neural Modeling. We consider a single input process simulated with yk D sinc.10.xk C 1// C wk for k D 1 to N;
(3.2)
where sinc denotes the cardinal sine function and N D 50. The noise values fwk g are drawn from a gaussian distribution with variance D 2 10¡3 , and the input values fxk g from an uniform distribution in [¡1I 1]. Neural models with one layer of nh “tanh” hidden neurons and a linear output neuron are considered, except the network without hidden neurons, which consists of a single linear neuron. They are trained 50 times using a quasi-Newton algorithm starting from different small random initial parameter values, in order to increase the chance to reach an absolute minimum; the parameters corresponding to the lowest value of the cost function (1.2) are kept. The corresponding mean square training error is denoted by MSTE. The approximate leave one out score computed with equations 1.5 and 1.12 is denoted by ALOOS. The ALOOS is to be compared to a better, unbiased estimate of the performance computed on an independent test set of 500 examples drawn from the same distribution as the training set (such a test set is usually not available in real life), the MSPE. 2 The economic QR decomposition of Z can also be used (see appendix A): the values computed with equation A.13 do not differ from those computed with equation 1.12 by more than roughly the computer unit roundoff (check with the Matlab program of appendix B).
408
I. Rivals and L. Personnaz
Table 2: Training of Neural Models with an Increasing Number nh of Hidden Neurons. nh 0 1 2 3 4
MSTE 3.699 9.506 3.181 2.153 1.888
ALOOS
10¡2
4.136 1.144 4.831 4.039 5.316
10¡3 10¡3 10¡3 10¡3
10¡2 10¡2 10¡3 10¡3 10¡7
MSPE 7.039 1.083 6.866 4.783 4.436
10¡2 10¡2 10¡3 10¡3 10¡3
d ·.Z/ 1.8 6.7 102 8.4 102 2.1 109 9.0 109
d > 108 (illNote: The bottom two rows correspond to ·.Z/ conditioned networks).
The simulations are programmed in the C language. The SVD decomposition is performed with the Numerical Recipes routine “svdcmp” (Press, Teukolsky, Vetterling, & Flannery, 2002), but in double precision. The leverages fhc kk.M&D/ g (see equation 2.1) and the fhkk.R&P/ g (see equation 1.12 are then computed as in the Matlab program given in appendix B. The results obtained are shown in Table 2. The outputs and the residuals of the networks with two and three hidden neurons are shown in Figures 1 and 2, respectively. Although the model with three hidden neurons has a slightly lower MSPE than the network with two, it is unreliable in the sense that one is unable to estimate correct condence T Z/¡1 with intervals for the regression with this network: computing .Zd T equation 1.8 and multiplying it by Z Z leads to a matrix that differs signicantly from the identity matrix (by more than two for some of its elements). Fortunately, following Rivals and Personnaz (2000), we can discard this network right away on the basis of its condition number, which is too large. d is ignored and that, following Monari and Dreyfus Suppose that ·.Z/ (2002), only the computed leverage values are considered. The results obtained with equations 1.12 and 2.1 are given in Table 3. Both relations 1.6 and 1.7 are satised for networks with three and even four hidden neurons, whether the leverages are computed with equation 1.12 .fhc kk.R&P/ g/ or equaTable 3: Observing the Computed Leverage Values. nh 0 1 2 3 4
max.hb kk .R&P/ / 8.1648771 8.5215146 8.6468122 8.8121097 9.9999983
10¡2 10¡1 10¡1 10¡1 10¡1
rO ¡
N X
hb kk .R&P/
max.hb kk .M&D/ /
kD1
2.22 10¡16 0.0 0.0 1:8 10¡15 0.0
8.1648771 8.5215146 8.6468122 8.8121097 9.9999981
10¡2 10¡1 10¡1 10¡1 10¡1
q¡
N X
hb kk .M&D/
kD1
0.0 1:1 10¡14 9.8 10¡15 6:2 10¡11 1:6 10¡8
Jacobian Conditioning Analysis for Model Validation
409
tion 2.1 .fhc kk.M&D/ g/. It proves that checking only equations 1.6 and 1.7 for the leverage values does not lead to discarding the unusable models with three and four hidden neurons. Let us take a closer look at the performance of the models with two and three hidden neurons, and at the interpretation of the leverage values. Figures 1 and 2 exhibit the training examples and the residuals corresponding to leverage values larger than 0.5. For both networks, the largest leverage value corresponds to an example that lies at the boundary of the input domain explored by the training set. This is a typical situation of an inuent example. For the network with two hidden neurons, a second leverage value is larger than 0.5: the fact that the corresponding example is located at an inexion point of the model output is the sign of its large inuence on the parameter estimate. Let us now examine the network with four hidden neurons. Four of its leverage values are larger than 0.5. These values are the following (from the
a ) 1 .5
1
0 .5
0
-0 .5 -1
- 0 .5
0
0 .5
1
0 .5
1
b ) 0 .2
0 .1
0
-0 .1
-0 .2 -1
- 0 .5
0
d D 8:4 102 /. (a) Regression Figure 1: Network with two hidden neurons .·.Z/ (dotted line), model output (continuous line), training set (circles). (b) Residuals. The training example and the residual corresponding to the largest leverage .8:65 10¡1 / are marked with a circle lled in black. A second leverage is larger than 0.5 (5.89 10¡1 ), and the corresponding training example and residual are marked with a circle lled in gray.
410
I. Rivals and L. Personnaz a ) 1 .5
1
0 .5
0
-0 .5 -1
-0 .5
0
0 .5
1
0 .5
1
b ) 0 .2
0 .1
0
-0 .1
-0 .2 -1
-0 .5
0
d D 2:1 109 /. (a) Regression Figure 2: Network with three hidden neurons .·.Z/ (dotted line), model output (continuous line), training set (circles). (b) Residuals. The training example and the residual corresponding to the largest leverage (8.81 10¡1 ) are marked with a circle lled in black. smallest to the largest corresponding abscissa; see Figure 3): 8 ¡1 > h[ 38 38.R&P/ D 7:3268195 10 > > > h[ 33 33.R&P/ D 9:9981558 10 > :d ¡1 h1 1.R&P/ D 8:4477304 10 The large leverage values correspond to the two extreme examples (38 and 1) and two overtted examples (39 and 33). In fact, large leverage values (i.e., values close to one but not necessarily larger than one) are the symptom of local overtting at the corresponding training examples or of extreme examples of the training set that are relatively isolated, and hence inuent. However, checking only equations 1.6 and 1.7 for the leverage values would not lead to the detection of the misbehavior of this network. Finally, this example shows that ill conditioning is not systematically related to leverage values close to one: the largest leverage value of the d D very ill-conditioned neural network with three hidden neurons .·.Z/ 9 ¡1 2:1 10 / equals 8:81 10 and is, hence, not much larger than the largest leverage value of the well-conditioned neural network with two hidden
Jacobian Conditioning Analysis for Model Validation
411
a ) 1 .5
1
0 .5
0
-0 .5 -1
- 0 .5
0
0 .5
1
0 .5
1
b ) 0 .2
0 .1
0
-0 .1
-0 .2 -1
- 0 .5
0
d D 9:0 109 /. (a) Regression Figure 3: Network with four hidden neurons .·.Z/ (dotted line), model output (continuous line), training set (circles). (b) Residuals. The training example and the residual corresponding to the largest leverage (9,9999983 10¡1 ) are marked with a circle lled in black. The training examples and residuals corresponding to three other leverages larger than 0.5 are marked with circles lled in gray. d D 8:4 102 /, which equals 8:65 10¡1 . Ill conditioning is not neurons .·.Z/ systematically related to local overtting but rather to a global parameter redundancy. 4 Conclusion Our conclusion is threefold: ² In order to validate only neural candidates whose approximate parameter covariance matrix and condence intervals can be reliably estimated, the rst condition should be that the condition number of their Jacobian does not exceed the square root of the inverse of the computer unit round-off (usually 108 ). ² If a procedure relies heavily on the computed values of the diagonal elements of the hat matrix, the leverages, the latter should be computed according to expression 1.12, as recommended in Rivals and Personnaz (2000), rather than according to expression 2.1 given in Monari and Dreyfus (2002). Only the computation according to equation 1.12
412
I. Rivals and L. Personnaz
ensures that the computed hat matrix is a projection matrix and that it is accurate. ² For candidates whose condition number is small enough and for which the leverages have been computed as accurately as possible according to equation 1.12, one may check additionally if none of the leverage values is close to one, as already proposed in Rivals and Personnaz (1998). Leverage values close to, but not necessarily larger than one are indeed the symptom of overtted examples or of isolated examples at the border of the input domain delimited by the training set. 5 Other Comment for Monari and Dreyfus (2002) This comment concerns the estimation of the performance of the selection method presented in Monari and Dreyfus (2002), for its comparison to selection methods proposed by other authors. As in Anders and Korn (1999), the process to be modeled is simulated, and its output is corrupted by a gaussian noise with known variance . In order to perform statistically signicant comparisons between selection methods, 1000 realizations of a training set of size N are generated. A separate test set of 500 examples is used for estimating the generalization mean square error (GMSE) of the selected models, and the following performance index is computed: ½D
GMSE ¡
(5.1)
;
equation 5.3 in Monari and Dreyfus (2002). In the case N D 100, two values of h½i, the average value of ½ on the 1000 training sets, are given: (1) a value of 126% corresponding to the above denition and (2) a value of 27% corresponding to a GMSE computed on part of the test set only. Strangely enough, 3% of the examples of the test set are considered as outliers and discarded from the test set. This value of 27% is compared to the values of h½i obtained by other selection procedures with the whole test set. This second value of 27% is meaningless. Putting aside the fact that considering examples of the performance estimation set as outliers is questionable, let us call GMSE¤ the GMSE obtained in (2). In the most favorable case for Monari and Dreyfus (2002)—the case where we assume that the examples Monari and Dreyfus discarded correspond to the largest values of the gaussian noise (and not to model errors)—this GMSE¤ should be compared not to but to the variance ¤ of a noise with a truncated gaussian distribution (without its two 1.5% tails). In the example, D 5 10¡3 , ¤ D 4:4 10¡3 . Thus, the ratio ½¤ D
GMSE¤ ¡ ¤
¤
>
GMSE¤ ¡
would be more representative of the actual model performance.
(5.2)
Jacobian Conditioning Analysis for Model Validation
413
To conclude, the second value of h½i obtained in Monari and Dreyfus (2002) by discarding some examples of the test set can by no means be compared to those obtained by other selection procedures correctly using the whole test set for the performance estimation. Appendix A This appendix summarizes the results used in this article; for details, see Golub and Van Loan (1983). A.1 Theorem for the Singular Value Decomposition (SVD). Consider a .N; q/ matrix Z with N ¸ q and rank.Z/ D r · q. There exist a .N; N/ orthogonal matrix U and a .q; q/ orthogonal matrix V such that: Z D U6V T ;
(A.1)
where 6 is a .N; q/ matrix such that [6]ij D 0 for i 6D j, and whose elements f[6]ii g, denoted by f¾i g, are termed the singular values of Z, with: ¾1 ¸ ¾2 ¸ ¢ ¢ ¢ ¸ ¾q ¸ 0:
(A.2)
If rank.Z/ D r < q, ¾1 ¸ ¢ ¢ ¢ ¸ ¾r ¸ ¾rC1 D ¢ ¢ ¢ D 0. The r rst columns of U form an orthonormal basis of the range of Z. A.2 Condition Number Using SVD. If rank.Z/ D q, and using the matrix two-norm3 , the condition number ·.Z/ of Z is expressed as ·.Z/ D kZk2 kZ¡1 k2 D
¾1 : ¾q
(A.3)
If ·.Z/ is large, the matrix Z is ill conditioned. We have the property ·.ZT Z/ D .· .Z//2 :
(A.4)
A.3 Inverse of ZT Z Using SVD. If rank.Z/ D q, the inverse of ZT Z exists and, using the SVD of Z, can be expressed as .ZT Z/¡1 D V.6 T 6/¡1 V T ; 3
The two-norm of a matrix A is dened as:
± kAk2 D sup x6D0
kAxk2 kxk2
² :
(A.5)
414
I. Rivals and L. Personnaz
where .6 T 6/¡1 is a .q; q/ diagonal matrix with [.6 T 6/¡1 ]ii D
1 ¾i2
for i D 1 to q:
(A.6)
A.4 Pseudo-Inverse of Z using SVD. Any .N; q/ matrix Z with rank r · q has a pseudo-inverse. It equals ZI D V6 I UT ;
(A.7)
where 6 I is a .q; N/ matrix whose only nonzero elements are the rst r diagonal elements: [6 I ]ii D
1 ¾i
for i D 1 to r:
(A.8)
A.5 Orthogonal Projection Matrix on the Range of Z Using SVD. The .N; N/ projection matrix H on the range of any .N; q/ matrix Z is given by H D ZZI :
(A.9)
Using the SVD of Z, we obtain H D U6V T V6 I UT D U66 I U T ;
(A.10)
where the matrix 66 I is hence a .N; N/ diagonal matrix whose r rst diagonal elements are equal to 1 and all the others to 0 (see Golub & Van Loan, 1983). Thus, the diagonal elements of H, the leverages, are given by hkk D
r X
.uki /2
iD1
for k D 1 to N:
(A.11)
A.6 Theorem for the QR Decomposition. Consider a .N; q/ matrix Z with N ¸ q and rank.Z/ D q. There exist a .N; N/ orthogonal matrix Q and an upper triangular .q; q/ matrix R such that ZDQ
µ ¶ R : 0
(A.12)
The q rst columns of Q form an orthonormal basis of the range of Z. A.7 Leverages Using QR. Using the QR decomposition of Z, we obtain: hkk D
q X iD1
.qki /2
for k D 1 to N:
(A.13)
Jacobian Conditioning Analysis for Model Validation
415
A.8 Angle Between Two Subspaces. Let S1 and S2 denote the ranges of two .N; q/ matrices Z1 and Z2 , and H1 and H2 the orthogonal projection matrices on S1 and S2 . The distance between the two subspaces S1 and S2 is dened as dist.S1 ; S2 / D kH1 ¡ H2 k2 :
(A.14)
The angle between S1 and S2 is dened as angle.S1 ; S2 / D arcsin.dist.S1 ; S2 //:
(A.15)
Appendix B Following is the text of a Matlab program that constructs an ill-conditioned matrix Z (for a small value of ®), and computes the leverage values using SVD, formulas 1.12 and 2.1, and also using the more economic QR decomposition, which is as accurate as equation 1.12: clc clear all format compact format short; % construction of the (N,q) matrix Z randn(© seed© ,12345); N=4; q=2; alpha = 1e-12 c = randn(N,1); Z = [ones(N,1) ones(N,1)+alpha*c]; condZ = cond(Z) % singular value decomposition of Z [U,S,V] = svd(Z); s = diag(S); diff_s = s(1)-s(2) % "True" leverages Z1 = [ones(N,1) c]; [U1,S1,V1] = svd(Z1); diagH_true = zeros(N,1); for k=1:N for i=1:q diagH_true(k) = diagH_true(k) + U1(k,i)^2;
416
I. Rivals and L. Personnaz
end end diagH_true = diagH_true % Rivals and Personnaz estimates (1.12) of the leverages tol = max(s)*eps; r = sum(s > tol); diagH_RP = zeros(N,1); for k=1:N for i=1:r diagH_RP(k) = diagH_RP(k) + U(k,i)^2; end end diagH_RP = diagH_RP r_sumd_RP = r-sum(diagH_RP) % Monari and Dreyfus estimates (2.1) of the leverages diagH_MD = zeros(N,1); for k=1:N for i=1:q toto = 0; for j=1:q toto = toto + Z(k,j)*V(j,i); end diagH_MD(k) = diagH_MD(k) + (toto/s(i))^2; end end diagH_MD = diagH_MD q_sumd_MD = q-sum(diagH_MD) % Economic estimates of the leverages using the QR decomposition [Q,R] = qr(Z); diagH_QR = zeros(N,1); for k=1:N for i=1:q diagH_QR(k) = diagH_QR(k) + Q(k,i)^2; end end diagH_QR = diagH_QR r_sumd_QR = r-sum(diagH_QR)
Jacobian Conditioning Analysis for Model Validation
417
Output of the program: alpha = 1.0000e-12 condZ = 2.8525e+12 diff_s = 2.8284 diagH_true = 0.2719 0.2580 0.7758 0.6943 diagH_RP = 0.2719 0.2580 0.7759 0.6943 r_sumd_RP = 0 diagH_MD = 0.2720 0.2579 0.7756 0.6944 q_sumd_MD = 1.8643e-05 diagH_QR = 0.2719 0.2580 0.7759 0.6943 r_sumd_QR = 0
References Anders, U., & Korn, O. (1999). Model selection in neural networks. Neural Networks, 12, 309–323. Anderson, E., Bai, Z., Bischof, C., Blackford, S., Demmel, J., Dongarra, J., Du Croz, J., Greenbaum, A., Hammarling, S., McKenney, A., & Sorensen, D. (1999). LAPACK User’s Guide (3rd ed.). Philadelphia: SIAM. Dongarra, J., Moler, C. B., Bunch, J. R., & Stewart, G. W. (1979). LINPACK User’s Guide. Philadelphia: SIAM. Golub, G. H., & Reinsch, C. (1970). Singular value decomposition and leastsquares solutions. Numerische Mathematik, 14, 403–420.
418
I. Rivals and L. Personnaz
Golub, G. H., & Van Loan, C. F. (1983). Matrix computations. Baltimore, MD: Johns Hopkins University Press. Monari, G., & Dreyfus, G. (2002). Local overtting control via leverages. Neural Computation, 14, 1481–1506. Press, W. H., Teukolsky, S. A., Vetterling, W. T., & Flannery, B. P. (2002).Numerical recipes in C. Cambridge: Cambridge University Press. Rivals I., & Personnaz, L. (1998). Construction of condence intervals in neural modeling using a linear Taylor expansion. In Proceedings of the International Workshop on Advanced Black-Box Techniques for Nonlinear Modeling. Leuwen. Rivals I., & Personnaz, L. (2000). Construction of condence intervals for neural networks based on least squares estimation. Neural Networks, 13, 463–484. Received May 31, 2002; accepted January 21, 2003.
LETTER
Communicated by Andrew Barto
Reply to the Comments on “Local Overtting Control via Leverages” in “Jacobian Conditioning Analysis for Model Validation” by I. Rivals and L. Personnaz Yacine Oussar
[email protected] Ga´etan Monari
[email protected] G´erard Dreyfus
[email protected] ESPCI, Laboratoire d’Electronique, F-75005 Paris, France
“Jacobian Conditioning Analysis for Model Validation” by Rivals and Personnaz in this issue is a comment on Monari and Dreyfus (2002). In this reply, we disprove their claims. We point to awed reasoning in the theoretical comments and to errors and inconsistencies in the numerical examples. Our replies are substantiated by seven counterexamples, inspired by actual data, which show that the comments on the accuracy of the computation of the leverages are unsupported and that following the approach they advocate leads to discarding valid models or validating overtted models. 1 Introduction “Jacobian Conditioning Analysis for Model Validation” by Rivals and Personnaz (this issue) is a detailed comment on Monari and Dreyfus (2002), from the second sentence of their abstract to the last paragraph of their text. The authors claim that we “followed” a previous, controversial (Larsen & Hansen, 2001) article of theirs (Rivals & Personnaz, 2000). In this reply, we disprove all of their claims. We point to awed reasoning in the theoretical comments and to errors and inconsistencies in the numerical examples. Our replies to the comments are substantiated by seven counterexamples, which show that their comments on the accuracy of the computation of the leverages are unsupported and that following their approach leads to making wrong decisions: discarding valid models or validating overtted models. This letter begins by disproving the comments in “Jacobian Conditioning Analysis” on the accuracy of the computation of the leverages. We then show that the authors’ comments on model validation are erroneous, and we provide several counterexamples in which decisions made on the basis Neural Computation 16, 419–443 (2004)
c 2003 Massachusetts Institute of Technology °
420
Y. Oussar, G. Monari, and G. Dreyfus
of the condition-number selection criterion that the authors advocate are wrong. We conclude by summarizing the arguments that disprove all three conclusions of Rivals and Personnaz’s comments. 2 Reply to the Comments on the Accuracy of the Computation of the Leverages Section 1 of “Jacobian Conditioning Analysis” is entitled, “On the Jacobian Matrix of a Nonlinear Model.” The authors rst recall theoretical results, and they recall the validation method advocated in Rivals and Personnaz (2000). That part will be replied to in section 3 of this letter. Section 2 of “Jacobian Conditioning Analysis” is entitled “Comment on the Proposition of Monari and Dreyfus (2002).” The authors claim that relation A.4 of our article (relation 2.1 of their comment) is not accurate enough for the purpose of model validation and that a more classical computation method should be used instead. In order to substantiate their claims, they exhibit a small handcrafted numerical example (section 3.1 of their article): we show below that it is irrelevant to model selection and that it is inconsistent with the claims made by the authors in the previous section of their article. They show that the traditional method can compute the leverages with an accuracy of 10¡16; however, we show in the following that the authors fail to provide evidence that such accuracy is relevant in the context of nonlinear model selection and that they fail to provide evidence that our method does not meet the actual accuracy requirements in that context. The recommended approach to a problem in numerical analysis consists of asking two questions: (1) What numerical accuracy should be achieved in order to get insight into the problem at hand? (2) How can the above accuracy be achieved? Obviously, that approach should be used in discussing the accuracy of the computation of the leverages for model selection. In Monari and Dreyfus (2002), we discussed nonlinear model selection in the context of machine learning. The computation of the leverages of the observations is one of the ingredients of the original validation method that we describe. Therefore, question 1, which is not asked in “Jacobian Conditioning Analysis,” is: For the purpose of model selection, what is the desired accuracy for the computation of the leverages of models obtained by training from examples? The answer to that question is straightforward: the accuracy should be of the magnitude of the “noise” on the leverages. Since training is generally performed by iterative optimization of a cost function, it is stopped when the two models obtained in two successive iterations are considered “identical.” Therefore, the numerical noise on the leverages is the difference between the values of the leverages of two models that are considered identical. Two models are considered identical if some criterion is met, for example, if the variation of the cost function is smaller than a prescribed value or if the variation of the magnitudes of the parameter vector is smaller than a prescribed value, or similar criteria. Rivals and Personnaz
Reply to Rivals and Personnaz
421
do not address the question of estimating the numerical noise on the leverages; they claim that the accuracy should be on the order of 10¡16, without providing any evidence that such accuracy is relevant to the problem at hand. Therefore, their criticisms are unsupported in the context of machine learning. Conversely, we describe, in section 3, examples of nonlinear model selection. Neural networks were trained from examples by minimizing the least-squares cost function with the Levenberg-Marquardt algorithm. Training was terminated when the magnitude of the relative variation of the cost function between two iterations of the algorithm became smaller than 10¡10, which is an extremely conservative stopping criterion (a typical stopping criterion would be a relative difference of 10¡5 or even more). The root mean square of the variations of the leverages was computed as v u N u1 X 1h.k/ D t .hii .k C 1/ ¡ hii .k//2 ; N iD1 where hii .k/ is the leverage of observation i at iteration k of the training algorithm, and N is the number of examples. Iterations k and k C 1 were chosen such that the relative variation of the cost function was smaller that 10¡10. The values of 1h were consistently found to be substantially larger than 10¡10: clearly, there is no point in computing the leverages with an accuracy of 10¡16 , while the noise on the leverages is actually larger by more than six orders of magnitude, even in unusually conservative conditions. Furthermore, the discrepancy between the leverages computed by the traditional method (advocated in the comment) and by our method was smaller than 10¡12: it is smaller, by several orders of magnitude, than the noise on the leverages and hence is not signicant. Actually, in most real-life applications, training will be terminated much earlier; therefore, the root-mean-square difference between the leverages of two models that are considered equivalent will be larger by still many more orders of magnitude. In order to substantiate their claims, Rivals and Personnaz provide a handcrafted numerical example whose results are shown in Table 1 of their article. In the following, we show that those results are inconsistent with the authors’ claims in “Jacobian Conditioning Analysis” and irrelevant to model selection. The authors consider a Jacobian matrix with two columns: the elements of one of them are equal to one, and the second column vector is equal to the sum of the rst column vector and of a “small” vector, which is a normally distributed random vector multiplied by a scalar ranging from 10¡6 to 10¡15. Most results presented in that table are inconsistent and irrelevant, for the following reasons: 1. Rivals and Personnaz fail to state that most models used to derive the numerical results presented in their Table 1 are actually discarded by
422
Y. Oussar, G. Monari, and G. Dreyfus
the selection criterion that they advocate. Out of 10,000 random realizations of the Jacobian matrix, all models with a D 10¡12 and a D 10¡15 are discarded for having a condition number larger than 10C8 ; 9,988 models out of 10,000 with a D 10¡8 are discarded for the same reason; conversely, all models with a D 10¡6 are accepted. Therefore, the only results of Table 1 that are relevant to model selection, according to the selection criterion advocated by the authors, are the results pertaining to a D 10¡6 (rst row of the table). The other results are irrelevant, since there is no point in discussing the accuracy of the computation of leverages for models that are discarded by the selection criterion advocated by the authors. Moreover, the results reported in the rst row show that our method computes the leverages with an accuracy of 10¡10, which is on the order of the noise on the leverages, as shown above. 2. The structure of the Jacobian matrix indicates that the model has two parameters, µ0 and µ1 , and is of the form y.x/ D µ0 C f .x; µ1 /; with
@f D 1 C ".x; µ1 /; @µ1
where the values of ".x; µ1 / can be modeled as realizations of a random variable equal to a normal variable multiplied by a factor of 10¡6 to 10¡15. Therefore, the model is of the form y.x/ D µ0 C µ1 C ° .x; µ1 / with
@° D ".x; µ1 /: @µ1
No modeler would ever design a model with such a functional form. An obvious re-parameterization of the model consists of dening a new parameter µ2 D µ0 C µ1 , so that the rst column is still made of 1’s (derivative of y with respect to µ2 ) and the second column is made of the values of ".x; µ1 /, modeled as small, random gaussian variables; thus, the Jacobian matrix is much better conditioned. If the same experiment is performed as in Table 1, after reparameterization, (1) for a D 10¡6 , the difference between the number of parameters and the sum of the leverages computed by our method becomes 2. 10¡16, so that the traditional method is not more accurate than ours; and (2) for a D 10¡8 , about 25% of the models are accepted by the selection criterion advocated by Rivals and Personnaz, and, as above, the accuracy of the computation of the leverages by our method is on the order of 10¡16, that is, it is comparable to that of the traditional method. Therefore, the example displayed in Table 1 of “Jacobian Conditioning Analysis” is handcrafted to prove that the traditional method of computing the leverages is more accurate than ours in that specic case; however, that
Reply to Rivals and Personnaz
423
case is irrelevant to model selection because (1) there is no point in computing accurately the leverages of models that, according to the selection criterion advocated in the article, would actually be discarded and (2) no knowledgeable modeler would design a model having a Jacobian matrix of that form. To summarize, in “Jacobian Conditioning Analysis,” Rivals and Personnaz do not provide any evidence that an accuracy of 10¡16 for the computation of the leverages is desirable in the context of machine learning or that our method does not achieve the accuracy required in realistic conditions. Therefore, their claim that the traditional method is superior to ours in the machine learning context is unsupported. In the next sections, we provide additional counterexamples to support the above conclusions further. Furthermore, it should be noted that the issue of numerical accuracy was far from central in our paper, relation A.4 being presented in an appendix. Therefore, we did not nd it necessary to elaborate on that question, since it can be answered in a straightforward fashion, as shown above. 3 Reply to the Comments on Model Validation In the previous section, we showed that the second conclusion stated in section 4 of “Jacobian Conditioning Analysis” is unsupported. In this section, we disprove the other two conclusions of the comment, together with claims made in Rivals and Personnaz (2000). We show that contrary to the statement of the authors, we did not follow the approach advocated in that article since it is not correct from a numerical analysis point of view and leads to making wrong decisions for model validation. 3.1 Reply to Section 1.1 of “Jacobian Conditioning Analysis”. The use of the Jacobian matrix (relations 1.1 and 1.2 of the article) for the analysis of the identiability of nonlinear models is classical in statistics. It can be traced back to 1956. Relation 1.4 describes one of several standard condence interval estimates for nonlinear models. It can be found in textbooks on nonlinear regression (see, e.g., Seber & Wild, 1989). In view of the contents of section 3.2 of our reply, it is relevant to note that many other condence interval estimation methods for nonlinear models can be found in the literature (Tibshirani, 1996). Relation 1.5 in “Jacobian Conditioning Analysis” is an unfounded claim. The authors claim that in Rivals and Personnaz (2000), they established an upper bound of the leave-one-out error. Nothing of that kind can be found in that article, in which, following Hansen and Larsen (1996) and Sorensen, Norgard, Hansen, and Larsen (1996), they derived an approximation of the leave-one-out error under the assumption of the validity of a rst-order Taylor approximation, in parameter space, of the output of the model. We provided a more rigorous proof in Monari (1999) and Monari and Dreyfus (2000). Appendix 1 is standard textbook material.
424
Y. Oussar, G. Monari, and G. Dreyfus
3.2 Reply to Section 1.2 of “Jacobian Conditioning Analysis”. Section 1.2, entitled “Numerical Considerations,” is erroneous in two respects: 1. Model selection aims at nding the statistical model that generalizes best among different candidate models. Poor generalization may be due to either poor training, which is very easy to detect, or overtting. Overtting occurs when the model has too many adjustable parameters in view of the complexity of the problem. Therefore, model selection and model validation aim at detecting, and discarding, models that are likely to exhibit overtting. It has long been known in statistics that a preliminary screening can be performed by ascertaining that the Jacobian matrix of the model has full rank. Rivals and Personnaz dismiss that method and state that “the notion of the rank of Z is not usable in order to take the decision to discard a model.” They claim that the criterion should be “whether the inverse of ZT Z needed for the estimation of a condence interval can be estimated accurately.” That shift of focus to a completely different issue is a scientic reasoning aw: the fact that the condence intervals that the authors advocate cannot be computed “accurately” (see the next paragraph for a discussion of that issue) for a given model does not mean that the model should be discarded: it means that the condence interval estimation method should be discarded. Another condence interval estimate should be used instead—one that does not rely on matrix inversion (e.g., bootstrap methods). Thus, the comments on our approach to model validation are unfounded, and the discussion of model selection presented in Rivals and Personnaz (2000) is essentially irrelevant. That will be further substantiated by six counterexamples in section 3.4. 2. Rivals and Personnaz state that “a decision [of discarding a model] should be taken on the basis of whether the inverse of ZT Z needed for the estimation of a condence interval can be estimated accurately.” We have just shown that statement to be erroneous; nevertheless, let us discuss that statement from a numerical point of view. The authors claim that the condence intervals should be estimated accurately, but they do not explain what “accurately” means. More specically, they do not ask the relevant question: How accurately should the condence interval be estimated in order to get insight into the problem of model validation? A part of the answer is the following: Since the estimation of the condence interval is derived from a rst-order Taylor expansion of the output of the model, in parameter space, around a minimum of the cost function, there is no point in computing the condence interval; hence, the inverse of ZT Z, with a numerical accuracy that is better than the accuracy of the Taylor development. Despite the fact that many authors investigated that issue (see, e.g., Bates & Watts, 1998), Rivals and Personnaz claim that the condition number of the Jacobian matrix should be smaller than 108 , regardless of
Reply to Rivals and Personnaz
425
the problem and of the model.1 That cannot be true, since the inaccuracy due to the rst-order Taylor expansion may be much higher than the numerical accuracy required by that criterion and since the accuracy of the Taylor expansion is problem dependent. Therefore, that “universal” condition-number selection criterion can be expected to lead to discarding perfectly valid models, as shown below with simple counterexamples. To summarize our reply to the theoretical part of the comments concerning model validation: 1. From a basic point a view, there is a aw in the scientic reasoning on which those comments are based. Instead of discarding models that do not generalize well, the selection method that Rivals and Personnaz advocate discards models for which the authors’ favorite condence interval estimation method fails. Actually, the authors should blame their condence interval estimates for not being accurate instead of blaming the model for not lending itself to their condence interval estimation method. 2. From a numerical point of view, their selection criterion does not take into account the accuracy of the approximations on which the estimation of the condence intervals is based, so that the criterion may be much more stringent than actually required. The rst misconception may lead to accepting models that are invalid; the second may lead to rejecting models that are valid. Both situations will be exemplied below by seven counterexamples. 3.3 Reply to the Numerical Example. As a further criticism to our article, section 3.2 of “Jacobian Conditioning Analysis,” entitled “Neural Modeling,” considers the simplest example that was presented in our article: the neural modeling of a sin x=x function with added noise. Rivals and Personnaz consider, just as we did, models with one to four neurons. They discuss that problem in much more detail than we did, and they come to the conclusion that two-neuron architectures are appropriate, which is exactly the conclusion we stated in our article. Therefore, their example can hardly be considered as supporting their criticisms. Moreover, their numerical example contains errors and inconsistencies: 1. Table 2 of “Jacobian Conditioning Analysis” displays numerical results related to that example. In that table as well as in the rest of the article, the authors use notations that are different from the notations of the article they are commenting on, which makes things confusing 1 The same statement was made recently in Rivals and Personnaz (2003), reporting results obtained in 2000.
426
Y. Oussar, G. Monari, and G. Dreyfus
to nonspecialist readers. Moreover, they do not provide any denition of the quantities that they use. The quantity dubbed ALOOS seems to be the square of the estimated leave-one-out score denoted by Ep in our article. The quantity called MSTE is the square of the training root mean square error TMSE of our article. The quantity termed MSPE is the square of the validation root mean square error denoted by VMSE in our article. Those quantities are dened as v u N u1 X TMSE D t R2 ; N iD1 i MSTE D TMSE2 ; v u ´2 N ³ u1 X Ri Ep D t ; N iD1 1 ¡ hii
v u NV u 1 X ALOOS D Ep2 ; VMSE D t R2 ; NV iD1 i MSPE D VMSE2 ; where Ri is the modeling error on example i of the relevant set, hii is the leverage of example i, N is the number of examples in the training set, and NV is the number of examples in the validation set. The last line of the table reports an MSTE of 1.9 10¡3 and an ALOOS of 5.3 10¡7 . That cannot be true. Since 0 < hii < 1, each term of the sum in Ep is larger than the corresponding term in TMSE, so that one has Ep > TMSE or, equivalently, ALOOS > MSTE. Therefore, the result reported for four hidden neurons is wrong by at least four orders of magnitude. 2. Rivals and Personnaz agree with us that a two-hidden-neuron architecture is the most appropriate for the problem under investigation. Although our article does not discuss the models with three hidden neurons, they claim that our method would have accepted models with three hidden neurons, whereas they dismiss that architecture. That is worth investigating. For architectures with three hidden neurons, we performed 100 trainings with different initial values of the parameters, in the conditions that Rivals and Personnaz describe. That generates a relatively small number of signicantly different models. Taking advantage of the fact that the “true” regression function f is known, the mean square distance D between the model and the true
Reply to Rivals and Personnaz
427
Table 1: Mean Square Prediction Errors, Squared Distance, and Condition Number of the Jacobian Matrix for Models with Three Hidden Neurons. Minimum 1 MSPE 2:9 10¡3
D2 1:2 10¡3 16
Minimum 2 K 2 to 7 10C9
MSPE 3:8 10¡3
Minimum 3 MSPE 4:3 10¡3
D2 2:1
10¡3
10C16
D2 1:6 10¡3 67
K 4:0 10C6
Other minima K to 10C17
12
MSPE > 8 10¡3
D2 > 5 10¡3 5
K > 10C8
regression function was computed as v u ND u 1 X [ f .xk / ¡ g.xk /]2 ; DDt ND kD1 where ND D 5,000 (drawn from a uniform distribution), f .x/ D sinc [10.xC1/=¼], and g.x/ is the output of the model. D is the best estimate of the theoretical risk, that is, of the generalization ability of the model. Table 1 shows the MSPE, the distance D, the condition number K of the Jacobian matrix, and the number of occurrences of the cost function minimum. The best model (model with the smallest D, which also has the smallest MSPE) is discarded by the condition-number selection criterion advocated by Rivals and Personnaz. It is also worth noting that models that correspond to the same minimum of the cost function have widely varying condition numbers, even though the MSPEs agree to four decimal places. Other examples of similar situations, where the conditionnumber selection criterion rejects valid models, are exhibited below. 3. Rivals and Personnaz claim that they generated a training set by adding noise to the function sinc[10.x C 1/]. From a cursory look at Figures 1 and 2 of their article, it is clear that such is not the case. Actually, it seems that sinc[10.x C 1/=¼ ] was implemented. To summarize, in order to prove that their selection method is superior to ours, Rivals and Personnaz investigated one of the examples presented in our article. Their conclusion is exactly the same as ours, so that their example cannot be considered as evidence of the superiority of their approach. Moreover, we showed that their example contains a result that is erroneous by at least four orders of magnitude. In addition, their validation method leads to discarding perfectly valid models, for reasons that we explained in section 3.2 here.
428
Y. Oussar, G. Monari, and G. Dreyfus
Figure 1: Liquidus temperature vs. lithium oxide concentration.
3.4 Additional Counterexamples. The following counterexamples disprove the claim that the stated limit on the condition number of the Jacobian matrix is an appropriate screening criterion, which invalidates the rst conclusion in “Jacobian Conditioning Analysis” and a large part of the contents of Rivals and Personnaz (2000). The counterexamples show additionally that as we have already demonstrated, the accuracy of the computation of the leverages is not critical for model selection; that invalidates the second point of the conclusion of the comment, as well as another part of Rivals and Personnaz (2000). The problem that we address here is inspired by real data. The quantity to be modeled is a thermodynamic parameter (the liquidus temperature) of industrial glasses, as a function of the oxide contents of the latter. The estimation of the liquidus temperature is important for glass manufacturing processes (a detailed description of the application can be found in Dreyfus & Dreyfus, 2003). Figure 1 shows the simplest instance of real data on that problem. It is interesting because the singular points actually have a physical meaning, related to phase transitions. That application prompted us to investigate the modeling of data generated from function y D j sin xj C cos x;
(3.1)
which is plotted on Figure 2. We exhibit three pairs of counterexamples. In each pair, one counterexample shows the condition-number selection criterion discarding a valid model, and the other counterexample shows that criterion validating a
Reply to Rivals and Personnaz
429
Figure 2: Academic example inspired from Figure 1.
model that is overtted. In all of the following, the values of the leverages computed by our method and by the traditional method advocated in “Jacobian Conditioning Analysis” are in excellent agreement (e.g., they agree within nine decimal places for counterexample 1 and 11 decimal places for counterexample 2). Therefore, those examples do not provide any support to the claim that the traditional method is superior to ours in the machine learning context. In all the following, neural network training was performed by the Levenberg-Marquardt algorithm. For a given number of hidden neurons, 100 trainings were performed with different parameter initializations. Seven thousand equally spaced examples were generated as a validation set. Experiments were performed under Matlab on a standard PC. Distance D (dened in section 3.3) was also computed from 7000 equally spaced points. Following the notations of Monari and Dreyfus (2002), we denote by TMSE the root-mean-square error on the training set (as dened in section 3.3) and by VMSE the equivalent quantity computed on the validation set of 7000 examples (an excellent estimate of the generalization error). Before discussing the counterexamples, it may be useful to emphasize the main point of Monari and Dreyfus (2002): we claim that overtting can be efciently monitored by checking the distribution of the leverages (hence the title of our article “Local Overtting Control via Leverages”).
430
Y. Oussar, G. Monari, and G. Dreyfus
The leverages obey the following relations: 0 · hii · 1 8i; N X iD1
hii D q:
Since the sum of the leverages is equal to the number of parameters q, the leverage of example i can be interpreted as the fraction of the degrees of freedom of the model that is devoted to tting the model to observation i. Therefore, ideally, all examples should have essentially the same leverages, equal to q=N: if one or several points have leverages close to 1, the model has devoted a large fraction of its degrees of freedom to tting those points and, hence, may have tted the noise on those examples very accurately. In other words, the more peaked the distribution of the leverages around q=N, the less prone to overtting the model. Rivals and Personnaz were unaware of that point in their previous articles, as evidenced by the fact that they never used the word leverage prior to “Jacobian Conditioning Analysis.” In order to give a quantitative assessment of the “distance” of the model to a model where all leverages are equal to q=N, we dened (Monari & Dreyfus, 2002) the parameter s N 1 X N ¹D hii : N iD1 q the closer ¹ to 1, the more peaked the distribution of the leverages around its mean q=N; ¹ D 1 if all leverages are equal to q=N. Hence, the closer ¹ to 1, the less overtted the model. Alternatively, one may use the normalized standard deviation ¾n of the leverages, dened as v u N ± u X N q ²2 ¾n D t hii ¡ : q.N ¡ q/ iD1 N ¾n D 0 if all leverages are equal to q=N, and ¾n D 1 in the worst case of overtting, where q leverages are equal to 1 and .N ¡ q/ leverages are equal to zero. Hence, the smaller ¾n , the less prone to overtting the model. Both quantities are computed in the following counterexamples. 3.4.1 Counterexamples 1 and 2. In a rst set of experiments, 35 equally spaced points were generated from relation 3.1 to serve as a training set. Uniform noise was added, with standard deviation 0.1. Figure 3 shows the generating function, the training data, and the output of a model with ve hidden neurons, together with the values of the leverages. The relevant gures for that model are summarized in the rst row of Table 2.
Reply to Rivals and Personnaz
431
Figure 3: Counterexample 1. The condition-number selection criterion discards a valid model.
Table 2: Relevant Data for Counterexamples 1 and 2. TMSE VMSE Counterexample 1 Counterexample 2
0.078 0.066
0.117 0.123
¹
¾n
0.984 0.38 0.979 0.56
Distance D
K
Leverages > 0:95
0.062 0.072
1:5 109 2:9 106
3 11
Notes: TMSE, VMSE, ¹, ¾n , and D as dened in the text. K: Condition number of the Jacobian matrix. Last column: Number of leverages that are larger than 0.95.
432
Y. Oussar, G. Monari, and G. Dreyfus
The condition number of its Jacobian matrix exceeds the limit stated in “Jacobian Conditioning Analysis” by more than one order of magnitude, so that it should be discarded according to those comments. Actually it is a valid model: its generalization error is small since its VMSE, computed from 7000 examples, is slightly larger than the standard deviation of the noise, and the distance D is even smaller. ¹ is close to 1, and ¾n is far from 1. Finally, the leverage values computed by the traditional method and by ours (relation A.4 of Monari & Dreyfus, 2002) are in excellent agreement: the root mean square of the differences between the leverages computed by our method and by the traditional method is on the order of 3.7 10¡10 . The three points with high leverages are located at the boundaries of the input range, as expected. The W-shape of the leverage graph indicates that the points located in the vicinity of the main minimum are inuential, as expected. Thus, counterexample 1 is an example of the condition-number selection criterion stated in “Jacobian Conditioning Analysis,” discarding a valid model. Counterexample 2 (see Figure 4) is a model with seven hidden neurons, trained in the same conditions and with the same data as counterexample 1. The characteristics of the model are shown in the second row of Table 2. The root mean square of the differences between the leverages computed by our method and by the traditional method advocated by Rivals and Personnaz is on the order of 1:7 10¡12. The condition number of the Jacobian matrix of the model is way below the limit stated by Rivals and Personnaz. Hence, according to their comment, the model should be accepted. However, it is clear from Figure 4 that the model is strongly overtted. That is further substantiated by three facts: 1. The validation error VMSE is twice as large as the training error TMSE. 2. Eleven leverages (almost one out of three) are larger than 0.95, and 13 of them are larger than 0.90. The high leverages are located between x D 1 and x D 2, where overtting is clearly apparent on Figure 4, and also at the boundaries of the input range, as usual. 3. ¹ is substantially smaller than 1, or, equivalently, the standard deviation of the leverages ¾n is larger than that of counterexample 1. The selection method that Rivals and Personnaz advocate nevertheless fails to detect that gross overtting. As usual, computing the leverages by the traditional method and by relation A.4 of our article does not make any signicant difference. The difference between the two models is also clear from Figure 5, which shows the histograms of the leverages for the model that is discarded by the condition-number selection criterion (top gure) and for the model that is accepted by that criterion (bottom gure). As explained above, the distribution of the leverages should be as peaked as possible around q=N. Clearly, the leverage distribution for the model accepted by the suggested
Reply to Rivals and Personnaz
433
Figure 4: Counterexample 2. The condition-number selection criterion fails to detect overtting.
condition-number selection criterion is extremely far from complying with that condition, having a very large number of leverages that are close to 1. To summarize, counterexample 2 shows an example of the conditionnumber selection criterion failing to discard an overtted model. 3.4.2 Counterexamples 3 and 4. The second set of experiments was performed as follows. One hundred neural networks were rst trained with a large number of points (350) generated by the regression function 1.1, with-
434
Y. Oussar, G. Monari, and G. Dreyfus
Figure 5: Leverage distributions of counterexample 1 (top) and counterexample 2 (bottom).
out added noise. Then the values of the parameters of those models were used as initial values for a retraining with only 35 points with added noise. Since the latter training starts with initial parameters of excellent models, it may be expected that the resulting models are the best models that one can hope to obtain, given the limited number of points and the noise in the training set. Figure 6 shows a good model with ve hidden neurons, whose characteristics are reported in the rst row of Table 3. Despite its performance, the model is rejected by the condition-number selection criterion, since its condition number is larger, by one order of magnitude, than the rejection limit specied its authors. Figure 7 shows the behavior of a model that also has ve hidden neurons; its characteristics are summarized in the second row of Table 3. That model is much worse than counterexample 3: its VMSE is higher, while its TMSE is
Reply to Rivals and Personnaz
435
Figure 6: Counterexample 3. The condition-number selection criterion discards a valid model.
smaller (a clear sign of overtting), the distance between the model and the regression function is higher, and 20% of the leverages are larger than 0.95. ¹ is substantially smaller than for counterexample 3, and ¾n is almost twice as large as that of counterexample 3. Nevertheless, the condition number is one order of magnitude below the rejection limit, so that the condition-number
436
Y. Oussar, G. Monari, and G. Dreyfus
Table 3: Relevant Data for Counterexamples 3 and 4. TMSE VMSE Counterexample 3 Counterexample 4
0.078 0.068
0.12 0.13
¹
¾n
0.99 0.36 0.95 0.62
Distance D
K
Leverages > 0:95
0.063 0.079
109 1:7 107
2 7
Notes: TMSE, VMSE, ¹, ¾n , and D as dened in the text. K: Condition number of the Jacobian matrix. Last column: Number of leverages that are larger than 0.95.
Table 4: Relevant Data for Counterexamples 5 and 6. TMSE VMSE Counterexample 5 Counterexample 6
0.082 0.068
0.12 0.14
¹
¾n
0.97 0.47 0.95 0.62
Distance D 0.069 0.091
K 1014
4:2 1:3 105
Leverages > 0:95 1 6
Notes: TMSE, VMSE, ¹, ¾n , and D as dened in the text. K: Condition number of the Jacobian matrix. Last column: Number of leverages that are larger than 0.95.
selection criterion fails to detect that gross overtting. Furthermore, the leverages computed by the traditional method, and by our method, agree within 10¡11 . 3.4.3 Counterexamples 5 and 6. In a nal set of numerical experiments, the training set was constructed with a larger number of examples in the vicinity of the singular points, the total number of points being kept constant, equal to 35. Training was performed with random parameter initialization, as in counterexamples 1 and 2. Figure 8 shows the behavior of a model with ve hidden neurons, whose characteristics are summarized in Table 4. As expected, no overtting occurs in the vicinity of the minima of the function, but since the total number of points was kept constant, leverages become higher between the minima. Nevertheless, this is a very reasonable model given the training data. It is discarded by the condition-number selection criterion, since its condition number is larger than the rejection limit by six orders of magnitude. By contrast, Figure 9 shows a model with ve hidden neurons, whose characteristics are summarized in the second row of Table 4. This is again an overtted model, whose distance to the regression function, and VMSE, are much poorer than those of counterexample 5; nevertheless, it is accepted by the condition-number selection criterion. 3.5 Relevance of the Condition-Number Selection Criterion to Overtting. In the previous section, we showed several examples of the conditionnumber selection criterion accepting overtted models or discarding valid models. The above counterexamples are just a selection among many more
Reply to Rivals and Personnaz
437
Figure 7: Counterexample 4. The condition-number selection criterion fails to discard an overtted model.
similar counterexamples, so that it is natural to wonder how frequently such situations will occur. More specically, one can ask the following question: What is the probability that the parameters of a network with one input and ve hidden neurons can be estimated reliably from 35 equally spaced points (counterexamples 1, 2, 3, and 4)? In order to gain some insight into that question, the following numerical experiment was performed, 10,000 different
438
Y. Oussar, G. Monari, and G. Dreyfus
Figure 8: Counterexample 5. The condition-number selection criterion discards a valid model.
neural networks with one input and ve hidden neurons were generated, with random parameters, uniformly distributed with variance 10. For each model, the leverages of 35 equally spaced points between ¡1 and C1, and the normalized standard deviation ¾n of their distribution, were computed. The condition number K of the Jacobian matrix was also computed.
Reply to Rivals and Personnaz
439
Figure 9: Counterexample 6. The condition-number selection criterion fails to discard an overtted model.
Figure 10 shows ¾n as a function of K: each network is shown as a dot; dots lying on the x-axis are networks whose Jacobian matrix is rank decient. For models whose Jacobian matrix has full rank, no trend can be found in that graph: thus, the condition number has essentially nothing to do with the distribution of the leverages and, hence, is essentially irrelevant to overtting. Any model located to the right of the vertical line would be
440
Y. Oussar, G. Monari, and G. Dreyfus
Figure 10: Normalized standard deviation of the distribution of the leverages of 35 equally spaced points, and Jacobian matrix condition number, for 1000 neural networks with ve hidden neurons.
discarded by the condition-number selection criterion, despite the fact that some of them have excellent leverage distributions and, hence, are functions whose paramters can legitimately be estimated from data pertaining to 35 equally spaced points. Actually, the “best” network (the network with themost peaked leverage distribution, i.e., with the smallest ¾n ) is discarded, whereas several poor networks (with large values of ¾n ), including the network with the largest ¾n , are accepted. Similarly, no clear trend can be found in the graph of ¾n versus K when data are more abundant. 4 Conclusion “Jacobian Conditioning Analysis for Model Validation” is essentially a comment on our article (Monari & Dreyfus, 2002). In this reply, we disproved all comments made in “Jacobian Conditioning Analysis.” The rst conclusion of Rivals and Personnaz states that the condition number of the Jacobian matrix of the model should be used as a criterion for model validation: a model with a condition number larger than 10C8 should be discarded; they made the same statement in Rivals and Personnaz, 2000.
Reply to Rivals and Personnaz
441
We proved that statement wrong in two respects: 1. The models that are discarded by that criterion are models for which a particular type of condence intervals, obtained by a specic estimation method, cannot be computed accurately. This does not mean that the model should be discarded. It means that the condence interval estimation method should be discarded; that misconception may lead to discarding valid models. 2. Rivals and Personnaz’s comment on the accuracy required to compute the condence intervals is erroneous, because they overlook the fact that the condence intervals stem from a rst-order Taylor expansion of the model output. Therefore, there is no point in computing the condence intervals with an accuracy that is better than the accuracy of that rst-order approximation. Therefore, the accuracy requested for the computation of the condence interval is completely problem dependent: the “universal” criterion exhibited by the authors cannot be valid. In addition to disproving, on theoretical grounds, the comments made by Rivals and Personnaz, we presented six counterexamples: three of them are instances of the authors’ selection criterion discarding valid models; the other three are instances of the authors’ criterion accepting overtted models. In short, the condition-number selection criterion, advocated in the rst conclusion of Rivals and Personnaz, is at best useless and very frequently leads to making wrong decisions. The claim that we “followed” that approach is unsupported. In the second paragraph of their conclusion, Rivals and Personnaz state that the numerical method for computing the leverages, which was indicated in an appendix of our article (Monari & Dreyfus, 2002), is not accurate enough. We proved that their statement is unsupported. They claim that their method can reach an accuracy of 10¡16; however, they do not provide any example, in the context of machine learning, where such accuracy is required, that is, where the “noise” on the estimation of the leverages is on the order of 10¡16 . Conversely, we provided examples where the difference between the leverages of two models that are obtained at two successive iterations after convergence of the training algorithm (i.e., between two models that are considered “identical”) exceeds 10¡16 by several orders of magnitude. Therefore, the accuracy of the method advocated by Rivals and Personnaz is irrelevant in such situations, and they did not provide any example where it might be relevant. In addition to disproving their point theoretically, we have shown that even in the very example that Rivals and Personnaz proffered, the differences between the leverages computed by the traditional method and those computed by our method are negligibly small. In order to substantiate their claims, Rivals and Personnaz study the numerical example that we investigated in Monari and Dreyfus (2002). They
442
Y. Oussar, G. Monari, and G. Dreyfus
come exactly to the conclusion that we reached, so their example cannot be considered a convincing counterexample. However, they reach that conclusion by faulty reasoning and computing. We pointed to inconsistencies in their presentation, and we provided a proof that one of their numerical results is wrong by at least four orders of magnitude. The third paragraph of the conclusion of “Jacobian Conditioning Analysis” is: “For candidates whose condition number is small enough and for which the leverages have been computed as accurately as possible according to equation 1.12, one may check additionally if none of the leverage values is close to one, as already proposed in Rivals and Personnaz (1998).” That statement is not acceptable for three reasons: 1. We showed that Rivals and Personnaz provide no evidence that using equation 1.12 is necessary or that our method for the computation of the leverages is inappropriate for model validation. 2. We showed that models with a “small enough” condition number, that is, a condition number below the limit stated by Rivals and Personnaz, may have several leverages close to 1, and, hence, exhibit strong overtting (counterexamples 2, 4, and 6), and that, conversely, models with high condition numbers may have reasonable leverages and, hence, be acceptable (counterexamples 1, 3, and 5). 3. Rivals and Personnaz did not state in Rivals and Personnaz (1998) that leverages should be checked for values close to 1. They made a different suggestion: checking that the sum of the leverages is equal to the number of parameters, and all leverages are smaller than 1. Actually, they did not realize the signicance of leverages close to 1 before reading our article (Monari & Dreyfus, 2002), as evidenced by the fact that the very word leverage is used in neither Rivals and Personnaz (1998), nor in Rivals and Personnaz (2002). By contrast, the last sentence of the conclusion of “Jacobian Conditioning Analysis” (“Leverage values close to, but not necessarily larger than, one are indeed the symptom of overtted examples or of isolated examples at the border of the input domain delimited by the training set”) is unquestionable: it is actually the very central idea of our work (Monari & Dreyfus, 2002). To summarize, in this letter, we disprove all comments of Rivals and Personnaz in “Jacobian Conditioning Analysis” on our previous article (Monari & Dreyfus, 2002). 2 Additionally, our replies invalidate a substantial part of the contents of other articles by the same authors on the same subject.
2 Rivals and Personnaz add, after their conclusion, a section entitled, “Other Comment for Monari and Dreyfus (2002).” In view of the fact that all the other claims of Rivals and Personnaz were disproved, we will not discuss that ultimate comment.
Reply to Rivals and Personnaz
443
References Bates, D. M., & Watts, D. G. (1998).Nonlinear regressionanalysis and its applications. New York: Wiley. Dreyfus, C., & Dreyfus, G. (2003).A machine learning approach to the estimation of the liquidus temperature of glassforming oxide blends. Journal of NonCrystalline Solids, 318, 63–78. Hansen, L. K., & Larsen, J. (1996). Linear unlearning for cross-validation. Advances in Computational Mathematics, 5, 269–280. Larsen, J., & Hansen, L. K. (2001). Comments for: Rivals I., Personnaz L., Construction of condence intervals for neural networks based on least squares estimation. Neural Networks, 15, 141–142. Monari, G. (1999). S´electionde mod`eles non lin´eaires par leave-one-out: e´tude th´eorique et application des r´eseaux de neurones au proc´ed´e de soudage par points. The` se de l’Universit e´ Pierre et Marie Curie, Paris. Monari, G., & Dreyfus, G. (2000). Withdrawing an example from the training set: An analytic estimation of its effect on a nonlinear parameterized model. Neurocomputing, 35, 195–201. Monari, G., & Dreyfus, G. (2002). Local overtting control via leverages. Neural Computation, 14, 1481–1506. Rivals, I., & Personnaz, L. (1998). Construction of condence intervals in neural modeling using a linear Taylor expansion. In J. A. Suykens & J. Vandewalle (Eds.), Proceedings of the International Workshop on Advanced Black-Box Techniques for Nonlinear Modeling. Leuven, Belgium: Katholieke Universiteit Leuven. Rivals, I., & Personnaz, L. (2000). Construction of condence intervals for neural networks based on least squares estimation. Neural Networks, 13, 463–484. Rivals, I., & Personnaz, L. (2003). MLPs (mono-layer polynomials and multilayer perceptrons) for nonlinear modeling. Journal of Machine Learning Research, 3, 1383–1398. Seber, G. A. F., & Wild, C. J. (1989). Nonlinear regression. New York: Wiley. Sorensen, P. H., Norgard, M., Hansen, L. K., & Larsen, J. (1996). Cross-validation with LULOO. In S. I. Amari, L. Yu, L. W. Chan, I. King, & K. S. Leung (Eds.), Proceedings of the International Conference on Neural Information Processing— ICONIP’96 (pp. 1305–1310). New York: Springer-Verlag. Tibshirani, R. J. (1996). A comparison of some error estimates for neural models. Neural Computation, 8, 152–163. Received March 31, 2003; accepted July 10, 2003.
ARTICLE
Communicated by David Fitzpatrick
Geometrical Computations Explain Projection Patterns of Long-Range Horizontal Connections in Visual Cortex Ohad Ben-Shahar
[email protected] Steven Zucker
[email protected] Department of Computer Science and the Interdepartmental Neuroscience Program, Yale University, New Haven, CT 06520, U.S.A.
Neurons in primary visual cortex respond selectively to oriented stimuli such as edges and lines. The long-range horizontal connections between them are thought to facilitate contour integration. While many physiological and psychophysical ndings suggest that collinear or association eld models of good continuation dictate particular projection patterns of horizontal connections to guide this integration process, signicant evidence of interactions inconsistent with these hypotheses is accumulating. We rst show that natural random variations around the collinear and association eld models cannot account for these inconsistencies, a fact that motivates the search for more principled explanations. We then develop a model of long-range projection elds that formalizes good continuation based on differential geometry. The analysis implicates curvature(s) in a fundamental way, and the resulting model explains both consistent data and apparent outliers. It quantitatively predicts the (typically ignored) spread in projection distribution, its nonmonotonic variance, and the differences found among individual neurons. Surprisingly, and for the rst time, this model also indicates that texture (and shading) continuation can serve as alternative and complementary functional explanations to contour integration. Because current anatomical data support both (curve and texture) integration models equally and because both are important computationally, new testable predictions are derived to allow their differentiation and identication.
1 Introduction The receptive elds (RFs) of neurons in visual cortex characterize their response to patterns of light in the visual eld. In primary visual cortex, this response is often selective for stimulus orientation in a small region (Hubel & Wiesel, 1977). The clustered long-range horizontal connections between such cells (Rockland & Lund, 1982) link those with nonoverlapping RFs and Neural Computation 16, 445–476 (2004)
c 2004 Massachusetts Institute of Technology °
446
O. Ben-Shahar and S. Zucker
are thought to facilitate contour integration (Field, Hayes, & Hess, 1993). However, there is no direct physiological evidence that these connections only support curve integration, while there also remains much ambiguity about the precise connections required to support the integration of curves. Our goal in this article is to address both of these concerns. 1.1 Biological Data and Integration Models. The argument that associates long-range horizontal connections with curve integration begins with the realization that the nite spatial extent of RFs and their broad orientation tuning lead to signicant uncertainties in the position and the local orientation measured from visual stimuli. This causes a further uncertainty in determining which of the many nearby RFs signal the next section of a curve (see Figure 1a). All of these uncertainties underlying curve integration can be reduced by interactions between neurons whose RFs are close in retinotopic coordinates. Starting with Mitchison and Crick (1982) and their hypothesis about interactions between iso-oriented RFs, physiological and anatomical ndings have been accumulating to suggest a roughly collinear interaction. The main evidence supporting this conclusion is based on the distribution of angular differences between preferred orientations of connected cells. These distributions are computed by taking the orientation difference between a target cell and every other cell it is connected to with a long-range horizontal connection. Indeed, as is exemplied in Figure 1b, these distributions have been shown to be unimodal on average, with maximal interaction between iso-oriented RFs (Ts’o, Gilbert, & Wiesel, 1986; Gilbert & Wiesel, 1989; Weliky, Kandler, Fitzpatrick, Katz, 1995; Schmidt, Goebel, Lowel, ¨ & Singer, 1997; Buza´ s, Eysel, Kisv´arday, 1998; Bosking, Zhang, Schoeld, & Fitzpatrick, 1997; Malach, Amir, Harel, & Grinvald, 1993; Sincich & Blasdel, 2001; Schmidt & Lowel, ¨ 2002). Furthermore, direct anatomical studies reveal long-range interactions between coaxial cells (Bosking et al., 1997; Schmidt et al., 1997) and indirect psychophysical experiments report a general association eld (Field et al., 1993; Kapadia, Ito, Gilbert, & Westheimer, 1995; Kapadia, Westheimer, & Gilbert, 2000) which emphasizes straight or slowly varying continuations while allowing some support for more rapidly varying continuations as well (see Figure 2a). With the accumulation of these data, however, are a growing number of observations that are difcult to reconcile with the intuition that neural spatial integration is based on collinearity or that it serves only curve integration. Facilitory interaction between cells of signicant orientation difference (Kapadia et al., 1995) short-range coaxial inhibition (Polat & Sagi, 1993), isoorientation side facilitation (Adini, Sagi, & Tsodyks, 1997), and strong correlations between iso-oriented, nonoverlapping, and parallel receptive elds (Ts’o et al., 1986) are functionally inconsistent. Evidence of cross-orientation (Matsubara, Cynader, Swindale, & Stryker, 1985; Kisv´arday, Toth, ´ Rausch, & Eysel, 1997) and nonaxial (Gilbert & Wiesel, 1989) connections, plus roughly
Projection Patterns of Long-Range Horizontal Connections in V Retinotopic envelope of all possible continuations
447
Where does the curve really continue?
Many possible curves due to RF tuning properties
RF preferred orientation, finite retinotopic extent, and broad orientation tuning Ideal hypothetical curve through RF
a Bo sking e t. al. 199 7
Bosking et. a l. 1 997 25 M e d ia n n u m b er of co n n ec tio ns (% )
M e d ia n n u m b er of co n n ec tio ns (% )
25 20
20
15
15
10
10
5 0 80 60 40 20 0 20 40 orientation difference
b
60
80
5 0 80 60 40 20 0 20 40 orientation difference
60
80
c
Figure 1: Visual integration and the distribution of long-range projections. (a) Broad tuning in orientation and position introduce uncertainty in curve integration even if a single curve model (thick red curve) is assumed through the RF. Determining which nearby RF the curve continues through can be facilitated by interaction between neurons with mutually aligned, retinotopically close RFs. (b) A fundamental measurable property of long-range connection is their distribution in the orientation domain, that is, the percentage of connections between interconnected neurons as a function of preferred orientation (angular) difference. This graph shows the median distribution of lateral connections (distance > 500¹m) of seven cell clusters in primary visual cortex of tree shrew (redrawn from Bosking et al., 1997, their Fig. 6c). Qualitatively similar (through coarser) measurements are available on primates as well (Malach et al., 1993). (c) Connectivity distribution of individual cell clusters reveals signicant variability and qualitiative differences between them. Shown here are distributions from two injection sites from Bosking et al. (1997).
448
O. Ben-Shahar and S. Zucker Connection distribution connectivity Connection distibution ofof connectivity models models
Roughly colinear continuations
Number connections Number of of connections (%) (%)
30
Association field
Collinearity connectivity General association field
25
20
15
10
5
0 80
60
40
20 0 20 40 orientation difference
60
80
Orientation difference
a
b
Figure 2: Collinear facilitation, association elds, and their predicted distribution of connections. (a) Informally, two visual integration or continuation models are typically considered in the physiological and psychophysical literatures. Collinearity, the predominant model, predicts only few possible curve continuations (top). On the other hand, many possible continuations reveal an association eld (bottom), similar to those observed psychophysically (Field et al., 1993). (b) The corresponding distribution derived from the collinearity and association eld models. Observe that collinearity predictis a very narrow distribution, which is clearly at odds with the signicant spread frequently measured anotomically or electrophysiologically (compare to Figure 1b). The association eld leads to a wider spread, but like collinearity, it predicts a xed distribution for all cells, a hypothesis refuted in recent studies (see the text). The collinearity distribution (solid) was calculated from the eld depicted in Figure 8a, while the association eld distribution (dashed) was calculated from the eld in Figure 8e. The dashed horizontal line depicts the uniform distribution.
isotropic retinotopic extent (Malach et al., 1993; Sincich & Blasdel, 2001), suggest anatomical inconsistencies. These inconsistencies prompt a closer examination of the interactions within visual cortex and their population statistics. As the evidence suggests, individual cells, or small collections of adjacent cells captured in tracer injections, may have qualitatively different connectivity distributions (Bosking et al., 1997): some are narrow and high while others are very wide, as is illustrated in Figure 1c. When averaged, the pooled distribution of longrange connections (e.g., those extending beyond 500 ¹m in Bosking et al., 1997) is (see Figure 3a): ² Unimodal.
² Peaks at zero orientation offset.
Projection Patterns of Long-Range Horizontal Connections in V
449
² Indicates a nonnegligible fraction of connections linking cells of signicantly different orientation preferences (Malach et al., 1993; Kisv´arday, Kim, Eysel, & Bonhoeffer, 1994; Kisv´arday et al., 1997; Bosking et al., 1997). ² Crosses the uniform distribution at approximately §40 degrees. 1
² Has a nonmonotonically changing variance as the orientation difference increases (Malach et al., 1993; Bosking et al., 1997). Neither collinearity nor association eld models predict all of these features. While both models imply unimodal pooled distributions over orientation differences (see Figure 2b), they also suggest a xed projection eld and thus neither predicts any variance for the pooled distribution, let alone a nonmonotonic one. Furthermore, collinearity is clearly at odds with the signicant spread in the distribution of connections over orientation differences, whether it is measured via extracellular injections (e.g., Bosking et al., 1997) or the more elaborate intracellular protocol (Buzas ´ et al., 1998). The data in Bosking et al. (1997) contain one injection site of possibly different connection distribution, which may substantially contribute to the nonmonotonic nature of the variance. Since the variance will become central to this article, we examined whether this statistical feature depends critically on this one, possibly outlier, measurement. We reanalyzed the data from Bosking et al. (1997) after removing the data from this injection site and calculating the statistical properties of the rest. We further examined the robustness of the nonmonotonicity by running two additional analyses: one in which we removed the sample points (one from each orientation bin) that contribute the most to the variance, and another in which we removed those sample points (again, one from each bin) that maximized local changes in the variance. In all these tests, including the last one, which attens the variance the most, the trimodal nonmonotonicity, and the two local minima at §30 degrees, were preserved. All these ndings suggest that the nonmonotonicity of the variance is a critical feature that deserves attention from both biologists and modelers. 1.2 Integration Models and Random Physiological Variations. It is tempting to explain the apparent anomalies and inconsistencies between the predicted and measured distributions of long-range horizontal connections as random physiological variations, for example, by asserting that anatomy only approximates the correct connections. We tested this explanation by applying different noise models to the collinearity and association eld connectivity distributions from Figure 2, and checked whether the resul1 This crossing point provides a reference for the bias of projection patterns toward particular orientations; considering the offsets where the connection distribution crosses the uniform line quanties this bias in a way independent of scale or quantization level.
450
O. Ben-Shahar and S. Zucker
tant pooled distributions possess the properties listed above. The results of the most natural noise model are illustrated in Figure 3b. Under this model, each long-range horizontal connection, ideally designated to connect cells of orientation difference 1µ, is shifted to connect cells of orientation difference 1µ C ²¾ , where ²¾ is a wrapped gaussian (i.e., normally distributed and wrapped on S 1 ) random variable with zero mean and variance ¾ (see the appendix for details). As the gure shows, it takes an overwhelming amount of noise (s.d.¸ 35 degrees) to transform the collinear distribution to one that resembles the measured data in terms of spread and peak height, but the nonmonotonic behavior of the variance is never reproduced. (For space considerations, we omit the results of other connection-based noise models, or the noisy distributions based on the association eld model, all of which were even less reminiscent of the measured physiological data.) A second possible source for the inconsistencies between the predicted and measured distributions may be the extracellular injection protocol commonly in use by physiologists to trace long-range horizontal connections (e.g., Gilbert & Wiesel, 1989; Malach et al., 1993; Kisv´arday et al., 1994, 1997; Bosking et al., 1997; Schmidt et al., 1997; Sincich & Blasdel, 2001). Due to the site-selection procedure used, cells stained by these injections are likely to have similar orientation preferences (e.g., Bosking et al., 1997, p. 2113, or Schmidt et al., 1997, p. 1084). However, their orientation tuning may nevertheless be different, sometimes signicantly (note such a cell in Bosking et al., 1997, Fig. 4B). Consequently, the distribution of presynaptic terminals (boutons) traced from the injection site may incorporate an articial, random spread relative to the single orientation typically assumed at the injection site. Preliminary evidence from a recently developed single-cell protocol (Buza´ s et al., 1998) suggests that leakage in the injection site cannot bridge the gap between the predicted collinear distribution and those measured anatomically. However, we also examined this possibility computationally by modeling the leakage in the injection site as a wrapped gaussian random variable of predened variance.2 The base distributions (collinear or association eld) of the computational cells selected by this process were then summed up and normalized, and the resultant (random) distribution was attributed to the original cell representing the injection site. Repeating this process many times yielded a collection of (different) distributions, for which we calculated an average and variance (see the appendix for details). The results are illustrated in Figure 3c. Similar to random variations at the level of individual connections, here too it takes an overwhelming amount of noise (s.d.¸ 35 degrees) to transform the colinear distribution to one that resembles the measured data in terms of spread and peak height, but the nonmonotonic behavior of the variance is never reproduced.
2 A wrapped gaussian model was particularly suitable here due to the injection site selection protocol typically used in the extracellular injection protocol; see the appendix.
Projection Patterns of Long-Range Horizontal Connections in V
451
a
b
c
Figure 3: Results of a statistical pertubation of collinear connectivity distribution. (a) Mean connection distribution computed from the data in Bosking et al. (1997), shown here for reference. Error bars are §1 standard deviation. Note the unimodal distribution that peaks at approximately 11%, the wide spread, crossing of the uniform distribution (dashed horizontal line) around §40 degrees, and the nonmonotonic variance. Can all these features be replicated by applying noise to the base distribution induced by the standard colinearity model? (b) Result of simulating physiological deviation at the individual connection level. The dashed line is the base collinear distribution. The gray region is the superposition of individual applications of the noise model to the base distribution. The solid graph is the expected distribution, and error bars are §1 standard deviation. Permitting large enough developmental variations (shown here is the result of wrapped gaussian independent and identically distributed noise of s. d. D 35± ) in the connections to model the rst-order statistics signicantly violates the underlying connectivity principle of good continuation but still cannot model the second-order statistics. (c) Results of simulating measurement errors due to leakage in the injection site. All parts are coded as in b. Again, permitting large enough injection spread to model the rst-order statistics (shown here is the result of gaussian noise of s.d. D 35± and assuming 20 cells per injection site; (Bosking et al., 1997) cannot model the second-order statistics
452
O. Ben-Shahar and S. Zucker
The thinking around long-range horizontal connections has been dominated by their rst-order statistics and its peak at zero orientation offset. However, the nonmonotonicity of the variance was rst reported almost a decade ago (Fig. 3d in Malach et al., 1993) and we have further conrmed it from the more detailed measurements in Bosking et al. (1997) as was illustrated in Figure 3a. Since neither collinearity nor association eld models can explain this aspect of the physiological data, even if much noise is allowed, it is necessary to consider whether this and the other subtle properties of the pooled data reect genuine functional properties of longrange horizontal connections. We therefore developed a geometric model of projection patterns and examined quantitatively both pooled connection statistics and connectivity patterns of individual cells generated by it. Since many ndings suggest that long-range horizontal connections are primarily excitatory, especially those extending beyond one hypercolumn (Ts’o et al., 1986; Gilbert & Wiesel, 1989; Kapadia et al., 1995; Kisv´arday et al., 1997; Buzas ´ et al., 1998; Sincich & Blasdel, 2001), our model concentrates on this class of connections. 2 From Differential Geometry to Integration Models Curve integration, the hypothesized functional role ascribed to long-range horizontal connections, is naturally based in differential geometry. The tangent, or the local linear approximation to a curve, abstracts orientation preference, and the collection of all possible tangents at each (retinotopic) position can be identied with the orientation hypercolumn (Hubel & Wiesel, 1977). Formally, since position takes values in the plane R2 (think of image coordinates x; y) and orientation in the circle S1 (think of an angle µ varying between 0 and 2¼ ), the primary visual cortex can be abstracted as the product space R2 £ S1 (see Figure 4). Points in this space represent both position and orientation to abstract visual edges of given orientation at a particular spatial (i.e., retinotopic) position. It is in this space that our modeling takes place. Since any single tangent is the limit of any smooth curve passing through a given (retinotopic) point in a given direction, the question of curve integration becomes one of determining how two tangents at nearby positions are related. (Collinearity, for example, asserts that the tangent orientation hardly changes for small displacements along the curve.) In general terms, the angular difference between RFs captures only one part of the relationship between nearby tangents; their relative spatial offset also must be considered. Thus, in the mathematical abstraction, relationships between tangents correspond to relationships between points in R2 £S1 . Physiologically, these relationships are carried by the long-range horizontal connections, with variation in retinotopic position corresponding to R2 , and variation along orientation hypercolumns corresponding to S1 (see Figure 5). Determining them amounts, in mathematical terms, to determining what is called a con-
Projection Patterns of Long-Range Horizontal Connections in V
a
b
c
d
453
Figure 4: Abstracting the primary visual cortex as R2 £ S 1 , or position £ orientation space. (a) The “ice cube” cartoon of visual cortex (Hubel & Wiesel, 1977) (cytochrome-oxidase blobs and distortions due to cortical magnication factor are not shown). A tangential penetration in the supercial layers reveals an orientiation hypercolumn of cells whose RFs have similar spatial (retinotopic) coordinates. With cells of similar orientation tuning grouped by color, the hypercolumn is cartooned as a horizontal cylinder. (b) With ocular dominace columns omitted, the supercial layers of the primary visual cortex can be viewed as a collection of (horizontally arranged) orientation hypercolumns. (c) Drawing the cylinders vertically emphasizes that RFs of cells within a column overlap in retinotopic coordinates .x; y/ and makes explicit this aspect of thier organization. (d) Since different hypercolumns correspond to different retinotopic positions, the set of all hypercolumns abstracts the visible subspace of R2 £ S 1 , with each column corresponding to a different vertical ber in that space. The µ axis in this space corresponds to a tangential penetration with V1 hypercolumns (colors within the column represent different orientation tunings), and the XY plane corresponds to retinotopic coordinates.
454
O. Ben-Shahar and S. Zucker
nection structure. As we discuss in the rest of this article, the relationship between these two types of connections, the mathematical and the physiological, is more than linguistic. 2.1 The Geometry of Orientation in the Retinal Plane. Orientation in the 2D (retinal) plane is best represented as a unit length tangent vector O q/ attached to point of interest q E.E E D .x; y/ 2 R2 . Having such a tangent vector attached to every point of an object of interest (e.g., a smooth curve or oriented texture) results in a unit length vector eld (O’Neill, 1966). AssumE from the ing good continuation (Wertheimer, 1955), a small translation V O point q E results in a small change (i.e., rotation) in the vector E.E q/. To apply
Projection Patterns of Long-Range Horizontal Connections in V
455
techniques from differential geometry, a suitable coordinate frame fEO T ; EO N g O q/—the E and the basis vector EO T is identied with E.E is placed at the point q E (see Figure 6). Note that EO T is drawn at an angle µ — tangent vector at q the local orientation measured relative to the horizontal axis in retinotopic coordinates—such that .E q; µ / 2 R2 £ S1 . Nearby tangents are displaced in both position and orientation according to the covariant derivatives of the underlying pattern. These covariant derivatives, rVE EO T and rVE EO N , are naturally represented as vectors in the basis fEO T ; EO N g itself: Á
rVE EO T rVE EO N
!
µ D
E w11.V/ E w21.V/
E w12.V/ E w22.V/
¶Á
EO T EO N
!
:
(2.1)
E known as 1-forms, are functions of the displacement The coefcients wij .V/, E and since the basis fEO T ; EO N g is orthonormal, they are skew direction vector V, E D ¡wji .V/. E Thus, w11 .V/ E D w22.V/ E D 0, and the system symmetric wij .V/ reduces to: Á ! µ ! ¶Á E rVE EO T 0 EO T w12.V/ D (2.2) : E ¡w12 .V/ 0 r E EO N EO N V
Figure 5: Facing page. Abstracting long-range horizontal connections as relationships between points in R2 £ S 1 . (a) Since visual integration must involve not only the relative orientation between RFs but their spatial offset as well, it is more fully abstracted by relationships between points in R2 £ S 1 . The exact nature of these relationships is determined by the underlying integration model. (b) Redrawing R2 £ S 1 bers as orientation hypercolumns in V1 reveals the connection between the integration model in R2 £ S 1 and the distribution of long-range horizontal connections between the hypercolumns. (c) Collapsing the R2 £ S 1 abstraction to a cortical orientation map (i. e., attening each orientation cylinder and redistributing its orentation-selective parts as orientation columns in the supercial cortical layers), the integration model implies a particular set of long-range horizontal connections between orientation domains (colors represent orientation tuning similar to panels a and b and Figure 4). Such links have been identied and measured through optical imaging and anatomical tracing (e.g., Malach et al., 1993; Bosking et al., 1997; Buz´as et al., 1998) and thus can be compared to the model’s predictions. (d) A real counterpart to the schematic in panel c. Reproduced from Bosking et al. (1997), this image shows an optical image of intrinsic signals combined with long-range horizontal connections traced through extracellular injection of biocytin. The white dots at the upper left corner represent the injection site, while the black dots represent labeled boutons. The white bar in the inset represents the orientation preference at the injection site.
456
O. Ben-Shahar and S. Zucker
This last system is known as Cartan’s connection equation (O’Neill, 1966), E is called the connection form. Since w12.V/ E is linear in V, E it can be and w12 .V/ O O represented in terms of fET ; EN g: E D w12.a EO T C b EO N / D a w12 .EO T / C b w12 .EO N / : w12 .V/
The relationship between nearby tangents is thus governed by two scalars at each point. We dene them as follows, ·T D w12.EO T / 1 ·N D w12.EO N /; 1
(2.3)
and interpret them as tangential (·T ) and normal (·N ) curvatures, since they represent a directional rate of change of orientation in the tangential and normal directions, respectively. While the connection equation describes the local behavior of orientation for the general 2D case, it is equally useful for the 1D case of curves. Now, only rEO T is relevant and equation 2.2 simplies to Á
rEO EO T T rEO EO N
!
"
D
T
0 ¡w12.EO T /
w12.EO T / 0
#Á
EO T EO N
!
:
(2.4)
In its more familiar form, where T,N, and · replace EO T ,EO N , and ·T , respectively, this is the classical Frenet equation (O’Neill, 1966) (primes denote derivatives by arc length): ³
T0 N0
´
µ D
0 ¡·
· 0
¶³
T N
´ :
(2.5)
2.2 Integration Models and Projection Patterns of Horizontal Connections. The geometrical analysis discussed above and illustrated in Figure 6 shows how the relationship between nearby tangents depends on the covariant derivative: for curves, the connection is dictated by one curvature; for texture ows, or oriented 2D patterns, two curvatures are required. By estimating these quantities at a given retinal point q E , it is possible to approximate the underlying geometrical object, and thus a coherent distribution of tangents, around q E . This, in turn, can be used to model the set of horizontal connections that are required to facilitate the response of a cell if its RF is embedded in a visual context that reects good continuation. Naturally, to describe such a local approximation and to use it for building projection patterns, the appropriate domain of integration must be determined. However, since RF measurements provide only the tangent, possibly curvature (Dobbins, Zucker, & Cynader, 1987; Versavel, Orban, & Lagae, 1990), but not whether the stimulus pattern is a curve (1D) or a texture (2D), it is necessary to consider continuations for both.
Projection Patterns of Long-Range Horizontal Connections in V
457
Figure 6: Visual integration under good continuation involves the question of how a measurement of orientation at one retinal position relates to another measurement of orientation at a nearby retinal position. Formally, this amounts E relates to specifying how a tangent (orientation measurement) at position q E This tangent displacement to another nearby tangent displaced by a vector V. amounts to rotation, and as shown above, this rotation can differ for different displacements. Formally, the rotation is specied locally by the covariant derivative 1VE , and the mathematical analysis is facilitated by dening an appropriate coordinate frame. Shown is the Frenet basis fEO T ; EO N g, where EO T corresponding to a unit vector in the orientation’s tangential direction and EO N corresponds to a unit vector in the normal direction. Associated with this frame is an angle µ dened relative to external xed coordinate frame (the black horizontal line). The covariant derivative species the frame’s initial rate of rotation for any direcE The four different cases in this gure illustrate how this rotation tion vector V. E both quantitiatively (i.e., different magnitudes of rotation) and depends on V qualitatively (i.e., clockwise, counterclockwise, or zero rotation). Since displacement is a 2D vector and 1VE is linear, two numbers are required to fully specify the covariant derivative. These two numbers describe the initial rate of rotation in two independent displacement directions. Using the Frenet basis once again, two natural directions emerge. A pure displacement in the tangential direction .EO T / species one rotation component, and a pure displacement in the normal direction .EO N / species the other component. We call them the tangential curvature .·T / and the normal curvature .·N /, respectively. If visual integration based on good continuation relates to 2D patterns of orientation, then both of these curvatures are required. For good continuation along individual curves, only the tangential curvature is required since displacement is possible only in the tangential direction (that is, along the curve only).
458
O. Ben-Shahar and S. Zucker
Since estimates of curvature at point q E hold in a neighborhood containing the tangent, the discrete continuation for a curve is commonly obtained by approximating it locally by its osculating circle (do Carmo, 1976) and quantizing. This relationship, which is based on the constancy of curvature E is known as co-circularity (Parent & Zucker, 1989; Zucker, Dobaround q, bins, & Iverson, 1989; Sigman, Cecchi, Gilbert, & Magnasco, 2001; Geisler, Perry, Super, & Gallogly, 2001), and in R2 £ S1 it takes the form of a helix (see Figures 7a and 7b). Different estimates of curvature give rise to different helices whose points dene both the spatial position and the local E (see orientation of nearby RFs that are compatible with the estimate at q Figure 7c). Together, these compatible cells induce a curvature-based eld of long-range horizontal connections (see Figures 7a through 7c and 8a through 8d). While different curvatures induce different projection elds, the “sum” over curvatures gives an association eld (see Figure 8e) reminiscent of recent psychophysical ndings (Field et al., 1993). Note, however, that as a psychophysical entity, the association eld is not necessarily a one-to-one reection of connectivity patterns in the visual cortex. In fact, representing a “cognitive union” across displays of different continuations, the association eld is unlikely to characterize any single cell. Similar considerations can be applied toward the local approximation of
Projection Patterns of Long-Range Horizontal Connections in V
459
texture ows, although now the construction of a rigorous local model is slightly more challenging. Unlike curves, this model must depend on the estimate of two curvatures at the point q E , KT D ·T .E q/ and KN D ·N .E q/, but more important, these estimates cannot be held constant in the neighborE , however small; they must covary for the pattern to be feasible hood of q (Ben-Shahar & Zucker, 2003b). Nevertheless, invariances between the curvatures do exist, and formal considerations of good continuation have been shown to yield a unique approximation that, in R2 £ S1 , takes the form of a right helicoid (see Figures 7c and 7d) and whose orientation function has
Figure 7: Facing page. Differential geometry, integration models, and horizontal connections between RFs. (a) Estimate of tangent (light blue vector) and curE permits modeling a curve with the osculating circle as a vature at a point q good-continuation approximation in its neighborhood. Given the approximation, compatible (green) and incompatible (pink) tangents at nearby locations can be explicitly derived. (b) with height representing orientation (see the scale along the µ -axis), the osculating circle lifts to a helix in R2 £ S 1 whose points dene both the spatial location and orientation of compatible nearby tangents. Color-coded as in a, the green point is compatible with the blue one, while the pink points are incompatible with it. (c) The consistent structure in a and b illustrated as RFs and their spatial arrangement. As an abstraction for visual integration, the ideal geometrical model—the osculating circle—induces a discrete set of RFs, which can facilitate the responce of the central cell. Shown here is an example for one particular curvature tuning at the central cell. (d) For textures, determination of good continuation requires two curvatures at a point. Based on these curvatures, a local model of good continuation can determine the position, orientation, and curvatures of (spatially) nearby coherent points. Given these two curvatures at a point, there exists a unique model of good continuation that guarantees identical covariation of the curvature functions. Given the approximation, compatible (green) and incompatible (pink) ow patches at nearby locations can be explicitly derived. (e) In R2 £ S 1 , our model for 2D orientation good continuation lifts to a right helicoid, whose points dene both the spatial location and orientation of compatible (green) nearby ow tangents. (f) As an abstraction for visual integration, the ideal geometric model—the right helicoid—induces a discrete set of RFs, which can facilitate the responce of the central cell. Shown here is an example for one particular curvature tuning at the central cell. Note that broad RF tuning means that both the helix and the helicoid must be dilated appropriately, thus resulting in compatible “volumes” in R2 £ S 1 and possibly multiple compatible orientations at give spatial positions. This dilation should be reected in the set of compatible RFs and the horizontal links to them, but to avoid clutter, we omit it from this gure. The effect of this dilation is illustrated in Figure 8 and consequently in all our calculations.
460
O. Ben-Shahar and S. Zucker
the following expression: ³ µ .x; y/ D tan¡1
´ KT x C KN y : 1 C KN x ¡ KT y
(2.6)
The unique property of this object is that it induces an identical covariation E. of the two curvature functions ·T and ·N in the neighborhood of the point q The osculating helicoid is the formal 2D analog of the osculating circle and, as with co-circularity for curves, the elds of connections between neurons that this model generates (see Figures 8f through 8j) depend intrinsically on curvature(s). Such connectivity structures can be used to compute coherent texture and shading ows in a neural, distributed fashion (Ben-Shahar & Zucker, 2003b). Two examples are shown in Figure 9. 3 Results The computational connection elds generated above contain all the geometrical information needed for predictions about long-range horizontal connections of individual cells (or, after some averaging, that of tracer injection sites) in visual cortex. Thus, we now turn to the central question: How well do these connectivity maps match the available data about projection elds in visual cortex? In particular, do they make better predictions than those arising from collinearity or association eld models? To answer these questions, we focused on anatomical studies that report population statistics (Malach et al., 1993; Bosking et al., 1997) and compared their data to predictions produced by performing “computational anatomy” on our model.3 We randomly sampled the population of model-generated elds analogous to the way anatomists sample cells, or injection sites, in neural tissue and computed both individual and population statistics of their connection distributions. To generate robust predictions, we repeated these sampling procedures many times and calculated the expected values and standard errors of the frequency distribution. 3.1 Computational Anatomy Predicts Biological Measurements. Figure 10 illustrates the main results computed from our models, and compares them to the corresponding anatomical data reported in the literature (Malach et al., 1993; Bosking et al., 1997). The agreement of the computational process to the biological data is striking qualitatively and quantitatively. As with the association eld, our model correctly predicts the spread of the pooled distribution with similar peak height (approximately 3 Anatomical studies such as Bosking et al. (1997) and Malach et al. (1993) were preferred to psychophysical or electrophysiological studies, which typically contribute no population statistics and are generally more difcult to interpret directly in terms of the structure of horizontal connections.
Projection Patterns of Long-Range Horizontal Connections in V
461
Figure 8: Illustration of connection elds for curves (top, based on co-circularity, Parent & Zucker, 1989) and textures (bottom, based on right helicoidal model, Ben-Shahar & Zucker, 2003b). Each position in these elds represents one orientation hypercolumn, while individual bars represent the orientation preference of singe neurons, all of which are connected to the central cell in each eld. Multiple bars at any given point represent multiple neurons in the same hypercolumn that are connected to the central cell, a result of the dilation of the compatible structure due to broad RF tuning (see the caption of Figure 7). All elds assume that orientation tuning is quantizied to 10 degrees and their radius of inuence is set to four to ve nonoverlapping hypercolumns to reect a 6 to 8 mm cortical range of horizontal connections (Gilbert & Wiesel, 1989) and hypercolumn diamater of 1.5 mm (to account for ocular dominance domains). (a–d) Examples of co-circularity projection elds (Parent & Zucker, 1989) for cells with orientation preference of 150 degrees (center bars) and different values of curvature tuning based on the implementation by Iverson (1994). (a) · D 0:0 (curvature in units of pixels¡1 ). (b) · D 0:08: (c) · D 0:16: (d) · D 0:24: (e) The union of all projection elds of all cells with same orientation preference (0 degrees in this case) but different curvature tuning. Note the similarity to the shcematic association eld in Figure 6b. (f–j) Examples of the texture ow projection elds (Ben-Shahar & Zucker, 2003b) for cells with horizontal orientation preference (center bars) and different curvature tuning. Note the intrinsic dependency on curvatures and the qualitatively different connectivity patterns that they induce. (f).·T ; ·N / D .0:0; 0:0/: (g) .·T ; ·N / D .0:2; 0:0/: (h) .·T ; ·N / D .0:0; 0:2/: (i) .·T ; ·N / D .0:1; 0:1/: (j) .·T ; ·N / D .0:2; 0:2/: Note that while the majority of connections link cells of roughly similar orientation, some connect cells of large orientation differences. The elds shown are just a few examples sampled from the models, both of which contain similar (rotated) connection elds for each of the possible orientation preferences in the central hypercolumn. The circles superimposed on d and i are used to characterize retinotopic distance zones for the predictions made in Figure 15.
462
O. Ben-Shahar and S. Zucker
Figure 9: Example of coherent texture (a–d) and shading (e–g) ow computation based on contextual facilitation with right helicoidal connectivity patterns (Ben-Shahar & Zucker, 2003b). (a) Natural image of a tree stump with perceptual texture ow. (b) A manually drawn ow structure as perceived by a typical observer. (c) Noisy orientation eld reminiscient of RF responses. The computed measurements are based on the direction of the image intensity gradient. (d) The outcome of applying a contextual and distributed computation (Ben-Shahar & Zucker, 2003b) which facilitates the response of individual cells based on their interaction with nearby cells through the connectivity structures in Figure 8. Compare to b and note how the measurements in the area of the knot, where no RF is embedded in a coherent context, were rejected altogether. (e) An image of a plane. (f) Measured shading ow eld (white) and edges (black). In biological terms, edges are measured by RFs of particular orientation preferences tuned to high spatial frequencies. The shading eld may be measured by cells tuned to low frequencies. (g) Applying the right helicoidal-based computation on the shading informaiton results in a coherent shading eld on the plan’s nose and a complete rejection of the incoherent shading information on the textured background. Such an outcome can be used to segment smoothly curved surfaces in the scene (Ben-Shahar & Zucker, 2003b), to resolve their shape (Lehky & Sejnowski, 1988), to identify shadows (Breton & Zucker, 1996), and to determine occlusion relationship underlying edge classication (Huggins et al., 2001).
11% for orientation resolution of 10 degrees) and a similar orientation offset at which it crosses the uniform distribution (approximately §40 degrees). Unlike collinearity and association eld models, however, ours predict qualitative differences between distributions of individual neurons, or injection sites, similar to ndings in the literature (see Figure 10c). Most important, our model predicts the consistently nonmonotonic standard deviation. At orientation resolution of 10 degrees, both the anatomical data and the com-
Projection Patterns of Long-Range Horizontal Connections in V
463
putational models exhibit variance local minima at approximately §30 degrees. This property holds for both a random sample of cells (see Figure 11) and the computational population as a whole (not shown for space consideration). 3.2 Curvature Quantization and Population Statistics. The geometrical model discussed in this article must be quantized in both orientation and curvature before projection patterns can be computed and computational predictions can be made. We xed the orientation quantization to the same level used in Bosking et al. (1997). Curvature quantization, however, is not addressed in the physiological literature, and thus it is necessary to examine its effect on the resultant connectivity distributions. We note that even with orientation represented to hyperacuity levels, there are sufcient numbers of cells to represent such quantization (Miller & Zucker, 1999). Broad orientation tuning implies discrete orientation quantization and suggests even more discrete curvature quantization. The results presented in Figures 10 and 11 are based on quantizing curvature into ve classes. 4 This is a likely upper bound, given the broad bandpass tuning of cortical neurons that have been observed (Dobbins et al., 1987; Versavel et al., 1990) and modeled (Dobbins, Zucker, & Cynader, 1989). However, to study the effect of curvature quantization, we repeated the entire set of computations with both a smaller (three) and a larger (seven) number of curvature classes. Three is clearly the lower limit, which may correspond to the tree shrew (Bosking et al., 1997) or other simple mammals, and seven is more than required computationally (Ben-Shahar, & Zucker, 2003b). We found that all of the properties predicted initially remain invariant under these changes. In particular, regardless of quantization level, the pooled distribution remains unimodal, it peaks at zero orientation difference with approximately 11%, it crosses the uniform distribution at §40 degrees, and it has nonmonotonic variance with local minima at §30 degrees (with somewhat increased variance around zero orientation for higher quantization levels). Qualitative differences between individual neurons are predicted regardless of the number of curvature classes. All these results are illustrated in Figure 12. 3.3 Relationship Between Cells’ Distribution and Connections’ Distribution. Since both anatomical and computational studies must sample the population of (biological or computational) cells to measure the distribution of their horizontal connections, an important consideration is whether the underlying distribution of cells (based on their curvature tuning) can affect the pooled distribution of connections. For example, if most cells in
4 In the context of curves, these ve classes may be labeled as straight, slowly curving to the left, slowly curving to the right, rapidly curving to the left, and rapidly curving to the right.
464
O. Ben-Shahar and S. Zucker
Figure 10: Comparison of anatomical data and model predictions for the distribution of long-range horizontal connections in the orientation domain. In all graphs, dashed horizontal lines represent the uniform distribution, and error bars represent §1 standard error. (a) Mean connection distribution of four injection sites from Malach et al. (1993) versus the computational prediction from our models (expected mean, N D 4, 100 repetitions). Note the dominant peak around zero orientation difference and the considerable width of the histogram. The asymmetry in the pooled distribution measured by Malach et al. (1993) likely derives from a bias at the injection site (see their Fig. 4D) rather than being intrinsic. (b) Median distribution of seven injection sites from Bosking et al. (1997) against the computational prediction from our models (expected median, N D 7, 100 repetitions). Note in particular the similarity in peaks’ height and in the orientation offset at which the graphs cross the uniform distribution, and the strongly nonmonotonic behavior of the variance. (c) Two individual injection sites with qualitatively different connection distributions reproduced from Bosking et al. (1997).The counterpart computational instances are sampled from our models. Solid graphs correspond to the elds in Figures 8b and 8i. Dashed graphs correspond to the elds in Figures 8c and 8j.
Projection Patterns of Long-Range Horizontal Connections in V
465
Figure 11: Although both the computational and the physiologically measured distributions of the mean are monotonically decreasing, their standard deviation is consistently nonmonotonic. (a) While Bosking et al. (1997)used the populaiton median, we further analyzed their published data (from seven injection sites) to nd its mean and standard deviation. It is evident that the standard deviation is nonmonotonic, with two local minima at §30 degrees (marked with arrows). (b) Expected standard deviation for the texture model. (c) Expected standard deviation for the curve model. Both graphs depict the expected standard deviation for seven randomly selected cells (N D 7, 100 repetitions). and both show a similar nonmonotonic behavior with pronounced standard deviation local minima at approximately §30 degrees. Not how the computational local minima coincide with the anatomical ones (arrows are copied from a and overlaid on the computational graphs). Compare also to the standard deviation on the median graphs in Figure 10b. Note that as with the distributions themselves, both computational models produce quantitatively similar standard deviation results.
466
O. Ben-Shahar and S. Zucker
Figure 12: Different quantization levels of curvature tuning have little effect on the expected median distribution and its standard deviation. (a) Anatomical data from Bosking et al. (1997) shown for comparison with the computational predictions. (b) Computational predictions with three curvature classes. (c) Computational predictions with ve curvature classes. (d) Computational predictions with seven curvature classes. In all cases, the left column depicts the expected median for seven cells (bars are 1 s.d.), the middle column depicts the expected standard deviation for seven cells, and the right column shows two qualitatively different distributions from two different cells. For space considerations, we show the results form the texture model only.
Projection Patterns of Long-Range Horizontal Connections in V
467
the population are tuned for zero or very small curvature, the pooled connection distribution may differ from that of a population dominated by high curvature cells. The results presented in Figures 10 and 11 are based on the assumption that cells of different curvature tuning (or, put differently, of different connectivity patterns) are distributed uniformly. Such an assumption follows naturally from the mathematical abstraction that allocates the same number of computational units to equal portions of R2 £S1 . However, if this assumption were wrong, would a bias in the distribution of cells affect signicantly the predictions made from our models? Unfortunately, few data about such distributions are available, partially because anatomists need not assume any particular cells’ distribution for their measurements of projection elds, and partially because curvature tuning is rarely considered. Some data available on the distribution of endstopped cells (Kato, Bishop, & Orban, 1978; Orban, 1984) in conjunction with the functional equivalence of end stopping with curvature selectivity (Orban, Versavel, & Lagae 1987; Dobbins et al., 1987, 1989; Versavel et al., 1990), suggest that cells are distributed bimodally in the curvature domain, with peaks at both zero and high curvature tuning. Alternatively, statistical studies of edge correlations in natural images (Dimitriv & Cowan, 1998; August & Zucker, 2000; Sigman et al., 2001; Kaschube, Wolf, Geisel, & L owel, ¨ 2001; Geisler et al., 2001; Pedersen & Lee, 2002) show that collinear cooccurrences are more frequent than others. Although these co-occurrence measurements neither depend on curvature nor do they necessarily indicate any particular distribution of cells at the computational level, implicitly they may suggest that cells are distributed unimodally in the curvature domain, with peak at zero curvature only. Since our model raises the possibility of a curvature bias effect, we thus redistributed the population of our computational cells by one or the other of these nonuniform (bimodal and unimodal) distributions, and then repeated the computational anatomy process described in section 3.1. All computations were done on the more general 2D (texture) model. The bimodal distribution was modeled as a radial two-gaussian mixture model q 2 and parame(GMM) parameterized by the total curvature · D ·T2 C ·N ters ¹0 D 0, ¾0 , ¹1 , and ¾1 . The unimodal distribution was modeled as a 2D gaussian of zero mean and variances ¾T and ¾N in the ·T and ·N dimensions, respectively. Figure 13 illustrates one example of the resultant statistical measures for the bimodal cell distribution. In this example, ¾0 D 0:04 and ¾1 D 0:05, where the slight difference accounts for corresponding differences in the two modes as reported in Kato et al. (1978) and Orban (1984). As is shown, this nonuniform distribution hardly changes the expected median, while further emphasizing the nonmonotonic nature of the variance (compared to the statistics obtained with the same number of curvature classes and
468
O. Ben-Shahar and S. Zucker
Figure 13: A statistical conrmation that the properties of our models persist even when the population of cells is distributed bimodally (Kato et al., 1978; Orban, 1984). Illustrated here is the result from a distribution modeled as a GMM with ¹0 D 0; ¾0 D 0:04; ¹1 D 0:2, and ¾1 D 0:05. Since the bimodal nature of the distribution is best represented with a higher number of curvature classes, presented here is the case of the texture model with curvatures quantizied to seven classes each. (a) The (radially) bimodal distribution of cells in the curvature domain normalized for number of cells. The x- and y-axes represent tangential and normal curvature tuning, respectively, and the z-axis represents the number of such curvature-tuned cells for any given orientation tuning. (b) Expected median of seven cells. Error bars are §1 standard deviation. (c) Expected standard error for seven cells. Compare both graphs to Figure 12d.
Projection Patterns of Long-Range Horizontal Connections in V
469
uniform cell distribution; see Figure 12d). Similar results were obtained with other values of ¾0 and ¾1 and for the curvature quantized to ve classes as well.5 Figure 14 illustrates another example, this time using the unimodal cell distribution mentioned above. In this example, ¾T D ¾N D 0:15 such that cells with zero curvature tuning are eight times more frequent than cells tuned to the maximum value of curvatures. As expected, this strongly nonuniform distribution slightly elevated the peak of the population statistics, but otherwise, all other features that were predicted from the uniform cell distribution, and in particular the nonmonotonic variance, were fully preserved. Similar results were obtained with other values of ¾T and ¾N , and for all quantization levels of the curvatures (as in section 3.2). In summary, we have shown that even if cells in primary visual cortex were distributed nonuniformly in their curvature tuning, the pooled distribution of long-range horizontal connections in the orientation domain would preserve its fundamental properties, and in particular its wide spread and nonmonotonic variance. Thus, our conclusions are not biased by an implicit assumption about curvature-dependent distribution of cells. 4 Discussion The ndings presented from our computational anatomy support the functional identication of the long-range horizontal connections with those obtained mathematically. However, the question of why the texture model is necessary becomes unavoidable, and we believe this issue is more than just formal mathematics. Certain physiological and psychophysical ndings, such as iso-orientation side facilitation (Adini et al., 1997), functional and anatomical connections between retinotopically parallel receptive elds (Ts’o et al., 1986; Gilbert & Wiesel, 1989), and roughly isotropic retinotopic extent of projection elds (Malach et al., 1993; Sincich & Blasdel, 2001) suggest the perceptual integration of texture ows rather than curves. Although this class of patterns may seem less important than curves as a factor in perceptual organization, their perceptual signicance has been established (Glass, 1969; Kanizsa, 1979; Todd & Reichel, 1990). Furthermore, recent computational vision research implicates them in the analysis of visual shading (Lehky & Sejnowski, 1988; Huggins, Chen, Belhumeur, & Zucker, 2001), as was demonstrated in Figure 9, and even color (Ben-Shahar & Zucker, 2003a). Whether projection patterns of cells in primary visual cortex come in different avors (i.e., curve versus texture or shading integration) is an open question. To answer it, one is likely to exploit the many physiologically measurable differences between these classes of projection patterns, as sug-
5 Quantization of curvature to three classes was irrelevant in this case because the bimodality of the distribution could not be expressed using too few (three) samples.
470
O. Ben-Shahar and S. Zucker
Figure 14: A statistical conrmation that the properties of our models persist even when the population of cells is distributed normally (i.e., unimodally). Such distribution is implicitly suggested by statistics of edges in natural images (Dimitriv & Cowan, 1998; August & Zucker, 2000; Sigman et al., 2001; Kaschube et al., 2001; Geisler et al., 2001; Pedersen & Lee, 2002) in which collinear edges are much more frequent. The case presented here (¾T D ¾N D 0:15) induces a distribution in which cells of zero curvature tuning are eight times more frequent than those of maximal curvature tuning. The graphs in this gure correspond to the texture model with curvatures quantizied to three classes each. Similar resutls were obtainded with other quantization levels as well. (a) The normal distribution of cells in the curvature domain normalized for number of cells. The x- and y-axes represent tangential and normal curvature tuning, respectively, and the z-axis represents the number of such curvature-tuned cells for any given orientation tuning. (b) Expected median of seven cells. Error bars are §1 standard deviation. (c) Expected standard error for seven cells. Compare both graphs to Figure 12b.
Projection Patterns of Long-Range Horizontal Connections in V
471
gested by Figure 8. Unfortunately, the statistical data obtained so far do not distinguish between the two (curve and texture) integration models; without a spatial dimension, the statistical differences between the two models in the orientation domain are too ne to measure relative to the accuracy of current laboratory techniques. Until full spatio-angular data are obtained, however, the inclusion of even weak spatial information is sufcient to generate further testable predictions. In particular, incorporating the retinotopic distance between linked cells into the statistics (or estimating it from their cortical distance) can produce predictions regarding the dependency of the distribution’s spread and shape on the integration distance, as illustrated in Figure 15. Some verication for these predictions can be seen in the measurements of Kisv´arday et al. 1997, top row in their Fig. 9 shows developing peaks resembling the ones in Figures 15b, and 15c). Similar annular analyses, which focus on sectors of the annuli in directions other than parallel to the RF’s preferred orientation, provide measurable differences between curve and texture projection elds. In summary, we have presented mathematical analysis and computational models that predict both the pooled distribution of long-range horizontal connections and the distributions of individual cells and injection sites. For the rst time, the modeling goes beyond the unimodal rst-order data and falsies a common conclusion from it. In particular, while coaligned facilitation entails the pooled rst-order data, the converse is not necessarily true: these data are also consistent with curvature-dependent connections. The second-order (variance) data, however, remain consistent with curvature-dependent connections but not with coaligned facilitation. The explanatory force of our model derives from sensory integration, and we observed in Section 1 that most researchers limit this to curve integration via collinearity. We conclude in an enlarged context: differential geometry provides a foundation for connections in visual cortex that predicts both dependency on curvature(s) and an expanded functional capability, including curve, texture, and shading integration. Since the same geometrical analysis applies to many other domains in which orientation and direction elds are fundamental features, coherent motion processing (Series, Georges, Lorenceau, & Fre´ gnac, 2002) and coherent color perception (Ben-Shahar & Zucker, 2003a) might also have been included. Since all follow from the geometry and all are important for vision, more targeted experiments are required to articulate their neural realization. Appendix: Noise Models Two basic noise models are used in this article to examine whether variations of the basic collinear distribution (see Figure 2) can produce a pooled statistics with similar properties to the biological one. This appendix describes both procedures in detail.
472
O. Ben-Shahar and S. Zucker
Figure 15: Model predictions of connection distributions by a retinotopic annulus. Left, middle, and right columns correspond to predictions based on small, medium, and large annuli, respectively (circles in Figures 8d and 8i). All annuli refer to distances beyond one orientation hypercolumn, thus, the small annulus should not be confused with distances less than the diameter of one hypercolumn. In a and b the same sampling procedure and the same sample set sizes described in Figure 10 were repeated. For lack of space, we omit the very similar graphs of the mean and median of the entire population, and present predictions from the texture model only. (a) Expected mean distribution and standard error (N D 4, 100 repetitions). Note the spread with increased retinotopic distance. (b) Expected median distribution and standard error (N D 7, 100 repetitions). Note the developing symmetric peaks that further depart from the iso-orientation domain as the spatial distance increases. The correspondece of these peaks to the minima of the standard error is remarkable, thus designating them as statistical anchors suitable for empirical verication. (c) Individually tuned cells show the qualitative difference between distributions of high and low curvature cells. In particular, note how the distributions of medium and high curvature cells (dashed graphs) are the ones that develop the peaks mentioned in b above.
Projection Patterns of Long-Range Horizontal Connections in V
473
To examine natural random variations at the level of individual connections, each long-range horizontal connection, ideally designated to connect cells of orientation difference 1µ , is shifted to connect cells of orientation difference 1µ C ²¾ , where ²¾ is a wrapped gaussian noise with zero mean and variance ¾ . To do this computation, the base distribution (collinear or association eld) from Figure 2, initially given as probabilities over 18 orientation bins of 10 degrees each, was normalized and quantized to a connection histogram in the range [0; N], where N represents the total number of connections a cell makes. To each such connection to orientation difference 1µ we then added a wrapped gaussian noise ²¾ of zero mean and variance ¾ , and the new (random) connection was accumulated at the bin 1µ C ²¾ of the resultant histogram. This process was repeated 200 times to produce 200 different perturbations, from which both expected distribution and variance were computed bin-wise. The parameter ¾ was set to the value that produced an expected distribution of peak height and spread similar to the biological one. Since different anatomical studies and protocols indicate a different number of total connections (e.g., hundreds in Schmidt et al. 1997, approximately 3500 in Buza´ s et al., 1998 and up to 20,000 for injection sites of approximately 20 cells in Bosking et al., 1997), we repeated this statistical test for normalizations in different ranges. As expected, changing N only scaled the variance uniformly across the expected distribution but did not affect its mean. Thus, for better clarity of its monotonicity, the result in Figure 3b reects a smaller number of total connections (N D 200), as in, for example, Schmidt et al. (1997). To examine random variations due to “leakage” of tracer from an injection site of preferred orientation µ0 to nearby orientation columns, we modeled such leakage by selecting i D 1; : : : ; M cells of preferred orientation µi D µ0 C 1µ¾ , where 1µ¾ is a wrapped gaussian random variable of zero mean and variance ¾ . A normalized- and quantized-based distribution (collinear or association eld) was then centered around each of µi , and all were summed up and normalized to yield a resultant (random) distribution of connections for the injection site at µ0 . As before, we repeated this generation process 200 times to produce 200 different perturbations, from which both expected distribution and variance were computed bin-wise. The parameter ¾ was again set to that value that produced an expected distribution of peak height and spread similar to the biological one. The number of cells in an injection was set to M D 20, approximately as reported in Bosking et al. (1997) Unlike random variations at the level of individual connections, the range parameter N had no effect on the variance of the expected distribution. Acknowledgments We are grateful to Allan Dobbins, David Fitzpatrick, Kathleen Rockland, Terry Sejnowski, and Michael Stryker for reviewing this manuscript and for
474
O. Ben-Shahar and S. Zucker
providing valuable comments; to Lee Iverson for the curve compatibility elds in Figure 8; and to an anonymous reviewer for pointing out a possible artifact in the data (in Section 1.1). This research was supported by AFOSR, ONR, and DARPA. References Adini, Y., Sagi, D., & Tsodyks, M. (1997). Excitatory-inhibitory network in the visual cortex: Psychophysical evidence. Proc. Natl. Acad. Sci. U.S.A., 94, 10426– 10431. August, J., & Zucker, S. (2000). The curve indicator random eld: Curve organization via edge correlation. In K. Boyer & S. Sarkar, (Eds.), Perceptual organization for articial vision systems. Norwell, MA: Kluwer. Ben-Shahar, O., & Zucker, S. (2003a). Hue elds and color curvatures: A perceptual organization approach to color image denoising. In Proc. Computer Vision and Pattern Recognition (pp. 713–720). Los Alamitos, CA: IEEE Computer Society. Ben-Shahar, O., & Zucker, S. (2003b). The perceptual organization of texture ows: A contextual inference approach. IEEE Trans. Pattern Anal. Machine Intell., 25(4), 401–417. Bosking, W., Zhang, Y., Schoeld, B., and Fitzpatrick, D. (1997). Orientation selectivity and the arrangement of horizontal connections in the tree shrew striate cortex. J. Neurosci., 17(6), 2112–2127. Breton, P. & Zucker, S. (1996). Shadows and shading ow elds. In Proc. Computer Vision and Pattern Recognition (pp. 782–789). Los Alamitos, CA: IEEE Computer Society. Buz´as, P., Eysel, U., & Kisva´ rday, Z. (1998). Functional topography of single cortical cells: An intracellular approach combined with optical imaging. Brain Res. Prot., 3, 199–208. Dimitriv, A. & Cowan, J. (1998). Spatial decorrelation in orientation-selective cortical cells. Neural Comput., 10, 1779–1795. do Carmo, M. (1976). Differential geometry of curves and surfaces. Engelwood Cliffs, NJ: Prentice-Hall. Dobbins, A., Zucker, S., & Cynader, M. (1987). Endstopped neurons in the visual cortex as a substrate for calculating curvature. Nature, 329(6138), 438–441. Dobbins, A., Zucker, S., & Cynader, M. (1989). Endstopping and curvature. Vision Res., 29(10), 1371–1387. Field, D., Hayes, A., & Hess, R. (1993). Contour integration by the human visual system: Evidence for a local “association eld.” Vision Res., 33(2), 173– 193. Geisler, W., Perry, J., Super, B., & Gallogly, D. (2001). Edge co-occurrence in natural images predicts contour grouping performance. Vision Res., 41(6), 711–724. Gilbert, C., & Wiesel, T. (1989). Columnar specicity of intrinsic horizontal and corticocortical connections in cat visual cortex. J. Neurosci., 9(7), 2432–2442. Glass, L. (1969). Moir´e effect from random dots. Nature, 223(5206), 578–580.
Projection Patterns of Long-Range Horizontal Connections in V
475
Hubel, D., & Wiesel, T. (1977). Functional architecture of macaque monkey visual cortex. Proc. R. Soc. London Ser. B, 198, 1–59. Huggins, P., Chen, H., Belhumeur, P., & Zucker, S. (2001). Finding folds: On the appearance and identication of occlusion. In Proc. Computer Vision and Pattern Recognition, (pp. 718–725). Los Alamitos, CA: IEEE Computer Society. Iverson, L.A. (1994). Toward discretegeometricmodels for early vision. Unplublished doctoral dissertations, McGill University. Kanizsa, G. (1979). Organization in vision: Essays on gestalt perception. New York: Praeger. Kapadia, M., Ito, M., Gilbert, C., & Westheimer, G. (1995). Improvement in visual sensitivity by changes in local context: Parallel studies in human observers and in V1 of alert monkeys. Neuron, 15, 843–856. Kapadia, M., Westheimer, G., & Gilbert, C. (2000). Spatial distribution of contextual interactions in primary visual cortex and in visual perception. J. Neurophysiol., 84, 2048–2062. Kaschube, M., Wolf, F., Geisel, T., & Lowel, ¨ S. (2001). The prevalence of colinear contours in the real world. Neurocomputing, 38-40, 1335–1339. Kato, H., Bishop, P., & Orban, G. (1978). Hypercomplex and the simple/complex cell classication in cat striate cortex. J. Neurophysiol., 41, 1071–1095. Kisva´ rday, Z., Kim, D., Eysel, U., & Bonhoeffer, T. (1994). Relationship between lateral inhibition connections and the topography of the orientation map in cat visual cortex. J. Neurosci., 6, 1619–1632. ´ Rausch, M., & Eysel, U. (1997). Orientation-specic relaKisva´ rday, Z., Toth, ´ E., tionship between populations of excitatory and inhibitory lateral connections in the visual cortex of the cat. Cereb. Cortex, 7, 605–618. Lehky, S. & Sejnowski, T. (1988). Network model of shape-from-shading: Neural function arises from both receptive and projective elds. Nature, 333, 452– 454. Malach, R., Amir, Y., Harel, M., & Grinvald, A. (1993). Relationship between intrinsic connections and functional architecture revealed by optical imaging and in vivo targeted biocytin injections in primate striate cortex. Proc. Natl. Acad. Sci. U.S.A., 90, 10469–10473. Matsubara, J., Cynader, M., Swindale, N., & Stryker, M. (1985). Intrinsic projections within visual cortex: Evidence for orientation specic local connections. Proc. Natl. Acad. Sci. U.S.A., 82, 935–939. Miller, D., & Zucker, S. (1999). Computing with self-excitatory cliques: A model and application to hyperacuity-scale computation in visual cortex. Neural Comput., 11, 21–66. Mitchison, G., & Crick, F. (1982). Long axons within the striate cortex: Their distribution, orientation, and patterns of connections. Proc. Natl. Acad. Sci. U.S.A., 79, 3661–3665. O’Neill, B. (1966). Elementary differential geometry. Orlando, FL: Academic Press. Orban, G. (1984). Neural operations in the visual cortex. Berlin: Springer-Verlag. Orban, G., Versavel, M., & Lagae, L. (1987). How do striate neurons represent curved stimuli. Abstracts of the Society for Neuroscience, 13, 404.10. Parent, P., & Zucker, S. (1989). Trace inference, curvature consistency, and curve detection. IEEE Trans. Pattern Anal. Machine Intell., 11(8), 823–839.
476
O. Ben-Shahar and S. Zucker
Pedersen, K., & Lee, A. (2002). Toward a full probability model of edges in natural images (APPTS Tech. Rep. 02-1.) Providence, RI: Division of Applied Mathematics, Brown University. Polat, U., & Sagi, D. (1993). Lateral interactions between spatial channels: Suppression and facilitation revealed by lateral masking exteriments. Vision Res., 33(7), 993–999. Rockland, K., & Lund, J. (1982). Widespread periodic intrinsic connections in the tree shrew visual cortex. Science, 215(19), 1532–1534. Schmidt, K., Goebel, R., Lowel, ¨ S., & Singer, W. (1997). The perceptual grouping criterion of colinearity is reected by anisotropies in the primary visual cortex. Eur. J. Neurosci., 9, 1083–1089. Schmidt, K., & Lowel, ¨ S. (2002). Long-range intrinsic connections in cat primary visual cortex. In B. Payne & A. Peters, (Eds.), The cat primary visual cortex (pp. 387–426). Orlando, FL: Academic Press. Series, P., Georges, S., Lorenceau, J., & Fre´ gnac, Y. (2002). Orientation dependent modulation of apparent speed: A model based on the dynamics of feedforwards and horizontal connectivity in V1 cortex. Vision Res., 42, 2781–2797. Sigman, M., Cecchi, G., Gilbert, C., & Magnasco, M. (2001). On a common circle: Natural scenes and gestalt rules. Proc. Natl. Acad. Sci. U.S.A., 98(4), 1935–1940. Sincich, L., & Blasdel, G. (2001). Oriented axon projections in primary visual cortex of the monkey. J. Neurosci., 21(12), 4416–4426. Todd, J., & Reichel, F. (1990). Visual perception of smoothly curved surfaces from double-projected contour patterns. J. Exp. Psych.: Human Perception and Performance, 16(3), 665–674. Ts’o, D., Gilbert, C., & Wiesel, T. (1986). Relationships between horizontal interactions and functional architecture in cat striate cortex as revealed by crosscorrelation analysis. J. Neurosci., 6(4), 1160–1170. Versavel, M., Orban, G., & Lagae, L. (1990). Responses of visual cortical neurons to curved stimuli and chevrons. Vision Res., 30(2), 235–248. Weliky, M., Kandler, K., Fitzpatrick, D., & Katz, L. (1995). Patterns of excitation and inhibition evoked by horizontal connections in visual cortex share a common relationship to oriented columns. Neuron, 15, 541–552. Wertheimer, M. (1955). Laws of organization in perceptual forms. In W. Ellis, (Ed.), A source book of gestalt psychology, (pp. 71–88). London: Routledge & Kegan Paul. Zucker, S., Dobbins, A., & Iverson, L. (1989). Two stages of curve detection suggest two styles of visual computation. Neural Comput., 1, 68–81. Received April 9, 2003; accepted September 5, 2003.
NOTE
Communicated by Bruce Knight
Mean Instantaneous Firing Frequency Is Always Higher Than the Firing Rate Petr Lansk ´ y´
[email protected] Institute of Physiology, Academy of Sciences of the Czech Republic, 142 20 Prague 4, Czech Republic
Roger Rodriguez
[email protected] Centre de Physique Th´eorique, CNRS and Facult´e des Sciences de Luminy, Universit´e de la M´editerran´ee, F-13288 Marseille Cedex 09, France
Laura Sacerdote
[email protected] Department of Mathematics, University of Torino, via Carlo Alberto 10, 10 123 Torino, Italy
Frequency coding is considered one of the most common coding strategies employed by neural systems. This fact leads, in experiments as well as in theoretical studies, to construction of so-called transfer functions, where the output ring frequency is plotted against the input intensity. The term ring frequency can be understood differently in different contexts. Basically, it means that the number of spikes over an interval of preselected length is counted and then divided by the length of the interval, but due to the obvious limitations, the length of observation cannot be arbitrarily long. Then ring frequency is dened as reciprocal to the mean interspike interval. In parallel, an instantaneous ring frequency can be dened as reciprocal to the length of current interspike interval, and by taking a mean of these, the denition can be extended to introduce the mean instantaneous ring frequency. All of these denitions of ring frequency are compared in an effort to contribute to a better understanding of the input-output properties of a neuron. 1 Introduction For a constant signal or under steady-state conditions, characterization of the input-output properties of neurons, as well as of the neuronal models, is commonly done by so-called frequency (input-output) transfer functions in which the output frequency of ring is plotted against the strength (again often frequency) of the input signal. By constructing the transfer functions, Neural Computation 16, 477–489 (2004)
c 2004 Massachusetts Institute of Technology °
478
P. L´ansky, ´ R. Rodriguez, and L. Sacerdote
it is implicitly presumed that the information in the investigated neuron is coded by the frequency of the action potentials, the classic coding schema in neural systems (Adrian, 1928). Characterization of the input signals by the frequency of evoked action potentials requires giving a proper denition of the ring frequency and extending it for the transient signals. Up to now, various denitions of the term spiking frequency have been adopted (Awiszus, 1988; Ermentrout, 1998; Gerstner & van Hemmen, 1992), and the concept of the rate coding is carefully treated by Gerstner and Kistler (2002). In the laboratory situation, the crucial question in identifying of the ring rate (frequency, intensity) is stationarity of the counting process of spikes; without this stationarity, speaking about the ring rate is meaningless. The most common and intuitive understanding of the ring frequency is based on counting events (spikes) appearing in an interval of prescribed duration and dividing this number by the length of this interval. An argument against using of this method is that the duration of the observation interval can be limited; either the required stationarity disappears outside the interval, or its length is out of the control of the researcher. The conditions are often very variable during experiments. For example, in studies of hippocampal place cells, the dwell time of a freely moving animal in a particular spot always changes (Fenton & Muller, 1998). Similarly, in experiments with neurons in sensory systems, the observation periods have to be reduced to the time when the stimuli act on the neuron (Rieke, Warland, de Ruyter van Steveninck, & Bialek, 1997). In other cases, an applied pharmaceutical treatment has a limited duration, and the neuron must be recorded within this time period. As we will note later, for a short observation period, the sensitive point is not only the limited length of the observation, but also the time when the counting of spikes starts (if identied with an action potential or not). Gerstner and Kistler (2002) discuss how to average over time (repetitions of experiments) or over different neurons to improve the estimates of the ring rate. A related method used for determining the ring frequency is based on calculating the inverse of the average interspike interval (ISI). We will see that this method and the previously mentioned one (based on counting process) are identical under specic conditions, and thus in theoretical studies, the reciprocal of the mean ISI is usually declared as the ring rate (Burkitt & Clark, 2000; Van Rullen & Thorpe, 2001). Of course, for small samples, both methods are inuenced by the selection of the beginning and the end of the observation period (see Figure 1), but the aim of this article is not to study the effect of a small sample size. In the extreme situation, with only one spike available during the observation period, calculating the mean ISI fails to provide any information. With two spikes, speaking about “mean” ISI is possible and probably less inuenced by other factors than the number of spikes in a vaguely determined period. So-called instantaneous ring frequency can be dened for a single ISI as the inverse of its length (see Pauluis & Baker, 2000, for historical notes on
N(t)
Instantaneous Firing Frequency Is Always Higher Than the Firing Rate
479
2
Spikes
0
0 .0 0
0 .0 4
0 .0 8
0 .1 2
0 .1 6
0 .2 0
T im e
Figure 1: A schematic example of experimental data. Observation starts at time zero and lasts for 0.2 s with three spikes or four spikes if there is spike at time origin. The ring frequencies, calculated in three different ways, are f1 D 1=t D 20 .25/s¡1 , f2 D#spikes/period D 15 .20/s¡1 , f3 D .1=t/ D 23:81 .32:54/ s¡1 , where the numbers in parentheses hold if the observation starts with a spike.
this method). This implicitly assumes that the current ISI is the “mean” over a short period of time. Then, in an analogous manner to the ring frequency over an interval of some xed length, the mean of these instantaneous ring frequencies can dene mean instantaneous ring rate. Similarly, Knight (1972) and Barbi, Carelli, Frediani, and Petracchi (1975) dene instantaneous rate as reciprocal to the period from the last spike, and thus both denitions coincide at the moment of spike generation. These two methods, the inverse of the mean ISI and the mean of the inverse ISI, are equally suitable for experimental and simulated (modeled) data. Formally, it means that if available, ISIs are denoted ft1 ; : : : ; tn g, which P are independent realizations P of a random variable T; then either 1=t D 1= n1 niD1 ti or .1=t/ D n1 niD1 t1 is i calculated. This corresponds to a situation in which it is assumed that ISIs are the realization of a random variable T and either 1=E.T/ or E.1=T/, where E is used for the mean throughout this note, is evaluated. The differences in denition of ring frequency are shown in a simple numerical example (see Figure 1). In this note, we compare the methods of ring rate quantication for common neuronal models and point out the possible differences and implications for inference on real data. We do not deal here with the time-variable and stochastic rates. 2 Basic Results The standard denition of the rate function of discharge (ring) is (Johnson, 1996)
f .t/ D lim
1t!0
E.N.t C 1t/ ¡ N.t// ; 1t
(2.1)
480
P. L´ansky, ´ R. Rodriguez, and L. Sacerdote
where N is the counting process of spikes. If this function is independent of t and thus constant, which is the presumption considered in this note, it is called the ring rate. A natural way to calculate the ring rate of a neuron is to divide the number of elicited spikes, N.t/, by the length of observation period, t. The mean of ISIs, E.T/, is connected to the mean of the counting process; E.N.t//; by the asymptotic formula 1 E.N.t// D lim ; E.T/ t!1 t
(2.2)
(see, e.g., Cox & Lewis, 1966; Rudd & Brown, 1997). Formula 2.2 holds true for nite t only under the condition that N.t/ is a stationary point process. Cox and Lewis (1966) show details on how E.N.t// is related to the inverse of the mean ISI. The necessary condition for the stationarity of the counting process is that it starts in an arbitrary time, and this implies that the sequence of ISIs cannot be stationary. However, for a renewal process, disregarding the time before the rst spike (i.e., starting the sequence of ISIs with the rst spike) solves the problem (nonrenewal models are outside the scope of this note). Nevertheless, as illustrated in Figure 1, equation 2.2 does not hold if the mean ISI is replaced by the sample average and the mean of the counting process by its sample value. The theoretical result given by equation 2.2 can be used if the observation period is sufciently long. This usually is not the case in estimating the ring rates, as the observation period contains rather few spikes. Gerstner and Kistler (2002) give 100 to 500 ms as the usual length of the observation period, and this obviously permits only a few spikes. To deal with this difculty, the averaging over time appearing in equation 2.2 is replaced by averaging over different neurons or over repetition of the experiment on the same neuron. In this case, the inference has to be based on realization of the random variable T. Hence, we focus on the estimate and the comparison of the ring frequency, 1=E.T/, and the mean instantaneous frequency, E.1=T/. In theoretical inference, we can use the fact that the function 1=t is convex, and by using Jensen’s inequality (e.g., Rao, 1965) we can prove that for any positive random variable T, E.1=T/ ¸ 1=E.T/;
(2.3)
which permits us to conclude that the mean instantaneous frequency is always higher than or at least equal to the ring frequency. This result will be illustrated and quantied for several common stochastic models of membrane depolarization and also for several generic statistical descriptors often used for the characterization of experimental data.
Instantaneous Firing Frequency Is Always Higher Than the Firing Rate
481
3 Generic Models of Spike Trains 3.1 Gamma Distribution of ISIs. The gamma distribution is often and successfully tted to experimentally observed histograms of ISIs and is often taken as a theoretical model for which further conclusions are derived (e.g., Baker & Gerstein, 2001). The probability density function of T is g.t/ D
¸v e¡¸t tv¡1 ; 0. v/
(3.1)
where 0 is the gamma function and v > 0 and ¸ > 0 are the parameters. (Below, probability density will be denoted by g.:/.) The statistical moments are well known for this distribution, and thus we can directly write 1=E.T/ D
¸ : v
(3.2)
For X D 1=T, we have g.x/ D
1 e¡¸=x .¸=x/v ; x0. v/
(3.3)
and the mean for distribution 3.3 can be calculated, which yields for v > 1, E.1=T/ D
¸ : v¡1
(3.4)
Compared with equation 3.2, we can see that the difference between the two rates is decreasing for increasing v. This is expected since for increasing v, the interval distribution becomes more sharply peaked (i.e., approaches the deterministic case), and in the deterministic case, the two measures become identical. 3.2 Poisson Process and Its Modication. The spikes are generated in accordance with a Poisson process if v D 1 in model 3.1. Then from equation 3.2, it follows that 1=E.T/ D ¸, and equation 3.4 implies that E.1=T/ D 1. This striking difference is caused by the existence of very short ISIs and has been noted in Johnson (1996). Let us thus assume that the model is the dead-time Poisson process (modeling a refractory period) in which intervals between events are exponentially distributed but cannot be shorter than a constant ±. Then ISI distribution is ¸ exp.¡¸.t ¡ ±// and 1=E.T/ D ¸=.¸± C 1/. The mean instantaneous frequency is Z E.1=T/ D
1 0
¸ exp.¡¸z/ dz; .z C ±/
(3.5)
482
P. L´ansky, ´ R. Rodriguez, and L. Sacerdote
a
b 0.5 Frequency
Frequency
40
25
0.4 0.3 0.2 0.1
10 0
100
200
300
0.0 0.0
400
0.4
1/d
0.8
c
1.6
2.0
d
0.3
1.0 Frequency
0.2 Frequency
1.2
Noise amplitude
0.1
0.8 0.6 0.4 0.2
0.0 0.0
0.4
0.8
1.2
1.6
Noise amplitude
2.0
0.0 0.00
0.05
0.10
0.15 0.20
0.25 0.30
Noise amplitude
Figure 2: Comparison of the input-output curves in dependency on the parameters of the models. (a) The ring rate (lower curve) and the mean instantaneous ring rate (upper curve) in dependency on the inverse value of the length of refractory period 1=± for Poissonian ring with rate ¸ D 20s¡1 . The frequencies and 1=± are given in s¡1 . (b–d) The ring rates (solid lines) and the mean instantaneous ring rates (dashed lines) are plotted as a function of the amplitude of the noise for different levels of the input. (b) LIF neuronal model: From top to bottom, ¹ D 1:5, 1, 0.5, and 0.1. The ring threshold S D 1, ¿ D 1. (c) Feller neuronal model. The same levels of the input ¹ and the same parameters as for the LIF, VI D ¡1, (d) The Morris-Lecar neuronal model. From top to bottom, Iext D 0.145, 0.125, 0.105, and 0.085.
which is nite. In Figure 2a, we show the ring rate and mean instantaneous ring rate in dependency on the inverse of the length of the refractory period ±. We can see that both denitions tend to give the same result when the refractory period is increased. This is expected since the process becomes almost deterministic when the refractory period dominates T. 4 Simple Stochastic Models 4.1 Perfect Integrate-and-Fire (Gerstein-Mandelbrot Model). This simple model is closer to the statistical descriptors like those in the previous section than to the models seeking to provide a realistic description of the neurons considered below. It means that even if the data t the model perfectly, it can be used for their characterization, but few biophysical con-
Instantaneous Firing Frequency Is Always Higher Than the Firing Rate
483
clusions can be deduced from this fact. The probability density function of T is known as the inverse gaussian distribution, g.t/ D
» ¼ S .S ¡ ¹t/2 exp ¡ p ; 2¾ 2 t ¾ 2¼ t3
(4.1)
where ¹, ¾ 2 , and S are constants characterizing the neuron and its input (Tuckwell, 1988). Its statistical moments are well known, and thus we can directly write 1=E.T/ D
¹ : S
(4.2)
For X D 1=T, we have
» ¼ S .Sx ¡ ¹/2 exp ¡ g.x/ D p ; 2x¾ 2 ¾ 2¼ x
(4.3)
and the mean for distribution 4.3 can be calculated, which yields E.1=T/ D
¹ ¾2 C 2: S S
(4.4)
Comparing equations 4.2 and 4.4, we can see that the difference between mean instantaneous frequency and the ring frequency increases with increasing ¾ , which characterizes the noise and decreases with decreasing S, which is the ring threshold of the model. 4.2 Leaky Integrate-and-Fire (Ornstein-Uhlenbeck Model). The Ornstein-Uhlenbeck stochastic process restricted by a threshold, called the leaky integrate-and-re (LIF) model, is the most common compromise between tractability and realism in neuronal modeling (Tuckwell, 1988; Gerstner & Kistler, 2002). In this model, the behavior of the depolarization V of the membrane is described by the stochastic differential equation, ³ ´ V dV D ¡ C ¹ dt C ¾ dW; V.0/ D 0; ¿
(4.5)
where ¿ > 0 is the time constant of the neuron, W is a standard Wiener process (gaussian noise with unity-sized delta function), ¹ and ¾ 2 are constants characterizing the input, and t D 0 is the time of the last spike generation. The possibility of identifying the time of the previous spike with time zero is due to the fact that the process of intervals generated by the rst passages of process 4.5 across the threshold S is a renewal one. The solution of the rst-passage-time problem is not a simple task for model 4.5, and numerical and simulation techniques have been widely used
484
P. L´ansky, ´ R. Rodriguez, and L. Sacerdote
(Ricciardi & Sacerdote, 1979; Ricciardi & Sato, 1990; Musila, Suta, & Lansky, 1996). The Laplace transform of the rst-passage-time probability density function is available (see Tuckwell, 1988, for historical references), and from it, its mean can be derived. The ring rate, f = 1=E.T/, can be approximated by the following linear function, FD
¢ 1 ¡ p ¾ ¼ ¿ C 2¿ ¹ ¡ S ; ¼¿S
(4.6)
(Lansky & Sacerdote, 2001). The ranges in which this approximation ispvalid are restricted to the p p values of parameters for which the quantities .¹ ¿ /=¾ and .¹ ¿ ¡ .S= ¿ //=¾ are small. This means that for sufciently large amplitudes of noise, the response function is linear. Linearization 4.6 is quite robust and valid in wide ranges of parameters. Q of the rst-passage-time By using the fact that the Laplace transform g.s/ probability density function is available, the mean of 1=T can be calculated by using the primitive function of this Laplace transform, µ Z s ¶ Q D ¡ (4.7) E.1=T/ g.w/dw : sD0
The input-output frequency curves using the ring rate and the mean instantaneous rate are compared in Figure 2b. To improve the possibility of comparison with other models, model 4.5 was transformed into dimensionless variables. It means that the time is in units of time constant ¿ , and voltage is in units of the ring threshold S. We can see that only for large noise amplitude, almost half of the threshold value, does the difference become substantial. 4.3 Diffusion Model with Inhibitory Reversal Potential (Feller Model). To include some other features of real neurons in the LIF model, the reversal potentials can be introduced into model 4.5. In one of the variants of this model, introduced by Lansky and Lanska (1987), the behavior of the depolarization V of the membrane is described by the stochastic differential equation, ³ ´ p V (4.8) dV D ¡ C ¹ dt C ¾ V ¡ VI dW; V.0/ D 0; ¿ where the parameters have the same interpretation as in equation 4.5 and VI < 0 is the inhibitory reversal potential (Lansky, Sacerdote & Tomassetti, 1995). As for the LIF model, the Laplace transform of the ISI can be written in a closed form, g.s/ D
1 ± p ¡VI I Á s¿; .¹ C 1/ 2¾
²; p S ¡VI 2¿ ¾
(4.9)
Instantaneous Firing Frequency Is Always Higher Than the Firing Rate
485
where Á.a; bI x/ is the Kummer function (Abramowitz & Stegun, 1965). Equation 4.9 was used for numerical evaluation of the mean instantaneous frequency via equation 4.7, and the mean ISI was calculated by using the Siegert formula (Siegert, 1951). To see the difference between the LIF model and model 4.8, we attempted to have the same set of parameters used in both models. The problem arises for xing the noise amplitude; as in model 4.8, it depends on the actual level of the membrane depolarization V. The amplitude of noise was made equal at the resting level, which is one of the methods applied in Lansky et al. (1995). The input-output frequency curves for ring rate and mean instantaneous rate are compared in Figure 2c. Again, as with the LIF model, the dimensionless variant of equation 4.8 was used. We can see that the difference between the two denitions of ring frequency is less remarkable here than it is for the LIF model. 5 Biophysical Model with Noise To illustrate the achieved results on a realistic model of a neuron, we investigated the Morris-Lecar model, which is a simplication of the original Hodgkin-Huxley system. Two kinds of channels, voltage-gated CaCC channels and voltage-gated delayed rectier KC channels, are present in this model of excitable cell membrane (Morris & Lecar, 1981; Rinzel & Ermentrout, 1989). When a noisy external input is applied, the system of equations for the normalized membrane depolarization V.t/ and for X.t/ representing the fraction of open KC channels can be written in the form dV D gCa m.VCa ¡ V/ C gK X.VK ¡ V/ C gL .VL ¡ V/ C Iext C ´.t/; (5.1) dt and dX D kX .V/.X.V/ ¡ X/; dt
(5.2)
where the time is also normalized. The calcium current plays the role of sodium current in the original Hodgkin-Huxley system. However, the calcium channels respond to voltage so rapidly that instantaneous activation is assumed for them with the associated ionic conductance gCa m.V/. Iext is an applied external normalized current, and ´.t/ is a white noise perturbation such that h´.s/´.t/i D ¾ ±.s ¡ t/, where ¾ is a constant. The functions m.V/, X.V/, and kX .V/ are of the form ³ ³ ´´ 1 V ¡ V1 1 C tanh m.V/ D ; 2 V2 ³ ³ ´´ 1 V ¡ V3 D 1 C tanh X.V/ ; 2 V4
486
P. L´ansky, ´ R. Rodriguez, and L. Sacerdote
³ kX .V/ D ’cosh
V ¡ V3 2V4
´´ :
(5.3)
In these equations, kX .V.t// is a relaxation constant for a given V.t/; further, gCa ; gK , and gL are constants representing normalized conductances, and VCa , VK , and VL are normalized resting potentials for the two different kinds of ions and for leakage current. Finally, V1 ,V2 ,V3 ,V4 , and ’ are constants. The values of all constants were taken from Rinzel and Ermentrout (1989). When constant and gaussian white noise inputs are applied, the random variable T is calculated. In Figure 2d, the inverses of mean E.T/ and mean E.1=T/ are shown in dependency on the amplitude of noise for different values of the constant input. The same effect as for the simple neuronal models is observed. 6 Discussion The calculation of number of spikes per long time window is rather unrealistic from the point of view of the neural system. Thus, it is assumed that the time averaging in real systems is replaced by population averaging, giving the same result (counting the spikes emitted by a neuron during the period of 1 minute is the same as counting the spikes of 600 neurons emitting spikes within 100 milliseconds). This possibility of replacing time averaging by population averaging would be especially important for evaluating ring rates in transient situations as in the evoked activity. The terminology when speaking about ring rates is not always clear. Sometimes constant ring rate is called mean rate, and at other occasions, the function f .t/ given by equation 2.1 and averaged over some interval of time is called the mean ring rate. The instantaneous rate is also understood in different ways. In some cases, it is the ring probability in an innitesimally short interval (Johnson, 1996; Fourcard & Brunel, 2002). In other cases, the reciprocal of the ISIs or their smoothed version is used; Pauluis and Baker (2000) and Johnson (1996) compare these two approaches. The difference is well illustrated in the extremal case (deterministic ring, which implies constant ISIs) in which ¾ in LIF given by equation 4.5 tends to zero. From one point of view, the ring rate is a sequence of delta functions of time with peaks located at the moments of spikes, and the instantaneous ring rate is either zero or 1/ISI. From another point of view, the one adopted here, the rate and instantaneous rate coincide and are equal to the reciprocal of the ISI. While the rst approach may be seen as more informative, we note that we presume a priori that the rate is constant over the entire period of observation. Van Rullen and Thorpe (2001) compared the counting method to calculate the ring rate with the ISI method and declared the latter as potentially more accurate than the former. However, under the conditions valid in their article, they authors preferred the latency to the rst spike as the neuronal
Instantaneous Firing Frequency Is Always Higher Than the Firing Rate
487
code. It is actually again closely related to the ISI distribution but not based on the counting process. 7 Conclusion We compared two methods for evaluating the ring rate in this note. Although one of them gives a systematically larger value, the difference between them is not so enormous, at least in the conditions investigated here. The only exception is the Poissonian ring, which only rarely can be considered as a realistic description of neuronal ring. For other models, only strong noise makes the methods substantially different. One advantage of the mean instantaneous ring rate is that statistical properties of the random variable 1=T can be derived, and thus condence intervals found, and testing procedures for comparison of the rates under different experimental situations can be applied. This is not so easy for the ring rate calculated as 1=E.T/. The only way to overcome this defect of the reciprocal of the mean ISI is to use known properties of spike counts (Treves, Panzeri, Rolls, Booth, & Wakeman, 1999; Settanni & Treves, 2000). Acknowledgments We thank an anonymous referee for many helpful comments. This work was supported by INDAM Research Project and by a grant from the Grant Agency of the Czech Republic 309/02/0168. References Abramowitz, M., & Stegun, I. (Eds.). (1965). Handbook of mathematical functions. New York: Dover. Adrian, E. D. (1928). The basis of sensation: The action of the sense organs. New York: Norton. Awiszus, F. (1988). Continuous function determined by spike trains of a neuron subject to stimulation. Biol. Cybern., 58, 321–327. Baker, S. N., & Gerstein, G. L. (2001). Determination of response latency and its application to normalization of cross-correlation measures. Neural Comput., 13, 1351–1378. Barbi, M., Carelli, V., Frediani, C., & Petracchi, D. (1975). The self-inhibited leaky integrator: Transfer functions and steady state relations. Biol. Cybern., 20, 51– 59. Burkitt, A. N., & Clark, G. M. (2000). Calculation of interspike intervals for integrate-and-re neurons with Poisson distribution of synaptic input. Neural Comput., 12, 1789–1820. Cox, D. R., & Lewis, P.A.W. (1966). Statistical analyses of series of events. London: Methuen.
488
P. L´ansky, ´ R. Rodriguez, and L. Sacerdote
Ermentrout, E. (1998). Linearization of F-I curves by adaptation. Neural Comput., 10, 1721–1729. Fenton, A. A., & Muller, R. U. (1998). Place cell discharge is extremely variable during individual passes of the rat through the ring eld. Proc. Natl. Acad. Sci., USA, 95, 3182–3187. Fourcard, N., & Brunel, N. (2002). Dynamics of ring probability of noisy integrate-and-re neurons. Neural Comput., 14, 2057–2110. Gerstner, W., & van Hemmen, J. L. (1992). Universality in neural networks: The importance of the “mean ring rate”. Biol. Cybern., 67,195–205. Gerstner, W., & Kistler, W. (2002). Spiking neuron models. Cambridge: Cambridge Univsity Press. Johnson, D. H. (1996). Point process models of single-neuron discharges. J. Comput. Neurosci., 3, 275–300. Knight, B. W. (1972). Dynamics of encoding in a population of neurons. J. Gen. Physiol., 59, 734–766. Lansky, P., & Lanska V. (1987). Diffusion approximation of the neuronal model with synaptic reversal potentials. Biol. Cybern., 56, 19–26. Lansky, P., & Sacerdote, L. (2001). The Ornstein-Uhlenbeck neuronal model with the signal-dependent noise. Physics Letters A, 285, 132–140. Lansky, P., Sacerdote, L., & Tomassetti, F. (1995). On the comparison of Feller and Ornstein-Uhlenbeck models for neural activity. Biol. Cybern., 76, 457– 465. Morris, C., & Lecar, H. (1981). Voltage oscillations in the barnacle giant muscle ber. Biophys. J., 35, 193–213. Musila, M., Suta, D., & Lansky, P. (1996). Computation of rst passage time moments for stochastic diffusion processes modelling the nerve membrane depolarization. Comp. Meth. Prog. Biomed., 49, 19–27. Pauluis Q., & Baker, S. N. (2000). An accurate measure of the instantaneous discharge probability, with application to joint-event analysis. Neural Comput., 12, 647–669. Rao, C. R. (1965). Linear statistical inference and its applications. New York: Wiley. Ricciardi, L. M., & Sacerdote, L. (1979). The Ornstein-Uhlenbeck process as a model of neuronal activity. Biol. Cybern., 35, 1–9. Ricciardi, L. M., & Sato, S. (1990). Diffusion processes and rst-passage-time problems. In L. M. Ricciardi (Ed.), Lectures in applied mathematics and informatics. Manchester: Manchester University Press. Rieke, F., Warland, D., de Ruyter van Steveninck, R. R., & Bialek, W. (1997). Spikes: Exploring the neural code. Cambridge, MA: MIT Press. Rinzel, J., & Ermentrout, G. B. (1989). Analysis of neural excitability and oscillations. In C. Koch & I. Segev (Eds.), Methods of neuronal modeling: From synapses to networks. Cambridge, MA: MIT Press. Rudd, M. E., & Brown, L. G. (1997). Noise adaptation in integrate-and-re neurons. Neural Comput., 9, 1047–1069. Settanni, G., & Treves, A. (2000). Analytical model for the effects of learning on spike count distributions. Neural Comput., 12, 1773–1788. Siegert, A.J.F. (1951). On the rst passage time probability problem. Phys. Rev., 81, 617–623.
Instantaneous Firing Frequency Is Always Higher Than the Firing Rate
489
Treves, A., Panzeri, S., Rolls, E. T., Booth, M., & Wakeman E. A. (1999). Firing rate distributions and efciency of information transmission of inferior temporal cortex neurons to natural visual stimuli. Neural Comput., 11, 601–632. Tuckwell, H. C. (1988). Introduction to theoretical neurobiology. Cambridge: Cambridge University Press. Van Rullen, R., & Thorpe, S. J. (2001). Rate coding versus temporal order coding: What the retinal ganglion cells tell the visual cortex. Neural Comput., 13, 1255– 1284. Received February 4, 2003; accepted July 10, 2003.
NOTE
Communicated by Rajesh P. Rao
Kalman Filter Control Embedded into the Reinforcement Learning Framework Istv´an Szita
[email protected] Andr´as L Íorincz
[email protected] Department of Information Systems, E¨otv¨os Lor´and University, P´azm´any P´eter s´et´any 1=C, H-1117 Budapest, Hungary
There is a growing interest in using Kalman lter models in brain modeling. The question arises whether Kalman lter models can be used on-line not only for estimation but for control. The usual method of optimal control of Kalman lter makes use of off-line backward recursion, which is not satisfactory for this purpose. Here, it is shown that a slight modication of the linear-quadratic-gaussian Kalman lter model allows the on-line estimation of optimal control by using reinforcement learning and overcomes this difculty. Moreover, the emerging learning rule for value estimation exhibits a Hebbian form, which is weighted by the error of the value estimation.
1 Motivation Kalman lters and their various extensions are well studied and widely applied tools in both state estimation and control. Recently, there has been increasing interest in Kalman lters (KF) or Kalman lter–like structures as models for neurobiological substrates. It has been suggested that Kalman ltering may occur at sensory processing (Rao & Ballard, 1997, 1999) or at acquisition of behavioral responses (Kakade & Dayan, 2000), may be the underlying computation of the hippocampus (Bousquet, Balakrishnan, & Honavar, 1998), and may be the underlying principle in control architectures (Todorov & Jordan, 2002a, 2002b). Detailed architectural similarities between Kalman lter and the entorhinal-hippocampal loop, as well as between Kalman lters and the neocortical hierarchy, have been described recently (L Íorincz & Buzs a´ ki, 2000; LÍorincz, Szatm a´ ry, & Szirtes, 2002). The interplay between the dynamics of Kalman lter–like architectures and learning of parameters of neuronal networks is promising for explaining known and puzzling phenomena such as priming, repetition suppression, and categorization (LÍorincz, Szirtes, Tak´acs, Biederman, & Vogels, 2002; K´eri et al., 2002). Neural Computation 16, 491–499 (2004)
c 2004 Massachusetts Institute of Technology °
492
I. Szita and A. LÍorincz
It is well known that Kalman lters provide an on-line estimation of the state of the system. On the other hand, optimal control typically cannot be computed on-line, because it is given by a backward recursion, the Ricattiequations. (For on-line parameter estimations of Kalman lters but without control aspects, see Rao, 1999.) The aim of this article is to demonstrate that Kalman lters can be embedded into a goal-oriented framework. We show that slight modication of the linear-quadratic-gaussian (LQG) Kalman lter model is satisfactory to integrate the LQG model into the reinforcement learning (RL) framework. 2 The Kalman Filter and the LQG Model Consider a linear dynamical system with state xt 2 Rn , control ut 2 Rm , observation yt 2 Rk , and noises wt 2 Rn and et 2 Rk (which are assumed to be uncorrelated Gaussians with covariance matrix Äw and Äe , respectively) in discrete time t: (2.1)
xtC1 D Fxt C Gut C wt
(2.2)
yt D Hxt C et :
Assume that the initial state has mean xO 1 , covariance 61 and that executing control step ut in xt costs c.xt ; ut / :D xTt Qxt C uTt Rut :
(2.3)
Further, assume that after the Nth step, the controller halts and receives a nal cost of xTN QN xN . The task is to nd a control sequence with minimum total cost.1 This problem has the well-known solution (2.4)
xO tC1 D FOxt C Gut C Kt .yt ¡ H xO t / Kt D F6t HT .H6t H T C Äe /¡1 w
T
6tC1 D Ä C F6t F ¡ Kt H6t F
T
(2.5) (state estimation)
(2.6)
and u t D ¡Lt xO t T
(2.7) ¡1
Lt D .G StC1 G C R/
T
(2.8)
G StC1 F
St D Qt C FT StC1 F ¡ FT StC1 GLt :
(optimal control)
(2.9)
Unfortunately, the optimal control equations are not on-line, because they can be solved only by stepping backward from the nal (i.e., the Nth) step. 1
In a more general setting, c is a general quadratic function of xt , ut , and an optional where xtrack is a trajectory of a linear system to be tracked. t
xtrack , t
Kalman Filter Control
493
3 Integrating Kalman Filtering into the Reinforcement Learning Framework First, we slightly modify the problem: the run time of the controller will not be a xed number N. Instead, after each time step, the process will be stopped with some xed probability p (and then the controller incurs the nal cost cf .x f / :D xtf Q f x f ). This modication is commonly used in the RL literature; it makes the problem more amenable to mathematical treatment. 3.1 The Cost-to-Go Function. Let Vt¤ .x/ be the optimal cost-to-go function at time step t, Vt¤ .x/ :D
inf
ut ;utC1 ;:::
E[c.xt ; ut / C c.xtC1 ; utC1 / C ¢ ¢ ¢ C c f .x f / j xt D x]: (3.1)
Considering that the controller is stopped with probability p, equation 3.1 assumes the following form, ¤ Vt¤ .x/ D p ¢ cf .x/ C .1 ¡ p/ inf.c.x; u/ C E w [VtC1 .Fx C Gu C w/]/ (3.2) u
for any state x. It is easy to show that the optimal cost-to-go function is time independent and it is a quadratic function of x. That is, the optimal cost-to-go action-value function assumes the form V¤ .x/ D xT 5¤ x:
(3.3)
Our task is to estimate the optimal value functions (parameter matrix 5¤ ) on-line. This can be done by the method of temporal differences. We start with an arbitrary initial cost-to-go function V0 .x/ D xT 50 x. After this, (1) control actions are selected according to the current value function estimate, (2) the value function is updated according to the experience, and (3) these two steps are iterated. The tth estimate of V ¤ is Vt .x/ D xT 5t x. The greedy control action according to this is given by ut D arg min.E[c.xt ; u/ C Vt .Fxt C Gu C w/]/ u
D arg min.uT Ru C .FOxt C Gu/T 5t .FOxt C Gu// u
D ¡.R C GT 5t G/¡1 .GT 5t F/Oxt :
(3.4)
For simplicity, the cost-to-go function will be updated by using the onestep temporal differencing (TD) method, although it could be substituted with more sophisticated methods like multistep TD or eligibility traces. The TD error is » if t D tSTOP ; c .Ox/ ¡ Vt .Oxt¡1 / (3.5) ±t D f .c.Oxt¡1 ; ut¡1 / C Vt .Oxt // ¡ Vt .Oxt¡1 /; otherwise;
494
I. Szita and A. LÍorincz
and the update rule for the parameter matrix 5t is 5tC1 D 5t C ®t ¢ ±t ¢ r5t Vt .Oxt / D 5t C ®t ¢ ±t ¢ xO t xO Tt ;
(3.6)
where ®t is the learning rate. Note that value-estimation error weighted Hebbian learning rule has emerged. 3.2 Sarsa. The cost-to-go function is used to select control actions, so the estimation of the action-value function Q¤t .x; u/ is more appropriate here. The action-value function is dened as Q¤t .x; u/ :D
inf
utC1 ;utC2 ;:::
E[c.xt ; ut / C c.xtC1 ; utC1 / C ¢ ¢ ¢ C c f .x f / j xt D x; ut D u];
(3.7)
and analogously to Vt¤ , it can be shown that it is time independent and assumes the form ³ ´³ ´ ³ ´ ¡ ¢ 2¤ 2¤ ¡ ¢ x x 11 12 D x T uT 2 ¤ (3.8) Q¤ .x; u/ D xT u T : ¤ ¤ u u 221 222 If the tth estimate of Q¤ is Qt .x; u/ D [xT ; uT ]2t [xT ; uT ]T , then the greedy control action is given by u t D arg min E.Qt .xt ; u// D ¡2¡1 Ot; 22 221 x u
(3.9)
where subscript t of 2 has been omitted to improve readability. Note that the value function estimate (but not model) is needed to compute ut . The estimation error and the weight update are quite similar to the statevalue case: » if t D tSTOP ; c .Ox / ¡ Qt .Oxt¡1 ; u t¡1 / (3.10) ±t D f t .c.Oxt¡1 ; ut¡1 / C Qt .Oxt ; ut // ¡ Qt .Oxt¡1 ; ut¡1 /; otherwise; 2tC1 D 2t C ®t ¢ ±t ¢ r2t Qt .Oxt ; ut / ³ ´ ³ ´T xO xO t D 2t C ®t ¢ ±t ¢ t : ut ut
(3.11)
3.3 Convergence. There are numerous results showing that simple RL algorithms with linear function approximation can diverge (see, e.g., Baird, 1995). There are also positive results dealing with the constant policy case (i.e., with policy evaluation) and iterative policy improvements (Tsitsiklis & Van Roy, 1996). For our case, convergence proof can be provided (the complete proof can be found in Szita & LÍorincz, 2003).
Kalman Filter Control
495
1. One can show that an appropriate form of the separation principle holds for value estimation. That is, using the state estimates xO t for computing the control is equivalent to using the exact states. 2. One can apply Gordon’s (2001) results, which states that under appropriate conditions, TD and Sarsa control with linear function approximation cannot diverge. 3. In our problem, where the system is linear and the costs are quadratic, using Gordon’s (2001) technique, one can prove the stronger result: Theorem. If the system .F; H/ is observable, 50 ¸ 5¤ (or 20p¸ 2¤ in the case of Sarsa), there exists an L such that kF C GLk · 1= 1 ¡ p, there exists an MP such that kxP t k · M for all t, and the constants ®t satisfy 0 < ®t < 1=M4 , t ®t D 1, t ®t2 < 1. Then combined TD-KF methods converge to the optimal policy with probability 1 for both state-value and for action-value estimations. 4 Discussion 4.1 Possible Extensions. We have demonstrated that Kalman ltering can be integrated into the framework of reinforcement learning. There are numerous possibilities for extending this scheme on both sides; both KF and RL can be extended. We list some of these possibilities: ² Advanced RL algorithms. The one-step TD method can be replaced by more efcient algorithms like TD(¸) (eligibility traces), Sarsa(¸), and Q-learning. ² Parameter estimation. In its present form, the algorithm needs a model of the system (i.e., the parameters F, G, and H). However, for Q-learning, reinforcement learning may not require any model. In turn, the algorithm can be augmented by standard KF parameter estimation techniques like expectation maximization (see, e.g., Rao, 1999) and by information maximization methods (LÍorincz, Szatm a´ ry, & Szirtes, 2002). As a result, an on-line, model-free control algorithm for the linear quadratic regulation problem can be obtained. ² Extended Kalman ltering. Kalman ltering makes the assumption that both the system and the observations are linear. This restriction can be overcome by using extended Kalman lters (EKF), adding further potential to our approach. ² Kalman smoother. To obtain more accurate and less noisy state estimations, we could use Kalman smoothing instead of the ltering equation. One-step smoothing does not impose much additional computational difculty, because one-step lookahead is needed anyway for the temporal differencing method.
496
I. Szita and A. LÍorincz
² More general quadratic and nonquadratic cost functions. One is not restricted to the simplest quadratic loss functions. The KF equations are independent of the costs, and RL can handle arbitrary costs. Moreover, in many cases, costs can be rewritten into the form of a linear function approximation (see, e.g., Bradtke, 1993), and convergence is then warranted. 4.2 Related Works in Reinforcement Learning. Reinforcement learning has been applied in the linear quadratic regulation problem (Landelius & Knutsson, 1996; Bradtke, 1993; ten Hagen & Kr¨ose, 1998). The main difference is that these works assume fully observed systems and that either Q-learning or the recursive least-squares methods were used as the RL component. It is important that our approach can be interpreted as a partially observed Markov decision process (POMDP), with continuous state and action spaces. In general, learning in POMDPs is a very difcult task (see Murphy, 2000, for a review). We consider the LQG approach attractive, because it is a POMDP and yet can be handled efciently because of its particular properties: both the transitions and the observations are linear, and uncertainties assume gaussian form. 4.3 Neurobiological Connections. The modeling of neural conditioning by TD learning is not new: a series of papers have been published on this topic (e.g., Montague, Dayan, & Sejnowski, 1996; Schultz, Dayan, & Montague, 1997). In these works, it is typically assumed that states can be fully observed. Recently, Daw, Courville, and Touretzky (in press) proposed partially observable semi-Markov processes as an underlying model to deal with nonobservability. The motivation for our work comes from neurobiology. Some novel theoretical works in neurobiology (Rao & Ballard, 1997; L Íorincz & Buzs a´ ki, 2000; L Íorincz, Szatm a´ ry, & Szirtes, 2002; Todorov & Jordan, 2002a) claim that both sensory processing and control may be based on Kalman lter– like structures. Notably, some odd predictions (LÍorincz & Buzs a´ ki, 2000; LÍorincz, Szatm a´ ry, & Szirtes, 2002) have been reinforced recently: ² Properties of neurons corresponding to the internal representation of the Kalman lter (the Vth and the VIth layers of the entorhinal cortex, EC) versus properties of neurons corresponding the reconstruction error and the ltered input (EC layers II and III), respectively ² Adaptable long-delay properties of neurons with the putative role of eliminating temporal convolutions arising in Kalman lter–like structures with not-yet-tuned parameters Both have been conrmed by experiments reported in Egorov, Hamam, Frans e´ n, Hasselmo, and Alonso (2002) and in Henze, Wittner, and Buzs a´ ki (2002), respectively.
Kalman Filter Control
497
Clearly, the brain is a goal-oriented system, and Kalman lters need to be embedded into a goal-oriented framework. Moreover, this framework should exhibit Hebbian learning properties. Further, although batch learning does have some plausibility in neurobiological models—consider, for example, hippocampal replays of time sequences (Skaggs & McNaughton, 1996; Nadasdy, Hirase, Czurko, Csicsvari, & Buzs a´ ki, 1999)—the extent of such batch learning should be limited given that the environment is subject to changes. In turn, our work reinforces the efforts dealing with Kalman lter description of neocortical processing. From the point of view of parameter estimation, the Kalman lter seems to be an ideal neurobiological candidate (LÍorincz & Buzs a´ ki, 2000; L oÍ rincz, Szatm a´ ry, & Szirtes, 2002). On-line estimation of the Kalman gain, however, remains a problem. (But see Poczos ´ & LÍorincz, 2003.) It has been noted that smoothing should improve the efciency of the algorithm. Albeit the importance of smoothing and switching between smooth solutions is striking in the neurobiological context, it remains an open and intriguing issue if and how the neurobiological framework may allow the integration of smoothing. 5 Conclusions Recent research in theoretical neurobiology indicates that Kalman lters have intriguing potentials in brain modeling. However, the classical method for Kalman lter control requires off-line computations, which are unlikely to take place in the brain. Our aim here was to embed Kalman lters into the reinforcement learning framework. This was achieved by applying the method of temporal differences. Theoretical achievements of reinforcement learning were used to describe the asymptotic optimality of the algorithm. Although our algorithm is only asymptotically optimal and may be slower than the classical approaches like solving the Ricatti recursions (used commonly in engineering applications), it has several advantages: it works on-line, no model is needed for computing the control law, and the learning is Hebbian, which is weighted by the error of value estimation. These properties make it an attractive approach for brain modeling and neurobiological applications. Furthermore, the algorithm described here admits several straightforward generalizations, including the extended Kalman lters, parameter estimations, nonquadratic cost functions, and eligibility traces, which can extend its applicability and improve its performance. Acknowledgments We are grateful to the anonymous referees who called our attention to recent work on the topic. Their suggestions helped us to improve this note. This work was supported by the Hungarian National Science Foundation (Grant No. T-32487).
498
I. Szita and A. LÍorincz
References Baird, L. C. (1995). Residual algorithms: Reinforcement learning with function approximation. In International Conference on Machine Learning (pp. 30–37). San Francisco: Morgan Kaufmann. Bousquet, O., Balakrishnan, K., & Honavar, V. (1998). Is the hippocampus a Kalman lter? In Proceedings of the Pacic Symposium on Biocomputing (pp. 655– 666). Singapore: World Scientic. Bradtke, S. J. (1993). Reinforcement learning applied to linear quadratic regulation. In C. L. Giles, S. J. Hanson, & J. D. Cowan (Eds.), Advances in neural information processing systems, 5, (pp. 295–302). San Mateo, CA: Morgan Kaufmann. Daw, N., Courville, A., & Touretzky, D. S. (in press). Timing and partial observability in the dopamine system. In S. Becker, S. Thrun, & K. Obermayer (Eds.), Advances in neural information processing systems, 15. Cambridge, MA: MIT Press. Egorov, A., Hamam, B., Franse´ n, E., Hasselmo, M., & Alonso, A. (2002). Graded persistent activity in entorhinal cortex neurons. Nature, 420, 173–178. Gordon, G. J. (2001). Reinforcement learning with function approximation converges to a region. In T. K. Leen, T. G. Dietterich, & V. Tresp (Eds.), Advances in neural information processing systems, 13 (pp. 1040–1046). Cambridge, MA: MIT Press. Henze, D., Wittner, L., & Buzs´aki, G. (2002). Single granule cells reliably discharge targets in the hippocampal CA3 network in vivo. Nature Neuroscience, 5, 790–795. Kakade, S., & Dayan, P. (2000). Acquisition in autoshaping. In S. A. Solla, T. K. Leen, & K.-R. Muller ¨ (Eds.), Advances in neural information processing, 12 (pp. 24–30). Cambridge, MA: MIT Press. K´eri, S., Benedek, G., Janka, Z., Aszalos, ´ P., Szatm´ary, B., Szirtes, G., & LÍorincz, A. (2002). Categories, prototypes and memory systems in Alzheimer ’s disease. Trends in Cognitive Science, 6, 132–136. Landelius, T., & Knutsson, H. (1996). Greedy adaptive critics for LQR problems: Convergence proofs (Tech. Rep. No. LiTH-ISY-R-1896). Link oping, ¨ Sweden: Computer Vision Laboratory. LÍorincz, A., & Buzs´aki, G. (2000). The parahippocampal region: Implications for neurological and psychiatric diseases. In H. Scharfman, M. Witter, & R. Schwarz (Eds.), Annals of the New York Academy of Sciences (Vol. 911, pp. 83– 111). New York: New York Academy of Sciences. LÍorincz, A., Szatm´ary, B., & Szirtes, G. (2002). Mystery of structure and function of sensory processing areas of the neocortex: A resolution. J. Comp. Neurosci., 13, 187–205. LÍorincz, A., Szirtes, G., Tak´acs, B., Biederman, I., & Vogels, R. (2002). Relating priming and repetition suppression. Int. J. Neural Systems, 12, 187–202. Montague, P., Dayan, P., & Sejnowski, T. (1996). A framework for mesencephalic dopamine systems based on predictive Hebbian learning. Journal of Neuroscience, 16(5), 1936–1947.
Kalman Filter Control
499
Murphy, K. P. (2000). A survey of POMDP solution techniques. Available on-line at: http://www.ai.mit.edu/murphyk/Papers/pomdp.ps.gz. ˜ Nadasdy, Z., Hirase, H., Czurko, A., Csicsvari, J., & Buzs´aki, G. (1999). Replay and time compression of recurring spike sequences in the hippocampus. Journal of Neuroscience, 19, 9497–9507. Poczos, ´ B., & LÍorincz, A. (2003). Kalman-ltering using local interactions(Tech. Rep. No. NIPG-ELU-28-02-2003).Budapest: Faculty of Informatics, Eotv ¨ os ¨ Lor´and University. Available on-line at: http://arxiv.org/abs/cs.AI/0302039. Rao, R. (1999). An optimal estimation approach to visual perception and learning. Vision Research, 39, 1963–1989. Rao, R., & Ballard, D. (1997). Dynamic model of visual recognition predicts neural response properties in the visual cortex. Neural Computation, 9, 721– 763. Rao, R., & Ballard, D. (1999). Predictive coding in the visual cortex: A functional interpretation of some extra-classical receptive-eld effects. Nature Neuroscience, 2, 79–87. Schultz, W., Dayan, P., & Montague, P. (1997). A neural substrate of prediction and reward. Science, 275, 1593–1599. Skaggs, W., & McNaughton, B. (1996). Replay of neuronal ring sequences in rat hippocampus during sleep following spatial experience. Science, 271, 1870–1873. Szita, I., & LÍorincz, A. (2003). Reinforcement learning with linear function approximation and LQ control coverves. (Tech. Rep. No. NIPG-ELU-22-06-2003). Budapest: Faculty of Informatics, Eotv ¨ os ¨ Lor´and University. Available on-line at: http://arvis.org/abs/cs.LG/0306120. ten Hagen, S., & Krose, ¨ B. (1998). Linear quadratic regulation using reinforcement learning. In F. Verdenius & W. van den Broek (Eds.), Proceedings of the 8th Belgian-Dutch Conf. on Machine Learning (pp. 39–46). Wageningen: BENELEARN-98. Todorov, E., & Jordan, M. (2002a). Optimal feedback control as a theory of motor coordination. Nature Neuroscience, 5, 1226–1235. Todorov, E., & Jordan, M. (2002b). Supplementary notes for optimal feedback control as a theory of motor coordination. Available on-line at: http://www.nature. com/neuro/supplements/. Tsitsiklis, J. N., & Van Roy, B. (1996). An analysis of temporal-difference learning with function approximation (Tech. Rep. No. LIDS-P-2322). Cambridge, MA: MIT. Received December 16, 2002; accepted July 15, 2003.
LETTER
Communicated by Richard Zemel
Rapid Processing and Unsupervised Learning in a Model of the Cortical Macrocolumn Jorg ¨ L ucke ¨
[email protected] Christoph von der Malsburg
[email protected] ¨ Neuroinformatik, Ruhr-Universit¨at Bochum, Institut fur D-44780 Bochum, Germany
We study a model of the cortical macrocolumn consisting of a collection of inhibitorily coupled minicolumns. The proposed system overcomes several severe decits of systems based on single neurons as cerebral functional units, notably limited robustness to damage and unrealistically large computation time. Motivated by neuroanatomical and neurophysiological ndings, the utilized dynamics is based on a simple model of a spiking neuron with refractory period, xed random excitatory interconnection within minicolumns, and instantaneous inhibition within one macrocolumn. A stability analysis of the system’s dynamical equations shows that minicolumns can act as monolithic functional units for purposes of critical, fast decisions and learning. Oscillating inhibition (in the gamma frequency range) leads to a phase-coupled population rate code and high sensitivity to small imbalances in minicolumn inputs. Minicolumns are shown to be able to organize their collective inputs without supervision by Hebbian plasticity into selective receptive eld shapes, thereby becoming classiers for input patterns. Using the bars test, we critically compare our system’s performance with that of others and demonstrate its ability for distributed neural coding. 1 Introduction Simulations of articial neural networks (ANNs) are a standard way to study neural information processing. Although a large amount of data about biological neural networks is available, there remain uncertainties regarding the way in which neurons process incoming action potentials, the way the neurons are interconnected, and the way in which interconnections change dynamically over time. These uncertainties have generated a broad variety of different models of neural networks. They are based on different assumptions for connectivity (e.g., feedforward or symmetrically interconnected), neuron models (e.g., McCulloch-Pitts, integrate-and-re, Hodgkin-Huxley), and different modication rules for synaptic weight changes (e.g., Hebbian Neural Computation 16, 501–533 (2004)
c 2004 Massachusetts Institute of Technology °
502
J. Lucke ¨ and C. von der Malsburg
learning, backpropagation). ANNs like the Hopeld network (Hopeld, 1982; Hopeld & Tank, 1986) or perceptrons afford deep functional insight on the basis of mathematical analysis that allowed the networks to be successful in various technical applications and inuenced our views on learning and information processing in biological neural networks signicantly. By now, it has become obvious, however, that the classic ANNs fall short in modeling the generalization abilities or computation times of biological networks. Many important reactions in the brain take place in times so short that individual neurons have time to transmit only very few or just a single action potential (see Nowak & Bullier, 1997, and Thorpe, Fize, & Marlot, 1996, for reaction times in the visual system). If graded signals are to be processed, models based on a single neuron rate code fail to model the measured reaction times. Further, most ANNs do not reect biologically plausible connectivity because they were motivated by the view that biological information processing is continuously distributed over the cortical surface or that information is processed strictly feedforward through layers of equal neurons. In the last two or three decades, a large amount of anatomical and physiological data has been accumulated, suggesting that the cortex is hierarchically organized in cellular columns as principal building blocks (see Mountcastle, 1997, and Buxhoeveden & Casanova, 2002, for overviews and Jones, 2000, for a critical discussion). Columnar organization is advantageous (1) with respect to implementation of a neural population rate code able to overcome the computational speed limitations of single neuron rate codes and (2) with respect to connectivity and robustness. With evolutionary growth of the brain, individual building blocks had to connect to more and more other elements. Groups of neurons can support many more connections than individual neurons, and a network based on neural columns as principal units can be expected to be much more robust against the loss of connections or dropout of neurons. In the cerebral cortex of mammals, neural columns can be identied on different scales. The minicolumn is believed to be the smallest neural entity consisting of several tens up to a few hundred neurons, which are stacked orthogonally to the cortical surface (Peters & Yilmaze, 1993). The minicolumns themselves combine to what is called a macrocolumn or segregate (Favorov & Diamond, 1990; see, Mountcastle, 1997, for an overview). Like minicolumns, macrocolumns can be identied both anatomically and physiologically (Favorov & Diamond, 1990; Elston & Rosa, 2000; Lubke, Egger, Sakmann, & Feldmeyer, 2000) and are shown to process stimuli from the same source, such as an area of the visual eld or a patch of a body surface (Favorov & Whitsel, 1988; Favorov & Diamond, 1990). In the primary somatosensory cortex of the cat, macrocolumns were found to contain approximately 80 minicolumns. Although mini- and macrocolumns are best studied in primary sensory areas, they are found in higher cortical areas as well (Peters, Cifuentes, & Sethares, 1997; Constantinidis, Franowicz, & Goldman-Rakic, 2001) and are believed to represent the basic building
Processing and Learning in Macrocolumns
503
blocks of all areas of cortices of higher vertebrates (Mountcastle, 1997; Buxhoeveden & Casanova, 2002). The main part of a minicolumn is a collection of excitatory cells grouped around bundles of dendrites (Peters & Yilmaze, 1993) and axons (Peters & Sethares, 1996). Together with physiological ndings (Thomson & Deuchars, 1994), this suggests that the excitatory cells of a minicolumn are mutually strongly interconnected (Mountcastle, 1997; Buxhoeveden & Casanova, 2002). For inhibitory feedback, double-bouquet cells and basket (clutch) cells play a central role (DeFelipe, Hendry, & Jones, 1989; DeFelipe, Hendry, Hashikawa, Molinari, & Jones, 1990; Peters & Sethares, 1997; Budd & Kisvarday, 2001). Dendritic branch and axonal eld analysis suggests that the inhibitory cells are stimulated by the activities within the excitatory cells of their minicolumn and project back to a number of minicolumns in their neighborhood. In this article, we study a model of the macrocolumn motivated by the above ndings. We model the macrocolumn as a collection of inhibitorily coupled minicolumns that consist of a collectionof randomly interconnected excitatory neurons. The excitatory neurons are modeled explicitly. The neuron model is a very abstract one, but it takes into account the neurons’ spiking character. The assumptions made allow for a detailed mathematical analysis that captures the basic properties of the neuron dynamics. It turns out that the spiking character of the neurons, in combination with the columnar interconnection structure and fast inhibitory feedback, leads to a dynamics with ideal properties for computing input mediated by afferent bers to the macrocolumn. We will show that in the absence of input, the macrocolumn dynamics possesses stationary points (i.e., states of ongoing neural activity). The number of these depends exponentially on the number of minicolumns. The stability of the stationary points is controlled by a single parameter of inhibition and changes at a single critical value of this parameter. The system is operated by letting the inhibition parameter oscillate about its critical value. An isolated macrocolumn is hereby successively forced to spontaneously break the symmetry between alternate stationary points. If the dynamics is weakly coupled to input by afferent bers that are subject to Hebbian plasticity, a self-organization of minicolumnar receptive elds (RFs) is induced. The self-organization turns the macrocolumn into a decision unit with respect to the input. The result of a decision is identied with the active state of the macrocolumn at the maximum of the inhibitory oscillation (or º-cycle). Possible º-cycle periods may be very short, suggesting cortical oscillations in the ° -frequency range as their biological correlates and allowing very rapid functional decisions. The macrocolumn model will be shown to be able to classify input patterns and extract basic features from the input. The latter capability is demonstrated using the bars test (Foldi´ ¨ ak, 1990), and it can be shown that the presented model is competitive with all other systems able to pass the test. During the bars test, a system has to learn the basic components of input patterns made up of superpositions of these components. In the su-
504
J. Lucke ¨ and C. von der Malsburg
perpositions, the components do not add up linearly, which presents an additional difculty. Systems passing the test must be able to represent an input pattern by exploiting constituent combinatorics. The passing of the bars test can be seen as a prerequisite for systems that are intended to handle large and varying natural input because natural input is expected to be most efciently represented by combining elementary features. The presented network owes its computational abilities to the properties of the internal macrocolumnar neuron dynamics, which itself is emergent from the columnar interconnection structure, the spiking nature of neurons, and background oscillation. The network reects, on the one hand, major properties of biological neural information processing and is, on the other hand, competitive to a class of systems that focus on functional performance in tasks of basic feature extraction. In combining these two aspects, the system distinguishes itself from all other column-based systems (Fukai, 1994; Favorov & Kelly, 1996; Somers, Nelson, & Sur, 1995; Fransen & Lansner, 1998; Lao, Favorov, & Lu, 2001) and systems whose ability for extracting basic input constituents was demonstrated using the bars benchmark test (Foldi´ ¨ ak, 1990; Saund, 1995; Dayan & Zemel, 1995; Marshall, 1995; Hinton, Dayan, Frey, & Neal, 1995; Harpur & Prager, 1996; Frey, Dayan, & Hinton, 1997; Hinton & Ghahramani, 1997; Fyfe, 1997; Charles & Fyfe, 1998; Hochreiter & Schmidhuber, 1999; O’Reilly, 2001; Spratling & Johnson, 2002). In section 2, we dene the macrocolumn model and analyze its dynamical properties. In section 3, we introduce Hebbian plasticity of afferents to the macrocolumn and study the resulting input driven self-organization of the minicolumns’ RFs. In section 4, the computational abilities of the system are systematically studied for the problem of pattern classication and for the bars problem. In section 5, we discuss the system’s general properties in comparison with systems that have been applied to the bars problem and discuss our system’s relation to neuroscience. 2 Neural Dynamics of the Macrocolumn We rst dene and analyze the dynamics of a model of a single minicolumn and then proceed by studying the dynamical properties of the macrocolumn as a set of inhibitorily coupled minicolumns. 2.1 Model of the Minicolumn. We take a minicolumn to consist of excitatory neurons that are randomly interconnected, as motivated by the abovementioned ndings (see Mountcastle, 1997, and Buxhoeveden & Casanova, 2002, for review). The excitatory neurons are modeled as McCulloch-Pitts neurons with a refractory period of one time step. The dynamics of a minicolumn consisting of m neurons is described by the following set of difference
Processing and Learning in Macrocolumns
505
equations (i D 1; : : : ; m):
0 1 m X ni .t C 1/ D H @ Tij nj .t/ ¡ 2A ¢ H.1 ¡ ni .t//; | {z } jD1
» 0 H.x/ :D 1
ref raction
if x · 0 : if x > 0
(2.1)
For the interconnection Tij , we assume that each neuron receives s synapses from other neurons of the minicolumn. We further assume that the dendrites and axons interconnect randomly such that a natural choice for the probability of receiving a synapse from a given neuron is m1 (cf. Anninos, Beek, Csermely, Harth, & Pertile, 1970). The synaptic weights we take to be equal to the constant c. Note that for any choice of c, the threshold 2 can be chosen such that the resulting dynamics is the Without loss of generality, we Psame. m therefore choose c D 1s such that m1 D 1. To further analyze the T ij i;jD1 dynamics, we describe equations 2.1 in terms of the activation probability of a neuron, p.t/, at time t. The probability depends, rst, on the number of received inputs and, second, on the probability that the neuron was active in the preceding time step. Due to the interconnection .Tij /, the probability fbn .x/ of a neuron to receive x nonzero inputs from its presynaptic neurons is given by the binomial distribution ³ fbn .x/ D
s x
´ px .1 ¡ p/s¡x :
(2.2)
For s À 1, the distribution can be approximated by a gaussian probability density (theorem of Moivre-Laplace) of the form fg .x/ D p
1 2¼¾
1
e¡ 2 .
x¡a 2 ¾ /
; a D sp; ¾ D
p
sp.1 ¡ p/:
(2.3)
The probability pA .tC1/ that a neuron receives enough input to exceed threshold at .tC1/ is thus given by the integration of all fg .x/ at t with x > 2c D s2. The probability p.t C 1/ that the neuron is activated at time .t C 1/ further depends on the probability pB .t C 1/ that it is not refractory at .t C 1/. The probability pB .t C 1/ is directly given by the complement of the probability that the neuron was active the time step before, pB .t C 1/ D .1 ¡ p.t//. To further analyze the dynamics, so-called coherence effects, which are caused by repeating (cycling) neural activity states, are considered. Such effects are a direct consequence of the random but xed interconnection matrix (Tij ) and thus of interdependent neural activation and refraction probabilities. As we will discuss at the end of the section, the coherence effects can be suppressed—for example, by neural threshold noise. If the effects are sufciently suppressed, we can assume the probabilities p A .t C1/ and pB .tC1/ to
506
J. Lucke ¨ and C. von der Malsburg
be approximately independent, p.t C 1/ D pA .t C 1/ pB .t C 1/. The assumption permits a compact description of dynamics 2.1 in terms of the activation probability p.t/ (see the appendix for details): Á p.t C 1/ D 8s
p.t/ ¡ 2 p p.t/ .1 ¡ p.t//
! .1 ¡ p.t// ;
(2.4)
R ps x ¡ 1 y2 where 8s .x/ D p1 ¡1 e 2 dy is the gaussian error integral param2¼ eterized by s. The inhibitory feedback to the minicolumnar activity is modeled indirectly as a rise of the threshold 2. It is taken to be present already in the next time step and to be equally sensed by all neurons, which can be motivated by the axonal distribution of inhibitory double-bouquet neurons (DeFelipe et al., 1990; Peters & Sethares, 1997). The inhibitory neurons receive input from the excitatory neurons of the minicolumn. The P inhibitory feedback we choose to depend linearly on the overall activity, m1 m iD1 ni .t/, of the minicolumn, 2Dº
m 1 X ni .t/ C 2o D ºp.t/ C 2o ; m iD1
(2.5)
where º is the proportionality factor of inhibition and 2o the constant threshold of the neurons. The choice represents a natural approximation of the feedback and allows for a further analytical treatment. Inserting equation 2.5 into 2.4, we get: Á p.t C 1/ D 8s
.1 ¡ º/ p.t/ ¡ 2o p p.t/ .1 ¡ p.t//
! .1 ¡ p.t//:
(2.6)
The difference equation, 2.6, can be shown to possess a point of nonzero stable stationary activity for a wide range of parameters s, 2o , and º, given by: » ³ ´ ¼ ¡ ¡ Pºs;2o :D max p j p D 8s .1p º/ p 2o .1 ¡ p/ : p .1¡p/
(2.7)
Pºs;2o (or Pº for short) can be numerically determined, and its value is in good agreement with activity probabilities obtained by directly simulating equation 2.1 with inhibition 2.5. The dynamics 2.1 with 2.5 has to be simulated with additional neural threshold noise in order to suppress the coherence effects mentioned above. Fluctuations of the inhibitory feedback caused by a nite number of simulated neurons, m, also contribute to the noise but are on their own for a wide range of parameters not sufcient for an appropriate suppression of the effects. Note that the coherence effects are most
Processing and Learning in Macrocolumns
507
efciently suppressed if the interconnection matrix .Tij / is re-randomized at each time step. In this case, the assumption p.t C 1/ D p A .t C 1/ pB .t C 1/ and also the computation of p A .t C 1/ via the binomial distribution can be adopted from Anninos et al. (1970). There the matter is thoroughly discussed for a dynamics with another type of inhibition, but the essential arguments carry over to our dynamics. Neural threshold noise as an alternative to rerandomization was rst described in Lucke, ¨ von der Malsburg, and Wurtz ¨ (2002). In specic simulations with xed interconnections and threshold noise, we have validated that their behavior closely matches that of simulations with successively re-randomized interconnections, which shows that the analytical results are also applicable for the biologically realistic case of xed interconnections, as used in this work. 2.2 Model of the Macrocolumn. As motivated by the distribution of synapses of inhibitory neurons (DeFelipe et al., 1990, 1989; Peters & Sethares, 1997; Budd & Kisvarday, 2001), the macrocolumn is modeled as a collection of inhibitorily coupled minicolumns. With the same assumption as above, the dynamics is given by the following set of k m difference equations, 0 1 m X n®i .t C 1/ D H @ Tij® nj® .t/ ¡ I .t/ ¡ 2o A ¢ H .1 ¡ n®i .t//;
(2.8)
jD1
where ® D 1; : : : ; k counts the minicolumns, i D 1; : : : ; m the neurons of the minicolumn, and where I .t/ denotes a time-dependent inhibition. Note that equation system 2.8 assumes that there are no direct connections between excitatory neurons of the different minicolumns. For a xed ®, the interconnection .T ®ij / is of the same type as in the previous section. By the statistical considerations of section 2.1, we can replace the k m difference equations by a set of k difference equations in terms of activation probabilities p® of neurons of different minicolumns: Á ! p® .t/ ¡ I .t/ ¡ 2o p (2.9) p® .t C 1/ D 8s .1 ¡ p® .t//: p® .t/.1 ¡ p® .t// The inhibitory feedback I .t/ is modeled as the maximum of the overall P ¯ activities in the minicolumns, p¯ D m1 m iD1 ni .t/,
I .t/ D º max fp¯ .t/g; ¯D1;:::;k
(2.10)
where the maximum operation is assumed to be implemented by the system of inhibitory neurons of the macrocolumn. The maximum operation can be biologically implemented in various ways (Yu, Giese, & Poggio, 2002). Some possibilities are based on shunting inhibition (cf. also Reichardt, Poggio,
508
J. Lucke ¨ and C. von der Malsburg
& Hausen, 1983) whereas others use subtractive inhibition. On the functional side, inhibition proportional to the maximal minicolumnar activity 2.10 results in a qualitatively different and favorable behavior compared to a dynamics with inhibition proportional to the average activity, as was studied in Lucke ¨ et al. (2002). The dynamical difference and its functional implications will be discussed further later in this section. 2.3 Stability Analysis. The dynamical properties of the macrocolumn model can now be studied with a stability analysis of a system of k coupled nonlinear difference equations (® D 1; : : : ; k): Á ! p® .t/ ¡ º max¯ fp¯ .t/g ¡ 2o p p® .t C 1/ D 8s .1 ¡ p® .t// p® .t/.1 ¡ p® .t// (2.11)
E D: G® .p.t//:
First, note that the system possesses the following set of potentially stable stationary points,
Q D fEq j 8i D 1; : : : ; k .qi D 0 _ qi D Pº /g;
(2.12)
for example, for k D 3, .0; 0; 0/, .Pº ; 0; 0/, .0; Pº ; 0/, : : : , .Pº ; Pº ; 0/, : : : , .Pº ; Pº ; Pº /, where Pº is given in equation 2.7. The magnitude of Q is jQj D 2k . The vector with the smallest norm, .0; 0; 0/, will be called qEmin , and the vector with largest norm, .Pº ; Pº ; Pº /, will be called qEmax . The set of Q without the trivial stationary point qEmin will be denoted by QC D Q ¡fE qmin g. To analyze the stability of the stationary points in Q, we rst approximate E D º max¯ fp¯ g with a differentiable function: I .p/ Á E :D º IQ ½ .p/
! ½1
k X
.p¯ / ¯D1
½
E D I .p/: E ) lim IQ ½ .p/ ½!1
(2.13)
A stationary point pE is stable if and only if the magnitudes of all eigenvalues E p/ E (see equation 2.11) are smaller than one. Thanks to of the Jacobian of G. E the symmetries of equation system 2.11 and to the substitution by IQ ½ .p/, eigenvalues can be computed in general and are, in the limit ½ ! 1, given by1 ¸0 D 0;
³ ´ 1 1 ¡ 2Pº 1¡º C ¸1 D p 2o 80s .h.º// ¡ 8s .h.º//; Pº 2 Pº .1 ¡ Pº / 1 Due to the symmetries in the equations, we get, for q E 0 .Eq/, E 2 Q , a symmetric Jacobian G whose eigenvalues are simple functions of the Jacobian matrix entries. The explicit equations for the eigenvalues can then be obtained by long but straightforward calculations.
Processing and Learning in Macrocolumns
1
³
509
1 ¡ 2Pº
1 C .1 ¡ 2Pº / º C ¸2 D p 2 Pº .1 ¡ Pº / .1 ¡ º/ Pº ¡ 2o where h.º/ D p : Pº .1 ¡ Pº /
Pº
´ 2o 80s .h.º// ¡ 8s .h.º//
If for a given vector qE 2 Q, l.E q/ is the number of nonzero entries, then ¸0 is of multiplicity .k ¡ l.Eq//, ¸1 is of multiplicity 1, and ¸2 is of multiplicity .l.E q/ ¡ 1/. For xed parameters s and 2o , the magnitudes of all eigenvalues are smaller than one for º smaller than a critical value ºc . For º > ºc , ¸2 gets larger than one, which implies that all qE 2 Q with l.E q/ ¸ 2 become unstable. Hence, a set of .2k ¡ k ¡ 1/ stationary points of QC loses their stability at the same value ºc . In Figures 1 and 2, the properties of the dynamics are visualized using bifurcation diagrams. The critical value ºc can be computed numerically and is, for s D 15 and 2o D 1s , given by ºc ¼ 0:69. For k D 2, the set QC consists of three nontrivial stationary points, which are all stable
Figure 1: Bifurcation diagram of equation system 2.11 for parameters s D 15, 2o D 1s ; and k D 2. The points of QC D f.P º ; 0/; .0; P º /; .P º ; Pº /g are plotted together with the two unstable points of the dynamics. To obtain a twodimensional bifurcation diagram, the stationary points for given º are projected onto the one-dimensional space with normal vector p12 .1; 1/. The only stationary point not plotted is Eqmin because it projects onto the same point as Eqmax . The solid lines mark stable stationary points, and the dotted lines mark unstable points.
510
J. Lucke ¨ and C. von der Malsburg
Figure 2: Bifurcation diagram of equation system 2.11 for parameters s D 15, 2o D 1s , and k D 3. For each º, the stationary points of the three-dimensional phase space are projected to the plane given by the normal vector p13 .1; 1; 1/. The vectors p1 , p2 , and p3 are projections of the trihedron of the phase space onto this space. The only stationary point of the system that is not plotted is Eqmin because it projects onto the same point as Eqmax . The unstable stationary points are plotted as dotted lines; the stable stationary points, which are always elements of QC , are plotted as solid lines. For º < ºc , all elements of QC are stable, but for º > ºc , only k D 3 stable points remain. All other stable points lose their stability in subcritical bifurcations for the same value of º.
for º < ºc . Apart from the points in QC , there exist two unstable stationary points, which are numerically computed and are given by the dotted lines in Figure 1. For small values of º, the unstable points lie in the vicinity of the antisymmetric stable stationary points .Pº ; 0/ and .0; Pº /, which indicates that they attract only a small volume of neighboring phase space. For increasing º, the unstable points approach the symmetric stable stationary point qEmax D .Pº ; Pº /, which indicates that the phase space volume of points attracted by qEmax gets gradually smaller. In the point of structural instability, º D ºc , qEmax nally loses its stability when it meets the unstable stationary points in a subcritical pitchfork bifurcation. To visualize the crucial and analytically derived property that .2k ¡ k ¡ 1/ stationary points of QC lose their stability for the same value of º, the bifurcation diagram of a network for k D 3 is given in Figure 2. In the
Processing and Learning in Macrocolumns
511
diagram, all stationary points of QC are plotted, and we nd for º < ºc the set’s .2k ¡ 1/ stable stationary points. Apart from the points in QC , we get a number of unstable stationary points, which all lie, for small º, at the same distance from qEmax and in the vicinity of the other points of QC . As º increases, the unstable points are getting closer to the stable points with l.Eq/ ¸ 2 nonzero entries, and for º D ºc , these stable points of QC lose their stability when they are met by the unstable points in subcritical bifurcations. Note that an inhibition proportional to the mean minicolumnar activity instead of the maximum as in equation 2.10 results in a dynamics whose stationary points lose their stability for different values of º (compare also Luecke ¨ et al., 2002). This dynamic property is reected by eigenvalues of E with 0 < ½ < 1. the dynamics’s Jacobian, which depend on l.Eq/ for IQ ½ .p/ 2.4 Input. For the neuron dynamics 2.8, we have gained, by our stability analysis, far-reaching insight into the dynamical properties of the macrocolumnar model. The knowledge can now be exploited to investigate the dynamical behavior of the system if it is subject to perturbations in the form of externally induced input. As is customary in biology, we will denote the positive contribution of a presynaptic neuron to the input of the postsynaptic neuron excitatory postsynaptic potential (EPSP), and we will say that a neuron emits a spike at time t if it is active at time t. For the dynamics as investigated in the previous section, there are essentially three different modes of operation possible: ² For º > ºc , the macrocolumn can serve as a memory unit able to stabilize k different macroscopic states (i.e., stable stationary points of equation system 2.11). The switching between the states is possible by sending a sufciently large quantity of EPSPs to the respective minicolumn. ² For º < ºc , the macrocolumn is able to stabilize 2k ¡ 1 different macroscopic states. The transition between the states would be possible by inducing a sufciently large quantity of EPSPs to an appropriate subset of minicolumns. If º is chosen to be only slightly smaller than ºc and if one starts with the stable stationary point qEmax , already small differences in the input to the minicolumns are sufcient for the macrocolumn to change to a corresponding macroscopic state. ² If, for an initial state qEmax , º is continuously increased from º < ºc up to a value larger than ºc , the system will change to another stable stationary point for some value of º < ºc depending on the input. For larger values of º, the system can again change between stable points until º is nally larger than ºc and the dynamics is forced to one of the remaining stable stationary points where just one minicolumn is active. Consider, for instance, external EPSP input to a macrocolumn with k D 3 minicolumns. If the numbers of EPSPs induced per time step,
512
J. Lucke ¨ and C. von der Malsburg
M®EPSP , are different for the three minicolumns, for example, M1EPSP : M2EPSP : M 3EPSP D 0 : 1 : .1 C ²/; for 0 < ² ¿ 1; the dynamics will stabilize the initial state qEmax for small º. If º gets larger, the system will change to the stable point .0; Pº ; Pº / for some º1 < ºc because this point is less deected by the input than qEmax . The deection of .0; Pº ; Pº / caused by the input is sufciently large, however, if º is further increased. The system will therefore nally stabilize the point .0; 0; Pº /. The third possibility is the one with the most useful features. For given inputs, the dynamics rst successively switches off the minicolumns with smallest inputs. These macrostate transitions occur the earlier the larger the differences between the inputs are and can therefore encode neural population rate differences. If a new stable stationary point is reached, the process of switching off a minicolumn continues, each time without the perturbing inuence of the input of the already quiescent columns. For a dynamics whose stationary points lose their stability for different values of º (see Lucke ¨ et al., 2002), the number of active minicolumns is determined by º and not by the input. Note in this context that equation 2.10 is not the only type of inhibition that results in a single critical value of º. Other inhibition functions, for example, the average of active columns,
Iac .t/ :D º
1 X Q ; p¯ .t/ ; A D f¯ j p¯ > pg jAj ¯2A
(2.14)
with 0 < pQ ¿ 1, can also be shown to possess this property. In general, the contribution of quiescent minicolumns to the inhibition has to be negligible as a prerequisite for a dynamics with qualitative behavior comparable to the one with inhibition, equation 2.10. The simplicity of equation 2.10 and its good functional performance were the reasons to choose an inhibition proportional to the maximum in this work. The macrostate transitions that depend on input differences but are induced by an increased parameter º are all performed near symmetry breaking points. The transitions are theoretically innitely sensitive to input differences such that a macrocolumn can serve as an ideal decision unit (see also L ucke ¨ et al., 2002). For appropriate parameters s and 2o , the stabilization of stationary activity is performed in a few iteration steps such that the time to increase º from its minimal to its maximal value lies within a few tens of time steps, which makes decisions very fast in addition. If the inhibition parameter oscillates, the macrocolumns can repeatedly select the strongest input or inputs. In the next section, this mode of operation is exploited and further discussed for a macrocolumn with explicitly modeled afferent bers.
Processing and Learning in Macrocolumns
513
3 Afferent Fibers and Hebbian Plasticity We consider the situation that the excitatory neurons of the macrocolumn, n®i , receive input from an input layer of N external neurons njI , which are of the same type as the excitatory neurons of the minicolumns. In the following, we think of the neurons of the input layer as extracortical neurons in order to analyze the dynamical properties of the macrocolumn in a more convenient way. However, the input neurons can also be considered to be excitatory neurons of other macrocolumns, which would account for lateral excitation within the cortex. An afference from an input neuron, njI , to a neuron of a minicolumn, n®i , will be denoted by R®ij (see Figure 4A for a sketch of the system). Analogous to the internal connectivity, we demand that one neuron n®i receives a xed number of r synapses from neurons of the input layer and that the synaptic weight of a synapse is given by c D 1s . The receptive eld vector of a minicolumn, RE® 2 f0; c; 2c; : : :gN , is dened as the sum of the RF vectors RE®i D .R®i1 ; : : : ; R®iN / of all neurons of the minicolumn ®, P E® 2 RE® D m iD1 Ri . Instead of reanalyzing the dynamics statistically, it is (for r signicantly smaller than s) sufcient to treat the external input to the macrocolumn as a perturbation of the internal dynamics. The macrocolumn will be operated by repeatedly increasing the inhibition factor º from a minimal value ºmin to a maximal value ºmax . The system is hereby forced to select the column(s) with the strongest input at the end of each period or º-cycle (as we will call it from now on). In the beginning of a º-cycle, the system has to be in the state qEmax , which can be achieved under the inuence of noise by setting º to a sufciently small value before starting to increase º at ºmin (see Figure 3B). If for a macrocolumn of k D 3 minicolumns, the RFs, RE® , are already given, the system is able to distinguish even strongly overlapping input patterns. The system rst switches off the minicolumns with RFs very different from the presented stimulus and then decides between the remaining minicolumns with RFs similar to the stimulus (see Figure 3). In this way, the system can also gracefully handle simultaneously presented patterns. If a superposition of two patterns corresponding to the RFs of two minicolumns is presented, the dynamics switches off all irrelevant minicolumns except the two corresponding ones, whose activities are symmetrized. It then depends on the choice of ºmax whether this is the nal state or whether the symmetry is broken to favor one of the patterns. We now proceed by introducing Hebbian plasticity of the afferent bers to match neurophysiological experiments which show input-dependent changes of neuron RFs. As the RFs of neurons, RE®i , change, the RFs of the 2
In the neurons’ RF vectors, only afferent connections are considered.
514
J. Lucke ¨ and C. von der Malsburg
Figure 3: Operation and dynamic behavior of a system with parameters m D 100, s D 15, r D 7, 2o D 1s , with k D 3 minicolumns, and an input layer of E ® , of the minicolumns ® D 1; 2; 3 are given N D 16 £ 16 neurons. (A) The RFs, R as two-dimensional plots of the 162 vector entries. The entries are visualized as gray levels (black D 0). To make the RFs more conceivable, we have chosen them to be of the form of simple two-dimensional patterns. The input pattern is chosen to correspond to the RF of minicolumn ® D 3. During the operation of the system, all neurons of the input layer that correspond to white pixels spike with probability 13 ; neurons that correspond to black pixels are not spiking. (B) The periodical change of the parameter of inhibition, º, is visualized. After a short period with º D 0:1, which serves to reset the dynamics to qEmax , º is linearly increased from ºmin to ºmax D 1:12. Three º-cycles with period length Tº D 25 are displayed. (C) The dynamic behavior of the system is visualized, with the activities p® .t/ for the minicolumns ® D 1; 2; 3 plotted against time. In the beginning of a º-cycle, the dynamics tends to symmetrize the activities as predicted by the theoretical results and the bifurcation diagrams. The symmetry is rst broken when minicolumn ® D 1 is switched off because its RF receives the smallest number of EPSPs from the presented input. Afterward the stationary point .0; P º ; Pº / is stabilized; the remaining two minicolumn activities are symmetrized, until minicolumn ® D 2 becomes quiescent because it receives less input than minicolumn ® D 3. The qualitative behavior for each º-cycle is the same, but quantitative differences exist due to the threshold noise of the neurons and nitely many neurons per minicolumn.
Processing and Learning in Macrocolumns
515
Figure 4: (A) Sketch of a macrocolumn with k D 3 minicolumns connected to an input layer of N D 25 neurons. The m D 8 neurons per minicolumn are randomly interconnected; each minicolumnar neuron receives s D 3 synapses from within its minicolumn. The inhibition is symbolically sketched as one inhibitory neuron receiving input from all minicolumns and projecting back to all of them. Each minicolumnar neuron receives r D 2 synapses from neurons E 1 , of minicolumn ® D 1 is of the input layer. The randomly initialized RF, R 2 3 E E fully displayed, whereas RFs R and R are not. Lines within the input layer are displayed only for visualization purposes. There are no connections of neurons within the input layer. (B) Set of three different input patterns of 16£16 pixels. (C) Modications of RFs of a macrocolumn with k D 3 minicolumns and parameters m D 100, s D 15, r D 7, 2o D 1s , E D 0:03, » D 55, and N D 256. For 0 º-cycles, the random initialization of the RFs is displayed. After ve º-cycles (and ve E 1 is presentations of patterns randomly chosen from the set of input patterns), R 2 3 E E slowly specializing to pattern 2, and after 10 and 15 º-cycles, R and R specialize to the patterns 3 and 1, respectively. After 15 º-cycles, the RF specialization further increases until the maximal specialization is reached after about 100 º-cycles. From 100 º-cycles on, the degree of specialization remains unchanged.
516
J. Lucke ¨ and C. von der Malsburg
minicolumns, RE® , consequently change in time as well. As an update rule for the synaptic change, 1R®ij .t/ D R®ij .t/ ¡ R®ij .t ¡ 1/, we use elementary Hebbian plasticity, that is, an afferent connection R®ij is increased if the presynaptic neuron was active at the time-step directly preceding the ring of the postsynaptic neuron. The state of maximal macrocolumnar activity qEmax is the same for all inputs. Only after the selection process, that is, at lower levels of activity due to minicolumn inactivation, the activity state is able to distinguish between inputs. Therefore, synaptic plasticity has to be preP P ® dominant at low levels of macrocolumnar activity, B.t/ D k®D1 m iD1 ni .t/, in order to generate discriminating RFs. A simple and, as it turned out, functionally advantageous way to do this is enabling synaptic modication only if B.t/ is smaller than a threshold » . As activity oscillations are ubiquitous in the cortex, it seems plausible that synaptic plasticity is phase coupled (see Wespata, Tennigkeit, & Singer, 2003, for recent evidence of phase-coupled synaptic modication). For the dynamics of synaptic change, we further demand as a boundary condition that the number of synapses received by a minicolumnar neuron is limited to r in order to avoid unlimited synaptic growths. We get as dynamic equations for the synaptic weights .® D 1; : : : ; k ; i D 1; : : : ; m ; j D 1; : : : ; N/: 1R®ij .t/ D AEt n®i .t/ njI .t ¡ 1/ iff B.t/ < » ; 8i; ® :
N 1X R® .t/ D r : c jD1 ij
(3.1) (3.2)
As our synaptic weights are discrete values, AEt is not a real valued growth factor but a probability that the synaptic weight is increased by c D 1s .3 If R®ij is increased for a given .®; i/, the neuron n®i removes randomly one afferent from the input layer in order to fulll the boundary condition. We operate the system by periodically changing º as in Figure 3B. Throughout the duration of a º-cycle, we present a pattern randomly chosen from a set of input patterns. An input neuron that corresponds to a white pixel is spiking, and a neuron that corresponds to a black pixel is not. The RFs of the neurons are randomly initialized and are modied according to equations 3.1 and 3.2. If the set of training patterns is structured, in the sense that it contains a small number of patterns as in Figure 4B, we can observe a specialization of the RFs of the minicolumns to the different input patterns. In Figure 4C, the modication of the RFs, RE® , of a macrocolumn with minicolumns ® D 1; 2; 3 is displayed, and it can be seen that the system organizes its RFs such that the macrocolumn becomes a decision unit for the input patterns. 3 To be more precise, for each .i; j; ®/ AE 2 f0; cg is an independent Bernoulli sequence t with probability P.0/ D 1 ¡ E and P.c/ D E .
Processing and Learning in Macrocolumns
517
In the beginning, an input pattern affects all minicolumns equally such that the system selects a subset of minicolumns by symmetry breaking. As soon as, initiated by random selection, a RF specializes for one class of input patterns, the corresponding minicolumn is more likely to be activated by patterns of this class, which further increases the specialization of the RF. This is the positive feedback loop of the self-organizing process, which amplies small uctuations and nally leads to an ordered state of the RFs. In addition to self-organizing aspects, we have a competition due to the minicolumn selection process and competition between afferent bers induced by equation 3.2. In order to avoid mutual weakening of different patterns stored in the same minicolumnar RF, the system specializes its RFs to adequately different input patterns. 4 Experiments We have seen that the system is able to specialize its RFs to different input patterns. So far, we have presented three different patterns to a network of three minicolumns (see Figure 4). We now investigate two more general situations. In the rst experimental setting, we present to the network different patterns that can be grouped into different classes. In the second setting, the network’s task is to extract basic constituents of a class of patterns generated by combining different bars, a task known as the bars problem (Foldi´ ¨ ak, 1990). For both tasks, the same network is used with the same set of parameters. All experiments use an input layer of 16 £ 16 input neurons. If a binary (black and white) pattern of 16 £ 16 pixels is presented, the input neuron, njI , of a given pixel spikes with probability 13 if the corresponding pixel is white and is not spiking if the corresponding pixel is black. Gray levels can be coded by intermediate ring rates, but in the following, for simplicity, only binary input is considered. We use a network with m D 100 neurons per minicolumn. Each neuron receives s D 15 synapses from presynaptic neurons of the same minicolumn and r D 7 synapses from neurons of the input layer. The neurons’ constant threshold is set to 2o D 1s ¼ 0:067; it is chosen such that a single EPSP is not sufcient to activate a neuron. The constant threshold is subject to gaussian threshold noise with zero mean and a variance of .¾ 2 /2 D 0:01. The oscillation of the inhibition is essentially governed by the parameters ºmin D 0:5, ºmax D 1:12, and the length of a º-cycle is Tº D 25 time steps. To allow short º-cycle periods, we use a º-oscillation as given in Figure 3B, where the rst part (with º D 0:1 and additional noise) serves to reset the dynamics to the stationary point qEmax . Note, however, that self-organization of RFs is also possible with other (e.g., sinusoidal) types of º-oscillations. Hebbian plasticity (see equations 3.1 and 3.2) is determined by the synaptic modication rate E D 0:03 and the parameter » D 55, which determines
518
J. Lucke ¨ and C. von der Malsburg
the network activity for which synaptic modication is possible. The latter is chosen such that synaptic modication is enabled only close to the end of a º-cycle (note that value two to three times larger for » with simultaneously reduced E results in a system with comparable qualitative behavior). All parameters are independent of the number of minicolumns k, which we allow to change for different experiments. The parameters are partly chosen to reect anatomical data, as in the case of the number of neurons per minicolumns m D 100, and partly to optimize performance in the experiments. In the following, we will refer to these parameters as the standard set of parameters. 4.1 Pattern Classication. We have seen that the system is able to specialize its RFs to be sensitive to a number of input patterns. More realistic input would not consist of a repeated presentation of exactly the same patterns as in Figures 4B and 4C but rather of different patterns that can be grouped into different classes. In Figure 5, a pattern classication experiment for such a kind of input is illustrated. For input patterns Va 2 f0; 1g256 as displayed in Figure 5A, we can dene the distance measure, dS .V a ; V b / :D
jA4Bj ; jA [ Bj
(4.1)
where A D fijVia D 1g, B D fijVib D 1g, and where .A4B/ D .A [ B/ ¡ .A \ B/ is the symmetric difference of sets. Distance function, equation 4.1, can directly be derived by analyzing RF-mediated input to the minicolumns (further detail would go beyond the scope of this article) and can be used to group the input patterns into different classes of mutually similar patterns, as can be seen in Figure 5B. In Figure 5C, a typical modication of RFs of a macrocolumn with six minicolumns is displayed, and it can be observed that the system builds up representations of all classes identiable in Figure 5B. If fewer minicolumns than pattern classes are available, the system builds up larger classes of mutually similar patterns (see Figure 5D). If more minicolumns than major classes are available, the system further subdivides the pattern classes (see Figure 5E). The subdivision may in this respect slightly differ from simulation to simulation. For example, for k D 9 the “square class” is in many cases represented only by one and the “plus class” by three instead of two minicolumns, as in Figure 5E. In the experiment, it can further be observed that the nal representation depends on the substructure of the pattern classes rather than on their size; for example, the plus pattern appears more frequently than the St. Andrew’s cross pattern but the St. Andrew’s cross tends to be represented by more RFs (see Figures 5C and 5E). Furthermore, the classication is independent of the number of white pixels per pattern because the impact of the patterns on the minicolumns is normalized by boundary condition 3.2. The independence is affected only by patterns approximately lling out the
Processing and Learning in Macrocolumns
519
Figure 5: (A) The set of input patterns is displayed. During each º-cycle, one randomly chosen pattern of this set is presented. (B) Distance matrix generated using the distance measure dS . The line and column index enumerates the 42 input patterns in the same order as they appear in A. (C) The modication of the RFs of a macrocolumn with k D 6 minicolumns and the standard set of parameters is displayed. After 250 and 1000 º-cycles, four and six different pattern classes are represented, respectively. The RFs’ degree of specialization further E 1 and R E 2 further subdivide the increases thereafter, and it can be seen that RFs R 1 E pattern class formerly represented by R only. (D) Final RF specialization (after 250 º-cycles) if a macrocolumn with k D 3 minicolumns is used with the same input. (E) Final RF specialization (after 10000 º-cycles) if a macrocolumn with k D 9 is used.
whole input layer or by patterns having a number of white pixels close to zero. 4.2 Feature Extraction. So far, we have seen that if the training patterns contain v classes of patterns, the system is able to identify these classes if k ¸ v. There are situations, however, when the training patterns cannot
520
J. Lucke ¨ and C. von der Malsburg
easily be grouped into pattern classes. This is the case, for instance, if we present from a number of v patterns not only the patterns themselves but also all possible superpositions. If k < 2v , the system is not able to store all patterns in different RFs. We mentioned in section 3, however, that the internal dynamics of a macrocolumn is especially suitable for taking into account pattern superpositions such that we can nevertheless expect the system to generate appropriate RFs. A method to evaluate the ability of a network to handle input that can be represented only by a combination of different patterns is the bars test. It was rst introduced in Foldi´ ¨ ak (1990) and soon became a benchmark test for generalization and combinatorial abilities of learning systems. The training patterns of the bars test consist of horizontal and vertical bars. On a quadratic input layer, 2b (nonoverlapping) horizontal and 2b (nonoverlapping) vertical bars can be displayed (with b an even integer), each with probability 1b and all of equal size (see Figure 6A for some examples with b D 8). Note that overlapping horizontal and vertical bars do not add up linearly because two overlapping white pixels do not add but result in a white pixel as well. The bars test is passed if, after a training phase, the system has built up representations of all bars and is able to correctly classify input consisting of superpositions of the learned bars. We can operate our system without modication, with the same set of parameters as for the experiment in Figure 5, and it turns out that it passes the bars test without difculty. The only prerequisite for a correct representation is that the number of minicolumns k is greater than or equal to the number of different bars, k ¸ b. If k > b, the RFs of some minicolumns remain uncommitted or specialize for a bar already represented by another minicolumn. For a bars test with b D 8 different bars, Figure 6B shows the modication of the RFs of a macrocolumn with k D 10 minicolumns. Starting from random initialization, the RFs specialize to different single bars even though the input patterns consist mainly of bar superpositions. In Figure 6B, a representation of all bars is clearly visible after 1000 º-cycles, and the representation can be seen to stabilize further thereafter. During the learning phase, an RF sometimes specializes to a combination two or more bars, as can be seen by looking at RF RE8 in Figure 6 (after 500 º-cycles). Such an RF is not stable, however, because the parts of the RF that correspond to different bars compete via equation 3.2. The RF therefore rapidly specializes for one bar if another RF becomes sensitive for the other. An example is given by RFs RE7 and RE8 in Figure 6 from 500 to 1000 º-cycles. In the experiment of Figure 6, two RFs remain unspecialized. In other experiments or for a longer learning phase, the two supernumerary RFs often specialize to an already represented bar and increase redundancy in this way. The bars test was used in different versions with different numbers of bars and different systems. In Hinton et al. (1995), for instance, 8 bars were used, Hochreiter and Schmidhuber (1999) and others used 10 bars, Hinton
Processing and Learning in Macrocolumns
521
Figure 6: (A) A selection of 33 typical input patterns of the bars test of eight different bars. (B) Typical example of the self-organization of the RFs of a macrocolumn with 10 minicolumns and the standard set of parameters. During each º-cycle, a randomly generated input pattern of the upper type is presented. After about 250 º-cycles, the network has already found representations of seven bars. After 1000 º-cycles, representations of all bars are found and are further stabilized.
and Ghahramani (1997) 12 bars, and Foldi´ ¨ ak (1990) and others used 16 bars. To allow for comparison with these systems, we measured the performance of our system for bars tests of 8, 10, 12, 14, and 16 bars. For all tests, we used the same system and always with the standard set of parameters and an input layer of 16 £ 16 neurons. The different bars tests required different
522
J. Lucke ¨ and C. von der Malsburg
bar widths in order to cover the input layer appropriately. For the bars test with b D 8 bars, a bar width of four pixels was used (see Figure 6), for b D 10 three pixels, and for b D 12, 14, 16 bars were of a width of two pixels. Consequently, the input layer is not uniformly covered for 10, 12, and 14 bars. In Figures 7A, 7B, and 7C, the results of different test series are presented. In Figure 7A, the number of minicolumns is equal to the number of different bars; in Figure 7B, the number of minicolumns exceeds the number of bars by 2; and in Figure 7C, a surplus of 4 minicolumns is available. A measurement point in the diagrams corresponds to the number of º-cycles after which there is a 50% probability that all bars are represented; for example, in 200 runs with 8 bars and k D 10 minicolumns, there were 100 runs in which a representation was found after 1050 º-cycles (see the rst measurement point in Figure 7B). A bar is taken to be represented by a minicolumn if the minicolumn remains active in 9 of 10 º-cycles if the bar is presented. A macrocolumn is said to have found a representation of all bars if all bars are represented by at least one minicolumn and no minicolumn represents two different bars. In all runs, the system nally found a correct representation. Once a representation was found, it remained stable in the sense that the minicolumns remained specialized for the same bars. For a given experimental setting, there can be relatively large differences between individual runs, however. For 8 bars and k D 10, for example, a correct representation of the bars was not found after about 2100 º-cycles in 20% of the 200 runs (indicated by the upper bound of the error bar), whereas another 20% of the experiments found representations after 400 º-cycles (lower bound). The reason is that all bars but one nd presentations very early, whereas the remaining bar might take a long time to be represented— an effect that is also observable in Spratling and Johnson (2002) for the noisy bars test. In Figures 7B and 7C a large reduction of learning time can be observed if the number of minicolumns is larger than that of presented bars. A surplus of two minicolumns results in a reduction to less than half of the learning time for no surplus, and a surplus of four minicolumns results in a learning time of coarsely a fourth. For the results in Figures 7A through 7C, we used a newly generated bars image for every º-cycle. The same experiments can be carried out, however, by choosing randomly from a xed set of a number of u generated bars images. If u is several times larger than b, the results are qualitatively and quantitatively comparable. For a bars test with 8 bars and k D 10 minicolumns, for instance, u D 50 input patterns are fully sufcient to build up a correct representation of single bars. The results of Figure 7A through 7C further show that learning time in terms of º-cycles decreases if the number of bars does. This can be expected because the system has to learn a decreasing number of independent input constituents. On the other hand, there is an increasing overlap of bars, which makes it harder for the system to differentiate between two bars (compare
Processing and Learning in Macrocolumns
523
Figure 7: (A–C) Results of bars tests with b D 8; 10; 12; 14; 16 bars and a macrocolumn with standard set of parameters. In A, the number of minicolumns of the used macrocolumns is always equal to the number of different bars. In B, the number of minicolumns exceeds the number of bars by two, and in C, there is a surplus of four minicolumns. (D–F) Results of a bars tests with b D 8 bars and a macrocolumn with k D 10 minicolumns and standard parameters. In D, the input patterns are perturbed with bit ip noise of 0 to 12%. In E, the bar widths are varied, and in F, the generation of the input patterns is altered such that for each run, four randomly chosen bars appear with probability 18 .1 ¡ ° /, whereas the other four appear with probability 18 .1 C ° /. The measurement points of A, B, and C were obtained by taking 200 runs into account and the measurement points of D, E, and F with 100 runs each. As a result, the number of º-cycles is given after which a representation of all bars is found with a probability of 0:5. The lower and upper bounds of the error bars correspond to a probability of 0:2 and 0:8, respectively. For each run, a newly generated macrocolumn with newly initialized RFs was used.
524
J. Lucke ¨ and C. von der Malsburg
Dayan & Zemel, 1995, and Hochreiter & Schmidhuber, 1999). The positive effect of fewer constituents is predominant in our system. The negative effect of more overlap can be made responsible, however, for an increase of learning time if the bar widths are varied in an experiment discussed below. Once a system has learned a correct representation of the bars, it can be used to analyze bars images. To test the accuracy of the recognition, we trained a macrocolumn with images generated according to the bars test until it found a representation. After some additional learning to stabilize the representation further, it was tested with newly generated bars images of the same type as the training images. If an image is presented, the minicolumns corresponding to the bars appearing in the image remain active longer than minicolumns associated with bars not appearing in the test image. At the end of a º-cycle, a minicolumn is either active or not. A test image is considered to be correctly recognized if for all minicolumns that correspond to bars appearing in the image, the probability of remaining active is above average and if the probabilities of minicolumns corresponding to all other bars lie below average. Six macrocolumns of 16 minicolumns trained with bars images of 16 bars were each tested 100,000 times with images generated according to the bars test except that, for convenience, we required each image to contain at least one bar. The networks could classify the input correctly in all but two of the 600,000 cases. In the rst case, one of seven bars was not recognized and in the second one of eight bars. In the usual bars test, an individual bar is always displayed identically, the bars are of the same size, and all bars occur with exactly the same probability. Systems solving the bars test can therefore be suspected of using these articial assumptions. The system (Foldi´ ¨ ak, 1990) not only exploits the fact that the bars are occurring with the same probability but also needs to know the exact value of the bars’ probability of occurrence. How much a system relies on the assumptions of the bars test can be tested by relaxing them, and we present three test series showing the corresponding behavior of our system. For all three series, we use a bars test with b D 8 bars and a macrocolumn with k D 10 minicolumns and standard set of parameters. For the robustness against perturbed bar images, we presented input images with bit ip noise during the learning phase (see Figure 8). In Figure 7D the learning time is plotted for different degrees of noise. As can be observed, even low levels of noise have positive effects. However, with an increasing noise level, the nal degree of specialization of the minicolumns’ RFs is reduced. In Figure 8, the nal specialization degree corresponds to the displayed RFs after about 2000 or 5000 º-cycles. If compared to the nal degree of specialization in Figure 6B, it can be seen that in the noisy case, the RFs have more overlap. The overlap increases with increasing noise, which leads to an increasing instability of a representation of all bars until the system cannot nd a representation of the bars anymore. For the standard set of parameters and for a bars test with 8 bars, no representations can be found for noise levels above about 12%. By decreasing the learning rate E ,
Processing and Learning in Macrocolumns
525
Figure 8: (A) A selection of 33 typical input patterns of a bars test with b D 8 different bars and 8% bit ip noise. (B) Typical RF specialization corresponding to this input. RFs of a macrocolumn with 10 minicolumns and the standard set of parameters are displayed. After about 500 º-cycles, representations of all bars are recognizable, and after about 2000 º-cycles, the maximal degree of specialization is reached.
the robustness against noise can be increased such that representations can be found for noise levels above 15%. In the second test series, the bar size is varied. For b D 8, the bars are E D .w1 ; w2 ; w3 ; w4 / denotes the bar widths for usually w D 4 pixels wide. If w the four vertical as well as for the four horizontal bars, we can dene ±w D Pb=2 jw ¡4j as a measure for the bar width variation. In Figure 7E, the results i iD1 E D .4; 4; 4; 4/,.3; 4; 4; 5/; .3; 3; 5; 5/; .2; 3; 5; 6/; .1; 3; 5; 7/ for the test series w are given. The learning time increases with increasing ±w presumably be-
526
J. Lucke ¨ and C. von der Malsburg
E D .1; 3; 5; 7/, the cause the maximal bar overlap increases; for example for w horizontal 7-pixel-wide bar covers nearly half of the 1-pixel-wide vertical bar. The robustness of the system against relaxation of the assumption that all bars occur with equal probability is investigated in the third test series. We reduce the appearance probability of four randomly chosen bars to the value p D po .1 ¡ ° / and increase the appearance probability of the four other bars by the same value p D po .1 C ° /. Here, ° is a parameter in the interval [0; 1], and po D 1b D 18 is the usual appearance probability. In Figure 7F, the results for ° D 0:0; : : : ; 0:8 are given; for ° D 0:9, the corresponding measurements are 3750, 8050, and 32,850 º-cycles for probabilities to get a correct representation of 0:2, 0:5, and 0:8, respectively. The measurements show that the system reliably learns a correct representation even if half of the bars appear nearly 20 times more frequently than the others, and it needs a longer learning phase only if half of the bars occur more than four times more frequently. The bars appearance probability can also be varied globally. If all bars appear with the same probability po and if po is increased to values larger than 1b , the probability of nding a few or a single bar in an input image gets gradually smaller. For k D 10, b D 8, and the standard set of parameters, learning time increases if po gets larger. For po D 1:5 1b , the system found a stable representation in all of 100 runs and needed fewer than 1600 º-cycles to nd representations in 50% of them. In addition to a longer learning phase, the RFs representing the different bars become less disjunct until the nal representation gets unstable for values of po larger than about 1:8 1b . Representations for input with higher values of po can be found, however, if the synaptic modication rate E is reduced, which in general stabilizes the representation. For E D 0:005 instead of E D 0:03, the system always nds stable representations for po D 2:0 1b (after fewer than 18,600 º-cycles in 50% of 100 runs). However, even for very low values of E , there is a limit at po somewhat larger than 2:0 1b , from which point on no stable representation can be found anymore. We have seen that the same network solves problems such as pattern classication and basic feature extraction. As demonstrated in the bars test, the network can build up a representation of the input, which allows us to classify patterns by using distributed neural coding. The network found correct representations of all bars in all 5700 simulations that were carried out to acquire the data given in Figure 7. After the learning phase, the classication for the usual bars test shows a reliability of virtually 100%. All experimental data given in Figure 5 to 8 were obtained with the same parameters. Different sets of parameters lead to different results, and for an individual task, the parameters can be optimized to obtain shorter learning times or a higher robustness. We have chosen, however, to use the standard set of parameters for all experiments in order to demonstrate the universality and robustness of the system’s dynamics.
Processing and Learning in Macrocolumns
527
5 Discussion From an elementary neuron model and a random but column-based interconnection, we derived a neural dynamics with properties that make of the macrocolumn an ideal decision unit for input to its minicolumns. The dynamics is best exploited with an oscillating gain factor of the inhibition. If the afferent bers to the macrocolumn are subject to elementary Hebbian plasticity that is also phase-locked to the oscillation of inhibition, we get a system that self-organizes the RFs of its minicolumns. The system is able to classify input patterns into different groups or to extract basic constituents of the input patterns, as was demonstrated using the bars test. The way the system represents the input depends on only the nature of the input, as the same system with the same set of parameters was used for the pattern grouping task and for the bars test. 5.1 Computational Aspects. There are various systems capable of learning without supervision. Important (not necessarily disjoint) classes are different types of ANN, probabilistic models, and, in a more general sense, independent component analysis (ICA) and principal component analysis implementations. Among these systems, only a few are able to build up efcient combinatorial representations of the input. The bars test is a means of testing this ability, and it represents in this respect a hard problem because in general, its components, the bars, do not add up linearly. Linear methods like ICA therefore fail to pass the test (see, e.g., Hochreiter & Schmidhuber, 1999). The problem is (in its more or less difcult versions) solved by merely a small subset of systems (Foldi´ ¨ ak, 1990; Saund, 1995; Dayan & Zemel, 1995; Marshall, 1995; Hinton et al., 1995; Harpur & Prager, 1996; Frey et al., 1997; Hinton & Ghahramani, 1997; Fyfe, 1997; Charles & Fyfe, 1998; Hochreiter & Schmidhuber, 1999; O’Reilly, 2001; Spratling & Johnson, 2002). Some of them need additional knowledge about the input; Foldi´ ¨ ak (1990) and Marshall (1995), for example, require that all bars occur with equal probability. Other systems, such as those of Dayan and Zemel (1995) and Hinton and Ghahramani (1997), use hierarchical approaches. If these systems are applied to the bars test, a pattern is rst represented as containing horizontal or vertical patterns, and then exact instances of those patterns are represented at the next level. The system as presented in this article is not hierarchical. However, the dynamics can be extended to allow for hierarchical learning in the sense that the input patterns are rst subdivided into larger classes of patterns on the basis of the distance measure, equation 4.1. Such an extended system increases the parameter ºmax during learning. A system based on this mechanism is currently being studied in our lab. Note, however, that such a system is learning hierarchically but that it is not hierarchically representing a pattern as the systems of Dayan and Zemel (1995) and Hinton and Ghahramani (1997) do.
528
J. Lucke ¨ and C. von der Malsburg
To compare systems that solve the bars test, their behavior under the relaxation of the bars test assumptions is one important criterion; their reliability (some systems do not always nd correct representations) and the time they need to nd a representation are others. Comparison among the systems is difcult in many cases, however, because important data (e.g., concerning robustness or reliability) are often missing. Even if data are available (e.g., in terms of the number of presentations of input images required to build up a correct representation), comparison remains difcult because systems specialized to the bars test assumptions 4 can be expected to be much faster than systems that can also be applied to more general input.5 Our system was therefore tested against relaxations of the bars test assumptions and was shown to behave favorably (see Figures 7D–7F). In terms of pattern presentations, only the systems of Foldi´ ¨ ak (1990) and Spratling and Johnson (2002) are faster than the presented network. 6 In Foldi´ ¨ ak (1990) the probability of bars occurrence has to be known ahead of time, however, and Spratling and Johnson (2002) do not report on the robustness of their system when bars are of different sizes or appear with different probability. A further possibility for comparing systems is the complexity of computation. A typical system with N input units and k internal computational units with all-to-all connectivity needs O.Nk C k2 / elementary computations for one update in the learning phase. Spratling and Johnson (2002) report O.Nk2 / computations, whereas our system needs O.Nk C k/ because it is not using internal all-to-all connectivity.7 5.2 Neuroscientic Aspects. As discussed in section 1, we designed our model of the cortical macrocolumn in accordance with relevant neuroanatomical and neurophysiological facts. We show that on discrimination and learning tasks, the resulting system can overcome two serious problems raised by the concept of single neurons as the brain’s decision units, reaction time, and limited fault tolerance. The essential components of our model are column-based interconnections, discrete neural spike signals, oscillatory activity, and Hebbian plasticity. These neural characteristics, usually seen as independent of each other, are shown here to form a natural alliance, with important functional consequences. The model requires little genetic information, being based on sparse, asymmetric, and random interconnections within the minicolumn. Our model makes several simplifying assumptions,
4 The system of Foldi ¨ a´ k (1990), for example, required learning time of 1200 presentations for 16 bars. 5 Hochreiter and Schmidhuber (1999), for example, needed 5000 passes through a training set of 500 patterns for 10 bars. 6 The system of Spratling and Johnson (2002) needs 210 cycles to get a correct representation in the majority of runs for 16 bars and a specially chosen set of parameters. 7 If N grows, the number of available afferents per input unit can be kept constant by proportionally increasing r and reducing the synaptic weights of the afferents accordingly.
Processing and Learning in Macrocolumns
529
using an abstract neuron model, discrete time, and direct inhibition. Experimental predictions of the model should therefore be treated with caution. A fundamental property of our system is the ability to sustain neural activity without input. The property is based on a random interconnection matrix within a minicolumn. A relatively high number of EPSPs per time step results in a relatively high number of EPSPs in the next. The amount of EPSPs is controlled by inhibitory feedback and refractoriness of the neurons. As studies of continuous time systems suggest (e.g., Wilson and Cowan, 1973), this mechanism can be implemented in a continuous time version of the presented minicolumn model as well, such that with a continuous inhibition between the minicolumns similar to equation 2.10, the qualitative dynamical behavior of the discrete model can be expected to carry over to a continuous one, which is based, for example, on an integrate-and-re or Hodgkin-Huxley neuron model. It can even be expected that convergence to stable stationary points of the dynamics is faster than in the discrete time case, which would allow for a shorter º-cycle period and, consequently, an even faster reaction time. For this reason and because of the possibility of a better comparison with neurophysiology, continuous time systems are the subject of further studies. Our system realizes neural populations with well-dened global behavior while realistically using local update rules for individual neurons and synapses. The resulting population code is based on a collective ring rate, evaluated by the macrocolumnar dynamics as average over each minicolumn’s population at a particular phase relative to oscillating inhibition. We tentatively identify our inhibitory cycle with cortical oscillations in the gamma frequency range—about 30 to 60 Hz. Recent neurophysiological experiments (Perez-Orive et al., 2002; see Singer, 2003, for review) support this view of a phase-coupled population rate code. For evidence for phase dependence of Hebbian modication, see Wespatat et al. (2003), where membrane potential oscillations of 20 to 40 Hz were articially induced in pyramidal cells. A central issue for understanding the brain is the neural code. The currently dominant view is the single-neuron hypothesis (Barlow, 1972), according to which essential decisions of the brain can be linked directly to ring decisions of individual neurons. A fundamental difculty for this view are reaction times of the brain. These can be so short that single neurons can re only once. This makes it impossible to express graded signals (see, however, the time-of-arrival hypothesis of Thorpe, 1988, which also advocates a ring phase). On the other hand, a population code can be the basis for very fast information processing. In our model with a standard set of parameters, individual neurons typically re only 2 to 10 times before the macrocolumn makes a decision, and yet the decision is based in a precise graded fashion on the input (if Tº is reduced to Tº D 10, the system shows qualitatively the same behavior, but neurons spike only one to four times before the rst macrostate transition).
530
J. Lucke ¨ and C. von der Malsburg
The other fundamental weakness of the single-neuron hypothesis is lack of robustness against damage and accidents of wiring. The usual proposal to repair this weakness is a population code, and our model may be seen as an essential step at establishing one. The minicolumn has a collective receptive eld. This makes it fault-tolerant with respect to accidents in the afferent connections; the same can be said about intracortical connections. Moreover, self-tuning of the activity dynamics of minicolumns (e.g., of the parameters ºmin and ºmax ) can make them robust to lesion or imperfections in ontogenesis. In summary, our model, motivated by macrocolumn connectivity, has neurodynamic properties that solve important conceptual problems of neurophysiology. The spiking character of neurons, column-based interconnection structure, oscillatory inhibition, and Hebbian plasticity are shown to combine together to form an advanced information processing system that allows solving problems such as pattern classication and, specically, the bars benchmark problem, where it is highly competitive with other recent systems. Appendix: Derivation of the Neuron Dynamics in Terms of Neuron Activation Probability For the dynamics 2.1 with the above-described interconnection Tij , the probability, p.t C 1/, that a neuron is activated at time .t C 1/ can be approximated by the product of the probability, pA .t C 1/, that it receives enough input to exceed threshold and the probability, pB .t C 1/, that the neuron is not refractory. Using equation 2.3, we get in the limit s ! 1: p.t C 1/ D pA .t C 1/ pB .t C 1/ Z 1 D fg .x/ dx .1 ¡ p.t// s2 Z 1 1 1 x¡a 2 D p e¡ 2 . ¾ / dx .1 ¡ p.t// ; [a D sp.t/]; 2¼ ¾ s2 p ¾ D sp.t/.1 ¡ p.t// Z a¡s2 ¾ 1 1 2 D p e¡ 2 x dx .1 ¡ p.t// 2¼ ¡1 Á ! p.t/ ¡ 2 D 8s p .1 ¡ p.t// ; p.t/ .1 ¡ p.t// Z ps x 1 1 2 D p 8s .x/ e¡ 2 y dy: 2¼ ¡1 The approximation has proven to be applicable even for a relatively low neuron number m and for relatively small values of s.
Processing and Learning in Macrocolumns
531
Acknowledgments We thank Rolf P. Wurtz ¨ for many discussions and suggestions and Jan D. Bouecke for helping us in implementing parts of the system. Partial funding by the EU in the RTN MUHCI (HPRN-CT-2000-00111), the German BMBF in the project LOKI (01 IN 504 E9), and the Korber ¨ Prize awarded to C. v. d. M. in 2000 is gratefully acknowledged. References Anninos, P. A., Beek, B., Csermely, T. J., Harth, E. M., & Pertile, G. (1970). Dynamics of neural structures. J. Theo. Biol., 26, 121–148. Barlow, H. B. (1972). Single units and sensation: A neuron doctrine for perceptual psychology. Perception, 1, 371–394. Budd, J. M. L., & Kisvarday, Z. F. (2001). Local lateral connectivity of inhibitory clutch cells in layer 4 of cat visual cortex. Exp. Brain Res., 140(2), 245–250. Buxhoeveden, D. P., & Casanova, M. F. (2002). The minicolumn and evolution of the brain. Brain, Behavior and Evolution, 60, 125–151. Charles, D., & Fyfe, C. (1998). Modelling multiple-cause structure using rectication constraints. Network: Computation in Neural Systems, 9, 167–182. Constantinidis, C., Franowicz, M. N., & Goldman-Rakic, P. S. (2001). Coding specicity in cortical microcircuits: A multiple-electrode analysis of primate prefrontal cortex. J. Neuroscience, 21, 3646–3655. Dayan, P., & Zemel, R. S. (1995). Competition and multiple cause models. Neural Computation, 7, 565–579. DeFelipe, J., Hendry, S.H.C., Hashikawa, T., Molinari, M., & Jones, E. G. (1990). A microcolumnar structure of monkey cerebral cortex revealed by immunocytochemical studies of double bouquet cell axons. Neuroscience, 37, 655–673. DeFelipe, J., Hendry, S.H.C., & Jones, E. G. (1989). Synapses of double bouquet cells in monkey cerebral cortex. Brain Res., 503, 49–54. Elston, G. N., & Rosa, M.P.G. (2000). Pyramidal cells, patches, and cortical columns: A comparative study of infragranular neurons in TEO, TE, and the superior temporal polysensory areas of the macaque monkey. J. Neuroscience, 20, 1–5. Favorov, O. V., & Diamond, M. (1990). Demonstration of discrete place-dened columns, segregates, in cat SI. J. Comparative Neurology, 298, 97–112. Favorov, O. V., & Kelly, D. G. (1996). Stimulus-response diversity in local neuronal populations of the cerebral cortex. Neuroreport, 7(14), 2293–2301. Favorov, O. V., & Whitsel, B. L. (1988). Spatial organization of the peripheral input to area 1 cell columns I. The detection of “segregates”. Brain Res., 472, 25–42. Foldi´ ¨ ak, P. (1990). Forming sparse representations by local anti-Hebbian learning. Biological Cybernetics, 64, 165–170. Fransen, E., & Lansner, A. (1998). A model of cortical associative memory based on a horizontal network of connected columns. Network-Computation in Neural Systems, 9(2), 235–264.
532
J. Lucke ¨ and C. von der Malsburg
Frey, B. J., Dayan, P., & Hinton, G. E. (1997). A simple algorithm that discovers efcient perceptual codes. In M. Jenkin and L. R. Harris (Eds.), Computational and psychophysical mechanisms of visual coding. Cambridge: Cambridge University Press. Fukai, T. (1994). A model of cortical memory processing based on columnar organization. Biological Cybernetics, 70, 427–434. Fyfe, C. (1997). A neural network for PCA and beyond. Neural Processing Letters, 6, 33–41. Harpur, G. F., & Prager, R. W. (1996). Development of low entropy coding in a recurrent network. Network-Computation in Neural Systems, 7, 277–284. Hinton, G. E., Dayan, P., Frey, B. J., & Neal, R. M. (1995). The “wake-sleep” algorithm for unsupervised neural networks. Science, 268, 1158–1161. Hinton, G. E., & Ghahramani, Z. (1997). Generative models for discovering sparse distributed representations. Phil. Trans. Royal Soc. London, Series B, Biological Sciences, 352, 1177–1190. Hochreiter, S., & Schmidhuber, J. (1999). Feature extraction through LOCOCODE. Neural Computation, 11, 679–714. Hopeld, J. J. (1982). Neural networks and physical systems with collective computational abilities. Proceedings of the National Academy of Sciences, 79, 2554–2558. Hopeld, J. J., & Tank, D. W. (1986). Computing with neural circuits: A model. Science, 233, 625–633. Jones, E. G. (2000). Microcolumns in the cerebral cortex. PNAS, 97, 5019–5021. Lao, R., Favorov, O. V., & Lu, J. P. (2001). Nonlinear dynamical properties of a somatosensory cortical model. Information Science, 132, 53–66. Lubke, J., Egger, V., Sakmann, B., & Feldmeyer, D. (2000). Columnar organization of dendrites and axons of single and synaptical coupled excitatory spiny neurons in layer 4 of the rat barrel cortex. J. Neuroscience, 20, 5300– 5311. Lucke, ¨ J., von der Malsburg, C., & Wurtz, ¨ R. P. (2002). Macrocolumns as decision units. In Articial Neural Networks—ICANN 2002, LNCS 2415 (pp. 57–62). New York: Springer-Verlag. Marshall, J. A. (1995). Adaptive perceptual pattern recognition by selforganizing neural networks: Context, uncertainty, multiplicity, and scale. Neural Networks, 8, 335–362. Mountcastle, V. B. (1997). The columnar organization of the neocortex. Brain, 120, 701–722. Nowak, L. G., & Bullier, J. (1997). The timing of information transfer in the visual system. Cerebral Cortex, 12, 205–241 O’Reilly, R. C. (2001). Generaliztion in interactive networks: The benets of inhibitory competition and Hebbian learning. Neural Computation, 13, 1199– 1241. Perez-Orive, J., Mazor, O., Turner, G. C., Cassenaer, S., Wilson, R. I., & Laurent, G. (2002). Oscillations and sparsening of odor representations in the mushroom body. Science, 297(5580), 359–365. Peters, A., Cifuentes, J. M., & Sethares, C. (1997). The organization of pyramidal cells in area 18 of the rhesus monkey. Cerebral Cortex, 7, 405–421.
Processing and Learning in Macrocolumns
533
Peters, A., & Sethares, C. (1996). Myelinated axons and the pyramidal cell modules in monkey primary visual cortex. J. Comp. Neurology, 365, 232–255. Peters, A., & Sethares, C. (1997). The organization of double bouquet cells in monkey striate cortex. J. Neurocytology, 26, 779–797. Peters, A., & Yilmaze, E. (1993). Neuronal organization in area 17 of cat visual cortex. Cerebral Cortex, 3, 49–68. Reichardt, W., Poggio, T., & Hausen, K. (1983). Figure-ground discrimination by relative movement in the visual system of the y. Biol. Cybern., 46 (Suppl.), 1–30. Saund, E. (1995). A multiple cause mixture model for unsupervised learning. Neural Computation, 7, 51–71. Singer, W. (2003). Synchronization, binding and expectancy. In M. A. Arbib (Ed.), The handbook of brain theory and neural networks (pp 1136–1143). Cambridge, MA: MIT Press. Somers, D. C., Nelson, S. B., & Sur, M. (1995). An emergent model of orientation selectivity in cat visual cortical simple cells. J. Neuroscience, 15, 5448–5465. Spratling, M. W. & Johnson, M. H. (2002). Preintegration lateral inhibition enhances unsupervised learning. Neural Computation, 14, 2157–2179. Thomson, A. M., & Deuchars, J. (1994). Temporal and spatial properties of local circuits in neocortex. Trends in Neuroscience, 17, 119–126. Thorpe, S. (1988). Identication of rapidly presented images by the human visual system. Perception, 17,(A77), 415. Thorpe, S., Fize, F., & Marlot, C. (1996). Speed of processing in the human visual system. Nature, 381, 520–522. Wespatat, V., Tennigkeit, F., & Singer, W. (2003). Phase sensitivity of Hebbian modications in oscillating cells of rat visual cortex. Manuscript submitted for publication. Wilson, H. R., & Cowan, J. D. (1973). A mathematical theory of the functional dynamics of cortical and thalamic nervous tissue. Kybernetik, 13, 55–80. Yu, A. J., Giese, M. A., & Poggio, T. A. (2002). Biophysiologically plausible implementations of the maximum operation. Neural Computation, 14, 2857– 2881. Received March 25, 2003; accepted August 4, 2003.
LETTER
Communicated by Helge Ritter
Temporally Asymmetric Learning Supports Sequence Processing in Multi-Winner Self-Organizing Maps Reiner Schulz
[email protected] James A. Reggia
[email protected] Departments of Computer Science and Neurology, and UMIACS, University of Maryland, College Park, MD 20742, U.S.A.
We examine the extent to which modied Kohonen self-organizing maps (SOMs) can learn unique representations of temporal sequences while still supporting map formation. Two biologically inspired extensions are made to traditional SOMs: selection of multiple simultaneous rather than single “winners” and the use of local intramap connections that are trained according to a temporally asymmetric Hebbian learning rule. The extended SOM is then trained with variable-length temporal sequences that are composed of phoneme feature vectors, with each sequence corresponding to the phonetic transcription of a noun. The model transforms each input sequence into a spatial representation (nal activation pattern on the map). Training improves this transformation by, for example, increasing the uniqueness of the spatial representations of distinct sequences, while still retaining map formation based on input patterns. The closeness of the spatial representations of two sequences is found to correlate signicantly with the sequences’ similarity. The extended model presented here raises the possibility that SOMs may ultimately prove useful as visualization tools for temporal sequences and as preprocessors for sequence pattern recognition systems. 1 Introduction A self-organizing map (SOM) is an articial neural network whose nodes are usually arranged in a two-dimensional grid. The SOM learns, in an unsupervised fashion, an organized transformation that converts a pattern from a typically high-dimensional input space to a pattern or map across the discrete surface of the grid. There currently exist two qualitatively different types of SOMs (see Table 1), both of which are actively in use. The rst type, iterative multi-winner SOMs, has been largely of interest in theoretical neuroscience (Cho & Reggia, 1994; Pearson, Finkel, & Edelman, 1987; Sutton, Reggia, Armentrout, & D’Autrechy, 1994; von der Malsburg, 1973). The activity of each map node is the result of a relatively neurobiologically plauNeural Computation 16, 535–561 (2004)
c 2004 Massachusetts Institute of Technology °
536
R. Schulz and J. Reggia
Table 1: Typical Features of the Two Types of Self-Organizing Maps. SOM Type Seminal work Primary applications Input-to-map connectivity Intramap connectivity Activation dynamics Learning rule Computational cost Memory capacity Examples
Iterative Multi-Winner
One-Step Single-Winner
von der Malsburg (1973) Neuroscience: modeling sensorimotor cortex Divergent, but localized
Kohonen (1982) Computer science: data visualization, speech processing Full
Lateral (excite immediate neighbors, inhibit more distant ones) Multiple winners: nonlinear differential equations Hebbian/competitive High High Cho & Reggia (1994); Pearson et al. (1987); Sutton et al. (1994); von der Malsburg (1973)
None (implicit neighborhoods) Single global winner: node most activated by the input Hebbian/competitive Low Low Callan et al. (1999); Kaski et al. (1998); Kohonen (1982); Kokkonen and Torkkola (1990); Principe et al. (1998)
sible, albeit computationally slow, iterative process during which laterally connected map nodes that are close to one another compete for activation. As a result, an activation pattern emerges in which multiple separated clusters of nodes (the “winners”) become active in response to an input pattern, so a distributed (or “coarse coding”) representation is being used. The second and more widely used type, one-step single-winner SOMs, is more oriented toward computational applications and less toward neuroscience (Callan, Kent, Roy, & Tasko, 1999; Kaski, Honkela, Lagus, & Kohonen, 1998; Kohonen, 1982; Kokkonen & Torkkola, 1990). These models are characterized by a signicantly more efcient but neurobiologically implausible mechanism for the computation of the SOM’s response to an input: the node that the input initially activates the most is declared the lone winner of a global competition, and it alone represents the input pattern, so a local representation is being used. The majority of past work on SOMs of both classes, as well as related non-SOM methods (Bishop, Svens´en, & Williams, 1998), has involved static, that is, time-invariant, input patterns where a map’s activation pattern in response to one input is not inuenced by previous inputs. The results of these studies do not carry over directly to temporal sequences of inputs, a signicant shortcoming given that sequential inputs are very common (e.g., language, motion in visual elds, motor feedback). In response to this problem, several extensions to the basic SOM method have been proposed during the past decade to support temporal sequence processing. These extended SOMs are very diverse, so we consider them rst in terms of the tasks they address and second in terms of the methodologies they adopt.
Multi-Winner Maps
537
The specic temporal processing tasks that have been addressed are prediction, recall, recognition and representation. Prediction is concerned with the accurate computation of the next element in a sequence from previously observed sequence elements. In Principe, Wang, & Motter (1998), for example, an SOM was successful at predicting articial chaotic time series as well as controlling a wind tunnel that required the prediction of wind speed changes. In Rao and Sejnowski (2000), an SOM-like network of two recurrently connected chains of neurons learned to predict the next in a series of left-to-right or right-to-left moving stimuli. The recall task takes prediction a step further, requiring that the SOM reproduce all elements of a sequence in the correct temporal order when given an initial cue, for example, the rst element of the sequence. This has been accomplished in Kopecz (1995) and Abbott and Blum (1996) with 2D fully laterally connected SOMs for one or two low-dimensional sequences. In Gerstner, Ritz, and van Hemmen (1993), a fully laterally connected network of 1000 nodes (not arranged according to any topology) was shown to be capable of storing and retrieving four sequences and its theoretical capacity estimated at 100 sequences. Recognition of temporal sequences has generally focused on identifying a given input sequence as a member of a class by mapping it onto a particular map location or locations that correspond to class prototypes learned from previously seen sequences. There have been many efforts to achieve this (Chappell & Taylor, 1993; Euliano & Principe, 1999; Kangas, 1990; Somervuo, 1999, 2003; Varsta, Millan, & Heikkonen, 1997; Wiemer, 2003). Finally, and most directly related to our work, is the problem of transforming temporal sequences into relatively unique spatial representations, that is, into relatively unique nal activation patterns on the map grid that represent the sequences and thus might be viewed as reminiscent of “cell assemblies” (Hebb, 1949). Such a time-to-space representation may be benecial in data visualization and as an initial input processing step in a larger neural system for sequence recognition (Chappell & Taylor, 1993). To our knowledge, the only other study to address this task was that of James and Miikkulainen (1995), but several of the sequence recognition models above are also necessarily concerned about how the prototypes are arranged on the map relative to one another. These past temporal sequence processing SOMs can also be viewed from the perspective of the diverse methodologies they have proposed. The simplest approach has been to leave the original one-shot single-winner SOM model untouched and to preprocess sequential inputs via an external shortterm memory. For example, in some studies, a xed number of successive input patterns were concatenated to form a single static pattern (Kangas, 1990). Others have suggested averaging the patterns in a sequence over time and feeding the average as a static pattern to the network (Carpinteiro, 1999). However, these approaches assume that the range of interpattern relations across time is quite limited. Many other forms of short term memory are reviewed elsewhere (Barreto, Araujo, & Kremer, 2003; Mozer, 1993). Another approach has employed leaky-integrator or other temporal neuron models
538
R. Schulz and J. Reggia
as the map nodes (Chappell & Taylor, 1993; Varsta et al., 1997), while yet another idea has been to capture temporal relations in the input via lateral connections between the map nodes, rendering the SOM a truly recurrent neural network (Abbott & Blum, 1996; Gerstner et al., 1993; Kopecz, 1995; Rao & Sejnowski, 2000). Finally, in Euliano and Principe (1999) and Weimer (2003), spreading wavefronts of activation (activity diffusion) were used to alter the typical activation dynamics of the one-shot single-winner SOM so that learning is characteristically affected by the temporal order of the inputs. At present, there is no general consensus as to how best to process sequences with SOMs, and this topic remains a very active focus of current neurocomputational research (Barreto et al., 2003). In this context, the goal of our work is to extend the one-step single-winner SOMs in biologically plausible ways to make them more effective in processing and representing large sets of variable-length sequences. Unlike most past related work described above, we focus solely on the task of developing a unique spatial representation for each of the input sequences, with the idea that this is also a precursor for effective pattern recognition. To achieve our goal, we extend traditional one-step single-winner SOMs (see Table 1) in two biologically motivated ways. First, our model supports map formation in the presence of multiple simultaneous winners rather than a single winner. Multiple islands of simultaneous activation (i.e., multiple “winners”) are often observed in neocortical regions (e.g., Donoghue & Sanes, 1992; Georgopoulos, Kettner, & Schwartz, 1988). Such a coarse coding in principle allows for the spatial encoding or representation of a much larger number of sequences. However, for computational efciency and unlike past multi-winner SOMs, we retain the one-step, noniterative selection of winners, making our model a one-step multi-winner SOM that can be distinguished from both classes of previously studied SOMs. Second, as a mechanism for supporting sequence processing, we introduce into SOMs for the rst time the use of temporally asymmetric Hebbian learning to train local, or range-limited, intramap connections. These local lateral connections are very different from those used in past multi-winner SOMs: they are not used to produce a “Mexican hat” pattern of lateral interactions, and they are adaptive. Further, their adaptation is temporally asymmetric in a fashion inspired by recent experimental evidence showing that changes in biological synaptic efcacy in cortex (Markram, Luebke, Frotscher, & Sakmann, 1997) and other structures of the brain (Bi & Poo, 1998, 2001; Zhang, Tao, Holt, Harris, & Poo, 1998) are sometimes due to temporally asymmetric Hebbian learning: a synapse is strengthened (long-term potentiation, LTP) if presynaptic action potentials precede excitatory postsynaptic potentials by typically 20 to 50 ms, and weakened (long-term depression, LTD) if the time course is reversed. While a few past modeling studies have used temporally asymmetric Hebbian learning to store and retrieve sequences (Abbott & Blum, 1996;
Multi-Winner Maps
539
Gerstner et al., 1993; Rao & Sejnowski, 2000), these past studies were not concerned with either map formation or the transformation of sequences into spatial representations as we consider here. Our model can be distinguished from that of James and Miikkulainen (1995), which successfully dealt with the representation task but did not use multi-winner SOMs, lateral connectivity, or temporally asymmetric Hebbian learning as we do here. Our approach is also very different from the pattern recognition system of Somervuo (1999) which, after initial training of a standard one-shot single-winner SOM on nonsequential inputs, uses an external construction algorithm to convert the SOM into a network with connections between arbitrarily distant nodes (i.e., its lateral connections are neither local nor learned with temporally asymmetric Hebbian learning). In summary, the fundamental hypothesis we examine in this article is that training a one-step multi-winner SOM whose short-range lateral connections undergo temporally asymmetric Hebbian learning transforms variable-length temporal sequences into reasonably unique spatial patterns of activity, even while map formation of the input space persists. 2 Methods While our model is very general, to assess its functionality, we use specic sequences of feature vectors. Each vector in a sequence is the feature-based encoding of an English phoneme. Each sequence corresponds to the phonetic transcription, based on the NetTalk corpus (Sejnowski & Rosenberg, 1987), of a 2- to 10-phoneme noun naming an object, taken from the widely used Snodgrass-Vanderwart word corpus (Snodgrass & Vanderwart, 1980). For example, /h ² l k a p t r/ is the phonetic sequence transcription of helicopter, and /p/, the fourth-from-last phoneme in the transcription, is equivalent to a distinct tuple of 34 binary feature values: [consonantal D 1, vocalic D 0, compact D 0, diffuse D 1, grave D 1, acute D 0, nasal D 0, oral D 1, tense D 1, lax D 0, : : : ]. See the appendix for a complete set of phoneme encodings and further details about the input data. In this context, the SOM’s task is the unsupervised acquisition of an internal representation for the spoken names of a set of objects, the representation for each name ideally being unique. Initially, before the rst vector of an input sequence is presented to the SOM, all map nodes are inactive. From this initial state, the activation pattern develops deterministically at discrete time steps (one time step per input phoneme vector, hence “one-step”) based on the current input vector and the activation pattern at the previous time step. This implies that, for example, in the case of bow (/b o/) and bowl (/b o l/), after the input of /o/, the respective activation patterns are identical. For bow, this is the nal activation pattern, and hence its spatial representation. According to our hypothesis, the last feature vector /l/ of bowl should trigger a change in the
540
R. Schulz and J. Reggia
Figure 1: Network structure. (A) Global architecture. (B) A map node with its weighted connections from the input layer and from the map nodes in its immediate 8-neighborhood. The different widths of the solid lines (arcs) indicate that in general, the efcacies of the connections differ.
map activation pattern of the trained one-step multi-winner SOM so that the spatial representation for bowl differs from that of bow. Figure 1a shows the basic architecture of our one-step multi-winner SOM. The recurrent network transforms an input sequence of patterns into a nal single static output pattern (the sequence’s spatial representation) where each component of the output corresponds to a node of the map. The map itself consists of a regular, rectangular grid of R rows by C columns of Q D R C nodes. We measure the distance on the map between two nodes i and i0 at positions .r; c/ and .r0 ; c0 / using the box-distance metric, d.i; i0 / D max.jr ¡ r0 j; jc ¡ c0 j/. As illustrated in Figure 1b, an arbitrary map node i receives a connection from each node of the input layer as well as from each map node within a circumscribed connection neighborhood, Nconn .i/ D fj j d.i; j/ · rconn g, centered at and including i. Every connection to i carries a real-valued synaptic weight indicating its efcacy. If the input layer consists of P nodes, then the connection to the ith map node from the jth input is weighted by wij , E i 2 RC P is the afferent weight vector of i. Analogously, the weights and w on the incoming lateral connections of map node i are stored in the lateral Q weight vector vEi 2 RC . In particular, vij corresponds to the weight on the lateral connection from j to i (d.i; j/ · rconn ), vii D ¯ 2 R is an immutable weight on the self-connection of i, and vij D 0 if i and j are not connected (d.i; j/ > rconn ). The level of activation of an arbitrary input or map node ranges between 0 (inactive) and 1 (fully active). The activation levels of all P input nodes compose the afferent vector xE 2 [0; 1]P , normalized to be of unit length. Similarly, the activation levels of all Q map nodes form a vector yE 2 [0; 1]Q , the activation pattern of the map. For any map node i, only those components of
Multi-Winner Maps
541
yE that correspond to activation levels of map nodes in i’s connection neighborhood Nconn .i/ are directly relevant since i receives lateral connections only from those nodes. The activation levels of all nodes are updated synchronously at discrete time steps, one step per input vector in a sequence. Thus, the afferent input vector as well as the map activation pattern are time variant. Given an input sequence X D xE.1/; : : : ; xE.k/, the map activation pattern, E evolves over a period of k time steps. The nal initialized as yE.0/ D 0, map activation pattern yE.k/ is then said to be the spatial representation of temporal sequence X. At the beginning of time step t ¸ 1, the net input h is computed independently for each map node i as E Ti xE.t/ C .1 ¡ ®/E hi .t/ D ® w vTi yE.t ¡ 1/;
(2.1)
where the xed parameter ® 2 [0; 1] determines the relative contributions of afferent versus lateral input vectors, and T indicates the transpose of E i and vEi . column vectors w E To compute yE.t/ from h.t/, we use a computationally efcient one-step mechanism that approximates the competitive activation dynamics (Mexican hat pattern) that has been implemented in some past iterative multiwinner SOMs via a computationally expensive numerical solution of differential equations (Reggia, D’Autrechy, Sutton, & Weinrich, 1992). However, unlike traditional one-step SOM models, multiple winners occur: every map node i that receives a net input greater than that of all other map nodes within i’s connection neighborhood is taken to be a winner. Since parameter rconn is usually chosen to be small relative to the size of the entire map, typically multiple winner nodes exist. Each winner is then made the center of a “peak” of activation. The distribution of activation within a single peak is such that winner node i at its center is maximally active (yi D 1), and the activation level of each nonwinner node j within i’s connection neighborhood decreases exponentially with increasing distance between j and i. The activation peak centered at i does not extend beyond the connection neighborhood of i. However, two or more peaks may partially overlap, in which case their contributions to the activation level of a map node in the region of overlap are added, but may not exceed 1. Specically, if the set V.t/ of winner nodes at time t is V.t/ D fi j 8j 6D i: j 2 Nconn .i/ ) hj .t/ < hi .t/g; then the activation of map node j is: Á ! X » ° d.i;j/ if j 2 N conn .i/ yj .t/ D min 1; ; 0 otherwise
(2.2)
(2.3)
i2V.t/
where ° 2 [0; 1] determines the shape of each peak of activation (smaller ° means faster drop-off).
542
R. Schulz and J. Reggia
To test our central hypothesis—that our model learns to spatially represent the sequences in the training set fairly uniquely—we use the 1norm to quantify the difference between two nal activation patterns yE PQ 0 and yE 0 : d.E y; yE 0 / D jjE y ¡ yE 0 jj1 D iD1 jyi ¡ yi j. Using distance measure d, we assess the overall performance of our model by measuring over all distinct sequences X (of length k) and X0 (of length k0 ) from the training set, the distance between the spatial representations of X and X0. We use three quantitative measures of how the model performs overall in separating different sequences into unique nal spatial representations. First, we count the number of pairs of distinct sequences in the training set for which the model ends up with the same nal map activation pattern: jZj D jffX; X0 g: X 6D X0 ; d.E yX .k/; yEX0 .k0 // D 0gj. The model uniquely represents all sequences if jZj D 0; otherwise, there are pairs of distinct sequences that the model “confuses.” The second measure is the minimum distance dmin between two spatial representations computed over all pairs of distinct sequences in the training set: dmin D minX6DX0 d.E yX .k/; yEX0 .k0//. Notice that dmin D 0 for as long as jZj > 0 and jZj D 0 as soon as dmin > 0, and that dmin and jZj are often complementary, not redundant. Training could, for example, signicantly increase dmin from a pretraining value already greater than zero, while jZj remains 0. Or training may decrease jZj to a smaller value still greater than zero, while dmin remains 0. Finally, we measure the average distance between two spatial representations computed over all pairs 1 P EX .k/; yEX0 .k0 //, of distinct sequences in the training set: dN D jSj fX;X0 g;X6DX0 d.y where jSj is the number of sequences in the training set. Before training, each weight is independently initialized with a random value from the interval [0; 1], each afferent weight vector is normalized to P unit length, and each lateral weight vector is normalized such that 8i : j6Di vij D 1. The one-step multiwinner SOM learns by adjusting its weights in response to the repeated input of all temporal sequences in the training set in random order. The number of training epochs is 1000. The input of a single arbitrary temporal sequence of length k causes the SOM to pass through k time steps. At the end of each time step t, after the construction of activation pattern yE.t/, the weights of the SOM are modied. E i of the ith map node, the learning rule For the afferent weight vector w is: E i .t/ D w E i .t ¡ 1/ C ¹yi .t/E w x.t/ E i .t/ D w E i .t/=jjw E i .t/jj2 : w
(2.4) (2.5)
Equation 2.4 implements typical temporally symmetric Hebbian (or competitive) learning where ¹ 2 .0; 1] is the afferent learning rate. RenormalE i to move across the surface of the unit ization in equation 2.5 restricts w hypersphere, generally in the direction of the current afferent input xE.t/. In contrast, the learning rule for the lateral weights is unusual in being a temporally asymmetric variant of Hebbian learning. As noted earlier, recent
Multi-Winner Maps
543
experimental studies have found this learning rule to sometimes govern changes in synaptic efcacy in cortex (Markram et al., 1997) and other parts of the brain (Bi & Poo, 1998, 2001; Zhang et al., 1998). Given two map nodes i and j where 0 < d.i; j/ · rconn , the efcacy of the connection vij to i from j at time t is increased proportional to yj .t ¡ 1/, the activity of j at the previous time step, times max.0; yi .t/ ¡ yi .t ¡ 1//, the increase in the activity of i relative to the previous time step:
vij .t/ D
8 0:2 are shown as arrows pointing from the source toward the destination node. The length of an arrow is proportional to the square root of the connection weight. The distance of the destination node equals the number of concentric arcs at the arrow’s base. The arrow is black if it points from a vowel to a consonant or vice versa; it is white if it points from a vowel (consonant) to a vowel (consonant). The pattern of strong lateral connections suggests that they represent frequent phoneme transitions in the training sequences. In the training sequences, a vowel is almost always followed by a consonant, and on the map, most connections originating at vowels indeed terminate at consonants (black arrows).
Multi-Winner Maps
551
Table 3: Correlation of Lateral Weights and Phoneme Transition Frequency. Training Set and Model Size
Pretraining Posttraining
60 Sequences, 30 £ 20 Nodes
175 Sequences, 40 £ 30 Nodes
¡0:0664 (0.0165)
¡0:0536 (0.0141)
0.6939 (0.0277)
0.6263 (0.0311)
low as 0.2, that is, even if the afferent inputs have very little inuence on the activation dynamics of the model. Figure 5 also shows all lateral connections whose weights have increased signicantly during training.1 A visual inspection suggests that nodes sensitive to vowels tend to send strong connections to nodes sensitive to consonants. It is one of the properties of the training set that in all but three cases (out of a total of 222), a vowel in a sequence is followed by a consonant. This gives rise to the hypothesis that strong lateral connections coincide with the frequent transition from a particular input phoneme to a particular next input phoneme in the sequences of the training set. To test this, we measured the correlation between lateral connection weights and phoneme transition frequencies. We recorded, for each possible input phoneme transition xE.t/ to xE.t C 1/, the sum of the weights on all lateral connections from a node E Ti xE.t/ and xE.t C 1/ maximizes i to a node j (i 6D j) where xE.t/ maximizes w T E j xE.t C 1/. We then paired each greater-than-zero sum 2 with the absolute w frequency with which the respective phoneme transition xE.t/; xE.tC1/ occurs in the training sequences. These pairs are the data points from which the correlation coefcient is computed. This was done repeatedly and independently for both a small (30 £ 20 nodes trained with 60 distinct sequences; 20 independent experiments) and a large (40 £ 30 nodes trained with 175 distinct sequences; 7 independent experiments) version of our model, prior to and after training. Table 3 summarizes the results by providing the mean (and standard deviation) of each correlation coefcient, as estimated from the results of the respective number of independent experiments. Prior to training, the then random lateral connection weights are not correlated with phoneme transition frequencies. After training, the two quantities are very highly positively correlated, lending strong support to our hypothesis. 3.4 Representation Distance and Sequence Similarity. We now consider whether similar input sequences are transformed into similar spatial 1 The threshold is 0.2. Prior to training for a map of 40 £ 30 (30 £ 20) nodes, the mean lateral weight is 0.0008 (0.0017) with a standard deviation of 0.0040 (0.0058). 2 Lateral connections with a weight equal to zero are considered nonexistent. Hence, sums equal to zero are excluded from the analysis.
552
R. Schulz and J. Reggia
representations on the map. To measure the similarity of spatial representations (nal activation patterns), we use both the 1-norm distance d that we have used all along plus a winner separation distance. Recall that at the end of training when the parameter ° determining activation peak widths approaches zero (see equation 2.3 and °n in Table 2), only the winner nodes are signicantly (and fully) active. Let the (row, column) positions of the winner nodes in the spatial representation yE following the nal phoneme of one input sequence be .r1 ; c1 /; .r2 ; c2 /; : : : ; .rk ; ck /, and the positions in yE 0 for a different input sequence be .r01 ; c01 /; : : : ; .r0k0 ; c0k0 /, and without loss of generality take k ¸ k0 . We then dene the winner separation distance dsep between yE and yE0 to be the average distance on the map from a winner node in yE to P the closest winner node in yE0 : dsep D 1k kiD1 min1·j·k0 .jri ¡ rj0 j C jci ¡ cj0 j/. For comparison purposes, we also need a measure or measures of the similarity of any two input sequences of phonemes used for training. In general terms, the similarity (dissimilarity) of two sequences is typically measured in terms of the two sequences’ optimal alignment or “edit distance,” and so we adopt this method here. The algorithm for computing optimal alignment is described in detail in, for example, in Guseld (1997). In short, an alignment of two sequences is a recipe for translating one sequence into the other using essentially two operations: the insertion of a special blank element into a sequence and the substitution of an element in one sequence with an element at the same position in the other sequence. Each substitution operation in an alignment is associated with a cost or score that is a function of the two elements being substituted. The sum over all substitutions in an alignment is the score (cost) of the alignment. An optimal alignment maximizes (minimizes) the score (cost) of translating one sequence into the other. To avoid length-based biases, we normalize the score (cost) of each optimal alignment by its length. We adopt the convention that each inserted blank equals the blank’s immediate predecessor in the sequence. As all input elements are binary-valued feature vectors, we adopt a Hamming distance cost measure (very similar results were also found using Euclidean distances instead). As a score measure, we use the Tversky feature count (Tversky, 1977; Tversky & Gati, 1978), a well-established method in linguistics for measuring the similarity of phonemes. With this latter measure, if two phonemes are encoded by the feature vectors xE and xE 0 , then their similarity equals the difference between the number of features they share and the number of features they do not share: jfi: xi D x0i D 1gj ¡ jfi: xi D 1; x0i D 0gj ¡ jfi: xi D 0; x0i D 1gj. The correlation analysis was performed for each of the four possible combinations of a representation distance measure (d or dsep) compared to a sequence similarity (Tversky) or dissimilarity (Hamming) measure. Two different size versions of the model were used (30 £ 20 nodes and 60 distinct sequences versus 40 £ 30 nodes and 175 distinct sequences). The small (large) instance of the model was initialized 20 (7) times with different random initial weights and subsequently trained. In each of these
Multi-Winner Maps
553
Table 4: Correlation of Representation Distance and Sequence Similarity. Training Set and Model Size 60 Sequences, 30 £ 20 Nodes Pattern Distance Measure Pre- vs. Posttraining
1-norm
175 Sequences, 40 £ 30 Nodes
Winner separation
1-norm
Winner separation
Pre
Post
Pre
Post
Pre
Post
Pre
Post
.3284
.2893
.3041
.3779
.3874
.3594
.3863
.4141
¡.4054 ¡.3950 ¡.3641 ¡.3946
¡.4245
Feature vector distance or similarity measure Hamming distance Tversky feature count
¡.3438 ¡.2917 ¡.3214
Note: Italic entries indicate a statistically signicant (p · 0:01) decrease and bold entries an increase of the posttraining relative to the pretraining absolute level of correlation.
independent experiments, the four correlation coefcients were computed prior to and after training. The correlation coefcients, averaged over the respective number of independent experiments, are listed in Table 4. Overall, these results show that both before and after training, there is a substantial positive correlation between input sequence Hamming distances and their nal activation pattern distances, and a substantial negative correlation between input sequence similarities (Tversky’s measure) and their nal activation pattern distances. Most intriguing is that the magnitudes of the correlation measured in terms of winner node separation are always increased by training. 4 Discussion Most past work on SOMs has focused on processing nonsequential input patterns and has used Kohonen’s approach to map formation. The latter bases learning on a single global winner node for each input pattern, and uses a one-step “best match” winner selection process for computational efciency. While very successful for the nonsequential tasks for which it is intended (Kaski et al., 1998), various past approaches extending such SOMs have been and continue to be developed to process temporal sequences (see section 1).
554
R. Schulz and J. Reggia
In this article, we have examined the specic question of the extent to which Kohonen SOMs can be modied to learn a unique spatial representation or encoding of temporal sequences while still retaining their traditional map formation properties. To facilitate sequence processing, we extended the standard one-step single-winner SOM methodology in two ways, both inspired by biological phenomena in cerebral cortex. First, instead of global single-winner activation dynamics, we used multiple simultaneous winner nodes. Such a distributed or coarse representation is motivated by its potential to encode or represent a larger number of temporal sequences. Using multiple local activation peaks like this is also more consistent with activity patterns in the cerebral cortex, and for this reason has been adopted in several past SOMs directed at modeling neurobiological observations (Cho & Reggia, 1994; Pearson et al., 1987; Sutton et al., 1994; von der Malsburg, 1973). However, unlike these past studies with nonsequence processing tasks, we retained the one-step winner selection of Kohonen SOMs for computational efciency. The second enhancement we made was to add local intramap lateral connections that undergo temporally asymmetric Hebbian learning. The motivation for this type of connection was to enable the now recurrent map network to capture temporal transitions via lateral shifts in activation peak locations. As discussed above, this extension also derives from biological data that have demonstrated such temporally asymmetric learning experimentally (Bi & Poo, 1998, 2001; Markram et al., 1997; Zhang et al., 1998). Our learning rule (see equations 2.6 and 2.7) intuitively tries to capture and enhance the causal relationships between activation peaks at one time instant and subsequent nearby activation peaks at the next time instance. With these two extensions, the resulting one-step multiwinner SOM was found to be remarkably effective in developing unique spatial representatives (unique nal activation patterns across the map) for sizable sets of real-world temporal sequences. Even with the relatively small networks we used, maps could learn unique encodings for almost all of 60 sequences, or 175 sequences for the somewhat larger maps. While not perfect (typically a very few sequences remained confused after training), the learning process clearly and consistently increased the uniqueness of representations over time. As similar input sequences tended to produce similar nal activation patterns over the map, not surprisingly the confused input phoneme sequences often were similar, especially in their initial or nal subsequences (e.g., ball/bell and spider/sweater). A somewhat unexpected nding, at least to us, was that in spite of the sequential nature of the input, the multiple simultaneous winner nodes, and the lateral intramap connectivity that inuenced selection of winning nodes, well-organized maps of the individual phoneme input patterns still formed. These were reminiscent of maps seen with traditional one-shot (Kohonen) SOMs, with similar phonemes being generally adjacent to one another. For example, there was clear-cut separation of vowel and consonant phonemes
Multi-Winner Maps
555
from one another. Of course, since unlike with traditional SOMs, our model has multiple simultaneous winner nodes, multiple copies of such maps were present. This nding was very robust to variations in the weighting given to afferent versus lateral connections (parameter ®). We believe the ndings of this study imply that SOMs have a substantially greater role to play as useful tools for sequence processing than is generally recognized. Still, there is clearly room for future research to improve on the capabilities of SOMs in this regard. Perhaps most important, future theoretical and experimental studies are needed of ways to guarantee the uniqueness of the spatial representations that are learned for similar input sequences. While it might be true that simply using larger networks could resolve this issue, a more satisfying solution would use methods that increase the effectiveness of the time-to-space mapping without enlarging the maps. Some methods, which we did not examine here, that might be explored include the use of noise during training to encourage more separation of the nal activation patterns of very similar sequences, or increasing the time span of learning effects on lateral intramap connections from one time step to two or three (reaching back further in time has proven effective in improving supervised sequence learning in some past non-SOM systems). Appendix: Training Data The words (nouns) used in this work are derived from the SnodgrassVanderwart corpus (Snodgrass & Vanderwart, 1980) and their phonemes based on the NetTalk corpus (Berndt, D’Autrechy, & Reggia, 1994; Sejnowski & Rosenberg, 1987). The Snodgrass-Vanderwart corpus contains 260 names of physical objects (e.g., apple), from which we eliminated all multiword names (e.g., spool of thread), words for which, in experiments, subjects did not select the “correct” name for the corresponding picture at least 90% of the time (using % Corr(1) in (Snodgrass & Yuditsky, 1996), and nouns that are not part of the NetTalk corpus. This leaves 175 nouns that we use as training data. The phoneme sequences corresponding to the selected nouns are taken from the NetTalk corpus. Altogether, 27 consonants and 15 vowels and diphthongs occur in the NetTalk corpus, for a total of 42 phonemes. Three of the consonants, /ul/, /um/, and /un/, which rarely or never are part of a selected noun (14, 0, and 3 times), are not distinguished, but are considered to be equivalent to /l/, /m/, and /n/. Construction of distinctive feature vectors for each phoneme is challenging, as sometimes experts in phonology, phonetics and linguistics disagree on what an ideal set of distinctive features should be (see, e.g., Frisch, 1996). Our distinctive features were not based on any modeling considerations but on well-known previously published feature sets. They provide a unique representation for each distinguished phoneme that captures at least some of the regularities that make some phonemes similar to others. All 34 components of a feature vector (input pattern), prior to normalization, are binary
556
R. Schulz and J. Reggia
Table 5: Distinctive Features: Vowels. IPA
o
a
Keyboard Code
o
ah ay oo
e
uh- ee
ih
eh ae uh+ u
aw er
ai au
Consonantal Vocalic Compact Diffuse Grave Acute Nasal Oral Tense Lax Continuant Interrupted Strident Mellow +Voicing –Voicing +Duration –Duration +(Af)Frication –(Af)Frication Liquid Glide Retroex F2;VH F2;H F2;HM F2;LM F2;L F2;VL =F1;VH F1;H F1;HM F1;LM F1;L F1;VL
. + . . . . . . + . . . . . + . . . . . . . . . . . . . + . . + . .
. + . . . . . . + . . . . . + . . . . . . . . . . . + . . + . . . .
. + . . . . . . . + . . . . + . . . . . . . . . . + . . . + . . . .
. + . . . . . . . + . . . . + . . . . . . . . + . . . . . . . . + .
. + . . . . . . . + . . . . + . . . . . . . . . + . . . . . + . . .
. + . . . . . . . + . . . . + . . . . . . . . . . . . + . . + . . .
. + . . . . . . + . . . . . + . . . . . . . . . + . . . . . . + . .
. + . . . . . . + . . . . . + . . . . . . . . + . . . . . . . + . .
u
. + . . . . . . + . . . . . + . . . . . . . . . . . . . + . . . + .
i
. + . . . . . . + . . . . . + . . . . . . . . + . . . . . . . . . +
I
æ
. + . . . . . . . + . . . . + . . . . . . . . . + . . . . + . . . .
ai
. + . . . . . . . + . . . . + . . . . . . . . . . + . . + . . . . .
. + . . . . . . . + . . . . + . . . . . . . . . . . . + . . . . . +
. + . . . . . . + . . . . . + . . . . . . . + . . + . . . . . + . .
. + . . . . . . . + . . . . + . . . . . . . . . . . + . . . . + . .
valued: C for a present feature (numerical value 1.0) and ¡ for an absent feature (0.0) (see Tables 5, 6, and 7). The consonant features were taken mostly from the Jakobson, Fant and Halle (1951) feature system (Singh, 1976), augmented for completeness with additional phonemes (e.g., /r/) and features by Singh and colleagues (Singh & Black, 1966; Singh, 1976). The vowel features include some of the same features as consonants, plus features based on the F1 and F2 formants, each divided into six discrete frequency intervals (VH = very high, H = high, HM = high-medium, M = medium, LM = lowmedium, L = low), taken from Paget (1976). The diphthongs such as /ai/ and /au/ were taken as the average of their two nondiphthong components for simplicity.
Multi-Winner Maps
557
Table 6: Distinctive Features: Consonants, Part I. IPA
p
b
m
t
d
n
k
g
f
v
Keyboard Code
p
b
m
t
d
n
tch
dj
k
g
f
v
th–
th+
Consonantal Vocalic Compact Diffuse Grave Acute Nasal Oral Tense Lax Continuant Interrupted Strident Mellow +Voicing –Voicing +Duration –Duration +(Af)Frication –(Af)Frication Liquid Glide Retroex F2;VH F2;H F2;HM F2;LM F2;L F2;VL =F1;VH F1;H F1;HM F1;LM F1;L F1;VL
+ . . + + . . + + . . + . . . + . + . + . . . . . . . . . . . . . .
+ . . + + . . + . + . + . . + . . + . + . . . . . . . . . . . . . .
+ . . + + . + . . . . . . . + . . + . + . . . . . . . . . . . . . .
+ . . + . + . + + . . + . . . + . + . + . . . . . . . . . . . . . .
+ . . + . + . + . + . + . . + . . + . + . . . . . . . . . . . . . .
+ . . + . + + . . . . . . . + . . + . + . . . . . . . . . . . . . .
+ . + . . . . + + . . + + . . + . + + . . . . . . . . . . . . . . .
+ . + . . . . + . + . + + . + . . + + . . . . . . . . . . . . . . .
+ . + . . . . + + . . + . + . + . + . + . . . . . . . . . . . . . .
+ . + . . . . + . + . + . + + . . + . + . . . . . . . . . . . . . .
+ . . + + . . + + . + . . . . + . + + . . . . . . . . . . . . . . .
+ . . + + . . + . + + . . . + . . + + . . . . . . . . . . . . . . .
+ . . + . + . + + . + . . + . + . + + . . . . . . . . . . . . . . .
+ . . + . + . + . + + . . + + . . + + . . . . . . . . . . . . . . .
For normalization, each feature vector is projected onto the unit hypersphere in the next higher dimension. The additional component vr stores the minimal distance between the original feature vector and the surface of the smallest hypersphere enclosing all feature vectors: vr D r ¡ jjE vjj2 where r is the length of the largest feature vector. The thus extended feature vectors are then normalized to unit length to prevent input vectors with relatively large norms from having a greater inuence on the activation dynamics. The prior projection step preserves topological information such as the nearest-neighbor relation between the vectors.
558
R. Schulz and J. Reggia
Table 7: Distinctive Features: Consonants, Part II. IPA
s
z
w
r
l
j
h
Keyboard Code
s
z
sh
zh
w
r
l
y
h
ng
Consonantal Vocalic Compact Diffuse Grave Acute Nasal Oral Tense Lax Continuant Interrupted Strident Mellow +Voicing –Voicing +Duration –Duration +(Af)Frication –(Af)Frication Liquid Glide Retroex F2;VH F2;H F2;HM F2;LM F2;L F2;VL =F1;VH F1;H F1;HM F1;LM F1;L F1;VL
+ . . + . + . + + . + . + . . + + . + . . . . . . . . . . . . . . .
+ . . + . + . + . + + . + . + . + . + . . . . . . . . . . . . . . .
+ . + . . . . + + . + . . . . + + . + . . . . . . . . . . . . . . .
+ . + . . . . + . + + . . . + . + . + . . . . . . . . . . . . . . .
+ . . . . . . + . . + . . . + . . + . + . + . . . . . . . . . . . .
+ + . . . . . + . . . . . . + . . + . + + . + . . . . . . . . . . .
+ + . . . . . + . . . . . . + . . + . + + . . . . . . . . . . . . .
+ . . . . . . + . . . . . . + . . + . + . + . . . . . . . . . . . .
+ . . . . . . + + . . . . . . + . + + . . . . . . . . . . . . . . .
+ . + . . . + . . . . . . . + . . + . + . . . . . . . . . . . . . .
Acknowledgments This work was supported by NINDS Award NS35460 and DOD Award N000140210810. References Abbott, L., & Blum, K. (1996). Functional signicance of long-term potentiation for sequence learning and prediction. Cerebral Cortex, 6(3), 406–416. Barreto, G., Araujo, A., & Kremer, S. (2003). A taxonomy for spatiotemporal connectionist networks revisited: The unsupervised case. Neural Computation, 15(6), 1255–1320.
Multi-Winner Maps
559
Berndt, R., D’Autrechy, C., & Reggia, J. (1994). Functional pronunciation units in English words. J. Experimental Psychology: Learning, Memory and Cognition, 20, 977–991. Bi, G., & Poo, M. (1998). Synaptic modications in cultured hippocampal neurons: Dependence on spike timing, synaptic strength, and postsynaptic cell type. J. Neuroscience, 18(24), 10464–10472. Bi, G., & Poo, M. (2001). Synaptic modication by correlated activity: Hebb’s postulate revisited. Annual Review of Neuroscience, 24, 139–166. Bishop, C. M., Svense´ n, M., & Williams, C. K. I. (1998). GTM: The generative topographic mapping. Neural Computation, 10, 215–234. Callan, D. E., Kent, R. D., Roy, N., & Tasko, S. M. (1999). Self-organizing map for the classication of normal and disordered female voices. Journal of Speech, Language, and Hearing Research, 42, 355–366. Carpinteiro, O. (1999). A hierarchical self-organizing map model for sequence recognition. Neural Processing Letters, 9(3), 209–220. Chappell, G. J., & Taylor, J. G. (1993). The temporal Kohonen map. Neural Networks, 6, 441–445. Cho, S., & Reggia, J. (1994). Map formation in proprioceptive cortex. Int. J. Neural Systems, 5(2), 87–101. Donoghue, J., & Sanes, S. L. J. (1992). Organization of the forelimb area in squirrel monkey motor cortex. Experimental Brain Research, 89, 1–19. Euliano, N. R., & Principe, J. C. (1999). A spatio-temporal memory based on SOM’s with activity diffusion. In E. Oja & S. Kaski (Eds.), Kohonen maps (pp. 253–266). Amsterdam: Elsevier. Frisch, S. (1996). Similarity and frequency in phonology. Unpublished doctoral dissertation, Northwestern University. Georgopoulos, A., Kettner, R., & Schwartz, A. (1988). Primate motor cortex and free arm movements to visual targets in three-dimensional space. II. Coding of the directions of movement by a neural population. J. Neuroscience, 8, 2928– 2937. Gerstner, W., Ritz, R., & van Hemmen, J. (1993). Why spikes? Hebbian learning and retrieval of time-resolved excitation patterns. Biological Cybernetics, 69(5– 6), 503–515. Guseld, D. (1997). Algorithms on strings, trees, and sequences. Cambridge: Cambridge University Press. Hebb, D. (1949). The organization of behavior: A neuropsychological theory. New York: Wiley. Jakobson, R., Fant, G., & Halle, M. (1951). Preliminaries to speech analysis: The distinctive features and their correlates. Cambridge, MA: MIT Press. James, D. L., & Miikkulainen, R. (1995). SARDNET: A self-organizing feature map for sequences. In G. Tesauro, D. Touretzky, and T. Leen (Eds.), Advances in neural information processing systems, 7 (pp. 577–584). Cambridge: MIT Press. Kangas, J. (1990). Time-delayed self-organizing maps. In Proc. of IJCNN, International Joint Conference on Neural Networks, San Diego (Vol. 2, pp. 331–336). Los Alamitos, CA: IEEE Computing Society Press. Kaski, S., Honkela, T., Lagus, K., & Kohonen, T. (1998). WEBSOM—selforganizing maps of document collections. Neurocomputing, 21(1), 101–117.
560
R. Schulz and J. Reggia
Kennedy, J., Eberhart, R., & Shi, Y. (2001). Swarm intelligence. San Francisco: Morgan Kaufmann. Kohonen, T. (1982). Self-organizing formation of topologically correct feature maps. Biol. Cyb., 43(1), 59–69. Kokkonen, M., & Torkkola, K. (1990). Using self-organizing maps and multilayered feed-forward nets to obtain phonemic transcriptions of spoken utterances. Speech Communication, 9(5–6), 541–549. Kopecz, K. (1995). Unsupervised learning of sequences on maps with lateral connectivity. In F. Fogelman-Soulie & P. Gallinari (Eds.), Proceedings of the International Conference on Articial Neural Networks (Vol. 1, pp. 431–436). Nanterre, France: Editions EC2. Markram, H., Luebke, J., Frotscher, M., & Sakmann, B. (1997). Regulation of synaptic efcacy by coincidence of postsynaptic APS and EPSPS. Science, 275(5297), 213–215. Mozer, M. (1993). Neural network architectures for temporal sequence processing. In A. Weigend & N. Gershenfeld (Eds.), Time series prediction (pp. 243– 264). Reading, MA: Addison-Wesley. Paget, R. (1976). Vowel resonances. In D. Fry (Ed.), Acoustic phonetics (pp. 95– 103). Cambridge: Cambridge University Press. Pearson, J., Finkel, L., & Edelman, G. (1987). Plasticity in the organization of adult cerebral cortical maps: A computer simulation based on neuronal group selection. J. Neuroscience, 7, 4209–4223. Principe, J. C., Wang, L., & Motter, M. A. (1998). Local dynamic modeling with self-organizing maps and applications to nonlinear system identication and control. Proceedings of the IEEE, 86(11), 2240–2258. Rao, R., & Sejnowski, T. (2000). Predictive learning of temporal sequences in recurrent neocortical circuits. In S. Solla, T. Leen, & K. Muller ¨ (Eds.), Advances in neural information processingsystems, 12 (pp. 164–171). Cambridge, MA: MIT Press. Reggia, J., D’Autrechy, C., Sutton, G., & Weinrich, M. (1992). A competitive distribution theory of neo-cortical dynamics. Neural Computation, 4, 287–317. Roberts, P. (1999).Computational consequences of temporally asymmetric learning rules: I. Differential Hebbian learning. J. Computational Neuroscience, 7(3), 235–246. Royer, S., & Pare, D. (2003). Conservation of total synaptic weight through balanced synaptic depression and potentiation. Nature, 422, 518–522. Sejnowski, T., & Rosenberg, C. (1987). Parallel networks that learn to pronounce English text. Complex Systems, 1, 145–168. Singh, S. (1976). Distinctive features: Theory and validation. Baltimore, MD: University Park Press. Singh, S., & Black, J. (1966). A study of twenty-six intervocalic consonants as spoken and recognized by four language groups. J. Acoustic Society of America, 39(2), 372–387. Snodgrass, J., & Vanderwart, M. (1980). A standardized set of 260 pictures. J. Experimental Psychology: Human Learning and Memory, 6, 174–215. Snodgrass, J., & Yuditsky, T. (1996). Naming times for the Snodgrass and Vanderwart pictures. Behavior Research Methods, Instruments and Computers, 28(4), 516–536.
Multi-Winner Maps
561
Somervuo, P. (1999). Time topology for the self-organizing map. In IJCNN’99. International Joint Conference on Neural Networks. Proceedings (Vol. 3, pp. 1900– 1905). Piscataway, NJ: IEEE Service Center. Somervuo, P. (2003). Speech dimensionality analysis on hypercubical self-organizing maps. Neural Processing Letters, 17(2), 125–136. Song, S., Miller, K., & Abbott, L. (2000). Competitive Hebbian learning through spike-timing-dependent synaptic plasticity. Nature Neuroscience, 3(9), 919– 926. Sutton, G., Reggia, J., Armentrout, S., & D’Autrechy, C. (1994). Cortical map reorganization as a competitive process. Neural Computation, 6, 1–13. Tversky, A. (1977). Features of similarity. Psychological Review, 84, 327–352. Tversky, A., & Gati, I. (1978). Studies of similarity. In E. Rosch & B. Lloyd (Eds.), Judgement under uncertainty: Heuristics and biases. Mahwah, NJ: Erlbaum. Varsta, M., Millan, J., & Heikkonen, J. (1997). A recurrent self-organizing map for temporal sequence processing. In W. Gerstner, A. Germond, M. Hasler, & J. Nicoud (Eds.), Proc. Internat. Conf. Artif. Neural Networks (pp. 421–426). Berlin: Springer-Verlag. von der Malsburg, C. (1973). Self-organization of orientation sensitive cells in the striate cortex. Kybernetik, 14, 85–100. Wiemer, J. (2003). The time-organized map algorithm: Extending the selforganizing map to spatiotemporal signals. Neural Computation, 15(5), 1143– 1171. Zhang, L., Tao, H., Holt, C., Harris, W., & Poo, M. (1998). A critical window for cooperation and competition among developing retinotectal synapses. Nature, 395, 37–44. Received April 15, 2003; accepted September 8, 2003.
LETTER
Communicated by Misha Tsodyks
Neuronal Bases of Perceptual Learning Revealed by a Synaptic Balance Scheme Osamu Hoshino
[email protected] Department of Human Welfare Engineering, Oita University, 700 Dannoharu, Oita 870-1192, Japan
Our ability to perceive external sensory stimuli improves as we experience the same stimulus repeatedly. This perceptual enhancement, called perceptual learning, has been demonstrated for various sensory systems, such as vision, audition, and somatosensation. I investigated the contribution of lateral excitatory and inhibitory synaptic balance to perceptual learning. I constructed a simple associative neural network model in which sensory features were expressed by the activities of specic cell assemblies. Each neuron is sensitive to a specic sensory feature, and the neurons belonging to the same cell assembly are sensitive to the same feature. During perceptual learning processes, the network was presented repeatedly with a stimulus that was composed of a sensory feature and noise, and the lateral excitatory and inhibitory synaptic connection strengths between neurons were modied according to a pulse-timing-based Hebbian rule. Perceptual learning enhanced the cognitive performance of the network, increasing the signal-to-noise ratio of neuronal activity. I suggest here that the alteration of the synaptic balance may be essential for perceptual learning, especially when the brain tries to adopt the most suitable strategy—signal enhancement, noise reduction, or both—for a given perceptual task. 1 Introduction Our ability to perceive external sensory stimuli improves as we experience the same stimulus repeatedly. This perceptual enhancement is called perceptual learning and has been demonstrated for various sensory systems, such as vision, audition, and somatosensation (for a review, see Goldstone, 1998). Depending on the type of cognitive processing, the brain may use a different strategy: signal enhancement (Poggio, Fahle, & Edelman, 1992; Weiss, Edelman, & Fahle, 1993), noise reduction (McLaren, Kaye, & Mackintosh, 1988; McLaren, 1994; Gold, Bennett, & Sekuler, 1999), or both (Dosher & Lu, 1998). It is often claimed that improvement in cognitive performance by perceptual learning may arise from the alteration of cortical circuitry, that is, Neural Computation 16, 563–594 (2004)
c 2004 Massachusetts Institute of Technology °
564
O. Hoshino
changes in synaptic connection strengths between relevant neurons (Gilbert, 1994, 1996). Karni and Sagi (1991) have proposed a Hebbian process for synaptic alteration, in which use-dependent synaptic enhancement is induced when concurrent pre- and postsynaptic activation occurs. Weiss and colleagues (1993) did a simulation study for perceptual learning in hyperacuity in an early visual system. They demonstrated that signal but not noise was amplied by perceptual learning, in which activity-dependent modulation of feedforward synaptic connection strengths played a crucial role. It has been proposed that lateral synaptic modulation within cortical areas may play a crucial role in perceptual learning. Karni and Sagi (1991) have suggested that lateral synaptic modulation based on the Hebbian rule is essential for perceptual learning in vision. Especially, lateral inhibitory synaptic modulation is effective for reducing background noise and therefore strengthens the “pop-out” of a target stimulus. Schwartz, Maquet, and Frith (2002) have suggested that perceptual learning in visual texture discrimination tasks may involve changes in connections within the primary visual cortex. Hoshino, Inoue, Kashimori, and Kambara (2001) have demonstrated by simulating a neural network model that excitatory and inhibitory synaptic modulation within the network enhances cognitive ability in sensory-feature identication tasks, reducing the time required to identify sensory stimuli. Crist, Li, and Gilbert (2001) have suggested that a change in excitatory-inhibitory synaptic balance may be essential for perceptual learning in the V1 area. Although there has been clear evidence that supports the relevance of cortical plasticity to perceptual learning, little is known about the functional signicance of the modulation of synaptic balance in perceptual learning. The purpose of the study examined here is to investigate how the lateral excitatory-inhibitory synaptic balance contributes to perceptual learning. I propose a synaptic balance scheme for perceptual learning, based on which I try to develop a unied understanding of the neuronal mechanism underlying perceptual learning. I investigate how the brain adopts the most suitable strategy—signal enhancement, noise reduction, or both—for a given perceptual task. I construct a simple associative neural network model in which sensory features are expressed by the activities of specic cell assemblies. When the network is stimulated with a sensory feature, the neurons of the cell assembly corresponding to the applied feature re action potentials, while the other neurons do not. During a perceptual learning process, the network is stimulated with a given feature repeatedly, and the lateral excitatory and inhibitory synaptic connection strengths between neurons are modied according to a pulse-timing-based Hebbian rule. After perceptual learning, neuronal responses to that feature are recorded. I evaluate the performance of the network by the signal-to-noise ratio of neuronal activity. The ratio is dened by (feature-induced neuronal
Neuronal Bases of Perceptual Learning
565
activity)/(noise-induced neuronal activity). This idea comes from the notion that there is a close relationship between an increase in the signal-to-noise ratio of neuronal activity and an improvement in cognitive performance (Gilbert, 1994). Other performance measures, such as reaction times to or the detectability of stimuli, which are often used for assessing the performance of associative neural networks (Hoshino et al., 2001; Hoshino, Zheng, & Kuroiwa, 2002), might also be possible. However, as will be shown in later sections, the signal-to-noise ratio of neuronal activity can clearly demonstrate the relevance of the synaptic balance to perceptual learning. I use a degraded feature stimulus as an input, which is the superposition of two ring patterns of the network, expressing an original feature stimulus and a noise stimulus. Relevant and some irrelevant neurons to the original feature tend to be activated when stimulated with that combinatory (feature plus noise) input. As the level of noise increases, the activity of the relevant and the irrelevant neurons decreases and increases, respectively. To investigate the relevance of the synaptic balance to perceptual learning, I vary the rate of excitatory and inhibitory synaptic modication and analyze neuronal behaviors before, during, and after perceptual learning. For the simulation, I assume perceptual learning processes that are restricted to early sensory systems such as V1 in vision, A1 in audition, or S1 in somatosensation (Gilbert, 1994; Buonomano & Merzenich, 1998). V1 neurons are orientation specic, A1 neurons are sound frequency specic, and S1 neurons tune to specic somatosensation for the body surface. That is, these neurons have simple and small receptive elds and therefore are single modal, responding to single sensory features but not to multiple sensory features. This notion might allow me to make a neural network model consisting of single-modal neurons. However, it is well known that perceptual learning is not limited to lower cortical areas but takes place in higher cortical areas as well. In general, since higher cortical areas integrate lower sensory information, the neurons of the higher areas have complex receptive elds and therefore tend to respond to multiple sensory features (Rolls & Baylis, 1994; Sakurai, 1996). I extend the simulation study to investigate how perceptual learning is processed if the network contains multimodal neurons. I simulate a simple neural network model that contains both types (single modal and bimodal) of neurons. The bimodal neurons are sensitive to two different features. 2 Neural Network Model Figure 1a shows the structure of the neural network model. The model consists of an input (IP) and an output (OP) network. The IP network receives an external input and sends action potentials to the OP network by divergent and convergent feedforward projections. The OP network consists of neural units, each consisting of a pair of a pyramidal neuron and an interneuron. In each unit, an excitatory (inhibitory) synaptic connection is made from the
566
O. Hoshino
Figure 1: The neural network model. (a) The model consists of an input (IP) and an output (OP) network. The IP network receives an external input and sends action potentials to the OP network via divergent and convergent feedforward projections. When the IP network is stimulated with feature FX (X = 1, 2, 3, 4, 5), the OP neurons corresponding to FX are activated. These groups of OP neurons have no overlapping region with each other. (b) Degraded input stimuli. The black circles and gray circles that denote the ring state of neurons represent signal and noise, respectively. The open circles denote the resting state of neurons.
pyramidal neuron (interneuron) to the interneuron (pyramidal neuron). The pyramidal neurons are mutually connected with excitatory and inhibitory synaptic connections. I assume here the excitatory and inhibitory synapses between pyramidal neurons. The excitatory synapses could be made by axo-dendritic connec-
Neuronal Bases of Perceptual Learning
567
tions between pyramidal neurons, and the inhibitory synapses could be expressed as the excitation of local interneurons (Martin, 1991; Amit, 1998), such as basket cells, Martinotti cells, or bitufted cells that are GABAergic inhibitory neurons in the cortex (Gupta, Wang, & Markram, 2000). The OP network is constructed based on the so-called associative neural network. Sensory features expressed by the activities of specic cell assemblies, or the specic ring patterns of the neural network, are memorized into the network dynamics through Hebbian learning. After learning, the neurons of each cell assembly become sensitive to one of the memorized features. These cell assemblies have no overlapping region with each other, thereby making feature-selective neurons. It is well known that the featureselective neurons generally prevail at early visual stages (Logothetis, 1998), such as motion-selective neuronal columns in MT area (Kandel, 1991). The network dynamics allows a certain cell assembly to be activated and therefore to be selected from among others when stimulated with a relevant sensory feature. For simplicity, the IP network contains only pyramidal neurons, and there is no connection between them. That is, the IP network works exclusively as an input network. I apply degraded features as input stimuli to the IP network. These degraded stimuli are expressed as the superposition of the original feature (e.g., F2) and a spatial noise, as shown in Figure 1b, where the black circles and gray circles denote the ring state of neurons, and the open circles denote the resting state of neurons. The black circles represent signal information about feature F2, while the gray circles, which are irrelevant to the feature F2, represent noises. As the level of deviation from the original feature representation (F20 or F2) increases, the F2-sensitive OP neurons receive lesser inputs of F2, while other OP neurons receive greater inputs of noise. Note that the numbers of the on-neurons (black and gray circles) of the degraded stimuli are the same. Dynamic evolutions of the membrane potentials of pyramidal neurons and interneurons are dened by ¿pIP ¿pOP
IP .t/ dup;i
dt
OP .t/ dup;i
dt
IP IP IP D ¡.up;i .t/ ¡ up;rest / C Ip;i .t/; OP OP D ¡.up;i .t/ ¡ up;rest /C
C wpr SOP r;i .t/ C ¿rOP
duOP r;i .t/ dt
NIP X jD1
NOP X jD1
(2.1)
ex ih OP .wpp;ij .t/ C wpp;ij .t//Sp;j .t/
IP LOP;IP Sp;j .t ¡ 1tOP;IP / ij
OP OP D ¡.uOP r;i .t/ ¡ ur;rest / C wrp Sp;i .t/;
Prob[SYx;i .t/ D 1] D fx [uYx;i .t/]; .x D p; rI Y D IP; OP/
(2.2)
(2.3) (2.4)
568
O. Hoshino
1 ; 1 C e¡´x .y¡µx /
fx [y] D
(2.5)
Y .t/ and uY .t/ are the membrane potentials of the ith pyramidal where up;i r;i neuron and the ith interneuron of the Y (Y = IP, OP) network at time t, reY spectively. up;rest and uYr;rest are the resting potentials of pyramidal neurons and interneurons, respectively. ¿pY and ¿rY are the decay times of membrane ex potentials. NY is the number of neuron units of the Y network. wpp;ij .t/ and
ih .t/ are, respectively, excitatory and inhibitory synaptic strengths from wpp;ij pyramidal neuron j to i. wpr (wrp ) is the strength of the synaptic connection from interneuron (pyramidal neuron) to pyramidal neuron (interneuron). Y .t/ and SY .t/ are the action potentials of the ith pyramidal neuron and Sp;i r;i
the ith interneuron, respectively. LOP;IP is the strength of the synaptic conij nection from IP pyramidal neuron j to OP pyramidal neuron i. Value 1 or 0 is properly chosen for LOP;IP to make divergent and convergent feedforward ij projections, whereby a specic group of feature-selective OP neurons (or cell assembly) tends to be activated when the IP network is stimulated with a given feature. 1tOP;IP denotes a delay time in signal transmission from the IP .t/ is an external input to IP pyramidal neuron i. ´ and IP to OP network. Ip;i x µx are the steepness and the threshold of sigmoid function fx , respectively, for x kind of neuron. Equations 2.4 and 2.5 dene the probability of neuronal ring, that is, the probability of SYx;i .t/ D 1 is given by fx , otherwise SYx;i .t/ D 0. When the ith neuron res, SYx;i .t/ takes on a value of 1 for 1 msec, which is followed by a value of 0 for another 1 msec. This is a simplied representation of an action potential and a refractory period and determines the maximum ring rate of 500 Hz. After ring, its membrane potential is reset to the resting potential Y (up;rest = -0.5, uYr;rest D ¡0:5). The values of the other network parameters used in this study are as follows: NIP = NOP = 81, ¿pIP = ¿pOP = 100 msec, ¿rOP = 100 msec, µp D 0:1, µr D 0:9, ´p D 10:2, ´r D 21:0, and 1tOP;IP = 10 msec. Synaptic connections within the OP network are modied according to a pulse-timing-based Hebbian rule, dened by ¿w
ex dwpp;ij .t/
dt
Z ex D ¡wpp;ij .t/ C
Z C
¿w
ih .t/ dwpp;ij
dt
1 0
¡1
Z
C
1 0
OP OP ®ex K.s/Sp;i .t/Sp;j .t C s/ ds
OP OP ®ex K.s/Sp;i .t ¡ s/Sp;j .t/ ds;
ih D ¡wpp;ij .t/ C
Z
0
0 ¡1
(2.6)
OP OP ®ih K.s/[Sp;i .t/ ¡ 1]Sp;j .t C s/ ds
OP OP ®ih K.s/[Sp;i .t ¡ s/ ¡ 1]Sp;j .t/ ds;
(2.7)
Neuronal Bases of Perceptual Learning
569
Figure 2: (a) A pairing function K.s/ for Hebbian learning. For details, see the text. (b) Raster plots of ring pulses of the OP cell assemblies, which have specic sensitivity to feature FX, during and after the feature-memorization process. Each “bar” indicates the period of stimulation with feature FX (X = 1, 2, 3, 4, 5) during which synaptic construction is made.
where ¿w is a time constant. ®ex (®ih ) is the rate of excitatory (inhibitory) synaptic modication. Kernel K.s/ (see Figure 2a) denes a pairing function for the spike-timing-based Hebbian learning rule. This pulse-timingbased Hebbian learning may accord with the recent experiments (Markram, Lubke, ¨ Frotscher, & Sakmann, 1997; Bi & Poo, 1998), demonstrating that synaptic plasticity depends on the temporal timing between presynaptic and postsynaptic spikes. The second and third terms on the right-hand side of equations 2.6 and 2.7 represent LTP (long-term potentiation) and LTD (long-term depression) for excitatory (see equation 2.6) and inhibitory (see equation 2.7) synaptic connections, respectively. According to K.s/, when ex neuron j res just before (after) neuron i res, excitatory connection wpp;ij .t/
570
O. Hoshino
is strengthened (weakened), and thus LTP (LTD) takes place for excitatory synapses. When neuron j res just before (after) neuron i ceases to re, ih .t/ is strengthened (weakened), and thus LTP inhibitory connection wpp;ij
ex ih (LTD) takes place for inhibitory synapses. Note that wpp;ij .t/ and wpp;ij .t/ are assigned to take a positive and a negative value, respectively. Parameter values for the synaptic dynamics are as follows: ¿w = 50 msec, ®ex D 20, and ®ih = 3 for the feature-memorization process. These values are exclusively used to reduce the time required for memorizing the features into the network as dynamic cell assemblies. ¿w = 500 msec, ®ex = 0–20.0 and ®ih = 0–1.0 are used for various perceptual learning processes. For simplicity, the synaptic connection strengths between pyramidal neurons and interneurons are chosen as constant values, wpr D ¡1:0 and wrp D 1:0. Other network parameter values used in the model are chosen to obtain suitable model performance so that I can clearly demonstrate the essential neuronal behaviors of the network. The simplications made here allow a more extensive exploration of the model by making it easier to identify critical parameters among a large number of possible candidates. Each stimulus is expressed by an on-off pixel pattern f»i .FXD /g as shown in Figure 1b, where D denotes the level of deviation from the original feature FX (or FX0 ). When stimulus FXD is presented to the IP network, the amount of stimulation received by neuron i of the IP network is dened by IP Ip;i .t/ D ²»i .FXD /; .X D 1 ¡ 5I D D 0; 10; 20; 30; 40; 50/;
(2.8)
where ² denotes the intensity of stimulus FXD . »i .FXD / takes on a value of 1 for on (the black and gray circles) or 0 for off (the open circles) pixel. ² = 100 for the feature memorization process, and ² = 0.2 for perceptual learning. The high intensity, ² = 100, is exclusively used to accelerate the feature memorization process, thereby reducing computational load for perceptual learning processes that follow. Figure 2b is the raster plot of ring pulses of the OP cell assemblies during and after the feature memorization process. Stimulation with feature FX, as indicated by a bar, activates the OP cell assembly corresponding to FX. During the stimulation periods, synaptic connections between the OP pyramidal neurons are modied according to the Hebbian rule dened by equations 2.6 and 2.7. After the memorization process, the OP neurons re action potentials in a random manner (see time = 6,000–10,000) at a ring rate of 10 to 30 Hz. Note that the neural network model has ongoing spontaneous activity even without external stimulation. The mutual excitatory connections within the cell assemblies might be responsible for the spontaneous neuronal ring. The randomness of ring might be due mainly to the self-inhibition of pyramidal neurons via inhibitory interneurons. In the V1 of alert macaque monkeys performing cognitive tasks, similar spontaneous neuronal ring (10»30 Hz) has been reported (Peres & Hochstein, 1994).
Neuronal Bases of Perceptual Learning
571
The ongoing spontaneous neuronal activity enables the OP network to respond quickly to an applied stimulus, activating the cell assembly corresponding to the stimulus. After the stimulus is switched off, the network returns to the ongoing spontaneous state. The dynamic property of the ongoing spontaneous state and its relevance to neuronal information processing have been investigated in detail (Hoshino, Kashimori, & Kambara, 1996, 1998; Hoshino, Usuba, Kashimori, & Kambara, 1997; Hoshino et al., 2001; Hoshino et al., 2002). ex ih I chose 0 for the initial synaptic values, wpp;ij .t/ = wpp;ij .t/ = 0, in equations 2.2, 2.6, and 2.7. The reason that I made no connection at t D 0 is to accelerate the learning process, by which the stable dynamic cell assemblies that represent the sensory features can be created as long-term memories. Nevertheless, the same result will be obtained if I choose certain values for the initial synaptic strengths (e.g., random values for sparse connections), though the time period that is required to create the stable dynamic cell assemblies would be prolonged due to the reconstruction of its synaptic structure. 3 Results 3.1 Neuronal Behavior. I stimulated the IP network repeatedly with a degraded stimulus (F220 : see Figure 1b) and recorded neuronal responses, where the rates of synaptic modication were set to ®ex = 10.0 and ®ih = 0.2. Figure 3a is the raster plots of action potentials of the OP pyramidal neurons that have specic sensitivity to feature FX. During perceptual learning (marked by L), synaptic connections between the OP pyramidal neurons were modied according to equations 2.6 and 2.7. After perceptual learning, the same stimulus (F220) was presented (marked by P), where no synaptic modulation was made. I counted the number of the neuronal spikes in a time window of 10 msec. Figures 3b and 3c are the total spike counts for the neurons that are sensitive to features F2 and F1, respectively. As the network goes through the ve trials of perceptual learning, the activity of F2-sensitive neurons increases, while those of the other neurons (e.g., F1sensitive neurons of Figure 3c) do not or decrease slightly. The increase in neuronal activity arises from the positive synaptic enhancement between F2-sensitive neurons, according to equation 2.6. Since the activity of the F2-sensitive neurons prevails during perceptual learning, the activities of the other neurons, that is, FX(X6D2)-sensitive neurons, tend to be suppressed by inhibitory connections from the F2-sensitive neurons to the other neurons. Note that the inhibitory connections have been established across the cell assemblies through the feature memorization process (see Figure 2b). I assessed the cognitive performance of the OP network in terms of the number of trials on perceptual learning. The simulation is the same as that
572
O. Hoshino
Figure 3: (a) Raster plots of action potentials of the OP cell assemblies that have specic sensitivity to feature FX (X = 1, 2, 3, 4, 5). In perceptual learning, the IP network was stimulated repeatedly with the degraded stimulus F220 (L), during which the synaptic connections between the OP pyramidal neurons were modied according to equations 2.6 and 2.7. After perceptual learning, the same stimulus was presented (P), where no synaptic modulation was made. (b, c) Total spike counts within a bin size of 10 msec for the OP neurons that are sensitive to feature (b) F2 or (c) F1. (d) Cognitive performance of the OP network as a function of the number of trials (up to 20 times) on perceptual learning. The sensitivity of an F2-sensitive neuron (open circles) and an F1-sensitive neuron (lled circles) to the stimulation (P) was measured.
Neuronal Bases of Perceptual Learning
573
in Figure 3a except that the stimulus (F220 ) was presented up to 20 times (not shown). The ring rate was measured for an F2- and an F1-sensitive neuron. Note that the F1-sensitive neuron is irrelevant to the applied feature (F2). As shown in Figure 3d, the sensitivity of the F2-sensitive neuron gradually increases as the trial proceeds (open circles) and saturates at a certain value »170 Hz. This result implies that the cognitive performance saturates at some level (Adini, Sagi, & Tsodyks, 2002). The key parameters that determine the maximal ring rate (»170 Hz) are OP ) of pyramidal neurons the time constant (¿pOP ) and resting potential (up;rest (see equation 2.2). The minimal ring rate (»3 Hz) of the F1-sensitive neuron (lled circles) is due to strong suppression by the active F2-sensitive neurons via inhibitory synapses that have also been enhanced by perceptual learning (see equation 2.7). When synaptic modulation took place even during interstimulus intervals, the sensitivity of the network to the stimulus (F220) deteriorated (not shown). The decrease in synaptic strength due to natural decay, which is determined by the synaptic time constant ¿w in equations 2.6, is responsible for that sensitivity deterioration, weakening the synaptic connections among F2-sensitive neurons. A longer synaptic time constant should be taken for preventing such deterioration. Note that the synaptic connections between F2-sensitive neurons are not likely to be enhanced due to lesser coincident action potentials during the interstimulus intervals. 3.2 Synaptic Balance and Perceptual Performance. To investigate the relevance of the synaptic balance between positive and negative connections to perceptual learning, I varied the ratio ®ex =® ih , which are the rates of positive and negative synaptic modication during perceptual learning (see equations 2.6 and 2.7). Figure 4 shows how neurons respond to stimulus F2D after perceptual learning in three typical cases: ®ex =® ih = 10/0.2 (see Figure 4a), 10/0.5 (see Figure 4b), and 20/0.5 (see Figure 4c). As shown at the left of Figure 4a, when the network is stimulated repeatedly, the activity of the F2-sensitive neurons increases (solid lines), while the other neurons (e.g., F1-sensitive neurons indicated by the dotted lines) do not for D < 30% or increase for D > 30%. This result may imply that signal but not noise changes at low noise levels and that both signal and noise enhancement occur at higher noise levels. I evaluated the performance of the model by the signal-to-noise ratio of neuronal activity that is the ratio of the ring rate of F2-sensitive neurons to that of F1-sensitive neurons. Note that both (feature-induced and noise-induced) neuronal activities include the ongoing spontaneous neuronal activity. As shown at the right of Figure 4a, the network performance can be improved if the noise level is low, D < 30%. The poor performance for higher noise stimuli (D > 30%) is presumably due to the enhancement of the noise. As shown at the left of Figure 4b, when the rate of inhibitory synaptic modulation increases, the noise but not the signal tends to change. That
574
O. Hoshino
Figure 4: Dependence of neuronal responses and cognitive performance on the synaptic balance. The IP network is stimulated repeatedly with F220 , where (a) ®ex =® ih = 10/0.2, (b) ®ex =® ih = 10/0.5, and (c) ®ex =® ih = 20/0.5. Left: Firing rates of an F2-sensitive neuron (solid lines) and an F1-sensitive neuron (dotted lines). Right: Signal-to-noise ratio of neuronal activity. The circles, triangles, squares, diamonds, and asterisks indicate that the numbers of trials on perceptual learning are 1, 2, 3, 4, and 5, respectively.
Neuronal Bases of Perceptual Learning
575
is, only the noise decreases through perceptual learning. The noise reduction is not sufcient (dotted lines) when stimulated at higher noise levels (D > 40%), and therefore the signal-to-noise ratio deteriorates as shown at the right of Figure 4b. Figure 4c shows the case where both signal enhancement and noise reduction take place. One of the notable ndings is that the synaptic balance set by ®ex =® ih = 10/0.5 (Figure 4b) provides good performance for the lower noise environments (D < 30%), while the balance set by ®ex =® ih = 20/0.5 (Figure 4c) can still enhance the performance for the higher noise environments (D > 30%). These results may imply that the synaptic balance between excitatory and inhibitory connections contributes to neuronal responses to both components (signal and noise) and therefore to network performance. To investigate the precise relationship between the synaptic balance and neuronal behaviors, I carefully varied the ratio (®ex =® ih ) and recorded neuronal responses. Figure 5 shows the dependence of neuronal behaviors on synaptic balance (®ex =® ih ). Figure 5a and 5b are, respectively, the isocontours of sensitivity changes (% on ring rate) of an F2- and an F1-sensitive neuron after perceptual learning (see time = 22,000–23,000 in Figure 3a). The subdomains of Figure 5c express the spaces that guarantee the ve distinct neuronal behaviors: signal enhancement without noise change (I), noise reduction without signal change (II), signal enhancement and noise reduction (III), signal and noise enhancement (IV), and signal and noise reduction (V). Although the subdomains I and II should be on the 0% contour lines (lled arrows), the two dotted regions roughly indicate the parameter spaces within which the two distinct neuronal behaviors (signal enhancement without noise change and noise reduction without signal change) can be easily available because of the sparseness of the isocontour lines. As has been revealed by Figures 4b and 4c, the synaptic balances expressed by regions II and III are quite effective for the improvement in network performance, increasing the signal-to-noise ratio. An interesting nding might be that the noise reduction alone can improve the cognitive performance more than signal enhancement alone. 3.3 Perceptual Learning in Higher Sensory Systems. In the above simulation, I assumed an early sensory system, where neurons are feature specic. That is, the neurons that respond to a certain feature stimulus do not have sensitivity to another feature. However, perceptual learning is not limited to early sensory systems; it takes place in higher sensory systems as well, where neurons might have complex receptive elds and therefore tend to respond to multiple sensory features. I extend the simulation study to investigate how perceptual learning is processed if the network contains bimodal neurons. I modied the neural network model, as shown in Figure 6a, where the F2- and F4-sensitive IP neurons project to (dotted lines) the same group of OP neurons (the dotted circle). When the IP network is stimulated with feature
576
O. Hoshino
Figure 5: Dependence of neuronal behaviors on synaptic balance ®ex /®ih . (a) Isocontours of sensitivity changes (% on ring rate) of an F2-sensitive neuron after perceptual learning (see time = 22,000–23,000 in Figure 3a). (b) Isocontours for an F1-sensitive neuron. (c) Overall neuronal behavior. I: signal enhancement without noise change; II: noise reduction without signal change; III: signal enhancement and noise reduction; IV: signal and noise enhancement; and V: signal and noise reduction. The signal enhancement without noise change and noise reduction without signal change are easily available within dotted regions because of the sparseness of the isocontour lines.
Neuronal Bases of Perceptual Learning
577
Figure 6: A neural network model that contains two types (single modal and bimodal) of neurons. The F2- and F4-sensitive IP neurons project to (dotted lines) the bimodal OP neurons (dotted circle). The bimodal F2-4 OP neurons are sensitive to both (F2 and F4) features.
F2, the OP neurons indicated by F2 and F2-4 are activated simultaneously. Note that “F2-4” denotes specic OP neurons that are indicated by an arrow with “F2-4” in Figure 6 and means that these neurons are sensitive to both F2 and F4. Similarly, when the IP network is stimulated with feature F4, the OP neurons indicated by F4 and F2-4 are activated simultaneously. This means that the F2 and F4 OP neurons are single modal, while the F2-4 OP neurons are bimodal. The ve sensory features are sequentially memorized (time = 1000–5500 in Figure 7a) through Hebbian learning. Since the F2-4 bimodal neurons are sensitive to both features, they are activated by F2 (time = 2000–2500) and F4 (time = 4000–4500). Note that the F2 neurons are also activated by F4 stimulation (see time = 4000–4500). This arises from an indirect association process. That is, the F4 stimulation (time = 4000–4500) directly activates F4 and F2-4 OP neurons, which then activate the F2 OP neurons by associating F2 with F2-4. Such an associative property between F2 and F2-4 has been established during time = 2000–2500. When the network is stimulated with F2 (time = 8000–9000), the F2sensitive neurons and the bimodal F2-4 neurons are strongly activated, while the F4 neurons are weakly activated. Similarly, F4 stimulation strongly activates the F4 and F2-4 neurons (time = 12,000–13,000), while the F2 neurons are weakly activated. Figures 7b, 7c, and 7d, respectively, show how the
578
O. Hoshino
Figure 7: (a) Raster plots of ring pulses of the OP cell assemblies during the feature memorization period (time = 1000–5500) and the stimulation periods with F2 (time = 8000–9000) and F4 (time = 12,000–13,000). Each bar indicates the period of stimulation with feature FX. Total spike counts within a bin size of 10 msec for the OP neurons that are sensitive to feature (b) F2, (c) F4, or (d) F2-4.
Neuronal Bases of Perceptual Learning
579
F2, F4, and F2-4 neurons respond to the F2 (time = 8000–9000) or F4 (time = 12,000–13,000) stimulation, where the numbers of the neuronal spikes that occur in a time window of 10 msec were counted. The stronger response of the F2 (F4) neurons during F2 (F4) stimulation arises from the direct activation of these neurons by the F2 (F4) stimulus, while the weaker response of the F4 (F2) neurons during F2 (F4) stimulation arises from the indirect activation of these neurons. The indirect activation is mediated by the F2-4 bimodal neurons. Note that the weak activation of the F4 neurons (time = 8000–9000) and the F2 neurons (time = 12,000–13,000), which seems, respectively, irrelevant to the applied F2 and F4 features, might be considered a kind of noise. I show in Figure 8 how perceptual learning can reduce that noise signal. Before perceptual learning, the F2 stimulus activates the relevant F2 and F2-4 neurons, and also the irrelevant F4 neurons (time = 8000–9000). After perceptual learning, during which the feature F2 is presented three times (time = 12,000–12,500, 14,000–14,500, 16,000–16,500), the responsiveness of the F4 neurons decreases, while that of the F2 neurons remains vigorous (time = 20,000–21,000). This result indicates that the noise activity of F4 neurons has been suppressed, and therefore the cortical representation has been reorganized. Figure 9a shows that cortical reorganization. After perceptual learning, the F4-sensitive neurons do not respond to feature F2, thereby reducing the number of neurons that are sensitive to feature F2. Such a reduction in the number of feature-sensitive neurons has in fact been reported in experiments of perceptual learning. I discuss this issue in section 4.1. It is interesting to ask how the F4 neurons have been removed from the cortical representation. Before perceptual learning, the F2, F4, and F2-4 neurons simultaneously respond to the F2 stimulation (see time = 8000– 9000 of Figure 8). It might be inferred from the simultaneous activation of the three cell assemblies (F2, F4, F2-4) that these cell assemblies are likely to merge into one large dynamic cell assembly through perceptual learning. That is, Hebbian learning, which takes place in the perceptual learning process, strengthens the synaptic connections between these neurons. This might result in forming one large dynamic cell assembly that consists of the F2, F4, and F2-4 neurons. However, the result was quite different: the F2 and F2-4 neurons but not the F4 neurons were included in the merged dynamic cell assembly, as revealed in Figure 8a (see time = 20,000–21,000) and as shown at the right of Figure 9a. To answer the question of why the F4 neurons were excluded from the merged dynamic cell assembly, I calculated cross-correlation functions of action potentials during the F2 stimulation period before perceptual learning (time = 8000–9000 of Figure 8). The F2 and F2-4 neuronal spikes coincide (the lled arrow of Figure 9b) with a small time lag (»3 msec), while the F4 and F2-4 neuronal spikes do not (Figure 9c). The disynaptic delay (the open arrow of Figure 9c) is too long to enhance their positive synaptic connec-
580
O. Hoshino
Figure 8: (a) Raster plots of ring pulses of the OP cell assemblies before, during, and after perceptual learning. Each bar indicates the period of stimulation with feature F2. L denotes perceptual learning. Before perceptual learning, the F2, F4, and F2-4 neurons are activated by F2 stimulation (time = 8000–9000). After perceptual learning, the F2 and F2-4 neurons but not the F4 neurons are activated by F2 stimulation (time = 20,000–21,000). Total spike counts within a bin size of 10 msec for the OP neurons that are sensitive to feature (b) F2, (c) F4, or (d) F2-4. ®ex =® ih = 3.0/1.0.
Neuronal Bases of Perceptual Learning
581
Figure 9: (a) Reorganization of cortical representation through perceptual learning. Stimulation with F2 before (left) and after (right) perceptual learning. The activity levels of the OP neurons are indicated by a white-to-black scale. (b) A cross-correlation function of action potentials between an F2- and an F2-4sensitive neuron during the F2 stimulation period before perceptual learning (time = 8000–9000 of Figure 8). The F2 and F2-4 neuronal spikes coincide (the lled arrow) with a small time lag (»3 msec). (c) A cross-correlation function for an F4- and an F2-4 sensitive neuron. The positive peak (the open arrow) reects a disynaptic delay between the two neurons. The lled arrows indicate anticorrelated rings.
582
O. Hoshino
tions according to kernel K.s/ (see Figure 2a). It seems that the time that is required to associate F4 with F2-4 might be the main cause for that delay and thus the less coincidence, resulting in the prevention of excitatory synaptic enhancement between the F2-4 and F4 neurons. In addition to the prevention of excitatory synaptic enhancement, the anticorrelations within several milliseconds (see the lled arrows of Figure 9c) allow these inhibitory synaptic connections to be enhanced according to equation 2.7. Such a combinatorial effect (i.e., the prevention of excitatory synaptic enhancement and the enhancement of inhibitory synaptic connections) may be essential for the suppression of the nontrained feature (F4). 3.4 Relevance to Experimental Observations. When the inhibitory and excitatory synapses were modulated separately during perceptual learning, that is, inhibitory synaptic modulation followed by an excitatory one and vice versa, noise reduction and signal enhancement were processed separately (not shown). I obtained the same cognitive performance in both cases as that in Figure 8. Furthermore, I could not nd any difference in cognitive performance even if the LTD part was removed from the kernel K.s/ (not shown). These results imply that the balance between excitatory and inhibitory synapses, but not LTD, is crucial for perceptual learning. The lack of LTD effect is presumably due to a small contribution of the LTD part compared to that of LTP of kernel K.s/. In fact, as will be presented in the next paragraphs, LTD greatly affects the network performance if the LTD part of the kernel is allowed to spread wider than LTP. I made simulations for the bimodal case (see Figure 8a), in which the time window of the LTD part spreads wider than LTP. As the LTD part for excitatory synapses spreads (see the solid ! dashed ! dotted line of Figure 10a), neuronal activity (time = 20,000–21,000) for the nontrained feature (F4) tends to be suppressed (see Figures 10b ! 10c ! 10d, where ®ex =® ih = 3.0/0.5). The suppression is presumably due to synaptic depression in excitatory connections from the F2-4 to F4 neuron, whose action potentials are anticorrelated (see Figure 9c). When the LTD part for inhibitory synapses spreads (see the solid ! dashed ! dotted line of Figure 11a), the neuronal activity (time = 20,000– 21,000) for the nontrained feature (F4) is not suppressed but rather enhanced (see Figures 11b ! 11c ! 11d). The enhancement is presumably due to synaptic depression in inhibitory connections from the F2-4 to F4 neuron that re with anticorrelation (see Figure 9c). Concerning the roles of LTD, Feldman (2000) has experimentally demonstrated that long LTD windows can effectively depress postsynaptic action potentials that are uncorrelated with presynaptic action potentials. Song and Abbott (2001) have demonstrated by simulations that only if LTD dominates LTP will a faithful representation develop which remains stable. I also suggest here that the irrelevant neuronal activity (or noise) corresponding
Neuronal Bases of Perceptual Learning
583
Figure 10: Effects of LTD spread for excitatory synaptic modulation on bimodal perceptual learning. (a) LTD spread (solid ! dashed ! dotted line) and corresponding ((b) ! (c) ! (d)) raster plots of action potentials of the F2- and F4-sensitive neurons. Each solid bar indicates the period of stimulation with feature F2. L denotes perceptual learning. ®ex =® ih = 3.0/0.5.
584
O. Hoshino
Figure 11: Effects of LTD spread for inhibitory synaptic modulation on bimodal perceptual learning. (a) LTD spread (solid ! dashed ! dotted line) and corresponding ((b) ! (c) ! (d)) raster plots of action potentials of the F2- and F4-sensitive neurons. Each solid bar indicates the period of stimulation with feature F2. L denotes perceptual learning. ®ex =® ih = 3.0/0.5.
Neuronal Bases of Perceptual Learning
585
to the nontrained feature might effectively be suppressed if LTD dominates LTP for excitatory synapses. The membrane time constants, ¿pOP and ¿rOP in equations 2.2 and 2.3 are critical parameters for the simulation and typically range between 10 and 100 msec (Dayan & Abbott, 2001). When these membrane time constants were reduced from 100 to 10 msec in the simulation of Figure 10b, the activity of neurons responsible for the nontrained feature (F4) did not decrease (not shown), and therefore the suppression mechanism did not work anymore. However, as shown in Figure 12a, if the inhibitory modulation rate (®ih in equation 2.7) is increased from 0.5 to 1.0, the suppression of the nontrained feature has occurred regardless of such a small membrane time constant. Figure 12b is a cross-correlation function of action potentials between an F4 (single-modal) and an F2-4 (bimodal) neuron during the F2 stimulation period before perceptual learning (time = 8000–9000 of Figure 12a). The positive peak (the open arrow) indicates that the F4 and F2-4 spikes coincide (within »3msec), which implies that the F4 neuronal activity is not suppressed but rather enhanced (see equation 2.6). However, the anticorrelation (the lled arrow) allows the inhibitory synaptic connections to be enhanced as well (see equation 2.7). This implies that the suppression mechanism could work if inhibitory modulation dominates excitatory modulation. I suggest that neuronal suppression for the nontrained feature might be possible for reasonable time constant values (5–10 msec) provided that the synaptic balance, ®ex =® ih , is properly chosen. Concerning response times to sensory stimulation, it is well known that human subjects can respond to old (previously learned) words and nonwords more quickly (by tends of milliseconds) than new (unlearned) words and nonwords (Haist, Musen, & Squire, 1991; Musen & Squire, 1993). Cave and Squire (1992) have demonstrated that human subjects can name old pictures more quickly (by tends of milliseconds) than new pictures. Figure 13, reconstructed from Figure 8, shows that the response time of the F2-sensitive neurons has been reduced in the shrunken cortical representation (see the right of Figure 9a), that is, after perceptual learning. In the expanded cortical representation (see the left of Figure 9a), before perceptual learning, the response latency of the neurons is »60 msec from the onset of simulation (top), while that for the shrunken cortical representation is »25 msec (bottom). The reduction in response time (1t = »35 msec) in the shrunken cortical representation is presumably due to the stabilization of the dynamic cell assembly corresponding to feature F2, which is mediated by the enhancement of excitatory synaptic connections between the F2-sensitive neurons according to equation 2.6. This result may provide some insight into underlying neuronal mechanisms for the reduction of response times to sensory stimulation through perceptual learning.
586
O. Hoshino
Figure 12: Effects of membrane time constants on bimodal perceptual learning. (a) Raster plots of action potentials of the OP cell assemblies, where ¿pOP and ¿rOP were reduced from 100 to 10 msec, and the inhibitory modulation rate (®ih in equation 2.7) was increased from 0.5 to 1.0. (b) A cross-correlation function of action potentials between an F4 (single-modal) and an F2-4 (bimodal) neuron during the F2 stimulation period before perceptual learning (time = 8000–9000). The positive peak (the open arrow) indicates that the F4 and F2-4 spikes coincide (within »3 msec). The lled arrow indicates anticorrelation. ®ex =® ih = 3.0/1.0.
4 Discussion In this section, I discuss the functional signicance of the synaptic balance in perceptual learning and then related simulation studies that tried to clarify neuronal mechanisms of perceptual learning based on synaptic plasticity. Finally, I discuss the assumptions that were made in this study.
Neuronal Bases of Perceptual Learning
587
Figure 13: Raster plots of action potentials of the F2-sensitive neurons during stimulation (F2) periods before (top) and after (bottom) perceptual learning (see Figure 8). The horizontal bars indicate stimulation periods. The response times of the F2-sensitive neurons are reduced by »35 msec (1t).
4.1 Signicance of Synaptic Balance. As noted in section 1, there have been three theories of perceptual learning: signal enhancement, noise reduction, or both. I have focused my study on the relevance of lateral excitatoryinhibitory synaptic balance to these theories. I have shown that the positive synaptic connections contribute to signal enhancement (see Figure 4a), while the negative one contributes to noise reduction (see Figure 4b). I have shown in Figure 4c that the signal enhancement and noise reduction take place if the synaptic balance falls within a certain range (III of Figure 5c), where the cognitive performance of the neural network has been greatly improved through perceptual learning even at higher noise levels (deviation > 30% of Figure 4c). I suggest that the alteration of the synaptic balance may be essential when the brain tries to adopt the most suitable strategy—signal enhancement, noise reduction, or both—for a given perceptual task. Concerning the relevance of neuronal behavior to perceptual learning, two contradictory experimental results have been reported: an increase (Skrandies, Lang, & Jedynak, 1996; Schwartz et al., 2002; Karni & Bertini,
588
O. Hoshino
1997) or a decrease (Wiggs & Martin, 1998; Schiltz et al., 1999) in neuronal activity through perceptual learning. I consider that such a difference in neuronal activity may arise from variety in synaptic modulation, where the brain may select the appropriate synaptic balance from among the possible candidates (see Figure 5). Researchers (Jenkins, Merzenich, Ochs, Allard, & Guic-Robles, 1990; Recanzone, Schreiner, & Merzenich, 1993) have suggested that repeated sensory stimulation increases the number of cortical neurons relevant to the stimulus, while other researchers (Wiggs & Martin, 1998; Ghose, Yang, & Maunsell, 2002) have suggested a decrease in that number. The increase in the number of neurons, which means an expansion of cortical representation of sensory information, might be established by the enhancement of excitatory synaptic connections, which adds new neuronal members into the original cortical representation. Then how does the decrease in the number of neurons or the shrinkage of cortical representation occur? I consider that the cortical shrinkage may arise from the enhancement of lateral inhibitory connections, whereby highly redundant sensory signals might be transformed into a more efcient code. For example, Figure 9a shows that the irrelevant information about F4, an associative signal with the applied stimulus F2, has been completely removed after perceptual learning. For the removal of the F4 neurons from the cortical representation, the enhancement of the lateral inhibitory connections was crucial. I evaluated the performance of the network in terms of the signal-to-noise ratio of neuronal activity, which was dened by the ratio of feature-induced and noise-induced neuronal activity. To assess early perceptual learning, many cognitive performance measures have been proposed, such as detectabilities of motion directions (Ball & Sekuler, 1982), global orientations (Ahissar & Hochstein, 1993), wave gratings (Fiorentini & Berardi, 1980), object features (Ahissar & Hochstein, 1993), hyperacuity (Poggio et al., 1992), and texture segregation (Karni & Sagi, 1991). In all these cognitive processes, the performance of subjects has been greatly improved by perceptual learning. I believe that the signal-to-noise ratio of neuronal activity may be one of the fundamental neural codes for cognitive processing, which is presumably transferred to later cortical areas, such as prefrontal cortex or motor cortex, thereby making nal decisions for given cognitive tasks. 4.2 Related Simulation Studies. Several neural network models have been proposed for perceptual learning. Weiss et al. (1993) made a model for Vernier hyperacuity. The model was a two-layered feedforward neural network. In the input network, the neurons that process signal information were affected by the activity of nearby neurons. The neuron of the output layer made a decision for orientation detection tasks. The researchers modied the feedforward synaptic connections and demonstrated that signal enhancement but not noise reduction occurred. I suggest that signal en-
Neuronal Bases of Perceptual Learning
589
hancement, noise reduction, or both is available even within single cortical networks when the brain alters the synaptic balance between positive and negative lateral connections. Peres and Hochstein (1994) made a recurrent neural network model for orientation detection tasks. The network consisted of hypercolumns, each of which contained orientation-sensitive neurons. The researchers demonstrated that the network performance was greatly improved if the synaptic ratio (intracolumnar excitatory connections)/(intercolumnar inhibitory connections) reached a certain value. The “synaptic balance scheme” that I propose here may support their result and may provide a unied understanding of the neuronal mechanism underlying perceptual learning. Herzog and Fahle (1998) proposed a three-layered neural network model for Vernier discrimination tasks. The researchers demonstrated that external feedback from another cortical region controlled a search process for visual stimuli and the speed of perceptual learning. In this study, I did not incorporate such a feedback contribution. I modeled an early sensory system, where bottom-up but not top-down signals seems to contribute to perceptual learning (Watanabe, Nanez, Koyama, Mukai, Liederman, & Sakaki, 2002). Adini et al. (2002) proposed a neural network model and demonstrated that perceptual learning enhanced the discrimination of visual contrast. The model consisted of excitatory (E) and inhibitory (I) neuronal subpopulations that were interconnected through both excitatory and inhibitory synaptic connections. The synaptic strengths were modulated according to the spike timing of E and I. The cognitive performance of the E subpopulation was improved if the inhibitory inuence of I on E was reduced. The neural network model has an interconnection between a pyramidal neuron and an interneuron, which is similar to that between E and I subpopulations. However, its functional role is quite different in that the cognitive performance of the present model has been improved by augmenting excitatory inuence among pyramidal neurons within cell assemblies, while that of this model has been improved by reducing the inhibitory inuence of I on E. This result may provide another plausible neuronal mechanism for perceptual learning. It might also be possible that the two distinct (inhibitory and excitatory) inuences cooperate together to improve overall cognitive performance. The model examined here has mutual inhibitory connections across cell assemblies. I have demonstrated that noise components can be reduced when the inhibitory connections are augmented, thereby improving signalto-noise ratio. The notable nding might be that the alteration of the balance between excitatory and inhibitory synaptic connections is essential for improvement in overall cognitive performance. Another possible way of enhancing the signal-to-noise ratio is to increase a neuronal threshold for action potentials. This would also lead to the compactication of neuronal representation when presented with repetitive
590
O. Hoshino
stimuli, because the “Eisberg” effect cuts off the noise and leaves only the salient activities (signal) alive. This might be the classic explanation of how the compactication and the enhancement of the signal-to-noise ratio work. 4.3 Assumptions. I assumed here the self-inhibition of pyramidal neurons via their accompanying interneuron. This self-inhibition is essential for the performance of the model presented here. If the pyramidal neurons do not inhibit themselves, the ongoing state of the network becomes stationary, and the current state of the network is determined by its previous state. Due to the robustness of the conventional associative neural network model without self-inhibition, the same ring pattern corresponding to F2 tended to appear even when presented with the different stimuli F2D (D D 0, 10, 20, 30, 40 or 50), where the noise contribution determined by D (see Figure 1b) was no longer reected in neuronal activity (not shown). The self-inhibition was essential for mediating the ongoing spontaneous neuronal activity and thus for assessing the noise contribution to perceptual learning. These interneurons function to suppress the activities of pyramidal neurons. This neuronal suppression allows the network state to escape from the basin of a point attractor in which the current network state is being trapped, and thus enables the network to easily respond to an applied stimulus. (For details, see Hoshino et al., 1996, 1997, 1998, 2001, 2002.) Such a self-inhibition by neighboring inhibitory interneurons has in fact been found in the visual cortex (Martin, 2002) To reduce total simulation time, I used “one-shot” learning where the ve sensory features were presented only once in their memorization process (see Figure 2b). The intense stimulation, ² = 100 in equation 2.8, was necessary for creating the stable dynamic cell assemblies (or the point attractors) in the network dynamics in such a short time interval. In real situations, we are generally exposed repeatedly to sensory features or events over a longer period, and we store them as long-term memories. If I chose a weak intensity and let the model learn these features, similar dynamic cell assemblies would be created as well, though its simulation time should be prolonged. The anticorrelation learning rule that I have proposed here (see equation 2.7) was quite effective for both the construction and modulation of inhibitory synapses. It allowed the inhibitory connections to the nonstimulated (inactive) neurons to grow (see Figure 2b), thereby successfully constructing the inhibitory synapses. Synaptic modulation according to the anticorrelation learning rule enabled the network to suppress a nontrained feature if it showed up (see Figure 8). Moreover, although noise is a rare event by denition, the anticorrelation learning rule enabled the network to reduce the noise effectively and therefore to enhance the signal-to-noise ratio, which seems difcult by another method of synaptic modulation, an invertedcorrelation learning rule, as will be explained in the next paragraph. The inverted-correlation learning rule obtained by the kernel ¡K.s/ (see Figure 2a), whose plausibility might be supported by an experimental study
Neuronal Bases of Perceptual Learning
591
(Holmgren & Zilberter, 2001), can modulate the inhibitory synapses as well. Buchs and Senn (2002) successfully used ¡K.s/ kernel for learning direction selectivity where one needs to modulate existing inhibitory synapses but not build them up. The inverted-correlation rule will only modify inhibitory connections between active neurons; it could not construct them because that rule is based on the correlation between the active neurons. Furthermore, the inverted-correlation learning rule for inhibitory synapses tends to correlate synaptic activities and also to correlate signal and noise, and is therefore less effective in increasing the signal-to-noise ratio due to the rareness of the noise event. I suggest that the anticorrelation learning rule could be an effective way of constructing and modulating the inhibitory synaptic connections. Acknowledgments I thank the anonymous referees for giving me valuable comments and suggestions on the earlier draft. References Adini, Y., Sagi, D., & Tsodyks, M. (2002). Context-enabled learning in the human visual system. Nature, 415, 790–793. Ahissar, M., & Hochstein, S. (1993). Attentional control of early perceptual learning. Proc. Natl. Acad. Sci. USA, 90, 5718–5722. Amit, D. J. (1998). Simulation in neurobiology: Theory or experiment? Trends Neurosci., 21, 231–237. Ball, K., & Sekuler, R. (1982). A specic and enduring improvement in visual motion discrimination. Science, 218, 697–698. Bi, G. Q., & Poo, M. M. (1998). Synaptic modication in cultured hippocampal neurons: Dependence on spike timing, synaptic strength, and postsynaptic cell type. J. Neurosci., 18, 10464–10472. Buchs, N. J., & Senn, W. (2002). Spike-based synaptic plasticity and the emergence of direction selective simple cells: Simulation results. J. Comput. Neurosci., 13, 167–186. Buonomano, D. V., & Merzenich, M. M. (1998). Cortical plasticity from synapses to maps. Annu. Rev. Neurosci., 21, 149–186. Cave, C. B., & Squire, L. R. (1992). Intact and long-lasting repetition priming in amnesia. J. Experimental Psychology, 18, 509–520. Crist, R. E., Li, W., & Gilbert, C. D. (2001). Learning to see: Experience and attention in primary visual cortex. Nature Neurosci., 4, 519–525. Dayan, P., & Abbott, L. F. (2001). Model neurons I: Neuroelectronics. In P. Dayan & L. F. Abbott (Eds.), Theoretical neuroscience (pp. 153–194). Cambridge, MA: MIT Press. Dosher, B. A., & Lu, Z. L. (1998). Perceptual learning reects external noise ltering and internal noise reduction through channel reweighting. Proc. Natl. Acad. Sci. USA, 95, 13988–13993.
592
O. Hoshino
Feldman, D. E. (2000). Timing-based LTP and LTD at vertical inputs to layer /III pyramidal cells in rat barrel cortex. Neuron, 27, 45–56. Fiorentini, A, & Berardi, N. (1980). Perceptual learning specic for orientation and spatial frequency. Nature, 287, 43–44. Ghose, G. M., Yang, T., & Maunsell, J. H. (2002). Physiological correlates of perceptual learning in monkey V1 and V2. J. Neurophysiol., 87, 1867–1888. Gilbert, C. D. (1994). Neural dynamics and perceptual learning. Curr. Biol., 4, 627–629. Gilbert, C. D. (1996). Plasticity in visual perception and physiology. Curr. Opin. Neurobiol., 6, 269–274. Gold, J., Bennett, P. J., & Sekuler, A. B. (1999). Signal but not noise changes with perceptual learning. Nature, 402, 176–178. Goldstone, R. L. (1998). Perceptual learning. Annu. Rev. Psychol., 49, 585–612. Gupta, A., Wang, Y., & Markram, H. (2000). Organizing principles for a diversity of GABAergic interneurons and synapses in the neocortex. Science, 287, 273– 278. Haist, F, Musen, G., & Squire, L. R. (1991).Intact priming of words and nonwords in amnesia. Psychobiology, 19, 275–285. Herzog, M., & Fahle, M. (1998). Modeling perceptual learning: Difculties and how they can be overcome. Biol. Cybern., 78, 107–117. Holmgren, C. D., & Zilberter, Y. (2001). Coincident spiking activity induces longterm changes in inhibition of neocortical pyramidal cells. J. Neurosci., 21, 8270–8277. Hoshino, O., Inoue, S., Kashimori, Y., & Kambara, T. (2001). A hierarchical dynamical map as a basic frame for cortical mapping and its application to priming. Neural Comput., 13, 1781–1810. Hoshino, O., Kashimori, Y., & Kambara, T. (1996). Self-organized phase transitions in neural networks as a neural mechanism of information processing. Proc. Natl. Acad. Sci. USA, 93, 3303–3307. Hoshino, O., Kashimori, Y., & Kambara, T. (1998). An olfactory recognition model based on spatio-temporal encoding of odor quality in olfactory bulb. Biol. Cybern., 79, 109–120. Hoshino, O., Usuba, N., Kashimori, Y., & Kambara T. (1997). Role of itinerancy among attractors as dynamical map in distributed coding scheme. Neural Networks, 10, 1375–1390. Hoshino, O., Zheng, M.H., & Kuroiwa, K. (2002). Roles of dynamic linkage of stable attractors across cortical networks in recalling long-term memory. Biol. Cybern., 88, 163–176. Jenkins, W. M., Merzenich, M. M., Ochs, M. T., Allard, T., & Guic-Robles, E. (1990). Functional reorganization of primary somatosensory cortex in adult owl monkeys after behaviorally controlled tactile stimulation. J. Neurophysiol., 63, 82–104. Kandel, E. R. (1991). Perception of motion, depth, and form. In E. R. Kandel, J. H. Schwartz & T. M. Jessell (Eds.), Principles of neural sciences (pp. 440–466). East Norwalk, CT: Appeleton & Lange. Karni, A., & Bertini, G. (1997). Learning perceptual skills: Behavioral probes into adult cortical plasticity. Curr. Opin. Neurobiol., 7, 530–535.
Neuronal Bases of Perceptual Learning
593
Karni, A., & Sagi, D. (1991). Where practice makes perfect in texture discrimination: Evidence for primary visual cortex plasticity. Proc. Natl. Acad. Sci. USA, 88, 4966–4970. Logothetis, N. (1998). Object vision and visual awareness. Curr. Opin. Neurobiol., 8, 536–544. Markram, H., Lubke, ¨ J., Frotscher, M., & Sakmann, B. (1997). Regulation of synaptic efcacy by coincidence of postsynaptic APs and EPSPs. Science, 275, 213–216. Martin, J. H. (1991). The collective electrical behavior of cortical neurons: The electroencephalogram and the mechanisms of epilepsy. In E. R. Kandel, J. H. Schwartz, & T. M. Jessell (Eds.), Principles of neural sciences (pp. 777–791). East Norwalk, CT: Appeleton & Lange. Martin, K. A. C. (2002). Microcircuits in visual cortex. Curr. Opin. Neurobiol., 12, 418–425. McLaren, I. (1994). Representation development in associative systems. In J. Hogan & J. Bolhuis (Eds.), Causal mechanisms of behavioral development (pp. 377–402). Cambridge: Cambridge University Press. McLaren, I., Kaye, H., & Mackintosh, N. (1988). An associative theory of the representation of stimuli: Applications to perceptual learning and latent inhibition. In R. Morris (Ed.), Parallel distributed processing: Implications for psychology and neurobiology (pp. 102–130). Oxford: Clarendon. Musen, G., & Squire, R. (1993). On the implicit learning of novel associations by amnesic patients and normal subjects. Neuropsychol., 7, 19–135. Peres, R., & Hochstein, S. (1994). Modeling perceptual learning with multiple interacting elements: A neural network model describing early visual perceptual learning. J. Comput. Neuronsci., 1, 323–338. Poggio, T., Fahle, M., & Edelman, S. (1992). Fast perceptual learning in visual hyperacuity. Science, 256, 1018–1021. Recanzone, G. H., Schreiner, C. E., & Merzenich, M. M. (1993). Plasticity in the frequency representation of primary auditory cortex following discrimination training in adult owl monkeys. J. Neurosci., 13, 87–103. Rolls, E. T., & Baylis, L. L. (1994). Gustatory, olfactory, and visual convergence within the primate orbitofrontal cortex. J. Neurosci., 14, 5437–5452. Sakurai, Y. (1996). Hippocampal and neocortical cell assemblies encode memory processes for different types of stimuli in the rat. J. Neurosci., 158, 181– 184. Schiltz, C., Bodart, M., Dubois, S., Dejardin, S., Michel, C., Roucoux, M., Crommelinck, M., & Orban, G. A. (1999). Neuronal mechanisms of perceptual learning: Changes in human brain activity with training in orientation discrimination. NeuroImage, 9, 46–62. Schwartz, S., Maquet, P., & Frith, C. (2002). Neural correlates of perceptual learning: A functional MRI study of visual texture discrimination. Proc. Natl. Acad. Sci. USA, 99, 17137–17142. Skrandies, W., Lang, G., & Jedynak, A. (1996). Sensory thresholds and neurophysiological correlates of human perceptual learning. Spat. Vis., 9, 475–489. Song, S., & Abbott, L. F. (2001). Cortical development and remapping through spike timing-dependent plasticity. Neuron, 32, 339–350.
594
O. Hoshino
Watanabe, T., Nanez, J. E., Koyama, S., Mukai, I., Liederman, J., & Sasaki, Y. (2002). Greater plasticity in lower-level than higher-level visual motion processing in a passive perceptual learning task. Nature Neurosci., 5, 1003–1009. Weiss, Y., Edelman, S., & Fahle, M. (1993). Models of perceptual learning in Vernier hyperacuity. Neural Compt., 5, 695–718. Wiggs, C. L., & Martin, A. (1998). Properties and mechanisms of perceptual priming. Current Opin. Neurobiol., 8, 227–233. Received February 7, 2003; accepted September 5, 2003.
LETTER
Communicated by Wulfram Gerstner
How the Shape of Pre- and Postsynaptic Signals Can Inuence STDP: A Biophysical Model Ausra Saudargiene
[email protected] Department of Psychology, University of Stirling, Stirling FK9 4LA, Scotland, and Department of Informatics, Vytautas Magnus University, Kaunas, Lithuania
Bernd Porr
[email protected] Florentin Worg ¨ otter ¨
[email protected] Department of Psychology, University of Stirling, Stirling FK9 4LA, Scotland
Spike-timing-dependent plasticity (STDP) is described by long-term potentiation (LTP), when a presynaptic event precedes a postsynaptic event, and by long-term depression (LTD), when the temporal order is reversed. In this article, we present a biophysical model of STDP based on a differential Hebbian learning rule (ISO learning). This rule correlates presynaptically the NMDA channel conductance with the derivative of the membrane potential at the synapse as the postsynaptic signal. The model is able to reproduce the generic STDP weight change characteristic. We nd that (1) The actual shape of the weight change curve strongly depends on the NMDA channel characteristics and on the shape of the membrane potential at the synapse. (2) The typical antisymmetrical STDP curve (LTD and LTP) can become similar to a standard Hebbian characteristic (LTP only) without having to change the learning rule. This occurs if the membrane depolarization has a shallow onset and is long lasting. (3) It is known that the membrane potential varies along the dendrite as a result of the active or passive backpropagation of somatic spikes or because of local dendritic processes. As a consequence, our model predicts that learning properties will be different at different locations on the dendritic tree. In conclusion, such site-specic synaptic plasticity would provide a neuron with powerful learning capabilities. 1 Introduction Hebbian (correlation-based) learning requires that pre- and postsynaptic spikes arrive within a certain small time window, which leads to an increase of the synaptic weight (Hebb, 1949). Originally it had been supposed that the temporal order of both signals is irrelevant (Bliss & Lomo, 1970, 1973; Neural Computation 16, 595–625 (2004)
c 2004 Massachusetts Institute of Technology °
596
A. Saudargiene, B. Porr, and F. Worg ¨ otter ¨
Bliss & Gardner-Edwin, 1973). However, rather early rst indications arose that temporal order is indeed important (Levy & Steward, 1983; Gustafsson, Wigstrom, Abraham, & Huang, 1987; Debanne, Gahwiler, & Thompson, 1994). Later, this was termed spike-timing-dependent plasticity (STDP), which refers to the observation that many synapses will decrease in strength when the postsynaptic signal precedes the presynaptic signal (dened here as T < 0), while they will grow if the temporal order is reversed (thus, T > 0) (Markram, Lubke, ¨ Frotscher, & Sakmann, 1997; Magee & Johnston, 1997; Bi & Poo, 2001). T denotes the temporal interval between post- and presynaptic signals (T :D tpost ¡ tpre). This leads to the characteristic antisymmetrical weight change curve measured by several groups. Antisymmetrical learning curves were rst observed in the entirely different context of classical conditioning models requiring much slower timescales. In their seminal study, Sutton and Barto (1981) introduced a learning rule based on the temporal difference between subsequent output signals, and they observed that this rule leads to inhibitory conditioning when the temporal order of conditioned and unconditioned stimulus is reversed. In their hands, this was an unwanted effect, because inhibitory conditioning is only very rarely observed in experiments (Prokasy, Hall, & Fawcett, 1962; Mackintosh, 1974, 1983; Gormezano, Kehoe, & Marshall, 1983). It gave, however, a hint that this class of “differential” algorithms would in general produce antisymmetrical learning curves. The TD learning rule (Sutton, 1988) also belongs to this class of algorithms. Accordingly, Rao and Sejnowski (2001) successfully used the TD algorithm to implement STDP. In the original TD learning rule, one specic signal is used as a dedicated reward, which is treated differently from the other inputs. Thus, Rao and Sejnowski (2001) had to change the TD rule to some degree in order to better adapt it to STDP (see also section 4). The differential Hebbian ISO learning rule, recently introduced by us (Porr & Worg¨ ¨ otter, 2003a) in the context of machine control (Porr & Worg¨ ¨ otter, 2003b), on the other hand, treats all input lines equivalently. This prompted us to query if the ISO rule could also be applied to spiking neurons in a biophysically more realistic model of STDP. The mechanisms that underlie STDP are associated with the biophysics of long-term potentiation (LTP) and long-term depression (LTD; Martinez & Derrick, 1996; Malenka & Nicoll, 1999; Bennett, 2000). This involves complex calcium dynamics and the concerted action of several enzymes such as ®calcium-calmodulin-dependent protein kinase II (CaMKII; e.g., see Teyler & DiScenna, 1987). Several kinetic models have been made (Senn, Markram, & Tsodyks, 2000; Castellani, Quinlan, Cooper, & Shouval, 2001; Shouval, Bear, & Cooper, 2002) in order to arrive at a better understanding of some of these aspects, and some models reach a relatively high level of biophysical and biochemical complexity. As a consequence, however, they contain many degrees of freedom. The question of STDP will be addressed here in the context of a singlecompartment neuron model applying the ISO learning rule. We motivate
How the Shape of Pre- and Postsynaptic Signals Can Inuence STDP
597
the use of this rule for modeling STDP by the fact that it is a differential Hebbian learning rule that correlates input and output signals and produces antisymmetrical weight change curves (Porr & Worg¨ ¨ otter, 2003a). On the right timescale, these properties are similar to STDP such that the ISO learning rule should in principle be applicable in this context too. Currently it is generally assumed that backpropagating spikes provide the necessary postsynaptic signal that represents the temporal reference for the considered synapse. This view has recently been questioned (see Goldberg, Holthoff, & Yuste, 2002, for a review), and a stronger emphasis has been laid on local dendritic processes. Therefore, we will specically investigate how different possible sources of postsynaptic depolarization, modeled by different shapes of the potential change, will inuence learning. The central nding of this modeling exercise is that the ISO learning rule leads in a robust and generic way to STDP, while the shape of the input signals distinctively inuences the shape of the weight change curve. We believe that this study may help to further our understanding of more complex (compartmentalized or kinetic) models because the question of how a certain STDP curve arises is reduced to the question of how the cellular parameters lead to the underlying input signal shapes. 2 Methods 2.1 Components of the Membrane Model. The model represents a small, nonspiking, dendritic compartment with synaptic connections that can take the shape of an AMPA or an NMDA characteristic. Thus, conductances g A of AMPA and gN of NMDA channels were modeled by state-variable equations: gi :D gA .t/ D gN A gO A .t/ D gN A t e¡t=tpeak gi :D gN .t/ D gN N gO N .t/ D gN N
¡t=¿ 1
(2.1) ¡t=¿ 2
¡e e : 1 C ´[Mg2C ] e¡° V
(2.2)
This slightly more complex notation is used because we will need the normalized conductance time functions gO A;N .t/ on their own when introducing the learning rule. All equations used in this study are numerically evaluated in 0.1 ms time steps. V is the membrane potential. Peak conductances are given by the gN values: gN A D 5:436 nS/ms, gN N D 4 nS. The other parameters were tpeak D 0:5 ms, ¿1 D 40 ms, ¿2 D 0:33 ms, ´ D 0:33/mM, [Mg2C ] D 1 mM, ° D 0:06/mV (Koch, 1999). Reversal potentials used were EA D EN D 0 mV.
598
A. Saudargiene, B. Porr, and F. Worg ¨ otter ¨
The conventional membrane equation (equation 2.3) was used to determine the momentarily existing membrane potential: C
dV.t/ X Vrest ¡ V.t/ D .½i C 1½i /gi .t/.Ei ¡ V.t// C ; dt R i
(2.3)
with R D 100 MÄ, C D 50 pF, and Vrest D ¡70 mV (Koch, 1999). Here we have introduced synaptic weights ½i and their weight changes 1½i . This is done purely for convenience, because weights and peak conductances could also be combined multiplicatively. However, as we will see below, it makes sense to keep them separate, because the peak conductances gN can then serve as reference values for the growth (or shrinkage) of the synaptic weights. These equations were modeled using C++ in the Z-domain (Kohn ¨ & Worg¨ ¨ otter, 1998) in order to speed up simulations. A single synapse was assumed as the so-called plastic synapse (PS) on which the inuence of the ISO learning rule was tested. This synapse can consist of varying NMDA and AMPA components, and we will call it ½1 . Note that only the NMDA component drives the learning (see below), the AMPA component will only (mildly) inuence the membrane potential, thereby possibly exerting a second-order inuence on the learning. It will, however, turn out in the course of this study that the secondary AMPA inuence is so small that it can be neglected in most cases. The plastic synapse receives the presynaptic spikes modeled as ±-function input to equations 2.1 and 2.2. The inuence of the NMDA component of the plastic synapse on the membrane potential is dependent on the membrane’s depolarization level. We assume in this model that this is determined by the postsynaptic activity, and we tested how different postsynaptic events inuence the weight change curve. To this end, three cases will be discussed: a postsynaptic inuence that takes the shape of (1) an AMPA response, (2) an NMDA response, and (3) a backpropagating spike (BP spike; see Figure 1A). We will call these inuences the postsynaptic depolarization source (DS). Since we will treat these cases one by one, we can associate them with the same weight (i.e., amplitude factor) ½0 . When using a BP spike, we have generally set ½0 D 1. Technically this was achieved by triggering the depolarization source with another ±-pulse that was shifted by a temporal interval T in relation to the presynaptic event. Physiologically this is meant to be linked to the postsynaptic spike. Strictly, this association, however, is valid only for the BP spike, which is causally related to the postsynaptic spike. The other (AMPA- or NMDA-) depolarization source events need not arise from such a causal relation but can be associated with other independently converging inuences from other synapses. The possibility for neuronal synchronization (Singer & Gray, 1995) with different lead or lag supports the possibility that clusters of other synapses could lead to the required depolarization. At this point, we note that it is not possible to rigorously dene T in all
How the Shape of Pre- and Postsynaptic Signals Can Inuence STDP
599
Figure 1: Schematic diagram of the model. (A) Components of the membrane model. The inset shows how to match the NMDA conductance function gO N (see equation 2.2) with a resonator impulse response h1 . (B) Components of ISO learning. (C) Typical weight change curve obtained with ISO learning. Parameters to obtain this curve were Q D 0:51, f D 0:1 Hz.
instances. Experimentally, T is associated with the difference between preand postsynaptic spike. At the site of the plastic synapse, one would, however, expect a 1 to 2 ms larger T due to the delay in backpropagating the spike. If other clusters of synapses are driving the (heterosynaptic) plasticity, T would have to be dened as the interval between the cluster activity and that at the plastic synapse. Figure 1A shows the different modeled depolarization sources in the context of a single-compartment model. At the summation point, the membrane potential is determined by the three depicted DS inuences (see equations 2.1–2.3) as well as by the inuence that comes from the plastic synapse. For practical purposes, we dene T as the difference between the events as it occurs at the site of the plastic synapse, thus neglecting possible delays that are currently experimentally unresolvable. One can assume that all active processes involved in generating BP spikes will cease at (or close to) the synaptic density. Thus, locally, only the electrotonic membrane properties will prevail. They are at the synaptic density determining the membrane potential, which in turn inuences the state of the Mg2C block at the NMDA channels and thus, the Ca2C inux, which enters the CaMKII second messenger chain (Teyler & DiScenna, 1987). Therefore,
600
A. Saudargiene, B. Porr, and F. Worg ¨ otter ¨
we have decided to model the BP spike also through a conductance change gBP at the summation point (see equation 2.4), which mimics the physiologically measured shapes of BP spikes without having to implement active processes (active channels). Note that the actual equations used do not have any physiological meaning; they are used only to design realistic backpropagating spike shapes: ³ gBP .t/ D gN BP
´ 1 0:5 ¡ ¡ 0:5 : 1 C e¡t=¿ rise 1 C e¡.t¡¿BP /=¿ fall
(2.4)
With the help of this equation, rising (¿rise ) and falling anks (¿ f all ), as well as the total width (¿BP ) of our backpropagating spikes, can be adjusted independently, while their amplitude is controlled by gN BP . This allowed us to design different shapes of backpropagating spikes in a very specic way. BP spikes modeled according to equation 2.4 always have a typical shape. In order to cover transitory cases of a BP spike with a shape that is intermediate to the different shapes obtainable by equation 2.4, we used equation 2.5 (Krukowski & Miller, 2001): gBP .t/ D gN BP . f e¡t=¿ a C .1 ¡ f / e¡t=¿ b ¡ e¡t=¿ c /:
(2.5)
Actual parameters for equations 2.4 and 2.5 shall be given in the gure legends. To calculate the membrane potential, we assumed in all cases EBP D 0 mV. A paired pulse protocol was used to stimulate the inputs. The pulse interval between both inputs (dened as T) was varied between ¡50 and +50 ms. The interval between pulse pairs was T D 250 ms in order to prevent second-order interactions between pulse pairs (steady-state condition). 2.2 Components of ISO Learning. Figure 1B shows the circuit diagram of rate-based ISO-learning for only two (±-pulse) inputs x0 ; x1 (for a more general description, see Porr & Worg¨ ¨ otter, 2003a). The inputs are rst bandpass ltered by means of heavily damped resonators h dened by h.t/ D
1 at e sin.bt/; b
(2.6)
p with a :D ¡¼ f=Q and b :D .2¼ f /2 ¡ a2 , where f is the center frequency of the resonator and Q ¸ 0:5 the damping factor. Generally we used Q D 0:51 (Porr & Worg¨ ¨ otter, 2003a), and this strong damping leads essentially to a low-pass behavior. Elsewhere, we have discussed that this would sufce for learning, while using the equations for bandpass lters renders several advantageous mathematical properties (Porr & Worg¨ ¨ otter, 2003a). However, in the context of this study, the “bandpass” lters used are really rather lowpass lters. Such a bandpass (low-pass) ltering takes place in a generic way
How the Shape of Pre- and Postsynaptic Signals Can Inuence STDP
601
at almost all membrane processes, and this allows us to easily associate the abstract operations h to more realistic cellular operations below. The transformed inputs u0;1 converge onto the learning unit with weights ½0;1 , and its output is given by v.t/ D ½0 .t/u0 .t/ C ½1 .t/u1 .t/ where u0;1 .t/ D x0;1 .t/ ¤ h.t/:
(2.7)
The ¤ denotes a convolution. In this study, we keep the weight ½0 xed. The other weight ½1 changes by the ISO learning rule that uses the temporal derivative of the output: d ½1 D ¹u1 .t/v0 .t/ dt
¹ ¿ 1:
(2.8)
In the original article, we had shown that ISO learning produces a linear weight change that can be calculated for all t ¸ 0 by solving ½1 ! ½1 C 1½1 Z 1 1½1 .T/ D ¹ u1 .T C ¿ /v0 .¿ /d¿: 0
(2.9) (2.10)
This integral can be solved analytically and leads to the ISO learning weight change curve shown in Figure 1C for two identical bandpass lters h. Note that this curve becomes skewed if two different bandpass lters are used, which indicates that the shapes of the input functions u are critical in determining the shape of the weight change curve. 2.3 Associating the Membrane Model to ISO Learning. We need to associate the parameters of the ISO learning rule with those in the membrane model for the plastic synapse ½1 : ² x1 : We assume that the presynaptic spike train at the plastic synapse represents the signal x1 of ISO learning. ² h1 : The bandpass lter operation h1 is represented by the conductance functions g of the plastic synapse, and we dene h1 .t/ :D gO N .t/:
(2.11)
² u1 : Since we are only dealing with spike trains modeled as ±-functions, we get u1 .t/ D h1 .t/ D gO N .t/. Corresponding curves are shown in the inset of Figure 1A. This shows that the conductance function essentially captures the characteristic of low-pass ltering the spike at the input. The match between the curves, however, is not exact, immediately indicating that the results of the membrane model will not be identical to those of ISO learning.
602
A. Saudargiene, B. Porr, and F. Worg ¨ otter ¨
² v: The membrane potential V is associated straightforwardly to the output function v from ISO learning. Note, as opposed to the original linear ISO learning rule, that we observe that the biophysically adapted version introduced here is no longer linear. The results presented later, however, will show that the adapted rule will still lead to generic antisymmetrical weight change curves. The nonlinearities introduced by the reversal potentials, as well as by the voltage dependence of the NMDA channel, inuence the results only qualitatively. ² x0 : The signal x0 from ISO learning is associated with three possible signals: with a spike arriving at input 1 (AMPA) or 2 (NMDA) or with a BP spike (see equation 2.3) in Figure 1A. ² u0 : It is not necessary to associate u0 with any component of the membrane model because it does not inuence the plastic synapse ½1 directly. Instead, this happens only via the derivative of the membrane potential.1 Essentially u0 represents the conductance change for any of the three introduced depolarization sources (see equations 2.1–2.3 and Figure 1A). ² ½0;1 : Synaptic weights had been directly introduced into the membrane equation (see equation 2.3). ½1 is the initial value of the synaptic weight of the plastic synapse. ½0 is used as the amplitude factor for a possible second synapse or the BP spike. Thus, ½0 denes the strength of the depolarization source. ² Learning rule: As a consequence of these settings, the learning rule of ISO learning is rephrased in the context of this model to d ½1 D ¹u1 .t/v0 .t/ D ¹ gO N .t/ V 0 .t/: dt
(2.12)
In section 4, we will address the question of the physiological relevance of the different parts of this learning rule and show how to associate pre- and postsynaptic events with the different terms. Here we only note that this rule in its derivative form can be associated to calcium ow through NMDA channels or in its integrated form to the calcium concentration (see section 4). Note that this rule is (as usual) treated in an adiabatic condition assuming that multiple spike pairs (with interspike interval T), which occur with a temporal distance of T between them, do not inuence each other. Thus, for the actual weight change 1½ obtained with one In a symmetrical learning situation (with ½0 changing also), one could associate u0 with the corresponding conductances gO in a similar way as for u1 . 1
How the Shape of Pre- and Postsynaptic Signals Can Inuence STDP
603
spike pair at the inputs, we use equation 2.10 and calculate Z 1½1 D
t 0
d½1 dt; dt
(2.13)
where T ¿ t ¿ T . We call 1½ an integrated weight change. The learning rate ¹ takes the unit of Volt¡1 , because this way 1½ is rendered unit free. To regard ¹ as a voltage-dependent entity may make sense given the observation that predepolarization enhances the induction of LTP and vice versa (Sourdet & Debanne, 1999). The physiological meaning behind the concept of a synaptic weight is still under debate. Several pre- and postsynaptic mechanisms contribute to the weight (Malenka & Nicoll, 1999), which in this study are subsumed under a single number ½. However, using these settings, N ½ can be treated as a multiplicative factor of the peak conductance g, which, multiplied together, can be interpreted as the strength of a given connection in the context of this model. At this point, it is important to note that the nal value of ¹ is rather arbitrary, because it is just a multiplicative factor that changes the slope of the (linear, see below) learning curve. In physiology, there exist (intracellular) amplication mechanisms that could in principle be associated with such a ¹-factor. This does not make sense in the context of this model because such mechanisms are not implemented. Thus, we will set the value of ¹ D 1 and provide an analysis about the range of ¹ within which the model operates linearly (see Figure 7). 3 Results In this section, we present results obtained when using a pure NMDAsynapse as the plastic synapse. At the end of this section, we discuss the physiologically more realistic case of a mixed AMPA/NMDA synapse, showing how to infer the corresponding results from what we have presented before. We use the three different sources for a postsynaptic depolarization introduced in section 2: a BP spike (3 in Figure 1A), which is currently believed to be the most likely source of depolarization, but also a pure AMPA inuence (see 1 in Figure 1A) or a pure NMDA inuence (see 2 Figure 1A). The goal of this section is to distinguish unrealistic from more realistic cases and to arrive at some conclusions concerning the robustness of the obtained results. In all the cases, we set the relative initial strength of the plastic synapse to ½1 D 0:5, which means that the synaptic weight of this connection was initially at 0:5 gN N . The weight ½0 , usually set to 1, is kept constant. 3.1 Individual Weight Change Examples. Figure 2A shows the conductance gN , Figure 2B the membrane potential and Figure 2C its derivative,
604
A. Saudargiene, B. Porr, and F. Worg ¨ otter ¨
Figure 2: Detailed curves for a single pulse pair experiment with T D 10 ms and a BP spike as the depolarization source, ½0 D 1. (A) Conductance change gN arrising from the presynaptic input and as a consequence of the BP spike leading to positive feedback at the e¡° V -term in equation 2.2. (B) Membrane potential change. (C) Derivative of the membrane potential. (D) Resulting integrated weight change. The weight stabilizes as soon as all curves have returned to their equilibrium.
and Figure 2D the development of the plastic synapse ½1 for a single input pulse pair with the rst spike arriving at t D 10 ms at the plastic synapse (presynaptic spike) and the BP spike arriving at t D 20 ms. Thus, we have a positive value for T D C10 ms. The initial membrane potential was at resting level (¡70 mV) for this simulation. The plastic synapse was assumed to be a pure NMDA synapse. The small increase in the NMDA conductance gN (see Figure 2A) starting at t D 10 ms is caused by a spike at the plastic synapse. The following large peak is due to the rising membrane potential as soon as the BP spike arrives at 20 ms. The membrane potential (see Figure 2B) increases slightly at t D 10 ms because of the activated NMDA channel and is dominated by a BP spike later. The upper part of the membrane potential curve (see Figure 2B) is not shown in order to make the small NMDA channel response more visible. An integrated weight change 1½1 (see Figure 2D) occurs throughout the duration of the membrane potential excursion. It follows the rule given in
How the Shape of Pre- and Postsynaptic Signals Can Inuence STDP
605
Figure 3: Detailed curves for a single pulse pair experiment with T D ¡10 ms. Panels are the same of those in Figure 2, only in this case, a negative integrated weight change is obtained (D).
equation 2.12 integrated according to equation 2.13, and the weight nally stabilizes slightly above zero as soon as the membrane potential returns to resting. The obtained change is small because only a single pulse pair was used. Essentially, the opposite situation is observed when inverting the pulse sequence to T D ¡10 ms (see Figure 3). The negative excursion of the derivative of the membrane potential (C) is now scaled with the full gN inuence (A), leading to a strong drop of the integral and nally to a reduced weight ½1 at steady state (D). Figure 4A shows complete weight change curves obtained at different resting potentials Vrest D ¡40 to ¡70 mV. We observe that the shape of the curves remains essentially the same while the magnitude of the weight change grows slightly when the membrane potential is depolarized. This is in accordance with observations that the amplitude of LTP can be augmented by predepolarizing the cell under study as a consequence of the voltage dependence of the NMDA channel (Sourdet & Debanne, 1999). Part B of Figure 4 has been obtained with a second synapse as the depolarization source and shall be discussed later.
606
A. Saudargiene, B. Porr, and F. Worg ¨ otter ¨
Figure 4: Weight change curves obtained at different resting potentials. (A) Using a BP spike as depolarization source. The BP spike is modeled with equation 2.4 and parameters ¿rise D 1 ms, ¿ f all D 10 ms, ¿BP D 25 ms, gN BP D 59:8 nS, ½0 D 1. (B) Using a second NMDA synapse as the depolarization source, ½0 D 10.
3.2 Inuence of the Shape of the BP Spike. Figure 5 shows 17 weight change curves and the BP spikes with which they were obtained. Note that some of these spike shapes do not necessarily reect realistic BP spike shapes. Instead, this modeling exercise is meant to cover a rather complete range of composable shapes such that the characteristics of a weight change curve resulting from any other BP spike shape can be inferred from this diagram. In general, we observe that the negative part of the weight change curve dominates in most cases across all panels, which is in accordance with physiology (Debanne, Gahwiler, & Thompson, 1998; Feldman, 2000). Figure 5: Facing page. Weight change curves obtained with different BP spikes as the depolarization source. Top panels show the weight change curves and bottom panels the BP spikes with which they were obtained. BP spikes were modeled using equation 2.4, adjusted to the same amplitude. A and B also contain one example obtained with a BP spike with intermediate shape modeled with equation 2.5. This spike starts with the shape of the rst BP spike in B and ends with the shape of the last spike. In all cases we used ½0 D 1. Individual parameters for the different BP spikes were: A, B: ¿rise D 1 ms: (¿ f all D 1 ms, ¿BP D 9 ms, gN BP D 56 nS); (¿ f all D 5 ms, ¿BP D 15 ms, gN BP D 61:4 nS); (¿ f all D 10 ms, ¿BP D 25 ms, gN BP D 59:8 nS); (¿ f all D 20 ms, ¿BP D 30 ms, gN BP D 67:5 nS). C, D, ¿rise D 5 ms: (¿f all D 1 ms, ¿BP D 20 ms, gN BP D 56 nS); (¿f all D 5 ms, ¿BP D 25 ms, gN BP D 64 nS); (¿f all D 10 ms, ¿BP D 35 ms, gN BP D 63 nS); (¿f all D 20 ms, ¿BP D 45 ms, gN BP D 67:5 nS). E, F, ¿rise D 10 ms: (¿f all D 1 ms, ¿BP D 35 ms, gN BP D 56 nS); (¿ f all D 5 ms, ¿BP D 40 ms, gN BP D 62 nS); (¿f all D 10 ms, ¿BP D 50 ms, gN BP D 63:5 nS); (¿ f all D 20 ms, ¿BP D 60 ms, gN BP D 69 nS). G, H, ¿rise D 20 ms, (¿ f all D 1 ms, ¿BP D 60 ms, gN BP D 57:5 nS); (¿ f all D 5 ms, ¿BP D 70 ms, gN BP D 60 nS); (¿ f all D 10 ms, ¿BP D 80 ms, gN BP D 62 nS); (¿ f all D 20 ms, ¿BP D 90 ms, gN BP D 68 nS). Parameters for the BP spike with intermediate shape in A and B are ¿a D 7 ms, ¿b D 30 ms, ¿c D 4 ms, f D 0:9, gN BP D 10:9 nS.
How the Shape of Pre- and Postsynaptic Signals Can Inuence STDP
607
608
A. Saudargiene, B. Porr, and F. Worg ¨ otter ¨
By comparing the curves within each panel, it can be seen that increasing fall times (¿f all ) of the BP spike mainly lead to an increase of the positive peak of the weight change curve, while the negative peak becomes smaller but more spread out toward negative values of T. By comparing curves across panels, one can assess the inuence of increasing rise times (¿rise). Here we observe that the typical STDP shape of the curves in Figure 5A (zero crossing at about T D 0 ms) becomes similar to standard Hebbian learning for values of T > ¡20 ms for a rather shallow rise time ¿rise D 20 ms of the BP spikes in Figure 5G. Only for T < ¡20 ms are negative weight changes again obtained. This effect becomes more pronounced when increasing the rise times to values ¿rise > 20 ms. Such shallow rise times may indeed occur at distal dendrites where, discounting possible active processes, the membrane capacitance has smeared out a BP spike substantially (Magee & Johnston, 1997; Larkum, Zhu, & Sakmann, 2001). When cells are driven, for example, by a stimulus, pre- and postsynaptic spikes will follow each other, often in intervals of less than 20 ms (Froemke & Dan, 2002). Thus, the values of a weight change curve for T beyond §20 ms are probably many times not of relevance for a cell’s synaptic plasticity (Froemke & Dan, 2002). Therefore, the shape-dependent leftward shift of the weight change curve leading to LTP within rather larger temporal intervals could be of some theoretical interest, because it shows that we do not have to alter the learning rule in order to get either differential Hebbian learning or a characteristic that is similar to standard Hebbian learning at realistic interspike intervals. A changing input characteristic will do the trick already. Note that in general, the plastic NMDA synapse will contribute almost nothing to the membrane potential change as compared to the strong inuence of the BP spike. Thus, in the specic case of Figures 5G and 5H, the derivative of the membrane potential will remain positive for rather long durations as soon as the rise time is large. This leads to positive weight changes also for large negative T values. The one example of a BP spike with intermediate shape (see Figures 5A and 5B) shows, and quite expectedly so, that gradual spike shape transitions will also lead to gradual transitions of the shape of the weight change curves. This supports the notion that other shapes of weight change curves can be basically inferred from these examples. 3.3 Inuence of Different NMDA Characteristics. It is known that during development, the relative frequency of different NMDA receptor types (NMDARA versus NMDARB ) changes. This inuences the electrophysiological properties of the NMDA channel. Figures 6A and 6B show three different NMDA characteristics, the steepest reecting an adult stage. The other two stages are observed during development at postnatal days 26 to 29 (¿decay D 380 ms) and 37 to 38 (¿decay D 189 ms) in ferret at a C40 mV voltage
How the Shape of Pre- and Postsynaptic Signals Can Inuence STDP
609
Figure 6: Learning curves obtained with different NMDA characteristics. (A) EPSC, (B) conductance of the NMDA synapse, both at C40 mV voltage clamp. (C) Weight change curves. Parameters for equation 2.2 were as follows. Normal (adult) NMDA: gN N D 4 nS, ¿1 D 40 ms, ¿2 D 0:33 ms which gives an EPSC with ¿decay D 41:7 ms. Young NMDA, before eye opening (26–29 days): gN N D 4:02 nS, ¿1 D 363:1 ms, ¿2 D 0:033 ms which gives EPSC with ¿decay D 380 ms. Older NMDA, after eye opening (37–38 days): gN N D 4:2 nS, ¿1 D 173:3 ms, ¿2 D 0:033 ms which gives EPSC with ¿decay D 189 ms. The BP spike was modeled according to equation 2.4 with parameters ¿rise D 1 ms, ¿f all D 10 ms, ¿BP D 2:5 ms, gN BP D 59:8 nS, ½0 D 1.
clamp preparation, where there is no more Mg2C blockage of the NMDA channel. The single decay values for ¿decay were taken from Roberts and Ramoa (1999), but we still modeled the NMDA characteristic using equation 2.2 by tting our two ¿ -values to yield the curves reported by Roberts and Ramoa. To obtain the weight change curves, we used a BP spike with a short rise time (1 ms) and a medium fall time (10 ms; compare Figure 5B). Interestingly we observe that both “young” NMDA synapses yield rather asymmetrical weight change curves with a strongly dominated LTD part. To our knowledge, so far very little is known about the actual physiological learning characteristics of early synapses. There are, however, indications that synaptic elimination dominates the early developmental stages (analyzed in a theoretical study by Chechik, Meilijson, & Ruppin, 1998). The
610
A. Saudargiene, B. Porr, and F. Worg ¨ otter ¨
theoretical results obtained with our learning rule would possibly point toward this direction. 3.4 Multiple Spike Pairs. Figures 7A and 7B show how weights develop when using a sequence of 30 pulse pairs with interpair intervals of T D 100 ms, which still guarantees the adiabatic condition for interspike intervals below T § 50 ms (see Figure 7C). Different learning rates ¹ ¸ 100 and varying amplitudes of the BP spike were applied to obtain Figures 7A and 7B. These high learning rates were used in order to be able to use only a few pulse pairs for measuring the whole curve. We nd two different types of behavior. Curves 1 to 6 show a gradual increase with almost unchanging slope (until saturation, curves 5 and 6); curves 7 to 9 show weak growth until a certain point, from which they grow much faster. Curves with gradual, unchanging growth (1–6) are obtained as soon as the amplitude of the BP spike is large. To obtain them, we have kept the same BP spike amplitude and increased the learning rate. This way, weight growth can be adjusted to different values to match it to the physiologically obtained percentage weight changes if desired (Bi & Poo, 2001). Curves 7 to 9 were obtained with (very) small amplitudes of the BP spike. We observe that the left part of the curves still shows a very shallow increase, followed by a kink (or bend), which continues into a second steeper part of the curve followed by saturation in curves 8 and 9. These differences can be explained by looking at how the membrane potential develops over time for the two different cases as shown in Figure 7B. Each depolarization represents the response of the model to a pulse pair (T D 25 ms). Most of the time, two peaks are seen in the potential; the Figure 7: Facing page. (A) Progress of the weight growth for a plastic NMDA synapse with initial value ½1 D 0:5 and BP spikes of different amplitudes. We used T D 5 ms as the temporal interval between pulses. Different slopes of the curves were obtained with different learning rates ¹, which were for curves 1–6 increasing as ¹ D 100; 200; 300; 500; 1000, 3400, curve 7, ¹ D 1000, curves 8 and 9, ¹ D 3400. The BP spike was modeled according to equation 2.4 for all curves with parameters, ½0 D 1, ¿rise D 1 ms, ¿ f all D 10 ms, ¿BP D 25 ms; for curves 1–6, gN BP D 59:8 nS; for curves 7 and 9, gN BP D 4 nS; for curve 8, gN BP D 0:4 nS. The resulting BP spike amplitude Vmax was for curves 1–6 Vmax D ¡20 mV, curves 7 and 9 Vmax D ¡61 mV, curve 8, Vmax D ¡69 mV. Interpair interval T D 100 ms. (B) Membrane potential traces comparing the cases with and without steep increase. Here we used T D 25 ms to make the individual contributions of both pulses more visible. The top panel corresponds to curve 9, the bottom panel to curve 3. Interpair interval T D 100 ms. (C) Weight change curves obtained with multiple spike pairs at different interpair intervals T D 100; 50; 25 ms. Parameters were: ¹ D 1, ½0 D 1, ¿rise D 1 ms, ¿f all D 10 ms, ¿BP D 25 ms, gN BP D 59:8 nS.
How the Shape of Pre- and Postsynaptic Signals Can Inuence STDP
611
rst comes from the plastic synapse and the second from the BP spike. The bottom trace represents a case where the amplitude of the BP spike was relatively large (as for curves 1–6). As a consequence, the general shape of the potential is very strongly dominated by the BP spike; the plastic synapse does not contribute much to it. In the top trace, we have used a very small BP spike amplitude (as for curves 7–9). This time, the potential is dominated
612
A. Saudargiene, B. Porr, and F. Worg ¨ otter ¨
by the growing plastic synapse. Given that this is a pure NMDA synapse, we obtain positive feedback through the e¡° V term in equation 2.2. Hence, we get a very steep increase as soon as this synapse gets “overly” strong. As a result, we get a slope in curves 8 and 9 that is the same as that in curve 6, which was obtained by a large BP spike and the same high learning rate (¹ D 3400). For the same reason, the slopes of curves 5 and 7 at the right side of the diagram are also similar. Note that if we use the small learning rate of ¹ D 1 that we normally applied, we still expect the same basic behavior but only after many more input pairs and with a much shallower nonlinear increase as soon as the positive feedback sets in. It is not conceivable that the situation shown in curves 7 to 9 directly corresponds to physiology, because one rarely nds pure NMDA synapses that at the same time would have to grow very strongly before being able to elicit this effect. However, a cluster of mixed AMPA and NMDA plastic synapses at the peripheral dendrite (or at a spine) where the BP spike might be small could indeed lead to local nonlinear potential changes resulting in such effects. Figure 7C shows weight change curves obtained with different interpair intervals T , as indicated in the gure. This shows that the adiabatic condition is violated as soon as interspike intervals T approach the interpair interval. As a consequence, secondary LTP (or LTD) regions appear as expected from the results of Froemke and Dan (2002). Bi (2002) discusses various cases of spike pair combinations that could lead to such results; explicit experimental proof of this model prediction, however, is at the moment still lacking.
3.5 Weight Change Curves Obtained with a Second Synapse as Depolarization Source. Its is known that the postsynaptic depolarization signal is needed in order to remove the Mg2C block at the NMDA channels, without which no Ca2C could enter the cell. A BP spike provides a very strong source of depolarization. More locally, however, other sources of postsynaptic depolarization also can exist, especially when considering clusters of synapses. Here “any other” synapse could lead to a local depolarization affecting the plastic NMDA synapse under consideration. This would lead to heterosynaptic plasticity, and Goldberg, Staff, and Spurston (2002) discuss the biophysical implications of this alternative. Accordingly, Figure 8 shows two groups of curves obtained with a plastic NMDA synapse and different synaptic depolarization sources. Depolarization comes from a cluster of AMPA synapses for A curves and a cluster of NMDA synapses for B curves. Both types of curves, A and B, are antisymmetrical, but the generic shape differs. Curves A have a skewed asymmetry and a slight positive offset, while curves B possess almost equal LTD and LTP parts and are shifted above zero for weak depolarization.
How the Shape of Pre- and Postsynaptic Signals Can Inuence STDP
613
Figure 8: Weight change curves obtained with differently strong depolarization source amplitudes ½0 . In general, an increase in depolarization source amplitude leads to an increase of the amplitude of the weight change curves. Insets show the shape of the two cut-off curves pointed to with the arrows at a reduced magnication. (A) Using an NMDA synapse as the depolarization source with ½0 D 0:25I 0:5I 1I 2I 5I 10. (B) Using an AMPA synapse with ½0 D 0:25I 0:5I 1I 2I 5I 10.
Nothing is known about STDP at synaptic clusters, and our results show that plasticity may have a different form when depolarization is caused by synchronized synapses and not a BP spike. The different curves in Figures 8A and 8B are obtained setting ½0 between 0.25 and 10 and thus assuming different relative strengths of the depolarization source.2 In all cases, we observed that the strength of the depolarization source acts as an amplication factor, leaving the shape of the curves essentially unchanged. As noted above, pure amplication is meaningless in the context of these simulations because this can be achieved by a changed learning rate ¹ or a larger number of paired pulses as well. Interestingly, we found that a changing depolarization source strength not only affects amplication but also induces a shift of the curves with respect to zero. The smallest curves, which were obtained with a weak depolarization source, remain above zero all the time. Thus, in spite of their realistic looking shape, these curves do not represent STDP. For larger values of DS (½0 ¸ 0:5), a zero crossing is observed, and only for the largest curve (DS: ½0 D 10), the negative part covers more area than the positive part, which seems to be the generic case for most STDP curves. Furthermore, weight changes are about tenfold smaller as compared to the cases above, where we had used a BP spike as the depolarization source 2 In principle, it would be possible to keep the relative strength of DS equal to 0.5 and vary the strength of PS. This, however, is essentially symmetrical to the experiments shown, because only the quotient between the strength of PS and DS determines the shape of the resulting STDP curves. This holds apart from minor effects due to the recurrent inuence of the NMDA channel.
614
A. Saudargiene, B. Porr, and F. Worg ¨ otter ¨
(except for the cases ½0 D 10). This is due to the much stronger change in membrane potential arising from a BP spike. As a consequence, a tonic depolarization of the membrane potential will lead to a stronger amplication of the weight change curves when using a second synapse as depolarization source (see Figure 4B) as compared to the situation where we had used a BP spike. 3.6 Mixed NMDA/AMPA Plastic Synapses. So far we have looked only at pure NMDA synapses as the plastic synapse. This is not in accordance with physiology because most synapses, which contain NMDA components, are mixed NMDA and AMPA synapses (as depicted in Figure 1A), moreover normally with a dominating AMPA part (Malenka & Nicoll, 1999). In addition, one nds that during learning, the AMPA component of such synapses undergoes much stronger changes than the NMDA component (Luscher ¨ & Frerking, 2001; Song & Huganir, 2002). This happens in spite of the fact that it is the NMDA component that is the driving force of the learning. This last observation, however, justies formulating the learning rule used here by only the normalized NMDA conductance function gO N in equations 2.11 and 2.12. Thus, the AMPA component cannot directly enter learning at equation 2.12; it will, however, inuence the membrane potential V and thereby exert two effects: (1) it inuences the e¡° V term in the NMDA conductance function (see equation 2.2) and (2) it inuences directly the derivative of the membrane potential. Both effects could change learning. However, at this stage of the analysis, we have arrived at the conclusion that the BP spike is in most cases the strongest and most inuential depolarization source, which is in accordance with others (reviewed in Linden, 1999, but see Goldberg et al., 2002). Above, we have argued that the depolarization that comes from a BP spike is normally much stronger than that which occurs from the plastic synapse itself that the inuence of the plastic synapse on the membrane potential can be neglected. This still holds for mixed AMPA and NMDA synapses. A realistic single excitatory postsynaptic spike potential (EPSP) is normally rather small. Thus, as long as the BP spike is still strong, a single EPSP will not substantially inuence learning. As soon as the BP spike drops in amplitude, one would have to consider the effect of a mixed plastic synapse. In this case, the learning curve will gradually approach the shape shown in Figure 8B, where we have a pure AMPA component assumed as the depolarization source and we will get nonlinear learning behavior such as that observed in Figure 7, curves 7 to 9. The wide range of possible effects that might arise from this, however, exceeds the scope of this article. 4 Discussion This study consists essentially of two parts. The rst part is the association between the ISO learning model and its biophysical counterpart, and the
How the Shape of Pre- and Postsynaptic Signals Can Inuence STDP
615
second part concerns the actual ndings obtained from the new model. We discuss both parts consecutively. At the end of this section, we compare our model to others found in the literature. 4.1 Discussing the Model Assumptions. ISO learning is a typical articial neural network algorithm, and it is therefore rather unrelated to biophysics. As a consequence, any kind of biophysical reevaluation cannot immediately go all the way down toward individual channel and calcium dynamics. Instead, in this study we have attempted to go one step into this direction by adapting the ISO learning algorithm to a traditional state-variable description of a neuronal (compartmental) model. One central assumption of ISO learning is the ltering of its inputs. In a neuron, low-pass ltering takes place in a very natural way as the consequence of the NMDA channel properties, as well as from the low-pass characteristics of natural membranes. Obviously these low-pass lters are not identical to the technical bandpass lters used in ISO learning, but deviations are small enough in order to yield the same basic STDP-like learning behavior. ISO learning was designed to correlate two inputs with each other in time (e.g., in a temporal sequence learning paradigm). STDP, however, takes place in relation to the temporal structure between a neuron’s input (presynaptic) and its output (postsynaptic). In spite of this apparently different algorithmic structure, adaptation of both models is still straightforward when realizing that normally a postsynaptic spike has been elicited from some presynaptic inuence. This justies our approach of either assuming a second (cluster of) synchronized synapse(s), a dendritic spike, or a BP spike as a possible depolarization source (Linden, 1999; Goldberg et al., 2002). The learning rule consists of two components. The second term is given by the derivative of the membrane potential. In most cases, the membrane potential is strongly dominated by the shape of the BP spike at the moment of pairing, while the contribution of the plastic synapse (or other synapses) can be neglected. This makes V 0 a postsynaptic quantity. Given that V0 D CI , O dQ . This we note that the learning rule can be rewritten also as d½ D ¹ dt C g dt
shows that charge transfer dQ across the (postsynaptic) membrane is a major dt driving force of learning. We can assume that a part of dQ is contributed by calcium ow (Malenka dt & Nicoll, 1999). Then after integration, the nal weight change 1½ is determined by part of Q, the total amount of calcium ions that crossed the membrane. This interpretation is valid as long as the calcium contributes an approximately xed part to the total current. The model does not take into account more complex calcium dynamics, buffering, enzymatic reactions, or others that take place during physiological weight changes. This was clearly not intended at this level of model complexity. As the rst term of the learning rule, we have used the normalized NMDA conductance function gO N , which represents the bandpass ltered input u1
616
A. Saudargiene, B. Porr, and F. Worg ¨ otter ¨
of ISO learning’s response to a ±-pulse input. We would argue that gO N essentially subsumes the time course of all processes that occur for an NMDA receptor outside or directly at the membrane—thus, all presynaptic events, for example, glutamate release, binding to the receptors, unbinding, and elimination from the synaptic cleft. The efciency with which this happens is encoded in the scaling factor gN N . O and a postsyThus, our learning rule uses a product of a presynaptic ( g) naptic (V 0 ) inuence. From this, it is now clear that the association of V 0 with calcium ow must be restricted to that proportion of the calcium that travels through the NMDA channels. Voltage-gated calcium channels, calciuminduced calcium release, or other calcium buffer release mechanisms are not being considered in this model, which could potentially inuence synaptic changes. There is, however, wide-ranging support (especially for spines) that synaptic plasticity is indeed strongly dominated by calcium transfer through NMDA channels (Schiller, Schiller, & Clapham, 1998; Yuste, Majewska, Cash, & Denk, 1999; Malenka & Nicoll, 1999) and that the other calcium release mechanisms may play only a minor role (but see, e.g., Huemmeke, Eysel, & Mittmann, 2002). Note that in this study, we do not implement any mechanisms of shortterm plasticity (Markram & Tsodyks, 1996; Fortune & Rose, 2001). In principle, this could be done using a fast model for short term plasticity (Giugliano, Bove, & Grattarola, 1999) as a front end that continuously adjusts the base value of ½ as soon as an input spike train arrives. Here we are also faced with another problem. Any input spike train, which res the cell, will lead to to complex “pre-post-pre-etc.” combinations (Froemke & Dan, 2002), discussed also in Bi (2002). Our model can cope with these effects too, and we receive additional transitions from LTP to LTD or vice versa depending on the pre-post sequence, as shown in Figure 7C. 4.2 Discussion of the Findings. We believe that three of our ndings could be of longer-lasting relevance for the understanding of synaptic learning, provided they withstand physiological scrutinizing: (1) the shape of the weight change curves heavily relies on the shape of the input functions (see Figures 5 and 8). (2) Differential Hebbian learning (STDP) can become similar to standard Hebbian learning (LTP) if the postsynaptic depolarization (i.e., the BP spike) rises shallow (see Figures 5A and 5B versus Figures 5F and 5G), (3) and weight growth can strongly change its characteristic in the course of learning if the membrane potential is locally dominated by the potential change arising from the plastic synapse itself (see Figure 7). 4.2.1 Finding 1. Physiological studies suggest that weight change curves can indeed have a widely varying shape (reviewed in Abbott & Nelson, 2000, and Roberts & Bell, 2002). In our study, both the NMDA characteristic and the characteristic of the depolarization source inuence the shape of the weight change curve. The NMDA characteristic changes during
How the Shape of Pre- and Postsynaptic Signals Can Inuence STDP
617
development, and in the ferret “young” NMDA channels are even slower than those found in adult animals (Roberts & Ramoa, 1999). We nd that for “young” synapses, LTD strongly dominates (see Figure 6). Functionally this would make sense in helping to stabilize (Song, Miller, & Abbott, 2000; Rubin, Lee, & Sompolinsky, 2001; Kempter, Gerstner, & van Hemmen, 2001) an immature network, where false “inverse-causal” correlations could still be frequent. Also, the shape of the membrane potential locally at the synapse is a source for differences in the shapes of the weight change curves. The physiological properties and morphology of dendritic trees will lead to a locally different active back-propagation or passive attenuation of the BP spike (Magee & Johnston, 1997; Larkum et al., 2001). It has been recently shown that synaptic plasticity in distal dendrites may be triggered by local NaC and/or Ca2C - mediated dendritic spikes (Golding et al., 2002), which are usually slower than the BP spikes (Stuart, Spruston, Sakmann, & H¨ausser, 1997; Schiller, Schiller, Stuart, & Sakmann, 1997; H¨ausser, Spruston, & Stuart, 2000; Larkum et al., 2001). As a consequence of these different shapes of the depolarizing potentials, the resulting weight change curves would differ as well. Many theoretical studies on STDP assume some kind of “generic” weight change curve that is applied regardless of the morphology of the neuron (Gerstner, Kempter, van Hemmen, & Wagner, 1996; Song et al., 2000; Kempter, Leibold, Wagner, & van Hemmen, 2001; Rubin et al., 2001). Others assume a gradually changing shape following, for example, the length of a dendrite without making specic assumptions about the parameters that lead to the different weight change curves (Panchev, Wermter, & Chen, 2002; Sterratt & van Ooyen, 2002). In our study, we argue that the shape of the inputs determines the shape of the weight change curves. It is interesting to consider that this way, local dendritic and spine properties would lead to different learning characteristics. For structures that are strongly electrically decoupled, the temporal structure and possible synchronization of the different inputs would be more important than the causality of preand postsynaptic signals. Note, however that the electrical (de-)coupling of spines is still a matter of debate (Koch, 1999; Kovalchuk, Eilers, Lisman, & Konnerth, 2000; Sabatini, Maravall, & Svoboda, 2001). 4.2.2 Finding 2. Several theoretical articles have shown that differential Hebb rules will lead to STDP-like behavior while plain (undifferentiated) Hebb rules will lead to temporally undirected LTP (Roberts, 1999; Xie & Seung, 2000; Roberts & Bell, 2002). Here we nd that our differential Hebb rule can lead to plain LTP within rather wide correlation windows T as soon as the rising ank of our BP spike is shallow. In this case, the product of gO and V 0 remains positive for rather large negative temporal shifts of the postsynaptic signal. This nding indicates that at this model level, it is not necessary to assume fundamentally different mechanisms for LTP or STDP.
618
A. Saudargiene, B. Porr, and F. Worg ¨ otter ¨
Yet from physiology, it is known that there are different intracellular processes involved for generating LTD or LTP. Furthermore, LTD is supposed to arise from low calcium inux, while LTP arises as soon as the calcium current is high (Nishiyama, Hong, Mikoshiba, Poo, & Kato, 2000). In an extended version, our model could be made compatible with these two aspects because the duration of the depolarization (hence V 0 ) and its temporal location with respect to the NMDA signal will determine how much calcium can ow, and this is different for the different values of T. One could implement two processes that are differently susceptible to these different calcium levels and thereby extend the model accordingly. The implications of being able to change an STDP to an LTP characteristic (or back) depending on the shape of the membrane potential are interesting from a theoretical point of view. Hebbian learning is usually associated with the extraction and condensation of relevant signals (Infomax principle, principal component analysis, Oja, 1982; Linsker, 1988), while STDP addresses the causality of synaptic events. It seems that the same substrate would support both principles and, in a similar way as discussed above, it could be the location at the dendrite that determines what kind of behavior is found. Possibly distal dendrites, where the potential changes can be shallow (Magee & Johnston, 1997; Larkum et al., 2001), discounting possible active processes, would experience LTP, proximal dendrites STDP. Also this can be a matter of further theoretical and experimental investigations. 4.2.3 Finding 3. We had found that weight growth can be linear or can contain a nonlinear transition when the positive feedback of the NMDA characteristic is dominant. This could again only happen at electrically more strongly decoupled parts of the membrane such as spines. There, small currents elicited by an active synapse will lead to large potential changes, which are required for this positive feedback effect. Such local potential changes cannot be measured anymore behind the spine neck because of its high resistance (Sabatini et al., 2001). In this case, the potential (change) that is the driving force of the calcium current would be strongly inuenced by the local structure of the synaptic density, and correlation-based learning will take place between local inputs independent of the cell’s soma (i.e., regardless of the ring of the cell). 4.3 Comparison to Other Models. A wide variety of models for STDP have been designed that can roughly be subdivided into two groups with different biophysical complexity. Some of them are spike based and others rate based. Group 1 could be called abstract models. They assume a certain shape for a weight change curve as the learning rule (Gerstner et al., 1996; Song et al., 2000; Rubin et al., 2001; Kempter, Leibold et al., 2001; Gerstner & Kistler, 2002) that remains unchanged across the local properties of the cell. Thus, these studies cannot discuss cellular properties but focus on network effects
How the Shape of Pre- and Postsynaptic Signals Can Inuence STDP
619
instead. One interesting nding obtained with these models was that STDP leads to self-stabilization of weights and rates in the network as soon as the LTD side dominates over the LTP side in the weight change curve (Song et al., 2000; Kempter, Gerstner, et al., 2001). In addition, it was found that such networks can store patterns (Abbott & Blum, 1996; Seung, 1998; Abbott & Song, 1999; Rao & Sejnowski, 2000; Fusi, 2002). More recently, these models have also been successfully applied to generate (i.e., to develop) some physiological properties such as map structures (Song & Abbott, 2001; Kempter, Leibold et al., 2001; Leibold & van Hemmen, 2002), direction selectivity (Buchs & Senn, 2002; Senn & Buchs, 2003) or temporal receptive elds (Leibold & van Hemmen, 2001). The biophysical realism of the used learning rules (really: weight change curves), however, must remain limited and cannot capture the wide variety of curves experimentally measured. Group 2 could be called state variable models, to which we count our approach. Such models can adopt a rather descriptive approach (Abarbanel, Huerta, & Rabinovich, 2002), where appropriate functions are being t to the measured weight change curves. Others are closer to the kinetic models in trying to t phenomenological kinetic equations (Senn et al., 2000; Castellani et al., 2001; Karmarkar & Buonomano, 2002; Karmarkar, Najarian, & Buonomano, 2002; Shouval et al., 2002). Our approach tries to associate the used functions more closely to biophysics than that of Abarbanel et al. (2002), but, unlike some of the other models, we have not tried to t any kinetic equations because model complexity substantially increases when doing so. As a consequence, our model is most closely related to the study of Rao and Sejnowski (2001). These authors used a variant of the TD learning rule to implement STDP. Dayan (2002) claries this issue and discusses that the rule of Rao and Sejnowski (2001) is rather a temporal difference rule between output activity values and not between prediction values as in the traditional TD rule. As a consequence, their rule is strongly related to our approach, and they also observe that the shape of the BP spike will inuence the weight change curve. We have, however, replaced the 10 ms discretization used by Rao and Sejnowski (2001) for calculating the temporal difference by a real derivative (using 0.1 ms steps), and in our model the presynaptic activity is modeled as a conductance. This recently allowed us to solve the integral equation for the weight change (see equation 2.10) analytically for a slightly simplied set of conductance functions (Porr, Saudargiene, & Worg¨ ¨ otter, 2004). Some of the other models implement a rather high degree of biophysical detail, including calcium, transmitter and enzyme kinetics (Senn et al., 2000; Castellani et al., 2001). The power of such models lies in the chance to understand and predict intra- or subcellular mechanism—for example, the aspect of AMPA receptor phosphorylation (Castellani et al., 2001), which is known to centrally inuence synaptic strength (Malenka & Nicoll, 1999; Luscher ¨ & Frerking, 2001; Song & Huganir, 2002). The approaches of Shouval et al. (2002) as well as of Karmarkar and
620
A. Saudargiene, B. Porr, and F. Worg ¨ otter ¨
co-workers (Karmarkar & Buonomano, 2002; Karmarkar et al., 2002) are a bit less detailed. Both models investigate the effects of different calcium concentration levels by assuming certain (e.g., exponential) functional characteristics to govern its changes. This allows them to address the question of how different calcium levels will lead to LTD or LTP (Nishiyama et al., 2000), and one of the models (Karmarkar & Buonomano, 2002) proposes to employ two different coincidence detector mechanisms to this end. An interesting aspect of our study and that of Rao and Sejnowski (2001) is that these models require only a single coincidence detector, because essentially the gradient of Ca2C drives the learning when using a differential Hebbian learning rule and not the absolute Ca2C -concentration (see Bi, 2002, for a detailed discussion of the gradient versus concentration model). Both model types (Shouval et al., 2002; Karmarkar & Buonomano, 2002; Karmarkar et al., 2002) were designed to produce a zero crossing (transition between LTD and LTP) at T D 0, which is not always the case in the measured weight change curves, which show transitions between more LTP- and more LTD-dominated shapes depending on the cell type and the stimulation protocol (Roberts & Bell, 2002). The differential Hebb rule we employed leads to the observed results as the consequence of the fact that the derivative of any generic (unimodal) postsynaptic membrane signal (like a BP spike) will lead to a bimodal curve. The relative temporal location of the presynaptic depolarization signal with respect to the positive (or negative) hump of this bimodal curve will then determine if the convolution product is positive (weight growth) or negative (weight shrinkage). The model of Shouval et al. (2002) implicitly also assumes such a differential Hebbian characteristic by the bimodal shape of their Ä-function, which they used to capture the calcium inuence. This group also discussed, among other aspects, the role of the shape of the BP spike, and they concluded that a slow afterdepolarization potential (more commonly known as repolarization) must exist in order to generate STDP. This assumption is essentially similar to that of a slow fall time of the BP spike in our study. Thus, also in their study, the shape of the BP spike will inuence the shape of the weight change curve. In general, they nd that the LTP part of the curve is stronger than the LTD part. This observation would prevent self-stabilization of the activity in network models (Song et al., 2000; Kempter, Gerstner et al., 2001a), which require a larger LTD part for achieving this effect. Interestingly, however Shouval et al. (2002) nd a second LTD part for larger positive values of T, which could perhaps be used to counteract such an activity amplication. In the hippocampus there is conicting evidence if such a second LTD part exists for large T (Pike, Meredith, Olding, & Paulsen, 1999; Nishiyama et al., 2000). Acknowledgments We acknowledge the support from SHEFC INCITE and EPSRC, GR/R74574/01. We are grateful to B. Graham, L. Smith, and D. Sterratt for their helpful comments on this manuscript.
How the Shape of Pre- and Postsynaptic Signals Can Inuence STDP
621
References Abarbanel, H. D. I., Huerta, R., & Rabinovich, M. I. (2002). Dynamical model of long-term synaptic plasticity. Proc. Natl. Acad. Sci. (USA), 99(15), 10132– 10137. Abbott, L. F., & Blum, K. I. (1996). Functional signicance of long-term potentiation for sequence learning and prediction. Cereb. Cortex, 6, 406–416. Abbott, L. F., & Nelson, S. B. (2000). Synaptic plasticity, Taming the beast. Nature Neurosci., 3, 1178–1183. Abbott, L. S., & Song, S. (1999). Temporal asymmetric Hebbian learning, spike timing and neuronal response variability. In M. S. Kearns, S. Solla, & D. A. Cohn (Eds.), Advances in neural information processing systems, 11 (pp. 69–75). Cambridge, MA: MIT Press. Benett, M. R. (2000). The concept of long term potentiation of transmission at synapses. Prog. Neurobiol., 60, 109–137. Bi, G. Q. (2002). Spatiotemporal specicity of synaptic plasticity: Cellular rules and mechanisms. Biol. Cybern., 87, 319–332. Bi, G.-Q., & Poo, M. (2001). Synaptic modication by correlated activity, Hebb’s postulate revisited. Annu. Rev. Neurosci., 24, 139–166. Bliss, T. V., & Gardner-Edwin, A. R. (1973). Long-lasting potentiation of synaptic transmission in the dentate area of the unanaesthetized rabbit following stimulation of the perforant path. J. Physiol. (Lond.), 232, 357–374. Bliss, T. V., & Lomo, T. (1970). Plasticity in a monosynaptic cortical pathway. J. Physiol. (Lond.), 207, 61P. Bliss, T. V., & Lomo, T. (1973). Long-lasting potentiation of synaptic transmission in the dentate area of the anaesthetized rabbit following stimulation of the perforant path. J. Physiol. (Lond.), 232, 331–356. Buchs, N. J., & Senn, W. (2002). Spike-based synaptic plasticity and the emergence of direction selective simple cells, Simultation results. J. Comput. Neurosci., 13, 167–186. Castellani, G. C., Quinlan, E. M., Cooper, L. N., & Shouval, H. Z. (2001). A biophysical model of bidirectional synaptic plasticity: Dependence on AMPA and NMDA receptors. Proc. Natl. Acad. Sci. (USA), 98(22), 12772–12777. Chechik, G., Meilijson, I., & Ruppin, E. (1998). Synaptic pruning in development: A computational account. Neural Comp., 10(7), 1759–1777. Dayan, P. (2002). Matters temporal. Trends Cogn. Sci., 6(3), 105–106. Debanne, D., Gahwiler, B. T., & Thompson, S. H. (1994). Asynchronous pre- and postsynaptic activity induces associative long-term depression in area CAI of the rat hippocampus in vitro. Proc. Natl. Acad. Sci. (USA), 91, 1148–1152. Debanne, D., Gahwiler, B., & Thompson, S. (1998). Long-term synaptic plasticity between pairs of individual CA3 pyramidal cells in rat hippocampal slice cultures. J. Physiol. (Lond.), 507, 237–247. Feldman, D. (2000). Timing-based LTP and LTD at vertical inputs to layer II/III pyramidal cells in rat barrel cortex. Neuron, 27, 45–56. Fortune, E. S., & Rose, G. J. (2001). Short-term synaptic plasticity as a temporal lter. Trends Neurosci., 24, 381–385. Froemke, R. C., & Dan, Y. (2002). Spike-timing-dependent synaptic modication induced by natural spike trains. Nature, 416, 433–438.
622
A. Saudargiene, B. Porr, and F. Worg ¨ otter ¨
Fusi, S. (2002). Hebbian spike-driven synaptic plasticity for learning patterns of mean ring rates. Biol. Cybern., 87, 459–470. Gerstner, W., Kempter, R., van Hemmen, J. L., & Wagner, H. (1996). A neuronal learning rule for sub-millisecond temporal coding. Nature, 383, 76–78. Gerstner, W., & Kistler, W. M. (2002). Mathematical formulations of Hebbian learning. Biol. Cybern., 87, 404–415. Giugliano, M., Bove, M., & Grattarola, M. (1999). Fast calculation of short-term depressing synaptic conductances. Neural Comp., 11, 1413–1426. Goldberg, J., Holthoff, K., & Yuste, R. (2002). A problem with Hebb and local spikes. Trends Neurosci., 25(9), 433–435. Golding, N. L., Staff, P. N., & Spurston, N. (2002). Dendritic spikes as a mechanism for cooperative long-term potentiation. Nature, 418, 326–331. Gormezano, I., Kehoe, E. J., & Marshall, B. S. (1983). Twenty years of classical conditioning research with the rabbit. In J. M. Sprague & A. N. Epstein (Eds.), Progress of psychobiology and physiological psychology (pp. 198–274). Orlando, FL: Academic Press. Gustafsson, B., Wigstrom, H., Abraham, W. C., & Huang, Y.-Y. (1987). Longterm potentiation in the hippocampus using depolarizing current pulses as the conditioning stimulus to single volley synaptic potentials. J. Neurosci., 7, 774–780. H¨ausser, M., Spruston, N., & Stuart, G. J. (2000). Diversity and dynamics of dendritic signaling. Science, 11, 739–744. Hebb, D. O. (1949). The organization of behavior, A neuropsychological study. New York: Wiley. Huemmeke, M., Eysel, U. T., & Mittmann, T. (2002). Metabotropic glutamate receptors mediate expression of LTP in slices of rat visual cortex. Eur. J. Neurosci., 15(10), 1641–1645. Karmarkar, U. R., & Buonomano, D. V. (2002). A model of spike-timing dependent plasticity, One or two coincidence detectors? J. Neurophysiol., 88, 507–513. Karmarkar, U. R., Najarian, M. T., & Buonomano, D. V. (2002). Mechanisms and signicance of spike-timing dependent plasticity. Biol. Cybern., 87, 373–382. Kempter, R., Gerstner, W., & van Hemmen J. L. (2001). Intrinsic stabilization of output rates by spike-based Hebbian learning. Neural Comp., 13, 2709–2741. Kempter, R., Leibold, C., Wagner, H., & van Hemmen, J. L. (2001). Formation of temporal-feature maps by axonal propagation of synaptic learning. Proc. Natl. Acad. Sci. (USA), 98(7), 4166–4171. Koch, C. (1999). Biophysics of computation. New York: Oxford University Press. Kohn, ¨ J., & Worg ¨ otter, ¨ F. (1998). Employing the Z-transform to optimize the calculation of the synaptic conductance of NMDA and other synaptic channels in network simulations. Neural Comp., 10, 1639–1651. Kovalchuk, Y., Eilers, J., Lisman, J., & Konnerth, A. (2000). NMDA receptormediated subthreshold Ca2C -signals in spines of hippocampal neurons. J. Neurosci., 20, 1791–1799. Krukowski, A. E., & Miller, K. D. (2001). Thalamocortical NMDA conductances and intracortical inhibition can explain cortical temporal tuning. Nature Neurosci., 4, 424–430.
How the Shape of Pre- and Postsynaptic Signals Can Inuence STDP
623
Larkum, M. E., Zhu, J. J., & Sakmann, B. (2001). Dendritic mechanisms underlying the coupling of the dendritic with the axonal action potential initiation zone of adult rat layer 5 pyramidal neurons. J. Physiol. (Lond.), 533, 447– 466. Leibold, C., & van Hemmen, J. L. (2001). Temporal receptive elds, spikes, and Hebbian delay selection. Neural Networks, 14(6-7), 805–813. Leibold, C., & van Hemmen, J. L. (2002). Mapping time. Biol. Cybern., 87, 428–439. Levy, W. B., & Steward, O. (1983). Temporal contiguity requirements for longterm associative potentiation/depression in the hippocampus. Neurosci., 8, 791–797. Linden, D. J. (1999). The return of the spike, Postsynaptic action potentials and the induction of LTP and LTD. Neuron, 22, 661–666. Linsker, R. (1988). Self-organisation in a perceptual network. Computer, 21(3), 105–117. Luscher, ¨ C., & Frerking, M. (2001). Restless AMPA receptors, Implications for synaptic transmission and plasticity. Trends Neurosci., 24(11), 665–670. Mackintosh, N. J. (1974). The psychologyof animal learning. Orlando, FL: Academic Press. Mackintosh, N. J. (1983). Conditioning and associative learning. Oxford: Oxford University Press. Magee, J. C., & Johnston, D. (1997). A synaptically controlled, associative signal for Hebbian plasticity in hippocampal neurons. Science, 275, 209–213. Malenka, R. C., & Nicoll, R. A. (1999). Long-term potentiation—a decade of progress? Science, 285, 1870–1874. Markram, H., Lubke, ¨ J., Frotscher, M., & Sakmann, B. (1997). Regulation of synaptic efcacy by coincidence of postsynaptic APs and EPSPs. Science, 275, 213–215. Markram, H., & Tsodyks, M. (1996). Redistribution of synaptic efcacy between neocortical pyramidal neurons. Nature, 382, 807–810. Martinez, J. L., & Derrick, B. E. (1996). Long-term potentiation and learning. Annu. Rev. Psychol., 47, 173–203. Nishiyama, M., Hong, K., Mikoshiba, K., Poo, M., & Kato, K. (2000). Calcium release from internal stores regulates polarity and input specicity of synaptic modication. Nature, 408, 584–588. Oja, E. (1982). A simplied neuron model as a principal component analyzer. J. Math. Biol., 15(3), 267–273. Panchev, C., Wermter, S., & Chen, H. (2002). Spike-timing dependent competitive learning of integrate-and-re neurons with active dendrites. In Lecture Notes in Computer Science, Proc. Int. Conf. Articial Neural Networks (pp. 896– 901). Berlin: Springer-Verlag. Pike, F. G., Meredith, R. M., Olding, A. A., & Paulsen, O. (1999). Postsynaptic bursting is essential for ”Hebbian” induction of associative long-term potentiation at excitatory synapses in rat hippocampus. J. Physiol. (Lond.), 518, 571–576. Porr, B., Saudargiene, A., & Worg¨ ¨ otter, F. (2004). Analytical solution of spiketiming dependent plasticity based on synaptic biophysics. Advances in neural information processing systems, 16. Cambridge, MA: MIT Press. In press.
624
A. Saudargiene, B. Porr, and F. Worg ¨ otter ¨
Porr, B., & Worg ¨ otter, ¨ F. (2003a). Isotropic sequence order learning. Neural Comp., 15, 831–864. Porr, B., & Worg¨ ¨ otter, F. (2003b). Isotropic-sequence-order learning in a closedloop behavioural system. Proc. Roy. Soc. B, 1811 (361), 2225–2244. Prokasy, W. F., Hall, J. F., & Fawcett, J. T. (1962). Adaptation, sensitization, forward and backward conditioning, and pseudo-conditioning of the GSR. Psychol. Rep., 10, 103–106. Rao, R. P. N., & Sejnowski, T. J. (2000). Predictive sequence learning in recurrent neocortical circuits. In S. Solla, T. K. Leen, & K.-R. Muller ¨ (Eds.), Advances in neural information processing systems, 12 (pp. 164–170). Cambridge, MA: MIT Press. Rao, R. P. N., & Sejnowski, T. J. (2001). Spike-timing-dependent Hebbian plasticity as temporal difference learning. Neural Comp., 13, 2221–2237. Roberts, E. B., & Ramoa, A. S. (1999). Enhanced NR2A subunit expression and decreased NMDA receptor decay time at the onset of ocular dominance plasticity in the ferret. J. Neurophysiol., 81, 2587–2591. Roberts, P. D. (1999). Computational consequences of temporally asymmetric learning rules, I. Differential Hebbian learning. J. Comput. Neurosci., 7(3), 235–246. Roberts, P. D., & Bell, C. C. (2002). Spike timing dependent synaptic plasticity in biological systems. Biol. Cybern., 87, 392–403. Rubin, J., Lee, D. D., & Sompolinsky, H. (2001). Equilibrium properties of temporally asymmetric Hebbian plasticity. Phys. Rev. Lett., 86(2), 364–367. Sabatini, B. L., Maravall, M., & Svoboda, K. (2001). Ca2C signaling in dendritic spines. Current Opinion Neurobiol., 11, 349–356. Schiller, J., Schiller, Y., & Clapham, D. E. (1998). Amplication of calcium inux into dendritic spines during associative pre- and postsynaptic activation: The role of direct calcium inux through the NMDA receptor. Nat. Neurosci., 1, 114–118. Schiller, J., Schiller, Y., Stuart, G., & Sakmann, B. (1997). Calcium action potentials restricted to distal apical dendrites of rat neocortical pyramidal neurons. J Physiol., 505, 605–616. Senn, W., & Buchs, N. J. (2003). Spike-based synaptic plasticity and the emergence of direction selective simple cells: Mathematical analysis. J. Comput. Neurosci., 14, 119–138. Senn, W., Markram, H., & Tsodyks, M. (2000). An algorithm for modifying neurotransmitter release probability based on pre-and postsynaptic spike timing. Neural Comp., 13, 35–67. Seung, H. S. (1998). Learning continuous attractors in recurrent networks. In M. Kearns, M. Jordan, & S. Solla (Eds.), Advances in neural information processing systems, 10 (pp. 654–660). Cambridge, MA: MIT Press. Shouval, H. Z., Bear, M. F., & Cooper, L. N. (2002). A unied model of NMDA receptor-dependent bidirectional synaptic plasticity. Proc. Natl. Acad. Sci. (USA), 99(16), 10831–10836. Singer, W., & Gray, C. (1995). Visual feature integration and the temporal correlation hypothesis. Annu. Rev. Neurosci., 18, 555–586.
How the Shape of Pre- and Postsynaptic Signals Can Inuence STDP
625
Song, I., & Huganir, R. L. (2002). Regulation of AMPA receptors during synaptic plasticity. Trends Neurosci., 25(11), 578–588. Song, S., & Abbott, L. F. (2001). Cortical development and remapping through spike timing-dependent plasticity. Neuron, 32, 1–20. Song, S., Miller, K. D., & Abbott, L. F. (2000). Competitive Hebbian learning through spike-timing-dependent synaptic plasticity. Nature Neurosci., 3, 919– 926. Sourdet, V., & Debanne, D. (1999). The role of dendritic ltering in associative long-term synaptic plasticity. Learning and Memory, 6, 422–447. Sterratt, D. C., & van Ooyen, A. (2002). Does morphology inuence temporal plasticity. In J. R. Dorronsoro (Ed.), ICANN 2002 (Vol. LNCS 2415, pp. 186– 191). Berlin: Springer-Verlag. Stuart, G., Spruston, N., Sakmann, B., & H¨ausser, M. (1997). Action potential initiation and backpropagationin neurons of the mammalian central nervous system. Trends Neurosci., 20, 125–131. Sutton, R. S. (1988). Learning to predict by the methods of temporal differences. Mach. Learn., 3, 9–44. Sutton, R., & Barto, A. (1981). Towards a modern theory of adaptive networks, Expectation and prediction. Psychol. Review, 88, 135–170. Teyler, T. J., & DiScenna, P. (1987). Long-term potentiation. Annu. Rev. Neurosci., 10, 131–161. Xie, X., & Seung, S. (2000). Spike-based learning rules and stabilization of persistent neural activity. In S. A. Solla, T. K. Leen, & K. R. Muller ¨ (Eds.), Advances in neural information processing systems, 12 (pp. 199–208). Cambridge, MA: MIT Press. Yuste, R., Majewska, A., Cash, S. S., & Denk, W. (1999). Mechanisms of calcium inux into hippocampal spines: Heterogeneity among spines, coincidence detection by NMDA receptors, and optical quantal analysis. J. Neurosci., 19, 1976–1987. Received February 14, 2003; accepted August 20, 2003.
Communicated by Ad Aertsen
LETTER
Self-Organizing Dual Coding Based on Spike-Time-Dependent Plasticity Naoki Masuda
[email protected] Kazuyuki Aihara
[email protected] Department of Complexity Science and Engineering, Graduate School of Frontier Sciences, University of Tokyo, Tokyo, Japan
It has been a matter of debate how ring rates or spatiotemporal spike patterns carry information in the brain. Recent experimental and theoretical work in part showed that these codes, especially a population rate code and a synchronous code, can be dually used in a single architecture. However, we are not yet able to relate the role of ring rates and synchrony to the spatiotemporal structure of inputs and the architecture of neural networks. In this article, we examine how feedforward neural networks encode multiple input sources in the ring patterns. We apply spiketime-dependent plasticity as a fundamental mechanism to yield synaptic competition and the associated input ltering. We use the Fokker-Planck formalism to analyze the mechanism for synaptic competition in the case of multiple inputs, which underlies the formation of functional clusters in downstream layers in a self-organizing manner. Depending on the types of feedback coupling and shared connectivity, clusters are independently engaged in population rate coding or synchronous coding, or they interact to serve as input lters. Classes of dual codings and functional roles of spike-time-dependent plasticity are also discussed. 1 Introduction Neural coding is a subject of intense debate. Classically, ring rates of neurons play major roles in, for example, encoding amplitudes of sensory stimuli and delivering instructions to motor neurons. However, the need for temporal averaging is recognized as a severe drawback of single-neuron rate codes, and recent developments in recording techniques have enabled detection of synchronous ring, correlated ring, and their dynamical properties with high temporal resolution (Gray, Konig, ¨ Engel, & Singer, 1989; Aertsen, Gerstein, Habib, & Palm, 1989; Aertsen, Erb, & Palm, 1994; Vaadia et al., 1995; de Oliveira, Thiele, & Hoffmann, 1997; Salinas & Sejnowski, 2001). These experimental results are suggestive of important functions provided by spatiotemporal spike codings based on individual spike timing and synNeural Computation 16, 627–663 (2004)
c 2004 Massachusetts Institute of Technology °
628
N. Masuda and K. Aihara
Figure 1: Schematic diagram showing dual coding. A neural population, consisting of ve neurons in this example, responds asynchronously (B, D) or synchronously (C, E) to an identical external stimulus (A). The temporal waveform of the stimulus is better approximated by population ring rates (D) obtained on the basis of asynchronous spike trains (B) than by population ring rates (E) obtained on the basis of synchronous spike trains (C).
chrony, which are beyond mere rate codings (von der Malsburg & Schneider, 1986; Abeles, 1991; Fujii, Ito, Aihara, Ichinose, & Tsukada, 1996; Gerstner, Kempter, van Hemmen, & Wagner, 1996; Masuda & Aihara, 2002a). However, the limitations on the rate codes mentioned above can be overcome if populations of neurons are pooled (Shadlen & Newsome, 1998; Gerstner, 2000). To address the problem of which code is dominantly used, several articles consider the possibility that temporal spike coding in the sense of synchrony and population rate coding can be used dually in a single neural network (Tsodyks, Uziel, & Markram, 2000; Masuda & Aihara, 2002b, 2003; van Rossum, Turrigiano, & Nelson, 2002). Population rates based on asynchronously ring neurons encode external inputs (see Figure 1A) relatively accurately, as schematically shown in Figures 1B and 1D. Figures 1C and 1E illustrate that synchronous ring or its interval encodes information on instantaneous input amplitudes with a lower resolution, but it is efcient in the sense that postsynaptic ring occurs more easily with synchronous inputs than with asynchronous inputs (Abeles, 1991; Salinas & Sejnowski, 2000, 2001; Stroeve & Gielen, 2001; Masuda & Aihara, 2002b; Kuhn, Rotter, & Aertsen, 2002; Kuhn, Aertsen, & Rotter, 2003). Labeled groups of neurons identied by synchrony are also believed to encode object information (von der Malsburg & Schneider, 1986; Abeles, 1991; Diesmann, Gewaltig, &
Self-Organizing Dual Coding
629
Aertsen, 1999). These codings can be dynamically switched by modulating, for example, the noise strength (Masuda & Aihara, 2002b; van Rossum et al., 2002), parameters specifying single-neuron properties or network structure (Masuda & Aihara, 2003), or short-term dynamics of neurotransmitter (Tsodyks et al., 2000). Experimental evidence shows that such switching may occur as a result of changes in situations during a task or changes in environments (de Oliveira et al., 1997; Riehle, Grun, ¨ Diesmann, & Aertsen, 1997). However, theoretical analysis of dual coding has considered only global synchrony in homogeneous networks, which abandons embedding the spatial structure of inputs in network states and therefore can encode at most one input source. In reality, multiple neuronal assemblies formed via synchronous ring (von der Malsburg & Schneider, 1986; Sompolinsky, Golomb, & Kleinfeld, 1990; Abeles, 1991; Diesmann et al., 1999) or correlated ring (Aertsen et al., 1989, 1994; Vaadia et al., 1995; Fujii et al., 1996; Salinas & Sejnowski, 2001) may coexist in the brain, facilitating the segregation of distinct entities. This framework may be a key to answering the binding problem and the superposition catastrophe (von der Malsburg & Schneider, 1986; Fujii et al., 1996; Salinas & Sejnowski, 2001). The presumed global synchrony actually should be replaced by synchrony constrained to a subgroup of neurons. It is also possible that these clusters of synchronously ring neurons interact to be engaged in further information processing. Clusters are induced by mechanisms such as synaptic learning (Aertsen et al., 1994; Horn, Levy, Meilijson, & Ruppin, 2000; Levy, Horn, Meilijson, & Ruppin, 2001), tuned delays (Fujii et al., 1996), specic coupling structure (Sompolinsky et al., 1990), and local excitatory interaction accompanied by global inhibition (von der Malsburg & Schneider, 1986). In this letter, we discuss the emergence of clusters and the functional role of clusters in regard to synaptic learning and feedback coupling, especially from the viewpoint of duality between rate coding and temporal coding. A recent distinguished experimental nding associated with learning and cluster formation is spike-time-dependent plasticity (STDP) in which synaptic weights are updated, depending on relative ne timing between the presynaptic and postsynaptic spikes (see Abbott & Nelson, 2000, for review). Multiple recordings of pyramidal neurons in layer 5 of the rat neocortex (Markram, Lubke, ¨ Frotscher, & Sakmann, 1997), tectal neurons of Xenopus tadpoles (Zhang, Tao, Holt, Harris, & Poo, 1998), and CA1 region of the rat hippocampus (Nishiyama, Hong, Mikoshiba, Poo, & Kato, 2000) revealed that synapses are potentiated when a presynaptic spike arrives about 20 ms or less before a postsynaptic spike. When the order of ring is reversed, the synapse is depressed (Markram et al., 1997; Zhang et al., 1998). The STDP learning, which extends the standard Hebbian learning, implies that spike timing and temporal codes are important for brain function. Model studies have shown possible functions of STDP learning. In feedback neural networks with appropriately tuned delays, STDP generally promotes self-organization of the coupling structure to result in clustered r-
630
N. Masuda and K. Aihara
ing. The clustered ring serves as a mechanism for memory retrieval (Levy, Horn, Meilijson, & Ruppin, 2001), realization of synre chains (Horn et al., 2000), and recovery of cognitive maps (Song & Abbott, 2001). Moreover, numerous studies show a fundamental consequence of the STDP learning in feedforward architecture, or synaptic competition in which only the feedforward synapses representing more synchronous, precise, or early inputs are more likely to survive (Song, Miller, & Abbott, 2000; Song & Abbott, 2001). Synaptic competition is a winner-take-all sort of input ltering, and it is useful for explaining high coefcients of variation of interspike intervals (Abbott & Song, 1999; Song et al., 2000; Cˆateau & Fukai, 2003), coincidence detection (Gerstner et al., 1996; Kistler & van Hemmen, 2000), stabilization of ring rates (Kempter, Gerstner, & van Hemmen, 2001), difference learning (Rao & Sejnowski, 2001), column formation (Song & Abbott, 2001), development and recovery of cortical maps (Song & Abbott, 2001), and others (Abbott & Nelson, 2000; Gerstner & Kistler, 2002b). There are two main approaches to understanding synaptic competition mathematically. Methods based on dynamical systems for mean synaptic weights are successful in explaining synaptic competition in which feedforward synapses linking to a common downstream neuron are strengthened only when the inputs are synchronous (Kempter, Gerstner, & van Hemmen, 1999; Kempter et al., 2001; Kistler & van Hemmen, 2000; Gerstner & Kistler, 2002a, 2002b; Gutig, ¨ Aharonov, Rotter, & Sompolinsky, 2003). Although mean dynamics are easier to derive, another approach based on the Fokker-Planck equations supplies us with more detailed information, such as the distribution functions of synaptic weights (van Rossum, Bi, & Turrigiano, 2000; Rubin, Lee, & Sompolinsky, 2001; Cˆateau & Fukai, 2003). For example, the bimodal distribution of synaptic weights obtained by numerical simulations (Song et al., 2000; Song & Abbott, 2001) is explained quasianalytically. However, these methods largely ignore inputs with temporal and spatial structure, which are related to synchrony presumably important for brain function. Exceptions are Rubin et al. (2001) and G utig ¨ et al. (2003), which describe synaptic dynamics for temporally structured inputs. In this letter, we rst analyze the formation of clusters via STDP when multiple inputs are presented. In section 2, we explain the feedforward neural networks used in this article. In section 3, using the Fokker-Planck equations and its reduction to effective deterministic dynamics, we examine how synaptic competition occurs, especially when multiple inputs with equal or different degrees of synchrony are presented to upstream neurons. The results provide the basis of cluster formation in the downstream layers. Then in section 4, we numerically investigate what kinds of information coding in terms of synchronous, clustered, and asynchronous states can be induced in networks. In particular, shared connectivity and feedback coupling within the downstream layers are modulated so that coding schemes switch. Consequently, clustered synchrony, which is related to the functional cell assemblies (Aertsen et al., 1989, 1994; Fujii et al., 1996), is featured. Sec-
Self-Organizing Dual Coding
631
tion 5 is devoted to discussions on various dual codings and dynamical states, including those presented in section 4, and on the functional relation between learning and synchrony.
2 Model The model neural network is schematically shown in Figure 2. The upstream layer, which is called the sensory layer because it receives external inputs directly, contains n1 sensory neurons. The ring time of the ith sensory neuron is determined by an inhomogeneous Poisson process with rate function ºi .t/ given as the external input. Accordingly, the probability that the neuron res once in [t; t C 1t] is ºi .t/1t for 1t ! 0, regardless of the history of ring. The output spike train reects possible deterministic information on ºi .t/ and stochasticity generated by, for example, the ongoing background activity (Arieli, Sterkin, Grinvald, & Aertsen, 1996; Shadlen & Newsome, 1998; Brunel, 2000; Gerstner, 2000). The downstream layer, or cortical layer, comprises n2 cortical neurons that process spike inputs from the sensory layer. Cortical neurons are assumed to be leaky integrate-and-re (LIF) neurons with membrane leak rate ° and a threshold for ring equal to 1. The membrane potential of the ith cortical neuron is denoted by vi .t/. If the jth cortical neuron res its kth spike
Figure 2: Architecture of the feedforward neural network. LIF = leaky integrateand-re neurons.
632
N. Masuda and K. Aihara
0 at t D Tj;k , it sends a spike after synaptic delay ¿ D 0:3 ms to the other cortical neurons. The time course of the synaptic current in the ith neuron 0 receiving the spike is written as ²i;j g.t ¡ Tj;k ¡ ¿ /, where ²i;j is the feedback synaptic weight from the jth neuron to the ith neuron. Without losing generality, g.t/ satises g.t/ D 0 (t < 0), g.t/ ¸ 0, and limt!1 g.t/ D 0. After ring, the neuron is reset to the resting state vi D 0. In section 4, we consider three cases of feedback coupling: (1) when ²i;j is turned off, (2) when ²i;j is homogeneous and constant in time, and (3) when ²i;j changes dynamically as a result of learning; all of these cases yield different coding schemes. Generally, stronger coupling tends to induce synchronous ring, whereas weaker coupling leads to asynchronous ring that realizes efcient population rate coding (Tsodyks et al., 2000; Masuda & Aihara, 2003). Furthermore, this tendency is quite common to both excitatory and inhibitory coupling, although eventual ring rates become different (Brunel, 2000; Gerstner & Kistler, 2002b). Therefore, we assume that ²i;j ¸ 0 to focus on the dynamics of the STDP learning. The ith cortical neuron also receives feedforward input with time course wi;j g.t¡Tj;k / from the jth sensory neuron, which emits its kth spike at t D Tj;k . The delay for feedforward synapses is assumed to be absent or, equivalently, uniformly constant. The feedforward synaptic weight wi;j is assumed to change via STDP. The coupling strength wi;j from the jth sensory neuron to the ith cortical neuron is changed by ( AC exp .¡±t=¿ 0 / ; ±t > 0; (2.1) G.±t/ D ¡A¡ exp .±t=¿ 0 / ; ±t · 0;
where AC and A¡ are the sizes of the synaptic modication by a single STDP event, and ±t ´ tpost ¡tpre is the time difference between a postsynaptic spike instant tpost and a presynaptic spike instant tpre. The shape of the learning window is shown in Figure 3A. We conne w in [0; wmax ] by resetting it to wmax (or 0) whenever w goes beyond wmax (or below 0). Excluding inhibitory feedforward synapses is justied by the fact that the weights of inhibitory synapses decay to small values as a result of the competitive dynamics, as we will see in section 3. Let us assume AC < A¡ to prohibit the explosion of synaptic weights and induce synaptic competition (Kempter et al., 1999, 2001; Song et al., 2000; Song & Abbott, 2001; Gutig ¨ et al., 2003). For simplicity, we consider STDP only due to the presynaptic spike temporally closest for each postsynaptic spike (Kistler & van Hemmen, 2000; van Rossum et al., 2000; Cateau ˆ & Fukai, 2003). To summarize, the dynamics of the ith cortical neuron between two spikes 0 0 at t D Ti;k and t D Ti;k is represented by 0 0 C1 n1 X n2 X X X dvi 0 D ¡ ¿ / ¡ ° vi .t/; wi;j g.t ¡ Tj;k / C ²i;j g.t ¡ Tj;k dt jD1 k jD1 k 0 vi .Ti;k / D 0: 0
(2.2)
Self-Organizing Dual Coding
633
Figure 3: Asymmetric (A) and symmetric (B) learning windows used for the STDP learning. Single updating magnitudes are normalized by the maximum synaptic weights.
3 Analysis of Synaptic Competition with Multiple External Inputs 3.1 Fokker-Planck Formalism for Inhomogeneous Inputs. In this section, we introduce the Fokker-Planck equations to examine how the feedforward synapses get organized through the STDP learning. The asymptotic distributions of synaptic weights have been evaluated for spatially uncorrelated inputs (van Rossum et al., 2002; Cˆateau & Fukai, 2003) and prescribed input-output correlation (Rubin et al., 2001). On the other hand, powerful analysis by the dynamical system theory (Kempter et al., 1999, 2001) has limitations in the vicinity of the boundaries where the diffusive term plays an important role, as we will see below. Our particular interest is to understand the dynamics and the asymptotics of the synaptic weights when inhomogeneous external inputs are applied. The Fokker-Planck equation for p.w; t/, or the probability distribution function of the synaptic weight w at time t, is written as 1 @2 @p.w; t/ @ D¡ .A.w/p.w; t// C .B.w/p.w; t// 2 @w2 @t @w
(3.1)
with the following reective boundary conditions: J.w; t/ ´ A.w/p.w; t/ ¡
1 @ .B.w/p.w; t// D 0 2 @w
(3.2)
at w D 0 and w D wmax . These conditions guarantee in a natural way that the density function p.w; t/ does not extend beyond the boundaries. The drift
634
N. Masuda and K. Aihara
term A.w/ and the diffusion term B.w/ satisfy Z A.w/ D Z B.w/ D
1 ¡1 1
¡1
d±t G.±t/ T.w; ±t/;
(3.3)
d±t G2 .±t/ T.w; ±t/;
(3.4)
where T.w; ±t/ is the rate at which the event that the presynaptic spike time is tpost ¡ ±t occurs, given w and a postsynaptic spike at t D tpost (Cˆateau @p.w;t/ & Fukai, 2003). Given that w is conned in [0; wmax ], we set @t D 0 in equation 3.1 to obtain the asymptotic distribution of w as follows: ÁZ p.w/ D N exp
w
2A.w/ ¡
@B.w/ @w
!
B.w/
;
0 · w · wmax ;
(3.5)
where N is the normalization constant. We now derive the drift term for inhomogeneous inputs ºj .t/. The method for calculating actual distributions is explained in section 3.2 for the specic case of two input sources. To calculate the drift term, we ignore the membrane leak in equation 2.2 (van Rossum et al., 2000; Rubin et al., 2001). The initial membrane potentials are also assumed to be distributed uniformly in [0; 1]. With these assumptions, the distribution of membrane potentials is uniform in [0; 1] at the instant of a presynaptic spike, and the ring rate of a postsynaptic neuron at t D tpost conditioned by the presynaptic ring time and the synaptic weight is equal to the rate of increase in the membrane potential. Therefore, using equation 2.2, the conditioned ring rate is given by post
ºi
Z .wi;j ; tpost jtpre/ D wi;j g.tpost ¡ tpre / C Z C
tpost ¡¿ ¡1
¡1
n2 X
dt0
tpost
i0 D1;i0 6Di
dt0
n1 X j0 D1
post
²i;i0 ºi0
wi;j0 ºj0 .t0 /g.tpost ¡ t0 /
.t0 /g.tpost ¡ t0 ¡ ¿ /:
(3.6)
It follows from Bayes’ theorem that post
ºj .tpre /ºi post
post
.wi;j ; tpostjtpre / D ºi
.tpost/T.wi;j ; tpost ¡ tpre jtpost /;
(3.7)
where ºi .tpost/ denotes the ring rate of the ith cortical neuron, and T.wi;j ; ¢jtpost/ is the T.wi;j ; ¢/ given tpost. Substitution of tpre D tpost ¡ ±t and
Self-Organizing Dual Coding
635
equation 3.6 into equation 3.7 yields 8 Z tpost n1 X ºj .tpost ¡ ±t/ < T.wi;j ; ±tjtpost / D post wi;j g.±t/ C dt0 wi;j0 ºj0 .t0 /g.tpost ¡ t0 / ¡1 0 D1 º .tpost / : i
j
Z C
tpost ¡¿ ¡1
dt
n2 X
0
i0 D1;i0 6Di
post ²i;i0 ºi0 .t0 /g.tpost
9 = ¡ t ¡ ¿/ : ; 0
(3.8)
Equation 3.8 guarantees that the effect of single spikes represented in the rst term appears only for ±t > 0, reecting the causality of the time course of synaptic current (g.t/ D 0, t < 0). However, the second term of equation 3.8 suggests that T.wi;j ; ±tjtpost / for ±t < 0 is more complicated than the simple multiplication of the presynaptic and postsynaptic ring rates (Abbott & Song, 1999; van Rossum et al., 2000; Cˆateau & Fukai, 2003) because of possible correlation between external inputs (Kempter et al., 1999, 2001; Gerstner, 2001; Gerstner & Kistler, 2002a; G utig ¨ et al., 2003). Because the STDP learning occurs on long timescales, T.wi;j ; ±t/ is averaged over tpost, which is proportional to hº.¢/it wi;j g.±t/ C C
n1 X j0 D1
Z
n2 X
Z wi;j0
²i;i0
1 0
i0 D1;i0 6Di
1 0
dt0 g.t0 /hºj .¢/ºj0 .¢ C ±t ¡ t0 /it post
dt0 g.t0 /hºi .¢/ºi0
.¢ C ±t ¡ t0 ¡ ¿ /it ;
(3.9)
where h¢it is the time averaging. Moreover, we assume statistical homogeneity of inputs; hºj .¢/it D hº.¢/it , .1 · j · n1 /. Since the scaling of T.wi;j ; ±t/ merely changes the timescale of equation 3.1, we use equation 3.9 for equation 3.3 to have Z 1 A.wi;j / D hº.¢/it wi;j dt g.t/G.t/ 0
C hº.¢/i2t C hº.¢/i2t C
n1 X
Z wi;j0
j0 D1 n1 X j0 D1
n2 X
²i;i0 i0 D1;i0 6Di post
£ hºj .¢/ºi0
¡1
Z wi;j0 Z
1 ¡1
1
1 ¡1
Z dt G.t/
1 0
Z dt G.t/ Z
dt G.t/
0
1
1 0
dt0 g.t0 /Cj;j0 .t ¡ t0 / dt0 g.t0 /
dt0 g.t0 /
.¢ C ±t ¡ t0 ¡ ¿ /it ;
(3.10)
636
N. Masuda and K. Aihara
where Cj;j0 .t/ D
hºj .¢/ºj0 .¢ C t/it ¡ hº.¢/i2t hº.¢/i2t
(3.11)
is the time correlation between ºj .t/ and ºj0 .t/. The diffusion term is derived in a similar manner by combining equations 3.4 and 3.9. Feedforward synapses with larger A.wi;j / are more likely to be potentiated, whereas other synapses with smaller A.wi;j / are suppressed. Three factors determine the drift term, which agrees with the results derived by mean-eld analysis (Kempter et al., 1999, 2001; Song et al., 2000; Gerstner & Kistler, 2002a). In equation 3.10, the rst term represents the single spike effect, the second term expresses the effect of synchronous inputs, and the third term the synaptic Rweights for normalization beR1 P 1 uniformly depresses 1 cause jn0 D1 wi;j0 > 0, ¡1 dt G.t/ < 0, and 0 dt0 g.t0 / > 0. The existence of the second term is consistent with numerical results (Song & Abbott, 2001) and with the fact that inducing potentiation is more effective with synchronous inputs than asynchronous inputs (Gerstner et al., 1996; Abbott & Song, 1999; Kempter et al., 1999). 3.2 Dynamics of Synaptic Competition. For a deeper understanding of the dynamical aspects of synaptic competition, let us prepare two temporally changing external inputs, each of which covers half of the sensory layer exclusively (Gutig ¨ et al., 2003). We set 8 > : 0;
1 · j; j0 · n21 ; n1 0 2 C 1 · j; j · n1 ; otherwise,
(3.12)
where ca > 0 and cb > 0 are the strengths of the intragroup correlation. This is the correlation function for colored noise. However, the results that follow are not sensitive to the choice of Cj;j0 .t/ as far as Cj;j0 .t/ is symmetric, has a positive maximum at t D 0, and decays to zero as t ! §1. Furthermore, we assume the following time course of the synaptic current: ( g.t/ D
exp.¡t=¿ g /; 0;
t ¸ 0; t < 0:
(3.13)
The second term of equation 3.10 becomes positive because of the cooperation of g.t/, which implements biological time courses of the synaptic currentRsuch as the one in equation 3.13, and synchronous inputs. On av1 erage, 0 dt0 Ci;j .t ¡ t0 /g.t0 / is actually larger for positive t, in which the synapses are potentiated, than for negative t. This feature will be missed if too simple time courses of synaptic current, such as delta-shaped spikes
Self-Organizing Dual Coding
637
(van Rossum et al., 2000; Cˆateau & Fukai, 2003), are used. In the numerical simulations below, the second term is not large enough to allow the rst term to be ignored, as it has been in other work (Kempter et al., 1999). The interplay of the effects of single-neuron ring and those of synchronous inputs induces competitive dynamics. We denote by wi;a and wi;b the representative synaptic weights corresponding to wi;1 ; wi;2 ; : : : ; wi;n1 =2 and wi;n1 =2C1 ; : : : ; wi;n1 , respectively. To describe the dynamics of these mean weights on the basis of the Fokker-Planck formalism, we ignore the feedback coupling in equation 2.2 for a moment by setting ²i;i0 D 0, 1 · i; i0 · n2 . Since it sufces to consider just one cortical neuron in this case, we omit the subscript of the cortical neuron from wi;a and wi;b . By substituting equations 2.1, 3.12, and 3.13 into equation 3.10, we obtain A.wa / D A1 wa C A2 ca hwa i ¡ A3 .hwa i C hwb i/ ´ A.wa ; ca ; hwa i ; hwb i/;
A.wb / D A.wb ; cb ; hwb i ; hwa i/;
(3.14) (3.15)
¿0 ¿ g A2 ¿0 C 2¿g C Á ¿c2 ¿g ¿0 2 n1 C hº.¢/it hwa i ca A2 2 .¿c C ¿g /.2¿c C ¿0 / ¡ Á ! ! 2¿c ¿g3 ¿0 ¿c2 ¿g ¿0 2 C ¡ AC .¿ g2 ¡ ¿c2 /.¿0 C 2¿g / .¿ g ¡ ¿c /.¿ 0 C 2¿c / ¿0 ¿ g n1 C hº.¢/i2t .hwa i C hwb i/.A2¡ C A2C / (3.16) 2 2 ´ B .wa ; ca ; hwa i ; hwb i/ ; (3.17)
B.wa / D hº.¢/it wa
B.wb / D B .wb ; cb ; hwb i ; hwa i/ ;
(3.18)
where hwa i and hwb i denote the ensemble averages with respect to the corresponding probability distribution functions, and ¿0 ¿ g AC ; ¿0 C ¿ g Á ¿c2 ¿g ¿0 2 n1 ¡ A2 ´ hº.¢/it A¡ 2 .¿c C ¿g /.¿c C ¿0 / Á ! ! 2¿c ¿g3 ¿0 ¿c2 ¿ g ¿0 C ¡ AC ; .¿ g2 ¡ ¿c2 /.¿0 C ¿g / .¿ g ¡ ¿c /.¿0 C ¿c / A1 ´ hº.¢/it
(3.19)
(3.20)
638
N. Masuda and K. Aihara
A3 ´ hº.¢/i2t
n1 .A¡ ¡ AC /¿0 ¿g : 2
(3.21)
Distributions of wa (or wb ) are obtained by substituting equations 3.14 and 3.17 (or equations 3.15 and 3.18) into equation 3.5. The obtained distributions are in turn used for calculating hwa i and hwb i, which are again used in equations 3.14, 3.15, 3.17, and 3.18. The asymptotic distributions pa .wa / and pb .wb / are obtained by repeating this procedure until the two distributions converge. The results up to this point would also hold for inhibitory feedforward synapses. However, equation 3.14 indicates that the drift terms for inhibitory synapses are generally smaller than those for excitatory synapses. Consequently, the inhibitory synapses are unlikely to survive in the end. For this reason, only the excitatory synapses are considered in our model. We then numerically calculate the dynamics of hwa i and hwb i with ¿g D 5 ms, ¿c D 14 ms, ¿0 D 20 ms, AC D 0:00375, A¡ D 0:004, and wmax D 0:07. To inspect dynamical aspects closely, we change hwa i and hwb i only by a small amount (5% at most) in each time step. The results for two input sources with different degrees of correlation, ca D 0:35 and cb D 0:25, are shown in Figure 4A. Synaptic competition is observed, and as expected, the ultimate distribution in which hwa i wins has the larger basin of attraction with respect to the initial condition (Kempter et al., 1999). On the other hand, the results for ca D cb D 0:25 shown in Figure 4B indicate that the two inputs are equally likely to win (Song & Abbott, 2001), with the separatrix being the straight line hwa i D hwb i. If we start from random initial distributions of synaptic weights, half of the cortical neurons is devoted to encoding ºa .t/, whereas the other half encodes ºb .t/. When the initial condition is close to the separatrix, the real nal distribution depends on the uctuation expressed by equation 3.4 as well as on the initial condition. This result coincides with the direct extension of the results based on dynamical systems (Kempter et al., 1999, 2001; Gerstner & Kistler, 2002a). In these articles, the cortical or downstream neurons are also assumed to be inherently stochastic, and the correlation between instantaneous ring rates is calculated. It is interesting to see that the result is also derived when we consider the spiking mechanism in detail. 3.3 Effective Deterministic Dynamics. Figures 4A and 4B show that a saddle point and a separatrix are associated with synaptic competition. To understand the competitive dynamics more clearly, we reduce the stochastic dynamics of learning to the deterministic dynamics. Consequently, we investigate how the interplay of the effective drift terms hA.wa /i and hA.wb /i determines the dynamics of hwa i and hwb i. With the detailed calculations developed in appendix A, the results of this mean-eld analysis are summarized in Figure 4C. The effect of the diffusion terms tied with the boundary conditions is manifested near the boundaries. The amount of noise speci-
Self-Organizing Dual Coding
639
Figure 4: Dynamics of mean synaptic weights when the degrees of synchrony of two inputs are different (A) or the same (B). The xed points and the nullclines for (B) are schematically shown in (C).
es the level of the noise oor indicated in the gure. A saddle point and two stable xed points exist as shown in Figures 4A and 4B. The separatrix characterizes two basins of attraction, in each of which the input ºa .t/ or ºb .t/ is dominantly encoded after learning. Individual synaptic weights are attracted to one of the two stable xed points, the selection of which depends on the initial conditions as well as on the dynamical noise caused by the stochastic arrival of spikes and the random wiring between layers. If ºa .t/ is more signicant than ºb .t/ with ca > cb , the slope of the nullcline hwP a i D 0 increases to make the basin for ºa .t/ larger. Our results extend the previous results (Kempter et al., 1999, 2001) in that the dynamical feature of synaptic weights is treated with realistic boundary conditions and that the competition between two equally synchronous inputs is explicitly analyzed.
640
N. Masuda and K. Aihara
When ca D cb , the analysis in appendix B shows that the saddle point exists if A1 C ca A2 > 0;
A1 C ca A2 ¡ 2A3 < 0:
(3.22) (3.23)
Equation 3.22 is usually satised because the synaptic weights grow when the global depression term proportional to A3 is absent. Equation 3.23 indicates that the depressing effect overrides the potentiating effect at the saddle point, which guarantees that the summed synaptic weights do not diverge (Kempter et al., 2001; Gerstner & Kistler, 2002a). We have relied on mean-eld descriptions by averaging the drift terms, or equations 3.14 and 3.15, over the weight space. This reduction renders our results essentially the same as those derived by simpler methods with differential equations (Kempter et al., 1999, 2001; Gerstner & Kistler, 2002a). However, the drift terms actually differ by individual synapses even if the presynaptic neurons belong to the same group. Because the drift terms are monotonically increasing in wi , a synapse that happens to have large wi is likely to be enhanced, whereas a synapse with small wi at a certain moment will be depressed. Compared with the differential equation approaches, the Fokker-Planck formalism enables us to look into the intragroup competition that occurs when there are excessive presynaptic neurons (Song et al., 2000; Song & Abbott, 2001). It is straightforward to apply the arguments to the cases of more than two inputs (see appendix B). When m inputs equally strong in terms of synchrony are received and the network is congured so that synaptic competition occurs, generalization of equations 3.22 and 3.23 results in ¸1 D ¢ ¢ ¢ D ¸m¡1 D A1 C ca A2 > 0;
¸m D A1 C ca A2 ¡ mA3 < 0:
(3.24) (3.25)
Accordingly, the saddle point has only a one-dimensional stable manifold whose eigenvector is .1; 1; : : : ; 1/. The other m ¡ 1 unstable eigenmodes underlie the symmetric competitive property of the STDP learning. Let us next assume that one input correlation is higher than the others, as follows: c1 > c2 D c3 D ¢ ¢ ¢ D cm ;
(3.26)
where we use c1 and c2 instead of ca and cb to avoid confusion. Then it follows that p .c1 ¡ c2 /A2 mA3 .¡1 C 1 C D/ C (3.27) ¸1 D A1 C c1 A2 C ; 2 2
Self-Organizing Dual Coding
¸2 D ¢ ¢ ¢ D ¸m¡1 D A1 C c1 A2 > 0; ¸m D A1 C c1 A2 C
641
p
.c1 ¡ c2 /A2 mA3 .1 C 1 C D/ ¡ ; 2 2
(3.28) (3.29)
where DD
.c1 ¡ c2 /2 A22 C .2m ¡ 4/.c1 ¡ c2 /A2 A3 m2 A23
> 0:
(3.30)
In general, the conguration of synaptic weights is attracted along the eigenmode with the largest eigenvalue (Kempter et al., 2001; Gerstner & Kistler, 2002a). Consequently, only the synapses from the neurons encoding the input with autocorrelation c1 , which correspond to the eigenmode for ¸1 , are likely to be eventually strengthened at the expense of others if A2 > 0. In the nondegenerated case where (3.31)
c1 > c2 > ¢ ¢ ¢ > cm holds, the calculations in appendix B guarantee ¸1 > ¸2 > ¢ ¢ ¢ > ¸m¡1 > 0 > ¸m :
(3.32)
The basic dynamics of the mean weights associated with the m groups is similar to that for the previous examples. The m-dimensional weight prole contracts along the one-dimensional stable manifold. Then, along the (m¡1)dimensional unstable manifold, it escapes away from states in which all the weights are more or less balanced. In nondegenerated cases, the weights chiey move along the eigenvector associated with ¸1 , which is . A1 Cc11A2 ¡¸1 , 1 1 t , where t denotes the transpose. On the basis of A1 Cc2 A2 ¡¸1 ; : : : ; A1 Ccm A2 ¡¸1 / equation B.14, the sign of the rst component of the eigenvector is different from that of the others, and the mean weight for the input with correlation c1 grows large while the others converge to tiny values. This is the likely consequence as long as the initial condition is in the appropriate basin of attraction, which is presumably larger than those for the other presynaptic neurons encoding the inputs with correlation c2 ; : : : ; cm . Thorough analysis requires better knowledge of the basin of attraction and the position of xed points, which apparently seems difcult to obtain. Finally, we examine the weight dynamics in the presence of feedback coupling. For simplicity, we ignore axonal and synaptic delays and set ¿ D 0 and g.t/ D ±.t/ for cortical neurons. Then, using equation 3.10, the drift term for the ith cortical neuron becomes A.wi;a / D A1 wi;a C A2 ca hwi;a i ¡ A3 .hwi;a i C hwi;b i/ C
n2 X
i0 D1;i0 6Di
² i;i0 A.wi0 ;a /;
(3.33)
642
N. Masuda and K. Aihara
A.wi;b / D A1 wi;b C A2 cb hwi;b i ¡ A3 .hwi;a i C hwi;b i/ C
n2 X
² i;i0 A.wi0 ;b /;
(3.34)
i0 D1;i0 6Di
where ² i;i0 D
²i;i0 : post ºi0 .¢/
(3.35)
Since feedback coupling makes synaptic weights of different neurons correlated in general, we have to consider the dynamics in the 2n2 -dimensional space for (w1;a , w1;b , w2;a , w2;b ; : : : ; wn2 ;a , wn2 ;b ). As shown in appendix C, the eigenvector corresponding to the largest eigenvalue in this space is calculated as .1; ¡1; 1; ¡1; : : : ; 1; ¡1/. This indicates that weights of feedforward synapses of different postsynaptic neurons codevelop to represent a single target input. This result is in remarkable contrast with the consequence in the absence of feedback coupling in which n2 leading eigenmodes are degenerated. They are represented as .1; ¡1; 0; 0; : : : ; 0; 0/t , .0; 0; 1; ¡1; 0; 0; : : : ; 0; 0/t ; : : : ; .0; 0; : : : ; 0; 0; 1; ¡1/t , and feedforward synaptic weights of each cortical neuron are allowed to evolve independently. Because of the codevelopment, analysis of a single cortical neuron in section 3.2 is sufcient for understanding the network behavior. 4 Simulation Results In this section, we return to the starting point to ask what kinds of coding schemes can operate in the feedforward networks shown in Figure 2 when multiple inputs are presented to the sensory layer. We are particularly interested in the coexistence of multiple clusters guided by multiple input sources, which is a key to solving binding problems and the superposition catastrophe (von der Malsburg & Schneider, 1986; Fujii et al., 1996; Salinas & Sejnowski, 2001). We consider a simple case of two temporal waveforms, each covering a part of the sensory layer in a complementary way: ( ºi .t/ D
10:0 C 5:0ºa .t/; 10:0 C 5:0ºb .t/;
1 · i · n1 ; n1 C 1 · i · n1 :
(4.1)
A cortical neuron is assumed to receive synapses from n01 sensory neurons according to the following rule. When a cortical neuron is connected to more (or fewer) presynaptic neurons in assembly f1; 2; : : : ; n1 g than in assembly fn1 C 1; : : : ; n1 g, this cortical neuron will prefer ºa .t/ (or ºb .t/) as a result of learning. If the wiring structure forces the majority of cortical neurons to select the same assembly of sensory neurons, the cortical layer obviously encodes one of the inputs. Here we consider nontrivial situations in which
Self-Organizing Dual Coding
643
each cortical neuron receives a synapse from a sensory neuron in one of two assemblies with equal probability. Accordingly, a cortical neuron is connected to n01 =2 sensory neurons in each assembly on average. Then, in the absence of the recurrent connection, the feedforward synaptic weights of cortical neurons are subject to the dynamics discussed in section 3. The STDP learning makes each cortical neuron encode ºa .t/ or ºb .t/, depending on the initial wiring and the stochastic dynamics. The cortical neurons encoding the same input are expected to form clusters with correlated ring. This kind of self-organizing behavior has been discussed in feedback networks (Horn et al., 2000; Levy et al., 2001), and we examine it for feedforward networks in relation to emergent coding schemes. Relevant to the dynamical states is shared connectivity among the ith and jth cortical neurons denoted by ri;j (Shadlen & Newsome, 1998; Masuda & Aihara, 2003). This quantity measures how much input a pair of downstream neurons share. With a large value of ri;j , two cortical neurons receive from the sensory layer correlated inputs that promote synchrony and limit the performance of the cortical layer as a population rate coder (Masuda & Aihara, 2003). For networks with xed structure, ri;j is calculated by counting the number of shared inputs divided by the total number of inputs. Since the synaptic weights are inhomogeneous and dynamically changing, we dene ri;j as follows: ri;j D
n1 X kD1
¡ ¢ min wi;k ; wj;k
¿
n1 ¡ ¢ 1X wi;k C wj;k ; 2 kD1
(4.2)
where min.x; y/ is the smaller value of x and y. We start with wi;j D 12 wmax for all the existing synapses. If both ith and jth cortical neurons purely n0
n0
1 encode ºa .t/ after learning, ri;j D 2n1 , whereas ri;j D 2.n ¡n if both cortical 1 1 1/ neurons encode ºb .t/. If they encode different inputs, ri;j ideally converges to 0. Inhomogeneous shared connectivity allows the network to encode two inputs in different manners (section 4.2). We also vary the feedback coupling strength ²i;j , which can switch network states (Tsodyks et al., 2000; Masuda & Aihara, 2003). The input rates for 5:0k ms · t < 5:0.k C 1/ ms are dened by
ºa .t/ D ºk;a D 0:7ºk¡1;a C »k;a ;
ºb .t/ D ºk;b D 0:7ºk¡1;b C »k;b ;
(4.3) (4.4)
where º¡1;a D º¡1;b ´ 0, and »k;a and »k;b are independently taken from the gaussian distribution with mean 0 and standard variation 1. The inputs can be approximated by the Ornstein–Uhlenbeck process with the correlation time equal to ¿c D 14 ms, which has been used in section 3. Furthermore, we set n2 D 50, n01 D 100, and ¿ D 0:3 ms.
644
N. Masuda and K. Aihara
Figure 5: Comparison of theoretical (smooth lines) and numerical (steps) weight distributions after learning, for a single postsynaptic neuron. (A, B) The weight distributions for the synapses corresponding to ºa .t/ and ºb .t/, respectively.
To verify the validity of the analytical methods developed in section 3, we rst compare the theoretical and numerical weight distributions for a single postsynaptic neuron in the nonleaky case (° D 0). The two weight distributions corresponding to the presynaptic neurons encoding ºa .t/ or ºb .t/ are calculated analytically and numerically for the parameter values described above, with AC D 0:00375, A¡ D 0:004, and wmax D 0:03. Figure 5 shows the agreement between the theoretical and analytical results. However, the analytical methods cannot directly predict synaptic weight distributions obtained by the simulations below. Membrane potential leak, which we ignored in the analytical part, actually generates pseudocorrelation between two input sources. A cortical neuron is more likely to re when large instantaneous inputs, probably simultaneous large inputs from both input sources, are received. This pseudocorrelation decreases A3 ; thus, the numerically obtained distributions are better approximated by the analytical results with AC when it is larger than the value actually used. Therefore, we set AC D 0:00333 rather than AC D 0:00375 in the following simulations. However, the qualitative feature of the learning dynamics is preserved even after the modication. We also set ° D 0:05 ms¡1 and wmax D 0:07. The ring rates are measured by counting the number of spikes with a bin width of 5 ms. Particularly, we stipulate that the relevant population ring rates are those based on n2 =5 neurons that most prefer ºa .t/ or ºb .t/. The performance of the rate coding by a single neuron or by a population is evaluated by the correlation coefcient of the time-dependent ring rates and ºa .t/ or ºb .t/ for the last 800 ms of a simulation run (Masuda & Aihara, 2002b, 2003). The degree of synchrony is measured with the correlation
Self-Organizing Dual Coding
645
coefcient between the spike counts of each pair of cortical neurons, with the use of the same bin width. 4.1 Even Shared Connectivities from Two Sources. We rst consider the case in which the shared connectivities tied with two assemblies in the sensory layer are the same: n1 D n1 =2. When feedback coupling is absent, each cortical neuron learns to encode ºa .t/ or ºb .t/ with equal probability (see Figure 4B). The simulation results with n1 D 240 and ²i;j D 0 are shown in Figure 6. In Figure 6, the cortical neurons are sorted so that neurons with smaller indices prefer ºa .t/. As shown in Figures 6A and 6E, single cortical neurons with small (or large) indices can encode ºa .t/ (or ºb .t/) by rate coding, with only the synapses from the sensory neurons associated with ºa .t/ (or ºb .t/) surviving. Figure 6A shows that the performance of encoding ºa .t/ (or ºb .t/) by ring rates, which is measured by the correlation coefcients introduced above, is higher for population ring rates based on the rst (or last) n2 =5 neurons than for single-neuronal ring rates. The cortical neurons are divided into two loose clusters identied by the relatively high degree of synchrony between the neurons encoding the same inputs, as shown in Figure 6B. Despite the weak synchrony, population rate coding is successful in encoding the inputs with high resolution. Figures 6C and 6D, which compare shared connectivity before and after learning, show that it is selforganized toward a structured state through learning. In other words, the learning causes the cortical layer to be specialized in two population rate codes. Let us denote this state by Ra Rb , which means that the cortical layer represents ºa .t/ and ºb .t/ by population rate codes (R) with two segregated subpopulations. Next, we introduce feedback coupling by setting ²i;j D 0:006 uniformly. Figure 7B shows that global synchrony is induced by the feedback coupling. However, the population rate coding performance with feedback coupling (Figure 7A) is much lower than that without feedback coupling (Figure 6A). The cortical layer operates in a synchronous code mode, which is robust and efcient in terms of power consumption at the cost of the coding accuracy (Masuda & Aihara, 2002b, 2003; van Rossum et al., 2002). In addition, the information on ºa.t/ and ºb .t/ is mixed because of the global coupling for most initial conditions. As the analytical results in section 3.3 predict, either ºa .t/ or ºb .t/, which is chosen randomly, is encoded by the whole cortical layer in the end. This ltering mechanism works not only at the singleneuron level but also at the network level, and it may be closely related to formation of columnal structure and dynamical neuronal modules through specialization (Song & Abbott, 2001). This state is denoted by Sa Sa , where S denotes synchrony. In this state, the cortical layer, which consists of two clusters if without feedback coupling, encodes Sa by globally synchronous ring. The existence of simultaneously active clusters identied by synchronous or correlated ring is a key to illuminating the binding problem and the
646
N. Masuda and K. Aihara
Figure 6: Simulation results when the shared connectivities for two inputs are equal. We set n1 D 240, n2 D 50, n01 D 100, n1 D 120, and ²i;j D 0 for all i; j. (A) Performance of single-neuron rate coding for ºa .t/ is indicated by squares, and that for ºb .t/ is indicated by circles. Performance of population rate coding for ºa .t/ is indicated by solid lines, and that for ºb .t/ is indicated by dotted lines. Performance is measured by the correlation coefcient between the ring rates and the external inputs, and population ring rates are calculated by taking the
Self-Organizing Dual Coding
647
Figure 7: Simulation results for two equal shared connectivities with constant feedback coupling. We set n1 D 240, n2 D 50, n01 D 100, n1 D 120, and ²i;j D 0:006. (A) Performance of single-neuron rate coding and population rate coding measured by the correlation function (see the caption of Figure 6 for details). (B) Degree of synchrony between cortical neurons. (C) Shared connectivity between cortical neurons.
Figure 6 (cont.): average ring rates of the n2 =5 D 10 neurons that are most engaged in encoding ºa .t/ or ºb .t/. (B) Degree of synchrony between pairs of cortical neurons. The horizontal and vertical axes correspond to the indices of cortical neurons. The brighter points correspond to larger values of the degree of synchrony. For clarity, a black point corresponds to a value less than 0.2, and a white point corresponds to a value more than 0.5. (C) Shared connectivity between pairs of cortical neurons before learning and (D) after learning. A black (or white) point corresponds to a value less than 0.1 (or more than 0.8). (E) The synaptic weights from the sensory neurons to the rst and the n2 th cortical neurons after learning.
648
N. Masuda and K. Aihara
superposition catastrophe, as discussed in the context of the synre chain (Abeles, 1991; Diesmann et al., 1999; Horn et al., 2000), dynamical cell assemblies (Aertsen et al., 1989, 1994; Vaadia et al., 1995; Fujii et al., 1996), object segregation (von der Malsburg & Schneider, 1986; Sompolinsky et al., 1990), and memory retrieval (Levy et al., 2001; Gerstner & Kistler, 2002b). It is possible for the network to encode ºa .t/ and ºb .t/ by two exclusively synchronous clusters with uniform coupling as described above, but only for a small range of parameter values. A more robust mechanism is to apply a learning rule in which ²i;j is increased only when the ith and jth cortical neurons re almost synchronously and is decreased otherwise (Nishiyama et al., 2000; Abbott & Nelson, 2000). Here we use a simple learning window (see Figure 3B), which is dened as follows: G.t/ D
» 0:0009 ²max ; ¡0:0006 ²max ;
.0 · jtj · 4 ms/ .8 · jtj · 23 ms/;
(4.5)
where ²max D 0:006. We run the learning procedure with the initial conditions ²i;j D 0 for all i; j and the restriction 0 · ²i;j · ²max . Figure 8B, in comparison with Figure 7B, shows that two synchronous clusters coexist. Figure 8D shows that feedback coupling eventually formed, separating two clusters and enhancing intracluster synchrony. This coding scheme is denoted by Sa Sb . However, it is difcult for these synchronous clusters to emerge with a standard learning window represented by equation 2.1 or Figure 3A. Actually, in recurrent networks or within a layer, synaptic potentiation in one direction accompanies synaptic depression in the other direction. The nal coupling strengths for the learning window equation 2.1 with the learning rate slowed by the ratio 0:0009=0:004 are shown in Figure 8E. No synaptic structure is formed with the asymmetric learning window, leaving the gure uniformly black. Similar to the uncoupled situation, the coding scheme for this situation is Ra Rb . The numerical results are qualitatively the same for various ratios of two learning rates. 4.2 Uneven Shared Connectivities from Two Sources. To investigate inhomogeneous information coding with respect to two external inputs, we impose different shared connectivities regarding ºa .t/ and ºb .t/ by setting n1 D 315 and n1 D 65. By assumption, a cortical neuron still receives incident spikes from each assembly of sensory neurons with probability 1=2, and the situation does not change for an isolated cortical neuron. However, population ring patterns may differ statistically between two emergent clusters of cortical neurons because of the uneven shared connectivities. The n0 shared connectivity 2n1 D 50=65 of cortical neurons encoding ºa .t/ is higher 1
n0
1 than the shared connectivity 2.n ¡n D 50=250 of cortical neurons encoding 1 1/ ºb .t/. The former subpopulation is more apt to synchronize (Shadlen &
Self-Organizing Dual Coding
649
Figure 8: Simulation results for two equal shared connectivities with adaptive feedback coupling for symmetric (A, B, C, D) and asymmetric (E) learning windows. We set n1 D 240, n2 D 50, n01 D 100, n1 D 120, and ²max D 0:006. (A) Performance of single-neuron rate coding and population rate coding (see the caption of Figure 6 for details). (B) Degree of synchrony. (C) Shared connectivity. (D) Feedback coupling strength after learning with the symmetric learning window of the form equation 4.5. A brighter point indicates a stronger coupling weight, and a black (or white) point corresponds to 0 (or ²max ). (E) Feedback coupling strength after learning with the asymmetric learning window of the form equation 2.1.
650
N. Masuda and K. Aihara
Newsome, 1998; Salinas & Sejnowski, 2000, 2001; Masuda & Aihara, 2003). Although the supposed shared connectivity values are somewhat higher than experimentally observed ones (Shadlen & Newsome, 1998; Salinas & Sejnowski, 2000), the higher shared connectivity could occur in networks with dense local coupling. Without feedback coupling, the cortical neurons encoding ºa .t/ are actually more synchronized (see Figure 9B) and are accompanied by a higher intracluster shared connectivity (see Figure 9C) than are cortical neurons of the other subpopulation. However, as shown in Figure 9A, the neurons encoding ºb .t/ can be pooled to realize more accurate rate coding. Two equally synchronous inputs are processed in different ways, and this coding mode is symbolized as Sa Rb . In this way, different codes may be used simultaneously by a network or a specic part of the brain, with its neuronal resources segregated for different modes.
Figure 9: Simulation results for two different shared connectivities without feedback coupling. We set n1 D 315, n2 D 50, n01 D 100, n1 D 65, and ²i;j D 0. (A) Performance of single-neuron rate coding and population rate coding (see the caption of Figure 6 for details). (B) Degree of synchrony. (C) Shared connectivity.
Self-Organizing Dual Coding
651
Figure 10: Simulation results for two different shared connectivities with feedback coupling. We set n1 D 315, n2 D 65, n01 D 100, n1 D 65, and ²i;j D 0:006. (A) Performance of single-neuron rate coding and population rate coding (see the caption of Figure 6 for details). (B) Degree of synchrony. (C) Shared connectivity.
Figure 10 shows the numerical results when the feedback coupling strength is uniformly equal to ²i;j D 0:006. In this situation, the cluster that has been weak in terms of synchrony is absorbed into the strong cluster so that the whole network encodes ºa .t/ by synchrony (Sa Sa ). This is because a stronger cluster generally sends more correlated feedback inputs to a weaker cluster than the other way around. Then the neurons receiving more correlated inputs are likely to become synchronous, with ring times locked to the inputs (Shadlen & Newsome, 1998; Salinas & Sejnowski, 2000; Masuda & Aihara, 2003), which are the ring times of the stronger cluster in this case. 5 Discussion 5.1 Various Types of Dual Coding. The results in section 4 show that multiple external inputs can be dealt with in inhomogeneous ways even in
652
N. Masuda and K. Aihara
a single network. We have found the rate code mode (Ra Rb ), the clustered synchronous code mode (Sa Sb ), the global synchronous code mode (Sa Sa ), and the mixed code mode (Sa Rb ). The analytical results in section 3 suggest that these modes can be extended to the case of a larger number of subpopulations in a layer. These modes are generalizations of our previous work on dual coding (Masuda & Aihara, 2003) in the inhomogeneous case. At the same time, these modes are code mode generalizations of the synre chain and coding schemes by synchronous clusters (von der Malsburg & Schneider, 1986; Abeles, 1991; Diesmann et al., 1999). In particular, we have shown that the combined effects of input structure, coupling structure, and synaptic learning lead to various network states including clustered ring. Similar to associative memory models, these complex network states, reecting inputs and network structure, are probably related to intricate information processing in the brain. Our results suggest that rate codes and temporal codes may be used in the brain dually in terms of time and neuronal populations. For example, even in synre chains, the neurons in the background may encode other information with population ring rates. Furthermore, the totally synchronous mode, the asynchronous mode, or other modes may alternate in time. This view is partially supported by experimental evidence (Gray et al., 1989; de Oliveira et al., 1997). The notion of dual coding has been used explicitly and implicitly by many authors. Although its denition may vary somewhat, dual coding generally refers to a mechanism in which the brain uses multiple codes. With our results in mind, let us review the different kinds of proposed dual coding: ² Dual coding by time sharing: Multiple codes, such as the rate code and the synchronous code, can emerge in different situations in the same network. In other words, time sharing triggered by external stimuli or switching of cognitive states enables the codes to alternate. At the network level, the theoretical results in this work and others (Aertsen et al., 1994; Tsodyks et al., 2000; Araki & Aihara, 2001; van Rossum et al., 2002; Masuda & Aihara, 2002b, 2003) together with experimental evidence (de Oliveira et al., 1997; Riehle et al., 1997) support this idea. With our notation, this mode corresponds to the situation in which various coding schemes such as Ra Rb , Sa Sb , and Sa Sa appear in turn. At the single-neuron level, the input frequency and the neurotransmitter release probability let the coding scheme of a single neuron switch between temporal coding based on coincidence detection and rate coding (Tsodyks & Markram, 1997). ² Dual coding by a multifunctional encoder: Different properties of neuronal states or spike patterns of a single neuron or a single population may simultaneously encode multiple entities. For example, it is proposed that a cell assembly encodes the meaning, whereas the ring rates of the constituent neurons encode its signicance or intensity
Self-Organizing Dual Coding
653
of inputs (Sompolinsky et al., 1990), the degree of participation of a single neuron in the assembly (Kruger ¨ & Becker, 1991), or the types of discriminative stimuli (Sakurai, 1999). Membership of a neuron in an assembly makes sense only by grouping neurons by, for example, synchrony or correlated ring. In a sense, this dual coding is equivalent to the synchronous mode S in our notation, with its coarse ring rate carrying the quantitative information. In any case, this coding scheme can be understood as the superimposition of Sa and Ra . Even at a singleneuron level, experiments show that information on time is encoded in the spike timing, whereas information on object identity is encoded in the spike count (Berry, Warland, & Meister, 1997). Experimental results indicative of dual coding by time sharing (de Oliveira et al., 1997; Riehle et al., 1997) can also be interpreted as examples of this type of dual coding; synchrony possibly represents internal events, whereas ring rates reect external inputs. ² Dual coding by different use of subpopulations: Subpopulations of a neural network can be allocated to multiple information sources, as examined in section 4. Model studies suggest that neural networks can be divided into clusters, each encoding an input feature or an internal state by ring rates or synchrony (for example, Ra Rb , Sa Sb , and Sa Rb ) (von der Malsburg & Schneider, 1986; Sompolinsky et al., 1990; Abeles, 1991; Fujii et al., 1996). Interaction of the clusters typically results in ltering of relatively asynchronous clusters, which are led to, for example, Sa Sa . By denition, a single neuron cannot operate in this mode. Moreover, this coding scheme is different from that described by the experimental literature cited above. Real neural codes may be a mixture of the theoretically proposed ones stated above, enabling proper assignment of neuronal space and time for high performance and capacity. The coding scheme can also change as signals pass through the layers; transition from R to S is generally easy, whereas the transition from S to R is generally difcult (Song & Abbott, 2001; Masuda & Aihara, 2003). In this work, the sensory layer serves as the rate coder Ra Rb because of the stochasticity of ring. We have observed that Ra Rb is passed onto the cortical layer without transformation or with transformation into Sa Rb , Sa Sa , or Sa Sb , depending on the shared connectivity and the structure of the feedback coupling. We note that the density approaches using the Fokker-Planck equations have been used for analyzing dynamics of ring rates and membrane potentials. As a result, the conditions for synchrony or asynchrony in recurrent networks have been established for simple input schemes such as constant, white gaussian, or step inputs (Brunel, 2000; Gerstner, 2000; Gerstner & Kistler, 2002b). Although it may be interesting to apply these methods to the analysis of dual coding, such analysis requires extensive calculations in the case of spatiotemporal inputs and is therefore a subject of future work.
654
N. Masuda and K. Aihara
5.2 Functional Roles of Spike-Time-Dependent Plasticity Revisited. STDP has led to the emergence of various modes via unsupervised learning and associated synaptic competition. The selection of synapses and selforganization through STDP are not new (see, e.g., Kempter et al., 1999; Kistler & van Hemmen, 2000; Song et al., 2000; van Rossum et al., 2000; Song & Abbott, 2001; Cˆateau & Fukai, 2003). In this article, we have discussed how learning inuences the coding schemes in relation to synchrony, clustered states, and asynchrony. The mechanisms that cause synchrony, such as high shared connectivity and strong feedback coupling, should be contrasted to STDP and its functional roles (Gerstner & Kistler, 2002b; also see section 1). The STDP learning enables neurons to respond fast (e.g., ring for low thresholds, Kistler & van Hemmen, 2000; latency reduction, Song et al., 2000), to be precise (e.g., coincidence detection, Kistler & van Hemmen, 2000; the associated temporal coding in barn owl auditory systems, Gerstner et al., 1996), and to be relational or sequential (e.g., difference learning, Rao & Sejnowski, 2001; alternatively ring clusters, Horn et al., 2000, and Levy et al., 2001). However, STDP does not necessarily make neurons synchronous because an accidental near-synchronous ring of two neurons potentiates the synaptic weight in one direction while it depresses the synapse in the opposite direction if it ever exists. In more or less synchronous situations, jitter may cause the next near-synchronous ring event to occur with the converse order of ring. After the transient, the coupling strengths between these two neurons are likely to converge to the minimum in both directions, as veried in Figure 8E. In other words, a dynamical state in which the synapses are bidirectionally strong is unstable. Although rough synchrony can emerge via unidirectionally potentiated synapses, the order of ring becomes xed within the synchronous cluster. This order implies sequential relations beyond mere synchrony. Of course, this kind of rough synchrony may be used for practical functions such as column formation, without taking advantage of the possibly meaningful order of ring (Song & Abbott, 2001). Synchrony with the precision of 1 ms is widely found in experiments (Gray et al., 1989; de Oliveira et al., 1997), and its functional role has been discussed extensively (von der Malsburg & Schneider, 1986; Abeles, 1991; Diesmann et al., 1999; Tsodyks et al., 2000). Synchronization may be caused not only by common inputs but also by lateral coupling (Gray et al., 1989; Tsodyks et al., 2000). However, it seems difcult for the asymmetric STDP rule as represented by equation 2.1, which has the characteristic timescale of ¿0 D 20 ms, to provide a mechanism for synchronization. Therefore we have introduced the symmetric learning window for the feedback coupling represented by equation 4.5. The symmetrical learning window is supported by experimental evidence for excitatory (Nishiyama et al., 2000) and inhibitory (Abbott & Nelson, 2000) neurons. Other mechanisms that lead to partially synchronous ring without full synchrony include enhancing synaptic weights for a short time in response to simultaneous increases in
Self-Organizing Dual Coding
655
the ring rates of two neurons (Sompolinsky et al., 1990) and use of global inhibitors that prohibit simultaneous activation of multiple clusters (von der Malsburg & Schneider, 1986; Watanabe, Aihara, & Kondo, 1998). It is a future problem to include inhibitory synapses and consider cooperation of various types of synaptic plasticity. Appendix A Here we derive the dynamics of the averaged weights hwa i and hwb i under synaptic competition. Using equations 3.1 and 3.2, we have Z wmax @ hwP a i D dw w pa .w; t/ @t 0 Z wmax @ D¡ dw w .A.w/pa .w; t// @w 0 Z 1 wmax @ 2 C dw w.B.w/pa .w; t// 2 0 @w2 Z wmax Z 1 wmax @ D dw A.w/pa .w; t/ ¡ dw .B.w/pa .w; t// 2 0 @w 0 D A1 hwa i C A2 ca hwa i C A3 .hwa i C hwb i/ 1 C .B.0; ca ; hwa i ; hwb i/pa .0; t/ 2 ¡ B.wmax ; ca ; hwa i ; hwb i/pa .wmax ; t//;
(A.1)
where we have used integration by parts. Consequently, the effective dynamics of hwa i and hwb i is written as follows: ³
´ ³ ´³ ´ 1 hwP a i ¡A3 hwa i A1 C ca A2 ¡ A3 D C hwb i hwP b i ¡A3 A1 C cb A2 ¡ A3 2 ³ ´ .B.0; ca ; hwa i ; hwb i/pa .0; t/ ¡ B.wmax ; ca ; hwa i ; hwb i/pa .wmax ; t// £ : (A.2) .B.0; cb ; hwb i ; hwa i/pb .0; t/ ¡ B.wmax ; cb ; hwb i ; hwa i/pb .wmax ; t//
The rst term in equation A.2 describes drift-driven dynamics, which is dominant everywhere except near the boundaries. The correction to the dynamics in the second term results from interplay of diffusion and the boundary conditions. The constraint that the weights are conned in [0; wmax ] yields px .w; t/ D ±.w/; if hwx i D 0;
px .w; t/ D ±.w ¡ wmax /; if hwx i D wmax ;
(A.3) (A.4)
656
N. Masuda and K. Aihara
where x D a or b. From equations A.3 and A.4 and the fact that the diffusion term is always positive, it follows that hwP x ihwx iD0 > 0;
hwP x ihwx iDwmax < 0:
(A.5)
Accordingly, the dynamics is conned to the square [0; wmax ] £ [0; wmax ]. The second term in equation A.2 changes dramatically only near the boundaries where px .w; t/ approaches a singular distribution. If we further suppose that the rst (or second) component of this term is monotonically decreasing in hwa i (or hwb i), the nullclines of equation A.2 look like the ones in Figure 4C. Appendix B To derive the conditions for the saddle point, let us assume ca D cb . Consequently, a possible xed point .w¤ ; w¤ /, 0 · w¤ · wmax on the straight line hwa i D hwb i satises 1 .A1 C ca A2 ¡ 2A3 /w¤ C .B.0; ca ; w¤ ; w¤ /pa .0; t/ 2 ¡ B.wmax ; ca ; w¤ ; w¤ /pa .wmax ; t// D 0:
(B.1)
Equation B.1 actually has a solution because equations A.3 and A.4 guarantee that the left-hand side of equation B.1 is larger than 0 for w¤ D 0 and smaller than 0 for w¤ D wmax . The stability of the xed point is determined by the eigenvalues ¸1 and ¸2 , which are the solutions of the following equation: ¤ ;w¤ / ¡¸ A C ca A2 ¡ A3 C @B.w @hwa i 1 ¤ ;w¤ / ¡A3 C @B.w @hwb i
¡A3 C
@B.w¤ ;w¤ / @hwb i
A1 C ca A2 ¡ A3 C
@B.w¤ ;w¤ / @hwa i
D 0;
¡ ¸ (B.2)
where B.hwa i ; hwb i/ D
1 .B.0; ca ; hwa i ; hwb i/pa .0; t/ 2 ¡ B.wmax ; ca ; hwa i ; hwb i/pa .wmax ; t//:
(B.3)
Self-Organizing Dual Coding
657
The saddle point .w¤ ; w¤ / must satisfy the following inequality: Á
@ B.w¤ ; w¤ / A1 C ca A2 ¡ A3 C @ hwa i
!2
Á
@ B.w¤ ; w¤ / ¡ ¡A3 C ¸1 ¸2 D @ hwb i Á ! @B.w¤ ; w¤ / @B.w¤ ; w¤ / D A1 C ca A2 C ¡ @ hwa i @ hwb i Á ! @B.w¤ ; w¤ / @B.w¤ ; w¤ / ¢ A1 C ca A2 ¡ 2A3 C C @ hwa i @ hwb i < 0:
!2
(B.4)
If the saddle point is not extremely close to the boundaries, the effect of the boundary conditions does not dramatically change near the saddle point. We ignore the derivative of B for this reason, and equation B.4 is reduced to equations 3.22 and 3.23. When m equally synchronous inputs are applied, a nontrivial xed point is obtained by solving the following set of equations: w¤ D hwa i D hwb i D ¢ ¢ ¢ D hwm i;
(B.5)
1 .B.0; ca ; w¤ ; : : : ; w¤ /pa.0; t/ 2 ¡ B.wmax ; ca ; w¤ ; : : : ; w¤ /pa .wmax ; t// D 0;
.A1 C ca A2 ¡ mA3 /w¤ C
(B.6)
which has an interior solution for reasons similar to that when m D 2. If we ignore the derivative of the last term of equation B.6, the eigenvalues ¸1 , ¸2 ; : : : ; ¸m in equations 3.24 and 3.25 are derived by solving A1 C ca A2 ¡ A3 ¡ ¸ ¡A3 ¡A3 :: :
¡A3 A1 C ca A2 ¡ A3 ¡ ¸
¡A3 ¡A3 :: :
¢ ¢ ¢ D 0:
(B.7)
We next consider the case in which all the input correlations are different, as represented in equation 3.31. For simplicity, the differences between the correlations are assumed to be small so that equation B.5 approximately holds. Then the characteristic equation similar to equation B.7 reads
658
N. Masuda and K. Aihara
as follows: m X iD1
A3 D 1: A1 C ci A2 ¡ ¸
(B.8)
To get insight into the ranges of the eigenvalues, let us put Á 3.¸/ ´
m X iD1
1 A1 C ci A2 ¡ ¸
!¡1 :
(B.9)
Then the eigenvalues are the solutions of 3.¸/ D A3 . For 3.¸/, there exist x1 < x2 < ¢ ¢ ¢ < xm¡1 such that: 3.¸/ ! 1
as ¸ ! xi C 0 or ¡ 1
3.¸/ ! ¡1 as ¸ ! xi ¡ 0 or 1 d 3.¸/ < 0; ¸ 6D xi ; 1 · i · m d¸ 3.A1 C ci A2 / D 0; 1 · i · m:
(B.10) (B.11) (B.12) (B.13)
Therefore, positive A3 , we obtain A1 C c1 A2 > ¸1 > A2 C c2 A2 > ¸2 > ¢ ¢ ¢ > A2 C cm A2 > ¸m :
(B.14)
The conditions for the synaptic competition are again generalized versions of equations 3.24 and 3.25, expressed as follows: A1 C ci A2 > 0; 1 · i · m ¡ 1;
A1 C cm A2 ¡ mA3 < 0:
(B.15) (B.16)
Combining equations B.14 and B.15 yields ¸1 > ¸2 > ¢ ¢ ¢ > ¸m¡1 > 0:
(B.17)
With equation B.16, ¸m < 0 is assured by ³ 3.0/
¸2 D ¸4 . Then the n2 -dimensional dominant eigenspace is spanned by .1; ¡1; 0; 0; : : : ; 0; 0/t , .0; 0; 1; ¡1; 0; 0; : : : ; 0; 0/t ; : : : ; .0; 0; : : : ; 0; 0; 1; ¡1/t . Acknowledgments We thank W. Gerstner for carefully reading and discussing the manuscript. We also thank T. Aihara, M. Watanabe, S. Katori, and H. Fujii for their helpful comments. This study is supported by the Japan Society for the Promotion of Science and also by the Advanced and Innovational Research Program in Life Sciences and a Grant-in-Aid on priority areas (C), Advanced Brain Science Project from the Ministry of Education, Culture, Sports, Science, and Technology, the Japanese Government. References Abbott, L. F., & Nelson, S. B. (2000). Synaptic plasticity: Taming the beast. Nature Neuroscience, 3, 1178–1183. Abbott, L. F., & Song, S. (1999). Temporally asymmetric Hebbian learning, spike timing and neuronal response variability. In M. S. Kearns, S. A. Solla, & D. A. Cohn (Eds.), Advances in neural information processing systems, 11 (pp. 69–75). Cambridge, MA: MIT Press. Abeles, M. (1991). Corticonics. Cambridge: Cambridge University Press. Aertsen, A., Erb, M., & Palm, G. (1994). Dynamics of functional coupling in the cerebral cortex: An attempt at a model-based interpretation. Physica D, 75, 103–128. Aertsen, A. M. H. J., Gerstein, G. L., Habib, M. K., & Palm, G. (1989). Dynamics of neuronal ring correlation: Modulation of “effective connectivity.” J. Neurophsiol., 61(5), 900–917. Araki, O., & Aihara, K. (2001). Dual information representation with stable ring rates and chaotic spatiotemporal spike patterns in a neural network model. Neural Computation, 13, 2799–2822. Arieli, A., Sterkin, A., Grinvald, A., & Aertsen, A. (1996). Dynamics of ongoing activity: Explanation of the large variability in evoked cortical responses. Science, 273, 1868–1871.
Self-Organizing Dual Coding
661
Berry, M., J., Warland, D., K., & Meister, M. (1997). The structure and precision of retinal spike trains. Proc. Natl. Acad. Sci. USA, 94, 5411–5416. Brunel, N. (2000). Dynamics of sparsely connected networks of excitatory and inhibitory spiking neurons. J. Comput. Neurosci., 8, 183–208. Caˆ teau, H., & Fukai, T. (2003). A stochastic method to predict the consequence of arbitrary forms of spike-timing-dependent plasticity. Neural Computation, 15, 597–620. de Oliveira, S. C., Thiele, A., & Hoffmann, K-P. (1997). Synchronization of neuronal activity during stimulus expectation in a direction discrimination task. J. Neurosci., 17(23), 9248–9260. Diesmann, M., Gewaltig, M-O., & Aertsen, A. (1999). Stable propagation of synchronous spiking in cortical neural networks. Nature, 402, 529–533. Fujii, H., Ito, H., Aihara, K., Ichinose, N., & Tsukada, M. (1996). Dynamical cell assembly hypothesis—theoretical possibility of spatio-temporal coding in the cortex. Neural Networks, 9(8), 1303–1350. Gerstner, W. (2000). Population dynamics of spiking neurons: Fast transients, asynchronous states, and locking. Neural Computation, 12, 43–89. Gerstner, W. (2001). Coding properties of spiking neurons: Reverse and crosscorrelations. Neural Networks, 14, 599–610. Gerstner, W., Kempter, R., van Hemmen, J. L., & Wagner, H. (1996). A neuronal learning rule for sub-millisecond temporal coding. Nature, 383, 76– 78. Gerstner, W., & Kistler, M. (2002a).Mathematical formulations of Hebbian learning. Biol. Cybern., 87, 404–415. Gerstner, W., & Kistler, W., M. (2002b). Spiking neuron models. Cambridge: Cambridge University Press. Gray, C. M., Konig, ¨ P., Engel, A. K., & Singer, W. (1989). Oscillatory responses in cat visual cortex exhibit inter-columnar synchronization which reects global stimulus properties. Nature, 338, 334–337. Gutig, ¨ R., Aharonov, R., Rotter, S., & Sompolinsky, H. (2003). Learning input correlations through nonlinear temporally asymmetric Hebbian plasticity. J. Neurosci., 23(9), 3697–3714. Horn, D., Levy, N., Meilijson, I., & Ruppin, E. (2000). Distributed synchrony of spiking neurons in a Hebbian cell assembly. In S. A. Solla, T. K. Leen, & K.-R. Muller ¨ (Eds.), Advances in neural information processing systems, 12 (pp. 129–135). Cambridge, MA: MIT Press. Kempter, R., Gerstner, W., & van Hemmen, J. L. (1999). Hebbian learning and spiking neurons. Phys. Rev. E, 59(4), 4498–4514. Kempter, R., Gerstner, W., & van Hemmen, J. L. (2001). Intrinsic stabilization of output rates by spike-based Hebbian learning. Neural Computation, 13, 2709–2741. Kistler, W., M., & van Hemmen, J. L. (2000). Modeling synaptic plasticity in conjunction with the timing of pre- and postsynaptic action potentials. Neural Computation, 12, 385–405. Kruger, ¨ J., & Becker, D. (1991). Recognizing the visual stimulus from neuronal discharges. Trends in Neurosciences, 14(7), 282–286.
662
N. Masuda and K. Aihara
Kuhn, A., Aertsen, A., & Rotter, S. (2003). Higher-order statistics of input ensembles and the response of simple model neurons. Neural Computation, 15, 67–101. Kuhn, A., Rotter, S., & Aertsen, A. (2002). Correlated input spike trains and their effects on the response of the leaky integrate-and-re neuron. Neurocomputing, 44–46, 121–126. Levy, N., Horn, D., Meilijson, I., & Ruppin, E. (2001). Distributed synchrony in a cell assembly of spiking neurons. Neural Networks, 14, 815–824. Markram, H., Lubke, ¨ J., Frotscher, M., & Sakmann, B. (1997). Regulation of synaptic efcacy by coincidence of postsynaptic APs and EPSPs. Science, 275, 213–215. Masuda, N., & Aihara, K. (2002a). Spatio-temporal spike encoding of a continuous external signal. Neural Computation, 14, 1599–1628. Masuda, N., & Aihara, K. (2002b). Bridging rate coding and temporal spike coding by effect of noise. Phys. Rev. Lett., 88(24), 248101. Masuda, N., & Aihara, K. (2003). Duality of rate coding and temporal spike coding in multilayered feedforward networks. Neural Computation, 15, 103– 125. Nishiyama, M., Hong, K., Mikoshiba, K., Poo, M., & Kato, K. (2000). Calcium stores regulate the polarity and input specicity of synaptic modication. Nature, 408, 584–588. Rao, R. P. N., & Sejnowski, T., J. (2001). Spike-timing-dependent Hebbian plasticity as temporal difference learning. Neural Computation, 13, 2221– 2237. Riehle, A., Grun, ¨ S., Diesmann, M., & Aertsen, A. (1997). Spike synchronization and rate modulation differently involved in motor cortical function. Science, 278, 1950–1953. Rubin, J., Lee, D. D., & Sompolinsky, H. (2001). Equilibrium properties of temporally asymmetric Hebbian plasticity. Phys. Rev. Lett., 86(2), 364–367. Sakurai, Y (1999). How do cell assemblies encode information in the brain? Neuroscience and Biobehavioral Reviews, 23, 785–796. Salinas, E., & Sejnowski, T. J. (2000). Impact of correlated synaptic input on output ring rate and variability in simple neuronal models. J. Neurosci., 20(16), 6193–6209. Salinas, E., & Sejnowski, T. J. (2001). Correlated neuronal activity and the ow of neural information. Nature Reviews Neuroscience, 2, 539–550. Shadlen, M. N., & Newsome, W. T. (1998). The variable discharge of cortical neurons: Implications for connectivity, computation, and information coding. J. Neurosci., 18(10), 3870–3896. Sompolinsky, H., Golomb, D., & Kleinfeld, D. (1990). Global processing of visual stimuli in a neural network of coupled oscillators. Proc. Natl. Acad. Sci. USA, 87, 7200–7204. Song, S., & Abbott, L. F. (2001). Cortical development and remapping through spike timing-dependent plasticity. Neuron, 32, 339–350. Song, S., Miller, K. D., & Abbott, L. F. (2000). Competitive Hebbian learning through spike-timing-dependent synaptic plasticity. Nature Neuroscience, 3(9), 919–926.
Self-Organizing Dual Coding
663
Stroeve, S., & Gielen, S. (2001). Correlation between uncoupled conductancebased integrate-and-re neurons due to common and synchronous presynaptic ring. Neural Computation, 13, 2005–2029. Tsodyks, M. V., & Markram, H. (1997). The neural code between neocortical pyramidal neurons depends on neurotransmitter release probability. Proc. Natl. Acad. Sci. USA, 94, 719–723. Tsodyks, M., Uziel, A., & Markram, H. (2000). Synchrony generation in recurrent networks with frequency-dependent synapses. J. Neurosci, 20, RC50 1–5. Vaadia, E., Haalman, L., Abeles, M., Bergman, H., Prut, Y., Slovin, H., & Aertsen, A. (1995). Dynamics of neuronal interactions in monkey cortex in relation to behavioural events. Nature, 373, 515–518. van Rossum, M. C. W., Bi, G. Q., & Turrigiano, G. G. (2000). Stable Hebbian learning from spike timing-dependent plasticity. J. Neurosci., 20(23), 8812– 8821. van Rossum, M. C. W., Turrigiano, G. G., & Nelson, S. B. (2002). Fast propagation of ring rates through layered networks of noisy neurons. J. Neurosci., 22(5), 1956–1966. von der Malsburg, C., & Schneider, W. (1986). A neural cocktail-party processor. Biol. Cybern., 54, 29–40. Watanabe, M., Aihara, K., & Kondo, S. (1998). A dynamic neural network with temporal coding and functional connectivity. Biol. Cybern., 78, 87–93. Zhang, L. I., Tao, H. W., Holt, C. E., Harris, W. A., & Poo, M. (1998). A critical window for cooperation and competition among developing retinotectal synapses. Nature, 395, 37–44. Received March 4, 2003; accepted September 8, 2003.
Communicated by Anthony Bell
NOTE
The Shape of Neural Dependence Rick L. Jenison
[email protected] Richard A. Reale
[email protected] Departments of Psychology and Physiology and the Waisman Center, University of Wisconsin–Madison, Madison, WI 53706, U.S.A.
The product-moment correlation coefcient is often viewed as a natural measure of dependence. However, this equivalence applies only in the context of elliptical distributions, most commonly the multivariate gaussian, where linear correlation indeed sufciently describes the underlying dependence structure. Should the true probability distributions deviate from those with elliptical contours, linear correlation may convey misleading information on the actual underlying dependencies. It is often the case that probability distributions other than the gaussian distribution are necessary to properly capture the stochastic nature of single neurons, which as a consequence greatly complicates the construction of a exible model of covariance. We show how arbitrary probability densities can be coupled to allow greater exibility in the construction of multivariate neural population models.
Neural population coding of events in the world is thought to involve a coordinated activity among neurons. In delineating the structure of this dependence among neurons in the brain, it is the time pattern of discharges from individual cells that represents the variables or marginal dimensions under study. A common misperception is that any marginal probability distribution can be used to construct a parametric model of this dependence among variables. The most commonly used measure for dependence, the Pearson product-moment correlation coefcient, was developed on the basis of normal marginals and addresses only linear dependence (Yule, 1897; Mari & Kotz, 2001). However, a new and innovative approach, the so-called copula method, provides the ability to couple arbitrary marginal densities (Sklar, 1959; Joe, 1997; Nelsen, 1999). The word copula is a Latin noun that refers to a bond and is used in linguistics to refer to a proposition that links a subject and predicate. In probability theory, it couples marginal distributions to form exible multivariate distribution functions. The appeal of the copula is that we eliminate the implied reliance on the multivariate gaussian or the assumption that dimensions are independent. Neural Computation 16, 665–672 (2004)
c 2004 Massachusetts Institute of Technology °
666
R. Jenison and R. Reale
Recently, Nirenberg, Carcieri, Jacobs, and Latham (2001) indirectly estimated the relative contribution of neural dependence to mutual information, which accounted for about 10% of the information transmitted. Pola, Thiele, Hoffmann, and Panzeri (2003) derived a method to directly estimate the mutual information on the entire response joint probability distribution. In both of these reports, as well as earlier studies of neural information, the shape of the dependence structure is generally ignored and considered to be absorbed into the joint probability density. In this communication, we show how the conditional response entropy can be factored into separate independent and dependent contributions to transmitted sensory information. These factors take the form of conditional differential entropies, and therefore the structure and impact of the dependence can be computed directly rather than indirectly. Sklar’s theorem (Sklar, 1959) states that any multivariate distribution can be expressed as the copula function C[u1 ; u2 ; : : : ; uN ] evaluated at each of the marginal distributions. By virtue of the probability integral transform (Casella & Berger, 1990), each marginal ui D Fi .xi / has a uniform distribution on [0; 1], where Fi .xi / is the cumulative integral of pi [xi ] for the random variables Xi . The joint probability density follows as p[x1 ; x2 ; : : : ; xN ] D
N Y iD1
pi [xi ] £ c[u1 ; u2 ; : : : ; uN ];
(1)
where pi [xi ] is each marginal density and coupling is provided by N C[u1 ;u2 ;:::;uN ] c[u1 ; u2 ; : : : ; uN ] D @ @u , which is itself a probability density. When 1 @u2 ¢¢¢@uN the random variables are independent, the copula density c[u1 ; u2 ; : : : ; uN ] is identically equal to one. The importance of equation 1 is that the independent portion, reected as the product of the marginals, can be separated from the function c[u1 ; u2 ; : : : ; uN ] describing the dependence structure or shape. There are several families of copulas, some of which are particularly descriptive of the dependence structure observed among the discharge patterns of sensory neurons. The choice of copula is a form of model selection that can be formalized using information criteria (Soo, 2000) such as the Akaike information criterion (AIC) (Akaike, 1974). The AIC penalizes the negative log maximum likelihood of the estimated model by the number of parameters in the model [¡2 log (maximum likelihood) + 2 (number of parameters)]. A smaller relative AIC represents a better model t while taking into account the complexity of the model. We have shown previously the importance of considering the impact of correlated noise on the acuity of sound localization by an ensemble of auditory cortical neurons in the cat (Jenison, 2000). Figure 1A shows a scatter plot of the timing (latency) to the rst evoked spike following a sound presented at a particular location, µ, in space for a pair of neurons recorded from separate electrodes in the primary auditory (AI) eld of an anesthetized cat.
The Shape of Neural Dependence
667
Figure 1: (A) Joint conditional probability density of rst-spike latency recorded from two cortical neurons in response to a given sound source in space. Gray contours correspond to estimated iso-densities based on the AMH copula. The estimated single parameter ® controls the shape of the copula, here ® D 0:95. The inset shows conditional cross sections of the joint density based on the copula (gray) and estimated multivariate gaussian (black) (B) AMH copula density c[u1 ; u2 j µ ] estimated from data shown in A. (C) AMH copula conditional entropy as a function of dependence and neural ensemble size (N). Multiple integrals were evaluated numerically using quasi–Monte Carlo integration.
668
R. Jenison and R. Reale
These two neurons are representative of a common class of eld AI cells described by their spatial sensitivity to free-eld (actual or virtual) sound sources. In this regard, these AI neurons exhibited broad spatial receptive elds when measured with directionally dependent stimuli even at the low intensity levels of the sound source (Brugge, Reale, & Hind, 1996). Furthermore, their response latency was tightly locked to stimulus onset, and systematic gradients in latency were evident within the receptive eld. Models have been successfully developed that capture this systematic response latency variability due to both sound-source direction and sound-source intensity (Jenison, Reale, Hind, & Brugge, 1998; Reale, Jenison, & Brugge, 2003). The joint probability density contours (gray) reect maximum likelihood t inverse-gaussian (IG) marginal probability densities (Jenison & Reale, 2003) coupled by the Ali-Mikail-Haq (AMH) copula (see appendix A). The AMH copula is parameterized by a dependence measure ®, which for the general multivariate case ranges between 0 and 1. The t shown in Figure 1 yields ® D 0:95 with an AIC equal to 3642. In comparison, for this data set, the AIC for independent IG marginals is 3728, and the multivariate gaussian AIC is 3766. This gure shows two common characteristics of the joint response of auditory cortical neurons. First, the marginal distributions are positively skewed, unlike the gaussian distribution, and second, the dependence structure is not elliptical, which is also inconsistent with a multivariate gaussian density. The inset shows comparisons of this copula-based joint density and the corresponding multivariate gaussian t conditioned on several levels of rst-spike latencies from neuron 2. The estimated AMH copula density c[u1 ; u2 ] is shown in Figure 1B, which can be viewed as a function that modulates the product density to form the joint probability density function as dened by equation 1. In this case, the AMH copula density narrows in the lower tails (near zero) and broadens in the upper tails (near one). This is a characteristic of the AMH copula and relates to our general observations of AI cortical neurons where the strength of association between ensemble neurons is greater for shorter rst-spike latencies compared to responses of longer latencies. When a pair of neurons res in response to a particular location in space, the shorter latencies (which reect a strong response) are more consistent within the ensemble; longer latencies (weaker responses) tend to be less consistent. This analysis indicates that not only is a multivariate gaussian density no longer a necessary assumption in delineating the structure of dependence among neurons, but employing a multivariate gaussian density with these data would actually provide misleading information, particularly for the behavior of the tail regions that greatly inuence estimation. An important outcome of this improvement is that an ideal observer analysis of sound location using responses from AI neurons can now employ parametric probability models using the proper dependence structure and marginals (Jenison, 2000; Jenison & Reale, 2003).
The Shape of Neural Dependence
669
Shannon information can provide a measure of the information available for localizing a sound source in space transmitted on average by a given ensemble of N neurons. Mutual information between stimulus and response is based on Shannon’s entropy, which can be interpreted as the degree to which the total response entropy is reduced by the entropy conditioned on a particular stimulus setting µ. Let this conditional entropy be H[x1 ; x2 ; : : : ; xN j µ], where xi is the response of the ith neuron. Using the copula density, it is straightforward to show that the conditional entropy can be split into two terms: the sum of entropies due to the independent components of the conditional density and the quantity due strictly to the dependence structure dened by the copula (see appendix B). This can be expressed as H[x1 ; x2 ; : : : ; xN j µ] D
N X iD1
H[xi j µ] C Hc [u1 ; u2 ; : : : ; uN j µ ];
(2)
where the copula entropy is Hc [u1 ; u2 ; : : : ; uN j µ] Z D¡ c[u1 ; u2 ; : : : ; uN j µ] log c[u1 ; u2 ; : : : ; uN j µ]du:
(3)
[0;1]N
It follows as a consequence of equation 2 that copula entropy must be mathematically equivalent to the negative of the mutual information between neurons, but with the benet of being computed directly from the dependence structure. Figure 1C shows the reduction in the conditional response entropy due to the AMH copula as a function of dependence and parameterized by ensemble size N. The entropy of the copula is at its maximum at zero dependence, and the copula entropy declines as a function of dependence and ensemble size. So although the decline is rather modest for a pair of neurons, as suggested by Nirenberg et al. (2001), it accelerates here as the ensemble size increases. The importance of this separability is that it is generally complicated, and often intractable, to parametrically model the joint probability density if it is nongaussian. When independence is assumed and the product density is based on only the marginal densities, the joint probability is easy to compute. Spike latency joint probability density can now be constructed using the marginal densities coupled by the separately estimated copula density function, which greatly simplies the construction. The gaussian assumption is no longer a necessary constraint, and this newly found freedom allows for much greater exibility in computational modeling studies of the brain. Finally, given the separability, the dependence entropy can be computed directly to assess the impact of the shape of neural dependence on sensory coding.
670
R. Jenison and R. Reale
Appendix A: Ali-Mikhail-Haq (AMH) Copula The AMH copula distribution (Ali, Mikhail, & Haq, 1978) is an Archimedean copula (Nelsen, 1999) that can be constructed by an additive generator function ’ and its inverse ’ ¡1 , if it exists, using the following form: " CArchimedean[u1 ; u2 ; : : : ; uN I ®] D ’
¡1
N X iD1
# ’[ui I ®] :
(A.1)
The generator function for the AMH copula is ’[uI ®] D log[ 1¡®.1¡u/ ], which u has an inverse ’ ¡1 [yI ®] D e1¡® , and yields y ¡® CAMH [u1 ; u2 ; : : : ; uN I ®] D
®¡
®¡1 QN 1¡®C®ui ; iD1
(A.2)
ui
R x1 where ui D Fi .xi / D ¡1 pi .t/ dt is the marginal cumulative distribution. ® is the measure of dependence where 0 · ® < 1. An Archimedean ndimensional copula is a proper distribution if and only if the inverse generator function ’ ¡1 is completely monotonic (Schweizer & Sklar, 1983). Appendix B: Copula Differential Entropy The joint entropy can be factorized into the sum of entropies due to each marginal density and the entropy of the copula (dependence structure):
H[x1 ; x2 ; : : : ; xN ] D Z ¡
RN
N X iD1
H[xi ] C Hc [u1 ; u2 ; : : : ; uN ]
(B.1)
p[x1 ; x2 ; : : : ; xN ] log[p[x1 ; x2 ; : : : ; xN ]] dx Z
D¡
N Y RN iD1
(
£ log Z D¡ £
N Y iD1
N Y RN iD1
(
pi [xi ] £ c [u1 ; u2 ; : : : ; uN ]
N X iD1
) pi [xi ] £ c [u1 ; u2 ; : : : ; uN ] dx
pi [xi ] c [u1 ; u2 ; : : : ; uN ] )
log pi [xi ] C log c [u1 ; u2 ; : : : ; uN ] dx:
(B.2)
The Shape of Neural Dependence
671
Integrate both sides of the equation over the unit hypercube: Z Z ¡ p[x1 ; x2 ; : : : ; xN ] log[p[x1 ; x2 ; : : : ; xN ]] dx du RN
[0;1]N
Z D¡
Z
£
RN
N X iD1
Z ¡
RN iD1
[0;1]N
(
N Y
pi [xi ] c [u1 ; u2 ; : : : ; uN ] )
log pi [xi ] C log c [u1 ; u2 ; : : : ; uN ] dx du
(B.3)
p[x1 ; x2 ; : : : ; xN ] log[p[x1 ; x2 ; : : : ; xN ]] dx
D¡
N Z X
pi [xi ] log pi [xi ] dxi
iD1
Z
¡
[0;1]N
c [u1 ; u2 ; : : : ; uN ] log c[u1 ; u2 ; : : : ; uN ] du
:H[x1 ; x2 ; : : : ; xN ] D
N X iD1
H[xi ] C Hc [u1 ; u2 ; : : : ; uN ]
(B.4)
(B.5)
Acknowledgments This work was supported by National Institutes of Health grant DC-03554. References Akaike, H. (1974). A new look at the statistical model identication. IEEE Trans. Auto. Control., 19, 723. Ali, M. M., Mikhail, N. N., & Haq, M. S. (1978). A class of bivariate distributions including the bivariate logistic. J. Multivariate Anal., 8, 405–412. Brugge, J. F., Reale, R. A., & Hind, J. E. (1996). The structure of spatial receptive elds of neurons in primary auditory cortex of the cat. J. Neurosci., 16, 4420– 4437. Casella, G., & Berger, R. L. (1990). Statistical inference. North Scituate, MA: Duxbury Press. Jenison, R. L. (2000). Correlated cortical populations can enhance sound localization performance. J. Acoust. Soc. Am., 107, 414–421. Jenison, R. L., & Reale, R. A. (2003). Likelihood approaches to sensory coding in auditory cortex. Network: Comp. Neural Sys., 14, 83–102. Jenison, R. L., Reale, R. A., Hind, J. E., & Brugge, J. F. (1998). Modeling of auditory spatial receptive elds with spherical approximation functions. J. Neurophysiol., 80, 2645–2656. Joe, H. (1997). Multivariate models and dependence concepts. London: Chapman & Hall.
672
R. Jenison and R. Reale
Mari, D. D., & Kotz, S. (2001). Correlation and dependence. London: Imperial College Press. Nelsen, R. B. (1999). An introduction to copulas. New York: Springer-Verlag. Nirenberg, S., Carcieri, S. M., Jacobs, A. L., & Latham, P. E. (2001). Retinal ganglion cells act largely as independent encoders. Nature, 411, 698–701. Pola, G., Thiele, A., Hoffmann, K. P., & Panzeri, S. (2003). An exact method to quantify information transmitted to different mechanisms of correlational coding. Network: Comp. Neural Sys., 14, 35–60. Reale, R. A., Jenison, R. L., & Brugge, J. F. (2003). Directional sensitivity of neurons in the primary auditory (AI) cortex: Effects of sound-source intensity level. J. Neurophysiol., 89, 1024–1038. Schweizer, B., & Sklar, A. (1983). Probabilistic metric spaces. Amsterdam, NorthHolland. Sklar, A. (1959). Fonctions de r´epartition a` n dimensions et leur marges. Publ. Int. Stat Univ., Paris, 8, 229–231. Soo, E. S. (2000). Principal information theoretic approaches. J. Am. Stat. Assoc., 95, 1349–1353. Yule, G. U. (1897). On the theory of correlation. J. Roy. Statist. Soc., 60, 812–854. Received May 29, 2003; accepted September 16, 2003.
LETTER
Communicated by Bard Ermentrout
On the Phase Reduction and Response Dynamics of Neural Oscillator Populations Eric Brown
[email protected] Jeff Moehlis ¤
[email protected] Program in Applied and Computational Mathematics, Princeton University, Princeton, NJ 08544, U.S.A.
Philip Holmes
[email protected] Program in Applied and Computational Mathematics and Department of Mechanical and Aerospace Engineering, Princeton University, Princeton, NJ 08544, U.S.A.
We undertake a probabilistic analysis of the response of repetitively ring neural populations to simple pulselike stimuli. Recalling and extending results from the literature, we compute phase response curves (PRCs) valid near bifurcations to periodic ring for Hindmarsh-Rose, HodgkinHuxley, FitzHugh-Nagumo, and Morris-Lecar models, encompassing the four generic (codimension one) bifurcations. Phase density equations are then used to analyze the role of the bifurcation, and the resulting PRC, in responses to stimuli. In particular, we explore the interplay among stimulus duration, baseline ring frequency, and population-level response patterns. We interpret the results in terms of the signal processing measure of gain and discuss further applications and experimentally testable predictions. 1 Introduction This letter seeks to add to our understanding of how the ring rates of populations of neural oscillators respond to pulselike stimuli representing sensory inputs and to connect this to mechanisms of neural computation and modulation. In particular, we study how responses depend on oscillator type (classied by its bifurcation to periodic ring), baseline ring rate of the population, and duration of the input. As in, e.g., Fetz and Gustaffson (1983) and Herrmann and Gerstner (2001), our results also apply to the interpretation of peri-stimulus time histograms (PSTHs), which represent averages over ensembles of independent neuronal recordings. ¤ Current address: Department of Mechanical and Environmental Engineering, University of California, Santa Barbara, CA 93106 U.S.A.
Neural Computation 16, 673–715 (2004)
c 2004 Massachusetts Institute of Technology °
674
E. Brown, J. Moehlis, and P. Holmes
We are motivated by attempts to understand different responses, in the form of PSTHs of spike rates in the brainstem organ locus coeruleus, of monkeys performing target identication and other tasks (Usher, Cohen, ServanSchreiber, Rajkowsky, & Aston-Jones, 1999; Brown, Mohelis, Holmes, Clayton, Rajkowski, & Aston-Jones, 2003b), but there are many other situations in which populations of spiking neurons are reset by stimuli. For example, the multiple oscillator and beat frequency models of interval timing of Meck et al. (Matell & Meck, 2000) involve cortical oscillators of differing frequencies, and the 40 Hz synchrony reported by Gray and Singer and Eckhorn et al. (see Gray, 2000, and Eckhorn, 1999, for reviews) also suggest the onset of coherent oscillations in visual cortex. For most neuron models, we nd that the response of populations to a xed stimulus current scales inversely with the prestimulus baseline ring rate of the population. While the ring rates of individual neurons also display this inverse relationship (encoded in their f ¡ I curves; Rinzel & Ermentrout, 1998), the scaling of the population response differs from that of individual neurons. This effect suggests a possible role of baseline ring rate in cognitive processing by neural populations: decreasing baseline ring rates (via reduced inputs from other brain areas or via neuromodulators; e.g. Usher et al., 1999; Aston-Jones, Rajkowski, & Cohen, 2000; Aston-Jones, Chen, Zhu, & Oshinsky, 2001) can adjust the ‘fraction’ of an incoming stimulus that is passed on to the next processing module. Recent data from the brainstem nucleus locus coeruleus (LC), for example, reect this pattern: greater responsivity and better cognitive performance are both correlated with slower baseline ring rates (Aston-Jones, Rajkowski, Kubiak, & Alexinsky, 1994; Usher et al., 1999; Brown et al., 2003b). We also nd that for certain common neuron models, the maximum population response to a step stimulus of xed strength can occur only (if it occurs at all) after stimulus removal. Moreover, in all cases, there are resonant stimulus durations for which there is no poststimulus response. Thus, the magnitude and timing of maximal population response depends strongly on both neuron type and stimulus duration relative to the baseline period. This letter is organized as follows. Section 2 discusses phase reduction techniques for ordinary differential equations with attracting limit cycles. In the following section, we recall and compute phase response curves for familiar neuron models near the four codimension one bifurcations to periodic ring, using normal forms and numerical calculations (Ermentrout, 2002). These two sections review part of the broad literature on the topic and provide new results: PRCs valid near degenerate Hopf and homoclinic bifurcations and the scaling of PRCs with the frequency of the neurons from which they are derived. Section 4 then analyzes ring probabilities in response to simple stimuli, enabling us to predict spike histograms, describe their dependence on parameters characterizing the stimuli and neuron type, and emphasize similarities and differences among the responses of different models. These results are summarized in six roman-numbered boldface
Phase Reduction and Response Dynamics
675
statements. Section 5 interprets these results in terms of the gain, or signal amplication, of neural populations. Section 6 closes the letter with comments on further applications and possible experimental tests. Both phase reduction methods and population modeling have a rich history, including numerous applications in neuroscience. The classical phase coordinate transformation used in this article originated at least by 1949 (Malkin, 1949), with the complementary asymptotic phase ideas expanded in, among others, Coddington and Levinson (1955), Winfree (1974, 2001), and Guckenheimer (1975) and applied in Ermentrout and Kopell (1984, 1990, 1991), Hansel, Mato, and Meunier (1993, 1995), van Vreeswijk, Abbott, and Ermentrout (1994), Ermentrout (1996), Hoppensteadt and Izhikevich (1997), Kuramoto (1997), Kim and Lee (2000), Bressloff and Coombes (2000), Izhikevich (2000b), Brown, Holmes, and Moehlis (2003a), and Lewis and Rinzel (2003); see also the related “spike response method” in Gerstner, van Hemmen, & Cowan, 1996, and Gerstner & Kistler, 2002 and references therein. Voltage density approaches, primarily undertaken in an integrate-andre framework involving reinjection boundary conditions and in some cases involving distributed conductances, are developed and applied in, among others, Stein (1965), Wilson and Cowan (1972), Fetz and Gustaffson (1983), Gerstner (2000), Nykamp and Tranchina (2000), Omurtag, Knight, and Sirovich (2000), Herrmann and Gerstner (2001), Casti et al. (2001), Brunel, Chance, Fourcaud, and Abbott (2001), Fourcaud and Brunel (2002), and Gerstner and Kistler (2002). In particular, density formulations derived from integrate-and-re models (Fetz & Gustaffson, 1983; Herrmann & Gerstner, 2001) demonstrate the inverse relationship between peak ring rates and baseline frequency (for populations receiving pulsed stimuli) that we extend to other neuron models in this article. The work of Brunel et al. (2001) and Fourcaud and Brunel (2002) focusses on the transmission of stimuli by noisy integrate-and-re populations. It explains how components of incoming signals are shifted and attenuated (or amplied) when output as ring rates of the population, depending on the frequency of the signal component and the characteristics of noise in the population. Some of the conclusions of our article (for integrate-and-re neurons only) could presumably be reconstructed from the Brunel et al. results by decomposing our stepped stimuli into Fourier components; however, simpler methods applicable to our noise-free case allow our different analytical insights into response properties. Experiments on population responses to applied stepped and uctuating currents have also been performed, for example, by Mainen and Sejnowski (1995) in cortical neurons. Due to noise inherent in their biological preparations, responses to stepped, but not uctuating, stimuli are gradually damped (cf. also Gerstner, 2000; Gerstner & Kistler, 2002); these effects are studied using a phase density approach by Ritt (2003). The phase density formulation is also used in Kuramoto (1984) and Strogatz (2000), where the emphasis is on coupling effects in populations with distributed frequencies, generally without external stimuli. The approach
676
E. Brown, J. Moehlis, and P. Holmes
closest to ours is that of Tass (1999), who focuses on how pulsed input signals can desynchronize populations of noisy, coupled phase oscillators that have clustered equilibrium states; of particular interest is the critical stimulus duration Tcrit for which the maximum desynchronizing effect is achieved. By contrast, this letter focuses on synchronizing responses of independent noiseless oscillators (with uniform stationary distributions) and, using analytical solutions to this simpler problem, stresses the inuence of individual neuron properties. Specically, we contribute a family of simple expressions for time-dependent ring rates in response to pulsed stimuli, derived from different nonlinear oscillator models via phase reductions and the method of characteristics. Our expressions allow us to identify a series of novel relationships between population dynamics during and after stepped stimuli and the frequencies and bifurcation types of the individual neurons making up the population. As already mentioned, we consider only uncoupled (and noiseless) neurons, but we note that our results remain generally valid for weakly coupled systems. In particular, in Brown et al. (2003b), we show that for a noisy neural population with synaptic and electrotonic couplings sufcient to reproduce observed variations in experimental cross-correlograms, the uncoupled limit is adequate for understanding key rst-order modulatory effects. 2 Phase Equations for Nonlinear Oscillators with Attracting Limit Cycles 2.1 Phase Reductions. Following, e.g., Ermentrout (1996, 2002), Hoppensteadt and Izhikevich (1997), Guckenheimer (1975), and Winfree (1974, 2001), we rst describe a coordinate change to phase variables that will simplify the analysis to come. Our starting point is a general, conductancebased model of a single neuron: CVP D [I g .V; n/ C Ib C I.V; t/]; nP D N.V; n/I .V; n/T 2 RN :
(2.1) (2.2)
Here, V is the voltage difference across the membrane, the .N ¡ 1/-dimensional vector n comprises gating variables and I g .V; n/ the associated membrane currents, and C is the cell membrane capacitance. The baseline inward current Ib effectively sets oscillator frequency and will correspond below to a bifurcation parameter. I.V; t/ represents synaptic currents from other brain areas due to stimulus presentation; below, we neglect reversal potentials so that I.V; t/ D I.t/. We write this equation in the general form xP D F.x/ C G.x; t/I
x D .V; n/T 2 RN ;
(2.3)
Phase Reduction and Response Dynamics
677
where F.x/ is the baseline vector eld, G.x; t/ is the stimulus effect, and T denotes transpose. In our simplication, G.x; t/ D .I.t/; 0/T ; in a more general setting, perturbations in the gating equations 2.2, could also be included. We assume that the baseline (G ´ 0) neural oscillator has a normally hyperbolic (Guckenheimer & Holmes, 1983), attracting limit cycle ° . This persists under small perturbations (Fenichel, 1971), and hereafter we assume that such a limit cycle always exists for each neuron. The objective is to simplify equation 2.3 by dening a scalar phase variable µ .x/ 2 [0; 2¼/ for all x in some neighborhood U of ° (within its domain of attraction), such that the phase evolution has the simple form dµ.x/ D ! for dt all x 2 U when G ´ 0. Here, ! D 2¼=T, where T is the period of equation 2.3 with G ´ 0. From the chain rule, this requires dµ .x/ @µ @µ @µ D .x/ ¢ F.x/ C .x/ ¢ G.x; t/ D ! C .x/ ¢ G.x; t/ : dt @x @x @x
(2.4)
Equation 2.4 denes a rst-order partial differential equation (PDE) that the scalar eld µ.¢/ must satisfy. Using the classical techniques of isochrons (Winfree, 1974, 2001; Guckenheimer, 1975; Kuramoto, 1997; cf. Hirsch, Pugh, & Shub, 1977), the unique (up to a translational constant) solution µ .¢/ to this PDE can be constructed indirectly. Even after µ .¢/ has been found (see the next subsection), equation 2.4 is not a phase-only (and hence self-contained) description of the oscillator dynamics. However, evaluating the vector eld at x° .µ /, which we dene as the intersection of ° and the µ.x/ level set (i.e., isochron), we have dµ .x/ @µ ° D!C .x .µ // ¢ G.x° .µ ; t// C E ; dt @x
(2.5)
where E is an error term of O .jGj2 /, where the scalar jGj bounds G.x; t/ over all components as well as over x and t (cf. Kuramoto, 1997). Dropping this error term, we may rewrite equation 2.5 as the onedimensional phase equation, dµ @µ D!C .µ /¢G.µ; t/ ; dt @x
(2.6)
which is valid (up to the error term) in the whole neighborhood U of ° . 2.2 Computing the Phase Response Curve. In the case of equations 2.1 and 2.2, the only partial derivative we must compute to fully dene equation 2.6 is with respect to voltage, and we dene the phase response curve @µ (PRC; Winfree, 2001) as @V .µ / ´ z.µ /. Then equation 2.6 becomes dµ D ! C z.µ /I.t/ ´ v.µ ; t/; dt
(2.7)
678
E. Brown, J. Moehlis, and P. Holmes
the population dynamics of which is the subject of this letter. Note that equation 2.7 neglects reversal potential effects for the various synapses that contribute to the net I.t/: if these were included, I.t/ would be replaced by I.µ; t/. Furthermore, if G had nonzero components in more than just the voltage direction, we would need to compute a vector-valued PRC; each component of this could be computed in a similar manner to that below. 2.2.1 Direct Method. We now describe a straightforward and classical way to compute z.µ / that is useful in experimental, numerical, and analytical studies (Winfree, 1974, 2001; Glass & Mackey, 1988; Kuramoto, 1997). By denition, z.µ / D lim
1V!0
1µ ; 1V
(2.8)
£ ¤ where 1µ D µ.x° C .1V; 0/T / ¡ µ.x° / is the change in µ.x/ resulting from a perturbation V ! V C 1V from the base point x° on ° (see Figure 1). Since µP D ! everywhere in the neighborhood of ° , the difference 1µ is preserved under the baseline (G D 0) phase ow; thus, it may be measured in the limit as t ! 1, when the perturbed trajectory has collapsed back to the limit cycle ° . That is, z.µ / can be found by comparing the phases of solutions in the innite-time limit starting on and innitesimally shifted from base points on ° . This is the idea of asymptotic phase. This method will be used in section 3 to compute PRCs for the normal forms commonly arising in neural models.
limit cycle g
*
DV
Dq
level sets of q (x)
@µ Figure 1: The direct method for computing @V at the point indicated by * is to take the limit of 1µ =1V for vanishingly small perturbations 1V. One can calculate 1µ in the limit t ! 1, as discussed in the text.
Phase Reduction and Response Dynamics
679
@µ 2.2.2 Other Methods. Another technique for nding @V .µ / involves solving the adjoint problem associated with equations 2.1 and 2.2 (Hoppensteadt & Izhikevich, 1997; Ermentrout & Kopell, 1991); this procedure is automated in the program XPP (Ermentrout, 2002) and is equivalent to the direct method discussed above. This equivalence, described in appendix A, is implicit in the calculation of coupling functions presented in Hoppensteadt and Izhikevich (1997) and Ermentrout (2002). The implementation of the adjoint method on XPP is used to compute the PRCs for full neuron models that are compared with normal form predictions later in this article. Since only partial derivatives @µ evaluated on ° enter equation 2.7, and @x not the value of the phase function µ itself, it is tempting to compute these partial derivatives directly from equation 2.4. However, when viewed as an algebraic equation for the vector eld @µ , equation 2.4 yields innitely @x many solutions, being only one equation for the N unknown functions @µ D 1; : : : ; N . Some of these solutions are much easier to construct than @xj ; j the phase response curve computed via the direct or the adjoint method. 2 However, for such a solution, which we write as @µ D @µ ) to distinguish it @x .6 @x from partial derivatives of the asymptotic phase µ , there is not necessarily 2 .x/ a corresponding phase variable µ2 such that dµdt D !; x 2 U (in the absence of stimulus): recall the uniqueness of the solution µ.x/ to equation 2.4. (See appendix B for a specic coordinate change from the literature in this context.)
2.3 Validity of the Phase Reduction. We shall always assume that the phase ow µP is nonnegative at the spike point µs ´ 0; otherwise equation 2.7 does not make sense as a neuron model (neurons cannot cross “backwards” through the spike and regain a state from which they can immediately re again). For oscillators giving PRCs z.µ / with z.µs / 6D 0, this assumption restricts admissible perturbing functions I.t/ (or, in the more general case of equation 2.6, G.x; t/) to those satisfying I.t/z.µs / > ¡!:
(2.9)
Thus, for z.µs / > 0, excitatory input (I.t/ > 0) is always admissible, but there is a lower bound on the strength of inhibitory input for which phase reductions hold. In particular, if I.t/ contains a noise component, it must be bounded below; this requires trimming the white (diffusive) or OrnsteinUhlenbeck noise processes commonly used to model variability in synaptic inputs. These problems do not arise for continuous PRCs having z.µs / D 0. We note that z.µ s / D 0 approximately holds for the Hodgkin-Huxley (HH) and Hindmarsh-Rose (HR) neurons to be considered below, and indeed holds for any neuron model with a fast vector eld surrounding the spike tip xs on the limit cycle. In this case, asymptotic phase changes very little in a small neighborhood near xs , since µ D !t, and only a short time is spent in the neighborhood. A small perturbation in the V direction therefore
680
E. Brown, J. Moehlis, and P. Holmes
takes trajectories to isochrons with similar values of µ, and so has little effect on asymptotic phase. For the integrate-and-re systems investigated below, spikes are not explicitly modeled. While this may be viewed as an articial omission leading to z.µs / 6D 0, the population dynamics of such systems are of interest because they are in rather common use. 3 Phase Response Curves for Specic Models In this section, we derive or recall analytical approximations to PRCs for multidimensional systems with limit cycles that arise in the four (local and global) codimension one bifurcations (Guckenheimer & Holmes, 1983): these are appropriate to conductance-based models of the form of equations 2.1 and 2.2. We then give PRCs for one-dimensional (linear) integrateand-re models. Of these PRC calculations, results for the homoclinic and degenerate Hopf bifurcation are new, while the results for other models, previously derived as referenced in the text, are summarized and recast to display their frequency dependence and for application to population models in what follows. 3.1 Phase Response Curves Near Codimension One Bifurcations to Periodic Firing. Bifurcation theory (Guckenheimer & Holmes, 1983) identies four codimension one bifurcations, which can give birth to a stable limit cycle for generic families of vector elds: a SNIPER bifurcation (saddle-node bifurcation of xed points on a periodic orbit), a supercritical Hopf bifurcation, a saddle-node bifurcation of limit cycles, and a homoclinic bifurcation (see Figure 2). All four bifurcation types have been identied in specic neuron models as a parameter. Here, the baseline inward current Ib , varies; for example, SNIPER bifurcations are found for type I neurons (Ermentrout, 1996) like the Connor model and its two-dimensional Hindmarsh-Rose (HR) reduction (Rose & Hindmarsh, 1989), supercritical Hopf bifurcations may occur for the abstracted FitzHugh-Nagumo (FN) model (Keener & Sneyd, 1998), a saddle-node bifurcation of limit cycles is found for the HodgkinHuxley (HH) model (Hodgkin & Huxley, 1952; Rinzel & Miller, 1980), and a homoclinic bifurcation can occur for the Morris-Lecar (ML) model (Rinzel & Ermentrout, 1998). In this section, we calculate or summarize PRCs for limit cycles arising from all four bifurcations. This is accomplished, where possible, through use of one- and two-dimensional normal form equations. Normal forms are obtained through center manifold reduction of equations 2.1 and 2.2 at the bifurcation, followed by a similarity transformation to put the linear part of the equation into Jordan normal form, and nally by successive nearidentity nonlinear coordinate transformations to remove as many terms as possible, a process that preserves the qualitative dynamics of the system (Guckenheimer & Holmes, 1983). To obtain the PRC in terms of the original @µ variables, that is, @V , rather than in terms of the normal form variables
Phase Reduction and Response Dynamics
681
Figure 2: (a) SNIPER bifurcation. Two xed points die in a saddle-node bifurcation at ´ D 0, giving a periodic orbit for ´ > 0, assumed to be stable. (b) Supercritical Hopf bifurcation. A xed point loses stability as ® increases through zero, giving a stable periodic orbit (closed curve). (c) Bautin bifurcation. See the 2 text for details. At ® D 4c f there is a saddle-node bifurcation of periodic orbits. Both a stable (solid closed curve) and unstable (dashed closed curve) periodic 2 orbit exist for 4c f < ® < 0; the unstable periodic orbit dies in a subcritical Hopf bifurcation at ® D 0. The xed point is stable (resp., unstable) for ® < 0 (resp., ® > 0). (d) Homoclinic bifurcation. A homoclinic orbit exists at ¹ D 0, giving rise to a stable periodic orbit for ¹ > 0.
682
E. Brown, J. Moehlis, and P. Holmes
(which we henceforth denote .x; y/) with associated PRCs @µ and @µ , it is @x @y necessary to undo these coordinate transformations. However, since the normal form coordinate transformations affect only nonlinear terms, we obtain the simple relationship @µ @µ @µ D ºx C ºy C O .x; y/; @V @x @y where
@x ; ºx D @V xDyD0
(3.1)
@y : ºy D @V xDyD0
The remainder term in equation 3.1 is assumed to be small near the bifurcations of relevance and is neglected below. This introduces vanishing error in the Hopf case, in which the bifurcating periodic orbits have arbitrarily small radii; the same is true near SNIPER and homoclinic bifurcations, where periodic orbits spend arbitrarily large fractions of their period near the origin. When using the Bautin normal form, however, we must tacitly assume that the nonzero onset radius of stable bifurcating orbits is small; failure of this assumption for the HH model may contribute to the discrepancy between PRCs derived by analytical and numerical methods (see section 3.3). Before proceeding, a few notes regarding the normal form equations that we will consider are in order. For the SNIPER bifurcation, we consider the normal form for a saddle-node bifurcation of xed points, which must be properly embedded globally in order to capture the presence of the periodic orbit (the unstable branch of the center manifold must close up and limit on the saddle node; cf. Figure 2a). For the saddle-node bifurcation of periodic orbits, we appeal to the sequence of bifurcations for type II neurons such as the HH model (Hodgkin & Huxley, 1952), namely, a subcritical Hopf bifurcation in which an unstable periodic orbit branch bifurcates from the rest state, turns around, and gains stability in a saddle-node bifurcation of periodic orbits (Rinzel & Miller, 1980). This sequence is captured by the normal form of the Bautin (degenerate Hopf) bifurcation (Kuznetsov, 1998); cf. Guckenheimer & Holmes, 1983, sec. 7.1). Finally, for the homoclinic bifurcation, we consider only the linearized ow near the xed point involved in the bifurcation. This is not strictly a normal form, and as for the SNIPER bifurcation, a proper global return interpretation is necessary to produce the periodic orbit. Near the SNIPER, Hopf, and Bautin local bifurcations, there is a separation of timescales between dynamics along versus dynamics normal to the one- or two-dimensional attracting center manifold containing (or, in the SNIPER case, consisting of) the periodic orbit. In particular, sufciently close to the bifurcation point, the time required for perturbed solutions to collapse back onto the manifold is negligible compared with the period of
Phase Reduction and Response Dynamics
683
the orbit. This implies that as the bifurcation is approached, the tangent space of any N ¡ 1 dimensional isochron (computed at its intersection with the periodic orbit) becomes normal to the corresponding tangent space of the center manifold. Thus, sufciently near these three bifurcations, the only relevant contributions that perturbations make to asymptotic trajectories is by their components along the center manifold, as captured by the above terms ºx and (additionally for the Hopf and Bautin bifurcations) ºy . Hence equation 3.1 captures the phase response curve for the full Ndimensional system. For the homoclinic global bifurcation, the same conclusion holds, although for a different reason: in this case, there is no lowdimensional center (i.e., locally slow) manifold. However, because the dynamics that asymptotically determine the PRC are linear for the homoclinic bifurcation (unlike the SNIPER, Hopf, and Bautin cases), a PRC valid for full N-dimensional systems can still be computed analytically, as described below. We use the direct method of section 2.2.1 to compute PRCs from the normal form equations. This involves linearizing about the stable periodic orbit, which is appropriate because the perturbations 1V to be considered are vanishingly small. The explicit solution of the normal form equations yields 1µ, and taking limits, we obtain the PRC (cf. equation 2.8). Without loss of generality, the voltage peak (spike) phase is set at µs D 0, and coordinates are dened so that phase increases at a constant rate ! in the absence of external inputs, as in section 2.1. Analogs of some of the following results have been previously derived by alternative methods, as noted in the text, and we also note that PRCs for relaxation oscillators have been discussed in Izhikevich (2000b). However, unlike the previous work, here we explicitly compute how the PRCs scale with oscillator frequency. 3.1.1 Saddle Node in a Periodic Orbit (SNIPER). A SNIPER bifurcation occurs when a saddle-node bifurcation of xed points takes place on a periodic orbit (see Figure 2a). Following the method of Ermentrout (1996, with details of the calculation), we ignore the direction(s) transverse to the periodic orbit and consider the one-dimensional normal form for a saddle-node bifurcation of xed points, xP D ´ C x2 ;
(3.2)
where x may be thought of as local arc length along the periodic orbit. For ´ > 0, the solution of equation 3.2 traverses any interval in nite time; as in Ermentrout (1996), the period T of the orbit may be approximated by calculating the total time necessary for the solution to equation 3.2 to go from x D ¡1 to x D C1 and making the solution periodic by resetting x to ¡1 every time it res at x D 1. This gives T D p¼´ , hence p ! D 2 ´.
684
E. Brown, J. Moehlis, and P. Holmes
Since equation 3.2 is one-dimensional, Ermentrout (1996) immediately computes @µ @t ! D! D dx ; @x @x
(3.3)
dt
where
dx dt
is evaluated on the solution trajectory to equation 3.2. This gives
2 @µ D [1 ¡ cos µ]; @x !
(3.4)
as rst derived in Ermentrout (1996), but with explicit !-dependence displayed here. Considering a voltage perturbation 1V, we have @µ csn D zSN D [1 ¡ cos µ]; @V !
(3.5)
where csn D 2ºx is a model-dependent constant (see equation 3.1). Note that @µ is nonnegative or nonpositive according to the sign of csn . Since in type I @V neuron models (Ermentrout, 1996), a positive voltage perturbation advances phase (and hence causes the neuron to re sooner), in the following we will generally assume csn to be positive. 3.1.2 Generalized and Supercritical Hopf Bifurcations. The normal form for the (generalized) Hopf bifurcation (Guckenheimer & Holmes, 1983; Kuznetsov, 1998) is zP D .® C i¯/z C .c C id/jzj2 z C . f C ig/jzj4 zI
(3.6)
in polar coordinates, this is rP D ®r C cr3 C f r5 ; ÁP D ¯ C dr2 C gr 4 :
(3.7) (3.8)
We study two cases, always treating ® as the bifurcation parameter. In the rst case, we assume c < 0, yielding a supercritical Hopf bifurcation. For ® < 0, there is a stable xed point at the origin that loses stability as ® increases p through zero, giving birth to a stable periodic orbit with radius rpo;H D ¡®=c (see Figure 2b). Crucially, rpo;H D 0 when ® D 0, so that only terms of cubic order in equations 3.7 and 3.8 are required to capture (unfold) the supercritical Hopf dynamics. Hence we may set g D f D 0 for a local analysis. In the second case, we assume c > 0, so that equations 3.7 and 3.8 have a subcritical Hopf bifurcation at ® D 0, and there is no stable periodic orbit
Phase Reduction and Response Dynamics
685
for any value of ® when g D f D 0. Hence, we must reintroduce these terms to capture the relevant dynamics. Assuming additionally that f < 0, for ® < 0 there is a stable xed point at the origin that loses stability in a subcritical Hopf bifurcation at ® D 0, giving rise to an unstable periodic orbit as ® decreases through zero. The branch of unstable periodic orbits turns 2 2 around at a saddle-node bifurcation of periodic orbits at ® D 4c f ; for ® > 4c f p stable periodic solutions exist with radius rpo;B D [ 21f .¡c ¡ c2 ¡ 4®f /]1=2 (see Figure 2c). This is the generalized Hopf or Bautin bifurcation (identied by the subscript B). In either case, the angular speed is constant on the stable periodic orbit; hence, we set the asymptotic phase µ equal to the polar angle Á on the periodic orbit itself. (However, radial level sets of Á extending off the periodic orbit are not isochrons, since ÁP varies with r.) We calculate the PRC by linearizing about the attracting periodic orbit rpo . Letting r D rpo C r0 , we obtain rP0 D ¸r0 C O .r02 /, where ¸ is the transverse Floquet exponent (eigenvalue) for the stable periodic orbit. In the supercritical Hopf bifurcation, ¸ D ¸H D ¡2® < 0 and rpo D rpo;H ; in the Bautin, p ¸ D ¸B D 1f .c2 ¡ 4®f C c c2 ¡ 4®f / < 0 and rpo D rpo;B . Here and below, we drop terms of O .r02 / because we are concerned with arbitrarily small perturbations (cf. equation 2.8). Solving the linearized radial equation with initial condition r.0/ D r0 , we obtain r.t/ D rpo C .r0 ¡ rpo /e¸j t ;
(3.9)
with j D H or B. Next, integrating equation 3.8 yields Z t Z t [¯ C d.r.s//2 C g.r.s//4 ]ds; Á.t/ D dÁ D 0
(3.10)
0
and taking Á .0/ D Á0 , substituting equation 3.9 in 3.10, letting t ! 1, and dropping terms of O .r02 /, we obtain the phase µ associated with the initial condition .r0 ; Á0 /: 2 4 C grpo µ.t/ D Á0 C .¯ C drpo /t ¡
2 2rpo .d C 2grpo /.r0 ¡ rpo /
¸B
:
(3.11)
Here we have again used the fact that the polar angle Á and the phase µ are identical on the periodic orbit. Suppose that we start with an initial condition .xi ; yi / on the periodic orbit, with polar coordinates .rpo ; Ái /. As t ! 1, the trajectory with this 2 C 4 initial condition has asymptotic phase Ái C .¯ C drpo grpo /t. Now consider a perturbation 1x in the x-direction to .x f ; y f / D .rpo cos Ái C 1x; rpo sin Ái /. To lowest order in 1x, this corresponds, in polar coordinates, to ³ ´ sin Ái .r f ; Á f / D rpo C cos Ái 1x; Ái ¡ 1x : rpo
686
E. Brown, J. Moehlis, and P. Holmes
Setting .r0 ; Á0 / D .r f ; Á f / in equation 3.11 and subtracting the analogous expression with .r0 ; Á0 / D .rpo;j ; Ái /, j D H or B, we compute the change in asymptotic phase due to this perturbation, 3
2drpo;j C 4grpo;j 1 @µ D¡ cos µ ¡ sin µ; @x ¸j rpo;j
(3.12)
where we have substituted µ for the polar angle Ái , again using the fact that the two variables take identical values on the periodic orbit. Similarly, we nd 3
2drpo;j C 4grpo;j 1 @µ D¡ sin µ C cos µ: @y ¸j rpo;j
(3.13)
We now express rpo;j and ¸j in terms of the frequencies of the periodic orbits. In the supercritical Hopf case (recall that we set g D f D 0 here), 4 at the bifurcation point, the phase frequency ! is ÁP D !H D ¯, and from 2 equation 3.8, we have ! ¡ !H D drpo;H , yielding rpo;H D
p
j! ¡ !H j p : jdj
(3.14)
Substituting for rpo;H , we have ! ¡ !H D ¡®d=c, which together with the expression for ¸H gives ¸H D
2c .! ¡ !H /: d
In the Bautin case, we nd that µ ¶q d gc g C 2 ! ¡ !SN D ¡ c2 ¡ 4®f C 2 .c2 ¡ 4®f /; 2f 2f 4f
(3.15)
(3.16)
where !SN is the frequency of the periodic orbit at the saddle-node bifurca2 tion (® D 4c f ). Thus, from equation 3.16, q
c2 ¡ 4®f D kj! ¡ !SN j C O .j! ¡ !SN j2 /; 2
(3.17)
2
f where k D j f d¡gc j, and we may use the expressions for rpo;B and ¸B to compute s ¡c C O .j! ¡ !SN j/; (3.18) rpo;B D 2f
¸B D
ck j! ¡ !SN j C O .j! ¡ !SN j2 /: f
(3.19)
Phase Reduction and Response Dynamics
687
Next, we substitute these equations 3.14, 3.15, and 3.18, 3.19, for rpo and ¸ into equations 3.12 and 3.13. For the supercritical Hopf case, this gives p 1 jdj @µ D p [d cos.µ / C c sin.µ /]; @x j! ¡ !SN j jcj p 1 jdj @µ D p [d sin.µ / ¡ c cos.µ /]: @y j! ¡ !SN j jcj
(3.20) (3.21)
In the Bautin case, we get " s ³ ´3=2 # 1 ¡c ¡c @µ D ¡2d ¡ 4g j! ¡ !SN j 2f 2f @x " s ³ ´3=2 # 1 ¡c ¡c @µ D ¡2d ¡ 4g j! ¡ !SN j 2f 2f @y
f cos µ C O .1/; ck
(3.22)
f sin µ C O .1/ ; ck
(3.23)
where we have explicitly written terms of O .j! ¡ !SN j/¡1 , which dominate near the saddle node of periodic orbits. Note that the only term involving the bifurcation parameter ® is the prefactor, so that as this parameter is varied, all other terms in equations 3.22 and 3.23 remain constant. Equipped with equations 3.20 and 3.21, the PRC for a perturbation in the V-direction near a supercritical Hopf bifurcation is found from equation 3.1 to be zH .µ / D
@µ cH D p sin.µ ¡ ÁH /; @V j! ¡ !H j
where the constant cH D º c¡º d
(3.24)
p q jdj .ºx c C ºy d/2 C .ºx d ¡ ºy c/2 and the phase jcj
shift ÁH D tan¡1 . ºyx cCºxy d /. The form of this PRC was originally presented as equation 2.11 of Ermentrout and Kopell (1984). See that article, as well as section 4 of Ermentrout (1996) and Hoppensteadt and Izhikevich (1997), for earlier, alternative methods and computations for the PRC near supercritical Hopf bifurcation. For the Bautin bifurcation, we similarly arrive at zB .µ / D
@µ cB D sin.µ ¡ ÁB /: j! ¡ !SN j @V
Here cB D [¡2d
q
¡c 2f
(3.25)
q 3=2 ] f 2 C º 2 is a constant (which can be ¡ 4g. ¡c y 2f / ck ºx
positive or negative depending on d and g), and ÁB D tan¡1 . ººxy / is an !independent phase shift.
688
E. Brown, J. Moehlis, and P. Holmes
3.1.3 Homoclinic Bifurcation. Finally, suppose that the neuron model has a parameter ¹ such that a homoclinic orbit to a hyperbolic saddle point p with real eigenvalues exists at ¹ D 0. Then there will be a periodic orbit ° for, say, ¹ > 0, but not for ¹ < 0. Specically, we assume a single unstable eigenvalue ¸u smaller in magnitude than that of the all stable eigenvalues, ¸u < j¸s;j j, so that the bifurcating periodic orbit is stable (Guckenheimer & Holmes, 1983; see Figure 2d). If parameters are chosen close to the homoclinic bifurcation, solutions near the periodic orbit spend most of their time near p, where the vector eld is dominated by its linearization. This may generically be written in the diagonal form: xP D ¸u x; yPj D ¸s;j yj ; j D 1; : : : ; N ¡ 1;
(3.26) (3.27)
where the x and yj axes are tangent to the unstable and a stable manifold of p, respectively, and ¸s;j < 0 < ¸u are the corresponding eigenvalues. For simplicity, we assume here that the segments of the axes shown in Figure 3 are actually contained in the respective manifolds; this can always be achieved locally by a smooth coordinate change (Guckenheimer & Holmes, 1983). We dene the box B D [0; 1]£¢ ¢ ¢£[0; 1] that encloses ° for the dominant part of its period, but within which equations 3.26 and 3.27 are still a good approximation; 1 is model dependent but xed for different periodic orbits occurring as a bifurcation parameter varies within the model. We do not explicitly model ° outside of B, but note that the trajectory is reinjected
re-injection
y
D
g p
e
D
x
Figure 3: The setup for deriving the PRC for oscillations near a homoclinic bifurcation, shown (for simplicity) with N D 2.
Phase Reduction and Response Dynamics
689
after negligible time (compared with that spent in B) at a distance ² from the stable manifold, where ² varies with the bifurcation parameter ¹ (see Figure 3). Thus, periodic orbits occurring closer to the bifurcation point correspond to lower values of ² and have larger periods. We approximate the period T.²/ as the time that the x coordinate of ° takes to travel from ² to 1 under equation 3.26: ³ ´ 1 1 ln T.²/ D : ¸u ²
(3.28)
Notice that the x-coordinate of ° alone determines T.²/, and hence may be thought of as independently measuring the phase of ° through its cycle. We set µ D 0 at x D ² and, assuming instantaneous reinjection, µ D 2¼ at x D 1. Then ! D 2¼=T.²/, and as in equation 3.3, @µ ! ! ! D dx D D exp.¡¸u µ=!/: @x ¸u x.µ / ¸u ²
(3.29)
dt
In the nal equality, we used the solution to equation 3.26, x.t/ D ² exp.¸u t/, with the substitution t D µ=!. Since, as remarked above, motion in the yj directions does not affect the phase of ° , only components of a perturbation 1V along the x-axis contribute to the phase response curve; thus, the PRC @µ D ºx @µ , where ºx is as dened following equation 3.1. Using zHC D @V @x equation 3.28, ² D 1 exp.¡2¼ ¸u =!/, which allows us to eliminate ² from equation 3.29: zHC .µ / D
³ ´ ³ ´ 2¼ ¸u @µ µ D chc ! exp exp ¡¸u ; @V ! !
(3.30)
where chc D ¸ºux1 is a model-dependent constant. This is an exponentially decaying function of µ with maximum ³ zmax
D chc ! exp
2¼¸u !
´ (3.31)
and minimum ³ ´ 2¼ ¸u D chc !: zmin D zmax exp ¡ !
(3.32)
Here and below we assume chc > 0. zHC is discontinuous at the spike point µs D 2¼ , which forces us to take a limit in dening population-averaged ring rates below but does not otherwise affect the following analysis.
690
E. Brown, J. Moehlis, and P. Holmes
3.2 One-Dimensional Neuron Models. Generalized integrate-and-re models have the form VP D F.V/ C G.V; t/;
(3.33)
where V.t/ is constrained to lie between a reset voltage Vr and a threshold Vth , and the following reset dynamics are externally imposed: if V.t/ crosses Vth from below, a spike occurs, and V.t/ is reset to Vr . Here, nothing is lost in transforming to the single phase equation, 2.6; in particular, the error term of equation 2.5 does not apply. In fact, as noted in Ermentrout (1981), the crucial @µ quantity @V can be found directly from equation 3.33 with G.V; t/ ´ 0: z.µ / D
@µ @t ! D! D ; @V @V F.V.µ //
(3.34)
where we recall that µ is dened such that µP D !. In the next two subsections, we compute phase response curves for two simple integrate-and-re models. 3.2.1 Integrate-and-Fire Neuron. We rst consider the simplest possible integrate-and-re (IF) model: CVP D .Ib C I.t//I Vr D 0; Vth D 1;
(3.35)
where Ib is the baseline current, C is membrane capacitance, and G.V; t/ D I.t/. Hereafter we set C D 1 for the IF model. The angular frequency of a baseline (I.t/ D 0) oscillation is ! D 2¼ Ib , and equation 3.34 gives zIF .µ / D
! ! D ´ 2¼: F.V.µ // Ib
(3.36)
Thus, the IF PRC is constant in µ and frequency independent. 3.2.2 Leaky IF Neuron. Next, we consider the leaky integrate and re (LIF) model: CVP D .Ib C gL .VL ¡ V/ C I.t//I Vr D 0; Vth D 1 < VL C
Ib ; gL
(3.37)
where Ib is the baseline current, gL > 0 and VL are the leak conductance and reversal potential, C is the capacitance, and G.V; t/ D I.t/. As above, we also set C D 1 for this model. We assume Ib ¸ gL .1 ¡ VL / so that when I.t/ D 0, the neuron res periodically with frequency µ ³ ! D 2¼ gL ln
Ib C g L V L Ib C gL VL ¡ gL
´¶¡1
:
(3.38)
Phase Reduction and Response Dynamics
691
This expression shows how Ib enters as a bifurcation parameter, with Ib D gL .1 ¡ VL / corresponding to the bifurcation point at which ! D 0. Solving equation 3.37 for V.t/ with initial condition V.0/ D Vr D 0, and then using µ D !t and equation 3.34, gives ! gL
zLIF .µ / D
³
³ ´´ ³ ´ 2¼ gL gL µ 1 ¡ exp ¡ exp ; ! !
(3.39)
equivalent to formulas previously derived in van Vreeswijk et al. (1994) and Lewis and Rinzel (2003). Thus, the PRC for the LIF model is an exponentially increasing function of µ, with a maximum that decreases with !: zmax .!/ D
! gL
³
³ exp
2¼gL !
´
´ ¡1 ;
(3.40)
and minimum ³ ´ 2¼ gL ! D zmin .!/ D zmax exp ¡ .1 ¡ e¡2¼gL =! /: ! gL
(3.41)
Recall that the PRC near a homoclinic bifurcation is also an exponential function but with opposite slope: this is because both the essential dynamics near a homoclinic bifurcation and the LIF dynamics are linear, while the trajectories accelerate following spikes in the homoclinic case and decelerate in the LIF. This is our nal analytical PRC calculation; we summarize the results derived above in Table 1 and Figure 4. Table 1: Phase Response Curves for the Different Neuron Models. Bifurcation
z.µ/ csn !
SNIPER Hopf Bautin Homoclinic
p
cH
LIF
¡ ÁB /] C O .1/
! gL
0
¡ 2¼ ¸u ¢ !
exp .¡¸u µ=!/
1 ¡ e¡2¼ g L
cH
¡p
j!¡!H j
jcB j j!¡! SN j
C O .1/
chc ! exp
2¼
¡
2csn !
p
[sin.µ ¡ ÁH /]
jcB j j!¡!SN j [sin.µ
IF
zmin
[1 ¡ cos.µ/]
j!¡!H j
chc ! exp
zmax
¡ 2¼ ¸u ¢
jcB j ¡ j!¡!
SN j
eg L µ =!
! gL
¡
e2¼ gL =! ¡ 1
C O .1/
chc !
!
2¼
¢ =!
cH j!¡!H j
2¼
¢
! g L .1
¡ e¡2¼ g L =! /
692
E. Brown, J. Moehlis, and P. Holmes
HR (SNIPER)
ML (homoclinic)
z(q)
0.6
500
0.4 0
0.2 0
2
q
4
6
FN (Hopf)
500
0
2
4
6
4
6
4
6
IF
10
40
q
z(q)
20 5
0 20 40 0
2
q
4
6
HH (Bautin)
0 0
2
q LIF
10
z(q)
1 0.5 5
0 0.5 1 0
2
q
4
6
0 0
2
q
Figure 4: PRCs for the various neuron models, from the formulas of section 3 and numerically computed using XPP (Ermentrout, 2002), all with µs D 0. The relevant bifurcations are noted where applicable. Dot-dashed, dashed, and dotted curves for each model correspond to increasing frequencies, respectively: HR: ! D 0:0102; 0:0201; 0:0316 rad/msec (corresp. 1.62, 3.20, 5.03 Hz); FN: ! D 0:204; 0:212; 0:214 (corresp. 32.5, 33.7, 34.1 Hz); HH: ! D 0:339; 0:355; 0:525 rad/msec (corresp. 54.2, 56.5, 83.6 Hz); ML: ! D 0:0572; 0:0664; 0:0802 rad/msec (corresp. 9.10, 10.6, 12.8 Hz); IF: (any frequency); LIF: ! D 0:419; 0:628; 1:26 rad/msec (corresp. 66.7, 100, 200 Hz). For the LIF model, gL D 0:110. Normal forms 3.5, 3.24, 3.25, and 3.30 for the PRCs closest to bifurcation shown solid (scale factors ci t by least squares); the IF and LIF PRCs are exact. PRC magnitudes decrease with ! for the HR, HH, ML, and LIF models; are constant for the IF model; and increase with ! for the FN model. The phase shifts ÁH and ÁB are chosen as ¼ (yielding z.µs / D 0; see section 2.3). The inset to the ML plot displays the same information on a log scale, demonstrating exponential decay.
Phase Reduction and Response Dynamics
HR (SNIPER)
zmax
0.8 0.6
0.02
0.04
w FN (Hopf)
80 60
0 8
20
2
5
0.2
HH (Bautin)
4
zmax
w
3 2
0
0.5
15
w
1
LIF
10 5
1 0
IF
6 4
0.18
0.06 0.08 0.1 0.12
w
40
0
ML (homoclinic)
600
200
0.2
zmax
800
400
0.4
0
693
0.4
0.5
w
0.6
0
0.5
w
1
Figure 5: Scaling of zmax with ! for the various neuron models (and hence scaling of the population response FLdmax ¡ FLb with !; see section 4.2). Asterisks are numerical values from the PRCs of the full HR, FN, HH, and ML models, and curves show predictions of the normal forms 3.5, 3.25, 3.24, and 3.30 with leastsquares ts (of PRC maxima) for scale factors. Results for the IF and LIF models are exact.
694
E. Brown, J. Moehlis, and P. Holmes
3.3 Accuracy of the Analytical PRCs. The range of parameters over which the PRCs of the full neuron models are well approximated by the analytical expressions derived above varies from model to model. One overall limitation, noted in Izhikevich (2000a), is that normal form calculations for the Bautin and supercritical Hopf bifurcation ignore the relaxation nature of the dynamics of typical neural oscillators. However, the analytical PRCs— equations 3.5, 3.24, 3.25, and 3.30—are qualitatively, and in many cases quantitatively, correct. Figure 4 compares these formulas with PRCs calculated using XPP (Ermentrout, 2002) for the HR, FN, HH, and ML models near the relevant bifurcations (PRCs for the integrate-and-re models are exact). The companion Figure 5 demonstrates the scaling of PRC maxima with baseline frequency, which is also correctly predicted by the normal form analysis. Frequencies ! were varied by changing the bifurcation parameter: baseline inward current Ib . Here and elsewhere, the neural models are as given in Rose and Hindmarsh (1989), Murray (2002), Hodgkin and Huxley (1952), and Rinzel and Ermentrout (1989). All parameter values used here are reproduced along with the equations in appendix C. Finally, looking forward to the next section, we note that the analytical PRCs derived here will correctly predict key qualitative aspects of population responses to stimuli. 4 Probabilistic Analysis of Firing Rates 4.1 A Phase Density Equation. We now describe how time-dependent ring rates in response to external stimuli emerge from averages of oscillator population dynamics with appropriate initial conditions. Let ½.µ; t/ denote the probability density of solutions of equation 2.7. Thus, ½.µ; t/dµ is the probability that a neuron’s phase in an arbitrary trial lies in the interval [µ; µ C dµ] at time t. This density evolves via the advection equation: @½.µ ; t/ @ D ¡ [v.µ; t/ ½.µ ; t/]: @t @µ
(4.1)
Boundary conditions are periodic in the probability ux: v.0; t/ ½.0; t/ D limÃ!2¼ v.Ã; t/½ .Ã; t/, which reduces to ½.0; t/ D ½.2¼; t/ for smooth phase response curves z. A related phase density approach is used in Tass (1999) and Ritt (2003), and we rst derived the solution in Brown et al. (2003b). In the presence of noise, there is an additional diffusion term in equation 4.1 (Stein, 1965; Tass, 1999; Brown et al., 2003b). Multiple trials in which stimuli are not keyed to oscillator states may be modeled by averaging solutions of the linear PDE equation 4.1 over suitably distributed initial conditions; since (unmodeled) noise and variable and/or drifting frequencies tend to distribute phases uniformly in the absence of stimuli, we set ½0 ´ 1=2¼ . Histograms of ring times may then be extracted by noting that ring probabilities for arbitrary cells at time t are equal to the passage rate of the probability density through the spike phase, that is, the
Phase Reduction and Response Dynamics
695
probability ux 4
FL.t/ D lim¡ v.Ã; t/ ½.Ã; t/ D lim¡ [! C z.Ã /I.t/] ½.Ã; t/ : Ã!µs
(4.2)
Ã!µs
The limit from below allows for discontinuities in z.µ / (as in the homoclinic and LIF PRCs), since the relevant quantity is ux across the spike threshold from lower values of V and hence from lower values of µ. If the PRC z.µ / and hence ½.µ ; t/ are continuous at µs , equation 4.2 simply becomes FL.t/ D [! C z.µ s /I.t/] ½.µs ; t/. We emphasize that expression 4.2 equally describes the average ring rate of an entire uncoupled population on a single trial or the average ring rate of single neurons drawn from such a population over many sequential trials, as in Herrmann and Gerstner (2001), or a combination of both. 4.2 Patterns of Firing Probabilities and Conditions for Refractory Periods. Equation 4.1 can be explicitly solved for piecewise constant stimuli of duration d D t2 ¡ t1 : I.t/ D IN for t1 · t · t2 and I.t/ D 0 otherwise. (Here and elsewhere, we assume IN > 0 unless explicitly noted.) Specically, the method of characteristics (Whitham, 1974; pp. 97–100 of Evans, 1998) yields ³ Z ½.µ ; t/ D ½0 .2µ ;t .0// exp ¡ Á
D
1 exp ¡IN 2¼
Z
tQ2 t1
t 0
@ v.2 µ;t .t0 /; t0 /dt0 @µ !
´
z0 [2µ;t .s/]ds ;
(4.3)
where t ¸ t1 , tQ2 D min.t; t2 / and we take the initial condition ½0 D ½.µ; 0/ D 1=2¼ . Here, 2µ;t .s/ lies on the characteristic curve given by d 2µ ;t .s/ D v.2 µ;t .s/; s/; ds
(4.4)
with end point condition 2µ;t .t/ D µ. When 2µ;t .s/ coincides with a discontinuity in z, the integrands in equation 4.3 are not dened, and we must appeal to the continuity of probability ux or, equivalently, to the following change of variables. We now simplify expression 4.3. Using the fact that v.2 µ;t .s/; s/ D ! C N Iz.2 µ;t .s// for t1 · s · t 2 , and changing variables from s to 2µ;t .s/, Z
tQ2 t1
Z
2µ ;t .tQ2 /
z0 [2µ;t .s/] d2µ;t .s/ N 2µ ;t .t1 / ! C Iz.2 µ;t .s// " # N 1 ! C Iz.2 µ ;t .tQ2 // D ln ; N IN ! C Iz.2 µ ;t .t1 //
z0 [2µ;t .s/]ds D
(4.5)
696
E. Brown, J. Moehlis, and P. Holmes
so that " 1 ½.µ ; t/ D 2¼
# N ! C Iz.2 µ;t .t1 // : N ! C Iz.2 µ;t .tQ2 //
(4.6)
This expression is valid everywhere it is dened. To obtain the terms in equation 4.6, we integrate equation 4.4 backward in time from the nal condition at s D t until s D t1 or s D tQ2 ; this may be done analytically for the normal form PRCs of section 3 or numerically for PRCs from full neuron models. The integration yields the PRC-independent expression 2µ;t .tQ2 / D µ ¡ !.t ¡ tQ2 /
(4.7)
for all neuron models, while 2µ;t .t1 / is model dependent via the PRC. Note that while the stimulus is on (i.e., t1 · t · t2 ), tQ2 D t so that 2µ ;t .tQ2 / D µ. After the stimulus turns off, v.µ ; t/ is independent of µ, and ½ is constant along curves with constant µ ¡ !t. Thus, for t > t2 , ½.µ ; t/ is simply a traveling wave rotating with frequency !, with ½.µ ; t2 / determining the phase density. From the denition 4.2, we have: " # N ! C z.à /I.t/ ! C Iz.2 Ã;t .t1 // (4.8) FL.t/ D lim : N à !µs 2¼ ! C Iz.2 Ã;t .tQ2 // Figure 6 shows examples of FL.t/ for the various neuron models, computed via equation 4.8 with both numerically and analytically derived PRCs z, as well as as via numerical simulations of the full neuron models. The phase reduction 4.8 gives qualitative, and in some cases precise, matches to the full numerical data. We recall that the accuracy of phase reductions from full neuron models improves with weaker stimuli IN and that the analytical PRCs better approximate their numerical counterparts as the bifurcation point is approached (i.e., as Ib is varied). Note that if limÃ!µs z.à / D 0, I.t/ does not directly enter equation 4.8, so FL.t/ depends only on variations in ½ resulting from the stimulus. However, (I) if limÃ!µs z.à / 6D 0, the ring probability FL.t/ jumps at stimulus onset and offset. See Figure 6, and recall that we set µs D 0. This is our rst main result. Some comments on the limit in equation 4.8 are appropriate. Since for all neuron models, we always assume that v.µ / is positive and bounded and is dened except at isolated point(s), 2Ã;t .s/ is a continuous function of à , s and t. Nevertheless, as 2Ã;t .t1 / and 2Ã;t .tQ2 / pass through µs as t advances, discontinuities in z.¢/ give discontinuities in FL.t/, but the limit in equation 4.8 ensures that FL.t/ is always dened. As remarked above, if the PRC z.¢/ is a continuous function, then limÃ!µs z.à / D z.µ s / and taking the limit is unnecessary.
Phase Reduction and Response Dynamics
697
Figure 6: (a–f) Phase density ½.µ ; t/ in gray scale (darker corresponding to higher values) (top) and ring probability FL.t/ in msec¡1 (bottom) for stimuli of length 3=2 £ P (indicated by black horizontal bars), from equations 4.6 and 4.8 via the method of characteristics. Dashed curves indicate FL.t/ from the normal form PRCs of equations 3.5, 3.24, 3.25, 3.30, 3.36, and 3.39; solid curves from numerical PRCs computed via XPP. Baseline frequencies and values of NI for HR, FN, HH, ML, IF, and LIF models are (0.0201, 0.212, 0.429, 0.08, 0.628, 0.628) rad/msec (corresp. 3.20, 33.7, 68.3, 12.7, 100, 100 Hz) and (0.1, 0.0015, 0.25, 0.0005,0.05, 0.05) ¹A=cm2 , respectively. The vertical bars are PSTHs, numerically computed using the full conductance-based equations (see appendix C) using 10,000 initial conditions, with Ib set to match frequencies of the corresponding phase models. Initial conditions generated by evolving the full equations for a (uniformly distributed) random fraction of their period, from a xed starting point. Note that FL.t/ jumps discontinuously at stimulus onset and offset for the IF and LIF models, since for these models, z.µs / 6D 0 (point I in text). Also, ! during stimulus, FL.t/ does not dip below the baseline value 2¼ for the HR, IF, and LIF models, because zmin ¼ 0 in these cases (point V).
While the stimulus is on, solutions to equation 4.4 are periodic with period Z PD
2¼ 0
dµ N / ! C Iz.µ
(4.9)
698
E. Brown, J. Moehlis, and P. Holmes -3
x 10 a)
FL(t)
0 0
200
400 600 time (msec.)
800
1000
HH (Bautin)
0.08
0
FL(t)
0.06
-3
400 600 time (msec.)
800
1000
400 600 time (msec.)
800
1000
HR (SNIPER)
6
0.07
200
x 10
d)
b)
FL(t)
4 2
2
4 2
0.05 0.04
HR (SNIPER)
6
4
0
x 10
c)
HR (SNIPER)
6
FL(t)
-3
0
20
40 60 time (msec.)
80
100
0
0
200
Figure 7: (a–d) Firing probabilities FL.t/ for the HH and HR models, with stimulus characteristics chosen to illustrate the points in the text. Dashed and solid curves and vertical bars denote data obtained as in Figure 6. (a) A stimulus (IN D 0:04 ¹A/cm2 ) of length exactly P D 232:50 msec (indicated by the horizontal black bar) for the HR model (! D0.0201 rad/msec) leaves no trace (point II). (b) A stimulus (NI D 0:25 ¹A/cm2 ) of duration dmax D 11:46 msec for the HH model (! D0.429 rad/msec) yields maximum response after the stimulus has switched off (because zmin < 0) but for the HR model (d) (! D0.0102 rad/msec) with stimulus duration dmax D 152:01 msec, the peak in FL.t/ is achieved at t2 (because zmin ¼ 0), (points III, IVa). Plots c and d illustrate point VI: the stimulus in c is identical to that of d, but the slower HR population d (! D 0:0102 versus 0.0201 rad=msec) displays the greatest response.
(independent of the end point condition). Thus, equation 4.6 implies that ½.µ ; t/ must also be P-periodic, so that the distribution returns to ½.µ ; t1 / ´ 1 1 2¼ every P time units: that is, ½.µ ; t 1 CkP/ ´ 2¼ for integers k. If the stimulus is turned off after duration d D t2 ¡t1 D kP, this at density therefore persists (recall that ½ evolves as a traveling wave), giving our second result: (II) for stimulus durations that are multiples of P, poststimulus ring probabili! ties FL.t/ return to the constant value 2¼ . This is illustrated in Figure 7a and corresponds to the absence of poststimulus refractory periods and ringing, and is related to the black holes discussed in Tass (1999). Figures 6 and 7 also illustrate the periodic regimes both during and after the stimulus. When the stimulus duration d is not a multiple of P (and provided z.µ / is not constant), ½.µ ; t2 / has at least one peak exceeding 1=2¼ and at least one valley less than 1=2¼ (see phase density plots of Figure 6). Let the largest and smallest possible ½ values be ½max and ½min , respectively. Equation 4.6
Phase Reduction and Response Dynamics
699
then gives " ½max
1 D 2¼
# N max ! C Iz I N min ! C Iz
" ½min
1 D 2¼
# N min ! C Iz ; N max ! C Iz
(4.10)
where zmin ´ z.µmin / and zmax ´ z.µmax / are the global extrema of the PRC; note the relationship ½min ½max D 1=4¼ 2 . Recalling that 2µ ;t .tQ2 / D µ during the stimulus, comparing equations 4.10 and 4.6 shows that ½max occurs at µmin and ½min at µmax . When it exists, the stimulus duration dmax (resp., dmin ) for which a distribution with peak ½max (resp., valley ½min ) occurs is essentially obtained by requiring (ignoring the limits required for discontinuous PRCs) that a characteristic curve passes through µmax (resp., µmin ) at t1 and through µmin (resp., µmax ) at time t2 . Thus, (III) for stimulus durations dmax (resp., dmin ), poststimulus ring probabilities FL.t/ exhibit their maximal ! . These deviations may deviation above (resp., below) the baseline rate 2¼ or not be exceeded during the stimulus itself. (See Figure 7 for examples and Figure 6 for the evolution of phase density during a prolonged stimulus; in particular, note that while dmax is not strictly dened for the LIF model, shorter stimuli (of arbitrarily small duration) always give higher peaks.) We now determine whether maximal peaks and minimal valleys in ring rates occur during or after stimulus for the various neuron types. Again using 2Ã;t .tQ2 / D Ã during the stimulus, equation 4.8 yields " # N ! C z.Ã /IN ! C Iz.2 Ã;t .t1 // FL .t/ D lim N Ã!µs 2¼ ! C Iz.Ã / d
D lim
Ã!µs
1 N [! C Iz.2 Ã;t .t1 //]; t 1 < t · t2 : 2¼
(4.11)
The superscript on FLd .t/ denotes “during” the stimulus, emphasizing that this expression is valid only for t1 < t · t2 . After the stimulus has turned off, a different special case of equation 4.8 is valid: " # N ! ! C Iz.2 Ã;t .t1 // a (4.12) FL .t/ D lim ; t > t2 : N Ã!µs 2¼ ! C Iz.2 Ã;t .t2 // Here the superscript on FLa .t/ denotes “after” the stimulus. We now use these expressions to write the maximum and minimum possible ring rates during and after the stimulus: FLdmax D FLamax D
1 N max ] [! C Iz 2¼ " # N max ! ! C Iz 2¼
N min ! C Iz
(4.13) (4.14)
700
E. Brown, J. Moehlis, and P. Holmes
FLdmin D FLamin D
¤ 1 £ N min ! C Iz 2¼ " # N min ! ! C Iz 2¼
N max ! C Iz
(4.15) :
(4.16)
From equations 4.13 through 4.16, we have FLdmax
¡
FLamax
FLdmin ¡ FLamin
# N max ! C Iz N min ; Iz N min ! C Iz " # N min 1 ! C Iz N max : D Iz N max 2¼ ! C Iz 1 D 2¼
"
(4.17)
(4.18)
Since we restrict to the case where v.µ ; t/ > 0 (i.e., there are no xed points for the phase ow), the terms in the brackets of the preceding equations are always positive. This implies, for IN > 0, FLamax ¸ FLdmax if and only if zmin · 0;
(4.19)
FLamax · FLdmax if and only if zmin ¸ 0;
(4.20)
FLamin · FLdmin if and only if zmax ¸ 0;
(4.21)
FLamin
(4.22)
¸
FLdmin
if and only if zmax · 0;
where the “equals” cases of the inequalities require zmax D 0 or zmin D 0. In other words, (IVa) for the specic stimulus durations that elicit maximal peaks in ring rates, these maximal peaks occur during the stimulus if zmin ¸ 0 but after the stimulus switches off if zmin · 0; (IVb) for the specic (possibly different) stimulus durations that elicit minimal ring rate dips, these minimal dips occur during the stimulus if zmax · 0 but after the stimulus switches off if zmax ¸ 0. We recall that zmin < 0 is a dening condition for type II neurons (Ermentrout, 1996). The poststimulus maximum (resp. minimum) ring rates are obtained as the peak (resp. valley) of the distribution ½.µ ; t/ passes through µs . As Figure 7b shows, the delay from stimulus offset can be signicant for typical neuron models. Dening the baseline rate valid for t < t1 , FLb .t/ ´
! ; 2¼
(4.23)
equation 4.15 shows that FLdmin ¸ FLb if and only if zmin ¸ 0. Thus, (V) if zmin ¸ 0, the ring rate does not dip below baseline values until (possibly) after the stimulus switches off. Table 2 summarizes the above results for the neuron models studied here.
Phase Reduction and Response Dynamics
701
Table 2: Predictions Using the Numerical PRCs of Figure 4. Neuron Model
Response “Jumps” With Stimulus? (point I)
Maximum Response After Stimulus and Depressed Firing During Stimulus? (points IV and V)
HR HH FN ML IF LIF
No No Yes Yes Yes Yes
No Yes Yes No No No
Note: The conclusions follow from the limiting value of z.µs / (point I in text), and the value of the PRC minimum zmin (points IVa and V).
We conclude this section by noting that Fourier transformation of the analog of equation 4.1 in the presence of noise shows that FL.t/ decays at exponential or faster rates due to noise and averaging over distributions of neuron frequencies (cf. Tass, 1999; Brown et al., 2003b). For mildly noisy or heterogeneous systems, the results I through V remain qualitatively similar but are smeared; for example, ½.µ ; t/ is no longer time periodic during or after the stimulus, but approaches a generally nonuniform equilibrium state via damped oscillations. 4.3 Frequency Scaling of Response Magnitudes. We now determine how the maximum and minimum deviations from baseline ring rates depend on the baseline (prestimulus) ring rate of the neural population. Following the discussion of the previous section, we separately compute the scaling of maximal (minimal) responses that are possible during stimulus and the scaling of maximal (minimal) responses that are possible after stimuli switch off. Equations 4.13 through 4.16 and 4.23 yield 1 N [Izmax ] 2¼ 1 N [Izmin ] FLdmin ¡ FLb D 2¼ " # N max ¡ zmin / ! I.z a b FLmax ¡ FL D N min 2¼ ! C Iz " # N min ¡ zmax / ! I.z a b FLmin ¡ FL D : N max 2¼ ! C Iz FLdmax ¡ FLb D
(4.24) (4.25) (4.26)
(4.27)
These expressions provide one set of measures of the sensitivity of population-level response at different baseline ring rates. Additionally, taking
702
E. Brown, J. Moehlis, and P. Holmes FLd
¡FLb
ratios with the prestimulus ring rate (e.g., nding max ) determines FLb the size of deviations relative to baseline activity. We use the information summarized in Table 1 to compile these measures for all neuron models in Tables 3 through 6. Note that in these tables, “moving away from the bifurcation” means varying parameters so that the frequency varies away from its value at the onset of ring, namely, ! D 0 for the SNIPER and homoclinic bifurcations and IF and LIF models, !H for the supercritical Hopf bifurcation, and !SN for the Bautin bifurcation. The scaling of FLdmax ¡ FLb , as an example, is conrmed by Figure 5. In summary, (VI) different neural models and bifurcations imply different scalings of maximal response magnitude with frequency. Most measures of population ring rate responses increase for frequencies closer to the bifurcation point (see Tables 3–6). If these models are parameterized so that frequency increases as the bifurcation parameter Ib increases through the bifurcation point, this means that populations at lower frequencies tend to display greater responses (see Figure 5 for examples). This effect is explored in the next section.
Table 3: Scaling of Deviations in Firing Rate During Stimulus FLdmax ¡ FLb for the Different Neuron Models. FLdmax ¡ FLb
Bifurcation
h 1 2¼
SNIPER
h 1 2¼
Hopf
p
h 1 2¼
Bautin Homoclinic
1 2¼
»
N Bj Ijc j!¡!SN j
¡
1 !
Weaker (weaker)
i » p
i »
¡ 2¼ ¸u ¢ !
IN N 1 I! 2¼ gL
Stronger or Weaker Effect as Move Away from Bifurcation, to Lowest Order. Unnormalized (Normalized by FLb )
i
NH Ic j!¡!H j
N hc ! exp Ic
IF LIF
N sn 2Ic !
Lowest Order Scaling Near Bifurcation
e2¼ g L =! ¡ 1
¢
1 j!¡!H j
1 j!¡!SN j
Weaker (weaker) Weaker (weaker)
» ! exp.k=!/
Weaker (weaker)
const.
Constant (weaker)
» ! exp.k=!/
Weaker (weaker)
Note: The positive constant k differs from case to case.
Phase Reduction and Response Dynamics
703
Table 4: Scaling of Deviations in Firing Rate During Stimulus FLdmin ¡ FLb for the Different Neuron Models. Bifurcation
FLdmin ¡ FLb
SNIPER 1 ¡ 2¼
p
h Bautin
1 2¼
NH Ic j!¡! H j
N Bj Ijc j!¡!SN j
1 ¡ 2¼
Homoclinic
Constant
Constant (constant)
i » ¡p
i
IN N 1 I! 2¼ gL
¡
1 ¡ e¡2¼ gL
1 j!¡!H j
1 » ¡ j!¡!
N hc ! Ic
IF LIF
Stronger or Weaker Effect as Move Away from Bifurcation, to Lowest Order. Unnormalized (Normalized by FLb )
0
h Hopf
Lowest Order Scaling Near Bifurcation
¢ =!
SN j
Weaker (weaker) Weaker (weaker)
»!
Stronger (constant)
Constant
Constant (weaker)
»!
Stronger (constant)
Table 5: Scaling of Deviations in Firing Rate After Stimulus, FLamax ¡ FLb , for the Different Neuron Models. FLamax ¡ FLb
Bifurcation
h 1 2¼
SNIPER
h Hopf
1 2¼
Bautin
1 2¼
!
h
Homoclinic
N hc ! 1 Ic 2¼ 1C Ic N hc
p
N sn 2Ic !
Lowest Order Scaling Near Bifurcation
Stronger or Weaker Effect as Move Away from Bifurcation, to Lowest Order. Unnormalized (Normalized by FLb )
i
N H! 2Ic
NH j!¡!H j¡ Ic
N B j! 2Ijc N Bj !j!¡!SN j¡Ijc
»
1 !
Weaker (weaker)
i » p
i
.exp.2¼¸u =!/ ¡ 1/
»
1 j!¡! H j
1 j!¡!SN j
Weaker (weaker) Weaker (weaker)
» ! exp.k=!/
Weaker (weaker)
IF
0
Constant
Constant (constant)
LIF
¡2¼ gL =! N /.e2¼ g L =! ¡1/ ! I.1¡e ¡2¼ gL =! 2¼ N g L C I.1¡e /
» ! exp.k=!/
Weaker (weaker)
Note: The positive constant k differs from case to case.
704
E. Brown, J. Moehlis, and P. Holmes
Table 6: Scaling of Deviations in Firing Rate After Stimulus FLamin ¡ FLb for the Different Neuron Models. FLamin ¡ FLb
Bifurcation
h 1 ¡ 2¼
SNIPER
N sn 2 Ic N !C2csn I=!
h Hopf
1 ¡ 2¼
Bautin
1 ¡ 2¼
!
h
p
Lowest Order Scaling Stronger or Weaker Effect Near Bifurcation as Move Away from Bifurcation, to Lowest Order. Unnormalized (Normalized by FLb )
i
N H! 2 Ic
NH j!¡! H jCIc
N B j! 2 Ijc N Bj !j!¡! SN jC Ijc
» ¡!
Stronger (constant)
i » ¡p
i
1 j!¡!H j
Weaker (weaker)
1 » ¡ j!¡! SN j
Weaker (weaker)
Homoclinic
N hc ! exp.¡ 2¼¸u /¡1 Ic ! 2¼ exp. ¡2¼ ¸u /C Ic N hc !
» ¡!
Stronger (constant)
IF
0
Constant
Constant (constant)
LIF
N 2¼ gL =! ¡1/.e¡2¼ gL =! ¡1/ ! I.e 2¼ N 2¼ g L =! ¡1/ gL C I.e
» ¡!
Stronger (constant)
5 Gain of Oscillator Populations In attempts to understand neural information processing, it is useful to understand how input signals are modied by transmission through various populations of spiking cells in different brain organs. The general way to treat this problem is by transfer functions (Servan-Schreiber, Printz, & Cohen, 1990; Gerstner & Kistler, 2002). Here we interpret the results of the previous section in terms of the amplication, or attenuation, of step function input stimuli by the neural population. We consider both extremal and average values of the ring rate FL.t/ during stepped stimuli of varying strengths and illustrate for neurons near a SNIPER bifurcation. We will use the word gain to describe the sensitivity of the resulting input-output relationship: systems with higher gain have a greater output range for a specic set of input strengths. The average ring rate during stimulus is
hFLd i ´
1 ; P
(5.1)
where P is the period of an individual oscillator during the stimulus (see equation 4.9), and h¢i is the average over one such period. For the special
Phase Reduction and Response Dynamics
case of a population near a SNIPER bifurcation, PSN D p hFLdSN i
p ! 2 C 2csn I D : 2¼
705 2¼ , ! 2 C2csn I
so that
(5.2)
These expressions describe the standard f ¡ I curve typically studied for single neurons (Rinzel & Ermentrout, 1998). The instantaneous responses of neurons are in some cases of greater interest than averages such as equations 5.1 and 5.2. To derive the extremal (i.e., maximally above or below baseline) ring rates, we appeal to expressions 4.11 and 4.12, which are valid for both positive and negative values of IN as long as v.µ ; t/ remains nonnegative. (However, the subsequent formulas of section 4.2 require modication: max and min must be appropriately N Thus, the extremal value of interchanged when dealing with negative I.) FLd .t/ is (cf. equation 4.13) FLd;ext D
1 N max ]; [! C Iz 2¼
in general, and in particular for the SNIPER bifurcation: " # 1 2csn IN d;ext FLSN D !C : 2¼ !
(5.3)
(5.4)
In Figure 8, we plot FLext SN as a function of both baseline ring rate and stimulus strength I,N where the latter takes both positive and negative values. For (here, negative) stimulus values, and frequencies, sufcient to cause the minimum of v.µ / to dip below zero, xed points appear in the phase model, giving ring rates FLd .t/ D hFLdSN i D FLd;ext SN D 0. Notice the increased sensitivity of extremal ring rates to changes in stimulus strength at low baseline frequencies. This “increased gain” is also shown in Figure 9a, which plots slices through Figure 8 for two different baseline frequencies. However, there is no analogous effect for the average ring rates of equation 5.2, which follow the standard frequency-current relationships for individual neurons (see Figure 9b). Note that there is always a crossing point between ring rate curves for near-SNIPER populations with high and low baseline frequencies (see Figure 9a). Above this crossing point, stimuli are more greatly amplied by the low-frequency population; below the crossing point, they are more greatly amplied by the high-frequency population. This is analogous to increasing the slope (= gain) of a sigmoidal response function as in ServanSchreiber et al. (1990), gain increase in Figure 1 of that article being analogous to decrease of !. Thus, if signal discrimination depends on extremal ring rates, the effects of gain modulation on signal and noise discrimination of Servan-Schreiber et al. (1990) could be produced by changes in baseline rate.
706
E. Brown, J. Moehlis, and P. Holmes
d,ext
FLSN 20 18 16 14 12 10 8 6 4 2 6 5
0.1 4
0.05 3
bas eline frequency (Hz.)
0 2
0.05 1
0.1
I
Figure 8: Maximum and minimum ring rate FLd;ext SN of a population of stimulated HR neurons in Hz, as a function of baseline frequency (Hz) and applied current strength NI (¹A/cm2 ).
6 Discussion We now provide further comments on how the mechanisms studied in this article could be applied and tested. As discussed in section 5 and with regard to the locus coeruleus (LC) in section 1, baseline frequency-dependent variations in the sensitivity of neural populations to external stimuli could be used to adjust gain in information processing. The effect could be to engage the processing units relevant to specic tasks, and as in Servan-Schreiber et al. (1990) and Usher et al. (1999), to additionally sensitize these units to salient stimuli. See Brown et al. (2003b) for details of the LC application. We recall that section 4.2 described the different types of poststimulus ringing of ring rates FL.t/ that occur for the various neuron models. This phase-resetting effect has long been studied in theoretical and experimental neuroscience (e.g. Winfree, 2001; Tass, 1999; Makeig et al., 2002). As we show here (see equation 4.19 and Figure 7), for neuron models having a phase response curve z.µ / that takes negative values, the greatest deviations from baseline ring rates can occur signicantly after stimulus end. Subpopulations of such neurons could be used in detecting termination of sensory stimuli. Elevated ring rates FL.t/ that remain (or are enhanced)
Phase Reduction and Response Dynamics 16 14 12
b)
d,ext
FLSN
5
< FL
d SN
4
10 8
3
6
2
4
1
2 -0.1
6
0, we have Q qm / > D´ .p; Q qmC1 /: D´ .p;
(2.26)
Thus, we can say that the model qm .yjx/ approaches the empirical distribution in the sense of D´ approximation. 3 Properties of ´-Boost 3.1 Model Associated with ´-Boost. We next x a vector f .x/ of weak hypotheses f1 .x/; : : : ; fm .x/ and a parameter ®0 2 Rm . Let g´0 .x; y/ be a probability density function of .X ; Y/, and assume that g´0 .x; y/ D g.x/g´0 .yjx/ D g.x/
.1 ¡ ´0 /e® 0 ¢ f .x/y C ´0 ; .1 ¡ ´0 /.e® 0 ¢ f .x/ C e¡®0 ¢ f .x/ / C 2´0
(3.1)
where 0 · ´0 < 1 and ® ¢f .x/ denotes the inner product. This model is identical to equation 2.17 with ´ D ´0 and F¤´ .x/ D ®0 ¢ f .x/. Since ´0 is basically unknown, we optimize the parameter ® by the loss function generated by .1 ¡ ´1 /e¡z ¡ ´1 z for some xed ´1 , 0 · ´1 < 1. Let
®.´1 / D argmin E[.1 ¡ ´1 /e¡®¢ f .X/Y ¡ ´1 ® ¢ f .X /Y]; ®
(3.2)
778
T. Takenouchi and S. Eguchi
where E denotes the expectation with respect to the joint distribution g´0 .x; y/ of .X ; Y/. We note that ®.´1 / satises the equilibrium condition as E[f .X /Yf¡.1 ¡ ´1 /e¡®.´1 /¢ f .X/Y ¡ ´1 g] D 0:
(3.3)
We now consider the abstract error rate for a discriminant function ®.´1 /¢ f .x/ as L.®.´1 /¢f / D E [I.®.´1 /¢ f .X /Y < 0/] :
(3.4)
In practical situations, we cannot determine the exact value of this abstract error L; however, we can usually estimate this value by a validation technique based on given training data. Then we obtain the following theorem. Theorem 1.
For any xed ´1 , 0 · ´1 < 1,
L.®.´1 /¢ f / ¸ L.®.´0 /¢ f / D L.®0 ¢ f /;
(3.5)
where ®0 is the parameter of probability density function g´0 .x; y/ dened in equation 3.1. Proof. First, we show that ®.´0 / D ®0 . If we replace ®.´0 / with ®0 , we have the following equation, from equation 3.3, Z (± ² .1 ¡ ´0 /e® 0 ¢ f .x/ C ´0 ¡.1 ¡ ´0 /e¡® 0 ¢ f .x/ ¡ ´0 .1 ¡ ´0 /.e® 0 ¢ f .x/ C e¡®0 ¢ f .x/ / C 2´0 ) ± ² .1 ¡ ´0 /e¡® 0 ¢ f .x/ C ´0 ®0 ¢ f .x/ ¡ ¡.1 ¡ ´0 /e ¡ ´0 .1 ¡ ´0 /.e® 0 ¢ f .x/ C e¡® 0 ¢ f .x/ / C 2´0 £ f .x/g.x/dx D 0; since the terms in the braces cancel. Thus, ®0 is a solution of equation 3.3. Since the loss function generated by .1 ¡ ´1 /e¡z ¡ ´1 z is convex in z, equation 3.2 has a unique solution. Thus, we obtain ®.´0 / D ®0 . Next, we consider the Bayes rule as ¸´0 .x/ D log
g´0 .1jx/ : g´0 .¡1jx/
(3.6)
Noting that ®.´0 /¢ f .x/ > 0 , ¸´0 .x/ > 0 , we have L.®.´0 /¢ f / D E [I.®.´ 0 /¢ f .X /Y < 0/] D L.¸´0 /:
(3.7)
Since the Bayes rule minimizes the abstract error rate, equation 3.4 (see Mclachlan, 1992), we can prove that L.®.´1 /¢ f / ¸ L.®.´ 0 / ¢ f / for all 0 · ´1 < 1.
Robustifying AdaBoost
779
This theorem shows that the abstract error rate, equation 3.4 reaches a minimum when we use the loss function generated by .1 ¡ ´0 /e¡z ¡ ´0 z. Hence, estimating ´0 using this property would be benecial. In fact, we evaluate the abstract error rate by 10-fold cross-validation, which will be given in a subsequent discussion. 3.2 Interpretation of Our Model. We now discuss the statistical interpretation of the model, equation 3.1. Consider a degree of deviation of g´0 .yjx/ from the logistic model, g0 .yjx/ D
e® 0 ¢ f .x/y : e¡® 0 ¢ f .x/ C e® 0 ¢ f .x/
(3.8)
See Eguchi and Copas (2002) regarding properties of the logistic model. We next observe that ¡ ¢ g´0 .yjx/ D 1 ¡ ².´0 ; ®0 ¢ f .x// g0 .yjx/C².´0 ; ®0¢f .x//g0 .¡yjx/; (3.9) where ².´0 ; z/ D
.1 ¡ ´0
/.ez
´0 : C e¡z / C 2´0
(3.10)
Thus, we can interpret the model, equation 3.1, as a contamination model in which the transpositions in label y occur with the following probability: ².´ 0 ; ®0¢f .x//. Note that ².´0 ; ®0¢f .x// depends on ´0 , ®0 , x. Thus, ².´0 ; ®0¢ f .x// is maximized at ®0¢f .x/ D 0, and its maximum value is ´20 . Therefore, contamination occurs primarily near the boundary, and the probability of contamination decreases exponentially as x moves away from the boundary (see Figure 4). 3.3 Normalized Model. In section 2.3, we discussed the unnormalized set P . Here, we treat the set 1 in P , which is the set of normalized conditional probability distributions, 8 < 1D
:
p.yjx/ 2 P I
X y2Y
9 = p.yjx/ D 1; for each x 2 X
;
:
(3.11)
Q q/ where .q.yjx/¡´/ We now consider the minimization problem of D´ .p; is proportional to the logistic model, equation 3.8, such that q.yjx/ ¡ ´ D g0 .yjx/: 1 ¡ 2´
(3.12)
780
T. Takenouchi and S. Eguchi
0.5 0.45 0.4 0.35 0.3 0.25 0.2 0.15 0.1 0.05 0 -0.05
1 0.8 0.6 eta
-10
0.4 0.2 0 10
5
-5
0 z
Figure 4: Graph showing ².´0 ; z/ with respect to ´0 and z.
This model can be rewritten as q.yjx/ D .1 ¡ ´/g0 .yjx/ C ´g0 .¡yjx/:
(3.13)
This model is a contamination model in which the label y is transposed with the constant probability ´ (Copas, 1988). In addition, the above minimization problem is equivalent to the minimization of the loss function, which is generated by .1 ¡ 2´/ log.1 C e¡2z / ¡ ´z. 4 Simulation Study In this section, we describe a number of simulation studies using synthetic data sets and a real data set. One simulation was intended to demonstrate theorem 1, showing that ´-Boost is more robust than AdaBoost under the contaminated model, equation 3.1. A second simulation dealt with a case in which the model was not described by equation 3.1 but in which the data set was affected by some noise. Throughout this section, we implemented AdaBoost, ´-Boost, LogitBoost (Friedman et al., 2000), MadaBoost (Domingo & Watanabe, 2000), and AdaBoostreg (Ra¨ tsch, Onoda, & Muller, ¨ 2001). 4.1 Case 1. We examined a case in which a training data set was generated by model 3.1. Weak hypotheses were stumps (Friedman et al., 2000).
Robustifying AdaBoost
781
Figure 5: Training data set generated by equation 4.1. Circled examples were contaminated. In this gure, 31 examples were contaminated.
Let g.x/ be the probability density function of the two-dimensional uniform distribution on .¡¼; ¼ / £ .¡¼; ¼/ and assume g´0 .yjx/ D
.1 ¡ ´0 /e.x2 ¡3 sin.x1 //y C ´0 ¡ ¢ : .1 ¡ ´0 / e.x2 ¡3 sin.x1 // C e¡.x2 ¡3 sin.x1 // C 2´0
(4.1)
We generated 200 observations of training data and 2000 observations of test data, which followed equation 4.1. Figure 5 shows a training data set. We xed ´0 D 0:5. Examples were observed to be particularly contaminated near the boundary. In Figure 6, the training error rate, the test error rate, and the 10-fold cross-validation error rate of ´-Boost at M D 50 are shown for each ´. The training error rate was observed to be minimized at ´ D 0, but the test error rate and the 10-fold cross-validation error rate were both minimized near ´ D 0:5. If we select ´ based on the 10-fold cross-validation error rate, ´-Boost is more robust than AdaBoost. Figure 7 shows that the decision boundary constructed by AdaBoost was affected by contaminated exam-
782
T. Takenouchi and S. Eguchi
Figure 6: Test error, 10-fold cross-validation error, and training error at M D 50 with respect to ´. The circled point is the minimum value of each curve.
ples. In contrast, the decision boundary constructed by ´-Boost appeared robust for contaminated examples, as shown in Figure 8. Table 1 provided the results of the test error rate and the training error rate of several methods. The value ´ in ´-Boost was determined by 10-fold cross-validation as well as tuning parameters C; p in AdaBoostreg. The ´Boost outperformed other methods in terms of test error, as predicted by theorem 1. 4.2 Case 2. We next investigated the case of a data set that was not generated by model 3.1 but was affected by some type of noise, so that AdaBoost does not work well. The class F of weak hypotheses and the distribution of x were the same as in case 1, and we made the following assumptions: g0´0 .yjx/ D .1 ¡ ´0 / C ´0
e.x2 ¡3 sin.x1 //y e.x2 ¡3 sin.x1 // C e¡.x2 ¡3 sin.x1 // e¡.x2 ¡3 sin.x1 //y ; C e¡.x2 ¡3 sin.x1 //
e.x2 ¡3 sin.x1 //
(4.2)
where ´0 was xed as 0:1. This is a contamination model in which the transposition in the label y occurs with probability ´0 uniformly throughout
Robustifying AdaBoost
783
Figure 7: The discriminant function constructed by AdaBoost.
the feature space. We generated 200 observations of training data and 2000 observations of test data from this model. Table 2 shows the results. The ´-Boost is superior to other methods, except for AdaBoostreg in terms of test error, although the simulation setting, equation 4.2, was not suited for that associated with ´-Boost as in equation 4.1. 4.3 Breast Cancer Wisconsin. We used a real data set from the UC-Irvine machine learning archive, Breast Cancer Wisconsin. This data set consisted of 683 observations. A feature vector x consisted of nine variables, which were coded on a scale 1 to 10, and the number of classes was two. We split the data into 500 training examples and 183 test examples. Table 3 shows that AdaBoostreg was superior to other methods in term of test error. Here, we made a further consideration to a case in which some examples were virtually marked reverse labels. This articial mislabeling was randomly performed only for examples near the decision boundary, which was obtained by one of the above methods based on the real data set, for example, AdaBoost. The choice of methods does not affect the result. Approximately 10% of the example labels were exchanged.
784
T. Takenouchi and S. Eguchi
Figure 8: The discriminant function constructed by ´-Boost where ´ was selected by the 10-fold cross-validation error. Table 1: Test Error Rates and Training Error Rates for the Synthetic Data Set.
AdaBoost ´-Boost LogitBoost MadaBoost AdaBoostreg
Test Error
Training Error
0.2180 0.1934 0.2352 0.2433 0.1972
0.1224 0.1410 0.1488 0.1588 0.0930
Table 2: Test Error Rates and Training Error Rates for the Synthetic Data Set.
AdaBoost ´-Boost LogitBoost MadaBoost AdaBoostreg
Test error
Training error
0.2461 0.2439 0.2566 0.2565 0.2256
0.1609 0.1991 0.1740 0.1753 0.1145
Robustifying AdaBoost
785
Table 3: Test Error Rates and Training Error Rates for the Breast Cancer Wisconsin Data Set.
AdaBoost ´-Boost LogitBoost MadaBoost AdaBoostreg
Test Error
Training Error
0.0494 0.0508 0.0549 0.0562 0.0285
0.0168 0.0386 0.0230 0.0222 0.0475
Table 4: Test Error Rates and Training Error Rates for the Breast Cancer Wisconsin Data Set with Synthetic Noise.
AdaBoost ´-Boost LogitBoost MadaBoost AdaBoostreg
Test Error
Training Error
0.0578 0.0527 0.0631 0.0666 0.0539
0.0646 0.0773 0.0712 0.0754 0.0714
The results of this test are shown in Table 4. The ´-Boost outperformed the other methods in term of test error. 4.4 Comparison. Let us now discuss the relation between the ´-Boost and the previously proposed boosting algorithms LogitBoost, MadaBoost, and AdaBoostreg. LogitBoost can be viewed as the normalized version of AdaBoost (see Lebanon & Lafferty, 2001) in which the loss function is LLogit .F/ D
N X iD1
log.1 C e¡2F.xi /yi /:
The loss function of MadaBoost is LMada .F/ D
N X
U.F.xi /yi /;
iD1
where U.z/ is »1 ¡z U.z/ D 12 ¡2z e 2
.z < 0/ .z ¸ 0/:
A loss function of AdaBoostreg is regularized as follows: Lreg .F/ D
N X iD1
¡ ¢ exp ¡F.xi /y ¡ C¹.F.xi /; yi I p/ ;
786
T. Takenouchi and S. Eguchi
where C; p are tuning parameters and ¹.F.xi /; yi I p/ is a nonnegative regularization quantity, which has a high value if an example is difcult to learn. If C D 0, AdaBoostreg reduces to AdaBoost. (See Ratsch ¨ et al., 2001, for a detailed discussion.) In the above simulation results, ´-Boost and AdaBoostreg were observed to have good performance on robustness for all types of noisy data sets. The ´-Boost has an advantage on computational simplicity compared with AdaBoostreg. In the algorithm of AdaBoostreg, we need some numerical optimizations for ®, the coefcient of a weak hypothesis, as well as for LogitBoost and MadaBoost. Additionally, the algorithm has two tuning parameters C and p, and C takes a wide range of values. Thus, AdaBoostreg has a high computational cost with respect to time. In contrast, the algorithm of ´-Boost has the closed form of ® as given in equation 2.2, and the tuning parameter is only ´, which is embedded between 0 and 1. The loss function of LogitBoost and MadaBoost is saturated in the sense of growing only linearly with F.x/y. Such losses are said to be B-robust (Hampel et al., 1986) in the feature space x and MadaBoost is the most B-robust (Murata et al., 2002). But in the classication context, it is more important and effective to consider the robustness for outliers in the label space y. This consideration leads to the contamination modeling associated with the loss function of ´-Boost. We observed that ´-Boost is more effective for several noisy data sets than LogitBoost and MadaBoost. 5 Conclusion We have derived ´-Boost by adding a simple modication to AdaBoost. The algorithm used is equivalent to the sequential minimization of the loss function, which is a mixture of the exponential loss function and the naive error loss function. In addition, we can interpret this minimization problem as the sequential minimization problem of ´-divergence between the empirical distribution and the extended exponential model. The key feature of ´-Boost is the consideration of the uniform weight distribution and the naive error rate. With respect to the weight distribution for the training data set, ´-Boost moderates the exponential update rule of AdaBoost by using the uniform weight distribution. In addition, the coefcient of a weak hypothesis f .x/ in the output function is tuned by the unweighted error rate of f .x/. These operations make the ´-Boost algorithm robust. We have given the statistical interpretation of ´-Boost, demonstrating that ´-Boost is especially robust when dealing with a contaminated logistic model, which has more contamination near the class boundary. We numerically compared ´-Boost with several other methods and conrmed this by computer simulation. For the real data set, we compared ´-Boost with other several methods and conrmed that ´-Boost attains better performance on robustness for noisy data sets.
Robustifying AdaBoost
787
References Copas, J. (1988). Binary regression models for contaminated data. J. Royal Statist. Soc. B, 50, 225–265. Csisz a´ r, I. (1984). Sanov property, generalized I-projection and a conditional limit theorem. Ann. Probability, 12, 768–793. Domingo, C., & Watanabe, O. (2000). MadaBoost: A modication of AdaBoost. In Proceedings of the 13th Conference on Computational Learning Theory. San Mateo, CA: Morgan Kaufmann. Eguchi, S., & Copas, J. (2001). Recent developments in discriminant analysis from an information geometric point of view. J. Korean Statist. Soc., 30, 247– 264. Eguchi, S., & Copas, J. (2002). A class of logistic type discriminant functions. Biometrika, 89, 1–22. Freund, Y., & Schapire, R. E. (1997). A decision-theoretic generalization of online learning and an application to boosting. J. Computer and System Sciences, 55, 119–139. Friedman, J., Hastie, T., & Tibshirani, R. (2000). Additive logistic regression: A statistical view of boosting. Ann. Statist., 28, 337–407. Hampel, F. R., Rousseeuw, P. J., Ronchetti, E. M., & Stahel, W. A. (1986). Robust statistics: The approach based on inuence functions. New York: Wiley. Hastie, T., Tibshirani, R., & Friedman, J. (2001). The elements of statistical learning. New York: Springer-Verlag. Huber, P. J. (1981). Robust statistics. New York: Wiley. Lebanon, G., & Lafferty, J. (2001). Boosting and maximum likelihood for exponential models. In T. G. Dietterich, S. Becker, & Z. Ghahramani (Eds.), Advances in neural information processing systems, 14. Cambridge, MA: MIT Press. Mason, L. Baxter, J. Bartlett, P., & Frean, M. (1999). Boosting algorithms as gradient descent in function space. In S. A. Solla, T. K. Leen, & K.-R. Muller ¨ (Eds.), Advances in neural information processing systems, 11. Cambridge, MA: MIT Press. Mclachlan, G. (1992). Discriminant analysis and statistical pattern recognition. New York: Wiley. Murata, N., Eguchi, S., Takenouchi, T., & Kanamori, T. (2002). Information geometry of U-boost and Bregman divergence (ISM research memorandum 860). Tokyo: Institute of Statistical Mathematics. R¨atsch, G., Onoda, T., & Muller ¨ K.-R. (2001). Soft margins for AdaBoost. Machine Learning, 42, 287–320. Schapire, R. (1990). The strength of weak learnability. Machine Learning, 5 , 197– 227. Schapire, R. (1999).Theoretical view of boosting. In Proceedingsof the 4th European Conference on Computational Learning Theory. Received January 2, 2003; accepted June 27, 2003.
Communicated by Peter Bartlett
LETTER
Boosting with Noisy Data: Some Views from Statistical Theory Wenxin Jiang
[email protected] Department of Statistics, Northwestern University Evanston, IL 60208, U.S.A.
This letter is a comprehensive account of some recent ndings about AdaBoost in the presence of noisy data when approached from the perspective of statistical theory. We start from the basic assumption of weak hypotheses used in AdaBoost and study its validity and implications on generalization error. We recommend studying the generalization error and comparing it to the optimal Bayes error when data are noisy. Analytic examples are provided to show that running the unmodied AdaBoost forever will lead to overt. On the other hand, there exist regularized versions of AdaBoost that are consistent, in the sense that the resulting prediction will approximately attain the optimal performance in the limit of large training samples.
1 Introduction During the past decade, especially after the invention of the popular AdaBoost algorithm (Freund & Schapire, 1997), boosting techniques have attracted much attention from researchers from computer science, machine learning, and statistics. These techniques are characterized by using sequential linear combination of functions in a base hypothesis space for the purpose of classication or regression. At each time or step t, the linear combination incorporates one more term to optimize an objective function. Examples include AdaBoost (Freund & Schapire, 1997) for classication and Matching Pursuit (Mallat & Zhang, 1993) for regression. As Schapire (1999) stated in an overview article, “The most basic theoretical property of AdaBoost concerns its ability to reduce training error.” The training error decreases exponentially fast, subject to an assumption of “weak base hypotheses,” which we will discuss here. Furthermore, AdaBoost can also reduce the generalization error. In many cases, the generalization error continues to decrease after hundreds of steps and even after the training error becomes zero. There have been some explanations of this phenomenon based on upper bounds using the theory of margin (Schapire, Freund, Bartlett, & Lee, 1998; Koltchinskii & Panchenko, 2002) or top (Breiman, 1997a). Neural Computation 16, 789–810 (2004)
c 2004 Massachusetts Institute of Technology °
790
W. Jiang
However, there are some limitations of these previous results. First, regarding the results on training error, the validity and implications of the assumption of weak base hypotheses are not clear. For general sequential optimization algorithms, the assumption of weak base hypotheses requires that the percentage improvement of the objective function be bounded away from zero at each time t so as to guarantee an exponential rate of overall improvement. For AdaBoost, this is equivalent to requiring that the error ²Ot on a weighted data set be bounded away from 0.5, the value corresponding to random guessing. Schapire et al. (1998) are uncertain whether the “weak edge” .²Ot ¡ 0:5/ will eventually deteriorate to zero. They state that “characterizing the conditions under which the increase is slow is an open problem.” The rst half of this article addresses this open problem. Second, regarding these previous results on generalization error, no comparison is made to the optimal standard: the Bayes error. Consider the case of noisy data. Here P.Z D 1jX/ 62 f0; 1g, where X D predictor or input (e.g., patient characteristics such as age, smoking status, family history), Z D response or label (e.g., whether having breast cancer). No prediction rule for Z given X can be perfect here since two patients can have the same X yet different Z’s. This is in contrast to traditional machine learning, where X completely determines Z. The best possible generalization error for the noisy case is the so-called Bayes error (e.g., Devroye, Gyor, ¨ & Lugosi, 1996): L¤ ´ EX minfP.Z D 1jX/; P.Z D ¡1jX/g D inf (Gen. error), which is also a measure of the intrinsic level of noise and plays a similar role to the error variance in regression problems. Therefore, when data are noisy, it is more appropriate to study the difference (Gen. error)¡(Bayes error) rather than (Gen. error) itself. Many previous works on AdaBoost bound the generalization error with sample quantities related to, for example, margin (Schapire et al., 1998; Koltchinskii & Panchenko, 2002) or top (Breiman, 1997a), without comparing to the Bayes error. However, when data are noisy (with a nonzero Bayes error), the generalization error (or any of its upper bounds) cannot be on .1/. (Here, n denotes the sample size and on .1/ denotes a small term that converges to zero as n increases.) One can only hope for consistency, that is, the difference being small: (Generalization error)¡(Bayes error)=on .1/. The second half of this article makes comparisons of the generalization error to the optimal Bayes error. We will consider the following questions: ² Under what conditions will the assumption of weak base hypotheses be valid? Is the assumption restrictive or widely valid? What are some implications of the assumption on the generalization error? ² Is it good to run AdaBoost forever? Is limt!1 (generalization error) good, or is it suboptimal when compared to the optimal Bayes error?
Boosting with Noisy Data
791
² How good is the best prediction generated during the process of AdaBoost? Is inft (generalization error) good, or is it suboptimal when compared to the optimal Bayes error? ² Is regularization unnecessary for AdaBoost, or is it potentially helpful? A careful study of these questions is reported in this article, which summarizes the results from our less accessible technical reports and extended abstracts (Jiang, 1999, 2000a, 2000b, 2000c; Jiang, in press). Due to limitation on programming expertise, we have opted for an analytic approach from statistical theory. Therefore, the conclusions are often obtained in situations that are friendly to analytic treatment and have been focused on AdaBoost. The context and conditions for each result are carefully reported. Despite these limitations, our preliminary study of these posed questions has suggested a coherent picture of the behaviors of AdaBoost when treating noisy data and has emphasized the importance of regularization. There are numerous works on other interesting aspects of boosting with general cost functions—to name a few, Lugosi and Vayatis (2002) and Mannor, Meir and Zhang (2002) on consistency of regularized boosting algorithms; Zhang (in press), Lin (2001), and Bartlett, Jordan, and McAuliffe (2003) on cost functions and risk bounds; Zhang and Yu (2003) on convergence analysis; Blanchard, Lugosi, and Vayatis (2003) on rate of convergence; Rosset, Zhu, and Hastie (2002) on boosting and maximum margin; and Efron, Hastie, Johnstone, and Tibshirani (2002) on least angle regression. The rest of this article is organized as follows. Section 3 serves as a prelude to a brief summary of some ndings about Matching Pursuit (MP) (Mallat & Zhang, 1993), which is recognized as a regression-boosting algorithm by Friedman, Hastie, and Tibshirani (2000). These results have led us to suspect that AdaBoost for classication may possess similar properties when handling noisy data, which forms the main topic of the article and will be considered in detail. We begin by introducing the formalisms for regression, classication, and boosting. 2 Regression, Classication, and Boosting In statistical learning, we have a set of data S D .Xi ; Yi /n1 , where Xn1 are predictors, which are assumed here (only for convenience) to lie in [0; 1] and take distinct values. We allow the responses Y1n to be random (conditional on the Xi ’s) for potential noises of the data. We use lowercase letters to denote sample realizations of random objects; for example, xn1 is used to denote a set of values for X1n , which is called a set of design points. The sample size is n. The Yi ’s are real for regression problems and are f0; 1g valued for classication problem, where a useful transform Zi D 2Yi ¡ 1 valued in f¡1; C1g (the labels) is often used. In learning, we usually have a hypothesis space of real regression functions Hr or a hypothesis space of f§1g valued classication functions Hc to t
792
W. Jiang
the data. Here, a hypothesis space Hr;c is a set of functions f : [0; 1] 7! < or f : [0; 1] 7! f§1g, respectively. A relatively simple hypothesis space, called the base hypothesis space or base system Hr;c , can be made more complex by linear combinations of t members as the t-combined system or t-combined hyP pothesis space denoted as lint .Hr;c /. Formally, lint .H/ D f t1 ®s fs : .®s ; fs / 2 < £ Hg. A sign-valued function fO D fO.¢jS/ from Hc can be selected to form a prediction of future labels given a future predictor, based on a training sample P S D .Xi ; Zi /n1 . The training error is dened to be n¡1 niD1 I[Zi 6D fO.Xi jS/]. The generalization error is dened to be P[Z 6D fO.XjS/], the probability of misclassifying a future predictor-response pair .X; Z/. It is assumed that .Xi ; Zi /n1 and .X; Z/ are independent and identically distributed (i.i.d.). We briey note that the corresponding concepts in the regression case can also be dened based on the squared error. In boosting, a cost function C.FjS/ is used, which depends on the sample S and is a functional of F, where F 2 lint .Hr;c / for some t. A boosting algorithm acting on a base system Hr;c minimizes C.FjS/ with respect to F sequentially and adaptively in a way similar to the following. (We here assume complete minimizations in step 2a for simplicity. Approximate minimizations can also be allowed; see, e.g., Jiang, 1999.) 1. Let FO0 D 0.
2. For all t D 1; 2; : : : ;
a. Let ®O t fOt D arg min®f 2 0. Then the following inequality holds:
E .F® / · Proof.
Var[M® .X; Y/] : E[M® .X; Y/]2
(3.1)
The proof is immediate:
E .F® / · D · ·
P[M® .X; Y/ · 0]
P[M® .X; Y/ ¡ E[M® .X; Y/] · ¡E[M ® .X; Y/]]
P[.M® .X; Y/ ¡ E[M ® .X; Y/]/2 ¸ E[M® .X; Y/]2 ] Var[M® .X; Y/]=E[M® .X; Y/]2 ;
where the last line results from Markov’s inequality. Note that there exist various renements and variants of this inequality. Amit and Geman (1997) and Amit and Blanchard (2001) derive a slightly different analysis for multiclass problems, and it is shown that the variance factor can mainly be interpreted as the average covariance, conditional on class, of two base classiers in the aggregate. In Devroye, Gyor, ¨ and Lugosi (1996), a somewhat tighter bound based on the same quantities can be found. What is the signicance of this inequality? On the one hand, it is certainly true that Chebyshev’s inequality is very coarse and that the above bound cannot be very tight. On the other hand, one can argue that we expect that the average and variance of the margin function are two elementary statistical quantities that should be able to be estimated from their empirical counterparts on the training set without excessive error (for a theoretical study supporting this point, see Blanchard, 2001). It is therefore more relevant to understand this inequality as an indication pointing toward what we should try to do in practice: nd some compromise between the mean and variance of the margin function (we want high mean but low variance), based on their empirical estimations. Furthermore, another argument supporting the interest of this approach comes from the following useful comparison with Fisher ’s discriminant. It is indeed interesting to note that while the minimax margin philosophy
820
G. Blanchard
was comparable to an SVM classier in a linear classication framework, the mean-and-variance philosophy is comparable to the principle underlying Fisher’s linear discriminant classier. The following simple theorem makes the link apparent by showing that when we apply the same line of reasoning (based on Chebyshev’s inequality) to a Euclidean classication setup, we naturally nd Fisher’s discriminant function. Theorem 3. Consider a classication in a Euclidean space F; let m1 and m¡1 be the average values of class 1 and ¡1. For a 2 F, let fa be the linear classier dened by fa .x/ D sign.ha; .x ¡ .m1 C m¡1 /=2/i/. Dene the margin function Ma .x; y/ D yha; .x ¡ .m1 C m¡1 /=2/i. Then the following inequality holds for any a such that ha; m1 i > ha; m¡1 i:
E . fa / ·
p1 s21 C p¡1 s2¡1 Var[Ma ] D 4 ; ha; .m1 ¡ m¡1 /i2 E[M a ]2
(3.2)
where pi D P.Y D i/ and s2i D Var[ha; XijY D i]; i D ¡1; 1. Proof. The rst inequality results from Chebyshev’s inequality similarly to theorem 2. The second equality is just a little calculation: for i D ¡1; 1 we note that E[M a jY D i] D ha; .m1 ¡ m¡1 /=2i; and thus E[M a ] D p1 E[Ma jY D 1] C p¡1 E[Ma jY D ¡1] D ha; .m1 ¡ m¡1 /=2i; and Var[M a ] D E[Var[Ma jY]] C Var[E[M a jY]] D p1 s21 C p¡1 s2¡1 C 0: Thus, a very similar mean-variance analysis performed in a linear classication setup leads us to inequality 3.2, where the right-hand side is exactly the inverse of Fisher’s discriminant function, used precisely to choose the optimal direction a of the linear classier (see Figure 3). Note that Fisher’s linear discriminant is traditionally justied by considering the idealized situation where the classes have a gaussian distribution with identical covariance matrix, but this is rarely the case in practice; hence, it can be seen in more generic situations as essentially based on the mean-variance approach. The point of that philosophy is to try to be suitable in situations where the two classes form two overlapping clusters, that is, when there exist noisy regions, whereas the minimax philosophy is more suited for situations where
Different Paradigms for Choosing Sequential Reweighting Algorithms
821
Figure 3: A pictorial idealization of the mean-and-variance paradigm for Fisher ’s linear discriminant. Compare with Figure 1 (the data points are the same).
the two classes are well separated, although maybe by an irregular boundary. A related analysis appears in Mika (2002), to compare the principles of SVMs and Fisher discriminant in a linear feature space. 3.2 The Sequential Lower-Variance Minimization. In this section, we discuss how to implement in practice the method suggested by the meanand-variance philosophy. First, if we believe that Chebyshev’s inequality and the resulting inequality 3.1 is a good indication toward what should be done in practice, then an immediate remark is that we can replace in this inequality the variance of the margin by what we call its lower variance: :
Var¡ [X] D E[..X ¡ E.X//¡ /2 ]; where .¢/¡ denotes the negative part (i.e., .x/¡ D ¡x when x < 0 and .x/¡ D 0 otherwise). This makes sense since we want to take into account only the lower deviations of the margin in order to bound the probability that it becomes negative. The second point is that we would like to nd an SRS (see section 1.3) corresponding to the mean-and-variance philosophy. It has been noted by several authors (Breiman, 1998b; Frean & Downs, 1998; Friedman, Hastie, & Tibshirani, 2000, among others) that the AdaBoost algorithm can be interpreted as a gradient descent with an exponential cost function (this has
822
G. Blanchard
led to proposing other SRSs based on different cost functions). We propose to follow this principle as applied to a cost function that reects our philosophy. A rst naive choice would be to choose the right-hand side of equation 3.1. Two arguments lead to choosing another solution. First, we expect inequality 3.1 to be too coarse to capture in all generality the optimal trade-off between mean and variance. We therefore suggest introducing a one-parameter family of cost functions corresponding to different possible trade-offs (and the parameter will be selected by cross validation). Second, our rst experiments showed that a cost function dened as a ratio between mean and variance yielded quite unstable results, and we prefer an additive cost function using this two quantities. Finally, we choose the following cost function depending on the parameter ¯ > 0: C¯ .G® / D j®j¡1
±q ² Var¡ [M® ] ¡ ¯E[M® ]
(3.3)
(the square root is for homogeneity). The normalization by j®j¡1 is necessary if we want a scale-invariant cost (the scale does not change the nal classication function). Note that this is very similar to a cost function that we suggested earlier (Amit & Blanchard, 2001). What SRS will we derive from that cost function if we apply a gradient descent to it? First, we note that this cost function is convex in the simplex fj®j D 1g (see the appendix for a proof), so that gradient descent is a legitimate method to minimize it. We mainly need to compute the gradient of the cost along the direction corresponding to adding a new classier to the mixture. Denote by G0 D G® the normalized aggregated function corresponding to our current mixture ®; consider adding classier f to the mixture with a small weight, resulting in the normalized aggregated function Gt D .1¡ t/G0 Ctf for some small t. We thus want to compute @C¯ .Gt /=@ tjtD0 and minimize this quantity as a function of f . We then have the following property (see the proof in the appendix): Theorem 4. Denote E f .x; y/ D 1f f .x/ 6D yg, w.x; y/ D .M® ¡ E[M® ]/¡ . Then the classier f minimizes @C¯ .Gt /=@ tjtD0 iff it minimizes the quantity J.®; f / D E[w.X; Y/E f .X; Y/] p C .¯ E[w.X; Y/2 ] ¡ E[w.X; Y/]/E[E f .X; Y/]:
(3.4)
Keep in mind that the goal of the algorithm is to minimize the empirical cost function, that is, when the variance and expectation operators in equation 3.3 are taken under the empirical distribution. Similarly, in the above denition of J.®; f /, the expectation is to be understood under the empirical distribution. The quantity J.®; f / can be interpreted as a weighted error: it can be put under the form J.®; f / D E[.w.X; Y/ C C/E f .X; Y/], with a constant C depending on ® but not on .X; Y/. Therefore, minimizing J.®; f / with
Different Paradigms for Choosing Sequential Reweighting Algorithms
823
respect to f amounts to minimizing the weighted error of f when example .X; Y/ is given a weight proportional to w.x; y/ C C. It can actually happen, if ¯ is small, that some weights are negative because of the negative part in the constant term C. This is not a problem in practice, since we can then ip the label of the corresponding example and take the absolute value of the weight for the next training set. However, if one chooses ¯ ¸ 1, then obviously the weights are always nonnegative by Jensen’s inequality. As a consequence, the sequential lower-variance minimization algorithm goes as follows: call at each step the weak classier with the weights corresponding to equation 3.4, and add the output f to the current combination with coefcient 1. Note that in the AdaBoost algorithm, the coefcient given to the classier is again the result of an optimization in the direction of f . But since this is not necessary to perform a simple gradient descent, we prefer this simpler uniform average here. 4 An Algorithmic Interpolation of the Two Philosophies 4.1 The Interpolated Algorithm. Now that we have explored the two paradigms about the margin distribution, we notice that the two algorithms built in the previous sections—Blackwell’s strategy applied to classication and sequential lower-variance minimization (SLVM)—have the interesting common point that the weights given to the training examples at each step t are in both cases of the form !t;i / 8at ;bt .M® t¡1 .Xi ; Yi //, where ® t¡1 is the current combination at the beginning of step t and 8a;b .x/ D .x ¡ a/¡ C b (see Figure 4). Moreover, both methods use only a simple vote among the base classiers built (which corresponds to a coefcient of 1 given to each base classier in the combination). For Blackwell’s strategy, bt D 0 and 2 P t¡1 at D 1 ¡ t¡1 kD1 "k , where "k is the weighted error of base classier fk at step k; in other words, at is the average of the mean weighted margins of the base classiers forming the combination up to the present step. For the SLVM, at D E[M® t¡1 ] is the (empirical) average margin of the current combination—in other words, the sum of the mean (unweighted) margins of the base classiers built up to the present step. If ek denotes the (unweighted) empirical error of base classier fk , this is also equivalent 2 Pt¡1 to at D 1 ¡ t¡1 kD1 ek ; notice that comparing with Blackwell’s algorithm, we have just replaced p the weighted error by the unweighted error in at . Additionally, bt D ¯ E[w.X; Y/2 ] ¡ E[w.X; Y/] (where w is dened as in theorem 4) is a quantity depending on the parameter ¯. As noted earlier, the minimax philosophy is more suited to low noise problems and the mean-variance philosophy to cases where there are noisy regions. This is in accordance with experimental results reporting that AdaBoost’s performance can become extremely poor in the presence of noise on the labeling of training examples (Dietterich, 2000) and more generally
824
G. Blanchard
Fa,b(x)
b a
x
Figure 4: The weighting function 8a;b .
that AdaBoost can actually exhibit overtting (Grove & Schuurmans, 1998; R¨atsch et al., 2001). Therefore, we would like to have an algorithm that is able to behave according to the one or the other philosophy according to the situation. More precisely, we want to derive a parameterized family of SRSs able to interpolate between the two. In view of the similar form taken by the reweighting in the two above algorithms, we propose the following simple solution: reweight the examples using a function 8at ;bt of the margins, where .at ; bt / are just given by some weighted mean with coefcients .1 ¡ ±; ±/ of the corresponding values for our two initial algorithms. This gives rise to the following choice for some ± 2 [0; 1]: recalling the notation w.x; y/ D .M® t¡1 .x; y/ ¡ E[M® t¡1 ]/¡ , (
at D 1 ¡ bt D ±.¯
2 t¡1
p
Pt¡1
kD1 ..1
¡ ±/"k C ±ek / ;
E[w.X; Y/2 ]
¡ E[w.X; Y/]/:
(4.1)
The resulting algorithm is summed up in Figure 5 (where it was slightly extended to also accept values of ± greater than 1). 4.2 Experiments. In our tests of the algorithm, we xed arbitrarily ¯ D 2 in the interpolated algorithm in order to reduce to a single tuning constant ± > 0 and also to ensure that the weights are always nonnegative. Note that the parameter ¯ could also be chosen by cross-validation in a reasonable range; however, in our experience, it resulted in little difference.
Different Paradigms for Choosing Sequential Reweighting Algorithms
825
1. For i D 1; : : : ; N , initialize weights !i;1 D 1=N, vectors Ri;1 D 0; Si;1 D 0. Put ± ¤ D max.±; 1/. 2. Iteration t: call the weak learner W with the weights .!i;t /, resulting in classi er ft . Let et denote the empirical unweighted error of ft and "t denote its weighted error. 3. For i D 1; : : : ; N, update vectors R; S the following way:
Ri;tC1 D Ri;t C 1f ft .Xi / 6D Yi g ¡ "t I Si;tC1 D Si;t C 1f ft .Xi / 6D Yi g ¡ et : ³ q ´ 1 PN 1 PN 2 4. Put VtC1 D ¯ N iD1 .Si;tC1 /C ¡ N iD1 .Si;tC1 /C . Update weights the following way:
for i D 1; : : : ; N;
!i;tC1 D ..1 ¡ ± ¤ /Ri;tC1 C ± ¤ Si;tC1 /C C ±VtC1 I
then renormalize so that the weights sum to 1.
5. If t < Tmax , proceed to next iteration and point 2. If iteration Tmax is reached, take for the aggregated classi er the simple uniform average (vote) of the ft ’s, t D 1; : : : ; T . Figure 5: The interpolated algorithm, depending on constants ¯ > 0 and ± > 0.
We performed two series of experiments: the rst with classication trees of limited depth and the second with RBF networks. The rst series aims at investigating the qualitative properties of the algorithm and to compare it to AdaBoost, in particular, in the presence of (labeling) noise. For this rst series, we chose classication trees and stumps because they have been used as examples in other works on ensemble methods and because they are fast to compute, allowing us to perform large sets of experiments. On the other hand, since classication trees are not excellent classiers taken individually, the goal is not to achieve record performance. In particular, it appears that the Random Forest (RF) algorithm often has better performance than the ensemble of trees obtained with our algorithm; however, RF is completely dedicated to classication trees: contrarily to AdaBoost and other reweighting schemes, it is not divided clearly into a weak learner and an ensemble scheme. In particular, it is difcult to state exactly what the “weak learner ” would be for RF, since the randomization is part of the tree building procedure itself (a random subset of features is selected at each node, and the “ensemble” part merely consists in repeating the base procedure with a bootstrap). This makes it difcult to make a fair comparison with other ensemble schemes that use a “black box” weak learner. In any case, the fact that AdaBoost and the interpolated algorithm presented here can be applied to any weak learner proves a decisive advantage: in the
826
G. Blanchard
second series of experiments, using small-size RBF networks as base classiers (which are generally better classiers than single classication trees), the test error rates of the reweighting schemes are very clearly better than RF for an important majority of the data sets. 4.2.1 Experiments with Classication Trees and Stumps. For this rst set of experiments we tested the algorithm for seven benchmark data sets, six of which have been used by Ratsch ¨ et al. (2001), 3 and the last one (Breast-c) 4 coming from the UCI repository. Our main point in this series of experiments is to compare the behavior of the interpolated algorithm with plain AdaBoost for an arbitrary weak learner, so we preferred to use a fast, but not very accurate, weak learner. We tried two procedures: decision trees of depths 1 (stumps) and 3, split using the Gini criterion, with no pruning and a coarse stopping rule. For each of the data sets, we tried the learning algorithms with labeling noise levels of 0%, 5%, 10%, and 20% respectively; these noise levels correspond to the proportionof training examples whose labels are ipped before learning (the labels of the test samples used to estimate the error remain unchanged). For the interpolated algorithm, the constant ± was determined at each round by a ve-fold cross-validation on the training set, choosing among an arbitrary set of nine values for ±: f0; :05; :1; :15; :2; :5; :75; 1; 1:5g (this probably could be improved). For each of these situations, the error rates were estimated with 100 different training and test sets (the same training and test sets are used with the different algorithms). Finally, for each of the SRSs used, we performed Tmax D 500 iterations at each round. First, we show in Figure 6 different empirical margins “proles” (cumulative distribution functions) of the training set, obtained with different values of the parameter on the Waveform data set. It is interesting to note that these proles follow basically what we should have been expecting from the construction of the algorithm: for low ±, the distribution has almost a steep jump near what should be the minimax margin; for higher values of ±, the distribution of the margins is more spread out but also has a higher mean. On the gure, it is noticeable that the AdaBoost algorithm does a better job than Blackwell’s strategy (corresponding to ± D 0) at pulling the minimum training margin higher; this is probably because since the AdaBoost algorithm is based on an exponential cost function, it converges very quickly (this nice property of AdaBoost has been pointed out several times in the literature). By comparison, Blackwell’s strategy uses only a linear function for its weighting and therefore needs more iterations to converge, and the 500 iterations performed here are probably not enough to reach the asymptotic regime (although the classication rates are very close). On the
3 4
Made available on-line at http://ida.rst.gmd.de/˜raetsch/data/benchmarks.htm. Available on-line at http://www.ics.uci.edu/˜mlearn/MLRepository.html.
Different Paradigms for Choosing Sequential Reweighting Algorithms
827
Margin empirical CDF. on training (Waveform) 1
Empirical CDF.
0.8
AB delta=0 delta=0.1 delta=0.2 delta=0.6 delta=2
0.6 0.4 0.2 0 1
0.5
0 Margin value
0.5
1
Margin empirical CDF. on test (Waveform) 1
Empirical CDF.
0.8
AB delta=0 delta=0.1 delta=0.2 delta=0.6 delta=2
0.6 0.4 0.2 0 1
0.5
0 Margin value
0.5
1
Figure 6: Margin proles (top: on training set; bottom: on test set) obtained on the Waveform data set, with depth 3 decision trees, with AdaBoost and the interpolated algorithm for different values of ±.
828
G. Blanchard
other hand, Figure 6 shows that the proposed algorithm achieves precisely its initial purpose, which was to be able to sample different candidate ensemble classiers that achieve qualitatively different trade-offs concerning the shape (prole) of the margin cumulative distribution function (c.d.f.), this trade-off being basically between the proportion of examples having a high margin, and the average margin of all the examples. The test set margin distribution also reported in Figure 6 shows that the qualitative differences appearing on the training set produce test proles following the same qualitative patterns, so that varying the parameter indeed makes a difference for test sample classication. (On the example shown on the gure, AdaBoost actually offers the best solution, as is seen at the test c.d.f. at the point 0, which represents the test error. This matches the results shown in Table 2.) The classication errors obtained on the different data sets are given in Tables 1 and 2. The interpolated algorithm does noticeably better than AdaBoost in general. It is interesting to note that the classication rates are generally worse (for both algorithms) using depth 3 trees than using stumps (except for the Waveform and Banana data sets). This must indicate that the depth 3 trees must be overtting in most of the cases, which has important consequences in terms of the performance of the aggregated classier. The interpolated algorithm outperforms AdaBoost the most clearly for depth 3 trees, on the one hand, and for the higher labeling noise levels, on the other, which indicates that the interpolated algorithm is much more resistant to noise and overtting. The German and Diabetes data sets are particularly instructive: when we compare the classication errors obtained with stumps and depth 3 trees, we see clearly that AdaBoost’s performances are severely degraded, while the interpolated algorithm suffers only a small increase in classication errors. The AdaBoost algorithm outperforms the interpolated algorithm with stumps in some cases on the data sets Waveform, German, and Fsolar, but when we look at the numbers, the two algorithms actually appear mostly on par (only in three cases is the difference signicant in the sense of a 95% t-test). One can sum up the results saying that over all the cases considered, there are 32 wins of the interpolated algorithm versus AdaBoost with 3 losses and 21 ties. As far as classication trees are concerned, it should be noted that for a majority of these data sets, the RF algorithm actually outperforms the interpolated algorithm (compare to Table 3). However, as pointed out earlier, it is not obvious how to identify a “weak learner” in the RF algorithm, that could be used by the other ensemble algorithms for fair comparison. Moreover, the reweighting schemes can be applied to other weak learners, yielding in most cases better nal classication rates than RF, as shown in the next section. 4.2.2 Experiments with RBF Networks. For this second set of experiments, we used RBF networks as weak learners. In contrast to decision trees, RBF
3.7 § 1.8 = 27.5 § 1.7 = 24.3 § 2.2 = 33.1 § 2.0 = 18.6 § 3.9 + 24.7 § 1.8 = 12.4 § 0.7 =
3.9 § 2.1 27.8 § 1.6 24.0 § 2.3 33.0 § 2.0 21.7 § 4.0 24.7 § 1.7 12.5 § 0.6
Breast Cancer Banana German FSolar Heart Diabetes Waveform
5.1 § 2.4 28.2 § 1.8 25.0 § 2.3 33.3 § 1.6 23.8 § 3.8 25.8 § 1.9 15.4 § 1.1
AdaBoost 4.1 § 2.2 + 27.9 § 1.8 = 25.0 § 2.2 = 33.5 § 2.0 = 19.2 § 4.3 + 25.0 § 1.9 + 14.0 § 1.2 +
Interpolated Algorithm
5%
5.5 § 2.4 29.0 § 2.2 25.9 § 2.4 33.6 § 1.9 26.1 § 4.8 26.1 § 2.1 17.7 § 1.3
AdaBoost 4.4 § 2.0 + 28.5 § 2.1 = 25.9 § 2.5 = 33.9 § 2.3 = 21.1 § 4.6 + 24.9 § 1.8 + 14.9 § 1.5 +
Interpolated Algorithm
10%
7.3 § 3.2 30.4 § 2.8 27.9 § 2.7 34.0 § 2.4 29.9 § 5.2 28.8 § 2.6 22.5 § 2.0
AdaBoost
5.1 § 2.5 + 29.8 § 2.8 = 27.6 § 2.8 = 34.9 § 2.6 25.4 § 5.0 + 26.6 § 2.9 + 17.6 § 2.0 +
Interpolated Algorithm
20%
Notes: The sign after the interpolated algorithm results indicates the result of a 95% two-sided t-test of equality with the AdaBoost results: + or ¡ indicates that the equality is rejected; = indicates that it is not. Boldface entries indicate the best raw results (least average test error) for each benchmark experiment.
Interpolated Algorithm
AdaBoost
Method
0%
Laboratory Noise %
Table 1: Results with Stumps as Base Classiers (100 Training and Test Sets, 500 Iterations of the Ensemble Algorithms).
Different Paradigms for Choosing Sequential Reweighting Algorithms 829
3.7 § 2.0 = 13.9 § 1.2 = 24.3 § 2.1 + 33.8 § 1.9 + 21.3 § 4.4 = 25.0 § 1.9 + 12.3 § 1.0 -
3.2 § 1.9 13.9 § 0.7 25.1 § 2.3 34.6 § 1.8 22.2 § 4.2 27.3 § 2.0 11.7 § 0.6
Breast Cancer Banana German FSolar Heart Diabetes Waveform
5.1 § 2.4 16.6 § 1.4 27.6 § 2.3 34.6 § 1.7 24.3 § 4.4 28.6 § 2.2 12.9 § 0.8
AdaBoost
Note: The same conditions as in Table 1 apply here.
Interpolated Algorithm
AdaBoost
Method
0%
4.4 § 2.6 + 15.8 § 1.5 + 25.6 § 2.4 + 34.1 § 2.1 = 23.0 § 4.8 + 25.6 § 1.7 + 13.5 § 1.0 -
Interpolated Algorithm
5%
6.8 § 2.9 19.6 § 1.6 29.2 § 2.7 35.2 § 2.0 26.2 § 4.3 30.0 § 2.3 14.7 § 1.0
AdaBoost
Laboratory Noise %
Table 2: Results with (Nonpruned) Depth 3 Tree Classiers.
4.4 § 2.3 + 17.9 § 1.8 + 26.0 § 2.4 + 34.8 § 2.2 = 24.1 § 4.7 + 25.5 § 2.3 + 14.7 § 1.1 =
Interpolated Algorithm
10%
9.9 § 4.2 25.8 § 2.1 33.0 § 2.8 35.6 § 2.2 31.6 § 5.3 34.0 § 2.9 20.2 § 2.0
AdaBoost
5.5 § 2.9 + 23.1 § 2.2 + 27.9 § 2.7 + 35.8 § 2.3 = 28.0 § 5.7 + 27.9 § 3.5 + 18.4 § 2.1 +
Interpolated Algorithm
20%
830 G. Blanchard
AB. RBF-net
12.0 § 0.6 26.6 § 2.5 35.5 § 1.8 20.5 § 3.0 26.9 § 2.2 11.0 § 0.6 22.6 § 1.2 30.6 § 4.8 5.4 § 2.3 4.7 § 1.6
Random Forest
12.5 § 0.8 23.0 § 2.1 34.2 § 1.7 18.8 § 4.0 24.7 § 1.7 11.1 § 0.7 23.0 § 2.2 26.6 § 4.4 3.6 § 0.3 3.5 § 0.5
11.0 § 0.6 24.4 § 2.3 33.1 § 1.9 16.9 § 3.9 23.9 § 1.9 9.8 § 0.4 22.4 § 1.1 26.6 § 4.7 1.6 § 0.1 3.5 § 0.4
ABReg RBF-net 10.7 § 0.5 23.9 § 2.3 34.7 § 1.7 17.6 § 2.9 23.7 § 2.0 9.9 § 0.4 23.6 § 2.1 27.3 § 4.6 1.8 § 0.2 2.9 § 0.3
Interpolated RBF-net + = = = = = +
Versus ABR
+ + + + = + +
Versus RF
Notes: The last two columns show the results of 95% condence two-sided t-test of the interpolated algorithm versus AdaboostReg and Random Forest, respectively. Same conditions as Table 1.
Banana German FSolar Heart Diabetes Waveform Titanic Breast Cancer (2) Ringnorm Twonorm
Data Set
Table 3: Performance Comparison Chart Between Random Forest, AdaBoost, AdaBoost-Reg and the Interpolated Algorithm (Applied to RBF-Nets).
Different Paradigms for Choosing Sequential Reweighting Algorithms 831
832
G. Blanchard
networks are quite accurate weak learners; they are also slower to train, so in that case, we made a more limited set of experiments, considering data only without labeling noise. The goal of this series of experiments is also to provide a performance comparison with the algorithm AdaBoostReg (R¨atsch et al., 2001), which is an alternative regularized version of AdaBoost based on different principles. For a fair comparison, we followed the protocol used in Ra¨ tsch et al. (2001); more precisely, we used the data and the RBF parameters provided on R¨atsch’s repository. The interpolation parameter was estimated by cross validation for the rst ve realizations of each data set, and the median of these ve values was used for the other realizations (which is exactly the protocol followed by Ra¨ tsch et al., 2001: this is rendered necessary to limit the computation time needed). We also used a bigger number of benchmark data sets as compared to the rst series of experiments (note that the Breast-Cancer data set here is not the same as in the previous series). The results for this experiment are reported in Table 3. It is rst interesting to note that the best ensemble method for RBF networks is never AdaBoost. This shows clearly that AdaBoost largely overts when the base classiers are too complex. Individually, the interpolated algorithm applied to RBF networks is almost always better than AdaBoost (only one loss), outperforms RF in a clear majority of cases (six victories, three losses, one tie), and is on par with AdaBoost-Reg (ve ties, three losses, two victories). (An algorithm wins against another whenever the corresponding two-sided t-test for equality of the mean error rates is rejected.) The advantage of the interpolated algorithm with respect to the latter is that it is computationally simpler (in particular, AdaBoost-Reg involves a line search at each step to determine the coefcient of the classier). The raw results suggest that AdaBoost-Reg may have a slight edge over the interpolated algorithm, but a down-to-earth view of the results, taking into account the condence intervals on test classication, leads to the conclusion that the algorithms have essentially equivalent performance. 5 Conclusion Numerous theoretical works in the past few years providing bounds on classication error rates of ensemble methods have supported the intuition that one should generally try to obtain high margins on the training set. Once this principle is posed, there still are some partially heuristic choices to be made regarding the trade-offs and the desired shape of the margin distribution. We have pointed out two philosophies that can be considered as different strategies to achieve this goal and that we showed or recalled to be linked with standard statistical learning procedures used in more classical frameworks (support vector machines and Fisher ’s linear discriminant), thus trying to provide a unifying point of view over different types of ensemble methods.
Different Paradigms for Choosing Sequential Reweighting Algorithms
833
Based on this analysis and observing the remarkably similar form of two specic algorithms designed to achieve the aims of each paradigm, we built a very simple family of algorithms corresponding to interpolated parameter values between the two initial methods. In our opinion, the main virtue of this algorithm is to propose a simple method able to sample a variety of good candidate empirical margin proles. The different types of proles correspond basically to different trade-offs between the proportion of examples having a high enough margin, and the average margin of the examples. Figure 6 shows clearly on an example how this goal is attained. It is then quite simple to pick among the sampled proles by cross validation to choose the trade-off best suited to the data. This proved, on benchmark data sets, to be an efcient method to improve over the performances of AdaBoost, more noticeably in noisy situations or when the base classiers tend to be too complex. This algorithm outperforms RF when used with good base classiers (small-size RBF networks) and exhibits the same level of performance as the Regularized AdaBoost method, while being noticeably simpler. Appendix Proof of theorem 1. Point ii is a direct consequence of point i. By the minimax theorem, the minimax margin ° is such that for any collectionof weights .!i /, there exists a classier f 2 H having average weighted margin higher than ° or, in other words, weighted error less than .1 ¡ ° /=2. Since W nds a classier with minimum weighted error, we must have "t · .1 ¡ ° /=2 for all t. Now, we have for all i D 1; : : : ; N, Ri;t D
´ t t ³ X X 1 .1ffk .Xi /Yi · 0g ¡ "k / D .1 ¡ fk .Xi /Yi / ¡ "k ; 2 kD1 kD1
and point i hence implies that for any ± > 0, for t big enough we have for all i, t t 1X 2X fk .Xi /Yi ¸ 1 ¡ "k ¡ ± ¸ ° ¡ ±; t kD1 t kD1
which proves point ii. Proof of the convexity of cost function C¯ .G® /. We want to prove that C¯ .G® / given by equation 3.3 is convex on the simplex S D fj®j D 1g. Since the margin functionpM ® is a linear function of j®j, this amounts to proving that the function Var¡ [M ® ] is convex. Let ® 1 ; ® 2 2 S , ´ 2 [0; 1],
834
G. Blanchard
and consider A D Var¡ [M´®1 C.1¡´/® 2 ]
D E[.´.M® 1 ¡ E[M ®1 ]/ C .1 ¡ ´/.M® 2 ¡ E[M® 2 ]//2¡ ]
· E[.´.M® 1 ¡ E[M ®1 ]/¡ C .1 ¡ ´/.M ®2 ¡ E[M® 2 ]/¡ /2 ]
D ´ 2 E[.M® 1 ¡ E[M® 1 ]/2¡ ] C .1 ¡ ´/2 E[.M® 2 ¡ E[M® 2 ]/2¡ ] C 2´.1 ¡ ´/E[.M® 1 ¡ E[M ®1 ]/¡ .M ®2 ¡ E[M® 2 ]/¡ ];
where at the third line, we have used the convexity of the negative part .¢/¡ ; and B D .´Var¡ [M® 1 ]1=2 C .1 ¡ ´/Var¡ [M ®2 ]1=2 /2
D ´2 E[.M® 1 ¡ E[M® 1 ]/2¡ ] C .1 ¡ ´/2 E[.M® 2 ¡ E[M ®2 ]/2¡ ]
C 2´.1 ¡ ´/.E[.M® 1 ¡ E[M® 1 ]/2¡ ]/1=2 .E[.M® 2 ¡ E[M ®2 ]/2¡ ]/1=2 I
we have A · B by the Cauchy-Schwartz inequality, hence the result. Proof of theorem 4. Let us start with recalling the following simple fact: if h.x/ is a differentiable function on R, then it is easy to check that d.h.x//2¡ D ¡2.h.x0 //¡ h0 .x0 /: dx xDx0
Let us assume without loss of generality that j®j D 1, and put M f .x; y/ D f .x/y (so that since f .x/ 2 f¡1; 1g, we have M f .x; y/ D 1 ¡ 2E f .x; y/); then Var¡ [.1 ¡ t/M® C tM f ] D E[.M® ¡ E[M® ] C t.M f ¡ M® ¡ E[M f ¡ M® ]//2¡ ]; so that using the rst remark, @ Var [.1 ¡ t/M® C tM f ] ¡ @t tD0 D ¡2E[.M® ¡ E[M® ]/¡ .M f ¡ M ® ¡ E[M f ¡ M® ]/] D 4E[w.X; Y/E f .X; Y/] ¡ 4E[w.X; Y/]E[E f ] C C® ; where C® is independent of f ; similarly, @ E[.1 ¡ t/M® C tM f ] D E[M f ¡ M® ] D ¡2E[E f ] C C0 ; ® @t tD0
Different Paradigms for Choosing Sequential Reweighting Algorithms
835
so that nally Á ! E[w.X; Y/E f .X; Y/] ¡ E[w.X; Y/]E[E f ] @ C¯ .Gt / D 2 p C ¯E[E f ] @t Var¡ [M® ] tD0 C C00® :
The result follows by translation and multiplication by constants independent of f . Acknowledgments This work nds some of its roots in a joint research project with Y. Amit, whom I thank. I also thank K-R. Muller, ¨ G. Ra¨ tsch, and S. Mika for stimulating discussions about this work. I nished this letter while I was a host at the Fraunhofer FIRST, Berlin. References Amit, Y., & Blanchard, G. (2001). Multiple randomized classiers: MRCL (Tech. Rep.). Chicago: University of Chicago. Available on-line: http://galton. uchicago.edu/ amit/Papers/mrcl.ps.gz. Q Amit, Y., & Geman, D. (1997). Shape quantization and recognition with randomized trees. Neural Computation, 9, 1545–1588. Blackwell, D. (1956). An analog of the minmax theorem for vector payoffs. Pacic Journal of Mathematics, 6, 1–8. Blanchard, G. (2001). Mixture and aggregation of estimators for pattern recognition: Application to decision trees. Unpublished doctoral dissertation, Universit e´ Paris-Nord. (In English, with an introductory part in French) Available online: http://www.math.u-psud.fr/ blanchard/publi/these.ps.gz. Q Blanchard, G. (2003). Generalization error bounds for aggregate classiers. In D. D. Denison, M. H. Hansen, C. C. Holmes, B. Mallick, & B. Yu (Eds.), Nonlinear estimation and classication. Berlin: Springer-Verlag. Breiman, L. (1998a). Arcing classiers. Annals of Statistics, 26(3), 801–849. Breiman, L. (1998b). Prediction games and arcing algorithms (Tech. Rep.). Berkeley Statistics Department, University of California at Berkeley. Available on-line: ftp://ftp.stat.berkeley.edu/pub/users/breiman/games.ps.Z. Breiman, L. (2001). Random forests. Machine Learning, 45, 5–32. Devroye, L., Gy¨or, L., & Lugosi, G. (1996). A probabilistic theory of pattern recognition. Berlin: Springer-Verlag. Dietterich, T. G. (2000). An experimental comparison of three methods for constructing ensembles of decision trees: Bagging, boosting and randomization. Machine Learning, 40, 139–158. Frean, M., & Downs, T. (1998). A simple cost function for boosting. (Tech. Rep.). Queensland, Australia: Department of Computer Science and Electrical Engi-
836
G. Blanchard
neering, University of Queensland. Available on-line: http://www.boosting. org/papers/FreDow98.ps.gz. Freund, Y., & Schapire, R. (1996). Experiments on a new Boosting algorithm. In Machine Learning: Proceedings of the 13th International Conference (pp. 148–156). San Mateo, CA: Morgan Kaufmann. Freund, Y., & Schapire, R. E. (1996). Game theory, on-line prediction and boosting. In Proceedings of the 9th Annual Conference on Computational Learning Theory (pp. 325–332). New York: ACM Press. Friedman, J., Hastie, T., & Tibshirani, R. (2000). Additive logistic regression: A statistical view of boosting. Annals of Statistics, 28, 337–374. Grove, A., & Schuurmans, D. (1998). Boosting in the limit: Maximizing the margin of learned ensembles. In Proceedings of the Fifteenth National Conference on Artical Intelligence (pp. 692–699). New York: AAAI Press. Available online: http://www.boosting.org/papers/GroSch98.ps.gz. Hart, S., & Mas-Colell, A. (2001). A general class of adaptive strategies. Journal of Economic Theory, 98, 26–54. Koltchinkskii, V., & Panchenko, D. (2002). Empirical margin distributions and bounding the generalization error of combined classiers. Annals of Statistics, 30(1), 1–35. Meir, R., & R¨atsch, G. (2003). An introduction to Boosting and leveraging. In S. Mendelson & A. Smola (Eds.), Advanced lectures on machine learning (pp. 119–184) Berlin: Springer-Verlag. Available on-line: http://www.boosting. org/papers/MeiRae03.ps.gz. Mika, S. (2002). Kernel Fisher discriminants. Unpublished doctoral dissertation, University of Technology, Berlin. Onoda, T., R¨atsch, G., & Muller, ¨ K.-R. (1998). An asymptotic analysis of AdaBoost in the binary classication case. In L. Niklasson, M. Bod´en, & T. Ziemke (Eds.), Proc. of the Int. Conf. on Articial Neural Networks (ICANN’98) (pp. 195–200). Berlin: Springer-Verlag. Available on-line: http://www. boosting.org/papers/ICANN98.ps.gz. R¨atsch, G., Mika, S., Schlkopf, B., & Muller, ¨ K.-R. (2002). Constructing Boosting algorithms from SVMs: An application to one-class classication. IEEE P.A.M.I., 24(9), 1184–1199. R¨atsch, G., Onoda, T., & Muller, ¨ K.-R. (2001). Soft margins for AdaBoost. Machine Learning, 3(42), 287–320. R¨atsch, G., & Warmuth, M. (2001). Marginal boosting (Tech. Rep.). London: Royal Holloway College. Schapire, R. E., Freund, Y., Bartlett, P., & Lee, W. S. (1998). Boosting the margin: A new explanation for the effectiveness of voting methods. Annals of Statistics, 26(5), 1651–1686. Smola, A. J., Bartlett, P. L., Sch¨olkopf, B., & Schuurmans, D. (Eds.). (2000). Advances in large margin classiers. Cambridge, MA: MIT Press. Viola, P., & Jones, M. J. (2001). Robust real-time object detection (Tech. Rep. CRL2001/01). Cambridge, MA: COMPAQ.
Received November 6, 2002; accepted September 5, 2003.
LETTER
Communicated by Robert Hecht-Nielsen
Robust Formulations for Training Multilayer Perceptrons Tommi K¨arkk¨ainen
[email protected].
Erkki Heikkola
[email protected]. Department of Mathematical Information Technology, University of Jyv¨askyl¨a, FIN-40014 University of Jyv¨askyl¨a, Finland
The connection between robust statistical estimates and nonsmooth optimization is established. Based on the resulting family of optimization problems, robust learning problem formulations with regularizationbased control on the model complexity of the multilayer perceptron network are described and analyzed. Numerical experiments for simulated regression problems are conducted, and new strategies for determining the regularization coefcient are proposed and evaluated. 1 Introduction Multilayered perceptron (MLP) is the most commonly used neural network for nonlinear regression approximation. The simplest model of data in regression is to assume that the given targets are generated by yi D Á .xi / C "i ;
(1.1)
where Á .x/ is the unknown stationary function and "i ’s are sampled from an underlying noise process. K¨arkk¨ainen (2002; see also Bishop, 1995) showed that for any MLP with linear nal layer and special regularization (weight decay without nal layer bias), the usual least-mean-squares (LMS) learning problem formulation corresponds to the gaussian assumption for the noise statistics in equation 1.1. In statistics, relaxation of this assumption underlies the so-called robust procedures (see, e.g., Huber, 1981; Rousseeuw & Leroy, 1987; Rao, 1988; Hettmansperger & McKean, 1998; Oja, 1999). In the neural networks literature, there have been some attempts to combine robust statistical procedures with learning problem formulations and training algorithms for MLP networks (see, e.g., Kosko, 1992; Chen & Jain, 1994; Liano, 1996). Moreover, single-layer (linear) perceptron (SLP) for classication and robust regression approximation has been extensively studied in a special algorithmic setting by Raudys (1998a, 1998b, 2000). However, the general combination of robustness and MLP without focusing on special algorithms or architectures has, as far as we know, not been considered on a solid basis. Neural Computation 16, 837–862 (2004)
c 2004 Massachusetts Institute of Technology °
838
T. K¨arkka¨ inen and E. Heikkola
Based on other initial settings than equation 1.1 there exist many techniques for studying learning properties of feedforward neural networks, especially probably approximately correct learnability and Vapnik-Chervonenkis dimension (see, e.g., Mitchell, 1997). The above-mentioned result concerning the LMS learning problem means that for every local solution of the learning problem, the output is equal to the sample mean of the given outputs. Hence, every locally optimal MLP provides an unbiased estimate of the true mean. The main theoretical observation of this article is the generalization of this result in accordance with the two most common robust estimates of location: median and spatial median. This yields generalized unbiasedness so that we can then concentrate on controlling the variance (i.e., the complexity) of MLP. For this purpose, we apply the special quadratic weight decay, which penalizes large values of weights using only a single hyperparameter and, being strictly convex, also improves the local properties of the learning problem. Surprisingly, in addition to numerous practical and theoretical studies, the analysis of fat-shattering dimension (scale-sensitive version of VC dimension more appropriate for studying neural networks) also supports the utilization of such an approach (Bartlett, 1998). The main emphasis here is to describe, analyze, and test robust learning problem formulations for the MLP in a batch mode. For numerical comparisons, we need to use black box training algorithms for solving the optimization problems based on these formulations. An essential concept then is the convergence of an algorithm, which depends on the regularity of the optimization problem (Nocedal & Wright, 1999). Hence, rigorous treatment of robust MLP requires us to establish a link between the norms behind the robust statistics and the regularity of such problems (Clarke, 1983; M¨akela¨ & Neittaanmaki, ¨ 1992). As far as we know, this fundamental relation has not been explicitly established in other works. Another basis for this work is to treat the MLP transformation in a layerwise form (Hagan & Menhaj, 1994; K¨arkk¨ainen, 2002). This allows us to derive the optimality system in a compact form, which can be used in an efcient computer implementation of the proposed techniques. Together with the given new heuristics for controlling model complexity, the proposed approach allows a rigorous derivation of an MLP for real applications with different noise characteristics within the training data. In section 2, we establish, discuss, and illustrate the connection between robust statistics and nonsmooth optimization. We also present the layerwise architecture and family of learning problem formulations for training an MLP. In section 3, we compute the optimality conditions for the network learning and derive and discuss some of their consequences. In section 4, we report results of numerical experiments for comparing different formulations and introduce two novel techniques for determining the complexity of an MLP model. Finally, in section 5 we briey make some conclusions.
Robust Formulations for Training Multilayer Perceptrons
839
2 Preliminaries Throughout the article, we denote by .v/i the ith component of a vector v 2 Rn : Without parentheses, vi represents one element in the set of vectors fvi g: The lq -norm of a vector v is given by kvkq D
Á n X iD1
!1=q q
j.v/i j
;
q < 1:
(2.1)
2.1 Nonsmooth Optimization and Robust Statistics. In this section, we establish the connection between nonsmooth optimization and robust statistics. More details and further references on nonsmooth optimization can be found in Clarke (1983) and Makel ¨ a¨ and Neittaanm a¨ ki (1992). Robust statistics are treated in Huber (1981), Rousseeuw and Leroy (1987), Rao (1988), Hettmansperger and McKean (1998), and Oja (1999). Nonsmooth optimization concentrates on functionals and optimization problems, which cannot be described by using the classic (C1 ) differential calculus. We consider the following unconstrained optimization problem, min J .u/;
(2.2)
u2Rn
where J : Rn ! R is a given Lipschitz continuous cost function. Denition 1. is dened by
The subdifferential @ J (according to Clarke, 1983) of J at u 2 Rn
@ J .u/ D f» 2 Rn j J 0 .uI d/ ¸ » T d
8d 2 Rn g;
(2.3)
where J 0 .uI d/ is the generalized directional derivative lim sup v !u t&0
J .vCtd/ ; t
which
0
coincides with the usual directional derivative J .uI d/ when it exists. Notice that @ J .u/ denes a nonempty, convex, and compact set. Theorem 1. Every local minimizer u¤ 2 Rn for problem 2.2 is substationary, that is, it satises 0 2 @ J .u¤ /:
(2.4)
Moreover, if J is convex, then the necessary optimality condition in equation 2.4 is also sufcient. To summarize, in nonsmooth optimization, generalization of the directional derivative is the set-valued subdifferential, and, correspondingly,
840
T. K¨arkka¨ inen and E. Heikkola
generalization of the smooth, local indication of an extremum point r J .u¤ / D 0 is the existence of a substationary point 0 2 @ J .u¤ /: Let us illustrate the above denitions with an example f .u/ D juj for u 2 R: The subdifferential of f .u/ is given by 8 ¡1; for u < 0; > > < for u D 0; (2.5) @ f .u/ D sign.u/ D [¡1; 1]; > > : 1; for u > 0: As can be seen, the subdifferential (i.e., generalized sign function) coincides with the usual derivative in the well-dened case u 6D 0; and contains the whole set [¡1; 1] with end points of left and right converging directional 0 derivatives of the sequence f .ui / for ui ! 0: Moreover, u¤ D 0 is the unique minimizer of juj; because 0 2 @ f .u/ only for u¤ D 0: Next we turn our attention to robust statistics. Let fx1 ; : : : ; xN g be a sample of a multivariate random variable x 2 Rn : Consider the following family of optimization problems: minn Jq® .u/;
u2R
for Jq® .u/ D
N 1X ku ¡ xi k®q : ® iD1
(2.6)
We restrict ourselves to the following combinations (cf. Rao, 1988): q D ® D 2 (average). In this case, problem 2.6 is the quadratic least-squares P problem, and the gradient of Jq® is given by r J22 .u/ D N iD1 .u ¡ xi /: From 2 ¤ ¤ r J2 .u / D 0, we obtain the unique solution a D u of problem 2.6 in the form aD
N 1 X xi ; N iD1
(2.7)
which is the marginal mean (average) for the given sample. q D ® D 1 (median). This choice leads to the minimization of the sum of l1 -norms, which is a nonsmooth optimization problem. The subdifferential of J11 .u/ reads as @ J11 .u/ D
N X
»i ; iD1
where .» i /j D sign..u ¡ xi /j /:
(2.8)
In practice, the median is realized by the marginal middle values of the feature-wise ordered sample set. Hence, the median is unique for odd N; whereas for N even, all points in the closed interval between the two middle values satisfy equation 2.8 (see, e.g., K¨arkk¨ainen & Heikkola, 2002).
Robust Formulations for Training Multilayer Perceptrons
841
q D 2® D 2 (spatial median). By equation 2.3 the subgradient of J21 .u/ is characterized by the condition
@ J21 .u/ D
N X iD1
»i ;
8 > : k» i k2 · 1; for ku ¡ xi k2 D 0:
Thus, in this case, solution of problem 2.6 is realized by the so-called spatial median s; which satises 0 2 @ J21 .s/: Milasevic and Ducharme (1987) show that if the sample fx1 ; : : : ; xN g belongs to a Euclidean space and is not concentrated on a line, the spatial median s is unique. This result is generalized to strictly convex Banach spaces in Kemperman (1987). 2.1.1 Comparison of Different Estimators. In a statistical context, robustness refers to the insensitivity of estimators toward outliers—observations that do not follow the characteristic distribution of the rest of the data. The sensitivity of the average a toward observations lying far from the origin (representing the mean-value estimator) is illustrated in Figure 1 (left), where the gradient eld r f .x/ D .x1 ; x2 / of 2d function kxk22 is given. As we can see, the size of the gradient vector increases when moving away from the origin, so that those points are weighted more heavily at the equilibrium rkxk22 D 0: This readily explains why the (symmetric) gaussian distribution with enough samples is the intrinsic assumption behind the least-squares estimate a: On the other hand, an estimator with equal weight of all samples is obtained by dividing the gradient (not the samples!) by its length, and then we precisely get the spatial median s; which is illustrated through the gradient eld of function kxk2 in Figure 1 (right). As stated, for example, in Hettmansperger and McKean (1998) and also evident from equation 2.9, the
1
1
0
0
1
1 1
0
1
1
0
Figure 1: Scaled (scale 0.4) gradient elds of kxk22 (left) and kxk2 (right).
1
842
T. K¨arkka¨ inen and E. Heikkola
1
0
1 1
0
1
Figure 2: Scaled (scale 0.4) gradient eld of f .x/ D kxk1 :
corresponding estimating function J21 depends only on the directions and not on the magnitudes of u ¡ xi ; i D 1; : : : ; N; which signicantly decreases both the sensitivity toward outliers and requirements concerning the necessary amount of data. Finally, in Figure 2, the gradient eld of a function kxk1 is depicted, where the insensitivity with respect to the distance but also the lack of rotational invariance (due to different contour lines of the unit ball in the 1-norm) are clearly visible. 2.2 MLP in a Layer-Wise Form. A compact description for the action of the multilayered perceptron neural network is given by Hagan and Menhaj (1994) and K¨arkk¨ainen (2002): o0 D x;
ol D F l .Wl oO .l¡1/ /
for
l D 1; : : : ; L:
(2.10)
Here, the superscript l corresponds to the layer number (starting from zero for the input), and by the circumex we indicate the normal extension of a vector by unity. F l .¢/ denotes the usual componentwise activation on the lth level, which can be represented by using a diagonal function matrix F D F .¢/ D Diagf fi .¢/gm iD1 supplied with the natural denition of the matrix vector product y D F .v/ ´ .y/i D fi ..v/i /: Notice, though, that the following analysis generalizes in a straightforward manner to the case of an activation with nondiagonal function matrix (K¨arkk¨ainen, 2002). The dimensions of the weight matrices are given by dim.Wl / D nl £ .nl¡1 C 1/; l D 1; : : : ; L; where n0 is the length of an input vector x; nL the length of the output
Robust Formulations for Training Multilayer Perceptrons
843
vector oL ; and nl ; 0 < l < L determine the sizes (number of neurons) of the hidden layers. Due to the special bias weights in the rst column, the column numbering for each weight matrix starts from zero. Instead of precisely equation 2.10, we consider an architecture of MLP containing only a linear transformation in the nal layer as oLi D N .fWl g/.xi / n0 nL D WL oO .L¡1/ : With given training data fxi ; yi gN i iD1 ; xi 2 R and yi 2 R ; the l L unknown weight matrices fW glD1 are determined as the solution of the optimization problem, min L
fWl gLlD1
® l q;¯ .fW g/;
(2.11)
where the cost functional is of the general form ® l q;¯ .fW g/
L
D
N 1 X kN ®N iD1
.fWl g/.xi / ¡ yi k®q C
L X ¯ X jWli;j j2 (2.12) 2 lD1 .i;j/2I l
for ¯ ¸ 0: Here, the index set Il contains all other indices of the unknown weight matrices except the ones corresponding to the bias vector of WL as suggested by the test results in K¨arkk©ainen ¨ ª (2002); see also Bishop (1995). All features in the training data xi ; yi are preprocessed to the range [¡1; 1] of the k-tanh functions tk .a/ D
2 ¡1 1 C exp.¡2 k a/
for k 2 N;
(2.13)
which are used in the activation. In this way, we balance the scaling of unknowns (components of weight matrices at different layers) in problem 2.11 (K¨arkk¨ainen, 2002). It is well known (as suggested by equation 1.1) that for a successful application of MLP, one needs to avoid overtting by taking into account both the model complexity and errors in data. In equation 2.12, choosing the single hyperparameter, the weight decay coefcient ¯ positive favors smaller weights, thus further balancing their scale for any iterative training algorithm. In addition, this pushes the linear part of each neuron toward the linear region of the k-tanh activation functions. However, because in our approach, we let k in equation 2.13 run from 1 to nl in each layer, thus diminishing the linear region, this kind of penalization does not directly reduce the MLP to a linear transformation. Function 2.12 is also related to Bayesian statistics with compatible choices of the sample data and prior distributions (see, e.g., Rognvaldsson, ¨ 1998). Moreover, we consider the same three combinations for the parameters q and ® as in the previous section: q D ® D 2; q D ® D 1; and q D 2® D 2: Hence, we conclude that the considered family of learning problem formulations for the MLP results from a compound (though special) application of robust and Bayesian statistics.
844
T. K¨arkka¨ inen and E. Heikkola
For solving the optimization problems in equation 2.11, we use generalizations of gradient-based methods for nonsmooth problems known as bundle methods (Makel ¨ a¨ & Neittaanm a¨ ki, 1992). We recall from K¨arkk¨ainen (2002) that for ® D 1, the assumptions of convergence for ordinary training algorithms like gradient descent (on batch-mode Lipschitz continuity of gradient, for on-line stochastic iteration C2 continuity), CG (Lipschitz continuity of gradient), and especially quasi-Newton methods (C2 -continuity) are violated (Haykin, 1994; Nocedal & Wright, 1999). As documented, for example, for SLP in Raudys (2000), for MLP in Saito and Nakano (2000), and for image restoration with similar functionals in K¨arkk¨ainen, Majava, and M¨akel¨a (2001), this yields nonconvergence of ordinary training algorithms, when the cost function does not fulll the required smoothness assumptions. Furthermore, results in K¨arkk¨ainen et al. (2001) indicateqthat simple smoothing techniques like replacing a norm kvk2 for v 2 R2 by v21 C v22 C " for " > 0 are not sufcient to restore the convergence of ordinary optimization methods. 3 Sensitivity Analysis and Its Consequences Next we apply a useful technique, also presented in K¨arkk¨ainen (2002), to derive the optimality conditions for the network training problem 2.11. From now on, for any vector v 2 Rn ; the notation sign[v] means a componentwise application of the sign function in equation 2.5, and the abbreviated notation v=kvk2 actually refers to the existence of vector » such that .» /i D
.v/i ; for kvk2 6D 0; kvk2
k» k2 · 1; for kvk2 D 0:
(3.1)
For simplicity, we assume that the activation functions in all function matrices F .¢/ are differentiable, although the analysis below can be extended and given algorithms applied also to nondifferentiable activation functions. Note that the use of nonsmooth activation functions (step function or a=.1 C jaj; e.g., Prechelt, 1998) makes the learning problem nonsmooth even for q D ® D 2; and therefore ordinary gradient-based optimization algorithms cannot be used for solving. 3.1 MLP with One Hidden Layer. For clarity, we start with MLP with ¤ ¤ only one hidden layer. Then any local solution .W1 ; W2 / of the minimization problem 2.11 is characterized by the conditions " # O O
2 2 @.W1 ;W2 / L
1¤ 2¤ ® q;¯ .W ; W /
D4
@W1 L @W2 L
3 1¤ 2¤ ® q;¯ .W ; W / 5: ® 1¤ 2¤ q;¯ .W ; W /
(3.2)
Robust Formulations for Training Multilayer Perceptrons
845
Here, @Wl L ®q;¯ ; l D 1; 2; are subdifferentials presented in a similar matrix form as the unknown weight matrices. We begin the derivation with some lemmata. The proofs are omitted here, because they follow exactly the same lines as the proofs of the corresponding lemmata in K¨arkk¨ainen (2002), using the already introduced results on subdifferential calculus in section 2.1. We also assume that except the unknown variable dened in each cost function, other xed quantities (matrices and vectors) are given with appropriate dimensions. Furthermore, to compress the presentation, we introduce the following notation: 8 v; > > > < » ®q .v/ D sign[v]; > > v > : ; kvk2
for q D ® D 2; for q D ® D 1; for q D 2® D 2:
Lemma 1. For the functional J.W/ D ®1 kW v ¡ yk®q , the matrix of subdifferentials is of the form @W J.W/ D » ®q .W v ¡ y/ vT : Lemma 2. as
For the functional J.u/ D
1 kW F .u/ ®
0
¡ yk®q , the subgradient reads 0
@u J.u/ D .W F .u//T » ®q .W F .u/ ¡ y/ D DiagfF .u/g WT » ®q .W F .u/ ¡ y/: Lemma 3. For the functional J.W/ D ifferentials is of the form
1 N kW F .Wv/ ®
¡ yk®q , the matrix of subd-
0 N T » ® .W N F .Wv/ ¡ y/ vT : @W J.W/ D DiagfF .Wv/g W q
Now we are ready to state the actual results for the perceptron with one hidden layer. In what follows, we denote by W21 the submatrix .W2 /i;j ; i D 1; : : : ; n2 ; j D 1; : : : ; n1 ; which is obtained from W2 by removing the rst column W20 containing the bias nodes. Furthermore, the error in the ith b.W1 xO i / ¡ yi : output is denoted by ei D W2 F Theorem 2. @W1 L
1 ® q;¯ .W ;
Matrices of subdifferentials @W2 L W2 / ½ Rn1 £.n0 C1/ are of the form
® 1 q;¯ .W ;
W2 / ½ Rn2 £.n1 C1/ and
@W2 L
1 2 ® q;¯ .W ; W /
D
N 1 X b.W1 xO i /]T C ¯ [0 W2 ]; »® .ei / [F 1 N iD1 q
@W1 L
1 2 ® q;¯ .W ; W /
D
N 1 X 0 DiagfF .W1 xO i /g .W21 /T » ®q .ei / xO Ti C ¯ W1 : (3.4) N iD1
(3.3)
846
T. K¨arkka¨ inen and E. Heikkola
3.2 MLP with Several Hidden Layers. Next, we generalize the previous analysis to the case of several hidden layers. N FN .W Q FQ .Wv// ¡ yk® , the matrix of Lemma 4. For the functional J.W/ D ®1 kW q subdifferentials is of the form 0 Q T DiagfFN 0 .W Q FQ .Wv//g W N T » ® .e/ vT ; @W J.W/ D DiagfFQ .Wv/g W q
N FN .W Q FQ .Wv// ¡ y: where e D W Theorem 3.
Matrices of subdifferentials @Wl L ® l q .fW g/
@Wl L
D
® l q;¯ .fW g/;
l D L; : : : ; 1; read as
N 1 X e l; » l [oO .l¡1/ ]T C ¯ W N iD1 i i
where
» Li D »®q .ei /;
(3.5) 0
» li D Diagf.F l / .Wl oO .l¡1/ /g .W.lC1/ /T » .lC1/ : i i 1
(3.6)
e l D [0 WL ] for l D L; and coincides with the whole matrix Wl for Furthermore, W 1 1 · l < L: The compact presentation of the optimality system in theorem 3 can be readily exploited in the implementation, which consists of a few basic linear-algebraic operations realizing equations 3.5 and 3.6. Moreover, the ¤ following result concerning every local minimizer O 2 @Wl L ®q;¯ .fWl g/ of problem 2.11 holds. Corollary 1. in theorem 3:
¤
For locally optimal MLP network fWl g satisfying the conditions 8 N > 1 X > > e¤i D 0; > > > N > iD1 > > > > < N X 02 sign[e¤i ]; > > iD1 > > > > > N X > e¤i > > > ; :0 2 ke¤ k iD1
for all ¯ ¸ 0:
i 2
for q D ® D 2; for q D ® D 1; for q D 2® D 2;
Robust Formulations for Training Multilayer Perceptrons
Proof.
847
The optimality condition O 2 @W L L
® l¤ q;¯ .fW g/
D
N ¤ 1 X ¤ ]T C ¯ [0 WL1 ] » Li [oO .L¡1/ i N iD1 ¤
(with the abbreviation oO .L¡1/ D oO .L¡1/ ) in theorem 3 can be written in the i i nonextended form as O2
N ¤ 1 X ¤ T »Li [1 .o.L¡1/ / ] C ¯ [0 WL1 ]: i N iD1
By taking the transpose on the right-hand side, we obtain " # " # " # ¤ N N 1 0T .» Li /T 1 X 1 X L¤ T D .» i / C : ¤ N iD1 o.L¡1/ N iD1 o.L¡1/ .» L¤ /T C ¯ .WL¤ /T ¯ .WL1 /T i i i 1 ¤
Finally, using the denitions in equation 3.5 for » Li in the rst row shows the results. The result of corollary 1 shows that by means of the error distribution, the local optimality conditions for the three learning problem formulations coincide with the conditions satised by the three statistical estimates in equations 2.7 through 2.9. Hence, we draw the following conclusions: ² We are able to generate robust MLPs using the two nonsmooth norms for tting. This also suggests that other good properties of robust estimates, like a smaller amount of learning data needed, could be a further benet when training a network. ² The result of corollary1 quanties precisely the fault tolerance of neural networks by means of erroneous data according to equation 1.1. ² The proof of corollary 1 reveals that by means of the regression model, the role of MLP is suppressed: we only needed a linear nal layer with a separate bias. Therefore, these results are actually valid for all kinds of regression approximators with these two properties. 4 Numerical Experiments 4.1 Univariate Single-Valued Regression. In the rst test setting, we study the use of the MLP network in the reconstruction of a given singlevalued function of one variable, which is disturbed by random noise. We train the network by solving the optimization problem 2.11 with both ® D q D 2 and ® D q D 1; and we compare the results given by these two
848
T. K¨arkka¨ inen and E. Heikkola
approaches. In this case, nL D 1; and thus the functionals L 12;¯ and L 11;¯ are identical. The minimizations are performed by the proximity control bundle method, which is applicable to both the smooth functional L 22;¯ and the nonsmooth functional L 11;¯ (M¨akel a¨ & Neittaanm a¨ ki, 1992). 4.1.1 Denition of the Test Problem. We consider the reconstruction of the function f .x/ D sin.x/ in the interval x 2 [0; 2¼]: The input vectors of the training data are chosen to be N uniformly spaced values xi from the interval [0; 2¼ ] given by xi D .i ¡ 1/ 2¼=.N ¡ 1/: The samples of function values involve two types of random noise: low-amplitude normally distributed noise affects the values over the whole interval [0; 2¼ ]; and at some isolated points, the values are disturbed by high-amplitude uniformly distributed noise (outliers). Hence, we choose yi D sin.xi / C ± "i C ³ ´i ;
(4.1)
where "i » N .0; 1/; i 2 OC ; and ´i » U .¡1; 1/; i 2 O: Here, U .¡1; 1/ denotes the uniform distribution on .¡1; 1/, and O is an index set of outliers such that O ½ f1; 2; : : : ; Ng: OC denotes the complement of O: We use the MLP with one hidden layer (i.e., L D 2) considered in section 3.1. The activation is performed with the k-tanh functions 2.13 such that 1 F .¢/ D Diagfti .¢/gniD1 : The input and output dimensions are both equal to one (n0 D n2 D 1), and we use the values 5; 10; and 20 for the dimension n1 of the hidden layer. The size N of the training data is 30 or 120 (N D 60 is given in K¨arkk¨ainen & Heikkola, 2002), and, correspondingly, the index set O is chosen to contain 3 or 10 randomly selected indices between 1 and N: The amplitudes of the normally and uniformly distributed noise are ± D 0:3 and ³ D 2; respectively. The training data set fxi ; yi g in the case N D 30; created according to the denitions above (before scaling to the range of the activation functions), is illustrated in Figure 3. 4.1.2 Comparison of Formulations. Our goal is to estimate the approximation capability of the MLP networks corresponding to the minimization of the two functionals L 11;¯ and L 22;¯ with different values of the parameters N; n1 ; and ¯: For this purpose, we dene a validation set of input values by xO i D .i ¡ 1/ 2¼=.Nt ¡ 1/; Nt D 257; which does not coincide with the input values xi of the training data. The difference between the MLP approximation and the exact function f is then calculated by using the norm err .W1 ; W2 / D
Nt 1 X b.W1 xO i / ¡ f .Oxi /j: jW2 F Nt iD1
(4.2)
Let us emphasize that the choice of error measure is not based on favoring L 11;¯ but on the fact that this form weights equally both small and large deviations from the exact function.
Robust Formulations for Training Multilayer Perceptrons
849
2.5 2 1.5 1
yi
0.5 0
0.5 1 1.5 0
1
2
3
xi
4
5
6
7
Figure 3: The training data set fxi ; yi g in the case N D 30 and the exact graph of function f: The circled markers also involve uniformly distributed noise.
We performed a series of tests with the three different values for N and n1 : For xed N and n1 ; the value of the regularization parameter ¯ varied in the interval [0; 1]; and for each ¯; we repeated the optimization algorithm 100 times with randomly created initial values for the weight matrices fW1 ; W2 g and computed the minimum and average values of the errors 4.2. The optimization problems to be solved for training the MLP are nonconvex, and thus there is a large number of local minima (all satisfying corollary 1) in the error surface to be explored by random initialization. PL P l 2 We monitored the value of the regularization term ¯2 lD1 .i;j/2Il jWi;j j of the functional L ®q;¯ corresponding to the MLP with minimum error in equation 4.2 for nding an effective way to choose the parameter ¯: This computation was motivated by the fact that our previous studies in image restoration with similar functions to be optimized have shown a strong correlation between the reconstruction error and the value of the regularization term (K¨arkkainen ¨ & Majava, 2000). In addition to the well-known cross-validation techniques, simpler heuristics for this purpose have been proposed and tested with the backpropagation algorithm (e.g., in Rognvald¨ sson, 1998).
850
T. K¨arkka¨ inen and E. Heikkola
Table 1: Optimal Values ¯ ¤ of the Regularization Parameter for Different Values of N and n1 with the Functionals L 11;¯ and L 22;¯ . L
1 1;¯
L
2 2;¯
N
n1
¯¤
err ¤a
¯¤
err¤a
30
5 10 20
2.0e-2 4.1e-2 1.0e-1
1.2e-1 1.2e-1 1.3e-1
7.7e-3 2.0e-2 4.1e-2
1.6e-1 1.6e-1 1.6e-1
120
5 10 20
1.9e-3 1.3e-2 3.1e-2
4.3e-2 4.7e-2 4.8e-2
1.2e-4 6.4e-4 2.6e-3
5.7e-2 5.9e-2 6.2e-2
minimum error obtained with the value ¯ ¤ over 100 tests. a The
The computational results are illustrated in Figures 8 through 15 in the appendix. Each gure includes three graphs corresponding to the three dimensions n1 of the hidden layer. For certain functional and xed values of N; the graphs represent either the average value of the errors in the 100 tests or the value of the regularization term. In each case, the norm 4.2 obtains minimum value at certain ¯; and these optimal points and minimal values of error are listed in Table 1. We see that in each test case, the MLP network based on the minimization of the functional L 11;¯ ¤ leads to a better approximation of the exact function f than the MLP based on L 22;¯ . We can also make the natural conclusion that the error is reduced by increasing N: The gures show that with a larger dimension of the hidden layer, the overall values of the error become smaller, but the minimum value remains essentially the same. Moreover, with a larger value of n1 ; the error of the MLP approximation becomes less sensitive to the choice of the regularization parameter. However, there is a remarkable difference in the behavior of the two learning problem formulations for n1 D 20 (i.e., with high representation capability of MLP): when ¯ grows from ¯ ¤ ; the average error for L 11;¯ essentially stays on the same level, whereas for L 22;¯ , there is approximately a linear increase. In addition, for L 22;¯ , small deviations from the optimal regularization parameter ¯ ¤ may lead to a large increase in the error. Interestingly this suggests that the wellknown approach in statistics “to integrate over the nuisance parameters” like ¯ would here yield a poorer result (especially for L 22;¯ ) than choosing an appropriate single value. From the graphs of the regularization term, we conclude that the strong oscillation indicates that the value of ¯ is smaller than the optimal value ¯ ¤ : In other words, the MLP is too complex with unnecessary variance. However, otherwise the reconstruction error and the regularization term
Robust Formulations for Training Multilayer Perceptrons
851
are not clearly correlated, and thereby our rst approach does not contain enough information to choose the parameter ¯ exactly. For q D ® D 1 and n1 D 20, there seems to be some similarity in the graphs for different values of N to indicate ¯ ¤ ; although this visual information is difcult to quantify precisely. 4.2 Bivariate Vector-Valued Regression. In the second set of experiments, we consider the reconstruction of a vector-valued function from noisy data. We use again the MLP network with one hidden layer and ktanh activation and train the network by minimizing the functional L ®q;¯ with the three choices of q and ®: The test function is formed as a sum of a global term, which affects the function values over the whole domain, and a local term, which is nonzero only in a small part of the domain. It is well known that MLP is efcient in approximating the global behavior of a function, but due to its structure, it tends to ignore the local variations. Another commonly used neural architecture is the radial basis function network (RBFN) (Broomhead & Lowe, 1988), which builds approximations with local basis functions. Therefore, when properly focused, RBFN can catch the local term but gives poor approximations to the global term. A simple idea to combine the advantages of locally and globally approximating networks is to augment the input of the MLP by the squares of the input values. Flake (1998) refers to such MLP networks as SQUARE-MLP (square unit augmented, radially extended, multilayer perceptron; see also Sarajedini & Hecht-Nielsen, 1992). Such networks retain the ability to form global representations, but they can also form local approximations with a single hidden node. 4.2.1 Denition of the Test Problem. We dene the vector-valued function f: R2 ! R2 ; f.x; y/ D .f1 .x; y/; f2 .x; y// as ³ ´ 2 .x ¡ x0 /2 C .y ¡ y0 /2 f1 .x; y/ D 4 exp ¡ ¡ C 1; (4.3) 0:7 1 C exp.¡6x/ and f2 .x; y/ D f1 .y; ¡x/ with x0 D y0 D 2:5: The function 4.3 is an example of a Hill-Plateau surface (Sarle, 1997), which is a sum of a local gaussian and a global tanh function. The approximation of such a function is known to be difcult for both MLP and RBFN. We attempt to reconstruct the function f in 5 D [¡5; 5] £ [¡5; 5]: The input vectors of the training data are obtained by rst constructing a uniform grid in 5 with the grid points given by .xi ; yj / D ..i ¡ 1/ h ¡ 5; .j ¡ 1/ h ¡ 5/;
i; j D 1; : : : ; n g ;
(4.4)
852
T. K¨arkka¨ inen and E. Heikkola
where h D 10=n g : For the tests, we choose n g D 21 as in Flake (1998). These coordinate values are then prescaled to the range [¡1; 1] of the activation functions, and they are included in the input vector together with the squares of the scaled coordinates. As in the previous section, all output vectors involve low-amplitude normally distributed noise, while some isolated outputs are also disturbed by high-amplitude outliers. More precisely, the output corresponding to the i;j
i;j
input xi;j D .xi ; yj ; x2i ; yj2 /T is of the form yi;j D .y1 ; y2 /T with i;j
yk D fk .xi ; yj / C ± "i;j C ³ ´i;j ;
(4.5)
where "i;j » N .0; 1/; .i; j/ 2 OC ; and ´i;j » U .¡1; 1/; .i; j/ 2 O: In the tests, the index set O is chosen to include approximately 0:05 n2g randomly selected elements, ± D 0:1; and ³ D 2: The training data created according to the denitions above (before prescaling to the range of the activation function) are illustrated in Figure 4. 4.2.2 Comparison of Formulations. We performed tests by using the values 5; 10; and 20 for n1 with the three different training formulations equations 2.11, initially without regularization (i.e., ¯ D 0). The error of the MLP approximations was measured using a uniform 49 £ 49 validation grid over 5: Let us denote the input vectors of the validation data by xO i;j and the corresponding grid points by .xO i ; yO i /: Then the error is calculated by err .W1 ; W2 / D
49 X 49 1 X b.W1 xO i;j / ¡ f.xO i ; yO i /k2 : kW2 F 492 iD1 jD1
(4.6)
With xed n1 ; we again repeated the optimization algorithm 100 times with random initialization and computed the minimum and average values of the errors (see equation 4.6).
3
3
2
2 1
1
0 0 1 1
2
2
3
3 5
4 5 5 0
0 5
5
5 0
0 5
5
Figure 4: The components of the output vectors in the training data (f1 left, f2 right).
Robust Formulations for Training Multilayer Perceptrons
853
Table 2: Minimum and Average Errors with the Functionals L L
2 2;¯
1 1;¯
L
L
2 2;¯ ;
L
1 1;¯ ; and
L
1 2;¯ .
1 2;¯
n1
err¤
err
err¤
err
err¤
err
5 10 20
9.6e-2 1.2e-1 1.1e-1
2.8e-1 2.4e-1 2.3e-1
3.7e-2 4.0e-2 4.7e-2
1.7e-1 1.5e-1 1.4e-1
3.4e-2 3.6e-2 4.4e-2
1.6e-1 1.4e-1 1.4e-1
The results are collected in Table 2. We conclude that the functionals L 11;0 and L 12;0 are more accurate and clearly more robust with respect to the initial guess than the smooth functional L 22;0 : In all cases, the smallest minimum error is achieved with the choice q D 2® D 2; while the dimension n1 does not have a strong effect on the accuracy. The best reconstruction, given by the functional L 12;0 ; is illustrated in Figure 5. 4.2.3 Determination of the Regularization Parameter. In the previous section, the regularization parameter ¯ was equal to zero. However, as already pointed out by the results in section 4.1, the value of ¯ has a strong effect on the accuracy of results, and it is thereby important to be able to choose the value correctly. Next, we describe another strategy for choosing the value of ¯ and estimate the quality of the resultant MLP network. The dimension of the hidden layer is xed to n1 D 20, and we use the choice q D 2® D 2: The learning data are exactly the same as in the previous tests and are rst divided into two disjoint parts C1 and C2 of equal sizes. More precisely, for N D 441, the randomly chosen sets C1 and C2 contain N1 D 221 and N2 D 220 elements, respectively. Only the input-output pairs .xi;j ; yi;j / in the set C1 are used in the functional 2.12, while the set C2 is
4
4
3
3
2
2
1
1
0
0
1 5
1 5 5 0
0 5
5
5 0
0 5
5
Figure 5: The components of the best reconstruction, obtained by minimizing L 12;¯ with n1 D 5 (f1 left, f2 right).
854
T. K¨arkka¨ inen and E. Heikkola
reserved for testing. This choice originates from the relaxed requirements concerning the necessary amount of data for robust training. We dene the two errors errk .W1 ; W2 / D
1 Nk
X .xi; j ;yi;j /2Ck
b.W1 xO i;j / ¡ yi;j k2 ; kW2 F
k D 1; 2; (4.7)
while err3 refers to the already dened validation error in equation 4.6. We search for an optimal nonzero ¯ in the interval [10¡9 ; 10¡6 ]; which can be determined by monitoring the value of the regularization term as described within the univariate test. The interval is covered with a predened set of values l ¢ 10¡s ; l D 1; 2; 4; 6; 8I s D 9; 8; 7; and for each xed ¯; the optimization algorithm is repeated 50 times with random initialization. We compute the errors err1 and err2 in all 50 tests and choose the MLP network with the smallest err2 to be the best one. For this MLP, we also compute the error err3 : The results of the rst stage of our strategy are illustrated by the graphs in Figure 6. We see that all three curves have similar behavior, which suggests that the errors err1 and err2 ; which can be computed without knowing the 0.4
err 2 err 3 err 1
0.35
the errors
0.3 0.25 0.2
0.15 0.1 0.05 9 10
10 8
b
10 7
Figure 6: Graphs of the three errors errk as functions of ¯:
10 6
Robust Formulations for Training Multilayer Perceptrons
855
exact function f; contain the same information as the true error err3 : Based on err1 and err2 , we now limit the complexity of MLP by choosing ¯ D 10¡8 : After choosing the value of the regularization parameter, we proceed to the second stage of our strategy, which is the determination of the nal MLP network. Because the model complexity of MLP is now xed along with ¯; we are able to use all of the data in the learning problem. We perform the minimization 100 times and choose the nal MLP to be the one that corresponds to the smallest value of local minima for L 12;¯ (i.e., the best candidate for global minimum). We evaluated the quality of the obtained MLP by computing the corresponding err3 ; which was approximately 0:05: By comparing this value to the graph of Figure 6, we see that approximation is improved from the rst stage, and we obtain a very good overall error level. We also computed err3 in each of the 100 tests and compared these values to the local minima of L 12;¯ : To study the correlation of these values and thus the validity of the nal choice, we sorted the 100 tests in ascending order according to L 12;¯ : The results of this procedure, together with the corresponding values of err3 , are given in Figure 7. The increase of err3 from its smallest value stresses the importance of the nal choice, which recovered almost the best alternative among the 100 candidates.
0.22
err
0.2
3
L12,b
0.18
error
0.16 0.14 0.12 0.1 0.08 0.06 0.04
0
20
Figure 7: The minimal value of L
40 1 2;¯
60
80
and err3 in the nal 100 tests.
100
856
T. K¨arkka¨ inen and E. Heikkola
5 Conclusion We considered robust learning problem formulations for the MLP network with regularization-based pruning. The MLP transformation was presented in a layer-wise form, which yielded a compact representation of the optimality systems and allowed a straightforward analysis and computer implementation of the proposed techniques. Different learning problem formulations were tested numerically for studying the effect of noise with outliers. We also proposed and tested two novel strategies for blind determination of the regularization parameter and thus the generality of MLP. Altogether, the combination of robust training, square unit augmentation of input, and effective control of model complexity yielded very promising computational results with simulated data. Appendix: Error and Regularization Graphs
0.34
n1=5 n1=10 n1=20
0.32 0.3
average error
0.28 0.26 0.24 0.22 0.2 0.18 0.16 0.14 0
0.05
0.1
b
Figure 8: Average error with the functional L
0.15
1 1;¯
and N D 30.
0.2
0.25
Robust Formulations for Training Multilayer Perceptrons
857
0.08 0.07
regularization term
0.06 0.05 0.04 0.03 0.02 n1=5 n1=10 n1=20
0.01 0 0
0.05
0.1
b
0.15
Figure 9: Regularization term with the functional L
0.2
1 1;¯
0.25
and N D 30.
0.34 0.32
average error
0.3 0.28 0.26 0.24 n1=5 n1=10 n1=20
0.22 0.2 0.18 0.16 0
0.05
0.1
b
Figure 10: Average error with the functional L
0.15
2 2;¯
0.2
and N D 30.
0.25
858
T. K¨arkka¨ inen and E. Heikkola 0.035
regularization term
0.03 0.025 0.02 0.015 0.01 n1=5 n1=10 n1=20
0.005 0 0
0.05
0.1
b
0.15
Figure 11: Regularization term with the functional L 0.2 0.18
0.2
2 2;¯
0.25
and N D 30.
n1=5 n1=10 n1=20
average error
0.16 0.14 0.12 0.1 0.08 0.06 0.04 0
0.01
0.02
0.03
0.04
b
Figure 12: Average error with the functional L
1 1;¯
0.05
and N D 120.
0.06
Robust Formulations for Training Multilayer Perceptrons
859
0.06
n1=5 n1=10 n1=20
regularization term
0.05
0.04
0.03
0.02
0.01
0 0
0.01
0.02
0.03
0.04
b
Figure 13: Regularization term with the functional L
1 1;¯
0.05
0.06
and N D 120.
0.3
average error
0.25
0.2
0.15
n1=5 n1=10 n1=20
0.1
0.05 0
0.01
0.02
0.03
0.04
b
Figure 14: Average error with the functional L
2 2;¯
0.05
and N D 120.
0.06
860
T. K¨arkka¨ inen and E. Heikkola 0.012
regularization term
0.01
0.008
0.006
0.004
0.002
0 0
n1=5 n1=10 n1=20 0.01
0.02
0.03
0.04
b
Figure 15: Average error with the functional L
2 2;¯
0.05
0.06
and N D 120.
Acknowledgments We thank Hannu Oja and Docent Marko M. Ma¨ kel a¨ for their help during the course of this research. We also gratefully acknowledge the comments and suggestions made by the anonymous referees, which improved the presentation of the results. This work was nancially supported by the Academy of Finland, grant 49006, and by the National Technology Agency of Finland, grant 2371/31/02 and InBCT-project. References Bartlett, P. L. (1998). The sample complexity of pattern classication with neural networks: The size of the weights is more important than the size of the network. IEEE Trans. Inform. Theory, 44(2), 525–536. Bishop, C. M. (1995). Neural networks for pattern recognition. Oxford: Oxford University Press. Broomhead, D. S., & Lowe, D. (1988). Multivariable functional interpolation and adaptive networks. Complex Systems, 2(3), 321–355. Chen, D. S., & Jain, R. C. (1994). A robust backpropagation learning algorithm for function approximation. IEEE Transactions on Neural Networks, 5(3), 467–479. Clarke, F. H. (1983). Optimization and nonsmooth analysis. New York: Wiley.
Robust Formulations for Training Multilayer Perceptrons
861
Flake, G. W. (1998). Square unit augmented, radially extended, multilayer perceptrons. In G. B. Orr & K.-R. Muller ¨ (Eds.), Neural networks: Tricks of the trade (pp. 145–163). Berlin: Springer-Verlag. Hagan, M., & Menhaj, M. (1994). Training feedforward networks with the Marquardt algorithm. IEEE Transactions on Neural Networks, 5(6), 989–993. Haykin, S. (1994). Neural networks: A comprehensive foundation. New York: Macmillan. Hettmansperger, T. P., & McKean, J. W. (1998). Robust nonparametric statistical methods. London: Edward Arnold. Huber, P. J. (1981). Robust statistics. New York: Wiley. K¨arkka¨ inen, T. (2002). MLP-network in a layer-wise form with applications to weight decay. Neural Computation, 14(6), 1451–1480. K¨arkka¨ inen, T., & Heikkola, E. (2002). Robust MLP (Rep. No. C 1). Jyv¨askyla¨ : University of Jyv¨askyla¨ , Department of Mathematical Information Technology. K¨arkka¨ inen, T., & Majava, K. (2000). Determination of regularization parameter in monotone active set method for image restoration. In P. Neittaanm¨aki, T. Tiihonen, & P. Tarvainen (Eds.), Proceedings of the Third European Conference on Numerical Mathematics and Advanced Applications (pp. 641–648). Singapore: World Scientic. K¨arkka¨ inen, T., Majava, K., & Ma¨ kel¨a, M. M. (2001). Comparison of formulations and solution methods for image restoration problems. Inverse Problems, 17(6), 1977–1995. Kemperman, J. (1987). The median of a nite measure on a Banach space. In Y. Dodge (Ed.), Statistical data analysis based on the L1 -norm and related methods (pp. 217–230). Amsterdam: North-Holland. Kosko, B. (1992). Neural networks and fuzzy systems: A dynamical systems approach to machine intelligence. Englewood Cliffs, NJ: Prentice Hall. Liano, K. (1996). Robust error measure for supervised neural network learning with outliers. IEEE Transactions on Neural Networks, 7(1), 246–250. Makel¨ ¨ a, M. M., & Neittaanm¨aki, P. (1992). Nonsmooth optimization. River Edge, NJ: World Scientic. Milasevic, P., & Ducharme, G. (1987). Uniqueness of the spatial median. Ann. Statist., 15(3), 1332–1333. Mitchell, T. M. (1997). Machine learning. New York: McGraw-Hill. Nocedal, J., & Wright, S. J. (1999). Numerical optimization. New York: SpringerVerlag. Oja, H. (1999). Afne invariant multivariate sign and rank tests and corresponding estimates: A review. Scand. J. Statist., 26(3), 319–343. Prechelt, L. (1998). Early stopping—but when? In G. B. Orr & K.-R. Muller ¨ (Eds.), Neural networks: Tricks of the trade (pp. 55–70). Berlin: Springer-Verlag. Rao, C. R. (1988). Methodology based on the L1 -norm, in statistical inference. Sankhy¯a Ser. A, 50(3), 289–313. Ï (1998a). Evolution and generalization of a single neurone: I. SingleRaudys, S. layer perceptron as seven statistical classiers. Neural Networks, 11(2), 283– 296.
862
T. K¨arkka¨ inen and E. Heikkola
Ï (1998b). Evolution and generalization of a single neurone: II. ComRaudys, S. plexity of statistical classiers and sample size considerations. Neural Networks, 11(2), 297–313. Ï (2000). Evolution and generalization of a single neurone: III. PrimiRaudys, S. tive, regularized, standard, robust and minimax regressions. Neural Networks, 13(4–5), 507–523. Rognvaldsson, ¨ T. S. (1998). A simple trick for estimating the weight decay parameter. In G. B. Orr & K.-R. Muller ¨ (Eds.), Neural networks: Tricks of the trade (pp. 71–92). Berlin: Springer-Verlag. Rousseeuw, P. J., & Leroy, A. M. (1987). Robust regression and outlier detection. New York: Wiley. Saito, K., & Nakano, R. (2000). Second-order learning algorithm with squared penalty term. Neural Computation, 12(3), 709–729. Sarajedini, A., & Hecht-Nielsen, R. (1992). The best of both worlds: Casasent networks integrate multilayer perceptrons and radial basis functions. In International Joint Conference on Neural Networks, 1992 (Vol. 3, pp. 905–910). Piscataway, NJ: IEEE. Sarle, W. (1997). The comp.ai.neural-nets frequently asked questions list. Available on-line at: http://www.faqs.org/faqs/ai-faq/neural-nets/(part2/ section-14.html).
Received January 29, 2003; accepted August 20, 2003.
LETTER
Communicated by Daniela Pucci de Farias
An Extended Projection Neural Network for Constrained Optimization Youshen Xia Department of Applied Mathematics, Nanjing University of Posts and Telecommunications, China
Recently, a projection neural network has been shown to be a promising computational model for solving variational inequality problems with box constraints. This letter presents an extended projection neural network for solving monotone variational inequality problems with linear and nonlinear constraints. In particular, the proposed neural network can include the projection neural network as a special case. Compared with the modied projection-type methods for solving constrained monotone variational inequality problems, the proposed neural network has a lower complexity and is suitable for parallel implementation. Furthermore, the proposed neural network is theoretically proven to be exponentially convergent to an exact solution without a Lipschitz condition. Illustrative examples show that the extended projection neural network can be used to solve constrained monotone variational inequality problems.
1 Introduction Constrained optimization problems arise in a wide variety of scientic and engineering applications, including signal processing, system identication, lter design, regression analysis, and robot control (Bazaraa, Sherali, & Shetty, 1993). In many practical applications such as robot motion control, related optimization problems are of a time-varying parameter characteristic (Sciavicco & Siciliano, 2000) and thus have to be solved in real time. For such applications, conventional solution methods for static optimization may not be effective due to the time-varying nature of such optimization problems and the requirement on computational time. Neural networks are composed of massively connected simple neurons. Having the structures similar to their biological counterparts, articial neural networks are representational and computational models processing information in a parallel distributed fashion (Hopeld, 1982). Feedforward neural networks and recurrent neural networks are two major classes of neural network models. Feedforward neural networks, such as the popular multilayer perceptron (Rumelhart, Hinton, & Williams, 1986), are usually used as representational models trained using a learning algorithm based on a set of sampled inputNeural Computation 16, 863–883 (2004)
c 2004 Massachusetts Institute of Technology °
864
Y. Xia
output data. It has been proven that the multilayer feedforward neural networks are universal approximators (Hornik, Stinchcombe, & White, 1989). Recurrent neural networks, as dynamical systems, are usually used as computational models for solving computationally intensive problems (Golden, 1996). Because of the inherent nature of parallel and distributed information processing in neural networks, neural networks are promising computational models for real-time applications. Typical examples of recurrent neural network applications are NP-complete combinatorial optimization problems, minor component analysis, associative memory, and real-time optimization problems (Peterson & Soderberg, 1989; Ohlasson, Peterson, & Soderberg, 1993; Abe, Kawakami, & Hirasawa, 1992; Oja, 1992; Sevrani & Abe, 2000; Cichocki & Unbehauen, 1993). The analog circuit research for solving linear programming problems perhaps stemmed from Pyne’s work (Pyne, 1956) a half-century ago. Tank and Hopeld (1986) rst developed a linear programming neural network suitable for applications that require on-line optimization. Their seminal work has inspired many researchers to investigate alternative and improved neural networks for solving linear and nonlinear programming problems. For example, Kennedy and Chua (1988) proposed an extended neural network for solving nonlinear programming problems. Because this network contains a penalty parameter, it generates approximate solutions only and has an implementation problem when the penalty parameter is very large (Lillo, Loh, & Hui, 1993). To avoid using penalty parameters, some signicant work has been studied in recent years. Rodr´iguez-V a´ zquez, Dom´inguezCastro, Rueda, Huertas, and S´anchez-Sinencio (1990) proposed a switchedcapacitor neural network for solving a class of optimization problems. This network is suitable for the case that the optimal solution lies in the feasible region only. Otherwise, the network may have no equilibrium point. Zhang, Zhu, and Zou (1992) developed a second-order neural network for solving a class of nonlinear optimization problems with equality constraints. Although it has a fast convergence rate, this network requires computing a time-varying inverse matrix and thus has a high model complexity for implementation. Bouzerdoum and Pattison (1993) presented a neural network for solving quadratic optimization problems with bounded variables only. Xia (1996a) proposed one primal-dual neural network for programming problems, Xia (1996b) proposed another primal-dual network for solving linear and quadratic programming problems, and Xia and Wang proposed (2001) a dual neural network for solving strictly convex quadratic programming problems. Recently, Xia, Leung, and Wang (2002) developed a projection neural network with the following dynamical equation,
dx D ¸fPX .x ¡ F.x// ¡ xg; dt
(1.1)
An Extended Projection Neural Network
865
where F.x/ is a differentiable vector-valued function from Rn into Rn , X D fx 2 Rn jd · x · hg, and PX : Rn ! X is a projection operator dened by 8 > xi < di : hi xi > hi ; where l D [l1 ; : : : ; ln ]T , and h D [h1 ; : : : ; hn ]T . This network has a low model complexity, and its equilibrium point coincides with the solution of the following variational inequality problem with box constraints: .x ¡ x¤ /T F.x¤ / ¸ 0;
x 2 X:
(1.2)
Xia et al. (2002) showed that the projection neural network in equation 1.1 is globally convergent to an equilibrium point when the Jacobian matrix, rF.x/, of F is symmetric and positive semidenite and is globally asymptotically stable at the equilibrium point when rF.x/ is positive denite. Since it can be seen from the optimization literature (Kinderlehrer & Stampcchia, 1980) that many constrained optimization problems, such as linear and nonlinear complementarity problems, can be converted into problem 1.2, the projection neural network can be applied directly to solve monotone nonlinear complementary problems and a class of constrained optimization problems. In this article, we are concerned with the following variational inequality problem with linear and nonlinear constraints, .x ¡ x¤ /T F.x¤ / ¸ 0;
x 2 Ä;
(1.3)
where Ä D fx 2 Rn j g.x/ · 0; Ax D b; x 2 Xg; A 2 Rr£n , and b 2 Rr , g.x/ D [g1 .x/; : : : ; gm .x/]T is an m-dimensional vectorvalued continuous function of n variables, the functions g1 ; : : : ; gm are assumed to be convex and twice differentiable, and the existence of the solution of problem 1.3 is also assumed. Since the constraint set Ä is more complex than X, problem 1.3 is a signicant extension of problem 1.2. Moreover, problem 1.3 has been viewed as a natural framework for unifying the treatment of constrained variational inequality problems and related optimization problems (Bertsekas, 1989; Harker & Pang, 1990; Ferris & Pang, 1997; Dupuis & Nagurney, 1993). The objective of this article is to present an extended projection neural network to solve problem 1.3. The proposed neural network is thus a signicant extension of the projection neural network for solving constrained variational inequality problems from box constraints to linear and nonlinear constraints. Compared with the modied projectiontype numerical method for solving problem 1.3 (Solodov & Tseng, 1996),
866
Y. Xia
which requires a varying step length, the proposed neural network not only has a lower complexity but also is suitable for parallel implementation. Furthermore, the proposed neural network is theoretically proven to be exponentially convergent to the solution of problem 1.3 only under a strictly monotone condition of F. As a result, the proposed neural network can be directly used to solve constrained monotone variational inequality problems and related optimization problems, including nonlinear convex programming problems. Numerical examples also show that the proposed neural network has a faster convergence rate than the modied projectiontype method. 2 An Extended Projection Neural Network In this section, problem 1.3 is equivalently reformulated as a set of nonlinear equations. An extended projection neural network model for solving problem 1.3 is then presented, and a comparison is given. 2.1 Problem Reformulation. It can be seen from the optimization literature (Marcotte, 1991; Bertsekas, 1989) that the Karush-Kuhn-Tucker (KKT) condition for problem 1.3 can be written as »
y ¸ 0; g.x/ · 0; l · x · h Ax D b; yT g.x/ D 0
and 8 T > : .F.x/ C r g.x/y ¡ AT z/i · 0
if xi D hi if li · xi · hi if xi D li ;
where y 2 Rm ; z 2 Rr , r g.x/ D .r g1 .x/; : : : ; r gm .x//, and r gi .x/ is the gradient of gi for i D 1; : : : ; m. According to the projection theorem (Kinderlehrer & Stampcchia, 1980) the above KKT condition can be equivalently represented as 8 T > : Ax D b; where .y/C D [.y1 /C ; : : : ; .ym /C ]T with .yi /C D maxf0; xg. Thus, the following lemma holds. Lemma 1. Assume that rF.x/ is positive semidenite. x¤ is a solution to problem 1.3 if and only if there exists y¤ 2 Rm and z¤ 2 Rr such that .x¤ ; y¤ ; z¤ / satises equation 1.4.
An Extended Projection Neural Network
867
2.2 Neural Dynamical Equations. To solve an optimization problem by neural computation, the key is to cast the problem into the architecture of a recurrent neural network whose stable state represents the solution to the optimization problem. Based on the equivalent formulation 2.1, we propose an extended projection neural network for solving problem 1.3 as follows. State equation 0 du B D ¸@ dt
1 PX .x ¡ .F.x/ C r g.x/y ¡ AT z// ¡ x C .y C g.x//C ¡ y A: ¡Ax C b
(2.2)
Output equation w.t/ D x.t/; where u D .x; y; z/ 2 Rn £ Rm £Rr is a state vector, w is an output vector, and ¸ > 0 is a scaling parameter. The state equation and the output equation can be implemented by a recurrent neural network with a simple structure as shown in Figure 1, where ¸ D 1. The circuit realizing the proposed neural network consists of n C m C r integrators, m C n piecewise activation functions for PX .x/ and .y/C , n processors for F.x/, m processors for g.x/, m processors for r g.x/, 2 mn connection weights for A and AT , and some summers. Therefore, the network complexity depends on only the one of the mapping F.x/ and g.x/. 2.3 A Comparison. The extended projection neural network is a significant extension of the existing projection neural network. It can be seen that one major difference between the projection neural networks in equation 1.1 and the extended projection neural network in equation 2.2 lies in essentially the treatment of different constraints. The projection neural network in equation 1.1 is used to solve problem 1.2 with box constraints, while the extended projection neural network in equation 2.2 is used to solve problem 1.3 with linear and nonlinear constraints. In the special case that Ä D X, g.x/ D 0, A D O, and b D 0, the proposed neural network in equation 2.2 becomes the projection neural network in equation 1.1. In addition, when the projection neural network is applied directly to problem 1.3, the projection neural network is then repressed by dx D ¸fPÄ .x ¡ F.x// ¡ xg; dt where P Ä .u/ is given by PÄ .x/ D arg min kx ¡ vk; v2Ä
where k ¢ k denotes l2 norm. Since the set Ä is not a box or sphere, PÄ .u/ cannot be expressed explicitly. As a result, the above projection neural network
868
Y. Xia
Figure 1: A block diagram of the extended projection neural network in problem 3.1.
cannot be used to solve problem 1.3. Next, the extended projection neural network can be used to solve nonlinear convex programming problems. In fact, consider the case that F.x/ D r f .x/, where f .x/ is a twice differentiable convex function and F.x/ is the gradient, r f .x/, of f . Then it can be seen from the optimization literature (Kinderlehrer & Stampcchia, 1980) that problem 1.3 becomes a well-known general nonlinear programming problem, minimize f .x/ subject to g.x/ · 0; Ax D b; x 2 X
(2.3)
where g.x/, A, and b are dened in section 2. The extended projection neural network in equation 2.2 then becomes 0 1 0 1 ¡x C PX .x ¡ .r f .x/ C r g.x/y ¡ AT z// x d B C B C ¡y C .y C g.x//C @yA D ¸ @ A; dt ¡Ax C b z
(2.4)
where x 2 Rn , y 2 Rm , and z 2 Rm . Compared to the Kennedy-Chua neural network for solving equation 2.3, the neural network in equation 2.4 has no penalty parameter and can globally converge to an exact optimal solution
An Extended Projection Neural Network
869
of equation 2.3. Moreover, unlike the switched-capacitor neural networks for solving equation 2.3, the neural network in equation 2.4 allows for the optimal solution of equation 2.3, which may be on the boundary of the feasible region. Finally, we compare the proposed neural network with existing numerical optimization methods. Note that when F.x/ D r f .x/ does not hold, those numerical methods for linear and nonlinear programming, such as the gradient method and the penalty function method, cannot be used to solve problem 1.3. From the viewpoint of both computation implementation and required conditions, the modied projection-type method (Solodov & Tseng, 1996) is viewed as the best one among existing numerical methods for constrained monotone variational inequalities. Thus we compare the proposed neural network with the modied projection-type method in terms of algorithm complexityand the convergence condition. The modied projection-type method is dened as u.kC1/ D u.k/ ¡ ¯h.u.k/ /fw.k/ C ®G.PXO .w.k/ // ¡ PXO .w.k/ /g;
where ® > 0 and ¯ > 0 are scaling parameters, w D u¡®G.u/, u D .x; y; z/ 2 RnCmCr , PXO .u/ D [P X .x/; .y/C ; z]T , PX .x/ and .y/C are dened in equation 2.1, and 0 1 F.x/ C rg.x/y ¡ AT z ku ¡ PXO .w/k2 B C G.u/ D @ ¡g.x/ h.u/ D : A; kw C ®G.PXO .w// ¡ PXO .w/k2 Ax ¡ b On one side, the modied projection-type method is not suitable for parallel implementation due to the choice of the varying step length h.u/. The modied projection-type method is required to compute nonlinear terms PXO .w/; G.PXO .w//, and h.u/ per iteration. The proposed neural network is required only to compute PXO .w/ per iteration and thus has a low complexity. On the other side, the global convergence of the modied projection-type method is guaranteed only when the following Lipschitz condition, .u ¡ v/T .G.u/ ¡ G.v// · ¯ku ¡ vk2 ;
8u; v 2 RnCmCr ;
and ® < 1=¯ are satised. The proposed neural network will be shown to converge exponentially to an exact solution of problem 1.3 without requiring the Lipschitz condition and ® < 1=¯. It is worth pointing out that when g.x/ is nonlinear, the Lipschitz condition is not satised even if F.x/ and g.x/ are all Lipschitz continuous. As a result, the global convergence of the projection-type method cannot be guaranteed for such cases. 3 Exponential Convergence In this section, we study the exponential convergence of the extended projection neural network. Before going further, we give two denitions and a lemma.
870
Y. Xia
Denition 1. on X if
The Jacobian matrix, rF.x/, of F is said to be positive semidenite
vT rF.x/v ¸ 0; 8x 2 X; v 2 Rn : rF.x/ is positive denite if the strict inequality holds in the above equation whenever x 6D 0 and uniformly positive denite if there is ± > 0 so that vT rF.x/v ¸ ±kvk22 ; 8x 2 X; v 2 Rn ; where k ¢ k2 denotes l2 norm. Denition 2. Let x¤ be a solution of problem 1.3. The extended projection neural network is said to be exponentially convergent to x¤ if its output trajectory w.t/ satises kw.t/ ¡ x¤ k2 · c0 e¡´.t¡t0 / ;
8t ¸ t0 ;
where c0 is a positive constant dependent on the initial point and ´ is a positive constant independent of the initial point. Lemma 2. (i) The state equation in 2.2 has at least an equilibrium point. Moreover, if .x¤ ; y¤ ; z¤ / is an equilibrium point of equation 2.2, then x¤ is a solution of problem 1.3. (ii) For any initial point, there exists a unique continuous solution for equation 2.2. (iii) Let u.t/ D .x.t/; y.t/; z.t// be the state trajectory of equation 2.2 with initial point u.t0 / D .x.t0 /; y.t0 /; z.t0 //. If x.t0 / 2 X and y.t0 / ¸ 0, then x.t/ 2 X and y.t/ ¸ 0. Proof. (i) Since problem 1.3 has a solution, we see by lemma 1 that there exist y¤ and z¤ such that .x¤ ; y¤ ; z¤ / is an equilibrium point of equation 2.2. Moreover, if .x¤ ; y¤ ; z¤ / is an equilibrium point of equation 2.2, then x¤ is a solution of that equation. (ii) Note that F.x/; r g.x/; PX .x/; .y/C are locally Lipschitz continuous. Then both PX .x ¡ .F.x/ C r g.x/y ¡ AT // and .y C g.x//C are locally Lipschitz continuous. Thus, the right term of problem 1.3 is too. According to the local existence theorem of ordinary differential equations (Slotine & Li, 1991), there exists a unique continuous solution u.t/ of equation 2.2 for [t0 ; T/. Moreover, it can be seen in theorem 1 that the solution u.t/ is bounded and thus T D C1. (iii) Since »
dx=dt C x D PX .x ¡ .F.x/ C rg.x/y ¡ AT z// dy=dt C y D .y C g.x//C ;
An Extended Projection Neural Network
(R t
871
Rt x/es ds D t0 es P X .x ¡ .F.x/ C r g.x/y/ ¡ AT z/ds Rt Rt s C s t0 .dy=dt C y/e ds D t0 e .y C g.x// ds:
t0 .dx=dt C
It follows that ( Rt x.t/ D e¡.t¡t0 / x.t0 / C e¡t t0 es PX .x ¡ .F.x/ C r g.x/y ¡ AT z//ds Rt y.t/ D e¡.t¡t0 / y.t0 / C e¡t t0 es .y C g.x//C ds: Since .y C g.x//C ¸ 0 and y.t0 / ¸ 0, y.t/ ¸ 0. Furthermore, in term of the integrator mean theorem, the rst equation above can be written as Z t ¡.t¡t0 / ¡t x.t/ D e x.t0 / C e /P X .z0 / es ds t0
¡.t¡t0 /
De
¡.t¡t0 /
x.t0 / C .1 ¡ e
/PX .z0 /;
where z0 D x.s0 /¡.F.x.s0 // Cr g.x.s0 //y.s0 / and t0 · s · t. Since P X .z0 / 2 X and x.t0 / 2 X, x.t/ 2 X. We now establish the main result on the exponential convergence of the extended projection neural network. Theorem 1. Assume that rF.x/ is positive denite on Ä. Then the extended projection neural network with any initial point u.t0 / D .x.t0 /; y.t0 /; z.t0 // 2 ¤ r X £ Rm C £ R is stable in the sense of Lyapunov and converges exponentially to x . Proof.
Dene the set of equilibrium points of equation 2.2 by
E0 D fu¤ D .x¤ ; y¤ ; z¤ / j T.u¤ / D 0g; where 0 1 PX .x ¡ .F.x/ C rg.x/y ¡ AT z// ¡ x B C T.u/ D @ .y C g.x//C ¡ y A: ¡Ax C b It is necessary to show that x¤ is unique. Assume that u1 D .x1 ; y1 ; z1 / 2 E0 and u2 D .x2 ; y2 ; z2 / 2 E0 . From lemma 1, it follows that x1 and x2 are solutions to problem 1.3. Then .x ¡ x1 /T F.x1 / ¸ 0; 8x 2 Ä and .x ¡ x2 /T F.x2 / ¸ 0; 8x 2 Ä:
872
Y. Xia
Thus, (
.x2 ¡ x1 /T F.x1 / ¸ 0 .x1 ¡ x2 /T F.x2 / ¸ 0:
Hence, .x1 ¡ x2 /T .F.x2 / ¡ F.x1 // ¸ 0: This contradicts that rF.x/ is positive denite implies F.x/ is strictly monotone. It follows that x1 D x2 . We now show that the output trajectory of the proposed projection neural network is exponentially convergent to x¤ . Denote the state trajectory of equation 2.2 by u.t/ with the initial point u.t0 /. By lemma 2, we see that x.t/ 2 X and y.t/ ¸ 0. Consider the following Lyapunov function, V.u/ D ¡G.u/T T.u/ ¡
1 1 kT.u/k22 C ku ¡ u¤ k22 ; 2 2
where u¤ D .x¤ ; y¤ ; z¤ / is an equilibrium point of equation 2.2 and 0 1 F.x/ C r g.x/y ¡ AT z B C ¡g.x/ G.u/ D @ A: Ax ¡ b According to the analysis given in Xia et al. (2002), in the following projection inequality, .v ¡ PÄ0 .v//T .P Ä0 .v/ ¡ u/ ¸ 0; v 2 RnCmCr ; u 2 Ä0 ; C r T where Ä0 D X £ Rm C £ R and PÄ0 .u/ D [PX .x/; .y/ ; z] , we let v D u ¡ G.u/. Then
.u ¡ G.u/ ¡ P Ä0 .u ¡ G.u/// T .PÄ0 .u ¡ G.u// ¡ u/ ¸ 0I that is, ¡G.u/T fPÄ0 .u ¡ G.u// ¡ ug ¸ kPÄ0 .u ¡ G.u/ ¡ uk2 : Note that T.u/ D PÄ0 .u ¡ G.u// ¡ u. It follows that ¡G.u/T T.u/ ¸ kT.u/k22
An Extended Projection Neural Network
873
and V.u/ ¸
1 ku ¡ u¤ k22 : 2
Similarly, we have dV.u/ · ¡G.u/T .u ¡ u¤ / ¡ T.u/T rG.u/T.u/; dt where 0 P 2 rF.x/ C m i r gi .x/yi B rG.u/ D @ ¡r g.x/T A
r g.x/ O1 O3
1 ¡AT C O2 A O4
and O1 2 Rn£m , O2 2 Rn£m , O3 2 Rr£m , and O4 2 Rr£r are zero matrices. Note that yi .t/ ¸ 0 and r 2 gi .x/ is positive semidenite. Then rG.u/ is positive semidenite and T.u/T rG.u/T.u/ ¸ 0. It follows that dV.u/ · ¡G.u/T .u ¡ u¤ /: dt It follows that Z V.u.t// · V.u.t0 // ¡
t t0
G.u.s//T .u.s/ ¡ u¤ /ds
and Z ku.t/ ¡ u¤ k22 · 2V.u.t0 // ¡ 2
t
t0
G.u.s//T .u.s/ ¡ u¤ /ds:
Note that u¤ satises r .u ¡ u¤ /T G.u¤ / ¸ 0; 8u 2 X £ Rm C £R :
Then G.u/T .u ¡ u¤ / ¸ .G.u/ ¡ G.u¤ //T .u ¡ u¤ /: Thus, Z ku.t/ ¡ u¤ k22 · 2V.u.t0 // ¡ 2
t
t0
.G.u.s// ¡ G.u¤ //T .u.s/ ¡ u¤ /ds:
874
Y. Xia
On one side, using Z ¤
T
1
¤
.G.u/ ¡ G.u // .u ¡ u / D Z D
0 1 0
.u ¡ u¤ /T rG.u C s.u ¡ u¤ //.u ¡ u¤ /ds O ¡ u¤ /ds; .u ¡ u¤ /T rG.u/.u
where uO D u C s.u ¡ u¤ /, we get .G.u/ ¡ G.u¤ //T .u ¡ u¤ / Z 1 m X O C O yO i /.x ¡ x¤ /ds D r 2 gi .x/ .x ¡ x¤ /T .rF.x/ 0
i
Z C Z ¡ Z D
0 1 0
1 0
1
Z O ¡ y¤ /ds ¡ .x ¡ x¤ /T r g.x/.y Z .x ¡ x¤ /T AT .z ¡ z¤ /ds C
O C .x ¡ x¤ /T .rF.x/
m X i
0
1
1 0
O T .x ¡ x¤ /ds .y ¡ y¤ /T r g.x/
.z ¡ z¤ /T A.x ¡ x¤ /ds
O yO i /.x ¡ x¤ /ds; r 2 gi .x/
O y; O zO /. On the other side, note that dV.u/=dt · 0 implies where uO D .x; r u.t/ ½ S D fu 2 X £ Rm C £ R j V.u/ · V.u0 /g;
and V.uk / ! C1 whenever kuk k2 ! C1. Then fu.t/ D .x.t/; y.t/; z.t//g and S are bounded. Let S1 D fx j x D x.t/ C s.x.t/ ¡ x¤ /; 0 · s · 1; t ¸ t0 g. It follows that the set S1 ½ S. Since rF.x/ is positive denite on S, O 2 D 1: vO T rF.x/vO > 0; 8x 2 S; kvk Let f .x/ D vO T rF.x/vO be a function dened on S. Then f .x/ is continuous on S. Thus, there exists ± > 0 such that vT rF.x/v ¸ ±kvk22 ; 8x 2 S; 8v 2 Rl : Note that x.t/ C s.x.t/ ¡ x¤ / 2 S1 . Then for all t ¸ t0 , .x.t/ ¡ x¤ /T rF.x.t/ C s.x.t/ ¡ x¤ //.x.t/ ¡ x¤ / ¸ ±kx.t/ ¡ x¤ k22 ; 8t ¸ t0 : It follows that Z kx.t/ ¡ x¤ k22 · 2V.u.t0 // ¡ ¯1
t t0
kx.t/ ¡ x¤ k22 ds;
An Extended Projection Neural Network
875
where ¯1 D 2¸±: According to the Bellman-Gronwall inequality (Slotine & Li, 1991), we obtain kw.t/ ¡ x¤ k2 D kx.t/ ¡ x¤ k2 ·
p
2V.u.t0 //e¡¯1 .t¡t0 /=2 ; 8t ¸ t0 :
Therefore, the extended projection neural network converges exponentially to x¤ . Remark 1. According to theorem 1, we conclude that the output trajectory of the projection neural network can converge to a solution with any given precision ² > 0 within a nite time. In fact, we observe that kw.t/ ¡ x¤ k2 ·
p
2V.u.t0 //e¡¯1 .t¡t0 /=2 < ²; 8t ¸ t0 :
It follows that e¯1 .t¡t0 /=2 >
p
2V.u.t0 //=²;
and then .t ¡ t0 / >
p 2 ln. 2V.u.t0 //=²/: ¯1
Thus, kw.t/ ¡ x¤ k < ², provided that t ¸ t0 C
p 2 ln. 2V.u.t0 //=²/: ¯1
Furthermore, if the scaling parameter ¸ is large enough in advance, then the output trajectory w.t/ can reach the solution of problem 1.3 within a nite time. In fact, let us assume the initial point x.t0 / 6D x¤ . Then there exist ¿ > 0 and ¹ > 0 such that Z
tC¿ t0
kx.t/ ¡ x¤ k22 ds ¸ ¹¿ > 0:
It follows that for any t ¸ t0 C ¿ , Z kw.t/ ¡ x¤ k22 · 2V.u.t0 // ¡ ¯
t t0
kx.t/ ¡ x¤ k22 ds · 2V.u.t0 // ¡ ¯1 ¹¿:
876
Y. Xia
Taking ¸ D V.u.t0 //=.¹¿ ±/ we have kw.t/ ¡ x¤ k22 · 2V.u.t0 // ¡ 2V.u.t0 // D 0: Therefore, w.t/ D x¤ when t ¸ t0 C ¿ . Remark 2. It should be noted that because rG.u/ is always positive semidenite, the existing asymptotical stability result (Xia et al., 2002) cannot be used to ascertain the convergence of the extended projection neural network. Moreover, unlike the existing exponential stability result on the projection neural network in equation 1.1, theorem 1 does not require the condition that rG.u/ is bounded and uniformly positive denite. Furthermore, compared with the existing convergence results on the modied projection-type method (Solodov & Tseng, 1996), the convergence result does not require the additional condition that G.u/ is Lipschitz continuous. Since the nonlinearity of g.x/ usually leads that G is not Lipschitz continuous, the global convergence of the modied projection-type method cannot be guaranteed. Later illustrative examples also show that the modied projection-type method may diverge if the scaling parameters ® and ¯ are not small enough. 4 Simulation Examples In order to demonstrate the effectiveness and efciency of the proposed neural network, we implemented it in MATLAB to solve a nonlinear programming problem and two constrained variational inequality problems. The ordinary differential equation solver engaged is ode23s. We compared the performance of this implementation with the gradient method, the penalty function method, and the modied projection-type method, respectively. 4.1 Example 1. Consider the following convex programming problem: minimize subject to
f .x/ D .x1 ¡ 2/2 C .x2 ¡ 1/2 » 0:25x21 C x22 · 1 x1 ¡ 2x2 D ¡1:
(4.1)
p p This problem has a unique optimal solution x¤ D .0:5. 7¡1/; 0:25. 7C1//. It can be seen that r f .x/ D [2.x1 ¡ 2/; 2.x2 ¡ 1/]T , g.x/ D 0:25x21 C x22 ¡ 1, r g.x/ D [0:5x1 ; 2x2 ]T , A D [1; ¡2], b D ¡1, and ³ r 2 f .x/ D
2 0
´ 0 2
An Extended Projection Neural Network
877
is positive denite. When applied to solve problem 4.1, the extended projection neural network in equation 2.4 becomes 0 1 0 1 ¡2.x1 ¡ 2/ ¡ 0:5x1 x3 ¡ z1 x1 C B ¡2.x ¡ 1/ ¡ 2x x C 2z C d B 2 2 3 1 C Bx2 C B B C D ¸B C: @.y1 C 0:25x21 C x21 ¡ 1/C ¡ y1 A dt @y1 A z1 1 C x1 ¡ 2x2
(4.2)
All simulation results show that the neural network in equation 4.2 with any initial point is always convergent to x¤ within a nite time. For example, Figure 2 displays the trajectory behavior of x.t/ D .x1 .t/; x2 .t// with 10 different initial points. For a comparison, we compute this example by using the gradient method, the penalty function method, and the proposed neural network, respectively. Let t D 10 and ¸ D 3, and let the penalty parameter be 1000. Their computational results are listed in Table 1, where the accuracy is dened by l2 -norm error kx ¡ x¤ k2 . From Table 1, we see that the proposed neural network gives a better solution than the gradient method and penalty function method.
3
2 *
*
(x1, x2)
x
2
1
.
0
1
2
3 3
2
1
0
1
2
3
Figure 2: The transient behavior of .x1 .t/; x2 .t// based on equation 4.2 with 10 different initial points in example 1.
878
Y. Xia
4.2 Example 2. Consider linearly constrained variational inequality 1.3, where 2 3 5x1 C .x1 C 2/2 C x2 C x3 C 10 6 7 F.x/ D 4 5x1 C 3x22 C 10x2 C 3x3 C 10 5 ; 10.x1 C 2/2 C 8x22 C 4x3 C 3x23 Ä D fx 2 R3 jx1 C x2 C x3 ¸ 4; x ¸ 0g. Then X D fx 2 R3 j x ¸ 0g, A D 0 , b D 0, g.x/ D ¡x1 ¡x2 ¡x3 C4, and r g.x/ D [¡1; ¡1; ¡1]T . This problem has a unique solution x¤ D [2:5; 1:5; 0]T . When applied to solve this problem, the extended projection neural network in equation 2.2 becomes ³ ´ ³ ´ .x ¡ .F.x/ C r g.x/y//C ¡ x d x D¸ (4.3) ; dt y .y C g.x//C ¡ y where x 2 R3 and y 2 R. All simulation results show that the neural network in equation 4.3 with any initial point is always globally convergent to x¤ within a nite time. For example, let ¸ D 40. Figure 3 displays the convergence behavior of the l2 -norm error kx.t/ ¡ x¤ k2 based on equation 4.3 with 20 random initial points. Next, we compare the proposed method with the modied projection-type method. Table 2 displays results obtained by the proposed method. Table 3 displays results obtained by the modied projection-type method, where the accuracy is dened by the l2 -norm error. From Tables 2 and 3, we can see that the proposed method not only gives a better solution but also has a faster convergence rate than the modied projection-type method. Moreover, from Table 3, we see that the modied projection-type method may diverge if the scaling parameters ® and ¯ are not small enough. 4.3 Example 3. Consider nonlinearly constrained variational inequality 1.3, where Ä D fx 2 R10 j g.x/ · 0; x 2 Xg, X D fx 2 R10 j x ¸ 0g, g.x/ D [g1 .x/; : : : ; g8 .x/]T , F.x/ D [2x1 C x2 ¡ 14; 2x2 C x1 ¡ 16; 2.x3 ¡ 10/; 8.x4 ¡ 5/; 2.x5 ¡ 3/; 4.x6 ¡ 1/; 10x7 ; 14x8 ; 4x9 ; 2.x10 ¡ 7/]T ;
and 8 g1 .x/ D 3.x1 ¡ 2/2 C 4.x2 ¡ 3/2 C 2x23 ¡ 7x4 ¡ 120; > > > > > g2 .x/ D 5x21 C .x3 ¡ 6/2 C 8x2 ¡ 2x4 ¡ 40; > > > 2 2 2 > > > g3 .x/ D .x1 ¡ 8/ =2 C 2.x2 ¡ 4/ C 3x5 ¡ x6 ¡ 30; > < 2 2 g4 .x/ D x1 C 2.x2 ¡ 2/ ¡ 2x1 x2 C 14x5 ¡ 6x6 ; > g5 .x/ D 4x1 C 5x2 ¡ 3x7 C 9x8 ¡ 105; > > > > > g6 .x/ D 10x1 ¡ 8x2 ¡ 17x7 C 2x8 ; > > > > > g7 .x/ D 12.x9 ¡ 8/2 ¡ 3x1 C 6x2 ¡ 7x10 ; > : g8 .x/ D ¡8x1 C 2x2 C 5x9 ¡ 2x10 ¡ 12:
An Extended Projection Neural Network
879
8
7
6
*
||x(t) x ||2
5
4
3
2
1
0 0
0.1
0.2
0.3
0.4
0.5
Time (sec)
0.6
0.7
0.8
0.9
1
Figure 3: Convergence behavior of the norm error kx.t/ ¡ x¤ k2 based on equation 4.3 with 20 random initial points in example 2.
This problem has a unique solution (Charalambous,1977): x¤ D [2:172; 2:364; 8:774; 5:096; 0:991; 1:431; 1:321; 9:829; 8:280; 8:376] T : We use the proposed neural network in equation 2.2 to solve the above problem. All simulation results show the trajectory of equation 2.2 with any initial point is always convergent to u¤ D .x¤ ; y¤ /. For example, let ¸ D 8. Figure 4 shows the transient behavior of x.t/ based on equation 2.2 with ve random initial points. Next, we compare the proposed method with the modied projection-type method. Table 4 displays results obtained by the proposed method. Table 5 displays results obtained by the modied projection-type method. From Tables 4 and 5, we can see that the proposed method not only gives a better solution but also has a faster convergence rate than the modied projection-type method. Moreover, from Table 5, we see that the modied projection-type method may diverge if the scaling parameters ® and ¯ are not small enough. 5 Conclusions We have presented an extended projection neural network for solving nonlinearly constrained variational inequality problems. In contrast to the ex-
880
Y. Xia
18 16 14
x (t) 10
12 x8(t)
10
x3(t)
8
x9(t)
6
x (t) 4
4
x (t) 6
x2(t)
2
x1(t)
2
x5(t)
x7(t)
0
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Figure 4: Transient behavior of x.t/ based on equation 2.2 with ve random initial points in example 3.
Table 1: Results for the Gradient Method, the Penalty Function Method, and the Proposed Method in Example 1. Method Initial point Iterative number CPU time (sec.) Solution l2 -norm error
Gradient Method
Penalty Method
Proposed Method
.¡2; ¡2/ 29; 813 23:94 .2:0019; 1:0015/ 1:1825
.¡2; ¡2/ 28; 296 23:67 .0:8252; 0:9120/ 0:0024
.¡2; ¡2; ¡2; ¡2/ 82 0:11 .0:8229; 0:9114/ 6:4 £ 10¡6
Table 2: Results for the Proposed Method on Linearly Constrained Variational Inequality in Example 2. Initial Point
Parameters
Iterative Number
CPU Time (sec.)
Accuracy
¡.2; 2; 2; 2/ .2; 2; 2; 2/
¸ D 4:5; t D 15 ¸ D 4:5; t D 15
522 506
2:33 2:22
0:0028 0:0032
An Extended Projection Neural Network
881
Table 3: Results for the Projection-Type Method on Linearly Constrained Variational Inequality in Example 2. Initial Point
Parameters
¡.2; 2; 2; 2/ ¡.2; 2; 2; 2/ ¡.2; 2; 2; 2/ ¡.2; 2; 2; 2/ .2; 2; 2; 2/ .2; 2; 2; 2/ .2; 2; 2; 2/ .2; 2; 2; 2/
¯ ¯ ¯ ¯ ¯ ¯ ¯ ¯
D 0:5; ® D 0:035 D 0:1; ® D 0:05 D 0:1; ® D 0:06 D 0:5; ® D 0:04 D 1; ® D 0:04 D 0:7; ® D 0:05 D 0:1; ® D 0:055 D 1; ® D 0:05
Iterative Number
CPU Time (sec.)
Accuracy
3723 13; 432 C1 C1 1649 1922 C1 C1
7:02 25:2 C1 C1 3:07 3:59 C1 C1
0:004 0:004 C1 C1 0:004 0:004 C1 C1
Table 4: Results for the Proposed Method on Nonlinearly Constrained Variational Inequality in Example 3, Where e D [1; : : : ; 1]T 2 R18 . Initial Point
Parameters
Iterative Number
CPU Time (sec.)
Accuracy
e ¡e
¸ D 4; t D 2 ¸ D 4; t D 2
235 319
3:01 4:05
0:033 0:034
isting projection neural network, the extended projection neural network can be directly used to solve monotone variational inequality problems with linear and nonlinear constraints. Compared with the modied projectiontype numerical method, the proposed neural network has lower complexity and is suitable for parallel implementation. More important, unlike the existing projection neural network and the modied projection-type method, the extended projection neural network is theoretically proven to be exponentially convergent to an exact solution of problem 1.3 under a strictly monotone condition of the mapping F. As a result, the proposed neural network not only has an exponential convergence but also can be directly
Table 5: Results for the Projection-Type Method on Nonlinearly Constrained Variational Inequality in Example 3, where e D [1; : : : ; 1]T 2 R18 . Initial point e e e e ¡e ¡e ¡e ¡e
Parameters ¯ ¯ ¯ ¯ ¯ ¯ ¯ ¯
D 0:1; ® D 0:1; ® D 0:1; ® D 0:9; ® D 0:9; ® D 0:9; ® D 0:9; ® D 0:1; ®
D D D D D D D D
0:005 0:005 0:05 0:0005 0:0005 0:0005 0:006 0:05
Iterative Number
CPU Time (sec.)
Accuracy
2194 14; 339 C1 28; 678 13; 410 3559 C1 C1
14:45 71:4 C1 142:7 200 17:71 C1 C1
0:1 0:08 C1 0:098 0:084 0:1 C1 C1
882
Y. Xia
used to solve constrained monotone variational inequality problems and related optimization problems, including nonlinear convex programming problems. Simulation results further show that the extended projection neural network can obtain a better solution and has a faster convergence rate than the gradient method, the penalty function method, and the modied projection-type method. Further investigations will be aimed at engineering applications of the extended projection neural network to control and signal processing. References Abe, S., Kawakami, J., & Hirasawa, K. (1992). Solving inequality constrained combinatorial optimization problems by the Hopeld neural networks. Neural Networks, 5, 663–670. Bazaraa, M. S., Sherali, H. D., & Shetty, C. M. (1993). Nonlinear programming— Theory and algorithms (2nd ed.). New York: Wiley. Bertsekas, D. P. (1989). Parallel and distributed computation: Numerical methods. Englewood Cliffs, NJ: Prentice hall. Bouzerdoum, A., & Pattison, T. R.(1993). Neural network for quadratic optimization with bound constraints. IEEE Transactions on Neural Networks, 4, 293–304. Charalambous, C. (1977). Nonlinear least p-th optimization and nonlinear programming. Mathematical Programming, 12, 195–225. Cichocki, A., & Unbehauen, R. (1993). Neural networks for optimization and signal processing, New York: Wiley. Dupuis, P., & Nagurney, A. (1993). Dynamical systems and variational inequalities. Annals of Operations Research, 44, 19–42. Ferris, M. C., & Pang, J. S. (1997). Engineering and economic applications of complementarity problems. SIAM Review, 39, 669–713. Golden, R. M. (1996). Mathematical methods for neural network analysis and design. Cambidge, MA: MIT Press. Harker, P. T., & Pang, J. S. (1990). Finite-dimensional variational inequality and nonlinear complementary problems: A survey of theory, algorithms, and applications. Mathematical Programming, 48, 161–220. Hopeld, J. J. (1982). Neural networks and physical systems with emergent collective computational ability. Proceedingsof the National Academy of Sciences, USA, Biophysics, 79, 2554–2558. Hornik, K., Stinchcombe, M., & White, H. (1989). Multilayer feedforward networks are universal approximators. Neural Networks, 2, 359–366. Kennedy, M. P., & Chua, L.O. (1988). Neural networks for nonlinear programming. IEEE Transactions on Circuits and Systems, 35, 554–562. Kinderlehrer, D., & Stampcchia, G. (1980). An introduction to variational inequalities and their applications. New York: Academic Press. Lillo, W. E., Loh, M. H., & Hui, S. (1993). On solving constrained optimization problems with neural networks: A penalty method approach. IEEE Transactions on Neural Networks, 4, 931–940.
An Extended Projection Neural Network
883
Marcotte, P. (1991). Application of Khobotov’s algorithm to variational inequalities and network equilibrium problems. Inform. Systems Oper. Res., 29, 258– 270. Ohlasson, M., Pelterson, C., & Soderberg, B. (1993). Neural networks for optimization problems with inequality constraints: The knapsack problem. Neural Computation, 5, 331–339. Oja, E. (1992). Principal components, minor components and linear neural networks. Neural Networks, 5, 927–935. Peterson, C., & Soderberg, B. (1989). A new method for mapping optimization problems onto neural networks. International Journal of Neural Systems, 1, 3– 22. Pyne, I. B. (1956). Linear programming on an electronic analogue computer. Trans. Amer. Inst. Elec. Eng., 75, 139–143. Rodr´õ guez-V´azquez, A., Dom´inguez-Castro, R., Rueda, A., Huertas, J. L., & S´anchez-Sinencio, E. (1990). Nonlinear switched-capacitor “neural networks” for optimization problems. IEEE Transactions on Circuits and Systems, 37, 384–397. Rumelhart, D. E., Hinton, G. E., & Williams, R. J. (1986). Learning internal representations by error propagation. In D. E. Rumelhart & J. L. McClelland (Eds.), Parallel distributed processing, (Vol. 1, pp 318–366). Cambridge, MA: MIT Press. Sciavicco, L., & Siciliano, B. (2000). Modelling and control of robot manipulators. New York: Springer-Verlag. Sevrani F., & Abe K. (2000). On the synthesis of brain-state-in-a-box neural models with application to associative memory. Neural Computation, 12, 451–472. Slotine, J. J., & Li, W. (1991). Applied nonlinear control, Englewood Cliffs, NJ: Prentice Hall. Solodov, M. V., & Tseng, P. (1996). Modied projection-type methods for monotone variational inequalities. SIAM J. Control and Optimization, 2, 1814–1830. Tank, D. W., & Hopeld, J. J. (1986). Simple “neural” optimization networks: An A/D converter, signal decision circuit, and a linear programming circuit. IEEE Transactions on Circuits and Systems, 33, 533–541. Xia, Y. S. (1996a). A new neural network for solving linear programming problems and its applications. IEEE Transactions on Neural Networks, 7, 525–529 Xia, Y. S. (1996b). A new neural network for solving linear and quadratic programming problems. IEEE Transactions on Neural Networks, 7, 1544–1547. Xia, Y. S., Leung, H., & Wang, J. (2002). A projection neural network and its application to constrained optimization problems. IEEE Transactions on Circuits and Systems—Part I, 49, 447–458. Xia, Y. S., & Wang, J. (2001).A dual neural network for kinematic control of redundant robot manipulators. IEEE Transactions on Systems, Man and Cybernetics— Part B, 31, 147–154. Zhang, S., Zhu, X., & Zou, L.-H. (1992). Second-order neural networks for constrained optimization. IEEE Transactions on Neural Networks, 3, 1021–1024. Received April 28, 2003; accepted October 6, 2003.
LETTER
Communicated by Laurence Abbott
Spike-Timing-Dependent Plasticity: The Relationship to Rate-Based Learning for Models with Weight Dynamics Determined by a Stable Fixed Point Anthony N. Burkitt
[email protected] Hamish Mefn h.mef
[email protected] David B. Grayden
[email protected] The Bionic Ear Institute, East Melbourne, Victoria 3002, Australia
Experimental evidence indicates that synaptic modication depends on the timing relationship between the presynaptic inputs and the output spikes that they generate. In this letter, results are presented for models of spike-timing-dependent plasticity (STDP) whose weight dynamics is determined by a stable xed point. Four classes of STDP are identied on the basis of the time extent of their input-output interactions. The effect on the potentiation of synapses with different rates of input is investigated to elucidate the relationship of STDP with classical studies of long-term potentiation and depression and rate-based Hebbian learning. The selective potentiation of higher-rate synaptic inputs is found only for models where the time extent of the input-output interactions is input restricted (i.e., restricted to time domains delimited by adjacent synaptic inputs) and that have a time-asymmetric learning window with a longer time constant for depression than for potentiation. The analysis provides an account of learning dynamics determined by an input-selective stable xed point. The effect of suppressive interspike interactions on STDP is also analyzed and shown to modify the synaptic dynamics. 1 Introduction Experimental evidence indicates that synaptic modication can depend on the timing of presynaptic inputs (excitatory postsynaptic potentials, EPSPs) and postsynaptic spikes (action potentials, APs) (Markram, Lubke, ¨ Frotscher, & Sakmann, 1997; Bi & Poo, 1998; Zhang, Tao, Holt, Harris, & Poo, 1998; Debanne, G¨ahwiler, & Thompson, 1998). The relationship of such spike-timing-dependent plasticity (STDP) to classical studies of long-term potentiation (LTP) and long-term depression (LTD), as well as previously studied models of Hebbian plasticity based on spiking rates (Bienenstock, Cooper, & Munro, 1982), is not well understood. Issues concerning compeNeural Computation 16, 885–940 (2004)
c 2004 Massachusetts Institute of Technology °
886
A. Burkitt, H. Mefn, and D. Grayden
tition between synapses and the stability of the learning algorithms have been a particular focus of attention (Miller & MacKay, 1994; Miller, 1996; Song, Miller, & Abbott, 2000; Abbott & Nelson, 2000; Rao & Sejnowski, 2001). The emergence of inhomogeneous distributions of synaptic strengths typically requires a competitive mechanism that causes some synapses to increase their strength and others to decrease theirs, while at the same time the synaptic modication should lead to a stable distribution of synaptic conductances that prevents their unconstrained increase. The majority of studies of STDP have assumed that changes in synaptic strengths are additive, that is, they do not scale with synaptic strength (Gerstner & van Hemmen, 1996; Abbott & Blum, 1996; Kempter, Gerstner, & van Hemmen, 1999, 2001; Roberts, 1999; Song et al., 2000; van Hemmen, 2001; Cˆateau & Fukai, 2003). These models typically exhibit strong competition between synapses and have a positive feedback instability, so that an upper bound on the synaptic strength must be imposed as a hard constraint to enforce stability. The learning dynamics (i.e., the evolution of the synaptic weights) of these additive STDP models is determined by an unstable xed point (Kempter et al., 1999), which leads to the conductances displaying a bimodal distribution that is concentrated around zero and the upper bound. Consequently, the strong competition in these models leads to distributions of synaptic strengths that do not necessarily reect patterns of activity in the input. Moreover, such a bimodal distribution does not reect the experimentally observed unimodal distribution of synaptic strengths (Turrigiano, Leslie, Desai, Rutherford, & Nelson, 1998). However, these STDP models based on an unstable xed point of the learning dynamics have been successfully combined with delay selection during development to provide an account of both the acuity of sound localization (Gerstner & van Hemmen, 1996) and the development of temporal feature maps in the avian auditory system (Leibold, Kempter, & van Hemmen, 2002; Burkitt & van Hemmen, in press). A recent analysis involving detailed comparison between STDP and experimental results supports the hypothesis that potentiation and depression have different weight dependence (Bi & Poo, 1998), and in particular that potentiation is approximately additive and depression is predominantly multiplicative (van Rossum, Bi, & Turrigiano, 2000), that is, the potentiating increments are independent of the value of the synaptic strength, whereas the depressing decrements are proportional to the synaptic strength. Such models of temporally asymmetric Hebbian plasticity have been studied recently by a number of authors (Kistler & van Hemmen, 2000; van Rossum et al., 2000; Rubin, Lee, & Sompolinsky, 2001; G utig, ¨ Aharanov, Rotter, & Sompolinsky, 2003). A modeling study of STDP with multiplicative depression by van Rossum et al. (2000) in the presence of uncorrelated Poisson trains of spikes produced an equilibrium distribution of synaptic weights that compared favorably with that observed in cultured cortical pyramidal neurons (Turrigiano et al., 1998). Moreover, they found that spike-timing correlations
Spike-Timing-Dependent Plasticity
887
among a subset of the synaptic inputs increased the corresponding synaptic weights while leaving the weights of the uncorrelated inputs unchanged. This appears to indicate that there is no inherent competition in the STDP of such models (van Rossum et al., 2000). By contrast, in models where both the potentiation and depression are additive, competition tends to drive the weights to either zero or their maximally allowed value (Kempter et al., 1999; Song et al., 2000). The lack of competition in the model of STDP with additive potentiation and multiplicative depression may be overcome by incorporating activity-dependent scaling of the synaptic weights, in which the synaptic strengths are observed to change in a manner that depends on their long-term average spiking rate (Turrigiano et al., 1998). This effect occurs on a much slower timescale than STDP but is nevertheless capable of introducing competition between synapses. One somewhat surprising outcome of the study by van Rossum et al. (2000) was that their model does not recover rate-based Hebbian learning, since an increase in the rate of synaptic inputs does not have any effect on the synaptic strength in their model. Consequently, it would appear that this version of STDP is unable to potentiate weights that have a higher rate of synaptic input, but rather only synaptic inputs that have spike-timing correlations with other inputs will be potentiated. This result is at odds with the classical studies of LTP and LTD, which indicate that the rate of synaptic inputs plays a crucial role in synaptic modication (Lomo, 1971; Bliss & Lomo, 1973). The conventional methodology adopted for the study of LTP and LTD has been based on spike rate and has ignored the timing of individual spikes. The concept of selective synaptic modication based on correlations between mean spiking rates forms the foundation for much of the extensive work on rate-based Hebbian learning in neural systems dating back many years (Sejnowski, 1977). Consequently there has been considerable recent interest in establishing the relationship between rate-based and spike-based learning; for recent reviews of the relationship between LTP, LTD, synaptic scaling, and STDP, see Sejnowski (1999), Abbott and Nelson (2000), and Bi and Poo (2001). However, a number of fundamental questions remain about the relationship between these Hebbian rate-based models of neural plasticity and STDP (Roberts & Bell, 2002). Most of the STDP models studied have resulted in a dynamics of the weights that is controlled by an unstable xed point, which leads to the stability problems and bimodal distribution of weights discussed earlier. Consequently, it has not been possible to relate STDP to rate-based models in which the learning is determined by a stable xed point, since existing models of STDP with such stable learning dynamics show an absence of selective potentiation of synapses with higher spiking rate inputs (van Rossum et al., 2000). One of the aims of this study is to show that STDP (together with synaptic scaling) could be capable of producing stable xed-point dynamics of learning, as well as the selective potentiation of higher input-rate synapses that Hebbian rate-based models incorporate.
888
A. Burkitt, H. Mefn, and D. Grayden
A recent focus of attention in STDP has been the time extent of the EPSPAP interactions. At very low output spiking rates, an input will typically fall only within the STDP time window of a single output spike, while at higher output spiking rates, an incoming EPSP may fall within the STDP time window of a number of output APs. This leads to questions regarding how the plasticity signals from these multiple EPSP-AP interactions are to be integrated. These issues have been addressed in recent studies on thick, tufted layer I pyramidal neurons in rat visual cortex, in which experimental measurements of STDP were compared with a number of models of EPSPAP interactions (Sjostr ¨ om, ¨ Turrigiano, & Nelson, 2001). They particularly analyzed three models in which an input arrives in the STDP time window of more than one output spike and the timing extent of EPSP-AP interactions was limited in various ways. The results indicate a complex dependence of LTP and LTD on the output spiking rate of the neurons and provide support for the hypothesis that the extent of the EPSP-AP interaction is restricted when the input EPSPs fall within more than one STDP time window of an output AP. In this study, we identify and systematically analyze models representing four classes of EPSP-AP interaction in order to elucidate how the different time extent of EPSP-AP interactions gives rise to plasticity rules that depend on input and output rates in a way that is consistent with classical LTP and LTD results. We examine the role, upon the potentiation of synapses with different rates of synaptic input, of (1) the time extent of input-output (EPSP-AP) interactions and (2) time asymmetry of the learning window. The model of STDP is presented in section 2.1, in which the STDP time window and the time constants associated with potentiation (¿P ) and depression (¿D ) are dened. Experimental results indicate that these time constants are different (Bi & Poo, 1998). A framework for describing the STDP evolution of the synaptic weights, based on the nonlinear Langevin equation, is presented in section 2.1. This approach naturally relates the description of the evolution of the mean weights (Kempter et al., 1999) and the Fokker-Planck approach (van Kampen, 1992; van Rossum et al., 2000). In order to illustrate how a stable xed-point model of STDP can form part of a larger neural learning paradigm of working memory, the notion of background and foreground synapses corresponding to non-input-selective (spontaneously active) and input-selective synapses is introduced. In this study we use a leaky integrate-and-re neuronal model with synaptic conductances (Burkitt, 2001), described in section 2.2. Four different classes of EPSP-AP interaction, which are based on the time extent of the interaction, are presented in Section 3. The results for each class of model are presented in sections 3.1, 3.2, 3.3, and 3.4. These results, discussed in section 4, indicate that the selective potentiation of synapses with higher-rate input requires both that the EPSP-AP interactions are limited in a way that depends on the input rate and that the STDP time window is time-asymmetric. The results also show how spike-timing-based models of neural plasticity can be in-
Spike-Timing-Dependent Plasticity
889
tegrated with established rate-based models (Sejnowski, 1977; Bienenstock et al., 1982). In section 4.1 the effect of suppressive interspike interactions on STDP (Froemke & Dan, 2002) is analyzed, and in section 4.2 the relationship of STDP with activity-dependent synaptic scaling is discussed. Section 4.3 contains a discussion of the width of the weight distribution, and section 4.4 gives a discussion of a number of the assumptions and approximations made in this study. The key mathematical details of the analysis are presented in the appendix. 2 The Model The two components of the model are the weight changes caused by the spike-timing-dependent plasticity and the description of the output spikes in terms of the synaptic inputs. Both of these components, outlined in the two subsections below, are described in terms of stochastic processes that arise because of the random nature of the times of the synaptic inputs (described in terms of a Poisson process). 2.1 Spike-Timing-Dependent Plasticity. We consider a neuron with NE excitatory synaptic inputs. The STDP is implemented by making the following changes to the (excitatory) weights wi when a synaptic input on the ith ber arrives at time t0 and an output spike is generated at time t: wi ) wi C 1wi , » ®D .wi / F.tout ¡ tin / for tout < tin (2.1) 1wi D ®P .wi / F.tout ¡ tin / for tout ¸ tin ; where ®P .wi /, ®D .wi / are weight-dependent potentiation and depression functions, respectively. The STDP time window function F.1t/ is given by » 1t=¿ D ¡e for 1t < 0 (2.2) F.1t/ D ¡1t=¿P for 1t ¸ 0 ; e where ¿P , ¿D are the time constants of the potentiation and depression, respectively. Figure 1 illustrates the changes in weights given by this form of the STDP time window. Bi and Poo (1998) measured values of ¿P D 17 §9 ms and ¿D D 34 § 13 ms in cultured hippocampal neurons. There are a number of STDP models that have been studied, and Table 1 denes the weight-dependent potentiation and depression functions ®P .w/, ®P .w/ for the models discussed here. “Van Rossum et al. plasticity” has additive potentiation and multiplicative depression (van Rossum et al., 2000). For “Gutig ¨ et al. plasticity” (Gutig ¨ et al., 2003), the case ¹ D 0 corresponds to purely “additive plasticity” and the case ¹ D 1 corresponds to “multiplicative plasticity,” but with the addition of an explicit upper bound of wmax on the weight, which for all nonzero ¹ and sufciently small learning rate prevents the weight from leaving the range [0; wmax ].
890
A. Burkitt, H. Mefn, and D. Grayden
F(Dt)
50
25
25
0
50
Dt (ms)
Figure 1: The STDP time window F.1t/, equation 2.2: The change in synaptic strength is a function of the time difference between inputs and outputs (1t D tout ¡ tin ). Synaptic inputs (EPSPs) that arrive before an output (AP) is generated give a positive contribution (potentiation) to the change in weight, whereas inputs that arrive after an output give a negative contribution (depression). The time constants here are taken to be ¿P D 17 ms and ¿D D 34 ms. Units on the vertical axis are dependent on the values of ®P .wi /, ®D .wi /.
STDP describes the extent to which a synapse is potentiated or depressed solely in terms of the relative times of the individual input EPSPs and the output APs. In this study, a number of different models are examined in which the time extent of allowed EPSP-AP interactions is restricted in various ways. We identify four classes of models: (I) no restriction on EPSP-AP interactions, (II) input-restricted models in which the extent of the EPSP-AP interactions is limited to an interspike interval (ISI) of the input EPSPs, (III) output-restricted models in which the extent of the EPSP-AP interactions is limited to an ISI of the output APs, and (IV) input- and output-restricted Table 1: Models of STDP Giving the Form of their Weight-Dependent Potentiation and Depression Functions. Model Additive plasticity Multiplicative plasticity van Rossum et al. plasticity Gutig ¨ et al. plasticity
®D .w/
®P .w/
cD cD w cD w c D w¹
cP cP w cP cP .wmax ¡ w/¹
Note: The coefcients cP and cD are constants and 0 · ¹ · 1).
Spike-Timing-Dependent Plasticity
891
models in which the extent of the EPSP-AP interactions is limited in a more complex way that depends on both the input and output ISIs. The details of these models will be given in section 3. In this analysis, we assume that the input EPSP times have a Poisson distribution. The evolution of the individual weights under STDP is described by the nonlinear Langevin equation (van Kampen, 1992; Risken, 1996), dw i .t/ D A.wi ; fwj g/ C C.wi ; fwj g/».t/; dt
(2.3)
where ».t/ is gaussian white noise with zero mean and unit variance. The functions A.wi ; fwj g/, C.wi ; fwj g/ are functions of not only the individual weight wi but also all of the other weights fwj g through the implicit dependence of the output spiking rate on all weights (for notational convenience, we will henceforth suppress the dependence of these functions on the fwj g). The underlying dynamics is described by the Poisson process governing the incoming EPSPs and the spike-generating dynamics of the leaky integrateand-re neuron with synaptic conductances, both of which combine to produce the distribution of relative times between the individual EPSPs and the APs that is necessary for the calculation of the weight changes, equation 2.1. Due to the central limit theorem, averaging over the EPSP-AP interaction time distribution for many independent small-amplitude weight increments gives rise to the effective white noise in the above Langevin equation and enables the coefcients of A.w/ and C.w/ to be calculated. The validity criteria for this approximation are analyzed in more detail below in relation to the equivalent Fokker-Planck formalism (Itoˆ interpretation). The coefcients of A.wi / and C.wi / may vary for different weights wi if their associated synaptic inputs have a different temporal distribution. The mean weight wN i evolves according to dwN i .t/ D A.wN i /: dt
(2.4)
This equation gives the effective rate-based description of the change in the mean synaptic weight for any particular model of STDP. This equation gives the relationship between STDP and rate-based models for the dynamics of the synaptic weights, and hence forms the basis for the analysis of weight dynamics (Kempter et al., 1999). The “drift” function A.wi / is calculated as Z A.wi / D
dr r Q.wi I r/;
(2.5)
where Q.wi I r/ is the probability of the weight wi making an independent jump of size r. For STDP models, the jumps are given by equation 2.1, and A.wi / is the sum of the potentiation and depression components, A.wi / D AP .wi / ¡ AD .wi /. The potentiating contributions AP .wi / arise when the time
892
A. Burkitt, H. Mefn, and D. Grayden
difference between an output AP and an input EPSP is positive, and the depressing contributions AD .wi / arise when the time difference between an output AP and an input EPSP is negative. Consequently, A.wi / can be expressed as an integral over the changes in the synaptic strengths weighted by the probability distribution P.1t/ of the input-output time differences: Z 0 AD .wi / D ®D .wi / aD .fwj g/ D ®D .wi / d1t F.1t/ P.1t/ ¡1 Z 1 AP .wi / D ®P .wi / aP .fwj g/ D ®P .wi / d1t F.1t/ P.1t/;
(2.6)
0
where aD .fwj g/ and aP .fwj g/ are the time integrals on the right side of the equations and typically depend on the output spiking distribution, which gives them an implicit dependence on all the weights fwj g (henceforth suppressed for notational convenience). The difculty in evaluating A.wi / lies principally in the explicit determination of the distribution P.1t/ of input and output times, since the output spike times depend on the particular neural model adopted. Details of the neural model used here are presented in section 2.2 and the calculations of A.wi / are presented in section 3 and appendix B. The stationary mean of the weight distribution, wN i , is given by the xedpoint equation, A.wN i / D AP .wN i / ¡ AD .wN i / D 0, which may be written as R1 d1t F.1t/ P.1t/ ®D .wN i / aP D D R 00 : N ®P .wi / aD d1t F.1t/ P.1t/
(2.7)
¡1
In this analysis, we consider only models in which the learning dynamics of each weight is governed by a stable xed point (i.e., small uctuations in the mean weight are suppressed). The stability condition for this xed point in the case where all the weights are homogeneous (i.e., there is no selective stimulation of particular synapses) is 0 N N ®D .w/ a0D ® 0 .w/ a0 C C P; > P N N ®D .w/ aD ®P .w/ aP
(2.8)
where wN is the xed point of the mean of the weight distribution and the prime denotes the derivative with regard to the weight. In the more general case, the stability analysis involves the calculation of the Jacobian of the drift function, @j A.wi /. Stability requires negative values for the real part of all of the eigenvalues of this Jacobian (evaluated at the xed point). One such model of STDP with a stable xed point is van Rossum et al. plasticity (see Table 1), for which the drift and diffusion functions have an explicit linear and quadratic dependence (resp.) on the weight: A.wi / D A0 ¡ A1 wi , and B.wi / D B0 ¡ B1 wi C B2 w2i (note the sign of the linear terms: with these
Spike-Timing-Dependent Plasticity
893
denitions all An ; Bn > 0 for this model), where for the drift function is given by the coefcients Z 0 A 0 D cD a D D cD d1t e1t=¿D P.1t/ ¡1 Z 1 A 1 D cP a P D cP d1t e¡1t=¿P P.1t/;
(2.9)
0
with cD ; cP > 0. The xed point solution of equation 2.7 is wN i D
A0 cD aD D : A1 cP aP
(2.10)
Although we will be mainly interested in the mean weights in this study, it is also of interest to know the distribution of the weights. The probability density of each weight, P.wi ; t/, is described by the Fokker-Planck equation (van Kampen, 1992; Risken, 1996), 1 @2 @P.wi ; t/ @ D¡ [A.wi /P.wi ; t/] C [B.wi /P.wi ; t/]: 2 @ w2i @t @wi
(2.11)
The diffusion term, B.wi /, is calculated analogously to the drift term, equation 2.5: Z (2.12) B.wi / D dr r2 Q.wi I r/: The complete weight distribution that STDP generates is given by the equilibrium solution of the Fokker-Planck equation (van Rossum et al., 2000), P.w/ D
N exp B.w/
µZ
w
dw0
¶ 2A.w0 / ; B.w0 /
(2.13)
where N is a normalization factor. An estimate of the variance of the weight distribution is given by expanding the exponent in the above expression to second order about the xed point, wN i dened in equation 2.7: ¾i2 ¼ ¡
B.wN i / : 2A0 .wN i /
(2.14)
In order for the Fokker-Planck formalism to provide a valid description of the weight dynamics, two conditions must be fullled (van Kampen, 1992): (1) the drift function must be a slowly varying function of the weight (i.e., A0 .w/=¸ ¿ 1, where ¸ is the rate of independent weight steps r in equation 2.5), and (2) the mean square weight step hr2 i must be much less than
894
A. Burkitt, H. Mefn, and D. Grayden
the variance of the weight distribution (i.e., hr2 i ¿ ¾ i2 ). Using equation 2.14 and hr2 i ¼ B.wN i /=¸, we nd that these two conditions are essentially equivalent, and we show in section 3 that they are fullled for the models studied here. For van Rossum et al. plasticity, Table 1, with the above linear and quadratic form of A.w/ and B.w/ resp., the equilibrium weight distribution is given by »
N P.w/ D
exp
.2A p 0 ¡A1 B1 =B2 / B0 B2 ¡B21 =4
µ atan p
B2 .w B0 B2 ¡B21 =4
.B0 ¡ B1 w C B2 w2 /1CA1 =B2
¶¼ ¡
B1 2B2 /
:
(2.15)
In this case, it is possible to give a more accurate analytical expression than equation 2.14 for the variance of the weight distribution, valid when A.w/ is linear and B.w/ is quadratic, that is derived from a self-consistency relation (Burkitt, 2001; see appendix A for details), ¾i2 D
B0 ¡ B1 A0 =A1 C B2 A20 =A21 : 2A1 ¡ B2
(2.16)
Alternatively, the variance of the distribution can be evaluated numerically directly from the Fokker-Planck distribution, equation 2.13. Some of the issues associated with the width of the weight distribution are discussed in section 4.3. One of the central questions regarding any synaptic modication rule is what its role is in a wider learning paradigm. In order to illustrate the relationship between Hebbian rate-based models with stable learning dynamics that is input selective (i.e., the xed point is dependent on the synaptic inputs) and STDP with a similar stable xed-point learning dynamics, a simple paradigm is considered. We analyze the weights of a single neuron in which the majority of the synapses receive only a background level of synaptic inputs (corresponding to spontaneous activity), and a small subset of the synapses receive a higher rate of inputs (corresponding to inputs from “activated” neurons). These inputs and synapses are referred to as background and foreground, respectively, and consequently the higher-rate inputs are selective to the foreground synapses. For simplicity, we assume that all the background inputs and synapses are statistically identical, and likewise for the foreground inputs and synapses. The equivalence of the foreground inputs represents a considerable simplication that removes the spatial component of activation, but it allows the central features of the learning dynamics of the STDP models analyzed here to be illustrated. Hence the incoming EPSPs are divided into two sets: a larger set of size NB with input rates that are given by the background (spontaneous) spiking rate, taken here to be ¸B D 5 Hz, and a smaller set of size NF D NE ¡ NB with an elevated input rate ¸F . There are correspondingly two sets of weights on
Spike-Timing-Dependent Plasticity
895
the neuron, described as background and foreground (wB , wF , resp.), that also correspond to unstimulated and stimulated synapses, respectively, in studies of classical LTP and LTD. Consequently the equations obeyed by the means of the background and foreground weights, wN B and wN F , respectively, are given by equation 2.7 (with the subscript i replaced by B,F). While most of the analysis will be concerned with the mean values of these distributions, we will also discuss the full distribution in later sections. It is straightforward to show that the stability condition of the xed point of this system of two weights corresponds to the requirement that the trace of the Jacobian @X A.wY / (where X,Y D B,F), evaluated at the xed point, be negative and the determinant be positive. The relationship of this simple learning paradigm with Hebbian rate-based models of learning is discussed in section 4. The results apply more generally to any spatial distribution of rates across inputs. In this analysis, we will be concerned with the effect of a large number of inputs whose spike times are uncorrelated. Consequently we neglect the spike-triggering effect, namely, the increase in the probability of a spike being generated due to a single synaptic input (this is equivalent to assuming that each individual synaptic input is small compared to the sum of all the synaptic inputs). The spike-triggering effect scales with the presynaptic ring rate (Kempter et al., 1999) and may be signicant when there are correlations between the presynaptic inputs (van Rossum et al., 2000; Svirskis & Rinzel, 2000; Stroeve & Gielen, 2001). In particular, in the case where the weight dynamics is not determined by a stable xed point, the spiketriggering effect could be greatly amplied by the instability (Gerstner, 2001; Kempter et al., 2001). For neurons with a large number of temporally uncorrelated small amplitude inputs, this term is very small, and it will be neglected in this study. In addition to STDP, cortical neurons also exhibit activity-dependent scaling of the synaptic weights (i.e., dependent on the long time average output spiking rate of the postsynaptic neurons), which is experimentally observed to occur on a much slower timescale than STDP (Turrigiano et al., 1998). This activity-dependent synaptic scaling (ADSS) has the effect of introducing a competitive mechanism between the weights (van Rossum et al., 2000). The slower timescale for ADSS means that we will not be directly concerned with modeling its effect in this study, although it nevertheless plays an important role in understanding the overall impact of STDP on neural functioning, as discussed in section 4.2. The parameters associated with ADSS can be treated as approximately constant over the timescale relevant to STDP, and the effect of ADSS will not be considered further here. 2.2 Leaky Integrate-and-Fire Neuronal Model with Synaptic Conductances. A central ingredient of STDP calculations is the determination of the output spike times, for which an explicit neural model is necessary. In this study, a one-compartment leaky integrate-and-re neuron with synaptic conductances is used. The membrane potential V.t/ in this model is the
896
A. Burkitt, H. Mefn, and D. Grayden
integrated activity of its excitatory and inhibitory synaptic inputs, and it decays in time with a characteristic time constant (Tuckwell, 1979), dV D ¡
NE X .V ¡ v0 / dt C .VE ¡ V/ gE;i dPE;i ¿ iD1
C .VI ¡ V/
NI X
(2.17)
gI;j dPI;j : jD1
The rst term describes the passive leak of the membrane, with resting potential v0 and passive membrane time constant ¿ . The second and third terms represent the synaptic contribution due to cortical background and foreground activity from NE excitatory (dPE;i ) and NI inhibitory (dP I;j ) neurons, respectively. The inputs dPE;i and dPI;j are trains of delta functions distributed as independent temporally homogeneous Poisson processes with constant intensities (spiking rates) ¸E;i and ¸I;j , respectively. VE and VI are the (constant) reversal potentials (VI · v0 · V.t/ · Vth < VE ). When the membrane potential reaches a threshold Vth , an output spike (action potential) is generated, and the membrane potential is reset to its resting value v0 . The parameters gE;i and gI;i represent the integrated conductances over the time course of the synaptic event (with unit weights) divided by the neural capacitance: they are nonnegative and dimensionless. Since we will be interested only in the plasticity of the excitatory synapses in this study, gI;i D gI is taken to be the same for all inhibitory synapses. In keeping with widespread convention about synaptic strengths, we will also use the notation wi D gE;i and refer to the excitatory synaptic conductances as weights. In the absence of spike generation, the membrane potential approaches an equilibrium value, ¹ Q , about which it uctuates with variance ¾Q . The membrane potential approaches ¹ Q with a time constant given by the effective membrane time constant ¿Q , which determines the decay of the membrane potential and is typically signicantly shorter than the passive membrane time constant ¿ due to the activation of the synaptic conductances. The values of ¹Q , ¾Q , ¿Q are (Hanson & Tuckwell, 1983; Burkitt, 2001) 1 1 D C r10 ¿Q ¿
¾ Q2 D
;
¹2Q r20 ¡ 2¹Q r21 C r22
rmn D VEn
¹Q D ¿Q .v0 =¿ C r11 /;
(2.18)
2=¿Q ¡ r20
NE X iD1
n m gm I;i ¸E;i C NI VI gI ¸I :
The output spiking rate is given as a function of ¹Q , ¾Q , and ¿Q by Z 1 ¡1 ¸out D ¿A C t fµ .t/ dt 0
¿Q D ¿A C ¾Q
r
¼ 2
Z
Vth v0
"
.u ¡ ¹Q /2 du exp 2¾Q2
#"
Á
u ¡ ¹Q 1 C erf p ¾Q 2
(2.19) !# ;
Spike-Timing-Dependent Plasticity
897
where ¿A is the absolute refractory period (neglected in the analysis here). This is the so-called Siegert formula (Siegert, 1951; Ricciardi, 1977; Tuckwell, 1988; Amit & Tsodyks, 1991), but with the membrane time constant ¿ replaced by the effective time constant ¿Q . Note that once the parameters Vth and v0 have been chosen, this formula gives the mean spiking rate as a function of just the three variables ¹ Q , ¾Q , and ¿Q . In the calculations of the drift function A.wi / described in the following sections, the output spike times are completely determined by the interspike interval (ISI) distribution fµ .t/. The Laplace transform of the ISI distribution fµ L .s/ plays an important role due to the assumed Poisson time distribution of the synaptic input. Because the spiking mechanism of the neural model is integrate-and-re, the ISI distribution is the rst passage time density of the membrane potential and is given by (see the appendix of Burkitt, Mefn, & Grayden, 2003) fµ L .s/ D
pL .s/ ; qL .s/
(2.20)
where µ ¶ .yth ¡ yr x/2 exp ¡ 1 ¡ x2 0 2¼¾Q2 .1 ¡ x2 / " # Z 1 y2th .1 ¡ x/2 ¿Q x¿Q s¡1 exp ¡ qL .s/ D dx q ; 1 ¡ x2 0 2¼¾Q2 .1 ¡ x2 / Z
pL .s/ D
1
dx q
¿Q x¿Q s¡1
(2.21)
p p with yth D .Vth ¡ ¹Q /= 2¾Q , yr D .v0 ¡ ¹Q /= 2¾Q . The values of potentials were chosen throughout this study to be VE D 0 mV, VI D ¡80 mV, v0 D ¡70 mV, Vth D ¡55 mV, and the passive membrane time constant was ¿ D 20 ms. The number of background and foreground synapses are taken to be NB =1500 and NF =100, respectively, and the number of inhibitory synapses is NI D 400. 3 Results Results for four different classes of EPSP-AP interaction, based on the time extent of the interaction, are presented in the four sections below. 3.1 Model I: Unrestricted Time Extent of EPSP-AP Interactions. We rst consider the model in which the synaptic inputs contribute to all the STDP time windows in which they fall (Gerstner, Kempter, van Hemmen, & Wagner, 1996; Kempter et al., 1999; Song et al., 2000; Senn, Markram, & Tsodyks, 2001). In this model, a synaptic input gives potentiating contributions for all the subsequent spikes and depressing contributions for all the preceding spikes that lie within the STDP time window. The drift function,
898
A. Burkitt, H. Mefn, and D. Grayden
A.wi /, is given by (see section B.1 in appendix B for details) A.wi / D ¸out ¸i [¿P ®P .wi / ¡ ¿D ®D .wi /];
(3.1)
where ¸i is the rate of synaptic inputs to synapse i with weight wi . In this expression for the drift function, the multiplicative factors ¸out and ¸i result from the cumulative nature of the changes to the weights induced by each output AP and input EPSP. The factors ¿P and ¿D represent the integral of the STDP window function F.1t/, equation 2.2, and hence incorporate the nite timescale of the EPSP-AP interactions represented by the STDP time window. The mean of the weight distribution for wi is determined by the xed-point equation A.wi / D 0. Since the resulting expression is independent of both the input and output spiking rates, the mean weight is identical for all synapses, so that the means of both the background and foreground weight distributions, wN F , wN B , are given by ®D .wN B / ®D .wN F / ¿P D D : ®P .wN B / ®P .wN F / ¿D
(3.2)
The actual xed-point value for wN F D wN B is found by solving this above implicit expression. Consequently, for this class of STDP model, in which the incoming EPSPs interact with all APs, all mean weights are identical, and in particular there is no selective potentiation of the higher input rate foreground synapses relative to that of the lower input rate background synapses. Moreover, this equality between the means of both weight distributions is unaffected by any difference between the two STDP time constants ¿P and ¿D . In the particular case of van Rossum et al. plasticity, Table 1, the mean weight is given by wN i D wN B D wN F D ¿P cP =.¿D cD /. The diffusion function B.w/ cannot be calculated in the same straightforward way as the drift function A.w/, since the successive changes in the weights are not independent (i.e., successive jumps are correlated), as discussed further in section 4.3. The calculation of B.w/ with such correlations is outside the scope of this investigation. 3.2 Model II: Input Restricted EPSP-AP Interactions. We consider now a model in which the time extent of the EPSP-AP interactions is restricted so that each output AP interacts only with its nearest synaptic inputs. Consequently each output AP in this model has a potentiation component that arises by interaction with the most recent input EPSP and a depression component by interaction with the rst subsequent EPSP (i.e., if additional EPSPs fall within the STDP time window of an output spike, these interactions are neglected; van Rossum et al., 2000). The EPSP-AP interactions included in this model are illustrated in Figure 2, which shows that APs interact only with the EPSPs that are within the EPSP ISI in which they are generated. Consequently, each ISI of the input EPSPs can be considered as an independent event.
Spike-Timing-Dependent Plasticity
899
Figure 2: Model II: Input-restricted EPSP-AP interactions (N D 1). Diagrammatic representation of the EPSP-AP interactions included (time goes from left to right). Each AP interacts only with the last and the next EPSP. Action potentials (APs) are represented by vertical arrows above the time line, and EPSPs are represented by vertical arrows below the time line. The brackets indicate the EPSP-AP interactions included in this model: brackets above the time line represent contributions that increase the weight (potentiation), and brackets below the time line represent contributions that decrease the weight (depression).
The drift function A.w/ and diffusion function B.w/ are evaluated using equation 2.5 and make use of the Poisson time distribution of the synaptic inputs. This is done by integrating over a single ISI of the EPSPs, that is, truncating the time window of integration of the EPSP-AP interaction (see section B.2 for details). The drift function is given for this model by " A.wi / D ¸out ¸i
# ®P .wi / .¸i C
1 ¿P /
¡
®D .wi / .¸i C
1 ¿D /
:
(3.3)
The drift function contains the multiplicative factors ¸out and ¸i , which are common to all models (see equation 3.1 and the following discussion). The restrictions imposed on the EPSP-AP interactions (i.e., truncated to extend over only the ISI of the inputs) introduce the terms within the square brackets, which determine the xed point of the weight. When averaged over the distribution of input-output time differences, the input-restricted interactions effectively result in a modication of the STDP window by the multiplicative factor e¡¸i t , that is, a suppressive factor corresponding to the Poisson rate of the input EPSPs. Consequently, the xed-point equation for wi now depends on the input rate ¸i so that for the mean background and
900
A. Burkitt, H. Mefn, and D. Grayden
foreground weights, we obtain .¸X C ¿1D / ®D .wN X / D ; ®P .wN X / .¸X C ¿1P /
X
D B,F:
(3.4)
As for model I, the xed-point equations for the background and foreground weights are identical if the two STDP time constants are equal, resulting in no selective potentiation. However, if the STDP time constants are different, then the potentiation is input selective. This is illustrated in the case of van Rossum et al. plasticity, Table 1, for which the drift function is given by A.wi / D A0 ¡ A1 wi with A0 D
¸out ¸i cP .¸i C
1 ¿P /
;
A1 D
¸out ¸i cD .¸i C
1 ¿D /
:
(3.5)
The validity of the Fokker-Planck formalism, as discussed in section 2.1, requires that A0 .wN i /=¸i ¿ 1, which in this case is given by the condition 2cD ¸out =.¸i C ¿1 / ¿ 1, and is satised for sufciently small cD (in this study, D cD D 0:01). The means of the background and foreground weight distributions evolve toward their respective stable xed points, equation 2.10, wN X D
cP .¸X C
cD .¸X C
1 ¿D / ; 1 ¿P /
X
D B,F:
(3.6)
Now for ¿D > ¿P (as observed experimentally) and ¸F > ¸B , it is clear from this equation that wN F > wN B . Consequently, by limiting the extent of EPSPAP interactions in this model, the weight distributions for the foreground synapses with higher input rate ¸F become potentiated relative to the unchanged weights of the low input rate ¸B background synapses, in contrast to previous studies (van Rossum et al., 2000). Figure 3 illustrates the effect of restricted EPSP-AP interactions and different time constants for potentiation and depression on the mean of the foreground weights. R¿ D ¿D =¿P is the ratio of the STDP time constants, and the plot shows the change of the mean weight wN F as a function of the spiking rate of the foreground inputs ¸F . (The background inputs have a rate of ¸B D 5 Hz and the average equilibrium background weight, which is xed at wN B D 0:0017 for all values of R¿ by suitably choosing cP , is independent of ¸F . This is also the initial value of wN F .) For the case where the time constants are equal, R¿ D 1, there is no potentiation, as observed previously (van Rossum et al., 2000). For larger values of the ratio of the STDP time constants, R¿ > 1, the results show that the mean foreground weight increases as the input rate of the foreground weights ¸F increases, and this effect becomes more pronounced for larger values of R¿ . The dashed line shows the case where the ratio of STDP time constants is reversed, namely, where ¿D < ¿P , which results in the weights of
Spike-Timing-Dependent Plasticity
901
150
Rt = 4
F
w (%)
100
Rt = 2
50
Rt = 1.5 Rt = 1
0
Rt = 0.5 50 0
25
50
l
F
(Hz)
75
100
Figure 3: Model II, with N D 1 and van Rossum et al. plasticity equation 3.6. The N F plotted as a function of the spiking percentage change of the mean weight w rate of the foreground inputs ¸F (the background inputs have a spiking rate of ¸B D 5 Hz). R¿ D ¿D =¿P is the ratio of the STDP time constants, given by xing ¿P D 17 ms and varying ¿D . Parameter values are cD D 0:01, and cP is chosen N B D 0:0017 for each R¿ . All other parameter values are as given in the so that w text. The dashed line corresponds to a value of R¿ < 1 in which the foreground set of weights falls below the background weights. Solid symbols represent the results for the average weight of the foreground synapses obtained by numerical simulation after 1000 APs.
the higher-input synapses becoming smaller than the background weights. The solid symbols in the gure represent the results of numerical simulation (averages taken after 1000 output spikes, in order for the weights to reach their steady-state distribution). Note that the values of the mean weights wN B , wN F , given by equation 3.6, are independent of the output spiking rate. 3.2.1 Model II: Generalization to N Input Restricted EPSP-AP Interactions. The above results can be extended to the case in which the time extent of the EPSP-AP interactions is restricted to the N nearest synaptic inputs to each output AP. The calculation proceeds as described in section B.2.1: the integral over input and output spike times, equation B.4, is generalized to equation B.14 by replacing the single input integral with N input integrals for the successive inputs in exactly the fashion described for model I. Consequently the model is N input restricted, so that each AP interacts with
902
A. Burkitt, H. Mefn, and D. Grayden
only the N closest EPSPs. The drift function is given by (
"
³
A.wi / D ¸out ¸i ®P .wi /¿P 1 ¡ "
¸i ¿P 1 C ¸i ¿P ³
¡ ®D .wi /¿D 1 ¡
´N #
¸i ¿D 1 C ¸i ¿D
´N #) :
(3.7)
Consequently, the xed-point equations of the background and foreground weight distributions are h i ¸X ¿P N 1 ¡ ¿ . / P 1C¸X ¿P ®D .wN X / h i; D ¸X ¿ D N ®P .wN X / ¿D 1 ¡ . 1C¸ / X ¿D
X
D B,F:
(3.8)
Clearly, as N ! 1 the results reduce to those for model I, equation 3.2, with unrestricted time extent of the EPSP-AP interactions, as we would expect. For all other values of N, the same conclusions hold as for the N D 1 case: the foreground synapses with higher input rate ¸F become potentiated relative to the background synapses with lower input rate ¸B if the STDP time constants are different. This is illustrated for the case of van Rossum et al. plasticity in Figure 4, which shows the change of the mean weight as a function of the input spiking rate for various values of N. Note that as N increases, the weight change becomes less and less, eventually reproducing the non-input-selective behavior of model I. 3.2.2 Model II: N D 1=2 Input-Restricted Model of EPSP-AP Interactions. A further variation of an input-restricted model is one in which each AP can interact with only the single closest EPSP, that is, an AP gives rise to either potentiation or depression depending on whether the preceding or the subsequent EPSP was closer in time. Consequently the interactions are restricted to half of the ISI of the input EPSPs. We denote this as N D 1=2, since only half the EPSP-AP interactions of the N D 1 model are included. This is illustrated in Figure 5, which shows how each AP interacts only with its closest EPSP. The time extent of the EPSP-AP interactions in this model is restricted to domains delimited by adjacent synaptic inputs, as for the N D 1 case of Model II. The drift function is given by (see section B.2.2 for details) " A.wi / D ¸out ¸i
# ®P .wi / .2¸i C
1 ¿P /
¡
®D .wi / .2¸i C
1 ¿D /
:
(3.9)
Spike-Timing-Dependent Plasticity
903
N = 1/2 N=1 N=2
50
w (%)
N=3
F
N=4 25
N=6
N = 10 0 0
25
50
l
F
(Hz)
75
100
N = 20
Figure 4: Model II, various N (see equation 3.8) for van Rossum et al. plasticity. N F plotted as a function of the spiking The percentage change of the mean weight w rate of the foreground inputs ¸F (the background inputs have a spiking rate of ¸B D 5 Hz). Different lines show the results for different values of N, the number of nearest synaptic inputs to each output AP interacts. The value N D 1=2 corresponds to the case of interaction between the AP and the single closest EPSP, as described in section 3.2.2. The time constants are taken as ¿D D 34 ms, ¿P D 17 ms (i.e., R¿ D 2). Other parameter values are as for Figure 3.
Consequently, the means of the background and foreground weight distributions for van Rossum et al. plasticity, Table 1, are wN X D
cP .2¸X C
cD .2¸X C
1 ¿D / ; 1 ¿P /
X
D B,F:
(3.10)
Similar to the previous input-restricted models, in this model, the foreground synapses with higher input rate ¸F become potentiated relative to the background synapses with lower input rate ¸B , as illustrated by the N D 1=2 line in Figure 4. 3.3 Model III: Output-restricted EPSP-AP Interactions. In this model, the time extent of the EPSP-AP interaction is restricted to domains delimited by adjacent output spikes; each EPSP is paired only with the last and the next AP. Consequently, each EPSP contributes exactly one potentiating
904
A. Burkitt, H. Mefn, and D. Grayden
Figure 5: Model II, N D 1=2 input-restricted model of EPSP-AP interactions. Each AP interacts only with the single closest EPSP. Diagrammatic representation of the EPSP-AP interactions are included in this model (see the caption of Figure 2).
term (by interaction with the rst subsequent AP) and exactly one depressing term (by interaction with the immediately preceding AP). This pattern of EPSP-AP interaction is illustrated in Figure 6, which shows how each EPSP interacts only with its last and next AP. This model has been studied previously in the case of additive STDP (Cateau ˆ & Fukai, 2003; Izhikevich & Desai, 2003). The physiological rationale for this model of EPSP-AP interactions is that an output AP almost completely resets the internal state of a neuron, thereby reducing the effect of more temporally distant interactions. The drift function (given by (Cˆateau & Fukai, 2003) is (see section B.3 for details) » A.wi / D ¸out ¸i
µ ³ ´¶ 1 ®P .wi /¿P 1 ¡ fµ L ¿P µ ³ ´¶¼ 1 ¡ ®D .wi /¿D 1 ¡ fµ L : ¿D
(3.11)
The terms containing fµ L ./ reect the output spiking rate, and in the limit of small spiking rates, this becomes equivalent to Poisson distributed outputs: Poisson
fµL .s/ H)
¸out : s C ¸out
(3.12)
Spike-Timing-Dependent Plasticity
905
Figure 6: Model III: Output-restricted EPSP-AP interactions (M D 1). Diagrammatic representation of the EPSP-AP interactions are included in this model (see the caption to Figure 2). Each EPSP contributes only to the STDP time window of its last and next AP.
The drift function in this limit becomes " # ®P .wi / ®D .wi / Poisson ¡ A.wi / H) ¸out ¸i .¸out C ¿1P / .¸out C ¿1D /
(3.13)
(cf. the analogous expression for A.wi /, equation 3.3, with Poisson inputs). The means of the background and foreground weight distributions are given by the solution of the xed-point equation: h i 1 1 ¡ f . / L µ N ®D .wX / ¿P ¿P Poisson .¸ out C h i H) D ®P .wN X / ¿D 1 ¡ f µ L . 1 / .¸out C ¿D
1 ¿D / : 1 ¿P /
(3.14)
As for models I and II, the drift function contains the multiplicative factors ¸out and ¸i (see equations 3.1 and 3.3) and the following discussion). In the output-restricted model, the EPSP-AP interactions are truncated to extend over only the ISI of the outputs, intuitively resulting in a similar multiplicative factor that attenuates the STDP time window, as seen for model II. As in the case for model I, the xed-point equation for the mean weight has no explicit dependence on the input rate. Consequently, the mean weights of the higher input rate foreground synapses, wN F , and the low input rate background synapses, wN B , are identical. However, unlike model I, in this model the mean weights do depend on the output spiking rate, that
906
A. Burkitt, H. Mefn, and D. Grayden
is, the mean weights will change in exactly the same fashion as the output spiking rate changes. This is illustrated in the case of van Rossum et al. plasticity, Table 1, for which the mean weights are given by h i 1 1 ¡ f . / µL cP ¿P ¿P h i: wN B D wN F D cD ¿D 1 ¡ f L . 1 / µ ¿D
(3.15)
In order to evaluate the mean weights, we solved equation 3.11 iteratively (by iterating to the stable xed point), since fµ L .s/ depends on wN B and wN F (Cˆateau & Fukai, 2003). (This was not necessary for model II because wN B and wN F depended on only their respective input rates.) Figure 7 shows the change of the mean background and foreground weights (which are equal: wN B D wN F ) as a function of the foreground input rate ¸F . Again the validity of the Fokker-Planck formalism, as discussed in section 2.1, requires that
60
w (%)
40
20
0 0
25
50
l
F
(Hz)
75
100
Figure 7: Model III, with M D 1 and van Rossum et al. plasticity, equation 3.15. NB D The percentage increase of the mean background and foreground weights, w N F , plotted as a function of the spiking rate of the foreground inputs ¸F (the w background inputs have a spiking rate of ¸B D 5 Hz). The plot is for STDP time constants ¿P D 17 ms and ¿D D 34 ms (i.e., R¿ D 2). Parameter values are NB D w N F D 0:0017 at ¸F =5 Hz. All other cD D 0:01 and cP D 3:0 £ 10¡5 , so that w parameter values are as given in the text. Solid symbols represent the results for the average weight of the synapses obtained by numerical simulation after 10,000 APs.
Spike-Timing-Dependent Plasticity
907
A0 .wN i /=¸out ¿ 1. In this case, the condition corresponds to 2cD ¿D ¸i [1 ¡ fµ L . ¿1D /] ¿ 1, which is again satised for sufciently small cD . 3.3.1 Model III: Generalization to M Output-Restricted EPSP-AP Interactions. The above results can be extended to the case in which the time extent of the EPSP-AP interactions is restricted to the M nearest AP outputs to each input EPSP. The calculation proceeds as described in section B.3.1: the integral over input and output spike times, equation B.18, is generalized to equation B.25 by replacing the single output integral with M output integrals for the successive AP outputs in the fashion described for model II. Consequently the model is M output restricted, so that each EPSP interacts with only the M closest APs. The drift function is given by » A.wi / D ¸out ¸i
µ
³
´¶ 1 ®P .wi /¿P 1 ¡ ¿P µ ³ ´¶¼ 1 M ¡ ®D .wi /¿D 1 ¡ fµL : ¿D fµML
(3.16)
The xed-point equations of the background and foreground weight distributions are h i M 1 ®D .wN B / ®D .wN F / ¿P 1 ¡ f µ L . ¿ P / h i: D D ®P .wN B / ®P .wN F / ¿D 1 ¡ f M . 1 / µ L ¿D
(3.17)
Clearly, as M ! 1, the results again reduce to those for model I, equation 3.2, with unrestricted time extent of the EPSP-AP interactions, as we would expect. For all other values of M, the same conclusions hold as for the M D 1 case: the xed-point weight depends on the output rate, and there is no input selectivity in the synapses, which all become equally potentiated as the input rate increases if the STDP time constants are different. 3.4 Model IV: Input- and Output-Restricted EPSP-AP Interactions. This model incorporates the restrictions of the time extent of EPSP-AP interactions contained in both models II and III: only the closest EPSP-AP interactions are included (i.e., EPSP-AP interactions that have no other intervening EPSPs or APs). The time extent of the EPSP-AP interactions in this model is consequently dependent on both the input and output spiking rates, and the pattern of EPSP-AP interactions that results is illustrated in Figure 8. This model is motivated by one of the models studied by Sj¨ostrom ¨ et al. (2001). Their rationale for this model of EPSP-AP interactions is that the amount of depression or potentiation could saturate with a single EPSP-AP interaction, making any further such interactions involving the same AP or EPSP superuous. (Note that in Sjostr ¨ om ¨ et al., 2001, the STDP window
908
A. Burkitt, H. Mefn, and D. Grayden
Figure 8: Model IV: Input- and output-restricted EPSP-AP interactions (N D M D 1). Diagrammatic representation of the EPSP-AP interactions included in this model (see the caption to Figure 2). Each EPSP contributes only to the STDP time window of its last and next AP, with the additional condition that there is no intervening EPSP or AP (i.e., only the closest EPSP-AP interactions are included).
function was taken to be a step function rather than the exponential function used here and illustrated in Figure 1, so that a single input anywhere within the time window leads to saturation in their formulation.) The drift function for this model is given by (see section B.4 for details) 8 < A.wi / D ¸out ¸i
:
®P .wi /
h 1 ¡ fµ L .¸i C h
¡ ®D .wi /
.¸i C
1 ¿P /
i
1 ¿P /
1 ¡ fµ L .¸i C .¸i C
1 ¿D /
1 ¿D /
i9 = ;
:
(3.18)
The means of the background and foreground weight distributions are given by the solution of the xed-point equation h 1 ¡ fµ L .¸X C ®D .wN X / Dh ®P .wN X / 1 ¡ fµ L .¸X C
1 ¿P /
i i
1 ¿D /
.¸X C
1 ¿D /
.¸X C
1 ¿P /
;
X
D B,F:
(3.19)
Spike-Timing-Dependent Plasticity
909
In the case of van Rossum et al. plasticity, Table 1, the mean weights are given by h i cP 1 ¡ fµL .¸X C ¿1P / .¸X C ¿1D / h i (3.20) wN X D ; X D B,F: cD 1 ¡ fµ L .¸X C ¿1D / .¸X C ¿1P / In the case where the STDP time constants are equal, there is no selective potentiation of the foreground synapses (i.e., wN F D wN B ). When the time constant for depression is larger than for potentiation, ¿D > ¿P , the foreground synapses are potentiated relative to the background synapses, as observed in the input-restricted case of model II. However, in addition, both the background and foreground weights also depend on the output spiking rate, as observed in the output-restricted case of model III. Consequently, this input- and output-restricted model displays features contained in models II and III: the foreground synapses are potentiated relative to the background synapses, and the background and foreground weights change with the output spiking rate. This is illustrated in Figure 9, which shows the change of the mean background (solid diamond symbols) and foreground (solid square symbols) weights wN B and wN F for R¿ D 2. 3.4.1 Model IV: Generalization to N Input-, M Output-Restricted EPSP-AP Interactions. The above results can be extended to the case in which the time extent of the EPSP-AP interactions is restricted to the N nearest EPSP inputs to each output AP and the M nearest AP outputs to each input EPSP. The calculation proceeds using the methods developed for each of the models separately. The drift function is given by " (Á ! N¡1 dN¡1 1 1 ¸N i .¡1/ ¡ A.wi / D ¸out ¸i ®P .wi /¿P .N ¡ 1/! d¸N¡1 ¸i ¸i C ¿1P i µ ³ ´¶¼ 1 £ 1 ¡ fµML ¿P N¡1 dN¡1 ¸N i .¡1/ .N ¡ 1/! d¸N¡1 i (Á !µ ³ ´¶)# 1 1 1 M £ ¡ 1 ¡ fµ L : 1 ¸i ¿ ¸i C ¿D D
¡ ®D .wi /¿D
(3.21)
The xed-point equations can be read off in the exactly the same way as for the other models. Clearly, as M ! 1 for xed N, the results reduce to those for model II, equation 3.7, while as N ! 1 for xed M, the results reduce to those for model III, equation 3.16, and as N; M ! 1 the results reduce to those for model I, equation 3.1, as expected. For nite N; M, the model shows behaviors evident in both input-restricted and output-restricted models, as discussed above.
910
A. Burkitt, H. Mefn, and D. Grayden
w (%)
40
20
0 0
25
50
l
F
(Hz)
75
100
Figure 9: Model IV, with N D M D 1 and van Rossum et al. plasticity, equation 3.20. The percentage increase of the mean background and foreground NB D w N F plotted as a function of the spiking rate of the foreground weights w inputs ¸F (the background inputs have a spiking rate of ¸B D 5 Hz). The plot is for STDP time constants ¿P D 17 ms and ¿D D 34 ms (i.e., R¿ D 2). Parameter NB D w N F D 0:0017 at ¸F =5 Hz. values are cD D 0:01 and cP D 2:75 £ 10¡5 , giving w All other parameter values are as given in the text. Solid diamonds (squares) represent the results for the average weight of the foreground (background) synapses obtained by numerical simulation after 10,000 APs.
4 Discussion The results for the four models of stable xed-point STDP indicate how the time extent of EPSP-AP interactions and the time asymmetry of the learning window affect the distribution of synaptic strengths for synapses with different xed rates of input. The effective rate-based description of the dynamics of the mean of a synaptic weight wN i (obtained by averaging over times that are long compared to both the STDP time window and the ISIs of the inputs and outputs) is given by equation 2.4, and the xed-point value of the mean weight is determined by equation 2.7. In order to gain an understanding of the effect of different rates of synaptic input and to relate the STDP models investigated here to both classical LTP and LTD experiments and models of rate-based Hebbian learning, the inputs are divided into a background set of inputs (with a spontaneous rate of 5 Hz) and a foreground set of inputs (with a higher rate). The question that we
Spike-Timing-Dependent Plasticity
911
have addressed is the extent to which this stable xed point model of STDP is capable of generating synaptic weights that are input selective (i.e., whether the weights of foreground synapses are potentiated relative to background synapses at the xed point) and therefore analogous to rate-based Hebbian learning models with learning dynamics determined by a stable xed point. Four classes of model are identied, based on the time extent of the allowed EPSP-AP interactions and dened in terms of how this extent is limited by the input ISIs or the output ISIs. The effect of both the restriction on the time extent of EPSP-AP interactions and the time asymmetry of the STDP window (i.e., different time constants for potentiation and depression) can be summarized as follows: (1) if the STDP time constants are equal, there is no selective potentiation of synapses with higher rates, and (2) if the time constant for depression is larger than for potentiation (R¿ D ¿D =¿P > 1), then there is selective potentiation of synapses with higher input rates only if the time extent of the EPSP-AP interactions displays some form of “input” restriction (such as in models II and IV). In model I, in which all EPSPs arriving within the STDP time window contribute to the synaptic modication, there is no selective synaptic enhancement of the foreground synapses over the background synapses, even when the time constants of potentiation and depression (¿P , ¿D ) are different. This somewhat surprising result is a consequence of the balance of the potentiation and depression contributions—higher rates of synaptic input increase both contributions, but their ratio (which determines the average synaptic strength, equation 2.7) remains unchanged. When the extent of the EPSP-AP interaction is input restricted, as in models II and IV and ¿D > ¿P , the foreground synapses are potentiated relative to the background synapses. The amount of selective potentiation depends on both the rate of the foreground synaptic inputs and the ratio of the time constants of potentiation and depression (R¿ ), as described by equations 3.4, 3.6, and 3.10 and illustrated in Figures 3 and 4. However, when the extent of EPSP-AP interaction is output restricted, as in model III, there is no selective potentiation of the higher rate foreground synapses (but there is some dependence of both background and foreground weights on the output spiking rate), as described by equation 3.14 and illustrated in Figure 7. Model IV, which is input and output restricted in the time extent of its EPSP-AP interactions, displays both behaviors: both background and foreground synapses are potentiated, but the foreground synapses are selectively potentiated relative to the background synapses, as described by equation 3.19. Consequently, the results presented here indicate that the potentiation of foreground synapses in a classical LTP and LTD paradigm involves some form of input restriction on the time extent of the EPSP-AP interactions. These input-restricted models of STDP are potentially signicant in understanding the relationship between STDP and rate-based Hebbian learning. Hebbian rate-based learning, in which the synaptic modication is based on correlations between input and output rates, has been widely
912
A. Burkitt, H. Mefn, and D. Grayden
used as models of neural information processing, including memory (associative and working memory), sensory perception, motor control, and classical conditioning. Most previously studied models of STDP have been shown to have learning dynamics controlled by an unstable xed point, which leads to instabilities and results in a bimodal distribution of weights. The existence of models of STDP with stable xed-point behavior allows the possibility that such spike-timing-based models provide the underlying mechanisms of the corresponding rate-based models. Consider, for example, the Hopeld model of associative memory (Hopeld, 1982), which has been subsequently considerably elaborated as a model of working memory (Amit & Tsodyks, 1991; Amit & Brunel, 1997; Amit & Mongillo, 2003). In these models, there is an interplay between the dynamics of the neurons (on a faster timescale) and the dynamics of the synaptic weights (on a slower timescale). The models contain a background state that is a spontaneously active network state, representing a global attractor of the neuronal dynamics, and a set of attractor states of the neuronal dynamics, corresponding to specic memories. This learning paradigm requires that these input-selective attractor states of the network dynamics (the “memories”) are generated by the learning dynamics of the weights. The results here demonstrate that STDP is potentially capable of generating such stable input-selective xed points. However, such a learning paradigm would require a consideration of other synaptic plasticity mechanisms in order that the weight distribution remain stable once the selective input ceases. The exploration of this double dynamics and the associated additional mechanisms required to stabilize the learned patterns clearly offers many interesting challenges. The calculations presented here for the four types of input-output time relationships are clearly exactly the same for widely used STDP models with learning dynamics determined by an unstable xed point. The drift functions for these models are exactly those presented here: the integrals over the input-output time difference probability distribution, aD , aP in equation 2.6 remain the same, and it is only the weight-dependent potentiation and depression functions, ®D .wi /, ®P .wi /, that differ. However, for these widely used STDP models, the learning dynamics is not changed in any fundamental way: both the value of the unstable xed point and the rate of change of the weights may be altered, but the dynamics will still produce the bimodal distribution of weights characteristic of these STDP models. 4.1 Effect of Suppressive Interspike Interactions. Recent work has shown that in addition to spike-timing-dependent plasticity, there is some evidence of an interspike suppressive interaction effect (Froemke & Dan, 2002), in which the efcacy of a spike in synaptic modication is suppressed by the preceding spike in the same neuron (i.e., either the presynaptic or the postsynaptic neuron). The resulting change in weight of the synapse connecting the ith presynaptic neuron with the jth postsynaptic neuron is
Spike-Timing-Dependent Plasticity
913
modeled as pre
1wij D ²i pre
post
where ²i , ²j
post
²j
(4.1)
F.1tij /;
are the efcacies of the presynaptic and postsynaptic neupost
pre
rons, respectively, 1tij D tj ¡ ti (the indices i and j will henceforth be dropped). Both of these efcacies depend on the times of their respective pre pre post post spikes, ² pre.tk ¡ tk¡1 / and ² post.tk ¡ tk¡1 /, respectively, where tk is the time of the kth spike (input or output resp.). The particular form of these suppressive interaction functions chosen in Froemke and Dan (2002) is " # q q .tk ¡ tk¡1 / q q q q D fpre; postg; (4.2) ² .tk ¡ tk¡1 / D 1 ¡ exp ¡ ; q ¿S where ¿S is the suppression time constant. The values for the suppression time constants found by best t to their experimental data from pyramidal pre neurons in layer 2/3 of rat visual cortical slices are approximately ¿S =34 ms post
and ¿S =75 ms (Froemke & Dan, 2002). The introduction of the efcacies produces a dependence of the change in weight on the presence of other spikes in both the presynaptic and postsynaptic neurons. If the ISI preceding a spike is very small, then the spike will contribute relatively little to the change in synaptic strength, whereas if the preceding ISI is large, then the change in weight associated with the spike will not be attenuated. This suppression could be mediated by either inhibitory synaptic interactions in the local cortical circuitry or mechanisms in the pre- and postsynaptic neurons, and a number of possible mechanisms were proposed (Froemke & Dan, 2002). Observation of paired-pulse depression (Tsodyks & Markram, 1997; Varela et al., 1997) suggested that suppression between presynaptic spikes during synaptic modication may be due to either short-term depression of transmitter release (Zucker, 1999) or desensitization or saturation of postsynaptic glutamate receptors (Zorumski & Thio, 1992). Suppression between postsynaptic spikes during synaptic modication may be due to CaC dependent conductances (Stuart & Sakmann, 1994) or other postsynaptic mechanisms that are involved in activity-dependent synaptic modication (Malenka & Nicoll, 1999). Including this suppressive interspike interaction in the STDP model I with unrestricted EPSP-AP interactions using the formalism in appendix 5 modies the corresponding drift function, equation 3.1, A.w/ D ¸out ¸i Epre Epost[¿P ®P .wi / ¡ ¿D ®D .wi /];
(4.3)
where pre
E pre D ¸i ²L .¸i / Epost D gµ L .0/ D
(4.4)
Z
1 0
dt fµ .t/ ² post.t/;
(4.5)
914
A. Burkitt, H. Mefn, and D. Grayden pre
where ²L .s/ is the Laplace transform of ² pre .t/, and gµ L .s/ is the Laplace transform of gµ .t/ ´ fµ .t/ ² post .t/. Consequently, the means of the background and foreground distributions are identical to those without the suppressive interspike interaction, equation 3.2. The effect of the suppressive interspike interaction on the drift function is an overall multiplicative factor Epre Epost that is the same for the potentiation and depression contributions. This factor is simply the product of the two independent terms corresponding to the presynaptic and postsynaptic contributions. This results in a change in the dynamics that affects the rate at which the xed point is reached but produces no change in the value of the xed-point weights. For the choice of suppressive interaction functions chosen in Froemke and pre post Dan (2002), the factors are Epre D .1C¿S ¸i /¡1 and Epost D [1¡ fµ L .1=¿S /]. For the N-input-restricted case, model II, the corresponding drift function, equation 3.7, becomes ( " ³ ´N # ¸i ¿P pre A.wi / D ¸out ¸i gµ L .0/ ®P .wi /¿P ¸i ²L .¸i / 1 ¡ 1 C ¸i ¿P µ ³ ´ ³ ´ 1 1 pre pre ¡ ®D .wi /¿D ¸i ²L .¸i / ¡ ¸i C ² ¸i C ¿D L ¿D ³ ´N #) ¸ i ¿D £ (4.6) : 1 C ¸ i ¿D For the input-restricted model, only the presynaptic interspike suppressive interactions described by ² pre .1t/ have an effect on the stable xed-point weight. The postsynaptic interspike suppressive interactions ² post .1t/ have an overall multiplicative effect on the drift function, which will affect the rate at which the xed point is reached but will not affect the value of the xedpoint weight. The magnitude of the change in the xed-point weight will depend on the precise form of the suppressive interaction function ² pre .1t/ as well as the STDP time constants ¿D , ¿P . In the case of van Rossum et al. plasticity using the interspike suppressive interaction function of Froemke and Dan (2002), we observed that the suppressive interspike interactions tended to reduce the selective potentiation of the higher-rate input synapses and that this effect is largest for higher rates of input. For the M-output-restricted case, model III, the drift function, equation 3.16, becomes » µ ³ ´ ³ ´¶ 1 1 2 pre M¡1 D ¡ A.wi / ¸out ¸i ²L .¸i / ®P .wi / ¿P gµL .0/ gµ L f ¿P µ L ¿P µ ³ ´¶¼ 1 ¡ ®D .wi / ¿D gµL .0/ 1 ¡ fµML : (4.7) ¿D For the output-restricted model, only the postsynaptic interspike suppressive interactions ² post.1t/ have an effect on the stable xed-point weight.
Spike-Timing-Dependent Plasticity
915
The presynaptic interspike suppressive interactions ² pre .1t/ produce a multiplicative effect on the drift function that will change the rate at which the xed point is reached, but will not affect the value of the xed-point weight. The magnitude of the xed-point weight depends now on the precise form of the postsynaptic suppressive interaction function ² post .1t/ as well as the STDP time constants ¿D , ¿P . The full expression for the N-input, M-output-restricted case, model IV, of the drift function equation 3.21, is given by equation B.29. This expression essentially produces effects on the drift function and xed-point weights that are a combination of those discussed in relation to the input- and output-restricted models above: the pre- and postsynaptic suppressive interaction functions will modify both the STDP xed point of the weight and the rate at which the learning dynamics drives the weight to the xed point. Consequently, suppressive interspike interactions will have an effect on the learning dynamics of STDP (i.e., a change of the corresponding drift function), but they cause a modication of the xed-point weights only when the STDP is either input or output restricted. The expressions given above for the modied drift functions, equations 4.3, 4.6, 4.7, and B.29, are given in terms of arbitrary suppressive interaction functions, ² pre .1t/, ² post .1t/, and there is clearly a need for more experimental data on the form of these functions. 4.2 Activity-Dependent Synaptic Scaling (ADSS) and Metaplasticity. The results presented here for STDP have shown the relationship of inputselective spike-based learning with conventional Hebbian rate-based learning. However, a complete understanding of the role of STDP in learning requires consideration of issues of metaplasticity (Abraham & Bear, 1996; Abbott & Nelson, 2000), that is, how synaptic plasticity is modulated and controlled by effects such as prior synaptic activity, membrane voltage (Fusi, 2003) and calcium concentration (Shouval, Castellani, Blais, Yeung, & Cooper, 2002). There are two main issues that need to be addressed in relation to the broader question of the role of STDP in specic learning schemes. First, our paradigm of input-selective spike-based learning relies on the existence of a stable xed point of the background spiking rate (i.e., spontaneous activity) against which the input-selective weights are potentiated. It is clear that STDP alone cannot generate such a state, since a nonspiking network state would never evolve into a spiking state. Consequently, some form of activity-dependent scaling of the weights is necessary, and we present below a unied framework of STDP and ADSS. Second, once a set of input-selective foreground weights has been generated, there must be some metaplasticity mechanism to ensure that they remain stable and do not return to their unpotentiated (background) value when the selective input is turned off. While the full investigation of such issues is outside the scope of this investigation, some remarks concerning ADSS and its rela-
916
A. Burkitt, H. Mefn, and D. Grayden
tionship to STDP are appropriate here, particularly since ADSS introduces competition between synapses. Activity-dependent scaling of the synaptic weights (i.e., dependent on the long time average output spiking rate of the postsynaptic neurons) has been experimentally observed in neocortical (Turrigiano et al., 1998), hippocampal (Lissin et al., 1998) and spinal cord neurons (O’Brien et al., 1998), and occurs on a much slower timescale than STDP. Recent experimental data suggest that there are two independent signals that regulate ADSS in cortical pyramidal neurons: (1) low levels of brain-derived neurotrophic factor (BDNF) cause excitatory synapses to scale up in strength (Desai, Rutherford, & Turrigiano, 1999) and (2) sustained depolarization of a cortical neuron over an extended time causes excitatory synapses to scale down in strength (Leslie, Nelson, & Turrigiano, 2001). The rst BDNF-mediated mechanism appears to be responsible for increasing synaptic strength when the output spiking rate is low (but there is no evidence that high levels of BDNF reduce synaptic weights in cortical neurons). The second depolarization-mediated mechanism is responsible for reducing the synaptic strength of neurons with high output spiking rates, since the output spiking rate is strongly dependent on the mean membrane depolarization. Together, these mechanisms bring about a global multiplicative rescaling of the weights; all weights are scaled downward by the same multiplicative factor (Turrigiano et al., 1998). Both of these ADSS mechanisms appear to function on a long timescale (from a couple of hours up to two days) that is orders of magnitude slower than the STDP mechanism involving presynaptic and postsynaptic spikes considered here. Nevertheless, it is possible to incorporate ADSS and STDP into a single framework by including terms in the weight update, equation 2.1, that are independent of the presynaptic and postsynaptic activity (® 0 .w/, corresponding to the BDNF mechanism), depend on only postsynaptic spikes (® post.w/, corresponding to the sustained depolarization mechanism outlined above, and depend on only presynaptic spikes (® pre .w/). The generalization of equation 2.4 for the time evolution of the mean weight is (Kempter et al., 1999; van Hemmen, 2001), N dw.t/ N C ® post .w/ N ¸out C ® pre .w/ N ¸in C Apre;post .w/; N D ® 0 .w/ dt
(4.8)
where Apre;post .w/ is the bilinear (Hebbian) term considered in this article. In such an expanded STDP model, the multiplicative ADSS scaling equation (van Rossum et al., 2000) is equivalent to the following form for the coefcients, ® 0 .w/ D ¯ w ¸B ;
® post.w/ D ¡¯ w;
® pre .w/ D 0;
(4.9)
where ¯ > 0 is a constant determining the strength and timescale of the synaptic scaling, and ¸B is the (background) spontaneous spiking rate (i.e.,
Spike-Timing-Dependent Plasticity
917
the desired postsynaptic activity). In the absence of structured synaptic input (i.e., Apre;post .w/ D 0), this has the effect of scaling the synaptic weights in such a way that the output spiking rate evolves toward ¸B . There are two important effects of ADSS. First, the joint effect of the above two mechanisms (corresponding to ® 0 .w/ and ® post .w/ above) is to establish a homeostasis, whereby networks of neurons adjust their synaptic weights up or down as a function of the long time average of their spiking activity. This maintains the neurons in a state that allows them to be responsive to the bilinear terms in STDP that contain information about structure in the synaptic input (Song et al., 2000). The second important function of ADSS is that it introduces a competitive mechanism between the synaptic weights (van Rossum et al., 2000). Competition has been extensively studied in both rate-based Hebbian learning (Miller & MacKay, 1994) and in additive STDP (Kistler & van Hemmen, 2000; Song et al., 2000; Kempter et al., 2001; Song & Abbott, 2001), where it arises naturally as a result of STDP. However, in stable xed-point models of STDP, equation 2.1, and with equal STDP time constants (¿P D ¿D ), there is little or no competition between synapses (van Rossum et al., 2000): spike-time correlated inputs give rise to larger weights but do not depress the weights of the uncorrelated inputs, as discussed above. ADSS acts as a competitive mechanism by rescaling all weights to lower values, thereby effectively depressing the weights of the potentiated inputs relative to those of the unpotentiated inputs. This same mechanism will function in the cases studied here where the STDP time constants are different and the foreground weights are potentiated relative to the background weights (i.e., models II and IV). Consequently, although there is no explicit competition between synapses in the stable xed-point STDP investigated here, ADSS nevertheless provides such a competitive mechanism. Because ADSS occurs on a much slower timescale than STDP, it has not been included in our study, but its effect is simply to rescale the weights by a multiplicative factor after the changes to the weights that are generated by STDP. 4.3 Width of the Weight Distribution. A feature of spike-based neural plasticity that is not present in rate-based models is a description of the width of the weight distribution. The Fokker-Planck analysis, which is based on the stochastic nature of the weight increments that result from the input and output spike times, enables the calculation of the complete weight distribution, equation 2.13. The calculation of the drift function A.w/ is straightforward, since it is only necessary to sum every individual contribution to the change in synaptic weight. However, the calculation of the diffusion function, B.w/, requires that we identify those components of potentiation and depression that are dependent and those that are independent. (Previous studies using the Fokker-Planck formalism have typically obtained an approximation for the diffusion function, B.w/, by assuming that each plasticity event is independent and that the output APs are independent; van Rossum et al., 2000.)
918
A. Burkitt, H. Mefn, and D. Grayden
A full calculation of the diffusion function, B.w/, is beyond the scope of this article for model I (in which there is no restriction on the time extent of the input-output interactions), since the changes in weight caused by successive inputs or outputs, or both, are correlated. However, it is possible to calculate the diffusion function, B.w/, for particular cases of the other models, which have a restricted time extent of the input-output interaction: for model II with N D 1 and 1=2, the contributions from the ISIs of the synaptic inputs are independent (they have a Poisson distribution and the EPSP-AP interactions are all within one ISI), whereas for model III with M D 1, the contributions from the ISIs of the output spikes are independent, and for model IV with M; N D 1, the contributions from both the input and output ISIs can be treated as independent. The width of the weight distributions can be evaluated numerically directly from the Fokker-Planck distribution, equation 2.13, or in the case of van Rossum et al. plasticity, it can be obtained from the self-consistency relation, equation 2.16. An illustration of the resulting widths of the weight distributions is given in Figure 10 for model IV with M; N D 1 and van Rossum et al. plasticity, which shows a plot of ¾F versus ¸F in the case R¿ =2.0 (other parameters are the same as in Figure 9). The solid line is the analytic solution given by equation 2.16, with values for the coefcients given by equations 3.18 and B.28), and the dashed line is the analytic solution obtained by taking the Poisson limit for the output spike distribution, equation 3.12. The solid symbols represent the result of numerical simulations. The value of ¾F depends on the output spiking rate, ¸out , which varies from 5 Hz (at ¸F =5 Hz) to 147 Hz (at ¸F =100 Hz). Notice that the Poisson distribution results at low output spiking rates (i.e., low values of ¸F ) converge with the results obtained using the rst passage time of the leaky integrate-and-re model with synaptic conductances. This is unsurprising, since at low output spiking rates, we expect that the output spiking distribution is Poisson-like. There are three principal effects contained in the calculation of B.w/ in equation B.12: (1) there is a nonzero correlation between the potentiation and depression contributions (corresponding to the term R3 , R4 , R5 in equation B.13 and to the nonzero linear coefcient B1 for van Rossum et al. plasticity), (2) there are terms that represent the correlations within the potentiation and depression contributions, respectively (corresponding to the terms R 1 , R 2 in equation B.13 and to the additional terms in the coefcients B0 and B2 in van Rossum et al. plasticity), and (3) B.w/ depends on the rst passage time density fµ .t/ (through its Laplace transform fµ L .s/), and consequently the coefcients reect the regularity of the output APs. This third effect becomes less important at low output spiking rates, where we nd that approximating the output spiking rate by a Poisson distribution (with the same spiking rate) gives good agreement with numerical simulations, as can be seen for model IV in Figure 10 by the convergence of the plots as ¸F ! ¸B =5 Hz. Representative distributions for the background and foreground weights are illustrated in Figure 11 for model II with N D 1 and van Rossum et al.
Spike-Timing-Dependent Plasticity
1.6
x 10
919
4
s
F
1.2
0.8
0.4
0
25
50
l
F
(Hz)
75
100
Figure 10: Model IV with N D M D 1 and van Rossum et al. plasticity. The standard deviation ¾F plotted as a function of the spiking rate of the foreground inputs ¸F . Parameter values are as for Figure 9. The solid line corresponds to the analytic results of equation 2.16 and equation B.28, and the dashed line corresponds to the values obtained using a Poisson distribution of output spikes with the same output spiking rate. Solid symbols represent the results for the average weight of the foreground synapses obtained by numerical simulation after 10,000 APs.
plasticity, in which the histograms give the two distributions obtained from numerical simulation. The parameters used are NB =1500, NI =400, NF =100, ¸B =5 Hz, ¸F =50 Hz, gI =0.002, cP D0.0000315, cD D0.01, ¿P D 17 ms, and ¿D D 34 ms. These parameter values result in an output spiking rate of ¸out D 35:2 Hz. The histogram on the left (unshaded) represents the background synapses, and the histogram on the right (black) represents the foreground synapses that have a higher input rate. The distributions obtained from the Fokker-Planck equation, equation 2.13, with the analytic expressions for the coefcients Ai , Bi , equations 3.5 and B.13, are given by the solid lines, which provide a good t to the numerical simulation data. The dotdashed curve (plotted here only for the foreground weights), which has the largest variance of the three curves plotted for the foreground weight distribution, is the “naive” Fokker-Planck distribution obtained by assuming that each individual contribution to the change in weight is independent (i.e., neglecting correlations between the potentiation and depression com-
920
A. Burkitt, H. Mefn, and D. Grayden
P(gE)
200
100
0.001
g
0.002 E
Figure 11: Distribution of weights for model II with N D 1 and van Rossum et al. plasticity. Histograms show results of numerical simulation (averaged over 10 independent weight congurations) of the distributions of the excitatory conductances gE (weights) for the NB D 1500 background synapses (larger unshaded peak on left) and NF D 100 foreground synapses (smaller shaded peak on right). Input rates for background and foreground synapses are ¸B D 5 Hz and ¸F D 25 Hz. The solid lines are the (scaled) Fokker-Planck probability distribution, equation 2.13, with A.w/ and B.w/ given by equations 3.5 and B.13). The dashed lines (mostly obscured for the foreground synapses) are the FokkerPlanck distributions when the output spike distribution is approximated by a Poisson distribution with the same output spiking rate of 35.2 Hz. The dotdashed line (only for the foreground weights) is the naive Fokker-Planck distribution, equation 4.10. Other parameter values are as given in section 2.2 and the text.
ponents, which is equivalent to neglecting the coefcient B1 in van Rossum et al. plasticity). The corresponding naive drift function B.w/ is given by " B.w/ D ¸out ¸X
®P2 .w/ .¸X C
2 ¿P /
C
2 ®D .w/
.¸X C
2 ¿D /
# :
(4.10)
The effect of neglecting these correlation terms is that the distribution of weights is broader than that found in numerical simulations, as illustrated for model II in the dot-dashed line of Figure 11. The dashed lines on the plot (mostly obscured for the foreground synapses and with the intermediate
Spike-Timing-Dependent Plasticity
921
variance of the three curves representing the background synapses) are the distributions obtained by approximating the rst passage time density of the output spikes by a Poisson distribution, equation 3.12. By comparing the actual weight distribution with that obtained using a Poisson distribution for the output APs, it is possible to gain some insight into the effect of the spiking dynamics. The spiking dynamics is contained in equation B.13 (the solid lines in Figure 11) but is neglected when a Poisson distribution for the output APs is used (dashed lines). Consequently, because the integrate-and-re spiking dynamics causes the output spikes to be considerably more regular (i.e., smaller coefcient of variation of the output spike distribution) than the equivalent Poisson distribution, there is a resulting difference between the distributions, observed here in the solid and dashed lines in Figure 11. 4.4 Discussion of Assumptions and Approximations. A number of assumptions and approximations are made in this study. First, we have chosen a model of STDP whose weight dynamics is determined by a stable xed point. An example of such a model is that investigated by van Rossum et al. (2000) with additive potentiation and multiplicative depression. While there is some experimental support for such weight-dependent STDP (Bi & Poo, 1998; van Rossum et al., 2000), this is still a matter of experimental investigation, and it may be present only in a particular class of neurons or neural systems. The results are given here for general weight-dependent potentiation and depression functions ®D .w/, ®P .w/, and the stability condition for the xed point is given in terms of these functions (see equation 2.8 and the following text). It is clearly of considerable interest to determine more accurately the form of these functions from experimental studies. Second, the STDP time window F.1t/ used in this study, equation 2.2, is modeled as two exponentials, with independent time constants for the potentiation and depression components. Such a model is both consistent with the experimental data (Bi & Poo, 1998) and relatively straightforward to handle mathematically. However, it is possible that different types of neurons have different shapes of STDP window, as proposed by Bi and Poo (2001) and examined in the case of additive STDP by Cˆateau and Fukai (2003). The effect of different STDP time windows will be to change the numerical value of the right side of equation 2.7. Consequently, the details of the STDP time window will affect the actual value of the stable xedpoint mean weights, but will not materially affect the conclusions of this study. Note that an important feature of STDP as formulated in this and other studies is that the weight-dependent and time-dependent components (corresponding to ®.w/ and F.1t/ in equation 2.1) are independent. Third, a Poisson distribution for the input spike trains is used. Although this is unlikely to be critical for the results presented here, the methods of analysis may not necessarily be applicable to networks, since the input spike-timing distribution for a neuron in a network could differ signicantly from the Poisson distribution. Consequently, while our results show that an
922
A. Burkitt, H. Mefn, and D. Grayden
input-selective stable xed point of the learning dynamics exists, it remains for future studies to elucidate the extent to which such STDP xed points determine the organization of the synaptic connectivity in recurrent networks under the combined inuence of the network dynamics and learning dynamics. It should also be noted that studies of LTP and LTD typically use sequences of spikes that are highly temporally structured, in contrast to the Poisson distribution of synaptic inputs studied here. Fourth, learning of the inhibitory conductances (Holmgren & Zilberter, 2001) is not included, since the timing dependence of the synaptic modication of inhibitory synapses is not yet well established. However, it is straightforward to extend the formalism presented here to include inhibitory synapses. In addition, the spike-triggering effect, is not included in the analysis here. This effect is small in the situation where the neuron receives a large number of small-amplitude synaptic inputs that have no spike time correlation, and it will not have a signicant effect on the results presented here. Likewise, we have considered only the case in which the input spikes do not have any spike time correlations. When the weight dynamics is not determined by a stable xed point, such spike correlations can cause the spike triggering effect to be greatly amplied by the instability (Gerstner, 2001; Kempter et al., 2001). The effect on the models studied here of such correlations is the subject of a separate report. Also, the expressions have been derived in the diffusion approximation, which neglects terms of higher-order in the weight step amplitude (i.e., the Fokker-Planck equation used here neglects higher-order terms in the Kramers-Moyal expansion), and this will give rise to the slight difference observed between the analytical and numerical results. 5 Conclusions In this study, we provide an analysis of spike-timing-based models of neural plasticity that examines their relationship with classical studies of LTP and LTD and with rate-based Hebbian models of learning whose weight dynamics is determined by a stable xed point. The analysis is carried out using a leaky integrate-and-re neuronal model with synaptic conductances. The evolution of the synaptic weights is based on the Langevin equation, which provides a description of the mean weights (Kempter et al., 1999) and is closely related to the Fokker-Planck approach (Kempter et al., 1999; van Rossum et al., 2000). We have examined the effect of two biologically important renements of the timing dependence of synaptic plasticity. First, we analyze the effect of limiting the time extent of the EPSP-AP interactions, whose importance was highlighted in recent experimental studies (Sj¨ostrom ¨ et al., 2001). Second, we analyze the effect of time asymmetry of the STDP time window, in which the time constant associated with synaptic depression ¿D is larger than that associated with potentiation ¿P (Bi & Poo, 1998).
Spike-Timing-Dependent Plasticity
923
One motivation for these studies was earlier results on a model of STDP with a stable xed point of the learning dynamics (van Rossum et al., 2000) that failed to show input rate selectivity, that is, it was found that higherrate uncorrelated synaptic inputs are not selectively potentiated relative to lower-rate inputs. The results here indicate that for the models whose weight dynamics is determined by a stable xed point, there is no selective potentiation of synapses that have higher rates of uncorrelated input if the STDP time constants for potentiation and depression are equal. Selective potentiation of synapses with higher input rates is possible only for models in which the time extent of the EPSP-AP interactions is input restricted and the time constant for depression is larger than for potentiation (¿D > ¿P ). The effect of suppressive interspike interactions is analyzed for each of the input-output restricted models, and the results show that such interactions can modify the learning dynamics, but that the value of the STDP xed-point weights are modied only for input- or output-restricted STDP models. The corresponding drift functions for all models studied here are given for arbitrary forms of the presynaptic and postsynaptic suppression functions. For particular cases of the three classes of models examined here involving some form of time-restricted input-output interaction (models II, III, and IV), we are able to calculate the full distribution of the excitatory conductances (i.e., the function B.w/) by identifying the independent components of potentiation and depression. The width of the distribution was found to depend on both the rst passage time density of the output APs (B.w/ is a function of fµ .t/) and the correlation between potentiation and depression contributions (e.g., a nonzero value of B1 in the case of van Rossum et al. plasticity). In order to gain an understanding of the role of STDP in the wider context of learning and network activity, we have investigated how it may be related to Hebbian rate-based models of learning with stable xed points of the weight dynamics that are input selective, that is, the (foreground) synapses associated with higher-rate inputs become potentiated relative to the remaining (background) synapses, whose only input is spontaneous activity. The input-restricted models of STDP are the only ones found to have the input-selective stable xed-point behavior. A more comprehensive description of models of STDP whose weight dynamics is determined by a stable xed point would require the analysis of correlated synaptic input, the spike triggering effect, and multiple patterns. These issues, together with a more comprehensive analysis of suppressive interspike interactions and an examination of the relationship between STDP and models of working memory, are the subject of a separate investigation. It would also be of interest to model the evolution of inhibitory weights, based on the experimental results for their plasticity (Rutherford, Nelson, & Turrigiano, 1998; Holmgren & Zilberter, 2001). In addition to the stable xed points of the weight evolution equation, equation 2.7, examined here, a more thorough study of the dynamics of weight evolution could be carried out: examine the trajectories of the weights and the time constant of
924
A. Burkitt, H. Mefn, and D. Grayden
approach to their stable xed-point values. These and other questions are currently under investigation. This study highlights the importance of experimental data that will provide accurate quantitative measures of a number of the key variables in the timing dependence of synaptic modication. (1) The results here have been formulated for arbitrary weight-dependent potentiation and depression functions, ®P .w/, ®D .w/, and the question of whether the weight dynamics is determined by a stable or unstable xed point depends on the weight dependence of these functions, which could vary across different neurons or neural systems. (2) The implications of various restrictions in the time extent of the EPSP-AP interaction have been explored here, and our results show that such restrictions have signicant implications for the learning behavior. (3) The analysis of the suppressive interspike interactions has been formulated in a fashion that allows it to be applied to any functional form of the presynaptic or postsynaptic suppression (² pre , ² post resp.), which require accurate experimental determination. In addition, experimental data could shed light on interesting questions concerning the weight dynamics, such as the timescale required for a weight distribution to reach its xed point. Appendix A: Self-Consistent Calculation of Variance In the situation where the drift function, A.w/, is linear and the diffusion function, B.w/, is quadratic, it is possible to nd the variance of the distribution of the weights in the following self-consistent way. Consider the Langevin equation 2.3 and use the following properties of the gaussian noise, Ef».1t/g D 0;
Ef» 2 .1t/g D 1t:
(A.1)
The time-dependent functions 7W .tI w0 / and 0W .tI w0 / are the following expectation values of the mean and variance: 7W .tI w0 / D Efw.t/I w.0/ D w0 g
0W .tI w0 / D Efw2 .t/I w.0/ D w0 g ¡ [Efw.t/I w.0/ D w0 g]2 :
The functions 7W .tI w0 / and 0W .tI w0 / are evaluated in a self-consistent way by considering how they change from time t to t C 1t (and neglecting terms of higher-order in 1t): Z 1 Efw.t C 1t/I w.0/ D w0 g D dw pW .w; t j w0 ; 0/ [w C A.w/1t] ¡1 Z 1 Efw2 .t C 1t/I w.0/ D w0 g D dw pW .w; t j w0 ; 0/ ¡1
£ [w2 C 2wA.w/1t C B.w/1t]:
Spike-Timing-Dependent Plasticity
925
It is possible to evaluate these expressions when A.w/ ´ A0 ¡ A1 w is linear and B.w/ ´ B0 ¡ B1 w C B 2 w2 is quadratic, as for van Rossum et al. plasticity. The expressions for 7W .tI w0 / and 0W .tI w0 / are obtained by evaluating the resulting gaussian integrals in the above expressions and taking the limit 1t ! 0: d7W .tI w0 / D A0 ¡ A1 7W .tI w0 / dt d0W .tI w0 / D .B2 ¡ 2A1 /0 W .tI w0 / C B0 ¡ B1 7W .tI w0 / C B2 7W2 .tI w0 /: dt
The solutions are straightforwardly found using Laplace transforms, A0 .1 ¡ e¡A1 t / A1 0W .tI w0 / D ¾W2 [1 ¡ e¡.2A1 ¡B2 /t ] C 01 [e¡A1 t ¡ e¡.2A1 ¡B2 /t ]
7W .tI w0 / D
C 02 [e¡2A1 t ¡ e¡.2A1 ¡B2 /t ];
where B0 ¡ B1 A0 =A1 C B2 A20 =A21 2A1 ¡ B2 B1 A0 =A1 ¡ 2B2 A20 =A21 01 D A 1 ¡ B2 A20 02 D ¡ 2 : A1
¾W2 D
The variance of the distribution of weights as t ! 1 is given by ¾W2 . Note that 2A1 > B2 for all biologically plausible parameters. This expression for ¾W2 is positive denite for all parameters investigated. Appendix B: Calculation of A.w/ and B.w/ B.1 Model I. The contribution to the change in synaptic weight due to a single output spike (AP) is calculated by integrating over the distribution of input times according to ZZ Z 1 Z 1 ° D ¸out dTP1 pin .TP1 / dTP2 pin .TP2 ¡ TP1 / 0 TP1 Z 1 ¢¢¢ dTPk pin .TPk ¡ TP.k¡1/ / Z
¢¢¢ ¢¢¢
Z
TP.k¡1/ 1
Z
dTD1 pin .TD1 /
0 1 TD.k¡1/
1 TD1
dTD2 pin .T D2 ¡ TD1 /
dTDk pin .TDk ¡ TD.k¡1/ / ¢ ¢ ¢ ;
(B.1)
926
A. Burkitt, H. Mefn, and D. Grayden
where the output spike is taken as occurring at time t D 0 and TPk (TDk ) is the arrival time of the kth input EPSP before (following) the spike. The input EPSPs are taken to be Poisson distributed so that their probability density (and Laplace transform, denoted by the subscript L) is pin .t/ D ¸in exp .¡¸in t/;
pin;L .s/ D
¸in : s C ¸in
(B.2)
The expression for A.w/ is given by # ZZ " 1 1 X X ¡TPn =¿P ¡TDn =¿D ¡ ®D .w/ A.w/ ´ ¸in ° ®P .w/ e e : nD1
(B.3)
nD1
The resulting expression is given in equation 3.1. The above calculation has been carried out from the perspective of a single output spike. It is equally possible to perform the calculation from the perspective of a single input EPSP, and it is straightforward to show that the resulting expression for A.w/ is identical. The calculation for B.w/, however, is much more problematic, since the contributions arising from different APs (or equivalently, from individual EPSPs) are not independent. This can also be seen by calculating the naive expressions obtained from the perspective of an input EPSP and from an output spike, which are found to differ. The full calculation would involve the simultaneous averaging over the time distributions of both the input EPSPs and the output spikes. B.2 Model II. The contribution to the change in synaptic weight due to a single incoming EPSP is calculated by considering a single ISI of the input EPSPs. Since the input EPSPs are Poisson distributed and the APs interact only with the EPSPs over one input ISI, the contribution from each input ISI can be considered and summed separately. We dene the following (normalized) integral over the distribution of input and output times, ZZ Z °´
1
0
¢¢¢
Z
Z
1
dT in pin .T in /
Z dT1 p1 .T1 /
0 1 Tk¡1
dTk fµ .Tn ¡ Tk¡1 / ¢ ¢ ¢ ;
1 T1
dT2 fµ .T2 ¡ T1 / (B.4)
where the previous input is taken to occur at time t D 0, T in is the arrival time of the subsequent input (which we take to be Poisson distributed), and Tk is the time of the kth output spike following the previous input EPSP. Note that this integral does not enforce the APs to be within an input ISI, but that this bound will be imposed in the denition of A.w/ and B.w/ below. The expression for pin .t/ is given in equation B.2. fµ .t/ is the rst passage time density of the output spikes, whose Laplace transform is given by
Spike-Timing-Dependent Plasticity
927
equation 2.20, and p1 .t/ is the probability density of the rst spike following the input EPSP at t D 0, Z 1 ¸out [1 ¡ fµ L .s/] (B.5) p1 .t/ D ¸out dt0 fµ .t0/; p1L .s/ D : s t The expressions for the drift function, A.w/, and diffusion function, B.w/, are obtained using the Fokker-Planck approach, equation 2.5 (van Kampen, 1992; Risken, 1996), ZZ " 1 X A.w/ ´ ¸in ° ®P .w/ e¡Tn =¿P H.T in ¡ Tn / nD1
¡ ®D .w/
1 X
#
e¡.T
in
¡Tn /=¿ D
nD1
H.T in ¡ Tn /
(B.6)
ZZ " 1 X B.w/ ´ ¸in ° ®P .w/ e¡Tn =¿P H.T in ¡ Tn / nD1
¡ ®D .w/
1 X
#2 e
¡.Tin ¡Tn /=¿ D
nD1
in
H.T ¡ Tn /
:
Each integral represents a single ISI of the input EPSPs, and the factor ¸in is the rate of the input ISIs. The Heaviside step functions H.t/ ensure that only the contributions from APs within the ISI are included in the respective potentiation and depression sums. The contributions are calculated using equation B.4: SP;n and SD;n are the contributions to the potentiation and depression, respectively, due to the nth spike, ZZ SPn .¿P / ´ ° e¡Tn =¿P H.T in ¡ T n / h i 1 ¸out 1 ¡ fµL .¸in C ¿1P / fµn¡1 L .¸in C ¿P / D (B.7) ¸in C ¿1P ZZ in SDn .¿D / ´ ° e¡.T ¡Tn /=¿D H.T in ¡ Tn / £ ¤ ¸out 1 ¡ fµL .¸in / fµn¡1 L .¸ in / D : 1 ¸in C ¿D Summing over all input spikes gives
SP .¿P / D SD .¿D / D
1 X nD1 1 X nD1
SPn .¿P / D SDn .¿ D / D
¸out .¸in C
¸out .¸in C
(B.8)
1 ¿P / 1 ¿D /
:
928
A. Burkitt, H. Mefn, and D. Grayden
The drift function, A.w/, dened in equation 2.5, is given by A.w/ D ¸in [®P .w/SP .¿P / ¡ ®D .w/SD .¿D /]:
(B.9)
In order to calculate the diffusion function, B.w/, we require ZZ RPm;Pn ´ ° e¡Tm =¿P H.T in ¡ T m /e¡Tn =¿P H.T in ¡ Tn / D
h ¸out 1 ¡ fµ L .¸in C
i
2 ¿P /
m¡1 fµL .¸in C
¸in C
.n > m/
2 n¡m ¿P / fµ L .¸in
C
2 ¿P
1 ¿P /
ZZ in in RDm;Dn ´ ° e¡.T ¡Tm /=¿D H.T in ¡ Tm /e¡.T ¡Tn /=¿D H.T in ¡ Tn /
D
.n > m/ £ ¤ n¡m ¸out 1 ¡ fµ L .¸in / fµm¡1 L .¸ in / fµ L .¸in C
1 ¿D /
(B.10)
2 ¿D
¸in C
ZZ in RDm;Pn ´ ° e¡.T ¡Tm /=¿D H.T in ¡ Tm /e¡Tn =¿P H.T in ¡ Tn / D RPm;Dn
h ¸out ¸in 1 ¡ fµ L .¸in C
i
1 ¿P /
m¡1 fµL .¸in C 1 ¿P /.¸in
1 n¡m ¿P / fµ L .¸in 1 ¿D /
C
C ZZ in ´ ° e¡Tm =¿P H.T in ¡ T m /e¡.T ¡Tn /=¿D H.T in ¡ Tn / D
.¸in C
.n > m/
h ¸out ¸in 1 ¡ fµ L .¸in C
i
1 ¿P /
.¸in C
m¡1 fµL .¸in C
1 ¿P /.¸in
C
1 ¿P
R2 .¿D / D R 3 .¿P ; ¿D / D D
1 X 1 X mD1 nDmC1 1 X 1 X mD1 nDmC1 1 X 1 X
1 n¡m ¿P / fµ L .¸in /
1 ¿D /
RPm;Pn D
¸out fµL .¸in C ¿1P / h .¸in C ¿2P / 1 ¡ fµL .¸in C
RDm;Dn D
¸out fµ L .¸in C ¿1D / h .¸in C ¿2D / 1 ¡ fµ L .¸in C
RDm;Pn
mD1 nDmC1
¸out ¸in fµL .¸in C ¿1P C ¿1D / h .¸in C ¿1P /.¸in C ¿1D / 1 ¡ fµ L .¸in C ¿1P C
1 ¿D /
.n > m/
:
Summing over all output spikes gives
R1 .¿P / D
C
i
1 ¿D /
1 ¿P /
i
1 ¿D /
i
Spike-Timing-Dependent Plasticity
R4 .¿P ; ¿D / D R5 .¿P ; ¿D / D
1 X 1 X mD1 nDmC1 1 X nD1
929
RPm;Dn D
.¸in C
¸out ¸in fµ L .¸in / 1 C ¿1D /[1 ¡ fµ L .¸in /] ¿P /.¸in
¸out ¸in
RPn;Dn D
.¸in C
1 ¿P /.¸ in
C
1 ¿D /
(B.11)
:
The diffusion function, B.w/, dened in equation 2.5, is given in terms of the above expressions as " B.w/ D
¸in ®P2 .w/
1 X
SP n
±¿ ² P
2
nD1
# C 2R1 .¿P /
¡ 2¸in ®P .w/ ®D .w/ [R5 .¿P ; ¿D / C R3 .¿P ; ¿D / C R4 .¿P ; ¿D /] " # 1 ±¿ ² X D 2 C ¸in ®D .w/ C 2R2 .¿D / : (B.12) SD n 2 nD1 The diffusion function, B.w/, is thus given by ( B.w/ D ¸out ¸X
"
®P2 .w/ .¸X C ¡
1C
2 ¿P /
1 ¿P / fµ L .¸X C ¿1P /
2 fµ L .¸X C
1¡
#
2®P .w/ ®D .w/ ¸X .¸X C "
1 ¿P /.¸X
C
1 ¿D /
# 2 fµL .¸X C ¿1P C ¿1D / 2 fµ L .¸X / £ 1C C 1 ¡ fµ L .¸X / 1 ¡ fµ L .¸X C 1 C 1 / ¿P ¿D " #) 1 2 2 fµ L .¸X C ¿ / ®D .w/ D C 1C (B.13) ; 2 1 ¡ fµ L .¸X C ¿1 / .¸X C ¿ / D
D
where fµ L .s/ is the Laplace transform of the rst passage time density whose explicit expression is given in equations 2.20 and 2.21. B.2.1 Model II: Arbitrary N. The above calculation of the drift function, A.w/, and the diffusion function, B.w/, is generalized for EPSP-AP interaction that have a time extent restricted to the N nearest synaptic inputs to each output AP. This is done by generalizing equation B.4: ZZ Z °D
1 0
Z ¢¢¢
Z dT1in pin .T1in / 1 in TN¡1
1 Tin 1
dT2in pin .T2in ¡ T 1in /
in in in ¡ TN¡1 dTN pin .TN /
930
A. Burkitt, H. Mefn, and D. Grayden
Z
1
0
¢¢¢
Z
Z dT 1out p1 .T1out / 1 Tout n¡1
1 Tout 1
dT2 fµ .T2out ¡ T1out /
out dTnout fµ .Tnout ¡ Tn¡1 /¢¢¢;
(B.14)
where the previous input is taken as occurring at time t D 0 and the kth subsequent output AP and input EPSP times are explicitly denoted by Tkout and Tkin , respectively. In addition, each of the Heaviside step functions in in equation B.6 is replaced by H.TN ¡T out n /, which gives the required restriction of the EPSP-AP interactions to the N nearest inputs (note also that each interaction is counted exactly once with these denitions). B.2.2 Model II: N D 1=2. The calculation of the drift function, A.w/, and the diffusion function, B.w/, proceeds in a similar fashion to that above. The contributions are calculated using equation B.4 as before, but with ZZ " 1 X A.w/ ´ ¸in ° ®P .w/ e¡Tn =¿P H.T in ¡ Tn /H.Tn ¡ T in =2/ nD1
¡ ®D .w/
1 X
#
¡.Tin ¡Tn /=¿ D
e
nD1
in
H.T =2 ¡ Tn /
(B.15)
ZZ " 1 X B.w/ ´ ¸in ° ®P .w/ e¡Tn =¿P H.T in ¡ Tn /H.Tn ¡ T in =2/ nD1
¡ ®D .w/
1 X
#2 ¡.Tin ¡Tn /=¿ D
e
nD1
in
H.T =2 ¡ Tn /
:
The contributions to the potentiation and depression due to the nth spike are SQ P;n and SQ D;n , respectively: ZZ SQ Pn ´ ° e¡Tn =¿P H.T in ¡ Tn /H.Tn ¡ T in =2/ h i n¡1 ¸out 1 ¡ fµ L .2¸in C ¿1P / fµL .2¸in C ¿1P / D 2¸in C ¿1 P ZZ in ¡T ¡.T in /=¿ n D SQ Dn ´ ° e H.T =2 ¡ Tn / h i 1 ¸out 1 ¡ fµ L .2¸in C ¿1D / fµn¡1 L .2¸in C ¿D / D : 2¸in C ¿1 D
Summing over all input spikes gives equation 3.9.
(B.16)
Spike-Timing-Dependent Plasticity
931
The calculation of the diffusion function, B.w/, proceeds in much the same way as for model II. The cross-terms (required for B1 ) are more straightforward to calculate because the integrals for the potentiation and depression terms are independent in this model. The resulting expression is given by ( " # 2 fµ L .2¸X C ¿1 / ®P2 .w/ P 1C B.w/ D ¸out ¸X 1 ¡ fµL .2¸X C ¿1 / .2¸X C ¿2P / P ¡ C
®P .w/ ®D .w/ ¸out
.2¸X C
1 ¿P /.2¸X
2 ®D .w/
.2¸X C
C
"
2 ¿P /
1C
1 ¿D / 1 ¿P / fµL .2¸X C ¿1P /
2 fµL .2¸X C
1¡
#) :
(B.17)
B.3 Model III. The calculation of the drift function, A.w/, and the diffusion function, B.w/, proceeds in a fashion analogous to that in section B.2, except that for this model, the ISIs of the output APs dene the independent events. The contributions are calculated by dening the following integral over the distribution of input and output times, ZZ Z 1 Z 1 Z 1 °´ dT out fµ .T out / dT1 pin .T 1 / dT2 pin .T2 ¡ T1 / 0
¢¢¢
Z
0
1 Tn¡1
T1
dTn pin .Tn ¡ Tn¡1 / ¢ ¢ ¢ ;
(B.18)
where the previous output AP is taken as occurring at time t D 0 and T out is the output time of the subsequent AP, which has a time distribution given by the rst passage time density fµ .t/, equation 2.20. T k is the time of the kth input EPSP following the output AP at time t D 0, and each Tk has a Poisson distribution pin .t/, given by equation B.2. Note that analogous to equation B.4, the above integral does not enforce the EPSPs to be within an output ISI, but that this bound on the input times will be imposed in the denition of A.w/ and B.w/ below. The drift function, A.w/, and diffusion function, B.w/, are given by ZZ " 1 X out A.w/ ´ ¸out ° ®P .w/ e¡.T ¡Tn /=¿P H.T out ¡ T n / nD1
¡ ®D .w/
1 X
#
¡Tn =¿D
e
H.T
out
nD1
¡ Tn /
(B.19)
ZZ " 1 X out B.w/ ´ ¸out ° ®P .w/ e¡.T ¡Tn /=¿P H.T out ¡ T n / nD1
¡ ®D .w/
1 X nD1
#2 ¡Tn =¿D
e
H.T
out
¡ Tn /
:
932
A. Burkitt, H. Mefn, and D. Grayden
Each of these integrals represents a single ISI of the output APs, and the factor ¸out is the rate of the output ISIs. The contributions are calculated using equation B.18: SP;n and SD;n are the contributions to the potentiation and depression, respectively, due to the nth input EPSP, ZZ out SPn .¿P / ´ ° e¡.T ¡Tn /=¿P H.T out ¡ Tn / ( n ) Z 1 pin;L .s ¡ ¿1P / ¡Tout =¿P ¡1 D L (B.20) dTout fµ .Tout /e s 0 ZZ SDn .¿D / ´ ° e¡Tn =¿P H.T out ¡ Tn / ( n ) Z 1 pin;L .s ¡ ¿1D / ¡Tout =¿D ¡1 D L dTout fµ .Tout /e ; s 0 where L ¡1 f f .s/g is the inverse Laplace transform of f .s/. Summing over all input spikes gives µ ³ ´¶ 1 X 1 SP .¿P / D (B.21) SPn .¿P / D ¸in ¿P 1 ¡ fµL ¿P nD1 µ ³ ´¶ 1 X 1 SD .¿D / D SDn .¿D / D ¸in ¿P 1 ¡ fµ L : ¿ D nD1 This gives the drift function, A.w/, in equation 3.11. In order to calculate the diffusion function, B.w/, we require ZZ out out RPm;Pn ´ ° e¡.T ¡Tm /=¿P H.T out ¡ Tm /e¡.T ¡Tn /=¿P H.T out ¡ Tn / .n > m/ Z 1 D dTout fµ .Tout /e¡2Tout =¿P L RDm;Dn
RDm;Pn
( ¡1
pm in;L .s ¡
2 n¡m ¿P /pin;L .s
¡
1 ¿P /
)
s 0 ZZ ´ ° e¡Tm =¿D H.T out ¡ T m /e¡Tn =¿D H.T out ¡ Tn / .n > m/ ( ) Z 1 2 n¡m 1 pm in;L .s C ¿D /pin;L .s C ¿D / D dTout fµ .Tout /L ¡1 s 0 ZZ out ´ ° e¡Tm =¿D H.T out ¡ T m /e¡.T ¡Tn /=¿P H.T out ¡ Tn / .n > m/ Z 1 D dTout fµ .Tout /e¡Tout =¿P
L
0
¡1
( £
pm in;L .s C
1 ¿D
¡
1 n¡m ¿P /pin;L .s
s
¡
1 ¿P /
)
Spike-Timing-Dependent Plasticity
933
ZZ out RPm;Dn ´ ° e¡.T ¡Tm /=¿P H.T out ¡ Tm /e¡Tn =¿D H.T out ¡ Tn / Z 1 D dTout fµ .Tout /e¡Tout =¿P 0
L
¡1
(
£
1 ¿D
pm in;L .s C
1 n¡m ¿P /pin;L .s
¡
s
C
1 ¿D /
.n > m/
)
(B.22)
:
Summing over all output spikes gives "
R1 .¿ P / D
¸2in
R2 .¿D / D
¸2in
"
fµ L . ¿2P / .¸in ¡ .¸in C
¸2in
.¸in C
¡ "
¡
1 ¿D /
1 ¿D /
C
1 ¿D
¡
2 ¿P /
C
#
.¸in ¡
1 ¿P /
#
1 ¿P /
2 ¿D /
¿D fµL .¸in C .¸in C
¿P fµ L .¸in C
2 ¿D /
1 ¿D /
1 ¿P /.¸ in
#
¡
C
1 ¿P /
¿D fµL .¸in C
1 ¿D / 1 ¿P /
¿P fµ L .¸in C
1 ¿D / 1 ¿P /
.¸in C
1 ¿D
¡
¿D fµL .¸in / .¸in ¡
.¸in C
h
C
.¸in ¡
1 ¿P /
fµ L . ¿1P /
C
R5 .¿P ; ¿D / D ¸in
1 ¿P /
fµL . ¿1P / .¸in C
¸2in
2 ¿D /.¸in
¿D fµL .¸in C
"
R4 .¿P ; ¿D / D
¡
1
¡
R3 .¿P ; ¿D / D
2 ¿P /.¸ in
¿P fµL .¸in /
1 ¿D
¡
1 ¿P /.¸ in
¿P fµL .¸in C .¸in C 1 ¿D
¡
1 ¿D /
1 C ¿1P / ¿D 1 ¿D /
fµ L . ¿1P / ¡ fµ L .¸in C .¸in C
C
1 ¿D /
1 ¿P /
¡
#
.¸in C
1 ¿D
¡
i
(B.23)
;
where the R ’s are dened in equation B.11. The diffusion function, B.w/, is given in terms of the above expressions by equation B.12 for model III: " B.w/ D ¸out ¸X
( ®P2 .w/
µ ³ ´¶ 2¸X fµ L . ¿2P / 2 ¿P 1 ¡ fµL C 2 ¿P .¸X ¡ ¿2P /.¸ X ¡
¡
2¸X ¿P fµ L .¸X / .¸X ¡
2 ¿P /
C
2¸X ¿P fµ L .¸X C .¸X ¡
1 ¿P /
1 ¿P /
1 ¿P /
)
934
A. Burkitt, H. Mefn, and D. Grayden
¡ 2®P .w/® D .w/
8h < fµL . ¿1 / ¡ fµ L .¸X C P :
"
.¸X ¡
1 ¿P /
¡
.¸X C
C
#
.¸X C
.¸X C
1 ¿D
¸X ¿P fµ L .¸X C
¡
2¸X ¿D fµL .¸X C .¸X C
.¸X C
1 ¿P /
¡
1 ¿P /
1 ¿P
C
1 ¿D /
1 ¿D /
¡
¸X ¿D fµ L .¸X /
9
.¸X ¡
1 ¿P /
1 = ¿D /
;
1 ¿D /
2 ¿D /
2 ¿D /
2¸X ¿D fµL .¸X C
¡
1 ¿D /
µ ³ ´¶ 2 2¸X ¿D 1 ¡ fµ L C 2 2 ¿D .¸X C ¿D /.¸X C C
1 ¿D
1
C
.¸X C
¸X fµL . ¿1P /
C
1 ¿P /
¸X .¿D ¡ ¿P / fµL .¸X C
C
(
1 ¿D
1
£
2 C ®D .w/
.¸X C
i
1 ¿D /
1 ¿D /
1 ¿D /
)# ;
X
D B,F: (B.24)
B.3.1 Model III: Arbitrary M. The above calculation of the drift function, A.w/, is generalized for EPSP-AP interactions that have a time extent restricted to the M nearest AP outputs to each input EPSP. This is done by generalizing equation B.18: ZZ Z °D
1
0
Z ¢¢¢ Z
1 Tout M¡1
1
0
Z ¢¢¢
Z dT 1out fµ .T1out /
Tin k¡1
T1out
dT2out fµ .T2out ¡ T1out /
out out out ¡ TM¡1 dTM fµ .TM /
dT 1in pin .T1in / 1
1
Z
1 T1in
dT2in pin .T2in ¡ T1in /
dTkin pin .Tkin ¡ T in k¡1 / ¢ ¢ ¢ ;
(B.25)
where the previous output AP is taken as occurring at time t D 0 and the kth subsequent output AP and input EPSP times are explicitly denoted by Tkout and Tkin , respectively. In addition, each of the Heaviside step functions in in equation B.19 are replaced by H.T out M ¡ Tn /, which gives the required
Spike-Timing-Dependent Plasticity
935
restriction of the EPSP-AP interactions to the M nearest outputs (again note that each interaction is counted exactly once with these denitions). B.4 Model IV. The calculation of the drift function, A.w/, and the diffusion function, B.w/, proceeds in a fashion analogous to that in the previous sections. Each EPSP interacts with at most two APs, and each AP interacts with at most two EPSPs. We choose here to consider one ISI of the output APs and dene ZZ Z °´
1 0
Z dTout fµ .Tout /
Tout
Z dt1 pin .t1 /
0
Tout
dt2
0
f±.t1 ¡ t2 / e¡¸in .Tout ¡t2 / C pin .T out ¡ t2 / H.t2 ¡ t1 /g;
(B.26)
where t1 is the rst EPSP following the input AP at time t D 0, and t2 is the last input EPSP before the subsequent AP at time Tout . The rst term in the brackets represents the case where there is exactly one EPSP within the ISI of the output APs, and the second term represents the case where there are two or more EPSPs within the ISI. The drift function, A.w/, and diffusion function, B.w/, are given by ZZ h i out A.w/ ´ ¸out ° ®P .w/e¡.T ¡t2 /=¿P ¡ ®D .w/e¡t1 =¿D
(B.27)
ZZ h i2 out B.w/ ´ ¸out ° ®P .w/e¡.T ¡t2 /=¿P ¡ ®D .w/e¡t1 =¿D :
These integrals are straightforward to evaluate, giving the drift function, equation 3.18, and the diffusion function, B.w/: 2 B.w/ D ¸out ¸X 4®P2 .w/
£
h
i
1 ¡ fµ L .¸X C .¸X C
8 h < ¸X 1 ¡ fµ L .¸X C :
2 ¿P /
i
1 ¿D /
2 ¿P /
¡
±
2®P .w/®D .w/ .¸X C
fµ L .¸X C
¡ .¸X C ¿1D / h i3 1 ¡ fµ L .¸X C ¿2D / 2 5; C ®D .w/ .¸X C ¿2D /
1 ¿P /
1 ¡ fµ L .¸X ¿P / .1 ¡ ¿¿DP /
X
D B,F:
C
1 ¿D /
²9 = ;
(B.28)
The expression for the drift function for the N input-, M output-restricted model IV with interspike suppressive interactions, section 4.1, calculated
936
A. Burkitt, H. Mefn, and D. Grayden
using the methods described in sections B.2.1 and B.3.1, is given by " pre
A.wi / D ¸out ¸i ®P .wi / ¿P ²L .¸i / (Á £
1 1 ¡ ¸i ¸i C
¸NC1 .¡1/N¡1 dN¡1 i .N ¡ 1/! d¸N¡1 i !µ ³
1 ¿P
´ 1 ¿P ´¶ )
gµ L .0/ ¡ gµL ¸i C
³ 1 £ ¸i C ¿P » µ ³ ´¶ 1 pre M ¡ ®D .wi / ¿D gµL .0/ ¸i ²L .¸i / 1 ¡ fµ L ¿D Z 1 M ¡ ¸N dtL ¡1 f fµL .s/g i fµM¡1 L
0
(Á £L
¡1
1 1 ¡ s s C ¿1D
!
pre
1 ¿D / 1 N¡1 ¿D /
²L .s C ¸i C .s C ¸i C
))# ;
(B.29)
pre
where ²L .s/ and gµL .s/ are dened following equation 4.5. All the other drift functions calculated above constitute special cases of this general expression. Acknowledgments This work was funded by the Australian Research Council (ARC Discovery Project #DP0211972) and the Bionic Ear Institute. We acknowledge W. Gerstner ’s suggestion for the inclusion of ADSS into the STDP framework (outlined in section 4.2, equations 4.8 and 4.9), and thank him for detailed helpful comments. We also thank G.Q. Bi for drawing Froemke and Dan (2002) to our attention and suggesting we analyze it within the framework presented in this article and an anonymous reviewer for useful comments on the manuscript. References Abbott, L. F., & Blum, K. I. (1996). Functional signicance of long-term potentiation for sequence learning and prediction. Cerebral Cortex, 6, 406–416. Abbott, L. F., & Nelson, S. B. (2000). Synaptic plasticity: Taming the beast. Nature Neuroscience, 3, 1179–1183. Abraham, W. C., & Bear, M. F. (1996). Metaplasticity: The plasticity of synaptic plasticity. Trends Neurosci., 19, 126–130.
Spike-Timing-Dependent Plasticity
937
Amit, D. J., & Brunel, N. (1997). Model of global spontaneous activity and local structured activity during delay periods in the cerebral cortex. Cerebral Cortex, 7, 237–252. Amit, D. J., & Mongillo, G. (2003). Spike-driven synaptic dynamics generating working memory states. Neural Comput., 15, 565–596. Amit, D. J., & Tsodyks, M. V. (1991). Quantitative study of attractor neural network retrieving at low spike rates: I. Substrate—Spikes, rates and neuronal gain. Network: Comput. Neur. Sys., 2, 259–273. Bi, G-Q., & Poo, M-M. (1998). Synaptic modications in cultured hippocampal neurons: Dependence on spike timing, synaptic strength, and postsynaptic cell type. J. Neurosci., 18, 10464–10472. Bi, G-Q., & Poo, M-M. (2001). Synaptic modication by correlated activity: Hebb’s postulate revisited. Ann. Rev. Neurosci., 24, 139–166. Bienenstock, E. L., Cooper, L. N., & Munro, P. W. (1982). Theory for the development of neuron selectivity: Orientation specicity and bionocular interaction in visual cortex. J. Neurosci., 2, 32–48. Bliss, T. V., & Lomo, T. (1973). Long-lasting potentiation of synaptic transmission in the dentate area of the anesthetized rabbit following stimulation of the perforant path. J. Physiol. (Lond.), 232, 331–356. Burkitt, A. N. (2001). Balanced neurons: Analysis of leaky integrate-and-re neurons with reversal potentials. Biol. Cybern., 85, 247–255. Burkitt, A. N., Mefn, H., & Grayden, D. B. (2003). Study of neuronal gain in a conductance-based leaky integrate-and-re neuron model with balanced excitatory and inhibitory synaptic input. Biol. Cybern., 89, 119–125. Burkitt, A. N., & van Hemmen, J. L. (in press). How synapses in the auditory system wax and wane: Theoretical perspectives. Biol. Cybern. Caˆ teau, H., & Fukai, T. (2003). A stochastic method to predict the consequence of arbitrary forms of spike-timing-dependent plasticity. Neural Comput., 15, 597–620. Debanne, D., Ga¨ hwiler, B. H., & Thompson, S. M. (1998). Long-term synaptic plasticity between pairs of individual CA3 pyramidal cells in rat hippocampal slice cultures. J. Physiol., 507, 237–247. Desai, N. S., Rutherford, L. C., & Turrigiano, G. G. (1999). BDNF regulates the intrinsic excitability of cortical neurons. Learning and Memory, 6, 284–291. Froemke, R. C., & Dan, Y. (2002). Spike-timing-dependent synaptic modication induced by natural spike trains. Nature, 416, 433–438. Fusi, S. (2003). Spike-driven synaptic plasticity for learning correlated patterns of mean rates. Rev. Neurosci., 14, 73–84. Gerstner, W. (2001). Coding properties of spiking neurons: Reverse and crosscorrelations. Neural Networks, 14, 599–610. Gerstner, W., & van Hemmen, J. L. (1996). What matters in neuronal locking? Neural Comput., 8, 1653–1676. Gerstner, W., Kempter, R., van Hemmen, J. L., & Wagner, H. (1996). A neuronal learning rule for sub-millisecond temporal coding. Nature, 383, 76–78. Gutig, ¨ R., Aharonov, R., Rotter, S., & Sompolinsky, H. (2003). Learning input correlations through non-linear temporally asymmetric Hebbian plasticity. J. Neurosci., 23, 3697–3714.
938
A. Burkitt, H. Mefn, and D. Grayden
Hanson, F. B., & Tuckwell, H. C. (1983). Diffusion approximations for neuronal activity including synaptic reversal potentials. J. Theoret. Neurobiol., 2, 127– 153. Holmgren, C. D., & Zilberter, Y. (2001). Coincident spiking activity induces long-term changes in inhibition of neocortical pyramidal cells. J. Neurosci., 21, 8270–8277. Hopeld, J. H. (1982). Neural networks and physical systems with emergent collective computational abilities. Proc. Nat. Acad. Sci. USA, 79, 2554–2558. Izhikevich, E., & Desai, N. S. (2003). Relating STDP to BCM. Neural Comput., 15, 1511–1523. Kempter, R., Gerstner, W., & van Hemmen, J. L. (1999). Hebbian learning and spiking neurons. Phys. Rev. E, 59, 4498–4514. Kempter, R., Gerstner, W., & van Hemmen, J. L. (2001). Intrinsic stabilization of output rates by spike-based Hebbian learning. Neural Comput., 13, 2709–2741. Kistler, W. M., & van Hemmen, J. L. (2000). Modeling synaptic plasticity in conjunction with timing of pre- and postsynaptic action potentials. Neural Comput., 12, 385–405. Leibold, C., Kempter, R., & van Hemmen, J. L. (2002). How spiking neurons give rise to a temporal-feature map: From synaptic plasticity to axonal selection. Phys. Rev. E, 65, 051915. Leslie, K. R., Nelson, S. B., & Turrigiano, G. G. (2001). Postsynaptic depolarization scales quantal amplitude in cortical pyramidal neurons. J. Neurosci., 21, RC170(1–6). Lissin, D. V., Gomperts, S. N., Carroll, R. C., Christine, C. W., Kalman, D., Kitamura, M., Hardy, S., Nicoll, R. A., Malenka, R. C., & von Zastrow, M. (1998). Activity differentially regulates the surface expression of synaptic AMPA and NMDA glutamate receptors. Proc. Natl. Acad. Sci., 95, 7097–7102. Lomo, T. (1971). Potentiation of monosynaptic EPSPs in the perforant pathdentate granule cell synapse. Exp. Brain Res., 12, 46–63. Malenka, R. C., & Nicoll, R. A. (1999). Long-term potentiation—A decade of progress? Science, 285, 1870–1874. Markram, H., Lubke, ¨ L., Frotscher, M., & Sakmann, B. (1997). Regulation of synaptic efcacy by coincidence of postsynaptic APs and EPSPs. Science, 275, 213–215. Miller, K. D. (1996). Synaptic economics: Competition and cooperation in synaptic plasticity. Neuron, 17, 371–374. Miller, K. D., & MacKay, D. J. C. (1994). The role of constraints in Hebbian learning. Neural Comput., 6, 100–126. O’Brien, R. J., Kamboj, S., Ehlers, M. D., Rosen, K. R., Fishback, G. D., & Huganir, R. L. (1998). Activity-dependent modulation of synaptic AMPA receptor accumulation. Neuron, 21, 1067–1078. Rao, R. P. N., & Sejnowski, T. J. (2001). Spike-timing-dependent Hebbian plasticity as temporal difference learning. Neural Comput., 13, 2221–2237. Ricciardi, L. M. (1977). Diffusion processes and related topics in biology. Berlin: Springer-Verlag. Risken, H. (1996). The Fokker-Planck equation (3rd ed.). Berlin: Springer-Verlag.
Spike-Timing-Dependent Plasticity
939
Roberts, P. D. (1999). Computational consequences of temporally asymmetric learning rules: I. Differential Hebbian learning. J. Comput. Neurosci., 7, 235– 246. Roberts, P. D., & Bell, C. C. (2002). Spike-timing dependent synaptic plasticity: Mechanisms and implications. Biol. Cybern., 87, 392–403. Rubin, J., Lee, D. D., & Sompolinsky, H. (2001). Equilibrium properties of temporally asymmetric Hebbian plasticity. Phys. Rev. Lett., 86, 364–367. Rutherford, L. C., Nelson, S. B., & Turrigiano, G. G. (1998). BDNF has opposite effects on the quantal amplitude of pyramidal neuron and interneuron excitatory synapses. Neuron, 21, 521–530. Sejnowski, T. J. (1977). Statistical constraints on synaptic plasticity. J. Theor. Biol., 69, 385–389. Sejnowski, T. J. (1999). The book of Hebb. Neuron, 24, 773–776. Senn, W., Markram, H., & Tsodyks, M. (2001). An algorithm for modifying neurotransmitter release probability based on pre- and post-synaptic spike timing. Neural Comput., 13, 35–68. Shouval, H. Z., Castellani, G. C., Blais, B. S., Yeung, L. C., & Cooper, L. N. (2002). Converging evidence for simplied biophysical model of synaptic plasticity. Biol. Cybern., 87, 383–391. Siegert, A. J. F. (1951). On the rst passage time probability problem. Phys. Rev., 81, 617–623. Sj¨ostrom, ¨ P. J., Turrigiano, G. G., & Nelson, S. B. (2001). Rate, timing, and cooperativity jointly determine cortical synaptic plasticity. Neuron, 32, 1149–1164. Song, S., & Abbott, L. F. (2001). Cortical development and remapping through spike-timing dependent plasticity. Neuron, 32, 339–350. Song, S., Miller, K. D., & Abbott, L. F. (2000). Competitive Hebbian learning through spike-timing dependent synaptic plasticity. Nature Neurosci., 3, 919– 926. Stroeve, S., & Gielen, S. (2001). Correlation between uncoupled conductancebased integrate-and-re neurons due to common and synchronous presynaptic ring. Neural Comput., 13, (2005–2029. Stuart, G. J., & Sakmann, B. (1994). Active propagation of somatic action potentials into neocortical pyramidal cell dendrites. Nature, 367, 69–72. Svirskis, G., & Rinzel, J. (2000). Inuence of temporal correlation of synaptic input on the rate and variability of ring in neurons. Biophys. J., 79, 629– 637. Tsodyks, M. V., & Markram, H. (1997). The neural code between neocortical pyramidal neurons depends on neurotransmitter release probability. Proc. Natl. Acad. Sci. USA, 94, 719–723. Tuckwell, H. C. (1979). Synaptic transmission in a model for stochastic neural activity. J. Theor. Biol., 77, 65–81. Tuckwell, H. C. (1988). Introduction to theoretical neurobiology: Vol. 2, Nonlinear and stochastic theories. Cambridge: Cambridge University Press. Turrigiano, G. G., Leslie, K. R., Desai, N. S., Rutherford, L. C., & Nelson, S. B. (1998). Activity-dependent scaling of quantal amplitude in neocortical neurons. Nature, 391, 892–896.
940
A. Burkitt, H. Mefn, and D. Grayden
van Hemmen, J. L. (2001). Theory of synaptic plasticity. In F. Moss & S. Gielen (Eds.), Handbook of biological physics, Vol. 4: Neuro-informatics and neural modelling (pp. 771–823). Amsterdam: Elsevier. van Kampen, N. G. (1992). Stochastic processes in physics and chemistry. Amsterdam: North-Holland. van Rossum, M. C. W., Bi, G. Q., & Turrigiano, G. G. (2000). Stable Hebbian learning from spike timing-dependent plasticity. J. Neurosci., 20, 8812–8821. Varela, J. A., Sen, K., Gibson, J., Fost, J., Abbott, L. F., & Nelson, S. B. (1997). A quantitative description of short-term plasticity at excitatory synapses in layer 2/3 of rat primary visual cortex. J. Neurosci., 17, 7926–7940. Zhang, L. I., Tao, H. W., Holt, C. E., Harris, W. A., & Poo, M.-M. (1998). A critical window for cooperation and competition among developing retinotectal synapses. Nature, 395, 37–44. Zorumski, C. F., & Thio, L. L. (1992). Properties of vertebrate glutamate receptors: Calcium mobilization and desensitization. Prog. Neurobiol., 39, 295–336. Zucker, R. S. (1999). Calcium- and activity-dependent synaptic plasticity. Curr. Opin. Neurobiol., 9, 305–313. Received January 16, 2003; accepted September 29, 2003.
LETTER
Communicated by Jonathan Victor
Estimating the Temporal Interval Entropy of Neuronal Discharge George N. Reeke
[email protected] Allan D. Coop
[email protected] Laboratory of Biological Modelling, Rockefeller University, New York, NY 10021, U.S.A.
To better understand the role of timing in the function of the nervous system, we have developed a methodology that allows the entropy of neuronal discharge activity to be estimated from a spike train record when it may be assumed that successive interspike intervals are temporally uncorrelated. The so-called interval entropy obtained by this methodology is based on an implicit enumeration of all possible spike trains that are statistically indistinguishable from a given spike train. The interval entropy is calculated from an analytic distribution whose parameters are obtained by maximum likelihood estimation from the interval probability distribution associated with a given spike train. We show that this approach reveals features of neuronal discharge not seen with two alternative methods of entropy estimation. The methodology allows for validation of the obtained data models by calculation of condence intervals for the parameters of the analytic distribution and the testing of the significance of the t between the observed and analytic interval distributions by means of Kolmogorov-Smirnov and Anderson-Darling statistics. The method is demonstrated by analysis of two different data sets: simulated spike trains evoked by either Poissonian or near-synchronous pulsed activation of a model cerebellar Purkinje neuron and spike trains obtained by extracellular recording from spontaneously discharging cultured rat hippocampal neurons. 1 Introduction Spike trains generated by the discharge activity of neurons are composed of sequences of action potentials (spikes or impulses) that are generally accepted as providing the predominant mode of communication between neurons within the central nervous system (Perkel, 1970). Classically, the rate-coded properties of a spike train have been considered most important, for example, when sensory stimulus intensity is taken to be represented by the mean spike rate in a particular ber or group of bers (Adrian, 1928). Neural Computation 16, 941–970 (2004)
c 2004 Massachusetts Institute of Technology °
942
G. Reeke and A. Coop
However, a signicant feature of neuronal discharge activity, particularly within the cortex, is its high variability (Buracas, Zador, DeWeese, & Albright, 1998; Noda & Adey, 1970; Shadlen & Newsome, 1998; Softky & Koch, 1993; Tolhurst, Movshon, & Dean, 1983; Whitsel, Schreiner, & Essick, 1977). Although less widely accepted, the presence of discharge variability suggests that the timing of neuronal impulses as well as only their rate may be important (Segundo, 2000), and numerous alternatives to rate coding have been proposed (Buzs a´ ki, Llin´as, Singer, Berthoz, & Christen, 1994; Eggermont, 2001; Gilbert, 2001; Meister & Berry, 1999; Perkel & Bullock, 1968). In many cases, analysis is based on the temporal dispersion or pattern of neuronal discharge within a spike train, measures of which include the standard deviation (SD), coefcient of variation (CV = SD/mean), serial correlation of intervals, features of burst discharge such as the spike number and burst duration (e.g., Lisman, 1997), or metric-space techniques (Victor & Purpura, 1996). In contrast with the classical rate-based approach, the characteristic feature of such temporal coding is the importance of the exact time at which neuronal discharge occurs. A theoretically grounded measure of the weighted average probability of individual alternative events, the entropy, originated in statistical mechanics and is fundamental to communication theory (Segundo, 1970; Shannon & Weaver, 1949). This measure has subsequently been adapted (see, e.g., Brillouin, 1962; MacKay & McCulloch, 1952; Rieke, Warland, de Ruyter van Steveninck, & Bialek, 1997) to provide the necessary foundation for estimation of the transmission of information within (Manwani & Koch, 2000) and between neurons (e.g. Bialek, Rieke, de Ruyter van Steveninck, & Warland, 1991; Borst & Theunissen, 1999; Brecht, Goebel, Singer, & Engel, 2001; Brenner, Strong, Koberle, Bialek, & de Ruyter van Steveninck, 2000; Buracas & Albright, 1999; Buracas et al., 1998; Butts, 2003; deCharms & Zador, 2000; Dimitrov & Miller, 2001; Eckhorn & Popel, ¨ 1974; Gershon, Wiener, Latham, & Richmond, 1998; Liu, Tzonev, Rebrik, & Miller, 2001; Optican & Richmond, 1987; Reich, Mechler, Purpura, & Victor, 2000; Rieke et al., 1997; Stein, 1967; Szczepanski, Amig´o, Wajnryb, & Sanchez-Vives, 2003; Werner & Mountcastle, 1965; Wiener & Richmond, 1999). One feature of the methodologies employed in many of these reports is their emphasis on the analysis of signal transmission following the sensory stimulation of visual pathways. Here, however, we are specically concerned with estimating the entropy of individual spike trains, regardless of their dependence on any particular sensory stimulus. With the content and mechanisms of neural coding still in dispute (see, e.g., Eggermont, 1998; Friston, 1997), the estimation of entropy becomes more uncertain as neuronal impulse activity propagates from primary sensory areas to multi- and supramodal cortical areas. In part, this is due to difculties in identifying the set of possible signals from which a particular neuronal response is selected and thus the probability with which a given response occurs. This is compounded by the lack of suitable experimental
Estimating The Temporal Interval Entropy of Neuronal Discharge
943
preparations capable of providing data sets of the length required to establish good estimates of the event probabilities underlying the calculated entropy. Here we describe the analytic distribution method (for convenience, the interval method) for the calculation of the temporal interval entropy or simply the interval entropy of a neuronal signal. The name interval entropy is proposed because probability estimates are based on hypothetical sets of all possible spike trains having the same distribution of interspike intervals as the spike train in question. The methodology requires no knowledge of stimulus properties and is applicable to individual spike trains. It is based on the interspike interval distribution obtained from a particular spike train and involves calculation of the entropy of an equivalent continuous analytic distribution as a means of accomplishing extrapolation to large data record lengths. In the following sections, the interval methodology is described and then applied to analyze spike trains obtained from two different sources. Our primary purpose is to demonstrate the methodology rather than to identify the functional relevance of any spike train properties that may be revealed by the analyses. Initially, articially generated spike trains are analyzed. Results of these analyses are compared with entropy estimates for the same spike trains obtained with either the direct method (de Ruyter van Steveninck, Lewen, Strong, Koberle, & Bialek, 1997; Strong, Koberle, de Ruyter van Steveninck, & Bialek, 1998) or the non-parametric method of Kozachenko and Leonenko (1987). We also apply the interval methodology to spike trains obtained via extracellular single unit recording of spontaneous discharge activity from acutely dissociated, cultured rat hippocampal neurons. 2 Previous Entropy Measures It has previously been suggested that entropy may provide a sensitive measure for quantication of neuronal discharge variability (Sherry & Klemm, 1981). One of the simplest approaches to the analysis of neuronal discharge entropy relies on a rate-based description of the neural code. Here, the fundamental event is assumed to be the action potential (or spike), and the entropy depends on the neuronal discharge rate. When N different spike signals are equally likely, that is, the probability of each signal is 1=N, the entropy, H, is given by the relation H D log2 N (Hartley, 1924). Alternatively, the Shannon formulation derives a measure of the entropy of a signal from estimates of the probabilities that particular signals would be observed if drawn at random from among a nite set of all possible signals that might be emitted under a given set of conditions. It is given in the discrete case by
HD¡
N X iD1
pi log2 pi ;
(2.1)
944
G. Reeke and A. Coop
where P pi is the probability of the ith event, N the total number of events, and pi D 1. Equation 2.1 naturally generalizes to the case of a continuous range of values to give the so-called differential entropy function, Z H.X/ D ¡ f .x/ log2 f .x/ dx; S
where S is the support set of X with a density f .x/ > 0. From H.X/, the discrete entropy can be calculated by H1 .X/ D H.X/ ¡ log2 1t;
(2.2)
where 1t is the temporal bin width (Cover & Thomas, 1991). Several complementary methods are currently available that, via various simplifying assumptions, allow upper or lower bounds of spike train entropy to be estimated (reviewed by, e.g., Borst & Theunissen, 1999). One rate-based measure originates in a derivation of the Shannon entropy function (see equation 2.1) in which a binary representation of a spike train is obtained by dividing the time axis into discrete bins of width 1t. Each bin is coded as 1 if a spike is present in that bin and 0 otherwise (Brillouin, 1962; MacKay & McCulloch, 1952). The observed spike train is taken to have been selected from the set of all possible spike trains with the same number of 1s and 0s. This approach has been developed by various simplifying assumptions to give the mean rate entropy in bits per spike as a function of 1t (from Rieke et al., 1997) HR ¼ log2
± e ² ; rN1t
(2.3)
where e is the base of natural logarithms, rN is the mean spike rate, and it is assumed that the probability of nding a spike in any one bin is ¿ 1. Here, HR is referred to as the rate entropy and is given as the mean entropy in bits per spike. It provides an upper bound to the entropy for a given activation rate and choice of 1t. 2.1 The Direct Method. As an example of a methodology for measuring entropy that takes into account both spike timing and any correlations that may be present within sequences of intervals, we briey summarize the entropy estimation component of the so-called direct method of spike train analysis (de Ruyter van Steveninck et al., 1997; Strong et al., 1998). The direct method begins, like the calculation of the rate entropy HR , with a binary representation of the given spike train in bins of width 1t. Consecutive bins (or “letters”) are grouped into “words” of L bins. For a given word length L and resolution 1t, the entropy of the response to a particular stimulus in bits per second is obtained from the probability distribution of the observed
Estimating The Temporal Interval Entropy of Neuronal Discharge
945
words by application of
HD .L; 1t/ D ¡
1 L1t
X
p.w/ log2 p.w/;
(2.4)
w2W.L;1t/
where w is a specic word (spike pattern), W.L; 1t/ the set of all possible words comprising L bins of width 1t, and p.w/ is the probability of observing pattern w in the spike train. The true entropy for a given measurement precision can be obtained only when observation times are sufciently long to obtain accurate estimates of p.w/ and after extrapolation of HD .L; 1t/ to innite L. In one version of this method, which is used for the calculations reported here, HD is plotted versus 1=L, the point of minimal fractional change in slope is found, and the four values of L up to and including this point are used for the extrapolation (Reinagel, Godwin, Sherman, & Koch, 1999). HD may then be divided by the mean spike rate to give an estimate of the mean entropy in bits per spike. The advantage of the direct method is that it makes no prior assumptions about the distribution of letters in a word or about the nature of the stimulus. However, as is generally the case for accurate estimation of spike train entropy, application of the method faces three obstacles (Optican, Gawne, Richmond, & Joseph, 1991). First, it is not always clear how to make a principled choice for the temporal resolution at which a spike train should be discretized. This is referred to here as the binning problem. Second, responses to a stimulus often appear to be highly variable (but see Reinagel & Reid, 2002). Third, experimental considerations frequently limit the sample size. The impracticality of collecting the large data sets required to estimate the occurrence probability of word patterns accurately, particularly long ones, is referred to here as the sampling problem. 2.2 The KL Method. A nonparametric method of entropy estimation has been proposed by Kozachenko and Leonenko (1987). Based on the method of DobruÏsin (1958), it provides an asymptotically unbiased estimator that gives the entropy of a continuous random vector from a sample of independent observations. For convenience, it is referred to here as the KL method and provides an entropy estimate referred to as the KL entropy .HKL /. (HKL should not be confused with the similarly named relative entropy or Kullback-Leibler distance, which may be calculated for two probability mass functions; e.g., Cover & Thomas, 1991). In brief, let Rm be the m-dimensional Euclidean space with metric
½.x1 ; x2 / D
8 <X m :
jD1
.j/
.j/
.x1 ¡ x2 /2
91=2 = ;
;
946
G. Reeke and A. Coop
.m/ m where xi D .x.1/ i ; : : : ; xi / 2 R , i D 1; 2, and m ¸ 1. From the sample X1 ; : : : ; XN , N ¸ 2, compute ½i D minf½.Xi ; Xj /, j 2 1; 2; : : : ; Ngnfigg and let
(
N X ½i ½N D 1t iD1
)1=N
;
(2.5)
where 1t is a bin width introduced to make ½N a dimensionless quantity. The entropy may then be obtained in bits per interval from ³ ´ 1 [m ln ½N C ln c1 .m/ C ln ° C ln.N ¡ 1/]; (2.6) HKL D ln 2 where c1 .m/ D
¼ m=2 ; 0. m2 C 1/
m D 1 for all analyses reported here, ln ° D 0:577216 (the Euler constant), and 0.²/ is the gamma function. In some runs (see section 4), the temporal resolution of spike timing was increased to that of the double oating-point number precision of MATLAB by the addition of uniform noise to each interval in a spike train. The noise was constrained such that interval durations were randomly varied within the 10 ¹s simulation resolution. Under this condition, all intervals contributed to the estimate of HKL . With the exception of the lower curves in Figures 3A and 3B, all values reported for the KL entropy were obtained as the mean of 10 replicates for each data set where the random number seed used to generate the noise was different in each analysis. 3 Methods 3.1 The Interval Method. Based on the assumption that temporal precision may typically provide a more important limiting factor on information transmission than maximum impulse rate, an early theoretical study concluded that interval modulation provides a more efcient mode of encoding than purely rate-based modulation (MacKay & McCulloch, 1952). Although subsequent studies of interspike interval distributions obtained from spontaneously discharging neurons in slice preparations of rat cerebellar cortex showed that serial correlation of adjacent intervals may extend over clusters of up to three to ve intervals (Klemm & Sherry, 1981; Sherry & Klemm, 1981), for simplicity, the interval method given here assumes independence of successive intervals, as does the KL methodology. The interval method proceeds by assuming that (1) a spike train consists of a sequence of symbols composed of interval spike pairs where each temporal interval following the rst spike is terminated by the occurrence of a spike, (2) the set of possible signals from which an observed spike train
Estimating The Temporal Interval Entropy of Neuronal Discharge
947
is drawn is the set of all spike trains with the same distribution of interspike intervals, (3) the distribution of spike intervals in an arbitrarily long spike train can be obtained by maximum likelihood estimation (MLE) of the parameters that match a suitably chosen analytical distribution to an observed sample of that spike train, and (4) the entropy determined by the timing of neuronal discharge can be calculated from the continuous probability distribution for an appropriate choice of resolution, 1t. (See the end of the following section.) 3.1.1 Estimating the Interval Entropy .HI /. The interval methodology avoids the binning problem by obtaining with MLE techniques the parameter values for the underlying continuous distribution from which the intervals of a particular spike train are most likely to have been drawn. The entropy may then be obtained by analytic or numeric techniques from the continuous distribution at any desired temporal resolution. Within the context of spike train analysis, several different approaches have been formulated to address the sampling problem, that is, the estimation of a probability function from a nite sample size, and different correction factors have been proposed (Optican et al., 1991; Panzeri & Treves, 1996; Sanderson & Kobler, 1976; Wolpert & Wolf, 1995). Here, the sampling problem is addressed by employing a best-t analytic distribution to approximate the limit of an innitely long data set. This approach has the advantage that the spike train duration required to provide sufcient data can be estimated by noting where the entropy estimates obtained from analytic probability density function t to records of increasing durations asymptote to a constant value, with no further correction for undersampling required. The interval method is not limited to a particular analytical density function; any suitable function or combination of functions may be employed to estimate the interval probability density of a given spike train. A sum of functions may be particularly useful when a neuron exhibits periodic burst discharge, a condition under which the observed interval probability density may contain multiple peaks. In the analyses reported here, typically one or both of two analytic probability density functions are employed: the generalized gamma distribution or the gaussian distribution. The generalized gamma distribution, in the case of a neuron exhibiting an absolute refractory period, requires a threeparameter form of the probability density function given by ° .xI a; s; ¿ / D
.x ¡ s/a¡1 ¡.x¡s/ e ¿ ; .x ¸ s ¸ 0I a; ¿ > 0/; ¿ a 0.a/
(3.1)
where a is the order or shape parameter and s and ¿ gave the time axis shift and the time scaling parameters of the distribution, respectively. This distribution is chosen as it is characteristic of simple models of neural spike production (see, e.g., Usher, Stemmler, & Koch, 1994) and because of its rela-
948
G. Reeke and A. Coop
tion to the Poisson distribution when a D 1. Early studies reported excellent ts between neuronal discharge activity and the gamma distribution (Stein, 1965), as have others since (e.g., Tiesinga, Jos´e, & Sejnowski, 2000). In any event, this distribution appears to be justied heuristically by its ability to t distributions that range from exponential to near gaussian as the order parameter, a, is increased. When all three parameters of a gamma distribution are unknown, an efcient method of parameter estimation is that of MLE (Mann, Schafer, & Singpurwalla, 1974). MLE values are estimated from the observed distribution, O, by maximizing the likelihood function Y LD p.xi /; i
where in the gamma case p.xi / D ° .xi I a; s; ¿ / for unimodal distributions and p.xi / D f °1 .xi I a1 ; s1 ; ¿1 / C .1 ¡ f /°2 .xi I a2 ; s2 ; ¿2 / for bimodal distributions. In the bimodal case, f is a constant .0 < f < 1/ that distributes the analytic probability between the two peaks in O, and a1 and a2 ; s1 and s2 , and ¿1 and ¿2 are the order, shift, and scaling parameters, respectively, of the two density functions as determined by the MLE procedure. It is often computationally advantageous to maximize the log likelihood instead of the likelihood function. Thus, for the unimodal case, X ln L D ln[° .i/]; (3.2) i
and the bimodal case, X ln L D ln[ f °1 .i/ C .1 ¡ f /°2 .i/]
(3.3)
i
where °n .i/ D
µ
¶ .xi ¡ sn /an ¡1 ¡.xi ¡sn/ e ¿n ; n D 1; 2: ¿nan 0.an /
The extension to more than two component distributions is obvious. Once the parameters a, s, and ¿ are known, the entropy of the tted gamma distribution can be calculated by summation of the Shannon entropy function (see equation 2.1), where for the unimodal case, the probability that an interspike interval lies in the interval x D [i1t; .i C 1/1t] is given by pi D [Q..i C 1/1tI a; s; ¿ / ¡ Q.i1tI a; s; ¿ /]
(3.4)
Estimating The Temporal Interval Entropy of Neuronal Discharge
949
where i gives the bin number, 1t the bin width, and Q is the incomplete gamma function, ³ ´ Z . x¡s / ¿ 1 x¡s Q.xI a; s; ¿ / D ° ;a ´ ta¡1 e¡t dt; ¿ 0.a/ 0 or analytically (see equation 2.2) where ³ H1 .X/ D
1 ln 2
´µ ln.¿ 0.a// C .1 ¡ a/
¶ 0 0.a/ C a ¡ log2 1t; 0.a/
(3.5)
and 0 0 .a/ is the derivative of the gamma function. For consistency with the bimodal case where the analytic entropy is not available in closed form (see below), the entropy of unimodal distributions is also obtained from equation 2.1. To keep the set of possible messages nite (Shannon, 1948), the summation is truncated after N terms, when the Ppi are sufciently small to achieve numerical convergence (we use 1 ¡ pi < 1 £ 10¡8 ). In the absence of an analytic expression for H.X/ in the bimodal case, the entropy is estimated from the Shannon entropy function (see equation 2.1), now using pi D f [Q..i C 1/1tI a1 ; s1 ; ¿1 / ¡ Q.i1tI a1 ; s1 ; ¿1 /] C .1 ¡ f /[Q..i C 1/1tI a2 ; s2 ; ¿2 / ¡ Q.i1tI a2 ; s2 ; ¿2 /]:
(3.6)
One or more gaussian distributions or a sum of gaussian and gamma distributions may provide an appropriate description when the order parameter, a, of a gamma distribution becomes large, although we note that the matching of single gaussians to individual interval distributions obtained within the visual system has been unsuccessful (Berry, Warland, & Meister, 1997). The parameter values for the analytic gaussian probability density function, ³
¡.x ¡ m/2 exp G.xI m; ¾ / D p 2¾ 2 2¼ ¾ 1
´ ; .¾ > 0/;
(3.7)
from which the observed interval distribution, O, is most likely drawn, are obtained by MLE maximization of the log-likelihood function as given above for the gamma cases. In contrast with the gamma distribution, which extends over the range [0; C1], the gaussian distribution extends over the range [¡1; C1]. As the probability of negative temporal intervals has no meaning, such values are avoided by truncating the distribution at x D 0 by renormalization. If one denes Á GT .X/ D
! 2 erfc. p¡m / 2¼
G.xI m; ¾ /;
(3.8)
950
G. Reeke and A. Coop
where 2 erfc.x/ D p ¼
Z
1
e¡t dt 2
x
is the complementary error function, then Z
1 0
GT .x/ dx D 1 and
N X iD0
GT .xi / ¼ 1;
P where N is chosen as given above such that 1 ¡ pi < 1 £ 10¡8 . An example of a more complex distribution is that of a comb function composed of multiple discrete gaussians, the amplitudes of which are constrained by a gamma envelope. Here, ³ P.x/ D
1 NC
´X g ³
¹j ¡ s ¿
jD1
± ¹ ¡s ²
´a¡1 e
j ¿
¡
³ ¡
e
´2 x¡¹j 2¾ 2 j
(3.9)
;
where NC is a normalization factor given by ± ¹ ¡s ² Á ! r ´ g ³ j ¡¹j ¼ X ¹j ¡ s a¡1 ¡ ¿ NC D e ¾j erfc p ; 2 jD1 ¿ 2¾j g is the number of gaussian components within the gamma envelope, and ¹j is the mean of the jth gaussian .¹j D s0 C jT/, where s0 is the offset of the rst gaussian and T the temporal interval separating the means of each gaussian component. The entropy of such a function may then be obtained from equation 2.1, where the probability of the ith bin is ± ¹ ¡s ² ´ g ³ j 1 X ¹j ¡ s a¡1 ¡ ¿ p.i/ D e NC jD1 ¿ r £
" ¼ 2
Á
.i C 1/1t ¡ ¹j erf p 2¾j
!
Á
i1t ¡ ¹j ¡ erf p 2¾j
!# :
(3.10)
Because in multimodal cases the integral is not always available in closed form, summation is carried out over bins explicitly, a procedure that is equivalent to equation 2.2 in the limit of small 1t. The continuous distribution is divided into bins of width 1t, and the probability of each bin is calculated as a denite integral of the underlying continuous distribution over the width of the bin. The Shannon entropy (see equation 2.1) is calculated from the resulting bin probabilities pi . If a distribution spans only a small number of bins, 1t, the entropy may be calculated for a reduced bin width and then
Estimating The Temporal Interval Entropy of Neuronal Discharge
951
converted to that of the required bin width from ³ H.1t2 / D H.1t1 / ¡ log2
1t2 1t1
´ ;
(3.11)
where 1t1 and 1t2 are the smaller and larger bin width, respectively. Note that binning is not required to obtain the underlying probability distribution as the parameters of the analytic distribution are obtained by the MLE methodology. Thus, the assumption of a particular value of 1t for the entropy calculation plays no role in the determination of the analytic probability distribution P.X/. Intuitively, there is a range of resolutions, 1t, at which the summation of the Shannon entropy function (see equation 2.1) might be performed, corresponding to the temporal resolution of the postsynaptic targets of the spike train. For the purpose of comparing the various entropy estimation methodologies, a bin width of 0.5 ms is employed here (Regehr & Stevens, 1999; Reinagel & Reid, 2000; Sabatini & Regehr, 1999). 3.1.2 Data Analysis. Analysis and tting routines were developed by the authors in MATLAB Version 6.5 (MathWorks Inc., Natick, MA. http:// www.mathworks.com). A MATLAB-supplied function that implements the Nelder-Mead simplex direct search algorithm, fminsearch, was employed to perform the ts. To enforce the requirement sn ¸ 0 (see equation 3.1), this variable is transformed to a variable ¯, ³
xmin ¡ s ¡ " ¯ D ¡ ln xmin
´ ; ¡1 · ¯ · 1;
(3.12)
where xmin is the smallest interval in O and " gives the tolerance to which .xmin ¡ s/ approaches zero (we use, " D 1 £ 10¡8 ). During the MLE procedure, ¯ is allowed to vary freely. The true value of s is recovered following parameter estimation via the inverse transformation, s D xmin .1 ¡ e¡¯ / ¡ ":
(3.13)
Condence intervals were obtained from suitable modications of a MATLAB-supplied function, mle, which uses the input data and the parameter values of the MLE t to calculate the Fisher information matrix, the inverse of which gives the covariance matrix of the parameter value estimates, µ. Condence intervals, Cn (where n gives the magnitude of the condence interval), are expressed as µ § cº, where º is the square root of the corresponding diagonal element of the inverse Fisher information matrix, and c is a constant that depends on the specic condence interval required. For the p analyses reported here, this was typically 99%, that is, c D 2erf ¡1 .0:99/. As entropy is determined by the variance of a distribution and is independent
952
G. Reeke and A. Coop
of the mean (Rieke et al., 1997), mn and sn are not included in the values reported for condence intervals. 3.2 Statistical Tests. The validity of the t of each analytic distribution to the observed data was assessed by both the Kolmorogorov-Smirnov (KS; Press, Tevkolsky, Vetterling, & Flannery, 1992) and Anderson-Darling (AD; Anderson & Darling, 1954) statistical tests. The AD test is a modication of the KS test which increases the power of the KS test in the tails of a distribution. For each test, the probability (pD and pW for the KS and AD test, respectively) of obtaining values of the test statistic (D and W for the KS and AD test, respectively) that are equal to or greater in magnitude than the observed test statistic was calculated. A p-value close to zero indicates that a signicant difference is likely to exist for the sample size used. Critical values for goodness-of-t testing determined from estimated parameters must be calculated for each distribution, as under these circumstances, they are much smaller than tabulated values (e.g., Lilliefors, 1967), and in any event, values have not been tabulated for the composite distributions we used. As the parameters of an analytic distribution estimated by MLE from a given data set may be used directly (Kotz & Johnson, 1983), for each t a number of intervals equal to the number in the observed data set are sampled with replacement from the analytic distribution and the appropriate test statistic is calculated. This procedure is repeated 5000 times to give a distribution of the test statistic from which critical and p-values may be obtained. When resampling from the comb function, the fractional contribution of each gaussian component is obtained from ± ¹ ¡s ² ³ ´ ³ ´ ¹j ¡ s a¡1 ¡ j¿ 1 p 2¼¾j (3.14) f D e : NC ¿ The validity of the MLE ts was also assessed qualitatively by visual comparison of the observed and analytic distributions and quantitatively by calculation of the root mean squared (RMS) error of the t with the observed distribution. In both cases, cumulative interval distributions are constructed by assuming that intervals that varied by less than the recording resolution are equivalent. The RMS error, E, may then be obtained as 8 91=2 0.05) HI (p 1?) should be addressed in advance, for example, by an application-specic test of unimodality. Inclusion of such tests into validity measures can lead to unreliable results, as demonstrated in the experiments. However, the proposed stability index SN can be considered as a measure of condence in solutions with k > 1. Unusually high instability for k > 1 supports the conclusion that only the trivial but uninformative solution (i.e., k D 1) contains “reliable” structure. 2.6 Using Stability in Practice. Ideally, one evaluates SN for several k and chooses the number of clusters k for which a minimum of SN is obtained. In practical applications, only one data set X is usually available. However, for N an expected value (with regard to two data sets) has to be estievaluating S, mated using the given data. We propose here a simple subsampling scheme for emulating two independent data sets: iteratively split the data into two halves, and compare the solutions obtained for these halves. Compute estimates for SN for different k and identify the k¤ with the smallest estimate (in case of nonunique minima, the largest k is selected). This k¤ is chosen as the estimated number of clusters. A partition of the whole data set can be obtained by applying the clustering algorithm with the chosen k¤ . The steps of the procedure are summarized in Table 1. A few comments are necessary to explain the subsampling scheme. The data should be split into two disjoint subsets because their overlap could otherwise already determine the group structure. This statistical dependence would lead to an undesirable, articially induced stability. Hence, the use of bootstrapping can be dangerous as well in that it can lead to articially low Table 1: Overview of the Algorithm for Stability-Based Model Order Selection. Repeat for each k 2 fkmin ; : : : ; kmax g: 1.
2. 3.
Repeat the following steps for r splits of the data to get an estimate O k / by averaging: S.A 1a. Split the given data set into two halves X; X0 and apply Ak to both. 1b. Use .X; Ak .X// to train the classier Á and compute Á.X0 /. 1c. Calculate the distance of the two solutions Á.X0 / and Ak .X0 / for X0 (see eq. (2.2)). Sample s random k-labelings, compare pairs of these, and O k /. compute the empirical average of the dissimilarities to estimate S.R O k / with S.R O k / to get an estimate for S.A N k /. Normalize each S.A
N k / as the estimated number of clusters. Return kO D argmink S.A
1308
T. Lange, V. Roth, M. Braun, and J. Buhmann
disagreement between solutions. This is one of the reasons that the methods by Levine and Domany (2001) and Ben-Hur et al. (2002) can produce unreliable results (see section 3). Furthermore, the data sets should have (approximately) equal size so that an algorithm can nd similar structure in both data sets. If there are too few samples in one of the two sets, the group structure might no longer be visible for a clustering algorithm. Hence, the option in Clest (see section 3) to have subsets of signicantly different sizes can inuence the assessment in a negative way and can render the obtained statistics useless in the worst case. Instead of the splitting scheme devised above, one could also t a generative model to the data from which new (noisied) data sets can be resampled. 3 Experimental Evaluation In this section, we provide empirical evidence for the usefulness of our approach to model order selection. By using toy data sets, we can study the behavior of the stability measure under well-controlled conditions. By applying our method to gene expression data, we demonstrate the competitive performance of the proposed stability index under real-world conditions. Preliminary results have been presented in Lange et al. (2003). For the experiments, a deterministic annealing variant of k-means (Rose, Gurewitz, & Fox, 1992) and path-based clustering (Fischer et al., 2001) optimized via an agglomerative heuristic are employed. Averaging is performed over r D s D 20 resamples and for 2 · k · 10. As discussed in section 2, the predictor should match the grouping principle. Therefore, we use (in conjunction with the stability index) the nearest centroid classier for k-means and a variant of a nearest-neighbor classier for path-based clustering that can be considered as a combination of single linkage and pairwise clustering (cf. Fischer et al., 2001; Hofmann & Buhmann, 1997). Concerning the classier training, see Hastie, Tibshirani, and Friedman (2001). We empirically compare our method to the Gap Statistic, Clest, Tibshirani’s Prediction Strength, Levine and Domany’s gure of merit, as well as the model explorer algorithm by Ben-Hur et al. (2002). These methods are described next, with an emphasis on differences and potential shortcomings. The results of this study are summarized in the Tables 2 and 3. Here, kO is used to denote the estimated number of clusters. 3.1 Competing Methods 3.1.1 The Gap Statistic. Recently, the Gap Statistic has been proposed as a method for estimating the number of clusters (Tibshirani, Walther, & Hastie, 2001). It is not a model-free validation method since it encodes assumptions about the group structure to be extracted. The Gap Statistic relies on considering the total sum of within-class dissimilarities for a given num-
Stability-Based Validation of Clustering Solutions
1309
Table 2: Estimated Model Orders for the Toy Data Sets.
Data Set 3 gaussians 5 gaussians 3 rings k-means 3 rings path-based 3 spirals k-means 3 spirals path-based
Stability Method
Gap Statistic
Clest
Prediction Strength
Levine’s FOM
Model Explorer
“True” Number k
kO D 3 kO D 5
kO D 3 kO D 1
kO D 3 kO D 5
kO D 3 kO D 5
kO D 2; 3 kO D 5
kO D 2; 3 kO D 5
kD3 kD5
kO D 7
kO D 1
kO D 7
kO D 1
kO D 8
—
kD3
kO D 3
kO D 1
kO D 1
kO D 1
kO D 2; 3; 4
kO D 2; 3
kD3
kO D 6
kO D 1
kO D 10 kO D 1
kO D 6
kO D 6
kD3
kO D 3
kO D 1
kO D 6
kO D 2; 3
kO D 3; 6
kD3
kO D 1
Table 3: Estimated Model Orders for the Gene Expression Data Sets.
Data Set
Stability Method
Golub et al. (1999) kO D 3 Alizadeh et al. (2000) kO D 2
Gap Statistic
Clest
kO D 10 kO D 4
Prediction Strength
Levine’s FOM
Model Explorer
“True” Number k
kO D 3 kO D 1
kO D 2; 8; 10
kO D 2
k 2 f3; 2g
kO D 2 kO D 1
kO D 2; 9
kO D 2
kD3
ber of clusters k, data set X, and clustering solution Y D Ak .X/: Wk :D
X 1·º·k
.2nº /¡1
X
Dij :
(3.1)
i;j:Yi DYj Dº
Here, Dij denotes the dissimilarity between Xi and Xj and nº :D jfi j Yi D ºgj the number of objects assigned to cluster º by the labeling Y. If Xi ; Xj 2 Rd , and Dij D kXi ¡ Xj k2 , Wk corresponds to the squared-error criterion optimized by the k-means algorithm. The Gap Statistic investigates the relationship between the log.W k / for different values of k and the expectation of log.Wk / for a suitable null reference distribution through the denition of the gap: gap k :D E¤ [log.W k /] ¡ log.W k /:
(3.2)
Here, E¤ denotes the expectation under the null reference distribution (as in Tibshirani, Walther, & Hastie, 2001). A possible reference distribution is the uniform distribution on the smallest hyper-rectangle that contains the
1310
T. Lange, V. Roth, M. Braun, and J. Buhmann
original data. In practice, the expected value is estimated by drawing B samples from the null distribution (B D 20 in the experiments), hence d k :D gap
1X ¤ log.Wkb / ¡ log.Wk /; B b | {z }
(3.3)
N¤ D:W k ¤ where W kb is the total within-cluster scatter for the bth data set drawn from the null reference distribution. Now, p let stdk be the standard deviation of the ¤ and sampled log.Wkb / sk :D stdk 1 C 1=B. Then the Gap Statistic selects the smallest number of clusters k for which the gap (corrected for the standard N ¤ and log.Wk / is large, deviation) between W k
d k ¸ gap d kC1 ¡ skC1 g: kO :D minfk j gap
(3.4)
Since the Gap Statistic relies on Wk in equation 3.1, it presupposes spherically distributed clusters. Hence, it contains a structural bias that should be avoided. The null reference distribution is a free parameter of the method, which essentially determines when to vote for no structure in the data (kO D 1). The experiments in Tibshirani, Walther, and Hastie (2001) as well as the 5 gaussian in section 3.2 demonstrate a sensitivity to the choice of the baseline distribution. 3.1.2 Clest. An approach related to ours has been proposed by Dudoit and Fridlyand (2002). We mainly comment on the differences here. A given data set is repeatedly split into two nondisjoint sets. The sizes of the data sets are free parameters of the method. As we already pointed out in section 2, very unbalanced splitting schemes can lead to unreliable results. After clustering both data sets, a predictor is trained on one data set and tested on the other data set. The predictor is a free parameter of Clest. No guidance is given in Dudoit and Fridlyand (2002) concerning predictor choice. Unfortunately, the subsequent steps are largely determined by the reliability of the prediction step. A similarity measure for partitions is used in order to measure the similarity of the predicted labels to the cluster labels. The measure itself is a free parameter again, in contrast to the method proposed here. In the experiments, the Fowlkes and Mallows (FM) index (Fowlkes & Mallows, 1983) was used. Given a xed k, the median tk of the similarity statistics obtained for B splits of the data is compared to those obtained for B0 data sets drawn from a null reference distribution. The difference dk :D tk ¡ t0k between P the average median statistic under the null hypothesis t0k :D .1=B0 / b tk;b and the observed statistic tk and pk :D jfb j tk;b ¸ tk gjB¡1 0 as a measure of variation in the baseline samples are used to select candidate number of
Stability-Based Validation of Clustering Solutions
1311
clusters. The number of clusters kO is chosen by bounding pk and dk , yielding two additional free parameters dmin and pmax . Dudoit and Fridlyand select » Ok D 1; min K¡ ;
if K¡ D ; otherwise
(3.5)
for K¡ :D f2 · k · M j pk · pmax ; dk ¸ dmin g. The whole procedure is repeated for k 2 f2; : : : ; Mg, where M is some predened upper bound for the number of clusters. Note that the set K¡ is essentially determined by the bounds on pk and dk . They can be chosen badly so that K¡ is always empty, for example. We conclude that Clest comes with a large number of parameters that have to be set by the user. At the same time, little guidance is given on how to reasonably select values for parameters in practice. This lack of parameter specication poses a severe practical problem since the obtained statistics are of little value for poor parameter selection. In contrast, our method gives guidance concerning the most important parameters. Due to the large degree of freedom in Clest, we consider it a conceptual framework, not a fully specied algorithm. For the experiments, we set B D B0 D 20, pmax D dmin D 0:05 and choose a simple linear discriminant analysis classier as the predictor, following the choice in Dudoit and Fridlyand (2002). 3.1.3 Prediction Strength. This recently proposed method (Tibshirani, Walther, Botstein et al., 2001) is related to ours. Again, the data are repeatedly split into two parts, and the nearest class centroid classier is employed for prediction. The similarity of clustering solutions is assessed by essentially measuring the intersection of the two clusters in both solutions that match worst. A threshold on the average similarity score is used to estimate the number of clusters. The largest k is selected for which the average similarity is above the user-specied threshold. From a conceptual point of view as well as in practice, this procedure has two severe drawbacks: (1) it is reasonably applicable to squared-error clustering only due to the use of the nearest centroid predictor, and (2) the similarity measure employed for comparing partitions can trivially drop to zero for large k. In particular, the latter point severely limits the applicability of the Prediction Strength method. In the experiments, we averaged over 20 splits, and the threshold was set to 0:9. 3.1.4 Model Explorer Algorithm. This method (Ben-Hur et al., 2002) clusters nondisjoint data sets: given a data set of size n, two subsamples are generated of size d f ne, where f 2 .0:5; 1/ ( f :D 0:8 in the experiments). The solutions obtained for these subsamples are compared at the intersection of the sets. The similarity measure for partitions is a free parameter of the method (in the experiments, the FM index was used). To estimate k, the experimental section of Ben-Hur et al. (2002) suggests looking for
1312
T. Lange, V. Roth, M. Braun, and J. Buhmann
“a jump in,”
P fSk > ´g ¼
r 1X 1fsj;k > ´g; r jD1
(3.6)
where Sk is the similarity score of two k-partitions and sj;k denotes the empirically measured similarity for the jth (out of r) subsample for some xed ´. We choose ´ D 0:9 in the experiments. Looking for a “jump” in the distribution is not a well-dened criterion for determining a number of clusters k as it is qualitative in nature. This vagueness can result in situation where no decision can be made. Furthermore, the Model Explorer algorithm can be biased (toward smaller k) due to the overlapping data sets where the data in both sets can potentially feign stability. As for the other methods, 20 pairs of subsamples were used in the experimental evaluation. 3.1.5 Levine and Domany’s Resampling Approach. This method (described here in a simplied way) creates r subsamples of size d f ne where f 2 [0; 1] from the original data (Levine & Domany, 2001). For the full data and for the resamples, solutions are computed. They dene a gure of merit M that assesses the average similarity of the solutions obtained on the subsamples with the one obtained on the full sample. 2 The matrix T 2 f0; 1gn with Tij :D 1fi 6D j and i; j are in the same clusterg where i; j 2 f1; : : : ; ng, is called the cluster connectivity matrix, which is a different representation of a clustering solution. The resampling results in a matrix T for the full data and r d f ne £ d f ne matrices T .1/ ; : : : ; T .r/ for the subsamples. For the parameter k, the authors dene their gure of merit as P r 1X M.k/ :D r ½D1
P i2J½
P
j2N i0 2J½
½ ;i
jN
.±Tij ;T.½/ / ij
½ ;i 0
j
;
(3.7)
where J½ is the set of samples in the ½th resample and N ½ ;i , i 2 J½ , denes a neighborhood between samples. In the original article, the neighborhood denition was left as a free parameter. The criterion we have chosen for our experiments is called the ·-mutual nearest neighbor neighborhood denition (see Levine, 1999). M.k/ measures the extent to which the grouping computed on the subsample is in agreement with the solution on the full data set. Thus, M.k/ D 1 yields perfect agreement. The authors suggest choosing the parameter(s) k for which local maxima of M are observed. Several maxima can occur, and hence it is not clear how to choose a single number of clusters in that case. In the experiments, all of them are taken into consideration. We have set f D 2=3, r D 20, and · D 20.
Stability-Based Validation of Clustering Solutions
1313
3.2 Experiments Using Toy Data. The rst data set consists of three fairly well separated point clouds, generated from three gaussian distributions (25 points from the rst and the second and 50 points from the third were drawn) and has also been used by Dudoit and Fridlyand (2002), Tibshirani, Walther, Botstein et al. (2001), and Tibshirani, Walther, and Hastie (2001). For some k, for example, k D 5 in Figure 2B, the variance in the stability over different resamples is relatively high. This effect can be explained by the model mismatch: for k D 5, the clustering of the three classes depends highly on the subset selected in the resampling. We conclude that additional information about the t can be obtained from the distribution of the stability index values over the resampled subsets apart from the absolute value of the stability index. For this data set, all methods under comparison are able to infer the “true” number of clusters k D 3. Figures 2A and 2B show the clustered data set and the proposed stability index. For k D 2, the 0.8 0.7
stability index
0.6 0.5 0.4 0.3 0.2 0.1 0 2
3
4
A
5 6 7 8 number of clusters
9
10
9
10
B 1
stability index
0.8 0.6 0.4 0.2 0 2
C
3
4
5 6 7 8 number of clusters
D
Figure 2: Results of the stability index on the toy data (see section 3.2). (A) Clustering of the 3 gaussians data for k D 3. (B) The stability index for the 3 gaussians data set with k-means. (C) Clustering of the 5 gaussians data with the estimated k D 3. (D) The stability index for the 5 gaussians data set with k-means.
1314
T. Lange, V. Roth, M. Braun, and J. Buhmann
stability is relatively high due to the hierarchical structure of the data set, which enables stable merging of the two smaller subclusters with 25 data points. The second proof-of-concept data set consists of n D 800 points in R2 drawn from 5 gaussians (400 from the center one, 100 from the four outer gaussians; see Figure 2C). All methods here estimate the correct number of clusters, k D 5, except for the Gap Statistic. In this example the uniformity hypothesis chosen for the Gap Statistic is too strict: groupings of uniform data sets drawn from the baseline distribution have very similar costs to those of the actual data. Hence, there is no large gap for k > 1. Figure 2D shows the stability index for this data set. Note the high variability and the poor stability index value for k < 5. This high instability is caused by the merging of clusters that do not belong together. Which clusters are merged solely depends on the current noise realization (i.e., on the current subsample). In the three ring data set (depicted in Figures 3A and 3C), which was also used by Levine and Domany (2001), three ring-shaped clusters can be naturally distinguished. These clusters obviously violate the modeling assumption of k-means of spherically distributed clusters. With k D 7, k-means is able to identify the inner circle as a cluster. Thus, the stability for this number of clusters k is highest (see Figure 3B). Clest infers kO D 7, and Levine’s FOM suggest kO D 8 while the Gap Statistic and Prediction Strength estimate kO D 1. The Ben-Hur et al. (2002) method does not lead to any interpretable result. Applying the proposed stability estimator with path-based clustering on the same data set yields highest stability for k D 3, the “correct” number of clusters (see Figures 3C and 3D). Here, most of the other methods fail and estimate kO D 1. The Gap Statistic fails here because it directly incorporates the assumption of spherically distributed data. Similarly, the Prediction Strength measure and Clest (in the form used here) use classiers that support only linear decision boundaries, which obviously cannot discriminate between the three ring-shaped clusters. In all these cases, the basic requirement for a validation scheme is violated: it should not incorporate additional assumptions about the group structure in a data set that go beyond the assumptions of the clustering principle employed. Levine’s gure of merit as well as Ben-Hur’s method infer multiple numbers of clusters. Apart from this observation, it is noteworthy that the stability of k-means is signicantly worse than the one achieved with path-based clustering. This stability ranking indicates that the latter is the preferred choice for this data set. Again, the stability can be considered as a condence measure for solutions. Experiments on uniform data sets, for example, lead to very large values of the stability index (for all k > 1), rendering solutions on such data questionable. The spiral arm data set (shown in Figures 4A and 4C) consists of three clusters formed by three spiral arms. Note that the assumption of compact
Stability-Based Validation of Clustering Solutions
1315
1
stability index
0.8 0.6 0.4 0.2 0 2
3
4
A
5 6 7 8 number of clusters
9
10
9
10
B
stability index
0.5 0.4 0.3 0.2 0.1 0 2
C
3
4
5 6 7 8 number of clusters
D
Figure 3: Results of the stability index on the toy data (see section 3.2). (A) kmeans clustering solution on the full three-ring data set for k D 7. (B) The stability index for the three-ring data set with k-means clustering. (C) Clustering solution on the full data set for k D 3 with path-based clustering. (D) The stability index for the three-ring data set with path-based clustering.
clusters is violated again. With k-means, the stability measure overestimates the “true” number of clusters since kO D 6 is returned (see Figure 4B). Most of the other methods fail in this case, since they estimate kO D 1, except for Clest, Levine’s FOM, and the Model Explorer algorithm. Clest favors the 10cluster solution. Note, however, that this kO is returned because the maximum number of clusters is 10. Levine’s FOM and Ben-Hur’s method both suggest k D 6, as our method does. When path-based clustering is employed, the “correct” number of clusters k D 3 is inferred by the stability-based approach (see Figure 4D). In this case, however, most competitors fail again to provide a useful estimate. Only the method by Levine and Domany (2001) and Ben-Hur et al. (2002) estimate k D 3 among other number of clusters. Again, the minimum stability index values for path-based clustering are signicantly below the index values for k-means. This again indicates that the
1316
T. Lange, V. Roth, M. Braun, and J. Buhmann 1
stability index
0.8
0.6
0.4
0.2
0 2
3
4
A
5 6 7 8 number of clusters
9
10
9
10
B 0.6
stability index
0.5 0.4 0.3 0.2 0.1 0 2
C
3
4
5 6 7 number of clusters
8
D
Figure 4: Results of the stability index on the toy data (see section 3.2). (A) Clustering solution on the full spiral arm data set for k D 6 with k-means. (B) The stability index for the three-spirals data set with k-means clustering. (C) Clustering solution on the full spiral arm data set for k D 3 with path-based clustering (D) and the stability index for this data set.
stability index quanties the reliability of clustering solutions and, hence, provides useful information for model (order) selection and validation of model assumptions. 3.3 Analysis of Gene Expression Data Sets. With the advent of microarrays, the expression levels of thousands of genes can be simultaneously monitored in routine biomedical experiments (Lander, 1999). Today, the huge amount of data arising from microarray experiments poses a major challenge to gain biological or medical insights. Cluster analysis has turned out to be a useful and widely employed exploratory technique (see, e.g., Tamayo et al., 1999; Eisen, Spellman, Botstein, & Brown, 1998; Shamir & Sharan, 2001) for uncovering natural patterns of gene expression: for example, it is used to nd groups of similarly expressed genes. Genes that are clustered together this way are candidates for being coregulated and hence
Stability-Based Validation of Clustering Solutions
1317
are likely to be functionally related. Several studies have investigated their data using such an approach (e.g., Iyer et al., 1999). Another application domain is that of class discovery. Recently, several authors have investigated how novel tumor classes can be identied based exclusively on gene expression data (Golub et al., 1999; Alizadeh et al., 2000; Bittner et al., 2000). A fully Bayesian approach to class discovery is proposed in Roth and Lange (in press). Viewed as a clustering problem, a partition of the arrays is sought that collects samples of the same disease type in one class. Such a partitioning can be used in subsequent steps to identify indicative or explaining genes for the different disease types. The nal goal of many such studies is to detect marker genes that can be used to reliably classify and identify new cases. Due to the random nature of expression measurements, the main problem remains to assess and interpret the clustering solutions. We reinvestigate here the data sets studied by Golub et al. (1999) and Alizadeh et al. (2000) using the stability method to determine a suitable model order. Ground-truth information on the correct number of clusters is available for both data sets that have also been re-analyzed by Dudoit and Fridlyand (2002). 3.3.1 Clustering of Leukemia Samples. Golub et al. (1999) studied in their analysis the problem of classifying acute leukemias. They used self-organizing maps (SOMs; see, Duda et al., 2000) for the study of unsupervised cancer classication. Nevertheless, the important question of inferring an appropriate model order remains unaddressed in their article since a priori knowledge is used to select a number of clusters k. In practice, however, such knowledge is often not available. Acute leukemias can be roughly divided into two groups, acute myeloid leukemia (AML) and acute lymphoblastic leukemia (ALL). Furthermore, ALL can be subdivided into B-cell ALL and T-cell ALL. Golub et al. (1999) used a data set of 72 leukemia samples (25 AML, 47 ALL of which 38 are Bcell ALL samples).2 For each sample, gene expression was monitored using Affymetrix expression arrays. Following the authors of the leukemia study, three preprocessing steps are applied to the expression data. At rst, all expression levels are restricted to the interval [100; 16,000], with values outside this interval being set to the boundary values. Second, genes are excluded for which the quotient of maximum and minimum expression levels across samples is smaller than 5 or the difference between maximum and minimum expression is smaller than 500. Finally, the data were log10-transformed. An additional step standardizes samples so that they have zero mean and unit variance across genes. The resulting data set consisted of 3571 genes and 72 samples.
2
Available on-line at http://www-genome.wi.mit.edu/cancer/.
1318
T. Lange, V. Roth, M. Braun, and J. Buhmann
0.7
stability index
0.6 0.5 0.4 0.3 0.2 0.1 0 2
3
4
5 6 7 8 number of clusters
9
10
Figure 5: Stability index for the leukemia data set.
For the purpose of cluster analysis, the feature set was additionally reduced by retaining only the 100 genes with highest variance across samples because genes with low variance across samples are unlikely to be informative for the purpose of grouping. This step is adopted from Dudoit and Fridlyand (2002). The nal data set consists of 100 genes and 72 samples. Cluster analysis has been performed with k-means in conjunction with the nearest centroid classier. Figure 5 shows the stability curve of the analysis for 2 · k · 10. For k D 3, we estimate the lowest empirical stability index. We expect that clustering with k D 3 separates AML, B-cell ALL, and T-cell ALL samples from each other. Figure 6A shows the resulting labeling for this case. With respect to the known ground-truth labels, 91:6% of the samples (66 samples) are correctly classied (the bipartite matching is used again to map the cluster onto the ground-truth labeling). Note that for k D 2, similar stability is achieved. Hence, we cluster the data set again for k D 2 and compare the result with the ALL – AML labeling of the data. The result is shown in Figure 6A. Here, 86:1% of the samples (62 samples) are correctly identied. The Gap Statistic overestimates the “true” number of clusters, while Prediction Strength does not provide any useful information by returning kO D 1. Clest infers the same number of clusters as our method does. The Model Explorer algorithm reveals a bias to a smaller number of clusters, while Levine’s FOM generates several local minima on this data set (not including k D 3). We conclude that our method is able to infer biologically relevant model orders. Furthermore, it suggests the model order for which a high accuracy is achieved. If the true labeling had been unknown, our reanalysis of this data set demonstrates that the different cancer classes could have been discovered based exclusively on the expression data by utilizing our model order selection principle and k-means clustering.
Stability-Based Validation of Clustering Solutions
1319
A
B
Figure 6: Results of the cluster analysis of the leukemia data set (100 genes, 72 samples) for the most stable model orders k D 2 (top) and k D 3 (bottom). The data are arranged according to the ground-truth labeling (the vertical bar labeled “True”) for both k. The vertical bar “Pred” indicates the predicted cluster labels.
3.3.2 Clustering of Lymphoma Samples. Alizadeh et al. (2000) measured gene expression patterns for three different lymphoid malignancies—diffuse large B-cell lymphoma (DLBCL), follicular lymphoma (FL), and chronic follicular leukemia (CLL)—by utilizing a special microarray chip, the lymphochip. The lymphochip, a cDNA microarray, uses a set of genes that is of importance in the context of lymphoid diseases. The study in Alizadeh et al. (2000) produced measurements for 4682 genes and 62 samples (42 samples of DLBCL, 11 samples of CLL, 9 samples of FL). The data set contained missing values, which were set to 0. After that, the data were standardized as above. Furthermore, the 200 genes with highest variance across samples have been selected for further processing, leading to a data set of 62 samples and 200 genes. Cluster analysis has been performed with k-means and the nearest centroid rule. The resulting stability curve is depicted in Figure 7. We estimate
1320
T. Lange, V. Roth, M. Braun, and J. Buhmann
1
stability index
0.8 0.6 0.4 0.2 0 2
3
4
5 6 7 8 number of clusters
9
10
Figure 7: Stability index for the lymphoma data set.
Figure 8: The lymphoma data set extracted from Alizadeh et al. (2000), together with a two-means clustering solution and ground-truth information. k-means almost perfectly separates here DLBCL from CLL and FL. The bar labeled “Pred” indicates the cluster labels, while “true” refers to the ground-truth label information.
here kO D 2, since we get the highest stability index value for this number of clusters. Note that this contradicts the ground-truth number of clusters k D 3. Taking a closer look at the solutions (see Figure 8) reveals, however, that the split of the data produced by k-means with k D 2 is biologically plausible since it separates DLBCL samples from FL and CLL samples. Furthermore, a 3-means solution does not separate DLBCL, FL, and CLL but splits the DLBCL cluster and generates two clusters consisting of FL, CLL, and DLBCL samples and one cluster of DLBCL samples. In total, we get an agreement of over 98% for k D 2, but for k D 3, an agreement of only ¼ 57% in comparison with the ground-truth labeling. Hence, our method
Stability-Based Validation of Clustering Solutions
1321
estimated the k, for which the most reliable grouping with k-means can be achieved. Clest and Ben-Hur’s method achieve the same result as we do. Levine’s FOM leads to estimating k D 2 and k D 9. The Prediction Strength method suggests an uninformative one-cluster solution, while the Gap Statistic proposes k D 4. The corresponding grouping solution with k-means again mixes the three different classes in the same clusters. Again, we conclude that we nd a biologically plausible partitioning of the data in an unsupervised way. Furthermore, the stability index has selected the number of clusters for which the most consistent data partition with regard to the ground-truth is extracted by k-means clustering.
4 Conclusion The concept of stability to assess the quality of clustering solutions was introduced. It has been successfully applied to the model order selection problem in unsupervised learning. The stability measure quanties the expected dissimilarity of two solutions, where one solution was extended by a predictor to the other one. In contrast to other approaches, the important role of the predictor is appreciated in the concept of cluster stability. Under the assumption that the labels produced by the clustering algorithm are the true labels, the disagreement can be interpreted as the attainable misclassication risk for two independent data sets from the same source. Normalizing the stability measure with the stability costs of a random predictor allows us to assess the suitability of different numbers of clusters k for a given data set and clustering algorithm in an objective way. In order to estimate the stability costs in practice, an empirical estimator is used that emulates independent samples by resampling. Experiments are conducted on simulated and well-known microarray data sets (Alizadeh et al., 2000; Golub et al., 1999). On the toy data, the stability method has demonstrated its competitive performance under wellcontrolled conditions. Furthermore, we have pointed out where and why competing methods fail or provide unreliable answers. Concerning the gene expression data sets, our reanalysis effectively shows that the stability index is a suitable technique for identifying reliable partitions of the data. The nal groupings are in accordance with biological prior knowledge. We conclude that our validation scheme leads to reliable results and is therefore appropriate for assessing the quality of clustering solutions, in particular in biological applications where prior knowledge is rarely available.
Acknowledgments This work has been supported by the German Research Foundation, grants Buh 914/4, Buh 914/5.
1322
T. Lange, V. Roth, M. Braun, and J. Buhmann
References Alizadeh, A. A., Eisen, M. B., Davis, R. E., Ma, C., Lossos, I. S., Rosenwald, A., Boldrick, J. C., Sabet, H., Tran, T., Yu, A., Powell, J. I., Yang, L., Marti, G. E., Moore, T., Judson, J. Jr., Lu, L., Lewis, D. B., Tibshirani, R., Sherlock, G., Chan, W. C., Greiner, T. C., Weisenburger, D. D., Armitage, J. O., Warnke, R., Levy, R., Wilson, W., Grever, M. R., Byrd, J. C., Botstein, D., Brown, P. O., & Staudt, L. M. (2000). Distinct types of diffuse large B-cell lymphoma identied by gene expression proling. Nature, 403, 503–511. Ben-Hur, A., Elisseeff, A., & Guyon, I. (2002). A stability based method for discovering structure in clustered data. In Pacic Symposium on Biocomputing 2002 (pp. 6–17). Singapore: World Scientic. Bittner, M., Meltzer, P., Chen, Y., Jiang, Y., Seftor, E., Hendrix, M., Radmacher, S., R., Yakhini, Z., Ben-Dor, A., Sampas, N., Dougherty, E., Wang, E., Marincola, F., Gooden, C., Lueders, J., Glatfelter, A., Pollock, P., Carpten, J., Gillanders, E., Leja, D., Dietrich, K., Beaudry, C., Berens, M., Alberts, D., & Sondak, V. (2000). Molecular classication of cutaneous malignant melanoma by gene expression proling. Nature, 406(3), 536–540. Breckenridge, J. (1989). Replicating cluster analysis: Method, consistency and validity. Multivariate Behavioral Research, 24, 147–161. Buhmann, J. M. (1995). Data clustering and learning. In M. Arbib (Ed.), Handbook of brain theory and neural networks (pp. 278–282). Cambridge, MA: MIT Press. Duda, R. O., Hart, P. E., & Stork, D. G. (2000). Pattern classication. New York: Wiley. Dudoit, S., & Fridlyand, J. (2002). A prediction-based resampling method for estimating the number of clusters in a dataset. Genome Biology, 3(7). Available on-line: http://genomebiology.com/2002/317/research/0036. Eisen, M., Spellman, P., Botstein, D., & Brown, P. (1998). Cluster analysis and display of genome-wide expression patterns. Proc. Natl. Acad. Sci. USA, 95, 14863–14868. Fischer, B., Zoller, ¨ T., & Buhmann, J. M. (2001). Path based pairwise data clustering with application to texture segmentation. In M. A. T. Figueiredo, J. Zerubia, & A. K. Jain (Eds.), LNCS energy minimization methods in computer vision and pattern recognition. Berlin: Springer-Verlag. Fowlkes, E., & Mallows, C. (1983). A method for comparing two hierarchical clusterings. Journal of the American Statistical Association, 78, 553–584. Fraley, C., & Raftery, A. (1998). How many clusters? Which clustering method? Answers via model-based cluster analysis. Computer Journal, 41(8), 578–588. Golub, T. T., Slonim, D. K., Tamayo, P., Huard, C., Gaasenbeek, M., Mesirov, J. P., Coller, H., Loh, M. L., Downing, J. R., Caligiui, M. A., Bloomeld, C. D., & Lander, E. S. (1999). Molecular classication of cancer: Class discovery and class prediction by gene expression monitoring. Science, 286, 531–537. Gordon, A. D. (1999). Classication (2nd ed.). London: Chapman & Hall. Hastie, T., Tibshirani, R., & Friedman, J. (2001). The elements of statistical learning: Data mining, inference and prediction. New York: Springer-Verlag.
Stability-Based Validation of Clustering Solutions
1323
Hofmann, T., & Buhmann, J. M. (1997). Pairwise data clustering by deterministic annealing. IEEE PAMI, 19(1), 1–14. Iyer, V. R., Eisen, M. B., Ross, D. T., Schuler, G., Moore, T., Lee, J. C. F., Trent, J. M., Staudt, L. M., Hudson, J. Jr., Boguski, M. S., Lashkari, D., Shalon, D., Botstein, D., & Brown, P. O. (1999). The transcriptional program in the response of human broblasts to serum. Science, 283(1), 83–87. Jain, A. K., & Dubes, R. C. (1988). Algorithms for clustering data. Upper Saddle River, NJ: Prentice Hall. Jain, A. K., Murty, M., & Flynn, P. J. (1999). Data clustering: A review. ACM Computing Surveys, 31(3), 265–323. Kuhn, H. (1955). The Hungarian method for the assignment problem. Naval Res. Logist. Quart., 2, 83–97. Lander, E. (1999). Array of hope. Nature Genetics Supplement, 21, 3–4. Lange, T., Braun, M., Roth, V., & Buhmann, J. (2003). Stability-based model selection. In S. Becker, S. Thrun, & K. Obermayer (Eds.), Advances in neural information processing systems, 15 (pp. 617–624). Cambridge, MA: MIT Press. Levine, E. (1999). Un-supervised estimation of cluster validity—methods and applications. Master’s thesis, Weizmann Institute of Science. Levine, E., & Domany, E. (2001). Resampling method for unsupervised estimation of cluster validity. Neural Computation, 13, 2573–2593. Rose, K., Gurewitz, E., & Fox, G. (1992). Vector quantization and deterministic annealing. IEEE Trans. Inform. Theory, 38(4), 1249–1257. Roth, V., & Lange, T. (in press). Bayesian class discovery in microarray datasets. IEEE Transactions on Biomedical Engineering. Shamir, R., & Sharan, R. (2001). Algorithmic approaches to clustering gene expression data. In T. Jiang, T. Smith, Y. Xu, & M. Q. Zhang (Eds.), Current topics in computational biology. Cambridge, MA: MIT Press. Sharan, R., & Shamir, R. (2000). CLICK: A clustering algorithm with applications to gene expression analysis. In ISMB’00 (pp. 307–316). Menlo Park, CA: AAAI Press. Smyth, P. (1998). Model selection for probabilistic clustering using cross-validated likelihood (Tech. Rep. 98-09). Irvine, CA: Information and Computer Science, University of California, Irvine. Tamayo, P., Slonim, D., Mesirov, J., Zhu, Q., Kitareewan, S., Dmitrovsky, E., Lander, E. S., & Golub, T. R. (1999). Interpreting patterns of gene expression with self-organizing maps: Methods and application to hematopoietic differentiation. PNAS, 96, 2907–2912. Tibshirani, R., Walther, G., Botstein, D., & Brown, P. (2001). Cluster validation by prediction strength (Tech. Rep.). Stanford, CA: Statistics Department, Stanford University. Tibshirani, R., Walther, G., & Hastie, T. (2001). Estimating the number of clusters via the gap statistic. J. Royal Statist. Soc. B, 63(2), 411–423. Yeung, K., Haynor, D., & Ruzzo, W. (2001). Validating clustering for gene expression data. Bioinformatics, 17(4), 309–316. Received May 15, 2003; accepted November 24, 2003.
NOTE
Communicated by Kechen Zhang
Bayesian Estimation of Stimulus Responses in Poisson Spike Trains Sidney R. Lehky
[email protected] Cognitive Brain Mapping Laboratory, RIKEN Brain Science Institute, Saitama 3510198, Japan, and Laboratory of Brain and Cognition, National Institute of Mental Health, Bethesda, MD 20892, U.S.A.
A Bayesian method is developed for estimating neural responses to stimuli, using likelihood functions incorporating the assumption that spike trains follow either pure Poisson statistics or Poisson statistics with a refractory period. The Bayesian and standard estimates of the mean and variance of responses are similar and asymptotically converge as the size of the data sample increases. However, the Bayesian estimate of the variance of the variance is much lower. This allows the Bayesian method to provide more precise interval estimates of responses. Sensitivity of the Bayesian method to the Poisson assumption was tested by conducting simulations perturbing the Poisson spike trains with noise. This did not affect Bayesian estimates of mean and variance to a signicant degree, indicating that the Bayesian method is robust. The Bayesian estimates were less affected by the presence of noise than estimates provided by the standard method. 1 Introduction The goal here is to use Bayesian methods to improve estimates of neural responses. The Bayesian analysis accomplishes this by incorporating additional information about spike train statistics into the analysis relative to standard methods, which simply calculate moments (e.g., mean, variance) of the data. In this analysis, we incorporate the assumption that spike counts are Poisson distributed. There have been previous applications of Bayesian methods to neural data (Brown et al., 1998; Martignon et al., 2000; Sanger, 1996; Zhang, Ginzburg, McNaughton, & Sejnowski, 1998). However, the focus of those studies was determining the most likely stimulus given a set of responses within a neural population. Here we are not concerned with determining the most likely stimulus, but rather move the analysis to a lower level and infer the probability distribution of the response ring rate to one particular stimulus, given observations of spike counts within xed time periods. The emphasis is on practical data analysis methods rather than theoretical issues in neural coding. c 2004 Massachusetts Institute of Technology Neural Computation 16, 1325–1343 (2004) °
1326
S. Lehky
The Bayes formulation for the case at hand is prob.¸ j n; T/ D
prob.n j ¸; T/ prob.¸; T/ ; prob.n; T/
(1.1)
where n is the observed spike count during time interval T and ¸ is the estimated spike rate. Omitting the normalization factor prob.n; T/ in the denominator leaves the proportionality: prob.¸ j n; T/ / prob.n j ¸; T/ prob.¸; T/ posterior / likelihood ¤ prior:
(1.2)
2 The Likelihood Function The experimental situation we envisage is that a particular stimulus is presented during multiple trials. Within each trial, there is a prestimulus period, when activity of the neuron is at spontaneous level, and a stimulus period. Spike counts from the two periods are used to form likelihood functions for spontaneous and stimulus ring rates, ¸spont and ¸stim . From these, the likelihood function of the response ¸stim ¡ ¸spont is derived. The Poisson probability density function (pdf) we assume gives the distribution of spike count n during period T, given mean spike count ¸T: prob.n j ¸; T/ D
.¸T/n ¡¸T e : n!
(2.1)
What we actually need for a Bayesian estimate, however, is the likelihood of ¸, which involves keeping the same equation but now making n a constant and ¸ a variable, opposite to the situation in equation 2.1. This change in perspective converts equation 2.1 into gamma-distribution-shaped likelihood function of ¸, with shape parameter n C 1 and rate parameter 1=T: prob.n j ¸; T/ D likelihood.¸ j n; T/ D
.¸T/n ¡¸T e 0.n C 1/
D C¸n e¡¸T :
(2.2a) (2.2b)
(Multiplying this likelihood function by a constant factor T would normalize it to a gamma pdf.) In equation 2.2b, all the multiplicative constants are collected into C. The precise value of C is unimportant for likelihood calculations as we are interested in relative rather than absolute values of the function. Assuming prestimulus and stimulus spike rates are independent, the joint likelihood of ¸spont and ¸stim is formed by multiplying their individual likelihoods: likelihoodi .¸stim ; ¸spont j nistim ; nispont; Tstim ; T spont / ni
nispont
stim D Ci ¸stim ¸spont e¡.¸stim Tstim C¸spont Tspont /
(2.3)
Bayesian Estimation of Neural Responses
1327
(for the ith stimulus presentation). All multiplicative constants are grouped into Ci , whose value again is not important. The joint likelihood for N stimulus trials is then found by multiplying the likelihood functions for the individual trials: likelihood.¸stim ; ¸spont j nE stim ; nE spont ; Tstim ; Tspont / D
N Y
ni
nispont
stim [Ci ¸stim ¸spont e¡.¸stim Tstim C¸spont Tspont / ]:
(2.4)
iD1
3 Prior Probability Distribution We dene two prior distributions as examples. One possibility, the agnostic assumption, is that all response magnitudes are equally likely. This is represented by a two-dimensional uniform distribution, 1 a2 D0
priorProbAg .¸stim ; ¸spont/ D
0 · ¸spont; ¸stim · a; otherwise
(3.1)
with a dening a biophysically plausible range of ring rates. Another possibility, the skeptical assumption, is that there is no stimulus response. That is, ¸spont and ¸stim are on average identical. Given the longterm mean spontaneous spike rate, ¸N spont , this distribution, analogous to equation 2.3, is: priorProbSk .¸stim ; ¸spont j ¸N spont ; Tstim ; Tspont / N
N
D C.¸stim /¸spont Tstim .¸spont /¸spont Tspont e¡.¸stim Tstim C¸spont Tspont / ;
(3.2)
where C is the normalizing constant. Besides these example prior distributions, others may be suggested by the particular aspects of an experiment. 4 Posterior Probability Distribution The posterior probability of the ring rates is proportional to the prior probability (see equation 3.1 or 3.2), and the joint likelihood function (see equation 2.4): postProb.¸stim ; ¸spont j nE stim ; nE spont ; Tstim ; Tspont / D C[priorProb.¸stim ; ¸spont /
likelihood.¸stim ; ¸spont j nE stim ; nE spont; Tstim ; T spont /];
in which C normalizes the area under postProb.¸stim ; ¸spont /.
(4.1)
1328
S. Lehky
What we require, however, rather than the joint pdf of ¸spont and ¸stim , is the distribution of the response ¸resp D ¸stim ¡ ¸spont . This is accomplished by rst changing variables from prob.¸stim ; ¸spont / to prob.¸stim ¡ ¸spont ; ¸stim C ¸spont /, which involves a rotation and dilation of the coordinate axes. Then prob.¸stim ¡ ¸spont; ¸stim C ¸spont / is integrated along the ¸stim C ¸spont axis to give the marginal distribution prob.¸stim ¡ ¸spont /. For convenience, we relabel the transformed variables as follows: ¸sum D ¸stim C ¸spont
¸dif D ¸stim ¡ ¸spont :
(4.2)
Under the transformed variables, the likelihood function for a single trial (analogous to equation 2.3) becomes: likelihoodi .¸sum ; ¸dif j nistim ; nispont; Tstim ; Tspont / i
i
D Ci .¸sum C ¸dif /nstim .¸sum ¡ ¸dif /nspont 1
£ e¡ 2 [.¸sum C¸dif /Tstim C.¸sum ¡¸dif /Tspont ] :
(4.3)
The joint likelihood over N stimulus trials is then found by taking the product of individual trials, analogous to equation 2.4. The agnostic prior, equation 3.1, remains essentially unchanged under transformed coordinates because it is a constant. The transformed equation for the skeptical prior, equation 3.2, becomes: priorProbSk .¸sum ; ¸dif j ¸N spont ; Tstim ; Tspont/ N
N
D C.¸sum C ¸dif /¸spont Tstim .¸sum ¡ ¸dif /¸spont Tspont 1
£ e¡ 2 [.¸sum C¸dif /Tstim C.¸sum ¡¸dif /Tspont ] :
(4.4)
Given the transformed likelihood and prior probability functions, the posterior probability is postProb.¸sum ; ¸dif j nE stim ; nE spont; Tstim ; Tspont / D C[priorProb.¸sum ; ¸dif /
£ likelihood.¸sum ; ¸dif j nE stim ; nE spont; Tstim ; Tspont /];
(4.5)
where again C normalizes the distribution. Integrating equation 4.5 with respect to ¸sum takes us from postProb.¸sum ; ¸dif / to postProb.¸dif /: E stim ; nE spont ; Tstim ; Tspont/ postProb.¸dif j n Z 1 DC priorProb.¸sum ; ¸dif / ¡1
£ likelihood .¸sum ; ¸dif j nE stim ; nE spont ; Tstim ; Tspont/d¸sum: (4.6)
Bayesian Estimation of Neural Responses
1329
This is the nal result we seek—the posterior probability distribution of the stimulus response. 5 Refractory Periods Thus far, the analysis has been carried out for a pure Poisson process. A signicant perturbation away from this ideal is caused by the existence of a refractory period. The refractory period acts to regularize a spike train, reducing the variance of spike counts within a xed period, and therefore the variance of observed ring rates. Given mean interspike interval tNISI and refractory period tref , the fractional variance relative to a pure Poisson process is
ºD
Var.¸/ref Var.¸/pure
Á D
tNISI ¡ tref tNISI
!2 (5.1)
:
(By denition of tref , tNISI < tref is impossible.) Since tNISI D T=n: ± n ²2 º D 1 ¡ tref : T
(5.2)
To take into account the refractory period, therefore, the variance of the likelihood function (as well as nonuniform priors) should be reduced by the refractory period correction factor º. A modication to the likelihood function equation 2.2b that approximates that transform is: nC1
likelihood.¸ j n; T; º/ D C¸ º ¡1 e¡ º ³ ´ nC1 D ± ¸¡ T ¸T
00
(8.2)
:
r·0
Parameters for the spontaneous ring rate were ¹ D 2:02, ¾ D 0:35, and for the stimulus rate, ¹ D 3:15, ¾ D 0:22. Those values satised two criteria. First, the mean spike densities were the same as used previously, rNspont D 8 and rNstim D 24, and second, Fano factors were approximately 2.0 in both cases. Using noise-perturbed spike trains, stimulus response statistics were calculated as before. Results for the two noise conditions are given in Tables 3 and 4, to be compared with results for homogeneous Poisson trains in Table 2. Comparison of Tables 2 and 3 shows that fast-uctuation noise has no effect on either Bayesian or standard estimates, not surprising given that on extended timescales, spike count is conserved under this noise. Slow-uctuation noise produces no signicant change in Bayesian estimates of E.Var.¸resp//, as a comparison of Tables 2 and 4 shows. On the other hand, standard estimates of E.Var.¸resp // were increased substantially. Thus, under this noise condition, Bayesian variance estimates become more accurate than standard ones (closer to the noise-free estimates), in addition to being more precise (smaller variance of the variance). In the slow-uctuation noise case, we have spike trains that appear nonPoisson by one criterion, a Fano factor of two, yet the estimates of a PoissonTable 3: Response Statistics Using Inhomogeneous Poisson Spike Trains with a 3 Msec Refractory Period and Whose Firing Rates are Perturbed with FastFluctuation Noise. E.¸resp /
E.Var.¸resp //
Var.Var.¸resp //
Trials
Standard
Bayesian
Standard
Bayesian
Standard
Bayesian
1 2 5 10
16.1 16.0 16.0 16.0
16.1 16.0 16.0 16.0
– 28.8 11.6 5.82
63.1 29.9 11.6 5.72
– 1574 68.2 7.64
133 16.8 1.11 0.138
Notes: Fast-uctuation noise varies rapidly relative to the duration of a stimulus trial, as dened in equation 8.1. Simulation parameters are as given in Table 1.
Bayesian Estimation of Neural Responses
1341
Table 4: Response Statistics Using Inhomogeneous Poisson Spike Trains with a 3 Msec Refractory Period and Whose Firing Rates are Perturbed with SlowFluctuation Noise. E.¸resp /
E.Var.¸resp //
Var.Var.¸resp //
Trials
Standard
Bayesian
Standard
Bayesian
Standard
Bayesian
1 2 5 10
16.2 15.9 16.0 16.1
16.2 15.9 16.0 16.1
– 46.6 18.9 9.36
62.8 29.8 11.5 5.71
– 4520 201 21.4
207 26.7 1.76 0.225
Notes: Slow-uctuation noise is slow relative to the duration of a stimulus trial, as dened in equation 8.2. Simulation parameters are as given in Table 1.
Table 5: Example of Bayesian Estimation Applied to Actual Data. E.¸resp /
E.Var.¸resp //
Var.Var.¸resp //
Trials
Standard
Bayesian
Standard
Bayesian
Standard
Bayesian
5
48.1
47.9
89.4
47.8
2230
19.7
Notes: The data consisted of 10 trials recorded from a unit in macaque striate cortex, presented with grating stimuli. The Fano factor was 1.7. Bayesian estimation was applied to 50 replications of a bootstrap resampling of the data, with 5 of the 10 trials chosen at random with replacement for each resampling.
based model are little affected as spike generation is still Poisson at a local timescale. The insensitivity of Bayesian estimates to either type of noise perturbation away from a homogeneous Poisson process indicates the robustness of the method. As a nal item on noise-perturbed spike trains, we apply the Bayesian method to actual data recorded from monkey striate cortex (see Table 5). In this case, E.¸resp/ is slightly different for the two methods because Tstim 6D Tspont (see equation 6.4). The pattern of E.Var.¸resp // values for the actual data appears to follow those of the simulations with slow-uctuation noise added, in which Bayesian variances are substantially smaller than standard variances. We noted previously that under the slow-uctuation noise condition, the smaller Bayesian variance estimates are more accurate than the standard estimates. 9 Discussion The Bayesian and standard methods provide similar estimates for the mean E.¸resp / and variance E.Var.¸resp // of the response ring rate for parameters typical of experimental conditions. The estimates of both methods converge
1342
S. Lehky
as the data sample size increases, by increasing either the number of trials or the duration of each trial, so that the Bayesian model is asymptotically unbiased. Given a uniform prior, the Bayesian estimate is a maximum likelihood estimate and therefore asymptotically efcient (given certain regularity conditions) (Kay, 1993). That is, the Bayesian E.Var.¸resp// estimate asymptotically approaches the Cram´er-Rao bound. As the Bayesian estimate also asymptotically approaches the standard estimate, the implication is that the standard estimate is, at minimum, also asymptotically efcient. The Bayesian method provides much smaller estimates of the variance of the variance .Var.Var.¸resp/// than the standard method, despite having almost identical values of expected variance E.Var.¸resp //. That is because the two methods compute variances using entirely different procedures. Under the standard method, the variance of the ring rate Var.¸/ is computed directly from the sample of ¸ itself by plugging those values into the standard formula for variance, without making assumptions about the distribution of ¸. Under the Bayesian method, statistics are calculated indirectly from the data in a two step process, by rst tting a model (the likelihood model) and then determining statistics from the model curve t rather than directly from the data. We can identify three benets to using the Bayesian analysis relative to the standard analysis. The rst is smaller values for the variance of the variance. This allows more precise interval estimates of a parameter, which in turn improves the precision of tests of signicance. The second is reduced sensitivity of variance estimates to perturbations in the data caused by noise. This can be seen by comparing Tables 2 and 4. The nal benet is the ability to estimate response variance after only a single trial. Underlying the Bayesian analysis was the assumption that spike trains have Poisson statistics. Sensitivity to this assumption was tested by perturbing spike trains away from Poisson statistics with two kinds of noise, whose uctuations were fast or slow relative to the timescale of a single trial. Despite the noise perturbations, Bayesian estimates of response mean and variance were not signicantly affected. This indicates that the Bayesian method is robust. Given the benets of Bayesian analysis listed above and the robustness of the method, it appears to be a promising candidate for practical data analysis.
Acknowledgments I thank Keiji Tanaka and Leslie Ungerleider for their support. Portions of this work were completed while I was visiting the Sloan-Schwartz Center for Theoretical Neurobiology at the Salk Institute.
Bayesian Estimation of Neural Responses
1343
References Brown, E. N., Frank, L. M., Tang, D., Quirk, M. C., & Wilson, M. A. (1998). A statistical paradigm for neural spike train decoding applied to position prediction from ensemble ring patterns of rat hippocampal place cells. J. Neurosci., 18, 7411–7425. Kay, S. M. (1993). Fundamentals of statistical signal processing: Estimation theory. Upper Saddle River, NJ: Prentice Hall. Martignon, L., Deco, G., Laskey, K., Diamond, M., Freiwald, W., & Vaadia, E. (2000). Neural coding: Higher order temporal patterns in the neurostatistics of cell assemblies. Neural Comput., 12, 2621–2653. Sanger, T. D. (1996). Probability density estimation for the interpretation of neural population codes. J. Neurophysiol., 76, 2790–2793. Zhang, K., Ginzburg, I., McNaughton, B. L., & Sejnowski, T. J. (1998). Interpreting neuronal population activity by reconstruction: Unied framework with application to hippocampal place cells. J. Neurophysiol., 79, 1017–1044. Received April 16, 2003; accepted December 3, 2003.
NOTE
Communicated by John Platt
Comments on “A Parallel Mixture of SVMs for Very Large Scale Problems” Xiaomei Liu
[email protected] Department of Computer Science and Engineering, University of Notre Dame, South Bend, IN 46556, U.S.A.
Lawrence O. Hall
[email protected] Department of Computer Science and Engineering, University of South Florida, Tampa, FL 33620, U.S.A.
Kevin W. Bowyer
[email protected] Department of Computer Science and Engineering, University of Notre Dame, South Bend, IN 46556, U.S.A.
Collobert, Bengio, and Bengio (2002) recently introduced a novel approach to using a neural network to provide a class prediction from an ensemble of support vector machines (SVMs). This approach has the advantage that the required computation scales well to very large data sets. Experiments on the Forest Cover data set show that this parallel mixture is more accurate than a single SVM, with 90.72% accuracy reported on an independent test set. Although this accuracy is impressive, their article does not consider alternative types of classiers. We show that a simple ensemble of decision trees results in a higher accuracy, 94.75%, and is computationally efcient. This result is somewhat surprising and illustrates the general value of experimental comparisons using different types of classiers. 1 Introduction Support vector machines (SVMs) are most directly used for two-class classication problems (Vapnik, 1995). They have been shown to have high accuracy in a number of problem domains, for example, character recognition (Decoste & Scholkopf, 2002). The training time required by SVMs is an issue for large data sets. Collobert, Bengio, and Bengio (2002) introduced a method using a mixture of SVMs created in parallel to address the issue of scaling to large data sets. Results were reported on an example data set, the Forest Cover Type data set from the UC Irvine repository (Blake & Merz, c 2004 Massachusetts Institute of Technology Neural Computation 16, 1345–1351 (2004) °
1346
X. Liu, L. Hall, and K. Bowyer
1998). The train data were converted to a two-class problem. It was shown that the mixture of SVMs is more accurate than a single SVM, it is faster to train, and it has the potential to scale to large data sets. The accuracy from training on 100,000 examples is shown to be 90.72% on an unseen test set of 50,000 examples. The parallel mixture of SVMs was not compared with other classiers. In this comment, we compare an ensemble of randomized C4.5 decision trees (Dietterich, 1998) to the parallel mixture of SVMs and, perhaps contrary to our expectations and others, nd that the ensemble of decision trees results in a more accurate classier. Further, decision trees scale reasonably well with large data sets (Chawla et al., 2003). This result seems to reinforce the idea that it is always useful to compare a classier to other approaches. In the next section, we briey discuss the two ensemble classiers compared. Section 3 provides the details of our experiments and our experimental results. Section 4 presents the discussion and our conclusion. 2 Background 2.1 SVM and a Parallel Mixture of SVMs. The SVM was introduced by Vapnik (1995). SVM classiers are used to solve problems of two-class classication. The learning and training time for an SVM is high. It is at least quadratic in the number of the training patterns. In order to decrease the time cost of SVM, Collobert et al. (2002) proposed a parallel mixture of SVMs. They use only part of the training set for each SVM, so the time cost is decreased signicantly. It is conjectured that the time cost of a parallel mixture of SVMs is subquadratic with the number of training patterns for large-scale problems. They claim that the performance of a parallel mixture of SVMs is at least as good as a single SVM and show it to be so. 2.2 Decision Trees. A decision tree (DT) is a tree-structured classier. Each internal node of the DT contains a test on an attribute of the example to be classied, and the example is sent down a branch according to the attribute value. Each leaf node of the decision tree has the class value of the majority class for the training examples that ended up at the leaf. A DT typically builds axis-parallel class boundaries. Pruning is a useful method to decrease overtting for individual DTs, but is not so useful for ensembles of decision trees. The randomized C4.5 algorithm (Dietterich, 1998) was based on the C4.5 (Quinlan, 1993) algorithm. The main idea of randomized C4.5 is to modify the strategy of choosing a test at a node. When choosing an attribute and test at an internal node, C4.5 selects the best one based on the gain ratio. In the randomized C4.5 algorithm, m best splits are calculated (m is a positive constant with the default value 20), and one of them is chosen randomly with a uniform probability distribution. When calculating the m best tests, it
A Parallel Mixture of SVMs
1347
Table 1: Class Distribution of the Forest Cover Type Data Set. Class Label
Meaning
Number of Records
1 2 3 4 5 6 7
Spruce/Fir Lodgepole Pine Ponderosa Pine Cottonwood/Willow Aspen Douglas-r Krummholz
211,840 283,301 35,754 2747 9493 17,367 20,510
is not required that they be from different attributes. In an extreme situation, the m candidate tests may be from the same attribute. 3 The Forest Cover Type Data Set The original Forest Cover Type data set (Blackard & Dean, 1999) contains 581,012 instances. 1 For each instance, there are 54 features. There are seven class labels numbered from 1 to 7. The distribution of the seven classes is not even. Table 1 shows the class distribution. Since an SVM is most directly used for two-class classication, the original seven-class data set was transformed into a two-class data set in the experiments of Collobert et al. (2002). The problem became to differentiate the majority class (class 2) from the other classes. Since we are going to compare the performance of a DT ensemble with the parallel mixture of SVMs, we use the same two-class version of the Forest Cover Type data set in our experiment. We downloaded the data sets. 2 They normalized the original forest cover data by dividing each of the 10 continuous features or attributes by the maximum value in the training set. There were 50,000 patterns in the testing set, 10,000 patterns in the validation set, and 400,000 patterns in the training set. We used the downloaded testing set and validation set as our testing set and validation set, accordingly. However, we did not actually tune our ensemble based on the validation set, so for us, it serves as a second test set. We used the rst 100,000 patterns in the downloaded training set as our training set. These are the exact data sets used in Collobert et al. (2002).
1 The data set is available on-line at http://ftp.ics.uci.edu/pub/machine-learningdatabases/covtype/covtype.info. 2 Available on-line at http://www.idiap.ch/collober/forest.tgz.
1348
X. Liu, L. Hall, and K. Bowyer
4 Experimental Results We used the USFC4.5 software, which is based on C4.5 release 8 (Quinlan, 1993) and modied by the researchers at University of South Florida (Eschrich, 2003), to do the DT experiments. 4.1 An Ensemble of 50 Randomized C4.5 Trees on the Full Training Set. Typically, a randomized C4.5 ensemble would consist of 200 decision trees. To compare with the 50 SVMs, we restricted our ensemble to 50 decision trees. Each tree was built on the whole training set of size 100,000. Since there were no differences in the data set of each individual tree, we used randomized C4.5 to create each tree to generate a diverse set of trees for the ensemble (Baneld, Hall, Bowyer, & Kegelmeyer, 2003). A random choice from among the top 20 tests was used to build the trees. The trees in the ensemble were unpruned. The ensemble prediction was obtained by unweighted voting, so the class with the most votes from individual classiers was the prediction of the ensemble. As shown in Table 2, the ensemble accuracy on the training data set was 99.81%, on the validation set was 94.85%, and on the testing set it was 94.75%. We also list the minimum, maximum, and average accuracy of the 50 individual DTs included in the ensemble in Table 2. The test set accuracy compares favorably with the 90.72% accuracy of the parallel mixture of SVMs. 4.2 An Ensemble of 100 C4.5 Trees on Half of the Training Set. To get an idea of how much the randomized C4.5 was helping the classication accuracy, we created an ensemble of 100 trees, each built on one-half of the training data. Each tree was trained on a randomly selected 50,000 examples from the 100,000-example training set. It is not guaranteed that each instance appears exactly 50 times in the training sets of the 100 DTs. Since each training data set is clearly unique, we built a standard C4.5 decision tree on them. The trees were not pruned. Each of our trees was built on 25 times more training data than the SVM. However, only 100,000 unique examples are used. Each tree can be built in 1 CPU minute. The ensemble performance on the testing set was 92.76%, the minimum single tree performance is 86.23%, the maximum single tree performance
Table 2: Accuracy of Dietterich’s Randomized C4.5 on the Forest Cover Type Data Set, 50 Trees.
Training set Validation set Testing set
Ensemble
Minimum
Maximum
Average
99.81% 94.85 94.75
97.27% 88.62 88.81
97.73% 89.92 89.66
97.51% 89.42 89.25
A Parallel Mixture of SVMs
1349
Table 3: Comparison of Accuracy Between Randomized C4.5 DTs and a Parallel Mixture of SVMs on the Forest Cover Type Data Set.
Training set Validation set Testing set
Randomized C4.5 DTs
Parallel Mixture of SVMs
99.81% 94.85 94.75
94.09% 91.00 90.72
is 87.60%, and the average single tree performance is 87.10%. So the SVM mixture was outperformed by an ensemble of plain C4.5 DTs with each tree grown on half the training data of one of the SVMs. 5 Discussion and Conclusion According to the results reported in Collobert et al. (2002), the best performance of their parallel mixture of SVMs (using 150 hidden units and 50 SVMs) on the training set was 94.09%, on the validation set it was around 91% (estimated from Figure 4 in Collobert et al., 2002), and on the testing set it was 90.72%. In our experiments, the ensemble of 50 randomized DTs had much better performance. As shown in Table 3, its accuracy on the training set was 99.81%, on the validation set it was 94.85%, and on the testing set it was 94.75%. We did build an ensemble of 200 trees, but the accuracies were only slightly greater. The ensemble becomes good quite quickly. Each SVM in the ensemble of classiers was built from a disjoint data set of 2000 examples. The high accuracy obtained from such small training data sets and the scalability of the algorithm are impressive. Each SVM classier used signicantly fewer than the 100,000 or 50,000 training examples for each DT in the ensemble. The accuracy of the DT ensemble with half the size of the data was 2% less than with all training examples. Clearly, the DT accuracy will decline with fewer examples. However, below, we show some timings indicating that a DT ensemble can likely be built in time comparable to or less than the SVM ensemble. As to the running time, according to Collobert et al. (2002), it needs 237 minutes when using 1 CPU, and 73 minutes when using 50 CPUs. We ran our experiments on a single computer. The CPU time to create the ensemble of 50 random C4.5 DTs is approximately 108 minutes. We used a Pentium III system, and each processor had 1 GHz clock speed. Since it is an ensemble, it could be created in parallel, with each processor starting with a different random seed. The CPU time to build each tree is approximately 2 minutes. The parallel time would then be on the order of 2 minutes plus communication time.
1350
X. Liu, L. Hall, and K. Bowyer
Further, an ensemble of 100 trees, each created on a randomly selected 50,000 examples, was still 2% more accurate than the ensemble of SVMs. Each of these trees could be built in approximately 1 minute of CPU time in parallel. From our experiments, it is shown that for the two-class Forest Cover Type data set, an ensemble of DTs has very good predictive accuracy. This advantage exists in not only the two-class Forest Cover Type data set. We also did some experiments on the original seven-class Cover Type data set using a single DT. The performance of a single DT is promising too. It is much better than that of a single feedforward backpropagation neural network in both accuracy and speed. The comparative results provided here underscore the need to compare classier results with other types of classiers even when it seems that the answer would be known a priori. For a given data set, most people would guess that an SVM would be much better than a DT. So if one designs a classier that is even better than a single SVM, intuitively it seems unnecessary to compare with classical approaches with known limits such as DTs. We are certain that a parallel mixture of SVMs will outperform DTs on some training data sets, but not this one. As noted above, DTs result in good classication accuracy on the Forest Cover data set. They are both faster to construct than SVMs on this data set and more accurate. Acknowledgments This work was supported in part by the U.S. Department of Energy through the Sandia National Laboratories ASCI VIEWS Data Discovery Program, contract number DE-AC04-76DO00789, as well the U.S. Navy, Ofce of Naval Research, under grant number N00014-02-1-0266. References Baneld, R. E., Hall, L. O., Bowyer, K. W., & Kegelmeyer, W. P. (2003). In T. Windeatt & F. Roli (Eds.), Multiple classier systems: Fourth International Workshop, MCS (pp. 306–316). New York: Springer-Verlag. Blackard, J. A., & Dean, D. J. (1999). Comparative accuracies of articial neural networks and discriminant analysis in predicting forest cover types from cartographic variables. Computers and Electronics in Agriculture, 24, 131– 151. Blake, C. L., & Merz, C. J. (1998). UCI repository of machine learning databases. Irvine: University of California, Department of Information and Computer Science. Available on-line at: http://www.ics.uci.edu/»mlearn/ MLRepository.html. Chawla, N. V., Moore, Jr., T. E., Hall, L. O., Bowyer, K. W., Kegelmeyer, W. P., & Springer, C. (2003). Distributed learning with bagging-like performance. Pattern Recognition Letters, 24(1–3), 455–471.
A Parallel Mixture of SVMs
1351
Collobert, R., Bengio, S., & Bengio, Y. (2002). A parallel mixture of SVMs for very large scale problems. Neural Computation, 14, 1105–1114. Decoste, D., & Scholkopf, B. (2002). Training invariant support vector machines. Machine Learning, 46(1–3), 161–190. Dietterich, T. G. (1998). An experimental comparison of three methods for constructing ensembles of decision trees: Bagging, boosting, and randomization. Machine Learning, 40(2), 139–158. Eschrich, S. (2003). Learning from less: A distributed method for machine learning. Unpublished doctoral dissertation, University of South Florida. Quinlan, J. R. (1993). C4.5: Programs for machine learning. San Francisco: Morgan Kaufmann. Vapnik, V. (1995). The nature of statistical learning theory. New York: SpringerVerlag. Received October 2, 2003; accepted January 8, 2004.
LETTER
Communicated by Edward White
Automated Algorithms for Multiscale Morphometry of Neuronal Dendrites Christina M. Weaver
[email protected] Department of Biomathematical Sciences and Computational Neurobiology and Imaging Center, Mount Sinai School of Medicine, New York, NY 10029, U.S.A.
Patrick R. Hof
[email protected] Fishberg Research Center for Neurobiology,Kastor Neurobiology of Aging Laboratories, Computational Neurobiology and Imaging Center, and Advanced Imaging Program, Mount Sinai School of Medicine, New York, NY 10029, U.S.A.
Susan L. Wearne
[email protected] Department of Biomathematical Sciences, Fishberg Research Center for Neurobiology, and Computational Neurobiology and Imaging Center, Mount Sinai School of Medicine, New York, NY 10029, U.S.A.
W. Brent Lindquist
[email protected] Department of Applied Mathematics and Statistics, Stony Brook University, Stony Brook, NY 11794, U.S.A.
We describe the synthesis of automated neuron branching morphology and spine detection algorithms to provide multiscale three-dimensional morphological analysis of neurons. The resulting software is applied to the analysis of a high-resolution (0.098 ¹m £ 0.098 ¹m £ 0.081 ¹m) image of an entire pyramidal neuron from layer III of the superior temporal cortex in rhesus macaque monkey. The approach provides a highly automated, complete morphological analysis of the entire neuron; each dendritic branch segment is characterized by several parameters, including branch order, length, and radius as a function of distance along the branch, as well as by the locations, lengths, shape classication (e.g., mushroom, stubby, thin), and density distribution of spines on the branch. Results for this automated analysis are compared to published results obtained by other computer-assisted manual means.
c 2004 Massachusetts Institute of Technology Neural Computation 16, 1353–1383 (2004) °
1354
C. Weaver, P. Hof, S. Wearne, and W. Lindquist
1 Introduction Ever since the pioneering work of Ramon ´ y Cajal (1893) and Tanzi (1893), there have been indications that the processes of learning and memory may require physical changes in neuronal anatomy (see Wallace, Hawrylak, & Greenough, 1991, for review). There is a strong correlation between the functional complexity of a neuron and its morphological complexity (Hubener, Schwarz, & Bolz, 1990; Scheibel et al., 1985; Scheibel, Conrad, Perdue, Tomiyasu, & Wechsler, 1990). As a result, many studies have investigated the morphological variation between neurons located in different regions of the cortex (Elston & Rosa, 1997, 1998; Elston, Tweedale, & Rosa, 1999a, 1999b, 1999c; Elston, 2000; Elston & Rockland, 2002; Jacobs, Driscoll, & Shall, 1997; Jacobs et al., 2001; Nimchinsky, Sabatini, & Svoboda, 2002). Dendritic branching and spine morphology vary throughout the life cycle of an animal (Anderson, Classey, Cond´e, Lund, & Lewis, 1995; Duan et al., 2003; Harris, Jensen, & Tsao, 1992; Page et al., 2002), and can be inuenced by environmental conditions (Bartesaghi, Severi, & Guidi, 2003; Kolb, Gibb, & Gorny, 2003; Seib & Wellman, 2003) and neurological disorders (DeFelipe & Farinas, ˜ 1992; Glantz & Lewis, 2000; Irwin et al., 2001, 2002; Nimchinsky, Oberlander, & Svoboda, 2001; Purpura, 1974). There have been several demonstrations of linkage between the morphology of a neuron and its physiological and electrotonic properties (Baer & Rinzel, 1991; Gilbert & Wiesel, 1992; Holmes, 1989; Krichmar, Nasuto, Scorcioni, Washington, & Ascoli, 2002; Mainen & Sejnowski, 1996; Rumberger, Schmidt, Lohmann, & Hoffmann, 1998; Stratford, Mason, Larkman, Major, & Jack, 1989; Surkis, Peskin, Tranchina, & Leonard, 1998; Wilson, 1984, 1988). These experimental and theoretical results motivate an interest in obtaining detailed information regarding the morphology of a neuron. Current morphological analysis typically involves a signicant component of computer-assisted manual labor in which a human operator performs crucial structure recognition decisions. As a result, such analysis can be very time-consuming and is susceptible to fatigue- and habituation-related bias (Anderson et al., 1995; Elston & Rosa, 1997; Feldman & Peters, 1979). The goal of our work is to contribute to the development of a newer generation of software tools that provide greater structural recognition capability, drastically reducing the human labor component. Koh, Lindquist, Zito, Nimchinsky, and Svoboda (2002) described a software package for detection of dendritic spines in images obtained from laser scanning microscopy and benchmarked it against manual recognition. In this article, we combine the spine detection algorithms with dendritic tracing algorithms (Koh, 2001) to produce a package capable of morphometry on an entire neuron, imaged at high resolution (0.1 ¹m) with vastly reduced user interaction. Automated methods for tracing dendritic arbors have been described in Al-Kofahi et al. (2002), Koh (2001), and Cohen, Roysam, and Turner (1994). Al-Kofahi et al. (2002) generate seed points to nd probable locations of den-
Algorithms for Morphometry of Neuronal Dendrites
1355
dritic segments, and then use predetermined templates of variable size to nd local dendritic boundaries and trace dendritic processes. They take advantage of two-dimensional (2D) projections of three-dimensional (3D) image stacks to locate the seed points, resulting in an efcient trace of dendritic segments, including the basic connectivity of these segments. Koh and Cohen et al. utilize skeletonization (medial axis) construction to automate arbor tracing. Both use the medial axis to generate a graph-theoretic representation of a neuron, from which branch segments are identied. Cohen et al. do not discuss how to handle loops in the medial axis structure that cause ambiguity when determining, for example, branch order and dendritic length from the soma. The arbor tracing method utilized by Koh et al. distinctly resolves such loops using minimum kink angle and branch length criteria. Automated methods for dendritic spine detection have also been demonstrated (Herzog et al., 1997; Koh et al., 2002; Rusakov & Stewart, 1995; Watzel, Braun, Hess, Scheich, & Zuschratter, 1995). The medial axis construction has again been the most popular technique for dendritic spine detection. Rusakov and Stewart construct the medial axis of 2D images of dendritic fragments, using manual supervision to distinguish dendritic segments from spine segments. A stereological procedure is used to estimate the 3D spine length distribution from 2D images of the same fragment at different tilt angles. Watzel et al. demonstrate a 3D medial axis-based algorithm to extract the centerline from images containing a single dendrite and identify “spurs” on the medial axis as potential spines. True spines are distinguished from medial axis artifacts using a minimum-length criterion. Herzog et al. use a parametric model of cylinders with hemispherical ends to t the shape of dendrites and spines appearing in an image. Thin-necked and short spines are hard to detect with this method and have to be added manually. Koh et al. construct a modied medial axis to locate dendritic branch centerlines and detect spines as geometric surface protrusions relative to the centerlines. The algorithms described in this study are implemented as part of the 3DMA-Neuron software package (Weaver & Lindquist, 2003). To date, the dendritic spine analysis in 3DMA-Neuron has been utilized (Koh et al., 2002) only on images detailing subregions of the dendritic arbor; while the branching morphology analysis has been used (Koh, 2001; Maravall, Shepherd, Koh, Lindquist, & Svoboda, in press) on images whose resolution is too low to permit accurate measurement of dendritic spines. We present here an integration of these two algorithms, demonstrating both branching morphology as well as ne-scale structure analysis of a single, full-neuron cell image. The neuron analyzed is a pyramidal neuron from layer III of the rhesus macaque monkey superior temporal cortex. Extracted morphological parameters from the automated analysis are compared with published results obtained by other means. This article is divided into the following sections. The biological preparation, imaging details, and image preprocessing techniques are presented
1356
C. Weaver, P. Hof, S. Wearne, and W. Lindquist
briey in section 2. The algorithms used to analyze the neuron morphology are presented in section 3. Morphometry results, including comparison with measurements available from the literature, are presented in section 4. Final discussion of the image analysis follows in section 5.
2 Materials and Methods 2.1 Tissue Preparation and Imaging. Tissue sections of 200 to 400 ¹m thickness were obtained from the neocortex of a rhesus macaque monkey, aged about 22 years, as described elsewhere (Duan, Wearne, Morrison, & Hof, 2002; Duan et al., 2003). All experimental protocols were conducted according to the National Institutes of Health guidelines for animal research and were approved by the Institutional Animal Care and Use Committee at Mount Sinai School of Medicine. Sections were mounted on nitrocellulose paper and immersed in phosphase buffered saline (PBS). The sections were incubated with 4,6-diamidino-2-phenylindole (DAPI; Sigma, St. Louis, MO) for at least 20 minutes to reveal the cellular architecture under ultraviolet (UV) excitation. Neurons were then impaled with micropipettes and loaded with Lucifer Yellow (LY; Molecular Probes, Eugene, OR) in distilled H2 O under a DC current of 3 to 8 nA until the dye had lled distal processes (10–12 minutes) and no further loading was observed (DeLima, Voigt, & Morrison, 1990; Duan et al., 2002; Nimchinsky, Hof, Young, & Morrison, 1996). Generally, up to 15 cells were injected per slice, with injections sufciently separated to avoid overlapping of the dendritic trees. After neuron labeling, sections were xed again in 4% paraformaldehyde and 0.125% glutaraldehyde in PBS for 4 hours at 4± C, washed and stored in PBS. Sections were mounted on uncoated slides, and loaded neurons were imaged with a TCS-SP (UV) confocal laser scanning microscope (Leica Microsystems, Heidelberg, Germany) equipped with confocal detector channels having 8-bit photomultiplier tubes. The pyramidal neuron imaged for this study is from layer III of the superior temporal cortex (STC) and was shown to project ipsilaterally to the prefrontal cortex (Duan et al., 2003). A 100£ oil immersion objective lens with numerical aperture (NA) of 1.4 was used to collect the image, which was scanned at a resolution of 1024 £ 1024 voxels. Voxel dimensions are 0.098 ¹m £ 0.098 ¹m in plane. Optical sections were collected every 0.081 ¹m via a scan stage with a 40 nm step size. As the working distance of the high NA oil immersion lens is limited to 100 ¹m, only 1000 to 1200 optical sections could be collected per column. After imaging an entire stack through-focus, the stage was refocused and translated in either the z (optical axis) direction, to collect another stack from the remainder of the column, or the xy-plane, to collect from an adjacent column. Sufcient overlap between columns was allowed to ensure for accurate assembly of the full neuron montage. Scanning continued until all areas containing the neuron were imaged.
Algorithms for Morphometry of Neuronal Dendrites
1357
2.2 Deconvolution, Alignment, and Integration of Image Stacks. Each imaged stack was deconvolved separately using the blind deconvolutionbased package AutoDeblur (AutoQuant Imaging Inc). Typically (Turner et al., 1997, 2000), the optical point spread function smears the image along the optical axis (z-direction) about three times that in the x- and y-directions (see Figure 1a). While deconvolution greatly reduces this spread (see Figure 1b), the effect is still noticeable. Inspection shows that compression by voxel averaging of 2 ! 1 in the z-direction nearly eliminates the spread effect (see Figure 1c). We perform this z-compression once the image stacks are integrated (as described next) into a single volume. The imaged stacks were processed into a single volume using the Volume Integration and Alignment System (VIAS) (Rodriguez et al., 2003). VIAS provides the user with an interface by which adjacent stacks can be aligned with one another. 2D projections of a stack, rather than the whole stack, are utilized in the alignment, greatly reducing the computer memory requirements of the program. Working with the xy-projections of two adjacent stacks, an approximate alignment is determined manually via a mouse “click-and-drag” operation. This is followed by an automated alignment that minimizes the L1 norm of the intensity differences in the overlapping region of the projection, using a relatively small neighborhood of offsets around the approximate manual alignment. Completion of this autoalignment step nalizes horizontal registration of the two stacks. Vertical registration is done by performing an automated alignment step using the xz- (or yz-) projections of the stack with the y (or x) coordinate of the alignment xed. By considering adjacent stacks one at a time, all stacks can be registered by a displacement vector relative to the rst. With all stacks registered, a global coordinate system can be assigned that covers all stacks and a minimum size rectangular volume dened that just contains all stacks. Fluorescence intensity values from individual stacks are then assigned to each appropriate voxel position in the enclosing volume. If voxels from two or more stacks overlap position, the average intensity value is assigned. Under the voxel resolution used here, the entire neuron volume (4; 826 £ 5; 609 £ 597 voxels) is too large (25 Gbytes) to store in memory on any single processor machine. Thus, the entire neuron volume is assembled slice-wise; each slice is output into a separate le before the next slice is assembled. A 2D projection of the nal result of the VIAS procedure for this neuron is shown in Figure 2. 3 Image Analysis For brevity, we shall refer to the entire z-compressed VIAS neuron image as If . The morphology of If is analyzed in a three-step approach. The rst step involves local ne-scale analysis of the structure of dendritic branch segments, including spines. The second step determines the global dendritic branching morphology, including assignment of labels for each dendritic
1358
C. Weaver, P. Hof, S. Wearne, and W. Lindquist
Figure 1: xy- (top) and xz-projections (bottom) of a subregion from a single stack after (a) confocal imaging and (b) deconvolution by AutoDeblur. (c) The region in (b) after compression by a factor of 2 in the z (optical) direction.
Algorithms for Morphometry of Neuronal Dendrites
1359
Figure 2: xy-projection of the entire neuron after stack alignment and integration by the VIAS system. Individual tiles manually selected for spine detection analysis are also indicated.
branch. In the third step, the ne-scale morphology measurements from step 1 must be registered with the dendritic branch labeling established in step 2. 3.1 Analysis of Fine-Scale Morphology. For the ne-scale analysis, computer memory limitations require dividing If into nonoverlapping tiles. For this analysis, If was manually divided into 31 tiles (see Figure 2), with
1360
C. Weaver, P. Hof, S. Wearne, and W. Lindquist
no tile larger than 532 3 voxels, a limit imposed by the 1 Gbyte memory limitation of the Linux-based PCs on which the software was implemented. Note that tiles are chosen to avoid regions containing no neuron structure. Fine structure analysis is performed separately for each tile; consequently, ne structure analysis proceeds without global label information for the dendritic segments present in the tile. The analysis sequence for the ne structure in a single tile consists of segmentation; skeletonization of the neuron phase and reduction of the skeleton to one consisting of midlines for all dendrite candidates; dendritic radius determination; spine detection as protrusions relative to dendritic midline; and morphological characterization of each spine. Dendrite midlines are also referred to as backbones, and the reduced skeleton composed of midlines is referred to as the backbone skeleton. These algorithms have been described in detail in Koh et al. (2002); for brevity, we comment only on changes incorporated for, and issues specic to, this entire neuron study. As mentioned, If was acquired as a union of separately imaged, overlapping stacks. During imaging, microscope parameters were adjusted for each separate stack to balance visibility of spines and thin dendritic processes against uorescent brightness of thicker processes. This produces intensity variations throughout If . Hence, a segmentation method that is locally adaptive is preferable to one employing global parameters. We employ a method (Oh & Lindquist, 1999) based on indicator kriging, which (using information in the local spatial neighborhood) performs a maximum likelihood estimation of the probability that any given voxel belongs to the neuron phase. The method requires that a subpopulation of voxels of each phase be positively identied rst. This subpopulation is identied by conservatively establishing a window of image intensity values delimited by two thresholds. Intensity values above the upper threshold are assumed to be positively identied as neuron, and those below the lower threshold are assumed background tissue. The classication of the remaining voxels is then performed by indicator kriging. The intensity variations in If necessitated tile-to-tile manual adjustment of these two threshold values. The determination of dendritic branch radius was not discussed explicitly in Koh et al. (2002); however, it follows directly from the skeletonization procedure discussed there. Skeletonization naturally associates a radius with each voxel comprising the dendrite’s midline; thus, the radius measure is generated as a function of position along any dendritic branch segment. Technically, the radius measured is one-half of the shortest diameter of the cross-sectional area of the branch at the voxel in question. A major thrust of the study by Koh et al. (2002) was to benchmark the spine detection algorithm against manual identication standards. The spine detection algorithms have false-positive and false-negative detection rates that primarily reect the amount of uorescent cellular debris located near neuron tissue and also reect nonspine dendrite surface structures such as incipient lopodia. False detection rates are controlled by a user-set
Algorithms for Morphometry of Neuronal Dendrites
1361
length parameter that determines both the maximum spine length allowed and the maximum distance an (apparently disconnected) spine head can lie from a dendritic backbone. As dendritic segments and spines are detected while examining only single tiles, nal conrmation of all dendrite and spine candidates so detected is reserved until the ne-scale analysis of all tiles has been collated and registered with the global dendrite morphology. 3.2 Analysis of Dendritic Branching Morphology. Memory requirements to store If also preclude an analysis of the dendritic branching morphology at ne-scale resolution. Let Si denote the backbone skeleton obtained from the ne-scale analysis of the neural tissue in the ith tile. Let S be the union of the Si over all tiles. S forms a ne-scale description of the backbone skeleton of the neuron; this skeleton preserves the geometry of the branching of the dendritic arbors. To reduce memory requirements needed to store the volume required to analyze the dendritic branching of S, a coarsened image Sc is obtained by a simple contraction operation. Let v be a voxel in S with global coordinates (xv , yv , zv ). (Because the image is digital, the coordinates are integers.) Dene the coarsened global position of v as ([xv =f ], [yv =f ], [zv =f ]), where f is a positive integer coarsening factor, and the [ ] operation denotes “integer part of.” Qualitatively, the effect of this contraction procedure is to replace each string of f voxels in the skeleton S by a single voxel (of f times the physical dimension in each direction). The set of coarsened voxels produces a coarsened skeleton Sc , which can be stored in a coarsened volume Ic . The branching morphology analysis is performed on Sc in Ic . Choice of the value for f represents a compromise between memory reduction and loss of accuracy of the branching structure due to a change in topology between S and Sc . If two dendritic processes are locally separated by less than f voxels in S, they will be locally merged under coarsening, producing a branch point in Sc that has no correspondence in S. For this neuron, a coarsening factor of f D 5 was used; effectively, the skeleton of the dendritic tree was discretized into coarse segments of length 0.5 ¹m rather than a ne-scale length of 0.1 ¹m. Since the individual pieces Si of the skeleton were determined from individual tiles, the union skeleton S, and hence its coarsened version Sc , contains disconnected pieces corresponding to cellular debris that could not be positively identied as such by examining only individual tiles, as well as spurious segments at tile boundaries that can be resolved only once skeleton pieces from neighboring tiles are connected. The true neuron skeleton is therefore identiable as the largest connected piece of Sc ; any disconnected skeletal pieces are deleted as corresponding to cellular debris. Finally the skeleton Sc is trimmed as described in Weaver (2003) to remove spurious, tile-boundary-related segments and to retain only a 26-connected skeletal structure.
1362
C. Weaver, P. Hof, S. Wearne, and W. Lindquist
In the trimmed skeleton (which we still refer to as Sc ), a (generally complicated) compact region of the backbone structure Sc corresponds to the soma of the neuron. Using the graphical user interface (GUI) provided, the software allows the region of the soma to be delineated manually by simple click-and-drag operations (Koh, 2001). The backbone structure of Sc that lies inside the soma region is then omitted from further analysis. The remainder of the backbone structure of Sc corresponds to the dendritic trees and axon. Axon collaterals in the image are not yet handled by our software, which currently assumes that all structures leaving the soma are dendritic. To handle axon collaterals, provision is provided in the GUI to manually inspect and reject axons using mouse point-and-click operations. During this manual rejection procedure, any click on a dendritic midline will remove it and the entire centrifugal subtree of midlines for which it is the root. After manual removal of the axon, using the GUI, a point-and-click operation is used to identify one (or more) trees that comprise the apical arbor. The software assumes all other trees comprise the basal arbor. We refer to each dendritic backbone segment exiting the soma region as the root of a dendritic tree. Using breadth-rst spanning trees generated in the centrifugal direction starting from each root, the dendritic branch segments in each tree are automatically labeled according to a centrifugal nomenclature (Uylings, Ruiz-Marcos, & van Pelt, 1986). Each branch has a unique label of the form B:t:l:i where B D a; b is the arbor label (apical or basal); t D 1; 2; 3; : : : denotes the tree label; l D 1; 2; 3; : : : is the branch order; and i is an index from 1 to nla (or nlb ), where nla (resp. nlb ) is the total number of apical (resp. basal) branches at branch order l. Any looping structures in the arbor skeleton must be resolved prior to labeling. We utilize minimum kink angle and branch length criteria in doing so (Koh, 2001). 3.3 Global Registration of the Fine-Scale Results. The ne-scale analysis allows only local labeling of all dendritic segments and spines found in each tile. Global labeling of the ne-scale results requires registration of the tile analysis with the dendritic mapping analysis of Sc (and hence of S). This registration is done with the same contraction operation used to produce Sc . Let D denote the (ne-scale) midline of a (presumed) dendritic segment analyzed in a single tile. (D may, in fact, correspond to cellular debris, or to an extraneous skeleton segment at a tile edge.) Let Dc be its coarsened representation, obtained by applying the contraction operation of section 3.2 to D. We attempt to register Dc with one (or more) dendrite midline segments in Sc allowing for the possibility of incomplete voxel alignments. Overlaps (voxels having the same coordinates) between Dc and midline segments of Sc are recorded. If any overlap is found, alignment is checked as follows. Assume Dc overlaps with branch midline segment B. If the length of segment Dc is shorter than B, Dc is further divided into three contiguous segments
Algorithms for Morphometry of Neuronal Dendrites
1363
Dc1 , Dc2 , Dc3 of effectively equal length. If overlaps between Dci and B exist for at least two of the three segments, Dc is considered to align with B. (If B is shorter than Dc , B is divided into three segments, and overlaps of each segment with Dc are checked.) If no overlapping, aligning branch in Sc is found for Dc , it is assumed to correspond to cellular debris or a tile boundary artifact, and it, and any spine candidates that may have been associated with it, are deleted. Since tile boundaries may cut through dendritic branches, a single branch B in Sc may register with a sequence of dendritic segments from different tiles. In this case, the ne-scale data for each dendritic segment are registered appropriately along B. A nal level of manual control is inserted at the end of the global registration step. The GUI provides a facility by which the user can scan 2D projected images of each ne-scale tile showing the nal results of the automated identication and global registration of dendritic segments and spines. Via mouse click, the user can manually remove dendritic segments and spines deemed to be false-positive signals. False-positive signals in a small subregion of the image are illustrated in Figure 3. The two false positives, marked C and D, require manual intervention. 3.4 Morphological Measurements. Recorded for each dendritic branch segment are its global label, total length, running diameter, number of spines, spine density, mean spine length, and numbers of spines of each classication type. Recorded for each spine are the global label of the dendritic branch segment on which it lies; its global position on the dendritic backbone; its length, volume, and head and neck diameters; and its shape classication (thin, mushroom, stubby) (Harris et al., 1992; Nimchinsky et al., 2002). Implementation of shape classication is discussed in Koh et al. (2002). Spine volume is currently an estimation, determined as the sum of the volume of all voxels comprising the spine. This method of estimating spine volume has not as yet been veried for accuracy. The accuracy of such an estimation depends strongly on the relative sizes of the voxel volume versus the microscope focal volume. (The focal volume for this image is estimated as 0.139 ¹m £ 0.139 ¹m £ 0.236 ¹m.) The dendritic branching morphology is output in SWC format (Cannon, Turner, Pyapali, & Wheal, 1998). More detailed output, including summary details about the spine shapes on each branch, is generated in NeuroML format (Goddard et al., 2001, 2002). These formats allow the data to be easily incorporated into compartmental modeling software such as NEURON (Hines, 1994) and GENESIS (Bower & Beeman, 1994). 4 Results For the entire neuron studied here, execution of the three morphology analysis steps required 26 hours computing time on a Pentium III 500 MHz
1364
C. Weaver, P. Hof, S. Wearne, and W. Lindquist
Figure 3: xy-projections of a subregion showing (a) raw data and (b) dendritic backbones and spines (colored black) detected by the automated algorithms. Indicated are instances of cellular debris (C) and of dendritic segment tips (D), which have been falsely identied as spines and must be manually deleted.
PC running RedHat Linux 6.2. (This does not include deconvolution time.) Most of this time was spent in the segmentation (34%) and spine detection algorithms (53%). Employing a coarsening factor of f D 5, produced a few topology differences between S and Sc . In Sc , one axon collateral merged at two points with two separate dendritic branches of the apical arbor. Since axons were deleted, however, this had no consequences. In the basal arbor, there are a couple of places where branches merge under contraction. No ne-scale information is lost during registration, but branch labels are based on the topology of Sc . Manual intervention for the analysis of this neuron consisted of the following. During the dendritic branching determination step, all axon collaterals in the image were deleted. During the nal tile-by-tile visual review, of the 2646 spines detected on the entire neuron, 76 were manually removed as denitely false positive.
Algorithms for Morphometry of Neuronal Dendrites
1365
Our primary interests are to validate the results of the automated analysis of this neuron against manual analysis of the same data and to compare these automated results with published results based on computer-aided manual analysis of similar neurons. We discuss the dendritic and spine morphometry results separately. 4.1 Dendritic Morphometry. The basal arbor of the neuron studied consisted of nine dendritic trees; the apical arbor consisted of a single tree. Following convention (DeFelipe & Farinas, ˜ 1992; Feldman, 1984), the segments of the apical tree were manually reclassied into a primary shaft (assigned branch order 1), oblique subtrees, and terminal tuft branches. There were 10 oblique subtrees and 2 branches in the terminal tuft. To validate our automated dendrite tracing algorithms, a limited manual analysis of the dendritic brushes was performed on the neuron image. The manual analysis was performed by looking at the principal 2D projections of individual tiles in the image. Manual determinations of the number of dendritic segments (NDS), number of dendritic branch points (NBP), and maximum branch order (MBO), for the basal and apical arbors separately, are compared with the automated determinations in Table 1. The automated results identify three fewer dendritic segments in the apical arbor and two more segments in the basal arbor. This is then reected in the manual versus automatic differences observed in NBP and MBO. Figure 4 displays the automated and manual differences in branch count as a function of branch order. Two mechanisms are responsible for the differences seen in the the apical arbor where branch count differences occur in three places. The rst difference occurs along the apical shaft, where the manual interpretation is that the parent shaft splits into three daughter segments (one of which is labeled as the continuation of the shaft). The automated algorithm, however, detects the parent shaft splitting into two daughters, with one of these daughter branches (the continuation of the shaft) proceeding for only a very short length before splitting further into two branches (one of which is again labeled as the continuation of the shaft). The other two places are near branch points where the manual method identies an oblique dendrite parent branch splitting into two daughters. In both cases, a break occurs in the parent branch just before the branch point. The automated method cannot bridge the break and concludes that the dendrite trace terminates at the break. The reason for these breakages is discussed below. Note that these two mechanisms oppose each other; compared to the automated methods, one leads to fewer manual counts, the other to more manual counts. Differences in counts observed on the basal arbor are due to these two mechanisms, and a third: topological differences between S and Sc induced by close dendritic processes. Table 1 presents comparisons on NDS, NBP, MBO as well as total dendritic length (TDL), and average dendritic segment length (DSL) with published results on pyramidal neurons in rhesus macaque (Duan et al., 2003),
1366
C. Weaver, P. Hof, S. Wearne, and W. Lindquist
Table 1: Comparison of Automated and Manual Results from This Study with Published Results on Similar Pyramidal Neurons in Monkeys and Humans. Species
NDS
TDL (mm)
NS
SD (¹m¡1 )
Apical Rhesus macaquea Rhesus macaqueb Rhesus macaquec Patas monkeyd Long-tailed macaquee
41 44 30.0(3.7) 28.3(7.0) 32.8(2.4)
1.86
1219
0.65
1.92(0.15) 1.76(0.21) 1.85(0.13)
791.3(51.7) 738(63)
0.41(0.02) 0.42(0.01)
Basal Rhesus macaquea Rhesus macaqueb Rhesus macaquec Patas monkeyd Long-tailed macaquee Humanf
57 55 51.8(4.1) 48.8(2.7) 42.6(2.0) 49.0(1.0)
2.05
1278
0.62
2.82(0.16) 2.45(0.32) 2.67(0.15) 3.30(0.10)
1078(70.5) 955(80)
0.38(0.03) 0.39(0.03)
903(57)
0.21(0.01)
Species
NBP
MBO
DSL (¹m)
SDS (¹m¡1 )
Apical Rhesus macaquea Rhesus macaqueb Long-tailed macaquee
20 21 15.5(1.2)
12 11 9.10(0.53)
45.4(6.3)
0.729(0.090)
58.5(2.5)
0.586(0.030)
Basal Rhesus macaquea Rhesus macaqueb Long-tailed macaquee
24 23 18.6(0.9)
6 6 5.77(0.25)
36.0 (3.5)
0.545(0.060)
64.8(1.9)
0.633(0.024)
Notes: All values represent mean (S.E.M.) per neuron except for SDS, which is per dendritic segment. NDS: number of dendritic segments; TDL: total dendritic length; NS: number of spines; SD: spine density; NBP: number of branch points; MBO: maximum branch order; DSL: dendrite segment length; SDS: spine density per segment. a This study, automated. b This study, manual. c Duan et al. (2003). d Page et al. (2002). e Soloway et al. (2002). f Jacobs et al. (2001).
patas (Page et al., 2002), and long-tailed macaque (Soloway, Pucak, Melchitzky, & Lewis, 2002) monkeys, and in humans (Jacobs et al., 1997, 2001). Except as noted below, all neurons in the comparison studies are STC layer III pyramidal neurons having ipsilateral projections to the prefrontal cortex area 46. In Duan et al., the results are averaged over 19 neurons from two adult male and three adult female macaque monkeys, ages 24 to 25 years. In Page et al., the results are averaged over three neurons from three adult (21–25 years) patas monkeys. The three neurons are from either layer III or layer V. In Soloway et al., individual results are averages of between 29 and 56 neurons having ipsilateral projections into prefrontal areas 9 or 46. The neurons were taken from young adult long-tailed macaque monkeys. Ap-
Algorithms for Morphometry of Neuronal Dendrites
manual automated
10 5 0
15 branch count
branch count
15
0
2 4 6 branch order Apical oblique branches
(a)
1367
10 5 0
0
2 4 branch order Basal arbor
6
(b)
Figure 4: Number of dendritic segments of each branch order for the neuron shown in Figure 2 for the (a) oblique branches of the apical arbor and (b) basal arbors, as determined by manual and automated methods. Apical oblique branch orders are numbered beginning with order 2; the apical shaft is order 1.
proximately 75% of the neurons were from the supragranular layers (II,III) of the STC; the remainder were from the infragranular layers (V,VI). In Jacobs et al., the results are averages over supragranular pyramidal neurons from Brodmann’s area 22 in neurologically normal human subjects, of mean age 30 (§17) years. One hundred neurons were studied—10 neurons each from 10 subjects. The projections of these neurons are unknown. Only basal arbors were examined in the Jacobs et al. (2001) study. For the apical arbor, both manual and automated counts of NDS are higher than reported in the literature, with the consequence that NBP and MBO are also high. For the basal arbor, our results for NDS and NBP are slightly high (though reasonable agreement exists on NDS with the study of Duan et al.), agreement of MBO is very good, and our results for TDL and DSL are low. Figures 5a and 5b compare TDL as a function of branch order for the apical and basal arbors against data reported in Duan et al. (2003). There is unfortunately a difference in denition of branch order. In Duan et al., if a parent of order n splits into two daughters of roughly equal diameter and equal angle of divergence from the parent, both daughters are assigned order nC1. However if one daughter continues relatively straight with respect to the parent and is judged to be of roughly equal diameter to the parent, that daughter branch remains order n, whereas the other(s) become n C 1. In the algorithm employed in our automated routines, for the trees in the basal arbor, all daughter branches of a branch order n parent are assigned order
C. Weaver, P. Hof, S. Wearne, and W. Lindquist
Apical arbor
total dendritic length ( m m)
total dendritic length ( m m)
1368
Duan et al. (2003) this study
1000 800 600 400 200 0
0
2 4 branch order
Basal arbor 1000
6
800 600 400 200 0
0
(b)
15 10 5 0
0
100 200 300 radius ( m m) (c)
400
number of intersections
number of intersections
(a)
2 4 6 branch order
30 20 10 0
0
100 200 radius ( m m)
300
(d)
Figure 5: Total dendritic length (a,b) and Sholl analysis (c,d) versus branch order for the apical (a,c) and basal (b,d) arbors obtained for the neuron shown in Figure 2, compared to results from Duan et al. (2003) for macaque monkeys of similar age. The Duan et al. plots show mean values; error bars denote § S.E.M. Sholl spheres were analyzed at radius increments of 20 ¹m.
n C 1. For the apical arbor, however, we have manually reclassied branch segments (and hence branch order) as either apical shaft, apical oblique, or terminal tuft. Thus, until the terminal tuft is reached, if an apical shaft parent branch (order 1) splits, one daughter is identied as (the continuation of the) apical shaft (branch order 1), and the other daughter(s) are identied as apical oblique branch (branch order 2). When an apical oblique branch of order n splits, all daughters are assigned order n C 1. Consequently, we observe reasonable agreement for TDL versus branch order with the results of Duan et al. (2003) for the apical arbor (where there is some consistency in
Algorithms for Morphometry of Neuronal Dendrites
1369
how branch order is assigned) but very little agreement on the basal arbor data. This comparison appears to emphasize the importance in choice of branch labeling scheme. To avoid the branch order designation differences, Figures 5c and 5d compare Sholl plots (Sholl, 1953) for the apical and basal arbors against data reported in Duan et al. (2003). The results are now qualitatively similar. The major differences consist of a higher peak and a more rapid “tailing off” in the Sholl plots for our neuron. Comparison of our Figure 2 with Figure 3 of Duan et al. (2003) does show qualitative differences between our neuron and representative examples of those studies in Duan et al. Thus, the differences are partly a consequence of the fact that our sample consists of a single neuron. However, the more rapid drop-off in the dendritic density at higher radius (and the low value measured for the TDL in Table 1) of both arbors results from breakages in dendrites. Breakage is caused by two mechanisms. The rst is caused by very slight rotations of the specimen mount that occurred when different stacks were imaged during the period of time over which the entire neuron image was collected. Such a rotation is demonstrated in Figure 6. Rotations prevent exact alignment of all dendritic segments crossing between two neighboring tiles, with the result that branch breakage occurs, preventing automated branch tracing across the break. Breakage also occurs where intensity variation in thin processes due to nonuniform dye lling results in breaks in the segmentated image. Such errors tend to occur in thin axon collaterals and near the tips of dendritic branches, leading to the observed rapid loss of dendritic length with radius. Figure 7 shows the nal reconstruction of dendrites and spines for this neuron. Arrows indicate where dendrite truncation occurred due to these two mechanisms. Filling and segmentation errors are in the majority. Figure 8 presents our measurements on mean dendrite diameter as a function of branch order. To our knowledge, data on comparable species are not available in the literature. These data should be relatively unaffected by branch truncation errors as such errors will reduce sample size and perhaps result in slight overestimation of average diameter at higher branch order values. Dendrite diameter decreases sharply across low branch orders and continues to taper gradually throughout the dendritic arbor. These observations are consistent with dendritic morphology reported elsewhere (DeLima et al., 1990; Feldman, 1984). 4.2 Dendritic Spine Analysis. The spine detection and characterization algorithms used here have been shown (Koh et al., 2002) to be consistent with manual analysis. The lower signal-to-noise ratio in our data as compared to those of Koh et al., where the imaging involved two-photon laser-scanning microscopy of cells tagged with uorescent proteins via ballistic transfection, motivated a manual count of a sampling of spines contained in our data. We performed a manual spine count on two distinct tiles shown (in
1370
C. Weaver, P. Hof, S. Wearne, and W. Lindquist
Figure 6: xy and xz projective views of a region in Figure 2. Arrows indicate boundaries between imaged regions joined after tile registration. A very small rotation in the right-hand tile prevents complete alignment of dendritic segments. The misalignment is most noticeable in the xz projection.
Algorithms for Morphometry of Neuronal Dendrites
1371
Figure 7: Final reconstruction of the neuron dendrites and spines in Figure 2. Open arrows indicate places where dendrites are truncated due to segmentation breakage; solid arrows indicate where minor rotations in the image tiles prevent accurate tile registration, with the consequence of dendrite breakage.
xy-projection) in Figure 9—one from the proximal section of one of the basal trees, in which all dendrites and spines are clearly visible, and one from a distal section of the apical tree, in which some of the dendrites and spines appear faint. Spines appearing in the xy-projection were counted manually. To distinguish cellular debris from true spines, the xz- and yz-projections were inspected as well. Table 2 summarizes the results. The 90% agreement
C. Weaver, P. Hof, S. Wearne, and W. Lindquist
branch diameter ( m m)
1372
basal apical 1
0.5
0
0
2 4 branch order
6
Figure 8: Mean diameter (in ¹m) of the apical and basal arbors at each branch order, for the neuron of Figure 2.
in spines counted in common for the proximal tile is consistent with results reported in Koh et al. (2002). The lower agreement for the distal tile reects reduced performance, probably in both methods, when spines and dendrites appear faintly. Table 1 compares our automated results on (per neuron) average number of spines and spine density with results from Duan et al. (2003), Page et al. (2002), and Jacobs et al. (2001). Table 1 also summarizes the comparison of spine density per dendrite segment with the results of Soloway et al. (2002). Our total spine numbers and spine densities are higher, except for the per Table 2: Comparison of Automated and Manual Spine Counts for Two Tiles Located Proximally and Distally to the Soma. Tile
Average
Common
Automated Only
Manual Only
Proximal Distal
228 111
205 (90%) 88 (79%)
31 (14%) 24 (22%)
15 (6.6%) 22 (20%)
Notes: Average: average of automated and manual counts. Common: number of spines identied by both methods. Automated only: additional spines identied by automated method only. Manual only: additional spines identied by manual method only. Percentages are relative to the average number of counts.
Algorithms for Morphometry of Neuronal Dendrites
1373
Figure 9: xy-projections of tiles (a) proximal to and (b) distal from the soma used for comparison of manual and automated spine counts.
segment density in the basal arbors of the neurons studied in Soloway et al. Figures 10a and 10b compare total spine count as a function of branch order for the apical and basal arbors, respectively, with the results from Duan et al. Direct comparison is difcult due to the different methods in assignment of branch orders, as discussed for the TDL results of Figures 5a and 5b. Based on that discussion, we would expect better agreement for the apical arbor (except for possibly a more rapid fall-off at higher branch order due to the truncation effects discussed above) than for the basal. Indeed we see greater qualitative agreement for the apical arbor, though with higher spine counts recorded for the automated method.
1374
C. Weaver, P. Hof, S. Wearne, and W. Lindquist
Apical arbor Duan et al. (2003) this study spine count
600
400 200 0
0
2 4 branch order
0
2 4 6 branch order
(b)
Apical arbor
Basal arbor
0.5
0
200
(a)
1
0
400
0
6
spine density (1/ m m)
spine count
600
spine density (1/ m m)
Basal arbor
2 4 branch order
(c)
6
1
0.5
0
0
2 4 6 branch order
(d)
Figure 10: Morphological measurements on the spines obtained for the neuron shown in Figure 2 compared to results from Duan et al. (2003) for macaque monkeys of similar age. (a, b) Spine count for the apical and basal arbors, respectively. (c, d) Spine density for the apical and basal arbors, respectively. Plots show mean § S.E.M. values.
Comparison of spine density with branch order is shown in Figures 10c and 10d. Note that spine density results will not be as affected by truncated dendrite branches, though differences in branch order numbering schemes will play some role. For the apical arbor, our data suggest a decrease in spine density with branch order, whereas the data of Duan et al. (2003) suggest a relatively constant density with branch order. For the basal arbor, both data
Algorithms for Morphometry of Neuronal Dendrites
1375
sets suggest an increase in spine density with branch order, though the rise is faster in our neuron. This difference can be explained by the difference in branch order labeling methods. If spine densities increase on daughter branches relative to their common parent, then labeling the parent and one daughter with the same branch order number as in Duan et al. (2003) will average densities, lowering the rate of increase seen with branch order. We note the results of DeLima et al. (1990) and Feldman (1984), which report increasing spine density with branch order until distal branches are reached, where the density then declines. The automatic method detects many more spines (and therefore nds higher spine densities) than published manual methods. It has been documented (Anderson et al., 1995; Elston & Rosa, 1997; Feldman & Peters, 1979) that many current methods tend to underestimate spine counts. We note in particular that spine counts in the study of Duan et al. (2003) are obtained by examining 2D slices separated by 1 ¹m increments, whereas the automated spine analysis examines the entire data set in 3D at 0.1 ¹m resolution. Manual analysis of a subset of the data presented here reveals that automated and manual spine identications can agree on 90% of the spines. Disagreement on remaining identications can be attributed to small spines, which are difcult to identify manually, small cellular debris near dendrites, which cannot be clearly distinguished from spines, and segmentation artifacts in the automated method. Spine length and shape analyses have been reported in rodents (Harris & Stevens, 1988; Harris et al., 1992; Irwin et al., 2002; Nimchinsky et al., 2001; Wilson, Groves, Kitai, & Linder, 1983) but not in primates. We therefore present our results for spines on the apical shaft, apical oblique branches, and basal dendritic branches with no comparison against previous literature. Figure 11 presents histograms of the spine length, shape classication, and volume results. Spines are categorized according to dendrite type and distance from the soma. For each category, the Shapiro-Wilk statistic (Shapiro & Wilk, 1965) was used to test the hypothesis that the spine length distributions are normal. The Shapiro-Wilk statistic rejects this hypothesis for every category, except for apical shaft spines less than 150 ¹m from the soma and for apical oblique spines less than 75 ¹m from the soma. Shape classication frequencies are uniformly distributed within each distancedendrite type category. Mushroom shapes dominate, and thin spines are the least frequent. As noted in section 3.4, our volume estimation procedure, based on straight voxel counts, has not yet been validated. Based on exponential ts to the histograms, the results are consistent with an exponentially decreasing volume distribution for apical shaft spines 150 to 225 ¹m from the soma, for apical oblique spines more than 75 ¹m from the soma, and for basal spines. We have also analyzed the dependence on branch order of branch length, branch diameter, spine density, mean spine length, and mushroom spine
1376
C. Weaver, P. Hof, S. Wearne, and W. Lindquist
percent of spines percent of spines
apical shaft 50
80
40 30
apical oblique
basal
0 - 75 mm 75 - 150 m m 150 - 225 m m > 225 mm
20 10 0
0.5 1.0 1.5 2.0 >2.5
0.5 1.0 1.5 2.0 >2.5 spine length ( m m)
0.5 1.0 1.5 2.0 >2.5
thin
thin
60 40 20 thin mushroom stubby
mushroom stubby
mushroom stubby
percent of spines
spine shape
60 40 20 0
0.2 0.4 0.6 0.8 >1.0
0.2 0.4 0.6 0.8 >1.0
0.2 0.4 0.6 0.8 >1.0
spine volume ( m m 3 )
Figure 11: Measured distributions for spine length, shape, and volume obtained from the neuron shown in Figure 2. Spines are categorized according to dendrite branch type and distance from the soma.
shape fraction from our data using a one-way analysis of variance to test the hypothesis of no change of the dependent variable with branch order. This was done separately for data on the apical oblique and the basal dendrites. A post hoc Duncan’s multiple-range test was applied to detect branch order groups with signicantly different measures for each variable tested. Because this design analyzed one dependent measure at a time, a Bonferroni correction with ® D 0:0125 was applied to ensure an experiment-wise ® of 0.05. For the apical oblique dendrites, the analysis of variance revealed no signicant variation in any of these measures with branch order. For the basal dendrites, the analysis of variance revealed signicant variation in dendrite diameter [F.5;51/ D 9:33, p < 0:0001] and spine density [F.5;51/ D 10:02, p < 0:0001] with branch order. Duncan’s test revealed no signicant variation in branch diameter with branch order for dendrites in the basal arbor with branch order two and higher. The mean values for the basal dendrites
Algorithms for Morphometry of Neuronal Dendrites
1377
(see Figure 8) do, however, suggest a trend to smaller diameter. Duncan’s test revealed a signicant increase in spine density between proximal (one and two) and distal (four and higher) branch orders for dendrites in the basal arbor. 5 Discussion Our results suggest that a next generation of automated tracing software containing feature recognition is a feasible, near-term goal. With respect to dendritic arbor tracing, the major hurdle is one of truncated branch traces due to breaks. The ability of the human eye to utilize nonlocal information to bridge a gap is a skill that is a challenge to duplicate using fast computer algorithms where speed is usually attained by choosing to depend on only local information. The problem of tracing across segmentationinduced breaks has been satisfactorily handled in 2D tracing of neurite outgrowth from explants (Ford-Holevinski, Dahlberg, & Agranoff, 1986; Weaver, Pinezich, Lindquist, & Vazquez, 2003) and should be adaptable to segmentation-induced breaks in tracing dendrites in 3D. It may also be adaptable to handling breaks induced by minor rotation effects, though a resolution of this problem may more appropriately lie in data acquisition techniques. There remain manual steps still to be automated. The retiling needed in section 3.1 is one such step. We believe we can construct such an algorithm that will produce nonoverlapping rectangular tiles that avoid coverage of regions devoid of neural processes, are relatively balanced as to the volume of neural processes each contains (load balancing), and minimize the number of dendritic segments cut by tile boundaries. It will be unlikely, at least in the near future, that manual intervention can be completely eliminated. (The reasonable goal is to reduce manual intervention to a minimum time involvement.) As mentioned, the GUI provides the ability to manually eliminate spine and dendrite identications judged to be false-positive by a human observer. It would be useful to include the possibility for manual false-negative corrections. Intensity variations between imaged tiles distal from and proximal to the soma result from adjustment of microscope settings to optimize the contrast of dendrite and spine intensities with that of the background. Such intensity variation throughout the image required manual adjustment of the two threshold parameters in the segmentation algorithm applied to each tile in the ne-scale analysis. We expect that this parameter adjustment can be automated by directly exploiting recorded microscope settings. The major effort in this article is to demonstrate the applicability of automated dendrite and spine detection software and to validate results against current state-of-the-art, manual-aided analysis. We have made a determined effort to control comparison between similar cell types from the same species and age range. However, there remain a number of factors po-
1378
C. Weaver, P. Hof, S. Wearne, and W. Lindquist
tentially contributing to disagreement between our automated results and results from the literature reported here. These include (1) a single-neuron sample size; (2) primate species variation; (3) age variation (the data in Soloway et al. (2002) are from young adult monkeys); (4) cells from different cortical layers (Page et al., 2002; Soloway et al., 2002); (5) staining methodintracellular injection (this study, Duan et al., 2002; Duan et al., 2003; Elston et al., 1999b; Page et al., 2002; Soloway et al., 2002) versus Golgi stain (Jacobs et al., 2001); (6) the method of labeling branch order; (7) lower manual spine counts, which can occur when spines hidden behind dendrites lie undetected, with many more spines detected by the 3D automated method; and (8) habituation-related biases in manual counting. This work also demonstrates the potential for access to morphological measures not commonly accessed. Due to the involved manual reconstruction required, detailed morphology (length, volume, shape) of dendritic spines or detailed determinations of dendritic radius are not readily available in other primate studies. Automated analysis removes this restriction. We note nally that the morphological output from automated algorithms can be put into a form compatible with input required by compartmental modeling software, allowing data to be incorporated into a very detailed compartmental model with no additional manipulation. Acknowledgments We thank H. Duan for access to her macaque data (Duan et al., 2003). This work is supported by the NSF Division of Mathematical Sciences, grant DMS-0107893 (C.M.W., W.B.L.), NSF Division of Biological Infrastructure, grant DBI-0305799 (C.M.W.), and NIH grants MH58911 (P.R.F.), DC04632, MH06073 and RR16754 (S.L.W.). We thank S. Henderson for help with the confocal microscopy. References Al-Kofahi, K. A., Lasek, S., Szarowski, D. H., Pace, C. J., Nagy, G., Turner, J. N., & Roysam, B. (2002). Rapid automated three-dimensional tracing of neurons from confocal image stacks. IEEE Trans. Information Tech. Biomed., 6, 171–187. Anderson, S. A., Classey, J. D., Cond´e, F., Lund, J. S., & Lewis, D. A. (1995). Synchronous development of pyramidal neuron dendritic spines and parvalbumin-immunoreactive chandelier neuron axon terminals in layer III of monkey prefrontal cortex. Neurosci., 67, 7–22. Baer, S. M., & Rinzel, J. (1991). Propagation of dendritic spikes mediated by excitable spines: A continuum theory. J. Neurophysiol, 65, 874–890. Bartesaghi, R., Severi, S., & Guidi, S. (2003). Effects of early environment on pyramidal neuron morphology in eld CA1 of the guinea-pig. Neuroscience, 116, 715–732.
Algorithms for Morphometry of Neuronal Dendrites
1379
Bower, J. M., & Beeman, D. (1994). The book of GENESIS: Exploring realistic neural models with the GEneral NEural SImulation system. Berlin: TELOS/SpringerVerlag. Cannon, R. C., Turner, D. A., Pyapali, G. K., & Wheal, H. V. (1998). An online archive of reconstructed hippocampal neurons. J. Neurosci. Methods, 84, 49–54. Cohen, A. R., Roysam, B., & Turner, J. N. (1994). Automated tracing and volume measurements of neurons from 3-D confocal uorescence microscopy data. J. Microsc., 171, 103–114. DeFelipe, J., & Farinas, ˜ I. (1992). The pyramidal neuron of the cerebral cortex: Morphological and chemical characteristic of the synaptic inputs. Prog. Neurobiol., 39, 563–607. DeLima, A. D., Voigt, T., & Morrison, J. H. (1990). Morphology of the cells within the inferior temporal gyrus that project to the prefrontal cortex in the macaque monkey. J. Comp. Neurol., 296, 159–172. Duan, H., Wearne, S. L., Morrison, J. H., & Hof, P. R. (2002). Quantitative analysis of the dendritic morphology of corticocortical projection neurons in the macaque monkey association cortex. Neuroscience, 114, 349–359. Duan, H., Wearne, S. L., Rocher, A. B., Macedo, A., Morrison, J. H., & Hof, P. R. (2003). Age-related dendritic and spine changes in corticocorticallyprojecting neurons in macaque monkeys. Cereb. Cortex, 13, 950–961. Elston, G. N. (2000). Pyramidal cells of the frontal lobe: All the more spinous to think with. J. Neurosci., 20, RC95. Elston, G. N., & Rockland, K. S. (2002). The pyramidal cell of the sensorimotor cortex of the macaque monkey: Phenotypic variation. Cereb. Cortex, 12, 1071– 1078. Elston, G. N., & Rosa, M. G. (1997). The occipitoparietal pathway of the macaque monkey: Comparison of pyramidal cell morphology in layer III of functionally related cortical visual areas. Cereb. Cortex, 7, 432–452. Elston, G. N., & Rosa, M. G. (1998). Morphological variation of layer III pyramidal neurones in the occipitotemporal pathway of the macaque monkey visual cortex. Cereb. Cortex, 8, 278–294. Elston, G. N., Tweedale, R., & Rosa, M. G. (1999a). Cellular heterogeneity in cerebral cortex: A study of the morphology of pyramidal neurones in visual areas of the marmoset monkey. J. Comp. Neurol., 415, 33–51. Elston, G. N., Tweedale, R., & Rosa, M. G. (1999b). Cortical integration in the visual system of the macaque monkey: Large-scale morphological differences in the pyramidal neurons in the occipital, parietal and temporal lobes. Proc. R. Soc. Lond. Ser. B, 266, 1367–1374. Elston, G. N., Tweedale, R., & Rosa, M. G. (1999c). Supragranular pyramidal neurones in the medial posterior parietal cortex of the macaque monkey: Morphological heterogeneity in subdivisions of area 7. NeuroReport, 10, 1925– 1929. Feldman, M. L. (1984). Morphology of the neocortical pyramidal neuron. In E. G. Jones & A. Peters (Eds.), Cerebral cortex (Vol. 1, pp. 123–200). New York: Plenum Press.
1380
C. Weaver, P. Hof, S. Wearne, and W. Lindquist
Feldman, M. L., & Peters, A. (1979). Technique for estimating total spine numbers on Golgi-impregnated dendrites. J. Comp. Neurol., 118, 527–542. Ford-Holevinski, T., Dahlberg, T. A., & Agranoff, B. W. (1986). A microcomputerbased image analyzer for quantifying neurite outgrowth. Brain Res., 368, 339–346. Gilbert, C. D., & Wiesel, T. N. (1992). Receptive-eld dynamics in adult primary visual-cortex. Nature, 356, 150–152. Glantz, L. A., & Lewis, D. A. (2000). Decreased dendritic spine density on prefrontal cortical pyramidal neurons in schizophrenia. Arch. Gen. Psychiatry, 57, 65–73. Goddard, N. H., Beeman, D., Cannon, R., Cornelis, H., Gewaltig, M. O., Hood, G., Howell, F., Rogister, P., De Schutter, E., Shankar, K., & Hucka, M. (2002). NeuroML for plug and play neuronal modeling. Neurocomputation, 44, 1077– 1081. Goddard, N. H., Hucka, M., Howell, F., Cornelis, H., Shankar, K., & Beeman, D. (2001). Towards NeuroML: Model description methods for collaborative modelling in neuroscience. Philos. Trans. R. Soc. B, 356, 1209–1228. Harris, K. M., Jensen, F. E., & Tsao, B. (1992). Three-dimensional structure of dendritic spines and synapses in rat hippocampus (CA1) at postnatal day 15 and adult ages: Implications for the maturation of synaptic physiology and long-term potentiation. J. Neurosci., 12, 2685–2705. Harris, K. M., & Stevens, J. K. (1988). Dendritic spines of rat cerebellar Purkinje cells: Serial electron microscopy with reference to their biophysical characteristics. J. Neurosci., 8, 4455–4469. Herzog, A., Krell, G., Michaelis, B., Wang, J., Zuschratter, W., & Braun, K. (1997). Restoration of three-dimensional quasi-binary images from confocal microscopy and its application to dendritic trees. In C. J. Cogswell, J. Conchello, & T. Wilson (Eds.), Three-dimensional microscopy: Image acquisition and processing IV, Proceedings of SPIE. Bellingham, WA: SPIE. Hines, M. L. (1994). The neuron simulation program. In J. Skrzypek (Ed.), Neural network simulation environments (pp. 147–163). Norwell, MA: Kluwer. Holmes, W. R. (1989). The role of dendritic diameter in maximizing the effectiveness of synaptic inputs. Brain Res., 478, 127–137. Hubener, M., Schwarz, C., & Bolz, J. (1990). Morphological types of projection neurons in layer 5 of cat visual cortex. J. Comp. Neurol., 301, 655–674. Irwin, S. A., Idupulapati, M., Gilbert, M. E., Harris, J. B., Chakravarti, A. B., Rogers, E. J., Crisostomo, R. A., Larsen, B. P., Mehta, A., Alcantara, C. J., Patel, B., Swain, R. A., Weiler, I. J., Oostra, B. A., & Greenough, W. T. (2002). Dendritic spine and dendritic eld characteristics of layer V pyramidal neurons in the visual cortex of fragile-X knockout mice. Am. J. Med. Genet., 111, 140– 146. Irwin, S. A., Idupulapati, M., Harris, J. B., Crisostomo, R. A., Larsen, B. P., Kooy, F., Willems, P. J., Cras, P., Kozlowski, P. B., Swain, R. A., Weiler, I. J., & Greenough, W. T. (2001). Abnormal dendritic spine characteristics in the temporal and visual cortices of patients with fragile-X syndrome: A quantitative examination. Am. J. Med. Genet., 98, 161–167.
Algorithms for Morphometry of Neuronal Dendrites
1381
Jacobs, B., Driscoll, L., & Schall, M. (1997). Life-span dendritic and spine changes in areas 10 and 18 of human cortex: A quantitative Golgi study. J. Comp. Neurol., 386, 661–680. Jacobs, B., Schall, M., Prather, M., Kapler, E., Driscoll, L., Baca, S., Jacobs, J., Ford, K., Wainwright, M., & Treml, M. (2001). Regional dendritic and spine variation in human cerebral cortex: A quantitative Golgi study. Cereb. Cortex, 11, 558–571. Koh, Y. Y. (2001). Automated recognition algorithmsfor neural studies. Unpublished doctoral dissertation, Stony Brook University. Koh, I. Y. Y., Lindquist, W. B., Zito, K., Nimchinsky, E. A., & Svoboda, K. (2002). An image analysis algorithm for dendritic spines. Neural Comput., 14, 1283– 1310. Kolb, B., Gibb, R., & Gorny, G. (2003). Experience-dependent changes in dendritic arbor and spine density in neocortex vary qualitatively with age and sex. Neurobiol. Learn. Mem., 79, 1–10. Krichmar, J. L., Nasuto, S. J., Scorcioni, R., Washington, S. D., & Ascoli, G. A. (2002). Effects of dendritic morphology on CA3 pyramidal cell electrophysiology: A simulation study. Brain Res., 941, 11–28. Mainen, Z. F., & Sejnowski, T. J. (1996). Inuence of dendritic structure on ring pattern in model neocortical neurons. Nature, 382, 363–366. Maravall, M., Shepherd, G. M. G., Koh, I. Y. Y., Lindquist, W. B., & Svoboda, K. (in press). Experience-dependent changes in dendritic morphology of layer 2/3 pyramidal neurons during a critical period for developmental plasticity in rat barrel cortex. Cerebral Cortex. Nimchinsky, E. A., Hof, P. R., Young, W. G., & Morrison, J. H. (1996). Neurochemical, morphologic, and laminar characterization of cortical projection neurons in the cingulated motor areas of the macaque monkey. J. Comp. Neurol., 374, 136–160. Nimchinsky, E. A., Oberlander, A. M., & Svoboda, K. (2001). Abnormal development of dendritic spines in FMR1 knock-out mice. J. Neurosci., 21, 5139– 5146. Nimchinsky, E. A., Sabatini, B. L., & Svoboda, K. (2002). Structure and function of dendritic spines. Annu. Rev. Physiol., 64, 313–352. Oh, W. & Lindquist, W. B. (1999). Image thresholding by indicator kriging. IEEE Trans. Pattern Anal. Mach. Intell., 21, 590–602. Page, T. L., Einstein, M., Duan, H., He, Y., Flores, T., Rolshud, D., Erwin, J. M., Wearne, S. L., Morrison, J. H., & Hof, P. R. (2002). Morphological alterations in neurons forming corticocortical projections in the neocortex of aged Patas monkeys. Neurosci. Lett., 317, 37–41. Purpura, D. P. (1974). Dendritic spine dysgenesis and mental-retardation. Science, 186, 1126–1128. Ramon ´ y Cajal, S. (1893). Neue Darstellung vom histologischen Bau des Centralnervensystems. Arch. Anat. Entwickl. [Anatomisches Abteilung des Archiv fur ¨ Anatomie und Physiologie], 319–428. Rodriguez, A., Ehlenberger, D., Kelliher, K., Einstein, M., Henderson, S. C., Morrison, J. H., Hof, P. R., & Wearne, S. L. (2003). Automated reconstruction
1382
C. Weaver, P. Hof, S. Wearne, and W. Lindquist
of three-dimensional neuronal morphology from laser scanning microscopy images. Methods, 30, 94–105. Rumberger, A., Schmidt, M., Lohmann, H., & Hoffmann, K. P. (1998). Correlation of electrophysiology, morphology, and functions in corticotectal and corticopretectal projection neurons in rat visual cortex. Exp. Brain Res., 119, 375–390. Rusakov, D. A., & Stewart, M. G. (1995). Quantication of dendritic spine populations using image analysis and a tilting disector. J. Neurosci. Methods, 60, 11–21. Scheibel, A. B., Conrad, T., Perdue, S., Tomiyasu, U., & Wechsler, A. (1990). A quantitative study of dendrite complexity in selected areas of the human cerebral cortex. Brain Cogn., 12, 85–101. Scheibel, A. B., Paul, L. A., Fried, I., Forsythe, A. B., Tomiyasu, U., Wechsler, A., Kao, A., & Slotnick, J. (1985). Dendritic organization of the anterior speech area. Exp. Neurol., 87, 109–117. Seib, L. M., & Wellman, C. L. (2003). Daily injections alter spine density in rat medial prefrontal cortex. Neurosci. Lett., 337, 29–32. Shapiro, S. S., & Wilk, M. B. (1965). An analysis of variance test for normality (complete samples). Biometrika, 52(3–4), 591–611. Sholl, D. A. (1953). Dendritic organization in the neurons of the visual and motor cortices of the cat. J. Anat., 87, 387–406. Soloway, A. S., Pucak, M. L., Melchitzky, D. S., & Lewis, D. A. (2002). Dendritic morphology of callosal and ipsilateral projection neurons in monkey prefrontal cortex. Neuroscience, 109, 461–471. Stratford, K., Mason, A., Larkman, A., Major, G., & Jack, J. (1989). The modeling of pyramidal neurons in the visual cortex. In R. Durbin, C. Miall, & G. Mitchison (Eds.), The computing neuron (pp. 296–321). Reading, MA: AddisonWesley. Surkis, A., Peskin, C. S., Tranchina, D., & Leonard, C. S. (1998). Recovery of cable properties through active and passive modeling of subthreshold membrane responses from laterodorsal tegmental neurons. J. Neurophysiol., 80, 2593– 2607. Tanzi, E. (1893). I fatti i le induzioni nell’odierna istologia del sistema nervoso. Riv. Sper. Freniatr. Med. Leg. Alien. Ment., 19, 419–422. Turner, J. N., Ancin, H., Becker, D. E., Szarowski, D. H., Holmes, M., O’Connor, N., Wang, M., Holmes, T., & Roysam, B. (1997). Automated image analysis technologies for biological 3D light microscopy. Int. J. Imag. Syst. Tech., 8, 240–254. Turner, J. N., Shain, W., Szarowski, D. H., Lasek, S., Dowell, N., Sipple, B., Can, A., Al-Kofahi, K., & Roysam, B. (2000). Three-dimensional light microscopy: Observation of thick objects. J. Histotechnol., 23, 205–217. Uylings, H. B. M., Ruiz-Marcos, A., & van Pelt, J. (1986). The metric analysis of three-dimensional dendritic tree patterns: A methodological review. J. Neurosci. Methods, 18, 127–151. Wallace, C., Hawrylak, N., & Greenough, W. T. (1991). Studies of synaptic structural modications after long-term potentiation and kindling: Context
Algorithms for Morphometry of Neuronal Dendrites
1383
for a molecular morphology. In M. Baudry & J. L. Davis (Eds.), Long-term potentiation: A debate of current issues (pp. 189–332). Cambridge, MA: MIT Press. Watzel, R., Braun, K., Hess, A., Scheich, H., & Zuschratter, W. (1995). Detection of dendritic spines in 3-dimensional images. Bielefeld: Springer-Verlag. Weaver, C. M. (2003). Automated morphology of neural cells. Unpublished doctoral dissertation, Stony Brook University. Weaver, C. M., & Lindquist, W. B. (2003). 3DMA-neuron users manual. Available on-line at: http://www.ams.sunysb.edu/»lindquis/3dma neuron/ 3dma neuron.html. Weaver, C. M., Pinezich, J. D., Lindquist, W. B., & Vazquez, M. E. (2003). An algorithm for neurite outgrowth reconstruction. J. Neurosci. Methods, 124, 197–205. Wilson, C. J. (1984). Passive cable properties of dendritic spines and spiny neurons. J. Neurosci., 4, 281–297. Wilson, C. J. (1988). Cellular mechanisms controlling the strength of synapses. J. Electron Microsc. Tech., 10, 293–313. Wilson, C. J., Groves, P. M., Kitai, S. T., & Linder, J. C. (1983). Three-dimensional structure of dendritic spines in the rat neostriatum. J. Neurosci., 3, 383–398. Received November 24, 2003; accepted January 7, 2004.
LETTER
Communicated by Misha Tsodyks
Computing and Stability in Cortical Networks Peter E. Latham
[email protected] Sheila Nirenberg
[email protected] Department of Neurobiology, University of California at Los Angeles, Los Angeles, CA 90095-1763, U.S.A.
Cortical neurons are predominantly excitatory and highly interconnected. In spite of this, the cortex is remarkably stable: normal brains do not exhibit the kind of runaway excitation one might expect of such a system. How does the cortex maintain stability in the face of this massive excitatory feedback? More importantly, how does it do so during computations, which necessarily involve elevated ring rates? Here we address these questions in the context of attractor networks—networks that exhibit multiple stable states, or memories. We nd that such networks can be stabilized at the relatively low ring rates observed in vivo if two conditions are met: (1) the background state, where all neurons are ring at low rates, is inhibition dominated, and (2) the fraction of neurons involved in a memory is above some threshold, so that there is sufcient coupling between the memory neurons and the background. This allows “dynamical stabilization” of the attractors, meaning feedback from the pool of background neurons stabilizes what would otherwise be an unstable state. We suggest that dynamical stabilization may be a strategy used for a broad range of computations, not just those involving attractors. 1 Introduction Attractor networks—networks that exhibit multiple stable states—have served as a key theoretical model for several important computations, including associative memory (Hopeld, 1982, 1984), working memory (Amit & Brunel, 1997a; Brunel & Wang, 2001), and the vestibular-ocular reex (Seung, 1996), and determining whether these models apply to real biological networks is an active area of experimental research (Miyashita & Hayashi, 2000; Aksay, Gamkrelidzek, Seungk, Baker, & Tank, 2001; Ojemann, Schoeneld-McNeill, & Corina, 2002; Naya, Yoshida, & Miyashita, 2003). A denitive determination, however, has been difcult, mainly because attractors cannot be observed directly; instead, inferences must be made about their existence by comparing experimental data with model prediction. c 2004 Massachusetts Institute of Technology Neural Computation 16, 1385–1412 (2004) °
1386
P. Latham and S. Nirenberg
To make these comparisons, it is necessary to have realistic models. Construction of such models has proved difcult because of what we refer to as the stability problem. The stability problem arises primarily because cortical networks are highly recurrent—a typical neuron in the cortex receives input from 5,000 to 10,000 others, most of which are nearby (Braitenberg & Schuz, ¨ 1991). While this high connectivity undoubtedly provides immense computational power, it also has a downside: it can lead to instabilities in the form of runaway excitation. For example, even mild electrical stimulation applied periodically can eventually lead to seizures (McIntyre, Poulter, & Gilby, 2002), and epilepsy, a sign of intrinsic instability, occurs in 0.5 to 1% of the human population (Bell & Sander, 2001; Hauser, 1997). The severity of the stability problem lies in the fact that recurrent connections in attractor networks have to be strong enough to allow activity in the absence of input, but not so strong that the activity can occur spontaneously. Moreover, in areas where attractor networks are thought to exist, such as prefrontal cortex (Fuster & Alexander, 1971; Wilson, Scalaidhe, & GoldmanRakic, 1993; Freedman, Riesenhuber, Poggio, & Miller, 2001) and inferior temporal cortex (Fuster & Jervey, 1982; Miyashita & Chang, 1988), the ring rates associated with attractor states are not much higher than those associated with background activity. The two differ by only a few Hz, with typical background rates ranging from 1 to 10 Hz and attractor rates ranging from 10 to 20 Hz (Fuster & Alexander, 1971; Miyashita & Chang, 1988; Nakamura & Kubota, 1995). This small difference makes a hard stability problem even harder, as there is almost no barrier preventing spontaneous jumps to attractors. Indeed, in previous attempts to build realistic models of attractor networks (Amit & Brunel, 1997a; Wang, 1999; Brunel, 2000; Brunel & Wang, 2001), ne-tuning of parameters was required to support attractors at the low rates seen in vivo. This was mainly because ring rates in those models were effectively set by single neuron saturation—by the fact that neurons simply cannot re for extended periods above a maximum rate. Here we propose a solution to the ne-tuning problem, one that allows attractors to exist at low rates over a broad range of parameters. The basic idea is to use natural interactions between the attractors and the background to limit ring rates, so that rates are set by network rather than single neuron properties. Limiting ring rates in this way may be a general computational strategy, not one used just by attractor networks. Thus, quantifying this mechanism in the context of attractor networks may serve as a general model for how cortical circuits carry out a broad range of computations while avoiding instabilities. 2 Reduced Model System To understand qualitatively the properties of attractor networks, we analyze a model system in which the neurons are described by their ring rates. For simplicity, we start with a network that exhibits two stable equilibria: a
Computing and Stability in Cortical Networks
1387
background state, where all neurons re at low rates, and a memory state, where a subpopulation of neurons res at an elevated rate. Our results, however, apply to multiattractor networks—networks that exhibit multiple memories—and we will consider such networks below. The goal of the analysis in this section is to obtain a qualitative understanding of attractor networks, in particular, how they can re at low rates without destabilizing the background. We will then use this qualitative understanding to guide network simulations of multiattractor networks with spiking neurons (containing up to 50 attractors), and use those simulations to verify the results obtained from the reduced model. We construct our reduced model attractor network in two steps. First, we build a randomly connected network of excitatory and inhibitory neurons that re at a few Hz on average. Second, we pick out a subpopulation of the excitatory neurons (which we refer to as memory neurons) and strengthen the connections among them (see Figure 1). If the parameters are chosen properly, this strengthening will produce a network in which the memory neurons can re at either an elevated rate or the background rate, resulting in a bistable network.
J EE
J II
(1-f) memory
J IE
- f
non memory
excitatory cells
J
EI
inhibitory cells
Figure 1: Schematic of network architecture. Excitatory and inhibitory neurons are synaptically coupled via the connectivity matrix J. The inhibitory neurons are homogeneous; the excitatory neurons break into two populations, memory and nonmemory. The coupling among memory neurons is higher than among nonmemory neurons by a factor of ¯.1 ¡ f /; the coupling from nonmemory to memory neurons is lower by a factor of ¯ f . The memory neurons make up a fraction f of the excitatory population, so the decrease in coupling from nonmemory to memory neurons ensures that the total synaptic drive to both memory and nonmemory neurons is the same.
1388
P. Latham and S. Nirenberg
To analyze the stability and computational properties of this network, we assume that, in equilibrium, the ring rate of a neuron is a function of only the ring rates of the neurons presynaptic to it. Then, near equilibrium, we expect Wilson and Cowan–like equations (Wilson & Cowan, 1972) in which the ring rates obey rst-order dynamics. Augmenting the standard Wilson and Cowan equations to allow increased connectivity among a subpopulation of neurons (see Figure 1 and appendix A), we have 0 1 X dºEi ¯ C ºEi D ÁE @ JEE ºE ¡ JEI ºI C ¿E »i .»j ¡ f /ºEj A (2.1a) dt NE f .1 ¡ f / j ¿I
dºI C ºI D ÁI .JIE ºE ¡ JII ºI /: dt
(2.1b)
Here, ¿E and ¿I are the excitatory and inhibitory time constants, ºEi is the ring rate of the ith excitatory neuron (i D 1; : : : ; NE ), ºE and ºI are the average ring rates of the excitatory and inhibitory neurons, respectively, the Js are the average coupling coefcients among the excitatory and inhibitory populations, ÁE and ÁI are the average excitatory and inhibitory gain functions, respectively, ¯ is the effective strength of the coupling among the memory neurons, NE is the number of excitatory neurons, and » is a random binary vector: »i D 1 with probability f and 0 with probability 1¡ f . The factor »j ¡ f ensures postsynaptic normalization: on average, the total synaptic strength to both memory and nonmemory neurons is the same. The » -dependent term in equation 2.1a is a standard one for constructing attractor networks (Hopeld, 1982; Tsodyks & Feigel’man, 1988; Buhmann, 1989). The gain functions, ÁE and ÁI , play a key role in this analysis. These functions, which hide all the single neuron properties, have a natural interpretation: each one corresponds to the average ring rate of a population of neurons as a function of the average ring rates of neurons presynaptic to it. They thus have standard shapes when plotted versus ºE : they look like f -I curves—ring rate versus injected current (McCormick, Connors, Lighthall, & Prince, 1985)—that have been smoothed at low ring rates (Brunel & Sergi, 1998; Tiesinga, Jos´e, & Sejnowski, 2000; Fourcaud & Brunel, 2002; Brunel & Latham, 2003). A generic gain function is shown in Figure 2. This curve does not correspond to any particular neuron or class of neurons—it is just illustrative. There are two important aspects to its shape: (1) it has a convex region, and (2) the transition from convex to concave occurs at ring rates well above 10 to 20 Hz on the output side (that is, the transition occurs when Á À 10–20 Hz). These properties are typical of both model neurons (Brunel & Sergi, 1998; Tiesinga et al., 2000; Fourcaud & Brunel, 2002; Brunel & Latham, 2003) and real neurons (McCormick et al., 1985; Chance, Abbott, & Reyes, 2002), and in the next two sections we will see how they connect to the problem of robust, low ring-rate attractors.
Computing and Stability in Cortical Networks
1389
( E, I )
~100 Hz
E
Figure 2: Generic gain function versus average excitatory ring rate, ºE . There are three distinct regimes: First, at small ºE , the mean current induced by presynaptic ring is too low to cause a postsynaptic neuron to re, so postsynaptic activity is due to uctuations in the current. This is the noise-dominated regime, and here Á is convex. Second, for slightly larger ºE , Á becomes approximately linear or slightly concave. Finally, for large enough ºE , the ring rate saturates— at around 100 Hz for typical cortical pyramidal cells (McCormick et al., 1985). This gain function is a schematic and does not correspond to any particular neuron model.
Although equations 2.1a and 2.1b have a fairly broad range of applicability, they are not all encompassing. In particular, they leave out the possibility of instability via synchronous oscillations (Abbott & van Vreeswijk, 1993; Gerstner & van Hemmen, 1993; Hansel & Mato, 2001), and they ignore the effects of higher-order moments of the ring rate (Amit & Brunel, 1997a; Latham, 2002). We thus verify all predictions using large-scale simulations of synaptically coupled neurons. As indicated in Figure 1, the network described by equations 2.1a and 2.1b contains a preferentially connected subpopulation. We are interested in determining under what conditions the network supports two states: a “memory” state in which this subpopulation res at elevated rate compared to the background, and a background state in which the subpopulation res at the same rate as the background. Since it is the difference between the attractor and background ring rates that is important, we dene a variable, m, that is proportional to this difference; for convenience, we let
1390
P. Latham and S. Nirenberg
the proportionality constant be 1=.1 ¡ f /, " # 1 1 X 1 X m´ »i ºEi ¡ ºEi : 1 ¡ f NE f i NE i
(2.2)
The rst expression inside the angle brackets is the average ring rate of the subpopulation; the second is the average ring rate of the whole population, ºE . Thus, the average ring rate of the subpopulation is ºE C .1 ¡ f /m, and the background state corresponds to m D 0. Intuitively, we expect that network dynamics should be governed by three variables: the ring rate associated with the memory state, m, the average excitatory ring rate, ºE , and the average inhibitory rate, ºI . In fact, if we average equations 2.1a and 2.1b over index, i, we can derive differential equations for these variables. As shown in appendix A, the averaged equations are dm C m D ÁE .JEE ºE ¡ JEI ºI C ¯m/ ¡ ÁE .JEE ºE ¡ JEI ºI / dt dºE C ºE D .1 ¡ f /ÁE .JEE ºE ¡ J EI ºI / ¿E dt C f ÁE .J EE ºE ¡ JEI ºI C ¯m/ dºI C ºI D ÁI .JIE ºE ¡ J II ºI /: ¿I dt ¿E
(2.3a)
(2.3b) (2.3c)
To simplify our analysis, we adopt the the effective response function approach of Mascaro and Amit (1999), which can be implemented by taking the limit of fast inhibition, ¿I ! 0. The main caveat to this limitis that we may overestimate the stability of equilibria: equilibria can be stable when ¿I D 0 but unstable when ¿I is above some threshold (Wilson & Cowan, 1972; van Vreeswijk & Sompolinsky, 1996; Latham, Richmond, Nelson, & Nirenberg, 2000a; Hansel & Mato, 2001). With ¿I D 0, we can solve equation 2.3c for ºI in terms of ºE . This allows us to write ºI D ºI .ºE /, where the function ºI .ºE / is determined implicitly from equation 2.3c with ¿I D 0. Replacing ºI with ºI .ºE / in equations 2.3a and 2.3b, we nd that all the ºE -dependence on the right-hand side of these equations can be lumped into the single expression, JEE ºE ¡ JEI ºI .ºE /. We denote this ¡° .ºE /, so that ° .ºE / ´ ¡JEE ºE C JEI ºI .ºE /:
(2.4)
With ºI eliminated, we are left with just two equations: dm C m D ÁE .¡° .ºE / C ¯m/ ¡ ÁE .¡° .ºE // ´ 1ÁE .ºE ; m/ dt dºE C ºE D ÁE .¡° .ºE // C f 1ÁE .ºE ; m/: ¿E dt ¿E
(2.5a) (2.5b)
Computing and Stability in Cortical Networks
1391
These equations are similar to ones derived previously by Brunel and colleagues (Amit & Brunel, 1997a; Brunel, 2000; Brunel & Wang, 2001). 2.1 The Sparse Coding Limit. The question we are interested in is: under what conditions do equations 2.5a and 2.5b admit two stable solutions— one with m D 0 (the background state) and one with m 6D 0 (the memory state)? Before we answer this question in general, let us consider the sparse coding limit, f ! 0, as this limit is relatively simple and it allows us to make contact with previous work. When f D 0, equations 2.5a and 2.5b decouple. In this regime, we can solve equation 2.5b for the excitatory ring rate, then solve equation 2.5a for m. Let us assume that equation 2.5b has a stable solution, meaning the network admits a stable background state, and focus on equation 2.5a. This equation is best solved graphically, by plotting 1ÁE .ºE ; m/ versus m and looking for intersections of this plot with the 45 degree line; those intersections correspond to equilibria (see Figure 3). Stability can be read off the
E
( E,m) (Hz)
A
B
~100
~100
~15
~15
m
C ~15
m
m
Figure 3: Gain functions and m-equilibria. (A). 1ÁE .ºE ; m/ versus m for different values of ¯. Points where the curves cross the 45 degree line are equilibria. In the sparse coding limit, f ! 0, an equilibrium is stable if the slope of the curve is less than one and unstable otherwise. Dotted line: weak coupling (small ¯). The only equilibrium is at the background, and it is stable. Solid line: intermediate coupling. Two more equilibria have appeared. The one with intermediate m is unstable; the one with large m is stable. Dashed line: large coupling. The background has become destabilized, and the memory is easily activated without input. Since 1ÁE .ºE ; m/ is essentially an average single-neuron f -I curve, it saturates at the maximum ring rate of single neurons, »100 Hz. Thus, the upper equilibrium occurs at high rate. (B). To reduce the ring rate of the upper equilibrium, one must operate in a very narrow parameter regime: small changes in network parameters would either rotate the curve counterclockwise, which would destabilize the background and/or move the equilibrium to high rate, or rotate it clockwise, which would eliminate the memory state. (C). Blowup of the region below 15 Hz in B.
1392
P. Latham and S. Nirenberg
plots by looking at the slope: an equilibrium with slope less than 1 is stable and one with slope greater than 1 is unstable. The number and stability of the equilibria are determined by ¯. If ¯ is small—weak coupling among the neurons in the subpopulation—there is only one equilibrium, at m D 0, and it is stable (see Figure 3A, dotted line). This makes sense, as weak coupling should not have much effect on network behavior. As ¯ increases, so does the slope of 1ÁE .ºE ; m/, and eventually a new pair of equilibria appear (see Figure 3A, solid line). Of these, the one at higher ring rate (larger m) is stable, since its slope is less than 1, and the one at lower, but nonzero, rate is unstable, since its slope is greater than 1. Finally, for large enough ¯, the unstable equilibrium slides past zero, at which point the equilibrium at m D 0, and thus the background, becomes unstable (see Figure 3A, dashed line). The intermediate ¯ regime, where the network can support both a memory and a stable background, is of the most interest to us. It is in this regime that the network can actually compute, in the sense that input to the network controls which state it is in (memory or background). The low and high ¯ regimes are not so interesting, however, at least if the goal is to construct a network that can store memories. If ¯ is too low, the network is stuck in the background state; if it is too high, the network can easily jump spontaneously from the background to the memory state. As Figure 3A shows, it is not hard to build a network that can exhibit two states—so long as one is willing to allow ring rates near saturation, meaning at a healthy fraction of 100 Hz. It is much more difcult to build a network in which the memory state exhibits low ring rates—in the 10 to 20 Hz range, as is observed experimentally (Fuster & Alexander, 1971; Miyashita & Chang, 1988; Nakamura & Kubota, 1995). This is because low ring rates require too many bends in 1ÁE .ºE ; m/ in too small a ring-rate range, where “too small” is relative to the saturation ring rate of »100 Hz (see Figures 3B and 3C). These qualitative arguments were quantied in networks of leaky integrate-and-re neurons using both analytic techniques and numerical simulations (Brunel, 2000; Brunel & Wang, 2001). For attractors that red at about 15 Hz, the coupling, ¯, among the subpopulations had to be tuned to within 1% to ensure that both the background and memories were stable. At slightly higher ring rates of 25 to 40 Hz, the tuning was somewhat more forgiving, 3% to 5%. It should be pointed out, however, that these networks were more robust to other parameters: multiple attractors were supported at reasonably low rates (20–40 Hz) when external input varied over a 40% range and the strengths of different receptor types (AMPA, NMDA, and GABA) were varied over range of 5% to 15%. 2.2 Beyond the Sparse Coding Limit. What these results indicate is that parameters need to be nely tuned for attractor networks to exist at low rates, at least in the f ! 0 limit. When f is nite, however, the picture
Computing and Stability in Cortical Networks
1393
changes, since the memories (m) and background (ºE ) couple (see equation 2.5). This means that the slope of 1ÁE .ºE ; m/ with respect to m is no longer the sole factor determining stability; instead, stability depends on the interaction between m and ºE . Consequently, an equilibrium in which the slope of 1ÁE .ºE ; m/ is greater than 1 can be stable. For example, the equilibrium on the solid curve in Figure 3A that occurs at about 15 Hz, which is unstable in the f ! 0 limit, could become stable when f > 0. If this were to happen, attractors could exist robustly at low rate. We refer to the regime in which the attractors are stable even though the slope of 1ÁE .ºE ; m/ is greater than 1 as the dynamically stabilized regime. This is because an equilibrium that would be unstable when one considers only the time evolution of the memory is dynamically stabilized by feedback from the background. Networks that operate in this regime have been investigated by Sompolinsky and colleagues (Rubin & Sompolinsky, 1989; Golomb, Rubin, & Sompolinsky, 1990). Although the ring rates of the attractors in those networks were low, the networks were not realistic in an important respect: they did not exhibit a background state in which all neurons red at about the same rate; instead, the only stable states were ones in which a subpopulation of the neurons was active. Our goal here is to overcome this problem and build a network that both operates in the dynamically stabilized regime, and thus at low rates, and supports a low ring-rate background. To do this, we must evaluate the stability of the equilibria in m-ºE space. The network equilibria are found by setting dm=dt and dºE =dt in equations 2.5a and 2.5b and solving the resulting algebraic equations. With the time derivatives set to zero, the solutions to these equations are curves in m-ºE space. These curves are referred to as nullclines, and their intersections correspond to equilibria. Constructing nullclines from the properties of ÁE is straightforward (Latham et al., 2000a) but tedious. We thus skip the details and simply plot them; once we have made the plots, it is relatively easy to see that they have the correct qualitative shape. One possible set of nullclines is shown in Figure 4A, with the black curve corresponding to the ºE -nullcline (the solution to equation 2.5b with dºE =dt D 0) and the gray curves corresponding to the m-nullcline (the solution to equation 2.5a with dm=dt D 0; note that there are two pieces—a smooth curve and a vertical line). The shape of the ºE -nullcline is relatively easy to understand: it is an increasing function of m, reecting the fact that as m gets larger, there is more excitatory drive to the network. This can also be seen from equation 2.5b, where ºE is coupled to m through the term f 1ÁE .ºE ; m/, and 1ÁE .ºE ; m/ is an increasing function of m. The m-nullcline is a little more complicated, primarily because it consists of two pieces rather than one. To understand its shape, we must reexamine Figure 3A. This gure shows three plots of 1ÁE .ºE ; m/ versus m. These plots correspond to three values of ¯; however, because ºE as well as ¯ affects 1ÁE .ºE ; m/, they could just as easily have corresponded to three values of
1394
P. Latham and S. Nirenberg
ºE . In fact, had we xed ¯ and varied ºE , we would have obtained a set of curves qualitatively similar to the ones shown in Figure 3A. In other words, depending on the value of ºE , we would have seen three distinct regimes: (1) one equilibrium at m D 0 (dotted line in Figure 3A), (2) one equilibrium at m D 0 and two at m > 0 (solid line), and (3) one equilibrium at m D 0, one at m < 0, and one at m > 0 (dashed line). These three regimes are reected in the m-nullcline in Figure 4A: when ºE is large, there is a single equilibrium at m D 0 (regime 1); when ºE is intermediate, there is an additional pair of equilibria at m > 0 (regime 2); and when ºE is small, one of the equilibria in that pair becomes negative (regime 3). The order in which the regimes appear in Figure 4A—one equilibrium at zero when ºE is large, two positive ones when ºE is slightly smaller,
A
B E
E
0
m
C
0
m
0
m
D E
E
0
m
Figure 4: Nullclines in various parameter regimes. The gray curve, including the vertical line, is the nullcline for equation 2.5a; the black curve is the nullcline for equation 2.5b. The solid branches of the gray curves are stable at xed ºE ; the dashed branches are unstable at xed ºE . The black arrows indicate typical trajectories; they are derived from equation 2.5. (A) 1ÁE .ºE ; m/ is a decreasing function of ºE ; f is reasonably large. (B) 1ÁE .ºE ; m/ is an increasing function of ºE ; f is reasonably large. (C) 1ÁE .ºE ; m/ is a decreasing function of ºE ; f ¿ 1. (D) 1ÁE .ºE ; m/ is an increasing function of ºE ; f ¿ 1.
Computing and Stability in Cortical Networks
1395
and a negative one when ºE is smaller still—corresponds to a particular dependence of 1ÁE .ºE ; m/ on ºE . Examining Figure 3A, we see that this progression corresponds to a dependence in which 1ÁE .ºE ; m/ decreases as ºE increases. Thus, for Figure 4A to correctly describe the m-nullcline, the network parameters must be such that 1ÁE .ºE ; m/ is a decreasing function of ºE . Below, we derive explicit conditions under which this happens. First, however, we examine its consequences. The black and gray nullclines in Figure 4 intersect in three places, and thus exhibit three equilibria. One is at m D 0, and two are at m > 0. The equilibrium at m D 0 corresponds to the background state; the other two are candidates for memory states. To determine which are stable, we have plotted a few typical trajectories (black arrows), which are derived from equation 2.5. These trajectories tell us that the stable equilibria occur at m D 0 and at the right-most intersection. Importantly, the stable equilibrium at m > 0 occurs on the unstable branch of the m-nullcline, in the dynamically stabilized regime. (We can tell that this branch is unstable because at xed ºE , the trajectories point away from it; ow is to the right below the m-nullcline and to the left above it. The ºE -nullcline, on the other hand, consists of one stable branch, since ow is toward it at xed m. This reects the fact that the background is always stable at xed m, independent of the ring rate of the memory state.) The unstable branches, which are drawn with dashed lines, correspond to points where the slope of 1ÁE .ºE ; m/ is greater than one. That the memory state lives on the unstable branch of the m-nullcline is important, because the unstable branch corresponds to the intermediate equilibrium shown in Figure 3A and can thus occur at low ring rate. All previous work in realistic networks that we know of (Amit & Brunel, 1997a; Brunel, 2000; Brunel & Wang, 2001) puts the memory state on the stable branch of the m-nullcline— the part with slope less than one. The stable branch corresponds to the upper equilibrium in Figure 3A, and so tends to occur at high ring rate—at some reasonable fraction of the saturation ring rate for single neurons. It may seem counterintuitive to have a stable equilibrium on the unstable branch, but it is well known that this can happen (Rinzel & Ermentrout, 1989). The only caveat is that there is no guarantee of stability: the equilibrium may become unstable via a Hopf bifurcation (Marsden & McCracken, 1976), leading to oscillations. In fact, oscillations were occasionally observed—mainly near or above the stability boundary in Figure 7. The fact that the network did not oscillate in most of the parameter regime explored was, we believe, because the background is strongly stable (there is strong attraction to the ºE -nullcline in Figure 4). What this analysis shows is that two conditions are necessary for building an attractor network in which the memory states occur on the unstable branch of the m-nullcline, and thus re at low rates: 1ÁE .ºE ; m/ must be a decreasing function of ºE , and the fraction, f , of neurons involved in a memory must be large enough to give the ºE -nullcline signicant curvature.
1396
P. Latham and S. Nirenberg
What happens if either of these conditions is violated? Figure 4B shows the nullclines when 1ÁE .ºE ; m/ is an increasing function of ºE ; in this regime, the m-nullcline turns upside down. A low-ring-rate equilibrium still exists, but it is unstable, as indicated by the black arrows. The stable memory state in this regime is at high ring rate, near single-neuron saturation—too high to be consistent with experiment. Figures 4C and 4D show nullclines when f ¿ 1. Again, the only stable memory states are near saturation, and thus at high rate. Is the condition that 1ÁE .ºE ; m/ be a decreasing function of ºE satised for realistic networks? In other words, is it reasonable to expect @1ÁE .ºE ; m/=@ºE < 0? To answer this, we use equation 2.5a to write @ 1ÁE .ºE ; m/ D ¡° 0 .ºE /[ÁE0 .¡° .ºE / C ¯m/ ¡ ÁE0 .¡° .ºE //]; @ ºE
(2.6)
where a prime after a function denotes a derivative with respect to its argument. The term in brackets on the right-hand side of equation 2.6 is typically positive for a memory lying on the unstable branch of the m-nullcline; this is because the unstable branch corresponds to the intermediate equilibrium in Figure 3A, where ÁE is generally convex. Thus, the sign of @1ÁE .ºE ; m/=@ºE is determined solely by the sign of ° 0 .ºE /. For typical cortical networks, it is likely that ° 0 .ºE / > 0. That is because cortical networks operate in the highgain regime, in which one excitatory action potential is capable of causing more than one excitatory action potential somewhere else in the network (Abeles, 1991; Matsumura, Chen, Sawaguchi, Kubota, & Fetz, 1996). The only way to stabilize such a system is to provide strong feedback from excitatory to inhibitory neurons, so that the inhibitory response to small increases in excitation dominates over the excitation (van Vreeswijk & Sompolinsky, 1996; Amit & Brunel, 1997b; Brunel & Wang, 2001). In terms of our network variables, this means that d.JEI ºI .ºE //=dºE must be greater than d.JEE ºE /=dºE , which implies, via equation 2.4, that ° 0 .ºE / > 0. Thus, the condition @1ÁE .ºE ; m/=@ºE < 0 is naturally satised in cortical networks. Finally, we point out that a necessary condition for a low-ring-rate memory and a stable background state is that the ºE -nullcline intersect the m D 0 line above the exchange-of-stability point (the point where the two branches of the m-nullcline intersect), and then intersect twice more on the unstable branch of the m-nullcline. A sufcient condition for this to happen is that the slope of the ºE -nullcline is less than the slope of the m-nullcline at m D 0. If that condition is satised, then ¯ can be increased until the ºE -nullcline is sufciently far above the exchange-of-stability point to guarantee two intersections on the unstable branch. The slopes of the two nullclines, which can be determined by implicitly differentiating equations 2.5a and 2.5b, are given by 1 ¯Á 0 .¡° C ¯m/ ¡ 1 dºE D 0 0 E (2.7a) dm m¡nullcline ° ÁE .¡° C ¯m/ ¡ ÁE0 .¡° /
Computing and Stability in Cortical Networks
1397
¯ f ÁE0 .¡° C ¯m/ dºE D : 1 C ° 0 [.1 ¡ f /ÁE0 .¡° / C f ÁE0 .¡° C ¯m/] dm ºE ¡nullcline
(2.7b)
E
( E,m)
To calculate the slope of the m-nullcline at m D 0, it is necessary to take the m ! 0 limit of equation 2.7a; that will ensure that we do not pick up the vertical piece of the m-nullcline, which has innite slope. Using the fact that ¯ÁE0 .¡° C ¯m/ D 1 at the exchange-of-stability point and Taylor expanding the numerator and denominator in equation 2.7a around m D 0, we see that the slope of the m-nullcline is equal to ¯=° 0 at m D 0. A simple calculation then tells us that the slope of the ºE -nullcline at m D 0 is a factor of f=.1C1=° 0 ÁE0 / smaller. This ensures that for some range of ¯, the nullclines will exhibit the set of equilibria shown in Figure 4C. We can now use Figure 4 to understand why the two features listed in the beginning of section 2 (convexity over some range and transition to concavity at ring rates well above 10–20 Hz) are necessary if attractor networks are to exhibit robust, low-ring-rate equilibria in the dynamically stabilized regime. First, if gain functions are not convex over some nite range of m, then they will intersect the 45 degree line with a slope that is less than 1 (disregarding the trivial intersection at m D 0), which eliminates the possibility of operating in the dynamically stabilized regime (see Figure 5). Second, if the transition from a convex to a concave gain function occurs at an output rate that is less than, say, 20 Hz, then it would be possible for the equilibria in Figures 4B to 4D to occur at ring rates less than 20 Hz, which would imply that a stable, low-ring-rate equilibrium can exist on the stable branch of the m-nullcline, and thus not in the dynamically stabilized regime.
m
Figure 5: Concave gain functions. When the gain functions have no convex region, they intersect the 45 degree line (marked with arrows) with slope less than one, indicating that the network operates on the stable branch of the mnullcline. Thus, the only way for the network to operate in the dynamically stabilized regime is for the gain functions to be convex for some range of m (see Figure 3).
1398
P. Latham and S. Nirenberg
The above analysis tells us what is possible in principle; to see what is possible in practice, and to determine whether attractor networks can operate at low rates in a reasonably large parameter range, we now turn to simulations.
3 Simulations To verify the reduced model described above and to determine the range of parameters that supports multiple attractors, we performed simulations with a large network of synaptically coupled, spiking neurons. Network connectivity was based on the ring-rate model given in equation 2.1, with two enhancements to make it more realistic. First, we used random, sparse background connectivity rather than uniform, all-all connectivity. Second, we allowed multiple attractors—multiple memories—rather than just a single memory. To implement multiple memories, we included in the connecPp tivity matrix a term proportional to ¯ ¹D1 »i¹ .»j¹ ¡ f /, where p is the number of memories and the » ¹ are uncorrelated, random binary vectors with a fraction f of their components equal to one (Hopeld, 1982; Tsodyks & Feigel’man, 1988; Buhmann, 1989). This term is a natural extension of the one given in equation 2.1a. A detailed description of the network is provided in appendix B. We are interested not only in whether the network can exhibit memories at low rates, but also if it can do so without ne-tuning parameters. Ideally, we would like to explore the whole parameter space. However, there are 17 parameters (see Table 1 in appendix B), making this prohibitively timeconsuming. Instead, we chose a network with reasonable parameters (e.g., synaptic and membrane time constants and PSP size; Table 1), and then varied two parameters over a broad range. The ones we varied were VPSPEE , the excitatory-to-excitatory EPSP (excitatory postsynaptic potential) size, and ¯, the increase in EPSP size among the neurons contained in each of the p memories. At a range of points in this two-dimensional space, we checked how many memories could be embedded out of the p memories we attempted to embed and whether any memories were spontaneously activated. A run—a simulation at a particular set of parameters—lasted 12 seconds. The rst 5 seconds consisted of background activity, in which the neurons red at low average rate. At 5 seconds, all neurons in one of the memories received a 100 ms barrage of EPSPs. After 2 seconds, the same neurons received a second 100 ms barrage, this time of inhibitory postsynaptic potentials. The network then ran for another 5 seconds at background. A successful run—one in which the desired memory was activated and deactivated at the right times, and no other memories were spontaneously activated—is shown in Figure 6A. Note that the ring rate during a memory is relatively low, about 14 Hz. Thus, for at least one set of parameters, an attractor can exist at low rate.
Computing and Stability in Cortical Networks
1399
A
18
9
Firing rate (Hz)
0
0
6
12
6
12
6 Time (s)
12
B
18
9 0
0 C
18
9 0
0
Figure 6: Simulations in which a set of memory neurons (black line) was activated at t D 5 seconds (rst vertical dashed line) and deactivated at t D 7 seconds (second vertical dashed line). The gray line in all plots is the background ring rate. (A) The activated memory is successfully embedded for the full 2 seconds. (B) The activated memory lasts for only 1.5 seconds, indicating that the attractor is not stable. (C) A spurious memory (dashed line) is spontaneously activated, indicating that the background is not stable. Parameters are given in Table 1, with VPSP EE D 0:48 mV, ¯ D 0:21 mV, and f D 0:1. Onset times for the memories were 50 to 100 ms, consistent with what is observed in vivo (Wilson et al., 1993; Tomita, Ohbayashi, Nakahara, Hasegawa, & Miyashita, 1999; Naya et al., 2003).
1400
P. Latham and S. Nirenberg
Figure 7: Summary of simulations. (A) Number of memories successfully embedded out of 50, the number attempted. (B) Average ring rate of memory neurons. The black line in both plots is the stability boundary: below it, the background is stable, meaning no spurious memories were activated; above it, the background is unstable. For EPSPs ranging from 0.2 to 0.5 mV, the network supports multiple attractors and a stable background for values of ¯ varying over a 15% to 25% range. Scale bar is shown to the right. Parameters are given in Table 1, with f D 0:1.
Not all runs were successful, of course. Two things can go wrong: the desired memory might not stay active for the whole 2 seconds (see Figure 6B), or a spurious memory might become spontaneously active (see Figure 6C). To determine whether a particular set of parameters can support multiple memories without any of them becoming unstable, we activated and deactivated, one at a time, each of the p memories. A particular memory was considered to be successfully embedded if it stayed active for the requisite 2 seconds, and a particular set of parameters was considered to be stable if no spurious memories were activated in any of the p runs. The results of simulations with p D 50 are summarized in Figure 7. Figure 7A shows the number of memories that were successfully embedded versus VPSPEE and ¯. The black line in this gure is the stability boundary: the background state is stable in the region below it, meaning no spurious memories were activated (see Figure 6A); it is unstable in the region above it, meaning at least one spurious memory was activated (see Figure 6C). Figure 7B shows the average ring rate of the memory neurons for those memories that were successfully embedded. As predicted by the reduced model, the ring rate is uniformly low, never exceeding 15 Hz in the stable regime. The background rate (not shown) was »0.1 to 0.2 Hz, lower than what is seen in vivo. For these network parameters, 50 memories was close to capacity: increasing p to 75 drastically reduced the size of the stable regime.
Computing and Stability in Cortical Networks
1401
Consistent with the analysis of the reduced model, the parameter regime that supports multiple memories and a stable background is large: for EPSPs ranging from 0.2 to 0.5 mV, the region with multiple (> 45/ memories extended about 0.04 mV below the stability boundary. If we consider the dynamic range of ¯ to lie between zero and the stability boundary, then this corresponds to a parameter range for ¯ of 15% to 25%. Although the low ring rates of the memory neurons and the large parameter regime that supports multiple memories are consistent with the reduced model introduced above, they are not a direct conrmation of it. What we would like to know is whether the equilibria we observed in the simulations really do lie on the unstable branch of the m-nullcline, as in Figure 4A, or whether they in fact lie on the stable one, as in Figures 4B to 4D. One way to nd out is to manipulate the nullclines, and thus the equilibrium ring rates, and then check to see if the ring rates change as predicted by the reduced model. A convenient manipulation is one in which the ºE -nullcline is raised or lowered while the m-nullcline remains xed. This is because raising the ºE nullcline lowers the average ring rate during the activation of a memory only when the equilibrium is on the unstable branch of the m-nullcline (see Figure 4A and the inset in Figure 8); if the equilibrium is on the stable branch, raising the ºE -nullcline increases the average ring rate (see Figures 4B–4D). Examining equation 2.5, we see that the ºE -nullcline can be raised without affecting the m-nullcline by increasing f , the fraction of neurons involved in a memory. The prediction from the reduced model, then, is that the background ring rate during the activation of a memory should decrease as f increases. This prediction is veried in Figure 8, thus providing strong evidence that the reduced model really does explain the simulations and that the simulations operate in a regime corresponding to the nullclines shown in Figure 4A. Note also that the range of f that supports stable memories is large, » 30%, providing further evidence for the robustness of the network. 4 Discussion The question we address in this article is: how can cortical networks, which are highly interconnected and dominated by excitatory neurons, carry out computations and yet maintain stability? We answered this question in the context of attractor networks—networks that support multiple stable states, or memories. We chose attractor networks for two reasons. First, they are thought to underlie several computations in the brain, including associative memory (Hopeld, 1982, 1984), working memory (Amit & Brunel, 1997a; Brunel & Wang, 2001), and the vestibular-ocular reex (Seung, 1996). Consequently, they are an active area of experimental research (Miyashita & Hayashi, 2000; Aksay et al., 2001; Ojemann et al., 2002; Naya et al., 2003). Second, the stability problem in attractor networks is especially severe, so
1402
P. Latham and S. Nirenberg
0.8
E
E
(Hz)
1.6
m
0.0 0.08 0.10 0.12 f (fraction of neurons per memory) Figure 8: Average ring rate during the activation of a memory, ºE , as a function of the fraction of neurons involved in a memory, f , for two parameter sets. Black curve: VPSP EE D 0:40 mV, ¯ D 0:18 mV; gray curve: VPSPEE D 0:20 mV, ¯ D 0:21 mV; other parameters are given in Table 1. Averages are over all 50 memories; error bars are standard deviations. As indicated in the inset, the prediction of our model is that the ring rate, ºE , should drop as f increases and drives the ºE -nullcline up (the upper ºE -nullcline corresponds to larger f ; see the text). This prediction is veried by the decreasing ring rate versus f .
if we can solve this problem for attractor networks, we should be able to solve it for networks implementing other kinds of computations. In previous models of realistic attractors networks, the strategy was to operate on the stable branch of the m-nullcline (Amit & Brunel, 1997a; Wang, 1999; Brunel, 2000; Brunel & Wang, 2001), where ring rates are set essentially by the saturation rates of single neurons (see Figure 4C). Those rates are »100 Hz, making the experimentally observed 10 to 20 Hz (Fuster & Alexander, 1971; Miyashita & Chang, 1988; Nakamura & Kubota, 1995) hard to reach without ne-tuning parameters. Here we showed, using a reduced, two-variable model and simulations with large networks of spiking neurons, that attractor networks can operate on the unstable branch of the m-nullcline, and thus at low rates. There are two key components to this model. The rst is strong coupling between excitatory and inhibitory cells. This coupling, which is essential in networks with strong recurrent excitatory connections like those found in the cortex, means that any upward uctuation in excitatory ring rate is matched by a larger upward uctuation in inhibitory rate. (Here “larger” has a technical denition: it means that ° 0 .ºE / > 0; see equations 2.4 and 2.6.) This makes the background state effectively inhibitory, a fact that is somewhat counterintuitive but extremely valuable, as it allows the background to act as a stabilizing pool. The second component is strong coupling between
Computing and Stability in Cortical Networks
1403
the attractors and the background state. Since the background is effectively inhibitory, this coupling can dynamically stabilize what would otherwise be an unstable state, and thus allow operation on the unstable branch of the m-nullcline. The approach we used, in which the network operates on the unstable branch of the m-nullcline, is fundamentally different from approaches in which networks operate on the stable branch. In particular, the latter require ne-tuning of network parameters to keep neurons from ring near saturation (see Figure 3C); the former does not. Importantly, the ne-tuning problem associated with operation on the stable branch cannot be eliminated simply by including effects like synaptic and spike-frequency adaption, or by adding inhibition. This is because the ne-tuning problem is associated with the structure of the nullclines in Figure 4 and the fact that neurons saturate at about 100 Hz (even with spike frequency adaptation taken into account), neither of which is changed by these manipulations. To verify the predictions of the reduced model, we performed large-scale simulations with networks of spiking neurons. We found that »50 overlapping memories could be embedded in a network of 10,000 neurons. As predicted, the ring rates of the neurons in the attractor were low, between 10 and 15 Hz on average—approximately what is seen in vivo. More important, the network was stable over a broad range of parameters: background EPSP amplitude could vary by »100%, the connection strength among the neurons involved in a memory could vary by »25%, and the fraction of neurons involved in a memory could vary by »30%, all without degrading network operation (see Figures 7 and 8). 4.1 Implications for Memory Storage. An important outcome of our model is that it predicts that the fraction of neurons involved in a memory, f , must be above some threshold. Otherwise, the feedback between the memory and the background would be too weak to stabilize activity at low rates, as can be seen by comparing Figures 4A and 4C. As shown by Gardner (1988), Tsodyks and Feigel’man (1988), Buhmann (1989), and Treves, Skaggs, and Barnes (1996), the maximum number of memories that can be embedded in a network is » 0:2K=j f log f j where K is the number of connections per neuron. This relation, combined with our result that f must be above some minimum, means that there is a limit to the number of memories that can be stored in any one network. This precludes the possibility of increasing the number of memories by adding neurons to a network while keeping the number of neurons in a memory xed: this manipulation would decrease f and thus increase the maximum number of allowed memories, but as f becomes too small, it would ultimately make the network unable to operate at low rates. If we take f D 0:1 (Fuster & Jervey, 1982) and K D 5000 (Braitenberg & Schuz, ¨ 1991), then the maximum number of memories that can be stored in any strongly interconnected network in the brain is »4000. This has ma-
1404
P. Latham and S. Nirenberg
jor implications for how we store memories. Humans, for example, can remember signicantly more than 4000 items, even within a particular category (words, for example). Thus, our model implies that memory must be highly distributed: local networks store at most thousands of items each, and during recall, those items must be dynamically linked to form a complete memory. 4.2 Biological Plausibility of Our Model. Our simulations contained garden-variety neurons and synapses—simple point neurons and fast synapses. This allowed us to cleanly separate behavior due to collective interactions from behavior due to single-neuron properties. However, it leaves open the possibility that more realistic neurons and synapses might change our ndings, especially the size of the parameter range in which the network is stable. Probably the most important parameters for stability are the excitatory and inhibitory synaptic time constants, with long excitatory time constants being stabilizing and long inhibitory time constants destabilizing (Tegner, Compte, & Wang, 2002). In our networks, we used the same time constant for both (3 ms). More realistic would be a longer inhibitory time constant (Salin & Prince, 1996; Xiang, Huguenard, & Prince, 1998) and a mix of long (NMDA) and short (AMPA) excitatory time constants (Andrasfalvy & Magee, 2001). Fortunately, such a mix has an overall stabilizing effect (Wang, 1999). Thus, we expect that with more realistic synapses, the parameter regime that supports a large number of memories should, in fact, be larger than the one we found. Because slow excitatory synapses reduce uctuations, they may increase the number of memories that can be embedded. This is important, since the number of memories in our network, 50, was smaller than what we expect of realistic networks, and much smaller than the theoretical maximum of 2000 for our network (derived using the above formula, 0:2K=j f log f j, with K D 2500 and f D 0:1). The background ring rate in our simulations was »0.1 to 0.2 Hz, lower than the 1 to 10 Hz observed in vivo (Fuster & Alexander, 1971; Miyashita & Chang, 1988; Nakamura & Kubota, 1995). These low rates can be traced to the fact that our network consisted of two pools of neurons, one near threshold and one well below threshold, with the latter pool ring at a very low rate. The below-threshold pool was needed to ensure that the average gain function, ÁE , was convex at the background ring rate, which is necessary for memories to exist (see Figure 3). With other methods for achieving convex gain functions, such as coherent external input or nonlinearities in the current-to-ring-rate transformation (especially the nonlinearity provided by persistent neurons; Egorov, Hamam, Fransen, Hasselmo, & Alonso, 2002), it is likely that a larger background rate could be supported. A nonlinear current-to-ring-rate transformation is especially helpful: when we included even a slight nonlinearity, the background rate increased to about 1 Hz and the capacity of the network increased to 100 memories (data
Computing and Stability in Cortical Networks
1405
not shown). Alternatively, it is possible that our network was consistent with real cortex, and the ring rates in vivo are overestimated. This could happen because experimental recordings are, by necessity, biased toward neurons that re at detectable rates, whereas we averaged ring rates over all neurons in the network, some of which rarely red. This possibility is strengthened by the recent nding that, based on energy considerations, average ring rates in the cortex are likely to be less than 1 Hz (Lennie, 2003). 4.3 Summary. These results demonstrate that dynamical stabilization can be used to embed multiple, overlapping, low-ring-rate attractors over a broad range of network parameters. This opens the door to detailed comparison between theory and experiment, and should help resolve the question of whether attractor networks really do exist in the brain. In addition, dynamical stabilization can be used for all kinds of networks, not just those involving attractors, and may turn out to be a general computational principle. Appendix A: Derivation of Reduced Model Equations In large neuronal networks in which neurons are ring asynchronously, it is reasonable to model neurons by their ring rates (Amit & Brunel, 1997a). While this approach is unlikely to capture the full temporal network dynamics (Treves, 1993), it is useful for studying equilibria. Moreover, near an equilibrium, we expect the ring rates to obey rst-order dynamics. Thus, a ring-rate model that uses linear summation followed by a nonlinearity would be described by the equations 0 1 X X dºEi EE EI C ºEi D ÁEi @ ¿E Jij ºEj ¡ Jij ºIj A dt j j 0 1 X X dºIi C ºIi D ÁIi @ ¿I JijIE ºEj ¡ JijII ºIj A ; dt j j
(A.1a)
(A.1b)
where ºEi and ºIi are the ring rates of individual excitatory and inhibitory neurons, respectively, ÁEi and ÁIi are their gain functions, and Jij determines the connection strength from neuron j to neuron i. The behavior of the network described by equation A.1 depends critically on connectivity. If the connectivity is purely random, then the network can be described qualitatively by the Wilson and Cowan model (Wilson & Cowan, 1972). However, if the connectivity among a subpopulation of excitatory neurons is strengthened, as it is in this study, we need to augment the Wilson and Cowan model. Ignoring ring-rate uctuations, an approx-
1406
P. Latham and S. Nirenberg
imation that is valid qualitatively (Amit & Brunel, 1997a; Latham, 2002), we can construct heuristically such an augmented model by letting JijEE ! NE¡1 JEE C ¯[NE f .1 ¡ f /]¡1 »i .»j ¡ f /
JijLM
!
LM N ¡1 ; M J
(A.2a)
LM D EI; IE; II:
(A.2b)
In these expression, NE and NI are the number of excitatory and inhibitory neurons, respectively, and » is a random binary vector: »i is 1 with probability f and 0 with probability 1 ¡ f . If we apply the replacements given in equation A.2 to equation A.1, average equation A.1b over index, i, dene P ÁI .x/ ´ NI¡1 i ÁIi .x/, and let ÁEi ! ÁE , then equation A.1 turns into equation 2.1. Replacing ÁEi with ÁE was done for convenience only: it simplies the resulting equations without detracting from the main result. To derive equations 2.3a and 2.3b from equation 2.1, we must average ÁE and »i ÁE over index. To perform these averages, we rst use the denition of m, equation 2.2, to simplify the argument of ÁE ; this denition implies that ÁE D ÁE .JIE ºE ¡ JII ºI C ¯m»i /. We then note that for any function F.»i /, NE¡1
X i
F.»i / D NE¡1 D NE¡1
X i
.1 ¡ »i /F.»i / C NE¡1
i
.1 ¡ »i /F.0/ C NE¡1
X
X
»i F.»i / i
X
»i F.1/
(A.3)
i
D .1 ¡ f /F.0/ C f F.1/: The second line follows because »i is either 0 or 1; the third (which is strictly valid only in the NE ! 1 limit) because »i averages to f . Similarly, . f NE /¡1
X i
»i F.»i / D . f NE /¡1
X i
»i F.1/ D F.1/:
(A.4)
Equations A.3 and A.4, along with the denition of m given in equation 2.2, can be used to derive equations 2.3a and 2.3b. Appendix B: Simulation Details The network we simulate consists of NE excitatory and NI inhibitory quadratic integrate-and-re neurons (Ermentrout & Kopell, 1986; Ermentrout, 1996; Gutkin & Ermentrout, 1998; Brunel & Latham, 2003). The membrane potential of the ith neuron, Vi , and the conductance change at neuron i in response to a spike at neuron j, sij , evolve according to ¿
X dVi .Vi ¡ Vr /.Vi ¡ Vt / D C V0i ¡ .Vi ¡ EE / Jij sij .t/ dt Vt ¡ Vr j2E
Computing and Stability in Cortical Networks
¡ .Vi ¡ EI / dsij dt
D¡
sij ¿s
C
X l
X Jij sij .t/
1407
(B.1a)
j2I
±.t ¡ tjl /:
(B.1b)
Here ¿ is the cell time constant, Vr and Vt are the nominal resting and threshold voltages, V0i is the product of the applied current and the membrane resistance (in units of voltage), Jij is the (dimensionless) connection strength from cell j to cell i, EE and EI are the excitatory and inhibitory reversal potentials, respectively, the notation j 2 M means sum over only those cells of type M, ±.¢/ is the Dirac ±-function, and tjl is the lth spike emitted by neuron j. To mimic the heterogeneity seen in cortical neurons, we let Voi , which determines the distance from resting membrane potential to threshold, have a range of values. This range is captured by the distributions PE .V0 / D 0:75
exp[¡.V0 ¡ 1:5/2 =2.0:5/2 ] [2¼.0:5/2 ]1=2
exp[¡.V0 ¡ 3:75/2 =2.1:0/2 ] C 0:25 [2¼.1:0/2 ]1=2 » 1 1 0:5 < V0 < 5:0 £ PI .V0 / D ; 0 otherwise 4:5 where PE .V0 / and PI .V0 / are the distributions for excitatory and inhibitory neurons, respectively. In the absence of synaptic drive, the distance between resting membrane potential and threshold is .Vt ¡ Vr /[1 ¡ 4V0 =.Vt ¡ Vr /]1=2 (see equation B.1a). Since we use Vt ¡ Vr D 15 mV (see Table 1), V0 D 3:75 corresponds to a resting membrane potential that is equal to threshold while V0 D 1:5 corresponds to a resting membrane potential about 12 mV below threshold. Neurons for which V0 > 3:75 are endogenously active—they re repeatedly without input. The distribution PE .V0 / tells us that the excitatory neurons consist of two pools: one with resting membrane potential well below threshold, and one with resting membrane potential near threshold. About half the neurons in the latter pool are above threshold, and thus endogenously active. In realistic networks, this endogenous activity could be due to external input, intrinsic single-neurons properties (Latham, Richmond, Nirenberg, & Nelson, 2000b), or a combination of the two. While endogenously active cells greatly facilitate the existence of a low-ring-rate background state (Latham et al., 2000a), they are not absolutely necessary for it (Hansel & Mato, 2001). A spike is emitted from neuron j whenever Vj reaches C1, at which point it is reset to ¡1. To attain the values §1 in our numerical integration, we make the change of variables V D .Vr C Vt /=2 C .Vt ¡ Vr / tan µ and integrate µ instead of V.
1408
P. Latham and S. Nirenberg
The connectivity matrix, Jij , consists of two parts. One is random, corresponding to background connectivity before learning; the other is structured, corresponding to memories. In addition, we impose sparse connectivity by making the connection probability less than 1. We thus write Jij D g.cij .Wij C Aij //; where cij is 1 with probability cconnect and zero otherwise, Wij corresponds to the background connectivity, Aij corresponds to the structured connectivity, and g is a clipping function chosen so that 0 · g.x/ · gmax ; for convenience, we choose g.x/ to be threshold linear and truncated at gmax : g.x/ D max.x; 0/ ¡ max.x ¡ gmax ; 0/. The clipping function ensures that a particular connection strength neither violates Dale’s law by falling below zero nor exceeds physiological levels by becoming too large. The random part of the connectivity matrix, W, is chosen to produce realistic postsynaptic potentials. Specically, we let W ij D wij
VPSPLM ; VM
where neuron i is of type L, neuron j is of type M (L; M D E; I), wij is a p p random variable uniformly distributed between 1 ¡ 31 and 1 C 31 (so that its variance is 12 ), and VM ´
E M ¡ Vr : .¿ =¿s / exp[ln.¿ =¿s /=.¿=¿s ¡ 1/]
With this choice for the connection matrix, a neuron of type L will exhibit postsynaptic potentials on the order of VPSPLM when a neuron of type M res, assuming that V0 D 0 for the neuron of type L (Latham et al., 2000a). The structured part of the connectivity matrix is the natural multimemory extension of the matrix given in equation 2.1a, except with a slightly different normalization, Aij D
p X ¯=VE ¹ ¹ »i .»j ¡ f /; NE f .1 ¡ f / ¹D1
where p is the number of memories and the » ¹ are uncorrelated, random binary vectors: »i¹ is 1 with probability f and 0 with probability 1 ¡ f . Only excitatory neurons participate in the memories. The factor 1=VE is included so that ¯ has units of voltage. The parameters in the model are listed in Table 1. All were xed throughout the simulation except for ¯ and VPSPEE , which were varied to explore robustness (see Figure 7), and f , which was 0.1 except in Figure 8, where it ranged from 0.085 to 0.115.
Computing and Stability in Cortical Networks
1409
Table 1: Parameters Used in the Simulations. Excitatory neurons Inhibitory neurons cconnect 1 ¿ ¿s Vr Vt E
E
E
I
VEPSP (E ! E) VEPSP (E ! I) VIPSP (I ! E, I ! I) gmax ¯ p f Time step
8000 2000 0.25 0.25 10 ms 3 ms ¡65 mV ¡50 mV 0 mV ¡80 mV 0.2–0.8 mV 1.0 mV ¡1:5 mV 2.5 mV 0.08–0.28 mV 50 0.085–0.115 0.5 ms
Acknowledgments We thank Nicolas Brunel and Alex Pouget for insightful comments on the manuscript. This work was supported by NIMH Grant R01 MH62447. References Abbott, L., & van Vreeswijk, C. (1993). Asynchronous states in networks of pulse-coupled oscillators. Phys. Rev. E, 48, 1483–1490. Abeles, M. (1991). Corticonics: Neural circuits of the cerebral cortex. Cambridge: Cambridge University Press. Aksay, E., Gamkrelidzek, G., Seungk, H., Baker, R., & Tank, D. (2001). In vivo intracellular recording and perturbation of persistent activity in a neural integrator. Nature Neurosci., 4, 184–193. Amit, D., & Brunel, N. (1997a). Dynamics of a recurrent network of spiking neurons before and following learning. Network, 8, 373–404. Amit, D., & Brunel, N. (1997b). Model of global spontaneous activity and local structured activity during delay periods in the cerebral cortex. Cereb. Cortex, 7, 237–252. Andrasfalvy, B., & Magee, J. (2001). Distance-dependent increase in AMPA receptor number in the dendrites of adult hippocampal ca1 pyramidal neurons. J. Neurosci., 21, 9151–9159. Bell, G., & Sander, J. (2001). The epidemiology of epilepsy: The size of the problem. Seizure, 10, 306–314. Braitenberg, V., & Schuz, ¨ A. (1991). Anatomy of the cortex. Berlin: Springer-Verlag.
1410
P. Latham and S. Nirenberg
Brunel, N. (2000). Persistent activity and the single-cell frequency-current curve in a cortical network model. Network: Computation in Neural Systems, 11, 261–280. Brunel, N., & P. Latham (2003). Firing rate of the noisy quadratic integrate-andre neuron. Neural Comput., 15, 2281–2306. Brunel, N., & Sergi, S. (1998). Firing frequency of leaky intergrate-and-re neurons with synaptic current dynamics. J. Theor. Biol., 195, 87–95. Brunel, N., & Wang, X. (2001). Effects of neuromodulation in a cortical network model of object working memory dominated by recurrent inhibition. J. Comput. Neurosci., 11, 63–85. Buhmann, J. (1989). Oscillations and low ring rates in associative memory neural networks. Phys. Rev. A, 40, 4145–4148. Chance, F., Abbott, L., & Reyes, A. (2002). Gain modulation from background synaptic input. Neuron, 35, 773–782. Egorov, A., Hamam, B., Fransen, E., Hasselmo, M., & Alonso, A. (2002). Graded persistent activity in entorhinal cortex neurons. Nature, 420, 173–178. Ermentrout, B. (1996). Type i membranes, phase resetting curves, and synchrony. Neural Comput., 8, 979–1001. Ermentrout, B., & Kopell, N. (1986). Parabolic bursting in an excitable system coupled with a slow oscillation. SIAM J. Appl. Math., 46, 233–253. Fourcaud, N., & Brunel, N. (2002). Dynamics of the ring probability of noisy integrate-and-re neurons. Neural Comput., 14, 2057–2110. Freedman, D., Riesenhuber, M., Poggio, T., & Miller, E. (2001). Categorical representation of visual stimuli in the primate prefrontal cortex. Science, 291, 312–316. Fuster, J., & Alexander, G. (1971). Neuron activity related to short-term memory. Science, 173, 652–654. Fuster, J., & Jervey, J. (1982). Neuronal ring in the inferotemporal cortex of the monkey in a visual memory task. J. Neurosci., 2, 361–375. Gardner, E. (1988). The space of interactions in neural network models. J. Phys. A: Math. Gen., 21, 257–270. Gerstner, W., & van Hemmen, L. (1993). Coherence and incoherence in a globally coupled ensemble of pulse-emitting units. Phys. Rev. Lett., 71, 312–315. Golomb, D., Rubin, N., & Sompolinsky, H. (1990). Willshaw model: Associative memory with sparse coding and low ring rates. Phys. Rev. A, 41, 1843– 1854. Gutkin, B., & Ermentrout, B. (1998). Dynamics of membrane excitability determine interspike interval variability: A link between spike generation mechanisms and cortical spike train statistics. Neural Comput., 10, 1047–1065. Hansel, D., & Mato, G. (2001). Existence and stability of persistent states in large neuronal networks. Phys. Rev. Lett., 86, 4175–4178. Hauser, W. (1997). Incidence and prevelance. In J. Engel & T. A. Pedley (Eds.), Epilepsy: A comprehensivetextbook (Vol. 1, pp. 47–58). Philadelphia: LippincottRaven. Hopeld, J. J. (1982). Neural networks and physical systems with emergent collective computational abilities. Proc. Natl. Acad. Sci., 79, 2554–2558.
Computing and Stability in Cortical Networks
1411
Hopeld, J. J. (1984). Neurons with graded responses have collective computational properties like those of two-state neurons. Proc. Natl. Acad. Sci., 81, 3088–3092. Latham, P. (2002). Associative memory in realistic neuronal networks. In T. G. Dietterich, S. Becker, & Z. Ghahramani (Eds.), Advances in Neural Information Processing Systems, 14. Cambridge MA: MIT Press. Latham, P., Richmond, B., Nelson, P., & Nirenberg, S. (2000a). Intrinsic dynamics in neuronal networks. I. Theory. J. Neurophysiol., 83, 808–827. Latham, P., Richmond, B., Nirenberg, S., & Nelson, P. (2000b). Intrinsic dynamics in neuronal networks. II. Experiment. J. Neurophysiol., 83, 828–835. Lennie, P. (2003). The cost of cortical computation. Curr. Biol., 13, 493–497. Marsden, J., & McCracken, M. (1976). The Hopf bifurcation and its applications. Berlin: Springer-Verlag. Mascaro, M., & Amit, D. (1999). Effective neural response function for collective population states. Network, 10, 351–373. Matsumura, M., Chen, D., Sawaguchi, T., Kubota, K., & Fetz, E. (1996). Synaptic interactions between primate precentral cortex neurons revealed by spiketriggered averaging of intracellular membrane potentials in vivo. J. Neurosci., 16, 7757–7767. McCormick, D., Connors, B., Lighthall, J., & Prince, D. (1985). Comparative electrophysiology of pyramidal and sparsely spiny stellate neurons of the neocortex. J. Neurophysiol., 54, 782–806. McIntyre, D., Poulter, M., & Gilby, K. (2002). Kindling: Some old and some new. Epilepsy Res., 50, 79–92. Miyashita, Y., & Chang, H. (1988). Neuronal correlate of pictorial short-term memory in the primate temporal cortex. Nature, 331, 68–70. Miyashita, Y., & Hayashi, T. (2000). Neural representation of visual objects: Encoding and top-down activation. Curr. Opin. Neurobiol., 10, 187– 194. Nakamura, K., & Kubota, K. (1995). Mnemonic ring of neurons in the monkey temporal pole during a visual recognition memory task. J. Neurophysiol., 74, 162–178. Naya, Y., Yoshida, M., & Miyashita, Y. (2003). Forward processing of long-term associative memory in monkey inferotemporal cortex. J. Neurosci., 23, 2861– 2871. Ojemann, G., Schoeneld-McNeill, J., & Corina, D. (2002). Anatomic subdivisions in human temporal cortical neuronal activity related to recent verbal memory. Nature Neurosci., 5, 64–71. Rinzel, J., & Ermentrout, G. (1989). Models for excitable cells and networks. In C. Koch & I. Segev (Eds.), Methods in neuronal modeling: from synapses to networks (pp. 137–173). Cambridge, MA: MIT Press. Rubin, N., & Sompolinsky, H. (1989). Neural networks with low local ring rates. Europhys. Lett., 10, 465–470. Salin, P. A., & Prince, D. A. (1996). Spontaneous GABAA receptor-mediated inhibitory currents in adult rat somatosensory cortex. J. Neurophysiol., 75, 1573–1588.
1412
P. Latham and S. Nirenberg
Seung, H. (1996). How the brain keeps the eyes still. Proc. Natl. Acad. Sci., 93, 13339–13344. Tegner, J., Compte, A., & Wang, X. (2002). The dynamical stability of reverberatory neural circuits. Biol. Cybern., 87, 471–481. Tiesinga, P., Jos´e, J., & Sejnowski, T. (2000). Comparison of current-driven and conductance-driven neocortical model neurons with hodgkin-huxley voltage-gated channels. Phys. Rev. E, 62, 8413–8419. Tomita, H., Ohbayashi, M., Nakahara, K., Hasegawa, I., & Miyashita, Y. (1999). Top-down signal from prefrontal cortex in executive control of memory retrieval. Nature, 401, 699–703. Treves, A. (1993). Mean-eld analysis of neuronal spike dynamics. Network, 4, 259–284. Treves, A., Skaggs, W., & Barnes, C. (1996). How much of the hippocampus can be explained by functional constraints? Hippocampus, 6, 666–674. Tsodyks, M., & Feigel’man, M. (1988). The enhanced storage capacity in neural networks with low activity level. Europhys. Lett., 6, 101–105. van Vreeswijk, C., & Sompolinsky, H. (1996). Chaos in neuronal networks with balanced excitatory and inhibitory activity. Science, 274, 1724–1726. Wang, X. (1999). Synaptic basis of cortical persistent activity: The importance of NMDA receptors to working memory. J. Neurosci., 19, 9587–9603. Wilson, F., Scalaidhe, S., & Goldman-Rakic, P. (1993). Dissociation of object and spatial processing domains in primate prefrontal cortex. Science, 260, 1955–1958. Wilson, H., & Cowan, J. (1972). Excitatory and inhibitory interactions in localized populations of model neurons. Biophys. J., 12, 1–24. Xiang, Z., Huguenard, J., & Prince, D. (1998). GABAA receptor-mediated currents in interneurons and pyramidal cells of rat visual cortex. J. Physiol., 506, 715–730. Received July 17, 2003; accepted November 21, 2003.
LETTER
Communicated by Peter Latham
Real-Time Computation at the Edge of Chaos in Recurrent Neural Networks Nils Bertschinger
[email protected] Institute for Theoretical Computer Science, Technische Universitaet Graz, A-8010 Graz, Austria
Thomas Natschl¨ager
[email protected] Software Compentence Center Hagenberg, A-4232 Hagenberg, Austria
Depending on the connectivity, recurrent networks of simple computational units can show very different types of dynamics, ranging from totally ordered to chaotic. We analyze how the type of dynamics (ordered or chaotic) exhibited by randomly connected networks of threshold gates driven by a time-varying input signal depends on the parameters describing the distribution of the connectivity matrix. In particular, we calculate the critical boundary in parameter space where the transition from ordered to chaotic dynamics takes place. Employing a recently developed framework for analyzing real-time computations, we show that only near the critical boundary can such networks perform complex computations on time series. Hence, this result strongly supports conjectures that dynamical systems that are capable of doing complex computational tasks should operate near the edge of chaos, that is, the transition from ordered to chaotic dynamics.
1 Introduction A central issue in the eld of computer science is to analyze the computational power of a given system. Unfortunately, there exists no full agreement regarding the meaning of abstract concepts such as computation and computational power. In the following, we mean by a computation simply an algorithm, system, or circuit that assigns outputs to inputs. The computational power of a given system can be assessed by evaluating the complexity and diversity of associations of inputs to outputs that can be implemented on it. Most work regarding the computational power of neural networks is devoted to the analysis of articially constructed networks (see, e.g., Siu, Roychowdhury, & Kailath, 1995). Much less is known about the computational c 2004 Massachusetts Institute of Technology Neural Computation 16, 1413–1436 (2004) °
1414
N. Bertschinger and T. Natschl¨ager
capabilities1 of networks with a biologically more adequate connectivity structure. Often, randomly connected (with regard to a particular distribution), networks are considered as a suitable model for biological neural systems (see, e.g., van Vreeswijk & Sompolinsky, 1996). Obviously not only the dynamics of such random networks depends on the connectivity structure but also its computational capabilities. Hence, it is of particular interest to analyze the relationship between the dynamical behavior of a system and its computational capabilities. It has been proposed that extensive computational capabilities are achieved by systems whose dynamics is neither chaotic nor ordered but somewhere in between order and chaos. This has led to the idea of “computation at the edge of chaos.” Early evidence for this hypothesis was reported by Langton (1990) and Packard (1988). Langton based his conclusion on data gathered from a parameterized survey of cellular automaton behavior. Packard used a genetic algorithm to evolve a particular class of complex rules. He found that as evolution proceeded, the population of rules tended to cluster near the critical region—the region between order and chaos—identied by Langton. Similar experiments involving boolean networks have been conducted by Kauffman (1993). The main focus of his experiments was to determine the transition from ordered to chaotic behavior in the space of a few parameters controlling the network structure. In fact, the results of numerous numerical simulations suggested a sharp transition between the ordered and chaotic regime. Later, these results were conrmed theoretically by Derrida and others (Derrida & Pomeau, 1986; Derrida & Weisbuch, 1986). They used ideas from statistical physics to develop an accurate mean-eld theory for random boolean networks (Derrida & Pomeau, 1986; Derrida & Weisbuch, 1986) and for Ising-spin models (networks of threshold elements) (Derrida, 1987), which allows determining the critical parameters analytically. Because of the physical background, this theory focused on the autonomous dynamics of the system, that is, its relaxation from an initial state (the input) to some terminal state (the output) without any external inuences. However such “off-line” computations (whose main inspiration comes from their similarity to batch processing by Turing machines) do not adequately describe the input-to-output relation of systems like animals or autonomous robots, which must react in real time to a continuously changing stream of sensory input. Such computations are more adequately described as mappings, or lters, from a time-varying input signal to a time-varying output signal. In this article, we will focus on time-series computations that shift the emphasis toward on-line computations and anytime algorithms (i.e.,
1 Usually one talks only about the computational power of a dedicated computational model, which we do not yet have dened. We prefer the term computational capabilities in combination with a randomly constructed system.
Real-Time Computation at the Edge of Chaos
1415
algorithms that output at any time the currently best guess of an appropriate response), and away from off-line computations. (See also Maass, Natschl¨ager, & Markram, 2002.) Our purpose is to investigate how the computational capabilities of randomly connected recurrent neural networks in the domain of real-time processing and the type of dynamics of the network are related to each other. In particular, we will develop answers to the following questions: ² Are there general dynamical properties of input-driven recurrent networks that support a large diversity of complex real-time computations, and if yes, ² Can such recurrent networks be utilized to build arbitrary complex lters, that is, mappings from input time series to output time series? A similar study using leaky integrate-and-re neurons, which was purely based on computer simulation, was conducted in Maass, Natschl¨ager, and Markram (2002). The rest of the letter is organized as follows. After dening the network model in section 2, we extend the mean-eld theory developed in Derrida (1987) to the case of input-driven networks. In particular, we will determine in section 3 where the transition from ordered to chaotic dynamics occurs in the space of a few connectivity parameters. In section 4, we propose a new complexity measure, which can be calculated using the mean-eld theory developed in section 3 and which serves as a predictor for the computational capability. Finally we investigate in section 5 the relationship between network dynamics and the computational capabilities in the timeseries domain. 2 Network Dynamics To analyze real-time computations on time series in recurrent neural networks, we consider input-driven recurrent networks consisting of N threshold gates with states xi 2 f¡1; C1g, i D 1; : : : ; N. Each node i receives nonzero incoming weights wij from exactly K randomly chosen units j 2 f1; : : : ; Ng. Such a nonzero connection weight wij is randomly drawn from a gaussian distribution with zero mean and variance ¾ 2 . Furthermore, the network is driven by an external input signal u.¢/, which is applied to each threshold gate. Hence, in summary, the update of the network state x.t/ D .x1 .t/; : : : ; xN .t// is given by 0 1 N X (2.1) xi .t/ D 2 @ wij ¢ xj .t ¡ 1/ C u.t/A ; jD1
which is applied for all neurons i 2 f1; : : : ; Ng in parallel and where 2.h/ D C1 if h ¸ 0 and 2.h/ D ¡1 otherwise.
1416
N. Bertschinger and T. Natschl¨ager
In the following, we consider a randomly drawn binary input signal u.¢/: at each time step, u.t/ assumes the value uN C 1 with probability r and the value uN ¡ 1 with probability 1 ¡ r. Note that the input bias uN can also be considered the mean threshold of the nodes. Similar to autonomous systems, these networks can exhibit very different types of dynamics, ranging from completely ordered all the way to chaotic, depending on the connectivity structure. Informally, we will call a network chaotic if arbitrary small differences in an (initial) network state x.0/ are highly amplied and do not vanish. In contrast, a totally ordered network forgets immediately about an (initial) network state x.0/, and the current network state x.t/ is determined to a large extent by the current input u.t/ (see section 3 for a precise denition of order and chaos). The top row of Figure 1 shows typical examples of ordered, critical, and chaotic dynamics while it is indicated in the lower panel (phase plot) which values of the model or system parameters correspond to a given type of dynamics.2 We will concentrate on the three system parameters K, ¾ 2 , and N where K (number of incoming connections per neuron) and ¾ 2 (variance of u, nonzero weights) determine the connectivity structure, whereas uN describes a statistical property (the mean or bias) of the input signal u.¢/. A phase transition occurs if varying such a parameter leads to a qualitative change in its behavior, like switching from ordered to chaotic dynamics. An example can be seen in Figure 1, where increasing the variance ¾ 2 of the weights leads to chaotic behavior. We refer to the transition from the ordered to the chaotic regime as the critical line. 3 Finding the Critical Line To dene the chaotic and the ordered phase of a discrete dynamical system, Derrida and Pomeau (1986) proposed the following approach: consider two (initial) network states with a certain (normalized) Hamming distance. These states are mapped to their corresponding following states dened by the network dynamics, and the change in the Hamming distance is observed. If the distance tend to grow, this is a sign of chaos, whereas if the distance decreases, this is a signature of order (Derrida & Pomeau, 1986; Derrida & Weisbuch, 1986; Derrida, 1987). This is closely related to the computation of the Lyapunov exponent in a continuous system (see, e.g., Luque & Sole, 2000) since this approach also emphasizes the idea of considering a system as chaotic if it is very sensitive to differences in its initial conditions. We will apply the same basic approach to dene the chaotic and ordered regime for an input-driven network: Two (initial) states are mapped to their corresponding following states with the same input in each case while the change in the Hamming distance is observed. Again, if the distances tend to increase (decrease) this is a sign for chaos (order). 2 A Mathematica (www.mathematica.com) notebook that contains the code to reproduce most of the gures in this letter is available on-line at: http://www.igi.tugraz.at/ tnatschl/edge-of-chaos.
Real-Time Computation at the Edge of Chaos
1417
Figure 1: Networks of randomly connected threshold gates can exhibit ordered, critical, and chaotic dynamics (see the text for a denition of order and chaos). In the upper row, examples of the time evolution of the network state x.t/ are shown (black: xi .t/ D C1, white: xi .t/ D ¡1) for three different networks with parameters taken from the ordered, critical, and chaotic regime, respectively. N as indicated in the phase plot at the Parameters: K D 4, N D 250, and ¾ 2 and u, bottom of the gure.
Following closely the arguments by Derrida (1987), we develop a meaneld theory (i.e., we consider the limit N ! 1), which allows us to analyze the time evolution of the (normalized) Hamming distance d.t/ D jfi : x1;i .t/ 6D x2;i .t/gj=N between two network states x1 .t/ and x2 .t/, where u.¢/ is the input signal, (xl;i .t/ denotes the ith component of the network state vector xl .t/, l D 1; 2). As it is shown in appendix A the Hamming distance d.t/ between two trajectories at time t is related to the Hamming distance d.tC1I u/ at time t C 1 (given that u is the input in that time step) through the equation d.t C 1I u/ D
K ³ ´ X K cD0
c
¢ d.t/c ¢ .1 ¡ d.t//K¡c ¢ P BF .c; u/;
(3.1)
1418
N. Bertschinger and T. Natschl¨ager
where PBF .c; u/ is the probability that exactly c bit-ips out of the K inputs to a unit will cause a different output of that unit in the two trajectories (see appendix A for details).3 In contrast to previous studies (Derrida, 1987; Derrida & Weisbuch, 1986), the Hamming distance d.t C 1I u/ in equation 3.1 depends on the input u, which leads to the new states at time t C 1. If one assumes a certain distribution of the input signal, one can calculate the expected value of d.t C 1I u/. Let us consider the case where u D uN C 1 and u D uN ¡ 1 with probability r and 1 ¡ r, respectively. This input distribution leads to the equation, dfade .t C 1/ :D FADE.d.t// :D r ¢ d.t C 1I uN C 1/ C .1 ¡ r/ ¢ d.t C 1I uN ¡ 1/;
(3.2)
for the time evolution of the Hamming distance. The map dened by equation 3.2 determines whether a network is in the ordered phase. If for all initial values d.0/ 2 [0; 1], the Hamming distance dfade .t/ converges to zero for t ! 1, we know that any state differences will die out, and the network is in the ordered phase. If, on the other hand, a small difference d.t/ > 0 is amplied and never dies out, the network is in the chaotic phase. Whether the Hamming distance will converge to zero can easily be seen in a so-called Derrida plot, where dfade .t C 1/ D FADE.d.t// is plotted versus d.t/. A network is in the ordered phase iff d¤fade D 0 is the only stable xed point d¤fade D FADE.d¤fade / of equation 3.2. See Figure 2A for examples of
Derrida plots.4 We also compared the theoretical predictions of equation 3.2 to simulations of networks. The results are shown in Figure 2B. Even so, the theory assumes an innite network size and that the weights are randomly drawn at each time step (the annealing approximation; Derrida & Pomeau, 1986). A good correspondence with simulation results is already obtained for networks containing a few hundred nodes and with the weights randomly drawn prior to the simulation (see appendix A for a more detailed discussion about the annealing approximation). The stability can be checked by analyzing the slope ® of the Derrida plot at d.t/ D 0. The xed point d¤fade D 0 is stable if j®j < 1. As shown in appendix B, the slope ® is given as @dfade .t C 1/ ®D @d.t/
d.t/D0
D K ¢ .r ¢ PBF .1; uN C 1/ C .1 ¡ r/ ¢ PBF .1; uN ¡ 1//:
3 Note that in the appendix, d.t C 1I u/ and PBF .c; u/ are denoted as d.t C 1I u; u/ and PBF .c; u; u/, respectively. 4 Equation 3.2 was solved numerically to yield the Derrida plots.
Real-Time Computation at the Edge of Chaos
A
B
Derrida plot
1
0.6
d(t)
0.9 0.8
K=2
theory simulation
K=4
theory simulation
K=8
0.4
0.5
0
K=4 K=2
0.196
0.6
d(t)
dfade(t+1)
0.391
0.6
0.4 0.2
0.4
0 0
0.6
d(t)
0.2 0.1 0 0
theory simulation
0.2
K=8
0.7
0.3
1419
0.4 0.2
0.5
d(t)
1
0 0
20
t
40
Figure 2: (A) Derrida plots. The Derrida plots show the dependence of dfade .tC1/ on d.t/. Shown are Derrida plots for networks with parameters uN D 0, ¾ 2 D 1, r D 0:5 and three different values of K, as denoted in the gure. For each plot, the numerical value of the stable xed point d¤fade D FADE.d¤fade /, that is, the intersection with the identity function (dashed line), is given. (B) Comparison of the mean-eld theory and numerical simulations. For each of the considered values of K, we compare the theoretical evolution of the Hamming distance by iterating equation 3.2 (solid line) to results from numerical simulations (crosses). Simulation results are averages over 50 runs of the same network of N D 250 threshold gates where in each run, the input u.¢/ was randomly generated with “rate” r D Pr fu.t/ D uN C 1g.
Hence, the stability of the xed point d¤fade D 0 depends on only the probability PBF .1; ¢/ that a single changed state component leads to a change of the output. Therefore, the so-called critical line j®j D 1, where the phase transition from ordered to chaotic behavior occurs, is given by r ¢ PBF .1; uN C 1/ C .1 ¡ r/ ¢ PBF .1; uN ¡ 1/ D
1 : K
(3.3)
This criterion is related to criticality criteria that are derived using a single bit-ip approximation directly (Rohlf ¨ & Bornholdt, 2002). The corresponding phase diagram is shown in Figure 3. For xed values of uN and ¾ 2 , more incoming connections K tend to make the network more
1420
N. Bertschinger and T. Natschl¨ager
critical line 2 1.5 1
K=3
K=4
K=5
K=6
0
K=8
u ¹
0.5
0.5 1 1.5 2
ordered 10
chaotic
1
¼
10
0
10
1
2
Figure 3: The critical line. Shown is the critical line for parameters ¾ 2 and uN in dependence of K (denoted in the gure). As before, we set r D 0:5. Note that the ordered (chaotic) regime is to the left (right) of the critical line.
chaotic. Note that for K D 1 and K D 2, the network cannot be made chaotic (see appendix B for details). Ordered dynamics can be achieved by means of smaller internal connection weights (smaller ¾ 2 ), which will increase the impact of the input signal or by biasing the network toward ¡1 or +1 by changing the mean uN of the input signal. This second way to achieve ordered dynamics is analogous to using biased boolean functions in random boolean networks (Kauffman, 1993). Since the state encoding of the networks is symmetric, using a positive or negative mean uN is equivalent. Usually a higher bias is expected to give more ordered dynamics. However, for high values of K, this is not true for uN near §1, which is an effect due to the binary input signals that are used here. Note that the ordered phase can be described by using the notion of fading memory (Boyd & Chua, 1985). Informally, a network has fading memory if the current state x.t/ depends just on the values of its input u.t0 / from some nite time window t0 2 [t¡T; t] into the past (Maass, Natschl a¨ ger,
Real-Time Computation at the Edge of Chaos
1421
& Markram, 2002). 5 A slight reformulation of this property (Jaeger, 2002) shows that it is equivalent to the requirement that all state differences vanish (i.e., being in the ordered phase). This property is called echo state property in Jaeger (2002). Fading memory plays an important role in the “liquid state machine” framework (Maass, Natschl¨ager, & Markram, 2002), called echo state networks in Jaeger (2002) which we will employ (and discuss in more detail) in section 5 to show that networks operating near the critical line support complex computations. Intuitively speaking, in a network with fading memory, a state x.t/ is fully determined by a nite history u.t ¡ T/; u.t ¡ T C 1/; : : : ; u.t ¡ 1/; u.t/ of the input u.¢/.6 This would in principle allow an appropriate readout function to deduce the recent input, or any function of it, from the network state. If, on the other hand, the network does not have fading memory (i.e., is in the chaotic regime), a given network state x.t/ also contains “spurious” information about the initial conditions, and hence it is hard, or even impossible, to deduce any features of the recent input. 4 Separation as a Predictor for Computational Power A further dynamical property that is especially important if a network is to be useful for computations on input time series is the separation property (see Maass, Natschlager, ¨ & Markram, 2002): any two different input time series that should produce different outputs (with regard to the readout function) drive the recurrent network into two (signicantly) different states. From the point of view of the readout function, it is clear that different outputs can be produced only for different network states. Hence, only if different input signals separate the network state (i.e., different inputs result in different states) is it possible for a readout function to respond differently. Furthermore, it is desirable that the separation (distance between network states) increase with the difference of the input signals. However, a simple separation measurement cannot be directly related to the computational power since chaotic networks separate even minor differences in the input to a very high degree. This is because any single different bit of the network state (eventually caused by the differing inputs) results in drastically different trajectories but is not a direct consequence of the different inputs. Therefore, we will propose in this section the so-called
5 More formally, a network is said to have the fading memory property if the following holds. For all ² > 0, initial conditions x1 .0/; x2 .0/ and input signals u1 .¢/; u2 .¢/ exist ± > 0 and T 2 N such that ku1 ¡ u2 kinput < ± ) kx1 .T/ ¡ x2 .T/kstate < ². Here, x1 .T/; x2 .T/ denote the states that are obtained by running the network T time steps on initial conditions x1 .0/; x2 .0/ and input signals u1 .¢/; u2 .¢/, respectively. k ¢ kinput and k ¢ kstate are norms that turn the space of input signals and network states into compact vector spaces. 6 More precisely there exists a function E such that x.t/ D E.u.t ¡ T/; u.t ¡ T C 1/; : : : ; u.t ¡ 1/; u.t// for some nite T 2 N.
1422
N. Bertschinger and T. Natschl¨ager
network-mediated separation (NM-separation, for short), which aims to capture those portions of the separation that are due to the differences in the input signals u1 .¢/ and u2 .¢/. As before, we measure differences of network states by their Hamming distance. The distance of input signals u1 .¢/ and u2 .¢/ is dened by b D P limT!1 T1 TtD0 .1 ¡ ±.u1 .t/; u2 .t///, which measures the average number of differences in the input signals, where ±.x; y/ D 1 if x D y and ±.x; y/ D 0 otherwise. b can also be interpreted as the probability that u1 .t/ and u2 .t/ are different, that is, b D Pr fu1 .t/ 6D u2 .t/g. The mean-eld theory we have developed (see appendix A) can easily be extended to calculate the state distance (separation) that results from applying inputs u1 .¢/ and u2 .¢/ with a mean distance of b (r denotes the “rate” r D Pr fu1 .t/ D uN C 1g of u1 .¢/): dsep.t C 1/ D SEP.d.t//
(4.1)
D b ¢ .r ¢ d.t C 1I uN C 1; uN ¡ 1/ C .1 ¡ r/ ¢ d.t C 1I uN ¡ 1; uN C 1// C .1 ¡ b/ ¢ .r ¢ d.t C 1I uN C 1; uN C 1/ C .1 ¡ r/ ¢ d.t C 1I uN ¡ 1; uN ¡ 1//
(4.2)
As in section 3, we numerically solved equation 4.2 for the stable xed point d¤sep D SEP.d¤sep/, that is, the separation that results from long runs with inputs u1 .¢/ and u2 .¢/ with a distance of b. The part of this separation that is caused by the input distance b and not by the distance of some initial state is then given by d¤sep ¡ d¤fade , since d¤fade measures the state distance that is caused by differences in the initial states and remains even after long runs with the same inputs (see section 3). Note that d¤fade is always zero in the ordered phase and nonzero in the chaotic phase. Furthermore we also have to consider the immediate separation dinp that is expected from the given input distribution (determined by uN and b). This term can be estimated as N r/ ¡ 1/2 ; dinp :D b ¢ .2 ¢ q.u;
N r/ is the fraction of nodes that “copy the input” (see appendix C where q.u; for details). 7 The immediate separation dinp is especially high for inputN r/ close to 1.0). Obvidriven networks (low values of ¾ 2 and values of q.u; ously this immediate separation cannot be utilized by any readout function that needs access to information about the input a few time steps ago because the network state x.t/ is dominated to a very large extent by the current input at time t. More informally, input-driven networks do not have any memory about recent inputs. 7 A unit copies the input if it outputs C1 if the input has the value uN C 1 and outputs ¡1 if the input has the value uN ¡ 1.
Real-Time Computation at the Edge of Chaos
1423
Since we want the NM separation to be a predictor for computational power, we have to correct d¤sep ¡ d¤fade by the immediate separation dinp in order to take the loss of memory of input-driven networks into account. Hence, a suitable measure of the network-mediated separation due to input differences is given by N r/ ¡ 1/2 : NMsep D d¤sep ¡ d¤fade ¡ b ¢ .2 ¢ q.u;
(4.3)
In Figures 4A and 4B, the uncorrected separation d¤sep ¡d¤fade and the NM separation are plotted in dependence of the input distance b for three representative parameter settings (same as in Figure 1). As expected, both measures are increasing with b. Note that for the ordered dynamics, the uncorrected separation assumes similar (or even higher) values as the uncorrected separation for critical parameters (this is because the network is very much input driven; see above). In contrast, the NM separation, which is corrected by the immediate separation dinp , is highest for critical parameters. In Figures 4C and 4D, the NM separation resulting from an input difference of b D 0:1 is shown in dependence of the network parameters uN and ¾ 2 . It can be clearly seen that the separation peaks at the critical line. Because of the computational importance of the separation property, this also suggests that the computational capabilities of the networks will peak at the onset of chaos, which is conrmed in the next section. 5 Real-Time Computations at the Edge of Chaos To access the computational power of a network, we make use of the socalled liquid state machine framework proposed by Maass et al. (2002) and independently by Jaeger (2002), who called it “echo state networks.” They put forward the idea that any complex time series computation can be implemented by a system that consists of two conceptually different parts: a dynamical system with rich dynamics followed by a memoryless readout function (the basic architecture is depicted in Figure 5). This idea is based on the following observation: If one excites a sufciently complex recurrent circuit with an input signal and looks at a later time at the current internal state of the circuit, then this state is likely to hold a substantial amount of information about recent inputs. In order to implement a specic task, called the target lter in the following, it is enough that the readout function is able to extract the relevant information from the network state. This amounts to a classical pattern recognition problem, since the temporal dynamics of the input stream has been transformed by the recurrent circuit into a high-dimensional spatial pattern. Following this line of arguments, the construction of a liquid state machine that implements an arbitrary lter can be decomposed into two steps: (1) choose a proper general purpose recurrent network, and (2) train a readout function to map the network state to the desired output.
1424
N. Bertschinger and T. Natschl¨ager
Figure 4: NM separation assumes high values on the critical line. (A) The uncorrected separation d¤sep ¡ d¤fade is shown for ordered (¾ 2 D 0:1), critical (¾ 2 D 0:5), and chaotic (¾ 2 D 5) dynamics in dependence of the input distance b (K D 4, uN D 0:4, r D 0:5). (B) Same as A but for the NM separation. (C) The gray coded image shows the NM separation in dependence on ¾ 2 and uN for K D 4, r D 0:5, and b D 0:1 (white: low values; dark gray: high values). The solid line marks the critical line. (D) Same as C but for K D 8.
This approach is potentially successful if the general purpose network encodes the relevant features of the input signal in the network state in such a way that the readout function can easily extract it. In the context of the network model considered in this article, we have already investigated in depth two properties that support such a requirement: fading memory (i.e., ordered dynamics) and (network-mediated) separation. However, it is not yet clear whether these properties are enough to ensure that the information can be extracted easily. In fact, up to now, it can be proven only
Real-Time Computation at the Edge of Chaos
1425
target y(t) = (F u)(t) target lter F
w; w0
training
input u(·)
output v(t) recurrent network
linear classi er
Figure 5: Setup used to train a linear classier for a given computational task. The input u.¢/ drives the dynamic reservoir (the randomly connected recurrent network in our case), while a memoryless readout function C, which has access to the full network state x.t/, computes the actual output v.t/ D C.x.t//. Only the weights w 2 RN , w0 2 R of the linear classier are adjusted such that the actual network output v.t/ D 2.w ¢ x.t/ C w0 / is as close as possible to the target values y.t/ D .F u/.t/. The signals shown are an example of the performance of a trained network on a one-step-delayed XOR task, that is, y.t/ D XOR.u.t ¡ 1/; u.t ¡ 2//.
(Maass et al., 2002) that any time-invariant lter that has fading memory can be approximated with any desired degree of precision if (1) the recurrent network has fading memory, (2) the recurrent network has the separation property, and (3) the readout has the capability to approximate any given continuous function that maps current states on current outputs. Hence, in a worst-case scenario, it may happen that the readout function must be arbitrarily complex to implement the desired target lter. However, we will show that near the critical line, the network encodes the information in such a way that a simple linear classier C.x.t// D 2.w¢x.t/C w0 / sufces to implement a broad range of complex nonlinear functions.8 To access the computational power in a principled way, networks with different parameters were tested on a delayed three-bit parity task. Here, the desired output signal y¿ .t/ is given by PARITY.u.t ¡¿ /; u.t ¡¿ ¡1/; u.t¡ ¿ ¡ 2// for increasing delays ¿ ¸ 0. The task is quite complex for the networks considered here since the parity function is not linear separable and 8 In order to train the network for a given target lter F, only the parameters w 2 RN , w0 2 R of the linear classier are adjusted such that the actual network output v.t/ is as close as possible to the target values y.t/ D .Fu/.t/. The weights w 2 RN , w0 2 R are determined using standard linear regression in an off-line fashion.
1426
N. Bertschinger and T. Natschl¨ager
it requires memory. Hence, to achieve good performance, it is necessary that a state x.t/ contains information about several input bits u.t0 /, t0 < t in a nonlinear transformed form such that a linear classier C is sufcient to perform the nonlinear computation of the parity task. In Figure 6A the mutual information MI.v¿ ; y¿ /9 measured on a test set between the network output v¿ .¢/ trained on a delay of ¿ and the target signal y¿ .¢/ is shown for increasing delays ¿ (cf. Natschl a¨ ger & Maass, 10 Following Jaeger (2002), we dene a memory capacity forthcoming). MC D P1 ¿ D0 MI.v¿ ; y¿ /. In Figure 6B, the dependence of MC on the parameters of the randomly drawn networks is shown. Each data point shows the mean over 10 different networks with the given parameters11 . The highest performance is clearly achieved for parameter values close to the critical line, where the phase transition occurs. This has been noted before (Langton, 1990; Packard, 1988). In Packard (1988), for example, cellular automata were evolved to solve one particular task. Even though there is evidence that evolution will nd solutions near the critical line, the specic dynamics selected are highly task specic. For this and other reasons, edgeof-chaos interpretations of these results were criticized (Mitchell, Hraber, & Crutcheld, 1993). In contrast, the networks used here are not optimized for any specic tasks, but only a linear readout function is trained to extract the task-specic information that is already present in the state of system. Furthermore, each network is evaluated for many different tasks, such as the 3-bit parity of increasingly delayed-input values. Therefore, a network that is specically designed for a single task will not show good performance in this setup. 9
The mutual information MI.v; y/ (in bits) between two signals v.¢/ and y.¢/ is de-
(f¡1; C1g in
P P
p.v0 ;y 0 / , where the sums are over all possible values p.v 0 /p.y0 / our case) of v.¢/ and y.¢/, p.v0 ; y0 / D Pr v.t/ D v0 ^ y.t/ D y0 is the joint prob-
ned as MI D
v0
©
y0
p.v0 ; y0 / log2
ª 0
©
©
ª
ª 0
ability, p.v0 / D Pr v.t/ D v and p.y0 / D Pr y.t/ D y are the marginal distributions. Note that all these probabilities can reliably be estimated simply by counting since v.¢/ and y.¢/ are binary signals in our case. 10 Note that for each delay ¿ , a separate classier is trained. For training a single linear classier, a training set fhxl ; yl ig, l D 1 : : : 9000where xl are network states and yl 2 fC1; ¡1g are the target values given by the considered target lter, was constructed as follows. The considered network was run from 10 randomly drawn initial states on randomly drawn input signals of length 5000 and rate r D 0:5. The rst 500 states were discarded, and then only the network states of every fth time step were added with the according target value yl to the training set. The weights w0 2 R, w D hw1 ; : : : ; wN i 2 RN of the linear classier were P then determined by standard linear regression: that is, by minimizing the squared error l .w0 Cw ¢ xl ¡ yl /2 via the pseudo-inverse solution. The performance of the trained classier was then measured on a test set. Therefore, the network was run again from 10 initial states for 2000 time steps. The initial states and input signals were again randomly drawn and different from the ones used during training. As before, the rst 500 states were discarded. 11 Standard deviations are not shown because they are quite small (< 0:5 bit of MC). The performance differences across the critical line are therefore highly signicant.
Real-Time Computation at the Edge of Chaos
1427
Figure 6: (A) Memory curve for the three-bit parity task. Shown is the mutual information MI.v; y/ between the classier output v.¢/ and the target function y.t/ D PARITY.u.t ¡ ¿ /; u.t ¡ ¿ ¡ 1/; u.t ¡ ¿ ¡ 2// on a test set (see the text for details) for various delays ¿ . Parameters: N D 250, K D 4, ¾ 2 D 0:2, uN D 0, and r D 0:5. (B) The gray coded image (an interpolation between the data points marked with open diamonds) shows the performance of trained networks in dependence of the parameters ¾ 2 and uN for the P same task as in A. Performance is measured as the memory capacity MC D ¿ MI.¿ /, that is, the “area” under the memory curve. Remaining parameters as in A.
These considerations suggest the following hypotheses regarding the computational function of generic recurrent neural circuits: to serve as a general-purpose temporal integrator and simultaneously as kernel (i.e., nonlinear projection into a higher dimensional space) to facilitate the subsequent linear readout of information whenever it is needed. The network size N determines the dimension of the space into which the input signal is nonlinearly transformed by the network and in which the linear classier is operating. It is expected that MC increases with N due to the enhanced discrimination power of the readout function. Hence, it is worthwhile to investigate how the computational power (in terms of MC) scales up with the network size N (see Figure 7). Interestingly, the steepest increase of MC with increasing N is observed for critical parameters (almost perfect logarithmic scaling). In contrast, at noncritical parameters, the performance on the delayed 3-bit parity task grows only slightly with increasing network size. To further explore the idea of computation at the edge of chaos, the above simulations were repeated for different values of K and different tasks. The networks were trained on a delayed 1-bit (actually just a delay line), a delayed 3-bit, and a delayed 5-bit parity task as well as on 50 randomly drawn boolean functions of the last ve inputs, that is, y.t/ D f .u.t/; u.t ¡ 1/; u.t ¡ 2/; u.t ¡ 3/; u.t ¡ 4// for a randomly drawn boolean function f :
1428
N. Bertschinger and T. Natschl¨ager
scaling behaviour (3 bit parity) 5
ordered critical chaotic
4.5 4
MC [bit]
3.5 3 2.5 2 1.5 1 0.5 0
125
250
network size N
500
1000
Figure 7: Computational capabilities scale with network size. Shown is the performance on the 3-bit parity task in dependence on the network size N (mean MC § standard deviation over 10 randomly drawn networks for each parameter setting). The remaining parameters were chosen as K D 4; r D 0:5; uN D 0:4, and ¾ 2 D 0:1 (ordered), ¾ 2 D 0:5 (critical), and ¾ 2 D 5 (chaotic).
f¡1; C1g5 ! f¡1; C1g. Average results over 10 randomly drawn networks for each parameter setting are shown in Figure 8. In almost all cases, the best performance is achieved close to the critical line. Just for the simplest task considered, a delay line, the best performance is achieved in the chaotic regime. In the case of K D 2, the network can never be made chaotic by increasing ¾ 2 , but still the performance peaks at some intermediate value of ¾ 2 . The reasons remain to be uncovered. It is also notable that even though the best performances are usually achieved close to the critical line, the performance varies considerably along it. Especially at high values of K, performance drops signicantly if the network N Hence, it is advantageous to use unis stabilized by using large biases u. biased networks (i.e., uN D 0). This is probably related to the fact that for uN D 0, the entropy of the network states is highest. Normalizing MC on an entropy bound obtained from the mean activity12 suggests that it indeed 12 For a given mean activity of a D Pr fxi .t/ D C1g, the maximum entropy of a single unit is H.a/ D ¡a log a ¡ .1 ¡ a/ log.1 ¡ a/. Note that 0 · H.a/ · 1 and H.0:5/ D 1. Hence, an upper bound for the entropy of the network states is N ¢ H.a/, which reaches
Real-Time Computation at the Edge of Chaos
1429
Figure 8: Performance of trained networks for various parameters and different tasks with increasing complexity. The performance (as measured in Figure 6B) is shown in dependence of the parameters ¾ 2 and uN for K D 2; 4; 8 (left to right) and the 1-, 3-, and 5-bit parity task as well as for an average over 50 randomly drawn boolean functions (top to bottom).
explains a large amount of the performance drop (not shown). Just for the randomly drawn boolean functions, slightly biased networks show a significantly higher performance, because unbiased networks cannot implement symmetric functions (about half of the randomly drawn functions will be symmetric) since they are antisymmetric devices (compare Jaeger, 2002). 6 Discussion We developed a mean-eld theory for input-driven networks, which allows an accurate determination of the position of the transition line between its maximum for a D 0:5. To compensate for the entropy drop, we considered the scaling MC ¢ N ¢ H.0:5/=.N ¢ H.a// D MC=H.a/.
1430
N. Bertschinger and T. Natschl¨ager
ordered and chaotic dynamics with respect to the parameters controlling the network connectivity and input statistics. To our knowledge, this is the rst time that networks that receive a time-varying input signal have been considered in this context. Additionally, a clear correlation between the network dynamics and computational power for real-time time-series processing was established. We found that near the critical line, the type of networks considered here supports a broad class of complex computations. These results provide further evidence for the idea of computation at the edge of chaos (Langton, 1990). In the context of the questions raised in the Introduction, our ndings point to the following answers: ² Dynamics near the critical line are a general property of input-driven dynamical systems that support complex real-time computations. ² Such systems can be used by training a simple readout function to approximate any arbitrary complex lter. Furthermore, this suggests that to exhibit sophisticated information processing capabilities, an adaptive system should stay close to critical dynamics. Since the dynamics is also inuenced by the statistics of the input signal (e.g., the mean as shown here), these results indicate a new role for plasticity rules: stabilize the dynamics at the critical line. Such rules would then implement what could be called dynamical homeostasis, which could be achieved in an unsupervised manner. Examples of such rules that adjust the connectivity K toward the critical line in threshold networks can be found in Bornholdt and Rohlf ¨ (2000) and Bornholdt and Rohlf ¨ (2003). Such a system then actively adjust its dynamics toward criticality, so that it shows self-organized criticality (Bak, Tang, &, Wiesenfeld, 1987, 1988). The rule described in Bornholdt and Rohlf ¨ (2003) is also interesting from a biological point of view since it is based on local Hebb-like correlations. In the context of articial neural networks, it is shown in Dauce, Quoy, Cessac, Doyon, and Samuelides (1998) that Hebbian learning drives the dynamics from the chaotic into the ordered regime. Hence combining task-specic optimization provided by (supervised) learning rules (Hertz, Krogh, & Palmer, 1991) with a more general adjustment of the dynamics would allow a system to gather specic information while self-organizing its dynamics toward criticality in order to be able to react exibly to other incoming signals. We believe that this viewpoint will be a fruitful avenue to further explore the mysteries of brain functioning. By combining self-organized criticality with learning, it might be possible to uncover some of the principles underlying the most impressive adaptability exhibited by brains.
Real-Time Computation at the Edge of Chaos
1431
Appendix A: Time Evolution of Hamming Distance d.t/ To determine the time evolution of the Hamming distance, we consider a randomly drawn set of weights wij and the following scenario: 1. The state x1 .t/ is mapped via equation 2.1 to the state x1 .t C 1/ where u1 is given as input. 2. The state x2 .t/ is mapped by the same network to the state x2 .t C 1/ where u2 is given as input. We will derive an equation that relates the Hamming distance d.t/ at time t between the two states x1 .t/ and x2 .t/ to the Hamming distance d.tC1I u1 ; u2 / at time t between the two states x1 .t C 1/ and x2 .t C 1/ given the two inputs u1 and u2 . Assume that out of the K inputs xj .t/ to a given unit i, there are exactly c (0 · c · K), which are different between the states x1 .t/ and x2 .t/, that is, c D jD j with D D fj : x1;j .t/ 6D x2;j .t/ ^ wij 6D 0g. The resulting internal activations at time t C 1 of a unit i in the two cases 1 and 2 can be written as (index i and t omitted for brevity) h1 D h2 D
X j
X j
wij x1;j C u1 D a C b C u1 wij x2;j C u2 D a ¡ b C u2 ;
P where a D j2E wij x1;j is a sum over the K ¡ c state components that are P equal (E D fj : x1;j .t/ D x2;j .t/ ^ wij 6D 0gj) and b D j2D wij x1;j is the sum over the c differing state components (D D fj : x1;j .t/ 6D x2;j .t/ ^ wij 6D 0g) (cf. Derrida, 1987). In the following, we apply the annealing approximation introduced by Derrida and others (Derrida & Pomeau, 1986). The basic assumption of the annealing approximation in our context is that at each time step, the whole weight matrix is randomly generated. This has the effect that correlations between units that would otherwise be built up during previous time steps must not be considered. Although this is a rather radical assumption, the predictions obtained for the quenched case (weight matrix generated once) are quite accurate (see Figure 2). This is due to the fact that if the number of inputs to a unit is rather small compared to the network size (K 0 , U-loss decreases monotonically. This property is closely related with that in the expectation-maximization (EM) algorithm for obtaining the maximum likelihood estimator. (See Amari, 1995, for the geometric considerations of the EM algorithm.) 3.4.1 Note on Binary Classication. In this section, we summarize some characteristics of binary classication problems, and in the following, we suppose the initial measure is simplied as ».q0 / D 0. A classier h.x/ outputs either fC1g or f¡1g in the binary case; therefore, h can be regarded as a real-valued function, and the corresponding decision
1464
N. Murata, T. Takenouchi, T. Kanamori, and S. Eguchi
function f is written as f .x; y/ D
yh.x/ C 1 : 2
(3.35)
In this case, the combined decision function satises Ft .xi ; y/ ¡ Ft .xi ; yi / D
» 0; P ¡yi tkD1 ®k hk .xi /;
if y D yi ; if y 6D yi :
(3.36)
The U-loss for the empirical model is written as n X 1X U.Ft¡1 .xi ; y/ ¡ Ft¡1 .xi ; yi / C ®. ft .xi ; y/ ¡ ft .xi ; yi /// n iD1 y2Y " Á Á !!# n t¡1 X 1X D U.0/ C U ¡yi ®k hk .xi / C ®ht .xi / : n iD1 kD1
LU .q/ D
Therefore, ®t in step 2 is given by ®t D argmin LU .q/ ®
D argmin ®
n X iD1
Á
U ¡yi
Á t¡1 X kD1
!! ®k hk .xi / C ®ht .xi /
:
(3.37)
In the case of the KL divergence, where U.z/ D exp.z/, this procedure is equivalent to AdaBoost: ®t D argmin ®
n X iD1
Á exp ¡yi
Á
t¡1 X kD1
!! ®k hk .xi / C ®ht .xi /
(cf. Lebanon & Lafferty, 2001). Also, for U.z/ D exp.z/, the constraint for the normalized model is described as X X exp.log.qt¡1 .yjx// C ®ft .x; y/ ¡ Át .x; ®// D 1: q.yjx/ D y2Y
y2Y
The normalization term Át is solved as 0 1 X Át .x; ®/ D log @ qt¡1 .yjx/ exp.®ft .x; y//A ; y2Y
Information Geometry of U-Boost and Bregman Divergence
1465
and ®t is given by ®t D argmin ®
D argmin ®
D argmin ®
n X
P
iD1 n X iD1 n X
y2Y
log P
y2Y
log log
iD1
qt¡1 .yjxi / exp.® ft .xi ; y//
qt¡1 .yi jxi / exp.®ft .xi ; yi //
exp.Ft¡1 .xi ; y/ C ®ft .xi ; y//
exp.Ft¡1 .xi ; yi / C ®ft .xi ; yi //
X y2Y
exp.Ft¡1 .xi ; y/ ¡ Ft¡1 .xi ; yi /
C ®. ft .xi ; y/ ¡ ft .xi ; yi /// Á ! ³ n t¡1 ±X ²´ X D argmin log 1 C exp ¡yi ®k hk .xi / C ®ht .xi / : ®
iD1
kD1
This representation is equivalent to the U-Boost for the empirical model with U.z/ D log.1 C exp.z//, and this shows that the normalized U-Boost associated with U.z/ D exp.z/ conducts the same procedure of LogitBoost (Friedman et al., 2000). Also, we note that the above discussion is an example that the U-Boost algorithm with the empirical and normalized models with the same U-function results in different solutions. 4 Statistical Properties of U-Boost 4.1 Error Rate Property. One of the important characteristics of the AdaBoost algorithm is the evolution of its weighted error rates, that is, the classier ht at step t shows the worst performance, which is equivalent to random guess, under the distribution at the next step t C 1. Let us dene the weight at step t C 1 by DtC1 .i; y/ D
qt .yjxi / ; ZtC1
(4.1)
where ZtC1 is a normalization constant dened by ZtC1 D
n X X
qt .yjxi /:
(4.2)
iD1 y6Dyi
Then dene the weighted error of the classier h with its decision function f by ²tC1 . f / D
n X X f .xi ; y/ ¡ f .xi ; yi / C 1 DtC1 .i; y/; 2 iD1 y6Dy i
(4.3)
1466
N. Murata, T. Takenouchi, T. Kanamori, and S. Eguchi
that is, 2²tC1 . f / ¡ 1 D
n X¡ X iD1 y6Dyi
¢ f .xi ; y/ ¡ f .xi ; yi / DtC1 .i; y/;
/ hqt ; f ¡ fQi¹Q ;
(4.4)
where the meaning of the factor of the weight is interpreted by considering the following four cases: 8 > >0; f .xi ; y/ ¡ f .xi ; yi / C 1 2 > : 1;
if yi if yi if yi if yi
2 h.xi / and 2 h.xi / and 62 h.xi / and 62 h.xi / and
y 62 h.xi /; y 2 h.xi /; y 62 h.xi /; y 2 h.xi /:
(4.5)
Intuitively speaking, the rst case is that ht is correct for y and yi , the second and third are the cases where ht is partially correct or partially wrong, and the last is the case where ht is wrong, because the correct classication rule for xi is to output fyi g: ²tC1 .h/ D
X
DtC1 .i; y/ C
1·i·n yi 62 h.xi / y 2 h.xi /
C
X 1·i·n yi 2 h.xi / y 2 h.xi / y 6D yi
X 1·i·n yi 62 h.xi / y 62 h.xi / y 6D yi
1 DtC1 .i; y/ 2
1 DtC1 .i; y/: 2
(4.6)
Note that f .xi ; y/ ¡ f .xi ; yi / vanishes when y D yi despite whether h.xi / is correct; hence, we omit qt .yi jxi / from the weighted error, that is, the weights of labels that are different from given examples are dened only as in AdaBoost.M2 (Freund & Schapire, 1996). Also note that the correct rate is written as 1 ¡ ²tC1 . f / D
X 1·i·n yi 2 h.xi / y 62 h.xi /
C
DtC1 .i; y/ C
X 1·i·n yi 2 h.xi / y 2 h.xi / y 6D yi
X 1·i·n yi 62 h.xi / y 62 h.xi / y 6D yi
1 DtC1 .i; y/; 2
1 DtC1 .i; y/ 2
(4.7)
Information Geometry of U-Boost and Bregman Divergence
1467
and that the second and third terms on the right-hand side are the same in the error rate. Then we can prove the following interesting property of the error rate of normalized and empirical U-Boost algorithms: Theorem 3. The U-Boost algorithm updates the distribution into the least favorable at each step, which means ²tC1 . ft / D
1 : 2
(4.8)
Proof. By differentiating the U-loss for the empirical model with respect to ®, we know that ®t satises n X X iD1 y2Y
. ft .xi ; y/ ¡ ft .xi ; yi //u.».qt¡1 .yjxi // C ®t . ft .xi ; y/ ¡ ft .xi ; yi /// D 0I(4.9)
by using the denition qt .yjxi / D u.».qt¡1 .yjxi // C ®t . ft .xi ; y/ ¡ ft .xi ; yi ///; the above equation is rewritten as h ft ¡ fQt ; qt i¹Q D 0: Similarly, by differentiating the U-loss for the normalized model, ®t satises ¡
n X iD1
C
. ft .xi ; yi / ¡ Át0 .xi ; ®t //
n X X iD1 y2Y
. ft .xi ; y/ ¡ Át0 .xi ; ®t //u.».qt¡1 .yjxi //
C ®t ft .xi ; y/ ¡ Át .xi ; ®t // D 0:
(4.10)
Using the denition of qt .yjxi / and the constraint above equation is rewritten as ¡
n X iD1
. ft .xi ; yi / ¡ Át0 .xi ; ®t // C
n X X iD1 y2Y
Q ¹Q C h ft ¡ Át0 ; qt i¹Q D ¡h ft ¡ Át0 ; pi Q ¹Q D h ft ¡ Át0 ; qt ¡ pi
P y2Y
qt .yjxi / D 1, the
. ft .xi ; y/ ¡ Át0 .xi ; ®t //qt .yjxi /
1468
N. Murata, T. Takenouchi, T. Kanamori, and S. Eguchi
Q ¹Q D h ft ; qt ¡ pi D h ft ; qt i¹Q ¡ fQt
(Át0 does not depend on y)
D h ft ¡ fQt ; qt i¹Q
.h1; qt i¹Q D 1/
D 0:
Therefore, for both models, the optimal condition for qt is written as the decision function and its conditional expectation as h ft ¡ fQt ; qt i¹Q D 0: This condition is interpreted as n X X iD1 y2Y
D
. ft .xi ; y/ ¡ ft .xi ; yi //qt .yjxi / X
1·i·n yi 62 ht .xi / y 2 ht .xi /
qt .yjxi / ¡
D 0;
X
qt .yjxi /
1·i·n yi 2 ht .xi / y 62 ht .xi /
that is, X 1·i·n yi 62 ht .xi / y 2 ht .xi /
qt .yjxi / D
X
qt .yjxi /:
(4.11)
1·i·n yi 2 ht .xi / y 62 ht .xi /
This means that the accumulated probability of correctly classied examples and wrongly classied examples is balanced under qt . By imposing the above relation into error rate 4.6 and correct rate 4.7, we observe ²tC1 .ht / D 1 ¡ ²tC1 .ht /; which concludes ²tC1 .ht / D
1 : 2
(4.12)
In this way, the U-Boost algorithm is constructed as updating the distribution into the least favorable for the present step. A set of decision functions dened by 8 9 < n X = X Q qt / D f R .p; (4.13) . f .xi ; y/ ¡ f .xi ; yi //qt .yjxi / D 0 : ; y2Y iD1
Information Geometry of U-Boost and Bregman Divergence
1469
tangent space of at random guess
Figure 12: Relationship between “random guess” and the tangent space of Q.
can be regarded as “random guess” associated with the weighted error. As seen in the above proof, the condition on R is equivalent to hpQ ¡ qt ; f ¡ b0 j®D0 i¹Q D 0;
(4.14)
and hence the set of “random guess,” which is a set of the “worst” classiers, is included in the tangent space of Qt at qt . Therefore, step 1 of the U-Boost claims that the next classier must be chosen from outside of random guess (see Figure 12). 4.2 Consistency and Bayes Rule Equivalence. Using the basic property of the Bregman divergence, we can show the consistency of the U-loss as follows: Theorem 4. Let p.yjx/ be the true conditional distribution and F.x; y/ be the minimizer of the U-loss HU .q/ with q D u.F/. The classication rule given by F becomes Bayes optimal: O D argmin F.x; y/ D argmin p.yjx/: y.x/ y2Y
Proof.
y2Y
From the property of the Bregman divergence,
DU .p; q/ D 0 , p.yjx/ D q.yjx/ .a.e. x/;
(4.15)
1470
N. Murata, T. Takenouchi, T. Kanamori, and S. Eguchi
and equivalence relation 3.8, the minimizer of the U-cross-entropy HU .p; qI ¹/ with respect to q, is given by F.x; y/ D argmin HU .p; uF / D ».p.yjx//: F
The statement comes from the monotonicity of » : argmin p.yjx/ D argmin ».p.yjx//: y2Y
y2Y
In the U-Boost algorithm, F.x; y/ is chosen from a class of functions that are a linear combination of ft .x; y/; .t D 1; : : : ; T/ with some bias function q0 and b. In the case that the true distribution is not always in the considered U-model, which happens in practical cases, U-Boost cannot achieve Bayes optimality, and the closest point in the model is chosen in the sense of U-loss. If the number of functions T is sufciently large and the functions ft I t D 1; : : : ; T are chosen from sufciently various decision functions, the U-model can well approximate the true distribution. It depends on the richness of the decision functions, which are basis of discriminate function F. For a discussion about the richness of the linear combination of simple functions, see, for example, Barron (1993) and Murata (1996). For the binary case in particular, we can show the following interesting relationship between the U-function and the log-likelihood ratio. In a binary classication problem, as shown in equation 3.37, the objective function of the U-loss is simplied as Z
X X
U.¡yF.x//p.yjx/¹.x/dx;
(4.16)
y2f§1g
where qt and F are linked as q D u.yF.x//; and the discriminate function F gives the classication rule as » C1; F.x/ > 0; yD ¡1; F.x/ < 0: Theorem 5. that is,
The minimizer of the U-cross-entropy gives the Bayes optimal rule,
» ¼ p.C1jx/ fxjF.x/ > 0g D x log >0 : p.¡1jx/
Information Geometry of U-Boost and Bregman Divergence
1471
Moreover, if log
u.z/ D 2z u.¡z/
(4.17)
holds, F coincides with the log-likelihood ratio F.x/ D
1 p.C1jx/ log : 2 p.¡1jx/
Proof. By usual variational arguments, the minimizer of equation 4.16 satises Z X
.p.C1jx/u.¡F.x// ¡ p.¡1jx/u.F.x///1.x/¹.x/dx D 0
for any function 1.x/. Hence, log
p.C1jx/ u.F.x// D log .a.e. x/; p.¡1jx/ u.¡F.x//
knowing that for any convex function U, ½.z/ D log
u.z/ ; u.¡z/
is monotonically increasing and satises ½.0/ D 0. This directly shows the rst part of the theorem and by imposing ½.z/ D 2z, the second part is proved. The last part of the theorem agrees with the result in Eguchi and Copas (2001, 2002). U-functions for AdaBoost, LogitBoost, and MadaBoost satisfy the condition 4.17. 4.3 Asymptotic Covariance. To see the efciency of the U-model, we investigate the asymptotic covariance of ® in this section. Let us consider the generic U-model in the form of Á ! T X q.yjx/ D u ».q0 .yjx// C ®t ft .x; y/ ¡ b.x; ®/ ; iD1
parameterized by ® D .®t I t D 1; : : : ; T/, and consider the case that the auxiliary function b does not depend on the data. Let p.yjx/ be the true
1472
N. Murata, T. Takenouchi, T. Kanamori, and S. Eguchi
conditional distribution. The optimal point q¤ in the U-model is given by ¤
® D argmin HU .p; qI ¹/ D argmin ®
®
Z X © ª U.».q// ¡ p».q/ d¹; (4.18) X
y2Y
and for given n examples, the estimate of ® is given by ®ˆ D argmin LU .q/ ®
Z
Q qI ¹/ D argmin HU .p; Q D argmin ®
®
X© x2X
y2Y
ª Q Q (4.19) U.».q// ¡ p».q/ d¹;
in abstract form. When n is sufciently large, the covariance of ®ˆ with respect to all the possible sample sets is given as follows: Theorem 6.
The asymptotic covariance of ®ˆ is given by
Cov.®/ ˆ D
³ ´ 1 ¡1 1 H GH¡1 C o ; n n
(4.20)
where H and G are T £ T matrices dened by @2 HU .p; q¤ I ¹/ @®@® ¿ Z @2 D r.x; ® ¤ /¹.x/dx; ¿ X @®@® Z X @ @ GD .U.».q¤ // ¡ ».q¤ // ¿ .U.».q¤ // ¡ ».q¤ //pd¹ @® @® X y2Y ´ Z X³ @ D r.x; ® ¤ / ¡ f .x; y/ @® X y2Y ³ ´¿ @ £ r.x; ® ¤ / ¡ f .x; y/ p.yjx/¹.x/dx; @®
HD
where r is the function of x dened by r.x; ®/ D
X y2Y
Á U ».q0 .yjx// C
T X tD1
! ®t ft .x; y/ ¡ b.x; ®/ C b.x; ®/;
and f D . f1 ; f2 ; : : : ; fT /T . The proof is given by usual asymptotic arguments (see Murata, Yoshizawa, and Amari, 1994, for example).
Information Geometry of U-Boost and Bregman Divergence
1473
When the true distribution is included in the U-model, that is, p D q¤ , the asymptotic covariance of LogitBoost becomes ³ ´ 1 1 Cov.®/ ˆ D I¡1 C o ; n n where I is the Fisher information matrix of the logistic model, which means LogitBoost attains the Crame´ r-Rao bound asymptotically, that is, LogitBoost is asymptotic efcient. In general, the asymptotic covariance of U-Boost algorithms is inferior to the Cram´er-Rao bound; hence, from this point of view, U-Boost is not efcient. However, instead of efciency, some U-Boost algorithms show robustness, as discussed in the next section. The expected U-loss of qt estimated with given n examples is asymptotically bounded by Q qt I ¹// Q D HU .p; q¤ I ¹/ C E.LU .qt // D E.HU .p;
³ ´ 1 1 ¡1 tr H G C o ; (4.21) 2n n
where E is the expectation over all the possible sample sets (cf. Murata et al., 1994). 4.4 Robustness of U-Boost. In this section, we examine the robustness of the U-Boost for the binary classication problem. First, we consider the robust condition for U-functions; then we discuss the robustness of the UBoost algorithm. 4.4.1 Most B-Robust U-Function. with one parameter ®,
Let us consider the statistical model
Qnorm .q0 ; fyh.x/g/ D fq 2 P j log q.yjx; ®/ D log q0 .yjx/ C ®yh.x/ ¡ b.x; ®/; ® 2 Rg; (4.22) where q0 2 P and h.x/ takes C1 or ¡1. Let us dene the likelihood ratio of q0 by F.x/ D
1 q0 .C1jx/ log : 2 q0 .¡1jx/
We dene an estimator of ® with the U-function as Z X ®U .q¹/ D argmin q.yjx/U.¡y.F.x/ C ®h.x///¹.x/dx; ®
X
y2Y
where q¹ is the joint distribution of x and y. As considered in the previous section when the U-function satises condition 4.17, the estimator by the
1474
N. Murata, T. Takenouchi, T. Kanamori, and S. Eguchi
U-function satises ®U .q® ¹/ D ® for any q® 2 Qnorm .q0 ; fyh.x/g/. We measure the robustness of the estimator by the gross error sensitivity (Hampel, Rousseeuw, Roncheti, & Stahel, 1986) ° .U; q0 / D sup lim Q ²!C0 ˜ ;y/ .x
.®U .pQ² / ¡ ®U .q0 ¹//2 ; ²2
(4.23)
where Q pQ² D .1 ¡ ²/ q0 ¹ C ² ±.˜x; y/ Q is the probability distribution with a point mass at .˜x; y/. Q The and ±.˜x; y/ gross error sensitivity measures the worst inuence that a small amount of contamination can have on the estimate. The estimator, which minimizes the gross error sensitivity, is called the most B-robust estimator. For a choice of a robust U-function, we show the following theorem. Theorem 7. The U-function that derives the MadaBoost algorithm minimizes the gross error sensitivity among the U-function with the property of 4.17. Proof. By brief calculation, the gross error sensitivity of the estimator is written as 0 1¡2 Z X Q x//2 @ ° .U; q0 / D sup u.yF.˜ u0.¡F.x/y/q0 .yjx/¹.x/dxA : (4.24) X Q ˜ ;y/ .x y2Y Knowing that the conditional probability is written with F as 1 1 C exp.¡2yF.x// u.yF.x// D u.F.x// C u.¡F.x//
q0 .yjx/ D
and u satises u0 .z/ u0 .¡z/ C D2 u.z/ u.¡z/ from the consistent condition on u log
u.z/ D 2z; u.¡z/
Information Geometry of U-Boost and Bregman Divergence
1475
we get u0.¡F.x//q0 .1jx/ C u0 .F.x//q0 .¡1jx/ u0 .¡F.x//u.F.x// u0 .F.x//u.¡F.x// D C u.F.x// C u.¡F.x// u.F.x// C u.¡F.x// 2u.F.x//u.¡F.x// D u.F.x// C u.¡F.x// D 2u.F.x//q0 .¡1jx/: By imposing the above relation, we obtain the expression of the gross error sensitivity only with function u as ³ Z ´¡2 2 Q ° .U; q0 / D sup u.yF.˜x// 2 u.F.x//q0 .¡1jx/¹.x/dx : X Q ˜ ;y/ .x
(4.25)
From the above formula, we nd that the gross error sensitivity diverges if u.yF.x// is not bounded and the integration of u.F.x// by q0 .¡1jx/¹.x/ is bounded. Therefore, we focus on the case that u is bounded. Without loss of generality, we can suppose Q x//2 D 1; sup u.yF.˜ Q ˜ ;y/ .x because the multiplication of the positive value to the U-function does not change the estimator. To minimize the gross error sensitivity, we need to nd a U-function that maximizes Z u.F.x//q0 .¡1jx/¹.x/dx: X
The point-wise maximization of u.z/ under the conditions u.¡z/ D u.z/e¡2z
and
Q x// D 1 sup u.yF.˜ Q ˜ ;y/ .x
leads to » u.z/ D
1; exp.2z/;
z ¸ 0; z < 0;
and this coincides with the MadaBoost U-function.
1476
N. Murata, T. Takenouchi, T. Kanamori, and S. Eguchi
4.4.2 Robustness of Boosting Algorithm. Next, we study the robustness of the U-Boost. Let us dene the normalized U-model Qnorm .q0 ; F / with the U set of binary decision functions,
F D fyh1 .x/; : : : ; yhT .x/g; where ht .x/ takes C1 or ¡1. When the probability q0 .yjx/ changes to pQ² , the U-Boost estimator is altered from F.x/ to Q F.x/ C ®U .pQ² /h.x/; Q where h.x/ is an element of fh1 .x/; : : : ; hT .x/g, which depends on pQ² . The probability q² .yjx/ 2 Qnorm .q0 ; F /, which is specied by the above decision function, is written as Q log q² .yjx/ D log q0 .yjx/ C ®U .pQ² /yh.x/ ¡ b.x; ®U .pQ² // norm Q 2 Q .q0 ; fyh.x/g/: Let us measure the robustness of U-Boost by the gross error sensitivity of the distribution estimated by the algorithm, which is dened with the KL divergence as °boost .U; q0 / 2 D sup lim 2 ²!C0 ² Q ˜ ;y/ .x
Z X X
y2Y
q0 .yjx/.log q0 .yjx/ ¡ log q² .yjx//¹.x/dx: (4.26)
Intuitively, this measures the sensitivity of the KL divergence between the true and the estimated distributions under the small contamination. Q the chosen classier hQ does not depend on the value of For a xed .Qx; y/, ² because Q ¡ q0 ; yht .x/ ¡ b0t i hpQ² ¡ q0 ; yht .x/ ¡ b0t i D ² ¢ h±.Qx; y/ holds. Therefore, we nd that Z X 2 lim 2 q0 .yjx/.log q0 .yjx/ ¡ log q² .yjx//¹.x/dx ²!C0 ² X y2Y Q ¢ D lim I.h/ ²!C0
.®U .pQ² / ¡ ®U .q0 ¹//2 ²2
Q is the Fisher information of Qnorm by the asymptotic expansion, where I.h/ Q .q0 ; yh.x// at ® D 0. From the property of .yh.x//2 D 1, we nd that I.h/ does not depend on h, and let us dene I0 as the common value of I.ht / for t D 1; : : : ; T.
Information Geometry of U-Boost and Bregman Divergence
1477
From the above argument, we nd that 2 Q ¢ .®U .pQ² / ¡ ®U .q0 ¹// °boost .U; q0 / D sup lim I.h/ ²!C0 ²2 Q ;y/ Q .x
D I0 sup lim ²!C0 Q ;y/ Q .x D I0 ° .U; q0 /:
.®U .pQ² / ¡ ®U .q0 ¹//2 ²2
Hence, the U-function of MadaBoost also minimizes °boost .U; q0 /. As a consequence, MadaBoost minimizes the inuence of outliers around the true distribution. 4.5 Illustrative Examples. In the following numerical experiments, we study the two-dimensional binary classication problem with “stumps” (Friedman et al., 2000). We generate labeled examples subject to a xed probability and a few examples are ipped by the contamination, as shown in Figure 13. The detailed setup is x D .x1 ; x2 / 2 X
D [¡¼; ¼] £ [¡¼; ¼]
y 2 Y D fC1; ¡1g
¹.x/ : uniform on X 1Ctanh.yF.x// p.yjx/ D 2
where F.x/ D x2 ¡ 3 sin.x1 /; and a% contaminated data are generated according to the following procedure. First, examples are sorted by descending order of jF.xi /j and from the top 10a% examples, a% are randomly chosen and ipped without replacement. That means that contamination is avoided around the boundary of classication. The plots are made by averaging 50 different runs. In each run, 300 training data are produced, and the classication error rate is calculated with 4000 test data. The training results by three boosting methods—AdaBoost, LogitBoost, and MadaBoost—are compared from the viewpoint of the robustness. In Figure 14, we show the test error evolution in regard to the number of boosting. All the boosting methods show overt phenomena as the number of boosting increases. We can see that AdaBoost is quite sensitive to the contaminated data. To show the robustness to the contamination, we plot the test error differences against the number of boosting. In Figure 15, the difference between 1% contamination and noncontamination and between 2% contamination and noncontamination are plotted, respectively. In fact, Figures 14b and 14c for the examples with outliers show that LogitBoost and MadaBoost stably provide an optimal performance around
1478
N. Murata, T. Takenouchi, T. Kanamori, and S. Eguchi 4
3
2
1
0
-1
-2
-3
-4
-4
-3
-2
-1
0
1
2
3
4
Figure 13: Typical examples with contamination.
the boosting number 30 to 50, while AdaBoost fails to attain such an optimal performance observed for the noncontaminated examples case, as observed in Figure 14a. Thus, we conclude that AdaBoost is more sensitive to outliers than MadaBoost in respect to learning curves.
5 Conclusion In this article, we formulated boosting algorithms as sequential updates of conditional measures, and we introduced a class of boosting algorithms by considering the relation with the Bregman divergence. Using a statistical framework, we discuss properties of consistency, efciency, and robustness. Still, detailed studies on some properties, such as the rate of convergence and stopping criteria of boosting, are needed to avoid the overtting problem. Here we treated only the classication problem, but the formulation can be extended to the case where y is in some continuous space, such as regression and density estimation. This remains a subject for future work.
Information Geometry of U-Boost and Bregman Divergence contamination: 0%
contamination: 1%
0.16 0.15 0.14
AdaBoost MadaBoost LogitBoost
0.12
0.12
0.13
0.14
test err
0.15
0.16
AdaBoost MadaBoost LogitBoost
0.13
test err
1479
0
100
200
300
400
boost
0
100
200
300
400
boost
0.14 0.13
test err
0.15
0.16
contamination: 2%
0.12
AdaBoost MadaBoost LogitBoost
0
100
200
300
400
boost
Figure 14: Test error of boosting algorithms. (a) Training data are not contaminated. (b) 1% contamination. (c) 2% contamination.
References Amari, S. (1985). Differential-geometrical methods in statistics. Berlin: SpringerVerlag. Amari, S. (1995). Information geometry of the EM and EM algorithms for neural networks. Neural Networks, 8(9), 1379–1408. Amari, S., & Nagaoka, H. (2000). Methods of information geometry. New York: Oxford University Press. Barron, A. R. (1993). Universal approximation bounds for superpositions of a sigmoidal function. IEEE Trans. Information Theory, 39(3), 930–945.
1480
N. Murata, T. Takenouchi, T. Kanamori, and S. Eguchi
0.01
diff. of test err
0.02
AdaBoost MadaBoost LogitBoost
0.00
0.01
0.02
0.03
diff. of test err (contamination: 2% vs. 0%)
AdaBoost MadaBoost LogitBoost
0.00
diff. of test err
0.03
diff. of test err (contamination: 1% vs. 0%)
0
100
200 boost
(a)
300
400
0
100
200
300
400
boost
(b)
Figure 15: Difference of test errors of the original examples and that of the contaminated examples. (a) Difference between 1% contamination and noncontamination. (b) Difference between 2% contamination and noncontamination.
Bishop, C. (1995). Neural networks for pattern recognition. Oxford: Clarendon Press. Collins, M., Schapire, R. E., & Singer, Y. (2000). Logistic regression, Adaboost and Bregman distances. In Proceedings of the Thirteenth Annual Conference on Computational Learning Theory (pp. 158–169). San Francisco: Morgan Kaufmann. Domingo, C., & Watanabe, O. (2000). MadaBoost: A modication of AdaBoost. In Proceedings of the Thirteenth Conference on Computational Learning Theory (pp. 180–189). San Francisco: Morgan Kaufmann. Eguchi, S., & Copas, J. B. (2001). Recent developments in discriminant analysis from an information geometric point of view. Journal of the Korean Statistical Society, 30, 247–264. Eguchi, S., & Copas, J. B. (2002). A class of logistic type discriminant functions. Biometrika, 89, 1–22. Eguchi, S., & Kano, Y. (2001). Robusting maximum likelihood estimation by psidivergence (ISM Research Memorandum 802). Tokyo: Institute of Statistical Mathematics. Freund, Y. (1995). Boosting a weak learning algorithm by majority. Information and Computation, 121(2), 256–285. Freund, Y., & Schapire, R. E. (1996). Experiments with a new boosting algorithm. In Machine Learning: Proceedings of the Thirteenth International Conference (pp. 148–156). San Francisco: Morgan Kaufmann. Freund, Y., & Schapire, R. E. (1997). A decision-theoretic generalization of online learning and an application to boosting. Journal of Computer and System Sciences, 55(1), 119–139.
Information Geometry of U-Boost and Bregman Divergence
1481
Friedman, J. H., Hastie, T., & Tibshirani, R. (2000). Additive logistic regression: A statistical view of boosting. Annals of Statistics, 28, 337–407. Hampel, F. R., Rousseeuw, P. J., Ronchetti, E. M., & Stahel, W. A. (1986). Robust statistics. New York: Wiley. Hastie, T., Tibishirani, R., & Friedman, J. (2001). The elements of statisticallearning. New York: Springer-Verlag. Kearns, M., & Valiant, L. G. (1988). Learning boolean formulae or nite automata is as hard as factoring (Tech. Rep. TR-14-88). Cambridge, MA: Harvard University Aiken Computation Laboratry. Kivinen, J., & Warmuth, M. K. (1999). Boosting as entropy projection. In Proceedings of the Twelfth Annual Conference on Computational Learning Theory (pp. 134–144). New York: ACM Press. Lebanon, G., & Lafferty, J. (2001). Boosting and maximum likelihood for exponential models (Tech. Rep. CMU-CS-01-144). Pittsburgh, PA: School of Computer Science, Carnegie Mellon University. McLachlan, G. J. (1992). Discriminant analysis and statistical pattern recognition. New York: Wiley. Minami, M., & Eguchi, S. (2002). Robust blind source separation by betadivergence. Neural Computation, 14, 1859–1886. Murata, N. (1996). An integral representation with ridge functions and approximation bounds of three-layered network. Neural Networks, 9(6), 947–956. Murata, N., Yoshizawa, S., & Amari, S. (1994). Network information criterion— determining the number of hidden units for an articial neural network model. IEEE Trans. Neural Networks, 5(6), 865–872. Schapire, R. E. (1990). The strength of weak learnability. Machine Learning, 5, 197–227. Schapire, R. E., Freund, Y., Bartlett, P., & Lee, W. S. (1998). Boosting the margin: A new explanation for the effectiveness of voting methods. Annals of Statistics, 26(5), 1651–1686. Takenouchi, T., & Eguchi, S. (2004). Robustifying AdaBoost by adding the naive error rate. Neural Computation, 16(4), 767–787. Vapnik, V. (1995). The nature of statistical learning theory. Berlin: Springer-Verlag. Received December 3, 2003; accepted January 8, 2004.
LETTER
Communicated by Joachim Diederich
A New Approach to the Extraction of ANN Rules and to Their Generalization Capacity Through GP Juan R. Rabunal ˜
[email protected] Juli´an Dorado
[email protected] Alejandro Pazos
[email protected] Javier Pereira
[email protected] Daniel Rivero
[email protected] Department of Informationand Communications Technologies,University of A Coruna, ˜ Faculta de Inform´atica, Campus Elvina ˜ s/n, 15192 A Coruna, ˜ Spain
Various techniques for the extraction of ANN rules have been used, but most of them have focused on certain types of networks and their training. There are very few methods that deal with ANN rule extraction as systems that are independent of their architecture, training, and internal distribution of weights, connections, and activation functions. This article proposes a methodology for the extraction of ANN rules, regardless of their architecture, and based on genetic programming. The strategy is based on the previous algorithm and aims at achieving the generalization capacity that is characteristic of ANNs by means of symbolic rules that are understandable to human beings. 1 Introduction Articial neural networks (ANN) are easily implemented and used, and they present a number of features that make them ideal for problem solving in various elds. Nevertheless, many developers prefer not to use them because they perceive them as black boxes: systems that produce certain outputs from certain inputs without explaining why or how. Researchers from a wide range of elds feel that ANNs are not trustworthy. In medicine, for instance, a computational system that is designed to provide medical diagnoses should not only be correct but also be able to justify how and why the diagnoses were reached. Expert systems (ES) are typically successful because they provide a reason for their solutions or answers. The aim of this letter is to nd a way of explaining how ANNs work and to translate their information into a symbolic format that maintains their major c 2004 Massachusetts Institute of Technology Neural Computation 16, 1483–1523 (2004) °
1484
J. Rabunal, ˜ J. Dorado, A. Pazos, J. Pereira, and D. Rivero
characteristic: the fact that they learn from examples. We shall propose a system that is conditioned by certain requirements (Tickle, Andrews, Golea, & Diederich, 1998): ² It must function with any type of ANN architecture. It must be compatible with all kinds of neural networks, including those that are highly interconnected and recurrent. ² It must function with any type of ANN training. Most algorithms focus on a given ANN training process, based on rule extraction, and are not valid for other types of training. ² It must be correct. Many rule extraction mechanisms generate only approaches to the functioning of the ANN, whereas the rules should be as precise as possible. ² It must have a high expression level. The rule language represents the extracted symbolic knowledge. 2 State of the Art 2.1 Genetic Programming. Genetic programming (GP) evolved from the so-called genetic algorithms (GA). During the rst conferences on GA at the International Conference on Genetic Algorithms (ICGA), two papers appeared that are generally considered the predecessors of the GP (Cramer, 1985; Fujiki, 1987); however, back in 1958 and 1959, Friedberg had already initiated his work on the automatic creation of programs in machine language (Friedberg, 1958; Friedberg, Dunham, & North, 1959). But it was John R. Koza who came up with the term that was also the title of his book: Genetic Programming (Koza, 1992). This book formally established the basis of todays work on GP. GP is based on the evolution of a population in which each individual represents a solution to a problem that needs to be solved. In this particular problem, these solutions will be rules that explain the behavior of ANNs, and, as we will see, they will take the shape of arithmetic or logical rules. The evolutionary process takes place after an initial population of random solutions is created. This initial population is called generation 0. The evolution takes place by selecting the best individuals (although the worst ones also have a small probability of being selected) and combining them in order to create new solutions. With these new solutions, a new population is created, deleting the old one. It is a process based on selection, mutation, and crossover algorithms. After a few generations, the population is expected to have reached a solution that is able to solve the problem. In GP, the codication of solutions is shaped like a tree: the user species which terminals (leaves) and functions can be used by the algorithm. Complex expressions can be produced either mathematically (i.e., involving
A New Approach to ANN Rules Extraction
1485
arithmetical operators) or logically (involving boolean or relational operators). Several directions or branches have been developed within GP. With regard to knowledge discovery (KD), fuzzy rules (Fayyad, Piatetsky-Shapiro, Smyth, & Uthurusamy, 1996; Bonarini, 1996), which derive from the combination of fuzzy logic and rule-based systems, are among the most promising evolutions. 2.2 Extraction of ANN Rules. Andrews (Andrews, Diederich, & Tickle, 1995; Andrews et al., 1996) identies three techniques for the extraction of rules: 1. The decompositional technique, which refers to extraction at the level of each ANN neuron, especially the neurons in the hidden layers and the output layer. Rules are extracted from each neuron and from its relation with the other neurons; as a result, the methods based on this technique depend completely on the architecture of the ANN and have a limited applicability. 2. The pedagogical technique, which treats the ANN like a black box, where the analysis takes place through the neurons of the hidden layers and by means of network inputs, from which the relevant rules are extracted. This means that the system only observes the relations between the inputs and the outputs of the ANN. The main purpose of this technique is to obtain the function that is computed by the ANN. 3. The eclectic technique, which uses the ANNs architecture and the inputoutput pairs as a complement of a symbolic training algorithm. This technique combines elements from the two previous ones. Towell and Shavlik (1994) apply the rst technique by using the connections between neurons as rules, based on simple transformations. This limits the extraction to networks with a certain multilayer architecture and with few processing elements (Andrews et al., 1995). Another example of this “decompositional” technique is given by Andrews and Geva (1994), who present the CEBP (constrained error backpropagation) algorithm. This algorithm constitutes a special class of ANNs, built with local functions that behave similar to basic radial functions. Whereas the basic underlying motive for rule extraction in CEBP is decompositional, in this case it is possible to exploit the characteristic properties of local functions to clearly identify dominant inputs (and, hence, rules). Thrun (1995) has developed the most important approach of the second technique, the so-called validity interval analysis (VI-A). His algorithm uses linear programming (SIMPLEX) by applying value intervals to the activations of each neuron in each layer. The system extracts “possibly correct” rules by expanding those intervals forward and backward through the ANN. Since it uses linear programming, this algorithm may present an
1486
J. Rabunal, ˜ J. Dorado, A. Pazos, J. Pereira, and D. Rivero
exponential complexity; it may also need an unacceptable lapse of time to reach a solution when confronted with a large number of process elements. Other approaches that use the second technique are the RULENEG (Pop, Hayward, & Diederich, 1994) and TREPAN (Craven, 1996) algorithms. The rst one is based on a simple sequential change of the truth value of the inputs, and if this affects the output, rule modication occurs. But those rule discovery techniques, which focus only on training data, lose the generalization capacity that is typical of ANNs. The TREPAN algorithm, proposed by Craven (Craven, 1996; Craven & Shavlik, 1996), introduces the concepts of language representation and extraction strategy. Language representation is the language used by the extraction method. The languages that were used by various extraction methods are the inference rules (if-then), the M rules, the fuzzy rules, and the decision trees. The extraction strategy is the method used to represent an ANN that is trained in the representation language: it explains how the method explores a space of possible characteristics. When the rules are extracted, they must describe the ANN’s behavior. The Trepan algorithm is similar to conventional decision tree algorithms such as CART and C4.5, which learn directly from a training set. These algorithms build their decision trees by recursively partitioning the input space. Each internal node in such a tree represents a splitting criterion that partitions a part of the input space, and each leaf represents a predicted class. The Trepan algorithm converts a neural network into a decision tree of the MofN type. Decompositional and pedagogical rule extraction techniques can be combined. DEDEC (Tickle, Orlowski, & Diederich, 1996), for instance, uses the trained ANN to create examples from which the underlying rules can be extracted. However, an important difference is that it extracts additional information from the trained ANN by subjecting the resultant weight vectors to further analysis (that is, a partial decompositional approach). This information is then used to direct the strategy for the generation of a (minimal) set of examples for the learning phase. It also uses an efcient algorithm for the rule extraction phase. Other rule discovery techniques are (Chalup, Hayward, & Diedrich, 1998; Visser, Tickle, Hayward, & Andrews, 1996; Tickle et al., 1998), based on several of the already mentioned approaches. Several authors have done research on the equivalence between ANNs and fuzzy rules (Jang & Sun, 1992; Buckley, Hayashi, & Czogala, 1993; Ben´õ tez, Castro, & Requena, 1997). Most results establish that equivalence through a process of successive approximations. Apart from being purely theoretical solutions, they require a large number of rules in order to approach the ANN’s functioning (Ben´õ tez et al., 1997). The work of Jang and Sun (1992), on the other hand, provides an equivalence between radial ANNs and fuzzy systems that require a nite number of rules or neurons. They remain, however, limited to ANNs of the RBF type. The methods for logical rule extraction developed by DUCH (Duch, Adamczak, & Grabczewski, 2001) are based on constrained multilayer per-
A New Approach to ANN Rules Extraction
1487
ceptron backpropagation networks (MLP2LN method) and their constructive version C-MLP2LN. MLP2LN is based on multilayered perceptrons that are rst trained on the data and subsequently simplied to obtain a skeleton network with 0 or C= ¡ 1 weights. It uses the same modication of the error function as the C-MLP2LN method, but the algorithm is not constructive, and more than one hidden layer may be used. The C-MLP2LN is based on a constructive multilayered perceptron algorithm, adding one new neuron per class and enforcing logical interpretation by modication of the error function, which ensures smooth transition to the 0 and +1 or ¡1 weights. After this transition, only a few relevant inputs are usually left. Dominant rules are recovered from the analysis of the rst input weights, interpreted as 0 D irrelevant feature, C1 D positive evidence, and ¡1 D negative evidence. GA has recently been used to search and discover ANN rules, because of the advantages of evolutionary methods in the analysis of complex systems. Keedwell (Keedwell, Narayanan, & Savic, 2000), as a follower of the pedagogical technique, uses a GA in which chromosomes are multicondition rules based on value intervals or ranges applied to ANN inputs. These values are obtained from training patterns. The work of Wong and Leung (2000) applies GP to the extraction of database knowledge and presents LOGENPRO (logic grammar based genetic programming). This rst approach shows the advantages of GP as an extraction technique in knowledge discovery in databases (KDDB). More recently, Engelbrecht, Rouwhorst, and Schoeman (2001) applied the GP technique and decision trees in order to extract database knowledge by designing an algorithm called BGP (building-block approach to genetic programming). One of the most remarkable aspects in each of these discovery process is the optimization of the rules obtained from the analysis of the ANN. It should be noted that the discovered rules may have redundant conditions, many specic rules may be contained in general rules, and many of the discovered logic expressions could be simplied by writing them differently. Therefore, rule optimization consists of the simplication and elaboration of symbolic operations with the rules. Depending on the discovery method and the type of rules that were obtained, various optimization techniques can be applied. As such, they can be classied into two main groups: a posteriori methods and implied optimization methods. The rst group usually consists of a syntactic analysis algorithm that is applied to the discovered rules in order to simplify them. For instance, Duch et al. (2001) used Prolog as a programming language to carry out the postprocessing of the obtained rules. The results are optimal linguistic variables that serve to nd simplied rules that use those variables. Implied optimization methods are techniques used in rule discovery algorithms, which intrinsically cause the algorithm to produce increasingly improved rules.
1488
J. Rabunal, ˜ J. Dorado, A. Pazos, J. Pereira, and D. Rivero
3 Development Premises The most important aspect in the design of an extraction algorithm is its modus operandi. As mentioned before, the extraction techniques can be classied into three main groups: decompositional, pedagogical, and eclectic techniques. The algorithm’s design must aim to adjust to the four features of Tickle et al. (1998): independence from the ANN’s architecture, independence from the ANN’s training algorithm, correctness, and a high level of expression. According to the rst two characteristics, the system must be able to deal with the ANN as a black box structure, which means that it must be an algorithm of general application. According to the last two features, the algorithm must be able to work with symbols that represent the knowledge expressed in the shape of rules. The algorithm needs to use symbolic models. This letter applies GP as the algorithm that builds a syntactic tree that reects, while imitating the functioning of an ANN, a set of rules. We used symbolic regression applied to input-output patterns, which are sequences of input applied to an ANN and of output produced by it. The result is a set of rules and formulas that imitate the ANN’s behavior. This technique is pedagogical because the ANN is treated like a black box: it is not necessary to know the ANN’s internal functioning. However, a rule extraction algorithm that is able to work with black box structures should also be able to implement some kind of mechanism that allows the a priori incorporation of the knowledge about that black box, which would considerably reduce the search space of the system’s rules. These structures are known as the “gray box.” The use of the GP makes this possible, because it allows us to control and determine the number and type of terminal and nonterminal elements that take part in the search process. If we know, for example, that the ANN carries out classication tasks, we can limit the type of terminal nodes to boolean nodes, thus avoiding oating-point values. The advantage is enormous, because we eliminate a priori all the possible combinations of mathematical operations. We shall therefore apply the eclectic technique rather than the pedagogical one. 3.1 ANN Generalization Capacity. Another fundamental aspect of the design of an extraction mechanism is that it should be expandable in order to determine the most characteristic and successful feature of the ANN: its generalization capacity. We must extract the knowledge that is extracted by the ANN from the training patterns and from test cases that were not presented during the learning process. This means that we need to design an algorithm for the creation of new test cases that will be applied to the ANN, allowing the analysis of its behavior in the face of new situations and of its functional limitations. This aspect is of particular relevance in the applicability of ANNs to certain elds with a high practical risk, such as the monitoring of a nuclear power plant or the diagnosis of certain medical
A New Approach to ANN Rules Extraction
1489
conditions. It has not been clearly used in the scientic literature about ANNs and rule extraction until now. 4 Description of the System We will carry out a process of ANN rule extraction based on input-output patterns. The inputs are applied to the ANN, and the resulting output is the basis of the “inputs training set—outputs ANN” patterns. The inductive algorithm (GP) will use these patterns in order to quantify the accuracy achieved by the rules that were obtained through the pair “outputs (ANN)— obtained outputs (rules).” Once the ANNs have been designed and trained, the same training and test values can be used in order to generate a second data pattern that will search for the rules acquired by the ANN during the training process. Figure 1 shows a general chart of the process. A very interesting aspect of this process is the fact that we can incorporate the knowledge extracted from the “expert” into the extraction algorithm as a priori or innate knowledge. The insertion of knowledge into the neural network has been known since the KBANN algorithm (Towell & Shavlik, 1994), which works by translating a domain theory, consisting of a set of propositional rules, directly into the neural network. In this article, however, knowledge insertion is different. The knowledge about the problem and the ANN can be reected if we have some data about the ANN’s structure, the
OUTPUT Training Set ANN INPUT Training Set
RULES
Training
Output Set
RULE Extraction
Figure 1: Rule extraction process of the ANN.
Adjust
1490
J. Rabunal, ˜ J. Dorado, A. Pazos, J. Pereira, and D. Rivero
Algorithm for the Creation of New Patterns Input
Output
ANN as Black Box
Rule Extraction Algorithm
Prior Knowledge
Knowledge ( Rules )
Figure 2: Rule extraction process.
types of data it will treat, or the kind of problems it will solve, by means of choosing the operators and the types of data the algorithm is going to work with. The inductive algorithm’s search space is therefore more or less reduced, depending on the amount of knowledge we possess about the ANN and the kind of problem it is dealing with. Figure 2 shows the extraction process diagram that was followed in the development of the rule extraction system. 4.1 Extraction of the Generalization Capacity. With regard to the extraction of the generalization capacity, we have designed an algorithm (algorithm for the creation of new patterns) that allows us to create new test cases. Through these cases, we can extract rules and expressions from those parts of the ANN’s internal functioning that have not been explored by the learning patterns. We are not certain whether the behavior of the ANN matches the behavior of the reality outside; we can only hope so. The only purpose of the new test cases is to better mimic the ANN model. Once good accuracy is obtained, the algorithm for creating new patterns is applied, thus generating a series of new input patterns that are applied to the ANN, giving us the relevant outputs. These new “new inputs—outputs ANN”
A New Approach to ANN Rules Extraction
1491
patterns are added to the training patterns and renew the rule extraction process. It should be noted that in order to continue with the extraction process, the rst step is to reassess each individual of the population in order to calculate its new tness. Once this has been achieved, the discovery process can continue. In order to avoid an unlimited number of test cases, the number of new test cases is set by establishing an elimination rule for test cases: if there is a new test case where the ANN’s outputs coincides with those produced by the best combination of obtained rules, that test case is eliminated (the obtained rules are supposed to have acquired the knowledge represented by that test case). The result is an iterative process: a number of generations or simulations of the rule extraction algorithm are simulated, new test cases are generated, and those rules whose result coincides with the ANN’s result are eliminated. Once all the relevant test cases are eliminated, new cases are created in order to ll in the gaps left by the deleted ones. Each new case is compared with the existing ones in order to determine if it is already present in the test cases. If so, it is eliminated, a new case is created, and the process starts all over, running the extraction algorithm during N generations, and repeating once again the elimination and creation of test cases. This process will end when the user determines that the obtained rules are the appropriate ones. Figure 3 shows a general chart of the process. The following parameters are involved in the algorithm for the creation of new patterns: ² Percentage of new patterns: The user must indicate the number of new patterns as a percentage. If the number of training cases is 100, for instance, and the user indicates 25% of new patterns, this means that 125 cases were used for the rules extraction (100 training cases and 25 new cases). These new cases are based on the learning patterns that were used for the ANN training, and they are checked to be different from the existing ones. ² Probability of change: The process of creating new test cases focuses on generating new sequences from the training data. A sequence of inputs is chosen at random from the training le. For each input, the same value is used or a new one is generated at random, following a certain probability. This technique is similar to the EC mutation process. The main advantage of this procedure is that by modifying only punctual values of the entry sequences of the training le, the newly created example cases are less likely to be very different (in the search space) from the training cases, and there will be an increased production of new representative sequences and a decreased generation of cases that are inconsistent with the problem. A more formal specication of the algorithm for the creation of new patterns is the following, where S is the pattern set used in the knowledge
1492
J. Rabunal, ˜ J. Dorado, A. Pazos, J. Pereira, and D. Rivero
ANN Training Inputs
Delete and generate new input
ANN Outputs
N+1 N+2
Rule Extraction Algorithm
Error
N+1
Yes
N+2
No
.
.
.
.
.
.
.
.
.
N+R
N+R
Yes
New Inputs
Best Rules
Figure 3: Generalization rule extraction process.
process, and Pc the probability of change: insert the set "inputs_training_set---outputs_ANN" in S repeat while (not full S) do choose a pattern from the training set change its values with a probability of Pc if (new pattern is not present in S) then evaluate the new pattern with the best rule obtained if (output of ANN is different from output of the rule) then insert the new pattern in S end if end if end while run the discovery algorithm N generations
A New Approach to ANN Rules Extraction
1493
for (each pattern in S) do if (output of ANN == output of the rule) then eliminate this pattern from S end if end for until (rules are good enough)
4.2 Rule Optimization. The technique used in our GP implementation for rule optimization is called depth penalty (parsimony parameter). Conceptually, when the adaptation level of an individual of the GP (tree) is evaluated, its adaptation level decreases to a certain extent if the tree possesses a large number of terminal and nonterminal nodes. An individual with a bad tness but few nodes is still better than an individual with a good tness but plenty of nodes. This is a way of dynamically favoring the existence of simple individuals. Therefore, if we are looking for rules (syntactic trees), we are also intrinsically stimulating the appearance of simple (optimized) rules. 5 Parameter Adjustment The next step in rule extraction is the numeric calculation of the parameters involved in the algorithm. Experimentation is used to adjust the GP-related parameters. The implemented parameters are shown in Table 1. Table 1: GP Parameters. Parameter Population creation algorithm
Selection algorithm
Mutation algorithm Elitist strategy Use of nondestructive crossover Crossover rate Mutation rate Nonterminal selection probability Population size Parsimony level
Options Full Partial Intermediate Tournament Roulette Stochastic remainder Stochastic universal Deterministic sample Subtree Punctual Yes-No Yes-No 0–100% 0–100% 0–100% Number of individuals Penalty value
1494
J. Rabunal, ˜ J. Dorado, A. Pazos, J. Pereira, and D. Rivero
When specifying terminal and nonterminal operators, we must establish a typology: each node has a type, and the nonterminal nodes demand a particular type from their offspring (Montana, 1995). This tactic guarantees that the generated trees will satisfy the grammar specied by the user. Also, both sets of specied operators must fulll two requirements: closure and sufciency, that is, it must be possible to build correct trees with the specied operators and to express the solution to the problem (the expression we are searching for) by means of those operators. The possibility of assigning a type to each node takes advantage of the domain knowledge about the type of the output function and the type of each input. By using node typing, we allow GP to create and combine only expressions that are grammatically correct. This means that GP will not allow, for example, arithmetical or relational expressions with boolean variables or logical operations with continuous variables. The structure of the expressions developed by GP will follow the syntactic rules provided by the types of the nodes and therefore will contain some knowledge of the domain about the types of inputs and outputs and the operations allowed. An empirical analysis must be carried out in order to adjust the ideal parameters, trying combinations so as to adjust the various values progressively until the best results are achieved. The following parameter adjustment is related to the terminal and nonterminal nodes that will take part in the algorithm’s execution. Our previous knowledge about the ANN, together with the types of problems for which it was designed, will be of consequence here. The parameters shown in Table 2 must be selected. The table shows a wide variety of possible congurations. Logical and relational operators are included in order to make boolean expressions
Table 2: Parameters of the Terminal and Nonterminal Nodes in the GP Algorithm. Parameter Logical operators
Relational operators Arithmetic functions Decision functions Constants Input variables Nature of the outputs
Options AND OR NOT >; D; X3 / OR .36:335 < X2 < 50:811// AND ...15:26 < X1 < 18:042/ OR ..8:4 < X1 < 13:165/ OR .36:335 < X2 < 49:797/// AND ..X3 > X1 / OR .16:932 > X1 ////: These rules correspond to the following confusion matrix: ³ ´ 50 0 D P : 0 100 The following rules are obtained from the original (normalized) patterns that correspond to the classication of Iris virginica and produce a success
A New Approach to ANN Rules Extraction
1499
Table 4: Summary of Rule Extraction Results for the Iris Data Set. Method Our method ReFuNN C-MLP2LN SSV ANN Grobian GA+NN NEFCLASS FuNe-I
Accuracy 99.33% 95.7 98.0 98.0 98.67 100.0 100.0 96.7 96.0
Reference Kasabov (1996) Duch et al. (2001) Duch et al. (2001) Mart´õ nez and Goddard (2001) Browne, Duntsch, ¨ and Gediga (1998) Jagielska, Matthews, and Whitfort (1996) Nauck, Nauck, and Kruse (1996) Halgamuge and Glesner (1994)
rate of 99.33%: ...X1 > X2 / OR .X2 > 0:7182// AND ..X2 > X4 / OR ...50:991 < X2 < 52:785/ OR .X4 > 71:321// OR .X1 > X3 ////: These rules correspond to the following confusion matrix: ³ ´ 50 1 PD : 0 99 The error of the latter expression produces, as a system output, the classication of Iris versicolor and Iris virginica (“false-true-true” output), which is a bad output and can be eliminated. This system allows use of a technique that detects whether it makes mistakes when no classication is produced or when it produces more than one. In this case, it will be counted as an error (because it cannot classify it), and the nal success rate is 99.33%. Table 4 draws an accuracy comparison with other extraction techniques. In order to carry out a detailed analysis of the behavior of the obtained rules, we have built an analysis le that contains all the possible input values for the four variables, with regular intervals. The intervals are 7 mm for X1 , 2.5 mm for X2 , 4.4 mm for X3 , and 7.9 mm for X4 ; the analysis le contains 14.641 examples, and we can analyze the whole possible range of classications carried out by the rules (see Figure 6). If we group the three classes into the same chart (see Figure 7), and compare them to those from the training le, we can see that the rule extraction system gathers the input values that depend on each classication and isolates them from those values that depend on the other classications. The intersection areas are those where the system produces incorrect output values. 7.1.2 ANN Rules Extraction. Mart´õ nez and Goddard (2001) have proved that a maximum success of 98.67% (two errors) is achieved with six neurons
1500
J. Rabunal, ˜ J. Dorado, A. Pazos, J. Pereira, and D. Rivero
A
Setosa
25 20 15 10 5 0 0
10
20
30
40
50
B
60
70
Versicolor
25 20 15 10 5 0 0
10
20
30
40
50
C
60
70
Virginica
25 20 15 10 5 0 0
10
20
30
40
50
60
70
Figure 6: Distribution of the three classes. (A) Iris setosa. (B) Iris versicolor. (C) Iris virginica.
A New Approach to ANN Rules Extraction
1501
Original Distribution
Setosa Versicolor Virginica
Incorrect Output
Setosa Versicolor Virginica
25
20
15
10
5
0 0
10
20
30
40
50
60
70
Figure 7: Distribution of the three classes and those that proceed from the training.
in the hidden layer. The system put forward by Rabu nal ˜ (1999) decreases the number of hidden neurons (ve), and by using tangent hyperbolic activation functions and a threshold function of 0.5 in the output neurons, it also reaches a success rate of 98.67%. In the Iris setosa cases, no error is obtained, whereas in the Iris versicolor and Iris virginica cases, two errors are made, but they are not detected because the ANN produces a valid classication (only one true output) that is nevertheless erroneous. Figure 8 shows the obtained architecture. By applying the rule discovery system with the previously mentioned parameter, we obtain the following rules: ² The rule obtained from the inputs and outputs of the ANN that corresponds to the classication of Iris setosa produces a delity of 100% correct answers: .X1 < 7:79/: ² The rule obtained from the inputs and outputs of the ANN that correspond to the classication of Iris versicolor produces a delity of 100% correct answers: ..7:23 < X1 < 13:29/ OR ...X3 > X2 / OR ..X4 > 60:379/ AND .X2 > X1 /// AND .36:6804 < X2 < 50:149///: ² The rule obtained from the inputs and outputs of the ANN that correspond to the classication of Iris virginica produces a delity of 100%
1502
X 1X1
X 2X2
X 3X3
X 4X4
J. Rabunal, ˜ J. Dorado, A. Pazos, J. Pereira, and D. Rivero
1
5
6
10
Setosa
7
11
Versicolor
8
12
Virginica
2
3
4
9
Figure 8: Obtained ANN.
correct answers: ...X1 > X3 / AND .X1 > X2 // OR ...13:742 < X1 / AND ..24:1868 > X3 / OR .50:225 < X2 /// AND .46:8303 < X2 /// The analysis of the results shows the distributions carried out by the ANN, and the rules obtained through the use of the analysis le (see Figure 9). Since we obtained a 100% success rate on the three distributions from the ANN, the resulting expressions show a 100% delity of the ANN behavior. The gures of these distributions are very similar to those of Figure 6: the rst two distributions are almost identical to those of Figure 6, and the third distribution is similar in the cases of the training set. These results obtained with the analysis le are the delity values of the rules, because the outputs of the analysis le are the outputs produced by the ANN with the inputs of the analysis le. 7.1.3 ANN Generalization. Figures 10A through 10C show the distribution of the three species that result from the application of the analysis le to the trained ANN. If we compare the distributions produced by the ANN in Figures 10A through 10C to those obtained by the rule extraction system with the initial data (see Figure 6) and the one obtained from the rules from the ANN (see Figure 9), there may seem to be little similarity. This is due to the fact that the rules provided by the extraction system have very geometrical shapes (since they use rules of the If-then-else type on ranges). As a consequence, the obtained rules provide a perfect tness in the areas that
A New Approach to ANN Rules Extraction
1503
A
Setosa
25 20 15 10 5 0
0
10
20
30
40
50
B
60
70
Versicolor
25 20 15 10 5 0 0
10
20
30
40
50
C
60
70
Virginica
25 20 15 10 5 0
0
10
20
30
40
50
60
70
Figure 9: Distribution of the three classes produced by the rules. (A) Iris setosa. (B) Iris versicolor. (C) Iris virginica.
1504
J. Rabunal, ˜ J. Dorado, A. Pazos, J. Pereira, and D. Rivero
correspond to training examples but a very bad accuracy with regard to the ANN’s global behavior. If we combine the rule extraction mechanism with the algorithm for new pattern creation, the values with better behavior in the algorithm are a value between 20% and 35% of new patterns and a value between 20% and 35% of new patterns and between 20% and 25% of change probability. With regard to the parameter of the new test case percentage, it should be noted that if this percentage is too high (over 60%), the extraction algorithm does not improve the obtained rules due to the fact that when the new values are generated, the problem is diverted to an extent that makes the algorithm look for contradictory rules, passing continuously from one kind of rules to another. The underlying idea to the whole process is to take the training les as the core, since they concentrate an accumulation of representative values for each classied value. For example, when we used a value of 75% of new cases for this parameter, the learning patterns were divided into 80% of cases representing the Iris versicolor, 12% Iris setosa, and 8% Iris virginica. If the rules always classify the Iris versicolor with a success level of 80%, improving this local minimum is a complicated task. With regard to the change probability parameter, something similar occurs: if the probability is high, the new values of the parameters depend too much on randomness. Therefore, the classications that represent the bigger areas in the search space are strongly favored, and the rules obtained from the extraction process focus on those classications. The dynamic creation method (i.e., using the algorithm for creating new patterns) obtained a dynamic success of 96.27% for the Iris setosa, 89.17% for the Iris versicolor, and 93.13% for the Iris virginica. The success is dynamic because it depends on the new test cases that are generated dynamically, and the success took place at the moment when the process was stopped. These values are the delity values obtained in the dynamic training. Obviously, these tness values do not correspond to the real ones, but they offer a representative idea about how the process develops with the training set and the extended test cases. The analysis le is transferred to the obtained rules in order to achieve the real tness values. With regard to the rules obtained from only the training patterns, a real delity is achieved for the Iris setosa (63.34%), while 56.97% corresponds to Iris versicolor and 57.23% to Iris virginica. The analysis le has been built using as outputs the outputs of the ANN, so the accuracy obtained with the rules on this le is the delity of these rules. The normalized rules obtained for the classication of Iris setosa with the algorithm for the creation of new patterns produce a real success (delity) of 95.45% on the analysis le: ...¡0:1941/ > .X2 ¡ X3 // AND ..X3 ¡ X2 / > .0:6381 ¤ X1 ///:
A New Approach to ANN Rules Extraction
1505
The normalized rules obtained for the classication of Iris versicolor produce a delity of 88.81% on the analysis le: ...X4 % 0:8032/ > X3 / AND ...X1 C .X2 % 0:7875// > X3 / AND .X2 < ...X4 %0:7875/ ¡ .0:4966 ¤ X1 //% 0:8896////:
The normalized rules obtained for the classication of Iris virginica produce a delity of 77.88% on the analysis le: ..X4 < ..X1 ¡ .¡.0:4925 ¤ .X2 ¤ 0:5280//// ¡ .X4 ¡ X2 /// AND ..X3 < X2 / OR .X1 > .X3 ¡ X2 ////:
These three rules are normalized, and their variables take only values between 0 and 1. The distribution of the rules obtained (not normalized) can be seen in Figures 10D through 10F. The distributions obtained from the ANN with GP and dynamic creation of patterns (see Figures 10D to 10F) are very similar to the actual gures of the ANN (see Figures 10A to 10C). These gures are very similar due to the high success rate that was obtained in the extraction of knowledge from the ANN. This leads us to believe that the nal behavior of the rules is very similar to the one produced by the ANN and that the algorithm has performed well. Table 5 draws a comparative chart of the delities achieved through the exclusive use of the learning patterns, and of those achieved through the use of the algorithm for the creation of new patterns. It shows the accuracy values that are produced by the rules obtained with both methods during the analysis of the le patterns (14,641 examples). The delity obtained with the algorithm for the creation of new patterns is better than the delity obtained without this dynamic creation. This is due to the fact that with this algorithm, new patterns not present in the training set are used for training the rules, and without dynamic creation no training will be made on those inputs outside the training set. Therefore, without dynamic creation, the test performs poorly on those patterns different from the ones present in the training set (but present in the analysis le). 7.2 Wisconsin Breast Cancer Data. The Wisconsin cancer data set is a small medical database from UCI (Merz & Murphy, 2002). It contains 699 entries, with 458 benign cases (65.5%) and 241 (34.5%) malignant cases. Each instance has 9 attributes with integer values in the range 1–10, and they are normalized in the 0–1 range (we may assume that they are continuous values): .X1 / Clump thickness .X2 / Uniformity of cell size .X3 / Uniformity of cell shape .X4 / Marginal adhesion
1–10 1–10 1–10 1–10
1506
J. Rabunal, ˜ J. Dorado, A. Pazos, J. Pereira, and D. Rivero
A
D
Setosa
25
25
20
20
15
15
10
10
5
5
0
Setosa
0 0
10
20
30
40
B
50
60
70
0
10
20
30
40
50
E
Versicolor
25
25
20
20
15
15
10
10
5
5
0
60
70
Versicolor
0 0
10
20
30
40
50
C
60
70
0
10
20
30
40
50
F
Virginica
25
25
20
20
15
15
10
10
5
5
0
60
70
Virginica
0 0
10
20
30
40
50
60
70
0
10
20
30
40
50
60
70
Figure 10: Distribution of the three classes produced by the ANN (A) Iris setosa, (B) Iris versicolor, (C) Iris virginica. Distribution of the three classes produced by the obtained rules with the algorithm for the creation of new patterns: (D) Iris setosa, (E) Iris versicolor, (F) Iris virginica. Table 5: Fidelities Achieved Through Exclusive Use of the Learning Patterns and Use of the Algorithm to Create New Patterns. Fidelity Method Without dynamic creation With dynamic creation
Iris setosa
Iris versicolor
Iris virginica
63.34% 95.45
56.97% 88.81
57.23% 77.88
A New Approach to ANN Rules Extraction
1507
Table 6: Parameters of the Algorithm for the Wisconsin Breast Cancer Data Problem. Parameter
Value
Constants Variables Relational operators Logical operators Decision function Selection algorithm Crossover rate Mutation rate Population size Parsimony level
50 random values in the range [0–1] X1 , X2 , X3 , X4 , X5 , X6 , X7 , X8 , X9 ; ; D AND, OR, NOT If then else over boolean values Tournament 92% 8% 1000 individuals 0.001
.X5 / Single epithelial cell size .X6 / Bare nuclei .X7 / Bland chromatin .X8 / Normal nucleoli .X9 / Mitoses
1–10 1–10 1–10 1–10 1–10
The GP parameters that produce the best results are shown in Table 6. The following rule, obtained from the original (normalized) patterns, produces a tness of 99.56%: ...IF .IF .IF .X1 > X7 / THEN .X2 > 0:4788/ ELSE .0:0000 < X7 < 0:3719// THEN .X3 > 0:3780/ ELSE ..0:2920 < X3 < 0:37192/ AND .0:0000 < X6 < 0:6564/// THEN .NOT ..0:0669 < X4 < 0:3719/ AND .NOT .0:2920 < X6 < 0:3719//// ELSE .X1 > 0:8431// OR .IF ..X3 > 0:4788/ AND .X7 > 0:3780// THEN .NOT .0:0000 < X4 < 0:3719// ELSE ..X3 X7 / AND .NOT .0:0669 < X1 < 0:6564///// OR ..IF .IF .IF .0:5996 < X1 < 0:7400/ THEN .X1 > X7 /
1508
J. Rabunal, ˜ J. Dorado, A. Pazos, J. Pereira, and D. Rivero
ELSE .NOT .0:0000 < X6 < 0:6564/// THEN .IF .IF .X6 > 0:8431/ THEN .X3 X4 / ELSE ..0:0000 < X3 < 0:3719/ AND .0:0000 < X6 < 0:6564/// THEN .X7 > 0:3780/ ELSE .0:0000 < X1 // ELSE FALSE/ THEN .IF .X7 > 0:3781/ THEN .X7 > 0:3781/ ELSE .X1 > X3 // ELSE .X2 > 0:4788//// This rule corresponds to the following confusion matrix, calculated on the test le and including missing values: ³ ´ 239 4 PD : 2 454 Table 7 draws an accurancy comparison with other extraction techniques. We use the system proposed by Rabu nal ˜ (1999); although several ANNs have been trained, we have not been able to achieve 100% correct classications. With seven neurons in one hidden layer and a linear activation function, we can obtain only 98.28% of correct classications. If we use the same architecture but with hyperbolic tangent functions, the success rate reaches 98.68% of correct classications. In order to carry out a detailed analysis of the behavior of the obtained rules, we have built an analysis le with all the possible input values for the nine variables, taking regular intervals of 0.25 between 0 and 1 (normalized inputs). With these intervals, the analysis le contains 1,953,125 examples, and the whole possible range of classications carried out by the rules can be analyzed. With this set, the outputs were calculated as the outputs of the ANN with these inputs, so the accuracies obtained from this analysis le with the rules are the delity values of these rules to the ANN. If we apply the analysis le over the rules set, we obtain a delity of 64.56% of correct classications. If we combine the rule extraction mechanism with the algorithm for new pattern creation on the ANN, the result is that with a value between 15% and 35% of new patterns and between 35% and 40% of change probability, the values are the ones with a better behavior in the algorithm.
A New Approach to ANN Rules Extraction
1509
Table 7: Summary of Rule Extraction Results for the Wisconsin Breast Cancer Data Set. Method
Accuracy
Our method C-MLP2LN IncNet k-NN Fisher LDA MLP with BP LVQ Bayes Naive Bayes DB-CART LDA LFC, ASI, ASR CART
99.56% 99.0 97.1 97.0 96.8 96.7 96.6 96.6 96.4 96.2 96.0 95.6 93.5
Reference Duch et al. (2001) Jankowski and Kadirkamanathan (1997) Duch et al. (2001) Ster and Dobnikar (1996) Ster and Dobnikar (1996) Ster and Dobnikar (1996) Ster and Dobnikar (1996) Ster and Dobnikar (1996) Shang and Breiman (1996) Ster and Dobnikar (1996) Ster and Dobnikar (1996) Shang and Breiman (1996)
The best rule obtained (normalized) through the rule extraction mechanism combined with the algorithm for new pattern creation is the following: ....X2 > 0:3009/ AND .X1 > X6 // OR ....X1 > 0:5098/ OR .X8 > X5 // OR .X6 > 0:5098// AND ..X3 > 0:3009/ OR .X6 > X5 //// OR .X3 > 0:5098//: This rule produces a success of 98.83% on the ANN outputs with the training inputs and 71.91% (delity) on the analysis le. We obtained another normalized rule that provides less accuracy but is much simpler: X6 > ..0:8928 ¡ X3 / ¡ X2 /: It produces an accuracy of 98.44% on the ANN outputs with the training inputs and 70.47% on the analysis le. The interpretation of this rule is the following: “Bare Nuclei” > .9:035 ¡ “Uniformity of cell shape” ¡ “Uniformity of cell size”/:
Table 8 draws a comparative chart of the delities achieved by using only the learning patterns and those achieved by using the algorithm for creating new patterns. It shows the accuracy values produced by the rules that were obtained with both methods during the evaluation of the analysis le patterns (1,953,125 examples). 7.3 Hepatitis. This is another small medical database from UCI (Merz & Murphy, 2002). It contains 155 samples that belong to two different classes:
1510
J. Rabunal, ˜ J. Dorado, A. Pazos, J. Pereira, and D. Rivero
Table 8: Comparative Fidelities Achieved Through Exclusive Use of the Learning Patterns and Use of the Algorithm to Create New Patterns. Method Without dynamic creation With dynamic creation
Fidelity 64.56% 71.91
32 “die” cases and 123 “live” cases. There are 19 attributes—13 binary and 6 attributes with 6 to 9 discrete values: .X1 / AGE: .X2 / SEX: .X3 / STEROID: .X4 / ANTIVIRALS: .X5 / FATIGUE: .X6 / MALAISE: .X7 / ANOREXIA: .X8 / LIVER BIG: .X9 / LIVER FIRM: .X10/ SPLEEN PALPABLE: .X11 / SPIDERS: .X12/ ASCITES: .X13/ VARICES: .X14/ BILIRUBIN: .X15/ ALK PHOSPHATE: .X16/ SGOT: .X17/ ALBUMIN: .X18/ PROTIME: .X19/ HISTOLOGY:
10; 20; 30; 40; 50; 60; 70; 80 male; f emale no; yes no; yes no; yes no; yes no; yes no; yes no; yes no; yes no; yes no; yes no; yes 0:39; 0:80; 1:20; 2:00; 3:00; 4:00 33; 80; 120; 160; 200; 250 13; 100; 200; 300; 400; 500 2:1; 3:0; 3:8; 4:5; 5:0; 6:0 10; 20; 30; 40; 50; 60; 70; 80; 90 no; yes
The parameters in Table 9 produce the best results. The rule obtained from the original (normalized) patterns produces an accuracy of 98.06% correct answers: IF .IF .X10 OR X7 / THEN .IF .X3 OR X4 / THEN .IF X7 THEN .IF .IF X19
A New Approach to ANN Rules Extraction
1511
Table 9: Parameters of the Algorithm for the Hepatitis Data Set. Parameter
Value
Constants Variables Relational operators Arithmetical operators Logical operators Decision function Selection algorithm Crossover rate Mutation rate Population size Parsimony level
50 random values in the range [0–1] X1 , X2 , X3 , X4 , X5 , X6 , X7 , X8 , X9 , X10 , X11 , X12 , X13 , X14 , X15 , X16 , X17 , X18 , X19 ; ; D C; ¡; ¤; % (protected division) AND, OR, NOT If-then-else over boolean values If-then-else over real values Tournament 90% 10% 800 individuals 0.001
THEN X10 ELSE X11/ THEN X12 ELSE X2 / ELSE X9 / ELSE X9 / ELSE X3 / THEN .IF .IF X19 THEN X3 ELSE X6 / THEN X12 ELSE .0:0816 < X14 < 0:4534// ELSE .IF .X9 OR X7 / THEN .X5 OR .0:5350 < X18 < 0:8474// ELSE .IF .IF X10 THEN X12 ELSE X4 /
1512
J. Rabunal, ˜ J. Dorado, A. Pazos, J. Pereira, and D. Rivero
Table 10: Summary of Rule Extraction Results for the Hepatitis Data Set. Method
Accuracy
Our method C-MLP2LN k-NN, k D 18, Manhattan FSM + rotations LDA Naive Bayes IncNet + rotations QDA 1-NN ASR FDA LVQ CART MLP with BP ASI LFC
98.06% 96.1 90.2 89.7 86.4 86.3 86.0 85.8 85.3 85.0 84.5 83.2 82.7 82.1 82.0 81.9
Reference Duch et al. (2001) Duch et al. (2001) Duch et al. (2001) Ster and Dobnikar (1996) Ster and Dobnikar (1996) Jankowski and Kadirkamanathan (1997) Ster and Dobnikar (1996) Ster and Dobnikar (1996) Ster and Dobnikar (1996) Ster and Dobnikar (1996) Ster and Dobnikar (1996) Ster and Dobnikar (1996) Ster and Dobnikar (1996) Ster and Dobnikar (1996) Ster and Dobnikar (1996)
THEN .X1 < 0:6620/ ELSE X13 // This rule corresponds to the following confusion matrix, calculated on patterns without any missing values: ³ ´ 64 1 PD : 3 12 Table 10 draws an accurancy comparison with other discovery techniques. We use the system proposed by Rabunal ˜ (1999) to train an ANN. With ve neurons in one hidden layer and hyperbolic tangent activation functions, we achieve a 98.06% tness. In order to carry out a detailed analysis of the behavior of the obtained rules, we built an analysis le with all the possible input values for the 13 boolean variables and with regular intervals of 0.25 between 0 and 1 (normalized inputs) for the discrete values. With these intervals, the analysis le has 5,971,968 examples, and we can analyze the possible range of classications carried out by the rules. With this set, the outputs were calculated as the outputs of the ANN with these inputs, so the accuracies obtained from this analysis le with the rules are the delity values of these rules to the ANN. If we apply the analysis le on the rules set, we obtain a delity of 59.95% of correct classications. If we combine the rule extraction mechanism with the algorithm for new pattern creation on the ANN, the result is that with a value between 25% and 45% of new patterns and between 30% and 50%
A New Approach to ANN Rules Extraction
1513
of change probability, the values are the ones with a better behavior in the algorithm. The best rule, obtained with the rule extraction mechanism combined with the algorithm for creating new patterns, is the following: .IF .IF .X16 D 0:3813/ OR ...¡0:9846/ C ...¡0:8851/%.¡X15 //
C.X5 ¡ 0:5614/// >D .¡...X9 C X10 / C .¡X11 //%.¡0:3598/////: The analysis le was built with all the possible input values for the 4 boolean variables and taking regular intervals of 0.5 between 0 and 1 (normalized inputs) for the discrete values. With these intervals, the analysis le contains 6,198,727,824 examples. With this set, the outputs were calculated as the outputs of the ANN with these inputs, so the accuracies obtained from this analysis le with the rules are the delity values of these rules to the ANN. We can evaluate both expressions (with and without the algorithm for new pattern creation) with the analysis le. In this case, the results are as shown in Table 14. 7.5 Appendicitis. This is another small medical database. It was obtained from Shalom Weiss (Weiss & Kulikowski, 1991). (Our special thanks to him for his support in the development of this experiment.)
A New Approach to ANN Rules Extraction
1517
Table 14: Comparative Fidelities Achieved Through Exclusive Use of Learning Patterns and the Use of the Algorithms for Creating New Patterns. Method Without dynamic creation With dynamic creation
Fidelity 77.44% 82.51
It contains 106 samples that belong to two different classes: 88 cases with acute appendicitis and 18 cases with other problems. There are eight attributes, all of them with continuous values. .X1 / .X2 / .X3 / .X4 / .X5 / .X6 / .X7 / .X8 /
WBC1 MNEP MNEA MBAP MBAA HNEP HNEA CRP
The GP parameters that produce the best results are shown in Table 15. The rule obtained from the original (normalized) patterns produces a success rate of 98.89%: IF .IF ..X8 C X4 / X7 / AND .X1 > X8 // AND .0:5225 >D ..X6 ¤ X2 / C X1 ////:
The analysis le has been built taking regular intervals of 0.125 between 0 and 1 (normalized inputs). With these intervals, the analysis le contains 134,217,728 examples. With this set, the outputs were calculated as the outputs of the ANN with these inputs, so the accuracies obtained from this analysis le with the rules are the delity values of these rules to the ANN. We can evaluate both expressions (with and without the algorithm for new pattern creation) with the analysis le. In this case, the results are as shown in Table 17.
A New Approach to ANN Rules Extraction
1519
Table 17: Comparative Fidelities Achieved Through Exclusive Use of the Learning Patterns and Use of the Algorithm for Creating New Patterns. Method Without dynamic creation With dynamic creation
Fidelity 78.93% 85.97
8 Discussion The results presented here have shown us that the rule discovery algorithm based on GP can be compared to other existing methods, but that it offers the great advantage of being applicable to any ANN architecture with any type of activation function. Depending on the task for which the ANN was designed, we only need to carry out the appropriate experimentation and build the subsequent learning patterns by selecting the operator types that will be used by the GP. The four requirements for an ideal rule discovery system have been met: ² The ANN is treated as a black box: only the inputs and outputs produced by the ANN are taken into account, and there is no dependance on architecture or training algorithm (requirements 1 and 2). ² The rules discovered in the ANN are as close to the ANN’s functioning as the tness value produced by the algorithm (GP) and as expressive as the semantic level used by the expression trees that are codied in the GP (requirements 3 and 4). We can therefore conclude that the algorithms based on GP adjust to t the ideal needs of rule extraction systems. Their results are very positive if we compare them to other more specic methods of the ANN architecture or of its training. Moreover, the results obtained from the extraction of the generalization capacity provide a rather accurate emulation of the ANN’s behavior with respect to the possible value combinations that may occur in the inputs. The high success rates of the rule extraction from the ANNs indicate a reliable behavior of the ANNs. We may also state that the knowledge of the ANN has been obtained in an explicit and comprehensible way from a human’s point of view, which allows us to express that ANN’s generalization capacity by means of a symbolic rule. Another advantage of the rule extraction system is that it drastically reduces the complexity of the obtained rules in all the analyzed cases. As the previous results show, the number of operators and constants that intervene in the rules is considerably smaller than that obtained by the rules extracted from the training patterns, and they always maintain the obtained tness levels.
1520
J. Rabunal, ˜ J. Dorado, A. Pazos, J. Pereira, and D. Rivero
9 Future Works The algorithm presented here works with data sequences that are generated by the ANN. This means that it could also be applied to ANNs with recurrent architecture (RANN) and dynamic problems, where the RANN outputs depend on the present inputs and on the inputs in N previous cycles. It is therefore necessary to redene the types of rules and operators in order to be able to represent these data structures, including the terminal nodes that store the values of the previous cycles. This is an attractive development that can be applied to dynamic models such as the prediction of time series, where recurrent ANNs have been applied to many problems of a similar nature. Another future development will be the analysis of the parameters that take part in the algorithm’s correct functioning, depending on the type of problem that is solved by the ANN. We shall consider the ANN not as a black box but as a gray box, where the ANN’s activation function is known; through its incorporation as a mathematical operator for GP, we shall analyze which rules, and expressions can be discovered by this operator. In order to accelerate the rule extraction process, we can use a network of computers that searches in a distributed and recurring manner, exchanging rules (subtrees) internally. Acknowledgments This work has been supported by project INBIOMED (Storage, Integration and Analysis of Clinical, Genetic and Epidemiological Data and Images Oriented to Research on Pathologies Platform) G03/160 nanced by the General Subbureau of Sanitary Research and the Institute of Health Carlos III. References Andrews, R., Cable, R., Diederich, J., Geva, S., Golea, M., Hayward, R., HoStuart, C., & Tickle, A. B. (1996). An evaluation and comparison of techniques for extracting and rening rules from articial neural networks (QUT NRC Tech. Rep.). Queensland: Queensland University of Technology, Neurocomputing Research Centre. Andrews, R., Diederich, J., & Tickle, A. (1995). A survey and critique of techniques for extracting rules from trained articial neural networks. Knowledge Based Systems, 8, 373–389. Andrews, R., & Geva, S. (1994). Rule extraction from a constrained error backpropagation MLP. In Proceedings of the Australian Conference on Neural Networks (pp. 9–12). Bendigo, Australia: Articial Intelligence Research Group.
A New Approach to ANN Rules Extraction
1521
Ben´õ tez, J. M., Castro, J. L., & Requena, I. (1997). Are articial neural networks black boxes? IEEE Transactions on Neural Networks, 8(5), 1156–1164. Bonarini, A. (1996). Evolutionary learning of fuzzy rules: Competition and cooperation. In W. Pedrycz (Ed.), Fuzzy modelling: Paradigms and practice. Norwell, MA: Kluwer. Browne, C., Duntsch, ¨ I., & Gediga, G. (1998). IRIS revisited: A comparison of discriminant and enhanced rough set data analysis. In L. Polkowski & A. Skowron (Eds.), Rough sets in knowledge discovery (Vol. 2, pp. 345–368). Heidelberg: Physica Verlag. Buckley, J. J., Hayashi, Y., & Czogala, E. (1993). On the equivalence of neural nets and fuzzy expert systems. Fuzzy Sets Systems, 53, 129–134. Chalup, S., Hayward, R., & Diedrich, J. (1998). Rule extraction from articial neural networks trained on elementary number classication task (QUT NRC Tech. Rep.). Queensland: Queensland University of Technology, Neurocomputing Research Centre. Cramer, N. L. (1985). A representation for the adaptive Generation of simple sequential programs. In Proceedings of First International Conference on Genetic Algorithms. Mahwah, NJ: Erlbaum. Craven, M. W. (1996). Extracting comprehensible models from trained neural networks. Unpublished doctoral dissertation, University of Wisconsin. Craven, M. W., & Shavlik, J. W. (1996). Extracting tree-structured representations of trained networks. In D. Touretzky, M. Mozer, & M. Hasselmo (Eds.), Advances in neural information processing systems, 8. Cambridge, MA: MIT Press. Duch, W., Adamczak, R., & Gr¸abczewski, K. (2001). A new methodology of extraction, optimisation and application of crisp and fuzzy logical rules. IEEE Transactions on Neural Networks, 12, 277–306. Engelbrecht, A. P., Rouwhorst, S. E., & Schoeman, L. (2001). A building block approach to genetic programming for rule discovery. In A. Abbass, R. Sarkar, & C. Newton (Eds.), Data mining: A heuristic approach. Idea Group Publishing. Fayyad, U., Piatetsky-Shapiro, G., Smyth, P., & Uthurusamy, R. (1996). Advances in knowledge discovery and data mining. Cambridge, MA: MIT Press. Fisher, R. A. (1936). The use of multiple measurements in taxonomic problems. Annals of Eugenics, 7, 179–188. Friedberg, R. M. (1958). A learning machine: Part I. IBM Journal of Research and Development, 2, 2–13. Friedberg, R. M., Dunham, B., & North, J. H. (1959). A learning machine: Part II. IBM Journal of Research and Development, 3, 282–287. Fujiki, C. (1987). Using the genetic algorithm to generate lisp source code to solve the Prisoner ’s Dilemma. In Proceedings of the International Conference on GAs (pp. 236–240). Mahwah, NJ: Erlbaum. Halgamuge, S. K., & Glesner, M. (1994). Neural networks in designing fuzzy systems for real world applications. Fuzzy Sets and Systems, 65, 1–12. Jagielska, I., Matthews, C., & Whitfort, T. (1996). The application of neural networks, fuzzy logic, genetic algorithms and rough sets to automated knowledge acquisition. In Proceedings of the Fourth International Conference on Soft Computing, IIZUKA’96 (Vol. 2, pp. 565–569). Singapore: World Scientic.
1522
J. Rabunal, ˜ J. Dorado, A. Pazos, J. Pereira, and D. Rivero
Jang, J., & Sun, C. (1992). Functional equivalence between radial basis function networks and fuzzy inference systems. IEEE Transactions on Neural Networks, 4, 156–158. Jankowski, N., & Kadirkamanathan, V. (1997). Statistical control of RBF-like networks for classication. In Proceedingsof the Seventh InternationalConference on Articial Neural Networks (pp. 385–390). Berlin: Springer-Verlag. Kasabov, N. (1996). Foundations of neural networks, fuzzy systems and knowledge engineering. Cambridge, MA: MIT Press. Keedwell, E., Narayanan, A., & Savic, D. (2000). Creating rules from trained neural networks using genetic algorithms. International Journal of Computers, Systems and Signals, 1, 30–42. Koza, J. (1992). Genetic programming. on the programming of computers by means of natural selection. Cambridge, MA: MIT Press. Mart´õ nez, A., & Goddard, J. (2001). Denicion ´ de una red neuronal para clasicaci´on por medio de un programa evolutivo. Journal of Ingenier´õ a Biom´edica, 22, 4–11. Merz, C. J., & Murphy, P. M. (2002). UCI repository of machine learning databases. Available on-line at: http://www-old.ics.uci.edu/pub/machine-learningdatabases. Montana, D. J. (1995). Strongly typed genetic programming. In M. Schoenauer et al. (Eds.), Evolutionary computation (pp. 199–200). Cambridge, MA: MIT Press. Nauck, D., Nauck, U., & Kruse, R. (1996). Generating classication rules with the neuro-fuzzy system NEFCLASS. In Proceedings of Biennal Conference of the North American Fuzzy Information Processing Society (NAFIPS’96) (pp. 466– 470). Los Alamos, CA: IEEE. Pop, E., Hayward, R., & Diederich, J. (1994). RULENEG: Extracting rules from a trained ANN by stepwise negation (Neurocomputing Research Centre Tech. Rep.). Queensland: Queensland University of Technology. Rabunal, ˜ J. R. (1999).Entrenamiento de Redes de Neuronas Articiales mediante Algoritmos Gen´eticos. Graduate thesis, University of A Coruna, ˜ Spain. Shang, N., & Breiman, L. (1996). Distribution based trees are more accurate. International Conference on Neural Information Processing, 1, 133–138. Ster, B., & Dobnikar, A. (1996). Neural networks in medical diagnosis: Comparison with other methods. In Proceedings of the International Conference on Engineering Applications of Neural Networks (pp. 427–430). Thrun, S. (1995). Extracting rules from networks with distributed representations. In G. Tesauro, D. Touretzky, & T. Leen (Eds.), Advances in neural information processing systems, 7. Cambridge, MA: MIT Press. Tickle, A. B., Andrews, R., Golea, M., & Diederich, J. (1998). The truth will come to light: Directions and challenges in extracting the knowledge embedded within trained articial neural networks. IEEE Transaction on Neural Networks, 9, 1057–1068. Tickle, A. B., Orlowski, M., & Diederich, J. (1996). DEDEC: A methodology for extracting rules from trained articial neural networks (Tech. Rep.). Queensland: Queensland University of Technology, Neurocomputing Research Centre.
A New Approach to ANN Rules Extraction
1523
Towell, G., & Shavlik, J. W. (1994). Knowledge-based articial neural networks. Articial Intelligence, 70, 119–165. Visser, U., Tickle, A., Hayward, R., & Andrews, R. (1996). Rule-extraction from trained neural networks: Different techniques for the determination of herbicides for the plant protection advisory system PRO PLANT. In Proceedings of the Rule Extraction from Trained Articial Neural Networks Workshop (pp. 133– 139). Queensland: Queensland University of Technology. Weiss, S. M., & Kulikowski, C. A. (1990). An empirical comparison of pattern recognition, neural nets and machine learning classication methods. In J. W. Shavlik & T. G. Dietterich (Eds.), Machine learning. San Francisco: Morgan Kaufmann. Weiss, S. M., & Kulikowski, C. A. (1991). Computer systems that learn: Classication and prediction methods from statistics, neural nets, machine learning, and expert systems. San Francisco: Morgan Kaufmann. Wong, M. L., & Leung, K. S. (2000). Data mining using grammar-based genetic programming and applications. Norwell, MA: Kluwer. Received April 9, 2003; accepted January 8, 2004.
LETTER
Communicated by Steven Nowlan
A Classication Paradigm for Distributed Vertically Partitioned Data Jayanta Basak
[email protected] Ravi Kothari
[email protected] IBM India Research Laboratory, Indian Institute of Technology, New Delhi 110016, India
In general, pattern classication algorithms assume that all the features are available during the construction of a classier and its subsequent use. In many practical situations, data are recorded in different servers that are geographically apart, and each server observes features of local interest. The underlying infrastructure and other logistics (such as access control) in many cases do not permit continual synchronization. Each server thus has a partial view of the data in the sense that feature subsets (not necessarily disjoint) are available at each server. In this article, we present a classication algorithm for this distributed vertically partitioned data. We assume that local classiers can be constructed based on the local partial views of the data available at each server. These local classiers can be any one of the many standard classiers (e.g., neural networks, decision tree, k nearest neighbor). Often these local classiers are constructed to support decision making at each location, and our focus is not on these individual local classiers. Rather, our focus is constructing a classier that can use these local classiers to achieve an error rate that is as close as possible to that of a classier having access to the entire feature set. We empirically demonstrate the efcacy of the proposed algorithm and also provide theoretical results quantifying the loss that results as compared to the situation where the entire feature set is available to any single classier. 1 Introduction Existing approaches to classier design assume that all the features are available for constructing the classier and during its subsequent use (Duda & Hart, 1973; Jain, Duin, & Mao, 2000). In some situations, all the features may not observable at a single location. Even if they are observable from one location, all the features may not be relevant in a particular context, and an observer may choose selected features relevant within its context. For example, in the domain of e-commerce, Web-based merchant-customer c 2004 Massachusetts Institute of Technology Neural Computation 16, 1525–1544 (2004) °
1526
J. Basak and R. Kothari
interaction may be recorded in one server. Such interaction may comprise past transaction data, clickstream (the sequence of Web pages visited by an individual) and other related information (Cooley, 2000). However, a different server may record phone-based merchant-customer interaction. Yet another server may record brick-and-mortar-based interactions. In some cases, the underlying infrastructure does not permit continual synchronization among the different servers. In other cases, a merchant may not have felt the need to utilize all the data in the past but may now feel the need to do so. A solution in which classiers, local to a server and built using the partial view available at the server, are utilized to arrive at an overall classication may be far more attractive than an infrastructure upgrade for co-locating data. There are other examples also. In the domain of computer vision, for example, multiple cameras may observe an object from different viewpoints. To support certain types of queries, there might be a need to utilize information gathered from the multiple viewpoints. Co-locating the data might be an acceptable solution in some situations. However, from a different perspective, one can think of integrating the output of (local) classiers designed on the basis of (local) partial views. In summary, there exist several reasons for considering the problem of decision making in this vertically partitioned environment (Caragea, Silvescu, & Honavar, 2001). Depending on the domain of application, the following, either individually or in combination, motivate the problem setting: ² The data sources may not be available at a single location, and it may be expensive (or difcult due to existing infrastructure constraints) to co-locate the partial views. ² Even if it is possible to co-locate the partial views, it may be difcult to solve the correspondence problem, that is, knowing which (partial) pattern in one data source corresponds to which (partial) pattern in some other data source. ² Even if it is possible to solve the correspondence problem, it may be difcult to build a single classier in a large-dimensional feature space (Ho & Basu, 2002). Vertically partitioning may allow scaling in complex situations. ² Even if all the above are resolved, issues related to privacy and security may preclude one data source’s fully releasing and sharing its data. In such situations, co-locating the partial views may not be possible. While one would expect a lower error rate when all the features are simultaneously available, the proposed classication paradigm is more feasible and can be attractive if a comparable error rate is achievable. One intuitive approach is to construct individual classiers from the partial views of the data and then use a master classier to combine the decisions of the individ-
A Classication Paradigm for Distributed Vertically Partitioned Data
1527
ual slave classiers. Some expert systems are in fact constructed using such an approach (Behrouz & Koono, 1996). However, this requires the master classier to have access to some data specied in terms of the entire feature set. At rst glance, it appears that the signicant activity in the area of the socalled mixture-of-experts model (Jacobs, Jordan, & Nowlan, 1991; Breiman, 1996; Jordan & Jacobs, 1994; Nowlan & Hinton, 1990, 1991; Nowlan & Sejnowski, 1995) is usable when data are vertically partitioned. In the typical mixture-of-experts model, there are a number of classiers (or modules) that simultaneously receive the input x. A gating module produces the overall output based on selection or negotiation of the output of the individual modules. The fundamental motivation of such networks is divide-and-conquer. That is, each expert in the mixture-of-experts framework can solve a simpler problem, and the combination of the outputs of the individual experts provides a solution to the more complex problem. Though typically each expert in such a mixture-of-experts framework sees the entire input x, it is conceivable that each expert could observe certain features and the entire framework be usable when data are vertically partitioned. However, each expert in a mixture-of-experts framework partitions the input space and establishes local regression surfaces in each partition. When used with vertically partitioned data, such regression surfaces would be dened over regions of a subspace, and there is no guarantee that the approximation will be anything close to the desired approximation (unless it happens that the function to be approximated is separable). From another perspective, a mixture-of-experts network typically reduces the bias (over a single classier) and experiences an increase in the variance. “Soft splits” (Jordan & Jacobs, 1994) alleviate this to some extent by allowing smooth transitions between the regions of inuence of two experts. However, when the experts see different feature subsets, it becomes difcult to produce smooth transitions and reduce the variance. Closely related to the mixture-of-experts model (within this context) is the so-called bagging technique (Jacobs et al., 1991; Breiman, 1996) in which weights of the individual classier are adjusted to minimize the generalization error. In the bagging technique, however, there is a “supervisor” that indicates the overall error. Such a supervisor may not be present in many situations. Inducing a classier with vertically partitioned data may also be viewed from the perspective of missing data. A classier induced from the features in a data partition may view the unobserved features as features whose value is always missing. Ahmad and Tresp (1993), for example, compute the posterior probabilities by marginalizing out the missing features in a manner similar to the one proposed in this article. However, they use a single classier that is different from the multiple classiers induced, using individual feature subsets in this work. Correlations between the multiple classiers that are exploited in this work, and not addressed in the previous works (Ahmad & Tresp, 1993), are an advantage of the proposed paradigm.
1528
J. Basak and R. Kothari
In this article, we assume that each individual classier is constructed based on the partial view of the data that are available. Correspondence of the patterns in the different data sources is not necessary—a classier is constructed based on the available data. Of course, in order for the decision made by the classiers to be consistent, the data sets available to the classiers should be sampled from the same (xed though unknown) distribution. However, we assume that a test pattern is observable across the classiers. Our approach to arriving at the combined classication is based on utilizing the posterior probabilities computed by the individual classiers. The rest of the letter is organized as follows. In section 2, we provide an outline of the problem formulation in a Bayesian framework. In section 3, we quantify the error introduced into the classication procedure due to the combination of the different classiers. In section 4, we provide certain case studies for different types of classiers and the simulation results obtained for different data sets vertically partitioned across the classiers. Finally, we present our conclusions in section 5. 2 Problem Formulation in a Bayesian Framework We consider the situation in which there are q independent observers of a phenomenon. The ith observer records events pertaining to the phenomenon S in terms S of S a set of features Fi . The entire feature set is given as F D F1 F2 ¢ ¢ ¢ Fq . The partial view recorded by each observer may be interpreted within the traditional nondistributed approach by visualizing that a data set in which each row comprises a pattern is vertically (columnwise) partitioned into q (possibly overlapping partitions). Let the (partial) view of a pattern x as viewed by the ith observer be denoted by xFi . More specically, let xFi denote the vector representation of a pattern x comprising the features of x present in Fi . Associated with each observer i is a classier Ci constructed on the basis of xFi . In this distributed setting, we approach the problem of nding the class label of a test pattern x. We assume that the output of each classier is the a posteriori probability computed on the basis of the partial view available to the classier. 2.1 Problem Formulation. Here and in the rest of the article, for notational simplicity, we use P.¢/ to represent the empirically estimated probaibO to represent the posterior probability as approximated by the lities and P.¢/ proposed approach. Let P.!j jxFi / denote the posterior probability for class !j as determined by classier i based on a partial view of x (i.e., xFi ), that is, P.!j jxFi / D
P.!j /P.xFi j!j / P.xFi /
;
(2.1)
A Classication Paradigm for Distributed Vertically Partitioned Data
1529
where the prior probabilities are assumed to be the same across the different partitions. As a consequence, there is no i associated with P.!j /. If the feature sets observed by the different classiers are nonoverlapping, then the combined decision can be obtained based on the assumption that the subset of features (observed by the classiers) is independent of each other. However, if these subsets are overlapping, then the independence assumption may not hold. In fact in the nonoverlapping case, the independence assumption may not hold true, but in the absence of dependency information, the best that can be obtained is based on the independence assumption of the feature subsets. The problem is to make a decision based on the output of the individual classiers. We model the problem as one of estimating the posterior probability based on the posterior probabilities estimated by the individual classiers. The following theorem describes one approximation technique by which the overall posterior probability can be estimated: Theorem 1. If we marginalize those features that are not visible by more than one classier, then the overall posterior probability is approximated as, Q Q [ k P.!j jxFk /][ k;l;m P.!j jx.Fk T Fl T Fm / /] ¢ ¢ ¢ O j jx/ D Q Q (2.2) P.! [ P.! jx T /][ P.! jx T T T /] ¢ ¢ ¢ k;l
j
.Fk
Fl /
k;l;m;n
j
.Fk
Fl
Fm
Fn /
where P.!j jx.Fk T Fl T Fm / / denotes the posterior probability for class !j based on T T the feature subset .Fk Fl Fm / and x.Fk T Fl T Fm / is the corresponding view of x. PO is the approximated posterior probability. Proof. Let us rst consider that thereS are only two classiers such that the feature set can be expressed as F D F1 F2 . Therefore, P.x/
D D
P.xF1 ; xF2 / P.xOF1 ; xOF2 ; x.F1 T F2 / /;
(2.3)
where xOF1 and xOF2 are the partial views of x corresponding to the feature subsets F1 ¡F2 and F2 ¡F1 , respectively. Since the feature subsets F1 ¡F2 and F2 ¡ F1 are not observable across the classiers, the probability distribution can be marginalized over these subsets (according to the assumption), and therefore O D P.xOF1 jx.F1 T F2 / /P.xOF2 jx.F1 T F2 / /P.x.F1 T F2 / / P.x/ P.xOF1 ; x.F1 T F2 / / P.xOF2 ; x.F1 T F2 / / D P.x.F1 T F2 / / P.x.F1 T F2 / / P.x.F1 T F2 / / D
P.xF1 /P.xF2 / : P.x.F1 T F2 / /
(2.4)
1530
J. Basak and R. Kothari
By similar arguments and approximation, the class conditional probability can be approximated as O P.xj! j/ D
P.xF1 j!j /P.xF2 j!j / : P.x.F1 T F2 / j!j /
(2.5)
Therefore, from equations 2.4 and 2.5, the posterior probability can be approximated as O j jx/ D P.!
P.!j jxF1 /P.!j jxF2 / : P.!j jxF1 T F2 /
(2.6)
Let us now prove the generalized approximation by the method of induction. We have to show that the approximation is valid for N feature subsets if it is valid for .N ¡ 1/ subsets. Let the feature subset collectively S available to theSrst .N ¡ 1/ classiers be represented as FC D iDN¡1 Fi iD1 so that F D FC Fn . Let xFC represent the partial view of x corresponding to the feature subset FC. Since the approximation is valid for the .N ¡ 1/ subsets, we can express Q [ kDN¡1 P.!j jxFk /] kD1 O j jxFC / D Q P.! kDN¡1;lDN¡1 [ kD1;lD1 P.!j jx.Fk T Fl / /] Q [ kDN¡1;lDN¡1;mDN¡1 P.!j jx.Fk T Fl T Fm / /] kD1;lD1;mD1 ¢ ¢ ¢ (2.7) QkDN¡1;lDN¡1;mDN¡1;nDN¡1 [ kD1;lD1;mD1;nD1 P.!j jx.Fk T Fl T Fm T Fn / /] From equation 2.6, the posterior for the overall decision can be approximated as O j jx/ D P.!
O j jxFC / P.!j jxFN /P.! : O j jx T / P.! FC
(2.8)
FN
Again FC
\
FN D
iDN¡1 [
\ .Fi
(2.9)
FN /
iD1
and also \ .Fi
FN /
\
\ .Fj
FN / D Fi
\ Fj
\ FN :
(2.10)
Therefore, with the similar approximation for N ¡ 1 partial views, the combined posterior from partial views of x observed by the pairwise intersection
A Classication Paradigm for Distributed Vertically Partitioned Data
1531
of the feature sets can be expressed as QkDN¡1
P.!j jxFk T FN /] kD1 T O P.!j jxFC FN / D QkDN¡1;lDN¡1 [ kD1;lD1 P.!j jx.Fk T Fl T FN / /] [
Q kDN¡1;lDN¡1;mDN¡1
P.!j jx.Fk T Fl T Fm T FN / /] kD1;lD1;mD1 ¢¢¢ QkDN¡1;lDN¡1;mDN¡1;nDN¡1 [ kD1;lD1;mD1;nD1 P.!j jx.Fk T Fl T Fm T Fn T FN / /] [
(2.11)
From equations 2.8 and 2.11, the overall posterior can be approximated as Q [ kDN kD1 P.!j jxFk /] O P.!j jx/ D QkDN;lDN [ kD1;lD1 P.!j jx.Fk T Fl / /] Q [ kDN;lDN;mDN jx T Fl T Fm / /] kD1;lD1;mD1 P.!j .Fk ¢¢¢ Q T T T /] [ kDN;lDN;mDN;nDN jx P.! j .F F F F / kD1;lD1;mD1;nD1 m n k l
(2.12)
Thus, the approximation of the posterior is valid for two partitions of the feature set, and therefore it is valid for three partitions (since it is valid for N partitions if it is valid for N ¡ 1 partitions). The argument can be extended inductively for any arbitrary number of partitions of the feature set. In this approximation, if a subset of attributes across the classier is empty, then the posterior for that attribute subset is approximated by the a priori probability. 2.2 Algorithm. In equation 2.12, it is observed that the odd subsets appear in the numerator, and the even subsets appear in theTdenominator. S Let there be two partitions F1 and F2 (F D F1 F2 and F1 F2 6D ;). Let a pattern x belong to class !1 such that P.!1 jx/ > P.!2 jx/, where P.:/ is the true posterior. Let the classiers be such that a correct decision is made with each individual feature partition, and also with the intersection of the feature partitions, that is, P.!1 jxF1 / > P.!2 jxF1 /, P.!1 jxF2 / > P.!2 jxF2 /, and P.!1 jxF1 T F2 / > P.!2 jxF1 T F2 /. Therefore, equation 2.12 may not guarantee that O 1 jx/ > P.! O 2 jx/; P.!
(2.13)
although each subclassier makes the decision correctly. A necessary condition to satisfy inequality 2.13 is (from equation 2.12) that P.!1 jxF1 /; P.!1 jxF2 / ¸ P.!1 jxF1 S F2 /;
(2.14)
1532
J. Basak and R. Kothari
which indicates that P.!2 jxF1 /; P.!2 jxF2 / ¸ P.!2 jxF1 S F2 / for a two-class problem. Condition 2.14 implies that a classier is able to make the decision more accurately if more features are available to it. In other words, the performance of a classier is not disturbed by the presence of redundant features. Note that certain classiers may not have this property. For example, the performance of k-NN classiers may be deteriorated by the presence of noisy features. A classier can have a certain internal logic to check the redundancy of the features (using certain feature selection mechanism) and behave as a consistent classier. We call a classier i to be consistent for a pattern x if there exists some 0 0 class label ! such that P.!jx Fi / > P.! jxFi /Tfor all ! 6D !, and P.!jxFi / > S P.!jxFi T Fj / for all Fj , Fi Fj ¾ Fi , and Fi Fj 6D ;. It is to be noted here that the Bayesian framework of deriving the approximate posterior (see equation 2.12) is valid only for the set of consistent classiers. If all classiers for the vertically partitioned data sets are consistent, then the overall classication score can be computed from equation 2.12. In our implementation, we have not imposed the restriction of consistency of each classier. We compute the overall approximated posterior for a pattern based on only the consistent classifers. Therefore, the overall algorithm is as follows. For a given pattern x, ² We nd the set of consistent classiers for each class label (note that one classier can be consistent for at most one class label for a given pattern). Let Äi be the set of classiers for the class label !i . O i jx/ ² For all non-null Äi , we compute the approximate posterior P.! based on only the classiers in Äi . O i jx/ is maximum. ² Assign the class label for which P.!
² If there exist no consistent classiers for any class label, then we approximate the posterior by the product of the posteriors of the individual classiers. 3 Information Loss Due to the Approximation In this section, we provide an alternate view of the approximation of the posterior due to partitioning of the feature set. According to the Bayes decision for a two-class problem, we have P.!1 jx/ ¸ P.!2 jx/ implies that x 2 Class !1 , otherwise x 2 Class !2 : (3.1) Let us consider that g.x/ D
P.!1 jx/ P.!2 jx/
and
O D g.x/
O 1 jx/ P.! ; O 2 jx/ P.!
(3.2)
A Classication Paradigm for Distributed Vertically Partitioned Data
1533
O is the approximated posterior as given by equation 2.12. There where P.¢/ is no error due to approximation if O g.x/ D g.x/:
(3.3)
In other words, an error in the approximation will occur if g.x/ is not O equal to g.x/ for an observation x. Thus, the error due to approximation in the classication can be formally measured as ´ Z ³ g.x/ log ED p.x/dx; O g.x/
(3.4)
that is, ´ ³ ´ Z ³ P.!1 jx/ P.!2 jx/ log ¡ log ED p.x/dx: O 1 jx/ O 2 jx/ P.! P.!
(3.5)
From equation 2.12, the approximation error, equation 3.5, can be simplied as Á ! Z p.xj! / Q p .x j! / Q p .x j! / ¢ ¢ ¢ 1 ij ij 1 ijkl ijkl 1 Q Q log ED p.x/ pi .xi j!1 / pijk .xijk j!1 / ¢ ¢ ¢ Á ! Q Q p.xj!2 / pij .xij j!2 / pijkl .xijkl j!2 / ¢ ¢ ¢ Q Q ¡ log p.x/ dx; j! j! ¢ ¢ ¢ pi .xi 2 / pijk .xijk 2 /
(3.6)
where xi , xij , xijk , and xijkl are the simplied representations of the partial views xFi , xFi T Fj , xFi T Fj T Fk , and xFi T Fj T Fk T Fl , respectively. The densities pi .xi j!/, pij .xij j!/, pijk .xijk j!/, and pijkl .xijkl j!/ are the T marginal T density T functions of the respective class in the feature subsets , , F F F F Fj Fk , i i j i T T T and Fi Fj Fk Fl , respectively, and p.x/ is the overall density function of the patterns. Considering that p.x/ D P.!1 /p.xj!1 /CP.!2 /p.xj!2 /, p.xj!1 / and p.xj!2 /, being the class-conditional densities for !1 and !2 , respectively, and the class-conditional density of class !1 is independent of the class-conditional density of class !2 , we can rewrite equation 3.6 as Á ! Q Q Z p.xj!1 / pij .xij j!1 / pijkl .xijkl j!1 / ¢ ¢ ¢ Q Q ED P.!1 / log p.xj!1 / pi .xi j!1 / pijk .xijk j!1 / ¢ ¢ ¢ Á ! Q Q p.xj!2 / pij .xij j!2 / pijkl .xijkl j!2 / ¢ ¢ ¢ Q Q ¡ P.!2 / log pi .xi j!2 / pijk .xijk j!2 / ¢ ¢ ¢ £p.xj!2 / (3.7) dx:
1534
J. Basak and R. Kothari
Let a.xj!1 / D
p.xj!1 / Q
Q
Q pij .xij j!1 / pijkl .xijkl j!1 / ¢ ¢ ¢ Q : pi .xi j!1 / pijk .xijk j!1 / ¢ ¢ ¢
(3.8)
The expression for a.xj!1 / can be rewritten as Q Qp.xj!1 / Qpij .xij j!1 / ¢ ¢ ¢ pi .xi j!1 / pijk .xijk / .¡1/N Q a2 .xj!1 / D p.xj!1 / Q pijk¢¢¢N .xijk¢¢¢N j!1 /: pi .xi j!1 / pijk .xijk j!1 / Q Q ¢¢¢ pij .xij j!1 /
(3.9)
pijkl .xijkl j!1 /
Therefore, log.a.xj!1 // D
1 2
µ log.p.xj!1 // C .¡1/N log.p.xijk¢¢¢N j!1 // Á Q ! ³ ´ pij .xij j!1 / p.xj!1 / C log Q C log Q C¢¢¢ pi .xi j!1 / pijk .xijk j!1 / ÁQ ! # ³Q ´ pijk .xijk j!1 / pi .xi j!1 / Q ¡ log ¡ log ¡ ¢ ¢ ¢ (3.10) pij .xij j!1 / pijkl .xijkl j!1 /
Thus, the error in approximation (from equations 3.7 and 3.10) is given as X .¡1/n 1 .¡1/N E · .P.!1 / I.!1 / C Dn .!1 / C IN .!1 / 2 2 2 n X .¡1/n 1 .¡1/N C P.!2 / I.!2 // C Dn .!2 // C IN .!2 / ; 2 2 2 n
(3.11)
where Z Dn .!i / D
³Q ´ p.xij¢¢¢.n¡1/ j!i / Q log p.xj!i /dx p.xij¢¢¢n j!i /
(3.12)
and Z D1 .!i / D
³
´ p.xj!i / log Q p.xj!i /dx pj .xj j!i /
(3.13)
reect the Kullback-Leibler (KL) divergence T T of the T marginalized density in the subspace of the feature subset ¢ ¢ ¢ F F Fn¡1 and in the subspace 1 2 T T T of feature subset F1 F2 ¢ ¢ ¢ Fn corresponding to class !i , I.!i / is the information contained in the class !i , and IN .!i / is the information content
A Classication Paradigm for Distributed Vertically Partitioned Data
1535
T T T T of class !i in the subspace of the S feature subset F1 F2 F3 ¢ ¢ ¢ FN . In the analysis, it is assumed that i Fi D F, that is, overlap of all feature subsets available to the classiers spans the entire feature set. As a special case, when the feature subset is split into N disjoint subsets, the error is E · jP.!1 /D.!1 /j C jP.!2 /D.!2 /j;
(3.14)
where D.!i / D D1 .!i / is the K-L divergence between the original classconditional density and the marginalized density in the space of feature subsets. From equation 3.14, it is evident that the approximationerror goes to zero if the class-conditional divergence is zero. In other words, if the feature split is such that the feature subsets are independent, then the approximation error in the classication is zero. Also, equation 3.11 provides a guideline for optimal feature split such that E is minimum (the algorithm for nding the optimal feature split is out of the scope of this discussion). 4 Experimental Results In the proposed approach, the combined classication is based on utilizing the posterior probabilities computed by the individual classiers. We rst show how the posterior probabilities can be obtained from the individual classiers when the classiers use some standard classication techniques. 4.1 Obtaining the Posterior Probabilities. 4.1.1 k-NN Classier. In the k-nearest-neighbor (k-NN) classier, we can approximate the posterior for a class !i of a classier as Pest .!i jx/ D
ki ; k
(4.1)
where ki is the number of samples belonging to class !i among the nearest k samples. The estimated posterior becomes better as the value of k increases. As a special case, if the Pest becomes zero (e.g., out of k-nearestneighboring samples, the number of samples belonging to class !i is zero) for some particular classier, then in equation 2.2, either the numerator or the denominator becomes zero. In the second case, it is difcult to obtain the posterior probability. In order to handle such situations, we regularize the posterior as Pest .!i jx/ D
²P.!i / C ki ; ²Ck
(4.2)
where ² is a small, regularizing constant and P.!i / is the a priori probability of class !i obtained from the training set available in the classier. Thus, the
1536
J. Basak and R. Kothari
posterior estimated in equation 4.2 becomes equal to that in equation 4.1, when k is large. 4.1.2 Decision Tree Classiers. can be estimated as
The posterior for a class in a decision tree
P est .!j jx/ D 1 ¡ impurity;
(4.3)
where impurity is the impurity at the leaf node, which measures the total number of samples not belonging to the assigned class !j of that leaf node divided by the total number of samples corresponding to that particular leaf node. Thus, for a large decision tree (not properly pruned), we can obtain a zero impurity for every child node. In that case, we can regularize the measure of estimated posterior in the same way as in the case of the k-NN classier. However, since a properly pruned decision tree usually provides a better generalization capability, the measure of the posterior can be better estimated from a pruned decision tree. 4.1.3 Multilayered Perceptrons. The output of a multilayered perceptron with an appropriately chosen error and activation function approaches the posterior probability provided that the network is sufciently complex and the training can reach the global minimum (Haykin, 1999). Although all of these conditions are not always satised, in practice, the adaptation of the basis functions in a multilayered perceptron leads to a better utilization of the available data than is possible with many other forms of estimation (Barron, 1993). In a multilayered perceptron having n output nodes representing n different class labels, the posterior of a sample x for a class !i is given as P.!i jx/ D vi ;
(4.4)
where vi 2 [0; 1] is the output of the ith node. 4.2 Results. The method has been simulated in a MATLAB environment. The data sets that we considered are shown in Table 1. All the data sets are readily available from Merz and Murphy (1996). For each data set, we partitioned the set vertically (there can be overlap between the partitions) and assigned one partition to each classier. Each classier performs the classication based on a k-NN algorithm with k D 10. It is not necessary that each local classier utilize the same classication paradigm, and the posteriors can be obtained by any other consistent method. We obtained the posterior probability using equation 4.1 with ² D 1. We partition the feature set in the following way. For every feature, we decide randomly a number of classiers to which it should belong to. The
A Classication Paradigm for Distributed Vertically Partitioned Data
1537
Table 1: Data Sets Used to Obtain the Experimental Results.
Data Set
jFj
Cancer Pima Diagnostic Breast Cancer Digit Recognition Optical Digit
30 8 30 64 144
Number of Classes
Training Data Size
Test Data Size
2 2 2 10 10
456 576 398 3823 5000
113 192 171 1797 5000
number of classiers to which the feature should belong to is decided as N f 2 [1; ®N];
(4.5)
where ® is a factor deciding the amount of overlap between the classiers. We then randomly allocate the feature to N f different classiers. We experimented with overlap factors 0.4 and 0.6. Thus, in each run in the experiment, a partition is randomly generated with a given number of classiers and a given value of the amount of overlap. In our experiment, we obtained the classication accuracy with two different overlap factors and feature set partitioned into different number of classiers. In each case, we obtained the classication accuracy for independent runs (as described above), and obtained the average test accuracy along with the maximum and minimum accuracies. The classication accuracy obtained from a classier having a complete view and that obtained from classiers having partial views is illustrated in Figures 1 through 5. In these gures, we have plotted the average test accuracy and have also indicated the maximum and minimum test accuracy (vertical lines) obtained from ve independent runs. There are several interesting observations that can be made from the gures. First, in general there is a degradation in performance with increasing number of partitions. However, comparing Figures 1 through 4 to Figure 5, we nd that the degradation is much less with the Digit data set. The Digit data set has a much larger number of features than the other three data sets. This can be explained to some extent by the fact that features in higherdimensional spaces tend to more correlated. Therefore, a partition is likely to contain proportionally larger classication information. It is also interesting to observe that in certain cases, the classication accuracy with a partitioned feature set is better than that with the entire feature set. This is due to the fact that we have used a k-NN algorithm for each classier, and the performance of the k-NN algorithm is, in general, perturbed by the presence of noisy features. In the partitioned case, if the number of redundant features in one classier is fewer than the informative features, then the classication accuracy can be even better than if all features are taken into account.
1538
J. Basak and R. Kothari
0.98 0.97
percentage accuracy ®
0.96 0.95 0.94 0.93 0.92 0.91 0.9
2
4
6
no. of partitions ®
8
10
0.98 0.97
percentage accuracy ®
0.96 0.95 0.94 0.93 0.92 0.91 0.9 1
2
3
4
5
no. of partitions ®
6
7
8
Figure 1: Results obtained with the Cancer data set for ® D 0:4 (top) and ® D 0:6 (bottom). Maximum and minimum accuracy on the test data are indicated by vertical lines with an x at the extremities. The average performance is computed over ve trials. Each individual classier is a k-NN classier with k D 25.
A Classication Paradigm for Distributed Vertically Partitioned Data
1539
0.9
percentage accuracy ®
0.85 0.8
0.75 0.7
0.65 0.6
0.55 0.5 1
2
3
4
5
6
3
4
5
6
no. of partitions ®
0.9
percentage accuracy ®
0.85 0.8
0.75 0.7
0.65 0.6
0.55 0.5 1
2
no. of partitions ®
Figure 2: Results obtained with the Pima data set for ® D 0:4 (top) and ® D 0:6 (bottom). Maximum and minimum accuracy on the test data are indicated by vertical lines with an x at the extremities. The average performance is computed over ve trials. Each individual classier is a k-NN classier with k D 25.
1540
J. Basak and R. Kothari
0.98
percentage accuracy ®
0.96 0.94 0.92 0.9
0.88 0.86 0.84 1
2
3
4
5
6
7
8
no. of partitions ®
9
10
0.98
percentage accuracy ®
0.96 0.94 0.92 0.9
0.88 0.86 0.84
2
4
6
no. of partitions ®
8
10
Figure 3: Results obtained with the Diagnostic Breast Cancer data set for ® D 0:4 (top) and ® D 0:6 (bottom). Maximum and minimum accuracy on the test data are indicated by vertical lines with an x at the extremities. The average performance is computed over ve trials. Each individual classier is a k-NN classier with k D 25.
A Classication Paradigm for Distributed Vertically Partitioned Data
1541
percentage accuracy ®
1
0.95
0.9
0.85 1
2
3
4
5
6
no. of partitions ® Figure 4: Results obtained with the Optical Digit data set for ® D 0:6. Maximum and minimum accuracy on the test data are indicated by vertical lines with an x at the extremities. The average performance is computed over ve trials. Each individual classier is a k-NN classier with k D 10.
5 Discussion and Conclusion In this article, we provided a model-free approach to combining the decisions made by the local classiers, each of which is constructed from partial views of the feature set. Our formulation is based on the assumption that each local classier is able to estimate the posterior probability for a given sample. We obtained a theoretical estimate of the error made in estimating the posterior probabilities based on the estimate of the posterior probabilities estimated from partial views. It is observed that for a lower number of features, the degradation of performance increases rapidly with the increase in the number of partitions as compared to that for a larger number of features. The proposed technique of decision-level fusion can be useful in many different situations of Web-based technologies, such as knowledge management, grid computing, e-commerce, and agent-based systems where the number of attributes is very large. The proposed method can have an impact on reducing the complexity of classier design; designing a classication system comprising local classiers, each constructed from a partial view of the feature set; and handling the process of fusing the decisions without co-
1542
J. Basak and R. Kothari
percentage accuracy ®
0.8
0.75
0.7
0.65 1
2
3
4
5
no. of partitions ®
6
7
8
Figure 5: Results obtained with the Digit data set for ® D 0:6. The performance is computed over a single trial. Each individual classier is a k-NN classier with k D 10.
locating the data (which may not be possible because of security or privacy issues). In combining the decisions, it is necessary to communicate the decisions or the posterior probabilities between the servers over the network. The communication or transfer of the decisions needs to be performed so that the total communication cost is minimized. Analysis of the communication and network protocol issues is an area for further study. We combined the decisions in terms of the posterior probabilities. However, a similar algorithm can also be approached by using other measures such as uncertainty or fuzzy ambiguity (Klir & Folger, 1991). Different partitions over a data set provide different classication scores. It is best to obtain a combined decision when the classiers are maximally independent, that is, the approximation error in terms of the information loss (as provided in section 3) is minimum. This also constitutes a scope for future study where it is required to determine the optimal partition under certain constraints (e.g., minimum and maximum number of attributes that a processor should handle). The problem, in a sense, is related to the problem of feature subset selection (Devijver & Kittler, 1982). We also assumed that the individual classiers were unbiased. Interesting variations arise when this is not the case (either intentionally or unin-
A Classication Paradigm for Distributed Vertically Partitioned Data
1543
tentionally). Such variations can be dealt with to some extent through the use of robust statistics (Rousseeuw, 1984; Hampel, Rousseeuw, Ronchetti, & Stahel, 1986).
References Ahmad, S., & Tresp, V. (1993). Some solutions to the missing feature problem in vision. In S. J. Hanson, J. D. Cowan, & C. L. Giles (Eds.), Advances in neural information processing systems. San Mateo, CA: Morgan Kaufmann. Barron, A. R. (1993). Universal approximation bounds for superposition of a sigmoidal function. IEEE Transactions on Information Theory, 39(3), 930–945. Behrouz, H. F., & Koono, Z. (1996). Ex-w-pert system: A web-based distributed expert system for groupware design. In J. K. Lee, J. Liebowitz, & Y. M. Chae (Eds.), Proc. of the Third World Congress on Expert Systems (pp. 545–552). Seoul, Korea: Scholium International. Breiman, L. (1996). Bagging predictors. Machine Learning, 26(2), 123–140. Caragea, D., Silvescu, A., & Honavar, V. (2001). Towards a theoretical framework for analysis and synthesis of agents that learn from distributed dynamic data sources. In S. Wermter, J. Austin, & D. Willshaw (Eds.), Emerging neural architectures based on neuroscience. Berlin: Springer-Verlag. Cooley, R. W. (2000). Web usage mining: Discovery and application of interesting patterns from web data. Unpublished doctoral dissertation, University of Minnesota. Devijver, P. A., & Kittler, J. (1982). Pattern recognition:A statisticalapproach. Upper Saddle River, NJ: Prentice Hall. Duda, R., & Hart, P. (1973). Pattern classication and scene analysis. New York: Wiley. Hampel, F. R., Rousseeuw, P. J., Ronchetti, E. M., & Stahel, W. A. (1986). Robust statistics—the approach based on inuence function. New York: Wiley. Haykin, S. (1999). Neural networks: A comprehensive foundation. Upper Saddle River, NJ: Prentice Hall. Ho, T. K., & Basu, M. (2002). Complexity measures of supervised classication problems. IEEE Transactions on Pattern Analysis and Machine Intelligence, 24(3), 289–300. Jacobs, R. A., Jordan, M. I., & Nowlan, S. J. (1991). Adaptive mixtures of local experts. Neural Computation, 3(1), 79–87. Jain, A. K., Duin, R. P. W., & Mao, J. (2000). Statistical pattern recognition: A review. IEEE Transactions on Pattern Analysis and Machine Intelligence, 22(1), 4–37. Jordan, M. I., & Jacobs, R. A. (1994). Hierarchical mixtures of experts and the EM algorithm. Neural Computation, 6(2), 181–214. Klir, G. J., & Folger, T. A. (1991). Fuzzy sets, uncertainty and information. Upper Saddle River, NJ: Prentice Hall.
1544
J. Basak and R. Kothari
Merz, C. J., & Murphy, P. M. (1996). UCI repository of machine learning databases (Tech. Rep.). Available on-line at: http://www.ics.uci.edu/˜ mlearn/ MLRepository.html. Irvine: Department of Information and Computer Science, University of California at Irvine. Nowlan, S. J., & Hinton, G. E. (1990). Evaluation of adaptive mixtures of competing experts. In R. P. Lippmann, J. E. Moody, & D. S. Touretzky (Eds.), Advances in neural information processing systems, 2 (pp. 774–780). San Mateo, CA: Morgan Kaufmann. Nowlan, S. J., & Hinton, G. E. (1991). Evaluation of adaptive mixtures of competing experts. In R. P. Lippmann, J. E. Moody, & D. S. Touretzky (Eds.), Advances in neural information processing systems (pp. 774–780). San Mateo, CA: Morgan Kaufmann. Nowlan, S. J., & Sejnowski, T. J. (1995). Filter selection model for generating visual motion signals. In C. L. Giles, S. J. Hansen, & R. P. Lippmann (Eds.), Advances in neural information processing systems, 5. San Mateo, CA: Morgan Kaufmann. Rousseeuw, P. J. (1984). Least median of squares regression. Journal of the American Statistical Association, 79, 871–880. Received December 13, 2002; accepted January 7, 2004.
LETTER
Communicated by Hugh Wilson
A Coarse-to-Fine Disparity Energy Model with Both Phase-Shift and Position-Shift Receptive Field Mechanisms Yuzhi Chen
[email protected] Ning Qian
[email protected] Center for Neurobiology and Behavior and Department of Physiology and Cellular Biophysics, Columbia University, New York, NY 10032, U.S.A.
Numerous studies suggest that the visual system uses both phase- and position-shift receptive eld (RF) mechanisms for the processing of binocular disparity. Although the difference between these two mechanisms has been analyzed before, previous work mainly focused on disparity tuning curves instead of population responses. However, tuning curve and population response can exhibit different characteristics, and it is the latter that determines disparity estimation. Here we demonstrate, in the framework of the disparity energy model, that for relatively small disparities, the population response generated by the phase-shift mechanism is more reliable than that generated by the position-shift mechanism. This is true over a wide range of parameters, including the RF orientation. Since the phase model has its own drawbacks of underestimating large stimulus disparity and covering only a restricted range of disparity at a given scale, we propose a coarse-to-ne algorithm for disparity computation with a hybrid of phase-shift and position-shift components. In this algorithm, disparity at each scale is always estimated by the phase-shift mechanism to take advantage of its higher reliability. Since the phasebased estimation is most accurate at the smallest scale when the disparity is correspondingly small, the algorithm iteratively reduces the input disparity from coarse to ne scales by introducing a constant position-shift component to all cells for a given location in order to offset the stimulus disparity at that location. The model also incorporates orientation pooling and spatial pooling to further enhance reliability. We have tested the algorithm on both synthetic and natural stereo images and found that it often performs better than a simple scale-averaging procedure. 1 Introduction Position-shift and phase-shift (or phase-difference) models are two distinct mechanisms that have been proposed for describing reception eld (RF) proles and disparity sensitivity of V1 binocular cells (Bishop & Pettigrew, c 2004 Massachusetts Institute of Technology Neural Computation 16, 1545–1577 (2004) °
1546
Y. Chen and N. Qian
1986; Ohzawa, DeAngelis, & Freeman, 1990; see Qian, 1997, for a review). The position-shift model assumes that the left and right RFs of a simple cell are always identical in shape but can be centered at different spatial locations, while according to the phase-shift model, the two RFs of a cell can have different shapes but are always tuned to the same spatial location. Recent studies indicate that both the phase- and position-shift models contribute to the representation of binocular disparity in the brain (Zhu & Qian, 1996; Ohzawa, DeAngelis, & Freeman, 1997; Anzai, Ohzawa, & Freeman, 1997, 1999a; Livingstone & Tsao, 1999; Prince, Cumming, & Parker, 2000) and that the phase-shift model appears to be more prominent although position shift may also play a signicant role at high spatial frequencies (Ohzawa et al., 1997; Anzai et al., 1997, 1999a). These ndings are consistent with earlier psychophysical observations that there is a correlation between the perceived disparity limit and spatial frequency as predicted by the phase model, but the disparity range at high spatial frequencies is larger than what a purely phase-shift model allows (Schor & Wood, 1983; Smallman & MacLeod, 1994). These studies raise at least two questions: What are the relative strengths and weaknesses of the phase- and the position-shift mechanisms in disparity computation, and what is the advantage, if any, of having both mechanisms? We (Zhu & Qian, 1996; Qian & Zhu, 1997) and others (Smallman & MacLeod, 1994; Fleet, Wagner, & Heeger, 1996) have previously analyzed the similarities and the differences between the two RF models. However, that work mainly focused on the disparity tuning properties of a given cell. For the task of disparity estimation faced by the brain, it is the population responses of many cells to a given (unknown) disparity that is most relevant.1 Tuning curves and population response curves have different shapes and properties in general; they are identical only under some special conditions (Teich & Qian, 2003). Qian and Zhu (1997) did compare the disparity maps computed from the population responses of the phase- and position-shift RF models, but the study did not examine the properties of the population responses in detail and was done under the condition where the difference between the two RF models is minimal (see section 2.1). In addition, previous studies have not explored how to properly combine the phase- and position-shift mechanisms in a stereo algorithm to achieve better results than either mechanism alone. Another limitation of most previous studies on disparity computation is the exclusion of orientation tuning. Indeed, with a few exceptions (Mikaelian
1 As we will detail in section 2, for a set of cells with a range of preferred disparity, a population response curve is obtained by plotting each cell’s response to a xed stimulus disparity against the cell’s preferred disparity. The stimulus disparity can be estimated from either the peak or the mean of a population response curve. In this article, we use the peak location. In contrast, one cannot estimate stimulus disparity directly from a disparity tuning curve (Qian, 1997).
A Coarse-to-Fine Disparity Energy Model
1547
& Qian, 1997, 2000; Matthews, Meng, Xu, & Qian, 2003), most previous computational studies of the disparity energy model employed one-dimensional (1D) Gabor lters (Qian, 1994; Zhu & Qian, 1996; Fleet et al., 1996; Qian & Zhu, 1997; Tsai & Victor, 2003) instead of two-dimensional (2D), oriented lters found in the visual cortex.2 Of course, a 1D algorithm can be readily extended to 2D by replacing 1D Gabor lters with 2D Gabor lters. However, without the additional renements presented in this article, the performance of the resulting 2D algorithm is much worse than that of the corresponding 1D algorithm. The reason is that at depth boundaries, 2D lters tend to mix up regions of different disparities more than 1D lters do (Chen & Qian, unpublished observations). In this article, we analyze the differences between the phase-shift and position-shift RF models in terms of both disparity tuning curves and population responses. We consider stimuli with and without orientations, and examine various offsets between the preferred orientation of the cells and the stimulus orientation. We nd that although the two RF models generate disparity tuning curves of similar qualities, the phase-shift mechanism provides more reliable population responses than the position-shift mechanism does for relatively small stimulus disparity. Here, the reliability is measured by the standard deviation of the peak location of the tuning curve or population response curve when certain stimulus details unrelated to disparity (such as the lateral position of a bar or the dot distribution in a random dot pattern) are varied. Based on these and other considerations and our previous nding that the phase-shift-based disparity computation is accurate only when the stimulus disparity is considerably smaller than the RF sizes (Qian, 1994; Qian & Zhu, 1997), we propose an iterative algorithm that employs both the phase- and position-shift mechanisms in a specic way and demonstrate that the hybrid mechanism is more reliable and accurate than either mechanism alone. After incorporating pooling across spatial location, orientation, and spatial scale, we present a coarse-to-ne model for disparity computation and demonstrate its effectiveness on both synthetic and natural stereograms. We also compare this coarse-to-ne algorithm with a simple scale-averaging procedure applied to the population responses of the position-shift mechanism. 2 Analyses and Simulations 2.1 Phase-Shift vs. Position-Shift RF Models. The details of the disparity energy model used in this work have been described previously (Ohzawa et al., 1990; Qian, 1994). Briey, according to quantitative physiological studies, spatial RFs of a binocular simple cell can be well described
2 Some studies mentioned orientation pooling, but did not really address the issue due to the 1D lters used.
1548
Y. Chen and N. Qian
by 2D Gabor functions (Daugman, 1985; Jones & Palmer, 1987; Ohzawa et al., 1990). The position-shift and phase-shift RF models can be expressed as an overall positional difference between the left eye and right eye Gabor functions and as a phase difference between the sinusoidal modulations of the Gabor functions, respectively. We rst consider a vertically oriented Gabor function centered at origin: Á ! 1 x2 y2 exp ¡ 2 ¡ cos.!x C Á/; g.x; y; Á/ D 2¼ ¾x ¾ y 2¾x 2¾y2
(2.1)
where ! is the preferred spatial frequency, ¾x and ¾y determine the x and y RF dimensions, and Á is the phase parameter for the sinusoidal modulation. (Other preferred orientations and RF centers can be obtained by rotating and translating this function, respectively.) The phase-shift model (Ohzawa et al., 1990; DeAngelis, Ohzawa, & Freeman, 1991; Ohzawa, DeAngelis, & Freeman, 1996) posits that the left and right RFs of a simple cell are expressed as pha
gl .x; y/ D g.x; y; Á/ pha gr .x; y/
D g.x; y; Á C 1Á/:
(2.2) (2.3)
Thus, the two RFs have the same position (both centered at origin), but there is a phase difference 1Á between their sinusoidal modulations. In contrast, the position-shift model assumes that the left and right RFs take the form pos
gl .x; y/ D g.x; y; Á/ pos gr .x; y/
D g.x C d; y; Á/:
(2.4) (2.5)
These two RFs have identical shape, but there is an overall shift d between their horizontal positions. Using these RF proles, one can compute simple and complex cell responses based on the disparity energy model (Ohzawa et al., 1990; Qian, 1994). We focus on complex cell responses here because simple cell responses are much less reliable due to their stronger dependence on the Fourier phases of the stimuli (Ohzawa et al., 1990; Qian, 1994; Zhu & Qian, 1996; Qian & Zhu, 1997; Chen, Wang, & Qian, 2001). It can be shown (see the appendix) that the response of a complex cell (constructed from a quadrature pair of simple cells) to a stereo image patch of disparity D can be written as ³ ´ ³ ´ D !D ¡ 1Á pha 2 2 !D ¡ 1Á C 4AB cos rq ¼ 4A cos 2 2 ¾x ³ ´ !D ¡ 1Á £ cos ® ¡ ¯ C (2.6) 2
A Coarse-to-Fine Disparity Energy Model
³
pos rq
´ ³ ´ !.D ¡ d/ D¡d !.D ¡ d/ ¼ 4A cos C 4AB cos 2 2 ¾x ³ ´ !.D ¡ d/ £ cos ® ¡ ¯ C 2
1549
2
2
(2.7)
for the phase- and position-shift RF models, respectively. Here A and ® are the local Fourier amplitude and phase (evaluated at the RF’s preferred frequency) of the stimulus patch ltered by the RF gaussian envelope; B and ¯ are the similar amplitude and phase of the stimulus patch ltered by the rst-order derivative of the RF gaussian envelope (see the appendix). The phases ® and ¯ are more dependent on the detailed luminance distribution of the stimulus than the amplitudes A and B. The two terms in each of equations 2.6 and 2.7 are, respectively, the zeroth and the rst-order terms in D=¾x or .D ¡ d/=¾x . Before exploring the implications of the above expressions, we need to dene disparity tuning curve and population response curve explicitly. For a given cell with xed RF parameters, if we vary the stimulus disparity D and plot the response of the cell as a function of D, we obtain a disparity tuning curve. The preferred disparity of the cell is the stimulus disparity that generates the peak response in the tuning curve. For a xed stimulus disparity D, we can consider a set of cells that prefer different disparities (e.g., have different 1Á or d parameters; see below) but are otherwise identical. If we plot the responses of these cells to the same D against their preferred disparities, we get a population response curve. A complication is that a cell’s preferred disparity depends not only on its intrinsic RF parameters but also on the stimulus to some degree (Poggio, Gonzalez, & Krause, 1988; Zhu & Qian, 1996; Chen et al., 2001). The abscissa of the population response curve, however, should not depend on any stimulus parameters as these parameters are assumed to be unknown during disparity computation. In other words, the abscissa should be only a function of the intrinsic RF parameters that uniquely label each cell in the population. We will therefore use an intrinsic parameter (or a combination of intrinsic parameters) as the abscissa that approximates the preferred disparity of each cell. To do so, note that in equations 2.6 and 2.7, the rst term (zeroth order) is usually larger than the second term (rst order). If we keep only the rst term, as we did in some of our previous work (Qian, 1994; Zhu & Qian, 1996; Qian & Zhu, 1997), the preferred disparity of a cell is a function of its intrinsic RF parameters only, given by: pha
1Á ! ¼d
Dpref ¼ pos
Dpref
(2.8) (2.9)
for the phase- and position-shift models, respectively. We will therefore use 1Á=! or d to label different cells along the abscissa of population response
1550
Y. Chen and N. Qian
curves. Also note that if we keep only the rst terms, the stimulus disparity D can be estimated from the preferred disparity of the most active cell (denoted by ¤) of the population response curve according to pha
1Á¤ ! ¼ d¤
Dest ¼
(2.10)
Dest
(2.11)
pos
for the phase- and position-shift models, respectively (Qian, 1994; Zhu & Qian, 1996; Qian & Zhu, 1997). We will use the same equations for disparity estimation in this article as we did previously. However, the more accurate analyses done here (the second terms in equations 2.6 and 2.7) will allow us to determine the conditions under which equations 2.8 to 2.11 are good approximations, and will lead to a better hybrid model with both phaseand position-shift mechanisms.3 We can now consider the properties of disparity tuning curves and population response curves generated with the phase- and position-shift RF models. We are particularly interested in whether and how much these curves depend on the stimulus details other than disparity (such as the lateral position of a bar or the dot distribution in a random dot pattern). Obviously, such dependence is undesirable, as it will make disparity estimation unreliable. We rst consider disparity tuning curves. According to the denitions in the appendix, ® and ¯ in equations 2.6 and 2.7 strongly depend on the detailed luminance prole of the stimulus. If the second terms in the equations can be neglected, then the rst term will be scaled only by A. Consequently, the shape of the tuning curves will be largely independent of the stimulus details, and the peak locations will accurately follow equations 2.8 and 2.9 for the phase- and position-shift models, respectively. If the second terms are not very small, however, the tuning curves of a cell will change with stimulus details, and the peak locations may deviate signicantly from equations 2.8 and 2.9. Since the stimulus disparity D has to vary over a wide range to generate a tuning curve, D=¾x or .D ¡ d/=¾x cannot always be small, and thus the second terms cannot always be neglected. Therefore, complex cell tuning curves based on either position-shift or phase-shift RF model must be somewhat unreliable, albeit much more reliable than simple-cell tuning curves (Qian, 1994; Zhu & Qian, 1996; Chen et al., 2001). The situation is different for population responses, however. Here stimulus disparity D is xed while the cell parameter 1Á or d varies over a wide range. For the phase-shift model, if D is xed at a value much smaller than ¾x , then the second term is negligible over the entire range of 1Á because 1Á 3 Alternatively, one could solve D from equation 2.6 or 2.7, or use those equations as templates to estimate D. The resulting method, however, will involve complex procedures that are unlikely to be implemented by the visual cells.
A Coarse-to-Fine Disparity Energy Model
1551
appears only inside the bounded cosine functions. In this case, equation 2.10 is well satised at the peak location of the population response curve and can be used to estimate D accurately regardless of the other details of the stimulus. Therefore, population responses generated with the phase-shift RF model should be highly reliable when the stimulus disparity is small compared with the RF size. In contrast, population responses generated with the position-shift model can never be very reliable. The reason is that the cells’ position-shift parameter d is present both inside and outside the cosine functions in equation 2.7. As d varies over a wide range, the second term cannot always be small for any xed values of D. Intuitively, when d is signicantly different from D, the image patches covered by the left and right RFs will be very different, and this difference will introduce large variability into the population response curve. We therefore conclude that among the four cases of disparity tuning curve and population response curve generated with the phase- and position-shift RF mechanisms, only the population response curves from the phase-shift mechanism have highly reliable peak locations for small stimulus disparities. In all other cases, the results in general will vary with stimulus details unrelated to disparity. We have performed extensive computer simulations to conrm this conclusion. Two examples are shown in Figures 1 and 2 for bar and random dot stimuli, respectively. We did not include spatial pooling (Zhu & Qian, 1996; Qian & Zhu, 1997) in these simulations in order to see the difference between the two RF models more clearly. The difference will be reduced (but not vanish) when spatial or orientation pooling is introduced. Figure 1 shows simulated disparity tuning curves (of a given complex cell in response to a range of stimulus disparities) and population response curves (of an array of model complex cells to a given stimulus disparity) for both phase- and position-shift RF mechanisms. The orientation of both the RFs and the stimuli (bars) was vertical in these simulations. To measure reliability in each case, we simulated 1000 disparity tuning curves or 1000 population response curves by randomly varying the lateral bar position while keeping all the other parameters constant (see the gure caption for details). For each bar position, a given stimulus disparity was introduced by shifting the left and right eyes’ images in opposite directions by half the disparity value. For clarity, only 30 tuning curves and 30 population curves are shown in panels a and b, respectively, but the peak location histograms in panels c and d are compiled from all 1000 simulations. The numbers inside each histogram panel are the mean (m) and standard deviation (s) of the distribution. The vertical lines in the tuning curve panels indicate the cell’s preferred disparity as determined by its parameters according to equation 2.8 (for phase shift) or equation 2.9 (for position shift). The vertical lines in the population response panels indicate the stimulus disparity. For tuning curves, small or large disparity refers to the cell’s preferred disparity (1 or 5 pixels). For population response curves, small or large disparity refers to the stimulus disparity (also 1 or 5 pixels).
1552
Y. Chen and N. Qian
It should be clear from the gure that the disparity tuning curves (panels a and c) are not very reliable in all cases: the peak location distributions all show signicant spread, indicating some dependence of the preferred disparity on the lateral bar position. For the population responses, the same unreliability holds for the position-shift model (panels b and d in Figures 1B and 1D). In contrast, the population response curves of the phase-shift model to small stimulus disparity (panels b and d in Figure 1A) are both reliable (the peaks from 1000 simulations are well aligned) and accurate (the peak location agrees with the actual stimulus disparity). These results are consistent with our analysis above. The population response of the phase-shift model to large stimulus disparity (panels b and d in case Figure 1C) is also reliable. As we will show in Figure 2, this is not generally true, but happens to be so for the bar stimuli because the ® and ¯ parameters in equation 2.6 approximately cancel each other (see the appendix) and the two terms of the equation can be combined. Note, however, that in this case, the peak location is not accurate as it underestimates the stimulus disparity. This underestimation is due to a zero-disparity bias of the phase-shift model demonstrated previously (Qian & Zhu, 1997), and it grows with stimulus disparity size. If equation 2.6 is expanded up to the second-order term, then the zero disparity bias of the phase-shift model will become apparent (results not shown). Also note that the curves in Figure 1 usually have side peaks (Qian, 1994; Zhu & Qian, 1996) outside the range plotted here.
Phase-shift b
0 1
c
d
0
Frequency Responses
-6
m =0.906 s =0.003 0
D [pixel ]
6
Tuning Curves
-6
a
b
0 1
c
d
0
-6
6
Population Resp
1
m =3.145 s =4.433
0
Df/w [pixel]
m =4.530 s =0.018 0
D [pixel ]
6
-6
0
Df/w [pixel]
6
B Frequency Responses
a
m =0.988 s =0.828
Position-shift
Population Resp
1
C
Large Disparity
Tuning Curves
Tuning Curves
Population Resp
1
a
b
0 1
c
d
m =1.059 s =0.832 0
D Frequency Responses
Frequency Responses
Small Disparity
A
-6
m =0.941 s =0.832 0
D [pixel]
6
Tuning Curves
-6
a
b
0 1
c
d
0
-6
6
Population Resp
1
m =5.059 s =0.832
0
d [pixel]
m =4.941 s =0.832 0
D [pixel]
6
-6
0
d [pixel]
6
A Coarse-to-Fine Disparity Energy Model
1553
We also repeated the above simulations with random dot stereograms, and the results are shown in Figure 2 with the same format as Figure 1. The disparity tuning curves and population response curves were simulated with 1000 sets of random dot stereograms that all contain the same disparity values but different dot patterns. These curves are more variable than those in Figure 1, as reected by the larger standard deviations of all the histograms, due to the stochastic nature of random dots. Nevertheless, Figure 2 clearly shows that the population response of the phase-shift model to small disparity (panels b and d in Figure 2A) is much more reliable and accurate than all the other cases, consistent with the results for bars in Figure 1 and with the prediction of our analysis. Note that in the large disparity case, both phase-shift and position-shift mechanisms show similarly unreliable population responses. This explains why Qian and Zhu (1997) found that the two RF mechanisms generate similar results; that study was done under the large disparity condition (D=¾x ¸ 0:5). Figure 1: Facing page. Disparity tuning and population response curves of model complex cells with the phase- and position-shift RFs in response to bar stereograms. Four cases are considered: (A) small disparity with the phase-shift RFs; (B) small disparity with the position-shift RFs; (C) large disparity with the phaseshift RFs; and (D) large disparity with the position-shift RFs. For the tuning curves of a given cell to a range of stimulus disparities, small or large disparity refers to the cell’s preferred disparity of 1 or 5 pixels (indicated by the vertical lines). For population response curves from an array of cells to a given stimulus disparity, small or large disparities refers to a stimulus disparity of 1 or 5 pixels (also indicated by the vertical lines). We use 1Á=! and d as approximate measures of the cells’ preferred disparities in the population response plots (see equations 2.8 and 2.9). To test the reliability of tuning curve and population response, 1000 curves were obtained for each case by randomly varying the bar’s lateral position in the cells’ RF with subpixel interpolation. For the tuning curves, the bar’s disparity varied from ¡8 to 8 pixels in a step of 1 pixel. For the population response curves, a group of model complex cells whose preferred disparities varied from ¡8 to 8 pixels in a step of 1 pixel were used. For clarity, only 30 tuning or population response curves are shown in panels a or b, and they are normalized by the strongest response. The peak location distribution histogram in panels c or d was compiled from all 1000 curves. Each peak location was computed by a parabolic t of three points around the peak. The numbers in the histograms represent the mean peak location m and the standard deviation s, respectively. Parameters: In each stereogram, the bar’s lateral position was conned within §8 pixels from the center of RF. The width and height of bar were 8 and 97 pixels, respectively. Both RFs and bars were oriented vertically. The RF parameters of model complex cells were !=2¼ D 1=16 cycle per pixel, ¾x D 8 pixels, ¾y D 16 pixels. The RFs were computed in a 2D region of 49 £ 97 pixels. The bin size in each histogram was 0.5 pixel.
1554
Y. Chen and N. Qian
Phase-shift
Frequency Responses
a
b
0 1
c
d
m =0.833 s =3.066 0
-6
m =0.937 s =0.463 0
D [pixel ]
6
Tuning Curves
-6
a
b
0 1
c
d
m =1.598 s =6.020 -6
0
Df/w [pixel]
6
Population Resp
1
0
Frequency Responses
Population Resp
1
C
Large Disparity
Tuning Curves
Position-shift
m =4.127 s =2.680 0
D [pixel ]
6
-6
B
0
Df/w [pixel]
6
Tuning Curves
Population Resp
1
a
b
0 1
c
d
m =0.636 s =3.024 0
D Frequency Responses
Frequency Responses
Small Disparity
A
-6
m =0.428 s =3.082 0
D [pixel]
6
Tuning Curves
-6
a
b
0 1
c
d
0
-6
6
Population Resp
1
m =1.581 s =5.586
0
d [pixel]
m =1.619 s =5.554 0
D [pixel]
6
-6
0
d [pixel]
6
Figure 2: Disparity tuning and population response curves of model complex cells with the phase- and position-shift RFs in response to random dot stereograms. The size of the random dot stereograms is 49 £ 97 pixels. The dot size and density are 2 £ 2 pixels and 50%, respectively. All RF parameters and the presentation format are identical to those in Figure 1.
2.2 An Iterative Hybrid Algorithm for Disparity Estimation . We have shown above that for the phase-shift model and small stimulus disparity (relative to the cells’ RF size), the peak location of the population response curve provides both reliable and accurate estimation of the stimulus disparities, while the position-shift model is always less reliable in comparison. However, the phase-shift model has its own limitations. First, for cells with preferred horizontal spatial frequency !, the range of disparity they can detect is conned between ¡¼=! and ¼=! (Qian, 1994) due to the periodicity of the population response as a function of 1Á, which in turn is due to the periodicity of the Gabor RFs as a function of Á. Any disparity beyond this range cannot be correctly detected. Second, for stimulus disparity within the range, the reliability of the population responses gets worse as the disparity approaches the limits of the range (compare Figure 2Cd with Figure 2Ad). The accuracy also decreases with increasing disparity magnitude because the underestimation caused by the zero disparity bias increases.
A Coarse-to-Fine Disparity Energy Model
1555
Since there is ample evidence indicating that both the phase- and position-shift mechanisms are involved in disparity processing (Schor & Wood, 1983; Smallman & MacLeod, 1994; Zhu & Qian, 1996; Anzai et al., 1997, 1999a; Livingstone & Tsao, 1999; Prince et al., 2000), it is natural to consider whether a hybrid of the two RF models could provide a better solution. Similar to the derivations of equations 2.6 and 2.7, the response of a complex cell with both position shift d and phase shift 1Á between the two eyes can be written approximately as ³
hyb rq
´ !.D ¡ d/ ¡ 1Á ¼ 4A cos 2 ³ ´ .D ¡ d/ !.D ¡ d/ ¡ 1Á C 4AB cos 2 ¾x ³ ´ !.D ¡ d/ ¡ 1Á £ cos ® ¡ ¯ C 2 2
2
(2.12)
If we only consider the rst term in equation 2.12, the preferred disparity of the cell is given by hyb
Dpref ¼
1Á C d: !
(2.13)
Since .D ¡ d/ appears outside the cosine functions in the second term of equation 2.12 just as in equation 2.7, whenever the position-shift parameter d is varied over a range for a population response curve, the computed disparity will not be reliable. Therefore, one should always rely on the phase difference 1Á for disparity computation, and d should be kept a constant close to D for all cells used in a given population response curve. The best scenario occurs if d happens to be equal to D since the second term of equation 2.12 will vanish, and the residual disparity .D ¡ d/ for the phase mechanism to estimate will be zero, and this is when the phase model is most accurate. These considerations lead us to the following iterative algorithm with a hybrid of both phase- and position-shift RFs. For each image location, we start (iteration 0) with a population of cells all with d D 0 and with 1Á covering the full range from ¡¼ to ¼ , and apply the phase mechanism to get an estimation D0 of the stimulus disparity D as we did before (Qian, 1994). Next, we use a set of cells all with the xed d D D0 and the full range of 1Á, and apply the phase mechanism again to get a new estimate D1 (iteration 1). Since the original stimulus disparity D has been offset by the constant position shift d D D0 of all the cells, D1 is a measure of the residual disparity D ¡ d D D ¡ D0 . Thus, the estimated stimulus disparity Dest from the rst iteration is D0 CD1 . This process can be repeated such that at the nth iteration, cells will all have the same position-shift d D D0 C D1 C ¢ ¢ ¢ C Dn¡1 and the newly computed Dn from the phase mechanism can be added to d to form
1556
Y. Chen and N. Qian
Iterative Hybrid Algorithm A Estimate the disparity Dest from population responses of complex cells with the fixed d, but full range of Df
Initialize position shift d=0
Reset position shift d = Dest
B Frequency
Iteration 0
Iteration 1
Iteration 2
Iteration 3
Iteration 4
0.774
0.832
0.850
1 0.646 m=4.127 s=2.680 0
-6
m=4.085 s=2.354
0.190 0
6
-6
m=4.282 s=2.476 0
6
-6
m=4.254 s=2.418 0
D hyb [pixel]
6
-6
m=4.305 s=2.498 0
6
-6
0
6
est
Figure 3: Iterative hybrid algorithm. (A) Flowchart of the algorithm. For a given stimulus, rst initialize the position shift d to zero, and compute the population responses of a set of complex cells all with the same d but the full range of the phase shift 1Á from ¡¼ to ¼ . The peak location of population responses is extracted as the estimated disparity Dest according to equation 2.13. Then reset the position shift d D Dest , and repeat the above procedure until a stable disparity value is obtained. (B) The performance histograms for the rst ve iterations of the algorithm, compiled from the same 1000 random dot stereograms used in Figure 2Cd. The format of the histograms is the same as that of Figure 2Cd except that here, the maximum value of each distribution is shown above the peak in each panel. All parameters but d are the same as those used in Figure 2Cd. To simply the simulations, we generated in advance 17 populations of cells with xed d’s from ¡8 to 8 pixels in steps of 1 pixel. At each iteration, we selected the population whose xed d was closest to the estimated disparity Dest in the previous iteration.
the current estimation of disparity Dest D D0 CD1 C¢ ¢ ¢CDn . This algorithm is shown schematically in Figure 3A. Note that at each iteration, the (residual) disparity is always computed with the phase mechanism. The positionshift parameter d is simply introduced to offset the stimulus disparity D so that the phase mechanism will operate on a progressively smaller (residual) disparity and will therefore become more and more reliable and accurate. A demonstration of this algorithm is presented in Figure 3B. We applied the method to the same 1000 random dot stereograms (all with D D 5 pixels) as in Figure 2C. Figure 3B shows the distribution histograms of estimated disparity up to the fourth iterations; little change occurs afterward. Since d equals 0 in the 0th iteration (far left panel), its histogram is identical to that in Figure 2Cd, with a broad distribution of estimated disparity around
A Coarse-to-Fine Disparity Energy Model
1557
the true disparity (vertical line). The situation quickly improved with a few iterations. The fraction of estimations falling within a bin of 0.5 pixel width around the true disparity increased from 19% to 85% in four iterations. This result suggests that the iterative hybrid algorithm converges quickly and that the estimation is much more accurate and reliable than a pure phaseor position-shift mechanism alone. It should be noted in Figure 3B that at each iteration, the estimated disparity always has some very small probability of being far away from the true stimulus disparity (note the negative disparity tail barely visible in the histograms). A closer examination reveals that if an estimation is far from the true disparity at the start (e.g., a wrong sign), subsequent iterations usually cannot correct the error. The reason is that if the estimated disparity is very wrong, the xed d introduced in the next step may make the residual disparity larger than the original disparity and the phase mechanism will become less reliable and harder to recover from the error. We will show that one way to reduce the occurrence of such runaway behavior is to pool across space, orientation, and scale. 2.3 Pooling. Pooling information from different sources can often improve performance. We (Zhu & Qian, 1996; Qian & Zhu, 1997) have previously demonstrated that spatial pooling of quadrature pair responses within a small neighborhood can signicantly improvethe quality of computed disparity maps. Here we focus on orientation pooling and spatial frequency (i.e., scale) pooling. 2.3.1 Orientation Pooling. Before considering orientation pooling, we rst examine how the disparity population responses of model complex cells depend on their preferred orientations. To generate RFs with orientation µ (measured from the positive horizontal axis), one can rotate the corresponding vertically oriented RF (equation 2.1) by µ ¡ 90± with respect to the RF center (positive angle means counterclockwise rotation). For the phase-shift RF model, the left and right RF centers are the same. Therefore, the response of a complex cell with preferred orientation µ to any stimulus is equal to the response of the corresponding vertically oriented complex cell to the stimulus rotated by an angle of 90± ¡ µ. If the original stimulus has a horizontal disparity D, the rotated stimulus then has components D sin µ and D cos µ orthogonal and parallel to the RF orientation, respectively. Since the RF prole along the preferred orientation changes much more slowly than that along the orthogonal axis, the parallel component D cos µ can be ignored, and the complex cell response with the phase-shift mechanism and preferred orientation µ can be obtained approximately by replacing D in equation 2.6 by D sin µ: ³ pha rq .µ /
0
¼ 4A 2 cos2
!D sin µ ¡ 1Á 2
´
1558
Y. Chen and N. Qian
³ ´ D sin µ 0 0 !D sin µ ¡ 1Á C 4A B cos 2 ¾? ³ ´ !D sin µ ¡ 1Á £ cos ® 0 ¡ ¯ 0 C ; 2
(2.14)
where A0, B0 , ® 0 , and ¯ 0 are similar to A, B, ® and ¯ in equations 2.6 and 2.7 except that they refer to the stimulus rotated an angle of 90± ¡ µ. Here, ¾? represents the gaussian width of the RF in the direction perpendicular to the cell’s preferred axis; it equals ¾x for vertically oriented RFs. For the position-shift mechanism, the left and right RF centers are not the same in general. Therefore, the rotational equivalence mentioned above has to be applied to the left and right RF responses separately before binocular combination. The nal result is that the complex cell response with the position-shift mechanism and preferred orientation µ can be obtained approximately by replacing D ¡ d in equation 2.7 by .D ¡ d/ sin µ: ³ 0
pos
rq .µ / ¼ 4A 2 cos2
!.D ¡ d/ sin µ 2
´
³ ´ .D ¡ d/ sin µ 0 0 !.D ¡ d/ sin µ 4A B cos 2 ¾? ³ ´ !.D ¡ d/ sin µ £ cos ® 0 ¡ ¯ 0 C : 2 C
(2.15)
This result depends on the assumption that the parallel component .D ¡ d/ cos.µ / can be ignored. Since d has to vary over a large range for a population response curve, equation 2.15 is not as good an approximation as equation 2.14. For vertical RF orientation, µ equals 90 degrees, and equations 2.14 and 2.15 reduce to equations 2.6 and 2.7, respectively. Similar to the discussion following equations 2.6 and 2.7, equations 2.14 and 2.15 also indicate that only the population response of the phase-shift mechanism to small stimulus disparity D (compared with the RF size) can be reliable and accurate. A new feature is that the sin µ factor will make the tuning curves and population response curves broader as the RF orientation µ deviates further from vertical. In the extreme case of horizontal orientation (µ D 0), the curves will be at with innite width. For phase-shift RFs with orientation µ , the preferred disparity expression should be generalized from equation 2.8 to: pha
Dpref ¼
1Á 1Á ´ ; ! sin µ !x
(2.16)
and the detectable disparity range becomes .¡¼=! sin µ; ¼=! sin µ/ or .¡¼=!x ; ¼=!x / where !x D ! sin µ (Qian, 1994; Mikaelian & Qian, 2000;
A Coarse-to-Fine Disparity Energy Model
1559
Matthews, Meng, Zu, & Qian, 2003). This range is smallest for the vertically oriented RFs and increases when the RF orientation µ deviates from vertical. Here ! is the preferred spatial frequency along the axis perpendicular to the RF orientation, and !x is the preferred spatial frequency along the horizontal axis; they equal each other when the RF is vertically oriented. For the position-shift RFs, equation 2.9 remains the same since d is assumed to be a horizontal shift regardless of the RF orientation. (If d is assumed to be the shift orthogonal to the RF orientation, then equation 2.9 will become pos Dpref ¼ d= sin µ.) The simulated disparity population responses and the peak-location histograms of complex cells are shown in Figure 4 for a bar with a small horizontal disparity of 1 pixel (marked by the vertical line in each panel) and an orientation of 67.5 degrees. Eight RF orientations evenly distributed in the entire 180 degree range were considered. Since the complex cells with horizontal orientation are not sensitive to horizontal disparity and generate at curves, we present only simulations from the seven nonhorizontal preferred orientations. The results for the phase- and position-shift RF models are shown in Figures 4A and 4B of the gure, respectively. For all seven preferred orientations, the population responses with the phase-shift mechanism are more reliable than those with the position-shift mechanism. For the phase-shift mechanism (see Figure 4A), the peak location of the population response depends on the difference between the RF and bar orientations. Only when the RF orientation matches the bar orientation (67.5 degrees), does the peak location agree with the actual stimulus disparity. Otherwise, the peak location underestimates the stimulus disparity magnitude. In particular, when the preferred orientation is 157.5 degrees, orthogonal to the bar orientation, the peak locates at zero disparity. As predicted by the analysis, the width of the curves in Figure 4 varies with the RF orientation.4 The maximum response in each panel (max) is shown at the top of the panel. Not surprisingly, the largest response occurs when the RF orientation matches the bar orientation. We have also done similar simulations with a random dot stereogram. The results (not shown) are similar except that since the stimulus is nonoriented, the maximum responses across all RF orientations do not differ much, and there is no orientation-dependent underestimation of the small disparity. We now consider pooling across cells with different orientations. The above results suggest that one should rst average together the population response curves from all orientations and then use the peak location of the
4 However, for the position-shift RF model (see Figure 4B), the sharpest response curves occur when the cells’ orientation matches the bar orientation (67.5degrees) instead of when it is vertical (90 degrees). This is not predicted by the approximate result of equation 2.15.
1560
Y. Chen and N. Qian
Disparity Population Responses RF Orientations
22.5 o 45 o *67.5 o 90 o 112.5 o 135 o 157.5 o max=5.3e-05 max=4.8e-03 max=2.4e-01 max=5.4e-03 max=1.6e-04 max=3.3e-05 max=8.8e-06 1
Phase-shift
Frequency Responses
A
0 1
0
Position-shift
Frequency Responses
B 1
0 1
0
m=0.52 s=0.02 -6
0
m=0.77 s=0.00 6
-6
0
m=0.90 s=0.00 6
-6
0
m=0.57 s=0.01 6
-6
0
m=0.32 s=0.12 6
-6
Df /(w sin q) [pixel]
0
m=0.27 s=0.10 6
-6
0
m=-0.07 s=0.03 6
-6
0
6
max=5.3e-05 max=4.8e-03 max=2.4e-01 max=5.4e-03 max=1.7e-04 max=3.3e-05 max=1.0e-05
m=0.83 s=2.23 -6
0
m=0.95 s=1.26 6
-6
0
m=0.97 s=0.83 6
-6
0
6
m=0.93 s=1.34
m=0.65 s=2.47
-6
-6
0 6 d [pixel]
0
m=0.72 s=3.70
6
-6
0
m=0.52 s=7.52 6
-6
0
6
Figure 4: Population responses to bar stereograms from model complex cells with different preferred orientations and with (A) the phase-shift and (B) the position-shift RF mechanisms. The bar stereograms have a xed disparity of 1 pixel and a xed orientation of 67.5 degrees. The preferred orientations of the cells are indicated above the rst row of the curves, and the case where the preferred orientation matches the bar orientation is marked by an asterisk. All simulation parameters (except orientation) and presentation format are the same as those for the population responses in Figures 1A and 1B. The number over the curves in each panel represents the maximum response to the 1000 stimuli with an arbitrary unit. Although the disparity range covered by a family of cells increases when the preferred orientation is closer to horizontal (see text), for the ease of presentation and comparison, we conned the disparity range of all cell families to that of the vertically oriented cells.
averaged curve to estimate disparity.5 For oriented stimuli such as bars, the response from cells with the matched orientation is the most accurate, and it will dominate the average because it is also the strongest. For nonoriented stimuli such as random dots, cells with different orientations respond similarly, and the averaging helps reduce noise. An alternative method is to average the estimated disparities from all orientations weighted by the
5 This procedure is analogous to what we did previously with spatial pooling (Zhu & Qian, 1996; Qian & Zhu, 1997).
A Coarse-to-Fine Disparity Energy Model
Phase-shift
A Frequency
Bar
1 m=0.894 s=0.002
0
-6
m=0.964 s=0.841
0
6
C Frequency
0
-6
0
6
0
6
D
1
1 m=0.968 s=0.107
0
Position-shift
B
1
Random-dot
1561
-6
m=1.014 s=0.603
0
Disparity [pixel]
6
0
-6
Disparity [pixel]
Figure 5: Distribution histograms of the estimated disparity after orientation pooling. Both phase- and position-shift mechanisms were applied to bar and random dot stimuli. The bar stimuli are same as in Figure 4, and the random dot stimuli are identical to those for panel Ad or Bd of Figure 2. The seven preferred orientations in Figure 4 were used in the pooling.
corresponding peak responses. We found that this method is usually not as good and will report simulation results only with the rst method. Figure 5 shows the distribution histograms of the estimated disparity with orientation pooling applied to bar and random-dot stimuli. The bar stimuli are same as in Figure 4. As expected, the distribution for the phaseshift mechanism in Figure 5A is as good as the case where the RF orientation matches the bar orientation (67.5 degrees) in Figure 4A. The random dot stimuli are identical to those for Figure 2Ad, and the distribution in Figure 5C is much more reliable due the pooling. In contrast, the orientation pooling is much less effective for the position-shift RF model (Figures 5B and 5D). Note that we used a relatively small stimulus disparity D; if D is increased to about half of the maximum value allowed by the phase-shift model, the difference between the two RF models will disappear (results not shown).
1562
Y. Chen and N. Qian
2.3.2 Scale Pooling and a Coarse-to-Fine Algorithm. In addition to orientation, disparity-selective cells are also tuned to spatial frequency. Cells in each frequency band (or scale) can be used to compute disparity (Marr & Poggio, 1979; Sanger, 1988; Qian, 1994), and an important question is how to combine information from different scales. Obviously, the scale pooling can be done according to the same methods for orientation pooling. The rst method is to average all the population response curves from different scales and then estimate the disparity.6 Alternatively, one could estimate one disparity from each scale and then average the estimates with proper weighting factors (Sanger, 1988; Qian & Zhu, 1997; Mikaelian & Qian, 2000). However, although these approaches work reasonably well for orientation pooling, they may not be adequate for scale pooling because the large and small scales have different problems that are unlikely to cancel each other through averaging. Cells at large scales have large RFs, and they tend to mix together different disparities in the image area covered by the RFs. This will make transitions at disparity boundaries less sharp than our perception (Qian & Zhu, 1997) and render disparities in small image regions difcult to detect. At small scales, the detectable disparity range by the cells is correspondingly smaller (Sanger, 1988; Qian, 1994), and large disparities in a stereogram may lead to completely wrong estimations. If one simply averages across the scales, the resulting disparity map will not be accurate unless the majority of the scales included perform reasonably well. With too many large scales included, the computed disparity map will lose sharp details, and with too many small scales included, disparity estimations at some locations may be totally wrong. If one knew the true stimulus disparities beforehand, one could pick a few appropriate scales and average across them only. However, the purpose of a stereo algorithm is to compute disparity maps from arbitrary stereograms with unknown disparities. A method known to alleviate these problems is coarse-to-ne tracking across scales. This method has been applied to stereovision previously (Marr & Poggio, 1979), but to our knowledge, its role in the disparity energy model with the phase- and position-shift RF mechanisms has not been explored. It is most natural to introduce the coarse-to-ne technique into our iterative algorithm presented in section 2.2. The only modication is to start at a large scale and then reduce the scale of the RFs through the iterations. By starting at a large scale, the algorithm can cover a large range of disparities. With each iteration, the disparity will be reduced by a constant position shift, and the residual disparity can thus be estimated by the phase mechanism at a smaller scale that sharpens the disparity boundaries. This procedure can be continued until a ne disparity map is obtained at the smallest scale.
6 This approach has been applied to disparity tuning curves (Fleet et al., 1996) and can be extended to population responses for disparity computation.
A Coarse-to-Fine Disparity Energy Model
1563
Coarse-to-fine Algorithm Start at the largest scale Initialize the position shift d = 0 at each location
Compute disparity population responses of complex cells with the fixed d, but full range of Df and q
Apply spatial pooling and orientation pooling, and estimate the disparity D
est
from the pooled population responses
Move to the next smaller scale, and reset d = D
est
Figure 6: The coarse-to-ne algorithm with both phase- and position-shift RFs.
The full algorithm, with the spatial pooling and orientation pooling procedures incorporated, is shown schematically in Figure 6. At each iteration, the spatial pooling step combines quadrature pair responses in a local area to improve the reliability of the population responses (Zhu & Qian, 1996; Qian & Zhu, 1997), and the orientation pooling further improves the quality of the estimated disparity as described in section 2.3.1. Based on the above considerations, the largest scale should be chosen according to the largest disparity the system should be able to extract, and the smallest scale should correspond to the nest details of disparity the system should be able to recover. For the simulations reported below, we employed a set of scales whose ¾ ’s follow a geometric series with a ratio p of 2. To keep the cells’ frequency bandwidth constant at different scales, we xed the product !¾? D ¼ for all scales, where ¾? is again the gaussian width in the direction orthogonal to the RF orientation. At each scale, there are several sets of cell populations, each with a constant position shift d and the full range of 1Á. Only the population whose d is closest to the estimated disparity in the previous scale will actually be used. We simply let the range of d’s be the same across all scales and equal to the disparity range of the phase-shift mechanism at the largest scale. For spatial pooling,
1564
Y. Chen and N. Qian
we used a 2D gaussian function to combine the responses of quadrature pairs around each location (Zhu & Qian, 1996; Qian & Zhu, 1997). Finally, to reduce computational time, we used ve RF orientations (30, 60, 90, 120, and 150 degrees from horizontal) to perform orientation pooling instead of the seven orientations in Figure 5. Figure 7C is an example of applying our coarse-to-ne algorithm to a random dot stereogram. To test the model’s performance for both large and small stimulus disparities, we picked a far disparity of 5 pixels for the central region and a near disparity of ¡1 pixel for the surround. The panels in Figure 7C show the estimated disparity maps at each of the ve iterations or scales (with one iteration per scale). At the largest scale, the transition at the disparity boundaries is poor, and the disparity magnitude of both the center and the surround are underestimated due to the zero disparity bias of the phase mechanism (Qian & Zhu, 1997). 7 However, since this disparity map is generally in the right direction, the subsequent smaller scales were able to rene it. At the smallest scale, the map has sharp transition boundaries and accurate disparity values for both center and surround regions. The nal map is much better than those computed from any individual scale with either phase- or position-shift mechanism. We also compared the coarse-to-ne algorithm with the simple method of averaging population responses across scales mentioned above. Since at small scales, the phase-shift RF model can cover only very small ranges of stimulus disparity, we used the position-shift model with the scale averaging method. The results of applying the method to the same random dot stereogram are shown in Figure 7D. The spatial pooling and orientation pooling were applied as in Figure 7C. In Figure 7D, the panels from left to right show the computed disparity maps by gradually including more scales in the averaging process, with the left-most panel showing the result from the largest scale alone and the right-most panel the result of averaging all ve scales. As expected, in the left two panels where the scales are large, the disparity map is fuzzy at the transition boundaries. With more smaller scales included, the estimated disparity at some locations is totally wrong (the black and white spots in the right panels of Figure 7D). In the simulations of different scales, we xed the stimuli and generated the RFs of different sizes. Alternatively, we could x the RFs and scale the stimuli appropriately. With this approach, both the coarse-to-ne and the scale-averaging methods generate results similar to that in Figure 7C. This suggests that the scale-averaging method is more sensitive to implementation details. Also note that for the simulations in Figure 7, at each scale, only eight complex cells were used for the coarse-to-ne simulation but 33 cells 7 As we noted earlier, without scale pooling, the single-scale disparity maps computed with 2D lters in this article are much worse than those computed with 1D lters reported previously (Qian, 1994; Qian & Zhu, 1997) because 2D lters tend to mix up disparities in a larger region.
A Coarse-to-Fine Disparity Energy Model
1565
Figure 7: The coarse-to-ne and scale-averaging algorithms applied to a random dot stereogram. (A) A 200£ 200 random dot stereogram with a dot density of 50% and dot size of 1 pixel. The central 100 £ 100 area has a disparity of 5 pixels, while the surround has a disparity of ¡1 pixel. (B) True disparity map. The white and black colors represent near and far disparities, respectively. (C) Estimated disparity maps at ve scales and iterations obtained with the coarseto-ne algorithm. The spatial pooling function was a 2D gaussian with both standard deviations equal to ¾? in each scale. Orientation pooling covered ve orientations from 30 to 150 degrees in steps of 30 degrees. The other parameters for the RFs in the left-most panel were the same as those in Figure 3. The p scales of other panels were successively reduced from left to right by a factor of 2, as indicated by the ¾? value under each panel. Note that for all ve scales, !¾? D ¼ ; thus, the frequency bandwidths of all scales were xed at 1.14 octaves. For each scale, the position shift d always varied from ¡8 pixels to 8 pixels in a step of 0.5 pixel, while the phase shift 1Á covered a period from ¡¼ to ¼ with a step of ¼=4. (D) Estimated disparity maps with the scale-averaging procedure and the position-shift mechanism. From left to right, a smaller scale with the indicated ¾? was added to the average at each panel. Thus, the left-most panel shows the result from the largest scale alone, while the right-most panel is the average result of all ve scales. The spatial pooling and orientation pooling were also applied as in C. The RF parameters were same as those in C, except that 1Á was kept at 0.
1566
Y. Chen and N. Qian
were used for the scale-averaging simulation. The coarse-to-ne method requires fewer cells because at each ner scale, the cells used are more focused around the stimulus disparity. No such adjustment is present in the scaleaveraging method, and if Figure 7D used only eight cells, the results (not shown) would be much worse. The scale-averaging method is also more sensitive to the frequency contents of a stereogram; it performs worse for narrowband stimuli (results not shown). In summary, the scale-averaging method is not as robust as the coarse-to-ne method. 2.4 Application of the Coarse-to-Fine Algorithm to More Complex Stereograms. We also applied the coarse-to-ne algorithm to more complex synthetic stereograms and to real-world stereograms. Figure 8 shows the results for a disparity ramp and a Gabor disparity prole. The stereograms were created by starting with a reference random dot image and then shifting each dot horizontally in opposite directions for the left and right images by half of the disparity value prescribed to the dot. Gray-level interpolation was used to represent subpixel disparities. Figures 8C and 8F demonstrate that the coarse-to-ne algorithm works well on these stereograms. Except for the slightly blurred disparity boundaries, the estimated disparity maps closely match the true disparity maps for both stereograms, with most errors within §0:25 pixel (true for 89% and 93% of the total pixels for the ramp and Gabor stereograms, respectively). For comparison, we also show in Figure 8 the results from the scale-averaging method used for Figure 7D; again, the estimated disparities at some locations are completely wrong. Figure 9 shows three real-world stereograms and the estimated disparity maps with our coarse-to-ne algorithm and the scale-averaging algorithm. Since the true disparity maps are unknown, we can access the performance only qualitatively. The results computed with our coarse-to-ne algorithm seem to be quite reasonable for the Pentagon and Tree stereograms, but less accurate for the Shrub stereogram. In general, the method works well on image areas with relatively high-contrast textures (e.g., the grass ground of the Tree stereogram), but fails at low-contrast regions (the foreground pavement of the Shrub stereogram). This problem is not surprising as the low-contrast areas generate only weak complex-cell responses that are more prone to noise. Another problem is exemplied by the small black spot in the Pentagon disparity map: if a large scale reports a very wrong disparity, the smaller scales usually cannot correct it. Solving these problems may require more global but selective interactions of disparity information at different locations and a bidirectional information ow among the scales (see section 3). With the scale-averaging method, the problem of spots with wrong disparities is more pronounced, similar to Figures 7 and 8. Moreover, the estimated disparity maps appear more blurred than those obtained with the coarse-to-ne algorithm (e.g., the signpost in Shrub).
A Coarse-to-Fine Disparity Energy Model
1567
Figure 8: More complex synthetic stereograms. (A) A ramp stereogram with a size of 200£ 200 pixels. In the central 160£ 160 area, the disparity varies linearly from ¡5 pixels to 5 pixels, while the surround has a zero disparity. The gray level of a pixel is randomly chosen between 0 and 1. (B) True disparity map for the ramp stereogram. The white and black colors represent near and far disparities, respectively. (C) Estimated disparity map with the coarse-to-ne algorithm (left panel) and the scale-averaging algorithm (right panel). (D) Gabor stereogram with the same size as the ramp stereogram. The disparity map is created 2
2
according to a Gabor function: D.x; y/ D Dmax exp.¡ x2¾Cy2 / cos.!D sin.µD /x C D
!D cos.µD /y C ÁD /. The parameters of the Gabor function are: Dmax D 5 pixels, !D =2¼ D 1=80 cyc/pixel, ¾D D 40 pixels, ÁD D 1:39, and disparity orientation µD D 30± . (E) True disparity maps for the Gabor stereogram. (F) Estimated disparity map with the coarse-to-ne algorithm (left panel) and the scale-averaging algorithm (right panel). All the RF parameters applied to both ramp and Gabor stereograms are same as those in Figure 7.
3 Discussion We have demonstrated through analyses and simulations that in the framework of the disparity energy model, the phase-shift RF mechanism is better suited for disparity computation than the position-shift mechanism. Although the two RF mechanisms generate similar tuning curve properties, the phase-shift mechanism provides more reliable population response curves than the position-shift mechanism does. The phase-shift model, however, has its own limitations, such as the restricted range of detectable disparity (Qian, 1994) and a zero-disparity bias (Qian & Zhu, 1997). To overcome
1568
Y. Chen and N. Qian
Figure 9: Real-world stereograms and the estimated disparity maps with the coarse-to-ne and the scale-averaging algorithms. (A) Pentagon stereogram. The size of the stereogram is 256 £ 256 pixels. (B) Shrub stereogram. The size is 256£240 pixels. (C) Tree stereogram. The size is 256£233 pixels. The stereograms have all been scaled to the same size in the gure for ease of presentation. All the RF parameters are the same as those in Figure 8. For attenuating the interference of the DC component in the stimuli, we deducted the mean luminance from each stereogram before applying the lters. All three stereograms are obtained from the Carnegie Mellon University Image Database.
these problems, we have proposed an iterative procedure in which disparity is always estimated with the phase-shift mechanism but the magnitude of disparity is gradually reduced by introducing a constant position-shift parameter to all cells at each iteration. We then considered integrating information across different RF orientations and spatial scales and found that although a simple averaging procedure (used previously for spatial pooling; Zhu & Qian, 1996; Qian & Zhu, 1997) seems sensible for orientation pooling, it may not be a good approach for scale pooling. Instead, it is better to treat the scale pooling as a coarse-to-ne process. Such a coarse-to-ne process can be easily combined with the iterative procedure with a smaller scale used at each new iteration. The nal algorithm is an iterative coarse-to-ne procedure with spatial and orientation pooling incorporated and with both position-shift and phase-shift RF components.
A Coarse-to-Fine Disparity Energy Model
1569
We have applied the algorithm to a variety of stereograms. Our simulations demonstrate that the algorithm can recover both sharp disparity boundaries and gradual disparity changes in the stimuli. However, a couple of problems of the algorithm are also revealed by real-world stereograms. Unlike synthetic patterns, real-world images tend to have regions of very low contrast. Model cells whose RFs cover only a low-contrast area will not respond well, and the results will tend to be very unreliable. The solution to the problem might involve introducing long-range spatial interactions to allow disparity information to propagate from high-contrast regions to low-contrast regions. The challenge is that the interactions should also be selective in order not to smear out real disparity boundaries. It is not clear how to introduce such interactions in a physiologically plausible manner without resorting to a list of ad hoc rules. Another problem with the current algorithm is that if a large scale gives a completely wrong estimation of stimulus disparity, it is impossible for the subsequent smaller scales to correct it. This does not happen often due to the inclusion of spatial and orientation pooling, which greatly improves the reliability of estimation at any scale, but when it does happen, the estimated disparity can be far from the true value. This problem may be solved by a bidirectional information ow across the scales instead of the current unidirectional ow from large to small scales. Indeed, there is psychophysical evidence suggesting that scale interaction is bidirectional (Wilson, Blake, & Halpern, 1991; Smallman & MacLeod, 1994). The scale-averaging approach may be viewed as bidirectional. However, it does not appear to be adequate because the wrongdisparity problem becomes worse with such a method. In general, we found that the coarse-to-ne algorithm is superior to the scale-averaging method, particularly with complex or natural stereograms. But the scale averaging is much easier to implement and may be useful when the precision of the disparity map is not critical. It is interesting to compare the effects of varying RF orientation and varying RF scale to disparity computation. When the RF orientation is farther away from the vertical, the cells’ detectable disparity range increases. This is similar to an increase of the RF scale. However, as the orientation gets closer to horizontal, there will be fewer cycles of RF modulation along the horizontal dimension, tuning curves and population response curves will become broader, and the disparity computation will become less meaningful. This problem does not occur when one uses vertically oriented cells at larger scales to cover a wider disparity range, as long as the spatial-frequency bandwidth (and thus the number of RF modulation cycles) is kept constant across scales. At a xed scale, cells with different orientations cover the same stimulus area, whereas cells with different scales cover very different stimulus areas. These differences suggest that the orientation pooling and scale pooling should be treated differently, as we did in this article. In particular, since cells with different orientations at a given scale all cover the same image patch, their responses can be averaged together to represent the total
1570
Y. Chen and N. Qian
disparity energy of that patch. In contrast, cells with different scales cover different image sizes, and it appears more sensible to use a coarse-to-ne algorithm that can gradually recover disparity details. Also note that the averaging procedure of orientation pooling appears to be more effective for the phase-shift model than for the position-shift model, at least for small disparities (see Figure 5). We nally discuss the physiological relevance of our model. Our algorithm is based on the disparity energy model, which has been found to provide a good approximation of real complex-cell responses although some discrepancies have also been noted (Freeman & Ohzawa, 1990; Ohzawa et al., 1990, 1997; DeAngelis et al., 1991; Anzai, Ohzawa, & Freeman, 1999b, 1999c; Chen et al., 2001; Livingstone & Tsao, 1999; Cumming, 2002). In addition, experimental evidence indicates that both the phase- and position-shift RF mechanisms are employed by the binocular cells for disparity representation (Hubel & Wiesel, 1962; Bishop & Pettigrew, 1986; Poggio, Motter, Squatrito, & Trotter, 1985; Ohzawa et al., 1990, 1997; DeAngelis et al., 1991; Anzai et al., 1999a; Prince et al., 2000). We (Zhu & Qian, 1996; Qian & Zhu, 1997) have pointed out previously that spatial pooling of quadrature pair responses to construct complex-cell responses is consistent with the fact that the RFs of complex cells are, on average, larger than those of simple cells (Hubel & Wiesel, 1962; Schiller, Finlay, & Volman, 1976). It is also reasonable to incorporate orientation pooling since physiological studies have demonstrated that obliquely oriented cells are tuned to disparity (Poggio & Fischer, 1977; Ohzawa et al., 1996, 1997) and thus must contribute to disparity estimation. This is further supported by the psychophysical nding that binocular correspondence appears to be solved by oriented lters along the contours in a stimulus (Farell, 1998) and does not follow the epipolar constraint (Stevenson & Schor, 1997). On the other hand, there is no evidence for (or against) our specic proposal of using the phase- and position-shift mechanisms in an iterative, coarse-to-ne process. In particular, we do not know if real binocular cells rely on the phaseshift mechanism to estimate disparity and use the position-shift mechanism to reduce the disparity magnitude that the phase mechanism has to process. Our coarse-to-ne procedure is similar to that proposed by Marr and Poggio (1979), but with one important difference. Marr and Poggio assumed that the large disparity is reduced by vergence eye movement before being processed by smaller scales. Since vergence eye movement shifts stimulus disparity globally, a particular vergence state will not be able to reduce stimulus disparities at all spatial locations, and different vergence state will have to be assumed for each image location. Our model requires only a single vergence state that brings stimulus disparity within a reasonable range; further disparity reduction during the estimation process is carried out by the position-shift mechanism locally at each point. This is consistent with the psychophysical observation that the scale interaction does not depend
A Coarse-to-Fine Disparity Energy Model
1571
on eye movement (Rohaly & Wilson, 1993). 8 In addition, the model is consistent with the nding that with vergence minimized, the fusional range decreases with spatial frequency (Schor, Wood, & Ogawa, 1984) because higher-frequency stimuli benet less from the guidance of the larger scales. A recent report indicates that individual V1 cells may undergo a coarse-tone process over time independent of eye movement (Menz & Freeman, 2003). Therefore, it may not be necessary to implement our coarse-to-ne algorithm with several different populations of cells as we did in our simulations; instead, a single cell population progressively reducing their scale over time might be sufcient. Appendix: Derivation of Equations 2.6 and 2.7 Based on the disparity energy model, the response of simple cell to a stereo image pair I.x; y/ and I.x C D; y/ can be written as (Ohzawa et al., 1990; DeAngelis, Ohzawa, & Freeman, 1993; Qian, 1994; Anzai et al., 1999b; Chen et al., 2001): 2 rs D 4 2 D4
Z Z1 ¡1
Z Z1 ¡1
32 fgl .x; y/I.x; y/ C gr .x; y/I.x C D; y/gdxdy 5 32 fgl .x; y/I.x; y/ C gr .x ¡ D; y/I.x; y/gdxdy 5 ;
(A.1)
where D is the stimulus disparity, and gl .x; y/ and gr .x; y/ are the left and right RFs of simple cell. Note that the full squaring used here is equivalent to a push-pull pair of half-squaring simple cells. We dene the linear ltering of the left and right images as
rsl D rsr D
Z Z1 gl .x; y/I.x; y/dxdy
(A.2)
gr .x ¡ D; y/I.x; y/dxdy;
(A.3)
¡1
Z Z1 ¡1
8 The model may also be consistent with the nding that the stereo threshold elevation with base disparity is not eliminated by the addition of a low-frequency component (Rohaly & Wilson, 1993) because the low-frequency component itself has an elevated threshold with base disparity and thus becomes unreliable at large base disparity.
1572
Y. Chen and N. Qian
so that rs D .rsl C rsr /2 . The left RF of a simple cell has the same form for the phase- and position-shift models, and we rewrite equations 2.2 and 2.4 as gl .x; y/ D cos.Á/gcos .x; y/ ¡ sin.Á/gsin .x; y/;
(A.4)
where Á ! 1 x2 y2 exp ¡ 2 ¡ cos.!x/ gcos .x; y/ D 2¼ ¾x ¾y 2¾x 2¾y2 Á ! 1 x2 y2 exp ¡ 2 ¡ sin.!x/: gsin .x; y/ D 2¼ ¾x ¾y 2¾x 2¾y2 Then, rsl for both phase- and position-shift RF models becomes rsl D cos.Á/
Z Z1 ¡1
gcos .x; y/I.x; y/dxdy ¡ sin.Á/
D A cos.® C Á/;
Z Z1 gsin .x; y/I.x; y/dxdy ¡1
(A.5)
where q AD A1 D A2 D
A21 C A22 ;
Z Z1
® D arctan.A2 =A1 /
(A.6)
gcos .x; y/I.x; y/dxdy ¡1 Z Z1
gsin .x; y/I.x; y/dxdy: ¡1
Since the right RF of a simple cell has different forms for the two RF models, we consider them separately. In equation A.3, gr .x ¡ D; y/ can be written as Á ! 1 .x ¡ D/2 y2 pha exp ¡ ¡ gr .x ¡ D; y/ D 2¼ ¾x ¾ y 2¾x2 2¾y2 £ cos.! .x ¡ D/ C Á C 1Á/ Á ! 1 .x ¡ .D ¡ d//2 y2 pos exp ¡ ¡ gr .x ¡ D; y/ D 2¼ ¾x ¾ y 2¾x2 2¾ y2 £ cos.! .x ¡ .D ¡ d// C Á/
(A.7)
(A.8)
for the phase- and position-shift mechanisms, respectively. In the previous modeling analyses (Qian, 1994; Zhu & Qian, 1996; Fleet et al., 1996; Qian
A Coarse-to-Fine Disparity Energy Model
1573
& Zhu, 1997; Chen et al., 2001), a simplifying assumption is that the horizontal envelope shifts (D in the gaussian term of equation A.7 and D ¡ d in the gaussian term of equation A.8) are small enough to be ignored. For the phase-shift mechanism, the assumption can be satised as long as the stimulus disparity D is small enough. However, for the position-shift mechanism, the above assumption requires the stimulus disparity D close to the position shift d. This is obviously a more stringent requirement. To compare the two mechanisms, the envelope shifts should be considered. Here, we take the rst-order Taylor expansion in D=¾x and .D ¡ d/=¾x for the two mechanisms, respectively: ³ ´ ³ ´ ³ ´ .x ¡ D/2 x2 xD x2 exp ¡ ¼ exp ¡ C exp ¡ 2¾x2 2¾x2 2¾x2 ¾x2 ³ ´ ³ ´ .x ¡ .D ¡ d//2 x2 x.D ¡ d/ exp ¡ ¼ exp ¡ 2 C 2 2¾x 2¾x ¾x2 ³ ´ x2 £ exp ¡ 2 : 2¾x
(A.9)
(A.10)
Under these approximations, the right RFs in equiations A.7 and A.8 can be written as Á ! 1 .x ¡ D/2 y2 pha exp ¡ ¡ gr .x ¡ D; y/ ¼ 2¼ ¾x ¾y 2¾x2 2¾ y2 ³ ´ xD £ cos.!.x ¡ D/ C Á C 1Á/ 1 C 2 (A.11) ¾x Á ! 1 .x ¡ .D ¡ d//2 y2 pos exp ¡ ¡ gr .x ¡ D; y/ ¼ 2¼ ¾x ¾y 2¾x2 2¾y2 ³ ´ x.D ¡ d/ £ cos.!.x ¡ .D ¡ d// C Á/ 1 C : (A.12) ¾x2 Then, similar to the derivation for equation A.5, we have pha
rsr ¼ A cos.® C Á C 1Á ¡ !D/ C
D B cos.¯ C Á C 1Á ¡ !D/ (A.13) ¾x
pos
rsr ¼ A cos.® C Á ¡ !.D ¡ d// D¡d C B cos.¯ C Á ¡ !.D ¡ d// ¾x
(A.14)
for the phase- and position-shift models, respectively, with q BD
B21 C B22 ;
¯ D arctan.B2 =B1 /
(A.15)
1574
Y. Chen and N. Qian
B1 D B2 D
Z Z1 ¡1 Z Z1 ¡1
x gcos .x; y/I.x; y/dxdy ¾x x gsin .x; y/I.x; y/dxdy: ¾x
The nal expressions for simple cell response are: pha
rs
pos
rs
µ ³ ´ ³ ´ 1Á ¡ !D 1Á ¡ !D ¼ 2A cos ® C Á C cos 2 2 ¶2 D C B cos.¯ C Á C 1Á ¡ !D/ ¾x µ ³ ´ ³ ´ !.D ¡ d/ !.D ¡ d/ ¼ 2A cos ® C Á ¡ cos 2 2 ¶2 ¡ D d C B cos.¯ C Á ¡ !.D ¡ d// : ¾x
(A.16)
(A.17)
Based on the well-known quadrature pair method for the energy models (Adelson & Bergen, 1985; Watson & Ahumada, 1985; Pollen, 1981; Ohzawa et al., 1990; Emerson, Bergen, & Adelson, 1992; Qian, 1994), the complex cell receives the inputs from two simple cells, both with identical 1Á, but their Á differing by ¼=2. The resulting complex-cell responses to the rst-order approximation are: pha
rq
pos
rq
³ ´ ³ ´ !D ¡ 1Á D !D ¡ 1Á ¼ 4A2 cos2 C 4AB cos 2 2 ¾x ³ ´ !D ¡ 1Á £ cos ® ¡ ¯ C 2 ³ ´ ³ ´ ¡ !.D d/ D¡d !.D ¡ d/ 2 2 ¼ 4A cos C 4AB cos 2 2 ¾x ³ ´ !.D ¡ d/ £ cos ® ¡ ¯ C 2
(A.18)
(A.19)
These are equations 2.6 and 2.7. For a relatively thin bar at xo , I.x; y/ ¼ ±.x ¡ xo ; y/, and equations A.6 and A.15 show ® ¼ ¯. Acknowledgments This work was supported by NIH grant MH54125. We thank the two anonymous reviewers for their helpful comments.
A Coarse-to-Fine Disparity Energy Model
1575
References Adelson, E. H., & Bergen, J. R. (1985). Spatiotemporal energy models for the perception of motion. J. Opt. Soc. Am. A, 2, 284–299. Anzai, A., Ohzawa, I., & Freeman, R. D. (1997). Neural mechanisms underlying binocular fusion and stereopsis: Position vs. phase. Proc. Nat. Acad. Sci. USA, 94, 5438–5443. Anzai, A., Ohzawa, I., & Freeman, R. D. (1999a). Neural mechanisms for encoding binocular disparity: receptive eld position vs. phase. J. Neurophysiol., 82, 874–890. Anzai, A., Ohzawa, I., & Freeman, R. D. (1999b). Neural mechanisms for processing binocular information: I. Simple cells. J. Neurophysiol., 82, 891–908. Anzai, A., Ohzawa, I., & Freeman, R. D. (1999c). Neural mechanisms for processing binocular information: II. Complex cells. J. Neurophysiol., 82, 909–924. Bishop, P. O., & Pettigrew, J. D. (1986). Neural mechanisms of binocular vision. Vision Res., 26, 1587–1600. Chen, Y., Wang, Y., & Qian, N. (2001). Modeling V1 disparity tuning to timevarying stimuli. J. Neurophysiol., 86(1), 143–155. Cumming, B. G. (2002). An unexpected specialization for horizontal disparity in primate primary visual cortex. Nature, 418, 633–636. Daugman, J. G. (1985). Uncertainty relation for resolution in space, spatial frequency, and orientation optimized by two-dimensional visual cortical lters. J. Opt. Soc. Am. A, 2, 1160–1169. DeAngelis, G. C., Ohzawa, I., & Freeman, R. D. (1991). Depth is encoded in the visual cortex by a specialized receptive eld structure. Nature, 352, 156–159. DeAngelis, G. C., Ohzawa, I., & Freeman, R. D. (1993). Spatiotemporal organization of simple-cell receptive elds in the cat’s striate cortex. II. Linearity of temporal and spatial summation. J. Neurophysiol., 69, 1118–1135. Emerson, R. C., Bergen, J. R., & Adelson, E. H. (1992). Directionally selective complex cells and the computation of motion energy in cat visual cortex. Vision Res., 32, 203–218. Farell, B. (1998). Two-dimensional matches from one-dimensional stimulus components in human stereopsis. Nature, 395, 689–692. Fleet, D. J., Wagner, H., & Heeger, D. J. (1996). Encoding of binocular disparity: Energy models, position shifts and phase shifts. Vision Res., 36, 1839–1858. Freeman, R. D., & Ohzawa, I. (1990). On the neurophysiological organization of binocular vision. Vision Res., 30, 1661–1676. Hubel, D. H., & Wiesel, T. (1962). Receptive elds, binocular interaction, and functional architecture in the cat’s visual cortex. J. Physiol., 160, 106–154. Jones, J. P., & Palmer, L. A. (1987). The two-dimensional spatial structure of simple receptive elds in the cat striate cortex. J. Neurophysiol., 58, 1187–1211. Livingstone, M. S., & Tsao, D. T. (1999). Receptive elds of disparity-selective neurons in macaque striate cortex. Nature Neurosci., 2(9), 825–832. Marr, D., & Poggio, T. (1979). A computational theory of human stereo vision. Proc. R. Soc. Lond. B, 204, 301–328. Matthews, N., Meng, X., Xu, P., & Qian, N. (2003). A physiological theory of depth perception from vertical disparity. Vision Res., 43, 85–99.
1576
Y. Chen and N. Qian
Menz, M. D., & Freeman, R. D. (2003). Stereoscopic depth processing in the visual cortex: A coarse-to-ne mechanism. Nature Neurosci., 6(1), 59–65. Mikaelian, S., & Qian, N. (1997). Disparity attraction and repulsion in a twodimensional stereo model. Soc. Neurosci. Abs., 23, 569. Mikaelian, S., & Qian, N. (2000). A physiologically-based explanation of disparity attraction and repulsion. Vision Res., 40, 2999–3016. Ohzawa, I., DeAngelis, G. C., & Freeman, R. D. (1990). Stereoscopic depth discrimination in the visual cortex: Neurons ideally suited as disparity detectors. Science, 249, 1037–1041. Ohzawa, I., DeAngelis, G. C., & Freeman, R. D. (1996). Encoding of binocular disparity by simple cells in the cat’s visual cortex. J. Neurophysiol., 75, 1779– 1805. Ohzawa, I., DeAngelis, G. C., & Freeman, R. D. (1997). Encoding of binocular disparity by complex cells in the cat’s visual cortex. J. Neurophysiol., 77, 2879– 2909. Poggio, G. F., & Fischer, B. (1977). Binocular interaction and depth sensitivity in striate and prestriate cortex of behaving rhesus monkey. J. Neurophysiol., 40, 1392–1405. Poggio, G. F., Gonzalez, F., & Krause, F. (1988). Stereoscopic mechanisms in monkey visual cortex: Binocular correlation and disparity selectivity. J. Neurosci., 8, 4531–4550. Poggio, G. F., Motter, B. C., Squatrito, S., & Trotter, Y. (1985). Responses of neurons in visual cortex (V1 and V2) of the alert macaque to dynamic randomdot stereograms. Vision Res., 25, 397–406. Pollen, D. A. (1981). Phase relationship between adjacent simple cells in the visual cortex. Nature, 212, 1409–1411. Prince, S. J. D., Cumming, B. G., & Parker, A. J. (2000). Range and mechanism of encoding of horizontal disparity in macaque V1. J. Neurophysiol., 87, 209–221. Qian, N. (1994). Computing stereo disparity and motion with known binocular cell properties. Neural Comput., 6, 390–404. Qian, N. (1997). Binocular disparity and the perception of depth. Neuron, 18, 359–368. Qian, N., & Zhu, Y. (1997). Physiological computation of binocular disparity. Vision Res. 37, 1811–1827. Rohaly, A. M., & Wilson, H. R. (1993). Nature of coarse-to-ne constraints on binocular fusion. J. Opt. Soc. Am. A, 10, 2433–2441. Sanger, T. D. (1988). Stereo disparity computation using Gabor lters. Biol. Cybern., 59, 405–418. Schiller, P. H., Finlay, B. L., & Volman, S. F. (1976). Quantitative studies of singlecell properties in monkey striate cortex: I. Spatiotemporal organization of receptive elds. J. Neurophysiol., 39, 1288–1319. Schor, C. M., & Wood, I. (1983). Disparity range for local stereopsis as a function of luminance spatial frequency. Vision Res., 23, 1649–1654. Schor, C. M., Wood, I., & Ogawa, J. (1984). Binocular sensory fusion is limited by spatial resolution. Vision Res., 24(7), 661–665. Smallman, H. S., & MacLeod, D. I. (1994). Size-disparity correlation in stereopsis at contrast threshold. J. Opt. Soc. Am. A, 11, 2169–2183.
A Coarse-to-Fine Disparity Energy Model
1577
Stevenson, S. B., & Schor, C. M. (1997). Human stereo matching is not restricted to epipolar lines. Vision Res., 37, 2717–2773. Teich, A. F., & Qian, N. (2003). Learning and adaptation in a recurrent model of V1 orientation selectivity. J. Neurophysiol., 89, 2086–2100. Tsai, J. J., & Victor, J. D. (2003). Reading a population code: A multi-scale neural model for representing binocular disparity. Vision Res., 43, 445–466. Watson, A. B., & Ahumada, A. J. (1985). Model of human visual-motion sensing. J. Opt. Soc. Am. A, 2, 322–342. Wilson, H. R., Blake, R., & Halpern, D. L. (1991). Coarse spatial scales constrain the range of binocular fusion on ne scales. J. Opt. Soc. Am. A, 8, 229–236. Zhu, Y., & Qian, N. (1996). Binocular receptive elds, disparity tuning, and characteristic disparity. Neural Comput., 8, 1611–1641. Received July 30, 2003; accepted January 30, 2004.
LETTER
Communicated by Ning Qian
A Preference for Phase-Based Disparity in a Neuromorphic Implementation of the Binocular Energy Model Eric K. C. Tsang
[email protected] Bertram E. Shi
[email protected] Department of Electrical and Electronic Engineering, Hong Kong University of Science and Technology, Clear Water Bay, Kowloon, Hong Kong
The relative depth of objects causes small shifts in the left and right retinal positions of these objects, called binocular disparity. This letter describes an electronic implementation of a single binocularly tuned complex cell based on the binocular energy model, which has been proposed to model disparity-tuned complex cells in the mammalian primary visual cortex. Our system consists of two silicon retinas representing the left and right eyes, two silicon chips containing retinotopic arrays of spiking neurons with monocular Gabor-type spatial receptive elds, and logic circuits that combine the spike outputs to compute a disparity-selective complex cell response. The tuned disparity can be adjusted electronically by introducing either position or phase shifts between the monocular receptive eld proles. Mismatch between the monocular receptive eld proles caused by transistor mismatch can degrade the relative responses of neurons tuned to different disparities. In our system, the relative responses between neurons tuned by phase encoding are better matched than neurons tuned by position encoding. Our numerical sensitivity analysis indicates that the relative responses of phase-encoded neurons that are least sensitive to the receptive eld parameters vary the most in our system. We conjecture that this robustness may be one reason for the existence of phase-encoded disparity-tuned neurons in biological neural systems. 1 Introduction The accurate perception of the relative depth of objects enables both biological organisms and articial autonomous systems to interact successfully with their environment. Binocular disparity, the positional shift between the image locations of an object in two eyes or cameras caused by the difference in their vantage points, is one important cue that can be used to infer depth. Neurons in the mammalian visual cortex combine signals from the left and right eyes to generate responses selective for a particular disparity c 2004 Massachusetts Institute of Technology Neural Computation 16, 1579–1600 (2004) °
1580
E. Tsang and B. Shi
(Barlow, Blackemore, & Pettigrew, 1967; Poggio, Motler, Squatrito, & Trotter, 1985). Ohzawa, DeAngelis, and Freeman (1990) proposed the binocular disparity energy model to explain the responses of binocular complex cells and found that the predictions of this model match measured data from the cat. Cumming and Parker (1997) showed that this model also matches data from the macaque. Although there are discrepancies between the model and the biological data (Ohzawa, DeAngelis, & Freeman, 1997; Chen, Wang, & Qian, 2001; Cumming, 2002), it remains a good rst-order approximation. In the model, a neuron can achieve a particular disparity tuning by a position or a phase shift between its monocular receptive eld (RF) proles of the left and right eyes. Based on an analysis of a population of binocular cells, Anzai, Ohzawa, and Freeman (1999a) suggest that the cat primarily encodes disparity via a phase shift, although position shifts may play a larger role at higher spatial frequencies. A preference for phase encoding is also consistent with human psychophysical data (Schor, Wood, & Ogawa, 1984; DeValois & DeValois, 1988). More recent data from the alert macaque suggest that a mixture of position and phase encodings is common (Prince, Cumming, & Parker 2002; Tsao, Conway, & Livingstone, 2003). Since computational studies indicate that populations of neurons tuned by either encoding can be used to recover disparity (Qian & Zhu, 1997), the natural question that arises is, What are the relative advantages of one encoding scheme over the other? This letter suggests one possible advantage of neurons tuned by phase shifts: robustness in the presence of neuronal variability, which we discovered in the course of developing a neuromorphic implementation of disparity-tuned neurons. Although physiological measurements reveal a large diversity of responses from populations or neurons, most computational studies of the binocular energy model assume retinotopic arrays of neurons that are identical except for position or phase shifts. In our neuromorphic implementation, variability is unavoidable due to manufacturing tolerances. Thus, we face a similar problem as biological systems: computing with inexact computing elements. Solutions to this problem will lead to more robust engineering systems, as well as possible insights into biological computational strategies. Section 2 reviews the binocular energy model and the encoding of disparity by position and phase shifts. Section 3 describes our implementation, which is constructed using a pair of silicon retinas that feed into a pair of silicon chips containing retinotopic arrays of neurons with monocular Gabor-type RF proles. Section 4 presents measured results from the system. We nd that the relative responses of neurons tuned by phase encoding are in better agreement with theoretical predictions than neurons tuned by position encoding. In our system, random transistor mismatch introduces mismatch between the nominally identical monocular neurons used to construct the binocular responses. We investigated the sensitivities of the position and phase models numerically to variations in different parameters of the monocular RF proles. For most parameters, phase encoding
A Preference for Phase-Based Disparity
1581
is more robust than position encoding. We also characterized the variations in the monocular RF proles within one chip. Serendipitously, we nd that the phase model is least sensitive to the parameters that vary the most in our system. Section 5 summarizes our results and future directions. 2 The Binocular Energy Model Ohzawa et al. (1990) proposed the binocular energy model to explain the response of binocular complex cells measured in the cat. The model is similar to the motion energy model previously proposed by Adelson and Bergen (1985) for human perception of visual motion. Anzai, Ohzawa, and Freeman further rened the model (1999a, 1999b, 1999c). In the model, the response of a binocular complex cell is the linear combination of the outputs of four binocular simple cells, as shown in Figure 1. Anzai et al. (1999b) model the response of a binocular simple cell using a linear binocular lter followed by a half-squaring nonlinearity. The linear binocular lter sums the responses of two monocular lters modeled by Gabor functions. Let I.x/ denote image intensity where x 2 R2 . The output of a Gabor monocular lter m.c; Á; I/ is given by Z D m.c; Á; I/ g.x; c; Á/I.x/dx ³ ´ 1 g.x; c; Á/ D · exp ¡ .x ¡ c/T C¡1 .x ¡ c/ cos.ÄT .x ¡ c/ C Á/; 2 Left Linear Binocular Filter
Right
Half-squaring
f=0
+ +
S
f=p
+ +
S
f= p/2
+ +
S
+
f=p/2
+ +
S
Binocular Complex Cell
In phase
+ +
Quadrature
+
S
Binocular Simple Cells
Figure 1: The binocular energy model of complex cell (adapted from Anzai et al., 1999a). The response of binocular complex cell is the linear combination of four binocular simple cell outputs. Each binocular simple cell is constructed from a linear binocular lter followed by half-squaring nonlinearity. The monocular lters combined by the binocular lter are modeled as Gabor functions.
1582
E. Tsang and B. Shi
where Ä 2 R2 and C 2 R2£2 control the spatial frequency tuning and bandwidth of the lter, and · controls the gain. These parameters are assumed to be the same in all of the simple cells that make up a complex cell. However, the center position c 2 R2 and the phase Á 2 R vary between the two eyes and among the four simple cells. The linear binocular lter output is b.cR ; cL ; ÁR ; ÁL / D m.cR ; ÁR ; IR / C m.cL ; ÁL IL /; where the subscripts R and L differentiate the right and left eye center positions, phases, and image intensities. The simple cell output is given by rs D .jb.cR ; cL ; ÁR ; ÁL /jC /2 ;
where jbjC D maxfb; 0g is the positive half-wave rectifying nonlinearity. While the response of simple cells depends heavily on the stimulus contrast and phase in addition to the disparity and position of the stimulus, the response of complex cells is largely independent of the phase and contrast. The binocular energy model posits that complex cells achieve phase invariance by linearly combining the outputs of four simple cells whose binocular lters differ in phase by ¼=2. If rc .cR ; cL ; ÁR ; ÁL / is the complex cell output, then rc .cR ; cL ; ÁR ; ÁL / D
± ¼ ¼² r s cR ; cL ; Á R C k ; Á L C k : 2 2 kD0
3 X
Since lters that differ in phase by ¼ are equal except for a change in sign, ± ¼ ¼² D .jb.cR ; cL ; ÁR ; ÁL /j¡ /2 r s cR ; cL ; Á R C 2 ; Á L C 2 2 2 ³± ´ ± ¡ 2 ¼ ¼² ¼ ¼ ² D r s cR ; cL ; Á R C 3 ; Á L C 3 b cR ; cL ; ÁR C ; ÁL C ; 2 2 2 2 where jbj¡ D maxf¡b; 0g is the negative half-wave rectifying nonlinearity. Thus, we can construct the four required simple cell outputs from positive and negative half-wave rectied outputs of two binocular lters. Complex cells constructed according to the binocular energy model are selective for disparities in the direction orthogonal to their preferred orientation. For example, cells tuned to vertical orientations are selective for horizontal disparities. Their disparity tuning depends on the relative center positions and the relative phases of the two monocular lters that make up the binocular linear lter. Dene the disparity, d, of a binocular stimulus to be the shift of the right eye stimulus with respect to the left eye stimulus, IR .x/ ¼ IL .x ¡ d/. A binocular complex cell whose monocular lters are shifted by 1c D cR ¡ cL and 1Á D ÁR ¡ ÁL will respond maximally when the input has a preferred disparity Dpref ¼ 1c ¡
1Á ; Ä
A Preference for Phase-Based Disparity
1583
where Ä is the spatial frequency tuning in the horizontal direction. However, this is only an approximation, as the preferred disparity depends also on the frequency content of the input stimulus (Qian, 1994; Zhu & Qian, 1996). Disparity is encoded by a position shift if 1c 6D 0 and 1Á D 0. Disparity is encoded by a phase shift if 1c D 0 and 1Á 6D 0. The cell uses a hybrid encoding if both 1c 6D 0 and 1Á 6D 0. Phase encoding and position encoding are equivalent for the zero-disparity-tuned cell. Assuming that the monocular RFs have the same centers, the complex cell output of Figure 1 is tuned to zero disparity. 3 Experimental Setup 3.1 System Overview. Figure 2 shows a block diagram of our binocular cell system, which uses a combination of analog and digital processing. The visual front end consists of two silicon retina chips (Zaghoul & Boahen, 2004a,b) mounted behind 4 mm lenses that focus light onto them. Each retina contains a 60 £ 96 pixel array of phototransistors that sample the incident light and convert it into a current. Processing circuits on the chip generate spike outputs that mimic the spiking responses of sustained-ON, sustained-OFF, transient-ON, and transient-OFF retinal ganglion cells. Our system uses only the sustained outputs, which respond continuously to a static image. ON and OFF cells respond to positive and negative contrasts. The ON and OFF cells are arranged in arrays of 30 £ 48 neurons. Each chip has an asynchronous digital output bus that communicates the spikes in the array off chip using the address event representation (AER) protocol (Boahen, 2000). The two Gabor-type chips perform the required monocular ltering operations (Choi, Shi, & Boahen, 2004). Each chip can process a 32 £ 64 pixel input image, which is encoded as the difference in spike rate of ON and OFF input channels, which are supplied using the AER protocol. Our system uses only the upper left 30 £ 48 pixels due to the lower resolution of the silicon retinas. Each chip lters its input image with both EVEN and ODD symmetric Gabor-type lters, corresponding to phases 0 and ¡¼=2. Four silicon neuron circuits at each pixel encode the positive (ON) and negative (OFF) components of the EVEN and ODD lter outputs as spike trains, which are communicated off chip using AER. We refer to these signals using the abbreviations eC; e¡; oC; and o¡. The Gabor-type lters differ from Gabor lters in that the modulating function is not gaussian but still decays with distance from the origin. For a one-dimensional image, the RF prole can be approximated by g.x; c; Á/ D ·
1Ä exp.¡1Äjxj/ cos.Ä.x ¡ c/ C Á/; 2
(3.1)
where x 2 R indexes pixel position. The parameters Ä 2 R and 1Ä 2 R control the spatial frequency tuning and bandwidth, · controls the gain,
1584
E. Tsang and B. Shi
Figure 2: Block diagram of the neuromorphic implementation. Connections via AER are denoted with oppositely oriented arrows. The outputs of the address lters are ON-OFF encoded voltage spike trains. The outputs of the binocular combination block are spike rate estimates obtained by binning. (A) The system congured for position encoding. (B) The system congured for phase encoding of positive disparity. Negative disparity can be encoded by swapping the routing between the ON and OFF channels from one retina in the binocular combination block.
A Preference for Phase-Based Disparity
1585
c 2 R indicates the center position, and Á 2 R represents the phase of the prole. Since the distance between the points where Fourier transform of the RF prole drops to approximately half its peak value is 21Ä, we refer to 1Ä as the half-bandwidth. The difference in the lter shape should not affect the resulting binocular complex cell responses signicantly; Qian and Zhu (1997) have shown that the binocular complex cell responses in the energy model are insensitive to the exact shape of the modulating envelope. Because we expect primarily horizontal disparities, we tune the Gabor-type chips for vertical orientations. The two address lters and the binocular combination block combine the outputs of the two Gabor-type chips to implement two-phase quadrature binocular lters. Each address lter passes only those spikes from the four neurons at a desired retinal location and demultiplexes them as voltage pulses on four separate digital lines. The binocular combination block merges the spike trains from the two eyes into four lines that represent the outputs of two phase quadrature binocular lters differentially. By changing the retinal location selected from the left and right eyes in the AER address lters, we can change the position encoding of disparity. By altering the routing in the binocular combination block, we can change the phase encoding. We implement the address lters and the binocular combination block using Xilinx complex programmable logic devices (CPLDs). The spike trains on the four output lines of the binocular combination block are binned over a 40 ms time window to estimate the spike rate then differenced to recover the two binocular lter outputs. The binocular lter outputs are then positive and negative half-squared to compute four simple cell outputs. The simple cell outputs are summed to compute the complex cell output. Spike binning is performed on the same Xilinx CPLD used to implement the binocular combination block. The nal differencing, squaring, and summing operations are done by an 8051 microcontroller (MCU) running at 24 MHz. The nal binocular cell responses are sent to a PC via the MCU serial port for analysis. The remaining sections describe the design of the AER address lter and the binocular combination block in more detail and can be skipped without loss of continuity. The silicon retina and Gabor-type transceiver chip have been described elsewhere. The nal steps computed by the microcontroller are not discussed further, as they are quite straightforward. 3.2 Address Filter. The AER output of each Gabor-type chip encodes the spike activity of all the neurons on that chip. The address lter extracts only those spikes corresponding to the four neurons at a selected retinal location. The AER protocol communicates spikes from one chip, the transmitter, to another, the receiver. The transmitter signals the occurrence of a spike in the array by placing an address on the bus that uniquely identies the neuron that spiked. Each neuron is assigned a unique X (column)
1586
E. Tsang and B. Shi
Figure 3: Block diagram of address lter showing the handshaking signals used by the AER protocol.
and Y (row) address pair. The time that the address appears on the bus encodes the spike time directly. As illustrated at the top of Figure 3, spike addresses are represented using a word serial format consisting of bursts of addresses, each of which encodes the locations of all simultaneously spiking neurons within one row. The rst address within each burst corresponds to a chip address, identifying the chip that the spikes originate from. The next address identies the row. Subsequent addresses identify the columns containing spiking neurons. Simultaneous spikes from different rows are handled by an arbiter that arranges them into sequential bursts. Three handshaking signals (ReqY, »ReqX, and Ack) ensure that addresses are communicated correctly. The transmitter takes ReqY high and low to signal the start and the end of a burst. It takes »ReqX low to signal the row and the subsequent column addresses. The receiver makes a transition on Ack to acknowledge every transition on either of the two request lines. For the Gabor-type chip, each retinal position has four associated neurons, which are arranged in 2 £ 2 blocks. ON and OFF neurons are in-
A Preference for Phase-Based Disparity
1587
terleaved in alternating columns. EVEN and ODD neurons are located in alternating rows. Thus, the least signicant bit (LSB) of the X address encodes ON-OFF, and the LSB of the Y address encodes EVEN-ODD. As shown in Figure 3, the address lter extracts only those spikes corresponding to the four neurons at a desired retinal position and demultiplexes these spikes as voltage pulses on four separate wires. As addresses appear on the AER bus, the lter generates the Ack signal that is sent back to the Gabor-type chip. Two latches latch the row and column addresses of each spike, which are compared with the row and column addresses of the desired retinal location. In our addressing scheme, the retinal location is encoded on bits 1 to 6 of the address. Once the decoder detects a spike from the desired retinal location, it generates a voltage pulse, which is demultiplexed onto one of four output lines depending on the LSBs of the latched row and column addresses. To avoid losing events, we implement the address lter on a Xilinx XC9500 series CPLD to minimize the time that the address lter requires to process each spike. We chose this series because of its speed and exibility. The block delay in each macrocell is only 7 ns. The series supports in system programming enable rapid debugging during system design. Because the AER protocol is asynchronous, we needed to pay particular attention to the timing in the signal path to ensure that addresses are latched correctly and to avoid glitches that could be interpreted as output spikes. First, we connected the »ReqX signal from the Gabor-type chips that signals the validity of both row and column addresses to an ungated global clock pin (GCLK) to generate the signal that latches the row and column addresses. Because the GCLK pin is routed to every logic block directly, this ensures that the ip-ops latch the address bits simultaneously. Second, rather than using automatic place and route, we manually equalized the path delays in the address comparison circuits and in the 1 to 4 decoder. This was essential since the routing delays in the switching matrix can be on the same order as the block delays. 3.3 Binocular Combination Block. The binocular combination block combines the eight monocular spike trains (four from each eye) to compute two-phase quadrature binocular lter outputs. For a zero-disparity-tuned cell, whose required routing is shown in Figure 2A, we set the AER address lters so that they extract spikes from monocular neurons corresponding to the same retinal location in the left and right eyes .1c D 0/. To compute the output of the rst binocular lter B1, the binocular combination block sums the outputs of the left and right eye EVEN lters by merging the spikes from the e¡ channels onto a positive output line, B1C, and the spikes from the e channels onto a negative output line, B1¡. Although the difference between the spike rates on B1C and B1¡ encodes the B1 lter output, the B1C and B1¡ spike rates do not represent the ON and OFF components of B1 since they may be positive simultaneously. To compute the output of the second
1588
E. Tsang and B. Shi
lter, B2, the binocular combination block merges the spikes from the ODD channels similarly. We can electronically recongure the system to construct the binocular lter outputs required for neurons tuned to nonzero disparities. For position encoding, we change the relative addresses selected by the AER address lters to set 1c 6D 0, but leave the binocular combination block unchanged. For phase encoding, we leave the AER address lters unchanged and alter the routing in the binocular combination block. Because the RF proles of the Gabor-type chips have two phase values, altering the routing can yield four distinct binocular lters with monocular lter phase shifts of 1Á D ¡¼=2; 0; ¼=2; and ¼. Neurons with phase shifts of ¡¼=2, whose required routing is shown in Figure 2B, are tuned to positive disparities. Neurons with phase shifts of ¼=2, which can be obtained by swapping the routing of the ON and OFF signals from one eye in Figure 2B, are tuned to negative disparities. Neurons with phase shifts of ¼ , which can be obtained by swapping the routing of the ON and OFF signals from one eye in Figure 2A, are tuned inhibitory complex cells whose outputs are minimized by zero-disparity stimuli. We implement the binocular combination block using a Xilinx XC9500 series CPLD. Inputs control the monocular phase shift of the resulting binocular lter by modifying the routing. For simplicity, we merge spikes using inclusive OR gates without arbitration. Although simultaneous spikes on the left and right channels could be merged into a single spike, the probability that this will happen is negligible, since the width of the voltage pulse representing each spike (»32 ns) is much smaller than the interspike intervals, which are on the order of milliseconds. 4 Results This section describes the measured responses from the binocular complex cell implementation described in the previous section, the results of a sensitivity analysis of the responses to mismatch in the parameters of the monocular RFs, and a characterization of the monocular RF mismatch in our system. 4.1 Binocular Receptive Field Maps-Technique. Figure 4 shows our experimental setup for measuring the binocular RFs of the complex cells. Both Gabor-type chips were tuned for vertical orientations with similar RF parameters, as listed in Table 1. The RF parameters were estimated by tting the monocular Gabor-type chip responses to equation 3.1. We presented each retina with a 2.4 cm wide vertical dark bar on a bright background using two LCD monitors placed 55 cm from the retinas. This arrangement meant that the bar spanned four adjacent retinal positions, corresponding to about one-third of the period of the sinusoidal modulation in the RF prole and two-thirds of the width of the modulating envelope,
A Preference for Phase-Based Disparity
1589
Figure 4: The experimental setup for measuring the binocular RF maps. The monitors were moved closer to the retinas to enable more detail of the hardware system to be shown. The insert board converts the AER address format used by the silicon retina to that used by the Gabor-type chips by inserting a chip address into each burst. Table 1: Tunings for Two Gabor-Type Chips on the Experimental Setup.
· 1Ä Ä
Left
Right
2152 0.3059 0.5226
2265 0.2494 0.5748
which is dened as 2.1Ä/¡1 . Using two monitors gives us precise control over the input disparity. We stimulated 31 £ 31 pairs of left and right stimulus positions. The visual angle between two adjacent stimulus positions was 0.935 degree. We characterized the responses of ve binocular complex cells: one cell tuned to zero disparity and two pairs of cells tuned to positive and negative disparities by phase or position encoding. We obtained the zero-disparitytuned cell and the phase-encoded cells by combining the spikes from pixels with (row, column) address of (15, 24) in the left and right chips. For the position-encoded cells, we combined the spikes from pixel (15, 24) in the right chip with the spikes from pixels (15, 20) and (15, 28) in the left chip. 4.2 Binocular Receptive Field Maps—Results. Figure 5 shows the binocular RF maps measured as described in the previous section. The map
1590
E. Tsang and B. Shi
Right Stim. Position 16 31 16 31 Left Stim. Position
Right Stim. Position
Zero Disparity
+
1
16 31 Left Stim. Position
1
-
1
1
Right Stim. Position 16 31
1
Right Stim. Position 16 31
for the zero-disparity-tuned neuron (Figure 5E) shows the expected peak in the center of the map. Maps for neurons tuned to positive (Figures 5C and 5F) or negative (Figures 5A and 5D) disparities show this peak shifted above or below the main diagonal. For position encoding (Figures 5D and
1
16 31 Left Stim. Position
1
31 16 Left Stim. Position
1
16 31 Left Stim. Position
1
1
Right Stim. Position 16 31
Right Stim. Position 16 31
Left Stim. Position
Figure 5: Binocular RF maps shown as images where the gray level of pixel encodes the response of the complex cell to a pair of spatial bar stimuli applied to the left and right Gabor-type chips. Darker gray levels indicate higher ring rates. (A, D) The RF maps of neurons tuned to negative disparities by phase and position encoding. (B) The regions of the maps corresponding to different disparities. (C, F) The RF maps of neurons tuned to positive disparities by phase and position encoding. (E) The map of the neuron tuned to zero disparity, which is the same for both position and phase encoding.
A Preference for Phase-Based Disparity
1591
5F), the peaks are qualitatively similar to the peak of the zero-disparity neurons, only shifted. The responses of the neurons tuned by phase encoding (Figures 5A and 5C) are qualitatively different from the response of the zero-disparity-tuned neuron, exhibiting suppression in their response to the opposite disparity. 4.3 Disparity Tuning Curves. Figure 6 shows the disparity tuning curves for complex cells tuned to negative, zero, and positive disparities by phase and position encoding, which we computed by integrating the responses in Figure 5 along lines of constant disparity. We also computed
3000
3000
2500
2500
2000
2000
1500
1500
1000
1000
500
500
0 -15
-10
-5
0
5
10
15
0 -15
-10
-5
0
5
10
15
-10
-5
0
5
10
15
M 3000
3000
2500
2500
2000
2000
1500
1500
1000
1000
500
500
0 -15
-10
-5
0
5
10
15
0 -15
Figure 6: Disparity tuning curves for the disparity cells in phase model and position model. (A, B) The measured tuning curves for the phase and position models. Tuning curves for different neurons are denoted by different widths. Thick line: zero-disparity tuned neuron. Medium line: negative disparity-tuned neuron. Thin line: positive disparity-tuned neuron. (C, D) The ideal tuning curves assuming perfectly matched monocular RF proles using the least-squares tted RF parameters of the measured monocular responses from one eye. Disparities are expressed in units of discrete stimulus position (0.935 degrees/position).
1592
E. Tsang and B. Shi
the ideal disparity tuning curves by assuming that the monocular responses to inputs at a position x had the form m.c; Á; x/ D ·
1Ä exp.¡1Äjxj/ cos.Ä .x ¡ c/ C Á/; 2
where the parameters ·; Ä, and 1Ä were obtained by least-squares t of the expression to the measured monocular responses from one eye, and Á was either 0 or ¼=2. Although we do observe the expected shifts in locations of the peak responses for different disparity tunings, the relative sizes of the peaks for the phase-encoded neurons appear to match the theoretical predictions of the disparity energy model better than those of the position-encoded neurons. The model predicts that the disparity tuning curves for the position-encoded neurons should be identical in size and shape, just shifted to the left or the right. However, the measured responses show three peaks with varying heights. These deviations between expected and measured responses are due primarily to mismatch between the monocular RF proles of the Gabor-type chips. Because subsequent processing is implemented in digital hardware, there is no variability introduced by these stages, except due to quantization, which is negligible here. The disparity energy model generally assumes that the left and right monocular RFs have the same gain and spatial frequency tunings. However, random transistor mismatch within the Gabor-type chips causes the gain and spatial frequency tuning to vary from neuron to neuron. The presence of this mismatch and its observed effect on the relative responses of neurons tuned to different disparities raises a question: Which of the two encoding schemes is more robust in the presence of these variations? As described below, we addressed this question in two parts. First, how does the variation in the parameters controlling the monocular RF proles affect the relative outputs of complex cells? Second, how much do the RF parameters vary from neuron to neuron on the chip? 4.4 Sensitivity Analysis. We model the mismatch by assuming the responses of the monocular lters have the same parameterized form, but that the parameters vary from pixel to pixel. For simplicity, we assume onedimensional input images consisting of a single spatial impulse and model the response of the four neurons at the origin of retinal coordinates to an impulse at pixel x by heC .x/ D ·eC jhe .x/ ¡ »e jC he¡ .x/ D ·e¡ jhe .x/ ¡ »e j¡
hoC .x/ D ·oC jho .x/ ¡ »o jC ho¡ .x/ D ·o¡ jho .x/ ¡ »o j¡
A Preference for Phase-Based Disparity
1593
where 1Ä ¡1Äjxj 1Ä ¡1Äjxj cos.Äx C Áe / and ho .x/ D sin.Äx C Áo /: e e 2 2 We examined the effect of mismatch in ve RF parameters: the spatial frequency Ä, the half-bandwidth 1Ä, the offsets .»e ; »o /, the phase offsets .Áe ; Áo /, and the gains .· eC ; ·e¡ ; ·oC ; ·o¡ /. Mismatch in the analog transistors implementing the Gabor-type lters introduces mismatch in the spatial frequency, half-bandwidth, phases, and offsets. Mismatch in the transistors of the input integrators and output spiking neurons introduces gain mismatch. We show in the next section that this mathematical model captures approximately 90% of the RF variability on the Gabor-type chips. To examine the robustness of the relative responses, we need to establish metrics on the quality of the relative responses. We concentrate on the relative responses of populations of neurons tuned to different disparities but similar frontoparallel positions since we can compare these to estimate disparity. Perhaps the simplest method classies a stimulus into three categories depending on which neuron has the largest response, as shown in Figure 7. In this case, we are interested in how monocular RF mismatch affects the locations of decision boundaries L and R. Although this is only one possible method for estimating disparity, other methods should display a similar dependence on the parameter mismatch. We examined how much these boundaries shifted in response to monocular RF parameter mismatch by perturbing the RF parameters in the left and right neurons by additive gaussian offsets with zero mean and varying percentage deviations. We used the mean of the left and right RF parameters listed in Table 1 as the nominal parameters. Deviations in the offsets, which are nominally zero, were expressed as a percentage of the peak value of the EVEN lter impulse response. Similarly, the deviations in the phase offsets he .x/ D
C
Disparity Energy
x
D
L
R
Disparity (pixel)
Figure 7: Decision boundaries obtained by comparing relative responses of neurons tuned to negative, zero, and positive disparities. L and R denote boundaries between negative/zero and zero/positive disparity stimuli. The parameters D D R ¡ L and C D .L C R/=2 describe the distance between and the centroid of the decision boundaries.
1594
E. Tsang and B. Shi
were expressed as a percentage of the maximum phase shift before phase wraparound, ¼ . We found that in general, variations in L and R were coupled, but that the variations in distance D D R ¡ L between the boundaries and their centroid C D .L C R/=2 were approximately independent. Deviations in the decision boundary parameters D and C are expressed as a percentage of the nominal value of D. Thus, a 100% shift in C indicates that the centroid of the left and right boundaries has shifted so much that the original and the perturbed regions corresponding to xation share no overlap if D remains constant. The variations in the decision boundary parameters increase linearly for small RF parameter deviations. We dene the sensitivity of each decision boundary parameter in response to variations in each RF parameter as the slope of this line, which we estimate by linear least-squares tting. If the sensitivity is less than one, we say that the decision boundary parameter is robust to variations in the RF parameter. Figure 8 compares the sensitivities of the position and phase models. Except for one combination (the sensitivity of D to variations in Ä), the phase model is more robust than the position model. In addition, the sensitivity of neurons tuned by phase encoding is always less than one, indicating that it is robust to the variation in the monocular RF proles. The distance D is more sensitive to variations than the centroid C. 4.5 Intrachip Variability of RF Parameters. The previous section showed that in general, the phase model is more robust than the position model. However, since there are RF parameters for which phase encoding is more sensitive than position encoding, it is important to establish how much each RF parameter varies within one chip. Although transistor parameters will vary from chip to chip, the effect of this interchip variation can be minimized by adjusting the external bias voltages appropriately. However, intrachip variability makes it impossible to match the responses of all pairs of neurons on two different chips and is therefore the limiting factor. We characterized the variability by measuring the variability of RF parameters obtained by least-squares ts to the RF proles measured at seven neurons located in row 16 and columns 26, 28, 30, 32, 34, 36, and 38 of one chip. We chose a restricted range of columns and rows to avoid the edge effects due to the chip boundaries. The least-squares ts of the RF proles of these neurons are illustrated in Figure 9. We measured the quality of the ts by the percentage accuracy v 0 1 uP O e [k]/2 C .ho [k] ¡ hO o [k]/2 ] u [.h [k] ¡ h e A £ 100%; A% D @1 ¡ t k P O 2Ch O 0 [k]2 ] [ [k] h e k where he [k] D
X j
.heC [j; k] ¡ he¡ [j; k]/ and ho [k] D
X j
.hoC [j; k] ¡ ho¡ [j; k]/
A Preference for Phase-Based Disparity
1595
Figure 8: The sensitivity of position and phase encoding. (A) Sensitivity of the distance D between the decision boundaries to mismatch in different RF parameters. (B) Sensitivity of the centroid C.
are the measured one-dimensional EVEN and ODD RF proles of the neurons, their tted RF proles are hO e [k] and hO o [k], and the indices j and k correspond to the row and column position. The ts shown in Figure 9 closely matched the measured proles. The percentage accuracy for about half of the neurons was over 90% (the average value was 88.37%). It turns out that the phase variability is quite small. Fixing the phase offsets Áe and Áo to be zero only degraded the quality of the ts by 1.2%. The percentage standard deviations are plotted in Figure 10. The lter gains vary the most, followed by the half-bandwidth. The spatial frequency, phase, and offset vary the least. This gure also shows that there is a complementary relationship between the sensitivity of the distance between the decision boundaries, D, for phase-encoded neurons and the RF parameter variability. Although the gains vary the most, the decision boundaries are not very sensitive to their variations. On the other hand, the decision bound-
1596
E. Tsang and B. Shi
1500
400
1000
0
500 0 -10
2000
0
10
-400 -10 500
0
10
0
10
0
10
0
10
-10
0
10
-10
0
10
0
10
1000
0
0
-500
-10
3000
0
10
2000
-1000 -10 1000 500 0
1000 0 -10 4000
-500 0
10
-10 0
2000 -1000 0 -10
4000
0
10
400 0
2000
-400
0 3000
-2000 -10
-10
0
10
-800 1000
2000 0
1000 0 -10 2000
0
10
-1000 500
0
1000
-500 -1000
0 -10
0
10
-10
Figure 9: The least-squares t of measured Gabor-type RF proles. (A–G) The ts for the EVEN and ODD proles corresponding to the seven neurons located in row 16 and columns 26, 28, 30, 32, 34, 36, and 38 of a chip. The discrete data points, +, represent the measured responses. The solid lines represent the ts.
A Preference for Phase-Based Disparity
1597
Figure 10: Comparison between the sensitivity of the distance D between decision boundaries using phase-encoded neurons to variations in different RF parameters and the measured variation of the parameters in the system.
aries are relatively more sensitive to variations in the spatial frequency and offset, but these do not vary much on the chip. 5 Conclusion We implemented a neuromorphic multichip model of disparity-tuned complex cells in the mammalian primary visual cortex based on the disparity energy model. We evaluated two possible mechanisms for disparity tuning, phase, and position offsets between the monocular RF proles and demonstrated that our system prefers phase encoding, because the relative responses of neurons tuned to different disparities are less sensitive to mismatch in the monocular RF proles. We found that the phase model is least sensitive to the RF parameters that vary the most in our system. Biological systems would also benet from this type of robustness, as they must also extract information from the responses of a population of neurons. Intuitively, the better robustness of our phase-encoded neurons arises because binocular disparity-selective neurons tuned to different disparities are constructed from the same sets of monocular outputs but in different combinations. Thus, different binocular neurons are subject to the same variations. On the other hand, our position-encoded neurons are constructed from different sets of monocular outputs. The biological analog of the large variability in the gain of RF proles we measured may be the large variation in the ring rate of real neurons. Thus, our result implies that phase encoding is one mechanism that enables real cells to deal with ring rate variability. Our system architecture is similar to a recent cortical
1598
E. Tsang and B. Shi
model of stereopsis, which proposes that monocular neural outputs from layer 4 of V1 are combined binocularly in layer 3B (Grossberg & Howe, 2003). Interestingly, the population response of phase-encoded neurons is also more robust in another sense: the peak of the population response is an accurate estimate of the disparity regardless of the detailed luminance prole of the stimulus, while this is not true for the population responses of position-encoded neurons (Chen & Qian, 2004). This implementation is an initial step toward the development of a multichip neuromorphic system capable of extracting depth information about the visual environment using silicon neurons with physiologically based functionality. Currently, the system can only compute the output of a single cell tuned to a particular disparity and retinal position. However, the estimation of disparities in a visual scene requires a population of neurons tuned to different disparities and retinal positions. Although we could accomplish this by exploiting the electronic recongurability of the system using time multiplexing, this approach is inefcient and time-consuming. Our current work seeks to implement this population in a fully parallel manner. By combining the AER outputs from the left and right Gabor-type chips onto a chip containing a two-dimensional array of neurons with the appropriate address remapping, we can implement an array of neurons tuned to the same disparity but varying retinal locations. Additional chips could represent neurons tuned to other disparities. Computing the neuronal outputs in parallel will enable us to investigate the roles of additional processing steps such as pooling (Fleet, Wagner, & Heeger, 1996; Qian & Zhu, 1997) and normalization (Albrecht & Geisler 1991; Heeger, 1992). Acknowledgments This work was supported in part by the Hong Kong Research Grants Council under grants HKUST6218/01E and HKUST6200/03E. It was inspired by an initial project at the 2002 Telluride Neuromorphic Workshop performed with Y. Miyawaki. We thank K. A. Boahen for supplying the silicon retina and AER receiver chips used in this work and for helpful discussions and T. Choi for his assistance in building the system. References Adelson, E. H., & Bergen, J. R. (1985). Spatiotemporal energy models for the perception of motion. J. Opt. Soc. Am. A, 2, 284–299. Albrecht, D. G., & Geisler, W. S. (1991). Motion selectivity and the contrast response functions of simple cells in the visual cortex. Visual Neuroscience, 7, 531–546. Anzai, A., Ohzawa, I., & Freeman, R. D. (1999a). Neural mechanisms for encoding binocular disparity: Position vs. phase. J. Neurophysiol., 82, 874–890.
A Preference for Phase-Based Disparity
1599
Anzai, A., Ohzawa, I., & Freeman, R. D. (1999b). Neural mechanisms for processing binocular information I. Simple cells. J. Neurophysiol., 82, 891– 908. Anzai, A., Ohzawa, I., & Freeman, R. D. (1999c). Neural mechanisms for processing binocular information II. Complex cells. J. Neurophysiol., 82, 909– 924. Barlow, H. B., Blackemore, C., & Pettigrew, J. D. (1967). The neural mechanism of binocular depth discrimination. J. Physiol. Lond., 193, 327–342. Boahen, K. A. (2000). Point-to-point connectivity between neuromorphic chips using address events. IEEE Transactions on Circuits and Systems–II: Analog and Digital Signal Processing, 47, 416–434. Chen, Y., Wang, Y., & Qian, N. (2001). Modeling V1 disparity tuning to timevarying stimuli. J. Neurophysiol., 86(1), 143–155. Chen, Y., & Qian, N. (2004). A coarse-to-ne disparity energy model with both phase-shift and position-shift receptive eld mechanisms. Neural Comput., 16, 1545–1577. Choi, T. Y. W., Shi, B. E., & Boahen, K. (2004). An ON-OFF orientation selective address event representation image transceiver chip. IEEE Transactions on Circuits and Systems–I: Fundamental Theory and Applications, 51(2), 342–353. Cumming, B. G. (2002). An unexpected specialization for horizontal disparity in primate primary visual cortex. Nature, 418, 633–636. Cumming, B. G., & Parker, A. J. (1997). Responses of primary visual cortical neurons to binocular disparity without depth perception. Nature, 389, 280– 283. DeValois, R. L., & DeValois, K. K. (1988). Spatial vision. New York: Oxford University Press. Fleet, D. J., Wagner, H., & Heeger, D. J. (1996). Neural encoding of binocular disparity: Energy models, position shifts and phase shifts. Vision Res., 36, 1839–1857. Grossberg, S., & Howe, P. D. L. (2003). A laminar cortical model of stereopsis and three-dimensional surface perception. Vision Res., 43, 801–829. Heeger, D. J. (1992). Normalization of cell responses in cat striate cortex. Visual Neuroscience, 9, 181–197. Ohzawa, I., DeAngelis, G. C., & Freeman, R. D. (1990). Stereoscopic depth discrimination in the visual cortex: Neurons ideally suited as disparity detectors. Science, 249, 1037–1041. Ohzawa, I., DeAngelis, G. C., & Freeman, R. D. (1997). Encoding of binocular disparity by complex cells in the catís visual cortex. J. Neurophysiol., 77, 2879– 2909. Poggio, G. F., Motter, B. C., Squatrito, S., & Trotter, Y. (1985).Responses of neurons in visual cortex (V1 and V2) of the alert macaque to dynamic random-dot stereograms. Vision Res., 25, 397–406. Prince, S. J., Cumming, B. G., & Parker, A. J. (2002). Range and mechanism of encoding of horizontal disparity in macaque V1. J. Neurophysiol., 87(1), 209– 221. Qian, N. (1994). Computing stereo disparity and motion with known binocular cell properties. Neural Comput., 6, 390–404.
1600
E. Tsang and B. Shi
Qian, N., & Zhu, Y. (1997). Physiological computation of binocular disparity. Vision Res., 37, 1811–1827. Schor, C., Wood, I., & Ogawa, J. (1984). Binocular sensory fusion is limited by spatial resolution. Vision Res., 24, 661–665. Tsao, D. Y., Conway, B. R., & Livingstone, M. S. (2003). Receptive elds of disparity-tuned simple cells in macaque V1. Neuron, 38(1), 103–114. Zaghoul, K. A., & Boahen, K. (2004a).Optic nerve signals in a neuromorphic chip I: Outer and inner retina models. IEEE Transactions on Biomedical Engineering, 51(4), 657–666. Zaghoul, K. A., & Boahen, K. (2004b). Optic nerve signals in a neuromorphic chip II: Testing and results. IEEE Transactions on Biomedical Engineering, 51(4), 667–675. Zhu, Y., & Qian, N. (1996). Binocular receptive eld models, disparity tuning, and characteristic disparity. Neural Comput., 8, 1611–1641. Received August 18, 2003; accepted February 2, 2004.
LETTER
Communicated by Misha Tsodyks
Learning Classication in the Olfactory System of Insects Ramon ´ Huerta
[email protected] Institute for Nonlinear Science, University of California San Diego, La Jolla CA 920930402,U.S.A., and GNB. Escuela T´ecnica Superior de Inform´atica, UniversidadAut´onoma de Madrid, 28049 Madrid, Spain
Thomas Nowotny
[email protected] Institute for Nonlinear Science, University of California San Diego, La Jolla CA 920930402, U.S.A.
Marta Garc´õ a-Sanchez
[email protected] GNB. Escuela T´ecnica Superior de Inform´atica, Universidad Aut´onoma de Madrid, 28049 Madrid, Spain, and Institute for Nonlinear Science, University of California San Diego, La Jolla CA 92093-0402, U.S.A.
H. D. I. Abarbanel
[email protected] Institute for Nonlinear Science, University of California San Diego, La Jolla CA 920930402, U.S.A., and Department of Physics and Marine Physical Laboratory, Scripps Institute of Oceanography, University of California San Diego, La Jolla, CA 920930402, U.S.A.
M. I. Rabinovich
[email protected] Institute for Nonlinear Science, University of California San Diego, La Jolla CA 920930402, U.S.A.
We propose a theoretical framework for odor classication in the olfactory system of insects. The classication task is accomplished in two steps. The rst is a transformation from the antennal lobe to the intrinsic Kenyon cells in the mushroom body. This transformation into a higherdimensional space is an injective function and can be implemented without any type of learning at the synaptic connections. In the second step, the encoded odors in the intrinsic Kenyon cells are linearly classied in the mushroom body lobes. The neurons that perform this linear classication are equivalent to hyperplanes whose connections are tuned by local Hebbian learning and by competition due to mutual inhibition. We c 2004 Massachusetts Institute of Technology Neural Computation 16, 1601–1640 (2004) °
1602
R. Huerta et al.
calculate the range of values of activity and size of the network required to achieve efcient classication within this scheme in insect olfaction. We are able to demonstrate that biologically plausible control mechanisms can accomplish efcient classication of odors. 1 Introduction Odor classication by nervous systems involves several quite different computational tasks: (1) similar odor receptions need to be classied as the same odor, (2) distinct odors have to be discriminated from each other, and (3) some quite different odors might carry the same meaning and therefore must be associated with each other. From studies of insect olfaction, we have developed a framework that may provide insights into how the olfactory system of insects accomplishes these different tasks with the available natural technology. Insects have three known processing stages of odor information before classication: the antenna, the antennal lobe (AL), and the mushroom body (MB) (see Figure 1 for a description). To put our work into the right context, we will describe the known facts about the system. Each olfactory receptor cell in the antenna expresses one type of receptor, and all olfactory receptor cells expressing the same receptor type connect to the same glomerulus in the AL (Gao, Yuan, & Chess, 2000; Vosshall, Wong, & Axel, 2000; Scott et al., 2001). Thus, a chemosensory map of receptor activity in the antenna is formed in the AL: the genetically encoded architecture induces a stimulus-dependent spatial code in the glomeruli (Rodrigues, 1988; Distler, Bausenwein, & Boeckh, 1998; Joerges, Kuettner, Galizia, & Menzel, 1997; Galizia, Joerges, Kuettner, Faber, & Menzel, 1997; Galizia, Nagler, Holldobler, & Menzel, 1998). Moreover, the spatial code is conserved across individuals of the same species (Galizia, Sachse, Rappert, & Menzel, 1999; Wang, Wong, Flores, Vosshal, & Axel, 2003) as would be expected given the genetic origin of the code. There are many fewer neurons in the rst relay station (the AL) than the number of receptors types by a ratio of 1:10. The reasons for such a strong convergence are not well understood and remain an interesting theoretical question for future study. It is known that increasing odor concentrations recruit increasing numbers of glomeruli (Ng et al., 2002; Wang et al., 2003). A simple transduction of the glomerular activity would increase the number of active neurons in the MB with increasing odor concentration as well. Calcium recordings in the MB of Drosophila show, however, that the activity in the MB indicated by calcium concentrations is independent of the odor concentration for the odors ethyl acetate and benzaldehyde (Wang et al., 2001). Moreover, recent recordings from the AL in the locust indicate that the activity in the projections of the excitatory neurons in the AL into the MB is nearly constant in time (Stopfer, Jayaraman, & Laurent, 2003). Therefore, a gain control mechanism maintaining a nearly constant average neuronal activity in the AL
Learning Classication in the Olfactory System of Insects Antenna
Antennal Lobe (AL) Network
Mushroom body (MB) Intrinsic KCs
Extrinsic KCs
Structural Organization
Glomeruli
1603
= LN
Genetically Combinato Spatio encoded temporal rial code map code Odor concentration dependent
Unknown
Our analysis
odor PN Adaptation conditioning adaptation dependent Not under analysis
Not under analysis
Assump tions
What is known ?
= PN,
None
None
Input space
Binomial distribution
Sparse code
Unknown
Not concentration dependent
Unknown
Loci for learning odor conditioning Injective function into a large display
Linear classifier
No learning in synaptic connections
Hebbian learning + mutual inhibition
Figure 1: Description of the structural organization of the rst few processing layers of the olfactory system of insects, including a list of what is known in terms of coding, odor concentration dependence, and learning of odor conditioning.
must exist. It seems that the AL performs some preprocessing of the data to feed an adequate representation of it into the area of the insect brain that is responsible for learning odor conditioning, the MB. Although the MBs are not critical for normal behavior of insects, they are crucial for odor conditioning (Heisenberg, Borst, Wagner, & Byers, 1985). Genetic manipulation has proved to be a powerful tool to identify the MB as the locus for learning (de Belle & Heisenberg, 1994; Zars, Fischer, Schulz, & Heisenberg, 2000; Pascual & Preat, 2001; Dubnau, Grady, Kitamoto, & Tully 2001; McGuire, Le, & Davis, 2001; Connolly et al., 1996; Menzel, 2001; Heisenberg, 2003; Dubnau, Chiang, & Tully, 2003). Dubnau and colleagues (2001) propose that “the Hebbian processes underlying olfactory associative learning reside in mushroom body dendrites or upstream of the mushroom body and that the resulting alterations in synaptic strength modulate mushroom body output during memory retrieval.” This is an important idea that supports our hypothesis that the important synaptic changes to achieve
1604
R. Huerta et al.
classication of odors have to occur in the connections from the intrinsic KCs to the MB lobes, not earlier or later. We propose a theory of odor classication that relies on a large coding display in the MB, random connectivity among neurons, Hebbian learning, and mutual inhibition. We show that these elements of an odor identication network are sufcient to accomplish the classication tasks described above (see Figure 1). Our main hypothesis as stated in Figure 1 is that the classication decision occurs in the MB lobes that receive inputs from the intrinsic KCs. This hypothesis is suggested by the concept of support vector machines (SVM) (Cortes & Vapnik, 1995) and Cover’s work (Cover, 1965), which were developed in the context of the classication problem based on linear threshold devices when the input classes are not linearly separable. The strategy consists of casting the input into a high-dimensional space by means of a nonlinear transformation and then using linear threshold devices for classication in this higher-dimensional representation. The nonlinear transformation is designed to separate the previously inseparable input classes in the new high-dimensional space. In this article, we consider the AL as the input space, the intrinsic KCs form the high-dimensional space, and the neurons in the MB lobes are the linear threshold devices that can separate the classes. The experimental support for having such an SVM is based on the known large divergence of connections from the AL to the MB and the observed role of the MB as the focus of learning odor conditioning. Let us elaborate on the necessary processing layers of the classication circuit and our hypotheses about these different stages of information processing. We consider the AL as the input layer to the classication layers in the MB. Figure 2 shows the proposed classication scheme. The rst processing stage is the nonlinear transformation of the information contained in the AL activity into the large screen of intrinsic KCs. Our hypothesis for this stage is that the mapping of activity patterns resulting from the transformation from the AL to the MB should be statistically injective (Garcia-Sanchez & Huerta, 2003). Injectivity means that if two different states of the AL, arising from distinct odors, are projected into the screen of intrinsic KCs, the resulting states in the intrinsic KC layer should be distinct as well. The next hypothesis is that once the odor representations in the AL have been widely separated in the intrinsic KC layer, they can be classied at the next processing stage using simple linear threshold devices (see Figure 2); we interpret the neurons of the MB lobes as such linear threshold devices. At this stage, we then employ biologically feasible learning mechanisms to allow memory formation and self-organization in this part of the network. In addition to showing that the developed framework indeed allows successful classication, we also determine the range of possible values for the sparseness of activity in the intrinsic KC layer. The results are consistent with experimental recordings of intrinsic KCs in the MB of locust (PerezOrive et al., 2002) and thus give a theoretical explanation for the observed
Learning Classication in the Olfactory System of Insects
1605
Nonlinear expansion
AL
intrinsic KC layer
MB lobes
NKC
N AL
Non specific connectivity matrix
NMB NKC >> NAL
‘Hebbian’ learning
Mutual inhibition
Sparse code
Linear classification
Figure 2: Main elements required to account for efcient classication in the insect olfactory system. The rst stage is an injective function from the AL to the intrinsic KC layer using nonspecic, connectivity matrices (left). By nonspecic, we mean that the connectivity matrix does not need to be learned or be genetically encoded in detail. For classication purposes, a large number of neurons in the intrinsic KC layer and a sparse code are needed, as we will show. The second phase (linear classication) is the decision phase, where linear classication takes place. It is characterized by converging connections, Hebbian learning, and mutual inhibition between MB lobe neurons.
sparseness of the code. Finally, we demonstrate that the large size of the display or screen in the intrinsic KC layer is critical for efcient classication and quantify the interdependence of storage capacity and KC layer size. Previous theoretical studies have revealed the possibility of using the temporal dynamics of the rst relay station of olfactory processing for odor recognition (Hendin, Horn, & Hopeld, 1994; Hendin, Horn, & Tsodyks, 1998; White, Dickinson, Walt, & Kauer, 1998; Li & Hertz, 2000). White et al., (1998) propose delay lines to improve discrimination between odors. Hendin et al. (1994, 1998) show how the olfactory bulb (equivalent to the AL) can, in principle, implement an associative memory and segmentation of odors. Li and Hertz (2000) emphasize the importance of the feedback from the cortex (equivalent to the MB, the location of odor recognition) to the olfactory bulb (equivalent to the AL) for odor segmentation. In this work, we do not consider time or odor segmentation. Time is important to enhance the separation of similar odors (Friedrich & Laurent, 2001; Linster & Cleland, 2001; Laurent, 2002; Laurent et al., 2001). The problem of segmentation addresses the question of recognizing odor A or odor B when
1606
R. Huerta et al.
encountering a mixture of both. Both the role of time and the problem of odor segmentation require detailed knowledge of the AL representation of odors and mixtures of odors. To date, there is no theoretical work that addresses these issues, especially of the code for mixtures of odors in the AL, in a satisfactory way. We therefore restrict our analysis to the spatial aspects of the representation of pure odors in an appropriate sense. The model that we present in this article is to our knowledge the rst to address quantitatively the levels of neural activity, convergent and divergent connectivities, and the function of th MB in odor classication. The role of time is left out for further investigation. We need to understand the limitations of spatial code processing before exploring the advantages of using time in the neural code. The organization of the letter is as follows. We start with a description of the elements of the system. Then we present two analyses: an analytical description of a single decision neuron and a systematic numerical analysis demonstrating that the AL and MB structure together with Hebbian learning and mutual inhibition account for efcient classication. 2 Description of the Processing Stages As mentioned above and shown in Figure 2, there are two essential stages in our model of odor classication: a nonlinear transformation followed by a linear classication. We will see that the similarities of this core concept of SVMs to the observed biological system are striking even though the specic implementation is different because of the actual composition, the available “wetware,” of the insect olfactory network. 2.1 Nonlinear Transformation from the AL to the MB. Recently, strong evidence has been collected in the locust (Perez-Orive et al., 2002) that the spatiotemporal activity in the AL is read by the MB as pictures or snapshots. The two observations supporting this hypothesis are the strong feedforward inhibition from the lateral horn interneurons onto the Kenyon cells and the short integration time of the intrinsic KCs. The periodic strong inhibition resets the activity in the KC layer every 50 ms, whereas the short integration time of the KCs makes them coincidence detectors that cannot process input across more than one inhibition cycle. Because we are not addressing the temporal aspects of the system in this work, the input for our classication system is a single snapshot. The hypothesis for the nonlinear transformation from the AL to the MB is then that every such snapshot or code word in the AL has a unique corresponding code word in the MB: the nonlinear transformation needs to be an injective function at least in a statistical sense. In previous work (Garcia-Sanchez & Huerta, 2003), we proposed a method to select the parameter values that allow one to construct such an injective function from the AL to the intrinsic KC layer with very high probability. The appropriate parameters are used
Learning Classication in the Olfactory System of Insects
1607
throughout this article for designing the nonlinear transformation from the AL to the MB. It is known that the activity among intrinsic KCs is very sparse (PerezOrive et al., 2002). Most of the intrinsic KCs re just one or two spikes during one odor presentation of a few seconds. Given this low endogenous activity of the intrinsic KC neurons, we chose simple McCulloch-Pitts “neurons” (McCulloch & Pitts, 1943) to represent them. The state of this model neuron is described by a binary number (0 D no spike and 1 D spike). In particular, the McCulloch-Pitts neuron is described by yj D 2
Á NAL X iD1
! cji xi ¡ µKC
j D 1; 2; : : : ; NKC :
(2.1)
x is the state vector of the AL neurons. It has dimension NAL , where NAL is the number of AL neurons. The components of the vector x D [x1 ; x2 ; : : : ; xNAL ] are 0s and 1s. y is the state vector for the intrinsic KC layer; it is NKC dimensional. The cij are the components of the connectivity matrix, which is NAL £ NKC in size; its elements are also 0s and 1s. µKC is an integer number that gives the ring threshold in an intrinsic KC. The Heaviside function 2.¢/ is unity when its argument is positive and zero when its argument is negative. To determine the statistical degree of injectivity of the connectivity between the AL and intrinsic KC, we rst calculate the probability of having identical outputs given different inputs for a given connectivity matrix: P.y D y0 j x 6D x0 ; C/, where C is one of the possible connectivity matrices (see Garcia-Sanchez & Huerta, 2003, for details) and the notation x 6D x0 is f.x; x0 / : x 6D x0 g. We want this probability, which we call the probability of confusion, to be as small as possible, on average, over all inputs and over all connectivity matrices. « ¬ We write this average as P.confusion/ D hP.y D y0 j x 6D x0 ; C/ix6Dx0 C , where h¢ix6Dx0 is the average over all nonidentical input pairs (x; x0 ), and h¢iC is the average over all connectivity matrices C. This gives us a measure of clarity, the opposite of confusion, as I D 1 ¡ P.confusion/:
(2.2)
The closer I is to 1, the better is our statistically injective transformation from the states x of the AL to the states y of the intrinsic KCs. There are two parameters of the model that can be adjusted using the measure of clarity. One is the probability pC of having a connection between a given neuron in the AL and a given intrinsic KC. The second is the ring threshold µKC of the intrinsic KCs. Fixed parameters in the model are the probability pAL of having an active neuron in the AL layer, the number NAL of input neurons, and the number NKC of intrinsic KCs. pC and µKC can be
1608
R. Huerta et al.
estimated using the following inequality, I · 1 ¡ fpKC 2 C .1 ¡ pKC /2 C 2¾ 2 gNKC ;
(2.3)
where pKC is the ring probability of a single neuron in the intrinsic KC layer. It can be calculated for inputs and connection matrices generated by a Bernoulli process with probabilities pAL and pC as pKC D
´ NAL ³ X NAL .p AL pC /i .1 ¡ pAL pC /NAL ¡i : i iDµ
(2.4)
KC
This probability has variance (¾ 2 above) when we average over all possible inputs and connectivity matrices. The formula for the probability of confusion can be intuitively understood if we assume that the activity of every intrinsic KC is statistically independent of the activity of the others. If so, the probability of confusion in one output neuron is the sum of the probability of having a one for two inputs plus the probability of having a zero for both: p2KC C.1¡pKC /2 . Thus, the probability of confusion in all NKC output neurons is .p2KC C .1 ¡ pKC /2 /NKC in the approximation of independent inputs. This bound on I should be close to unity for any set of parameter values we choose. The inequality for the measure of clarity becomes an equality for sparse connectivity matrices. This, fortunately, is the case in the locust, where every intrinsic KC receives an average of only 10 to 20 connections from the AL where nearly 900 are possible. 2.2 Linear Classication in the MB Lobes. As shown in Figure 2, we hypothesize that the classication decision takes place at the MB lobes. The neurons in the MB lobes are again modeled by McCulloch-Pitts neurons, which are simple linear threshold devices, 0 1 NKC X (2.5) zl D 2 @ wlj ¢ yj ¡ µLB A ; l D 1; 2; : : : ; NLB : jD1
Here, the index LB denotes the MB Lobes. The vector z is the state of the MB lobes; it has dimension NLB . µLB is the threshold for the decision neurons in the MB lobes. The NKC £ NLB connectivity matrix wlj has entries 0 or 1. We introduce a plasticity rule for the entries wlj below. Every column vector wl (wlj for xed j) of a connectivity matrix denes a hyperplane in the intrinsic KC layer coordinates y (see Figure 3 for an illustration of this). This is the plane that is normal to wl . This hyperplane discriminates between the points that are “above it” and the points that are “below it.” There is a different hyperplane for each MB lobe neuron, and the combinatorial placement of the planes determines the classication space.
Learning Classication in the Olfactory System of Insects
1609
Figure 3: Hypothesized functional diagram of the olfactory system of insects. The data of the AL are encoded in snapshots. Every snapshot is a binary code of spikes (1) or no spikes (0). The projection from the AL to the intrinsic KC layer separates the encoded odors to be classied. The connectivity matrix from the intrinsic KC layer to the MB lobes denes a set of hyperplanes w that are able to discriminate between different sets of encoded odors. Note that without the projection into a higher-dimensional space, it would not be possible to linearly discriminate the dark points from the light ones.
2.2.1 Local Synaptic Changes: Hebbian Learning. Another key ingredient in addressing the classication problem is a learning mechanism. Hebbian learning is the classical choice for local synaptic changes (Hebb, 1949). The resulting local synaptic modications are made at the linear classication stage (see Figure 2) and efciently solve the classication problem. Hebbian plasticity is used to adjust the connectivity matrix wlj from the intrinsic KC layer to the MB lobes. No other areas need to be involved in learning for our model of classication. The plasticity rule is applied by rst choosing a connectivity matrix with some randomly chosen initial entries. Then inputs are presented to the system. The entries of the connectivity matrix at the time of the nth input are denoted by wlj .n/. The values after the next input, wlj .n C 1/, are given by the rule ¡ ¢ wlj .n C 1/ D H zl ; yj ; wlj .n/ ; where » H.1; 1; wlj .n// D
1 wlj .n/
with probability pC ; with probability 1 ¡ pC ;
(2.6)
1610
R. Huerta et al.
» 0 H.1; 0; wlj .n// D wlj .n/ H.0; 1; wlj .n// D wlj .n/;
with probability p¡ ; with probability 1 ¡ p¡ ;
(2.7)
H.0; 0; wlj .n// D wlj .n/:
A synaptic connection is activated (it becomes 1 when it was 0) with probability pC if the input activity is accompanied by an output activity. The connection is removed (it becomes 0 if it was 1) with probability p¡ if an output occurs in the absence of input activity. In the remaining cases, no input and no output and input but no output, the synapse remains unaltered. For example, let us apply the plasticity rule to a given input y and one output z D 1. This is the case that we primarily analyze in the next section. It can be easily calculated that the average number of iterations it takes for the connection wj to become 1 if yj D 1 and wj D 0 is 1=pC . If wj was already 1, then the number of iterations is 0. On the other hand, the average number of iterations it takes for the connection wj to become 0 if yj D 0 and wj D 1 is 1=p¡ . In other words, the inverses of the probabilities pC and p¡ are just the timescales of synaptic changes. Therefore, if we apply this rule sufciently long for one given input activity at the intrinsic KC layer and an active response z D 1 in an output neuron, the set of active connections will eventually be the set of active inputs itself. 2.2.2 Mutual Inhibition: nW -Winner-Take-All. We hypothesize that mutual inhibition exists and, in joint action with Hebbian learning, is able to organize a nonoverlapping response of the decision neurons in the MB lobes. This is often considered a neural mechanism for self-organization in the brain. The combination of Hebbian learning and mutual inhibition has already been proposed as a biologically feasible mechanism to account for learning in neural networks (O’Reilly, 2001). Mutual inhibition is implemented articially in the nite automata model with McCulloch-Pitts neurons. We allow only a subset of decision neurons that receive the highest synaptic input to re. The size of this subset is xed to allow nW winners in a winner-take-all conguration for every odor prePNKC sentation. Let us dene the vector u¹ D jD1 w¹j yj ¡ µLB . Then equation 2.5 is rewritten as 0 1 NKC X zl D 2 @ wlj yj ¡ µLB ¡ fugnW C1 A ;
(2.8)
jD1
where fugnW C1 denotes the (nW C 1)th largest component of the vector u. Since we force the system to have exactly nW active extrinsic KCs, µLB does not play any role because it is canceled in equation 2.8 and we can simply
Learning Classication in the Olfactory System of Insects
1611
Table 1: Summary of the Parameters and Variables of All the Processing Layers.
AL
AL!KC
KC
KC! lobe
Variables
x
cij
y
M lobe
wij
z
Parameters
NAL , pAL
pC
NKC ; µKC pKC .NAL ; pAL ; pC ; µKC /
pC ; p¡
NLB ; nW
write 0 zl D 2 @
NKC X jD1
wlj yj ¡
8 NKC <X :
9 = w¹j yj
jD1
1 A:
;
(2.9)
nW C1
The parameters and variables of the model are summarized in Table 1. The AL parameters are the number of neurons NAL and the probability of ring pAL . The connectivity probability from the AL to the MB is pC . The parameters of the intrinsic KC layer are the number of neurons NKC and the threshold for activation µKC . The ring probability pKC is a function of NAL , pAL , pC , and µKC . The dynamics of the connections from the intrinsic KC layer to the MB lobes is governed by the probabilities (timescales) for increasing and decreasing connection strength, pC and p¡ . Finally, the response of the decision layer depends on the number of allowed active neurons in the MB lobes, nW . For clarity, we also include a summary of the equations of the model in Table 2. 3 Results In the following sections, we divide our analysis into two parts. First, we analyze the parameter values that allow classication with one single output neuron. For this analysis, we use basic probability theory in Table 2: Summary of the Equations. yj D 2
Intrinsic KC layer Mushroom body lobes Plasticity rule
zl D 2
³ P NKC
wlj .n C 1/ D
(
jD1
±P
NAL iD1 cji
wlj yj ¡
1 0 wlj .n/
² xi ¡ µKC
nP N KC jD1
w¹j yj
o nW C1
´ ,
with pC if yj D 1 and zl D 1 with p¡ if yj D 0 and zl D 1 in all other cases
Note: The index n represents the iteration number for presentations of a snapshot of AL activity to the intrinsic KCs.
1612
R. Huerta et al.
a similar framework as Caticha, Palo Tejada, Lancet, & Domany, (2002). In principle, Shannon information can be employed as well, as in Nadal (1991) and Frolov & Murav’ev (1993) for associative memories. For our purposes, basic probability theory yields straightforward results that are easy to interpret. For simplicity, we will use a simple convention to separate the classication problem into two subproblems we call discrimination and association: ² The discrimination task is the ability of a single output neuron to distinguish one input from the rest, assuming uncorrelated input patterns in the antennal lobe. The probability space of AL activity will be unchanged throughout our analysis ² The association problem is dened as the capability of a single output neuron to re for d given uncorrelated inputs while not responding to the rest. The discrimination problem can be seen as a special case of the association problem with d D 1. In the second part of the results, we present simulations of a complete system with many output neurons that organizes via the plasticity rule into winner-take-all congurations. This analysis is different from the previous one in that the system self-organizes without supervision. 3.1 Discrimination for a Single Output Neuron. We consider a single output neuron in the MB lobes, usually referred to as an extrinsic Kenyon cell (eKC). We will show how the ability of this neuron to discriminate one odor from the rest depends on the total number of neurons in the intrinsic KC layer and the level of KC activity. The obtained discrimination ability can be enhanced by redundancy, for example, by using several decision neurons. The single neuron calculations presented here are in this sense a lower bound on the ability of the system to perform discrimination. The application of error correcting codes (Hamming, 1950) based on redundancy will be investigated in future work. The goal of the investigation presented in this section is to obtain exact quantitative statements rather than general trends to be able to compare our results to experimental ndings. Let us proceed with the analysis as illustrated in Figure 4. As a rst step, we need to calculate the probability distribution of the number of active intrinsic KC neurons (iKCs). The expectation value for the number of active iKCs is given by E nKC D NKC pKC , where NKC is the number of iKC and pKC is given by equation 2.4, (see section A.1 in the appendix). The probability distribution for the number of active iKC is, however, not binomial despite the random connectivity matrix between the AL and the MB. It can be calculated using the same procedure as in Nowotny and Huerta (2003). The details are explained in the appendix. The probability distribution P.nKC D k/ cannot be simplied into a more compact form. However, it can be calculated for given pC
Learning Classication in the Olfactory System of Insects
1613
Mushroom Body Antennal Lobe
iKCs
PNs eKC qLB
0.03
0.03
0.02 0.01 0
0
200 400 600 800 # of active PNs
P(# active iKCs)
P(# active PNs)
0.04
0.02
0.01
0
0
200 400 600 # of active iKCs
Figure 4: Characteristics of the rst processing stage, the nonlinear expansion. The input probability space in the AL is xed as a Bernoulli process with pAL D 0:15 in 50 millisecond intervals, as observed experimentally. The probability distribution, which is a binomial distribution, is shown in the lower left graph. The intrinsic KCs receive direct input from the PNs according to equation 2.1. The main parameter, here, is the probability of connection from the AL to the MB, pC . By changing this parameter, we can regulate the expected activity in the intrinsic KC layer. The resulting probability distribution for the number of active iKCs is not a binomial distribution (dotted line in the lower right graph) but a much wider distribution (solid line in the lower right graph) with an expected value given by equation 2.4. Given this distribution regulated by pC , we study the ability to discriminate uncorrelated inputs in the AL by using a single extrinsic KC. This investigation gives us a lower limit on the performance of this system that can be quantitatively compared to experimental data.
1614
R. Huerta et al.
and pAL and stored in the computer for further calculations. As shown in Figure 4, the distribution P.nKC D k/ is three to four times wider than the binomial distribution obtained if assuming independence of ring events in the iKCs. The distributions in the gure were obtained for the typical parameter values of the locust. To quantitatively determine the ability of the system to discriminate, we then generate a set of random inputs x0 ; x1 ; : : : ; xN in the AL by independent identical Bernoulli processes with probability p AL . In section 2.1, we give the conditions such that the map from the AL activity to the MB activity is an injective function. If these conditions are fullled, the set of random inputs in the AL corresponds (with probability close to one) uniquely to a set y0 ; y1 ; : : : ; yN of activity vectors in the MB. The activities nKC .y j / of the activity vectors obey the probability distribution given in equation A.4. The rst pattern y0 is then trained in one single classication output neuron: one neuron is forced to spike in response to this pattern, while for all other input patterns, the system acts autonomously. The Hebbian plasticity rule given in equation 2.8 generates a learned connectivity vector w D y0 if the rule is applied in the order of maxf1=p¡ ; 1=pC g times. Given this learned vector w, we determine the structural parameters such that z.y1 / D z.y2 / D ¢ ¢ ¢ D z.yN / D 0 with probability close to one. The probability of discrimination P.z.y1 / D 0; z.y2 / D 0; : : : ; z.yN / D 0/ is calculated in the appendix in section A.2. For successful discrimination, it needs to be as close to one as possible. To obtain the probability of discrimination, we basically estimate the degree of overlap of each of the uncorrelated inputs to the learned one. If the learned vector has l.w/ ones, the probability of having i overlapping 1s between w and any given y j is
p.i overlapping ones/ D
¡l.w/¢¡NKC ¡l.w/¢ i
l.y j /¡i
¡ NKC ¢
:
(3.1)
l.y j /
The overlapping activity vectors are read by an eKC that has a threshold of activation, µLB . In order for the eKC not to spike in response to the wrong activity vector y j , the probability for overlaps with i ¸ µLB has to be close to 0. It can quickly be seen that for sparse activity, the probability of overlaps decreases. The advantage of using sparse code for associative memory was pointed out by Marr (1969) and others (Willshaw, Buneman, & LonguetHiggins, 1969) 30 years ago using nonoverlapping patterns of activity. In this work, we rigorously calculate the probability of discrimination in order to determine the minimum size of the intrinsic KC layer, the maximal and minimal activity in the intrinsic KC layer, the capacity of the system, and the ability to discriminate similar odors. This will allow us to compare our results to the real system on a quantitative level.
Learning Classication in the Olfactory System of Insects
1615
3.1.1 Dependence on the Intrinsic KC Layer Size . It is interesting to investigate the size of the intrinsic KC layer because of the contradiction that nature seems to have chosen in insect olfaction. On the one hand, the number of 50; 000 intrinsic KCs in locust is very large. On the other hand, the locust seems to use only a few of these cells. That raises the question: Is there an optimal size of the intrinsic KC layer that allows the system to work most efciently? There are two main cases: keep pKC constant, leading to an increasing number of active neurons nKC D pKC ¢ NKC , with increasing size NKC , or keep the number of active neurons nKC constant by adjusting pKC appropriately. Using constant pKC is problematic because the learned vector will have too many active connections due to the Hebbian plasticity rule when NKC is large. Since the threshold µLB of the decision neuron is not size dependent, this causes the neuron to re for any input, and the network’s discrimination ability drops drastically for large sizes. We therefore concentrate on the more interesting second case ensuring a constant average number of active neurons by using equation 2.4 to modify pC appropriately. In Figure 5A, the probability of discrimination for a total of N D 100 inputs, for various values of the (xed) average number nKC of active intrinsic KCs as a function of the intrinsic KC layer size NKC , is shown. For better discrimination, one obviously needs to increase the number of neurons in the intrinsic KC layer. Note that although the scale shown is logarithmic, one can see a relatively small area in which the system makes a transition from failure to function. This points to the possibility that there is a critical number of intrinsic KCs that can be quantitatively compared to experimental data. The most important result is that the minimum iKC layer size for successful discrimination has scaling property as a function of the average number of active iKCs. The results are shown in Figure 5B for three different values of N. As one can see in the log-log plot, the exponent does not depend on N. When we t these values, NKC / n2:12 KC . We can theoretically estimate that NKC / n2KC (this will be published elsewhere). However, we do not know at this point where the discrepancy between the exponents stems from because we took severe limiting cases, like pAL pC ! 0 and NAL À µKC , which has an effect on the scaling exponent. The implication of this result is that for discrimination purposes, if the internal representation in the intrinsic KC layer is large, the size of this layer needs to be even larger. The effects of noise can be strong, because it would require increasing the number of active neurons, as explained in Garcia-Sanchez & Huerta (2003). We can conjecture that the intrinsic KC neurons need to have very low endogenous noise.
3.1.2 What Is the Maximum Number of Active iKC That Allows the System to Discriminate? It has been experimentally observed that the representation
1616
R. Huerta et al.
B
A
100000
0.8
nKC=20
0.6
10000 n KC=50
0.4
n KC=100
1000 100
0.2 0 10
NKC
P(z=0|N=100)
1
100
1000
NKC
10000 100000
10
1
10
100
nKC
Figure 5: (A) Probability of discrimination for N D 100 inputs for several numbers of active neurons nKC in the intrinsic KC layer for a threshold value of activation in the MB lobe layer of µLB D 7. (B) Scaling law properties of the boundary of discrimination. Solid line: N D 1000 inputs; middle dashed line: N D 100; dashed line-N D 10. The exponent of the power law is 2.12, NKC / n2:12 KC .
of information in the intrinsic KC layer of the locust is sparse (Perez-Orive et al., 2002). The percentage of iKC neurons that do not respond to the presentation of any given odor is 90% (the statistic was gathered with fewer than 100 neurons). This means that only ¼ 10% are active during 1 second of odor response. Therefore, the probability that an iKC will be active in a window of 50 milliseconds is pKC » 0:005. This means that on average, there are 250 active neurons in every 50 ms snapshot. If we compare this value to the theoretical sparseness values presented in Figure 6A, we obtain that the probability of discrimination is 97% for N D 10 uncorrelated inputs, 84% for N D 100, and 56% for N D 1000. As explained in section 3.1 these values can be improved by using redundant error-correcting code techniques. It turns out that there is a mechanism that increases the boundary of successful discrimination and thus induces a better correspondence to the experimental observations. Let us assume in the following, for convenience of the analysis, that all connections from the iKCs to the extrinsic KC are initially very unlikely. Recall that our learning paradigm so far assumed that the learning time is much larger than maxf1=p¡ ; 1=pC g such that w D y.x0 /. For nite time, however, there is a nonzero probability of w 6D y.x0 /. The probability of having an active connection at time t during learning is pt D pC .1 ¡ pC /t¡1 , where t denotes the number of presentations of input x0 . This probability is valid only for the connections whose presynaptic neuron is active. Therefore, we can write p.l.w// D
´ NKC ³ X i .1 ¡ pt /l.w/ pi¡l.w/ P.nKC D i/; t l.w/ iDl.w/
(3.2)
A
B
1
1
0.8
0.8
P(z=0|N)
P(z=0|N)
Learning Classication in the Olfactory System of Insects
0.6 0.4
0.6 0.4 0.2
0.2 0
1617
1
10
100
nKC
1000
10000
0 100
1000
nKC
Figure 6: (A) Probability of discrimination for N D 1000 (solid line), N D 100 (middle dashed line), N D 10(dashed line) inputs as a function of the average number of active neurons nKC in the intrinsic KC layer for a threshold value of activation in the MB lobe layer of µLB D 7 and NKC D 50000 as in the locust. The experimental values for the locust are located near the boundary of the discrimination regime. (B) Effects of having less time to reinforce the positive odor. The solid line represents pt D 1, the dotted line is for pt D 0:9, and the dashed line for pt D 0:8. A limited time for learning increases the boundaries of the discrimination regime.
where the P.nKC D i/ is calculated in the appendix. The probability p.l.w// of having l.w/ connections from the iKCs to an eKC is used in equations A.17 and A.22 in the appendix. If we recalculate the probability of discrimination, we obtain the results shown in Figure 6B. If the probability of transition pC is decreased, the sparse region increases. So in this case, a limited time to learn the vector to be discriminated can increase the boundary of the parameter region for successful discrimination. 3.1.3 Discrimination Capacity. Given a set of structural parameters such as threshold for activation, µLB , size NKC of the intrinsic KC layer, and level of activity pKC , we try to determine the maximal number of inputs that can be discriminated; we determine the maximum value Nmax of N such that P.z.y1 / D 0; : : : ; z.yN / D 0/ ¸ 1 ¡ ² where ² is a xed small error tolerance. In the appendix (section A.2.4) we calculate the upper bound for the capacity Nmax ·
log P.z.y1 / D 0; : : : ; z.yN¡1 / D 0/ ² ¼ : log P.z D 0/ P.z D 1/
(3.3)
The result shows that capacity is proportional to the inverse of the probability of misclassication. It is trivial to say that if we increase the probability of discrimination, we can increase the capacity to discriminate. However, it
1618
R. Huerta et al.
is not trivial to conclude that the capacity to discriminate is proportional to the inverse of the probability of misclassication. 3.1.4 Resolution of Discrimination. In the previous sections, we dealt with an uncorrelated set of inputs in the AL. This is a good starting point for achieving good discrimination abilities. The next question is what the resolution of the insect olfaction system is when the inputs are highly correlated or very similar. Let us consider an input x that is learned as the vector x0 , that is, the representation of x in the intrinsic KC layer, y.x/, corresponds to w. Now let us consider another vector x0 , which is at distance l.x0 ¡ x/ D ¹. This means that the vector x0 and x differ in ¹ ones. Let us dene f as the event that the input x produces a spike in a neuron of the intrinsic KC layer and f 0 as the event that the input x0 produces a spike in the same neuron of the intrinsic KC layer. We want to determine the probability that the input x0 produces a spike given that the (similar) input x already produced a spike and that the distance between x0 and x is ¹, l.x0 ¡ x/ D ¹. As always, we also need to condition for a given number of active AL neurons. We denote the desired probability as q ´ P. f 0 j nAL D k; f; l D ¹/. In the appendix (section A.2.5), we explain how to calculate q. After obtaining q, we can calculate whether there are sufcient 1s in the intersection of y.x0 / and w D y.x/ such that the eKC under consideration is above threshold for input x0 . Given that there are nKC D r active intrinsic KCs in y.x/, the probability of having i 1s in the new vector y.x0 / is P.i coincident 1s j nAL D k; nKC
³ ´ r i D r/ D q .1 ¡ q/nK C ¡i : i
(3.4)
Therefore, P.i coincident 1s/ D
NKC NAL X X kD0 rDi
P.i coincident 1s j nAL D k; nKC D r/
£ P.nKC D r j nAL D k/P.n AL D k/;
(3.5)
where P.nKC D r j nAL D k/ and P.nAL D k/ are given in the appendix (section A.1). We now have the tools to calculate the probability of z.x0 / D 1 given that z.x/ D 1 and l.x0 ¡ x/ D ¹: P.z.x0 / D 1 j z.x/ D 1; l.x0 ¡ x/ D ¹/ D 1 ¡
µX LB ¡1
P.i coincident 1s/ (3.6)
iD0
Figure 7 shows the ability of the system to discriminate similar inputs as a function of the separation of the input states and for different average levels of activity in the intrinsic KC layer. The system is able to separate or
Learning Classication in the Olfactory System of Insects
P(z=1|m)
1
1619
nKC=50
0.8
nKC=100 nKC=200
0.6
nKC=300 nKC=400
0.4 0.2 0
0
1
2
3
4
5
m Figure 7: Ability to discriminate similar inputs as a function of distance between the inputs and different levels of average activity in the intrinsic KC layer. The probability of separating the rst learned vector from a similar vector is high when the distance is larger than 2, 3, or 4 depending on the level of sparseness of the activity in the KC layer.
discriminate when the distance between the inputs is larger than 4, which is a rather small distance considering the dimension of the input space, that is, the number of AL neurons. We therefore conclude that it has a great ability to discriminate similar inputs. 3.2 Association for One Single Output. In the general problem of association, one has a set of uncorrelated inputs in the intrinsic KC layer y1 ; y2 ; : : : ; yd . All these uncorrelated inputs need to produce a 1 in the classication neuron, z.y/ D 1, using Hebbian learning. The rule in equation 2.8 has two limiting cases depending on the choice of pC and p¡ . The rst limiting case is given for pC ! 0 and p¡ D 1. Note that in this case, every time that there is a 0 in the input yj , the connectivity wj immediately drops to 0. Only the 1s occurring in all d inputs will produce a wj D 1 in an average time 1=pC . Therefore, the connectivity vector w is eventually equal to the intersection of all the inputs, that is, w D y1 \ y2 \ ¢ ¢ ¢ \ yd . The second limiting case is given in the extreme pC D 1 and p¡ ! 0. Following a similar argument as before, it can easily be seen that w D y1 [ y2 [ ¢ ¢ ¢ [ yd in an average time 1=p¡ . Playing with pC and p¡ , one can move w between these two limiting cases. Let us study the disadvantages of the rst case, (pC ! 0; p¡ D 1). To allow any activation of the output neuron, the number of 1s in w needs to be greater than or equal to µLB . One therefore needs to nd parameters such
1620
R. Huerta et al.
that the probability of having less than µLB 1s in w is very P small, that is, P.l.w/ < µLB / D ² for some small given ². Here, l.w/ D j wj the number of 1s in w. The proper method to calculate this probability is explained in the appendix in section A.3. However, it is not possible to carry out due to its computational cost. Nevertheless, we can make a simple approximation that can contribute to clarifying the association problem. Let us approximate the probability distribution piKC .nKC / by a binomial distribution with probability pKC , where pKC is the expected value of number of active neurons in the intrinsic KC layer. Then P.l.w/ < µLB / can easily be calculated as P.l.w/ < µLB / D
³ µX LB ¡1 iD0
´ NKC .pdKC /i .1 ¡ pdKC /NKC ¡i : i
(3.7)
Using this probability, one can calculate the necessary level of activity pKC in the intrinsic KC layer as a function of d for a xed tolerance ². Figure 8 shows that in order to be able to associate a few uncorrelated inputs, one needs to strongly increase the level of activity in the network up to unrealistic levels. And this is not all. One also needs to make sure that one can still discriminate the learned inputs from another sequence of uncorrelated inputs ydC1 ; ydC2 ; : : : ; yNCdC1 with the activity pKC obtained from Figure 8. The way to measure this ability is to calculate P.z.ydC1 / D 0; z.ydC2 / D 0; : : : ; z.yNCdC1 / D 0/. How to calculate this probability follows from equation A.22. The result of this calculation is that for N D 1 and d D 2, P.z.ydC1 / D 0; z.ydC2 / D 0; : : : ; z.yNCdC1 / D 0/ D 10¡7 , whereas for larger 1
pKC (activity)
0.8 0.6 0.4 0.2 0
10
20
30
40
50
d (number of inputs) Figure 8: Necessary activity level pKC for an intrinsic KC layer of 50,000 neurons to be able to associate d independent inputs in the intersection scenario. The threshold value of activation in the MB lobe layer was µLB D 7, and we used a tolerance level of ² D 10¡6 . Details are explained in the text.
Learning Classication in the Olfactory System of Insects
1621
values of d, the probability is practically 0. This means that in principle, one can associate d uncorrelated inputs. However, later, one can no longer distinguish them from the remainder of the N inputs. Therefore, the plasticity parameters that led to the intersection of the d inputs, (pC ! 0; p¡ D 1), are not suitable for association. Let us proceed to the second scenario in which one has the union of d inputs, (pC D 1, p¡ ! 0). In this training process, active states of the preceding stimulus are diffused to avoid the restrictive condition of the intersecting sets. One obtains a connectivity vector that is the union of all the training input vectors, w D y1 [ y2 [ ¢ ¢ ¢ [ yd . The probability distribution exclup.l.y1 [y2 [ ¢ ¢ ¢[ yd // can inSprinciple be calculated S using the inclusion P sion principle to obtain l. 1·i·d yi /, namely, l. 1·i·d yi / D 1·i1 ·d l.yi1 / ¡ P P T i1 i2 i1 i2 i3 dC1 l. diD1 yi / 1·i1 : 0 otherwise ³ ´ P k ´ (A.33) P. j cj xj D ´ j nAL D k/ D p .1 ¡ pC /k¡´ ; ´ C and P. f j n AL D k/ is given by equation A.2. A.3 P.l.y1 \y2 \¢ ¢ ¢\yd / < µLB /. The probability of having i1 intersecting 1s given two vectors y1 and y2 , whose lengths are l1 and l2 , is
P.i1 j l1 ; l2 / D
¡l1 ¢¡NKC ¡l1 ¢ i1
l ¡i
(A.34)
2 ¡NKC ¢1 :
l2
The probability of having i2 intersecting 1s of the vector y3 given i1 coincidences of the rst two vectors is P.i2 j l1 ; l2 ; l3 ; i1 / D
¡i1 ¢¡NKC ¡i1 ¢ i2
l ¡i
3 ¡NKC ¢2
(A.35)
:
l3
Given that we know P.i1 j l1 ; l2 /, we can now write P.i2 j l1 ; l2 ; l3 / D
¡ ¢¡NKC ¡i1 ¢ ¡l ¢¡NKC ¡l ¢
minfl 1 ;l 2 ;l3 g i1 X i2 i1 D0
1
l ¡i
i1
3 ¡NKC ¢2
l ¡i
1
2 ¡NKC ¢1 :
l3
(A.36)
l2
We can continue this procedure to calculate the probability of having id¡1 1s in the vector .y1 \ y2 \ ¢ ¢ ¢ \ yd /, which is 1 P.id¡1 j l1 ; l2 ; : : : ; ld / D Qd ¡N ¢ KC jD2
£ where lmin D minfl1 ; l2 ; : : : ; ld g.
lj
d¡2 Y³ jD1
lmin X id¡2 ;id¡3 ;:::; i1 D0
i d¡2 ·i d¡3 ·:::·i1
ij ijC1
³ ´³ ´ l1 NKC ¡ l1 i1 l2 ¡ i1
´³
´ NKC ¡ ij ; ljC2 ¡ ijC1
(A.37)
Learning Classication in the Olfactory System of Insects
1637
The not-ring probability of the extrinsic KC is P.l.y1 \ y2 \ ¢ ¢ ¢ \ yd / < µLB /. Therefore, the conditional not-ring probability given .l1 ; l2 ; : : : ; ld / is P.re j l1 ; l2 ; : : : ; ld / D
µX LB ¡1 id¡1 D0
P.id¡1 j l1 ; l2 ; : : : ; ld /:
(A.38)
Finally, P.l.y1 \ y2 \ ¢ ¢ ¢ \ yd / < µLB / D
NKC X l1 ;l 2 ;:::;ld D0
P.re j l1 ; l2 ; : : : ; ld /
d Y jD1
P.nKC D lj /;
(A.39)
where P.nKC D lj / is given in Section A.1. The main problem with this calculation is the computer time required to evaluate it. The minimum number of calls to the hypergeometric function based on the multiprecision library (Torbjorn, 2001) is .Nmax .d ¡ 1/µ LB /d . For the parameter values corresponding to the locust we have µLB D 7 and Nmax D 2000, where Nmax is dened by the condition N max X iD0
P.nKC D i/ D 1 ¡ ²;
(A.40)
with ² D 10¡6 . The number of associated inputs that we can analyze is therefore with d · 4 rather small. An approximation is required to draw conclusions for larger d. Acknowledgments We are very grateful to Gilles Laurent for conversations about this work. We also thank Ildiko Aradi and Gabriel Mindlin for constructive criticisms. This work was partially supported by the National Science Foundation under grants NSF/EIA-0130708 and NSF PHY0097134, and the U. S. Department of Energy, Ofce of Basic Energy Sciences, Division of Engineering and Geosciences, under grants DE-FG03-90ER14138 and DE-FG03- 96ER14592. This work was also partially supported by M. Ciencia y Tecnolog´õ a BFI20000157 (M.G.). References Amari, S. (1989). Characteristics of sparsely encoded associative memory. Neural Networks, 2, 451–457.
1638
R. Huerta et al.
Buhmann, J. (1989). Oscillations and low ring rates in associative memory neural networks. Phys. Rev. A, 40(7), 4145–4148. Caticha, C., Palo Tejada, J. E., Lancet, D., & Domany, E. (2002). Computational capacity of an odorant discriminator: The linear separability of curves. Neural Comput., 14, 2201–2220. Connolly, J. B., Roberts, I. J., Armstrong, J. D., Kaiser, K., Forte, M., Tully, T., & O’Kane, C. J. (1996). Associative learning disrupted by impaired Gs signaling in Drosophila mushroom bodies. Science, 274(5295), 2104–2107. Cortes, C,. & Vapnik, V. (1995). Support vector networks. Machine Learning, 20, 273–297. Cover, T. (1965). Geometric and statistical properties of systems of linear inequalities with applications in pattern recognition. IEEE T Elect. Comput., 14, 326. de Belle, J. S., & Heisenberg, M. (1994). Associative odor learning in Drosophila abolished by chemical ablation of mushroom bodies. Science, 263, 692–695. Distler, P. G., Bausenwein, B., & Boeckh, J. (1998). Localization of odor-induced neuronal activity in the antennal lobes of the blowy Calliphora vicina, a [3H] 2-deoxyglucose labeling study. Brain Res., 805, 263–266. Dubnau, J., Chiang, A. S., & Tully, T. (2003). Neural substrates of memory: From synapse to system. J. Neurobiol., 54(1), 238–253. Dubnau, J., Grady, L., Kitamoto, T., & Tully, T. (2001). Disruption of neurotransmission in Drosophila mushroom body blocks retrieval but not acquisition of memory. Nature, 411(N6836), 476–480. Finelli, L. A., Haney, S., Bazhenov, M., Stopfer, M., Sejnowski, T. J., & Laurent, G. (2004). Effects of a synaptic learning rule on the sparseness of odor representationsin a model of the locust olfactory system. Unpublished manuscript, Salk Institute, San Diego, CA. Friedrich, R,. & Laurent, G. (2001). Dynamical optimization of odor representations in the olfactory bulb by slow temporal patterning of mitral cell activity. Science, 291, 889–894. Frolov, A. A. & Murav’ev, I. P. (1993). Informational characteristics of neural networks capable of associative learning based on Hebbian plasticity. Network, 4, 495–536. Galizia, C. G., Joerges, J., Kuettner, A., Faber, T., & Menzel, R. (1997). A semi-invivo preparation for optical recording of the insect brain. J. Neurosci. Meth., 76, 61–69. Galizia, C. G., Nagler, K., Holldobler, B., & Menzel, R. (1998). Odor coding is bilaterally symmetrical in the antennal lobes of honeybees (Apis mellifera). Eur. J. Neurosci., 10, 2964–2974. Galizia, C. G., Sachse, S., Rappert, A., & Menzel, R. (1999). The glomerular code for odor representation is species specic in the honeybee Apis mellifera R. Nature Neurosci., 2, 473–478. Gao, Q., Yuan, B., & Chess, A. (2000). Convergent projections of Drosophila olfactory neurons to specic glomeruli in the antennal lobe. Nat. Neurosci., 3(8),780–785. Garcia-Sanchez, M., & Huerta, R. (2003). Design parameters of the fan-out phase of sensory systems. J. Comput. Neurosci., 15, 5–17.
Learning Classication in the Olfactory System of Insects
1639
Hamming, R. W. (1950). Error detecting and error correcting codes. Bell System Technical Journal, 26(2), 147–160. Hebb, D. (1949). The organization of behavior. New York: Wiley. Heisenberg, M. (2003). Mushroom body memoir, from maps to models. Nature Rev. Neurosci., 4, 266–275. Heisenberg, M., Borst, A., Wagner, S., and Byers, D. (1985). Drosophilamushroom body mutants are decient in olfactory learning. J. Neurogenet., 2, 1–30. Hendin, O., Horn, D., & Hopeld, J. J. (1994). Decomposition of a mixture of signals in a model of the olfactory bulb. P. Natl. Acad. Sci. USA, 91(13), 5942– 5946. Hendin, O., Horn, D., & Tsodyks, M. V. (1998). Associative memory and segmentation in an oscillatory neural model of the olfactory bulb. J. Comput. Neurosci., 5(2), 157–169. Hosler, J. S., Buxton, K. L., & Smith, B. H.(2000). Impairment of olfactory discrimination by blockade of GABA and nitric oxide activity in the honey bee antennal lobes. Behav. Neurosci., 114(3), 514-525. Joerges, J., Kuettner, A., Galizia, C. G., & Menzel, R. (1997). Representations of odors and odor mixtures visualized in the honeybee brain. Nature, 387, 285–288. Laurent, G. (2002). Olfactory network dynamics and the coding of multidimensional signals. Nature Rev. Neurosci., 3, 884–895. Laurent, G., Stopfer, M., Friedrich, R. W., Rabinovich, M. I., Volkovskii, A., & Abarbanel, H. D. I. (2001). Odor encoding as an active, dynamical process, Experiments, computation, and theory. Annual Rev. Neurosci., 24, 263– 297. Li, Z., & Hertz, J. (2000). Odour recognition and segmentation by a model olfactory bulb and cortex. Network Comput. Neural Syst., 11, 83–102. Linster, C., & Cleland, T. A. (2001). How spike synchronization among olfactory neurons can contribute to sensory discrimination. J. Comput. Neurosci., 10(2), 187–193. Marr, D. (1969). A theory of cerebellar cortex. J. Physiol., 202, 437–470. McCulloch, W. S., & Pitts, W. (1943). A logical calculus of ideas immanent in nervous activity. Bulletin of Mathematical Biophysics, 5, 115–133. McGuire, S. E., Le, P. T., & Davis, R. L. (2001). The role of Drosophila mushroom body signaling in olfactory memory. Science, 293(5533), 1330–1333. Menzel, R. (2001). Searching for the memory trace in a mini-brain, the honeybee. Learn. Mem., 8(2), 53–62. Nadal, J. P. (1991). Associative memory, on the (puzzling) sparse coding limit. J. Phys. A, 24, 1093–1101. Nadal, J. P., & Tolouse, G. (1990). Information storage in sparsely coded memory nets. Network, 1, 61–74. Ng, M., Roorda, R. D., Lima, S. Q., Zemelman, B. V., Morcillo, P., & Miesenbock, G. (2002). Transmission of olfactory information between three populations of neurons in the antennal lobe of the y. Neuron, 36(3), 463–474. Nowotny, T., & Huerta, R. (2003). Explaining synchrony in feed-forward networks: Are McCulloch-Pitts neurons good enough? Biol. Cyber., 89(4), 237241.
1640
R. Huerta et al.
O’Reilly, R. C. (2001). Generalization in interactive networks: The benets of inhibitory competition and Hebbian learning. Neural Comput., 13,1199–1241 O’Reilly, R. C., & McClelland, J. L. (1994). Hippocampal conjunctive encoding, storage, and recall, avoiding a trade-off. Hippocampus, 4(6), 661-682. Palm, G. (1980). On associative memory. Biol. Cyber., 36, 59–71. Pascual, A., & Preat, T. (2001). Localization of long-term memory within the Drosophila mushroom body. Science, 294(5544), 1115–1117. Perez-Orive, J., Mazor, O., Turner, G. C., Cassenaer, S., Wilson, R. I., & Laurent, G. (2002). Oscillations and the sparsening of odor representations in the mushroom body. Science, 297, 359–365 Perez Vicente, C. J., & Amit, D. J. (1989). Optimised network for sparsely coded patterns. J. Phys. A, 22(5), 559–569. Rodrigues, V. (1988). Spatial coding of olfactory information in the antennal lobe of Drosophila melanogaster. Brain Res., 453, 299–307. Scott, K., Brady, R. Jr., Cravchik, A., Morozov, P., Rzhetsky, A., Zuker, C., & Axel, R. (2001). A chemosensory gene family encoding candidate gustatory and olfactory receptors in Drosophila. Cell, 104(5), 661–673. Stopfer, M., Bhagavan, S., Smith, B. H., & Laurent, G. (1997). Impaired odour discrimination on desynchronization of odour-encoding neural assemblies. Nature, 390(6655), 70-74. Stopfer, M., Jayaraman, V., & Laurent, G. (2003). Intensity versus identity coding in an olfactory system. Neuron, 39(6), 991–1004. Torbjorn, G. (2001).GNU MP: The GNU multiple precisionarithmeticlibrary,version. 4.0.1. Available on-line at: http://www.swox.com/gmp/. Tsodyks, M. V., & Feigel’man, M. V. (1988). The enhanced storage capacity in neural networks with low activity level. Europhys. Lett., 6, 101–105. Vosshall, L. B., Wong, A. M., & Axel, R. (2000). An olfactory sensory map in the y brain. Cell, 102(2), 147–159. Wang, Y., Wright, J. D., Guo, H.-F., Zuoping, X., Svoboda, K., Malinow, R., Smith, D. P., & Zhong, Y. (2001). Genetic manipulation of the odor-evoked distributed neural activity in the Drosophilamushroom body. Neuron, 29, 267– 276. Wang, J. W., Wong, A. M., Flores, J., Vosshall, L. B., & Axel, R. (2003). Two-photon calcium imaging reveals an odor-evoked map of activity in the y brain. Cell, 112, 271–282. Willshaw, D., Buneman, O. P., & Longuet-Higgins, H. C. (1969).Nonholographic associative memory. Nature, 222, 960. White, J., Dickinson, T. A., Walt, D. R., & Kauer, J. S. (1998).An olfactory neuronal network for vapor recognition in an articial nose. Biol. Cyber., 78, 245-251. Zars, T., Fischer, M., Schulz, R., & Heisenberg, M. (2000). Localization of a shortterm memory in Drosophila. Science, 288, 672–675. Received June 9, 2003; accepted January 30, 2004.
LETTER
Communicated by Erkki Oja
Adaptive Blind Separation with an Unknown Number of Sources Ji-Min Ye Key Lab for Radar Signal Processing and School of Science, Xidian University, Xi’an 710071, China
Xiao-Long Zhu xlzhu
[email protected] Xian-Da Zhang
[email protected] Department of Automation, State Key Lab of Intelligent Technology and Systems, Tsinghua University, Beijing 100084, China
The blind source separation (BSS) problem with an unknown number of sources is an important practical issue that is usually skipped by assuming that the source number n is known and equal to the number m of sensors. This letter studies the general BSS problem satisfying m ¸ n. First, it is shown that the mutual information of outputs of the separation network is a cost function for BSS, provided that the mixing matrix is of full column rank and the m£ m separating matrix is nonsingular. The mutual information reaches its local minima at the separation points, where the m outputs consist of n desired source signals and m ¡ n redundant signals. Second, it is proved that the natural gradient algorithm proposed primarily for complete BSS (m D n) can be generalized to deal with the overdetermined BSS problem (m > n), but it would diverge inevitably due to lack of a stationary point. To overcome this shortcoming, we present a modied algorithm, which can perform BSS steadily and provide the desired source signals at specied channels if some matrix is designed properly. Finally, the validity of the proposed algorithm is conrmed by computer simulations on articially synthesized data.
1 Introduction The blind source separation (BSS) problem consists of recovering statistically independent but otherwise unobserved source signals from their linear mixtures without knowing the mixing coefcients. This kind of blind technique has signicant potential applications in various elds, such as wireless telecommunication systems, sonar and radar systems, image enhancement, speech processing, and biomedical signal processing. c 2004 Massachusetts Institute of Technology Neural Computation 16, 1641–1660 (2004) °
1642
J. Ye, X. Zhu, and X. Zhang
Since the pioneering work of Jutten and H´erault (1991), a variety of algorithms have been proposed for BSS (see, e.g., Amari & Cichocki, 1998; Sanchez A., 2002; Douglas, 2003; Hyvarinen, Karhunen, & Oja 2001; Cichocki & Amari, 2002). Generally, the existing algorithms can be divided into four major categories: neural network–based algorithms (Karhunen, Oja, Wang, Vigario, & Joutsensalo, 1997; Amari & Cichocki, 1998; Cichocki, Karhunen, Kasprzak, & Vigario, 1999; Pajunen, & Karhunen, 1998; Zhu & Zhang, 2002), density model–based algorithms (Comon 1994; Amari, Cichocki, & Yang, 1996; Yang & Amari, 1997; Lee, Girolami, & Sejnowski, 1999; Choi, Cichocki, & Amari, 2000; Cao, Murata, Amari, Cichocki, & Takeda, 2003), algebraic algorithms (Cardoso, 1993; Belouchrani, AbedMeraim, Cardoso, & Moulines, 1997; Hyvarinen & Oja, 1997; Lindgren & Broman, 1998; Pham & Cardoso, 2001; Feng, Zhang, & Bao, 2003), and information–theoretic algorithms (Bell & Sejnowski, 1995; Igual, Vergara, Camacho, & Miralles, 2003). Most of the algorithms for BSS in the literature assume that the number of sources is known a priori. Typically, it should be equal to the number of sensors. In practice, however, such an assumption does not often hold. The BSS problem when there are fewer sensors than sources is referred to as the underdetermined or overcomplete BSS (Amari, 1999; Lewick & Sejnowski, 2000). Since in this case, at most a part of source signals can be successfully extracted if additional prior information about the mixing matrix or source signals is not at hand (Cao & Liu, 1996; Li & Wang, 2002), we focus on the overdetermined and determined BSS problem, where the number m of available mixtures is greater than or equal to the true number n of sources: m ¸ n. To deal with the overdetermined BSS problem with an unknown number of sources, several neural network architectures, together with the associated learning algorithms, are presented (Cichocki et al., 1999). When a prewhitening (sphering) layer is used to estimate the source number and reduce the dimension of the data vectors, separation results may be poor for ill-conditioned mixing matrices or weak source signals. Similar disadvantages exist if possible data compression takes place in the separation layer instead of the prewhitening layer. Alternatively, some neural algorithms proposed primarily for determined BSS are generalized to learn an m £ m separating matrix, and it was observed via simulation experiments (Cichocki et al., 1999) that among m outputs of the BSS network at convergence, n components are mutually independent or as independent as possible, and they are the desired rescaled or rearranged source signals; the remaining m ¡ n components are copies of some independent component and thus redundant. Since a signal and its copy are coherent (in the ideal case) or highly correlated (in the practical case), one or more copies of a signal can be easily removed using decorrelation approaches. That is, a postprocessing layer is appended to the separation network for elimination of the redundant signals and determination of the number of active source signals.
Adaptive Blind Separation with an Unknown Number of Sources
1643
Although extensive experiments show the above phenomenon on redundant signals, a complete theoretical proof has not been found in the literature. One of the goals of this article is to ll this gap. Additionally, we provide new insight into the existing mutual information of output components when used as a cost function of the overdetermined BSS problem. Another goal is to show that the natural gradient algorithm for BSS has no stationary point if there are more mixtures than sources, resulting in inevitable divergence. To perform BSS steadily, a modied algorithm is proposed here, which includes the orthogonal natural gradient algorithm (Amari, Chen, & Cichocki, 2000) as a special case and can provide the desired source signals at the rst n channels if some matrix is designed properly. Section 2 presents the problem statement. Section 3 discusses the mutual information contrast function. In section 4, we analyze the behaviors of the natural gradient BSS algorithm in the overdetermined case and then propose a modied natural gradient algorithm. The effectiveness of the new approach is justied by computer simulations on articially synthesized data in section 5. Finally, conclusions are set out in section 6. 2 Problem Statement Assume that there exist n zero mean source signals s1 .t/; : : : ; sn .t/ that are scalar valued and mutually statistically independent at each time instant t. For simplicity, we restrict ourselves to the noise-free case. Denote by the superscript T the transpose of a vector or matrix; then the available sensor vector xt D [x1 .t/; : : : ; xm .t/]T is given by xt D Ast ;
(2.1)
where A 2 Rm£n is a constant but unknown mixing matrix, and st D [s1 .t/; : : : ; sn .t/]T represents the vector of the unobserved source signals. According to the relation between the source number n and the sensor number m, the BSS problem can be divided into three cases: the determined/complete BSS (m D n), the overdetermined/undercomplete BSS (m > n), and the underdetermined/overcomplete BSS (m < n). For the difcult case where there are fewer mixtures than sources (Cao & Liu, 1996; Amari, 1999; Lewick & Sejnowski, 2000; Li & Wang, 2002), separation may be achievable only in special instances; that is, all the source signals cannot be successfully extracted. Therefore, we focus here on the overdetermined and determined BSS problem (m ¸ n) and make further assumptions that the mixing matrix A is of full column rank and at most one of the source signals is allowed to be gaussian distributed (Tong, Liu, Soon, & Huang, 1991; Cao & Liu, 1996; Amari, Chen, & Cichocki, 1997; Amari & Cichocki, 1998). The task of the BSS problem is to reconstruct all the source signals from the observations subject to unknown probability distributions. For this pur-
1644
J. Ye, X. Zhu, and X. Zhang
pose, standard neural and adaptive BSS approaches adjust the so-called separating matrix W such that the output vector yt D Wxt
(2.2)
can provide the separated source signals. If the number n of sources is equal to that of sensors or is known a priori, then the BSS problem is easily handled (see, e.g., Amari & Cichocki, 1998; Zhang, Cichocki, & Amari, 1999). One can conne the separating matrix W in equation 2:2 to be an n £ m matrix with rank n, and update it according to some on-line learning rule until the output yt is a permuted or rescaled version of st , that is, yt D PDst ;
(2.3)
where P is a permutation matrix and D is a nonsingular diagonal matrix. The multiplication of P and D results in a generalized permutation matrix, which describes the order indeterminacy and scale ambiguity inherent in the BSS problem (Tong et al., 1991; Hyvarinen et al., 2001; Lu & Rajapakse, 2003). When the source number n is unknown, the BSS problem becomes much difcult. Since it is true in many practical applications, we discuss such a general case. If the source signals are extracted one by one (Delfosse & Loubaton, 1995; Hyvarinen & Oja, 1997; Thawonmas, Cichocki, & Amari, 1998; Feng et al., 2003), then it is unnecessary to estimate the number of sources. However, sequential BSS approaches have some shortcomings too. Some of them require prewhitening processing of the observations, and the quality of the separated source signals would be increasingly degraded due to accumulated errors from the deation process. More important, signal extraction at subsequent extraction units cannot proceed unless the current deation process has converged to the equilibrium point, which is prohibitive in some real-time situations such as wireless communication systems. As to parallel algorithms that recover the source signals jointly, there are several schemes to tackle the unknown number of sources, including the model selection criteria (Sato, 2001; Sugiyama & Ogawa, 2001) and the crossvalidation technique (Cao et al., 2003). Cichocki et al. (1999) outline three other approaches. One applies a prewhitening layer to estimate the source number and reduce the dimension of the data vectors. The second method processes the possible data compression in the separation layer that follows the prewhitening layer. Both methods, however, suffer from poor separation results for ill-conditioned mixing matrices or weak source signals. In the third approach, some algorithms for determined BSS are employed directly to learn an m £ m separating matrix W. If it is realized that n components of the m-dimensional output vector yt are the original source signals or their
Adaptive Blind Separation with an Unknown Number of Sources
1645
rescaled versions, and the remaining m ¡ n components consist of copies of some source signal, that is, all the source signals appear at least once in the network outputs, then the m ¡ n redundant signals may be easily removed using decorrelation techniques. The idea that a postprocessing layer is utilized to drop the redundant signals and estimate the source number takes such an advantage that the structure of separation network can remain unchanged when the number of sources varies over time, which is of key importance for some engineering applications. Clearly, the third approach builds validity on condition that the m outputs should be composed of n independent components and m ¡ n redundant components. Although such a phenomenon has been reported by several researchers (Cichocki et al., 1999), a rigorous theoretical proof remains open. 3 Contrast Function A contrast function is a function whose maximization leads to source separation in some restrictions (Comon, 1994; Moreau & Macchi, 1996; Moreau & Thirion-Moreau, 1999). There exist in the literature many contrast functions (contrasts) for BSS, including maximum likelihood contrasts (Cardoso, 1997; Pham & Garrat, 1997), mutual information contrasts (Comon, 1994; Amari et al., 1996; Yang & Amari, 1997), nonlinear principle component analysis contrasts (Oja, 1997; Karhunen, Pajunen, & Oja, 1998), and high-order statistics contrasts (Moreau & Macchi, 1996; Moreau & Thirion-Moreau, 1999; Cardoso, 1999). As a mainstream technique to the BSS problem, the independent component analysis or abbreviative ICA (Comon, 1994) usually employs the mutual information of outputs of the separation system as the cost function (minus contrast function). The mutual information describes the dependence among components of yt , and it can be measured by the KullbackLeibler divergence (KLD) (Cover & Thomas, 1991; Amari et al., 1996; Yang & Amari, 1997) between the joint probability Q density function (PDF) pY .yt / of yt and its factorized version pQY .yt / D m iD1 pYi .yi .t//, where pYi .yi .t// is the marginal PDF of yi .t/. For convenience, we omit the time instant t, and thus we have Z 1 p .y/ (3.1) I.yI W/ D KLD[pY .y/jjpQY .y/] D pY .y/ ln Y dy; pQY .y/ ¡1 in which ln.¢/ denotes the natural logarithm operator. Using the property of KLD (Cover & Thomas, 1991), it is easy to show that I.yI W/ ¸ 0 with equality if and only if yi (i D 1; : : : ; m) are mutually independent. When m D n, it is proved that I.yI W/ is a cost function of the BSS problem (Comon, 1994), meaning I.yI W/ D 0
iff WA D G;
(3.2)
1646
J. Ye, X. Zhu, and X. Zhang
where G is an n£n generalized permutation matrix. Since the m components of y are linear combinations of the original sources, there are at most n independent components in y when m > n. Therefore, I.yI W/ is not equal to zero unless the outputs are composed of independent random signals and deterministic zero signals. When adaptive learning rules are applied to search maxima of the contrast function, any output component of y is impossible to stay at zero; hence, we do not consider in this article the special case where some output component is a zero signal, namely, I.yI W/ is always greater than zero for m > n. Provided that the m £ m separating matrix W is invertible and the m £ n mixing matrix A is of full column rank, the global transfer matrix C D WA has rank n, which implies that there must exist an n £ m submatrix W1 of W such that W1 A is nonsingular. Without loss of generality, we suppose that W1 is composed of the rst n rows of W, and denote by W2 the submatrix constituted by the remaining m ¡ n rows (m > n). Let b y D W1 x D [y1 ; : : : ; yn ]T with the PDF pb.y1 ; : : : ; yn / and y D W2 x D Y
[ynC1 ; : : : ; ym ]T with the PDF p .ynC1 ; : : : ; ym /, one has Y
¡
¢
pb.y1 ; : : : ; yn / p .ynC1 ; : : : ; ym / pY y1 ; : : : ; ym Y Y D ¢ pY .y1 / ¢ ¢ ¢ pYm .ym / pY .y1 / ¢ ¢ ¢ pYn .yn / pY .ynC1 / ¢ ¢ ¢ pYm .ym / nC1 1 1 ¡ ¢ pY y1 ; : : : ; ym ¢ (3.3) : pb.y1 ; : : : ; yn /p .ynC1 ; : : : ; ym / Y
Y
Substitution of equation 3.3 into equation 3.1 results in I.yI W/ D I.y1 ; : : : ; yn / C I.ynC1 ; : : : ; ym /
C I.fy1 ; : : : ; yn g; fynC1 ; : : : ; ym g/;
(3.4)
which is true as well for m D n C 1 if the convention I.ynC1 / D 0 is made.1 By denition, W1 A is an n£n invertible matrix; therefore, s D .W1 A/¡1b y, and thus y D W2 A ¢ s D W2 A.W1 A/¡1 ¢ b y, which indicates the conditional entropy ¡ ¢ H ynC1 ; : : : ; ym y1 ; : : : ; yn D 0:
(3.5)
b In addition I.y;b y/ D H.y/ ¡ H.y y /; it holds that I 1
¡©
ª © ª¢ ¡ ¢ y1 ; : : : ; yn ; ynC1 ; : : : ; ym D H ynC1 ; : : : ; ym :
(3.6)
Note that I.ynC1 / 6D I.ynC1 ; ynC1 /, since the right-hand-side term is actually the R1 entropy of ynC1 , dened by H.ynC1 / D I.ynC1 ; ynC1 / D ¡ ¡1 pY nC1 .ynC1 / ln pYnC1 .ynC1 / dynC1 , which is not equal to zero unless ynC1 is a deterministic variable (Cover & Thomas, 1991).
Adaptive Blind Separation with an Unknown Number of Sources
1647
Then, using the property (Cover & Thomas, 1991; Comon, 1994; Yang & Amari, 1997), I.ynC1 ; : : : ; ym / D
m¡n X kD1
¡ ¢ ¡ ¢ H ynCk ¡ H ynC1 ; : : : ; ym ;
(3.7)
and combining equations 3.4, 3.6, and 3.7, we have I.yI W/ D I.y1 ; : : : ; yn / C
m¡n X
H.ynCk /:
(3.8)
kD1
The rst term on the right-hand side (RHS) of equation 3.8 obtains the global minimum that is zero if and only if y1 ; : : : ; yn are mutually independent. On the other hand, ynCk D
n X
[W2 A]nCk;j sj ;
jD1
k D 1; : : : ; m ¡ n;
(3.9)
in which Bpq denotes the .p; q/th entry of matrix B. Since a random variable that is the sum of several independent components has the entropy not less than any individual does (Cover & Thomas, 1991), that is, ¡ ¢ (3.10) H ynCk ¸ max H.[W2 A]nCk;j sj /; j2f1;:::;ng
we can infer that the second term on the RHS of equation 3.8 reaches the local minima when each component ynCk is a copy of some source signal. Let »t be an m-dimensional vector, the rst n components of which are the n source signals and the rest m ¡ n components consist of some source signal. Clearly, there are n! ¢ nm¡n possible versions of »t in all, and they form the set Ä. The mutual information is invariant to nonzero rescaling and reordering of its arguments, so the following lemma is straightforward. Lemma 1. The mutual information I.yt I W/ of the m output components of the network 2.2 with a nonsingular separating matrix W is a cost function for BSS with an unknown number of sources, and it obtains a local minimum if and only if yt D G» t , where G is an m £ m generalized permutation matrix, and »t 2 Ä. Theorem 1. If the m £ n mixing matrix A is of full column rank (m > n), then the mutual information I.yt I W/ reaches local minima under the sufcient and necessary condition that the m £ m separating matrix W D G[.AC /T ; N.AT / C T
W ]T , in which G is an m £ m generalized permutation matrix, AC is the pseudoinverse of the mixing matrix A, N.AT / denotes the m£ .m¡ n/ matrix constructed by the basis vectors in the null space of AT , and W denotes an .m ¡ n/ £ m matrix whose rows are made up of some row(s) of AC .
1648
J. Ye, X. Zhu, and X. Zhang
Proof. Since the m £ n mixing matrix A is of full column rank, AC A D .AT A/¡1 AT A D I is the n £ n identity matrix, and .N.AT //T A D .AT N.AT //T D O is an .m ¡ n/ £ n null matrix. Therefore, the separatT
ing matrix W D G[.AC /T ; N.AT / C W ]T implies that the output vector yt D WAst D G[sTt ; .WAst /T ]T D G» t , and vice versa. Clearly, the theorem holds by lemma 1. If the mixing matrix A is of full column rank and an m £ m nonsingular separating matrix W is updated such that the mutual information is minimized, then at convergence, the output yt contains n independent components that are the separated source signals and m ¡ n redundant components. Such empirical ndings were rst reported by Cichocki et al. (1999). Here we present the theoretical justication behind it from the viewpoint of optimization. 4 A Modied Natural Gradient Algorithm Based on the relation between the mutual information and the differential entropy (Cover & Thomas, 1991; Amari et al., 1996; Yang & Amari, 1997), the cost function 3.1 can be rewritten as I.yI W/ D
m X kD1
¡ ¢ ¡ ¢ H yk ¡ H y1 ; : : : ; ym :
(4.1)
If the m £ m separating matrix is conned to be invertible, then (Bell & Sejnowski, 1995; Yang & Amari, 1997) ¡ ¢ ¡ ¢ (4.2) H y1 ; : : : ; ym D H x1 ; : : : ; xm C ln jdet.W/j ; where jdet.W/j is the absolute value of the determinant. Substituting equation 4.2 into 4.1 and applying the natural gradient learning rule (Amari, 1998) 2 or, equivalently, the relative gradient learning rule (Cardoso & Laheld, 1996), dW @I.yI W/ D ¢ WT W dt @W
(4.3)
to search local minima of equation 4.1 or 3.8, one has the discrete time natural gradient algorithm for BSS (Amari et al., 1996; Yang & Amari, 1997; Amari, 1998; Zhang et al., 1999), as WtC1 D Wt C ´t [I ¡ Á.yt /yTt ]Wt ;
(4.4)
2 The natural gradient on the Stiefel manifold (Amari, 1999) where the separating matrix is conned to be an orthonormal matrix instead of a nonsingular matrix is given h iT @I.yIW / @I.yIW / W W. by dW D ¡ dt @W @W
Adaptive Blind Separation with an Unknown Number of Sources
1649
£ ¤T in which ´t > 0 is a learning rate, Á.yt / D ’1 .y1 /; : : : ; ’m .ym / is a nonlinear-transformed vector of yt , and ’i .yi / D ¡
dpYi .yi / 1 ¢ ; pYi .yi / dyi
i D 1; : : : ; m
(4.5)
is referred to as the score function in BSS. From the derivation of learning rule 4.4, it can be seen that the natural gradient algorithm for complete BSS problem (m D n) can be used to deal directly with the overdetermined BSS problem (m > n). Moreover, algorithm 4.4 is clearly independent of the source number n, so it works as well when the number of sources is unknown. At convergence, one local minimum of the cost function is achieved, and the outputs of the BSS network provide the desired source signals. However, the natural gradient algorithm 4.4 cannot be stable in such a state, since the stationary condition EfI ¡ Á.yt /yTt g D O
(4.6)
in which Ef¢g is the expectation operator does not hold when yt D G» t . To understand the behavior of algorithm 4.4 at the separation points, we consider the simplest case, where n < m · 2n, and the rst n components of yt are the original source signals in order, while the remaining m ¡ n components correspond respectively to the rst m ¡ n sources, that is, » si .t/; si¡n .t/;
yi .t/ D
i D 1; : : : ; n : i D n C 1; : : : ; m
(4.7)
The outputs in equation 4.7 lead to3 h i m¡nC1 m EfI ¡ Á.yt /yTt g D ¡ I1nC1 ; : : : ; Im¡n ; : : : ; 0n ; InC1 m ;0 1 ; : : : ; Im¡n D 0;
(4.8)
q
where Ip is the pth column vector of the m£m identity matrix I, 0q is an m£1 zero vector, and the superscript q denotes the column index of 0. Therefore, » E
dW dt
¼
h iT D ¡´t wTnC1 ; : : : ; wTm ; 0T ; : : : ; 0T ; wT1 ; : : : ; wTm¡n ;
(4.9)
in which wp represents the pth row vector of the separating matrix W. It can be inferred from equation 4.9 that when algorithm 4.4 converges, the two rows of W that make outputs of the separation network be copies of 3
At convergence, the scaling of the separated source signals satises ’i .yi .t//yi .t/ D 1.
1650
J. Ye, X. Zhu, and X. Zhang
the same source signal would have an impact on each other. By theorem 1, these two rows may have a difference of some transposed basis vector of the null space of AT . Hence, as long as the whole separating matrix updates in the equivalent class (Amari et al., 2000), dened by CW
» h¡ ¢ iT ¼ e e D G AC T ; N.AT / C WT D W W ;
(4.10)
where G, AC , N.AT / and W are the same as those in theorem 1, the BSS system can still provide the separated source signals (before divergence), although algorithm 4.4 does not have stationary point. We stress that due to overow of some rows of W as a result of the accumulated inter-impact, the natural gradient algorithm 4.4 would diverge inevitably if the learning rate ´t is not sufciently small, which is veried by our simulation experiments. Additionally, the above analysis is based on the simplest output model 4.7, but similar conclusions can be made when m > 2n or several copies of one or more source signals appear at different output channels. To perform BSS steadily when there are more mixtures than sources, we modify the natural gradient algorithm as h i WtC1 D Wt C ´t R ¡ Á.yt /yTt Wt ;
(4.11)
where R is an m £ m matrix. If R D I is the identity matrix, equation 4.11 is obviously the original natural gradient algorithm 4.4. However, it cannot work stably for the overdetermined case m > n. Recently, a nonholonomic orthogonal learning algorithm for BSS was presented (Amari et al., 2000), where © ª R D diag ’1 .y1 .t//y1 .t/; : : : ; ’1 .ym .t//ym .t/
(4.12)
is a diagonal matrix. The orthogonal natural gradient algorithm can make the redundant components decay to zero, and it is particularly efcient when magnitudes of the source signals are intermittent or changing rapidly over time (e.g., speech sounds, electroencephalogram signals). One drawback of this algorithm is that the redundant signals, although very weak in scale, may still be mixtures of several source signals. Furthermore, the magnitudes of some separated source signals are possibly very small as well; hence, it is sometimes difcult to discriminate them from the redundant signals. Besides the above two selections, we propose another choice of R, given by n o R D E Á.yt /yTt y DG» ; t
t
(4.13)
Adaptive Blind Separation with an Unknown Number of Sources
1651
which is clearly no longer a diagonal matrix if the number of sensors is greater than the number of sources.4 Unlike the diagonal matrix 4.12, a properly designed matrix 4.13 can control positions of the redundant signals if the source number n is known a priori or has been estimated in advance. Theorem 2. If the m £ n mixing matrix A is of full column rank, S D [stC1 ; : : : ; stCK ] is an n £ K full row rank matrix of the independent source signals (K ¸ n), and xt D Ast is the noise-free observation vector, then AT spans the same null space as the K £ m data matrix XT D [xtC1 ; : : : ; xtCK ]T , and the source number n is equal to the rank of X. Proof. For any m £ 1 vector b that lies in the null space of AT , denoted by null.AT /, we have AT b D 0. Hence, we use this result and X D AS to get XT b D ST AT b D 0;
(4.14)
which implies b 2 null.XT /. On the other hand, 8b 2 null.XT /, it is easily obtained from equation 4.14, and the full row rank assumption5 of S that AT b D 0, which means b 2 null.AT /. Since null.AT / D null.XT /, it is straightforward that the source number equals the rank of X: By theorem 2, the source number can be determined directly from K samples of the noise-free observation vector. In practice, such a method performs satisfactorily for somewhat large K, for example, K D m » 2m if the sources are all stationary random signals. Next, we consider the design of R in equation 4.13. If the source number is unknown, we may start R with the identity matrix, and then take it as an exponentially windowed average of equation 4.13, that is, RtC1 D ¸ ¢ Rt C .1 ¡ ¸/ ¢ Á.yt /yTt ;
(4.15)
where 0 < ¸ < 1 is a forgetting factor. If the source number is known a priori or has been estimated by theorem 2 or by other schemes (Sato, 2001; Sugiyama & Ogawa, 2001; Cao et al., 2003), we may make use of the magnitude ambiguity and order indeterminacy inherent in the BSS problem (Tong et al., 1991; Hyvarinen et al., 2001; Lu & Rajapakse, 2003) to design the proper matrix R such that the original source 4 When m D n, equation 4.13 is simplied to equation 4.12, and in this sense, the proposed algorithm can be regarded as a generalized orthogonal natural gradient algorithm for BSS. 5 By statistical theory (Cramer, 1946), when K is sufciently large, S D [stC1 ; : : : ; stCK ] © ª has the same rank with probability 1 as the correlation matrix Rss D E st sTt . In the BSS problem, the source signals are assumed to be mutually independent, so S is of full row rank with probability 1 for sufciently large K.
1652
J. Ye, X. Zhu, and X. Zhang
signals are separated at specied channels and the redundant signals are copies of outputs of certain channels. For example, if the source number n satises n < m · 2n, and it is desired that all the source signals are recovered at the rst n channels, while the other m ¡ n redundant components are copies of outputs of the rst m ¡ n channels, then we design R D I ¡ 0;
(4.16)
in which I is the identity matrix, and 0 is dened as equation 4.8. Alternatively, we can take h i m R D IC 11m¡n ; 02 ; : : : ; 0n ; 1nC1 m¡n ; : : : ; 1m¡n ;
(4.17)
q
where 1m¡n is a column vector with the rst n elements all equal to zero and the last m ¡ n elements all equal to 1, so that among m outputs of algorithm 4.11, the rst n components are the separated source signals, whereas the remaining m ¡ n components are all copies of the rst component. 5 Simulation Results In order to verify the effectiveness of proposed algorithm 4.11, we consider separation of the following source signals (Amari et al., 1996; Yang & Amari, 1997): S1: Sign signal: s1 .t/ D sign.cos.2¼ 155t//
S2: High-frequency sinusoid signal: s2 .t/ D sin.2¼800t/ S3: Low-frequency sinusoid signal: s3 .t/ D sin.2¼ 90t/
S4: Amplitude-modulated signal: s4 .t/ D sin.2¼ 9t/ sin.2¼ 300t/
S5: Phase-modulated signal: s5 .t/ D sin.2¼ 300t C 6 cos.2¼ 60t//
S6: Noise signal: n.t/ is uniformly distributed in [¡1; C1]; and s6 .t/ D n.t/. Figure 1 shows the source signals in one run. In simulations, nine sensors are used, and the mixing matrix A is xed in each run, but the elements are randomly generated subject to the uniform distribution in [¡1; C1] such that A is of full column rank. The mixed signals are sampled at the sampling rate of 10 KHz, that is, the sampling period Ts D 0:0001s. The proposed algorithm 4.11 is determined by three ingredients. One is the matrix R. Another is the learning rate parameter; for good separation accuracy, we take ´t D 30£Ts in simulations. The third ingredient is the nonlinear score functions. Since the PDFs of source signals are unknown in the BSS problem, the score function is not available, and it is usually replaced by another nonlinear function, the activation function. To perform BSS steadily,
Adaptive Blind Separation with an Unknown Number of Sources
1653
Figure 1: Waveforms of the six original source signals in one run.
the activation function should satisfy the local stability condition (Amari et al., 1997, 2000; Cardoso, 2000; Hyvarinen et al., 2001). Since the above six sources are all subgaussian signals, one may select ’i .yi .t// D y3i .t/. Here, we apply the adaptive function (Yang & Amari, 1997): ³
´ 1 i 9 i i Ái .yi .t// D ¡ ·3 C ·3 ·4 y2i .t/ 2 4 ³ ´ 1 3 3 C ¡ ·4i C .· 3i /2 C .· 4i /2 y3i .t/; 6 2 4
(5.1)
where ·3i D Efy3i .t/g and ·4i D Efy4i .t/g ¡ 3 represent, respectively, the skewness and the kurtosis of yi .t/. They are updated as follows, ± ² d·3i D ¡¹t ·3i ¡ y3i .t/ dt ± ² d·4i D ¡¹t ·4i ¡ y4i .t/ C 3 ; dt
(5.2) (5.3)
in which ¹t is the step size parameter and taken as ¹t D 50 £ Ts in simulations.
1654
J. Ye, X. Zhu, and X. Zhang
Two experiments are executed. In the rst, the source number is unknown, and we run the natural gradient algorithm 4.4 (´t D 30 £ Ts ), the orthogonal natural gradient algorithm 4.11 with 4.12 (´t D 180 £ Ts ), and the proposed algorithm 4.11 with 4.13 at the same time. In the latter two algorithms, both R’s start with the identity matrix and then take the exponentially windowed averages of algorithms 4.12 and 4.13, respectively, with the same forgetting factor ¸ D 0:99. To evaluate the performance of the BSS algorithms, we use the cross-talking error as the performance index (Amari et al., 1996; Yang & Amari, 1997), 0 1 0 1 m n n m X X X X c c 1 1 pq pq @ @ PI D ¡ 1A C ¡ 1A ; (5.4) m pD1 qD1 maxk n qD1 pD1 maxk cpk ckq where C D WA D fcpqg is the combined mixing-separating matrix. Figure 2 plots the curves of performance indexes averaged over 500 independent runs (both the mixing matrix A and the noise source n.t/ are randomly generated in each run). Clearly, both the orthogonal natural gradient algo-
Figure 2: Average performance indexes over 500 independent runs of the natural gradient algorithm 4.4, the orthogonal natural gradient algorithm 4.11 with 4.12, and the proposed algorithm 4.11 with 4.13.
Adaptive Blind Separation with an Unknown Number of Sources
1655
Figure 3: Average performance indexes over 500 independent runs of the proposed algorithm 4.11 with 4.16 and 4.11 with 4.17.
rithm and the proposed algorithm 4.11 can perform BSS steadily,6 while algorithm 4.4 would diverge due to the reason presented in the previous section. In the second experiment, we apply the method described by theorem 2 where K D m to estimate the unknown source number, and simulate algorithm 4.11 with 4.16 and 4.17 at the same time. Since the desired source signals appear at the rst n channels, we use the cross-talking error 5.4 with C D W1 A, where W1 is the submatrix composed of the rst n rows of W, to measure the performance. Another 500 independent runs are executed. Figure 3 shows the average results, and Figures 4 through 6 plot, respectively, the performance indexes and the waveforms of the recovered source signals in one run. From these gures, it can be seen that besides n source signals separated at the rst n channels, the proposed algorithm 4.11 can make the rest of the m ¡ n redundant signals be copies of outputs of the rst m ¡ n 6 The orthogonal natural gradient algorithm makes the redundant components as well as some of the separated source signals very small in scale; hence, the steady-state crosstalking error is somewhat less than those of the original natural gradient algorithm (before divergence) and the proposed one, but it does not necessarily mean a higher recovery quality of the separated source signals.
1656
J. Ye, X. Zhu, and X. Zhang
Figure 4: Performance indexes of the proposed algorithm 4.11 with 4.16 and 4.11 with 4.17 in one run.
channels by using algorithm 4.16, while all copies of the rst component by using algorithm 4.17, which agree with what we expect. 6 Conclusions This article studies the general BSS problem where the number m of sensors is not fewer than the number n of sources. First, it is shown that the mutual information of the output components is a cost function for BSS, and it reaches a local minimum if and only if m outputs consist of n independent components and m ¡ n redundant components. Therefore, this article presents a rigorous theoretical justication for the empirical ndings previously reported by Cichocki et al. (1999). Second, it is illustrated that when the natural gradient algorithm for complete BSS (m D n) is used to deal with the overdetermined case (m > n), it would diverge inevitably. As a solution to this problem, we propose a modied learning rule, which includes the standard and orthogonal natural gradient algorithms (Amari et al., 2000) as special cases. Moreover, the proposed one can provide the desired source signals at specied channels and control what and where the redundant components are if the source number is known or has been estimated in advance.
Adaptive Blind Separation with an Unknown Number of Sources
1657
Figure 5: Waveforms of the recovered source signals by the proposed algorithm 4.11 with 4.16 in one run.
Figure 6: Waveforms of the recovered source signals by the proposed algorithm 4.11 with 4.17 in one run.
1658
J. Ye, X. Zhu, and X. Zhang
Acknowledgments We are grateful to the anonymous reviewers and Terrence J. Sejnowski for their valuable comments and suggestions. This work was supported by the National Natural Science Foundation of China under contract 60375004. References Amari, S., Cichocki, A., & Yang, H. H. (1996). A new algorithm for blind signal separation. In D. S. Touretzky, M. C. Mozer, & M. E. Hasselmo (Eds.), Advances in neural information processing systems, 8 (pp. 757–763). Cambridge, MA: MIT Press. Amari, S., Chen, T. P., & Cichocki, A. (1997). Stability analysis of learning algorithms for blind source separation. Neural Networks, 10(8), 1345–1351. Amari, S. (1998). Natural gradient works efciently in learning. Neural Computation, 10(2), 251–276. Amari, S., & Cichocki A. (1998). Adaptive blind signal processing: Neural network approaches. Proceedings IEEE, 86(10), 2026–2048. Amari, S. (1999). Natural gradient for over- and under-complete bases in ICA. Neural Computation, 11(8), 1875–1883. Amari, S., Chen, T. P., & Cichocki, A. (2000). Nonholonomic orthogonal learning algorithms for blind source separation. Neural Computation, 12(6), 1463– 1484. Bell, A. J., & Sejnowski, T. J. (1995). An information maximization approach to blind separation and blind deconvolution. Neural Computation, 7(6), 1129– 1159. Belouchrani, A., Abed-Meraim, K., Cardoso, J. F., & Moulines, E. (1997). A blind source separation technique using second-order statistics. IEEE Trans. on Signal Processing, 45(2), 434–444. Cao, J. T., Murata, N., Amari, S., Cichocki, A., & Takeda T. (2003). A robust approach to independent component analysis of signals with high-level noise measurements. IEEE Trans. on Neural Networks, 14(3), 631–645. Cao, X. R., & Liu, R. W. (1996). A general approach to blind source separation. IEEE Trans. on Signal Processing, 44(3), 562–571. Cardoso, J. F., & Souloumiac, A. (1993). Blind beamforming for non-Gaussian signals. IEEE Proceedings F, Radar and Signal Processing, 140(6), 362–370. Cardoso, J. F. (1997). Infomax and maximum likelihood for source separation.IEEE Signal Processing Letters, 4(4), 112–114. Cardoso, J. F. (1999). High-order contrasts for independent component analysis. Neural Computation, 11(1), 157–192. Cardoso, J. F. (2000). On the stability of source separation algorithms. Journal of VLSI Signal Processing, 26(1), 7–14. Cardoso, J. F., & Laheld, H. (1996). Equivariant adaptive source separation. IEEE Trans. on Signal Processing, 44(12), 3017–3029. Choi, S., Cichocki, A., & Amari, S. (2000). Flexible independent component analysis. Journal of VLSI Signal Processing, 26(1), 25–39.
Adaptive Blind Separation with an Unknown Number of Sources
1659
Cichocki, A., & Amari, S. (2002). Adaptive signal processing: Learning algorithms and applications. New York: Wiley. Cichocki, A., Karhunen, J., Kasprzak, W., & Vigario, R. (1999). Neural networks for blind separation with unknown number of sources. Neurocomputing, 24(1), 55–93. Comon, P. (1994). Independent component analysis, a new concept? Signal Processing, 36(3), 287–314. Cover, T. M., & Thomas J. A. (1991). Elements of information theory. New York: Wiley. Cramer, H. (1946). Mathematical methods of statistics. Princeton, NJ: Princeton University Press. Delfosse, N., & Loubaton, P. (1995). Adaptive blind separation of independent sources: A deation approach. Signal Processing, 45(1), 59–83. Douglas, S. C. (2003). Blind source separation and independent component analysis: A crossroads of tools and ideas. In S. Amari, A. Cichocki, S. Makino, & N. Murata (Eds.), International Symposium on Independent Component Analysis and Blind Signal Separation (ICA2003), 4 (pp. 1–10). Nara, Japan. Feng, D. Z., Zhang, X. D., & Bao, Z. (2003). An efcient multistage decomposition approach for independent components. Signal Processing, 83(1), 181– 197. Hyvarinen, A., Karhunen, J., & Oja, E. (2001). Independent component analysis. New York: Wiley. Hyvarinen, A., & Oja, E. (1997). A fast xed-point algorithm for independent component analysis. Neural Computation, 9(7), 1483–1492. Igual, J., Vergara, L., Camacho, A., & Miralles, R. (2003).Independent component analysis with prior information about the mixing matrix.Neurocomputing, 50, 419–438. Jutten, C., & H´erault J. (1991).Blind separation of sources: An adaptive algorithm based on neuromimetic architecture. Signal Processing, 24(1), 1–10. Karhunen, J., Oja, E., Wang, L., Vigario, R., & Joutsensalo, J. (1997). A class of neural networks for independent component analysis. IEEE Trans. on Neural Networks, 8(3), 486–504. Karhunen, J., Pajunen, P., & Oja, E. (1998). The nonlinear PCA criterion in blind source separation: Relations with other approaches, Neurocomputing, 22(1), 5–20. Lee, T. W., Girolami, M., & Sejnowski, T. J. (1999). Independent component analysis using an extended infomax algorithm for mixed sub-gaussian and super-gaussian sources. Neural Computation, 11(2), 409–433. Lewick, M., & Sejnowski, T. J. (2000). Learning overcomplete representations. Neural Computation, 12(2), 337–365. Li, Y. Q., & Wang, J. (2002). Sequential blind extraction of instantaneously mixed sources. IEEE Trans. on Signal Processing, 50(5), 997–1006. Lindgren, U. A., & Broman, H. (1998). Source separation using a criterion based on second-order statistics. IEEE Trans. on Signal Processing, 46(7), 1837– 1850. Lu, W., & Rajapakse, J. C. (2003). Eliminating indeterminacy in ICA. Neurocomputing, 50, 271–290.
1660
J. Ye, X. Zhu, and X. Zhang
Moreau, E., & Macchi, O. (1996). High-order contrasts for self-adaptive source separation. InternationalJournal of Adaptive Control and Signal Processing, 10(1), 19–46. Moreau, E., & Thirion-Moreau, N. (1999). Nonsymmetrical contrasts for source separation. IEEE Trans. on Signal Processing, 47(8), 2241–2252. Oja, E. (1997). The nonlinear PCA learning rule and signal separation: Mathematical analysis. Neurocomputing, 17(1), 25–46. Pajunen, P., & Karhunen, J. (1998). Least-squares methods for blind source separation based on nonlinear PCA. International Journal of Neural Systems, 8, 601–612. Pham, D. T., & Garrat, P. (1997). Blind separation of mixture of independent sources through a quasi-maximum likelihood approach. IEEE Trans. on Signal Processing, 45(7), 1712–1725. Pham, D. T., & Cardoso, J. F. (2001). Blind separation of instantaneous mixtures of nonstationary sources. IEEE Trans. on Signal Processing, 49(9), 1837–1848. Sanchez A., V. D. (2002).Frontiers of research in BSS/ICA. Neurocomputing, 49(1), 7–23. Sato, M. (2001). Online model selection based on the variational bayes. Neural Computation, 13(7), 1649–1681. Sugiyama, M., & Ogawa, H. (2001). Subspace information criterion for model selection. Neural Computation, 13(8), 1863–1889. Thawonmas, R., Cichocki, A., & Amari, S. (1998). A cascade neural network for blind extraction without spurious equilibria. IEICE Trans. on Fundamentals of Electronics, Communications, and Computer Science, E81-A(9), 1833–1846. Tong, L., Liu, R., Soon, V. C., & Huang, Y. F. (1991). Indeterminacy and identiability of blind identication. IEEE Trans. on Circuits and Systems, 38(5), 499–509. Yang, H. H., & Amari, S. (1997). Adaptive on-line learning algorithms for blind separation: Maximum entropy and minimum mutual information. Neural Computation, 9(5), 457–1482. Zhang, L. Q., Cichocki, A., & Amari, S. (1999). Natural gradient algorithm for blind separation of over-determined mixtures with additive noise. IEEE Signal Processing Letters, 6(11), 293–295. Zhu, X. L., & Zhang, X. D. (2002). Adaptive RLS algorithm for blind source separation using a natural gradient. IEEE Signal Processing Letters, 9(12), 432– 435. Received July 16, 2003; accepted January 30, 2004.
LETTER
Communicated by Maneesh Sahani
Unsupervised Spike Detection and Sorting with Wavelets and Superparamagnetic Clustering R. Quian Quiroga
[email protected] Z. Nadasdy
[email protected] Division of Biology, California Institute of Technology, Pasadena, CA 91125, U.S.A.
Y. Ben-Shaul
[email protected] ICNC, Hebrew University, Jerusalem, Israel
This study introduces a new method for detecting and sorting spikes from multiunit recordings. The method combines the wavelet transform, which localizes distinctive spike features, with superparamagnetic clustering, which allows automatic classication of the data without assumptions such as low variance or gaussian distributions. Moreover, an improved method for setting amplitude thresholds for spike detection is proposed. We describe several criteria for implementation that render the algorithm unsupervised and fast. The algorithm is compared to other conventional methods using several simulated data sets whose characteristics closely resemble those of in vivo recordings. For these data sets, we found that the proposed algorithm outperformed conventional methods.
1 Introduction Many questions in neuroscience depend on the analysis of neuronal spiking activity recorded under various behavioral conditions. For this reason, data acquired simultaneously from multiple neurons are invaluable for elucidating principles of neural information processing. Recent advances in commercially available acquisition systems allow recordings of up to hundreds of channels simultaneously, and the reliability of these data critically depends on accurately identifying the activity of individual neurons. However, developments of efcient and reliable computational methods for classifying multiunit data, that is, spike sorting algorithms, lag behind the capabilities afforded by current hardware. In practice, supervised spike sorting of a large number of channels is highly time-consuming and nearly impossible to perform during the course of an experiment. c 2004 Massachusetts Institute of Technology Neural Computation 16, 1661–1687 (2004) °
1662
R. Quiroga, Z. Nadasdy, and Y. Ben-Shaul
The basic algorithmic steps of spike classication are as follows: (1) spike detection, (2) extraction of distinctive features from the spike shapes, and (3) clustering of the spikes by these features. Spike sorting methods are typically based on clustering predened spike shape features such as peakto-peak amplitude, width, or principal components (Abeles & Goldstein, 1977; Lewicki, 1998). Nevertheless, it is impossible to know beforehand which of these features is optimal for discriminating between spike classes in a given data set. In the specic case where the spike features are projections on the rst few principal components, the planes onto which the spikes are projected maximize the variance of data but do not necessarily provide an optimal separation between the clusters. A second critical issue is that even when optimal features from a given data set are used for classication, the distribution of the data imposes additional constraints on the clustering procedure. In particular, violation of normality in a given feature’s distribution compromises most unsupervised clustering algorithms, and therefore manual clustering of the data is usually preferred. However, besides being a very time-consuming task, manual clustering introduces errors due to both the limited dimensionality of the cluster cutting space and human biases (Harris, Henze, Csicsvari, Hirase, & Buzs a´ ki, 2000). An alternative approach is to dene spike classes by a set of manually selected thresholds (window discriminators) or with spike templates. Although this is computationally very efcient and can be implemented on-line, it is reliable only when the signal-to-noise ratio is very high and it is limited to the number of channels a human operator is able to supervise. In this article, we introduce a new method that improves spike separation in the feature space and implements a novel unsupervised clustering algorithm. Combining these two features results in a novel unsupervised spike sorting system. The cornerstones of our method are the wavelet transform, which is a time-frequency decomposition of the signal with optimal resolution in both the time and the frequency domains, and superparamagnetic clustering (SPC; Blatt, Wiseman, & Domany, 1996), a relatively new clustering procedure developed in the context of statistical mechanics. The complete algorithm encompasses three principal stages: (1) spike detection, (2) selection of spike features, and (3) clustering of the selected spike features. In the rst step, spikes are detected with an automatic amplitude threshold on the high-pass ltered data. In the second step, a small set of wavelet coefcients from each spike is chosen as input for the clustering algorithm. Finally, the SPC classies the spikes according to the selected set of wavelet coefcients. We stress that the entire process of detection, feature extraction, and clustering is performed without supervision and relatively quickly. In this study, we compare the performance of the algorithm with other methods using simulated data that closely resemble real recordings. The rationale of using simulated data is to obtain an objective measure of performance, since the simulation sets the identity of the spikes.
Unsupervised Spike Detection and Sorting
1663
2 Theoretical Background 2.1 Wavelet Transform. The wavelet transform (WT) is a time-frequency representation of the signal that has two main advantages over conventional methods: it provides an optimal resolution in both the time and the frequency domains, and it eliminates the requirement of signal stationarity. It is dened as the convolution between the signal x.t/ and the wavelet functions Ãa;b .t/, WÃ X.a; b/ D hx.t/ j Ãa;b .t/i;
(2.1)
where Ãa;b .t/ are dilated (contracted), and shifted versions of a unique wavelet function Ã.t/, ³ 1
Ãa;b .t/ D jaj¡ 2 Ã
t¡b a
´ (2.2)
where a and b are the scale and translation parameters, respectively. Equation 2.1 can be inverted, thus providing the reconstruction of x.t/. The WT maps the signal that is represented by one independent variable t onto a function of two independent variables a, b. This procedure is redundant and inefcient for algorithmic implementations; therefore, the WT is usually dened at discrete scales a and discrete times b by choosing the set of parameters faj D 2¡j I bj;k D 2¡j kg, with integers j and k. Contracted versions of the wavelet function match the high-frequency components, while dilated versions match the low-frequency components. Then, by correlating the original signal with wavelet functions of different sizes, we can obtain details of the signal at several scales. These correlations with the different wavelet functions can be arranged in a hierarchical scheme called multiresolution decomposition (Mallat, 1989). The multiresolution decomposition algorithm separates the signal into details at different scales and a coarser representation of the signal named “approximation” (for details, see Mallat, 1989; Chui, 1992; Samar, Swartz, & Raghveer, 1995; Quian Quiroga, Sakowicz, Basar, & Schurmann, ¨ 2001; Quian Quiroga & Garcia, 2003). In this study we implemented a four-level decomposition using Haar wavelets, which are rescaled square functions. Haar wavelets were chosen due to their compact support and orthogonality, which allows the discriminative features of the spikes to be expressed with a few wavelet coefcients and without a priori assumptions on the spike shapes. 2.2 Superparamagnetic Clustering. The following is a brief description of the key ideas of superparamagnetic clustering (SPC), which is based on simulated interactions between each data point and its K-nearest neighbors (for details, see Blatt et al., 1996; Blatt, Wiseman, & Domany, 1997). The method is implemented as a Monte Carlo iteration of a Potts model. The
1664
R. Quiroga, Z. Nadasdy, and Y. Ben-Shaul
Potts model is a generalization of the Ising model where instead of having spins with values §1=2, there are q different states per particle (Binder & Heermann, 1988). The rst step is to represent the m selected features of each spike i by a point xi in an m-dimensional phase space. The interaction strength between points xi is then dened as 8 Á ! > kxi ¡ xj k2 : 0 else where a is the average nearest-neighbors distance and K is the number of nearest neighbors. Note that the strength of interaction Jij between nearestneighbor spikes falls off exponentially with increasing Euclidean distance dij D kxi ¡ xj k2 , which corresponds to the similarity of the selected features (i.e., similar spikes will have a strong interaction). In the second step, an initial random state s from 1 to q is assigned to each point xi . Then N Monte Carlo iterations are run for different temperatures T using the Wolf algorithm (Wolf, 1989; Binder & Heermann, 1988). Blatt et al. (1997) used a Swendnsen-Wang algorithm instead, but its implementation and performance are both very similar. The advantage of both algorithms over simpler approaches such as the Metropolis algorithm is their enhanced performance in the superparamagnetic regime (see Binder & Heermann, 1988; Blatt et al., 1997, for details). The main idea of the Wolf algorithm is that given an initial conguration of states s, a point xi is randomly selected and its state s changed to a new state snew , randomly chosen between 1 and q. The probability that the nearest neighbors of xi will also change their state to snew is given by ³ pij D 1 ¡ exp ¡
Jij T
´ ±si ;sj ;
(2.4)
where T is the temperature (see below). Note that only those nearest neighbors of xi that were in the same previous state s are the candidates to change their values to snew . Neighbors that change their values create a “frontier” and cannot change their value again during the same iteration. Points that do not change their value in a rst attempt can do so if revisited during the same iteration. Then for each point of the frontier, we apply equation 2.4 again to calculate the probability of changing the state to snew for their respective neighbors. The frontier is updated, and the update is repeated until the frontier does not change any more. At that stage, we start the procedure again from another point and repeat it several times in order to get representative statistics. Points that are relatively close together (i.e., corresponding to a given cluster) will change their state together. This observation can be quantied by measuring the point-point correlation h±si ;sj i and dening xi , xj to be members of the same cluster if h±si ;sj i ¸ µ , for a given threshold µ.
Unsupervised Spike Detection and Sorting
1665
As in Blatt et al. (1996), we used q D 20 states, K D 11 nearest neighbors, N D 500 iterations, and µ D 0:5. It has indeed been shown that clustering results mainly depend on the temperature and are robust to small changes in the previous parameters (Blatt et al., 1996). Let us now discuss the role of the temperature T. Note from equation 2.4 that high temperatures correspond to a low probability of changing the state of neighboring points together, whereas low temperatures correspond to a higher probability regardless of how weak the interaction Jij is. This has a physical analogy with a spin glass, in which at a relatively high temperature, all the spins are switching randomly, regardless of their interactions (paramagnetic phase). At a low temperature, the entire spin glass changes its state together (ferromagnetic phase). However, at a certain medium range of temperatures, the system reaches a “superparamagnetic” phase in which only those spins that are grouped together will change their state simultaneously. Regarding our clustering problem, at low temperatures, all points will change their state together and will therefore be considered as a single cluster; at high temperatures, many points will change their state independently from one another, thus partitioning the data into several clusters with only a few points in each; and for temperatures corresponding to the superparamagnetic phase, only those points that are grouped together will change their state simultaneously. Figure 1A shows a two-dimensional (2D) example in which 2400 2D points were distributed in three different clusters. Note that the clusters partially overlap, they have a large variance, and, moreover, the centers fall outside the clusters. In particular, the distance between arbitrarily chosen points of the same cluster can be much larger than the distance between points in different clusters. These features render the use of conventional clustering algorithms unreliable. The different markers represent the outcome after clustering with SPC. Clearly, most of the points were correctly classied. In fact, only 102 of 2400 (4%) data points were not classied because they were near the boundaries of the clusters. Figure 1B shows the number of elements assigned to each given cluster as a function of the temperature. At low temperatures, we have a single cluster with all 2400 data points included. At a temperature between 0.04 and 0.05, this cluster breaks down into three subclusters corresponding to the superparamagnetic transition. The classication shown in the upper plot was performed at T D 0:05. At about T D 0:08, we observe the transition to the paramagnetic phase, where the clusters break down into several groups with a few members each. Note that the algorithm is based on K-nearest neighbor interactions and therefore does not assume that clusters are nonoverlapping or that they have low variance or a gaussian distribution. 3 Description of the Method Figure 2 summarizes the three principal stages of the algorithm: (1) spikes are detected automatically via amplitude thresholding; (2) the wavelet trans-
1666
R. Quiroga, Z. Nadasdy, and Y. Ben-Shaul
Figure 1: Example showing the performance of superparamagnetic clustering. (A) The two-dimensional data points used as inputs. The different markers represent the outcome of the clustering algorithm. (B) Cluster size vs. temperature. At temperature 0.05, the transition to the superparamagnetic phase occurs, and the three clusters are separated.
form is calculated for each of the spikes and the optimal coefcients for separating the spike classes are automatically selected; and (3) the selected wavelet coefcients then serve as the input to the SPC algorithm, and clustering is performed after automatic selection of the temperature corresponding to the superparamagnetic phase. (A Matlab implementation of the algorithm can be obtained on-line from www.vis.caltech.edu/»rodri.)
3.1 Spike Detection. Spike detection was performed by amplitude thresholding after bandpass ltering the signal (300–6000 Hz, four pole
Unsupervised Spike Detection and Sorting
1667
Figure 2: Overview of the automatic clustering procedure. (A) Spikes are detected by setting an amplitude threshold. (B) A set of wavelet coefcients representing the relevant features of the spikes is selected. (C) The SPC algorithm is used to cluster the spikes automatically.
1668
R. Quiroga, Z. Nadasdy, and Y. Ben-Shaul
butterworth lter). The threshold (Thr) was automatically set to » Thr D 4¾n I
¾n D median
¼ jxj ; 0:6745
(3.1)
where x is the bandpass-ltered signal and ¾ n is an estimate of the standard deviation of the background noise (Donoho & Johnstone, 1994). Note that taking the standard deviation of the signal (including the spikes) could lead to very high threshold values, especially in cases with high ring rates and large spike amplitudes. In contrast, by using the estimation based on the median, the interference of the spikes is diminished (under the reasonable assumption that spikes amount to a small fraction of all samples). To demonstrate this, we generated a segment of 10 sec of background noise with unit standard deviation, and in successive simulations, we added a distinct spike class with different ring rates. Figure 3 shows that for noise alone (i.e., zero ring rate), both estimates are equal, but as the ring rate increases, the standard deviation of the signal (conventional estimate) gives an increasingly erroneous estimate of the noise level, whereas the improved estimate from equation 3.1 remains close to the real value. For each detected spike, 64 samples (i.e., »2.5 ms) were saved for further analysis. All spikes were aligned to their maximum at data point 20. In order to avoid spike misalignments due to low sampling, spike maxima were determined from interpolated waveforms of 256 samples, using cubic splines. 3.2 Selection of Wavelet Coefcients. After spikes are detected, their wavelet transform is calculated, thus obtaining 64 wavelet coefcients for each spike. We implemented a four-level multiresolution decomposition using Haar wavelets. As explained in section 2.1, each wavelet coefcient characterizes the spike shapes at different scales and times. The goal is to select a few coefcients that best separate the different spike classes. Clearly, such coefcients should have a multimodal distribution (unless there is only one spike class). To perform this selection automatically, the Lilliefors modication of a Kolmogorov-Smirnov (KS) test for normality was used (Press, Teukolsky, Vetterling, & Flannery, 1992). Note that we do not rely on any particular distribution of the data; rather, we are interested in deviation from normality as a sign of a multimodal distribution. Given a data set x, the test compares the cumulative distribution function of the data .F.x// with that of a gaussian distribution with the same mean and variance .G.x//. Deviation from normality is then quantied by max.jF.x/ ¡ G.x/j/:
(3.2)
In our implementation, the rst 10 coefcients with the largest deviation from normality were used. The selected set of wavelet coefcients provides
Unsupervised Spike Detection and Sorting
1669
2.8
Conventional estimation Improved estimation Real value
2.6 2.4
noise std
2.2 2 1.8 1.6 1.4 1.2 1 0.8 0
5
10
15
20
25
30
35
40
45
50
Firing rate (Hz) Figure 3: Estimation of noise level used for determining the amplitude threshold. Note how the conventional estimation based on the standard deviation of the signal increases with the ring rate, whereas the improved estimation from equation 3.1 remains close to the real value. See the text for details.
a compressed representation of the spike features that serves as the input to the clustering algorithm. Overlapping spikes (i.e., spikes from different neurons appearing quasisimultaneously) introduce outliers in the distribution of the wavelet coefcients that cause deviations from normality in unimodal (as well as multimodal) distributions, thus compromising the use of the KS test as an estimation of multimodality. In order to minimize this effect, for each coefcient we only considered values within §3 standard deviations. 3.3 SPC and Localization of the Superparamagnetic Phase. Once the selected set of wavelet coefcients is chosen, we run the SPC algorithm for a wide range of temperatures spanning the ferromagnetic, superparamag-
1670
R. Quiroga, Z. Nadasdy, and Y. Ben-Shaul
netic, and paramagnetic phases. In order to localize the superparamagnetic phase automatically, a criterion based on the cluster sizes is used. The idea is that for both the paramagnetic and ferromagnetic phases, temperature increases can only lead to the creation of clusters with few members each. Indeed, in the paramagnetic phase (i.e., high temperature), the clusters break down into several small ones, and in the ferromagnetic phase, there are almost no changes when the temperature is increased. In contrast, in the superparamagnetic phase, increasing the temperature creates new clusters with a large number of members. In our implementation, we varied the temperature from 0 to 0.2 in increments of 0.01 and looked for the highest temperature at which a cluster containing more than 60 points appeared (not being present at lower temperatures). Since our simulations were 60 sec long, this means that we considered clusters corresponding to neurons with a mean ring rate of at least 1 Hz. The threshold of 1 Hz gave us optimal results for all our simulations, but it should be decreased if one considers neurons with lower ring rates. Alternatively, one can consider a fraction of the total number of spikes. If no cluster with a minimum of 60 points was found, we kept the minimum temperature value. Using this criterion, we can automatically select the optimal temperature for cluster assignments, and therefore the whole clustering procedure becomes unsupervised. 4 Data Simulation Simulated signals were constructed using a database of 594 different average spike shapes compiled from recordings in the neocortex and basal ganglia. For generating background noise, spikes randomly selected from the database were superimposed at random times and amplitudes. This was done for half the times of the samples. The rationale was to mimic the background noise of actual recordings that is generated by the activity of distant neurons. Next, we superimposed a train of three distinct spike shapes (also preselected from the same database of spikes) on the noise signal at random times. The amplitude of the three spike classes was normalized to have a peak value of 1. The noise level was determined from its standard deviation, which was equal to 0.05, 0.1, 0.15, and 0.2 relative to the amplitude of the spike classes. In one case, since clustering was relatively easy, we also considered noise levels 0.25, 0.30, 0.35, and 0.4. Spike times and identities were saved for subsequent evaluation of the clustering algorithm. The data were rst simulated at a sampling rate of 96,000 Hz, and by using interpolated waveforms of the original spike shapes, we simulated the spike times to fall continuously between samples (to machine precision). Finally, the data were downsampled to 24,000 Hz. This procedure was introduced in order to imitate actual recording conditions in which samples do not necessarily fall on the same features within a spike (i.e., the peak of the signal does not necessarily coincide with a discrete sample).
Unsupervised Spike Detection and Sorting
1671
In all simulations, the three distinct spikes had a Poisson distribution of interspike intervals with a mean ring rate of 20 Hz. A 2 ms refractory period between spikes of the same class was introduced. Note that the background noise reproduces spike shape variability in biological signals (Fee, Mitra, & Kleinfeld, 1996; Pouzat, Mazor, & Laurent, 2002). Moreover, constructing noise from spikes ensures that this noise shares a similar power spectrum with the spikes themselves (1/f spectrum). The realistic simulation conditions applied here render the entire procedure of spike sorting more challenging than, for example, assuming a white noise distribution of background activity. Further complications of real recordings (e.g., overlapping spikes, bursting activity, moving electrodes) will be addressed in the next section. Figure 4 shows one of the simulated data sets with a noise level 0.1. Figure 4A discloses the three spike shapes that were added to the background noise, as shown in Figure 4B. Figure 4C shows a section of the data in Figure 4B in ner temporal resolution. Note the variance in shape and amplitude between spikes of the same class (identied with a marker of the same gray level) due to the additive background noise. Figure 5 shows another example with noise level 0.15, in which classication is considerably more difcult than in the rst data set. Here, the three spike classes share the same peak amplitudes and very similar widths and shapes. The differences A)
B)
1
C la s s 1 C la s s 2 C la s s 3
0.5
1.5 1 0.5
0
0 0.5
0.5
1 1.5
1 0
50
100
150
200
250
300
0
1
S a m p le s
C)
1.5
3 1
3
1
2
3
4
5
T im e (s e c )
21
2
3
1
2
1 0.5 0 0.5 1 0.9
0.92
0.94
0.96
0.98
1
1.02
T im e (s e c )
Figure 4: Simulated data set used for spike sorting. (A) The three template spike shapes. (B) The previous spikes embedded in the background noise. (C) The same data with a magnied timescale. Note the variability of spikes from the same class due to the background noise.
1672 A)
R. Quiroga, Z. Nadasdy, and Y. Ben-Shaul B) 1
1.5
C la s s 1 C la s s 2 C la s s 3
1
0.5
0.5
0
0
0.5 0.5 0
50
100
150
200
250
300
1 0
1
2
S a m p le s
C)
1.5
3
3
4
5
T im e (s e c )
2
3
2
1
1
2
1
0.5
0
0.5 0.06
0.08
0.1
0.12
0.14
0.16
T im e (s e c )
Figure 5: Another simulated data set. (A) The three template spike shapes. (B) The previous spikes embedded in the background noise. (C) The same data with a magnied timescale. Here the spike shapes are more difcult to differentiate. Note in the lower plot that the variability in the spike shapes makes their classication difcult.
between them are relatively small and temporally localized. By adding the background noise, it appears to be very difcult to identify the three spike classes (see Figure 5C). As with the previous data set, the variability of spikes of the same class is apparent. All the data sets used in this article are available on-line at www.vis. caltech.edu/»rodri. 5 Results The method was tested using four generic examples of 60 sec length, each simulated at four different noise levels, as described in the previous section. Since the rst example was relatively easy to cluster, in this case we also generated four extra time series with higher noise levels. 5.1 Spike Detection. Figures 4 and 5 show two of the simulated data sets. The horizontal lines drawn in Figures 4B and C and 5B and C are the thresholds for spike detection using equation 3.1. Table 1 summarizes the performance of the detection procedure for all data sets and noise levels. Detection performances for overlapping spikes (i.e., spike pairs within 64 data points) are reported separately (values in brackets). Overlapping
Unsupervised Spike Detection and Sorting
1673
Table 1: Number of Misses and False Positives for the Different Data Sets. Example Number (Noise Level) Example 1 [0.05] [0.10] [0.15] [0.20] Example 2 [0.05] [0.10] [0.15] [0.20] Example 3 [0.05] [0.10] [0.15] [0.20] Example 4 [0.05] [0.10] [0.15] [0.20]
Number of Spikes 3514 (785) 3522 (769) 3477 (784) 3474 (796) 3410 (791) 3520 (826) 3411 (763) 3526 (811) 3383 (767) 3448 (810) 3472 (812) 3414 (790) 3364 (829) 3462 (720) 3440 (809) 3493 (777)
Misses 17 (193) 2 (177) 145 (215) 714 (275) 0 (174) 0 (191) 10 (173) 376 (256) 1 (210) 0 (191) 8 (203) 184 (219) 0 (182) 0 (152) 3 (186) 262 (228)
False Positives 711 57 14 10 0 2 1 5 63 10 6 2 1 5 4 2
Notes: Noise level is represented in terms of its standard deviation relative to the peak amplitude of the spikes. All spike classes had a peak value of 1. Values in brackets are for overlapping spikes.
spikes hamper the detection performance because they are detected as single events when they appear too close in time. In comparison with the other examples, a relatively large number of spikes were not detected in data set 1 for the highest noise levels (0.15 and 0.2). This is due to the spike class with opposite polarity (class 2 in Figure 4). In fact, setting up an additional negative threshold reduced the number of misses from 145 to 5 for noise level 0.15 and from 714 to 178 for 0.2. In the case of the overlapping spikes, this reduction is from 360 to 52 and from 989 to 134, respectively. In all other cases, the number of undetected spikes was relatively low. With the exception of the rst two noise levels in example 1 and the rst noise level in example 3, the number of false positives was very small (less than 1%). Lowering the threshold value in equation 3.1 (e.g., 3.5 ¾n ) would indeed reduce the number of misses but also increase the number of false positives. The optimal trade-off between number of misses and false positives depends on the experimenter’s preference, but we remark that the automatic threshold of equation 3.1 gives an optimal value for different noise levels. In the case of example 1 (noise level 0.05 and 0.1) and example 3 (noise level 0.05), the large number of false positives is exclusively due to double detections. Since the noise level is very low in these cases, the threshold is also low, and consequently, the second positive peak of the class 3 spike shown in Figure 4 is detected. One solution would be to take a higher threshold value (e.g., 4.5 ¾n ), but this would not be optimal for high
1674
R. Quiroga, Z. Nadasdy, and Y. Ben-Shaul
noise levels. Although double detections decrease the performance of the detection algorithm, it does not represent a problem when considering the whole spike sorting procedure. In practice, the false positives show up as an additional cluster that can be disregarded later. For further testing of the clustering algorithm, the complete data set of simulated spikes (with both the detected and the undetected ones) will be used. 5.2 Feature Extraction. Figure 6 shows the wavelet coefcients for spikes in the data set shown in Figure 4A and Figure 5B. For clarity, wavelet coefcients of overlapping spikes are not plotted. Coefcients corresponding to individual spikes are superimposed, each representing how closely the spike waveform matches the wavelet function at a particular scale and time. Coefcients are organized in detail levels (D1–4 ) and a last approximation (A4 ), which correspond to the different frequency bands in which spike shapes are decomposed. Especially in Figure 6A, we observe that some of the coefcients cluster around different values for the different spike classes, thus being well suited for classication. Most of these coefcients are chosen by the KS test, as shown with black markers. For comparison, the 10 coefcients with maximum variance are also marked. It is clear from this gure that coefcients showing the best discrimination are not necessarily the ones with the largest variance. In particular, the maximum variance criterion misses several coefcients from the high-frequency scales (D1 –D2 ) that allow a good separation between the different spike shapes. Figure 7 discloses the distribution of the 10 best wavelet coefcients from Figure 6B (in this case, including coefcients corresponding to overlapping spikes) using the KS criterion versus the maximum variance criterion. Three wavelet coefcients out of the ten selected using the KS criterion show a deviation from normality that is not associated with multimodal distribution: coefcient 42, showing a skewed distribution, and coefcients 19 and 10, which, in addition to skewed distribution, have signicant kurtosis mainly due to the outliers introduced by the overlapping spikes. In the remaining cases, the KS criterion selected coefcients with a multimodal distribution. In contrast, with the exception of coefcient 20, the variance criterion selects coefcients with a uniform distribution that hampers classication. For the same data, in Figure 8 we show the best three-dimensional (3D) projections of the wavelet coefcients selected with the KS criterion (Figure 8A), the variance criterion (Figure 8B) and projections of the rst three principal components (Figure 8C). In all cases, the clustering was done automatically with SPC and is represented with different gray levels. We observe that using the KS criterion, it is possible to clearly identify the three clusters. In contrast, when choosing the coefcients with the largest variance, it is possible to identify two out of three clusters, and when using the rst three principal components, only a single cluster is detected (the number of classication errors is shown in Table 2, example 2, noise 0.15). Note also that the cluster shapes can be quite elongated, thus challenging any cluster-
Unsupervised Spike Detection and Sorting
1675
Figure 6: Wavelet transform of the spikes from Figure 4 and Figure 5 (panes A and B, respectively). Each curve represents the wavelet coefcients for a given spike, the gray levels denoting the spike class after clustering with SPC. (A) Several wavelet coefcients are sensitive to localized features. (B) Separation is much more difcult due to the similarity of the spike shapes. The markers show coefcients selected based on the variance criteria and coefcients selected based on deviation from normality. D1 –D4 are the detail levels, and A 4 corresponds to the last approximation level.
1676
R. Quiroga, Z. Nadasdy, and Y. Ben-Shaul
21
2
1000 0
0.6
0.5
1 0.2
0
0.2
4 0.5
0
0.5
5 0.2
0
0.2
0
7
500 0.2
0
0.2
0
8
1000 0.5
0
0.5
0
10
500 0.05
0.1
0.15
0.2
0
6
500 0.15
0.1
0.05
500 0
1
0
1
2
4
2
0
2
4
1
0
1
2
3
2
1
0
1
2
1
0
1
2
1
0
1
2
1
0
1
2
0
1
2
3
0.5
0
0.5
1
500 0
0
2
500 0
0.25
0
1000 0
1
2
1000 0
0.4
6
1000 0
0.4
4
500 0
1
2
500 0
0.4
0
500 0
0
20
39 20 38 40
1
500 0
0.2
1000 0
19
0
500 0
43
0.2
1000 0
42
0.4
500 0
10
Variance criterion
3
41
KS criterion
500 0
Figure 7: Distribution of the wavelet coefcients corresponding to Figure 6B. (A) The coefcients selected with the Kolmogorov-Smirnov criterion. (B) The coefcients selected with a maximum variance criterion. The coefcient number is at the left of each panel. Note that the rst criterion is more appropriate as it selects coefcients with multimodal distributions.
Unsupervised Spike Detection and Sorting
1677
Figure 8: Best projection of the wavelet coefcients selected with the (A) KS criterion and the (B) variance criterion. (C) The projection on the rst three principal components. Note that only with the wavelet coefcients selected by using the KS criterion is it possible to separate the three clusters. Clusters assignment (shown with different gray levels) was done after use of SPC.
1678
R. Quiroga, Z. Nadasdy, and Y. Ben-Shaul
Table 2: Number of Classication Errors for All Examples and Noise Levels Obtained Using SPC, K-Means, and Different Spike Features. Example Number (Noise Level) Example 1 [0.05] [0.10] [0.15] [0.20] [0.25] [0.30] [0.35] [0.40] Example 2 [0.05] [0.10] [0.15] [0.20] Example 3 [0.05] [0.10] [0.15] [0.20] Example 4 [0.05] [0.10] [0.15] [0.20] Average
Number of Spikes
Wavelets
PCA
SPC Spike Shape
Feature Set
2729 2753 2693 2678 2586 2629 2702 2645 2619 2694 2648 2715 2616 2638 2660 2624 2535 2742 2631 2716
1 5 5 12 64 276 483 741 3 10 45 306 0 41 81 651 1 8 443 1462 (2)
1 17 19 130 911 1913 1926 (2) 1738 (1) 4 704 1732 (1) 1791 (1) 7 1781 1748 (1) 1711 (1) 1310 946 (2) 1716 (2) 1732 (1)
0 0 0 24 266 838 1424 (2) 1738 (1) 2 59 1054 (2) 2253 (1) 3 794 2131 (1) 2449 (1) 24 970 (2) 1709 (1) 1732 (1)
863 833 2015 (2) 614 1265 (2) 1699 (1) 1958 (1) 1977 (1) 502 1893 (1) 2199 (1) 2199 (1) 619 1930 (1) 2150 (1) 2185 (1) 1809 (1) 1987 (1) 2259 (1) 1867 (1)
0 0 0 17 69 177 308 930 0 2 31 154 0 850 859 874 686 271 546 872
0 0 0 17 68 220 515 733 0 53 336 740 1 184 848 1170 212 579 746 1004
2662
232
1092
873
1641
332
371
K-means Wavelets PCA
Notes: In parentheses are the number of correct clusters detected when different from 3. The numbers corresponding to the example shown on Figure 8 are underlined.
ing procedure based on Euclidean distances to the cluster centers, such as K-means. 5.3 Clustering of the Spike Features. In Figure 9, we show the performance of the algorithm for the rst data set (shown in Figure 4). In Figure 9A, we plot the cluster sizes as a function of the temperature. At a temperature T D 0:02, the transition to the superparamagnetic phase occurs. As the temperature is increased, a transition to the paramagnetic regime takes place at T D 0:12. The temperature T D 0:02 (vertical dotted line) is determined for clustering based on the criterion described in section 3.3. In Figure 9B, we see the classication after clustering and in Figure 9C the original spike shapes (without the noise). In this case, spike shapes are easy to differentiate due to the large negative phase of the class 1 spikes and the initial negative peak of the class 3 spikes. Figure 10 shows the other three data sets with a noise level 0.1. In all these cases, the classication errors were very low (see Table 2). Note also that many overlapping spikes were correctlyclassied (those in color), especially when the latency between the spike peaks was larger than about 0.5 ms. Pairs of spikes appearing with a lower time separation are not clustered
Unsupervised Spike Detection and Sorting
1679
Figure 9: (A) Cluster size vs. temperature. Based on the stability criterion of the clusters, a temperature of 0.02 was automatically chosen for separating the three spike classes. (B) All spikes with gray levels according to the outcome of the clustering algorithm. Note the presence of overlapping spikes. (C) Original spike shapes.
by the algorithm (in gray) but can, in principle, be identied in a second stage by using the clustered spikes as templates. Then one should look for the combination (allowing delays between the spike templates) that best reproduces the nonclustered spike shapes. This procedure is outside the scope of this study and will not be further addressed.
1680
R. Quiroga, Z. Nadasdy, and Y. Ben-Shaul
Figure 10: Outcome of the clustering algorithm for the three remaining examples. The inset plots show the original spike shapes. Most of the classication errors (gray traces) are due to overlapping spikes with short temporal separation.
Unsupervised Spike Detection and Sorting
1681
5.4 Comparison with Other Spike Features. Errors of spike identication cumulatively derive from two sources: incorrect feature extraction and incorrect clustering. First, we compared the discrimination power of wavelets at different noise levels with other feature extraction methods using the same clustering algorithm, SPC. Specically, we compared the outcome of classication with wavelets, principal component analysis (using the rst three principal components), the whole spike shape, and a xed set of spike features. The spike features were the mean square value of the signal (energy), the amplitude of the main peak, and the amplitude of the peaks preceding and following it. The width of the spike (given by the position of the three peaks) was also tested but found to worsen the spike classication. Table 2 summarizes the results (examples 1 and 2 were shown in Figures 4 and 5, respectively). Performance was quantied in terms of the number of classication errors and the number of clusters detected. Values in brackets denote the number of clusters detected, when different from 3. Errors due to overlapping spikes are not included in the table (but overlapping spikes were also inputs to the algorithm). Since example 1 was easier to cluster, in this case we also analyzed four higher noise levels. In general, the best performance was achieved using the selected wavelet coefcients as spike features. In fact, with wavelets, all three clusters are correctly detected, with the exception of example 4, noise level 0.2. Using the other features, only one cluster is detected when increasing the noise level. Considering all wavelet coefcients as inputs to SPC (i.e., without the selection based on the KS test) gave nearly the same results as those obtained using the entire spike shape (not shown). This is not surprising since the wavelet transform is linear, which implies rotations or rescaling of the original high-dimensional space. The clear benet is introduced when considering only those coefcients that allow a good separation between the clusters, using the KS test. Note that PCA features gave slightly worse results than those using the entire spike shapes (and clearly worse than using wavelets). Therefore, PCA gives a reasonable solution when using clustering algorithms that cannot handle high-dimensional spaces (although this is not a problem for SPC). The xed features, such as peak amplitudes and mean squared values, were much less efcient. This was expected since spike classes were generated with the same peak value and in some cases also with very similar shapes that could not be discriminated by these features. We remark that the number of detected clusters decreases monotonically with noise level, but the number of classication errors does not necessarily increase monotonically because even when three clusters are correctly recognized, a large number of spikes may remain unassigned to any of them (see, e.g., example 4 with PCA for noise levels 0.05 and 0.1). 5.5 Comparison with Other Clustering Algorithms. A number of different clustering algorithms can be applied for spike sorting, and the choice of an optimal one is important in order to exploit the discrimination power
1682
R. Quiroga, Z. Nadasdy, and Y. Ben-Shaul
of the feature space. The most used algorithms are supervised and usually assume gaussian shapes of the clusters and specic properties of the noise distribution. In order to illustrate the difference with these methods, we will compare results using SPC with those obtained using K-means. The partitioning of data by K-means keeps objects of the same cluster as close as possible (using Euclidean distance in our case) and as far as possible from objects in the other clusters. The standard K-means algorithm leaves no unclassied items, but the total number of clusters should be predened (therefore being a supervised method). These constraints simplify the clustering problem and give an advantage to K-means in comparison to SPC, since in the rst case, we know that each object should be assigned to one of the three clusters. The right-most two columns of Table 2 show the clustering performance using K-means with wavelets and PCA. Despite being unsupervised, SPC applied on the wavelet features gives the best performance. For spike shapes relatively easy to differentiate (Table 2, examples 1 and 2), the outcomes with wavelets are similar using K-means or SPC. However, the advantage of SPC with wavelet becomes apparent when spike shapes are more similar (Table 2, examples 3 and 4). We remark that with SPC, points may remain unclassied, whereas with K-means, all points are assigned to one of the three clusters (thus having at least a 33% chance of being correctly classied). This led K-means to outperform SPC for example 4, at noise level 0.2, where only two out of three clusters were identied with SPC. In general, the number of classication errors using PCA with K-means is higher than the ones using wavelets with K-means. The few exceptions where PCA outperformed wavelets with using K-means (example 3 at noise level 0.1; example 4 at noise level 0.05) can be attributed to the presence of more elongated cluster shapes obtained with wavelets that K-means fails to separate. 5.6 Simulations with Nongaussian Spike Distributions. In this section, we consider conditions of real recordings that may compromise the performance of the proposed clustering algorithm. In particular, we will simulate the effect of an electrode moving with respect to one of the neurons, bursting activity, and a correlation between the spikes and the local eld potential. Clearly, these situations are difcult to handle with algorithms that assume a particular distribution of the noise. In fact, all of these cases add a nongaussian component to the spike shape variance. For simulating a moving electrode, we use the example shown in Figure 5 (example 2, noise 0.15 in Table 2), but in this case, we progressively decreased the amplitude of the rst spike class (linearly) with time from a value of 1 at the beginning of the recording to 0.3 at the end. Using SPC with wavelet, it was still possible to detect all three clusters, and from a total of 2692 spikes, the number of classication errors was only of 48. Second, we simulated a bursting cell based also on the example shown in Figure 5. The rst class consisted of three consecutive spikes with ampli-
Unsupervised Spike Detection and Sorting
1683
tudes 1.0, 0.7, and 0.5, separated by 3 ms in average (SD = 1, range, 1–5 ms). From a total of 2360 spikes, again the three clusters were correctly detected, and we had 25 classication errors using wavelets with SPC. Finally we considered a correlation between the spike amplitudes and the background activity, similar to the condition when spikes co-occur with local eld events. We used the same spike shapes and noise level shown in Figure 5, but the spike amplitudes varied from 0.5 to 1 depending on the amplitude of the background activity at the time of the spike (0.5 when the background activity reached its minimum and 1.0 when it reached its maximum). In this case, again, the three clusters were correctly detected, and we had 439 classication errors from a total of 2706 spikes. 6 Discussion We presented a method for detection and sorting neuronal multiunit activity. The procedure is fully unsupervised and fast, thus being particularly interesting for the classication of spikes from a large number of channels recorded simultaneously. To obtain a quantitative measure of its performance, the method was tested on simulated data sets with different noise levels and similar spike shapes. The noise was generated by superposition of a large number of small-amplitude spikes, resembling characteristics of real recordings. This makes the spectral characteristics of noise and spikes similar, thus increasing the difculty in detection and clustering. The proposed method had an overall better performance than conventional approaches, such as using PCA for extracting spike features or K-means for clustering. Spike detection was achieved by using an amplitude threshold on the high-pass ltered data. The threshold value was calculated automatically using the median of the absolute value of the signal. The advantage of this estimation, rather than using the variance of the overall signal, is that it diminishes the dependence of the threshold on the ring rate and the peakto-peak amplitude of the spikes, thus giving an improved estimation of the background noise level. Indeed, high ring rates and high spike amplitudes lead to an overestimation of the appropriate threshold value. In terms of the number of misses and number of false positives, the proposed automatic detection procedure had good performance for the different examples and noise levels. The advantage of using the wavelet transform as a feature extractor is that very localized shape differences of the different units can be discerned. The information about the shape of the spikes is distributed in several wavelet coefcients, whereas with PCA, most of the information about the spike shapes is captured only by the rst three principal components (Letelier & Weber, 2000; Hulata, Segev, Shapira, Benveniste, & Ben-Jacob, 2000; Hulata, Segev, & Ben-Jacob, 2002), which are not necessarily optimal for cluster identication (see Figure 7). Moreover, wavelet coefcients are localized in time. In agreement with these considerations, a better performance of
1684
R. Quiroga, Z. Nadasdy, and Y. Ben-Shaul
wavelet coefcients in comparison with PCA was shown for several examples generated with different noise levels. For comparison, we also used the whole spike shape as input to the clustering algorithm. As shown in Table 2, the dimensionality reduction achieved with the KS test clearly improves the clustering performance. Since wavelets are a linear transform, using all the wavelet coefcients yields nearly the same results as taking the entire spike shape (as it is just a rescaling of the space). Since the need of a lowdimensional space is a limiting factor for many clustering algorithms, the dimensionality reduction achieved by combining wavelets with the KS test may have a broad range of interest. The use of wavelets for spike sorting has been proposed recently by Letelier et al. (2000) and Hulata et al. (2000, 2002). Our approach differs from theirs in several aspects, most notably by the implementation of our algorithm as a single unsupervised process. One key feature of our algorithm is the choice of wavelet coefcients by using a Kolmogorov-Smirnov test of normality, thus selecting features that give an optimal separation between the different clusters. Letelier et al. (2000) suggested to visually select those wavelet coefcients with the largest mean, variance, and, most fundamental, multimodal distribution. However, neither a large average nor a large variance entitles the given coefcient to be the best separator. In contrast, we considered only the multimodality of the distributions. In this respect, we showed that coefcients in the low-frequency bands, with a large variance and uniform distribution, are not appropriate for the separation of distinct clusters. In contrast, they introduce dimensions with nonsegregated distributions that in practice may compromise the performance of the clustering algorithm. A caveat of the KS test as an estimator of multimodality is that it can also select unimodal nongaussian distributions (those that are skewed or have large kurtosis). In fact, this was the case of three coefcients shown in Figure 7. Despite this limitation, the selection of wavelet coefcients with the KS test gave optimal results that indeed outperformed other feature selection methods. The main caveat of PCA is that eigenvectors accounting for the largest variance of the data are selected, but these directions do not necessarily provide the best separation of the spike classes. In other words, it may well be that the information for separating the clusters is represented in principal components with low eigenvalues, which are usually disregarded. In this respect, our method is more reminiscent of independent component analysis (ICA), where directions with a minimum of mutual information are chosen. Moreover, it has been shown that minimum mutual information is related to maximum deviation from normality (Hyvarinen & Oja, 2000). Hulata et al. (2000, 2002) proposed a criterion based on the mutual information of all pairs of cluster combinations for selecting the best wavelet packet coefcients. However, such an approach cannot be implemented in an unsupervised way. In fact, Hulata and coworkers used previous knowledge of the spike classes for selecting the best wavelet packets. A second caveat
Unsupervised Spike Detection and Sorting
1685
is the difculty of estimating mutual information (Quian Quiroga, Kraskov, Kreuz, & Grassberger, 2002) in comparison with the KS test of normality. The second stage of the clustering algorithm is based on the use of superparamagnetic clustering. This method is based on K-nearest neighbor interactions and therefore does not require low variance, nonoverlapping clusters, or a priori assumptions of a certain distribution of the data (e.g., gaussian). Superparamagnetic clustering has already been applied with excellent results to several clustering problems (Blatt et al., 1996, 1997; Domany, 1999). In our study, we demonstrated for the rst time its application to spike sorting. Moreover, it is possible to automatically locate the superparamagnetic regime, thus making the entire sorting procedure unsupervised. Besides the obvious advantage of unsupervised clustering, we compared the results obtained with SPC to those obtained using K-means (with Euclidean distance). Although this comparison should not be generalized to all existing clustering methods, it exemplies the advantages of SPC over methods that rely on the presence of gaussian distributions, clusters with centers inside the cluster (see Figure 1 for counterexamples), nonoverlapping clusters with low variance, and others. The performance of K-means could in principle be improved by using another distance metric. However, this would generally imply assumptions about the noise distribution and its interference with spike variability. Such assumptions may improve the clustering performance in some situations but may also be violated in other conditions of real recordings. Note that this comparison was in principle unfair to SPC since K-means is a supervised algorithm where the total number of clusters is given as an extra input. Of course, the total number of clusters is usually not known in real recording situations. In general, besides the advantage of being unsupervised, SPC showed better results than the ones obtained with K-means. Overall, the presented results show an optimal performance of the clustering algorithm in situations that resemble real recording conditions. However, we should stress that this clustering method should not be taken as a black box giving the optimal spike sorting. When possible, it is always desirable to conrm the validity of the results based on the shape and variance of the spike shapes, the interspike interval distribution, the presence of a refractory period, and so forth. Finally, we anticipate the generalization of the method to tetrode recordings. Indeed, adding spike features from adjacent channels should improve spike classication and reduce ambiguity.
Acknowledgments We are very thankful to Richard Andersen and Christof Koch for support and advice. We also acknowledge very useful discussions with Noam Shental, Moshe Abeles, Ofer Mazor, Bijan Pesaran, and Gabriel Kreiman. We are in debt to Eytan Domany for providing us the SPC code and to Alon Nevet
1686
R. Quiroga, Z. Nadasdy, and Y. Ben-Shaul
who provided the original spike data for the simulation. This work was supported by the Sloan-Swartz foundation and DARPA. References Abeles, M., & Goldstein, M. (1977). Multispike train analysis. Proc. IEEE, 65, 762–773. Binder, K., & Heermann, D. W. (1988). Monte Carlo simulations in statisticalphysics: An introduction. Berlin: Springer-Verlag. Blatt, M., Wiseman, S., & Domany, E. (1996). Super-paramagnetic clustering of data. Phys. Rev. Lett., 76, 3251–3254. Blatt, M., Wiseman, S., & Domany, E. (1997). Data clustering using a model granular magnet. Neural Computation, 9, 1805–1842. Chui, C. (1992). An introduction to wavelets. San Diego, CA: Academic Press. Domany, E. (1999). Super-paramagnetic clustering of data: The denitive solution of an ill-posed problem. Physica A, 263, 158–169. Donoho, D., & Johnstone, I. M. (1994).Ideal spatial adaptation by wavelet shrinkage. Biometrika, 81, 425–455. Fee, M. S., Mitra, P. P., & Kleinfeld, D. (1996). Variability of extracellular spike waveforms of cortical neurons. J. Neurophysiol., 76, 3823–3833. Harris, K. D., Henze, D. A., Csicsvari, J., Hirase, H., & Buzs´aki, G. (2000). Accuracy of tetrode spike separation as determined by simultaneous intracellular and extracellular measurements. J. Neurophysiol., 84, 401–414. Hulata, E., Segev, R., & Ben-Jacob, E. (2002). A method for spike sorting and detection based on wavelet packets and Shannonís mutual information. J. Neurosci. Methods, 117, 1–12. Hulata, E., Segev, R., Shapira, Y., Benveniste, M., & Ben-Jacob, E. (2000). Detections and sorting of neural spikes using wavelet packets. Phys. Rev. Lett., 85, 4637–4640. Hyvarinen, A., & Oja, E. (2000). Independent component analysis: Algorithms and applications. Neural Networks, 13, 411–430. Letelier, J. C., & Weber, P. P. (2000). Spike sorting based on discrete wavelet transform coefcients. J. Neurosci. Methods, 101, 93–106. Lewicki, M. (1998). A review of methods for spike sorting: The detection and classication of neural action potentials. Network: Comput. Neural Syst., 9, R53–R78. Mallat, S. (1989). A theory for multiresolution signal decomposition: The wavelet representation. IEEE Trans. Pattern Analysis and Machine Intell., 2, 674– 693. Pouzat, C., Mazor, O., & Laurent, G. (2002). Using noise signature to optimize spike-sorting and to assess neuronal classication quality. J. Neurosci. Methods, 122, 43–57. Press, W. H., Teukolsky, S. A., Vetterling, W. T., & Flannery, B. P. (1992).Numerical recipes in C. Cambridge: Cambridge University Press. Quian Quiroga, R., & Garcia, H. (2003). Single-trial event-related potentials with wavelet denoising. Clin. Neurophysiol., 114, 376–390.
Unsupervised Spike Detection and Sorting
1687
Quian Quiroga, R., Kraskov, A., Kreuz, T., & Grassberger, P. (2002). Performance of different synchronization measures in real data: A case study on electroencephalographic signals. Phys. Rev. E, 65, 041903. Quian Quiroga, R., Sakowicz, O., Basar, E., & Schurmann, ¨ M. (2001). Wavelet Transform in the analysis of the frequency composition of evoked potentials. Brain Research Protocols, 8, 16–24. Samar, V. J., Swartz, K. P., & Raghveer, M. R. (1995). Multiresolution analysis of event-related potentials by wavelet decomposition. Brain and Cognition, 27, 398–438. Wolf, U. (1989). Comparison between cluster Monte Carlo algorithms in the Ising spin model. Phys. Lett. B, 228, 379–382. Received December 3, 2002; accepted January 30, 2004.
LETTER
Communicated by John Platt
Decomposition Methods for Linear Support Vector Machines Wei-Chun Kao
[email protected] Kai-Min Chung
[email protected] Chia-Liang Sun
[email protected] Chih-Jen Lin
[email protected] Department of Computer Science and Information Engineering, National Taiwan University, Taipei 106, Taiwan
In this letter, we show that decomposition methods with alpha seeding are extremely useful for solving a sequence of linear support vector machines (SVMs) with more data than attributes. This strategy is motivated by Keerthi and Lin (2003), who proved that for an SVM with data not linearly separable, after C is large enough, the dual solutions have the same free and bounded components. We explain why a direct use of decomposition methods for linear SVMs is sometimes very slow and then analyze why alpha seeding is much more effective for linear than nonlinear SVMs. We also conduct comparisons with other methods that are efcient for linear SVMs and demonstrate the effectiveness of alpha seeding techniques in model selection.
1 Introduction Solving linear and nonlinear support vector machines (SVM) has been considered two different tasks. For linear SVM without too many attributes in data instances, people have been able to train millions of data (e.g. Mangasarian & Musicant, 2000), but for other types of problems, in particular, nonlinear SVMs, the requirement of huge memory as well as computational time has prohibited us from solving very large problems. Currently, the decomposition method, a specially designed optimization procedure, is one of the main tools for nonlinear SVMs. In this letter, we show the drawbacks of existing decomposition methods, in particular, sequential minimal optimization (SMO)–type algorithms, for linear SVMs. To remedy these drawbacks, using theorem 3 of Keerthi and Lin (2003), we develop effective strategies so that decomposition methods become efcient for solving linear SVMs. c 2004 Massachusetts Institute of Technology Neural Computation 16, 1689–1704 (2004) °
1690
W.-C. Kao, K.-M. Chung, C.-L. Sun, and C.-J. Lin
First, we briey describe linear and nonlinear SVMs. Given training vectors xi 2 Rn ; i D 1; : : : ; l, in two classes, and a vector y 2 Rl such that yi 2 f1; ¡1g, the standard SVM formulation (Cortes & Vapnik, 1995) is as follows: min w;b;»
l X 1 T w wCC »i 2 iD1
subject to yi .wT Á .xi / C b/ ¸ 1 ¡ »i ;
(1.1)
»i ¸ 0; i D 1; : : : ; l:
If Á .x/ D x, usually we say equation 1.1 is the form of a linear SVM. On the other hand, if Á maps x to a higher-dimensional space, equation 1.1 is a nonlinear SVM. For a nonlinear SVM, the number of variables depends on the size of w and can be very large (even innite), so people solve the following dual form: 1 T ® Q® ¡ eT ® 2 subject to yT ® D 0; min ®
(1.2)
0 · ®i · C; i D 1; : : : ; l;
where Q is an l £ l positive semidenite matrix with Qij D yi yj Á.xi /T Á.xj /, e is the vector of all ones, and K.xi ; xj / D Á.xi /T Á.xj / is the kernel function. Equation 1.2 is solvable because its number of variables is the size of the training set, independent of the dimensionality of Á.x/. The primal and dual relation shows wD
l X
(1.3)
®i yi Á.xi /;
iD1
so Á sgn.wT Á.x/ C b/ D sgn
l X iD1
! ®i yi K.xi ; x/ C b
is the decision function. Unfortunately, for a large training set, Q becomes such a huge, dense matrix that traditional optimization methods cannot be directly applied. Currently, some specially designed approaches, such as decomposition methods (Osuna, Freund, & Girosi, 1997; Joachims, 1998; Platt, 1998) and nding the nearest points of two convex hulls (Keerthi, Shevade, Bhattacharyya, & Murthy, 2000), are major ways of solving equation 1.2.
Decomposition Methods for Linear Support Vector Machines
1691
On the other hand, for linear SVMs, if n ¿ l, w is not a huge vector variable, so equation 1.1 can be solved by many regular optimization methods. As at the optimal solution »i D max.0; 1 ¡yi .wT xi Cb//, in a sense we mainly have to nd out w and b. Therefore, if the number of attributes n is small, there are not many main variables w and b in equation 1.1, no matter how large the training set is. Currently, people have been able to train a linear SVM with millions of data (e.g. Mangasarian & Musicant, 2000); but for a nonlinear SVM with many fewer data, we need more computational time as well as more computer memory. Therefore, it is natural to ask whether in SVM software, linear and nonlinear SVMs should be treated differently and solved by two methods. It is also interesting to see how capable nonlinear SVM methods (e.g., decomposition methods) are for linear SVMs. By linear SVMs, we mean those with n < l. If n ¸ l, the dual form, equation 1.2, has fewer variables than w of the primal, a situation similar to nonlinear SVMs. As the rank of Q is less than or (usually) equal to min.n; l/, the linear SVMs we are interested in here are those with low-ranked Q. Recently, in many situations, linear and nonlinear SVMs have been considered together. Some approaches (Lee & Mangasarian, 2001; Fine & Scheinberg, 2001) approximate nonlinear SVMs by different problems, which are in the form of linear SVMs (Lin & Lin, in press; Lin, 2002) with n ¿ l. In addition, for nonlinear SVM model selection with gaussian kernel, Keerthi and Lin (2003) proposed an efcient method, which has to conduct linear SVMs model selection rst (i.e., linear SVMs with different C). Therefore, it is important to discuss optimization methods for linear and nonlinear SVMs at the same time. In this letter, we focus on decomposition methods. In section 2, we show that existing decomposition methods are inefcient for training linear SVMs. Section 3 demonstrates theoretically and experimentally that the alpha seeding technique is particularly useful for linear SVMs. Some implementation issues are discussed in section 4. The decomposition method with alpha seeding is compared with existing linear SVM methods in section 5. In section 6, we apply the new implementation to solve a sequence of linear SVMs required for the model selection method in Keerthi and Lin (2003). The discussion and concluding remarks are in section 7. 2 Drawbacks of Decomposition Methods for Linear SVMs with n ¿ l The decomposition method is an iterative procedure. In each iteration, the index set of variables is separated into two sets B and N, where B is the working set. Then in that iteration, variables corresponding to N are xed while a subproblem on variables corresponding to B is minimized. If q is the size of the working set B, in each iteration, only q columns of the Hessian matrix Q are required. They can be calculated and stored in computer memory when needed. Thus, unlike regular optimization methods, which usually require
1692
W.-C. Kao, K.-M. Chung, C.-L. Sun, and C.-J. Lin
access of the whole Q, here, the memory problem is solved. Clearly, decomposition methods are specially designed for nonlinear SVMs. Throughout this article, we use DSVM to refer to the solver of SVM that adopts the decomposition method, such as LIBSVM (Chang & Lin, 2001b) and SVMlight (Joachims, 1998). When the size of its working set is two, we say it is of the SMO type (Platt, 1998). Unlike popular optimization methods such as Newton or quasi-Newton, which enjoy fast convergence, decomposition methods converge slowly, as in each iteration only very few variables are updated. We will show that the situation is even worse when solving linear SVMs. It has been demonstrated (Hsu & Lin, 2002b) by experiments that if C is large and the Hessian matrix Q is not well conditioned, decomposition methods converge very slowly. For linear SVMs, if n ¿ l, then Q is a lowrank and hence ill-conditioned matrix. In Figure 1, we demonstrate a simple example by using the problem heart from the statlog database (Michie, Spiegelhalter, & Taylor, 1994). Each attribute is scaled to [¡1; 1]. We use LIBSVM (Chang & Lin, 2001b) to solve linear and nonlinear (RBF kernel, 2 2 e¡kxi ¡xj k =.2¾ / with 1=.2¾ 2 / D 1=n) SVMs with C D 2¡8 ; 2¡7:5 ; : : : ; 28 and present the number of iterations. Though two different problems are solved (in particular, their Qij ’s are in different ranges), Figure 1 clearly indicates the huge number of iterations for solving the linear SVMs. Note that for linear SVMs, the slope is greater than that for nonlinear SVMs and is very close to one, especially when C is large. This means that a doubled C leads to a doubled number of iterations.
he art_sca le 1e+06
Iterations
100000
10000
1000
100
-8
-6
-4
-2
0
log (C)
2
4
6
8
Figure 1: Number of decomposition iterations for solving SVMs with linear (the thick line) and RBF (the thin line) kernel.
Decomposition Methods for Linear Support Vector Machines
1693
The following theorems that hold only for linear SVMs help us to realize the difculty the decomposition methods suffer from. Theorem 2 can further explain why the number of iterations is nearly doubled when C is doubled. Theorem 1 (Keerthi & Lin, 2003). properties:
The dual linear SVM has the following
² There is C¤ such that for all C ¸ C¤ , there are optimal solutions at the same face. ² In addition, for all C ¸ C¤ , the primal solution w is the same. By the “face” of ®, we mean three types of value of each ®i : (1) lowerbounded, that is, ®i D 0, (2) upper-bounded, that is, ®i D C, and (3) free, that is, 0 < ®i < C. More precisely, the face of ® can be represented by a length l vector whose components are in flower-bounded; upper-bounded; freeg. This theorem indicates that after C ¸ C¤ , exponentially increased numbers of iterations are wasted in order to obtain the same primal solution w. Even if we could detect C¤ and stop training SVMs, for C not far below C¤ , the number of iterations may already be huge. Therefore, it is important to have an efcient linear SVM solver that could handle both large and small C. Next, we try to explain the nearly doubled iterations by the difculty of locating faces of the dual solution ®. Theorem 2. Assume that any two parallel hyperplanes in the feature space do not contain more than n C 1 points of fxi g on them. We have:1 1. For any optimal solution of equation 1.2, it has no more than n C 1 free components. 2. There is C¤ such that after C ¸ C¤ , all optimal solutions of equation 1.2 share at least the same l ¡ n ¡ 1 bounded ® variables. The proof is available in Kao, Chung, Sun, and Lin (2002). This result indicates that when n ¿ l, most components of optimal solutions are at bounds. Furthermore, dual solutions at C and 2C share at least the same l ¡ 2.n ¡ 1/ upper- and lower-bounded components. If upper-bounded ®i at C remains upper-bounded at 2C, a direct use of decomposition methods means that ®i is updated from 0 to C and from 0 to 2C, respectively. Thus, we anticipate that the efforts are roughly doubled. We conrm this explanation 1 Note that a pair of parallel hyperplanes is decided by n C 1 numbers (the n number decides one hyperplane in the feature space Rn , and another one decides the other hyperplane parallel to it). So the assumption of theorem 2 would be violated if m linear equations in n C 1 variables, where m > n C 1, have solutions. The occurrence of this scenario is of measure zero. This explains that the assumption of theorem 2 is generic.
1694
W.-C. Kao, K.-M. Chung, C.-L. Sun, and C.-J. Lin heart_scale(linear, C=128)
70
70
60
60
50 40 30
50 40 30
20
20
10
10
0
0
500
1000
1500
Iterationx100
2000
heart_scale(linear, C=256)
80
Error face rate
Error face rate
80
2500
3000
0
0
1000
2000
3000
Iterationx100
4000
5000
6000
Figure 2: The error face rate (i.e., the difference between the current face and the one at the nal solution) for solving linear SVM with C D 128 and C D 256.
by comparing the error face rate (i.e., the difference between the current face and the one at the nal solution) with C D 27 and C D 28 . As shown in Figure 2, two curves are quite similar except that the scale of x-axis differs by twice. This indicates that ® travels similar faces for C D 27 and C D 28 , and the number of iterations spent on each face with C D 28 is roughly doubled. 3 Alpha Seeding for Linear SVMs Theorem 2 implies that for linear SVMs, dual solutions may share many upper- and lower-bounded variables. Therefore, we conjecture that if ® 1 is an optimal solution at C D C1 , then ® 1 C2 =C1 can be a very good initial point for solving equation 1.2 with C D C2 . The reason is that ® 1 C2 =C1 is at the same face as ® 1 , and it is likely to be at a similar face of one optimal solution of C D C2 . This technique, called alpha seeding, was originally proposed for SVM model selection (DeCoste & Wagstaff, 2000) where several problems (see equation 1.2) with different C have to be solved. Earlier work that focuses on nonlinear SVMs mainly uses alpha seeding as a heuristic. For linear SVMs, the speed could be signicantly boosted due to the above analysis. The following theorem further supports the use of alpha seeding: Theorem 3. There are two vectors A, B, and a number C¤ such that for any C ¸ C¤ , AC C B is an optimal solution of equation 1.2. The proof is in Kao et al. (2002). If Ai > 1, Ai C C B > C after C is large enough, and this violates the bounded constraints in equation 1.2. Similarly, Ai cannot be less than zero, so 0 · Ai · 1. Therefore, we can consider the following three situations of vectors A and B: 1. 0 < Ai · 1
Decomposition Methods for Linear Support Vector Machines
1695
Table 1: Comparison of Iterations (Linear Kernel), With and Without Alpha Seeding. Without ®-Seeding
®-Seeding Number of Iterations
Problem heart australian diabetes german web adult ijcnn
Number of Number of Iterations Iterations wT w Total Iterations (C D 27:5 ) (C D 28 )
C¤
27,231 23:5 5.712 79,162 22:5 2.071 33,264 26:5 16.69 277,932 210 3.783 24,044,242 Unstable Unstable 3,212,093 Unstable Unstable 590,645 26 108.6
2,449,067 20,353,966 1,217,926 42,673,649 ¸ 108 ¸ 108 41,440,735
507,122 3,981,265 274,155 6,778,373 74,717,242 56214289 8,860,930
737,734 5,469,092 279,062 14,641,135 ¸ 108 84,111,627 13,927,522
2. Ai D 0; Bi D 0 3. Ai D 0; Bi > 0
For the second case, ®i1 CC2 D Ai C2 CBi D 0, and for the rst case, Ai C À Bi 1
after C is large enough. Therefore, ®i1 CC2 D Ai C2 C Bi CC2 ¼ Ai C2 C Bi . For both 1 1 cases, alpha seeding is very useful. On the other hand, using theorem 2, there are few .· n C 1/ components satisfying the third case. Next, we conduct some comparisons between DSVM with and without alpha seeding. Here, we consider two-class problems only. Some statistics of the data sets used are in Tables 1 and 2. The four small problems are from the statlog collection (Michie et al., 1994). The problem adult is compiled by Platt (1998) from the UCI “adult” data set (Blake & Merz, 1998). Problem web is also from Platt. Problem ijcnn is from the rst problem of IJCNN challenge 2001 (Prokhorov, 2001). Note that we use the winner ’s transformation of the raw data (Chang & Lin, 2001a). We train linear SVMs with C 2 f2¡8 ; 2¡7:5 ; : : : ; 28 g. That is, [2¡8 ; 28 ] is discretized to 33 points with equal ratio. Table 1 presents the total number Table 2: Comparison of Iterations (RBF Kernel), With and Without Alpha Seeding. Problem heart australian diabetes german web adult ijcnn
l
n
®-Seeding
Without ®-Seeding
270 690 768 1000 49,749 32,561 49,990
13 14 8 24 300 123 22
43,663 230,983 101,378 191,509 633,788 2,380,265 891,563
56,792 323,288 190,047 260,774 883,319 4,110,663 1,968,396
1696
W.-C. Kao, K.-M. Chung, C.-L. Sun, and C.-J. Lin
of iterations of training 33 linear SVMs using the alpha seeding approach. We also individually solve them by LIBSVM and list the number of iterations (total, C D 27:5 , and C D 28 ). The alpha seeding implementation will be described in detail in section 4. We also list the approximate C¤ for which linear SVMs with C ¸ C¤ have the same decision function. In addition, the constant wT w after C ¸ C¤ is also given. For some problems (e.g., web and adult), wT w has not reached a constant until C is very large, so we indicate them as “unstable” in Table 1. To demonstrate that alpha seeding is much more effective for linear than nonlinear SVMs, Table 2 presents the number of iterations using the radial 2 2 basis function (RBF) kernel K.xi ; xj / D e¡kxi ¡xj k =.2¾ / with 1=2¾ 2 D 1=n. It is clear that the number of iterations saved by using alpha seeding is marginal. In addition, comparing to the “Total Iterations” column in Table 1, we conrm again the slow convergence for linear SVMs without alpha seeding. The alpha seeding approach performs so well that its total number of iterations is much fewer than solving one single linear SVM with the original decomposition implementation. Therefore, if we intend to solve one linear SVM with a particular C, it may be more efcient to solve one with small initial C0 and then use the proposed alpha seeding method by gradually increasing C. Furthermore, since we have solved linear SVMs with different C, model selection by cross validation is already done. From the discussion in section 2, solving several linear SVMs without alpha seeding is time-consuming and model selection is not easy. Note that in Table 1, web is the most difcult problem and requires the largest number of iterations. Theorem 2 helps to explain this: since web’s large number of attributes might lead to more free variables during iterations or at the nal solution, alpha seeding is less effective. 4 Implementation Though the concept is so simple, an efcient and elegant implementation requires considerable thought. Most DSVM implementations maintain the gradient vector of the dual objective function during iterations. The gradient is used for selecting the working set or checking the stopping condition. In the nonlinear SVM, calculation of the gradient Q® ¡ e requires O.l2 n/ operations (O.n/ for each kernel evaluation), which are expensive. Therefore, many forms of DSVM software use ® D 0 as the initial solution, which makes the initial gradient ¡e immediately available. However, in DSVM with alpha seeding, the initial solution is obtained from the last problem, so the initial gradient is not a constant vector. Fortunately, for linear SVMs, the situation is not as bad as that in the nonlinear SVM. In this case, the kernel matrix is of the form Q D XT X, where X D [y1 x1 ; : : : ; yl xl ] is an n by l matrix. We can calculate
Decomposition Methods for Linear Support Vector Machines
1697
the gradient by Q® ¡ e D XT .X®/ ¡ e, which requires only O.ln/ operations. The rst decomposition software that used this for linear SVMs is SVMlight (Joachims, 1998). Similarly, if ® is changed by 1® between two consecutive iterations, then the change of gradient is Q.1®/ D XT .X1®/. Since there are only q nonzero elements in 1®, the gradient can be updated with O.nq/ C O.ln/ D O.ln/ operations, where q is the size of working set. Note that because l À q, increasing q from 2 to some other small constant will not affect the time of the updating gradient. In contrast, for nonlinear SVMs, if Q is not in the cache, the cost for updating the gradient is by computing Q.1®/, which requires q columns of Q and takes O.lnq/-time. Thus, while the implementation of nonlinear SVMs may choose SMO-type implementation (i.e., q D 2) for a lower cost per iteration, we should use a larger q for linear SVMs because the gradient update is independent of q and the number of total iterations may be reduced. In the nonlinear SVM, constructing the kernel matrix Q is expensive, so a cache for storing recently used elements of Q is necessary. However, in linear SVM, either the kernel matrix Q or the cache is no longer needed. 5 Comparison with Other Approaches It is interesting to compare the proposed alpha seeding approach with efcient methods for linear SVMs. In this section, we consider active SVM (ASVM) (Mangasarian & Musicant, 2000) and Lagrangian SVM (LSVM) (Mangasarian & Musicant, 2001). DSVM, ASVM, and LSVM solve slightly different formulations, so a fair comparison is difcult. However, our goal here is only to demonstrate that with alpha seeding, decomposition methods can be several times faster and competitive with other linear SVM methods. In the following, we briey describe the three implementations. DSVM is the standard SVM, which uses the dual formulation, equation 1.2. ASVM and LSVM both consider a square error term in the objective function: l X 1 T .w w C b2 / C C »i2 : 2 iD1
(5.1)
Then, the dual problem of equation 5.1 is ³ ´ 1 T I T min ® Q C yy C ® ¡ eT ® ® 2 2C subject to 0 · ®i ; i D 1; : : : ; l;
(5.2)
min w;b;»
where I is the identity matrix. The solution of equation 5.2 has far more free components than that of equation 1.2 as upper-bounded variables of
1698
W.-C. Kao, K.-M. Chung, C.-L. Sun, and C.-J. Lin
Table 3: Comparison of Different Approaches for Linear SVMs. Decomposition Methods (LIBSVM)
Methods for Linear SVMs
With ® Seeding Without ® Seeding Problem australian heart diabetes german ijcnn adult web
Acc. Time: C · 28 .C · 23 / 85.51 85.56 79.69 73.65 92.65 85.02 98.67
4.1 (3.6) 1.4 (0.9) 2.4 (2.3) 10.2 (6.2) 981.9 (746.0) 1065.9 (724.8) 18,035.3 (1738.2)
Time: C · 23 7.0 1.0 1.6 18.2 3,708.6 12,026.8 7,035.6
ASVM
LSVM
Time
Time
88.70 8.1 85.56 5.8 81.64 6.2 73.95 16.4 92.51 725.2 84.90 3,130.7 98.63 10,315.9
2.3 1.8 2.0 9.0 17,496.4 13,445.4 43,060.1
Acc.
Notes: Acc.: test accuracy using the parameter obtained from cross validation. Time (in seconds): total training time of ve-fold cross validation by trying C D 2¡10 ; 2¡9:5 ; : : : ; 28 (or 23 if specied).
this equation are likely to be free now. With different formulations, their stopping conditions are not exactly the same. We use conditions from similar derivations; details are discussed in Kao et al. (2002). In this experiment, we consider LIBSVM for DSVM (with and without alpha seeding). For ASVM, we use the authors’ C++ implementation available on-line at http://www.cs.wisc.edu/dmi/asvm. The authors of LSVM provide only MATLAB programs, so we implement it by modifying LIBSVM. The experiments were done on an Intel Xeon 2.8 GHz machine with 1024 MB RAM using the gcc compiler. Using the same benchmark problems as in section 3, we perform comparisons in Table 3 as follows: for each problem, we randomly select twothirds of the data for training and leave the remaining for testing. For algorithms except DSVM without alpha seeding, ve-fold cross validation with C D 2¡10 ; 2¡9:5 ; : : : ; 28 on the training set is conducted. For DSVM without alpha seeding, as the training time is huge, only C up to 23 is tried. Then using the C that gives the best cross-validation rate, we train a model and predict the test data. Both testing accuracy and the total computational time are reported. In Table 3, alpha seeding with C up to 28 is competitive with solving C up to only 23 without alpha seeding. For these problems, considering C · 23 is enough, and if alpha seeding stops at 23 as well, it is several times faster than without alpha seeding. Since alpha seeding is not applied to ASVM and LSVM, we admit that their computational time can be further improved. Results here also serve as the rst comparison between ASVM and LSVM. Clearly ASVM is faster. Moreover, due to the huge computational time, we set the maximal iterations of LSVM at 1000. For problems adult and web, after C is large, the iteration limit is reached before stopping conditions are satised. In addition to comparing DSVM with ASVM and LSVM, we compare the performance of SMO type (q D 2) and that with a larger working set
Decomposition Methods for Linear Support Vector Machines
1699
Table 4: Comparison of Different Subproblem Size in Decomposition Methods for Linear SVMs. Decomposition Methods with Alpha Seeding q D 2 (SVMlight ) Problem australian heart diabetes german ijcnn adult web
q D 30 (SVMlight )
Total Iterations
Time
Total Iterations
Time
50,145 25,163 30,265 182,051 345,630 1,666,607 NAa
0.71 0.21 0.4 2.9 185.85 1455.2 NAa
6533 1317 5378 6006 79847 414,798 2,673,578
1.19 0.33 0.31 3.68 115.03 516.71 1885.1
Notes: Time in seconds; q: size of the working set. The time here is shorter than that in Table 3 because we do not perform cross validation. a SVMlight faced numerical difculties.
(q D 30) for DSVM with alpha seeding in Table 4 by modifying the software SVM light , which allows adjustable q. All default settings of SVMlight are used. In this experiment, we solve linear SVMs with C D 2¡8 ; 2¡7:5 ; : : : ; 28 and report their computational time and total number of iterations. Note that when q is two, SVMlight and LIBSVM use the same algorithm and differ only in some implementation details. The results in Table 4 show that the implementation with a larger working set takes less time than that with a smaller one. This is consistent with our earlier statement that for linear SVMs, SMO-type decomposition methods are less favorable. Regarding the computational time reported in this section, we must caution that quite a few implementation details may affect it. For example, each iteration of ASVM and LSVM involves several matrix vector multiplications. Hence, it is possible to use nely tuned dense linear algebra subroutines. For the LSVM implementation here, by using ATLAS (Whaley, Petitet, & Dongarra, 2000), for large problems, the time is reduced by twothirds. Thus, it is possible to further reduce the time of ASVM in Table 3, though we nd it too complicated to modify the authors’ program. Using such tools also means X is considered as a dense matrix. In contrast, X is currently treated as a sparse matrix in both LIBSVM and SVMlight, where each iteration requires two matrix vector multiplications X.XT .® kC1 ¡ ® k //. This sparse format creates some overhead when data are dense. 6 Experiments on Model Selection If the RBF kernel K.xi ; xj / D e¡kxi ¡xj k
2
=.2¾ 2 /
1700
W.-C. Kao, K.-M. Chung, C.-L. Sun, and C.-J. Lin
is used, Keerthi and Lin (2003) propose the following model selection procedure for nding good C and ¾ 2 : Algorithm 1.
Two-line model selection
Q 1. Search for the best C of linear SVMs and call it C.
2. Fix CQ from step 1 and search for the best .C; ¾ 2 / satisfying log ¾ 2 D log C ¡ log CQ using the RBF kernel. That is, we solve a sequence of linear SVMs rst and then a sequence of nonlinear SVMs with the RBF kernel. The advantage of algorithm 1 over an exhaustive search of the parameter space is that only parameters on two lines are considered. If decomposition methods are directly used for both linear and nonlinear SVMs here, due to the huge number of iterations, solving the linear SVMs becomes a bottleneck. Our goal is to show that by applying the alpha seeding technique to linear SVMs, the computational time spent on the linear part becomes similar to that on the nonlinear SVMs. Earlier, in Keerthi and Lin (2003), due to the difculty of solving linear SVMs, algorithm 1 was tested on only small two-class problems. Here, we would like to evaluate this algorithm on large multiclass data sets. We consider the problems dna, satimage, letter, and shuttle, which were originally from the statlog collection (Michie et al., 1994) and were used in Hsu and Lin (2002a). Except dna, which takes two possible values 0 and 1, each attribute of all training data is scaled to [¡1; 1]. Then test data are adjusted using the same linear transformation. Since LIBSVM contains a well-developed cross-validation procedure, we use it as the DSVM solver in this experiment. We search for CQ by ve-fold cross validation on linear SVMs using uniformly spaced log2 CQ value in [¡10; 10] (with grid space 1). As LIBSVM considers ° D 1=2¾ 2 as the kernel parameter, the second step is to search for good .C; ° / satisfying Q ¡ 1 ¡ log2 ° D log2 C ¡ log2 C:
(6.1)
We discretize [¡10; 4] as values of log2 ° and calculate log2 C from equation 6.1. To avoid log2 C locating in an abnormal region, we consider only points with ¡2 · log2 C · 12, so the second step may solve fewer SVMs than the rst step. The same computational environment as that for section 3 is used. Since this model selection method is based on the analysis of binary SVMs, a multiclass problem has to be decomposed to several binary SVMs. We employ the one-against-one approach: if there are k classes of data, all k.k ¡ 1/=2 two-class combinations are considered. For any two classes of data, the model selection is conducted to have the best .C; ¾ 2 /. With
Decomposition Methods for Linear Support Vector Machines
1701
Table 5: Comparison of Different Model Selection Methods. Complete Grid Search Problem dna satimage letter shuttle
1 .C; ¾ 2 / Accuracy
k.k ¡ 1/=2 .C; ¾ 2 / Time Accuracy
95.62 4945 91.9 7860 97.9 56,753 99.92 104,904
95.11 92.2 97.72 99.94
Algorithm 1 Time
Time (linear)
202 1014 5365 4196
123 743 3423 2802
Time (nonlinear)
Accuracy
79 94.86 (94.77) 271 91.55 (90.55) 1942 96.54 (95.9) 1394 99.81 (99.7)
Notes: Accuracies of algorithm 1 enclosed in parentheses are the accuracies if we search log2 CQ 2 [¡10; 3] in step 1 of algorithm 1. Time is in seconds.
the k.k ¡ 1/=2 best .C; ¾ 2 / and corresponding decision functions, a voting strategy is used for the nal prediction. In Table 5, we compare this approach with two versions of complete grid searches. First, for any two classes of data, ve-fold cross validation is conducted on 225 points, a discretization of the .log2 C; log2 ° / D [¡2; 12] £ [¡10; 4] space. The second way is from the cross-validation procedure adopted by LIBSVM for multiclass data, where a list of .C; ¾ 2 / is selected rst, and then for each .C; ¾ 2 /, the one-against-one method is used for estimating the cross-validation accuracy of the multiclass data. Therefore, for the nal optimal model, k.k ¡ 1/=2 decision functions share the same C and ¾ 2 . Since the same number of nonlinear SVMs is trained, the time for the two complete grid searches is exactly the same, but the performance (test accuracy) may be different. There is no comparison so far, so we present a preliminary investigation here. Table 5 presents experimental results. For each problem, we compare test accuracy by two complete grid searches and by algorithm 1. The two grid searches are represented as “1 .C; ¾ 2 /” and “k.k ¡ 1/=2 .C; ¾ 2 /,” respectively, depending on how many .C; ¾ 2 / used by the decision functions. The performances of the three approaches are very similar. However, the total model selection time of algorithm 1 is much shorter. In addition, we also list the accuracy of algorithm 1 in parentheses if we search only log2 CQ value in [¡10; 3] in step 1. We nd that the accuracy is consistently lower if we search only CQ in this smaller region. In fact, if we search log2 CQ in [¡10; 3], there are many log2 CQ equals to 3 in this experiment. This means that [¡10; 3] is too small to cover good parameter regions. We make algorithm 1 practical with using alpha seeding. Otherwise, the time for solving linear SVMs increases greatly, so the proposed model selection does not possess any advantage. We then investigate the stability of the new model selection approach. Due to timing restrictions, we consider two smaller problems, banana and adult small, tested in Keerthi and Lin (2003). adult small, a subset of adult used in section 3, is a binary problem with 1605 examples. Table 6 shows
1702
W.-C. Kao, K.-M. Chung, C.-L. Sun, and C.-J. Lin
Table 6: Mean and Standard Deviation of Two Model Selection Methods. Complete Grid Search log2 C Problem banana adult small dna satimage
log2 °
Algorithm 1
Accuracy
log2 CQ
Accuracy
Mean
SD
Mean
SD
Mean
SD
Mean
SD
Mean
SD
7 5.4 5.4 2.5
4.45 2.37 3.34 0.71
-0.4 -7.6 -5 0.1
1.51 1.71 0 0.57
87.91 83.82 95.56 91.74
0.47 0.27 0.19 0.24
-1.9 0.3 -
2.18 4.08 -
76.36 83.20 94.85 91.19
12.21 1.28 0.20 0.28
Note: Each method was applied 10 times.
the means and standard deviations of parameters and accuracy using the k.k ¡ 1/=2 .C; ¾ 2 / grid search and algorithm 1 10 times. For algorithm 1, we Q variances because the variances of parameters C and ¾ 2 , which list only C’s are computed from equation 6.1, are less meaningful. By applying the same method 10 times, note that different parameters as well as accuracy are due to the randomness of cross validation. From Table 6, we can see that although the performance (testing accuracy and consumed time) of the model selection algorithm 1 is good, it might be less stable. That is, the variance of accuracy is signicantly larger than that of the complete grid search method, while the variances of both parameters are large. We think that in the complete grid search method, the crossvalidation estimation bounds the overall error. Thus, the variances of gained parameters do not affect the testing performance. However, in the two-line search method (algorithm 1), two-stage cross validations are utilized. Thus, the variance in the rst stage may affect the best performance of the second stage. 7 Discussion and Conclusion It is arguable that we may have used a too strict a stopping condition in DSVM when C is large. One possibility is to use the stopping tolerance that is proportional to C. This will reduce the number of iterations so that directly solving linear SVMs with large C may be possible. However, in the appendix of Chung et al. (2002), we show that even in these settings, DSVM with alpha seeding still makes the computational time several times faster than the original DSVM, especially for large data sets. Moreover, a stopping tolerance that is too large will cause DSVM to stop with wrong solutions. In conclusion, we hope that based on this work, SVM software using decomposition methods can be suitable for all types of problems, both n ¿ l and n À l.
Decomposition Methods for Linear Support Vector Machines
1703
Acknowledgments This work was supported in part by the National Science Council of Taiwan, by grant NSC 90-2213-E-002-111. We thank Thorsten Joachims for help on modifying SVMlight for experiments in section 5. References Blake, C. L., & Merz, C. J. (1998). UCI repository of machine learning databases (Tech. Rep.). Irvine, CA: University of California, Department of Information and Computer Science. Available on-line at: http://www.ics.uci.edu/ ~mlearn/MLRepository.html. Chang, C.-C., & Lin, C.-J. (2001a). IJCNN 2001 challenge: Generalization ability and text decoding. In Proceedings of IJCNN. New York: IEEE. Chang, C.-C., & Lin, C.-J. (2001b). LIBSVM: A library for support vector machines. Available on-line at: http://www.csie.ntu.edu.tw/~cjlin/libsvm. Cortes, C., & Vapnik, V. (1995). Support-vector network. Machine Learning, 20, 273–297. DeCoste, D., & Wagstaff, K. (2000). Alpha seeding for support vector machines. In Proceedings of International Conference on Knowledge Discovery and Data Mining. New York: ACM Press. Fine, S., & Scheinberg, K. (2001). Efcient SVM training using low-rank kernel representations. Journal of Machine Learning Research, 2, 243–264. Hsu, C.-W., & Lin, C.-J.(2002a).A comparison of methods for multi-class support vector machines. IEEE Transactions on Neural Networks, 13(2), 415–425. Hsu, C.-W., & Lin, C.-J. (2002b). A simple decomposition method for support vector machines. Machine Learning, 46, 291–314. Joachims, T. (1998). Making large-scale SVM learning practical. In B. Sch¨olkopf, C. J. C. Burges, & A. J. Smola (Eds.), Advances in kernel methods—support vector learning, Cambridge, MA: MIT Press. Kao, W.-C., Chung, K.-M., Sun, T., & Lin, C.-J. (2002). Decomposition methods for linear support vector machines (Tech. Rep.). Taipei: Department of Computer Science and Information Engineering, National Taiwan University. Available on-line at: http://www.csie.ntu.edu.tw/»cjlin/papers/linear.pdf. Keerthi, S. S., & Lin, C.-J. (2003). Asymptotic behaviors of support vector machines with gaussian kernel. Neural Computation, 15(7), 1667–1689. Keerthi, S. S., Shevade, S. K., Bhattacharyya, C., & Murthy, K. R. K. (2000). A fast iterative nearest point algorithm for support vector machine classier design. IEEE Transactions on Neural Networks, 11(1), 124–136. Lee, Y.-J., & Mangasarian, O. L. (2001). RSVM: Reduced support vector machines. In Proceedings of the First SIAM International Conference on Data Mining. Philadelphia: SIAM. Lin, K.-M. (2002). Reduction techniques for training support vector machines. Unpublished master’s thesis, National Taiwan University. Lin, K.-M., & Lin, C.-J. (2003). A study on reduced support vector machines. IEEE Transactions on Neural Networks, 14(6), 1449–1559.
1704
W.-C. Kao, K.-M. Chung, C.-L. Sun, and C.-J. Lin
Mangasarian, O. L., & Musicant, D. R. (2000). Active set support vector machine classication. In T. K. Leen, T. G. Dietterich, & V. Tresp (Eds.), Advances in neural information processing systems, 13 (pp. 577–583). Cambridge, MA: MIT Press. Mangasarian, O. L., & Musicant, D. R. (2001). Lagrangian support vector machines. Journal of Machine Learning Research, 1, 161–177. Michie, D., Spiegelhalter, D. J., & Taylor, C. C. (1994). Machine learning, neural and statistical classication. Englewood Cliffs, NJ: Prentice Hall. Data available on-line at: ftp.ncc.up.pt/pub/statlog/. Osuna, E., Freund, R., & Girosi, F. (1997). Training support vector machines: An application to face detection. In Proceedings of CVPR’97 (pp. 130–136). New York: IEEE. Platt, J. C. (1998). Fast training of support vector machines using sequential minimal optimization. In B. Sch¨olkopf, C. J. C. Burges, & A. J. Smola (Eds.), Advances in kernel methods—Support vector learning, Cambridge, MA: MIT Press. Prokhorov, D. (2001). IJCNN 2001 neural network competition. Slide presentation in IJCNN’01, Ford Research Laboratory. Available on-line at: http://www.geocities.com/ijcnn/nnc _ijcnn01.pdf. Whaley, R. C., Petitet, A., & Dongarra, J. J. (2000). Automatically tuned linear algebra software and the ATLAS project (Tech. Rep.). Chattanooga: Department of Computer Sciences, University of Tennessee. Received March 7, 2003; accepted January 7, 2004.
LETTER
Communicated by Manfred Opper
An Asymptotic Statistical Theory of Polynomial Kernel Methods Kazushi Ikeda
[email protected] Graduate School of Informatics, Kyoto University, Sakyo, Kyoto 606-8501, Japan
The generalization properties of learning classiers with a polynomial kernel function are examined. In kernel methods, input vectors are mapped into a high-dimensional feature space where the mapped vectors are linearly separated. It is well-known that a linear dichotomy has an average generalization error or a learning curve proportional to the dimension of the input space and inversely proportional to the number of given examples in the asymptotic limit. However, it does not hold in the case of kernel methods since the feature vectors lie on a submanifold in the feature space, called the input surface. In this letter, we discuss how the asymptotic average generalization error depends on the relationship between the input surface and the true separating hyperplane in the feature space where the essential dimension of the true separating polynomial, named the class, is important. We show its upper bounds in several cases and conrm these using computer simulations.
1 Introduction Kernel classiers, such as the support vector machine (SVM) and the kernel perceptron learning algorithm, have received much attention in recent years as successful new pattern classication techniques (Vapnik, 1995, 1998; Sch¨olkopf, Burges, & Smola, 1998; Cristianini & Shawe-Taylor, 2000; Smola, Bartlett, Sch¨olkopf, & Schuurmans, 2000). A kernel classier maps an input vector x 2 X to the corresponding feature vector f .x/ in a high-dimensional feature space F and discriminates it linearly, that is, outputs the sign of the inner product of the feature vector and a parameter vector w (Aizerman, Braverman, & Rozonoer, 1964). An SVM shows good generalization performance in experiments and has been analyzed theoretically on its generalization ability based on the probably approximately correct (PAC) learning approach (Vapnik, 1995, 1998). In other words, the sufcient number of examples is given as a function of ± and ² such that the probability that the generalization error is larger than ² becomes less than ± regardless of the distribution of the input vectors (Valiant, 1984). This framework considers the worst case in a sense, and the c 2004 Massachusetts Institute of Technology Neural Computation 16, 1705–1719 (2004) °
1706
K. Ikeda
VC dimension plays an essential role in representing the complexity of the learning machines (Vapnik & Chervonenkis, 1971). Another approach to generalization ability is to directly estimate the generalization error or, more specically, evaluate the average generalization error, dened as the generalization error averaged over given examples and a new input. The average generalization error as a function of the number of given examples is often called a learning curve. One tool for deriving the learning curve is statistical mechanics (Baum & Haussler, 1989; Gyorgyi ¨ & Tishby, 1990; Opper & Haussler, 1991), which has been applied to SVM learning under the assumptions that the input is binary (Dietrich, Opper, & Sompolinsky, 1999). The results given by statistical mechanics explain the phenomenon well. However, the method has yet to be justied mathematically, and it is applicable only to the case where the dimension of the input space and the number of examples tend to innity together. Another approach is asymptotic statistics (Amari, Fujita, & Shinomoto, 1992; Amari, 1993; Amari & Murata, 1993; Murata, Yoshizawa, & Amari, 1994), assuming that the number of examples is sufciently large. When the learning machine is not differentiable, as in the case of a linear dichotomy, a stochastic geometrical method can also be employed (Ikeda & Amari, 1996). The analyses above have shown that the generalization error is, in general, asymptotically proportional to the number of parameters and inversely proportional to the number of given examples, although its exact coefcient is still unknown. In this letter, we discuss the learning curve of kernel classiers. Since they are essentially linear dichotomies in a high-dimensional feature space F, it could be expected that the generalization errors are proportional to the number of parameters that coincides with the dimension of the space F. However, analyses in the PAC learning framework have shown that the generalization ability does not depend on the apparent high dimension of the feature space, even if the feature space is of innite dimension (Vapnik, 1995, 1998). Kernel methods are different from linear dichotomies in that examples in the feature space are located only on a low-dimensional submanifold wherein the input vectors are mapped. Hence, the learning curve of kernel methods does not match that of linear dichotomies in the framework of asymptotic statistics. The purpose of this letter is to derive the asymptotic learning curve of kernel methods and to clarify the generalization properties of kernel learning methods when a polynomial kernel is employed as the kernel function. When the input space is one-dimensional, it was shown that the average generalization error depends on only the complexity of the true separating polynomial and does not increase even if the degree of the polynomial kernel becomes larger (Ikeda, forthcoming). However, the result is restrictive and not applicable to general cases since it signicantly depends on the fact that any polynomial of one variable is factorizable into polynomials of degree one or two.
An Asymptotic Statistical Theory of Polynomial Kernel Methods
1707
To extend the above result to an input space of general dimension, we introduce a new index named class that characterizes the relationship between the submanifold the input vectors make and the true separating function in the feature space. The class describes the number of essential parameters, and the average generalization error is proportional to it. We give its upper bounds in some cases. The rest of the letter is organized as follows. Section 2 reviews the results on the learning curve of linear dichotomies. The mathematical formulation of polynomial kernel methods is given in section 3. A new index for the generalization ability is introduced in section 4, and its bounds in several cases are given in section 5. Section 6 is devoted to computer simulations, and conclusions and discussions are given in section 7. 2 Learning Curve of Linear Dichotomies Since a kernel classier is a linear dichotomy in feature space, we rst review the learning curve of linear dichotomies in this section. 2.1 Formulation. Consider a deterministic dichotomy machine that outputs y 2 f§1g for an input vector f D . f1 ; f2 ; : : : ; fc /T in a bounded subset F of a c-dimensional Euclidean space Rc by calculating y D sgn[wT f ], where w is a c-dimensional vector called the parameter vector, or simply the parameter, which has the prior distribution qpr .w/. We assume that f has a positive density function p.f / > 0 in F such that the difference between the outputs with w and w C 1w is proportional to k1wk when k1wk is small, that is, Z jsgn[.w C 1w/T f ] ¡ sgn[wT f ]jdf D a.w/k1wk C o.k1wk/; where a.w/ > 0, as assumed in (Amari et al., 1992). An example consists of an input f and its corresponding output y D sgn[wo T f ], where wo is the true parameter vector of a xed supervisor chosen according to qpr .w/. Given N examples denoted by 4N D f.fn ; yn /; n D 1; : : : ; Ng, a learning machine is trained so that the machine can reproduce O be such a parameter of the yn for any fn in the example set 4N . Let w learning machine. The learning machine with parameter w O predicts the output yNC1 corresponding to a test input vector fNC1 independently chosen from the same distribution as the examples. The generalization error L.4N ; fNC1 ; yNC1 / of the trained machine is dened as the probability that the predicted output yO for a test input fNC1 is different from the true output yNC1 over the posterior distribution qpo .w/ of the true parameter wo . Since the pair .fNC1 ; yNC1 / can be regarded as an example, we denote the generalization error by L.4NC1 / for short. The average generalization error
1708
K. Ikeda
is dened as the expectation of the generalization error and is denoted by LN D E 4NC1 [L.4NC1 /] :
(2.1)
LN is called a learning curve. 2.2 Learning Algorithms and Admissible Region. The average generalization error depends on how to estimate the parameter. We choose it randomly according to the posterior distribution qpo .w/ from a Bayesian standpoint. This method is called the Gibbs algorithm (Levin, Tishby, & Solla, 1990). In the case of a deterministic dichotomy, qpo .w/ _ qpr.w/ when w 2 AN and qpo .w/ D 0 otherwise hold true where AN is called the admissible region (Ikeda & Amari, 1996) or the version space (Mitchell, 1982), dened as n o (2.2) AN D wjwT fn yn > 0; n D 1; : : : ; N (see Figure 1). Hence, the R posterior distribution is expressed as qpo .w/ D qpr.w/=ZN , where ZN D AN qpr.w/dw is a measure of the admissible region AN called the partition function in statistical mechanics. 2.3 A Bound on the Average Generalization Error. We give here an asymptotic upper bound on the average generalization error. By denition, the partition function is rewritten as Z (2.3) ZN D qpr .w/dw Z D
AN
qpr .w/
N Z Y
I[wT fn yn > 0]dw;
(2.4)
nD1
f1
wo AN
fn f2
Figure 1: Examples and admissible region in the parameter space.
An Asymptotic Statistical Theory of Polynomial Kernel Methods
1709
where I[¢] denotes the indicator function. The generalization error is therefore described using the partition function as Z Z (2.5) L.4NC1 / D qpo .w/I[wT fN yN < 0]dw Z
D 1¡
1 ZN
D 1¡
ZNC1 : ZN
qpr.w/
NC1 YZ
I[wT fn yn > 0]dw
(2.6)
nD1
(2.7)
From the convexity of the logarithm function, ¡ log.ZNC1 =ZN / is an upper bound on equation 2.7 and hence ³ ´ £ ¤ £ ¤ 1 c Co (2.8) LN · E4N log ZN ¡ E4NC1 log ZNC1 D N N asymptotically holds true. (See Amari et al., 1992, and Amari, 1993, for details.) In the following, we use equation 2.8 to derive upper bounds on the kernel methods. 2.4 Restricted Input Vectors. Suppose that input vectors are located in a c0 -dimensional subspace of the input space. We can denote the input vector by f D .f 0T ; 0T /T without loss of generality where f 0 is a c0 -dimensional input vector and 0 is a .c ¡ c0 /-dimensional null vector. Since the last .c ¡ c0/ components of the parameter vector w do not affect the output yNC1 for fNC1 as well as the examples .fn ; yn /, n D 1; : : : ; N, we can regard this problem as a c0 -dimensional one. Hence, an upper bound on the average generalization error in this case is given by c0 =N. (See Amari et al., 1992 for a similar discussion.) 3 Polynomial Kernel Methods Consider a deterministic dichotomy that outputs y 2 f§1g for an input vector x D .x1 ; x2 ; : : : ; xm /T in an m-dimensional space X D Rm by calculating sgn[wT f .x/] where f .x/ is an M-dimensional vector called a feature vector where M D mCp Cm . Each feature vector consists of M monomials of degrees up to p, that is, ( ) m X d1 d2 dm f .x/ D ad1 ;:::;dm x1 x2 ¢ ¢ ¢ xm j (3.1) di · p ; iD1
where ad1 ;:::;dm is a constant determined by K.x; x0 / D .xT x0 C 1/p . In the case of m D 2 and p D 2, for example, p p p f .x/ D .x21 ; 2x1 x2 ; x22 ; 2x1 ; 2x2 ; 1/: (3.2)
1710
K. Ikeda
(See Sch¨olkopf & Smola, 2002, for details on the relationship between a kernel function and the corresponding feature space.) Note that any polynomial of a degree up to p is expressed in the form wT f .x/. An example consists of an input x drawn from a certain probability density p.x/ > 0 and its corresponding output y D sgn[wo T f .x/], where wo is the true parameter vector of a supervisor. A learning machine given N examples predicts the output yNC1 for a test input vector xNC1 with the Gibbs algorithm, that is, the estimate w O is chosen according to the posterior distribution qpo .w/ D p.wj4N / where 4N D f.xn ; yn /; n D 1; : : : ; Ng. The average generalization error LN is dened as the probability that the predicted output yO is false, averaged over all training sets and a test input O 4NC1 and all estimates w. We treat only the Gibbs algorithm in this letter, but the results shown below can easily be extended to other algorithms based on the admissible region such as the Bayesian estimation (Neal, 1996) and the SVM learning since the results mainly depend on how the input space is embedded into the feature space and how the true hyperplane intersects the input space there. Note that the estimator by an SVM is the center of the hypersphere inscribed in the admissible region in the feature space where the radius of the hypersphere corresponds to its margin. (See Herbrich, 2002, for details on the geometrical view of SVMs.) 4 Embedding and Generalization Error As seen in section 2.4, a learning curve depends on the number of essential parameters, that is, the dimension of the space input vectors span. A kernel classier also has localized examples and test inputs when it is regarded as a linear dichotomy in the feature space, since feature vectors form a submanifold determined by the feature map f .x/. Hence, we consider how the localization of the feature vectors affects the learning curve. 4.1 Input Surface and Separating Curve. We dene the input surface as the m-dimensional submanifold in the feature space wherein input vectors are mapped. Since a separating hyperplane in the feature space is described as wT f .x/ D 0;
(4.1)
the input surface and the true separating hyperplane has an .m ¡ 1/-dimensional intersection in the feature space. We call this the separating curve (see Figure 2). The task of a learning machine is to estimate the separating curve using positive examples from one side of the curve and negative ones from the other side. As a simple example, we discuss the one-dimensional case, that is, m D 1. The intersection of the input surface (called the input curve here) and the
An Asymptotic Statistical Theory of Polynomial Kernel Methods
1711
Figure 2: Input surface and the separating hyperplane.
separating hyperplane then consists of isolated points. So the learning machine is required to nd zero points of the true separating polynomial between the positive and negative examples in the neighborhood of the true zero points rather than to estimate the true polynomial itself, since uncertainty exists between the positive and negative examples in the input curve. When the number N of examples is sufciently large, the distances of the closest positive and negative examples from a true zero point go to null. Hence, the problem is divided into po one-dimensional linear dichotomies when the true separating polynomial wo T f .x/ D 0 has po zero points. Therefore, an upper bound on the average generalization error based on equation 2.8 is asymptotically po =N, and does not depend on the degree p of the kernel which corresponds to the dimension of the feature space (Ikeda, forthcoming). Consider the above case in the feature space. The input space (the left line in Figure 3) is mapped to a curve in the feature space (the right curve) and intersects the separating hyperplane at several points (the black circles). Since the given examples, as well as the test input vector, exist only in the input curve, the generalization error is determined by how much an intersecting point of the estimated hyperplane and the input curve differs from the corresponding true one. In the framework of asymptotic statistics, assuming N À 1, the direction of the difference can be approximated by the tangent vector of the input curve at each true zero point (the thick arrows). Hence, only the subspace spanned by the tangent vectors at all of the zero points affects the generalization properties. Since there are po such tangent vectors, df .x/=dx for x s.t. wo T f .x/ D 0, the vectors have different relative orientations and span po dimensions. This means that the components of the parameter vector orthogonal to the space do not affect the output and
1712
K. Ikeda
Figure 3: Tangent vectors of the input curve.
Figure 4: Separating curve and the tangent plane.
that the number of effective parameters is only po . Therefore, the problem is equivalent to a linear dichotomy of po dimensions. 4.2 Class and Generalization Error. Let us extend this idea to higherdimensional cases, m ¸ 2. Asymptotically, we consider only the examples and the test input vector in the subspace spanned by the tangent planes at x in the separating curve, that is, wo T f .x/ D 0 (see Figure 4). We call the dimension of the subspace the class of the separating polynomial (Ikeda, 2003). Since any vector in the tangent plane at x is described as @ f .x/ ¢ 1x; @x
(4.2)
where 1x is a certain m-dimensional vector, the class corresponds to the dimension of the subspace spanned by the vectors of the form 4.2 for an x s.t. wo T f .x/ D 0 and an arbitrary vector 1x. We consider the meaning of the class from another viewpoint. Denote the separating polynomial by Ã.xI w/ D wT f .x/ so as to emphasize that the separating curve is a function of x with parameter w. Suppose that a zero point x exists in the separating curve with parameter w. When the
An Asymptotic Statistical Theory of Polynomial Kernel Methods
1713
parameter w changes to w C 1w, so does the zero point x to x C 1x, that is, Ã .xI wo / D 0;
Ã.x C 1xI wo C 1w/ D 0:
(4.3) (4.4)
Assuming that 1x and 1w are small in their asymptotic limit, we derive from equation 4.3 and 4.4 @Ã @Ã ¢ 1w C ¢ 1x D 0; @w @x
(4.5)
whose rst term represents a polynomial of a degree up to p with coefcients 1w since @Ã=@w D f .x/. This means that the class is the dimension of the subspace spanned by small variations 1w in the coefcient vector from the true vector wo , which changes the zero point of à .xI wo / by 1x. In other words, the variation in the coefcient vector that is orthogonal to the subspace does not change the zero point (the intersection of the input surface and the separating hyperplane), and hence it does not affect the generalization error. Therefore, the number of parameters that can inuence the generalization error is equal to the class. So when the class is c, y D sgn[wT f .x/] can be regarded as a dichotomy with c parameters, and its upper bound based on equation 2.8 is asymptotically c=N. 5 Bounds on Class From the above discussion, the class is the number of essential parameters of the learning machine. In this section, we discuss the upper bounds on the class. A reasonable and rather constructive idea for deriving upper bounds is to explicitly give vectors orthogonal to any vector in the tangent planes at any x in the separating curve. If we nd c0 such vectors, an upper bound on the class is given by M ¡ c0 , where M is the dimension of the feature space. As an example, we derive a trivial upper bound on the class. Since one component (the Mth without loss of generality) of f .x/ is a constant term, @f .x/=@x ¢ 1x is necessarily orthogonal to a constant vector .0; : : : ; 0; 1/T , regardless of x and 1x. This leads to an upper bound M ¡ 1 on the class. 5.1 Class of Irreducible Polynomials. We prove here that the class of an irreducible polynomial is equal to the possible maximum, M ¡ 1. Suppose that wo T f .x/ is an irreducible polynomial of degree p. Then m polynomials wo T
@f .x/ ; @ xi
i D 1; : : : ; m
(5.1)
are mutually prime, that is, there exists an m-dimensional polynomial vector of x denoted by 1x.x/ s.t. wo T @f .x/=@x ¢ 1x.x/ D 1, where x satises
1714
K. Ikeda
wo T f .x/ D 0 and the components of 1x.x/ are mutually prime. In this case, wT @f .x/=@ x ¢ 1x.x/
(5.2)
does not constantly vanish as a polynomial for any w, except .0; : : : ; 0; 1/T . If such w exists, each wT @f .x/=@xi has to be null for any x s.t. wo T f .x/ D 0 due to the primeness of the components of 1x.x/. However, this never happens since wT @ f .x/=@xi is a polynomial of a degree up to .p ¡ 1/, whereas wo T f .x/ is an irreducible polynomial of degree p. Hence, class c takes its maximum, M ¡ 1. Note that this value is almost the same as the dimension M of the feature space, that is, the localization of the input vector in the feature space does not affect the generalization error at all, and the high dimension enlarges the generalization error. However, this seems inevitable since the true separating polynomial has such a high complexity. This result is consistent with that in Dietrich et al. (1999), where the true parameter is randomly chosen. The class of a reducible polynomial may be smaller but is still unknown (Ikeda, 2003). 5.2 Upper Bounds on Redundant Polynomials. The case mentioned above treats a rather difcult problem since a polynomial of degree p is necessary to classify inputs correctly. Hence the pessimistic result that the average generalization error is proportional to .M ¡ 1/=N is natural in a sense. In this section, we consider a more interesting problem: We employ a polynomial kernel of degree p while the true separating polynomial, denoted by Ão.xI wo / is of degree po less than p. How does the average generalization error depend on p and po ? In this case, the learning machine can correctly discriminate any inputs since the separating polynomial induced by the kernel of degree p can express any polynomial of a degree up to p, including the true polynomial of degree po . When m D 1, the average generalization error po =N based on equation 2.8 depends not on p but only on po , as shown in section 4.1. We extend the result to an input space of general dimension m ¸ 2 and give two upper bounds in a constructive manner as before, that is, giving orthogonal vectors explicitly. The rst bound is given by considering the space spanned by a tangent vector @f .x/=@xi . Since @f .x/=@xi is a polynomial of a degree up to .p ¡ 1/, it spans a .pCm¡1/ Cm -dimensional space if x is arbitrary. However, x has the constraint Ão .xI wo / D 0 in the denition of class. Any polynomial written as wT @f .x/=@ xi D Á.x/Ão .xI wo /;
(5.3)
where Á.x/ is an arbitrary polynomial of a degree up to .p¡po ¡1/, takes null for x s.t. Ão .xI wo / D 0. Such w spans a space of dimension .p¡po Cm¡1/ Cm due
An Asymptotic Statistical Theory of Polynomial Kernel Methods
1715
to the degree of Á.x/. Hence, the space spanned by @f .x/=@xi has a dimension of .pCm¡1/ Cm ¡ .p¡po Cm¡1/ Cm , at most. Considering each i D 1; 2; : : : ; m in the same way, an upper bound on the class is given by m ¤ ..pCm¡1/ Cm ¡ .p¡po Cm¡1/ Cm /:
(5.4)
The second bound is given by considering vectors orthogonal to wT @f .x/= @xi more directly. Suppose p ¸ 2po . For any polynomial written as wT f .x/ D Á.x/Ão .xI wo /2 ;
(5.5)
where Á .x/ is an arbitrary polynomial of a degree up to .p¡2po /, wT @f .x/=@xi D 0 holds true for any i and x s.t. Ão .xI wo / D 0. Hence, wT @f .x/=@x ¢ 1x D 0
(5.6)
stands for any 1x. This means that any tangent vector has no component in the direction of w satisfying equation 5.5, which spans a space of dimension .p¡2po Cm/ Cm due to the degree of Á .x/. Hence, an upper bound on the class is given by .pCm/ Cm
¡ .p¡2po Cm/ Cm :
(5.7)
Compare the two bounds 5.4 and 5.7. It is easily shown that both are polynomials of p of degree .m ¡ 1/ and the leading coefcients are m2 po and 2mpo , respectively. Hence, equation 5.7 is tighter when p is large, although it is not general since they have terms of lower degrees and equation 5.7 has an additional assumption of p ¸ 2po . 6 Computer Simulations To conrm the validity of the theoretical analyses above, some computer simulations have been carried out. In the experiments below, the kerneladatron algorithm (Anlauf & Biehl, 1989; Friess, Cristianini, & Campbell, 1998; Nishimori, 2001) is employed as an approximate of the Gibbs algorithm: 1. Initialize ®n , n D 1; 2; : : : ; N, randomly.
2. Calculate the margin of each example .xn ; yn / by zn D yn
N X
®n0 yn0 K.xn ; xn0 /:
n0 D1
3. If there exists a negative zn , ®n à ®n ¡ 2zn and go to 2.
(6.1)
1716
K. Ikeda
4. Stop if all zn are positive. Note that the distribution of the estimate with the kernel-adatron algorithm is unknown and depends on the input vectors. Hence it is not clear how well it approximates the Gibbs algorithm. However, we can intuitively consider that the prior distribution of the true parameter was proportional to the density function of the estimate given by the kernel-adatron so that both distributions are the same, since the bound 2.8 does not depend on the prior. In each trial, the initial values of ®n , n D 1; 2; : : : ; N are independently chosen according to the normal distribution N.0; 1/. The input vectors xn , n D 1; 2; : : : ; N also independently obey the normal distribution N.0; Im / where Im is the m £ m identity matrix. Note that this is equivalent to the case where xn is uniformly distributed in the unit hypersphere Sm¡1 . The average of the generalization error is taken over 100 trials. Figure 5 shows the result in the case of m D 2 where the true separating polynomial is x21 =4Cx22 ¡1. The x’s, o’s and *’s show the experimental values of the average generalization error times the number of examples where N D 1000, N D 3000, and N D 10000, respectively. The solid line represents the theoretical value given by equation 5.4 and 5.7 which are the same in this case. Since it is known that the generalization error with the Gibbs algorithm is almost two-thirds of the upper bound given in equation 2.8 (Gyorgyi ¨ & Tishby, 1990; Opper & Haussler, 1991; Ikeda & Amari, 1996), two-thirds of equation 5.4 is also drawn by the dashed line for convenience.
20
Coefficient
15
10
5
0 0
2
4
6
8
10
12
Degree of Polynomial Kernel
14
Figure 5: Average generalization error versus degree of polynomial kernels (m D 2).
An Asymptotic Statistical Theory of Polynomial Kernel Methods
1717
80 70
Coefficient
60 50 40 30 20 10 0 0
2
4
6
8
10
12
Degree of Polynomial Kernel
14
Figure 6: Average generalization error versus degree of polynomial kernels (m D 3).
This gure shows that the experimental values are bounded from the above by the theoretical line regardless of the number of examples and conrms the validity of asymptotic theory. The results imply that the average generalization error is smaller than the asymptotic theoretical value when the number of examples is not so large that the asymptotic theory is applicable. This is a more interesting and important phenomenon to be analyzed in the future. The same tendency can be seen in the case of m D 3 shown in Figure 6, where the true separating polynomial is x21 =4 C x22 C x23 =8 ¡ 1. The x’s, the o’s, and *’s show the experimental values of the average generalization error times the number of examples where N D 1000, N D 3000, and N D 10,000, respectively. The dashed and solid lines represent the theoretical values given by equation 5.4 and 5.7 and multiplied by two-thirds, respectively. 7 Conclusions The generalization properties of kernel classiers with a polynomial kernel have been examined. First, it was shown that the generalization error of kernel methods depends not on the dimensions of the input or the feature spaces themselves but on how the input space is embedded into the feature space, more specifically, on the class dened as the dimension of the subspace spanned by the tangent plane at the zero points of the true separating polynomial.
1718
K. Ikeda
Second, the relationship between the class and the true separating polynomial was discussed. When the true separating polynomial is irreducible (i.e., smaller models cannot separate inputs correctly), the average generalization error is comparable to that of a dichotomy having an input space of the same dimension as the feature space. When the true separating polynomial is reducible, the average generalization error may be smaller than irreducible cases, but the class or bounds on it are open, except for easy cases such as m D 1. When a smaller model can realize a correct separation, the generalization error is smaller compared to the dimension of the feature space. The validity of these analyses was conrmed by computer simulations. Let us note nally that the analyses shown above do not assume the important fact in kernel methods that the weight vector is a summand of given input vectors mapped into the feature space. Hence, the results are applicable to the case of three-layered polynomial neural networks with polynomial neurons of a degree up to p in the hidden layer and a sign neuron in the output layer.
Acknowledgments This study is supported in part by Grand-in-Aid for Scientic Research (14084210, 15700130) and by the COE Program (Informatics Research Center for Development of Knowledge Society Infrastructure) from the Ministry of Education, Culture, Sports, Science and Technology of Japan.
References Aizerman, M. A., Braverman, E. M., & Rozonoer, L. I. (1964). Theoretical foundations of the potential function method in pattern recognition learning. Automation and Remote Control, 25, 821–837. Amari, S. (1993). A universal theorem on learning curves. Neural Networks, 6, 161–166. Amari, S., Fujita, N., & Shinomoto, S. (1992). Four types of learning curves. Neural Computation, 4, 605–618. Amari, S., & Murata, N. (1993). Statistical theory of learning curves under entropic loss criterion. Neural Computation, 5, 140–153. Anlauf, J. K., & Biehl, M. (1989). The AdaTron: An adaptive perceptron algorithm. Europhysics Letters, 10, 687–692. Baum, E. B., & Haussler, D. (1989). What size net gives valid generalization? Neural Computation, 1, 151–160. Cristianini, N., & Shawe-Taylor, J. (2000). An introduction to support vector machines. Cambridge: Cambridge University Press. Dietrich, R., Opper, M., & Sompolinsky, H. (1999). Statistical mechanics of support vector networks. Physical Review Letters, 82(14), 2975–2978.
An Asymptotic Statistical Theory of Polynomial Kernel Methods
1719
Friess, T.-T., Cristianini, N., & Campbell, C. (1998). The kernel-adatron algorithm: A fast and simple learning procedure for support vector machines. In Proc. Int’l. Conf. Machine Learning. San Francisco: Morgan Kaufmann. Gy¨orgyi, G., & Tishby, N. (1990). Statistical theory of learning a rule. In K. Thuemann & R. Koeberle (Eds.), Neural networks and spin glasses (pp. 31–36). Singapore: World Scientic. Herbrich, R. (2002). Learning kernel classiers: Theory and algorithms. Cambridge, MA: MIT Press. Ikeda, K. (2003). Generalization error analysis for polynomial kernel methods— algebraic geometrical approach. In O. Kaynak, E. Alpaydin, E. Oja, & L. Xu (Eds.), Articial neural networks and neural information processing— ICANN/ICONIP, LNCS 2714 (pp. 201–208). New York: Springer-Verlag. Ikeda, K. (forthcoming). Geometry and learning curves of kernel methods with polynomial kernels. Systems and Computers in Japan. Ikeda, K., & Amari, S. (1996). Geometry of admissible parameter region in neural learning. IEICE Trans. Fundamentals, E79-A, 409–414. Levin, E., Tishby, N., & Solla, S. A. (1990). A statistical approach to learning and generalization in layered neural networks. Proc. IEEE, 78, 1568–1574. Mitchell, T. M. (1982). Generalization as search. Articial Intelligence, 18(2), 203–226. Murata, N., Yoshizawa, S., & Amari, S. (1994). Network information criterions— determining the number of parameters for an artifcial neural network model. IEEE Trans. Neural Networks, 5, 865–872. Neal, R. M. (1996). Bayesian learning for neural networks. New York: SpringerVerlag. Nishimori, H. (2001). Statistical physics of spin glasses and information processing: An introduction. Oxford: Oxford University Press. Opper, M., & Haussler, D. (1991). Calculation of the learning curve of Bayes optimal classication on algorithm for learning a perceptron with noise. In Proc. Ann. Workshop Comp. Learning Theory, 4 (pp. 75–87). San Francisco: Morgan Kaufmann. Sch¨olkopf, B., Burges, C., & Smola, A. J. (1998). Advances in kernel methods: Support vector learning. Cambridge: Cambridge University Press. Sch¨olkopf, B., & Smola, A. J. (2002). Learning with kernels: Support vector machines, regularization, optimization, and beyond. Cambridge, MA: MIT Press. Smola, A. J., Bartlett, P. L., Sch¨olkopf, B., & Schuurmans, D. S. (2000). Advances in large margin classiers. Cambridge, MA: MIT Press. Valiant, L. G. (1984). A theory of the learnable. Communications of ACM, 27, 1134–1142. Vapnik, V. N. (1995). The nature of statistical learning theory. New York: SpringerVerlag. Vapnik, V. N. (1998). Statistical learning theory. New York: Wiley. Vapnik, V. N., & Chervonenkis, A. Y. (1971). On the uniform convergence of relative frequencies of events to their probabilities. Theory of Probability and Its Applications, 16, 264–280. Received April 7, 2003; accepted February 4, 2004.
LETTER
Communicated by Eric Baum
A Neural Root Finder of Polynomials Based on Root Moments De-Shuang Huang
[email protected] Hefei Institute of Intelligent Machines, Chinese Academy of Sciences, Hefei, Anhui 230031, China, and AIMtech Center, Department of Computer Science, City University of Hong Kong, Kowloon Tong, Hong Kong
Horace H.S. Ip
[email protected] Institute of Intelligent Machines, Chinese Academy of Sciences, Hefei, Anhui 230031, China
Zheru Chi
[email protected] Center for Multimedia Signal Processing, Hong Kong Polytechnic University, Hong Kong
This letter proposes a novel neural root nder based on the root moment method (RMM) to nd the arbitrary roots (including complex ones) of arbitrary polynomials. This neural root nder (NRF) was designed based on feedforward neural networks (FNN) and trained with a constrained learning algorithm (CLA). Specically, we have incorporated the a priori information about the root moments of polynomials into the conventional backpropagation algorithm (BPA), to construct a new CLA. The resulting NRF is shown to be able to rapidly estimate the distributions of roots of polynomials. We study and compare the advantage of the RMM-based NRF over the previous root coefcient method–based NRF and the traditional Muller and Laguerre methods as well as the mathematica roots function, and the behaviors, the accuracies of the resulting root nders, and their training speeds of two specic structures corresponding to this FNN root nder: the log 6 and the 6 ¡ 5 FNN. We also analyze the effects of the three controlling parameters f±P0 ; µp ; ´g with the CLA on the two NRFs theoretically and experimentally. Finally, we present computer simulation results to support our claims. 1 Introduction There exist many root-nding problems in practical applications such as lter designing, image tting, speech processing, and encoding and decoding in communication. (Aliphas, Narayan, & Peterson, 1983; Hoteit, 2000; c 2004 Massachusetts Institute of Technology Neural Computation 16, 1721–1762 (2004) °
1722
D.-S. Huang, H. Ip. and Z. Chi
Schmidt & Rabiner, 1977; Steiglitz & Dickinson, 1982; Thomas, Arani, & Honary, 1997). Although these problems could be solved using many traditional root-nding methods, both high accuracy and fast processing speed are very difcult to achieve simultaneously due to the trade-offs with respect to speed and accuracy in the design of existing root nders (Hoteit, 2000; Thomas et al., 1997; Lang & Frenzel, 1994). Specically, many traditional root-nding methods need to obtain the initial roots distributions before iterating. Moreover, it is well known that the traditional methods can nd the roots only one by one, that is, by deation method: the next root is obtained by the deated polynomial after the former root is found (William, Saul, William, & Brian, 1992). This means that the traditional rootnding algorithms are inherently sequential, and increasing the number of processors will not increase the speed of nding the solutions. On the other hand, this root-nding method by deation will be not able to guarantee the estimated roots accuracy for the reduced polynomials, which is greatly affected. Recently, we showed that feedforward neural networks (FNN) can be formulated to nd the roots of polynomials (Huang, 2000; Huang, Ip, Chi, & Wong, 2003). The advantage of the neural approach is that it provides more exible structures and suitable learning algorithms for the root-nding problems. More importantly, the neural root nder (NRF) presented in Huang (2000) and Huang et al. (2003) exploits the parallel structure of neural computing to simultaneously obtain all roots, resulting in a very efcient solution to the root-nding problems. Obviously, if the neural root-nding approach can be cast into an optimized software or function, in particular, if a new neural parallel processing–based hardware system with many interconnecting processing nodes like the brain is developed in the future, the designed NRFs will without question surpass in speed and accuracy any traditional non-NRFs. Hence, such a novel approach to the root-nding problem is a vitally important research topic in neural or intelligent computational eld. Briey, the idea of using FNNs to nd the roots of polynomials is to factorize the polynomial into many subfactors on the outputs of the hidden layer of the network. The connection weights (i.e., roots) from the input layer to the hidden layer are then trained by using a suitable learning algorithm until the dened output error between the actual output and desired output (the polynomial) converges to a given error accuracy. The converged connection weights obtained in this way are the roots to the underlying polynomial. For instance, for an n-order arbitrary polynomial f .x/, f .x/ D a0 xn C a1 xn¡1 C ¢ ¢ ¢ C an¡1 x C an ;
(1.1)
where n ¸ 2; a0 D 6 0. Without loss of generality, the coefcient a0 of xn is usually set as 1. Suppose that there exist n approximate roots (real or
A Neural Root Finder of Polynomials Based on Root Moments
1723
complex) in equation 1.1, then equation 1.1 can be factorized as follows, f .x/ D xn C a1 xn¡1 C ¢ ¢ ¢ C an¡1 x C an ¼
n Y iD1
.x ¡ wi /;
(1.2)
where wi is the ith root of f .x/. To design a neural network (i.e., feedforward neural network) model for nding the roots of polynomials, we expand the corresponding polynomial into an FNN, and use the FNN to express or approximate the polynomial so that the hidden-layer weights of the FNN represent the coefcients of the individual linear monomial factors such as 1 or x. As a result, a two-layered FNN model for nding roots of polynomials can be constructed, as shown in Figure 1. This model, which is somewhat similar to the 6 ¡ 5 neural network structure (Hormis, Antonion, & Mentzelopoulou, 1995), is in essence a onelayer linear network extended by a sum product (6 ¡ 5) unit. The network has two input nodes corresponding to the terms of 1 and x, n hidden nodes of forming the difference between the input x and the connection weights wi ’s, and one output product node that performs the multiplications on the outputs of the n hidden nodes. Only the weights between the input node that is clamped at value 1 and the hidden nodes need to be trained. The weights between the input x and the hidden nodes and those between the hidden nodes to the output node are always xed at 1. In mathematical terminology,
w1 1
w2
1
1
wn 1 1 wn 1
x
Figure 1: Feedforward neural network architecture for nding the roots of polynomials.
1724
D.-S. Huang, H. Ip. and Z. Chi
the output corresponding to the ith hidden node can be represented as yi D x ¡ wi ¢ 1 D x ¡ wi
(1.3)
where wi .i D 1; 2; : : : ; n/ are the network weights (the roots of the polynomial). The output of the output layer performing multiplication on the outputs of the hidden layer can be written as O D y.x/
n Y iD1
yi D
n Y iD1
.x ¡ wi /:
(1.4)
The outer-supervised signal dened at the output of this network model is the polynomial f .x/. Here, this NRF structure is referred to as the 6 ¡ 5 model. In fact, equation 1.4 can be performed by means of the logarithmic operator. Thus, we can obtain the following equation: N O D ln jy.x/j D y.x/
n X iD1
ln jx ¡ wi j:
(1.5)
As a result, another NRF model, which is structurally similar to the factorization network model proposed by Perantonis, Ampazis, Varoufakis, and Antoniou (1998), can be easily derived. In this model, the resulting hidden nodes, which originally computed the linear differences (in other words, make the summation) between the input x and the connection weight w in the 6 ¡ 5 model, become nonlinear one with a logarithmic operator activation function, and the resulting output node performs a linear summation instead of multiplication. Therefore, this resulting NRF structure is referred to as the log 6 model (Huang, 2000). For such constructed NRFs, the corresponding training algorithm is generally the traditional backpropagation algorithm (BPA) with gradient-descent type, which has a very slow training speed and often leads to unsatisfactory solutions when training the network unless a suitably chosen set of initialized connection weights is given in advance (Huang, 2000). Although the initializing connection weights can be derived from separating different roots of polynomials by means of polynomial theory, it will still take a long time to achieve the upper and lower bounds of the roots of polynomials, especially for the higher-order polynomials (Huang, 2000; Huang & Chi, 2001). In order to alleviate this difculty, we adopt the idea of a constrained learning algorithm (CLA) rst proposed by Karras and Perantonis (1995) and incorporate the a priori information about the relationship between the roots (the connection weights) and the coefcients of a polynomial into the BPA. This facilitates the learning process and leads to better solutions. As a result, we have developed a CLA for nding the roots of a polynomial (Huang & Chi, 2001), which did speed up the convergent process and
A Neural Root Finder of Polynomials Based on Root Moments
1725
improve the accuracy for nding the roots of polynomials with respect to previous results (Huang, 2000). Specically, the key point to stress is that this CLA is not sensitive to the initial weights (roots). It was also observed that a large number of computations for this CLA are spent on computing the constrained relation, which is of computational complexity of the order O.2n / between roots and coefcients of polynomial and the corresponding derivatives. This problem will become more serious for very high-order polynomials. Therefore, a method that requires fewer computations of the constrained relation between roots and coefcients will signicantly reduce training time. In the light of the fundamental idea of the CLAs, a constrained learning method based on the constrained relation between the root moments (Stathaki & Constantinides, 1995; Stathaki, 1998; Stathaki & Fotinapoulos, 2001) and the coefcients of polynomial can be constructed to solve this problem. Indeed, we have proved that the corresponding computational complexity, which is of the order O.nm3 /, for the root moment method (RMM) is much lower than the root coefcient methods (RCM) (Huang, Chi, & Siu, 2000). As a result, we consider that the roots of high-order polynomials can be found by this approach. In this article, we present the constrained relation between the root moments and the coefcients of polynomial, derive the corresponding CLA, and compare the computational complexities of the RMM-based NRF (RMM-NRF) with the RCM-based NRF (RCM-NRF). It has been found that neural methods have a signicant advantage over nonneural methods like Muller and Laguerre in mathematica roots function and that the RMM-NRF is signicantly faster than the RCM-NRF. We then focus on discussing how to apply this new RMM-NRF to nd the arbitrary roots of arbitrary polynomials. We discuss and compare the performance of two specic structures of the NRFs: the 6 ¡ 5 and log 6 models. It was found that the total performance of the log 6 model is better than that of the 6 ¡ 5 model. Moreover, it was also found that there are seldom local minima with the NRFs on the error surface if all the a priori information from polynomials has been appropriately encoded into the CLA. In addition, to apply this CLA to solve practical problems more conveniently, we investigate the effects of the three controlling parameters with the CLA on the performance of the two NRFs. The simulation results show that the training speed improves as the values of the three controlling parameters increase, while the estimated root accuracies and variances are almost kept unchanged. Section 2 of this letter presents the constrained relation between the root moment and the coefcients of polynomials, and discusses and derives the corresponding constrained learning algorithm based on the root moments for nding the roots of polynomials. In section 3, the computational complexities for the two NRFs of the RMM and the RCM, as well as traditional methods, are discussed, and the performance of the neural root-nding method is compared to the nonneural method’s performance. Experimen-
1726
D.-S. Huang, H. Ip. and Z. Chi
tal results are reported and discussed in section 4. Section 5 presents some concluding remarks. 2 Complex Constrained Learning Algorithm Based on Root Moments of Polynomial 2.1 The Root Moments of Polynomial. We have discussed the fundamental constrained relationship, referred to as the root coefcient relation, between the (real or complex) roots and the coefcients of polynomials, and formulated this relation into the conventional BPA to get the corresponding CLA for nding the roots of polynomials (Huang & Chi, 2001; Huang et al., 2003). It has been found that there exists another important well-known constrained relationship, referred to as root moment relation, between the roots and the root moments of polynomial rst formulated by Sir Isaac Newton; the resulting relationship is known as the Newton identity (Stathaki & Constantinides, 1995; Stathaki, 1998; Stathaki & Fotinopoulos, 2001). The concept for the root moment of polynomial is dened as follows: Denition 1. For an n-order polynomial described by equation 1.1, assume that the corresponding n roots are, respectively, w1 ; w2 ; : : : ; and wn . Then the m.m 2 Z/ order root moment of the polynomial is dened as m m Sm D wm 1 C w2 C ¢ ¢ ¢ C wn D
n X
wm i :
(2.1)
iD1
Obviously, Sm ’s are possibly complex numbers that depend on wi ’s. Furm thermore, there hold S0 D n and dS D mwm¡1 . i dwi According to this denition of root moment, a recursive relationship between the m-order root moment and the coefcients of polynomial can be obtained as follows. First, for the case of m ¸ 0, we have (Stathaki, 1998): 8 S1 C a 1 D 0 > > > > > <S2 C a1 S1 C 2a2 D 0 :: (2.2) : > > > > C C ¢ ¢ ¢ C D 0; · S a S ma .m n/ 1 m¡1 m > : m Sm C a1 Sm¡1 C ¢ ¢ ¢ C an Sm¡n D 0; .m > n/: Second, for the case of m < 0, we have: 8 > >an S¡1 C an¡1 D 0 > > > > > > a S C an¡1 SmC1 C ¢ ¢ ¢ C jmjanCm D 0; > : n m an Sm C an¡1 SmC1 C ¢ ¢ ¢ C SmCn D 0;
(2.3) .m ¸ ¡n; a0 D 1/ .m < ¡n/
A Neural Root Finder of Polynomials Based on Root Moments
1727
In fact, it can be proven that the two following formulas hold: Sm C a1 Sm¡1 C ¢ ¢ ¢ C an Sm¡n D 0;
.m > n/
(2a.2)
and an Sm C an¡1 SmC1 C ¢ ¢ ¢ C SmCn D 0;
.m < ¡n/:
(2a.3)
Consequently, for jmj > n (outside the polynomial coefcients window), we can obtain a unied recursive relation as follows: Sm C a1 Sm¡1 C ¢ ¢ ¢ C an Sm¡n D 0;
.jmj > n/
(2.4)
The recursive relationships in equations 2.2 and 2.3 are named as Newton identities. From these Newton identities, we can obtain a theorem as follows: Theorem 1. Suppose that an n-order polynomial described by equation 1.1 is known. Then a set of parameters (root moment) fSm ; m D 1; 2; : : : ; n/ can be uniquely determined recursively by equation 2.2. Conversely, given n root moments fSm ; m D 1; 2; : : : ; n/, an n-order polynomial described by equation 1.1 can be uniquely determined recursively by equation 2.2. For the case of m < 0, however, a theorem similar to the above conclusion can be stated as follows: Theorem 2. Suppose that an n-order polynomial described by equation 1.1 is known. Then a set of parameters (root moment) fSm ; m D ¡1; ¡2; : : : ; ¡n/ can be uniquely determined recursively by equation 2.3. Conversely, given n root moments fSm ; m D ¡1; ¡2; : : : ; ¡n/, an n-order polynomial described by equation 1.1 can be uniquely determined recursively by equation 2.3. In fact, we can also derive the root moment from the Cauchy residue theorem, which is dened as 1 Sm D 2¼j
I X n 0 iD1
1 zm dz; .z ¡ ri /
(2.5)
where 0 is a closed contour, dened as z D ½.µ /e jµ , and contains the roots of the required factor of f .z/. From equation 1.1, we have f 0 .z/ D
n X f .z/ : ¡ ri z iD1
(2.6)
1728
D.-S. Huang, H. Ip. and Z. Chi
Hence, equation 2.5 can be rewritten as 1 Sm D 2¼j
I 0
f 0 .z/ m z dz: f .z/
(2.7)
From the above denition of the root moment, we have the following corollary: Corollary 1. The root moments of the product f .z/ D f1 .z/ f2 .z/ and the ratio f .z/ D f1 .z/=f2 .z/ can be respectively derived as: f .z/
Sm
f .z/ Sm
f .z/
D Sm1 D
f .z/ Sm1
f .z/
(2.8)
f .z/ Sm2
(2.9)
C Sm2 ¡
Moreover, we can derive another corollary about the gradual behaviors of the root moments as the order m becomes sufciently large or small as follows: Corollary 2. Assume that rmax and rmin are respectively the maximum and minimum modulus roots of f .z/. Then we can derive lim Sm D rm max
m!1
lim Sm D rm min
m!¡1
(2.10) (2.11)
That is, when m is sufciently large, the root rmax with the maximum modulus will dominate the sum Sm , while the root rmin with minimum modulus will dominate the sum Sm when m is sufciently small. In addition, it is very easy to derive another corollary about the reciprocal of the roots: Corollary 3. If a new polynomial is dened as g.z/ D zn f .z¡1 /, then it has reciprocal roots with the original polynomial f .z/. These corollaries and conclusions form the basis for using the root moments of polynomials to nd the roots of polynomials. In the following, we derive the corresponding complex CLA based on the root moments’ constrained relation (i.e., conditions) implicit in polynomials to train the NRFs for nding the arbitrary roots of arbitrary polynomials. 2.2 Complex Constrained Learning Algorithm for Finding the Arbitrary Roots of Arbitrary Polynomial. From Huang (2000), Huang et al. (2000), and Huang et al. (2003), it can be found that the root nding method based on the CLA is apparently faster than the simple BP learning rule based on careful selection of initial synaptic weights. Therefore, in the following, we deduce this CLA and extend it to a more general complex version.
A Neural Root Finder of Polynomials Based on Root Moments
1729
Suppose that there are P training patterns selected from the complex region jxj < 1. An error cost function (ECF) is dened at the output of the FNN root nder, E.w/ D
P P 1 X 1 X jep .w/j2 D jop ¡ yp j2 ; 2P pD1 2P pD1
(2.12)
where w is the set of all connection weights in the network model; op D f .xp / or Qnln f .xp / denotes Pn the target (outer-supervised) signal to be rooted; yp D iD1 .xp ¡ wi / or iD1 ln jx ¡ wi j represents the actual output of the network; and p D 1; 2; : : : ; P is an index labeling the training patterns. The above variables and functions are all possibly complex numbers, so in the following, the derivations according to the rule of complex variables must be observed. For an arbitrary complex variable, p z D xC iy, where x and y are the real and imaginary parts of z, and i D ¡1. For the convenience of deducing the following learning algorithm, we give a denition of the derivative of a real-valued function with respect to complex variable: Denition 2. Assume that a real-valued function U.w/ is the function of complex variable w with the real and imaginary parts w1 and w2 . Then the derivative D @U.w/ C i @U.w/ of U.w/ with respect to w is dened as @U.w/ @w @w1 @w2 . Consequently, the BPA based on the gradient descent can easily be deduced from the E.w/. By taking the partial derivative of E.w/ with respect to w, we can obtain Ji D
P n Y 1X @ E.w/ D ep .w/ ¢ .xp ¡ wj / @wi P pD1 j6Di
P xp ¡ wi 1X ep .w/ ¢ : jxp ¡ wi j2 P pD1
or
(2.13)
As a result, the BPA based on the gradient descent is described as follows: dw i D ¡´Ji
(2.14)
where dw i D wi .k/ ¡ wi .k ¡ 1/ denotes the difference between wi .k/ and wi .k ¡ 1/, the current and the past weights. According to the above formulated constrained conditions based on the root moments dened in equations 2.2 to 2.4, we can uniformly write them as 8 D 0;
(2.15)
1730
D.-S. Huang, H. Ip. and Z. Chi
where 8 D [81 ; 82 ; : : : ; 8m ]T .m · n/ (T denotes the transpose of a vector or matrix) is a vector composed of the constraint conditions of any one of equations 2.2 to 2.4. By considering the ECF, which possibly contains many long, narrow troughs (Hush, Horne, & Salas, 1992), a constraint for updated connection weights is imposed in order to avoid missing the global minimum. Consequently, the sum of the squared absolute value of the individual weight changes takes a predetermined positive value .±P/2 (±P > 0 is a constant): n X iD1
jdwi j2 D .±P/2 :
(2.16)
This means that at each epoch, the search for an optimum new point in the weight space is restricted to a small hypersphere with radius ±P centered at the point dened by the current weight vector. If ±P is small enough, the change to E.w/ and to 8 induced by changes in the weights can be approximated by the rst differentials, dE.w/ and d8, respectively. In order to derive the corresponding CLA based on the constraint conditions of equations 2.2 or 2.3 and 2.16, assume that d8 is equal to a predetermined vector quantity ±Q, designed to bring 8 closer to its target (zero). The objective of the learning process is to ensure that the maximum possible change in jdE.w/j is achieved at each epoch. Usually the maximization of jdE.w/j can be carried out analytically by introducing suitable Lagrange multipliers. Thus, a vector V D [v1 ; v2 ; : : : ; vm ]T of Lagrange multipliers is needed to take into account the constraints in equation 2.15. Another Lagrange multiplier ¹ is introduced for equation 2.16. By introducing the function ", d" can be expanded as follows:
d" D
n X iD1
H
Ji dwi C .±Q ¡
n X
" dw i FH i /V
iD1 .j/
2
C ¹ .±P/ ¡
n X iD1
# 2
jdw i j
; (2.17)
@8
.2/ .m/ T j where Fi D [F.1/ i ; Fi ; : : : ; Fi ] , Fi D @wi .i D 1; 2; : : : ; n; j D 1; 2; : : : ; m/, · denotes the number of the constraint conditions in equation 2.15. m n Further, to maximize jd"j (in fact, minimize d") at each epoch, we demand that
d2 " D
n X iD1
2 .Ji ¡ FH i V ¡ 2¹dwi /d wi D 0
d3 " D ¡2¹
n X .d2 wi /2 < 0: iD1
(2.18)
(2.19)
A Neural Root Finder of Polynomials Based on Root Moments
1731
As a result, the coefcients of d2 wi in equation 2.18 should vanish, that is, Ji FH V ¡ i 2¹ 2¹
dw i D
(2.20)
where the values of Lagrange multipliers ¹ and V can beP readily evaluated from equations 2.16 and 2.20 and the condition .±QH ¡ niD1 dwi FH i /V D 0 is embodied in equation 2.15 with the following results: " 1 ¹D¡ 2
¡1 IJJ ¡ IH JF IFF I JF
#1=2
¡1 .±P/2 ¡ ±QH IFF ±Q
¡1 ¡1 V D ¡2¹IFF ±Q C IFF IJF ;
where IJJ D dened by .j/
IJF D
Pn
2 iD1 jJi j
n X iD1
(2.21) (2.22)
is a scalar and I JF is a vector whose components are
.j/
Ji Fi ; j D 1; 2; : : : ; m:
(2.23)
P jk .j/ Specically, IFF is a matrix, whose elements are dened by IFF D niD1 Fi F.k/ i .j; k D 1; 2; : : : ; m/. Obviously, there are .m C 1/ parameters ±P; ±Qj .j D 1; 2; : : : ; m/, which need to be set before the learning process begins. Parameter ±P is often adaptively selected as (Huang, 2001) µp
±P.t/ D ±P0 .1 ¡ e¡ t /; ±P.0/ D ±P 0 ;
(2.24)
where ±P0 is the initial value for ±P, which is usually chosen as a larger value, t > 0 is the time index for training, and µp is the scale coefcient of time index t, which is usually set as µp > 1. However, the vector parameters ±Qj .j D 1; 2; : : : ; m/ are generally selected as proportional to 8j , that is, ±Qj D ¡k8j (j D 1; 2; : : : ; m and k > 0), which ensures that the constraints 8 move toward zero at an exponential rate as training progresses. ¡1 From equation 2.19, we note that k should satisfy k · ±P.8H IFF 8/¡1=2 . q In practice, the simplest choice for k is k D ´±P= 8H I¡1 FF 8, where 0 < ´ < 1 is another free parameter of the algorithm but ±P. 3 Computational Complexity Estimates of the RMM-NRF and the RCM-NRF In this section, we compare the computational complexities of our proposed RM-based roots-nding method with the original RC-based roots-nding
1732
D.-S. Huang, H. Ip. and Z. Chi
method, as well as traditional root nders such as Muller and Laguerre (William et al., 1992; Anthony & Philip, 1978), and show theoretically and experimentally that the speed of this RMM-RNF is signicantly faster than the RCM-NRF and those traditional root nders. It can be seen from the CLA that at each iterative epoch, we have to compute the values of the constraint conditions of 8 and their derivatives, @8=@w, which are sharply dominant in all CLA computations (multiplication or division operations). In order to estimate the computational complexity of the multiplication or division operations for the RMM, we can, from the constraint conditions of equation 2.23, achieve @8=@w D [F.1/ ; F.2/ ; : : : ; F.n/ ]T as follows: 8 .1/ F D [1; 1; : : : ; 1]T > > > .2/ T > > > > >F.n/ D [n¸n¡1 C .n ¡ 1/a1 ¸n¡2 C ¢ ¢ ¢ C an¡1 ; n¸n¡1 C ¢¢¢ > 1 1 2 : n¡1 C an¡1 ; : : : ; n¸n C .n ¡ 1/a1 ¸n¡2 C ¢ ¢ ¢ C an¡1 ]T : n
(3.1)
Consequently, we can estimate the number of multiplication operations at each epoch for computing the constraint conditions of equation 2.2 for 8 and their derivatives @ 8=@w, stated in the following remark (Huang et al., 2000; Huang, 2004): 1 Remark 1. At each epoch for nding the complex roots of a given arbitrary polynomial f .x/ of order n based on the constrained learning neural networks using the m constraint conditions ordered from the rst one of equation 2.2, the estimate of the number of multiplication operations needed for computing 8 and @8=@w, CERMM .n; m/, is CERMM.n; m/ D
2 .m ¡ 1/.nm2 C 10nm C 3m ¡ 12n C 6/: 3
(3.2)
Obviously, CERMM.n/ is of the order O.nm3 / multiplication operations. Specically, when m D n, equation 3.2 becomes CES .n; n/ D 23 .n ¡ 1/.n3 C 10n2 ¡ 9n C 6/. In addition, for the RCM, we can obtain @ 8=@w D [F.1/ ; F.2/ ; : : : ; F.m/ ]T from the constraint condition of the relation between the roots and coef-
1 Here, the most general case for complex computations, which have four times the real computations, is considered.
A Neural Root Finder of Polynomials Based on Root Moments
1733
cients of the polynomial (see, Huang, 2000) as follows: 8 .1/ F D [1; 1; : : : ; 1]T > > " #T > > n n n > X X X > .2/ > > F D wi ; wi ; : : : ; wi > > < i6D1
i6D2
i6Dn
i6D1
i6D2
i6Dn
(3.3)
:: > > : > > " #T > > m m m Y Y Y > > .m/ > wi ; wi ; : : : ; wi : > :F D
As a result, the estimate of the number of multiplication operations at each epoch for a given polynomial f .x/ of order n is stated as follows (Huang et al., 2000; Huang, in press): Remark 2. At each epoch for nding the complex roots of a given arbitrary polynomial f .x/ of order n based on the constrained learning neural networks using the m constraint conditions ordered from the rst one of equation 3.3, the estimate of the number of multiplication operations needed for computing 8 and @8=@w, CERCM.n; m/, is: 0 1 m m m X X X j j j 2 CERCM .n; m/ D 4 @ j Cn ¡ jCn ¡ Cn C n C 1A : jD0
jD0
(3.4)
jD0
2 Obviously, CERCM.n/ is of the order O.Cm n m / multiplication operations, which is considerably higher than that of the RMM. Specically, when m D n, equation 3.4 becomes CERCM .n; n/ D 4[.n2 ¡n¡4/2n¡2 CnC1]. Obviously, CERCM.n; n/ is of the order O.2n / multiplication operations. CERCM .n;n/ If we dene a ratio, rc , of CERCM .n; n/ and CERMM.n; n/; rc D CE , RMM .n;n/ then it can be readily deduced that
lim rc D lim
n!1
n!1
CERCM .n; n/ D 1: CERMM.n; n/
(3.5)
Table 1 gives the comparisons between the computational complexities of the RCM-NRF and the RMM-NRF versus the polynomial order n. From the results, it is easily found that the RMM-NRF has less computational cost as n increases. However, when n is chosen as a smaller value (e.g., n · 5), the conclusion is the opposite. Therefore, it is for higher-order polynomials that the RMM-NRF exhibits its strong potential in computational complexity. In addition, the computational complexities for those traditional root nders in mathematica roots function such as Muller and Laguerre (Mourrain, Pan, & Ruatta, 2003; Milovanovic & Petkovic, 1986) are generally of the order O.3n /,
1734
D.-S. Huang, H. Ip. and Z. Chi
Table 1: Comparisons of Computational Complexities of the Original RCM and the RMM. n
1
2
3
4
5
6
7
8
9
10
11
12
CERCM .n/ 0 4 32 148 536 1692 4896 13,348 34,856 88,108 217,136 524,340 CERMM .n/ 0 24 128 388 896 1760 3104 5068 7808 11,496 16,320 22,484
while the computational complexity for the fastest method, like JenkinsTraub, is only of the order over O.n4 /. It is substantially higher than our proposed RMM, especially our derived recursive RMM (see Huang, 2004). To further verify the correctness of these theoretical analyses, we take a seven-order polynomial— f .x/ D x7 C .1:2 ¡ 0:5i/x6 C .¡6:5 C 1:4i/x4 C 2:93ix2 C .1:7 C 2:4i/x C 2:4 ¡ 0:3i, for example—to compare the performance of the RCM-NRF and the RMM-NRF. Here, we consider only the case of the log 6 model. Assume that the controlling parameters with the CLA for two methods are both selected as ±P0 D 2:0, µp D 5:0, and ´ D 0:6, and let the termination error be er D 1:0 £ 10¡8 . The training sample pairs .x; f .x// are obtained from the complex domain of jxj < 1, where the total training sample number P is xed as 100. In the experiment, assume that we adopt Pentium III with a CPU clock of 795 Mhz and RAM of 256 Mb, and use Visual Fortran 90 to encode. After the NRFs respectively trained by these two methods converge, it was found that the RCM-NRF and the RMM-NRF take, respectively, 9196 and 2783 iterations, where the CPU times taken by the former and the latter are 277 seconds and 41 seconds. That is, the training speed of the RMM is almost seven times that of the RCM, which completely supports our theoretical analyses. Figure 2 depicts the two learning error curves (LEC) of the two methods, and Figure 3 shows the LEC for the BPA-based NRF in the case of the learning coefcient ´ D 0:1. From Figures 2 and 3, it can be seen that the two CLAs are signicantly faster than the BPA, and the RMM-based CLA is essentially faster than the RCM-based CLA. In addition, we use the same parameters as above but with the termination accuracy er D 1:0 £ 10¡25 to compare the performance of the RCM-NRF and the RMM-NRF, as well as the two nonneural methods of Muller and Laguerre. Here, the important point is that Muller’s and Laguerre’s methods are also encoded by Visual Fortran 90 to perform the roots-nding program according to formulas of 9.5.2 and 9.5.3 and 9.5.4 to 9.5.11 in William et al. (1992), respectively. Moreover, to statistically evaluate different root nders, the experiments were repeated 10 times with different initial weight values from the uniform distribution in [¡1; 1] (for the two nonneural methods of Muller and Laguerre, the initial roots are determined by polynomial theory; Huang, 2000, 2004). Consequently, the average iterating numbers, the average CPU times, and the average estimated variances for the four methods
A Neural Root Finder of Polynomials Based on Root Moments
1735
Figure 2: Learning error curves of the original RCM-NRF and the RMM-NRF for a seven-order polynomial.
Figure 3: Learning error curve for the BPA–based NRF in the case of the learning coefcient ´ D 0:1 for a seven-order polynomial.
1736
D.-S. Huang, H. Ip. and Z. Chi
Table 2: Statistical Performance Comparisons of the Original RCM-NRF and the RMM-NRF for a Seven-Order Polynomial f .x/.
Indices
Average Iterating Numbers
Average CPU Times (seconds)
Average Estimated Variances
RMM RCM Muller Laguerre
23,242 276,425 519,374 432,463
972.4 1207.2 2542.2 2312.4
3.26367E¡026 2.32545E¡026 1.32443E¡026 4.23151E¡026
Figure 4: Histogram comparisons among the iterating numbers for four rootnding methods.
are respectively shown in Table 2, and the corresponding histograms for the average iterating numbers and the average CPU times are respectively illustrated in Figures 4 and 5. Table 2 shows that the RMM-NRF is signicantly faster than the RCMNRF, and that the two NRFs are also faster than the two nonneural methods. Moreover, the accuracies and the CPU times consumed for the two neural approaches are higher and shorter than the ones for the two nonneural methods. This is because Muller’s and Laguerre’s methods iteratively ob-
A Neural Root Finder of Polynomials Based on Root Moments
1737
Figure 5: Histogram comparisons among the CPU times consumed for four root-nding methods.
tain one root at a time so that the accuracy for the reduced polynomial is affected, which will result in lower root accuracies and longer iterating times (including the times of carefully selecting initial root values).
Comments. Why do we revisit this topic of the root-nding polynomial since there have been many numerical methods? The other motivation except for the parallel processing fact for neural methods includes that the neural methods can simultaneously nd all roots, while the traditional numerical methods can sequentially nd the roots only by the deation method so that each new root is known with only nite accuracy, and errors will creep into the determination of the coefcients of the successively deated polynomial (William et al., 1992). Hence, the accuracy for the nonneural numerical method is fundamentally limited and cannot surpass the one for the neural root-nding method. In addition, all nonneural methods need to nd good candidates with initial root values; otherwise, the designed root nder will not converge (William et al., 1992), which will increase computational complexity and consume additional processing time. On the other hand, our proposed neural approaches are instructed by the a priori information from the polynomials imposed into the CLA so that they almost do not need to compute any initial root values except for randomly selecting them from the uniform distribution in [¡1; 1].
1738
D.-S. Huang, H. Ip. and Z. Chi
Considering these points, we shall focus on the RMM-NRF in presenting and discussing experimental results. 4 Experimental Results and Discussions In order to verify the effectiveness and efciency of our proposed approaches, we conducted several experiments. We make two assumptions in the following experiments. First, for each polynomial involved f .x/, the input training sample pairs .x; f .x// are obtained from the complex domain of jxj < 1, and the total number, P, of input training sample pairs is xed at 100. Second, the initial weight (root) values of the FNNRF are randomly selected from the uniform distribution in [¡1; 1]. 4.1 Polynomial with Known Roots. There is a six-order test polynomial f1 .x/ with known complex roots generated from ri D 0:9£ exp.j ¼3i /.i D 1; 2; : : : ; 6/, the roots distributions of which are shown in Figure 6. We use an FNNRF with a 2-6-1 structure to nd the roots of this polynomial. Assume that three controlling parameters with the CLA are chosen as ±P 0 D 10:0, µp D 10:0, and ´ D 0:3, which are kept unchanged in this example, and that three termination errors of er D 0:1, er D 0:01, and er D 0:001 are respectively considered. To evaluate the statistical performance of the NRFs, for each case, we conducted 30 repeating experiments by choosing different initial connection weights from the uniform distribution in [¡1; 1] to
1.2 1 0.8 0.6 0.4 0.2 0 -0.2 -0.4 -0.6 -0.8 -1 -1.2 -1.2 -1 -0.8 -0.6 -0.4 -0.2
0
0.2 0.4 0.6 0.8
Figure 6: Roots distributions of f1 .x/ in the complex plane.
1
1.2
A Neural Root Finder of Polynomials Based on Root Moments
1739
observe the experimental results for the 6 ¡ 5 and log 6 models. After the corresponding FNNRFs were trained by the CLA to converge to the given accuracies, the estimated convergent roots corresponding to the two models are respectively depicted in Figures 7 through 9, where the true roots are represented by the bigger black dots. Figures 7 to 9 show that the estimated root accuracies get higher and higher, and the scatters for the estimated roots become smaller and smaller as the termination error increases. Figure 7 shows that for the termination error case of er D 0:1, the accuracy for the 6 ¡ 5 model is obviously higher than the one for the log 6 model. The reason is that the logarithmic operator used in the log 6 model primarily plays a role of transformation, compressing a larger numerical value into a smaller one. Thus, the termination error er D 0:1 in the log 6 model is to a certain extent equivalent to er D e0:1 in the 6 ¡ 5 model. Therefore, it is the lower training accuracy that has resulted in the poor estimated root accuracy. The same conclusions hold for the other two termination error cases; however, the differences between the two models as the training accuracy increases (i.e, the termination error decreases) are not great (see Figures 8 and 9). At the same time, this phenomenon also indirectly shows that the FNNRF based on root moments results in a very fast convergent speed. In addition, assume that the termination error is xed at er D 1:0 £ 10¡9 . We repeat the 30 experiments to get the average estimated roots (including the average estimated variances), the average iterating numbers, the average CPU times, and the average relative estimate errors (dr ), as shown in Table 3.2 From Table 3, it can be seen that from the statistical sense, the 6 ¡ 5 model is of higher estimated accuracy than the log 6 one, but the latter has a shorter training time than the former: the average CPU time for the 6 ¡ 5 model is 52 seconds longer than that for the log 6 one. The reason is also due to the compressing transformation of the log 6 model. Figures 10 and 11 show, respectively, two sets of logarithmic learning root curves of the two models for one of 30 experiments, where the logarithmic operations of the iterating numbers are done in order to show the magnied starting parts of the curves. For each plot, there are 12 curves representing the real and imaginary parts of the six complex roots for some NRF (in fact, we can get the six complex roots only by means of the NRF). From Figures 10 and 11, it can be seen that the 6 ¡ 5 model yields drastic uctuations around their true (estimated) root values, which may result in many long, 2
The average relative estimate error is dened as
dr D
K n .k/ 1 1 XX xi ¡ xO i ¢ xi K n kD1
i
where K is the repeated experimental number, xi is the ith true (exact) root value of the given polynomial, and xO .k/ i is the kth estimated value of the ith root (i.e., the weight, wi ).
1740
D.-S. Huang, H. Ip. and Z. Chi 1.2 1 0.8 0.6 0.4 0.2 0 -0.2 -0.4 -0.6 -0.8 -1 -1.2 -1.2 -1 -0.8 -0.6 -0.4 -0.2
0
(a) er
0.2 0.4 0.6 0.8
1
1.2
1
1.2
0.1
1.2 1 0.8 0.6 0.4 0.2 0 -0.2 -0.4 -0.6 -0.8 -1 -1.2 -1.2 -1 -0.8 -0.6 -0.4 -0.2
0
(b) er
0.2 0.4 0.6 0.8
0.1
Figure 7: Estimated roots distributions of f1 .x/ in the complex plane for the 6 ¡ 5 and log 6 models in the case of er D 0:1. (a) 6 ¡ 5 model. (b) log 6 model.
A Neural Root Finder of Polynomials Based on Root Moments
1741
1.2 1 0.8 0.6 0.4 0.2 0 -0.2 -0.4 -0.6 -0.8 -1 -1.2 -1.2 -1 -0.8 -0.6 -0.4 -0.2
0
(a) er
0.2 0.4 0.6 0.8
1
1.2
1
1.2
0.01
1.2 1 0.8 0.6 0.4 0.2 0 -0.2 -0.4 -0.6 -0.8 -1 -1.2 -1.2 -1 -0.8 -0.6 -0.4 -0.2
0
(b) er
0.2 0.4 0.6 0.8
0.01
Figure 8: Estimated roots distributions in the complex plane of f1 .x/ for the 6 ¡ 5 and log 6 models in the case of er D 0:01. (a) 6 ¡ 5 model. (b) log 6 model.
1742
D.-S. Huang, H. Ip. and Z. Chi 1.2 1 0.8 0.6 0.4 0.2 0 -0.2 -0.4 -0.6 -0.8 -1 -1.2 -1.2 -1 -0.8 -0.6 -0.4 -0.2
0
(a) er
0.2 0.4 0.6 0.8
1
1.2
1
1.2
0.001
1.2 1 0.8 0.6 0.4 0.2 0 -0.2 -0.4 -0.6 -0.8 -1 -1.2 -1.2 -1 -0.8 -0.6 -0.4 -0.2
(b) er
0
0.2 0.4 0.6 0.8
0.001
Figure 9: Estimated roots distributions in the complex plane of f1 .x/ for the 6 ¡ 5 and log 6 models in the case of er D 0:001. (a) 6 ¡ 5 model. (b) log 6 model.
A Neural Root Finder of Polynomials Based on Root Moments
1743
Table 3: Statistical Performance Comparison of the 6 ¡5– and the Log 6–Based FNNRF Models for f1 .x/. Average Estimated Roots (±P0 D 10:0; µp D 10:0; ´ D 0:3) Indices Average estimated roots
w1 w2 w3 w4 w5 w6
Average iterating number Average CPU times (seconds) Average relative estimated error (dr ) Average estimated variance
log 6 Model
6 ¡ 5 Model
(0.8999990, 1.3729134E¡06) (¡0.8999991, ¡1.1557200E¡07) (0.4499939, 0.7794214) (¡0.4500013, ¡0.7794217) (¡0.4499971, 0.7794194) (0.4499970, ¡0.7794210)
(0.8999987, ¡4.1471349E¡06) (¡0.9000001, ¡8.5237025E¡07) (0.4499981, 0.7794229) (¡0.4499988, ¡0.7794209) (¡0.4499974, 0.7794240) (0.4499984, ¡0.7794222)
137,541
226,690
118.24
170.33
1.133025533350818E¡005
8.708582759808792E¡006
2.706395396765315E¡005
2.147191933435669E¡005
Figure 10: The 12 learning weights (roots) curves of f1 .x/ for the 6 ¡ 5 model versus the logarithmic iterating numbers.
1744
D.-S. Huang, H. Ip. and Z. Chi
Figure 11: The 12 learning weights (roots) curves of f1 .x/ for the log 6 model versus the logarithmic iterating numbers.
narrow troughs that are at in one direction and steep in other directions. This phenomenon slowed the corresponding training speed signicantly. In contrast, the log 6 model is capable of compressing the input dynamic range so that it can produce a smoother cost function landscape, avoiding deep valleys and thus facilitating learning. Therefore, for practical problems, the log 6 model is preferred. 4.2 Arbitrary Polynomial with Unknown Roots. To verify the efciency and effectiveness of this NRF based on the root moments of a polynomial, a nine-order arbitrary polynomial f2 .x/ D x9 C 2:1ix7 C 1:1x6 C .1:3 ¡ i/x3 C .1:4 C 0:6i/x C 5:3i is given to test the performance of the NRF. For this problem, an FNNRF with the structure of 2-9-1 is constructed to nd the roots of the polynomial. Assume that three controlling parameters with the CLA are respectively chosen as ±P0 D 8:0, µp D 5:0, and ´ D 0:4, which are also kept unchanged. And for three given sets of termination errors of er D 0:1, er D 0:01, and er D 0:001, we respectively conduct 30 repeating experiments by choosing different initial connection weights from the uniform distribution in [¡1; 1] to evaluate the statistical performance of the two NRFs. After the NRF converges, the estimated convergent roots corresponding to the two models are respectively depicted in complex planes, as shown in Figures 12 to 14, where only one root is inside the unit circle of the complex plane. These three gures show that similar phenomena can be observed for the experimental cases.
A Neural Root Finder of Polynomials Based on Root Moments
1745
1.5
1
0.5
0
-0.5
-1
-1.5 -1.5
-1
-0.5
0
(a) er
0.5
1
1.5
0.5
1
1.5
0.1
1.5
1
0.5
0
-0.5
-1
-1.5 -1.5
-1
-0.5
0
(b) er
0 .1
Figure 12: Estimated roots distributions in the complex plane of f2 .x/ for the 6 ¡ 5 and log 6 models in the case of er ¡ 0:1. (a) 6 ¡ 5 model. (b) log 6 model.
1746
D.-S. Huang, H. Ip. and Z. Chi 1.5
1
0.5
0
-0.5
-1
-1.5 -1.5
-1
-0.5
0
(a) er
0.5
1
1.5
0.5
1
1.5
0.01
1.5
1
0.5
0
-0.5
-1
-1.5 -1.5
-1
-0.5
0
(b) er
0.01
Figure 13: Estimated roots distributions in the complex plane of f2 .x/ for the 6 ¡ 5 and log 6 models in the case of er D 0:01. (a) 6 ¡ 5 model. (b) log 6 model.
A Neural Root Finder of Polynomials Based on Root Moments
1747
1.5
1
0.5
0
-0.5
-1
-1.5 -1.5
-1
-0.5
0
(a) er
0.5
1
1.5
0.5
1
1.5
0.001
1.5
1
0.5
0
-0.5
-1
-1.5 -1.5
-1
-0.5
0
(b) er
0.001
Figure 14: Estimated roots distributions in the complex plane of f2 .x/ for the 6 ¡ 5 and log 6 models in the case of er D 0:001. (a) 6 ¡ 5 model. (b) log 6 model.
1748
D.-S. Huang, H. Ip. and Z. Chi
Here, the point that we must stress is that it has been found in experiments that for some initial connection weights, the 6 ¡ 5 model always oscillates around some local minimum. That is, there are possibly some local minima in the error surface of the 6 ¡ 5 model. Similarly, let er D 1:0 £ 10¡9 . We repeat the 30 experiments to get the average estimated roots (including the average estimated variances), the average reconstructed polynomial coefcients, the average iterating numbers, the average CPU times, and the average estimate accuracies3 (Cp ), as shown in Table 4. From Table 4, it can again be observed that from the statistical sense, the log 6 model has a faster training speed than the 6 ¡ 5 model, but the latter is of higher estimated accuracy than the former. In addition, Figures 15 and 16 illustrate respectively two sets of logarithmic learning weights (roots) curves of the two models for only one of 30 repeated experiments with identical initial connection weight values, where there are, for each subplot, two curves representing the real (the solid line) and imaginary (the dashed line) parts of one complex root. From Figures 15 and 16, it can be seen that the 6 ¡ 5 model is of slower convergent speed than the log 6 model, and the corresponding uctuation surrounding their true (estimated) root values is slightly conspicuous with respect to the latter. In addition, for some initial connection weights, we again found that there are oscillations surrounding certain local minima occurring in the process of searching the solutions. Therefore, it is again suggested that in applying this NRF to solving those practical problems, the log 6 model should be preferred to the direct 6 ¡ 5 model. 4.3 Effects of the Parameters with the CLA on the Performance of the NRFs. In order for the CLAs to be more conveniently applied to solving root-nding problems in practice, we investigate the effects of the three controlling parameters with the CLA on the performance of the NRFs. Assume that an arbitrary ve-order test polynomial with unknown roots f3 .x/ D x5 C 2:2x4 C .¡3:1 ¡ 0:5i/x3 C .2:4 ¡ 1:7i/x2 C 4:3ix C 2:35 ¡ 1:23i is given; we build an FNNRF with the structure of 2-5-1 to discuss the effects of the three controlling parameters used in the CLA, f±P0 ; µp ; ´g, on the performance of this NRF. In the following experiments, we always suppose that the termination error is xed at er D 1 £ 10¡8 for all cases, three sets of different values for each parameter of f±P0 ; µp ; ´g with the CLA are 3
The average estimate accuracy is dened as
Cp D
K n 1 1 XX ¢ jai ¡ aO .k/ i j; K n kD1 iD1
where K is the repeated experimental number, fai g is the coefcient of the polynomial, and aO .k/ i is the ith reconstructed polynomial coefcient value of the kth experiment.
A Neural Root Finder of Polynomials Based on Root Moments
1749
Table 4: Statistical Performance Comparison of the 6 ¡ 5– and Log 6–Based FNNRF Models for f2 .x/. Average Estimated Roots (±P0 D 8:0; µp D 5:0; ´ D 0:4) Indices Average estimated roots
w1 w2 w3 w4 w5 w6 w7 w8 w9
Average reconstructed polynomial coefcients
aN 1 aN 2 aN 3 aN 4 aN 5 aN 6 aN 7 aN 8 aN 9
Average iterating number Average CPU times (seconds) Average estimated accuracy .Cp / Average estimated variance
log 6 Model
6 ¡ 5 Model
(¡1.138758, ¡0.1249457) (0.9814540, ¡0.3930945) (1.029028, 0.5315055) (¡0.6140527, ¡0.7163082) (0.2785297, 1.157246) (¡0.5205935, 1.071421) (¡1.275355, 0.8272572) (1.105366, ¡1.250805) (0.1543829, ¡1.102276)
(¡1.138757, ¡0.1249474) (0.9814527, ¡0.3930941) (1.029029, 0.5315061) (¡0.6140542, ¡0.7163094) (0.2785297, 1.157247) (¡0.5205949, 1.071422) (¡1.275356, 0.8272576) (1.105367, ¡1.250804) (0.1543843, ¡1.102277)
234,239
267,529
135.16
278.54
5.855524581083182E¡005
8.679319172366640E¡006
2.164147025607199E¡005
3.248978567496603E¡006
(¡1.3808409E¡6, 1.0450681E¡6) (6.6061808E¡8, 1.0728835E¡7) (¡3.4644756E¡06, 2.100000) (¡2.8174892E¡07, 2.100000) (1.100003, ¡4.3437631E¡06) (1.100000, ¡1.0875538E¡06) (¡9.9966510E¡6, ¡5.5188639E¡6) (¡4.694555E¡9, 1.4712518E¡6) (2.8562122E¡6, 5.2243431E¡6) (1.5409139E¡6, 1.7070481E¡6) (1.299983, ¡1.000008) (1.300000, ¡0.9999995) (¡1.1211086E¡5, 2.4979285E¡5) (2.8605293E¡6, 1.9240313E¡6) (1.399990, 0.6000041) (1.399996, 0.6000024) (2.6458512E¡5, 5.299979) (6.8408231E¡6, 5.300003)
in turn chosen while keeping other two parameters unchanged, and for each case, we conduct 30 repeating experiments by choosing different initial connection weights from the uniform distribution in [¡1; 1] to evaluate the statistical performance of the two NRFs. 4.3.1 Case 1. The parameters µp D 5:0 and ´ D 0:5 remain unchanged, while ±P0 is respectively chosen as 15.0, 27.0, and 39.0. For this case, we design the corresponding CLA with these chosen parameters to train the two FNNRFs with 30 different initial weight values until the termination error is reached. Table 5 lists the average estimated roots (including the average estimated variances), the average reconstructed polynomial coefcients, the average iterating numbers, and the average estimate accuracies (Cp ). From Table 5, it can be seen that the iterating number (or the training time) becomes bigger and bigger (longer and longer) as the parameter ±P0 increases.
6¡5 model
log 6 model w1 w2 w3 w4 w5 aN 1 aN 2 aN 3 aN 4 aN 5
w1 w2 w3 w4 w5 aN 1 aN 2 aN 3 aN 4 aN 5
Average iterating number Average CPU times (seconds) Average estimated accuracy .Cp / Average estimated variance
Average reconstructed polynomial coefcients
Average estimated roots
Average iterating number Average CPU times (seconds) Average estimated accuracy .Cp / Average estimated variance
Average reconstructed polynomial coefcients
Average estimated roots
Indices
(2.199999, 5.1061312E¡7) (¡3.100002, ¡0.4999993) (2.400008, ¡1.700007) (2.6580785E¡5, 4.300016) (2.349960, ¡1.229988) 72,072 24.05 3.894946050153614E¡5 1.953243868591304E¡5
47,069 13.8 4.174685119040511E¡5 2.109834543554335E¡5
1.348195988200199E¡4
1.215649799242255E¡4
(2.200001, ¡3.5464765E-7) (¡3.099993, ¡0.4999999) (2.400001, ¡1.700003) (5.4543507E¡6, 4.300003) (2.350020, ¡1.230003)
2.434911676528762E¡4
2.408975763542598E¡4
(7.2838880E¡2, 0.4975683) (0.7840892, 0.9237731) (¡0.6682807, ¡0.9142268) (0.9756432, ¡0.5939271) (¡3.364290, 8.6812273E¡2)
18.21
12.76
(7.2835609E¡2, 0.4975790) (0.7840912, 0.9237689) (¡0.6682819, ¡0.9142306) (0.9756438, ¡0.5939294) (¡3.364290, 8.6812407E¡2)
50,755
28,398
2.039519060620054E¡5
3.940120231216149E¡5
25.88
104,382
(2.200002, ¡1.025199E¡6) (¡3.099999, ¡0.5000008) (2.399997, ¡1.700003) (2.0472218E¡5, 4.300006) (2.350006, ¡1.229997)
(7.283525E¡2, 0.4975761) (0.7840918, 0.9237698) (¡0.6682822, ¡0.9142287) (0.9756443, ¡0.5939285) (¡3.364291, 8.6812563E¡2)
1.317036301803572E¡4
2.560353167176954E¡4
22.34
66,752
(2.200003, 3.0497714E¡7) (¡3.100013, ¡0.5000228) (2.399964, ¡1.700022) (1.3247265E¡4, 4.300088) (2.349868, ¡1.230084)
(2.200001, ¡9.7428756E¡7) (¡3.100002, ¡0.5000318) (2.400039, ¡1.700037) (5.6860819E¡5, 4.300084) (2.350065, ¡1.230010)
(2.199985, 3.0177337E¡6) (¡3.099978, ¡0.5000005) (2.400114, ¡1.699990) (¡7.6231045E¡6, 4.300166) (2.349878, ¡1.230064)
±P0 D 39:0 .´ D 0:5; µp D 5:0/ (7.2854005E¡2, 0.497559) (0.7840888, 0.9237748) (¡0.6682967, ¡0.9142174) (0.9756408, ¡0.5939273) (¡3.36429, 8.6810604E¡2)
±P0 D 27:0 .´ D 0:5; µp D 5:0/ (7.2831228E¡2, 0.497575) (0.7841018, 0.9237837) (¡0.6682878, ¡0.9142339) (0.9756468, ¡0.5939333) (¡3.364293, 8.6809404E¡2)
(7.286707E¡2, 0.4975542) (0.7840687, 0.9237923) (¡0.6682831, ¡0.9142303) (0.9756445, ¡0.5939322) (¡3.364282, 8.6812876E¡2)
±P0 D 15:0 .´ D 0:5; µp D 5:0/
Table 5: Performance Comparisons Based on the Test Polynomial f3 .x/ Between the 6 ¡ 5 and Log 6 Models Under the Conditions of µp D 5:0 and ´ D 0:5 and ±P0 D 15:0; 27:0; 39:0.
1750 D.-S. Huang, H. Ip. and Z. Chi
A Neural Root Finder of Polynomials Based on Root Moments
1751
Figure 15: Logarithmic learning weights (roots) curves of the 6 ¡ 5 model for f2 .x/ versus the logarithmic iterating numbers: (a) w1 . (b) w2 . (c) w3 . (d) w4 . (e) w5 . (f) w6 . (g) w7 . (h) w8 . (i) w9 .
Moreover, it can be observed that from the statistical sense, the log 6 model has a faster training speed than the 6 ¡5 model, but the latter is of higher estimated accuracy than the former due to a longer training time. In addition, Figure 17 shows the logarithmic learning error curves for the two models in the case of three different ±P0 ’s for only one trial. Figure 17 shows that the convergent speed becomes slower and slower (i.e., the iterating number becomes bigger and bigger) as ±P0 increases. This phenomenon might be explained from the particulars of the CLA. In fact, from equation 2.20,
1752
D.-S. Huang, H. Ip. and Z. Chi
Figure 16: Logarithmic learning weights (roots) curves of the log 6 model for f2 .x/ versus the logarithmic iterating numbers: (a) w1 . (b) w2 . (c) w3 . (d) w4 . (e) w5 . (f) w6 . (g) w7 . (h) w8 . (i) w9 .
q ¡1 ¡1 ±Qj D ¡´8j ±P= 8H IFF 8 and » D IJJ ¡ IH JF IFF I JF ¸ 0, we can follow: ¹D¡
1 » ¢p : 2±P 1 ¡ ´2
(4.1)
¡1 It can be derived that when ¹ ! 0 equation 2.21 approaches V ¼ IFF IJF ; then equation 2.19 can be rewritten as
dw i ¼
¡1 Ji ¡ F H i IFF I JF ; 2¹
(4.2)
A Neural Root Finder of Polynomials Based on Root Moments
1753
Figure 17: Logarithmic learning error curves for the (a) 6 ¡ 5 and (b) log 6 models under the conditions of µp D 5:0 and ´ D 0:5 and ±P0 D 15:0; 27:0; 39:0 for only one trial.
1754
D.-S. Huang, H. Ip. and Z. Chi
which is more similar to the gradient descent formula of the BPA apart from ¡1 the second term ¡FH i IFF I JF =2¹ related to the a priori information of the root¡1 nding problem. When ¹ ! 1 equation 2.21 becomes V ¼ ¡2¹IFF ±Q; equation 2.19 can then be rewritten as ¡1 dw i ¼ FH i IFF ±Q;
(4.3)
which is more dependent on the a priori information of the problem at hand. Therefore, when ±P0 dynamically increases, from equation 4.1, ¹ becomes smaller and smaller so that the iterating for the connection weights switches to the gradient descent searching described in equation 4.2. Owing to ±P.t/ being adaptively chosen by equation 2.24, there is a longer training time for the bigger ±P0 before the iterating for the connection weights switches to the a priori information searching described in equation 4.3. Consequently, the convergent speed will certainly become slower for the bigger ±P0 , which shows that our experimental results are completely consistent with our theoretic analysis. In addition, from Figure 17, it can be seen that the uctuation for the 6 ¡ 5 model is much more drastic than the log 6 model, which is also consistent with the previous analyses. 4.3.2 Case 2. The parameters ±P0 D 12:0 and µp D 10:0 are kept unchanged, while ´ is respectively chosen as 0.4, 0.6, and 0.8. We design the corresponding CLA with these chosen parameters to train the two NRFs with 30 different initial weight values until the termination error is reached. Table 6 shows the average estimated roots (including the average estimated variances), the average reconstructed polynomial coefcients, the average iterating numbers, and the average estimate accuracies (Cp ). From Table 6, it can be observed that the iterating number increases as the parameter ´ increases. Moreover, it can be seen that from the statistical sense, the 6 ¡ 5 model has a slower training speed than the log 6 model, but the former is of higher estimated accuracy than the latter. Figure 18 shows the logarithmic learning error curves for the two models in the case of three different ´’s for only one trial. From this gure, it can be seen that the convergent speed becomes slower and slower as the parameter ´ increases. This phenomenon can be explained as follows: Since ±P.t/ is adaptively chosen by equation 2.24, the learning process always starts from ¹ ! 0 (the gradient-descent-based phase) to ¹ ! 1 (the a priori information-based searching phase). If ´.0 < ´ < 1/ is chosen as a bigger value, from equation 4.1, ¹ also becomes bigger. As a result, the role of ´ is dominant in the gradient-descent-based phase so that the bigger ´ will result in a slower searching process in the error surface. On the other hand, when the learning processing switches to the a priori information-based searching phase, the role of ±P.t/ becomes dominant. Obviously, during this phase, the
6¡5 model
log 6 model w1 w2 w3 w4 w5 aN 1 aN 2 aN 3 aN 4 aN 5
w1 w2 w3 w4 w5 aN 1 aN 2 aN 3 aN 4 aN 5
Average iterating number Average CPU times (seconds) Average estimated accuracy .Cp / Average estimated variance
Average reconstructed polynomial coefcients
Average estimated roots
Average iterating number Average CPU times (seconds) Average estimated accuracy .Cp / Average estimated variance
Average reconstructed polynomial coefcients
Average estimated roots
Indices
2.679815124170259E¡4 1.385519662459264E¡4
2.554318324064775E¡4 1.410293537507154E¡4
(2.200001, 6.1516954E¡7) (¡3.099996, ¡0.5000052) (2.400009, ¡1.700007) (1.1644424E¡5, 4.300018) (2.349983, ¡1.229999) 72,083 21.37 4.134208802647477E¡5 2.230266509128565E¡005
(2.200000, 5.3594511E¡7) (¡3.100001, ¡0.4999996) (2.400007, ¡1.699998) (1.4605153E¡5, 4.300022) (2.349977, ¡1.230004) 47,582 15.82 4.232391613533809E¡5 2.176448291144487E¡005
(7.2838947E¡2, 0.4975726) (0.7840895, 0.9237724) (¡0.6682820, ¡0.9142290) (0.9756428, ¡0.5939277) (¡3.364291, 8.6811036E¡2)
12.57
9.45
(7.2840475E¡2, 0.4975722) (0.7840880, 0.9237713) (¡0.6682820, ¡0.9142277) (0.9756443, ¡0.5939280) (¡3.364291, 8.6811766E¡2)
50,025
30,594
2.261178583488332E¡005
4.376189929594654E¡5
27.55
101,549
(2.199999, ¡1.1945770E-7) (¡3.100004, ¡0.4999988) (2.400000, ¡1.700007) (1.1297192E¡5, 4.300014) (2.349992, ¡1.229999)
(7.2837636E¡2, 0.4975743) (0.7840908, 0.9237695) (¡0.6682816, ¡0.9142290) (0.9756443, ¡0.5939274) (¡3.364291, 8.6812787E¡2)
1.268341660759528E¡4
2.489754259489135E¡4
15.73
59,575
(2.200000, ¡6.7576769E¡7) (¡3.100017, ¡0.5000157) (2.400007, ¡1.700022) (1.0601872E¡4, 4.300101) (2.349869, ¡1.230183)
(2.199995, ¡2.2557870E¡6) (¡3.100020, ¡0.5000215) (2.400031, ¡1.699972) (4.7380134E¡5, 4.299971) (2.349939, ¡1.230093)
(2.200006,7.8842040E¡6) (¡3.099989, ¡0.5000029) (2.400028, ¡1.700106) (1.5615452E¡4, 4.300119) (2.349921, ¡1.229973)
´ D 0:8 .±P0 D 12:0; µp D 10:0/ (7.2868429E¡2, 0.4975677) (0.7840847, 0.9237705) (¡0.6682978, ¡0.9142180) (0.9756364, ¡0.5939327) (¡3.364292, 8.6813159E¡2)
´ D 0:6 .±P0 D 12:0; µp D 10:0/ (7.2851375E¡2, 0.4975778) (0.7840911, 0.9237669) (¡0.6682835, ¡0.9142171) (0.9756373, ¡0.5939324) (¡3.364290, 8.6807005E¡2)
(7.2833687E¡02, 0.4975479) (0.7840996, 0.9237925) (¡0.6682897, ¡0.9142303) (0.9756414, ¡0.5939324) (¡3.364291, 8.6814374E¡2)
´ D 0:4 .±P0 D 12:0; µp D 10:0/
Table 6: Performance Comparisons Based on the Test Polynomial f3 .x/ Between the 6 ¡ 5 and Log 6 Models under the Conditions of ±P0 D 12:0 and µp D 10:0 and ´ D 0:4; 0:6; 0:8.
A Neural Root Finder of Polynomials Based on Root Moments 1755
1756
D.-S. Huang, H. Ip. and Z. Chi
Figure 18: Logarithmic learning error curves for the (a) 6 ¡ 5 and (b) log 6 models under the conditions of ±P0 D 12:0 and µp D 10:0 and ´ D 0:4; 0:6; 0:8 for only one trial.
A Neural Root Finder of Polynomials Based on Root Moments
1757
distinct ´ will not affect the convergent speed. Therefore, from the analyses, it can be deduced that the convergent speed will drop when the parameter ´ goes up. This shows that our experimental results are completely consistent with our theoretic analyses. From Figure 18, it can be observed that the uctuation for the 6 ¡ 5 model is much more drastic than the log 6 model. 4.3.3 Case 3. The parameters ±P0 D 8:0 and ´ D 0:3 are kept unchanged while µp is respectively chosen as 10.0, 20.0, and 30.0. Similar to the other two cases, we design the corresponding CLA with these chosen parameters to train the two NRFs with 30 different initial weight values until the termination error is satised. Table 7 shows the average estimated roots (including the average estimated variances), the average reconstructed polynomial coefcients, the average iterating numbers, and the average estimated accuracies (Cp ). From Table 7, it can also be observed that the iterating number increases as the parameter µp increases. Furthermore, it can be seen that from the statistical sense, the log 6 model has a faster training speed than the 6 ¡ 5 model, but the latter is of higher estimated accuracy than the former. Finally, Figure 19 illustrates the logarithmic learning error curves for the two models in the case of three different µp ’s for only one trial. From this gure, it can be seen that the convergent speed drops as µp increases. The experimental phenomenon here can be completely explained by means of the analyses of case 1 since ±P.t/ is the functional of µp and a bigger µp corresponds with a bigger ±P.t/. 5 Conclusions This article proposed a novel feedforward neural network root nder (FNNRF), which can be generalized to two kinds of structures, referred to as the 6 ¡5 and log 6 models, to nd the arbitrary roots (including complex ones) of arbitrary polynomials. For this FNNRF, a constructive learning algorithm, referred to as the constrained learning algorithm (CLA), was constructed and derived by imposing the a priori information, the root moments of polynomials, into the output error cost function of the FNNRF. The experimental results show that this CLA based on the root moments of polynomials has an extremely fast training speed and can compute the solutions (roots) of polynomial functions efciently. We showed in theory and in experiment that the computational complexity for the RMM-NRF is signicantly lower than that for the RCM-NRF, and the computational complexities for the two neural root nders are generally lower than the ones for the nonneural root nders in mathematica roots function. We also compared the performance of the two NRFs of the RMM and the RCM with the two nonneural methods of Muller and Laguerre. The experimental results showed that both RMM-NRF and RCM-NRF have a faster convergent speed and higher accuracy with respect to the traditional Muller and Laguerre methods. Moreover, the neural methods do not need
6¡5 model
log 6 model w1 w2 w3 w4 w5 aN 1 aN 2 aN 3 aN 4 aN 5
w1 w2 w3 w4 w5 aN 1 aN 2 aN 3 aN 4 aN 5
Average iterating number Average CPU times (seconds) Average estimated accuracy .Cp / Average estimated variance
Average reconstructed polynomial coefcients
Average estimated roots
Average iterating number Average CPU times (seconds) Average estimated accuracy .Cp / Average estimated variance
Average reconstructed polynomial coefcients
Average estimated roots
Indices
(7.283733E¡2, 0.4975732) (0.7840915, 0.9237716) (¡0.6682816, ¡0.9142293) (0.9756441, ¡0.5939273) (¡3.364290, 8.6811915E¡2) (2.199999, ¡9.040037E¡8) (¡3.100003, ¡0.5000041) (2.400007, ¡1.700010) (1.2623451E¡5, 4.300018) (2.349993, ¡1.229995) 47,079 27.27 4.036975258464581E¡5 2.058069194485565E¡5
(7.2838798E¡2, 0.4975756) (0.7840902, 0.9237689) (¡0.6682814, ¡0.9142286) (0.9756427, ¡0.5939276) (¡3.364290, 8.6811893E¡2) (2.200000, ¡3.1491118E¡7) (¡3.100001, ¡0.5000021) (2.400004, ¡1.700005) (4.5583829E¡6, 4.300004) (2.349991, ¡1.230005) 25,166 23.62 4.045050402790018E¡5 2.180208364499509E¡5
1.449031751690100E¡4
1.296374854486893E¡4
26.75
21.0 2.567913167198371E¡4
15.15 2.671974244067288E¡4
37,330
26,194
18,612
2.389964036500691E¡5
4.360327382141316E¡5
31.4
71,646
(2.199999, ¡3.380081E¡7) (¡3.100003, ¡0.4999986) (2.400006, ¡1.699997) (1.8306442E¡5, 4.300008) (2.349992, ¡1.229998)
(7.283732E¡2, 0.4975742) (0.7840905, 0.9237704) (¡0.6682812, ¡0.9142278) (0.9756446, ¡0.5939287) (¡3.364291, 8.681227E¡2)
1.418370494943892E¡4
2.484951266545354E¡4
(2.199992, ¡2.8039011E¡7) (¡3.100018, ¡0.5000123) (2.399993, ¡1.700036) (1.1222088E¡4, 4.300019) (2.349895, ¡1.230052)
(2.199995, 2.7832884E¡6) (¡3.100013, ¡0.5000213) (2.400023, ¡1.700058) (1.3987898E¡4, 4.300080) (2.349871, ¡1.229923)
(2.199996, ¡4.9563741E¡6) (¡3.099985, ¡0.5000350) (2.400034, ¡1.700035) (9.6878302E¡6, 4.300109) (2.350020, ¡1.230014)
(7.2845824E¡2, 0.497563) (0.7840970, 0.9237728) (¡0.6682878, ¡0.9142196) (0.9756383, ¡0.5939289) (¡3.364286, 8.681283E¡2)
µp D 30:0 .±P0 D 8:0; ´ D 0:3/
(7.283365E¡2, 0.4975407) (0.7840981, 0.9237937) (¡0.6682835, ¡0.9142225) (0.9756444, ¡0.5939250) (¡3.364288, 8.681038E¡2)
µp D 20:0 .±P0 D 8:0; ´ D 0:3/
(7.2840773E¡2, 0.4975706) (0.7840921, 0.9237852) (¡0.6682875, ¡0.9142358) (0.9756450, ¡0.5939275) (¡3.364287, 8.6812451E¡2)
µp D 10:0 .±P0 D 8:0; ´ D 0:3/
Table 7: Performance Comparisons Based on the Test Polynomial f3 .x/ Between the 6 ¡ 5 and Log 6 Models under the Conditions of ±P0 D 8:0 and ´ D 0:3 and µp D 10:0; 20:0; 30:0.
1758 D.-S. Huang, H. Ip. and Z. Chi
A Neural Root Finder of Polynomials Based on Root Moments
1759
Figure 19: Logarithmic learning error curves for the (a) 6 ¡ 5 and (b) log 6 models under the conditions of ±P0 D 8:0 and ´ D 0:3 and µp D 10:0; 20:0; 30:0 for only one trial.
1760
D.-S. Huang, H. Ip. and Z. Chi
to carefully select the initial root values as the nonneural methods do except for randomly selecting them from the uniform distribution in [¡1; 1]. The important point to stress here is that the neural root-nding methods have an advantage over traditional nonneural methods if the neural computer with many interconnecting processing nodes like the brain or a computer with inherently parallel algorithm structure is developed in the near future. In addition, we took two polynomials as examples to discuss and compare the performance of the two models, 6 ¡ 5 and log 6. The experimental results illustrated that under the identical termination accuracy, the 6 ¡ 5 model has a higher estimated accuracy but slower training speed and exhibits more drastic uctuations in the process of searching for global minima on the error surface with respect to the log 6 model. Specically, it can sometimes be observed in experiments that local minima exist on the error surface of the 6 ¡ 5 model so that the CLA is oscillating around a local minimum or even diverges. In contrast, the log 6 model can avoid these drawbacks due to its logarithmic operator, which plays the role of compressing the dynamical range of the inputs and gives a smoother error surface on the hidden nodes. Besides, there are no local minima with the log 6 model apart from the parameters of the CLA having been not suitably chosen since all the a priori information has been utilized. Therefore, in real applications, the log 6 model should be preferred to the direct 6 ¡5 model. Finally, we discussed the effects of the three controlling parameters f±P0 ; µp ; ´g with the CLA on the performance of the two NRFs. The experimental results showed that the training speeds could be increased by increasing these three controlling parameters while maintaining the same estimated accuracies (including the estimated variances). We showed that this performance behavior is consistent with our theoretical analysis. Specically, the drastic uctuation phenomena with the direct 6 ¡ 5 model were again observed in these experiments. Future work will explore how to use the NRFs to nd the maximum or minimum modulus root of arbitrary polynomials and apply them to addressing more practical problems in signal processing. Acknowledgments This work was supported by the National Natural Science Foundation of China and a grant of the 100-Talents Program of the Chinese Academy of Sciences of China. References Aliphas, A., Narayan, S. S., & Peterson, A. M. (1983). Finding the zeros of linear phase FIR frequency sampling lters. IEEE Trans. Acoust., Speech Signal Processing, 31, 729–734. Anthony, R., & Philip, R. (1978). A rst course in numerical analysis. New York: McGraw-Hill.
A Neural Root Finder of Polynomials Based on Root Moments
1761
Hormis, R., Antonion, G., & Mentzelopoulou, S. (1995). Separation of twodimensional polynomials via a 6 ¡5 neural net. In Proceedings of International Conference on Modelling and Simulation (pp. 304–306). Pittsburg, PA: International Society of Modeling and Simulation. Hoteit, L. (2000). FFT-based fast polynomial rooting. In Proceedings of ICASSP ’2000, 6 (pp. 3315–3318). Istanbul: IEEE. Huang, D. S. (2001). Finding roots of polynomials based on root moments. In 8th Int. Conf. on Neural Information Processing (ICONIP), 3, 1565–1571. Shanghai, China: Publishing House of Electronic Industry. Huang, D. S. (2001). Revisit to constrained learning algorithm. In The 8th International Conference on Neural Information Processing (ICONIP) (Vol. I, pp. 459–464). Shanghai, China: Publishing House of Electronic Industry. Huang, D. S. (2004). A constructive approach for nding arbitrary roots of polynomials by neural networks. IEEE Trans on Neural Networks, Vol. 15, No. 2, pp. 477–491. Huang, D. S., & Chi, Z. (2001).Neural networks with problem decomposition for nding real roots of polynomials. In 2001 Int. Joint Conf. on Neural Networks (IJCNN2001). Washington, DC: IEEE. Huang, D. S., Chi, Z., & Siu, W. C. (2000). Finding real roots of polynomials based on constrained learning neural networks (Tech. Rep.). Hong Kong: Hong Kong Polytechnical University. Huang, D. S., Ip, H., Chi, Z., & Wong, H. S. (2003), Dilation method for nding close roots of polynomials based on constrained learning neural networks. Physics Letters A, 309, 443–451. Hush, D. R., Horne, B., & Salas, J. M. (1992). Error surfaces for multilayer perceptrons. IEEE Trans. Syst., Man Cybern, 22, 1152–1161. Karras, D. A., & Perantonis, S. J. (1995). An efcient constrained training algorithm for feedforward networks. IEEE Trans. Neural Networks, 6, 1420– 1434. Lang, M., & Frenzel, B.-C. (1994). Polynomial root nding. IEEE Signal Processing Letters, 1, 141–143. Milovanovic, G. V., & Petkovic, M. S. (1986). On computational efciency of the iterative methods for the simultaneous approximation of polynomial zeros. ACM Transactions on Mathematical Software, 12, 295–306. Mourrain, B., Pan, V. Y., & Ruatta, O. (2003). Accelerated solution of multivariate polynomial systems of equations. SIAM J. Comput., 32, 435–454. Perantonis, S. J., Ampazis, N., Varoufakis, S., & Antoniou, G. (1998). Constrained learning in neural networks: Application to stable factorization of 2-D polynomials. Neural Processing Letters, 7, 5–14. Schmidt, C. E., & Rabiner, L. R. (1977).A study of techniques for nding the zeros of linear phase FIR digital lters. IEEE Trans. Acoust., Speech Signal Processing, 25, 96–98. Stathaki, T. (1998). Root moments: A digital signal-processing perspective. IEEE Proc. Vis. Image Signal Processing, 145, 293–302. Stathaki, T., & Constantinides, A. G. (1995). Root moments: An alternative interpretation of cepstra for signal feature extraction and modeling. In Proceedings of the 29th Asilomar Conference on Signals, Systems and Computers (Vol. 2, pp. 1477–1481).
1762
D.-S. Huang, H. Ip. and Z. Chi
Stathaki, T., & Fotinopoulos, I. (2001). Equiripple minimum phase FIR lter design from linear phase systems using root moments. IEEE Trans on Cirsuits and Systems-II: Analog and Digital Processing, 48, 580–587. Steiglitz, K., & Dickinson, B. (1982). Phase unwrapping by factorization. IEEE Trans. Acoust., Speech Signal Processing, 30, 984–991. Thomas, B., Arani, F., & Honary, B. (1997). Algorithm for speech model root location. Electronics Letters, 33, 355–356. William, H. P., Saul, A. T., William, T. V., & Brian, P. F. (1992). Numerical recipes in Fortran (2nd ed.). Cambridge: Cambridge University Press.
Received July 30, 2002; accepted February 2, 2004.
NOTE
Communicated by Kechen Zhang
A Note on the Applied Use of MDL Approximations Daniel J. Navarro
[email protected] Department of Psychology, Ohio State University, Columbus, OH 43210, U.S.A.
An applied problem is discussed in which two nested psychological models of retention are compared using minimum description length (MDL). The standard Fisher information approximation to the normalized maximum likelihood is calculated for these two models, with the result that the full model is assigned a smaller complexity, even for moderately large samples. A geometric interpretation for this behavior is considered, along with its practical implications. 1 Introduction The minimum description length (MDL) principle (Rissanen, 1978; see also Grunwald, ¨ 1998) has attracted interest in applied fields because it allows comparisons between nonnested and misspecified models without requiring restrictive assumptions. Of particular interest is the normalized maximum likelihood (NML) criterion, NML =
p(X|θ ∗ (X)) , p(Y|θ ∗ (Y)) dY
where θ ∗ (X) denotes the maximum likelihood estimate (MLE) for the data X. While the NML has many desirable properties (Rissanen, 2001), a practical difficulty is that the normalization term is difficult to evaluate. Consequently, an approach based on Fisher information (FIA) is often used in its place (e.g., Rissanen, 1996). This criterion is given by N k FIA = − ln p(X|θ ∗ (X)) + ln + ln |I(θ)| dθ + o(1), 2 2π where k denotes the number of free parameters, N denotes the sample size, and I(θ) denotes the expected Fisher information matrix of sample size one. The second and third terms are often referred to as a complexity penalty. Under regularity conditions (Rissanen, 1996), it can be shown that the FIA asymptotically converges to − ln(NML).1 1 The regularity conditions, most important the asymptotic normality of the MLE, are satisfied for models that constitute compact (i.e., closed and bounded) subsets of an exponential family, such as those discussed here.
c 2004 Massachusetts Institute of Technology Neural Computation 16, 1763–1768 (2004)
1764
D. Navarro
The practical advantage to the FIA lies in its calculation, since it only requires integration over the parameter space and the Fisher information matrix is sometimes easier to find than the maximum likelihood. However, while the optimality of the NML criterion holds for all N, the FIA is guaranteed only asymptotically, which can be problematic in some applications. This note presents one such case. Since applied researchers are often forced by necessity to use asymptotic measures such as the FIA, it is useful to take note of circumstances under which they are inappropriate and give consideration to the reasons. 2 The Applied Problem The applied problem originates in the study of human forgetting (“retention”; see Navarro, Pitt, & Myung, in press). In a typical retention experiment, participants are presented with a list of words and asked to recall them later. Measurements at different times after stimulus presentation produce a “retention curve”: the probability of accurate recall is initially high but tends toward zero over time. A retention model takes the form p(C|t) = f (t, θ ), where p(C|t) denotes the probability of correct recall at time t, and f (t, θ ) is the hypothesized form of the retention curve, parameterized by θ. Obviously, f (t, θ) must lie on [0, 1] for all t. Additionally, t is normalized to lie between 0 and 1. The classic retention model is the exponential (EX) model, f (t, θ ) = a exp(−bt), where θ = (a, b) such that a ∈ [0, 1] denotes the initial retention probability, and b gives the decay rate. While definitive bounds on b are not easy to specify, experience with a large database suggests that [0, 100] is reasonable (see Rubin & Wenzel, 1996, or Navarro et al., in press). A second retention model is provided by Wickelgren’s (1972) strength-resistance (SR) theory, which hypothesizes that f (t, θ ) = a exp(−btw ) where θ = (a, b, w). Importantly, w is constrained to lie on [0, 1], since w < 0 results in an increasing retention function, and w > 1 results in a faster-than-exponential decay, neither of which is consistent with the theory. Note that the EX model is a special case of the SR model, and by inspection, the NML denominator term is always smaller for the EX model than for the SR model, regardless of sample size. 3 Fisher Information Matrices In any retention experiment, continuous measurements are impractical, so retention is measured at some number m of fixed time intervals t1 , . . . , tm . Thus, the observed data may be treated as N observations from an m-variate binomial distribution, where the ith Bernoulli probability is described by f (ti , θ). A standard result (e.g., Schervish, 1995; Su, Myung, & Pitt, in press) allows the uvth element of the Fisher information matrix of sample size one
A Note on the Applied Use of MDL Approximations
1765
for these models to be written Iuv (θ) =
m
1 ∂ f (ti , θ ) ∂ f (ti , θ ) . f (ti , θ)(1 − f (ti , θ )) ∂θu ∂θv
i=1
The partial derivatives of f (ti , θ) are simple. For the EX model, ∂ f (ti , θ )/∂a = exp(−bti ) and ∂ f (ti , θ)/∂b = −ati exp(−bti ). For the SR model, the partial w w derivatives are ∂ f (ti , θ)/∂a = exp(−btw i ), ∂ f (ti , θ )/∂b = −ati exp(−bti ), and w w ∂ f (ti , θ)/∂w = −abti ln(ti ) exp(−bti ). Substitution into the Fisher information formula yields (1/a) m i=1 y(i) I(a, b) = − m i=1 ti y(i)
− m i=1 ti y(i) 2 a m i=1 ti y(i)
for the EX model, where y(i) = 1 (exp(bti ) − a) . For the SR model, I(a, b, w) =
m z(i) m i=1 − i=1 tw z(i) m w i
(1/a)
−b
t i=1 i
(ln ti )z(i) ab
m w t z(i) mi=1 i a i=1 t2w z(i) m 2w i −
t i=1 i
m w t (ln ti )z(i) mi=1 2wi ab i=1 ti (ln ti )z(i) m 2w 2 2 −b
(ln ti )z(i) ab
t i=1 i
(ln ti ) z(i)
where z(i) = 1 (exp(btw i ) − a) . 4 Small Sample Behavior The experimental design was assumed to consist of eight evenly spaced ti values, though the results are consistent across a range of designs. Without a closed form for the integral in the FIA, numerical estimates were obtained using Monte Carlo methods (e.g., Robert & Casella, 1999). Given the low dimensionality of the integrals and the unnecessarily extensive sampling (108 samples were taken), even simple Monte Carlo methods provide adequate √ results. The integral term |I(θ)|dθ for the EX model is approximately 8.08, whereas for the SR model, the value is approximately 0.44. Substituting this into the FIA expression indicates that the SR model has a higher estimated complexity only when N ≥ 2096, as shown in Figure 1. This presents a substantial difficulty for the applied problem, since not one of the 77 data sets considered by Navarro et al. (in press) had a sample size this large. Therefore, every data set that could possibly be observed would be better fit by SR, yet the nested EX model would be penalized more severely for excess complexity. It is worth considering the source of this problem. Following Myung, Balasubramanian, and Pitt (2000) and Balasubramanian (1997), the complexity terms in the FIA can be viewed as approximations to (the logarithm of) the ratio of two Riemannian volumes. The first is the volume occu-
1766
D. Navarro 5000 EX model SR model
Fisher information approximation
4500 4000 3500 3000 2500 2000 1500 1000 500 0 0
500
1000
1500 2000 Sample size, N
2500
3000
Figure 1: Complexity assigned by the Fisher information approximation (FIA) for the EX and SR models (before taking logarithms).
pied by the model in the space of probability √ distributions (assuming the Fisher information metric), given by V f = |I(θ)|dθ . The second volume is that of a little ellipsoid around θ ∗ , intended to quantify the size of a “rek/2 √ gion of appreciable |J(θ ∗ )|/|I(θ ∗ )|, where 2fit,” given by Vc = (2π/N) ∗ J(θ ) = −1/N ∂ ln p(C|t)/∂θ ∂θ is the observed information. In u v θ =θ ∗ t general, the observed information J(θ ∗ ) will differ from the expected information I(θ ∗ ), but for current purposes, it suffices to note that there are data sets for which they are very similar, yielding J(θ ∗ ) ≈ I(θ ∗ ). As these are binomial models, an example of such a data set would be one in which Ci ≈ Nf (ti , θ ∗ ) for all i since the data are almost identical to their expected values at the MLE parameters. Note that these are the data sets that are well fit by the model and are thus precisely those of most interest to applied research. In such cases, Vc ≈ (2π/N)k/2 . When the observed information matrix closely approximates the expected information matrix, it is reasonable to view thecomplexity penalty for the FIA as approximately equivalent to ln V f /Vc . At N = 1, we observe that for the EX model, Vc = 2π < 8.08 ≈ V f . The little ellipsoid is smaller than the model, as it should be. However, by expanding the EX model into a third dimension, one obtains the SR model, for which Vc = (2π)3/2 > 0.44 ≈ V f . The volume of the three-dimensional ellipsoid is now larger than the volume of the entire model. Taken together, these observations suggest that the extension of the model along the new dimension induced by the addition of the w parameter is so tiny that the “small” ellipsoid now protrudes extensively, like a marble embedded in a sheet of paper.
A Note on the Applied Use of MDL Approximations
1767
Most of the “region of good fit” no longer lies within the region occupied by the model. In short, until N gets very large, θ ∗ does not seem to lie sufficiently within the model manifold to support the gaussian approximations that underlie the FIA. It does not √ appear that the problem can be entirely solved by incorporating the |J(θ ∗ )|/|I(θ ∗ )| term. After all, by judicious choice of t and C, data sets could be chosen for which the observed information is precisely equal to the expected information even at small samples. For instance, if t = (.25, .5, .75, 1) and N = 16, then the data set C = (8, 4, 2, 1) yields MLE parameters a∗ = 1 and b∗ = ln 16 for both models and w∗ = 1 for SR. In both cases, f (ti , θ ∗ ) = Ci /N for all i: The data take on their expected values at the MLE. Accordingly, Vc = (2π/N)k/2 for these data. Even so, V f ≈ 0.07 for the SR model, while V f ≈ 3.39 for the EX model, indicating that the little ellipsoid is not well located for the SR model. 5 Discussion Asymptotic approximations are useful in practice only insofar as they can be relied on to give the right answers. Clearly, if Vc < V f is not satisfied, the standard FIA expression is not necessarily reliable. In psychology, for instance, it is rare to find studies with N > 2000, yet Vc can remain smaller than V f even at this sample size. This “well-locatedness” requirement is mentioned by Balasubramanian (1997) as a condition of his asymptotic expansions, but is not generally thought of as a regularity condition because it holds asymptotically (i.e., Vc tends to zero as N tends to infinity). In finite samples, some care is needed. While observing Vc < V f does not guarantee that θ ∗ is well located, observing that Vc > V f does imply that it is not. It may be possible to use this observation to formulate approximations that perform better in small samples, a question left to more theoretically inclined researchers. In any case, it is clear that the source of the current difficulty does not lie with the MDL principle itself: the NML criterion does not suffer from this problem at any N. It is only that the o(1) term in the asymptotic criterion can be large enough to make the approximation impractical for smaller samples. While the point is somewhat obvious, it does imply the need to take care in the use of the criterion. To that end, it is worth ensuring that Vc < V f before using the approximation. Acknowledgements This research was supported by NIH grant R01-MH57472 and by a grant from the Office of Research at OSU. I thank Peter Grunwald, ¨ Michael Lee, In Jae Myung, and Mark Pitt for helpful suggestions and two anonymous reviewers for comments that substantially improved the article.
1768
D. Navarro
References Balasubramanian, V. (1997). Statistical inference, Occam’s razor and statistical mechanics on the space of probability distributions. Neural Computation, 9, 349–368. Grunwald, ¨ P. (1998). The minimum description length principle and reasoning under uncertainty. Unpublished doctoral dissertation, CWI, the Netherlands. Myung, I. J., Balasubramanian, V., & Pitt, M. A. (2000). Counting probability distributions: Differential geometry and model selection. Proceedings of the National Academy of Sciences USA, 97, 11170–11175. Navarro, D. J., Pitt, M. A., & Myung, I. J. (in press). Assessing the distinguishability of models and the informativeness of data. Cognitive Psychology. Rissanen, J. (1978). Modeling by the shortest data description. Automatica, 14, 465–471. Rissanen, J. (1996). Fisher information and stochastic complexity. IEEE Transactions on Information Theory, 42, 40–47. Rissanen, J. (2001). Strong optimality of the normalized ML models as universal codes and information in data. IEEE Transactions on Information Theory, 47, 1712-1717. Robert, C. P., & Casella, G. (1999). Monte Carlo statistical methods. New York: Springer-Verlag. Rubin, D. C., & Wenzel, A. (1996). One hundred years of forgetting: A quantitative description of retention. Psychological Review, 103, 734–760. Schervish, M. J. (1995). Theory of statistics. New York: Springer-Verlag. Su, Y., Myung, I. J., & Pitt, M. A. (in press). Minimum description length and cognitive modeling. In P. Grunwald, ¨ I. J. Myung, & M. A. Pitt (Eds.), Advances in minimum description length: Theory and applications. Cambridge, MA: MIT Press. Wickelgren, W. A. (1972). Trace resistance and decay of long-term memory. Journal of Mathematical Psychology, 9, 418–455. Received September 15, 2003; accepted February 4, 2004.
NOTE
Communicated by Chris Burgess
Optimal Reduced-Set Vectors for Support Vector Machines with a Quadratic Kernel Thorsten Thies
[email protected] Frank Weber
[email protected] Cognitec Systems GmbH, D–01139 Dresden, Germany
To reduce computational cost, the discriminant function of a support vector machine (SVM) should be represented using as few vectors as possible. This problem has been tackled in different ways. In this article, we develop an explicit solution in the case of a general quadratic kernel k(x, x ) = (C + D x x )2 . For a given number of vectors, this solution provides the best possible approximation and can even recover the discriminant function if the number of used vectors is large enough. The key idea is to express the inhomogeneous kernel as a homogeneous kernel on a space having one dimension more than the original one and to follow the approach of Burges (1996). 1 Introduction Support vector machines (as described in Boser, Guyon, & Vapnik, 1992; Vapnik, 1998; Scholkopf ¨ & Smola, 2002) have been widely used for pattern recognition and regression estimation for several years. Aside from their numerous advantages, they suffer from being quite slow in the test phase compared to other learning techniques. To compensate for this drawback, there have been different approaches (e.g., Burges, 1996; Osuna & Girosi, 1998; Tipping, 2001). Specifically, we are concerned with reducing the expense of computing a discriminant function of the form f (x) = b +
m
αi k(xi , x),
(x ∈ X),
(1.1)
i=1
where k is a positive semidefinite kernel on some input space X = Rn . The support vectors xi ∈ X and the αi ∈ R can be computed by the support vector algorithm (in fact, some quadratic program). As the complexity of the computation 1.1 scales linearly with the number m of support vectors (which may be quite large), we are interested in finding a set of ν vectors (with ν 0. The reason for this restriction is that if C and D have different signs, the kernel is not positive semidefinite in general.1 2 Notation For any positive semidefinite kernel k on a sample space X, there is a feature space H: a Hilbert space with dot product ·, · and a feature map : X → H such that k(x, x ) = (x), (x ),
(x, x ∈ X).
We denote the norm in H as well as in X by || · ||. The concrete form of and H for the occurring kernels does not matter here; we introduce this concept simply to explain what we mean by the best possible approximation of a discriminant function. Suppose we solved the SVM’s quadratic program and received the shift value b ∈ R, the support vectors x1 , . . . , xm ∈ X, and the corresponding α1 , . . . , αm ∈ R\{0}. Note that the αi can be negative here. Their sign corresponds to the class labels yi ∈ {−1, 1} used in Burges (1996).
As a simple example, consider D = 1, C = −1 and n = 2. Then √1 11 and −1 0 lead 2 to a Gram matrix, which is not positive semidefinite, so k is not positive semidefinite by definition. 1
Optimal Reduced-Set Vectors for SVM
1771
Thus, we derived the normal vector w ∈ H to the decision hyperplane in H corresponding to the decision surface in X, that is, f (x) = b + (x), w, where w = m i=1 αi (xi ). We are now looking for vectors z1 , . . . , zν ∈ X and coefficients γ1 , . . . , γν ∈ R such that the distance m ν αi (xi ) − γp (zp ) (2.1) ρ= i=1 p=1 between the normal vector w and its approximation is minimized. ρ 2 can be written as m ν m ν 2 αi (xi ) − γp (zp ), αj (xj ) − γq (zq ) . ρ = i=1
p=1
j=1
q=1
Using the bilinearity of ·, · we receive an expression of ρ 2 without using : ρ2 =
m
αi αj k(xi , xj ) − 2
m ν
αi γp k(xi , zp ) +
i=1 p=1
i,j=1
ν
γp γq k(zp , zq ). (2.2)
p,q=1
To separate the different kinds of kernels used in this article, we use the following notation: for C, D > 0 let knC and KnC,D be the general homogeneous (resp. inhomogeneous) quadratic kernels on Rn given by knC (x, x ) = C2 (x x )2
and
KnC,D (x, x ) = (C + D x x )2 .
3 Review of the Homogeneous Case Let us first review the case of a general homogenous quadratic kernel knC . We give a detailed proof of the following theorem, which is essentially contained in section 3.1 of Burges (1996). n Theorem 1. Let C > 0, x1 , . . . , x m ∈ R and α1 , . . . , αm ∈ R\{0}. Let λ1 , . . . , m λn ∈ R be the eigenvalues of S = i=1 αi xi x i , ordered by their absolute values: |λ1 | ≥ · · · ≥ |λn |. Then for given ν ∈ {1, . . . , n}, the squared distance
ρ2 =
m i,j=1
αi αj knC (xi , xj ) − 2
m ν i=1 p=1
αi γp knC (xi , zp ) +
ν
γp γq knC (zp , zq )
p,q=1
is minimized among all γ1 , . . . , γν ∈ R and z1 , . . . , zν ∈ Rn , if we choose z1 , . . . , zν λ to be orthogonal eigenvectors of S corresponding to λ1 , . . . , λν and set γp = z p2 p n for p = 1, . . . , ν. In this case, ρ 2 = p=ν+1 λp2 . Thus, for ν = n we get ρ = 0, and the discriminant function is recovered.
1772
T. Thies and F. Weber
Proof. We use induction on ν. Step 1. We prove the claim in the case ν = 1. The only candidates (γ , z) ∈ R × Rn for a minimum of ρ = ρ(γ , z) are 2 the critical points of ρ. These are characterized by the conditions ∂ρ ∂z = 0 (I) and ∂ρ ∂γ = 0 (II). Observe that we have 2
ρ2 =
m
αi αj knC (xi , xj ) − 2
= C2
αi γ knC (xi , z) + γ 2 knC (z, z)
i=1
i,j=1
m
m
2 αi αj (x i xj ) − 2
m
2 2 2 . αi γ (x i z) + γ (z z)
i=1
i,j=1
So condition I is equivalent to m
2 αi (x i z)xi = γ z z.
(3.1)
i=1
Using the matrix S, equation 3.1 reads as Sz = γ z2 z.
(3.2)
Condition II is equivalent to 2
m
2 2 αi (x i z) = 2γ (z z)
⇐⇒
γ z4 = z Sz.
(3.3)
i=1
Thus, the only candidates for a minimum of ρ are (γ , z), where z is an λ eigenvector of S with corresponding eigenvalue λ and γ = z 2. We now show that the minimum of ρ (for ν = 1) occurs if λ = λ1 , the eigenvalue of the largest absolute size. To this end, consider ρ 2 (as previously computed). The first term of ρ 2 equals C2
m
2 2 αi αj (x i xj ) = C
i,j=1
m
αi αj x i xj xi xj
i,j=1
= C2
m
αi αj
i,j=1
=C
2
n
xik xjk xil xjl
k,l=1
n
m
k,l=1
i=1
= C2 tr(S2 ).
m αi xik xil αj xjk xjl j=1
Optimal Reduced-Set Vectors for SVM
1773
Therefore, we get 2
ρ =C
2
2
tr(S ) − 2γ
m
αi z
xi x i z
2
2
+ γ (z z)
i=1
= C2 (tr(S2 ) − 2γ z Sz + γ 2 z4 ) = C2 (tr(S2 ) − 2γ 2 z2 z z + γ 2 z4 ) = C2 (tr(S2 ) − γ 2 z4 ) = C2 (tr(S2 ) − λ2 ), which is minimized if we choose z to be the eigenvector z1 of S corresponding to the eigenvalue λ = λ1 and γ1 = zλ12 . This proves the claim for ν = 1. 1 Step 2. We prove the claim for ν > 1 supposing it is true for ν − 1. We assume z1 , . . . , zν−1 and γ1 , . . . , γν−1 to be chosen as in the theorem. Let us define xm+p := zp ,
αm+p := −γp ,
(p = 1, . . . , ν − 1).
Then the error ρ 2 , which we want to minimize with respect to (γ , z) ∈ R×Rn , reads as 2 m ν−1 2 ρ = αi (xi ) − γi (zp ) − γ (z) i=1 p=1 m+ν−1 m+ν−1 2 2 2 2 = C2 αi αj (x αi γ (x . i xj ) − 2 i z) + γ (z z) i=1
i,j=1
2 α i xi x From the first step, with S = m+ν−1 i instead of S, we get that ρ i=1 is minimized if we set z to be an eigenvector of S corresponding to the eigenvalue λ of S of largest absolute size. Because S = S −
ν−1 p=1
λp
zp zp
zp zp
has the same eigenvectors as S, but with z1 , . . . , zν−1 ∈ ker S , we conclude that this is the case if λ = λν , z = zν and γ = γν = zλν2 . ν Because tr S2 = ni=1 λ2i , we get ρ 2 = 0 for ν = n.
1774
T. Thies and F. Weber
4 The Inhomogeneous Case Now assume that we are given the inhomogeneous kernel KnC,D (x, x ) = (C + D x x )2
with x, x ∈ Rn and C, D > 0.
Then we can write KnC,D (x, x ) = (C + D x x )2
2 D D = C2 1 + x x C C =C
=
2
D Cx
1
C kn+1
D Cx
D Cx
1
,
1
2
D Cx
1
.
If we identify x ∈ Rn with x˜ ∈ Rn+1 using the embedding n
n+1
η: R → R
,
x →
D Cx
1
˜ =: x,
we can express the inhomogenous kernel KnC,D as a homogeneous kernel C kn+1 on Rn+1 : C ˜ x˜ ). (x, KnC,D (x, x ) = kn+1
(4.1)
We will prove the following: Theorem 2. Let C, D > 0, x1 , . . . , xm ∈ Rn and α1 , . . . , αm ∈ R\{0}. Let ˜ ˜ i x˜ λ1 , . . . , λn+1 ∈ R be the eigenvalues of S = m i , ordered by their absolute i=1 αi x values, |λ1 | ≥ · · · ≥ |λn+1 |, and assume that there is a corresponding orthogonal eigenbasis v1 , . . . , vn+1 of S˜ whose last components are all equal to one. Then for given ν ∈ {1, . . . , n + 1} the squared distance (see equations 2.1 and 2.2) ρ2 =
m i,j=1
αi αj KnC,D (xi , xj ) − 2
m ν i=1 p=1
αi γp KnC,D (xi , zp ) +
ν p,q=1
γp γq KnC,D (zp , zq )
Optimal Reduced-Set Vectors for SVM
1775
is minimized among all γ1 , . . . , γν ∈ R and z1 , . . . , zν ∈ Rn , if we choose the rth
component of zp to be zp,r =
γp :=
λp , vp 2
C D vp,r ,
(r = 1, . . . , n), and set
(p = 1, . . . , ν).
Remarks. • As a consequence, we get the best possible approximation of the discriminant function f using ν vectors: f (x) = b +
m
αi KnC,D (xi , x) ≈ b +
i=1
ν
γp KnC,D (zp , x),
(x ∈ X),
p=1
where b ∈ R is the shift value. We recover f exactly for ν = n + 1. • The supposition on S˜ is necessary to find an orthogonal basis of eigenvectors of S˜ in the subset η(Rn ) ⊂ Rn+1 . By rescaling, it is necessary only to have an orthogonal eigenbasis with nonvanishing last components. If we want to find ν < n + 1 reduced-set vectors, it may still be weakened: suppose that vi,n+1 = 0 for i = 1, . . . , ν only. • From the definition of γp , zp for p = 1, . . . , ν, it is obvious that only those γp , zp are relevant for the solution where the eigenvalue λp is different from 0. So if the kernel of S˜ has dimension d, we get an exact solution (ρ = 0) even for ν = n + 1 − d. • It is possible to determine an optimal solution even if the supposition on S˜ is not valid: let vp ∈ Rn+1 be an eigenvector of S˜ of norm 1 with corresponding eigenvalue λp = 0 such that the last component of vp is 0. We replace the last component by > 0, name the resulting vector vp , and take zp :=
C1 D
vp,1 .. . vp,n
to get η(zp ) = z˜p = 1 vp . Correspondingly, we have γp :=
λp D 2 C zp
+1
=
2 λp . 1 + 2
1776
T. Thies and F. Weber
Then the corresponding term in the decision function f can be written 2 2 λp C + D x zp 2 1+ 2 vp,1 √ λp . = C + CD x .. , 1 + 2 vp,n
γp KnC,D (x, zp ) =
which leads to minimum ρ for → 0. Proof. From theorem 1 with x˜ 1 , . . . , x˜ m ∈ Rn+1 instead of x1 , . . . , xm , we get that for given ν ∈ {1, . . . , n + 1}, m
C αi αj kn+1 (x˜ i , x˜j ) − 2
ν m
C αi γp kn+1 (x˜ i , vp ) +
i=1 p=1
i,j=1
ν
C γp γq kn+1 (vp , vq ) (4.2)
p,q=1
n+1
and is minimized, if we choose v1 , . . . , vν ∈ Rn+1 to be orthogonal eigenvectors of S˜ corresponding to the eigenvalues λ1 , . . . , λν , λ and set γp = v p2 for p = 1, . . . , ν. p ˜ we know that we can assume the last compoFrom the hypotheses on S, n+1 to be equal to one. nents of all v1 , . . . , vν ∈ R equals
2 p=ν+1 λp
Now define z1 , . . . , zν ∈ Rn by setting their components zp,r = for r = 1, . . . , n. Then we have vp = z˜p for p = 1, . . . , ν. Therefore, from equation 4.2, we get n+1 p=ν+1
λp2 =
m
C αi αj kn+1 (x˜ i , x˜j ) − 2
ν
C αi γp kn+1 (x˜ i , z˜p )
i=1 p=1
i,j=1
+
m ν
C D vp,r ,
C γp γq kn+1 (˜zp , z˜ q ).
p,q=1
This is the minimum of ρ 2 among all γp ∈ R and z˜p ∈ η(Rn ) ⊂ Rn+1 for p = 1, . . . , ν. Using equation 4.1, we see that this is equal to m i,j=1
αi αj KnC,D (xi , xj ) − 2
m ν i=1 p=1
αi γp KnC,D (xi , zp ) +
ν
γp γq KnC,D (zp , zq ).
p,q=1
This is exactly the squared distance ρ 2 we want to minimize.
Optimal Reduced-Set Vectors for SVM
1777
References Boser, B., Guyon, I., & Vapnik, V. (1992). A training algorithm for optimal margin classifiers. In D. Haussler (Ed.), Proceedings of the Fifth Annual Workshop on Computational Learning Theory (pp. 144–152). Pittsburgh, PA: ACM Press. Burges, C. (1996). Simplified support vector decision rules. In L. Sartta (Ed.), 13th International Conference on Machine Learning (pp. 71–77). San Mateo, CA: Morgan Kaufmann. Osuna, E., & Girosi, F. (1998). Reducing the run-time complexity of support vector machines. In International Conference on Pattern Recognition. Brisbane, Australia: IEEE. Scholkopf, ¨ B., & Smola, A. (2002). Learning with kernels. Cambridge, MA: MIT Press. Tipping, M. (2001). Sparse Bayesian learning and the relevance vector machine. Journal of Machine Learning Research, 1, 211–244. Vapnik, V. (1998). Statistical learning theory. New York: Wiley. Received September 5, 2003; accepted February 2, 2004.
LETTER
Communicated by A. Yuille
Stochastic Reasoning, Free Energy, and Information Geometry Shiro Ikeda
[email protected] Institute of Statistical Mathematics, Tokyo 106-8569, Japan, and Gatsby Computational Neuroscience Unit, University College London, London WC1N 3AR, U.K.
Toshiyuki Tanaka
[email protected] Department of Electronics and Information Engineering, Tokyo Metropolitan University, Tokyo 192-0397, Japan
Shun-ichi Amari
[email protected] RIKEN Brain Science Institute, Saitama 351-0198, Japan
Belief propagation (BP) is a universal method of stochastic reasoning. It gives exact inference for stochastic models with tree interactions and works surprisingly well even if the models have loopy interactions. Its performance has been analyzed separately in many fields, such as AI, statistical physics, information theory, and information geometry. This article gives a unified framework for understanding BP and related methods and summarizes the results obtained in many fields. In particular, BP and its variants, including tree reparameterization and concave-convex procedure, are reformulated with information-geometrical terms, and their relations to the free energy function are elucidated from an informationgeometrical viewpoint. We then propose a family of new algorithms. The stabilities of the algorithms are analyzed, and methods to accelerate them are investigated. 1 Introduction Stochastic reasoning is a technique used in wide areas of AI, statistical physics, information theory, and others to estimate the values of random variables based on partial observation of them (Pearl, 1988). Here, a large number of mutually interacting random variables are represented in the form of joint probability. However, the interactions often have specific structures such that some variables are independent of others when a set of variables is fixed. In other words, they are conditionally independent, and their interactions take place only through these conditioning variables. When such a structure is represented by a graph, it is called a graphical model c 2004 Massachusetts Institute of Technology Neural Computation 16, 1779–1810 (2004)
1780
S. Ikeda, T. Tanaka, and S. Amari
(Lauritzen & Spiegelhalter, 1988; Jordan, 1999). The problem is to infer the values of unobserved variables based on observed ones by reducing the conditional joint probability distribution to the marginal probability distributions. When the random variables are binary, their marginal probabilities are determined by the conditional expectation, and the problem is to calculate them. However, when the number of binary random variables is large, the calculation is computationally intractable from the definition. Apart from sampling methods, one way to overcome this problem is to use belief propagation (BP) proposed in AI (Pearl, 1988). It is known that BP gives exact inference when the underlying causal graphical structure does not include any loop, but it is also applied to loopy graphical models (loopy BP) and gives amazingly good approximate inference. The idea of loopy BP is successfully applied to the decoding algorithms of turbo codes and low-density parity-check (LDPC) codes as well as spinglass models and Boltzmann machines. It should be also noted that some variants have been proposed to improve the convergence property of loopy BP. Tree reparameterization (TRP) (Wainwright, Jaakkola, & Willsky, 2002) is one of them, and convex concave computational procedure (CCCP) (Yuille, 2002; Yuille & Rangarajan, 2003) is another algorithm reported to have better convergence property. The reason that loopy BP works so well is not fully understood, and there are a number of theoretical approaches that attempt to analyze its performance. The statistical physical framework uses the Bethe free energy (Yedidia, Freeman, & Weiss, 2001a) or something similar (Kabashima & Saad, 1999, 2001), and a geometrical theory was initiated by Richardson (2000) to understand the turbo decoding. Information geometry (Amari & Nagaoka, 2000), which has been successfully used in the study of the mean field approximation (Tanaka, 2000, 2001; Amari, Ikeda, & Shimokawa, 2001), gives a framework to elucidate the mathematical structure of BP (Ikeda, Tanaka, & Amari, 2002, in press). A similar framework is also given to describe TRP (Wainwright et al., 2002). The problem is interdisciplinary when various concepts and frameworks originate from AI, statistics, statistical physics, information theory, and information geometry. In this letter, we focus on undirected graphs, which is a general representation of graphical models, and give a unified framework to understand BP, CCCP, their variants, and the role of the free energy, based on information geometry. To this end, we propose a new function of the free energy type to which the Bethe free energy (Yedidia et al., 2001a) and that of Kabashima and Saad (2001) are closely related. By constraining the search space in proper ways, we obtain a family of algorithms including BP, CCCP, and a variant of CCCP without double loops. We also give their stability analysis. The error analysis was given in Ikeda et al. (in press). This letter is organized as follows. In section 2, the problem is stated compactly, followed preliminaries of information geometry. Section 3 in-
Stochastic Reasoning, Free Energy, and Information Geometry
1781
troduces an information-geometrical view of BP, the characteristics of its equilibrium, and related algorithms, TRP and CCCP. We discuss the free energy related to BP in section 4, and new algorithms are proposed with stability analysis in section 5. Section 6 gives some extensions of BP from an information-geometrical viewpoint, and finally section 7 concludes the letter. 2 Problem and Geometrical Framework 2.1 Basic Problem and Strategy. Let x = (x1 , . . . , xn )T be hidden and y = (y1 , . . . , ym )T be observed random variables. We start with the case where each xi is binary that is, xi ∈ {−1, +1} for simplicity. An extension to a wider class of distributions will be given in section 6.1. The conditional distribution of x given y is written as q(x|y), and our task is to give a good inference of x from the observations. We hereafter simply write q(x) for q(x|y) and omit y. One natural inference of x is the maximum a posteriori (MAP), that is, xˆ map = argmax q(x). x This minimizes the error probability that xˆ map does not coincide with the true one. However, this calculation is not tractable when n is large because the number of candidates of x increases exponentially with respect to n. The maximization of the posterior marginals (MPM) is another inference that minimizes the number of component errors. If each marginal distribution q(xi ), i = 1, . . . , n, is known, the MPM inference decides xˆ i = +1 when q(xi = +1) ≥ q(xi = −1) and xˆ i = −1 otherwise. Let ηi be the expectation of xi with respect to q(x), that is, ηi = Eq [xi ] = xi q(xi ). xi
The MPM inference gives xˆ i = sgn ηi , which is directly calculated if we know the marginal distributions q(xi ), or the expectation:
η = Eq [x]. This article focuses on the method to obtain a good approximation to η , which is equivalent to the inference of ni=1 q(xi ). For any q(x), ln q(x) can be expanded as a polynomial of x up to degree n, because every xi is binary. However, in many problems, mutual interactions of random variables exist only in specific manners. We represent ln q(x) in the form ln q(x) = h · x +
L r=1
cr (x) − ψq ,
1782
S. Ikeda, T. Tanaka, and S. Amari
wij
x3
x2 x5
x1
x6 x7
x4 x8
Figure 1: Boltzmann machine.
where h · x = i hi xi is the linear term, cr (x), r = 1, . . . , L, is a simple polynomial representing the rth clique among related variables, and ψq is a logarithm of the normalizing factor or the partition function, which is called the (Helmholtz) free energy, ψq = ln exp h · x + cr (x) . (2.1) r x In the case of Boltzmann machines (see Figure 1) and conventional spinglass models, cr (x) is a quadratic function of xi , that is, cr (x) = wij xi xj , where r is the index of the edge that corresponds to the mutual coupling between xi and xj . It is more common to define the true distribution q(x) of an undirected graph as a product of clique functions as q(x) =
n 1 φi (xi ) φr (xr ), Zq i=1 r∈C
where C is the set of cliques. In our notation, φi (xi ) and φr (xr ) are denoted as follows: 1 φi (xi = +1) hi = ln , cr (x) = ln φr (xr ), ψq = ln Zq . 2 φi (xi = −1)
Stochastic Reasoning, Free Energy, and Information Geometry
1783
When there are only pairwise interactions, φr (xr ) has the form φr (xi , xj ). 2.2 Important Family of Distributions. Let us consider the set of probability distributions p(x; θ , v) = exp[θ · x + v · c(x) − ψ(θ , v)],
(2.2)
parameterized by θ and v, where v = (v1 , . . . , vL )T , c(x) = (c1 (x), . . . , cL (x))T , and v · c(x) = Lr=1 vr cr (x). We name the family of the probability distributions S, which is an exponential family, S = {p(x; θ , v) θ ∈ n , v ∈ L },
(2.3)
where its canonical coordinate system is (θ , v). The joint distribution q(x) is included in S, which is easily proved by setting θ = h and v = 1L = (1, . . . , 1)T , q(x) = p(x; h, 1L ). We define M0 as a submanifold of S specified by v = 0, M0 = {p0 (x; θ ) = exp[h · x + θ · x − ψ0 (θ )] | θ ∈ n }. Every distribution of M0 is an independent distribution, which includes no mutual interaction between xi and xj , (i = j), and the canonical coordinate system of M0 is θ . The product of marginal distributions of q(x), that is, n n i=1 q(xi ), is included in M0 . The ultimate goal is to derive i=1 q(xi ) or the corresponding coordinate θ of M0 . 2.3 Preliminaries of Information Geometry. In this section we give preliminaries of information geometry (Amari & Nagaoka, 2000; Amari, 2001). First, we define e–flat and m–flat submanifolds of S: e–flat submanifold: Submanifold M⊂S is said to be e–flat when, for all t ∈ [0, 1], q(x), p(x) ∈ M, the following r(x; t) belongs to M: ln r(x; t) = (1 − t) ln q(x) + t ln p(x) + c(t), where c(t) is the normalization factor. Obviously, {r(x; t) | t ∈ [0, 1]} is an exponential family connecting two distributions, p(x) and q(x). When an e–flat submanifold is a one–dimensional curve, it is called an e–geodesic. In terms of the e–affine coordinates θ , a submanifold M is e–flat when it is linear in θ . m–flat submanifold: Submanifold M⊂S is said to be m–flat when, for all t ∈ [0, 1], q(x), p(x) ∈ M, the following mixture r(x; t) belongs to M: r(x; t) = (1 − t)q(x) + tp(x).
1784
S. Ikeda, T. Tanaka, and S. Amari
When an m–flat submanifold is a one–dimensional curve, it is called an m– geodesic. Hence, the above mixture family is the m–geodesic connecting them. From the definition, any exponential family is an e–flat manifold. Therefore, S and M0 are e–flat. Next, we define the m–projection (Amari & Nagaoka, 2000). Definition 1. Let M be an e–flat submanifold in S, and let q(x)∈S. The point in M that minimizes the Kullback-Leibler (KL) divergence from q(x) to M, denoted by M ◦q(x) = argmin D[q(x); p(x)], p(x)∈M
(2.4)
is called the m–projection of q(x) to M. Here, D[·; ·] is the KL divergence defined as D[q(x); p(x)] =
x
q(x) ln
q(x) . p(x)
The KL divergence satisfies D[q(x); p(x)]≥0, and D[q(x); p(x)] = 0 when and only when q(x) = p(x) holds for every x. Although symmetry D[q; p] = D[p; q] does not hold in general, it is regarded as an asymmetric squared distance. Finally, the m–projection theorem follows. Theorem 1. Let M be an e–flat submanifold in S, and let q(x)∈S. The m– projection of q(x) to M is unique and given by a point in M such that the m– geodesic connecting q(x) and M ◦q is orthogonal to M at this point in the sense of the Riemannian metric due to the Fisher information matrix. Proof. A detailed proof is found in Amari and Nagaoka (2000), and the following is a sketch of it. First, we define the inner product and prove the orthogonality. A rigorous definition concerning the tangent space of manifold is found in Amari and Nagaoka (2000). Let us consider a curve p(x; α) ∈ S, which is parameterized by a realvalued parameter α. Its tangent vector is represented by a random vector ∂α ln p(x; α), where ∂α = ∂/∂α. For two curves p1 (x; α) and p2 (x; β) that intersect at α = β = 0, p(x) = p1 (x; 0) = p2 (x; 0), we define the inner product of the two tangent vectors by Ep(x) [∂α ln p1 (x; α)∂β ln p2 (x; β)]α=β=0 . Note that this definition is consistent with the Riemannian metric defined by the Fisher information matrix.
Stochastic Reasoning, Free Energy, and Information Geometry
1785
Let p∗ (x) be an m–projection of q(x) to M, and the m–geodesic connecting q(x) and p∗ (x) be rm (x; α), which is defined as rm (x; α) = αq(x) + (1 − α)p∗ (x),
α ∈ [0, 1].
The derivative of ln rm (x; α) along the m–geodesic at p∗ (x) is ∂α ln rm (x; α)|α=0 =
q(x) − p∗ (x) q(x) − p∗ (x) = . rm (x; α) α=0 p∗ (x)
Let an e–geodesic included in M be re (x; β), which is defined as ln re (x; β) = β ln p (x) + (1 − β) ln p∗ (x) + c(β), p (x) ∈ M,
β ∈ [0, 1].
The derivative of ln re (x; β) along the e–geodesic at p∗ (x) is ∂β ln re (x; β)|β=0 = ln p (x) − ln p∗ (x) + c (0). The inner product becomes Ep∗ (x) [∂α ln p∗ (x)∂β ln p∗ (x)] =
[q(x) − p∗ (x)][ln p (x) − ln p∗ (x)].
(2.5)
x
The fact that p∗ (x) is an m–projection from q(x) to M gives ∂β D[q; re (β)]|β=0 = 0, that is, q(x)[ln p∗ (x) − ln p (x)] − c (0) = 0. (2.6) ∂β D[q(x); re (x; β)]β=0 = x Moreover, since D[p∗ ; re (β)] is minimized to 0 at β = 0, we have ∂β D[p∗ (x); re (x; β)]β=0 = p∗ (x)[ln p∗ (x) − ln p (x)] − c (0) = 0. x
(2.7)
Equation 2.5 is proved to be zero by combining equations 2.6 and 2.7. Furthermore, it immediately proves the Pythagorean theorem: D[q(x); p (x)] = D[q(x); p∗ (x)] + D[p∗ (x); p (x)]. This holds for every p (x) ∈ M. Suppose the m–projection is not unique, and let another point be p∗∗ (x) ∈ M, which satisfies D[q; p∗∗ ] = D[q; p∗ ]. Then the following equation holds D[q(x); p∗∗ (x)] = D[q(x); p∗ (x)] + D[p∗ (x); p∗∗ (x)] = D[q(x); p∗ (x)].
1786
S. Ikeda, T. Tanaka, and S. Amari
This is true only if p∗ (x) = p∗∗ (x), which proves the uniqueness of the m– projection. 2.4 MPM Inference. We show that the MPM inference is immediately given if the m–projection from q(x) to M0 is given. From the definition in equation 2.4, the m–projection of q(x) to M0 is characterized by θ ∗ , which satisfies p0 (x; θ ∗ ) = M0 ◦ q(x). Hereafter, we denote the m–projection to M0 in terms of the parameter θ as
θ ∗ = πM0 ◦ q(x) = argmin D[q(x); p0 (x; θ )]. θ
By taking the derivative of D[q(x); p0 (x; θ )] with respect to θ , we have xq(x) − ∂θ ψ0 (θ ∗ ) = 0, (2.8) x where ∂θ shows the derivative with respect to θ . From the definition of exponential family, ∂θ ψ0 (θ ) = ∂θ ln exp(h · x + θ · x) = xp0 (x; θ ). (2.9) x x We define the new parameter η 0 (θ ) in M0 as η 0 (θ ) = xp0 (x; θ ) = ∂θ ψ0 (θ ). x
(2.10)
This is called the expectation parameter (Amari & Nagaoka, 2000). From equations 2.8, 2.9, and 2.10, the m–projection is equivalent to marginalizing q(x). Since translation between θ and η 0 is straightforward for M0 , once the m–projection or, equivalently, the product of marginals of q(x) is obtained, the MPM inference is given immediately. 3 BP and Variants: Information-Geometrical View 3.1 BP 3.1.1 Information-Geometrical View of BP. In this section, we give the information-geometrical view of BP. The well-known definition of BP is found elsewhere (Pearl, 1988; Lauritzen & Spiegelhalter, 1988; Weiss, 2000); details are not given here. We note that our derivation is based on BP for undirected graphs. For loopy graphs, it is well known that BP does not necessarily converge, and even if it does, the result is not equal to the true marginals.
Stochastic Reasoning, Free Energy, and Information Geometry
A
B
1787
C
Figure 2: (A) Belief graph. (B) Graph with a single edge. (C) Graph with all edges.
Figure 2 shows three important graphs for BP. The belief graph in Figure 2A corresponds to p0 (x; θ ), and that in Figure 2C corresponds to the true distribution q(x). Figure 2B shows an important distribution that includes only a single edge. This distribution is defined as pr (x; ζ r ), where pr (x; ζ r ) = exp[h · x + cr (x) + ζ r · x − ψr (ζ r )],
r = 1, . . . , L.
This can be generalized without any change to the case when cr (x) is a polynomial. The set of the distributions pr (x; ζ r ) parameterized by ζ r is an e–flat manifold defined as Mr = {pr (x; ζ r ) | ζ r ∈ n },
r = 1, . . . , L.
Its canonical coordinate system is ζ r . We also define the expectation parameter η r (ζ r ) of Mr as follows: η r (ζ r ) = ∂ζr ψr (ζ r ) = xpr (x; ζ r ), r = 1, . . . , L. (3.1) x In Mr , only the rth edge is taken into account, but all the other edges are replaced by a linear term ζ r · x, and p0 (x; θ ) ∈ M0 is used to integrate all the information from pr (x; ζ r ), r = 1, . . . , L, giving θ , which is the parameter of p0 (x; θ ), to infer i q(xi ). In the iterative process of BP ζ r of pr (x; ζ r ), r = 1, . . . , L are modified by using the information of θ , which is renewed by integrating local information {ζ r }. Information geometry has elucidated its geometrical meaning for special graphs for error-correcting codes (Ikeda et al., in press; see also Richardson, 2000), and we give the framework for general graphs in the following. BP is stated as follows: Let pr (x; ζ tr ) be the approximation to q(x) at time t, which each Mr , r = 1, . . . , L specifies. Information-Geometrical View of BP 1. Set t = 0, ξ tr = 0, ζ tr = 0, r = 1, . . . , L.
1788
S. Ikeda, T. Tanaka, and S. Amari
2. Increment t by one and set ξ t+1 r , r = 1, . . . , L as follows:
ξ t+1 = πM0 ◦pr (x; ζ tr ) − ζ tr . r
(3.2)
as follows: 3. Update θ t+1 and ζ t+1 r ζ t+1 = ξ t+1 ξ t+1 = θ t+1 = r r r , r =r
r
1 t+1 ζ . L−1 r r
4. Repeat steps 2 and 3 until convergence. The algorithm is summarized as follows: Calculate iteratively: θ t+1 = [πM0 ◦ pr (x; ζ tr ) − ζ tr ],
ζ t+1 r
r t+1
=θ
− [πM0 ◦ pr (x; ζ tr ) − ζ tr ],
r = 1, . . . , L.
We have introduced two sets of parameters {ξ r } and {ζ r }. Let the converged point of BP be {ξ ∗r }, {ζ ∗r }, and θ ∗ , where θ ∗ = r ξ ∗r = r ζ ∗r /(L − 1) and θ ∗ = ξ ∗r + ζ ∗r . With these relations, the probability distribution of q(x), its final approximations p0 (x; θ ∗ ) ∈ M0 , and pr (x; ζ ∗r ) ∈ Mr are described as q(x) = exp[h · x + c1 (x) + · · · + cr (x) + · · · + cL (x) − ψq ] p0 (x; θ ∗ ) = exp[h · x + ξ ∗1 · x + · · · + ξ ∗r · x + · · · + ξ ∗L · x − ψ0 (θ ∗ )] pr (x; ζ ∗r ) = exp[h · x + ξ ∗1 · x + · · · + cr (x) + · · · + ξ ∗L · x − ψr (ζ ∗r )].
The idea of BP is to approximate cr (x) by ξ ∗r · x in Mr , taking the information from Mr (r = r) into account. The independent distribution p0 (x; θ ) integrates all the information. 3.1.2 Common BP Formulation and the Information-Geometrical Formulation. BP is generally described as a set of message updating rules. Here we describe the correspondence between common and information-geometrical formulation. In the graphs with pairwise interactions, messages and beliefs are updated as mt+1 ij (xj ) = bi (xi ) =
1 φi (xi )φij (xi , xj ) mtki (xi ) Z xi k∈N (i)\j 1 mt+1 φi (xi ) ki (xi ), Z k∈N (i)
where Z is the normalization factor and N (i) is the set of vertices connected to vertex i. The vector ξ r corresponds to mij (xj ). More precisely, when r is an edge connecting i and j, ξr,j =
1 mij (xj = +1) ln , 2 mij (xj = −1)
ξr,i =
1 mji (xi = +1) ln , 2 mji (xi = −1)
ξr,k = 0 for k = i, j,
Stochastic Reasoning, Free Energy, and Information Geometry
1789
where ξr,i denotes the ith component of ξ r . Note that the ith component of ξ r is not generally 0 if the rth edge includes vertex i and that equation 3.2 updates mij (xj ) and mji (xi ) simultaneously. Now, it is not difficult to understand the following correspondences: θi =
ξr ,i =
r
ζr,i = θi − ξr,i =
mki (xi = +1) 1 ln 2 k∈N (i) mki (xi = −1)
mki (xi = +1) 1 ln , 2 k∈N (i)\j mki (xi = −1)
where θi and ζr,i are the ith component of θ and ζ r , respectively, and r corresponds to the edge connecting i and j. Note that ζr,k = θk holds for k = i, j. 3.1.3 Equilibrium of BP. The following theorem, proved in Ikeda et al. (in press), characterizes the equilibrium of BP. The equilibrium (θ ∗ , {ζ ∗r }) satisfies
Theorem 2.
1. m–condition: θ ∗ = πM0 ◦ pr (x; ζ ∗r ). 2. e–condition: θ ∗ =
L 1 ζ∗. L − 1 r=1 r
It is easy to check that the m–condition is satisfied at the equilibrium of the BP algorithm from equation 3.2 and θ ∗ = ζ ∗r + ξ ∗r . In order to check the e–condition, we note that ξ ∗r corresponds to a message. If the same set of messages is used to calculate the belief of each vertex, the e–condition is automatically satisfied. Therefore, at each iteration of BP, the e–condition is satisfied. In some algorithms, multiple sets of messages are defined, and a different set is used to calculate each belief. In such cases, the e–condition plays an important role. In order to have an information-geometrical view, we define two submanifolds M∗ and E∗ of S (see equation 2.3) as follows:
∗ ∗ xp(x) = xp0 (x; θ ) = η 0 (θ ) M = p(x) p(x) ∈ S, x x L L ∗ t0 ∗ tr ∗ pr (x; ζ r ) tr = 1, tr ∈ E = p(x) = Cp0 (x; θ ) ∗
r=1
(3.3)
r=0
C : normalization factor. Note that M∗ and E∗ are an m–flat and an e–flat submanifold, respectively.
1790
S. Ikeda, T. Tanaka, and S. Amari
The geometrical implications of these conditions are as follows: m–condition: The m–flat submanifold M∗ , which includes pr (x; ζ ∗r ), r = 1, . . . , L, and p0 (x; θ ∗ ), is orthogonal to Mr , r = 1, . . . , L and M0 , that is, they are the m–projections to each other. e–condition: The e–flat submanifold E∗ includes p0 (x; θ ∗0 ), pr (x; ζ ∗r ), r = 1, . . . , L, and q(x). The equivalence between the e–condition of theorem 2 and the geometrical one stated above is proved straightforwardly by setting t0 = −(L − 1) and t1 = · · · = tL = 1 in equation 3.3. From the m–condition, x xp0 (x; θ ∗ ) = x xpr (x; ζ ∗r ) holds, and from the definitions in equations 2.10 and 3.1, we have,
η 0 (θ ∗ ) = η r (ζ ∗r ),
r = 1, . . . , L.
(3.4)
It is not difficult to show that equation 3.4 is the necessary and sufficient condition for the m–condition, and it implies not only that the m–projection of pr (x; ζ ∗r ) to M0 is p0 (x; θ ∗ ) but also that the m–projection of p0 (x; θ ∗ ) to Mr is pr (x; ζ ∗r ), that is,
ζ ∗r = πMr ◦ p0 (x; θ ∗ ),
r = 1, . . . , L.
where πMr denotes the m–projection to Mr . When BP converges, the e– condition and the m–condition are satisfied, but it does not necessarily imply q(x) ∈ M∗ , in other words, p0 (x; θ ∗ ) = ni=1 q(xi ), because there is a discrepancy between M∗ and E∗ . This is shown schematically in Figure 3. It is well known that in graphs with tree structures, BP gives the true marginals, that is, q(x) ∈ M∗ holds. In this case, we have the following relation: L q(x) =
∗ r=1 pr (x; ζ r ) . p0 (x; θ ∗ )L−1
(3.5)
This relationship gives the following proposition. Proposition 1. When q(x) is represented with a tree graph, q(x), p0 (x; θ ∗ ), and pr (x; ζ ∗r ), r = 1, . . . , L are included in M∗ and E∗ simultaneously. This proposition shows that when a graph is a tree, q(x) and p0 (x; θ ∗ ) are included in M∗ , and the fixed point of BP is the correct solution. In the case of a loopy graph, q(x) ∈ / M∗ , and the correct solution is not generally a fixed point of BP. However, we still hope that BP gives a good approximation to the correct marginals. The difference between the correct marginals and the BP solution is regarded as the discrepancy between E∗ and M∗ , and if we can
Stochastic Reasoning, Free Energy, and Information Geometry
1791
Figure 3: Structure of equilibrium.
qualitatively evaluate it, the error of the BP solution is estimated. We have given a preliminary analysis in Ikeda et al. (2002, in press), which showed that the principal term of the error is directly related to the e–curvature (see Amari & Nagaoka, 2000, for the definition) of M∗ , which mainly reflects the influence of the possible shortest loops in the graph. 3.2 TRP. Some variants of BP have been proposed, and information geometry gives a general framework to understand them. We begin with TRP (Wainwright et al., 2002). TRP selects the set of trees {Ti }, where each tree Ti consists of a set of edges, and renews related parameters in the process of inference. Let the set of edges be L and Ti ⊂ L, i = 1, . . . , K be its subsets, where each graph with the edges Ti does not have any loop. The choice of the sets {Ti } is arbitrary, but every edge must be included at least in one of the trees. In order to give the information-geometrical view, we use the parameters ζ r , θ r , r = 1, . . . , L, and θ . The information-geometrical view of TRP is given as follows: Information-Geometrical View of TRP 1. Set t = 0, ζ tr = θ tr = 0, r = 1, . . . , L, and θ t = 0.
1792
S. Ikeda, T. Tanaka, and S. Amari
2. For a tree Ti , construct a tree distribution ptTi (x) as follows ptTi (x) = Cp0 (x; θ t )
pr (x; ζ t ) r r∈Ti
p0 (x; θ tr )
= C exp h · x +
cr (x) +
r∈Ti
(ζ tr
−
θ tr )
+θ
t
· x . (3.6)
r∈Ti
By applying BP, calculate the marginal distribution of ptTi (x), and let θ t+1 = πM0 ◦ptTi (x). Then update θ t+1 and ζ t+1 as follows: r r For r ∈ Ti ,
θ t+1 = θ t+1 , r
ζ t+1 = πMr ◦ptTi (x). r
For r ∈ / Ti ,
θ t+1 = θ tr , r
ζ t+1 = ζ tr . r
3. Repeat step 2 for trees Tj ∈ {Ti }. 4. Repeat steps 2 and 3 until θ t+1 = θ t+1 holds for every r, and {ζ t+1 r r } converges. Let us show that the e– and the m–conditions are satisfied at the equilibrium of TRP. Since ptTi (x) is a tree graph, BP in step 2 gives the exact inference of marginal distributions. Moreover, from equations 3.5 and 3.6, we have t t+1 r∈Ti pr (x; ζ r ) r∈Ti pr (x; ζ r ) t t pTi (x) = Cp0 (x; θ ) = . t p0 (x; θ t+1 )|Ti |−1 r∈Ti p0 (x; θ r ) where |Ti | is the cardinality of Ti . By comparing the second and third terms and using θ t+1 = θ t+1 , r ∈ Ti , r t+1 (ζ tr − θ tr ) + θ t = ζ t+1 − (|Ti | − 1)θ t+1 = (ζ t+1 − θ t+1 . r r r )+θ r∈Ti
r∈Ti
r∈Ti
t t Since r∈T / i (ζ r − θ r ) does not change through step 2, we have the following relation, which shows that the e–condition holds for the convergent point of TRP: ζ ∗r − (L − 1)θ ∗ = (ζ ∗r − θ ∗r ) + θ ∗ = (ζ tr − θ tr ) + θ t = 0. r
r
r
When TRP converges, the operation of step 2 shows that each tree distribution has the same marginal distribution, which shows p∗Ti (x) ∈ M∗ , where p∗Ti (x) is the tree distribution constructed with the converged parameters. Since ζ ∗r = πMr ◦p∗Ti (x), r ∈ Ti holds, pr (x; ζ ∗r ) ∈ M∗ also holds for r = 1, . . . , L, which shows the m–condition is satisfied at the convergent point.
Stochastic Reasoning, Free Energy, and Information Geometry
1793
3.3 CCCP. CCCP is an iterative procedure to obtain the minimum of a function, which is represented by the difference of two convex functions (Yuille & Rangarajan, 2003). The idea of CCCP was applied to solve the inference problem of loopy graphs, where the Bethe free energy, which we will discuss in section 4, is the energy function (Yuille, 2002) (therefore, it is CCCP-Bethe, but in the following, we refer it as CCCP). The detail of the derivation will be given in the appendix, and CCCP is defined as follows in an information-geometrical framework. Information-Geometrical View of CCCP Inner loop: Given θ t , calculate {ζ t+1 r } by solving t πM0 ◦ pr (x; ζ t+1 r ) = Lθ −
ζ t+1 r ,
r = 1, . . . , L.
(3.7)
r
Outer loop: Given a set of {ζ t+1 r } as the result of the inner loop, calculate
θ t+1 = Lθ t −
ζ t+1 r .
(3.8)
r
From equations 3.7 and 3.8, one obtains
θ t+1 = πM0 ◦ pr (x; ζ t+1 r ),
r = 1, . . . , L,
which means that CCCP enforces the m–condition at each iteration. On the other hand, the e–condition is satisfied only at the convergent point, which can be easily verified by θ t+1 = θ t = θ ∗ in equation 3.8 to yield the e– letting ∗ ∗ condition (L − 1)θ = r ζ r . One can therefore see that the inner and outer loops of CCCP solve the m–condition and the e–condition, respectively. 4 Free Energy Function 4.1 Bethe Free Energy. We have described the information-geometrical view of BP and related algorithms. It has the characteristics of the equilibrium points, but it is not enough to describe the approximation accuracy and the dynamics of the algorithm. An energy function helps us to clarify them, and there are some functions proposed for this purpose. The most popular one is the Bethe free energy. The Bethe free energy itself is well known in the literature of statistical mechanics, being used in formulating the so-called Bethe approximation (Itzykson & Drouffe, 1989). As far as we know, Kabashima and Saad (2001) were the first to point out that BP is derived by considering a variational extremization of a free energy. It was Yedidia et al. (2001a) who introduced to the machine-learning community the formulation of BP based on the Bethe
1794
S. Ikeda, T. Tanaka, and S. Amari
free energy. Following Yedidia et al. (2001a) and using their terminology, the definition of the free energy is given as follows:
Fβ =
br (xr ) ln
br (xr ) exp[hi xi + hj xj + cr (x)]
xr − (li − 1) bi (xi ) ln r
i
xi
bi (xi ) . exp(hi xi )
Here, xr denotes the pair of vertices included in the edge r, bi (xi ) and br (xr ) are a belief and a pairwise belief respectively, and li is the number of neighbors of vertex i. From its definition, xi bi (xi ) = 1, and xr br (xr ) = 1 is satisfied. In an information-geometrical formulation, br (xr ) = pr (xr ; ζ r ). And by setting pr (xk ; ζ r ) = p0 (xk ; θ ),
k∈ / rth edge,
the Bethe free energy becomes
Fβ =
[ζ r · η r (ζ r ) − ψr (ζ r )] − (L − 1)[θ · η 0 (θ ) − ψ0 (θ )].
(4.1)
r
In Yedidia et al. (2001a, 2001b) the following reducibility conditions (also called the marginalization conditions) are further imposed: bi (xi ) =
bij (xi , xj ),
bj (xj ) =
xj
bij (xi , xj ).
(4.2)
xi
These conditions are equivalent to the m–condition in equation 3.1, that is, η r (ζ r ) = η 0 (θ ), r = 1, . . . , L, so that every ζ r is no longer an independent variable but is dependent on θ . With these constraints, the Bethe free energy is simplified as follows:
Fβm (θ ) = (L − 1)ψ0 (θ ) − +
r
ψr (ζ r (θ ))
ζ r (θ ) − (L − 1)θ · η 0 (θ ).
(4.3)
r
At each step of the BP algorithm, equation 4.2 is not satisfied, but the e– condition is satisfied. Therefore, assuming equation 4.2 for original BP immediately gives the equilibrium, and no free parameter is left. Without any free parameter, it is not possible to take the derivative, which does not allow us to give any further analysis in terms of the Bethe free energy. Thus,
Stochastic Reasoning, Free Energy, and Information Geometry
1795
it is important to specify in any analysis based on the free energy what the independent variables are, in order to provide a proper argument. Finally, we mention the relation between the Bethe free energy and the conventional (Helmholtz) free energy ψq , the logarithm of the partition function of q(x) defined in equation 2.1. When the e–condition is satisfied, Fβm (θ ) becomes
Fβm (θ ) = (L − 1)ψ0 (θ ) −
ψr (ζ r (θ ))
r
= − ψ0 (θ ) + [ψr (ζ r (θ )) − ψ0 (θ )] . r
This formula shows that the Bethe free energy can be regarded as an approximation to the conventional free energy by a linear combination of ψ0 and {ψr }. Moreover, if the graph is a tree, the result of proposition 1 shows that the Bethe free energy is equivalent to −ψq . 4.2 A New View on Free Energy. Instead of assuming equation 4.2, let us start from the free energy defined in equation 4.3 without any constraint on the parameters; that is, all of θ , ζ 1 , . . . , ζ L are the free parameters:
F (θ , ζ 1 , . . . , ζ L ) = (L − 1)ψ0 (θ ) − +
ψr (ζ r )
r
ζ r − (L − 1)θ · η 0 (θ ).
(4.4)
r
The above function is rewritten in terms of the KL divergence as F (θ , ζ 1 , . . . , ζ L ) = D[p0 (x; θ ); q(x)] − D[p0 (x; θ ); pr (x; ζ r )] + C, r
where C is a constant. The following theorem is easily derived. Theorem 3. ζ r ). Proof.
The equilibrium (θ ∗ , ζ ∗r ) of BP is a critical point of F (θ , ζ 1 , . . . ,
By calculating
∂F = 0, ∂ζr we easily have
η r (ζ r ) = η 0 (θ ),
1796
S. Ikeda, T. Tanaka, and S. Amari
which is the m–condition. By calculating ∂F = 0, ∂θ
(4.5)
we are led to the e–condition (L − 1)θ =
r ζr.
The theorem shows that equation 4.4 works as the free energy function without any constraint. 4.3 Relation to Other Free Energies. The function F (θ , ζ 1 , . . . , ζ L ) works as a free energy function, but it is also important to compare it with other “free energies.” First, we compare it with the one proposed by Kabashima and Saad (2001). It is a function of (ζ 1 , . . . , ζ L ) and (ξ 1 , . . . , ξ L ), given by
FKS (ζ 1 , . . . , ζ L ; ξ 1 , . . . , ξ L ) = F (θ , ζ 1 , . . . , ζ L ) + D[p0 (x; θ ); p0 (x; ζ r + ξ r )], r
where θ = r ξ r . It is clear from the definition that the choice of ξ r that makes FKS minimum is ξ r = θ − ζ r , for all r, and FKS becomes equivalent to F . Next, we consider the dual form of the free energy Fβ in equation 4.1. The dual form is defined by introducing the Lagrange multipliers (Yedidia et al., 2001a) and redefining the free energy as a function ofthem. The multipliers are defined on the reducibility conditions, bi (xi ) = xj bij (xi , xj ) and bj (xj ) = xi bij (xi , xj ). They are equivalent to η r (ζ r ) = η 0 (θ ), which is the m–condition in information-geometrical formulation. Let λr ∈ n , r = 1, . . . , L be the Lagrange multipliers, and the free energy becomes G (θ , {ζ r }, {λr }) = Fβ (θ , {ζ r }) − λr · [η r (ζ r ) − η 0 (θ )], λr ∈ n . r
The original extremal problem is equivalent to the extremal problem of G with respect to θ , {ζ r }, and {λr }. The dual form Gβ is derived by redefining G as a function of {λr }, where the extremal problems of θ and {ζ r } are solved. By solving ∂θ G = 0, we have
θ ({λr }) =
1 λr , L−1 r
while ∂ζr G = 0 gives
ζ r (λr ) = λr .
Stochastic Reasoning, Free Energy, and Information Geometry
Finally, the dual form Gβ becomes
Gβ ({λr }) = (L − 1)ψ0 (θ ({λr })) −
ψr (ζ r (λr )).
1797
(4.6)
r
Although F in equation 4.4 becomes equivalent to Gβ by assuming the e– condition, F is free from the e– and the m–conditions and is different from Gβ . From the definition of the Lagrange multipliers, Gβ is introduced to analyze the extremal problem of Fβ under the m–condition, where the e– condition is not satisfied. The m–constraint free energy Fβm in equation 4.3 shows that F is equivalent to Fβ under the m–condition. Finally, we summarize as follows: Under the m–condition, F is equivalent to Fβ , and under the e–condition, F is equivalent to the dual form Gβ . 4.4 Property of Fixed Points. Let us study the stability of the fixed point of Fβ or, equivalently, F under the m–condition. Since the m–condition is satisfied, every ζ r is a dependent variable of θ , and we consider the derivative with respect to θ . From the m–condition, we have
η r (ζ r ) = η 0 (θ ),
∂ζr = Ir−1 (ζ r )I0 (θ ), ∂θ
r = 1, . . . , L.
(4.7)
Here, I0 (θ ) and Ir (ζ r ) are the Fisher information matrices of p0 (x; θ ) and pr (x; ζ r ), respectively, which are defined as I0 (ζ r ) = ∂θ η 0 (θ ) = ∂θ2 ψ0 (θ ),
Ir (ζ r ) = ∂ζr η r (ζ r ) = ∂ζ2r ψr (ζ r ),
r = 1, . . . , L. Equation 4.7 is proved as follows:
η r (ζ r ) + Ir (ζ r )δ ζ r η r (ζ r + δ ζ r ) = η 0 (θ + δ θ ) η 0 (θ ) + I0 (θ )δ θ δ ζ r = Ir (ζ r )−1 I0 (θ )δ θ . (4.8) The condition of the equilibrium is equation 4.5 which yields the e–condition, and the second derivative gives the property around the stationary point, that is, ∂ 2F ∂ θ2
= I0 (θ ) + I0 (θ )
[Ir (ζ r )−1 − I0 (θ )−1 ]I0 (θ ) + ,
(4.9)
r
where is the term related to the derivative of the Fisher information matrix, which vanishes when the e–condition is satisfied. If equation 4.9 is positive definite at the stationary point, the Bethe free energy is at least locally minimized at the equilibrium. But it is not always positive definite. Therefore, the conventional gradient descent method of Fβ or F may fail.
1798
S. Ikeda, T. Tanaka, and S. Amari
5 Algorithms and Their Convergences 5.1 e–constraint Algorithm. Since the equilibrium of BP is characterized with the e– and the m–conditions, there are two possible algorithms for finding the equilibrium. One is to constrain the parameters always to satisfy the e–condition and search for the parameters that satisfy the m–condition (e–constraint algorithm). The other is to constrain the parameters to satisfy the m–condition and search for the parameters that satisfy the e–condition (m–constraint algorithm). In this section, we discuss e–constraint algorithms. BP is an e–constraint algorithm since the e–condition is satisfied at each step, but its convergence is not necessarily guaranteed. We give an alternative of the e–constraint algorithm that has a better convergence property. Let us begin with proposing a new cost function as Fe ({ζ r }) = ||η 0 (θ )−η r (ζ r )||2 , (5.1) r
under the e–constraint θ = r ζ r /(L − 1). If the cost function is minimized to 0, the m–condition is satisfied, and it is an equilibrium. A naive method to minimize Fe is the gradient descent algorithm. The gradient is 2 ∂ Fe = −2Ir (ζ r )[η 0 (θ )−η r (ζ r )] + [η 0 (θ )−η r (ζ r )]. (5.2) I0 (θ ) ∂ζr L−1 r If the derivative is available, ζ r and θ are updated as ∂ Fe 1 t+1 ζ t+1 = ζ tr − δ t , θ t+1 = ζ , r L r r ∂ζr where δ is a small positive learning rate. It is not difficult to calculate η 0 (θ ), η r (ζ r ), and I0 (θ ), and the rest of the problem is to calculate the first term of equation 5.2. Fortunately, we have the relation,
η r (ζ r + αh) − η r (ζ r ) . α→0 α If (η 0 (θ )−η r (ζ r )) is substituted for h, this becomes the first term of equation 5.2. Now we propose a new algorithm. Ir (ζ r )h = lim
A New e–Constraint Algorithm 1. Set t = 0, θ t = 0, ζ tr = 0, r = 1, . . . , L. 2. Calculate η 0 (θ t ), I0 (θ t ), and η r (ζ tr ), r = 1, . . . , L. 3. Let hr =η 0 (θ t )−η r (ζ tr ) and calculate η r (ζ tr +αhr ) for r = 1, . . . , L, where α>0 is small. Then calculate gr =
η r (ζ tr + αh) − η r (ζ tr ) . α
Stochastic Reasoning, Free Energy, and Information Geometry
1799
4. For t = 1, 2, . . . , update ζ t+1 as follows: r
ζ t+1 r θ t+1
2 t = − δ −2gr + hr , I0 (θ ) L−1 r 1 t+1 = ζ . L−1 r r
5. If Fe ({ζ r }) = and go to 2.
ζ tr
r ||η 0 (θ )−η r (ζ r )||
2
> ( is a threshold) holds, t+1→t
This algorithm is an e–constraint algorithm and does not include double loops, which is similar to the BP algorithm, but we have introduced a new parameter α, which can affect the convergence. We have checked, with small-sized numerical simulations, that if α is sufficiently small, this problem can be avoided, but further theoretical analysis is needed. Another problem is that this algorithm converges to any fixed point of BP even if it is not a stable fixed point of BP. For example, when ζ r and θ are extremely large, eventually every component of η r and η 0 becomes close to 1, which is a trivial useless fixed point of this algorithm. In order to avoid this, it is natural to use the Riemannian metric for the norm instead of the square norm defined in equation 5.1. The local metric modifies the cost function to FeR ({ζ r }) = [η 0 (θ ) − η r (ζ r )]T I0 (θ 0 )−1 [η 0 (θ ) − η r (ζ r )], r
where θ 0 is the convergent point. Since I0 (θ 0 )−1 diverges at the trivial fixed points mentioned above, we expect FeR ({ζ r }) to be a better cost function. The gradient can be calculated similarly by fixing θ 0 , which is unknown. Hence, we replace it by θ t . The calculation of gr should also be modified to η (ζ t + αI0 (θ t )−1 r hr ) g˜ r = r r α from the point of view of the natural gradient method (Amari, 1998). We finally have 1 t+1 t t −1 ζ r = ζ r − 2δI0 (θ ) hr . −˜gr + L−1 r Since I0 (θ ) is a diagonal matrix, computation is simple. 5.2 m–constraint Algorithm. The other possibility is to constrain the parameters always to satisfy the m–condition and modify the parameters to satisfy the e–condition. Since the m–condition is satisfied, {ζ r } are dependent on θ .
1800
S. Ikeda, T. Tanaka, and S. Amari
A naive idea is to repeat the following two steps: Naive m–Constraint Algorithm 1. For r = 1, . . . , L,
ζ tr = πMr ◦ p0 (x; θ t ).
(5.3)
2. Update the parameters as θ t+1 = Lθ t − ζ tr . r
Starting from θ t , the algorithm finds {ζ t+1 r } that satisfies the m–condition by equation 5.3, and θ t+1 is adjusted to satisfy the e–condition. This is a simple recursive algorithm without double loops. We call it the naive m–constraint algorithm. One may use an advanced iteration method that uses new ζ t+1 instead of ζ tr . In this case, the algorithm is r ζ t+1 = πMr ◦ p0 (x; θ t+1 ), where θ t+1 = Lθ t − ζ t+1 r r . r t
In this algorithm, starting from θ , one should solve a nonlinear equation t+1 in θ t+1 , because {ζ t+1 . This algorithm therefore uses r } are functions of θ double loops—the inner loop and the outer loop. This is the idea of CCCP, and it is also an m–constraint algorithm. 5.2.1 Stability of the Algorithms. Although the naive m–constraint algorithm and CCCP share the same equilibrium θ ∗ and {ζ ∗r }, their local stabilities at the equilibrium are different. It is reported that CCCP has superior properties in this respect. The local stability of BP was analyzed by Richardson (2000) and also by Ikeda et al. (in press) in geometrical terms. The stability condition of BP is given by the conditions of the eigenvalues of a matrix defined by the Fisher information matrices. In this article, we give the local stability of the other algorithms. If we eliminate the intermediate variables {ζ r } in the inner loop, the naive m–constraint algorithm is θ t+1 = Lθ t − πMr ◦ p0 (x; θ t ), (5.4) r
and CCCP is represented as θ t+1 = Lθ t − πMr ◦ p0 (x; θ t+1 ).
(5.5)
r
In order to derive the variational equation at the equilibrium, we note that for the m–projection,
ζ r = πMr ◦ p0 (x; θ ),
Stochastic Reasoning, Free Energy, and Information Geometry
1801
a small perturbation δ θ in θ is updated as δ ζ r = Ir (ζ r )−1 I0 (θ )δ θ (see equation 4.8). The variational equations are hence for equation 5.4, t+1 −1 δθ = LE − Ir (ζ r ) I0 (θ ) δ θ t , r
where E is the identity matrix, and for equation 5.5, −1 t+1 −1 =L E+ Ir (ζ r ) I0 (θ ) δθt , δθ r
respectively. Let K be a matrix defined by K=
1
I0 (θ )Ir (ζ r )−1 I0 (θ ) L r
t and δ θ˜ be a new variable defined as
t δ θ˜ = I0 (θ )δ θ t .
The variational equations for equations 5.4 and 5.5 are then δ θ˜
t+1
t = L(E − LK)δ θ˜ ,
δ θ˜
t+1
= L(E + LK)−1 δ θ˜ ,
t
respectively. The equilibrium is stable when the absolute values of the eigenvalues of the respective coefficient matrices are smaller than 1. Let λ1 , . . . , λn be the eigenvalues of K. They are all real and positive, since K is a symmetric positive-definite matrix. We note that λi are close to 1, when Ir (ζ r ) ≈ I0 (θ ) or Mr is close to M0 . The following theorem shows that CCCP has a good convergent property. Theorem 4. The equilibrium of the naive m–constraint algorithm in equation 5.4 is stable when 1+
1 1 > λi > 1 − , L L
i = 1, . . . , n.
The equilibrium of CCCP is stable when the eigenvalues of K satisfy λi > 1 −
1 , L
i = 1, . . . , n.
(5.6)
1802
S. Ikeda, T. Tanaka, and S. Amari
Under the m–constraint, the Hessian of F (θ ) at an equilibrium point is equal to (cf. equation 4.9)
I0 (θ )[LK − (L − 1)E] I0 (θ ), so that the stability condition (see equation 5.6) for CCCP is equivalent to the condition that the equilibrium is a local minimum of F under the m–constraint, which is equivalent to the m–constraint Bethe free energy Fβm (θ ). The theorem therefore states that CCCP is locally stable around an equilibrium if and only if the equilibrium is a local minimum of Fβm (θ ), whereas the naive m–constraint algorithm is not necessarily stable even if the equilibrium is a local minimum. A similar result is obtained in Heskes (2003). It should be noted that the above local stability result for CCCP does not follow from the global convergence result given by Yuille (2002). Yuille has shown that CCCP decreases the cost function and converges to an extremal point of Fβm (θ ), which means the fixed point is not necessarily a local minimum but can be a saddle point. Our local linear analysis shows that a stable fixed point of CCCP is a local minimum of Fβm (θ ). 5.2.2 Natural Gradient and Discretization. Let us consider a gradient rule for updating θ to find a minimum of F under the m–condition ∂ F (θ ) θ˙ = − . ∂θ When we have a metric to measure the distance in the space of θ , it is natural to use the metric for gradient (natural gradient; see Amari, 1998). For statistical models, the Riemannian metric given by the Fisher information matrix is a natural choice, since it is derived from KL divergence. The natural gradient version of the update rule is ∂F θ˙ = −I0−1 (θ ) = (L − 1)θ − ∂θ
πMr ◦ p0 (x; θ ).
(5.7)
r
For the implementation, it is necessary to discretize the continuous-time update rule. The “fully explicit” scheme of discretization (Euler’s method) reads t+1 t t t θ = θ + t (L − 1)θ − πMr ◦ p0 (x; θ ) . (5.8) r
When t = 1, this is equivalent to the naive m–constraint algorithm (see equation 5.4). However, we do not necessarily have to let t = 1. Instead, we may use arbitrary positive value for t. We will show how the convergence rate will be affected by the change of t later.
Stochastic Reasoning, Free Energy, and Information Geometry
1803
The “fully implicit” scheme yields
θ
t+1
t
= θ + t (L − 1)θ
t+1
−
πMr ◦ p0 (x; θ
t+1
) ,
(5.9)
r
which, after rearrangement of terms, becomes πMr ◦ p0 (x; θ t+1 ). [1 − t(L − 1)] θ t+1 = θ t − t r
When t = 1/L, this equation is equivalent to CCCP in equation 5.5. Again, we do not have to be bound to the choice t = 1/L. We will also show the relation between t and the convergence rate later. We have just shown that the naive m–constraint algorithm and CCCP can be viewed as first-order methods of discretization applied to the continuoustime natural gradient system shown in equation 5.7. The local stability result for CCCP proved in theorem 4 can also be understood as an example of the well-known absolute stability property of the fully implicit scheme applied to linear systems. It should also be noted that other more sophisticated methods for solving ordinary differential equations, such as Runge-Kutta methods (possibly with adaptive step-size control) and the Bulirsch-Stoer method Press, Teukolsky, Vetterling, and Flannery (1992), are applicable for formulating m–constraint algorithms with better properties, for example, better stability. In this article, however, we do not discuss possible extension along this line any further. 5.2.3 Acceleration of m–Constraint Algorithms. equations 5.8 and 5.9 in this section. The variational equation for equation 5.8 is δ θ˜
t+1
We give the analysis of
t = {E − [LK − (L − 1)E]t}δ θ˜ .
Let λ1 ≤ λ2 ≤ · · · ≤ λn
(5.10)
be the eigenvalues of K. Then the convergence rate is improved by choosing an adequate t. The convergence rate is governed by the largest absolute values of the eigenvalues of E − [LK − (L − 1)E]t, which are given by µi = 1 − [Lλi − (L − 1)]t. From equation 5.10, we have µ1 ≥ µ2 ≥ · · · ≥ µn . The stability condition is |µi | < 1 for all i. At a locally stable equilibrium point, µ1 < 1 always holds, so that the algorithm is stable if µn > −1 holds. The convergence to a locally
1804
S. Ikeda, T. Tanaka, and S. Amari
stable equilibrium point is most accelerated when µ1 + µn = 0, which holds by taking topt =
2 L(λ1 + λn − 1) + 2
.
The variational equation for equation 5.9 is δ θ˜
t+1
t = {E + [LK − (L − 1)E]t}−1 δ θ˜ ,
and the convergence rate is governed by the largest of the absolute values of the eigenvalues of {E + [LK − (L − 1)E]t}−1 , which should be smaller than 1 for convergence. The eigenvalues are µi =
1 . 1 + [Lλi − (L − 1)]t
We again have µ1 ≥ µ2 ≥ · · · ≥ µn . At a locally stable equilibrium point, 0 < µn and µ1 < 1 always hold, so that the algorithm is always stable. In principle, the smaller µ1 becomes, the faster the algorithm converges, so that taking t → +∞ yields the fastest convergence. However, the algorithm in this limit reduces to the direct evaluation of the e–condition under the m–constraint with one update step of the parameters. This is the fastest if it is possible, but this is usually infeasible for loopy graphs. 6 Extension 6.1 Extend the Framework to Wider Class of Distributions. In this section, two important extensions of BP are given in the information-geometrical framework. First, we extend the model to the case where the marginal distribution of each vertex is an exponential family. A similar extension is given in Wainwright, Jaakola, and Willsky (2003). Let ti be the sufficient statistics of the marginal distribution of xi , that is, q(xi ). The marginal distribution is in the family of distributions defined as follows: p(xi ; θ i ) = exp[θ i · ti − ϕi (θ i )]. This includes many important distributions. For example, multinomial distribution and gaussian distribution are included in this family. Let us define t = (t1T , . . . , tnT )T and θ = (θ T1 , . . . , θ Tn )T , and let the true distribution be q(x) = exp[h · t + c(x) − ψq ]. We can now redefine equation 2.2 as follows, p(x; θ , v) = exp[θ · t + v · c(x) − ψ(θ , v)],
Stochastic Reasoning, Free Energy, and Information Geometry
1805
and S in equation 2.3 as S = {p(x; θ , v) | θ ∈ , v ∈ V }. When the problem is to infer the marginal distribution q(xi ) of q(x), we can redefine the BP algorithm in this new S by redefining M0 and Mr . This extension based on the new definition is simple, and we do not give further details in this article. 6.2 Generalized Belief Propagation. In this section, we show the information-geometrical framework for the general belief propagation (GBP; Yedidia et al., 2001b), which is an important extension of BP. A naive explanation of GBP is that the cliques are reformulated by subsets of L, which is the set of all the edges. This brings us a new implementation of the algorithm and a different inference. In the information-geometrical formulation, we define cs (x) as a new clique function, which summarizes the interactions of the edges in Ls , s = 1, . . . , L , that is, cs (x) =
cr (x),
r∈Ls
where Ls ⊆ L. Those Ls may have overlaps, and Ls must be chosen to satisfy ∪s Ls = L. GBP is a general framework, which includes a lot of possible cases. We categorize them into three important classes and give an informationgeometrical framework for them: Case 1. In the simplest case, each Ls does not have any loop. This is equivalent to TRP. As we saw in section 3.2, the algorithm is explained in the information-geometrical framework. Case 2. In this case, each Ls can have loops, but there is no overlap, that is, Ls ∩ Ls = ∅ for s = s . The extension to this case is also simple. We can apply information geometry by redefining Mr as Ms , where its definition is given as Ms = {ps (x; ζ s ) = exp[h · x + cs (x) + ζ s · x − ψs (ζ s )] | ζ s ∈ n }. Since some loops are treated in a different way, the result might be different from BP. Case 3. Finally, we describe the case where each Ls can have loops and overlaps with the other sets. In this case, we have to extend the framework. Suppose Ls and Ls have an overlap, and both have loops. We explain the case with an example in Figure 4.
1806
S. Ikeda, T. Tanaka, and S. Amari
x1
x3 4
1
3
x4
x2 5
2 Figure 4: Case 3.
Let us first define the following distributions: 5 ci (x) − ψq q(x) = exp h · x + i=1
p0 (x; θ ) = exp[h · x + θ · x − ψ0 (θ )] 3 p1 (x; ζ 1 ) = exp h · x + ci (x) + ζ 1 · x − ψ1 (ζ 1 ) p2 (x; ζ 2 ) = exp h · x +
i=1 5
(6.1) (6.2)
ci (x) + ζ 2 · x − ψ2 (ζ 2 ) .
(6.3)
i=3
Even if ζ 1 , ζ 2 , and θ satisfy the e–condition as θ = ζ 1 + ζ 2 , this does not imply that C
p1 (x; ζ 1 )p2 (x; ζ 2 ) p0 (x; θ )
is equivalent to q(x), since c3 (x) is counted twice. Therefore, we introduce another model p3 (x; ζ 3 ), which has the following form: p3 (x; ζ 3 ) = exp[h · x + c3 (x) + ζ 3 · x − ψ3 (ζ 3 )].
(6.4)
Now, C
p1 (x; ζ 1 )p2 (x; ζ 2 ) p3 (x; ζ 3 )
becomes equal to q(x), where ζ 3 = ζ 1 + ζ 2 is the e–condition. Next, we look at the m–condition. The original form of the m–condition is xp0 (x; θ ) = xps (x; ζ s ), x x
Stochastic Reasoning, Free Energy, and Information Geometry
1807
but in this case, this form is not enough. We need a further condition, that is, ps (x2 , x3 ; ζ s ) = ps (x; ζ s ) x1 ,x4
should be the same for s = {1, 2, 3}. The models in equations 6.1, 6.2, 6.3, and 6.4 are not sufficient, since we do not have enough parameters to specify a joint distribution of (x2 , x3 ), and the model must be extended. In the binary case, we can extend the models by adding one variable as follows: p1 (x; ζ 1 , v1 ) = exp h · x + p2 (x; ζ 2 , v2 ) = exp h · x +
3
ci (x) + ζ 1 · x + v1 x2 x3 − ψ1 (ζ 1 , v1 )
i=1 5
ci (x) + ζ 2 · x + v2 x2 x3 − ψ2 (ζ 2 , v2 )
i=3
p3 (x; ζ 3 , v3 ) = exp[h · x + c3 (x) + ζ 3 · x + v3 x2 x3 − ψ3 (ζ 3 , v3 )], and the m–condition becomes xp0 (x; θ ) = xps (x; ζ s , vs ), s = 1, 2, 3, x x x2 x3 p1 (x; ζ 1 , v1 ) = x2 x3 p2 (x; ζ 2 , v2 ) = x2 x3 p3 (x; ζ 3 , v3 ). x x x We revisit the e–condition, which is now extended as
ζ3 = ζ1 + ζ2,
v3 = v1 + v2 .
This is a simple example, but we can describe any GBP problem in the information-geometrical framework in a similar way. 7 Conclusion Stochastic reasoning is an important technique widely used for graphical models with many interesting applications. BP is a useful method to solve it, and in order to analyze its behavior and give a theoretical foundation, a variety of approaches have been proposed from AI, statistical physics, information theory, and information geometry. We have shown a unified framework for understanding various interdisciplinary concepts and algorithms from the point of view of information geometry. Since information geometry captures the essential structure of the manifold of probability distributions, we are successful in clarifying the intrinsic geometrical structures and their difference of various algorithms proposed so far. The BP solution is characterized with the e– and the m–conditions. We have shown that BP and TRP explore the solution in the subspace where the
1808
S. Ikeda, T. Tanaka, and S. Amari
e–condition is satisfied, while CCCP does so in the subspace where the m– condition is satisfied. This analysis makes it possible to obtain new, efficient variants of these algorithms. We have proposed new e– and m–constraint algorithms. The possible acceleration methods for the m–constraint algorithm and CCCP are shown with local stability and convergence rate analysis. We have clarified the relation among the free-energy-like functions and have proposed a new one. Finally, we have shown possible extensions of BP from an information-geometrical viewpoint. This work is a first step toward an information-geometrical understanding of BP. By using this framework, we expect further understanding and a new improvement of the methods will emerge. Appendix: Information-Geometrical View of CCCP In this section, we derive the information-geometrical view of CCCP. The following two theorems play important roles in CCCP. Theorem 5 (Yuille & Rangarajan, 2003, sect. 2). Let E(x) be an energy function with bounded Hessian ∂E2 (x)/∂x∂x. Then we can always decompose it into the sum of a convex function and a concave function. Theorem 6 (Yuille & Rangarajan, 2003, sect. 2). Consider an energy function E(x) (bounded below) of form E(x) = Evex (x) + Ecave (x) where Evex (x), Ecave (x) are convex and concave functions of x, respectively. Then the discrete iterative CCCP algorithm xt → xt+1 given by ∇Evex (xt+1 ) = −∇Ecave (xt ) is guaranteed to monotonically decrease the energy E(x) as a function of time and hence to converge to a minimum or saddle point of E(x) (or even a local maximum if it starts at one). The idea of CCCP was applied to solve the inference problem of loopy graphs, where the Bethe free energy Fβ in equation 4.1 is the energy function (Yuille, 2002). The concave and convex functions are defined as follows: Fβ (θ , {ζ r }) = [ζ r · η r (ζ r ) − ψr (ζ r )] − (L − 1)[θ · η 0 (θ ) − ψ0 (θ )] r
= Fvex (θ , {ζ r }) + Fcave (θ , {ζ r }) Fvex (θ , {ζ r }) = [ζ r · η r (ζ r ) − ψr (ζ r )] + [θ · η 0 (θ ) − ψ0 (θ )], r
Fcave (θ ) = −L[θ · η 0 (θ ) − ψ0 (θ )]. Let the m–condition be satisfied, and Fvex is a function of θ . Next, since η 0 and θ have a one-to-one relation, let η 0 be the coordinate system. The
Stochastic Reasoning, Free Energy, and Information Geometry
1809
gradient of Fvex and Fcave is given as ∇η0 Fvex (η 0 ) = θ + ζ r , −∇η0 Fcave (η 0 ) = Lθ . r
Finally, the CCCP algorithm is written as ) = −∇η0 Fcave (η t0 ) ∇η0 Fvex (η t+1 0 θ t+1 + ζ t+1 = Lθ t . r
(A.1)
r
Since the m–condition is not satisfied in general, the inner loop solves the condition, while the outer loop updates the parameters as equation A.1. Acknowledgments We thank the anonymous reviewers for valuable feedback. This work was supported by the Grant-in-Aid for Scientific Research 14084208 and 14084209, MEXT, Japan and 14654017, JSPS, Japan. References Amari, S. (1998). Natural gradient works efficiently in learning. Neural Computation, 10, 251–276. Amari, S. (2001). Information geometry on hierarchy of probability distributions. IEEE Trans. Information Theory, 47, 1701–1711. Amari, S., Ikeda, S., & Shimokawa, H. (2001). Information geometry and mean field approximation: The α-projection approach. In M. Opper & D. Saad (Eds.), Advanced mean field methods—Theory and practice (pp. 241–257). Cambridge, MA: MIT Press. Amari, S., & Nagaoka, H. (2000). Methods of information geometry. Providence, RI: AMS, and New York: Oxford University Press. Heskes, T. (2003). Stable fixed points of loopy belief propagation are minima of the Bethe free energy. In S. Becker, S. Thrun, & K. Obermayer (Eds.), Advances in neural information processing systems, 15 (pp. 359–366). Cambridge, MA: MIT Press. Ikeda, S., Tanaka, T., & Amari, S. (2002). Information geometrical framework for analyzing belief propagation decoder. In T. G. Dietterich, S. Becker, & Z. Ghahramani (Eds.), Advances in neural information processing systems, 14 (pp. 407–414). Cambridge, MA: MIT Press. Ikeda, S., Tanaka, T., & Amari, S. (in press). Information geometry of turbo codes and low-density parity-check codes. IEEE Trans. Information Theory. Itzykson, C. & Drouffe, J.-M. (1989). Statistical field theory. Cambridge: Cambridge University Press. Jordan, M. I. (1999). Learning in graphical models. Cambridge, MA: MIT Press. Kabashima, Y., & Saad, D. (1999). Statistical mechanics of error-correcting codes. Europhysics Letters, 45, 97–103.
1810
S. Ikeda, T. Tanaka, and S. Amari
Kabashima, Y., & Saad, D. (2001). The TAP approach to intensive and extensive connectivity systems. In M. Opper & D. Saad (Eds.), Advanced mean field methods—Theory and practice (pp. 65–84). Cambridge, MA: MIT Press. Lauritzen, S. L., & Spiegelhalter, D. J. (1988). Local computations with probabilities on graphical structures and their application to expert systems. Journal of the Royal Statistical Society B, 50, 157–224. Pearl, J. (1988). Probabilistic reasoning in intelligent systems: Networks of plausible inference. San Mateo, CA: Morgan Kaufmann. Press, W. H., Teukolsky, S. A., Vetterling, W. T., & Flannery, B. P. (1992). Numerical recipes in C: The art of scientific computing. Cambridge: Cambridge University Press. Richardson, T. J. (2000). The geometry of turbo-decoding dynamics. IEEE Trans. Information Theory, 46, 9–23. Tanaka, T. (2000). Information geometry of mean-field approximation. Neural Computation, 12, 1951–1968. Tanaka, T. (2001). Information geometry of mean-field approximation. In M. Opper & D. Saad (Eds.), Advanced mean field methods—Theory and practice (pp. 259–273). Cambridge, MA: MIT Press. Wainwright, M., Jaakkola, T., & Willsky, A. (2002). Tree-based reparameterization for approximate inference on loopy graphs. In T. G. Dietterich, S. Becker, & Z. Ghahramani (Eds.), Advances in neural information processing systems, 14 (pp. 1001–1008). Cambridge, MA: MIT Press. Wainwright, M., Jaakkola, T., & Willsky, A. (2003). Tree-reweighted belief propagation algorithms and approximate ML estimate by pseudo-moment matching. In C. M. Bishop & B. J. Frey (Eds.), Proceedings of Ninth International Workshop on Artificial Intelligence and Statistics. Available on-line at: http://www.research.microsoft.com/conferences/aistats2003/. Weiss, Y. (2000). Correctness of local probability propagation in graphical models with loops. Neural Computation, 12, 1–41. Yedidia, J. S., Freeman, W. T., & Weiss, Y. (2001a). Bethe free energy, Kikuchi approximations, and belief propagation algorithms (Tech. Rep. No. TR2001–16). Cambridge, MA: Mitsubishi Electric Research Laboratories. Yedidia, J. S., Freeman, W. T., & Weiss, Y. (2001b). Generalized belief propagation. In T. K. Leen, T. G. Dietterich, & V. Tresp (Eds.), Advances in neural information processing systems, 13 (pp. 689–695). Cambridge, MA: MIT Press. Yuille, A. L. (2002). CCCP algorithms to minimize the Bethe and Kikuchi free energies: Convergent alternatives to belief propagation. Neural Computation, 14, 1691–1722. Yuille, A. L., & Rangarajan, A. (2003). The concave-convex procedure. Neural Computation, 15, 915–936. Received August 12, 2003; accepted March 3, 2004.
LETTER
Communicated by Sebastian Seung
Blind Separation of Positive Sources by Globally Convergent Gradient Search Erkki Oja
[email protected] Neural Networks Research Centre, Helsinki University of Technology, 02015 HUT, Finland
Mark Plumbley
[email protected] Department of Electrical Engineering, Queen Mary, University of London, London E1 4NS, U.K.
The instantaneous noise-free linear mixing model in independent component analysis is largely a solved problem under the usual assumption of independent nongaussian sources and full column rank mixing matrix. However, with some prior information on the sources, like positivity, new analysis and perhaps simplified solution methods may yet become possible. In this letter, we consider the task of independent component analysis when the independent sources are known to be nonnegative and well grounded, which means that they have a nonzero pdf in the region of zero. It can be shown that in this case, the solution method is basically very simple: an orthogonal rotation of the whitened observation vector into nonnegative outputs will give a positive permutation of the original sources. We propose a cost function whose minimum coincides with nonnegativity and derive the gradient algorithm under the whitening constraint, under which the separating matrix is orthogonal. We further prove that in the Stiefel manifold of orthogonal matrices, the cost function is a Lyapunov function for the matrix gradient flow, implying global convergence. Thus, this algorithm is guaranteed to find the nonnegative well-grounded independent sources. The analysis is complemented by a numerical simulation, which illustrates the algorithm.
1 Introduction The problem of independent component analysis (ICA) has been studied by many authors in recent years (for a review, see Hyv¨arinen, Karhunen, & Oja, 2001; Amari & Cichocki, 2002). In the simplest form of ICA, we assume that we have a sequence of observations {x(k)}, which are samples of a random c 2004 Massachusetts Institute of Technology Neural Computation 16, 1811–1825 (2004)
1812
E. Oja and M. Plumbley
observation vector x generated according to x = As,
(1.1)
where s = (s1 , . . . , sn )T is a vector of real independent random variables (the sources), all but perhaps one of them nongaussian, and A is a nonsingular n × n real mixing matrix. The task in ICA is to identify A given just the observation sequence, using the assumption of independence of the si s, and hence to construct an unmixing matrix B = RA−1 giving y = Bx = BAs = Rs, where R is a matrix that permutes and scales the sources. Typically, we assume that the sources have unit variance, with any scaling factor being absorbed into the mixing matrix A, so y will be a permutation of the s with just a sign ambiguity. Common cost functions for ICA are based on maximizing nongaussianities of the elements of y, and they may involve approximations by higherorder cumulants such as kurtosis. The observations x are usually assumed to be zero mean, or transformed to be so, and are commonly prewhitened by some matrix z = Vx so that E{zzT } = I before an optimization algorithm is applied to find the separating matrix. This basic ICA model can be considered to be solved, with a multitude of practical algorithms and software. However, if one makes some further assumptions that restrict or extend the model, then there is still ground for new analysis and solution methods. One such assumption is positivity or nonnegativity of the sources and perhaps the mixing coefficients. Nonnegativity is a natural condition for many real-world applications, for example, in the analysis of images (Parra, Spence, Sajda, Ziehe, & Muller, ¨ 2000; Lee, Lee, Choi, & Lee, 2001), text (Tsuge, Shishibori, Kuroiwa, & Kita, 2001) or air quality (Henry, 2002). The constraint of nonnegative sources, perhaps with an additional constraint of nonnegativity on the mixing matrix A, is often known as positive matrix factorization (Paatero & Tapper, 1994) or nonnegative matrix factorization (Lee & Seung, 1999). A nonnegativity constraint has been suggested for a number of other neural network models too (Xu, 1993; Fyfe, 1994; Harpur, 1997; Charles & Fyfe, 1998). We refer to the combination of nonnegativity and independence assumptions on the sources as nonnegative independent component analysis. Recently, one of us considered the nonnegativity assumption on the sources (Plumbley, 2002, 2003) and introduced an alternative way of approaching the ICA problem, as follows. We call a source si nonnegative if Pr(si < 0) = 0, and such a source will be called well grounded if Pr(si < δ) > 0 for any δ > 0, that is, that si has nonzero pdf all the way down to zero. The following key result was proven (Plumbley, 2002): Theorem 1. Suppose that s is a vector of nonnegative well-grounded independent unit-variance sources si , i = 1, . . . , n, and y = Us where U is a square orthonormal rotation, that is, UT U = I. Then U is a permutation matrix, that is,
Blind Separation of Positive Sources
1813
the elements yj of y are a permutation of the sources si , if and only if all yj are nonnegative. Actually, the requirement for independence of the sources in theorem 1 is a technicality that simplifies the proof but could be relaxed to second-order independence or uncorrelatedness. The proof would be lengthy, however, and we wished to rely on the existing theorem 1 in this letter. The result of theorem 1 can be used for a simple solution of the nonnegative ICA problem. Note that y = Us can also be written as y = Wz, with z the prewhitened observation vector and W an unknown orthogonal (rotation) matrix. It therefore suffices to find an orthogonal matrix W for which y = Wz is nonnegative. This brings additional benefit over other ICA methods that we know of that, if successful, we always have a positive permutation of the sources, since both the s and y are nonnegative. The sign ambiguity present in usual ICA vanishes here. Plumbley (2002) further suggested that a suitable cost function for finding the rotation could be constructed as follows. Suppose we have an output + + truncated at zero, y+ = (y+ 1 , . . . , yn ) with yi = max(0, yi ), and we construct T T a reestimate of z = W y given by zˆ = W y+ . Then a suitable cost function would be given by J(W) = E{ z − zˆ 2 } = E{ z − WT y+ 2 },
(1.2)
because obviously its value will be zero if W is such that all the yi are positive, or y = y+ . We considered the minimization of this cost function by various numerical algorithms in Plumbley (2003) and Plumbley and Oja (2004). In Plumbley (2003), explicit axis rotations as well as geodesic search over the Stiefel manifold of orthogonal matrices were used. In Plumbley and Oja (2004), the cost function 1.2 was taken as a special case of nonlinear PCA, for which an algorithm was earlier suggested by Oja (1997). However, a rigorous convergence proof for the nonlinear PCA method could not be constructed except in some special cases. The general convergence seems to be a very challenging problem. The new key result shown in this article is that the cost function 1.2 has very desirable properties. In the Stiefel manifold of rotation matrices, the function has no local minima, and it is a Lyapunov function for its gradient matrix flow. A gradient algorithm, suggested in the following, is therefore monotonically converging and is guaranteed to find the absolute minimum of the cost function. The minimum is zero, giving positive components yi , which by theorem 1 must be a positive permutation of the original unknown sources sj . Some preliminary results along these lines were given in Oja and Plumbley (2003). In the next section, we present the whitening for non-zero-mean observations and further illustrate by a simple example why a rotation into positive outputs yi will give the sources. In section 3, we consider the cost function
1814
E. Oja and M. Plumbley
1.2 in more detail. Section 4 gives a gradient algorithm, whose monotonical global convergence is proven. Section 5 relates this orthogonalized algorithm to the nonorthogonal “nonlinear PCA” learning rule previously introduced by one of the authors by illustrating both algorithms in a pictorial example. Finally, section 6 gives some conclusions. 2 Prewhitening and Axis Rotations In order to reduce the ICA problem to one of finding the correct orthogonal rotation, the first stage in our ICA process is to whiten the observed data x. This gives z = Vx,
(2.1)
where the n × n real whitening matrix V is chosen so that z = E{(z − z¯ )(z − z¯ )T } = In , with z¯ = E{z}. If E is the orthogonal matrix of eigenvectors of the data covariance matrix x = E{(x − x¯ )(x − x¯ )T } and D = diag(d1 , . . . , dn ) is the diagonal matrix of corresponding eigenvalues, so that x = EDET and −1/2 = ED−1/2 ET ET E = EET = In , then a suitable whitening matrix is V = x where −1/2
D−1/2 = diag(d1
−1/2
, . . . , dn
).
x is normally estimated from the sample covariance (Hyv¨arinen et al., 2001). Note that for nonnegative ICA, we do not remove the mean of the data, since this would lose information about the nonnegativity of the sources (Plumbley, 2002). Suppose that our sources sj have unit variance, such that s = In , and let U = VA be the s-to-z transform. Then In = z = Us UT = UUT so U is an orthonormal matrix. It is therefore sufficient to search for a further orthonormal matrix W such that y = Wz = WUs is a permutation of the original sources s. Figure 1 illustrates the process of whitening for nonnegative data in two dimensions. Whitening has succeeded in making the axes of the original sources orthogonal to each other (see Figure 1b), but there is a remaining orthonormal rotation ambiguity. A typical ICA algorithm might search for a rotation that makes the resulting outputs as nongaussian as possible, for example, by finding an extremum of kurtosis, since any sum of independent random variables will make the result “more gaussian” (Hyv¨arinen et al., 2001). However, Figure 1 immediately suggests another approach: we should search for a rotation where all the data fit into the positive quadrant. As long as the distribution of the original sources is “tight” down to the axes, then it is intuitively clear that this will be a unique solution, apart from a permutation and scaling of the axes. This explains why theorem 1 works. Note also that after this rotation, the two sources s1 and s2 are indeed independent, even
Blind Separation of Positive Sources 5
4
4
3
3
2
2
z2
5
2
x
1815
1
1
0
0
−1
−1
−2
−2 −2
0
2
4
−2
x
0
2
4
z
1
1
(a)
(b)
Figure 1: Original data (a) are whitened (b) to remove second-order correlations.
if they are not zero mean, because for their densities, it holds that the joint density (in this case, uniform in a square) is the product of the marginal (uniform) densities. 3 The Cost Function for Nonnegative BSS In the following, we show that the minimum of the cost function 1.2 in the set of orthogonal (rotation) matrices will give the sources. For brevity of notation, let us denote the truncation nonlinearity by g+ (yi ) = max(0, yi ), which is zero for negative yi and yi otherwise. We are now ready to state and prove theorem 2: Theorem 2. Assume the n-element random vector z is a whitened linear mixture of nonnegative well-grounded independent unit variance sources s1 , . . . , sn , and y = Wz with W constrained to be a square orthogonal matrix. If W is obtained as the minimum of the cost function 1.2, rewritten as J(W) = E z − WT g+ (Wz) 2 , then the elements of y will be a permutation of the original sources si . Proof.
Because W is square orthogonal, we get:
J(W) = E{ z − WT g+ (Wz) 2 }
(3.1)
1816
E. Oja and M. Plumbley
= E{ Wz − WWT g+ (Wz) 2 } +
2
= E{ y − g (y) } n = E{[yi − g+ (yi )]2 } = =
i=1 n i=1 n
(3.2) (3.3) (3.4)
E{min(0, yi )2 }
(3.5)
E{y2i |yi < 0}P(yi < 0).
(3.6)
i=1
This is always nonnegative and becomes zero if and only if each yi is nonnegative with probability one. Thus, if W is obtained as the minimum of the cost function, then the elements of y will be nonnegative. On the other hand, because y = Wz with W orthogonal, it also holds that y = Us with U orthogonal. Theorem 1 now implies that because the elements of y are nonnegative, they must be a permutation of the elements of s. 4 A Converging Gradient Algorithm Theorem 2 leads us naturally to consider the use of a gradient algorithm for minimizing equation 3.1 under the orthogonality constraint. In order to derive the gradient, let us first write equation 3.1 in the simple form, equation 3.5: J(w1 , . . . , wn ) =
n
E{min(0, yi )2 }, yi = wTi z,
(4.1)
i=1
where the vectors wTi are the rows of matrix W and thus satisfy wTi wj = δij . The gradient with respect to one of vectors wi is straightforward: ∂J min(0, yi )2 ∂yi = 2E{min(0, yi )z}. =E ∂wi ∂yi ∂wi
(4.2)
(4.3)
A possible way to do the constrained minimization by gradient descent is to divide each descent step into two parts: a step in the direction of the unconstrained gradient, followed by a consequent projection of the new point onto the constraint set 4.2. For vectors wi , this gives the update rule ∂J ∂wi = wi − 2γ E{min(0, yi )z},
w˜ i = wi − γ
(4.4) (4.5)
Blind Separation of Positive Sources
1817
which is the unconstrained gradient descent step with step size γ , and ˜ ˜W ˜ T )−1/2 W, W = (W
(4.6)
which is the projection onto the constraint set of orthonormal wTi vectors, ˜ is the one with because equation 4.6 implies WWT = I. Obviously, matrix W T vectors w˜ i as its rows. This is just one of the possibilities for performing orthonormalization; another alternative would be the Gram-Schmidt algorithm. However, we prefer a symmetrical orthonormalization as there is no reason to order the wi vectors in any way. In matrix form, equation 4.5 reads ˜ = W − 2γ E{fzT }, W
(4.7)
where f = f(y) is the column vector with elements min(0, yi ). This is the gradient descent step. Now, to derive the projection in equation 4.6, we have ˜W ˜ T = WWT + 4γ 2 E{fzT }E{zfT } − 2γ E{WzfT + fzT WT } = I − 2γ E{WzfT + W T ˜W ˜ T )−1/2 = I + γ E{WzfT + fz WT } + O(γ 2 ). Thus, assuming γ small, (W T T 2 T −1/2 ˜W ˜ ) ˜ = W − 2γ E{fzT } + γ E{WzfT W + fz W } + O(γ ), and finally (W W T 2 T T fz } + O(γ ) = W − γ E{fz } + γ E{yf W} + O(γ 2 ). Omitting O(γ 2 ), the ˜ at the ˜W ˜ T )−1/2 W change from W at the previous step to the new W = (W T T next step is therefore W = −γ E{fz − yf W}, which can be further written in the form W = −γ E{fyT − yfT }W.
(4.8)
Substituting min(0, yi ) for the elements of f(y) gives the gradient descent algorithm for the constrained problem. The form of the update rule 4.8 is the same as that derived by Cardoso and Laheld (1996) (although with different notation) for minimizing a general contrast under the orthogonality constraint. However, they are projecting the unconstrained gradient onto the space of skew-symmetric matrices. It seems that from our derivation, higher-order terms with respect to the step size γ and thus a more accurate projection could be more easily obtained by continuing the series expansion. The skew-symmetric form of the matrix fyT − yfT in equation 4.8 ensures that W tends to stay orthogonal from step to step, although to fully guarantee orthogonality in a discrete-time gradient algorithm, an explicit orthonormalization of the rows of W should be done from time to time. Instead of analyzing this learning rule directly, let us look at the averaged differential equation corresponding to the discrete-time algorithm 4.8. It becomes dW = −MW, dt
(4.9)
1818
E. Oja and M. Plumbley
where we have denoted the continuous-time deterministic solution also by W, and the elements µij of matrix M = E{fyT − yfT } are µij = E{min(0, yi )yj − yi min(0, yj )}.
(4.10)
Note that M is a nonlinear function of the solution W, because y = Wz. Yet we can formally write the solution of equation 4.9 as t W(t) = exp − M(s)ds W(0).
(4.11)
0
The solution W(t) is always an orthogonal matrix if W(0) is orthogonal. This can be shown as follows: W(t)W(t)T = t t exp − M(s)ds W(0)W(0)T exp − M(s)T ds 0 t
= exp[−
0
(M(s) + M(s)T )ds].
0
But matrix M is skew-symmetric; hence, M(s) + M(s)T = 0 for all s and W(t)W(t)T = exp[0] = I. We can now analyze the stationary points of equation 4.9 and their stability in the class of orthogonal matrices. The stationary points (for which dW dt = 0) are easily solved. They must be the roots of the equation MW = 0, which is equivalent to M = 0 because of the orthogonality of W. We see that if all yi are positive or all of them are negative, then M = 0. Namely, if yi and yj are both positive, then min(0, yi ) and min(0, yj ) in equation 4.10 are both zero. If they are both negative, then min(0, yi ) = yi and min(0, yj ) = yj , and the two terms in equation 4.10 cancel out. Thus, in these two cases, W is a stationary point. The case when all yi are positive corresponds to the minimum value (zero) of the cost function J(W). By theorem 1, y is then a permutation of s, which is the correct solution we are looking for. We would hope that this stationary point would be the only stable one, because then the ordinary differential equation (ODE) will converge to it. The case when all the yi are negative corresponds to the maximum value of J(W), equal to ni=1 E{y2i } = n. As it is stationary too, we have to consider the case when it is taken as the initial value in the ODE. In all other cases, at least some of the yi have opposite signs. Then M is not zero and W is not stationary, as seen from equation 4.11. We could look at the local stability of the two stationary points. However, we can do even better and perform a global analysis. It turns out that equation 3.1 is in fact a Lyapunov function for the matrix flow 4.9; it is strictly decreasing always when W changes according to the ODE 4.9, except at the stationary points. Let us prove this in the following.
Blind Separation of Positive Sources
Theorem 3. If W follows the ODE 4.9, then all yi are nonnegative or all are nonpositive.
1819 dJ(W) dt
< 0, except at the point when
Proof. Consider the ith term in the sum J(W), given in equation 3.5. Denoting it by ei , we have ei = E{min(0, yi )2 } whose derivative with respect to yi is 2E{min(0, yi )}. If wTi is the ith row of matrix W, then yi = wTi z. Thus, T dwi dei dei dyi = = 2E min(0, yi ) z . dt dyi dt dt
(4.12)
From the ODE 4.9, we get n dwTi µik wTk , =− dt k=1
with µik given in equation 4.10. Substituting this in equation 4.12 gives n dei µik E{min(0, yi )yk } = −2 dt k=1
= −2
n
E2 {min(0, yi )yk }
k=1 n
+2
E{min(0, yk )yi }E{min(0, yi )yk }.
k=1
If we denote αik = E{min(0, yi )yk }, we have
n n n n n dJ(W) dei 2 αik + αik αki . = =2 − dt dt i=1 i=1 k=1 i=1 k=1 By the Cauchy-Schwartz inequality, this is strictly negative unless αik = αki for all i, k, and thus J(W) is decreasing. We still have to look at the condition that αik = αki for all i, k and show that this implies nonnegativity or nonpositivity for all the yi . Now, because y = Us with U orthogonal, each yi is a projection of the positive source vector s on one of n orthonormal rows uTi of U. If the vectors ui are aligned with the original coordinate axes, then the projections of s on them are nonnegative. For any rotation that is not aligned with the coordinate axes, one of the vectors ui (or −ui ) must be in the positive octant due to the orthonormality of the vectors. Without loss of generality, assume that this vector is u1 ; then it holds that P(y1 = uT1 s ≥ 0) = 1 (or 0). But if P(y1 ≥ 0) = 1, then min(0, y1 ) = 0 and α1k = E{min(0, y1 )yk } = 0 for all k. If symmetry holds for the αij , then also αk1 = E{min(0, yk )y1 } = E{y1 yk |yk ≤
1820
E. Oja and M. Plumbley
0}P(yk ≤ 0) = 0. But y1 is nonnegative, so P(yk ≤ 0) must be zero too for all k. The same argument carries over to the case when P(y1 ≥ 0) = 0, which implies that if one yi is nonnegative, then all yk must be nonnegative in the case of symmetrical αij . The behavior of the learning rule 4.8 is now well understood. The function J(W) in equation 1.2 is a Lyapunov function for the averaged differential equation 4.9, for all orthogonal matrices W except for the point with all outputs yi nonpositive. Therefore, recalling that if W(0) is an orthogonal matrix, then it is constrained to remain so, W(t) converges to the minimum of J(W) from almost everywhere in the Stiefel manifold of orthogonal matrices. For a discussion of optimization and learning on the Stiefel manifold, see Edelman, Arias, and Smith (1998) and Fiori (2001). This minimum corresponds to all nonnegative yi . By theorem 1, these must be a permutation of the original sources sj which therefore have been found. The result was proven only for the continuous-time averaged version of the learning rule; the exact connection between this and the discrete-time online algorithm has been clarified in the theory of stochastic approximation (see Oja, 1983). In practice, even if the starting point happened to be the “bad” stationary point in which all yi are nonpositive, then numerical errors will deviate the solution from this point and the cost function J(W) starts to decrease. 5 Experiments We illustrate the operation of the algorithm using a blind image separation problem. We use the same images and perfomance measures as in Plumbley and Oja (2004), in which the nonlinear PCA algorithm (Oja, 1997) was used instead to solve the nonnegative ICA problem. As performance measures, we measure a mean squared error eMSE , an orthonormalization error eOrth , and a permutation error ePerm , defined as follows: 1 zk − WT g+ (yk ) 2 np k=1 p
eMSE =
1 I − (WVA)T WVA 2F n2 1 = 2 I − abs(WVA)T abs(WVA) 2F , n
(5.1)
eOrth =
(5.2)
ePerm
(5.3)
where abs(M) returns the absolute value of each element of M, so that ePerm = 0 only for a positive permutation matrix. The parameters have been scaled (by 1/(np) or 1/n2 ) to allow more direct comparison between the result of simulations using different values for n and p.
Blind Separation of Positive Sources
1821
Four image patches of size 252 × 252 were selected from a set of images of natural scenes and downsampled by a factor of 4 in both directions to yield 63 × 63 pixel images. Each of the n = 4 images was treated as one source, with its pixel values representing the p = 63 × 63 = 3969 samples. The source image values were shifted to have a minimum of zero to ensure they were well grounded, and the images were scaled to ensure they were all unit variance. After scaling, the source covariance matrix was found to be 1.000 0.074 −0.003 0.050 0.074 1.000 −0.071 0.160 , (5.4) ssT − s¯ s¯ T = −0.003 −0.071 1.000 0.130 0.050 0.160 0.130 1.000 giving an acceptably small covariance between the images. A mixing matrix A was generated randomly and used to construct x = As. For the algorithm, the demixing matrix W was initialized to the identity matrix, ensuring initial orthogonality of W. Instead of algorithm 4.8, the theoretical expectation was replaced by a batch update method: denoting by X the observation matrix having all the data vectors x(1), . . . , x(T) as its columns and defining the matrix of outputs as Y = WX, we update W with the incremental change, 1 W = −µ {f(Y)YT − Yf(Y)T }W. p
(5.5)
A constant update factor of µ = 0.03 was used, and W was renormalized to an orthogonal matrix after each step using equation 4.6. Figure 2 shows the performance of learning over 2 × 104 steps, with Figure 3 showing the original, mixed, and separated images and their histograms. After 2 × 104 iteration steps, in which each step is one update following presentation of the batch of 3969 samples (1610s/27min of CPU time on an 850 MHz Pentium III), the source-to-output matrix WVA was found to be 1.002 0.015 −0.040 0.026 −0.099 0.067 −0.111 1.007 , (5.6) WVA = −0.016 1.009 −0.089 0.018 0.004 −0.056 1.021 −0.058 with eMSE = 6.07 × 10−5 , eOrth = 8.88 × 10−3 , and ePerm = 1.07 × 10−2 . The mean squared error and orthogonalization error are slightly better than for the nonnegative PCA algorithm, which were 9.30 × 10−5 and 9.02×10−3 , respectively, for the same number of iterations. The final permutation error for the same number of iterations is also smaller than the value obtained with the nonnegative PCA algorithm, which was 1.68 × 10−2 .
1822
E. Oja and M. Plumbley
2
10
0
MSE
10
−2
10
−4
10
−6
10
0
10
1
10
2
3
10
10
4
10
5
10
Iteration 0
Performance
10
−1
10
−2
10
−3
10
0
10
1
10
2
3
10
10
4
10
5
10
Iteration
Figure 2: Simulation results on image data. Performance in the lower graph is measured as distance from permutation ePerm (upper curve) and distance from orthogonality eOrth (lower curve). See the text for definitions.
The lower bound on eOrth and ePerm is determined by the accuracy of the prewhitening stage: recall that prewhitening is estimated from the statistics of the input data, without having any access to the original mixing matrix A. Calculating the equivalent error in VA from orthonormality, eWhite =
1 I − (VA)T (VA) 2F , n2
(5.7)
we find eWhite = 8.88 × 10−3 = eOrth to within the machine accuracy (i.e., |eOrth − eWhite | < 10−15 ) as we might expect, since W is orthonormalized at each iteraction, so WWT = I. Therefore, the ePerm is 20.3% above its lower bound for this algorithm, compared to 80.3% for the nonnegative PCA algorithm. 6 Discussion We have considered the problem of nonnegative ICA, that is, independent component analysis where the sources are known to be nonnegative. Else-
Blind Separation of Positive Sources
1823
Figure 3: Images and histograms for the image separation task, showing (a) the original source images (b) and their histograms, (c, d) the mixed images and their histograms, and (e, f) the separated images and their histograms.
1824
E. Oja and M. Plumbley
where, one of us introduced algorithms to solve this based on the use of orthogonal rotations, related to Stiefel manifold approaches (Plumbley, 2003). In this article, we considered a gradient-based algorithm operating on prewhitened data, related to the “nonlinear PCA” algorithms investigated by one of the authors (Oja, 1997, 1999). We refer to these algorithms, which use a truncation nonlinearity, as nonnegative PCA algorithms. By theoretical analysis of algorithm 4.8, we showed the key result of the article: asymptotically, as the learning rate is very small, the algorithm is guaranteed to find a permutation of the well-grounded nonnegative sources. Such a global convergence result is rather unique in ICA gradient methods. The convergence was experimentally verified for a small learning rate using a set of positive images as sources, with a random mixing matrix. Acknowledgments Patrik Hoyer kindly supplied the images used in section 5. Part of this work was undertaken while M.P. was visiting the Neural Networks Research Centre at the Helsinki University of Technology, supported by a Leverhulme Trust Study Abroad Fellowship. This work is also supported by grant GR/R54620 from the UK Engineering and Physical Sciences Research Council as well as by the project New Information Processing Principles, 44886, of the Academy of Finland. References Amari, S.-I., & Cichocki, A. (2002). Adaptive blind signal and image processing. New York: Wiley. Cardoso, J.-F., & Laheld, B. H. (1996). Equivariant adaptive source separation. IEEE Transactions on Signal Processing, 44(12), 3017–3030. Charles, D., & Fyfe, C. (1998). Modelling multiple-cause structure using rectification constraints. Network: Computation in Neural Systems, 9, 167–182. Edelman, A., Arias, T. A., & Smith, S. T. (1998). The geometry of algorithms with orthogonality constraints. SIAM J. Matrix Anal. Appl., 20(2), 303–353. Fiori, S. (2001). A theory for learning by weight flow on Stiefel-Grassman manifold. Neural Computation, 13, 1625–1647. Fyfe, C. (1994). Positive weights in interneurons. In G. Orchard (Ed.), Neural computing: Research and applications II. Proceedings of the Third Irish Neural Networks Conference, Belfast, Northern Ireland, 1–2 Sept 1993 (pp. 47–58). Belfast, NI: Irish Neural Networks Association. Harpur, G. F. (1997). Low entropy coding with unsupervised neural networks. Unpublished doctoral dissertation, Cambridge University. Henry, R. C. (2002). Multivariate receptor models—current practice and future trends. Chemometrics and Intelligent Laboratory Systems, 60(1–2), 43–48. Hyv¨arinen, A., Karhunen, J., & Oja, E. (2001). Independent component analysis. New York: Wiley.
Blind Separation of Positive Sources
1825
Lee, D. D., & Seung, H. S. (1999). Learning the parts of objects by nonnegative matrix factorization. Nature, 401, 788–791. Lee, J. S., Lee, D. D., Choi, S., & Lee, D. S. (2001). Application of nonnegative matrix factorization to dynamic positron emission tomography. In T.-W. Lee, T.-P. Jung, S. Makeig, & T. J. Sejnowski (Eds.), Proceedings of the International Conference on Independent Component Analysis and Signal Separation (ICA2001), San Diego, California (pp. 629–632). San Diego, CA: Institute of Neural Computation, University of California, San Diego. Oja, E. (1983). Subspace methods of pattern recognition. Baldock, U.K.: Research Studies Press, and New York: Wiley. Oja, E. (1997). The nonlinear PCA learning rule in independent component analysis. Neurocomputing, 17(1), 25–46. Oja, E. (1999). Nonlinear PCA criterion and maximum likelihood in independent component analysis. In Proc. Int. Workshop on Independent Component Analysis and Signal Separation (ICA’99) (pp. 143–148). Aussois, France: ICA 1999 Organizing Committee, Institut National Polytechnique de Grenoble, France. Oja, E., & Plumbley, M. D. (2003). Blind separation of positive sources using nonnegative ICA. In Proc. Int. Workshop on Independent Component Analysis and Signal Separation (ICA’03). Kyoto, Japan: ICA 2003 Organizing Committee, NTT Communication Science Laboratories. Paatero, P., & Tapper, U. (1994). Positive matrix factorization: A nonnegative factor model with optimal utilization of error estimates of data values. Environmetrics, 5, 111–126. Parra, L., Spence, C., Sajda, P., Ziehe, A., & Muller, ¨ K.-R. (2000). Unmixing hyperspectral data. In S. A. Solla, T. K. Leen, & K.-R. Muller ¨ (Eds.), Advances in neural information processing systems, 12 (pp. 942–948). Cambridge, MA: MIT Press. Plumbley, M. D. (2002). Conditions for nonnegative independent component analysis. IEEE Signal Processing Letters, 9(6), 177–180. Plumbley, M. D. (2003). Algorithms for non-negative independent component analysis. IEEE Transactions on Neural Networks, 14(3), 534–543. Plumbley, M. D., & Oja, E. (2004). A “non-negative PCA” algorithm for independent component analysis. IEEE Transactions on Neural Networks, 15(1), 66–76. Tsuge, S., Shishibori, M., Kuroiwa, S., & Kita, K. (2001). Dimensionality reduction using non-negative matrix factorization for information retrieval. In IEEE International Conference on Systems, Man, and Cybernetics (Vol. 2, pp. 960–965). Piscataway, NJ: IEEE. Xu, L. (1993). Least mean square error reconstruction principle for selforganizing neural-nets. Neural Networks, 6(5), 627–648. Received July 9, 2003; accepted February 5, 2004.
LETTER
Communicated by Aapo Hyvarinen
A New Concept for Separability Problems in Blind Source Separation Fabian J. Theis
[email protected] Institute of Biophysics, University of Regensburg, 93040 Regensburg, Germany
The goal of blind source separation (BSS) lies in recovering the original independent sources of a mixed random vector without knowing the mixing structure. A key ingredient for performing BSS successfully is to know the indeterminacies of the problem—that is, to know how the separating model relates to the original mixing model (separability). For linear BSS, Comon (1994) showed using the Darmois-Skitovitch theorem that the linear mixing matrix can be found except for permutation and scaling. In this work, a much simpler, direct proof for linear separability is given. The idea is based on the fact that a random vector is independent if and only if the Hessian of its logarithmic density (resp. characteristic function) is diagonal everywhere. This property is then exploited to propose a new algorithm for performing BSS. Furthermore, first ideas of how to generalize separability results based on Hessian diagonalization to more complicated nonlinear models are studied in the setting of postnonlinear BSS. 1 Introduction In independent component analysis (ICA), one tries to find statistically independent data within a given random vector. An application of ICA lies in blind source separation (BSS), where it is furthermore assumed that the given vector has been mixed using a fixed set of independent sources. The advantage of applying ICA algorithms to BSS problems in contrast to correlation-based algorithms is that ICA tries to make the output signals as independent as possible by also including higher-order statistics. Since the introduction of independent component analysis by H´erault and Jutten (1986), various algorithms have been proposed to solve the BSS problem (Comon, 1994; Bell & Sejnowski, 1995; Hyv¨arinen & Oja, 1997; Theis, Jung, Puntonet, & Lang, 2002). Good textbook-level introductions to ICA are given in Hyv¨arinen, Karhunen, and Oja (2001) and Cichocki and Amari (2002). Separability of linear BSS states that under weak conditions to the sources, the mixing matrix is determined uniquely by the mixtures except for permutation and scaling, as showed by Comon (1994) using the DarmoisSkitovitch theorem. We propose a direct proof based on the concept of c 2004 Massachusetts Institute of Technology Neural Computation 16, 1827–1850 (2004)
1828
F. Theis
separated functions, that is, functions that can be split into a product of one-dimensional functions (see definition 1). If the function is positive, this is equivalent to the fact that its logarithm has a diagonal Hessian everywhere (see lemma 1 and theorem 1). A similar lemma has been shown by Lin (1998) for what he calls block diagonal Hessians. However, he omits discussion of the separatedness of densities with zeros, which plays a minor role for the separation algorithm he is interested in but is important for deriving separability. Using separatedness of the density, respectively, the characteristic function (Fourier transformation), of the random vector, we can then show separability directly (in two slightly different settings, for which we provide a common framework). Based on this result, we propose an algorithm for linear BSS by diagonalizing the logarithmic density of the Hessian. We recently found that this algorithm has already been proposed (Lin, 1998), but without considering the necessary assumptions for successful algorithm application. Here we give precise conditions for when to apply this algorithm (see theorem 3) and show that points satisfying these conditions can indeed be found if the sources contain at most one gaussian component (see lemma 5). Lin uses a discrete approximation of the derivative operator to approximate the Hessian. We suggest using kernel-based density estimation, which can be directly differentiated. A similar algorithm based on Hessian diagonalization has been proposed by Yeredor (2000) using the characteristic function of a random vector. However, the characteristic function is complex valued, and additional care has to be taken when applying a complex logarithm. Basically, this is well defined locally only at nonzeros. In algorithmic terms, the characteristic function can be easily approximated by samples (which is equivalent to our kernel-based density approximation using gaussians before Fourier transformation). Yeredor suggests joint diagonalization of the Hessian of the logarithmic characteristic function (which is problematic because of the nonuniqueness of the complex logarithm) evaluated at several points in order to avoid the locality of the algorithm. Instead of joint diagonalization, we use a combined energy function based on the previously defined separator, which also takes into account global information but does not have the drawback of being singular at zeros of the density, respectively, characteristic function. Thus, the algorithmic part of this article can be seen as a general framework for the algorithms proposed by Lin (1998) and Yeredor (2000). Section 2 introduces separated functions, giving local characterizations of the densities of independent random vectors. Section 3 then introduces the linear BSS model and states the well-known separability result. After giving an easy and short proof in two dimensions with positive densities, we provide a characterization of gaussians in terms of a differential equation and provide the general proof. The BSS algorithm based on finding separated densities is proposed and studied in section 4. We finish with a generalization of the separability for the postnonlinear mixture case in section 5.
A New Concept for Separability Problems in BSS
1829
2 Separated and Linearly Separated Functions Definition 1. A function f : Rn → C is said to be separated, respectively, linearly separated, if there exist one-dimensional functions g1 , . . . , gn : R → C such that f (x) = g1 (x1 ) · · · gn (xn ) respectively f (x) = g1 (x1 ) + · · · + gn (xn ) for all x ∈ Rn . Note that the functions gi are uniquely determined by f up to a scalar factor, respectively, an additive constant. If f is linearly separated, then exp f is separated. Obviously the density function of an independent random vector is separated. For brevity, we often use the tensor product and write f ≡ g1 ⊗ · · · ⊗ gn for separated f , where for any functions h, k defined on a set U, h ≡ k if h(x) = k(x) for all x ∈ U. Separatedness can also be defined on any open parallelepiped (a1 , b1 ) × · · · × (an , bn ) ⊂ Rn in the obvious way. We say that f is locally separated at x ∈ Rn if there exists an open parallelepiped U such that x ∈ U and f |U is separated. If f is separated, then f is obviously everywhere locally separated. The converse, however, does not necessarily hold, as shown in Figure 1.
0.2
0.1
4 3
0 4
2 3
1 2
0 1
0
−1 −1
−2
−2 −3 −3
Figure 1: Density of a random vector S with a locally but not globally separated density. Here, pS := cχ[−2,2]×[−2,0]∪[0,2]×[1,3] where χU denotes the function that is 1 on U and 0 everywhere else. Obviously, pS is not separated globally, but is separated if restricted to squares of length < 1. Plotted is a smoothed version of pS .
1830
F. Theis
The function f is said to be positive if f is real and f (x) > 0 for all x ∈ Rn , and nonnegative if f is real and f (x) ≥ 0 for all x ∈ Rn . A positive function f is separated if and only if ln f is linearly separated. Let C m (U, V) be the ring of all m-times continuously differentiable functions from U ⊂ Rn to V ⊂ C, U open. For a C m -function f , we write ∂i1 · · · ∂im f := ∂ m f/∂xi1 · · · ∂xim for the m-fold partial derivatives. If f ∈ n C 2 (Rn , C), denote with the symmetric (n × n)-matrix H f (x) := ∂i ∂j f (x) i,j=1 the Hessian of f at x ∈ Rn . Linearly separated functions can be classified using their Hessian (if it exists): Lemma 1. A function f ∈ C 2 (Rn , C) is linearly separated if and only if H f (x) is diagonal for all x ∈ Rn . A similar lemma for block diagonal Hessians has been shown by Lin (1998). Proof. If f is linearly separated, its Hessian is obviously diagonal everywhere by definition. Assume the converse. We prove that f is separated by induction over the dimension n. For n = 1, the claim is trivial. Now assume that we have shown the lemma for n − 1. By induction assumption, f (x1 , . . . , xn−1 , 0) is linearly separated, so f (x1 , . . . , xn−1 , 0) = g1 (x1 ) + · · · + gn−1 (xn−1 ) for all xi ∈ R and some functions gi on R. Note that gi ∈ C 2 (R, C). Define a function h : R → C by h(y) := ∂n f (x1 , . . . , xn−1 , y), y ∈ R, for fixed x1 , . . . , xn−1 ∈ R. Note that h is independent of the choice of the xi , because ∂n ∂i f ≡ ∂i ∂n f is zero everywhere, so xi → ∂n f (x1 , . . . , xn−1 , y) is constant for fixed xj , y ∈ R, j = i. By definition, h ∈ C 1 (R, C), so h is y integrable on compact intervals. Define k : R → C by k(y) := 0 h. Then f (x1 , . . . , xn ) = g1 (x1 ) + · · · + gn−1 (xn−1 ) + k(xn ) + c, where c ∈ C is a constant, because both functions have the same derivative and Rn is connected. If we set gn := k + c, the claim follows. This lemma also holds for functions defined on any open parallelepiped (a1 , b1 ) × · · · × (an , bn ) ⊂ Rn . Hence, an arbitrary real-valued C 2 -function f is locally separated at x with f (x) = 0 if and only if the Hessian of ln | f | is locally diagonal. For a positive function f , the Hessian of its logarithm is diagonal everywhere if it is separated, and it is easy to see that for positive f , the converse
A New Concept for Separability Problems in BSS
1831
also holds globally (see theorem 1(ii)). In this case, we have for i = j, 0 ≡ ∂i ∂j ln f ≡
f ∂i ∂j f − (∂i f )(∂j f ) , f2
so f is separated if and only if f ∂i ∂j f ≡ (∂i f )(∂j f ) for i = j or even i < j. This motivates the following definition: Definition 2.
For i = j, the operator
Rij : C 2 (Rn , C) → C 0 (Rn , C) f → Rij [ f ] := f ∂i ∂j f − (∂i f )(∂j f ) is called the ij-separator. Theorem 1.
Let f ∈ C 2 (Rn , C).
i. If f is separated, then Rij [ f ] ≡ 0 for i = j or, equivalently, f ∂i ∂j f ≡ (∂i f )(∂j f )
(2.1)
holds for i = j. ii. If f is positive and Rij [ f ] ≡ 0 holds for all i = j, then f is separated. If f is assumed to be only nonnegative, then f is locally separated but not necessarily globally separated (if the support of f has more than one component). See Figure 1 for an example of a nonseparated density with R12 [ f ] ≡ 0. Proof of Theorem 1.i. If f is separated, then f (x) = g1 (x1 ) · · · gn (xn ) or short f ≡ g1 ⊗ · · · ⊗ gn , so ∂i f ≡ g1 ⊗ · · · ⊗ gi−1 ⊗ gi ⊗ gi+1 ⊗ · · · ⊗ gn and ∂i ∂j f ≡ g1 ⊗ · · · ⊗ gi−1 ⊗ gi ⊗ gi+1 ⊗ · · · ⊗ gj−1 ⊗ gj ⊗ gj+1 ⊗ · · · ⊗ gn for i < j. Hence equation 2.1 holds. ii. Now assume the converse and let f be positive. Then according to the remarks after lemma 1, Hln f (x) is everywhere diagonal, so lemma 1 shows that ln f is linearly separated; hence, f is separated.
1832
F. Theis
Some trivial properties of the separator Rij are listed in the next lemma: Lemma 2.
Let f, g ∈ C 2 (Rn , C), i = j and α ∈ C. Then
Rij [αf ] = α 2 Rij [ f ] and Rij [ f + g] = Rij [ f ] + Rij [g] + f ∂i ∂j g + g∂i ∂j f − (∂i f )(∂j g) − (∂i g)(∂j f ). 3 Separability of Linear BSS Consider the noiseless linear instantaneous BSS model with as many sources as sensors: X = AS,
(3.1)
with an independent n-dimensional random vector S and A ∈ Gl(n). Here, Gl(n) denotes the general linear group of Rn , that is, the group of all invertible (n × n)-matrices. The task of linear BSS is to find A and S given only X. An obvious indeterminacy of this problem is that A can be found only up to scaling and permutation because for scaling L and permutation matrix P, X = ALPP−1 L−1 S, and P−1 L−1 S is also independent. Here, an invertible matrix L ∈ Gl(n) is said to be a scaling matrix if it is diagonal. We say two matrices B, C are equivalent, B ∼ C, if C can be written as C = BPL with a scaling matrix L ∈ Gl(n) and an invertible matrix with unit vectors in each row (permutation matrix) P ∈ Gl(n). Note that PL = L P for some scaling matrix L ∈ Gl(n), so the order of the permutation and the scaling matrix does not play a role for equivalence. Furthermore, if B ∈ Gl(n) with B ∼ I, then also B−1 ∼ I, and, more generally if BC ∼ A, then C ∼ B−1 A. According to the above, solutions of linear BSS are equivalent. We will show that under mild assumptions to S, there are no further indeterminacies of linear BSS. S is said to have a gaussian component if one of the Si is a one-dimensional gaussian, that is, pSi (x) = d exp(−ax2 + bx + c) with a, b, c, d ∈ R, a > 0, and S has a deterministic component if one Si is deterministic, that is, constant. Theorem 2 (Separability of linear BSS). Let A ∈ Gl(n) and S be an independent random vector. Assume one of the following: i. S has at most one gaussian or deterministic component, and the covariance of S exists.
A New Concept for Separability Problems in BSS
1833
ii. S has no gaussian component, and its density pS exists and is twice continuously differentiable. Then if AS is again independent, A is equivalent to the identity. So A is the product of a scaling and a permutation matrix. The important part of this theorem is assumption i, which has been used to show separability by Comon (1994) and extended by Eriksson and Koivunen (2003) based on the Darmois-Skitovitch theorem (Darmois, 1953; Skitovitch, 1953). Using this theorem, the second part can be easily shown without C 2 -densities. Theorem 2 indeed proves separability of the linear BSS model, because if X = AS and W is a demixing matrix such that WX is independent, then WA ∼ I, so W−1 ∼ A as desired. We will give a much easier proof without having to use the DarmoisSkitovitch theorem in the following sections. 3.1 Two-Dimensional Positive Density Case. For illustrative purposes we will first prove separability for a two-dimensional random vector S with positive density pS ∈ C 2 (R2 , R). Let A ∈ Gl(2). It is enough to show that if S and AS are independent, then either A ∼ I or S is gaussian. S is assumed to be independent, so its density factorizes: pS (s) = g1 (s1 )g2 (s2 ), for s ∈ R2 . First, note that the density of AS is given by pAS (x) = | det A|−1 pS (A−1 x) = cg1 (b11 x1 + b12 x2 )g2 (b21 x1 + b22 x2 ) for x ∈ R2 , c = 0 fixed. Here, B = (bij ) = A−1 . AS is also assumed to be independent, so pAS (x) is separated. pS was assumed to be positive; then so is pAS . Hence, ln pAS (x) is linearly separated, so ∂1 ∂2 ln pAS (x) = b11 b12 h1 (b11 x1 + b12 x2 ) + b21 b22 h2 (b21 x1 + b22 x2 ) = 0 for all x ∈ R2 , where hi := ln gi ∈ C 2 (R2 , R). By setting y := Bx, we therefore have b11 b12 h1 (y1 ) + b21 b22 h2 (y2 ) = 0
(3.2)
for all y ∈ R2 , because B is invertible. Now, if A (and therefore also B) is equivalent to the identity, then equation 3.2 holds. If not, then A, and hence also B, have at least three nonzero entries. By equation 3.2 the fourth entry has to be nonzero, because the
1834
F. Theis
hi are not zero (otherwise gi (yi ) = exp(ayi + b), which is not integrable). Furthermore, b11 b12 h1 (y1 ) = −b21 b22 h2 (y2 ) for all y ∈ R2 , so the hi are constant, say, hi ≡ ci , and ci = 0, as noted above. Therefore, the hi are polynomials of degree 2, and the gi = exp hi are gaussians (ci < 0 because of the integrability of the gi ). 3.2 Characterization of Gaussians. In this section, we show that among all densities, respectively, characteristic functions, the gaussians satisfy a special differential equation. Lemma 3.
Let f ∈ C 2 (R, C) and a ∈ C with
af 2 − f f + f 2 ≡ 0. Then either f ≡ 0 or f (x) = exp
(3.3) a
2x
2
+ bx + c , x ∈ R, with constants b, c ∈ C.
Proof. Assume f ≡ 0. Let x0 ∈ R with f (x0 ) = 0. Then there exists a nonempty interval U := (r, s) containing x0 such that a complex logarithm log is defined on f (U). Set g := log f |U. Substituting exp g for f in equation 3.3 yields a exp(2g) − exp(g)(g + g2 ) exp(g) + g2 exp(2g) ≡ 0, and therefore g ≡ a. Hence, g is a polynomial of degree ≤ 2 with leading coefficient 2a . Furthermore, lim f (x) = 0
x→r+
lim f (x) = 0,
x→s−
so f has no zeros at all because of continuity. The argument above with U = R shows the claim. If, furthermore, f is real nonnegative and integrable with integral 1 (e.g., if f is the density of a random variable), then f has to be the exponential of a real-valued polynomial of degree precisely 2; otherwise, it would not be integrable. So we have the following corollary:
A New Concept for Separability Problems in BSS
1835
Corollary 1. Let X be a random variable with twice continuously differentiable density pX satisfying equation 3.3. Then X is gaussian. If we do not want to assume that the random variable has a density, we can use its characteristic function (Bauer, 1996) instead to show an equivalent result: Corollary 2. Let X be a random variable with twice continuously differentiable characteristic function X(x) := EX (exp ixX) satisfying equation 3.3. Then X is gaussian or deterministic. = 1, lemma 3 shows that X(x) = exp( a x2 + bx). MoreUsing X(0) 2 ≤1 over, from X(−x) = X(x), we get a ∈ R and b = ib with real b . And |X| shows that a ≤ 0. So if a = 0, then X is deterministic (at b ), and if a = 0, then X has a gaussian distribution with mean b and variance −a−1 .
Proof.
3.3 Proof of Theorem 2. We will now prove linear separability; for this, we will use separatedness to show that some source components have to be gaussian (using the results from above) if the mixing matrix is not trivial. The main argument is given in the following lemma: Lemma 4. Let gi ∈ C 2 (R, C) and B ∈ Gl(n) such that f (x) := g1 ⊗· · ·⊗gn (Bx) is separated. Then for all indices l and i = j with bli blj = 0, gl satisfies the differential equation 3.3 with some constant a. Proof.
f is separated, so by theorem 1i.
Rij [ f ] ≡ f ∂i ∂j f − (∂i f )(∂j f ) ≡ 0
(3.4)
holds for i < j. The ingredients of this equation can be calculated for i < j as follows: ∂i f (x) =
bki g1 ⊗ · · · ⊗ gk ⊗ · · · ⊗ gn (Bx)
k
(∂i f )(∂j f )(x) =
bki blj (g1 ⊗ · · · ⊗ gk ⊗ · · · ⊗ gn )
k,l
× (g1 ⊗ · · · ⊗ gl ⊗ · · · ⊗ gn )(Bx) ∂i ∂j f (x) = bki (bkj g1 ⊗ · · · ⊗ gk ⊗ · · · ⊗ gn k
+
l=k
blj g1 ⊗ · · · ⊗ gk ⊗ · · · ⊗ gl ⊗ · · · ⊗ gn )(Bx).
1836
F. Theis
Putting this in equation 3.4 yields 0 = ( f ∂i ∂j f − (∂i f )(∂j f ))(x) = bki bkj ((g1 ⊗ · · · ⊗ gn )(g1 ⊗ · · · ⊗ gk ⊗ · · · ⊗ gn ) k
− (g1 ⊗ · · · ⊗ gk ⊗ · · · ⊗ gn )2 )(Bx) 2 2 = bki bkj g21 ⊗ · · · ⊗ g2k−1 ⊗ gk gk − g2 k ⊗ gk+1 ⊗ · · · ⊗ gn (Bx) k
for x ∈ Rn . B is invertible, so the whole function is zero:
2 2 bki bkj g21 ⊗ · · · ⊗ g2k−1 ⊗ gk gk − g2 k ⊗ gk+1 ⊗ · · · ⊗ gn ≡ 0.
(3.5)
k
Choose x ∈ Rn with gk (xk ) = 0 for k = 1, . . . , n. Evaluating equation 3.5 at (x1 , . . . , xl−1 , y, xl+1 , . . . , xn ) for variable y ∈ R and dividing the resulting one-dimensional equation by the constant g21 (x1 ) · · · g2l−1 (xl−1 )g2l+1 (xl+1 ) · · · g2n (xn ) shows bli blj
gl gl
−
g2 l
(y) = −
bki bkj
k=l
gk gk − g2 k g2k
(xk ) g2l (y)
(3.6)
for y ∈ R. So for indices l and i = j with bli blj = 0, it follows from equation 3.6 that there exists a ∈ C such that gk satisfies the differential equation ag2l − gl gl + g2 l ≡ 0, that is, equation 3.3. Proof of Theorem 2. i. S is assumed to have at most one gaussian or deterministic component and existing covariance. Set X := AS. We first show using whitening that A can be assumed to be orthogonal. For this, we can assume S and X to have no deterministic component at all (because arbitrary choice of the matrix coefficients of the deterministic components does not change the covariance). Hence, by assumption, Cov(X) is diagonal and positive definite, so let D1 be diagonal invertible with Cov(X) = D21 . Similarly, let D2 be diagonal invertible with Cov(S) = D22 . −1 Set Y := D−1 1 X and T := D2 S, that is, normalize X and S to covariance I. Then −1 −1 Y = D−1 1 X = D1 AS = D1 AD2 T
A New Concept for Separability Problems in BSS
1837
−1 and T, D−1 1 AD2 and Y satisfy the assumption, and D1 AD2 is orthogonal because
I = Cov(Y) = E(YY ) −1 = E(D−1 1 AD2 TT D2 A D1 ) −1 = (D−1 1 AD2 )(D1 AD2 ) .
So without loss of generality, let A be orthogonal. Now let S(s) := ES (exp is S) be the characteristic function of S. By assumption, the covariance (and hence the mean) of S exists, so S ∈ C 2 (Rn , C) (Bauer, 1996). Furthermore, since S is assumed to be independent, its chari . The characteristic function is separated: S ≡ g1 ⊗ · · · ⊗ gn , where gi ≡ S acteristic function of AS can easily be calculated as AS(x) = ES (exp ix AS) = S(A x) = g1 ⊗ · · · ⊗ gn (A x) for x ∈ Rn . Let B := (bij ) = A . Since AS is also assumed to be independent, f (x) := AS(x) = g1 ⊗ · · · ⊗ gn (Bx) is separated. Now assume that A ∼ I. Using orthogonality of B = A , there exist indices k = l and i = j with bki bkj = 0 and bli blj = 0. Then according to lemma 4, gk and gl satisfy the differential equation 3.3. Together with corollary 2, this shows that both Sk and Sl are gaussian, which is a contradiction to the assumption. ii. Let S be an n-dimensional independent random vector with density pS ∈ C 2 (Rn , R) and no gaussian component, and let A ∈ Gl(n). S is assumed to be independent, so its density factorizes pS ≡ g1 ⊗ · · · ⊗ gn . The density of AS is given by pAS (x) = | det A|−1 pS (A−1 x) = | det A|−1 g1 ⊗ · · · ⊗ gn (Ax) for x ∈ Rn . Let B := (bij ) = A−1 . AS is also assumed to be independent, so f (x) := | det A|pAS (x) = g1 ⊗ · · · ⊗ gn (Bx) is separated. Assume A ∼ I. Then also B = A−1 ∼ I, so there exist indices l and i = j with bli blj = 0. Hence, it follows from lemma 4 that gl satisfies the differential equation 3.3. But gl is a density, so according to corollary 1 the lth component of S is gaussian, which is a contradiction.
1838
F. Theis
4 BSS by Hessian Diagonalization In this section, we use the theory already set out to propose an algorithm for linear BSS, which can be easily extended to nonlinear settings as well. For this, we restrict ourselves to using C 2 -densities. A similar idea has already been proposed in Lin (1998), but without dealing with possibly degenerated eigenspaces in the Hessian. Equivalently, we could also use characteristic functions instead of densities, which leads to a related algorithm (Yeredor, 2000). If we assume that Cov(S) exists, we can use whitening as seen in the proof of theorem 2i (in this context, also called principal component analysis) to reduce the general BSS model, equation 3.2, to X = AS
(4.1)
with an independent n-dimensional random vector S with existing covariance I and an orthogonal matrix A. Then Cov(X) = I. We assume that S admits a C 2 −density pS . The density of X is then given by pX (x) = pS (A x) for x ∈ Rn , because of the orthogonality of A. Hence, pS ≡ pX ◦ A. Note that the Hessian of the composition of a function f ∈ C 2 (Rn , R) with an n × n-matrix A can be calculated using the Hessian of f as follows: H f ◦A (x) = AH f (Ax)A . Let s ∈ Rn with pS (s) > 0. Then locally at s, we have Hln pS (s) = Hln pX ◦A (s) = AHln pX (As)A .
(4.2)
pS is assumed to be separated, so Hln pS (s) is diagonal, as seen in section 2. Lemma 5. Let X := AS with an orthogonal matrix A and S, an independent random vector with C 2 -density, and at most one gaussian component. Then there exists an open set U ⊂ Rn such that for all x ∈ U, pX (x) = 0 and Hln pX (x) has n different eigenvalues. Proof. Assume not. Then there exists no x ∈ Rn at all with pX (x) = 0 and Hln pX (x) having n different eigenvalues because otherwise, due to continuity, these conditions would also hold in an open neighborhood of x.
A New Concept for Separability Problems in BSS
1839
Using equation 4.2 the logarithmic Hessian of pS has at every s ∈ Rn with pS (s) > 0 at least two of the same eigenvalues, say, λ(s) ∈ R. Hence, since S is independent, Hln pS (s) is diagonal, so locally, ln pSi (s) = ln pSj (s) = λ(s) for two indices i = j. Here, we have used continuity of s → Hln pS (s) showing that the two eigenvalues locally lie in the same two dimensions i and j. This proves that λ(s) is locally constant in directions i and j. So locally at points s with pS (s) > 0, Si and Sj are of the type exp P, with P being a polynomial of degree ≤ 2. The same argument as in the proof of lemma 3 then shows that pSi and pSj have no zeros at all. Using the connectedness of R proves that Si and Sj are globally of the type exp P, hence gaussian (because of R pSk = 1), which is a contradiction. Hence, we can assume that we have found x(0) ∈ Rn with Hln pX (x(0) ) having n different eigenvalues (which is equivalent to saying that every eigenvalue is of multiplicity one), because due to lemma 5, this is an open condition, which can be found algorithmically. In fact, most densities in practice turn out to have logarithmic Hessians with n different eigenvalues almost everywhere. In theory however, U in lemma 5 cannot be assumed to be, for example, dense or Rn \ U to have measure zero, because if we choose pS1 to be a normalized gaussian and pS2 to be a normalized gaussian with a very localized small perturbation at zero only, then U cannot be larger than (−ε, ε) × R. By diagonalization of Hln pX (x(0) ) using eigenvalue decomposition (principal axis transformation), we can find the (orthogonal) mixing matrix A. Note that the eigenvalue decomposition is unique except for permutation and sign scaling because every eigenspace (in which A is only unique up to orthogonal transformation) has dimension one. Arbitrary scaling indeterminacy does not occur because we have forced S and X to have unit variances. Using uniqueness of eigenvalue decomposition and theorem 2, we have shown the following theorem: Theorem 3 (BSS by Hessian calculation). Let X = AS with an independent random vector S and an orthogonal matrix A. Let x ∈ Rn such that locally at x, X admits a C 2 -density pX with pX (x) = 0. Assume that Hln pX (x) has n different eigenvalues (see lemma 5). If EHln pX (x)E = D is an eigenvalue decomposition of the Hessian of the logarithm of pX at x, that is, E orthogonal, D diagonal, then E ∼ A, so E X is independent.
1840
F. Theis
Furthermore, it follows from this theorem that linear BSS is a local problem as proven already in Theis, Puntonet, and Lang (2003) using the restriction of a random vector. 4.1 Example for Hessian Diagonalization BSS. In order to illustrate the algorithm of local Hessian diagonalization, we give a two-dimensional example. Let S be a random vector with densities 1 χ[−1,1] (s1 ) 2
1 1 pS2 (s2 ) = √ exp − s22 2 2π
pS1 (s1 ) =
where χ[−1,1] is one on [−1, 1] and zero everywhere else. The orthogonal mixing matrix A is chosen to be
1 1 A= √ 2 −1
1 . 1
The mixture density pX of X := AS then is (det A = 1), 1 pX (x) = √ χ[−1,1] 2 2π
1 1 √ (x1 − x2 ) exp − (x1 + x2 )2 , 4 2
for x ∈ R2 . pX is positive and C 2 in a neighborhood around 0. Then 1 ∂1 ln pX (x) = ∂2 ln pX (x) = − (x1 + x2 ) 2 ∂12 ln pX (x) = ∂22 ln pX (x) = ∂1 ∂2 ln pX (x) = −
1 2
for x with |x| < 12 , and the Hessian of the logarithmic densities is
1 1 Hln pX (x) = − 2 1
1 1
independent on x in a neighborhood around 0. Diagonalization of Hln pX (0) yields
−1 0
0 , 0
and this equals AHln pX (0)A , as stated in theorem 3.
A New Concept for Separability Problems in BSS
1841
4.2 Global Hessian Diagonalization Using Kernel-Based Density Approximation. In practice, it is usually not possible to approximate the density locally with sufficiently high accuracy, so a better approximation using the typically global information of X has to be found. We suggest using kernel-based density estimation to get an energy function with minima at the BSS solutions together with a global Hessian diagonalization in the following. The idea is to construct a measure for separatedness of the densities (hence independence) based on theorem 1. A possible measure could be the norm of the summed-up separators i<j Rij [ f ]. In order for this to be calculable, we choose only a set of points p(i) where we evaluate the difference and minimize k i<j Rij [ f ](p(k) )2 at those points. Although in the linear noiseless case, calculation of the Hessian at only one point would be enough, using an energy function of this type ensures using global information of the densities while averaging over possible local errors. First, we need to approximate the density function. For this, let X ∈ Rn be an n-dimensional random vector with ν independent and identically distributed samples x(1) , . . . , x(ν) ∈ Rn . Let ϕ : Rn → R
1 1 2 x → n √ exp − 2 x 2σ σ (2π)n
be the n-dimensional-centered independent gaussian with fixed variance σ 2 > 0. For ease of notation, denote κ := 2σ1 2 . Define the approximated density pˆX of X by pˆX (x) :=
ν 1 ϕ(x − x(i) ). ν i=1
(4.3)
If l → ∞, pˆX converges to pX in the space of all integrable functions if σ is chosen appropriately. This can be shown using the central limit theorem. Figure 2 depicts the approximation of a Laplacian using equation 4.3. The partial derivatives of ϕ can be calculated as ∂i ϕ(x) = −2κxi ϕ(x) ∂i ∂j ϕ(x) = 4κ 2 xi xj ϕ(x)
(4.4)
for i = j. ϕ is separated, so R[ϕ] ≡ 0. Note that pˆX ∈ C ∞ (Rn , R) is positive. So according to theorem 1 pˆX is separated if and only if Rij [pˆX ] ≡ 0 for i < j. And since pˆX is an approximation of pX , separatedness of pˆX also induces approximate independence of X.
1842
F. Theis
0.4
0.2
0.2
0.1
0 2
1
0
−1
−2 −2
−1
0
1
2
0 2
1
0
−1
−2 −2
−1
0
1
2
Figure 2: Independent Laplacian density pS (s) = 12 exp(−|x1 | − |x2 |): theoretic (left) and approximated (right) densities. For the approximation, 1000 samples and gaussian kernel approximation (see equation 4.3) with standard deviation 0.37 were used.
Rij [pˆX ] can be calculated using lemma 2—here Rij [ϕ(x − x(k) )] ≡ 0—and equation 4.4: ν 1 (k) Rij [pˆX ](x) = 2 Rij ϕ(x − x ) ν k=1 =
1 ϕ x − x(k) ∂i ∂j ϕ x − x(l) 2 ν k=l − (∂i ϕ) x − x(k) (∂j ϕ) x − x(l)
=
4κ 2 (l) ϕ x − x(k) ϕ x − x(l) x(k) xj − xj(l) i − xi 2 ν k=l
=
4κ 2 (l) xj(k) − xj(l) . ϕ x − x(k) ϕ x − x(l) x(k) i − xi 2 ν k 0, there exists an open neighborhood U ⊂ Rn of s0 with pS |U > 0 and pS |U ∈ C 2 (U, R). If we define f (s) := ln | det W|−1 | det A|−1 pS (s) for s ∈ U, then f (s) = ln(h1 ((As)1 ) · · · hn ((As)n )g1 ((Wh(As))1 ) · · · gn ((Wh(As))n )) =
n
ln hk ((As)k ) + ζk ((Wh(As))k )
k=1
for x ∈ Rn where ζk := ln gk locally at s0k . pS is separated, so ∂i ∂j f ≡ 0
(5.2)
for i < j. Denote A =: (aij ) and W =: (wij ). The first derivative and then the nondiagonal entries in the Hessian of f can be calculated as follows (i < j): n hk aki ((As)k ) + ζk ((Wh(As))k ) wkl ali hl ((As)l ) ∂i f (s) = hk k=1 l=1 n
A New Concept for Separability Problems in BSS
∂i ∂j f (s) =
n
aki akj
2 hk h k − hk
h2 k
k=1
+
ζk ((Wh(As))k )
((As)k )
n
wkl ali hl ((As)l )
l=1
+ ζk ((Wh(As))k )
1847
n
n
wkl alj hl ((As)l )
l=1
wkl ali alj hl ((As)l ) .
l=1
Substituting y := As and using equation 5.2, we finally get the following differential equation for the hk : 0=
n
aki akj
2 hk h k − hk
h2 k
k=1
+
ζk ((Wh(y))k )
(yk )
n l=1
+ ζk ((Wh(y))k )
n
n
wkl ali hl (yl )
wkl alj hl (yl )
l=1
wkl ali alj hl (yl )
(5.3)
l=1
for y ∈ V := A(U). We will restrict ourselves to the simple case mentioned above in order to solve this equation. We assume that the hk are analytic and that there exists x0 ∈ Rn where the demixed densities gk are locally constant and nonzero. Consider the above calculation around s0 = A−1 (h−1 (W−1 x0 )). Choose the open set V such that the gk are locally constant nonzero on h(W(V)). Then so are the ζk = ln gk , and therefore 0=
n
aki akj
2 hk h k − hk
k=1
h2 k
(yk )
for y ∈ V. Hence, there exist open intervals Ik ⊂ R and constants bk ∈ R with 2 ≡ dk h2 aki akj hk h k − hk k on Ik (here, dk =
l=k ali alj
hl h −h2 l l (yl ) h2 l
for some (and then any) y ∈ V).
By assumption, W is mixing. Hence, for fixed k, there exist i = j with aki akj = 0. If we set ck := akibak kj , then 2 ck h2 k − hk hk + hk ≡ 0
(5.4)
1848
F. Theis
on Ik . hk was chosen to be analytic, and equation 5.4 holds on the open set Ik , so it holds on all R. Applying lemma 3 then shows that either hk ≡ 0 or hk (x) = ± exp
c
x + dk x + ek , x ∈ R
k 2
2
(5.5)
with constants dk , ek ∈ R. By assumption, hk is bijective, so hk ≡ 0. Applying the same arguments as above to the inverse system S = A−1 (h−1 (W−1 X)) and using the fact that also pS is somewhere locally constant nonzero shows that equation 5.5 also holds for (h−1 k ) with other constants. But if both the −1 derivatives of hk and hk are of this exponential type, then ck = dk = 0, and therefore hk is affine linear for all k = 1, . . . , n, which completes the proof of postnonlinear separability in this special case. Note that in the above proof, local positiveness of the densities was assumed in order to use the equivalence of local separability with the diagonality of the logarithm of the Hessian. Hence, these results can be generalized using theorem 1 in a similar fashion as we did in the linear case with theorem 2. Hence, we have proven postnonlinear separability also for uniformly distributed sources. 6 Conclusion We have shown how to derive the separability of linear BSS using diagonalization of the Hessian of the logarithmic density, respectively, characteristic function. This induces separated, that is, independent, sources. The idea of Hessian diagonalization is put into a new algorithm for performing linear independent component analysis, which is shown to be a local problem. In practice, however, due to the fact that the densities cannot be approximated locally very well, we also propose a diagonalization algorithm that takes the global structure into account. In order to show the use of this framework of separated functions, we finish with a proof of postnonlinear separability in a special case. In future work, more general separability results of postnonlinear BSS could be constructed by finding more general solutions of the differential equation 5.3. Algorithmic improvements could be made by using other density approximation methods like mixture of gaussian models or by approximating the Hessian itself using the cumulative density and discrete approximations of the differential. Finally, the diagonalization algorithm can easily be extended to nonlinear situations by finding appropriate model parameterizations; instead of minimizing the mutual information, we minimize the absolute value of the off-diagonal terms of the logarithmic Hessian.
A New Concept for Separability Problems in BSS
1849
The algorithm has been specified using only an energy function; gradient and fixed-point algorithms can be derived in the usual manner. Separability in nonlinear situations has turned out to be a hard problem— illposed in the most general case (Hyv¨arinen & Pajunen, 1999)—and not many nontrivial results exist for restricted models (Hyv¨arinen & Pajunen, 1999; Babaie-Zadeh et al., 2002), all only two-dimensional. We believe that this is due to the fact that the rather nontrivial proof of the DarmoisSkitovitch theorem is not at all easily generalized to more general settings (Kagan, 1986). By introducing separated functions, we are able to give a much easier proof for linear separability and also provide new results in nonlinear settings. We hope that these ideas will be used to show separability in other situations as well. Acknowledgments I thank the anonymous reviewers for their valuable suggestions, which improved the original manuscript. I also thank Peter Gruber, Wolfgang Hackenbroch, and Michaela Theis for suggestions and remarks on various aspects of the separability proof. The work described here was supported by the DFG in the grant “Nonlinearity and Nonequilibrium in Condensed Matter” and the BMBF in the ModKog project. References Babaie-Zadeh, M., Jutten, C., & Nayebi, K. (2002). A geometric approach for separating post non-linear mixtures. In Proc. of EUSIPCO ’02 (Volume 2, pp. 11–14). Toulouse, France. Bauer, H. (1996). Probability theory. Berlin: Walter de Gruyter. Bell, A., & Sejnowski, T. (1995). An information-maximization approach to blind separation and blind deconvolution. Neural Computation, 7:1129–1159. Belouchrani, A., Meraim, K. A., Cardoso, J.-F., & Moulines, E. (1997). A blind source separation technique based on second order statistics. IEEE Transactions on Signal Processing, 45(2), 434–444. Cardoso, J.-F., & Souloumiac, A. (1993). Blind beamforming for non gaussian signals. IEEE Proceedings–F, 140(6), 362–370. Cichocki, A., & Amari, S. (2002). Adaptive blind signal and image processing. New York: Wiley. Comon, P. (1994). Independent component analysis—a new concept? Signal Processing, 36:287–314. Darmois, G. (1953). Analyse g´en´erale des liaisons stochastiques. Rev. Inst. Internationale Statist., 21, 2–8. Eriksson, J., & Koivunen, V. (2003). Identifiability and separability of linear ica models revisited. In Proc. of ICA 2003, (pp. 23–27). Nara, Japan. Habl, M., Bauer, C., Puntonet, C., Rodriguez-Alvarez, M., & Lang, E. (2001). Analyzing biomedical signals with probabilistic ICA and kernel-based source
1850
F. Theis
density estimation. In M. Sebaaly, (Ed.) Information science innovations (Proc.ISI’2001) (pp. 219–225). Alberta, Canada: ICSC Academic Press. H´erault, J., & Jutten, C. (1986). Space or time adaptive signal processing by neural network models. In J. Denker (Ed.), Neural networks for computing: Proceedings of the AIP Conference (pp. 206–211). New York: American Institute of Physics. Hyv¨arinen, A., Karhunen, J., & Oja, E. (2001). Independent component analysis. New York: Wiley. Hyv¨arinen, A., & Oja, E. (1997). A fast fixed-point algorithm for independent component analysis. Neural Computation, 9:1483–1492. Hyv¨arinen, A., & Pajunen, P. (1999). Nonlinear independent component analysis: Existence and uniqueness results. Neural Networks, 12(3), 429–439. Kagan, A. (1986). New classes of dependent random variables and a generalization of the Darmois-Skitovitch theorem to several forms. Theory Probab. Appl., 33(2), 286–295. Lee, T., & Lewicki, M. (2000). The generalized gaussian mixture model using ICA. In Proc. of ICA 2000 (pp. 239–244). Helsinki, Finland. Lin, J. (1998). Factorizing multivariate function classes. In M. Kearns, M. Jordan, & S. Solla (Eds.), Advances in neural information processing systems, 10, (pp. 563– 569). Cambridge, MA: MIT Press. Skitovitch, V. (1953). On a property of the normal distribution. DAN SSSR, 89, 217–219. Taleb, A., & Jutten, C. (1999). Source separation in post non linear mixtures. IEEE Trans. on Signal Processing, 47, 2807–2820. Theis, F., & Gruber, P. (forthcoming). Separability of analytic postnonlinear blind source separation with bounded sources. In Proc. of ESANN 2004. Evere, Belgium: d-side. Theis, F., Jung, A., Puntonet, C., & Lang, E. (2002). Linear geometric ICA: Fundamentals and algorithms. Neural Computation, 15, 1–21. Theis, F., Puntonet, C., & Lang, E. (2003). Nonlinear geometric ICA. In Proc. of ICA 2003 (pp. 275–280). Nara, Japan. Tong, L., Liu, R.-W., Soon, V., & Huang, Y.-F. (1991). Indeterminacy and identifiability of blind identification. IEEE Transactions on Circuits and Systems, 38, 499–509. Yeredor, A. (2000). Blind source separation via the second characteristic function. Signal Processing, 80(5), 897–902.
Received June 27, 2003; accepted March 8, 2004.
LETTER
Communicated by Jacques Droulez
Modeling Mental Navigation in Scenes with Multiple Objects Patrick Byrne
[email protected] Suzanna Becker
[email protected] Department of Psychology, McMaster University, Hamilton, Ontario, Canada
Various lines of evidence indicate that animals process spatial information regarding object locations differently from spatial information regarding environmental boundaries or landmarks. Following Wang and Spelke’s (2002) observation that spatial updating of egocentric representations appears to lie at the heart of many navigational tasks in many species, including humans, we postulate a neural circuit that can support this computation in parietal cortex, assuming that egocentric representations of multiple objects can be maintained in prefrontal cortex in spatial working memory (not simulated here). Our method is a generalization of an earlier model by Droulez and Berthoz (1991), with extensions to support observer rotation. We can thereby simulate perspective transformation of working memory representations of object coordinates based on an egomotion signal presumed to be generated via mental navigation. This biologically plausible transformation would allow a subject to recall the locations of previously viewed objects from novel viewpoints reached via imagined, discontinuous, or disoriented displacement. Finally, we discuss how this model can account for a wide range of experimental findings regarding memory for object locations, and we present several predictions made by the model. 1 Introduction Spatial reasoning is of paramount importance in nearly all aspects of human behavior, from planning and navigating a complex route through some environment in order to reach a relevant goal, to simply grasping a nearby object. The set of all entities that comprise our spatial surroundings may be divided into subsets in arbitrarily many ways. For the present purposes, we will partition these entities into environmental boundaries/landmarks and objects. There is substantial empirical evidence as to how the brain represents and processes the former (for a review, see Burgess, Becker, King, & O’Keefe, 2001). For example, O’Keefe & Dostrovsky (1971) found neurons in the hippocampus of the rat that respond to the rat’s location in space. O’Keefe and Nadel (1978) argue that this collection of “place cells” forms a c 2004 Massachusetts Institute of Technology Neural Computation 16, 1851–1872 (2004)
1852
P. Byrne and S. Becker
cognitive map and is the rat’s internal allocentric map of the environment. Evidence of view-invariant hippocampal place cells has also been found in nonhuman primates (Ono, Nakamura, Nishijo, & Eifuku, 1993) and in human hippocampus (Ekstrom et al., 2003). For the case of short-term object location representation, the focus of this article, empirical evidence strongly indicates involvement of the posterior parietal cortex. For example, Sabes, Breznen, and Andersen (2002) perform single unit recordings that demonstrate that area LIP of monkey cortex encodes saccade targets in retinotopic coordinates. More generally, Colby and Goldberg (1999) review evidence showing that object locations are represented in a variety of reference frames in parietal cortex, while Andersen, Shenoy, Snyder, Bradley, and Crowell (1999) review evidence suggesting that area 7a incorporates vestibular information to maintain localization of visual stimuli in a world-centered reference frame. Finally, Goodale and Milner (1992) review data suggesting that the dorsal stream from striate to posterior parietal cortex is responsible for the relatively short timescale sensorimotor transformations used while performing visually guided actions directed at objects in the environment. In this article, we first review empirical evidence from neuropsychological and electrophysiological studies as to the nature and locus of object representations in the brain. We then present a biologically plausible computational model that allows for the transformation of object coordinates maintained in spatial working memory (WM). The function of this transformation is to update WM representations of object coordinates in egocentric space while the subject mentally navigates through the environment. This transformation is driven by the products of mental navigation (presumably some mentally generated equivalent to vestibular, proprioceptive, and motor efference information). Next, we perform simulations demonstrating the functioning of the model and how error is introduced into object coordinate representations by the model. Finally, we discuss how the model can account for various experimental findings. The representation of object location information in the brain appears, at least under certain circumstances, to be quite different from the representation of environmental boundary information. A series of experiments performed by Wang & Spelke (2000) provides insight into how humans differentially process object location and environmental boundary information. For each experiment, the authors placed a few objects around a small room. Subjects were allowed to explore the room and studied objects’ locations for as long as they pleased. They were then brought to the center of the room, blindfolded, rotated a small amount, and asked to point to the various objects as they were called out in a random order by the experimenters. Following this, subjects were required to sit in a swivel chair fixed at the center of the room and disorient themselves via a 1 minute self-induced rotation. They were again asked to point to the objects in a predetermined random order. In addition to objects, subjects were also required to point to
Modeling Mental Navigation in Scenes with Multiple Objects
1853
room corners in some experiments. The main findings can be summarized as follows: After the initial rotation (without disorientation), subjects could point to objects and room corners with relatively high accuracy, implying that they had encoded correctly the information that they were asked to. After disorientation, subjects could no longer accurately point to the location of objects or room corners. However, analyses of the consistency of the pointing errors indicated that the relative configuration of the room corners could be accurately recalled (for nonrectangular as well as rectangular rooms), whereas the relative object configuration could not be. This supports a configural representation of geometric feature locations and nonconfigural or independent representations of object locations. Furthermore, when subjects were reoriented with a bright light (still blindfolded), they could accurately recall the absolute and relative locations of room corners but neither the relative nor absolute locations of objects. When the bright light was left on throughout the disorientation procedure, subjects could recall the relative and absolute locations of both objects and room corners. The results of Wang and Spelke (2000) experiments indicate that subjects could rapidly form an accurate internal representation of environmental boundaries and object locations. However, it appears that during mental transformations of spatial information in trials where the subject is not continuously oriented, information about a given object location is updated independent of information regarding other objects and of the environment. In this way, each object location is subject to independent transformation error. This is in contrast to environmental geometric cues such as room corners, which appear to be updated as a coherent whole. Additional evidence that object location information is handled differently from environmental boundary information comes from an experiment by Shelton and McNamara (2001). Subjects were brought into a room with various objects placed at different locations inside and allowed to observe the objects from various predetermined viewpoints. After leaving the room, they were asked to imagine standing at a given object facing a second object and to point to where a third object would be relative to themselves. Subjects performed best when their imagined viewpoint was aligned with the original viewpoint from which they observed the object configuration. This suggests the possibility that subjects are storing the object configuration in head-centered egocentric coordinates and that they must transform it (introducing error) when they imagine observing it from other viewpoints. We say head centered here because subjects were asked to imagine facing an object, which implies pointing their head toward it; they were not asked to constrain their gaze direction, so neither retinal nor body-centered coordinates are implied. Head-centered coordinates are also implied by the unit recordings of Funahasi, Bruce, and Goldman-Rakic (1989), as we mention in the next section. Another possible interpretation of Shelton and McNamara’s (2001) results is that subjects form a view-based snapshot of the entire presentation scene, which can be used for later matching. Wang and Spelke
1854
P. Byrne and S. Becker
(2002) review evidence that humans do seem to make use of such snapshots, at least under certain circumstances. However, we do not investigate this issue any further here. In order to build a computational model of mental navigation that explains empirical results such as those of Wang and Spelke (2000) and Shelton and McNamara (2001), we first require a more complete understanding of how the brain represents objects. We next review empirical evidence as to the nature and locus of object representations in the brain. 2 Objects and Spatial Working Memory A number of experiments suggest a transition from short-term spatial WM to long-term memory representations of objects after several minutes of study time (Smith & Milner, 1989; Crane, Milner, & Leonard, 1995; Bohbot et al., 1998). The evidence provided by these studies suggests that medial temporal lobe structures are essential to location memory over periods of several minutes (≈ 4 minutes) or greater but are less relevant over shorter timescales. To investigate the nature of short-term object location memory, we consider the evidence from functional imaging and unit recording studies. In an fMRI experiment performed by Galati et al. (2000), subjects were required to report the location of a vertical bar flashed before them for 150 ms relative to their midsagittal plane. During this task, several frontal and parietal regions were more active on the location task, relative to a control color decision involving the same stimuli. Furthermore, Sala, R¨am¨a, and Courtney (2003) presented subjects with a sequence of flashes and asked them to recall the location or identity of what was shown (picture of a house or a face) three flashes back. During location recall, fMRI scans revealed activation in the superior portion of the intraparietal sulcus (IPS) and in the superior frontal sulcus, as well as other areas. During identity recall, activation was found in the inferior and medial frontal gyrus, as well as other areas. These observations indicate that areas of frontal and parietal cortices are of key importance in generating and maintaining internal representations of spatial locations, at least for short periods of time. The role of frontal cortical areas in this process is reviewed by Levy and Goldman-Rakic (2000), who argue that the principal sulcus (area 46) plays a crucial role in spatial WM and that Walker’s areas 12 and 45 (the inferior convexivity) play a crucial role in object identity WM. In particular, a unit recording study by Funahasi et al. (1989) showed that neurons in the principal sulcus of the monkey appear to code for egocentric spatial locations in head-centered coordinates. Furthermore, a human study by Oliveri et al. (2001), requiring subjects to remember the position of a flash two steps back in a sequence, found that only when transcranial magnetic stimulation (TMS) was applied to the dorsolateral prefrontal cortex (DLPFC) was accuracy affected, although TMS applied to several different brain regions affected reaction times. These re-
Modeling Mental Navigation in Scenes with Multiple Objects
1855
sults suggest that egocentric representations of spatial locations are maintained in DLPFC. Unfortunately, the role of parietal cortical areas in spatial WM is somewhat less clear. It is known, however, that neurons in areas VIP and LIP of the IPS show receptive fields in head-centered coordinates (see Burgess, Jeffery, and O’Keefe, 1999, for an overview). Also, Chaffe and Goldman-Rakic (1998) showed that when monkeys are required to hold a target location in memory for a short delay before making a saccade to it, neurons in area 8a (near the principle sulcus of prefrontal cortex) and in area 7ip (near the IPS) become active and show temporally varying activation levels over the delay period. Neurons in both regions show spatial selectivity similar to that found in the principal sulcus of monkeys by Funahasi et al. (1989). From the above evidence, it seems plausible that object locations are initially represented in parietal cortex, and if sufficient attention is allocated to them, then their locations and identities are maintained in WM in DLPFC. In particular, their locations are represented egocentrically (as Shelton and McNamara’s, 2001, work might indicate) in one area of DLPFC, while maintenance of their identities involves a different area of DLPFC consistent with Goldman-Rakic’s hypothesis that the ventral/dorsal (what/where) distinction persists into the DLPFC. In order to mentally manipulate the positions of objects stored in WM, it seems reasonable to assume that a circuit involving areas of the parietal cortex (especially the IPS) would be involved, given this area’s involvement in spatial WM and its known ability to represent locations in multiple reference frames. In particular, we hypothesize that object coordinates are maintained in WM egocentrically in DLPFC and manipulated and represented more transiently in the vicinity of the IPS. 3 Model In this work, we develop a computational account of memory for object locations in the face of certain types of viewer motion, specifically those in which the motion is imagined or those in which the subject does not remain continuously oriented throughout. At a minimum, the model should provide an explanation of Shelton and McNamara’s (2001) finding that subjects more accurately recall object positions when asked to do so from a viewpoint in which they previously observed the object configuration, and of Wang and Spelke’s (2000) finding regarding the pattern of errors made by disoriented subjects recalling object configurations. One possible way in which subjects might perform either the Shelton and McNamara or the Wang and Spelke task is by mental navigation. In contrast to a simple mental rotation of the object array, mental navigation involves imagined egomotion. This entails making mental viewpoint transformations while simultaneously updating WM representations of egocentric object coordinates. More specifically, subjects could make a WM snapshot of object locations from a given viewpoint, and when asked to recall object locations from a new viewpoint, they could
1856
P. Byrne and S. Becker
mentally navigate from the initial viewpoint to the new viewpoint and use the same motion signal driving mental navigation to simultaneously drive a transformation of egocentric object coordinates. In the case of the Wang and Spelke tasks, the new viewpoint would have to be estimated due to the disorientation procedure. Also, assuming a serial updating procedure in which the transformation just described must be repeated for each individual object allows for a possible explanation of the finding that object configurations could not be accurately recalled after disorientation and that recall errors were not systematic, indicating lack of a configurational representation. This will be discussed in more detail in section 5. Our hypothesis requires neural circuits that can represent the environment allocentrically, to allow for mental navigation through the environment. Additional circuits are required to maintain egocentric object coordinates and transform these coordinates, one object at a time, based on the egomotion signal that drives mental navigation. To this end, models of real navigation based on internal allocentric cognitive maps have been developed for rats and humans (see Voicu, 2003, and Burgess, Donnett, & O’Keefe, 1997, for examples). Given the evidence that both real and imagined spatial tasks invoke nearly the same cortical circuitry (see, e.g., Stippich, Ochmann, & Sartor, 2002; Ino et al., 2002; Mellet et al., 2000; Kreiman, Koch, & Fried, 2000), these models of real navigation will be assumed to be applicable to mental navigation. We now require a neural circuit that performs the object location transformation. To begin, we must decide how egocentric object coordinates are to be represented. Thelen, Schoner, ¨ Scheier, and Smith (2001) and Compte, Brunel, Goldman-Rakic, and Wang (2000) have created models of spatial WM that hold location as a bump of activity in a topographically organized neural circuit, in which each neuron represents a location or direction in egocentric space. Becker and Burgess (2001) model the parietal egocentric map in this way, and similarly the organism’s location in allocentric space is represented by a gaussian bump of activity over an array of hippocampal place cells. Such bump attractor networks have been used by others in models of hippocampus as well (Zhang, 1996; Samsonovich & McNaughton, 1997). Here also it will be assumed that the model contains a main layer of neurons, each of which will be preassigned a unique location on a Cartesian grid covering head-centered egocentric space. We will represent object location in egocentric space as a gaussian bump of activity in the main neuron layer. By definition, the person’s coordinates in this map are the origin, and their orientation will be taken as facing along the positive y-axis. The existence of these main-layer neurons in parietal cortex is supported by the electrophysiological recordings of Chaffe and Goldman-Rakic (1998). In addition to a representation of object locations, our model will be adapted from a model by Droulez and Berthoz (1991), which could easily be implemented to handle translational movements of the observer but requires a nontrivial extension to handle rotational motion. To derive the
Modeling Mental Navigation in Scenes with Multiple Objects
1857
P y’
r’P x’ θ
y
rP
O’ rT
x O
Figure 1: Translation and rotation of reference frame.
model, first consider an observer standing at the origin, O, of some reference frame and facing along the positive y-axis (see Figure 1). The position of an object at a point, P, will be denoted by the vector, rP . If the observer moves by an amount rT and rotates by an amount θ , then we will call the new egocentric reference frame the primed frame. The position of the object with regard to this new frame is given by rp = rp − rT .
(3.1)
Reexpressed in terms of Cartesian x, y components, this is xp i + yp j = xp − xT i + yp − yT j,
(3.2)
where i and j are basis vectors oriented along the x- and y-axes, respectively. Making an appropriate change of basis on the right-hand side yields xp = xp − xT cos θ + yp − yT sin θ yp = yp − yT cos θ − xp − xT sin θ.
(3.3)
Next, we assume that the observer’s viewpoint shift occurs over a time interval t 0. Synaptic weights from transient node to excitatory cells are set to wij = i + (j − 1)N
(2.4)
Representational Capacity
1923
for all i and j. Therefore, nodes in the network are activated with a gradient of input values, where every node has a distinct activity amplitude that corresponds to its position in a network. This is equivalent to random gaussian noise for oscillatory units in LEGION (Wang & Terman, 1997) or to dye injections in the FBF network (Grossberg & Wyse, 1991) or to different visual latencies for pixels with different luminance in the spin-lattice model (Opara & Worg ¨ otter, ¨ 1998). For instance, suppose that M = N = 5. Then,
1 6 wij = 11 16 21
2 7 12 17 22
3 8 13 18 23
4 9 14 19 24
5 10 15 . 20 25
(2.5)
This type of coding seems too artificial to be implemented in the brain because it is independent from any visual attributes. However, it could be enhanced by inclusion of direct input values (i.e., luminance) as in I(t)(wij +Jij ). In that case, the network becomes sensitive to a magnitude of the input Jij , but it is also sensitive to an abstract distinction between different surfaces. Therefore, lighter surfaces will receive labels with higher amplitude due to the Jij , but if there are two or more distinct surfaces with the same input, they will be labeled with different activity values. Unfortunately, interference may occur between two different codes, and weights (wij ) should be normalized to some small interval in order to prevent it. Another possibility is to use two-layer architecture, where the first layer is sensitive only to the luminance and its output is sent to the second layer, which uses abstract code for the objects’ identity. Linear output functions with half-rectification for excitatory and inhibitory cells are defined as f (r) = h(r) = r for r > 0 and f (r) = h(r) = 0 for r ≤ 0. Rectification is necessary in a biologically plausible model because it prevents excitatory connection from becoming inhibitory, and vice versa. Self-excitation is included in a model in order to prevent activation decay after the transient input is shut down and to improve the quality of the object representation. Function g() describes the dendritic output function; it is defined as g(r) = 1 if r > 0 and g(r) = 0 if r ≤ 0. Other functions may be used as well (e.g., sigmoid function), but the important point is that g() must have an upper boundary because too strong an inhibition would lead to a winner-take-all behavior. Threshold Ty controls the minimal difference between the activity of the excitatory cell and the activity of the dendritic interneuron that is necessary to influence cell yij . The network is fully connected, and the sum in equation 2.2 is taken over all network cells p, q except i, j. The lateral and dendritic inhibitory connections are assumed to be of unit strength, and their weights are not explicitly represented in the
1924
D. Domijan
model. The term Sij in equation 2.3 is defined as Sij =
νmnij 2k[ f (xmn ) − h(zij )],
(2.6)
mn
where m and n are four nearest-neighbor locations {(i + 1, j), (i − 1, j), (i, j + 1), (i, j − 1)}. It describes excitatory activity arriving from local surrounding xmn to the dendritic interneuron through its dendrites. This activity is also subject to dendritic inhibition from the dendritic interneuron (zij ), which prevents its excessive excitation (see Figure 1B). Dendrites of the dendritic interneuron have a half-rectified linear output function k(r) = r for r > 0 and k(r) = 0 if r ≤ 0. Furthermore, excitatory links are multiplicatively gated by the νmnij = g(Jmn − Jij + Tν )g(Jij − Jmn + Tν ).
(2.7)
This is a simplified variant of the BCS, which detects intensity difference between neighboring pixels Jmn and Jij from input image J. Terms in equation 2.5 correspond to two dendritic branches, which independently compute a signal difference between excitatory and inhibitory input, as shown in Figure 1B. They multiplicatively interact in order to achieve contrast sensitivity without sensitivity to contrast polarity (Grossberg & Mingolla, 1985). A nonlinear dendritic output function g() is defined as for equation 2.2. When intensity difference is larger than the excitatory tonic signal Tv , one of the terms in equation 2.5 drops to 0, and activity spreading is disabled. If the difference is smaller than Tv , both terms will be 1, which will allow an exchange of excitatory signals between neighboring cells. Tonic signal Tv controls the sensitivity of the boundary detection system to intensity difference. It may be considered an internal property of the dendrites, or it may arise from a common external source. Multiplicative interactions are assumed to arise at junctions of dendritic branches (Mel, 1994). The input image J does not influence excitatory cells directly; rather, it controls activity spreading through the boundary detection system (other possibilities are discussed above). The network description could be simplified if we assume that inhibitory interneurons quickly reach equilibrium value. Then we could replace the system of equations 2.1 to 2.3 with a single equation: dxij = −Axij + (B − xij )(I(t)wij + f (xij )) dt − (C + xij ) g[ f (xpq ) − h(xij ) − Sij ].
(2.8)
pq=ij
Further simplification is useful when applying the network to real images. Solving the large system of ordinary differential equations is timeconsuming, and algebraic approximation of the real-time process greatly
Representational Capacity
1925
reduces the computational costs. Spreading activation in the labeling network is approximated with the following algorithm: 1. Initialize xij = wij ;
(2.9)
2. Iterate until convergence xij = Max(xij , xmn ∗ νmnij );
(2.10)
where indices m and n are defined above, in equation 2.6. The algorithm does not reproduce exactly the amplitudes of the cells’ activity obtained with the real-time version of the model, but it labels distinct closed regions with distinct activity values. All locations within the same region are labeled with the same activity value. 2.2 Size Estimation Network. Application of the labeling network to gray-level images results in fragmented output with many small regions that do not correspond to significant objects in the visual scene. In order to remove this noise, an additional network is proposed, which estimates a size of surfaces segmented by the labeling network. It is analogous to a lateral potential in the LEGION (Wang & Terman, 1997). While the lateral potential is directly embedded in the LEGION dynamics, the size estimation network operates independent of the labeling network. The size estimation network is shown in Figure 2. It combines excitatory and inhibitory connections to distinguish signals originating from different objects. Each cell in the size estimation network receives excitatory input from a corresponding input cell and an excitatory and inhibitory signal from all other input cells. Input arrives from the labeling network. As in the labeling network, an additional dendritic inhibitory pathway is introduced, which influences direct excitatory and inhibitory pathways before they reach the cell. Mathematically, the model is described by dXij g(xmn − Zij + TE ) − Yij = −Xij + g(xij ) + dt m n dYij g(xmn − Zij + TI ) = −Yij + dt m n dZij = −Zij + xij , dt
(2.11) (2.12) (2.13)
where Xij , Yij , and Zij , denote, respectively, excitatory cell, lateral interneuron, and dendritic interneuron at spatial position {i, j} for all i = 1, . . . , M and j = 1, . . . , N in the size estimation network. The dynamics of cells is
1926
D. Domijan
Figure 2: Size estimation network. Open circles are excitatory cells, and filled circles are inhibitory interneurons. Lines with T-endings are dendrites. The bottom layer of excitatory cells represents input from the labeling network (xij ). The connection pattern is shown for a single cell in the size estimation network.
described using a linear additive model (Grossberg, 1988). Term xij is a direct input to the cell from the labeling network that is not subject to any activity modulations; xmn represents lateral input, which is modulated by dendritic inhibition (−Zij ) before it can perturb the cell. Indexes m and n represent all spatial locations except i and j; function g(r) is a dendritic output function defined above; TE (TI ) is a threshold for the lateral excitatory (inhibitory) pathway. Constraint on the threshold values is given by the following inequality, 0 = TI < TE < D,
(2.14)
where D is a positive constant representing a measure of distinctiveness of the labeled object representation defined as the smallest difference between object labels. Such a constraint allows cells to be sensitive to lateral input with the same label as from direct input since the threshold for excitatory pathway is set to the positive value. When activity amplitude is equal in an excitatory pathway and in a dendritic inhibitory pathway, a positive output is produced since the threshold in the inhibitory pathway prevents lateral inhibition to influence the target cell. All weights in the network are set to the unit value, and they are not explicitly represented. All cells in the size estimation network receive input from all cells in the labeling network. There is no intralayer connectivity, so activity spreads in a strictly feedforward manner. Therefore, activity of the
Representational Capacity
1927
size estimation network could be approximated with an algebraic form: Xij = g(xij ) +
m
−
m
g(xmn − xij + TE )
n
g(xmn − xij + TI ).
(2.15)
n
In order to remove small, noisy regions, the output of the size estimation network is thresholded and multiplicatively combined with the output of the labeling network in order to produce a final result of the segmentation (Segij ): Segij = g(Xij − TX )xij ,
(2.16)
where TX represents a threshold for the size estimation network. Therefore, all surfaces with size smaller than TX will not be represented in the final output. 3 Computer Simulations 3.1 Synthetic Images. Behavior of the real-time version of the labeling network (equations 2.4–2.8) is illustrated using small synthetic images (see Figure 3). Input is a 15 × 15 array of pixels whose intensities vary in the range 0 to 225. Network parameters are set to A = .01; B = 225; C = 0; Ty = .1; Tv = .5; tm = .1. Differential equations are solved using the Euler method with a time step of .001. In Figure 3A, four squares of equal size are presented. Initially, all cells have a different activity value due to the transient tonic stimulation. Dendritic inhibition of the lateral inhibitory signals will preserve the initial activity difference in the absence of local excitation. The node that receives the strongest activation will inhibit all other nodes, but through its interneuron, it will indirectly inhibit all inhibitory signals from other nodes, and therefore it will not receive any inhibition. Excitatory self-feedback will drive its activity close to the maximum possible level B (here it is assumed that decay parameter A is negligible). The second largest node will also prevent inhibition from all other units except the largest one. The largest node will override the dendritic inhibition and deliver one unit of inhibition to the second largest node, which will settle approximately to the activity level B − 1. In the same manner, the third largest node will settle to B − 2 because it receives two units of inhibition, one from the largest node and one from the second largest node; the fourth largest node will settle to B − 3; and so on. However, local excitatory signals override inhibition from nearby locations and allow spreading of activity through a whole region bounded by signals from the boundary detection system. All cells in the interior of the
1928
D. Domijan
Figure 3: Computer simulations that illustrate the behavior of the labeling network. The first row is input image (Jij ) with maximal intensity (=225) denoted as white and minimal intensity (=1) as black. The second row is a response from the boundary detection system (wmnij ), which is depicted as a set of lines between neighboring locations {m, n}, and {i, j}. Line length encodes the magnitude of the response. The bottom row is the output from an excitatory cell of the network (xij ), where maximal activity is denoted as white and minimal as black. (A) Four identical squares with background. (B) Four connected squares with a separate central region. (C) Overlapping squares. (D) Fragmented image with a distinct intensity value for every pixel.
single square will converge on the value of the cell, which initially has the strongest activation in the group. Since inhibition from cells representing different squares could not be overridden, different squares are painted with a different activity amplitude, which is denoted by different shades of gray. The background is painted in white because the cell at location {15, 15}, which is initialized with the strongest input, reaches maximally possible activity value (225) and spreads it to all network locations outside the closed borders belonging to squares. An important point is that due to the selfexcitation and unit inhibition from nodes belonging to different segments, there is no edge effect; that is, the activity level of cells representing a single image region is the same at all corresponding locations. If four squares are connected, they will be treated as a unitary object, as shown in Figure 3B. There is no edge between squares, and activity spreads
Representational Capacity
1929
from the square with the highest activity to all other squares. Although the interior of the figure belongs to the background, the labeling network treats it as a separate surface. Therefore, the labeling network is sensitive to the topological structure of the input image in the same manner as LEGION (Wang, 2000). Figure 3C illustrates the network ability to segment overlapping surfaces. This is a consequence of the boundary detection system, which prevents spreading between squares through multiplicative interactions with local excitatory signals. However, the model does not have the ability to perform figure-ground separation and amodal completion of partially overlapped surfaces like the FBF network, because all segmented regions are represented at the same network layer. The problem arises from the model’s inherent inability to represent depth relations. In order to achieve this, multilayer architecture should be employed with different layers corresponding to different depths. Nevertheless, an important point is that the number of network layers depends on the number of different depth planes, not on the number of objects that should be segmented as in the FBF network (Grossberg & Wyse, 1991). The FBF network would require a different network layer for every surface even if they are at the same depth plane. It would be interesting to establish whether the FBF network and the labeling network could be combined in order to achieve greater flexibility in representing three-dimensional space. Figure 3D shows that the labeling network can represent even maximally fragmented images where each pixel is treated as a separate object. Such a network capacity depends on the choice of parameter B, which is set to the value corresponding to the dimensionality of the network. However, even if B is restricted to the value that is considerably lower than the total number of pixels in the network, the capacity could be preserved if the positive value of the function g() is lowered to compensate for compression. For instance, if B is set to the value that is 10% of the dimensionality of the input image, g() should be defined as g(r) = .1 if r > 0 and g(r) = 0 if r ≤ 0. 3.2 Real Images. The full architecture, including the labeling network and the size estimation network, is applied to real images. Before the analysis of segmentation results, the behavior of the size estimation network is examined in more detail. Consider a simple example including three rectangles of different extent and a background. Suppose that each rectangle is labeled with a different activity level as an output from the labeling network: 1 0 x= 2 0 3
1 0 2 0 3
1 0 2 0 0
1 0 2 0 0
1 0 0 0 0
1 0 0 . 0 0
(3.1)
1930
D. Domijan
Given this input, the size estimation network produces the following output:
6 18 X= 4 18 2
6 18 4 18 2
6 18 4 18 18
6 18 4 18 18
6 18 18 18 18
6 18 18 . 18 18
(3.2)
Parameters for the size estimation network are set to TE = .5 and TI = .0. The same parameter set is used for all results reported below. As can be seen, all cells that represent the same surface have the same level of activity, which corresponds to the size of the given rectangle. This is true even for the background, which was labeled with the zero activity level. In order to understand how such output is produced, three possible cases should be considered. First, if xij = xmn for some spatial locations {i, j} and {m, n}, dendritic inhibition will suppress the inhibitory signal traveling along the pathway from xmn to Xij . On the other hand, dendritic inhibition will not be strong enough to prevent the excitatory signal because of the higher threshold (TE ) in the excitatory pathway. The result will be an excitation of Xij . The amount of excitation is directly related to the number of cells, xmn , which have an equal activity level as xij . In other words, Xij simply counts the number of cells that belong to the same surface as xij . Second, if xij > xmn , dendritic inhibition will prevent activity spreading along excitatory and inhibitory pathways, and those xmn will not be able to influence the target cell Xij . Third, if xij < xmn , dendritic inhibition is too weak to prevent activity spreading along excitatory and inhibitory pathways. Both excitatory and inhibitory signals will be delivered to the target cell. However, they will not have any influence on the final cell activity because they have the same strength and will cancel each other. The third case will not be necessary if surfaces are labeled with amplitudes according to their sizes, that is, if larger streaks are labeled with higher activity levels. In that case, inhibitory pathways could be excluded from the model. However, it is not reasonable to expect that a labeling procedure could use size information before it is computed by the size estimation network. Therefore, inhibitory pathways are included in the model to prevent cells with higher activity labeling from destroying size computation with cells with lower activity levels. As a consequence, the model allows arbitrary labeling in input as shown in the demonstration where the largest surface is painted with activity level 1 while the smallest surface is painted with activity value 3. An important constraint of the model is that input labels should be distinctive (the difference between activity levels of different surfaces should be large enough). In the demonstration, it is assumed that it is one unit of activity. Therefore, the threshold on the excitatory pathways should be set
Representational Capacity
1931
Figure 4: A gray-level image of an outdoor scene with a resolution of 192 × 128 pixels. (A) Input image. (B) Activation of the labeling network (xij ) with Tν = 2.5. (C) Activation of the size estimation network (Xij ). (D) Final segmentation result (Segij ) with TX = 100.
to the value above 0 but below D. Otherwise the model will not correctly distinguish different surfaces. Figure 4 shows an application of the algorithmic version of the model (equations 2.9–2.10 and 2.15–2.16) on an outdoor image. The image is taken from a database of natural images used to test properties of the independent component filters based on independent component analysis. (It is available on-line at http://hlab.phys.rug.nl/imlib/index.html.) The database is formed using a Kodak DCS420 digital camera. (Details of the acquisition procedure can be found in van Hateren and van der Schaaf, 1998.) The thumbnail version of images consists of 192 × 128 pixels with an intensity range from 0 to 255. Parameters are set to Tν = 2.5 and TX = 100. Different activity amplitudes are represented as different shades of gray. As can be seen from Figure 4D, the algorithm segments several major regions in the image such as the central path, the left and the right meadow, larger and smaller bushes on the right, several bushes on the left, and the sky. In
1932
D. Domijan
total, 10 segments are produced, plus a background, which is denoted by the black area. Due to natural illumination, there is a gradient of intensity values from the bottom of the image to the center, as can be seen on the path and on the meadows. However, the algorithm treats these regions as single units because illumination variations are regular across space and the boundary system does not detect them. The building in the center of the image is not represented as a single segment because it contains irregular variations in intensity values at a small area due to the dark windows on the white coat. This illustrates the model’s inability to cope with textured surfaces because small surfaces are treated as noise and the threshold for the size estimation network, TX , eliminates them. Figure 5 shows an application of the algorithm with the same parameter setting on another image from the same database. This image shows leaves in a forest. There are many leaves that occlude each other and are illuminated
Figure 5: A gray-level image showing leaves in a forest. It is from the same database as the image in Figure 4 and also consists of 192 × 128 pixels. (A) Input image. (B) Activation of the labeling network (xij ) with Tν = 2.5. (C) Activation of the size estimation network (Xij ). (D) Final segmentation result (Segij ) with TX = 100.
Representational Capacity
1933
from the upper right corner, making it very difficult for segmentation. Combined output from the labeling network and the size estimation network is shown in Figure 5D. There are 10 segmented regions plus the background. Segments correspond to several larger leaves, while smaller leaves are filtered out. The segment in the right upper corner corresponds to the open space. Figure 6 shows the effect of change in the parameter setting for the same image as in Figure 5. The threshold Tν is set to 3.5 and TX = 100 in Figure 6D. Now the algorithm produces 18 segments. If, in addition, the threshold for the size estimation network (TX ) is lowered to 50, output is even more fragmented, with 47 segments (see Figure 6E). Increasing the Tν and decreasing the TX leads to a more fragmented output. The threshold Tν influences the size of the segments and, implicitly, their number in the final output. A higher value of the Tν implies that larger differences between neighboring pixels are necessary in order to prevent activity spreading in the labeling network. Therefore, the activity spreading will not get stuck on small, insignificant segments. However, there is always a risk of using too high a value and produce activity leakage and compromise the segmentation results. On the other hand, the TX acts simply as a size filter, which disables the representation of smaller segments. In order to facilitate comparison with the LEGION, the algorithm is tested on an MRI (magnetic resonance imaging) image of a human head (available on-line from Digital Anatomist Interactive Atlases, http://www9.biostr. washington.edu/cgi-bin/DA/imageform/). The image consists of 450 × 450 pixels (see Figure 7.). Due to the larger scale of this image, thresholds are set to larger values: Tν = 6.5 and TX = 500. The algorithm successfully segments several important regions such as the cerebral cortex, the brain stem, the corpus callosum, the fornix, the septum pellucidum, and the bone marrow. The entire image was partitioned into 20 regions. There are also several failures. For instance, the cerebellum was not represented as a segment, the white matter is not distinguished from the gray matter, and parts of the cerebral cortex, including the striate and parietal regions, are not included in the single segment. The failures are a consequence of the large variations in the image intensity at these regions, which results in a large number of small segments, as indicated by activity gradients in Figure 7B. Despite these shortcomings, the algorithm exhibits good segmentation performance comparable with the LEGION. However, an important difference between the models is that the labeling network has the inherent capacity to represent arbitrarily many visual segments, while the LEGION achieves this only with algorithmic approximation. 4 Discussion Natural images contain a large number of objects, which poses a representational problem for image segmentation algorithms based on neural networks. Models based on oscillatory correlation such as LEGION have a
1934
D. Domijan
Figure 6: The same gray-level image as in the Figure 5, but with different parameter settings. (A) Input image. (B) Activation of the labeling network (xij ) with Tν = 3.5. (C) Activation of the size estimation network (Xij ). (D) Final segmentation result (Segij ) with TX = 100. (E) Segmentation result with TX = 50.
Representational Capacity
1935
Figure 7: A gray-level image of an MRI scan of a human head. It consists of 450 × 450 pixels (available online from Digital Anatomist Interactive Atlases). (A) Input image. (B) Activation of the labeling network (xij ) with Tν = 6.5. (C) Activation of the size estimation network (Xij ). (D) Final segmentation result (Segij ) with TX = 500.
limited capacity to separate patterns (Wang & Terman, 1997). Even if large capacity is possible, this will need a very long time to complete segmentation. On the other hand, models based on amplitude code such as FBF or CLM avoid the capacity issue by projecting different segments to different network layers (Grossberg & Wyse, 2001; Wersing et al., 2001). This strategy
1936
D. Domijan
implies either that the number of objects is known in advance or that the number of network layers is very large. The model proposed here is based on an amplitude code, but it requires only one network layer (with the size estimation network as an auxiliary mechanism for removing small, noisy regions in real images). The pattern separation is achieved by labeling different objects with a different activity level. Pixels that belong to the same object are painted with the same activity value due to the local excitatory links. Self-excitation ensures a uniform representation of segments without any auxiliary mechanisms. Pixels that belong to different objects are segregated by lateral inhibitory signals, which maintain an amplitude difference between different groups. In contrast with the spin-lattice model (Opara & Worg ¨ otter, ¨ 1998), which assumes units with discrete activation states (i.e., Potts spins), continuous activation dynamics in the present model converges to a set of discrete labels. The flow of local excitatory signals is regulated by a separate set of units that compute boundaries between objects. The proposed network deals only with the issue of representation, and surface boundaries are computed at other networks, such as BCS. Although the model’s behavior resembles interaction between surface and boundary computation as proposed by Grossberg and Todorovi´c (1988), there are two important differences. In the present approach, surfaces are constructed from abstract code provided by a single time-varying cell, while the feature contour system (FCS) receives input from contrast-sensitive cells similar to lateral geniculate nucleus (LGN) cells, which are also used for boundary computation. Another difference is that interaction between the surface and boundary system occurs at dendrites, which allows for sharper surface representation in the present formulation. The labeling network shares similar architectural assumptions with previously proposed models because it depends on the local excitatory links and lateral inhibitory links (Grossberg, 1988; Terman & Wang, 1995; Wersing et al., 2001). Standard architecture is augmented with dendritic inhibition of lateral inhibitory pathways, which enable the network to represent an arbitrary number of image segments. Dendritic inhibition is closely related to the mechanism of presynaptic inhibition previously proposed by Yuille and Grzywacz (1989) in a model of winner-take-all behavior. Spratling and Johnson (2001, 2002) provide a review of arguments for dendritic inhibition as a plausible mechanism. They showed that dendritic inhibition improves network capacity for storing patterns. Here it is shown that dendritic inhibition also increases the representational capacity of the neural networks. Multiplicative interactions at dendritic branches may arise as a consequence of electric properties of real dendritic membrane (Mel, 1994). The assumption that object identity is coded with amplitude of neural activity is in accord with a recent proposal by Roelfsema, Lamme, and Spekreijse (2000; Roelfsema, Lamme, Spekreijse, & Bosch, 2002). However, the present approach does not require attention or any high-level visual process (e.g., associative memory) in order to form a segmented representation of an input scene.
Representational Capacity
1937
Several neurophysiological studies support the idea of labeling different surfaces or objects with different activity levels. Lamme (1995) studied neural activity in the primary visual cortex in response to stimuli where a figure is distinguished from a background based on a difference in a texture orientation or a motion direction. He observed a difference in the firing rate when a neuron’s receptive field was inside or outside a region belonging to the figure. Rate enhancement was uniform along the whole figure, regardless of the location of the receptive field. It could be on the border or on the interior of the figure. In a subsequent study, Zipser, Lamme, and Schiller (1996) found firing rate enhancement for the figures defined by differences in luminance, color, or disparity. Furthermore, in both studies, the rate enhancement related to the figure-ground relationship has a longer latency than the initial response to local image features. Also, texture boundaries are processed before a texture interior is enhanced, suggesting different network mechanisms for surfaces and borders (Lamme, Rodriguez-Rodriguez, & Spekreijse, 1999). Based on these studies, Lamme and Roelfsema (2000) proposed a model of sensory processing in the brain with two distinct modes. The first mode is a fast feedforward mode, where cells respond to the local image feature to which they are tuned. The second mode is a slower feedback mode, which represents a more abstract code related to figure-ground segregation and possibly other aspects of visual perception. Unfortunately, Lamme and his colleagues used displays with only one figure and a background. Extensions to multi-element displays still need to be done. In that case, the present model has made a testable predication. Distinct surfaces should be represented with different firing rates, and all cells coding the same surface should exhibit the same firing rate. It is not unreasonable to expect such an outcome given the recent finding that the cells in the prefrontal cortex use amplitude code for representing abstract property, namely, serial position of motor commands in a sequence of movements that should be performed (Averbeck, Chafee, Crowe, & Georgopoulos, 2002). However, a study from another laboratory failed to find evidence for enhanced activity in the interior of the texture in the texture segregation task (Rossi, Desimone, & Ungerleider, 2001). Enhanced activity was observed only at the texture borders. Rossi et al. argued that a discrepancy in results could be accounted for by different behavioral paradigms employed in these studies. On the other hand, Albright and Stoner (2002) suggest that textured stimuli do not provide strong cues for segmentation, and configurations with depth cues should be tested instead. For instance, two overlapping surfaces will promote stronger segmentation because they contain T-junctions. Relevant findings are summarized by Lee (2003), who also argued that enhanced activity in the interior of the surfaces reflects important information about figure-ground organization. Results of segmentation need to be interpreted in some way by higherlevel visual areas. How surface selection and recognition could be performed using temporal correlations is not entirely clear (Shadlen & Mov-
1938
D. Domijan
shon, 1999). Wang (1999) proposed an architecture for the surface selection that takes the LEGION as an input. On the other hand, the present approach offers a straightforward method for selecting a particular segment or performing a visual search. Network dynamics forces one surface to attain maximal activity amplitude due to self-excitation. Therefore, the surface is selected simply by imposing a threshold on the output of the labeling network. The threshold should assume a value just slightly below the maximal amplitude in the network. Moreover, the threshold could be lowered in order to select more than one object at a time. This is consistent with recent psychophysical evidence (Davis, Driver, Pavani, & Shepherd, 2000; Davis, Welch, Holmes, & Shepherd, 2001). When the surface (or surfaces) is selected, it is evaluated in higher-level visual areas, and if it is not suitable for the task at hand, it could be actively inhibited. When representation of the most active surface is removed, the next most active representation is free from inhibition because it received inhibition only from the previously selected surface. Therefore, its activity grows until it reaches the maximal amplitude, which means that it is above the threshold and becomes the next selected surface. This process could be continued until a target surface is found. In this way, a serial search for the visual target is implemented in the network (Wolfe et al., 2002). Visual search models usually assume a saliency map as an input, which is computed as a weighted difference over all registered visual features and drives the search process (Itti & Koch, 2000, 2001; Wolfe, 1994). The labeling network assigns arbitrary activity amplitude to the surfaces due to the abstract identity code. However, it is possible that the labeling network receives input directly from the saliency map. In that case, the activity amplitude of the particular surface would correspond with its saliency, but if two or more surfaces have the same saliency measure, they will receive the same activity amplitude and be selected together. The disadvantage of the proposed approach is the extensive wiring necessary for proper network operation. Besides full connectivity of the lateral inhibitory pathways, every location has its own set of connections, from the dendritic interneuron to dendrites of lateral interneuron. Further research will explore whether it is possible to achieve the same behavior in a structurally less demanding architecture. Another task is to test the network on noisy images. In that case, a more complex boundary detection algorithm should be used (e.g., the BCS). Recent variants of the BCS have been shown to be very effective in detecting surface borders even in the presence of high noise (Mingolla, Ross, & Grossberg, 1999; Ross, Grossberg, & Mingolla, 2000).
References Albright, T. D., & Stoner, G. R. (2002). Contextual influences on visual processing. Annual Review of Neuroscience, 25, 339–379.
Representational Capacity
1939
Averbeck, B. B., Chafee, M. V., Crowe, D. A., & Georgopoulos, A. P. (2002). Parallel processing of serial movements in prefrontal cortex. Proceedings of the National Academy of Sciences USA, 99, 13172–13177. Beck, D. M., Rees, G., Frith, C. D., & Lavie, N. (2001). Neural correlates of change detection and change blindness. Nature Neuroscience, 4, 645–650. Becker, M. W., Pashler, H., & Anstis, S. M. (2000). The role of iconic memory in change detection tasks. Perception, 29, 273–286. Cesmeli, E., & Wang, D. L. (2000). Motion segmentation based on motion/brightness integration and oscillatory correlation. IEEE Transactions on Neural Networks, 11, 935–947. Cesmeli, E., & Wang, D. L. (2001). Texture segmentation using gaussian-Markov random fields and neural oscillator networks. IEEE Transactions on Neural Networks, 12, 394–404. Davis, G., Driver, J., Pavani, F., & Shepherd, A. (2000). Reapprising the apparent costs of attending to two separate visual objects. Vision Research, 40, 1323– 1332. Davis, G., Welch, V. L., Holmes, A., & Shepherd, A. (2001). Can attention select only a fixed number of objects at a time? Perception, 30, 1227–1248. Driver, J., Davis, G., Russell, C., Turatto, M., & Freeman, E. (2001). Segmentation, attention and phenomenal visual objects. Cognition, 80, 61–95. Gray, C. M. (1999). The temporal correlation hypothesis of visual feature integration: Still alive and well. Neuron, 24, 31–47. Grossberg, S. (1988). Nonlinear neural networks: Principles, mechanism, and architectures. Neural Networks, 1, 17–61. Grossberg, S., & Mingolla, E. (1985). Neural dynamics of perceptual grouping: Textures, boundaries, and emergent segmentations. Perception and Psychophysics, 38, 141–171. Grossberg, S., & Todorovi´c, D. (1988). Neural dynamics of 1-D and 2-D brightness perception: A unified model of classical and recent phenomena. Perception and Psychophysics, 43, 241–277. Grossberg, S., & Wyse, L. (1991). A neural network architecture for figure-ground separation of connected scenic figures. Neural Networks, 4, 723–742. Hauser, M., & Mel, B. W. (2003). Dendrites: Bug or feature? Current Opinion in Neurobiology, 13, 372–383. Hollingworth, A., & Henderson, J. M. (2002). Accurate visual memory for previously attended objects in natural scenes. Journal of Experimental Psychology: Human Perception and Performance, 28, 113–136. Itti, L., & Koch, C. (2000). A saliency-based search mechanism for overt and covert shifts of visual attention. Vision Research, 40, 1489–1506. Itti, L., & Koch, C. (2001). Computational modeling of visual attention. Nature Reviews Neuroscience, 2, 1–11. Kaski, S., & Kohonen, T. (1994). Winner-take-all networks for physiological models of competitive learning. Neural Networks, 7, 973–984. Lamme, V. A. F. (1995). The neurophysiology of figure-ground segregation in primary visual cortex. Journal of Neuroscience, 15, 1605–1615. Lamme, V. A. F., Rodriguez-Rodriguez, V., & Spekreijse, H. (1999). Separate processing dynamics for texture elements, boundaries and surfaces in primary visual cortex of the macaque monkey. Cerebral Cortex, 9, 406–413.
1940
D. Domijan
Lamme, V. A. F., & Roelfsema, P. R. (2000). The distinct modes of vision offered by feedforward and recurrent processing. Trends in Neuroscience, 23, 571–579. Landman, R., Spekreijse, H., & Lamme, V. A. F. (2003). Large capacity storage of integrated objects before change blindness. Vision Research, 43, 149–164. Lee, T. S. (2003). Computations in the early visual cortex. Journal of Physiology— Paris, 97, 121–139. Mack, A. (2003). Inattentional blindness: Looking without seeing. Current Directions in Psychological Science, 12, 180–184. Mack, A., & Rock, I. (1998). Inattentional blindness. Cambridge, MA: MIT Press. Marcus, D. S., & van Essen, D. C. (2002). Scene segmentation and attention in primate cortical areas V1 and V2. Journal of Neurophysiology, 88, 2648–2658. Mel, B. W. (1994). Information processing in dendritic tree. Neural Computation, 6, 1031–1085. Milner, P. M. (1974). A model of visual shape recognition. Psychological Review, 81, 521–535. Mingolla, E., Ross, W. D., & Grossberg, S. (1999). A neural network for enhancing boundaries and surfaces in synthetic aperture radar images. Neural Networks, 12, 499–512. Moore, C. M., & Egeth, H. (1997). Perception without attention: Evidence of grouping under conditions of inattention. Journal of Experimental Psychology: Human Perception and Performance, 23, 339–352. Opara, R., & Worg ¨ otter, ¨ F. (1998). A fast and robust cluster update algorithm for image segmentation in spin-lattice models without annealing—Visual latencies revisited. Neural Computation, 10, 1547–1566. Palmer, S., & Rock, I. (1994). Rethinking perceptual organization: The role of uniform connectedness. Psychonomic Bulletin and Review, 1, 29–55. Peterson, M. A. (1994). Object recognition processes can and do operate before figure-ground organization. Current Directions in Psychological Science, 3, 105– 111. Peterson, M. A., & Gibson, B. S. (1994a). Must figure-ground organization precede object recognition? An assumption in peril. Psychological Science, 5, 253– 259. Peterson, M. A., & Gibson, B. S. (1994b). Object recognition contributions to figure-ground organization: Operations on outlines and subjective contours. Perception and Psychophysics, 56, 551–564. Poirazi, P., Brannon, T., & Mel, B. W. (2003). Pyramidal neuron as two-layer neural network. Neuron, 37, 989–999. Pylyshyn, Z. (1999). Is vision continuous with cognition? The case for cognitive impenetrability of visual perception. Behavioral and Brain Sciences, 22, 341– 423. Rensink, R. A. (2000). Seeing, sensing, and scrutinizing. Vision Research, 40, 1469– 1487. Rensink, R. A. (2002). Change detection. Annual Review of Psychology, 53, 245–277. Roelfsema, P. R., Lamme, V. A. F., & Spekreijse, H. (2000). The implementation of visual routines. Vision Research, 40, 1385–1411. Roelfsema, P. R., Lamme, V. A. F., Spekreijse, H., & Bosch, H. (2002). Figureground segregation in recurrent network architecture. Journal of Cognitive Neuroscience, 14, 525–537.
Representational Capacity
1941
Ross, W. D., Grossberg, S., & Mingolla, E. (2000). Visual cortical mechanisms of perceptual grouping: Interacting layers, networks, columns and maps. Neural Networks, 13, 571–588. Rossi, A. F., Desimone, R., & Ungerleider, L. G. (2001). Contextual modulation in primary visual cortex of macaques. Journal of Neuroscience, 21, 1698–1709. Shadlen, M. N., & Movshon, J. A. (1999). Synchrony unbound: A critical evaluation of the temporal binding hypothesis. Neuron, 24, 67–77. Simons, D. J. (2000). Current approaches to change blindness. Visual Cognition, 7, 1–15. Simons, D. J., Chabris, C. F., Schnur, T., & Levin, D. T. (2002). Evidence for preserved representations in change blindness. Consciousness and Cognition, 11, 78–97. Simons, D. J., & Levin, D. T. (1997). Change blindness. Trends in Cognitive Sciences, 1, 261–267. Singer, W. (1999). Neuronal synchrony: A versatile code for the definition of relations? Neuron, 24, 49–65. Singer, W., & Gray, C. M. (1995). Visual feature integration and the temporal correlation hypothesis. Annual Review of Neuroscience, 18, 555–586. Spratling, M. W., & Johnson, M. H. (2001). Dendritic inhibition enhances neural coding properties. Cerebral Cortex, 11, 1144–1149. Spratling, M. W., & Johnson, M. H. (2002). Preintegration lateral inhibition enhances unsupervised learning. Neural Computation, 14, 2157–2179. Terman, D., & Wang, D. L. (1995). Global competition and local cooperation in a network of coupled oscillators. Physica D, 81, 148–176. van Hateren, J. H., & van der Schaaf, A. (1998). Independent component filters of natural images compared with simple cells in primary visual cortex. Proceedings of the Royal Society London B, 265, 359–366. van Rullen, R., & Koch, C. (2003). Competition and selection during visual processing of natural scenes and objects. Journal of Vision, 3, 75–85. Vecera, S. P. (2000). Toward a biased competition account of object-based segregation and attention. Brain and Mind, 1, 353–384. Vecera, S. P., Flevaris, A. V., & Filapek, J. C. (2004). Exogenous spatial attention influences figure-ground assignment. Psychological Science, 15, 20–26. Vecera, S. P., & O’Reilly, R. C. (1998). Figure-ground organization and object recognition processes: An interactive account. Journal of Experimental Psychology: Human Perception and Performance, 24, 441–462. von der Malsburg, C. (1981). The correlation theory of brain function (Int. Rep. 81-2). Gottingen: ¨ Max Planck Institute for Biophysical Chemistry. von der Malsburg, C., & Schneider, W. (1986). A neural cocktail-party processor. Biological Cybernetics, 54, 29–40. Wang, D. (1999) Object selection based on oscillatory correlation. Neural Networks, 12, 579–592. Wang, D. L. (2000). On connectedness: A solution based on oscillatory correlation. Neural Computation, 12, 131–139. Wang, D. L., & Terman, D. (1997). Image segmentation based on oscillatory correlation. Neural Computation, 9, 805–836. Wersing, H., Steil, J. J., & Ritter, H. (2001). A competitive-layer model for feature binding and sensory segmentation. Neural Computation, 14, 357–387.
1942
D. Domijan
Wolfe, J. M. (1994). Guided search 2.0: A revised model of visual search. Psychonomic Bulletin and Review, 1, 202–238. Wolfe, J. M. (1999). Inattentional amnesia. In V. Coltheart (Ed.), Fleeting memories (pp. 71–94). Cambridge, MA: MIT Press. Wolfe, J. M., Oliva, A., Horowitz, T. S., Butcher, S. J., & Bompas, A. (2002). Segmentation of objects from backgrounds in visual search tasks. Vision Research, 42, 2985–3004. Yuille, A. L., & Grzywacz, N. M. (1989). A winner-take-all mechanism based on presynaptic inhibition feedback. Neural Computation, 1, 334–347. Zipser, K., Lamme, V. A. F., & Schiller, P. H. (1996). Contextual modulation in primary visual cortex. Journal of Neuroscience, 16, 7376–7389. Received March 25, 2003; accepted March 9, 2004.
LETTER
Communicated by Michael Arnold
A Solution for Two-Dimensional Mazes with Use of Chaotic Dynamics in a Recurrent Neural Network Model Yoshikazu Suemitsu
[email protected] Graduate School of Natural Science and Technology, Okayama University, Okayama 700-8530, Japan
Shigetoshi Nara
[email protected] Department of Electrical and Electronic Engineering, Okayama University, Okayama, 700-8530, Japan
Chaotic dynamics introduced into a neural network model is applied to solving two-dimensional mazes, which are ill-posed problems. A moving object moves from the position at t to t + 1 by simply defined motion function calculated from firing patterns of the neural network model at each time step t. We have embedded several prototype attractors that correspond to the simple motion of the object orienting toward several directions in two-dimensional space in our neural network model. Introducing chaotic dynamics into the network gives outputs sampled from intermediate state points between embedded attractors in a state space, and these dynamics enable the object to move in various directions. System parameter switching between a chaotic and an attractor regime in the state space of the neural network enables the object to move to a set target in a two-dimensional maze. Results of computer simulations show that the success rate for this method over 300 trials is higher than that of random walk. To investigate why the proposed method gives better performance, we calculate and discuss statistical data with respect to dynamical structure. 1 Introduction Since Poincar´e discovered chaos, chaotic phenomena have been observed in various scientific and engineering fields. Deterministic but complex behavior, which is caused by internal instability and nonlinearity, has been attracting strong interest among scientists. A variety of phenomenological and dynamical properties of chaos have been investigated and reported. Chaos has been discovered even in living systems, including the brain, with the help of great advances in measuring technologies and has had a great impact in the effort to understand biological systems (Babloyantz c 2004 Massachusetts Institute of Technology Neural Computation 16, 1943–1957 (2004)
1944
Y. Suemitsu and S. Nara
& Destexhe, 1986; Skarda & Freeman, 1987). There is the possibility that chaos is important in realizing highly advanced information processing and control functioning. If this is so and the mechanism is understood, then chaos could be applied to a novel information processing or control system based on nonconventional control principles. In biological systems, interactions between local systems and the total system couple each other with quite strong nonlinearity. This prevents us from decoupling them by conventional reductionism, that is, decoupling them to individual parts or elements. In other words, the difficulty of “combinatorial explosion” or “divergence of algorithmic complexity” occurs due to degrees of freedom that are too large. In such situations, large-scale computer simulations or heuristic methods are useful for approaching this problem. Artificial neural networks in which chaotic dynamics can occur are typical approaches, and the relation between chaos and functions has been discussed (Aihara, Takabe, & Toyoda, 1990; Tsuda, 1991, 2001; Fujii, Itoh, Ichinose, & Tsukada, 1996). The whole picture, however, is not yet revealed. Nara and Davis found that chaotic dynamics can occur in a recurrent neural network, and they investigated functional aspects of chaos by applying it to solving, for instance, a memory search set under an ill-posed context (Nara & Davis, 1992, 1997; Nara, Davis, Kawachi, & Totuji, 1995; Kuroiwa, Nara, & Aihara, 1999; Suemitsu & Nara, 2003). From the perspective of control, chaos has been considered to spoil control systems. Consequently, a large number of methods have been proposed to avoid chaos. However, we consider chaos to be useful not only in solving ill-posed problems but also in controlling systems with large but finite degrees of freedom. To show the potential of chaotic dynamics generated in neural networks, we apply it to control functioning using a maze in twodimensional space as an example. Here, an object is assumed to move in a two-dimensional maze and approach the target by employing chaotic dynamics. One reason that we consider a maze is that the process involved with solving a maze can be easily visualized; we can understand how the dynamical structures are effectively utilized in controlling. Let us state the essence of the model’s construction. The object is assumed to move with discrete time steps. The state vector in the network is translated into the object’s motion, which will be given in a later section. In addition, several limit cycle attractors, which are regarded as prototypes of simple motion, are embedded in the network. They correspond to movement toward several directions in two-dimensional space via the motion function. If the network state converges into the prototypical attractor, the object moves in one of several directions constantly in two-dimensional space. Introducing chaotic dynamics into the network generates nonperiodic output. The nonperiodic state vector, which includes internal instability, is translated into the chaotic motion of the object through the motion function. Chaotic dynamics samples a novel motion among attractors spontaneously; this is the first important point. In addition, system parameter switching by a sim-
A Solution for Two-Dimensional Mazes
1945
ple evaluation with a certain tolerance between chaotic and stable phases of the network might realize a complex motion for the object. The interaction between spontaneous motion generated by chaotic dynamics and external simple evaluation including uncertainty gives constrained chaos; this is the second important point. In the computer simulation, we find that the present method utilizing chaos gives superior performance to a stochastic method based on a random walk. To understand the mechanism of the better performance, we have investigated the dynamical hysteresis based on statistical data. 2 Network Model and Motion Function In this letter, a binary neural network is employed, defined by si (t + 1) = sgn
N
εr;ij wij sj (t)
(2.1)
j=1
sgn(u) =
1 (if u ≥ 0) , −1 (if u < 0)
where si (t) = ±1 is a neuron state of a neuron i at time t, N is the total number of the neurons, sgn(u) is a signature function, {εr:ij } is a binary activity that is defined to be 0 or 1 and satisfies the relation j εr;ij = r, r is called a fan-in number or a connectivity for each neuron, and {wij } is a synaptic connection matrix. It is determined by wij =
L M
µ,λ+1 † µ,λ ξj
(2.2)
M L µ,λ α,β (O−1 )α,β ξj
(2.3)
µ=1 λ=1 † µ,λ ξj
=
µ,λ
Oα,β =
α
ξi
β
µ,λ α,β ξj ,
ξj
(2.4)
j µ,λ
where {ξi |i = 1, . . . , N : µ = 1, . . . , L : λ = 1, . . . , M} is an attractor pattern set and is shown below. The periodic sequence ξ µ,λ = ξ µ,λ+M is employed in this model where † ξ µ,λ is an adjoint vector of ξ µ,λ , which satisfies the relation † ξ α,ρ ξ β,σ = δα,β δρ,σ where δ is Kronecker’s delta. This enables us to embed L attractors with M-step maps into the network while avoiding superious attractors when the connectivity r = N (Nara & Davis, 1992, 1997; Nara et al., 1993, 1995; Suemitsu & Nara, 2003). In our computer simulation, the case N = 400, L = 4, M = 6 is employed. If N is too small, chaotic dynamics does not occur, whereas if N is oversized, it results in
1946
Y. Suemitsu and S. Nara
excessive computing time. This is why we selected the case N = 400. The meaning of L and M is shown below. In two-dimensional space, an object is assumed to move from the position (px (t), py (t)) to (px (t + 1), py (t + 1)) via a set of motion functions. Using network state s(t), the motion functions fx (s(t)), fy (s(t)) are defined by N
4 4 fx (s(t)) = si (t) · si+ N (t) 2 N i=1
(2.5)
N
fy (s(t)) =
4 4 s N (t) · si+ 3 N (t) , 4 N i=1 i+ 4
(2.6)
where fx and fy range from −1 to +1 due to normalization by 4/N. Note that each motion function is calculated by a self inner product between two parts of the network state s(t). In two-dimensional space, the actual motion of the object is given by px (t) = px (t − 1) + fx (s(t))
(2.7)
py (t) = py (t − 1) + fy (s(t)) .
(2.8)
In our computer simulations, two-dimensional space is digitized with a resolution of 0.02 due to the binary state vector s(t) with 400 elements and the definition of fx and fy in equations 2.5 and 2.6. The way to determine a set of attractor patterns in equation 2.2 is as follows. Each attractor pattern is divided into two parts. One is a random pattern component, where each µ,λ state becomes +1 or −1 with a probability of 0.5 (ξi = ±1 : i = 1, . . . , N/2). µ,λ The other (ξi = ±1 : i = N/2+1, . . . , N) is determined so that the relations ( fx (ξ 1,λ ), fy (ξ 1,λ )) = (−1, −1)
(2.9)
( fx (ξ 2,λ ), fy (ξ 2,λ )) = (−1, +1)
(2.10)
( fx (ξ 3,λ ), fy (ξ 3,λ )) = (+1, −1)
(2.11)
( fx (ξ
4,λ
), fy (ξ
4,λ
)) = (+1, +1) ,
(2.12)
are held. Under the condition λ = 1, . . . , 6, four limit cycle attractors, each of which have M (= 6) patterns, are embedded in the synaptic connection matrix of the network. Each limit cycle attractor corresponds to a constant motion of the object toward one of the four directions (+1, +1), (+1, −1), (−1, +1), (−1, −1).
A Solution for Two-Dimensional Mazes
1947
3 Chaotic Dynamics If the connectivity r is sufficiently large, the network state s(t) converges to one of the cyclic attractors. This corresponds to a conventional associative memory. Under the condition s(t) = ξ µ,λ , the output at the next time step becomes s(t + 1) = ξ µ,λ+1 . Even if the network state s(t) is near one of the attractor patterns ξ µ,λ , the output sequence s(t + kM) (k = 1, 2, 3, . . .) generated by the M-step map will converge to the pattern ξ µ,λ . This fact suggests that each attractor pattern has a set of the network states—the so-called attractor basin Bµ,λ . A network state s(t + kM) (k = 1, 2, 3, . . .) converges to one of the attractor patterns ξ µ,λ if s(t) is in an attractor basin Bµ,λ . Estimating basin volume is one of the main features of this network; to estimate basin volume accurately, one must check the final state (limk→∞ s(kM)) for all initial states in the network. Calculating all the final states is very difficult, however, due to the requirement for enormous amounts of time in this case (the total number of initial states is 2N ). Therefore, the approximate basin volume is estimated by the method shown below. First, a large number of random initial states is generated. These cover the entire state space uniformly if the number is sufficiently large. By updating the initial states, each final state limk→∞ s(kM) can be specified; therefore, the ratios between the number of initial states converge to a certain attractor and the total number of initial states. The rate of convergence to each attractor pattern is proportional to the basin volume, and this rate is regarded as the approximate basin volume for each attractor. Figure 1 shows basin volume of the network. The estimation result indicates that almost all initial states converge to one of the attractor patterns and that each basin volume is approximately equal to the others. Let us consider introducing chaotic dynamics while decreasing the connectivity r. When reducing r, the network state wanders chaotically in the state space, since the stability of the embedded attractors is lost. In our previous article (Nara et al., 1995), we discussed whether the generated dynamics was chaotic. We have observed orbital instability for low connectivity. We have also calculated the invariant measure in terms of basin visiting frequency. This means that dynamical structure does exist in the state space due to reduction of connectivity (Nara & Davis, 1992, 1997; Nara et al., 1993, 1995; Suemitsu & Nara, 2003). In particular, it is confirmed that the network states visit all basins of the embedded attractors chaotically within a certain range of small r. In our model, each embedded attractor corresponds to a prototypical motion in two-dimensional space. Stable output of the network with large r provides a constant motion of the object in two-dimensional space, and Figure 2 shows the motion of the object with large r. Furthermore, if the network becomes unstable with low connectivity r (< 70), the object will
1948
Y. Suemitsu and S. Nara
Figure 1: Basin volume: The horizontal axis represents memory pattern number (1–24). Basin 25 on the horizontal axis corresponds to samples that converged into cyclic outputs having a period of six steps that do not converge to any memory attractor. Basin 26 corresponds to samples that do not belong to any other case (1–25). The vertical axis represents the ratio of each sample to the total number of samples. Hatching and nonhatching are used alternately to show different cycle attractors.
move chaotically. Figure 3 shows the chaotic orbit generated from chaotic dynamics in the network.
4 Motion Control In this section, we propose a control method for the object through switching the connectivity r with a simple evaluation. A target is assumed to be set in two-dimensional space; furthermore, an exact coordinate value of the target (qx , qy ) is not given in the control method proposed below. As an example of external response with uncertainty, a rough direction of the target D1 (t) with a certain tolerance is defined and obtained by the object. For instance, D1 (t) becomes 1 if the direction from the object to the target is observed between the angle from 0 to π/2, where the angle is defined by the x-axis in two-dimensional space. Similarly, D1 (t) becomes n (= 1, 2, 3, 4) if the direction belongs to the angle (n − 1)π/2 and nπ/2. Second, a direction of
A Solution for Two-Dimensional Mazes
4
1949
r=400
3 2 1 0 -1 -1
0
1
2
3
4
Figure 2: An example of the object orbit from start point (0, 0) in twodimensional space with attracting network state (r = 400). One of the motions, which is embedded in synaptic connection matrix {wij }, is shown.
the object motion D2 (t) from time t − 1 to t is defined as 1 2 D2 (t) = 3 4
(cx (t) = +1 and cy (t) = +1) (cx (t) = −1 and cy (t) = +1) , (cx (t) = −1 and cy (t) = −1) (cx (t) = +1 and cy (t) = −1)
where cx (t) and cy (t) are given as px (t) − px (t − 1) |px (t) − px (t − 1)| py (t) − py (t − 1) . cy (t) = |py (t) − py (t − 1)| cx (t) =
(4.1) (4.2)
Finally, using D1 (t) and D2 (t), a time-dependent connectivity r(t) in the network is determined, r(t) =
RL RS
(= N) ( N)
(if D1 (t − 1) = D2 (t − 1)) , (otherwise)
(4.3)
1950
Y. Suemitsu and S. Nara
2
r=10
1 0 -1 -2 -3 -1
0
1
2
3
4
Figure 3: An example of the object orbit from the start point (0, 0) in twodimensional space with chaotic network states (r = 10).
where RL has sufficiently high connectivity (= N in our experiments) and RS has low connectivity. Therefore, after determining the connectivity r(t), we calculate the motion of the object from the network state updated with r(t). The motion provides the next calculation of D1 (t) and D2 (t), and by repeating this process, where the connectivity is switched between RL and RS , the object moves in two-dimensional space. In this control method, the high connectivity is maintained r(t) = RL if the object moves toward the target with a tolerance of π/2, as it provides stable motion toward the target. Figure 4 shows an example of the orbit of the object with the control method proposed above. In the figure, the object moves along the horizontal axis, which is not embedded as the prototypical attractor in the network. Figures 5 and 6 show examples where the walls, through which the object is not permitted to pass, are set in two-dimensional space like a maze. We have assumed that the object always knows which quadrant (= D1 (t)) the target is in at each time step. In our experiments, both fx (s(t)) and fy (s(t)) are taken to be zero if the moving object hits the wall, which means that the object cannot penetrate the wall when it hits. It is possible to observe various solutions to the problem of the object approaching the target in a two-dimensional maze.
A Solution for Two-Dimensional Mazes
1951
15 Rs=10 10 5 target 0 start -5 -10 -15 -5
0
5
10
15
20
25
Figure 4: An example of the object orbit in two-dimensional space with the simple control method from start point (0, 0) to target point (20, 0). Using simple switching between RS and RL , motion in the right direction appears, although that motion is not embedded in the synaptic connection matrix.
5 Result of the Two-Dimensional Maze Problem To show the performance of the control method proposed above in solving a two-dimensional maze from a quantity viewpoint, we have estimated a success rate over 300 random initial states. The result of the control method is compared to that of a random method (random walk), where a random bitpattern generator is used instead of chaotic dynamics. The network state s(t) is replaced at every step by a random pattern generator in the case D1 (t−1) = D2 (t−1) instead of chaotic dynamics. If D1 (t−1) = D2 (t−2), then r(t) is switched to RL = N. The problem in the actual computer simulation is the maze shown in Figure 6, where 300 random patterns are set as the initial states of the network. The rate of approaching the target within 5000 time steps is estimated as the success rate in Figure 7. The success rate for the same problem with random walk is 0.01; the success rate of the control method with chaotic dynamics in the network between RS = 40 and RS = 50, however, is significantly higher than that with randomwalk. The configuration of {εr;ij } is randomly chosen with the condition j εr;ij = r. The differing configurations of {εr;ij } provide a variety of results, even if
1952
Y. Suemitsu and S. Nara
15 Rs=10 10 wall
5 0 start
target
-5 -10 -15 -5
0
5
10
15
20
25
Figure 5: An example of the object orbit in two-dimensional space with a wall from start point (0, 0) to target point (20, 0). After colliding with the wall, the object escapes the wall and moves to the target. Table 1: Mean and the Standard Deviation of the Success Rate for RS = 30, 40, 50.
Average Standard Deviation
RS = 30
RS = 40
RS = 50
0.084 0.130
0.139 0.157
0.129 0.089
the same initial condition is given. The mean and the standard deviation of success rate over six different configurations {εr;ij } for RS = 30, 40, 50 are shown in Table 1. Note that this result is typical, although it strongly depends on the configuration {εr;ij } in equation 2.1. 6 Discussion To show one of the reasons that our model gives better performance than a random walk, let us discuss hysteresis of the network state s(t) from a dynamical viewpoint. In the proposed controlling method, the connectivity of the network is set to a low connectivity r(t) = RS when the object collides against a wall in two-dimensional space. Making a detour to approach
A Solution for Two-Dimensional Mazes
1953
20 Rs=30
target
15 wall 10
5 start
0
0
5
10
15
20
Figure 6: An example of the object orbit in two-dimensional space with walls from start point (1, 9) to target point (19, 19).
0.4 0.3 0.2 0.1 0 0
10
20
30
40
50
60
Figure 7: Three hundred random patterns are set as the initial states of the network. The rate of successfully reaching the target within 5000 time steps is estimated as the success rate. The horizontal axis is RS , and the vertical axis is the success rate. The broken line represents the result of a random search (= 0.01).
1954
Y. Suemitsu and S. Nara
the target, the object must move toward a direction that does not include the target. It implies that with low connectivity RS , high-dimensional orbits in the network must stay in each attractor basin, which is defined at full connection, within certain successive time intervals. We could make a hypothesis that the origin of the successful result is in itinerancy between attractor ruins. To confirm this hypothesis, we estimate residence time in a certain basin. A distribution p(l) is defined as p(l) = {the number of l|s(t) ∈ βµ in τ ≤ t ≤ τ + l and
s(τ − 1) ∈/ βµ ands(τ + l + 1) ∈/ βµ }, M βµ = Bµ,λ ,
(6.1) (6.2)
λ=1
K=
lp(l) ,
(6.3)
l
where l is the length of successional time steps staying in each embedded basin, and p(l) represents a distribution of l within K steps, which is taken to be K = 105 in our actual computer simulation. In statistical processing, it is known that the distribution p(l) fits a exponential function. Figures 8 (semilog plot) and 9 (log-log plot) show the frequency distribution p(l) for the case of random walk (R.W.), where r = 10, r = 40(a) and r = 40(b), in which the two cases (r = 40(a) and r = 40 (b)) have different {εr;ij } configurations. They are calculated in the free runs of the object, which are shown in Figure 3. It is found that the distribution of random walk (R.W. in the figure) fits a line in the semilog scale. Since r = 10 also fits a line in the semilog scale, highly developed chaos is considered to occur near r = 10. Better performance in solving a two-dimensional maze is not obtained in the result of the success rate shown in Figure 7 due to small values of p(l) in the region of large l with the case of random walk or where r = 10. In contrast, there is no exponential decrease found in the cases of r = 40(a) and r = 40(b) in the semilog scale (see Figure 8). They fit the power law well within l = 10 on the log-log scale (see Figure 9), which indicates that the distribution p(l) has a long tail near r = 40 and suggests that dynamical properties are included in chaotic dynamics generated from the network. The case of r = 40 gives a broader distribution in the region of long l than that of random walk or r = 10, which is one reason that chaotic dynamics near r = 40 gives superior performance, since the object must generate various motions to make a detour after colliding with a wall. Residence time in a certain basin might be considered to help the object generate the appropriate motion. An unsolved problem still remains. In spite of the similarity of the distribution p(l) between r = 40(a) and r = 40(b) in Figures 8 and 9, the case with r = 40(a) performs better (success rate: 0.373) than that with r = 40(b)
A Solution for Two-Dimensional Mazes
1955
100000 10000 1000 100
r = 40 (b)
10
R.W.
1
5
r = 10 10
r = 40 (a) 15
20
25
30
Figure 8: The log plot of the frequency distribution of staying length l. The horizontal axis represents the basin-visiting time length and the vertical axis is each observed frequency from t = 1 to t = 105 . r = 10 is the case of highly developed chaos with connectivity r = 10. r = 40(a) is the case with r = 40, which gives good performance for the maze shown in Figure 6. r = 40(b) is different from r = 40(a) for the configuration {εr;ij }, and that gives poor performance. R.W. means that the random number generator is replaced with chaotic dynamics in the neural network model.
100000 10000 1000
r = 40 (b) 100
r = 40 (a)
10 1
1
10
Figure 9: Log-log plot of the frequency distribution of visiting time length l. Each axis corresponds to the axis in Figure 8.
1956
Y. Suemitsu and S. Nara
(success rate: 0.013) in solving the maze in Figure 6. This indicates that the performance depends heavily on the configuration of {εr;ij }. Generally, arbitrarily generated chaos is not always useful in solving the given problem. A functional aspect of chaotic dynamics is still context dependent, which is an issue for future study. 7 Summary We proposed a simple method to solve two-dimensional mazes with the use of chaotic dynamics. Generally, arbitrarily generated chaos is not always useful in solving the given problem. However, better results were often observed with the use of chaotic dynamics. This is why we have confined ourselves in this article to comparing results by utilizing chaotic dynamics with the result using random pattern generator. It is desirable that we consider the comparison of a colored random pattern generator based on reinforcement learning with chaos learning. However, we have not yet dealt with this problem; this is a future issue. A summary of our results is given below: • We proposed a motion control method in two-dimensional space with chaotic dynamics in a recurrent neural network. • Utilizing chaotic dynamics gives better performance than a random walk for solving a two-dimensional maze. • The distribution of p(l) shows that chaotic dynamics leads to longer stays in a certain basin in the network state space than does a random walk. • We found itinerancy between attractor ruins in the chaotic dynamics. • Low connectivity gives highly developed chaos, and it is almost equivalent to a random walk in the distribution p(l). References Aihara, K., Takabe, T., & Toyoda, M. (1990). Chaotic neural networks. Physics Letters A, 114, 333–340. Babloyantz, A., & Destexhe, A. (1986). Low-dimensional chaos in an instance of epilepsy. Proceedings of the National Academy of Sciences of the United States of America, 83, 3513–3517. Fujii, H., Itoh, H., Ichinose, N., & Tsukada, M. (1996). Dynamical cell assembly hypothesis—Theoretical possibility of spatio-temporal coding in the cortex. Neural Networks, 9, 1303–1350. Kuroiwa, J., Nara, S., & Aihara K. (1999). Functional possibility of chaotic behavior in a single chaotic neuron model for dynamical signal processing elements. Proceedings of IEEE International Conference on SMC, 1, 290–295.
A Solution for Two-Dimensional Mazes
1957
Nara, S., & Davis, P. (1992). Chaotic wandering and search in a cycle memory neural network. Progress of Theoretical Physics, 88, 845–855. Nara, S., & Davis, P. (1997). Learning feature constraints in a chaotic neural memory. Physical Review E, 55, 826–830. Nara, S., Davis, P., Kawachi, M., & Totuji, H. (1995). Chaotic memory dynamics in a recurrent neural network with cycle memories embedded by pseudoinverse method. International Journal of Bifurcation and Chaos, 5, 1205–1212. Nara, S., Davis, P., & Totsuji, H. (1993). Memory search using complex dynamics in a recurrent neural network model. Neural Networks, 6, 963–973. Skarda, C. A., & Freeman, W. J. (1987). How brains make chaos in order to make sense of the world. Behavioral and Brain Sciences, 10, 161–195. Suemitsu, Y., & Nara, S. (2003). A note on time delayed effect in a recurrent neural network model. Neural Computing and Applications, 11(3&4), 137–143. Tsuda, I. (1991). Chaotic itinerancy as a dynamical basis of hermeneutics of brain and mind. World Futures, 32, 167–185. Tsuda, I. (2001). Towards an interpretation of dynamic neural activity in terms of chaotic dynamical systems. Behavioral and Brain Sciences, 24, 793–847. Received August 12, 2003; accepted March 9, 2004.
LETTER
Communicated by Jerome Friedman
Online Adaptive Decision Trees Jayanta Basak
[email protected] IBM India Research Lab, Indian Institute of Technology, New Delhi–110048, India
Decision trees and neural networks are widely used tools for pattern classification. Decision trees provide highly localized representation, whereas neural networks provide a distributed but compact representation of the decision space. Decision trees cannot be induced in the online mode, and they are not adaptive to changing environment, whereas neural networks are inherently capable of online learning and adpativity. Here we provide a classification scheme called online adaptive decision trees (OADT), which is a tree-structured network like the decision trees and capable of online learning like neural networks. A new objective measure is derived for supervised learning with OADT. Experimental results validate the effectiveness of the proposed classification scheme. Also, with certain real-life data sets, we find that OADT performs better than two widely used models: the hierarchical mixture of experts and multilayer perceptron. 1 Introduction Two major tools for classification (Duda & Hart, 1973; Jain, Duin, & Mao, 2000) are neural networks (Haykin, 1999) and decision trees (Breiman, Friedman, Olshen, & Stone, 1983; Quinlan, 1993). The first, especially feedforward networks, embeds a distributed representation of the inherent decision space, whereas the decision tree always makes a localized representation of the decision. The decision tree (Duda, Hart, & Stork, 2001; Durkin, 1992; Fayyad & Irani, 1992; Friedman, 1991; Breiman et al., 1983; Quinlan, 1993, 1996; Brodley & Utgoff, 1995) is a widely used tool for classification in various domains, such as text mining (Yang & Pedersen, 1997), speech (Riley, 1989; Chien, Huang, & Chen, 2002), bio-informatics (Salzberg, Delcher, Fasman, & Henderson, 1998), web intelligence (Zamir & Etzioni, 1998; Cho, Kim, & Kim, 2002), and many other fields that need to handle large data sets. Decision trees are in general constructed by recursively partitioning the training data set into subsets such that finally certain objective error (such as impurity) over the entire data is minimized. Although the very nature of the construction of axis-parallel decision trees always tries to reduce the bias at the cost of variance (it can be a high generalization error), axis-parallel decision trees are quite popular in terms of their interpretability; each class c 2004 Massachusetts Institute of Technology Neural Computation 16, 1959–1981 (2004)
1960
J. Basak
can be specifically described by a set of rules defined by the path from the root to the leaf nodes of the tree. Variations of decision trees also exist in the literature such as SVDT (Bennett, Wu, & Auslender, 1998) and oblique decision trees (Murthy, Kasif, & Salzberg, 1994), which take into account the combination of more than one attribute in partitioning the data set at any level (at the cost of interpretability). Decision trees usually perform hard splits of the data set, and if a mistake is committed in splitting the data set at a node, then it cannot be corrected further down the tree. In order to take care of this problem, attempts have been made (Friedman, Kohavi, & Yun, 1996) that in general employ various kinds of look-ahead operators to guess the potential gain in the objective after more than one recursive split of the data set. Second, a decision tree is always constructed in the batch mode, that is, the branching decision at each node of the tree is induced based on the entire set of training data. Different variations have been proposed for an efficient construction mechanism (Utgoff, Berkman, & Clouse, 1997; Mehta, Agrawal, & Rissanen, 1996). Online algorithms (Kalai & Vempala, n.d.; Albers, 1996), on the other hand, sometimes prove to be more useful in the case of streaming data, very large data sets, and situations where memory is limited. Unlike decision trees, neural networks (Haykin, 1999) make decisions based on the activation of all nodes, and therefore even if a node is faulty (or makes mistakes) during training, it does not affect the performance of the network very much. Neural networks also provide a computational framework for online learning and adpativity. In short, a decision tree can be viewed as a token passing system, and a pattern is treated as a token that follows a path in the tree from the root node to a leaf node. Each decision node behaves as a gating channel, depending on the attribute values of the pattern token. Neural networks, on the other hand, resemble the nervous system in the sense of distributed representation. A pattern is viewed as a phenomenon of activation, and the activation propagates through the links and nodes of the network to reach the output layer. Some attempts have been made to combine the neural networks with decision trees to obtain different classification models. For example, in the neural tree (Stromberg, ¨ Zrida, & Isaksson, 1991; Golea & Marchand, 1990), multilayer perceptrons are used at each decision node (intermediate node) of the tree for partitioning the data set. Attempts are also made to construct decision trees (Boz, 2000) from trained neural networks, as well as to construct neural networks (Lee, Jone, & Tsai, 1995) from trained decision trees. However, all of these methods attempt to hybridize the concepts borrowed from both these paradigms and learn locally within each node, although none of them addresses the issues of learning in terms of achieving overall classification performance. Possibly the most elegant approach in this line is the hierarchical mixture of experts (HME) (Jordan & Jacobs, 1993, 1994) and the associated learning algorithm. In HME, a tree-structured network (similar to a decision tree)
Online Adaptive Decision Trees
1961
is viewed as a hierarchy of the mixture of experts, and the output of the individual experts is combined selectively through gating networks in a bottom-up manner through the hierarchy such that the root node always represents the final decision. HME provides a general framework for both classification and regression, and a statistical framework, the expectationmaximization (EM) algorithm, has been used for estimating the network parameters in the batch mode. Jordan and Jacobs (1993) demonstrated that HME is a more powerful classification tool when compared with the feedforward neural networks and decision trees. Online learning algorithms are also derived for HME. Here we provide a scheme for online classification by preserving the structure of decision trees and the learning paradigm of neural networks. We call our model an online adaptive decision tree (OADT), which can learn in the online mode and can exhibit adaptive behavior (i.e., adapt to a changing environment). Online adaptive decision trees are complete binary trees with a specified depth where a subset of leaf nodes is assigned for each class, in contrast to HME, where root node always represents the class. In the online mode, for each training sample, the tree adapts the decision hyperplane at each intermediate node from the response of the leaf nodes and the assigned class label of the sample. Unlike HME, OADT uses the top-down structure of a decision tree; instead of probabilistically weighing the decisions of the experts and combining them, OADT allocates a data point to a set of leaf nodes (ideally one) representative of a class. Since OADT operates in a topdown manner, it requires a different objective error measure. The decision in this case is collectively represented by a set of leaf nodes. We provide a different objective error measure for OADT and subsequently derive the learning rules based on the steepest gradient descent criteria. We organize the rest of the letter as follows. In section 2, we show how the leaf nodes should respond to a given sample in OADT for a specified structure of the tree. We then formulate an objective error measure to reflect the discrepency between the collective responses of the leaf nodes and the assigned class label of the training sample. We derive the learning rule by following the gradient descent of the error measure. In section 3, we demonstrate the performance of the OADT on real-life data sets. Finally, we discuss and conclude this letter in sections 4 and 5, respectively. 2 Description of the OADT An OADT is a complete binary tree of depth l where l can be specified beforehand. A tree has a set of intermediate nodes D, |D| = 2l − 1 and a set of leaf nodes L where L = 2l , such that the total number of nodes in the tree is 2l+1 − 1. Each intermediate node i in the tree stores a decision hyperplane in the form (wi , θi ), a vector lr (the form of lr is explained later in equation 2.4), and a sigmoidal activation function f (.). The weight vector wi is n-dimensional for an n-dimensional input and constrained as wi = 1
1962
J. Basak
for all i such that (wi , θi ) together have n free variables. As a convention, we number the nodes in OADT in the breadth-first order starting from the root node numbered as 1, and this convention is followed throughout this article. 2.1 Activation of OADT. As input, each intermediate node can receive activation from its immediate parent and the input pattern x. The root node does not have any parent, and it can receive only the input pattern. As output, each intermediate node produces two outputs—one for the left child and the other for the right child. According to our convention of numbering the nodes, node i has two children numbered as 2i and 2i + 1 where node 2i is the left child of node i and node 2i + 1 is the right child. If u(i) denotes the activation received by node i from its parent, then the activation values received by the left and right children of node i are given as u(2i) = u(i). f (wi x + θi )
(2.1)
u(2i + 1) = u(i). f (−(wi x + θi )).
(2.2)
and
In other words, the node i takes the product of the activation value received from its parent and the activation produced by itself and sends it to the child nodes. The activation values produced by the node i itself are different for two different child nodes. The value of u(1) for the root node is considered to be unity. The leaf nodes do not see the input pattern x and reproduce the activation received from its parent. Thus, from equations 2.1 and 2.2, the activation value of a leaf node p is given as u(p) =
l p p mod( n−1 , 2n ) 2 f (−1) (w n=1
p 2n
.x
+ θ pn ) , 2
(2.3)
that is, the activation of a leaf node is decided by the product of activations produced by the intermediate nodes on the path from the root node to the concerned leaf node. We define a function lr(i, p) to represent whether the leaf node p is on the left path or the right path of an intermediate node i, such that if p is on the left path of i 1 lr(i, p) = −1 if p is on the right path of i . (2.4) 0 otherwise Using the function lr(.), the activation of a leaf node p can be represented as u(p) = f lr(i, p)(wi .x + θi ) , (2.5) i∈Pp
Online Adaptive Decision Trees
1963
where Pp is the path from the root node to the leaf node p. The function lr(.) is stored in the form of a vector at each intermediate node i (the respective lr(i, :)). In OADT, each intermediate node can observe the input pattern, which is unlike the conventional neural networks. In the feedforward networks, usually there exist some hidden nodes that can observe only the input provided by other nodes but cannot observe the actual input pattern. During the supervised training phase of OADT, it is difficult to decide which leaf node should receive the highest activation. However, it is observed that the total activation of the leaf nodes is always constant if the activation function is sigmoid or of similar nature. Claim 1. The total activation of the leaf nodes is constant if the activation function of each intermediate node is sigmoidal in nature. Proof.
Let us assume one form of the sigmoid function as
f (v; m) =
1 . 1 + exp(−mv)
(2.6)
Then f (v; m) + f (−v; m) = 1.
(2.7)
This property in general holds true for any kind of symmetric sigmoid functions (including the S-function used in the literature of fuzzy set theory). For any child node p, the activation is given as u(p) = u(p/2 ). f ((−1)mod(p,p/2 ) (wp/2 .x + θp/2 )).
(2.8)
Thus, for any two leaf nodes p and p + 1 having a common parent node p/2, we have u(p) + u(p + 1) = u(p/2).
(2.9)
Thus, the total activation of the leaf nodes is l+1 p=2
−1
u(p) =
p=2l
l q=2
−1
u(q).
(2.10)
u(r).
(2.11)
q=2l−1
By similar reasoning, l q=2
−1
q=2l−1
u(q) =
l−1 r=2 −1
r=2l−2
1964
J. Basak
As we go up the tree until we reach the root node, the sum of the activation becomes equal to u(1). Since we consider that u(1) = 1 by default, the total activation of the leaf nodes is
u(p) = 1. (2.12) p
Therefore, decreasing the output of one leaf node will automatically enforce the increase in the activation of the other nodes given that the activation of the root node is constant. 2.2 Activation Function. The activation of each intermediate node is given as u(p) =
l p p mod( n−1 , 2n ) 2 f (−1) (w n=1
p 2n
.x
+ θ pn ) , 2
(2.13)
and the activation function f (.) is f (v) =
1 . 1 + exp(−mv)
(2.14)
The value of m is to be selected in such a way that a leaf node representing the true class should be able to produce an activation value greater than a certain threshold, that is, u(p) ≥ 1 − ,
(2.15)
where p is the activated leaf node representing the true class. Note that if the condition in equation 2.15 holds then the total activation of the rest of the leaf nodes is less than by claim 1. For example, can be chosen as 0.1. Let us further assume that mini |wi x + θ| = δ,
(2.16)
that is, δ is a margin between a pattern and the closest decision hyperplane. In that case, the minimum activation of a maximally activated leaf node representing the true class is up (x) ≤
1 1 + exp(−mδ)
l ,
(2.17)
considering that the pattern x satisfies all conditions governed by all hyperplanes wi on the path from root to the leaf node p, (wi x + θ)lr(i, p) > 0.
(2.18)
Online Adaptive Decision Trees
1965
In that case, from equation 2.15, 1 ≥ 1 − , (1 + exp(−mδ))l
(2.19)
such that l log(1 + exp(−mδ)) ≤ − log(1 − ).
(2.20)
Assuming exp(−mδ) and to be small, we have l exp(−mδ) ≤ ,
(2.21)
that is, m≥
1 log(l/). δ
(2.22)
For example, if we choose = 0.1 and δ = 1, then m = log(10l). 2.3 Loss Function. For a two-class problem, we partition the leaf nodes into two categories 1 and 2 , corresponding to two different classes, C1 and C2 . We assign the odd-numbered leaf nodes to 1 and the even-numbered leaf nodes to 2 such that for a trained OADT, it is desired that (x ∈ C1 ) and (p is activated)
⇒
p ∈ 1
(x ∈ C2 ) and (p is activated)
⇒
p ∈ 2 .
(2.23)
For a supervised learning machine with a single output, if g(x) is the output for an underlying model y(x), then y(x) can be described as (Amari, 1998) y(x) = g(x; W) + ε(x),
(2.24)
where W is a matrix defining the set of parameters and ε is a gaussian noise such that
1 1 p(ε) = √ exp − 2 (y − g(x; W))2 . (2.25) 2σ 2πσ The effective loss function derived from the likelihood measure can thus be expressed as E=
1
(y − g(x; W))2 . 2 x
(2.26)
1966
J. Basak
However, in OADT, we do not have any underlying function analogous to y(x) for each leaf node since it is not known a priori which leaf node will be activated for a given input. In the framework of decision trees, it is difficult to assign any desired activation to any particular leaf node. For a given pattern x, we can say that one node from a group of nodes should be activated. We define an indicator function, y(x) =
1
if x ∈ C1
−1
x ∈ C2
.
(2.27)
We model the noise as
1 uj (x; W) exp − 2 (1 − y) p(ε) = √ 2σ 2πσ j∈ 1 1
+(1 + y)
2 uj (x; W) ,
(2.28)
j∈ 2
where 1 and 2 are the set of leaf nodes assigned to class 1 and class 2, respectively. Here, W represents all weight vectors corresponding to all intermediate nodes of OADT. Equation 2.28 tells us that for any sample x ∈ C1 , if any node from 2 is activated or vice versa, then there is an error. Considering that j uj (x; W) = 1, the objective error measure can be expressed as 2
1 y(x) − uj (x; W) − uj (x; W) E= 2 x j∈
j∈
1
since y2 (x) = 1 for all x. Considering that 1, the objective error measure is
(2.29)
2
j∈ 1
uj (x; W) +
2 1
uj (x; W) , E= 2 j∈
c
where c is the set of leaf nodes assigned for class c.
j∈ 2
uj (x; W) =
(2.30)
Online Adaptive Decision Trees
1967
2.4 Learning Rule. Following steepest gradient descent, we have wi = −η
uj (x; W)
j∈ c
θi = −η
j∈ c
∂uj (x; W) ∂wi j∈
c
(2.31)
∂uj (x; W) uj (x; W) , ∂θi j∈
c
where c is either 1 or 2 depending on sample belonging to Class 2 or Class 1 respectively. The activation of a leaf node j can in general be expressed as uj (x; W) = v1j · · · vij · · · vj/2 j where 1, . . . , i, . . . , j/2 are the nodes on the path from the root to the leaf node j, and vij = f ((wi x + θi )lr(i, j))
(2.32)
is the contribution of node i in the activation of the leaf node j. Therefore, from equation 2.31, wi = −ηm
uj (x; W)
j∈ c
θi = −ηm
uj (x; W)(1 − vij )lr(i, j) x
j∈ c
uj (x; W)
j∈ c
(2.33)
uj (x; W)(1 − vij )lr(i, j) .
j∈ c
Since we impose that wi = 1 for all i, we have wi wi = 0 for a small η, that is, w is normal to w. Further, from equation 2.33, we observe that w changes in the direction of x for steepest descent. Taking the projection of x on to the normal hyperplane of w, we get w ∝ (I − ww )x.
(2.34)
The modified learning rule is therefore given as wi = −ηmUc zic (I − wi wi )x θi = −ηmUc ulic ,
(2.35)
where Uc =
j∈ c
uj (x; W)
(2.36)
1968
J. Basak
and zic =
uj (x; W)(1 − vij (x))lr(i, j).
(2.37)
j∈ c
We select η to ensure steepest descent such that the residual error with first-order approximation goes to zero, that is,
uj (x; W) = −Uc (x). (2.38) j∈ c
Expanding the first-order approximation of uj , from equation 2.35,
uj = −ηm2 [Uc zic uj (x; W) i∈D
× (1 − vij (x))lr(i, j)(1 + x2 − (w x)2 )], where D is the set of intermediate nodes. Therefore,
uj = −ηm2 Uc z2ic (1 + x2 − (w x)2 ).
(2.39)
(2.40)
i∈D
j
From equation 2.38, η=
m2
1 . 2 (1 + x2 − (w x)2 ) z i∈D ic
(2.41)
In this analysis, we assumed the same rate for updating w and θ , although w is updated with an additional constraint of w = 1. In order to make it more flexible, we can have different learning rates as η and λη for updating w and θ, respectively. The overall learning rule is therefore given as wi = −αUc zic (I − wi wi )x θi = −λαUc zic ,
(2.42)
where α=
m
2 k∈D zkc (λ
1 . + x2 − (w x)2 )
(2.43)
It can be noted here that near the optimal point in the parameter space, the value for zkc becomes very small, and as a result, the steepest descent can take relatively larger steps. In order to take care of this problem, we define the denominator of the learning rule as α=
1 . m 1 + k∈D z2kc (λ + x2 − (w x)2 )
(2.44)
Online Adaptive Decision Trees
1969
It is interesting to observe the behavior of the learning rule from equation 2.42. The changes in the weights are always modulated by the total undesired response of the leaf nodes, that is, Uc , which is the total activation of the leaf nodes not corresponding to class c (i.e., nodes not belonging to c ) for a sample coming from class c equation 2.36. The factor zic denotes the contribution of the leaf nodes that are descendants of the node i and do not belong to c (see equation 2.37). If a leaf node j not corresponding to class c (see equation 2.37) is on the right path of node i, then j contributes negatively to z, and in effect it is a positive contribution in the learning rule in equation 2.42. This is due to the fact that if the activation of the left child of a node increases, then that of the right child always decreases (claim 1), and vice versa. Also looking at the expression for zic , equation 2.37, it appears that z is modulated by (1 − v), where v is the contribution of that node in the resultant activation of the leaf nodes. In other words, if a node is highly active, then it is penalized more or the weight parameters are adjusted by a relatively larger step than if the node is less active, that is, a highly active intermediate node is taken to be more responsible for producing the erroneous responses in its descendants if the descendants are found to be erroneous in their activities. Observing the learning rule in equation 2.42, we find that every intermediate node can adjust its parameter values regardless of other nodes so long as it can observe the activation values in the leaf nodes. This is unlike the class of backpropagation learning algorithms as used in the feedforward networks where there is an inherent dependency of the lower-layer nodes on the higher-layer nodes for learning due to error propagation. However, in OADT learning (see equations 2.42 and 2.44), since zic represents the accumulated effect of all descendant leaf nodes of node i, it ensures the accumulation of activities from a larger number of leaf nodes as i goes up the hierarchy. Thus, the coarse level changes in the parameter values occur at the higher layers (the root node and the nodes closer to the root), and the finer details are captured at the lower layers as it comes closer to the leaf nodes. 3 Simulation and Experimental Results We simulated OADT in MATLAB 5.3 on Pentium III machine. We experimented with both synthetic and real-life data sets, and here we demonstrate the performance of OADT in both the cases. 3.1 Protocols. We trained OADT in the online mode. The entire batch of samples is repeatedly presented in the online mode to OADT, and we call the number of times the batch of samples presented to an OADT the number of epochs. If the size of the data set is large (i.e., data density is high), then it is observed that OADT takes fewer epochs to converge, and for relatively smaller size data sets, OADT takes a larger number of epochs. On average, we found that OADT converges near its local optimal solution
1970
J. Basak
within 10 to 20 epochs, although the required number of epochs increases with the depth of the tree. We report all results for 200 epochs. The performance of OADT depends on the depth (l) of the tree and the parameter m. We have chosen the parameters and δ (see equation 2.22) as 0.1 and 1, respectively, that is, m = log(10l). Since m is determined by l in our setting, the performance is solely dependent on l. We report the results for different depths of the tree. We normalize all input patterns such that any component of x lies in the range [−1, +1]. Note that it is not always required to normalize the input within this specified range. However in that case, the parameter λ (see equation 2.42) needs to be chosen properly, although we observed that the performance is not very sensitive to variations in the choice of λ. We have chosen λ as unity in the case of normalized input values. In order to test the performance of OADT with the test samples, we provide an input pattern x and obtain the activation of all leaf nodes. We find the class label corresponding to the maximally activated leaf node and assign the same class label to the test sample. In all experiments, we assigned the odd-numbered leaf nodes to class 1 ( 1 is the set of odd-numbered leaf nodes) and even-numbered leaf nodes to class 2. 3.2 Real-Life Data. We experimented with different real-life data sets available in the UCI machine learning repository (Merz & Murphy, 1996). We computed the 10-fold cross-validation score for each data set. The experimental protocols for OADT are exactly as described in section 3.1. Table 1 demonstrates the performance of OADT. As a comparison, we report the classification performance obtained with the HME for each data set. We implemented the HME using the software available on the web (Martin, n.d.), which is also cited in the Bayes net toolbox (Murphy, 2001, 2003). We used 100 epochs for HME with the EM algorithm, although it converges within 30 epochs for every data set. Table 1 demonstrates that OADT performs better (sometimes marginally) than HME. We also implemented a multilayer perceptron (MLP) network with different number of hidden nodes using the WEKA software package (Garner, 1995; Witten & Frank, 2000) in Java. We observed that both HME and OADT outperform MLP in terms of classification score for each data set. Jordan and Jacobs (1993) reported that HME performs better than MLP in most cases in terms of classification score. Also, for the data sets having highly nonlinear structure (e.g., liver disease data), MLP often gets stuck to local minima if the parameters (such as learning rate and the momentum factor) are not properly tuned. With different experiments, we obtained a maximum score of 57.97% for this data set using an MLP with 1 hidden layer and 15 hidden nodes. Possibly an increase in the number of hidden layers may increase the score. We therefore do not report the classification score obtained with MLP separately. In addition to comparing with HME using the EM algorithm, we compare the performance of OADT with certain batch-mode algorithms, including decision tree, k-nearest neighbor, naive Bayes, support vector machine, and bagged
Online Adaptive Decision Trees
1971
Table 1: Ten-Fold Cross-Validation Scores of OADT on Four Real-Life Data Sets for Three Depths of 3, 4, and 5. Data Description and Model Number of features Number of instances HME Depth = 3 Depth = 4 Depth = 5 Depth = 6 Depth = 7 OADT Depth = 3 Depth = 4 Depth = 5 C4.5 Score |D| k-NN SVM Naive Bayes Bagging C4.5 AdaBoost C4.5
Data Set Breast Cancer (Diagnostic)
Breast Cancer (Prognostic)
Liver Disease (Bupa)
Diabetes (Pima)
30 569 92.98 92.44 93.15 94.90 94.55 97.37 97.19 96.49 92.44 25 96.13 97.19 93.15 94.73 95.96
32 198 73.11 70.17 74.56 71.56 68.72 77.78 76.89 77.78 76.77 19 72.22 78.28 66.67 80.81 75.76
6 345 63.42 66.47 65.81 59.14 64.86 66.67 66.24 64.86 66.38 51 64.35 70.43 55.94 72.46 67.54
8 768 69.78 68.87 69.78 69.00 71.21 76.69 75.39 74.35 74.09 43 70.44 76.69 75.78 74.87 71.35
Notes: The data sets are available in the UCI machine learning repository (Merz & Murphy, 1996). A comparison with HME is also provided for different depths. Comparison with different batch-mode algorithms are also shown. |D| denotes the number of decision nodes generated by the decision tree C4.5. For k-NN classifier, the value of k is set automatically by the leave-one-out principle. SVM uses the cubic polynomial kernels. The classifiers k-NN, SVM, naive Bayes, C4.5 (J4.8), and bagged and boosted trees are available in the WEKA software package.
and boosted decision trees. We implemented all these classification tools using the WEKA software package (Garner, 1995; Witten & Frank, 2000) in Java. In the case of the decision tree, we used C4.5, which is available as J4.8 in the WEKA package. In the case of k-NN algorithms, the value of k is automatically set by the leave-one-out principle. In the case of SVM, we chose cubic polynomial kernels. We observe that OADT outperforms the first three classification algorithms in terms of classification score. However, SVM performs best for two data sets and the bagged trees perform best for two other data sets as compared to all other algorithms. In this context, it may be mentioned that OADT learns strictly in the online mode, and it performs better than its online counterparts. Overall, in the paradigm of online learning, we find that OADT performs reasonably well. 3.3 Synthetic Data. It is well known that parity problems are difficult to deal with in the standard feedforward neural networks. Various investigations (Lavretsky, 2000; Hohil, Liu, & Smith, 1999) are available in the
1972
J. Basak
Table 2: Performance of OADT for n-Parity Problem with Different Amounts of Gaussian Noise. Task
Recognition score (%) of OADT of Depth
Bayes Score
Dimension
SD
2
3
4
5
6
(Theoretical)
2-parity
0.2 0.3 0.4 0.2 0.3 0.4 0.2 0.3 0.4 0.2 0.3 0.4
97.40 88.30 77.90 82.40 70.40 57.40 56.55 55.16 53.27 62.30 50.60 52.92
97.90 88.20 76.40 96.30 82.10 67.20 71.43 65.77 61.90 65.73 56.55 57.06
98.10 89.50 79.90 97.00 83.30 75.00 84.23 79.07 62.20 74.60 62.40 58.97
98.40 89.30 81.30 95.90 81.50 74.50 89.78 78.67 66.57 83.17 67.24 61.59
98.10 88.70 81.30 96.30 84.40 71.80 96.33 83.04 68.75 87.70 71.37 64.11
98.77 90.90 81.10 98.16 86.99 74.53 97.56 83.45 69.35 96.97 80.26 65.26
3-parity
4-parity
5-parity
Notes: The dimensionality n varies from 2 to 5, and the standard deviation of gaussian noise varies from 0.2 to 0.4. It is observed that the OADT performance gracefully degrades with noise.
literature that address solving N-parity problems in connectionist frameworks and provide the design of different architectures to deal with this class of problems. In this section, we show that OADT is also able to handle the gaussian parity problems and provide reasonable classification score. Table 2 demonstrates the performance of OADT for gaussian parity problems where we generated 1000 data points in a unit hypercube and perturbed them with gaussian noise. We show the classification results for different sizes of the gaussian kernels along with the corresponding theoretical upper bound (Bayes risk in the appendix). We find that in most cases, OADT performs close to the theoretical optimum. In the case of higher dimension, the performance falls short of the theoretical bound due to the decrease in the data density. For example, in the case of the 5-parity problem, there are 32 corners of the hypercube, and we have only 31 points (approximately) on an average at each corner. Decision trees are usually not used for such problems because it is difficult to classify such patterns using decision trees without smart look-ahead operators; however, decision trees prove to be very suitable in real-life scenario for their interpretability. 4 Discussion 4.1 Issues of Online Learning. We have used the steepest gradient descent algorithm for training OADT, and adjusted the learning rate parameter by observing the error that needs to be reduced at each step. In the updating rule, since we rely on only the steepest descent, we do not explicitly incor-
Online Adaptive Decision Trees
1973
porate any factor to explicitly distinguish the differening priors; rather, the weights are updated depending on the frequency of different class labels. In order to incorporate the probability models, we need to compute an inverse Hessian matrix as used in the online version of the hierarchical mixture of experts (Jordan & Jacobs, 1993, 1994), where the inverse covariance matrix (Ljung & Soderstr ¨ om, ¨ 1986) is computed, and in the natural gradient descent algorithm (Amari, 1998; Amari, Park, & Fukumizu, 2000), where the inverse Fisher information matrix is computed. Since it is difficult to derive the inverse of a matrix incorporating the statistical structure in the online mode, a stochastic approximation is made using a decay parameter. However, in this class of algorithms, we need to adjust two learning rate parameters as described in Amari et al. (2000). In our learning algorithm, we do not require any explicit learning rate parameter to be defined by the user. Only the depth of the tree needs to be specified before training. Nonetheless, the performance of OADT can possibly be improved by considering smarter gradient learning algorithms, which constitutes a scope for further study. During the training of OADT, we normalized the data set so that the maximum and minimum values of each component of the input patterns are within the range of [−1, 1] (see section 3.1). We did the normalization in order to select λ = 1 (see equation 2.42). However, in the case of the streaming data set, it may not be possible to know the maximum and minimum values that the attributes can take, and thus sometimes normalization may not be possible, although an educated guess can work in this case. However, in general, λ can be selected greater than unity also in order to enforce that the step size for parameter θ is larger than that for w, although it may be difficult to derive any theoretical guideline for selecting θ . In order to avoid this situation for a general case, the learning in equation 2.42 can be performed in two steps, analogous to the learning as presented in Basak (2001). The weight vector w can be updated first and then θ can be updated based on the residual error. However, in the case of the two-step learning algorithm, the responses for the leaf nodes are to be computed twice for the same pattern. The two-stage learning algorithm can be formulated as follows. Update w without changing θ based on the response of the leaf nodes as wi = −
Uc zic (I − wi wi )x . m 1 + k∈D z2kc (x2 − (w x)2 )
(4.1)
After updating w for all intermediate nodes (which can be performed simultaneously), recompute the activation of the leaf nodes for the same pattern x. From the recomputed activation values of the leaf nodes, update θ as θi = −
Uc zic . m(1 + k∈D z2kc )
(4.2)
1974
J. Basak
Note that in the two-stage training algorithm for OADT, the requirement for selecting the parameter λ, equation 2.42, is not required. By the inherent nature of OADT, it can adapt to a changing environment since we do not freeze the learning during the training process. Our learning algorithm does not depend on any decreasing learning rate parameter, and therefore the changing characteristics of a data set are reflected in the changes in the parameters of the intermediate nodes. This fact enables OADT to behave adaptively so long as the depth of the tree does not demand any changes. We have used an OADT of fixed depth for training; however, it can be grown by observing the performance. If the performance of the tree (such as the decrease in error rate or some other measure) is not satisfactory with a certain depth, then it can be increased by creating child nodes to the existing leaf nodes without disturbing the learned weight vectors. The process is analogous to growing neural networks, that is, adding hidden nodes to a feedforward network. However, we have not yet studied the performance of a growing OADT. It constitutes an area for future study. Similarly, for obtaining better generalization performance, it may be necessary to prune the tree selectively, which is also an area for further study. The depth of an OADT approximately depends on the number of convex polytopes that can describe the class structure. Analogous to model selection techniques in neural networks, the model selection for OADT can be investigated in the future. 4.2 Behavior of the Model. In Figure 1, we illustrate the different decision regions generated by OADT while trained on a two-class problem. For multidimensional patterns, it is difficult to visualize the regions formed, and therefore we show the behavior of OADT for only two attributes. We generated data from a mixture of bivariate gaussian distribution such that the classification task is highly nonlinear. In Figure 1, we observe the regions that are obtained from OADT after 50 iterations for different depths from one to six. With the increase in the depth of OADT, the complexity of the regions increases. However, we observe that after a certain depth, the complexity of the regions does not change significantly. It indicates an interesting fact that for this particular problem, the complexity of the decision region has a saturating tendency after a certain depth of OADT, which shows that even if the depth of OADT is increased beyond the desired value, OADT does not overfit, which may provide good generalization capability. We have illustrated the decision space for only two attributes; however, it requires further investigation to understand whether this saturating tendency is true in general. Conventional decision tree learning algorithms are more elegant in inducing axis-parallel trees (Breiman et al., 1983; Quinlan, 1993) where at each decision node, a decision is induced based on the value of a particular attribute. An OADT by its very nature induces oblique decision trees, since
Online Adaptive Decision Trees
1975
Figure 1: The regions obtained by OADT for (a) depth = 1, (b) depth = 2, (c) depth = 3, (d) depth = 4, (e) depth = 5, and (f) depth = 6. Two different class labels are marked as x and o.
if the steepness of the activation function (m) of the intermediate nodes is increased, then the OADT starts behaving like an oblique decision tree, and in the limit of m → ∞, it is exactly an oblique decision tree. Thus, OADT can be viewed as interpretable as the oblique decision trees are. However, in comparison to the existing axis-parallel decision trees (such as C4.5), OADT falls short in terms of interpretability. In feedforward networks, usually there are hidden nodes, which cannot observe the input and the output. The weight updating for each hidden node depends on the error propagated from the output layer. In the case of OADT, each intermediate node can directly compute the change in weight from the response of the leaf nodes. Thus, the weight updating of each intermediate node can be made independently and simultaneously with other intermediate nodes, unlike feedforward networks, where there is an inherent dependency on the error propagation through layers. The memory required to store an OADT is O(d ∗ (n + 1)), where d is the number of intermediate nodes and n is the dimension of the input space. In the case of a two-layer perceptron with one output node, the required storage is O(h ∗ (n + 1) + h + 1), where h is the number of hidden nodes, considering that all hidden nodes are connected to all input nodes. Since d = 2l−1 , where l is the depth of OADT, the storage required for OADT
1976
J. Basak
can be more than a feedforward network if d > h. The hyperplane set by one intermediate node in OADT interacts only with that created by the descendants of that node in forming the decision regions. In feedforward networks, all decision hyperplanes formed by the hidden nodes can interact with each other. Thus feedforward networks can in general provide a more compact representation of the decision region as compared to OADT. 4.3 Multiclass Input Label. Here we provided an algorithm for learning a two-class problem. Any k-class problem can be handled using (k − 1) trained classifiers able to deal with two-class problems. However, it is always desirable to have a single classifier that is able to classify multilabel input data. In the case of k-class problem, we can modify the network into a k-ary tree such that the node i sends an activation gil to its lth child (l ∈ {1, . . . , k}) (in equations 2.1 and 2.2), where f (wil .x + θil ) gil (x) = k , j=1 f (wij .x + θij )
(4.3)
and the activation of the lth child node is u(i).gil (x). The activation is analogous to that of the gating networks used in the framework of mixture of experts (Jordan & Jacobs, 1993, 1994), although the purpose here is to make the activation flow from the root node to the leaf nodes. The rest of the learning algorithm can be derived in exactly the same way as in the case of a two-class problem, where we assign each leaf node i a class index c where c = mod(i, k)+1 if the nodes in the tree are numbered in a breadth-first order. However, using such networks requires extra storage space as compared to the two-class problems. In general, if the depth is d, then we require kd + 1 nodes in a k-ary OADT for a k-class problem and (k − 1)(2d + 1) nodes if we employ k − 1 binary OADTs. Since OADT operates in a top-down manner, it does not have an analogous way of handling multiple class labels with extra output nodes as in the case of HME and the MLP. 5 Conclusions We presented the OADT, a classification method that maintains the structure of a tree and employs the gradient descent learning algorithm like neural networks for supervised learning in the online mode. OADT is useful in the case of streaming data and limited memory situation where data sets cannot be stored explicitly. OADT is similar to the HME framework in the sense that the latter also employs a tree-structured model. However in OADT, the leaf nodes represent the decision instead of the root node, which is different from the HME framework. Thus, in OADT, the decision region is always explicitly formed by the collective activation of a set of leaf nodes. An analog of the difference between OADT and HME can be drawn from the difference between the divisive and agglomerative clustering in the un-
Online Adaptive Decision Trees
1977
supervised framework, where in the first case, the patterns are successively partitioned down the hierarchy, whereas in the second case, they are successively agglomerated into larger clusters up the hierarchy. Since in OADT, the decision is represented by the collective activation of the leaf nodes, the objective error measure is different, and therefore new learning learning rules are derived to train OADT in the online mode. We also provided guidelines for selecting the activation function of each node of an OADT for a given depth, and therefore the performance of our model is dependent on only one parameter—the depth of the tree. We also observe that if we increase the depth of OADT, the complexity of the decision region generated by OADT almost saturates after a certain depth, an indication that the model may not suffer much from overfitting even if the depth is increased gradually. We used steepest gradient descent learning in OADT as opposed to the EM framework in HME, although it is possible to employ gradient descent online learning in the latter framework. We used steepest gradient descent learning in our model, which is analogous to the class of gradient descent algorithms used in feedforward neural networks. In formulating the learning algorithms for OADT, we do not view the activation of the nodes in terms of the posteriors as performed in the HME where multinomial logit probability model (Jordan, 1995) is used to interpret the activation of the nodes in terms of class posteriors. However, analogous to the behavior feedforward neural networks where the activation of the output nodes corresponds to class posteriors (Barron, 1993) after convergence, further investigation can be carried out to interpret the activation of the leaf nodes of OADT in a statistical framework. Also, smarter gradient descent algorithms apart from only the steepest descent can be incorporated in OADT for improved performance. With the real-life data sets, we observe that OADT performs better than HME. However, it requires further exploration to determine if OADT will provide better scores than HME in general. Further exploration in a possible statistical framework may provide better insight into the behavior of OADT and its relationship with HME. Appendix: Bayes Risk for Parity Problem For a two-class (C1 and C2 ) symmetric gaussian parity problem, the Bayes risk is given as R= p(x|C1 )dx, (A.1) P(C2 |x)>P(C1 |x)
where p(x|C1 ) = N1 i N (µi , ) is the conditional distribution with µ representing the corners of a unit n-dimensional hypercube, = σ 2 I being the covariance matrix, and N = 2n−1 being the number of corners of the hypercube allocated to class C1 .
1978
J. Basak
The conditional distribution of a class is given as 1
N (µi , ), N i
p(x|C1 ) =
(A.2)
where µ represents the corners of a unit hypercube such that for class C1 , mod
µij , 2 = 0
(A.3)
j
for all i, and = σ 2 I is the covariance matrix. Each attribute is considered to be independent of the other. The Bayes error for the patterns generated by the first kernel (0, 0, . . . , 0) when the first component of the patterns exceeds 1/2 is given as 1 N
r=
√
1/2
×
∞
1/2 −∞
1 2πσ
e
−
x2 1 2σ 2
dx1
2
x 1 − 2 e 2σ 2 dx2 √ 2πσ
1/2
−∞
2
x 1 − 3 e 2σ 2 dx3 · · · √ 2π σ
(A.4)
where N = 2n−1 , assuming the independence of the components, 1 r= N
∞
√
1/2
1 2πσ
e
−
x2 2σ 2
n−1 1/2 2 1 1 − x2 dx . e 2σ dx √ N −∞ 2π σ
(A.5)
Computing the integrals, r=
1 1 . (1 − K)(1 + K)n−1 , N 2n
(A.6)
where K = erf ( √1 ). If we consider all possible ways by which one com2 2σ ponent of a pattern generated from the kernel at (0, 0, . . . , 0) can exceed the value 1/2, we get the risk as r1 =
1 2N2
n 1
(1 − K)(1 + K)n−1 .
(A.7)
Similarly, if we consider all possible ways by which three components of a pattern generating from the kernel at (0, 0, . . . , 0) can exceed the value 1/2,
Online Adaptive Decision Trees
1979
we get the risk as r2 =
1 2N2
n 3
(1 − K)3 (1 + K)n−3 ,
(A.8)
and so on. Summing up the factors r1 , r2 , . . . , we get the risk for patterns generated from the kernel at (0, 0, . . . , 0) exceeding the value 1/2 by an odd number of components as r=
1 2N2
n 1
+
(1 − K)(1 + K)n−1 n (1 − K)3 (1 + K)n−3 + · · · . 3
(A.9)
After simplification, r=
1 (1 − Kn ). 2N
(A.10)
For N such kernels, the total Bayes risk is given as R=
1 1 1 − erf n . √ 2 2 2σ
(A.11)
It can be noted from equation A.11 that with the same variance of the kernel distribution (i.e., the same noise), the Bayes error increases with the increase in dimension. Acknowledgments I gratefully acknowledge the anonymous reviewers for their comments, which improved this article considerably. References Albers, S. (1996). Competitive online algorithms (Tech. Rep. No. BRICS Lecture Series LS-96-2). Aarhus, Denmark: University of Aarhus. Amari, S. I. (1998). Natural gradient works efficiently in learning. Neural Computation, 10(2), 251–276. Amari, S. I., Park, H., & Fukumizu, K. (2000). Adaptive method of realizing natural gradient learning for multilayer perceptrons. Neural Computation, 12(6), 1399–1409. Barron, A. R. (1993). Universal approximation bounds for superposition of a sigmoidal function. IEEE Transactions on Information Theory, 39, 930–945.
1980
J. Basak
Basak, J. (2001). Learning Hough transform: A neural network model. Neural Computation, 13, 651–676. Bennett, K. P., Wu, D., & Auslender, L. (1998). On support vector decision trees for database marketing (Tech. Rep. No. RPI Math Report 98-100). Troy, NY: Rensselaer Polytechnic Institute. Boz, O. (2000). Converting a trained neural network to a decision tree DecText— decision tree extractor. Unpublished doctoral dissertation, Lehigh University. Breiman, L., Friedman, J. H., Olshen, R. A., & Stone, C. J. (1983). Classification and regression trees. New York: Chapman & Hall. Brodley, C. E., & Utgoff, P. E. (1995). Multivariate decision trees. Machine Learning, 19, 45–77. Chien, J., Huang, C., & Chen, S. (2002). Compact decision trees with cluster validity for speech recognition. In IEEE Int. Conf. Acoustics, Speech, and Signal Processing (pp. 873–876). Orlando, FL: IEEE Signal Processing Society. Cho, Y. H., Kim, J. K., & Kim, S. H. (2002). A personalized recommender system based on web usage mining and decision tree induction. Expert Systems with Applications, 23, 329–342. Duda, R., & Hart, P. (1973). Pattern classification and scene analysis. New York: Wiley. Duda, R., Hart, P., & Stork, D. (2001). Pattern classification (2nd ed). New York: Wiley. Durkin, J. (1992). Induction via ID3. AI Expert, 7, 48–53. Fayyad, U. M., & Irani, K. B. (1992). On the handling of continuous-values attributes in decision tree generation. Machine Learning, 8, 87–102. Friedman, J. H. (1991). Multivariate adaptive regression splines. Annals of Statistics, 19, 1–141. Friedman, J. H., Kohavi, R., & Yun, Y. (1996). Lazy decision trees. In H. Shrobe & T. Senator (Eds.), Proceedings of the Thirteenth National Conference on Artificial Intelligence and the Eighth Innovative Applications of Artificial Intelligence Conference (pp.S 717–724). Menlo Park, CA: AAAI Press. Garner, S. (1995). Weka: The Waikato environment for knowledge analysis. In Proc. of the New Zealand Computer Science Research Students Conference (pp. 57–64). Hamilton, New Zealand: University of Waikato. Golea, M., & Marchand, M. (1990). A growth algorithm for neural network decision trees. Europhysics Letters, 12, 105–110. Haykin, S. (1999). Neural networks: A comprehensive foundation. Upper Saddle River, NJ: Prentice Hall. Hohil, M. E., Liu, D., & Smith, S. H. (1999). Solving the N-bit parity problem using neural networks. Neural Networks, 12, 1321–1323. Jain, A. K., Duin, R. P. W., & Mao, J. (2000). Statistical pattern recognition: A review. IEEE Transactions on Pattern Analysis and Machine Intelligence, 22, 4–37. Jordan, M. (1995). Why the logistic function? A tutorial discussion on probabilities and neural networks. Available online at: http://citeseer.nj.nec.com/ jordan95why.html. Jordan, M. I., & Jacobs, R. A. (1993). Hierarchical mixtures of experts and the EM algorithm (Tech. Rep. No. AI Memo 1440). Cambridge, MA: MIT. Jordan, M. I., & Jacobs, R. A. (1994). Hierarchical mixtures of experts and the EM algorithm. Neural Computation, 6, 181–214.
Online Adaptive Decision Trees
1981
Kalai, A., & Vempala, S. (n.d.). Efficient algorithms for online decision. Available online at: http://citeseer.nj.nec.com/585165.html. Lavretsky, E. (2000). On the exact solution of the parity-N problem using ordered neural networks. Neural Networks, 13, 643–649. Lee, S.-J., Jone, M.-T., & Tsai, H.-L. (1995). Construction of neural networks from decision trees. Journal of Information Science and Engineering, 11, 391–415. Ljung, L., & Soderstr ¨ om, ¨ T. (1986). Theory and practice of recursive identification. Cambridge, MA: MIT Press. Martin, D. (n.d.). Hierarchical mixture of experts. Available online at: http://www. cs.berkeley.edu/∼dmartin/software/. Mehta, M., Agrawal, R., & Rissanen, J. (1996). SLIQ: A fast scalable classifier for data mining. In P. M. G. Apers, M. Bouzeghoub, & G. Gardarin (Eds.), Advances in database technology—EDBT’96 (pp. 18–32). Avington, France. Merz, C. J., & Murphy, P. M. (1996). UCI repository of machine learning databases (Tech. Rep.). Irvine: University of California at Irvine. Murphy, K. (2001). The Bayes net toolbox for Matlab. Computing Science and Statistics, 33, 1–20. Murphy, K. (2003). Bayes net toolbox for Matlab. Available online at: http:// www.ai.mit.edu/∼murphyk/software/index.html. Murthy, S. K., Kasif, S., & Salzberg, S. (1994). A system for induction of oblique decision trees. Journal of Artificial Intelligence Research, 2, 1–32. Quinlan, J. R. (1993). Programs for machine learning. San Francisco: Morgan Kaufmann. Quinlan, J. R. (1996). Improved use of continuous attributes in C4.5. Journal of Artificial Intelligence, 4, 77–90. Riley, M. D. (1989). Some applications of tree based modeling to speech and language indexing. In Proc. DARPA Speech and Natural Language Workshop (pp. 339–352). San Francisco: Morgan Kaufmann. Salzberg, S., Delcher, A. L., Fasman, K. H., & Henderson, J. (1998). A decision tree system for finding genes in DNA. Journal of Computational Biology, 5, 667–680. Stromberg, ¨ J. E., Zrida, J., & Isaksson, A. (1991). Neural trees—using neural nets in a tree classifier structure. In Proc. IEEE International Conference on Acoustics, Speech and Signal Processing (pp. 137–140). Toronto: IEEE Signal Processing Society. Utgoff, P. E., Berkman, N. C., & Clouse, J. A. (1997). Decision tree induction based on efficient tree restructuring. Machine Learning, 29(1), 5–44. Witten, I. H., & Frank, E. (2000). Data mining: Practical machine learning tools and techniques with Java implementations. San Francisco: Morgan Kaufmann. Yang, Y., & Pedersen, J. O. (1997). A comparatative study on feature selection in text categorization. In D. H. Fisher (Ed.), Fourteenth Int. Conference on Machine Learning (ICMl97) (pp. 412–420). San Francisco: Morgan Kaufmann. Zamir, O., & Etzioni, O. (1998). Web document clustering: A feasibility demonstration. SGIR ‘98: Proceedings of the 21st Annual International Conference on Research and Development in Information Retrieval (pp. 46–54). New York: ACM. Received August 11, 2003; accepted February 23, 2004.
LETTER
Communicated by Ning Qian
Understanding the Cortical Specialization for Horizontal Disparity Jenny C.A. Read
[email protected] Bruce G. Cumming
[email protected] Laboratory of Sensorimotor Research, National Eye Institute, National Institutes of Health, Bethesda, MD 20892, U.S.A.
Because the eyes are displaced horizontally, binocular vision is inherently anisotropic. Recent experimental work has uncovered evidence of this anisotropy in primary visual cortex (V1): neurons respond over a wider range of horizontal than vertical disparity, regardless of their orientation tuning. This probably reflects the horizontally elongated distribution of two-dimensional disparity experienced by the visual system, but it conflicts with all existing models of disparity selectivity, in which the relative response range to vertical and horizontal disparities is determined by the preferred orientation. Potentially, this discrepancy could require us to abandon the widely held view that processing in V1 neurons is initially linear. Here, we show that these new experimental data can be reconciled with an initial linear stage; we present two physiologically plausible ways of extending existing models to achieve this. First, we allow neurons to receive input from multiple binocular subunits with different position disparities (previous models have assumed all subunits have identical position and phase disparity). Then we incorporate a form of divisive normalization, which has successfully explained many response properties of V1 neurons but has not previously been incorporated into a model of disparity selectivity. We show that either of these mechanisms decouples disparity tuning from orientation tuning and discuss how the models could be tested experimentally. This represents the first explanation of how the cortical specialization for horizontal disparity may be achieved. 1 Introduction Our remarkable ability to deduce depth from slight differences between the left and right retinal images appears to begin in primary visual cortex (V1), where inputs from the two eyes first converge and where many neurons are sensitive to binocular disparity (Barlow, Blakemore, & Pettigrew, 1967; Nikara, Bishop, & Pettigrew, 1968). Since the discovery of these neurons, it c 2004 Massachusetts Institute of Technology Neural Computation 16, 1983–2020 (2004)
1984
J. Read and B. Cumming
has always seemed obvious that their disparity tuning should reflect their tuning to orientation: a cell should be most sensitive to disparities orthogonal to its preferred orientation. As Figure 1 shows, this is a natural consequence of a linear-oriented filter preceding binocular combination. The fact that disparities in natural viewing are almost always close to horizontal, due to the horizontal displacement of our eyes, led to the expectation that cells tuned to vertical orientations should be the most useful for stereopsis, because such cells would be most sensitive to horizontal disparities (DeAngelis, Ohzawa, & Freeman, 1995a; Gonzalez & Perez, 1998; LeVay & Voigt, 1988; Maske, Yamane, & Bishop, 1986a; Nelson, Kato, & Bishop, 1977; Ohzawa & Freeman, 1986b). However, it has recently been demonstrated that the “obvious” premise was wrong. The response of V1 neurons was probed using random dot stereograms with disparity applied along different axes to obtain the cell’s firing rate as a function of the two-dimensional disparity of the stimulus. The resulting disparity tuning surfaces were generally elongated along the horizontal axis, independent of the cell’s preferred orientation (Cumming, 2002). Paradoxically, this means that V1 neurons are more sensitive to small changes in vertical than in horizontal disparity—precisely the opposite of the expected anisotropy. In this article, we consider why the observed anisotropy might be functionally advantageous for the visual system and then how individual cortical neurons may be wired up to achieve this. We begin, in part A, by estimating the distribution of real two-dimensional disparities encountered in natural viewing. We show that while large vertical disparities do occur, they are extremely rare. The probability distribution of two-dimensional disparity is highly elongated along the horizontal axis, reflecting the much higher probability of large horizontal disparity compared to vertical disparity. We argue that the horizontal elongation observed in the disparity tuning of individual neurons may plausibly reflect the horizontal elongation of this probability distribution. This would be functionally useful because in stereopsis, the brain has to solve the correspondence problem: to match up features in the two eyes that correspond to the same object in space. The initial step in this process appears to be the computation, in V1, of something close to a local cross-correlation of the retinal images as a function of disparity (Fleet, Wagner, & Heeger, 1996). The cross-correlation function should peak when it compares points in the retinas that are viewing the same object in space. The problem faced by the brain is that not all peaks in the cross- correlation function correspond to real objects in space; there are a multitude of false matches where the interocular correlation is high by chance, even though no object is present at the corresponding position in space. To distinguish true matches from false, the brain has to consider the image over large scales and use additional constraints, such as the expectation that disparity generally varies smoothly across the image (Julesz, 1971; Marr & Poggio, 1976). The horizontal elongation of the probability distribution for two-dimensional disparity represents another
Understanding Horizontal Disparity Specialization
1985
important constraint. Because correct matches almost always have disparities very close to horizontal, local correlations between retinal positions that are separated vertically are likely to prove false matches (Stevenson & Schor, 1997). This may explain why the disparity-sensitive modulation of V1 neurons is abolished by small departures from zero vertical disparity: this property immediately weeds out a large number of false matches and simplifies the solution of the correspondence problem. Thus, although the global solution of the correspondence problem appears to be achieved after V1 (Cumming & Parker, 1997, 2000), the recently observed anisotropy (Cumming, 2002) may represent important preprocessing performed by V1. We next, in parts B and C, address the puzzling issue of how this functionally useful preprocessing can be achieved by V1 neurons. In all existing models (Ohzawa, DeAngelis, & Freeman, 1990; Qian, 1994; Read, Parker, & Cumming, 2002), disparity selectivity is tightly coupled to orientation tuning, with the direction of maximum disparity sensitivity orthogonal to preferred orientation. This coupling arises as a direct result of orientation tuning at the initial linear stage (see Figure 1). It seems impossible to minimize disparity-sensitive responses to nonzero vertical disparities, independent of orientation tuning. Thus, at present, we have the undesirable situation that the best models of the neuronal computations supporting depth perception conflict with the best evidence that these computations do support depth perception (Uka & DeAngelis, 2002). In this letter, we present two possible ways in which existing models can be simply modified so as to achieve horizontally elongated disparity tuning, regardless of orientation tuning. The first depends on position disparities between individual receptive fields feeding into a V1 binocular cell, the second on a form of divisive normalization. This demonstrates that the specialization for horizontal disparity exhibited by V1 neurons can be straightforwardly incorporated into existing models and is thus compatible with an initial linear stage. 2 Materials and Methods 2.1 The Probability Distribution of Two-Dimensional Disparity. In part A of the Results (section 3.1), we estimate the probability distribution of two-dimensional disparity encountered by the visual system within the central 15 degrees of vision. 2.1.1 Eye Position Coordinates. We specify eye position using Helmholtz coordinates, since these are particularly convenient for binocular vision (Tweed, 1997a, 1997b). In the Helmholtz system, eye position is specified by a series of rotations away from a reference position, in which the gaze is straight ahead. Starting with the eye in the reference position, we make a rotation through an angle T clockwise (from the eye’s point of view) about the reference gaze direction; then we make a rotation left through
1986
J. Read and B. Cumming
A
B
receptive fields
left eye
right eye 1 0 2
C
D
o rity pa dis
1 0
o rth
2
l na go RF to
d
ari isp
p ty
ara
R l to lle
F
disparity tuning surface
E
disparity parallel to RF
disparity orthogonal to RF
Figure 1: Existing models predict that neurons should be more sensitive to disparities orthogonal to their preferred orientation. (A,B) The receptive fields of a binocular neuron. The left and right eye receptive fields (RFs) are identical, both consisting of a single long ON region oriented at 45 degrees to the horizontal (shown in white) and flanked by OFF regions (black). We consider stimuli with three different disparities. In each case, the left eye’s image falls in the center of the left receptive field (circle in A). The different disparities occur because the right eye’s image falls at different positions for the three stimuli. This are labeled with the numbers 0–2 in B. (C) The disparities of the three stimuli are plotted in disparity space. Stimulus 0 has zero disparity; its images fall in the middle of the central ON region of the receptive field in each eye, and so it elicits a strong response. Stimulus 1 has disparity parallel to the receptive field orientation. Although its image is displaced in the right eye, its images still fall within the ON region in both eyes, so the cell still responds strongly to both stimuli. In consequence, the disparity tuning curve showing response as a function of disparity parallel to receptive field is broad (D, representing a crosssection through the disparity tuning surface along the dashed line). Conversely, stimulus 2 has the same magnitude of disparity, but in a direction orthogonal to the receptive field orientation. Its image in the right eye falls on the OFF flank of the receptive field (B), so the binocular neuron does not respond. This leads to a much narrower disparity tuning curve; the response falls off rapidly as a function of disparity orthogonal to the receptive field (E, representing a cross-section through the disparity tuning surface along the dotted line). This means that the neuron is more sensitive to disparity orthogonal to its preferred orientation, in the sense that its response falls off more rapidly as a function of disparity in this direction.
Understanding Horizontal Disparity Specialization
1987
an angle H; then we make a rotation down through an angle V. Thus, the reference direction has H = V = T = 0 by definition. In general, the positions of the two eyes have 6 degrees of freedom, expressed by the angles HL , HR , VL , VR , TL , TR . However, we are interested only in the case where both eyes are fixated on a single point in space, so that the gaze lines intersect (possibly at infinity). This means that both eyes must have the same Helmholtz elevation, which we write V(= VL = VR ), and positive or zero vergence angle D, defined by D = HR − HL . Specifying the fixation point further constrains HL and HR , leaving just 2 degrees of freedom, TL and TR . These describe the torsion state of the eyes, corresponding to the freedom of each eye to rotate about its line of sight without changing the fixation point. Donder’s law states that whenever the eyes move to look at a particular fixation point, specified by V, HL , and HR , they always adopt the same torsional state. So Helmholtz torsion can be expressed as a function of the Helmholtz elevation and azimuth: TL (V, HL , HR ), TR (V, HL , HR ). Current experimental evidence (Bruno & Van den Berg, 1997; Mok, Ro, Cadera, Crawford, & Vilis, 1992; Van Rijn & Van den Berg, 1993) suggests that the function TR (V, HL , HR ) has the form tan TR/2 = tan
V 2
tan[λR + µR (HR − HL )] − tan HR/2 , 1 − tan HR/2 tan[λR + µR (HR − HL )]
(2.1)
where TR/2 stands for TR /2 and HR/2 stands for HR /2. The corresponding expression for the left eye’s torsion is obtained by interchanging L and R throughout. Estimates of µ range from 0.12 to 0.43 (Minken & Van Gisbergen, 1996; Mok et al., 1992; Somani, DeSouza, Tweed, & Vilis, 1998; Tweed, 1997b; Van Rijn & Van den Berg, 1993), and estimates of λ are around +2 degrees for the left eye and −2 degrees for the right (Bruno & Van den Berg, 1997). In our simulations, we use µ = 0.2 and |λ| = 2◦ . 2.1.2 The Retinal Coordinate System and Two-Dimensional Disparity. Next, we need a coordinate system for describing the position of images on the retinas. We assume that the retinas are hemispherical in shape. We define a Cartesian coordinate system (x, y) on the retina by projecting each retina onto the plane that is tangent to it at the fovea. The coordinates of a point P on the physical retina are given by the point where the line through P and the nodal point of the eye intersect this tangent plane. By definition, the fovea lies at the origin (0,0). We define the x-axis to be the retinal horizon, that is, the intersection of the horizontal plane through the fovea with the retina in its reference position. The y-axis is the retinal vertical meridian: the intersection of the sagittal plane through the fovea with the retina in its reference position. Clearly, these axes change their orientation in space as the eyes move. However, we shall continue to refer to the x- and y-axes as the “horizontal” and “vertical” meridians, with the understanding that this terminology is
1988
J. Read and B. Cumming
based on their orientation when the eyes are in the reference position. We shall define the horizontal and vertical disparity of an object to be the difference between the x- and y-coordinates, respectively, of its images in the two retinas. Expressed in angular terms,
xL xR δx = arctan − arctan , f f yL yR δy = arctan − arctan , f f
(2.2)
where f is the distance from the nodal point of the eye to the fovea. 2.1.3 The Epipolar Line. Given a particular point (xL , yL ) in the left retina, the locus of possible matches in the right retina defines the right epipolar line. This depends not only on (xL , yL ), but also on the position of the eyes. On our planar retina, the epipolar lines are straight but of finite extent. Two points on the epipolar line are the epipole, which is where the interocular axis intersects the tangent plane of the right retina, and the infinity point, which is the image in the right retina of the point at infinity that projects to (xL , yL ) in the left retina. Written as a vector, these are, respectively, q0 =
f (− cos TR , sin TR ), tan HR
(2.3)
and q∞ =
f [ f cos D − xL sin D cos TL − yL sin D sin TL ] × (xL [cos D cos TR cos TL + sin TR sin TL ] − yL [cos D cos TR sin TL − sin TR cos TL ] + f sin D cos TR , − xL [cos D sin TR cos TL − cos TR sin TL ] + yL [cos D sin TR sin TL + cos TR cos TL ] − f sin D sin TR ),
(2.4)
where f is the distance from the nodal point of the eye to the fovea. We use bold type to denote a two-dimensional vector on the retina. Not all points on the straight line through q0 and q∞ correspond to possible objects in space (we do not allow objects to be inside or behind the eyes). When the eye is turned inward, the epipolar line runs from the epipole q0 to the infinity point q∞ . When the eye is turned outward, the epipolar line starts at the infinity point. Formally:
Understanding Horizontal Disparity Specialization
1989
If HR = 0: the right-epipolar line is specified by the vector equation r = (xR , yR ) = q∞ + β ε, where β ranges from 0 to ∞, and ε=
2If (xL sin HL cos TL − yL sin HL sin TL + f cos HL ) × (− cos HR cos TR , sin TR ),
where I is half the interocular distance. If HR = 0: the right-epipolar line is r = q0 + β ε, where ε = q∞ − q0 , and sin HR > 0, ( f cos D − xL sin D cos TL + yL sin D sin TL ) then β goes from 0 to 1; sin HR < 0, if ( f cos D − xL sin D cos TL + yL sin D sin TL ) then β goes from 1 to ∞;
if
(2.5)
2.1.4 Distribution of Horizontal and Vertical Disparities for a Given Eye Posture. For a given eye posture, we sought the distribution of physically possible disparities for all objects whose images fall within an angle 15 degrees of the fovea in both retinas. For example, we consider an object whose image falls on (xL , yL ) in the left retina and calculate the epipolar line in the right retina, as described above (see equations 2.3–2.5). We discard all points on the epipolar line that fall further than 15 degrees from the right eye fovea. With equation 2.1, we can then calculate the set of possible twodimensional disparities (δx , δy ) for all objects whose image in the left retina falls at (xL , yL ), and whose image on the right retina falls within 15 degrees of the fovea. On disparity axes (δx , δy ), this set forms an infinitely narrow line segment, which we convolved with an isotropic two-dimensional gaussian in order to obtain a smooth distribution. We repeated this calculation for a set of 420 points (xL , yL ) within the central 15 degrees region of the left retina and then redid the calculation the other way around: starting with a set of 420 points in the right retina and finding the epipolar lines in the left retina. The starting points (x, y) were equally spaced in latitude and longitude on the hemispherical retina: x = f tan φ cos θ, y = f tan φ sin θ , where θ covered the range from 0 to 360 degrees in steps of 18 degrees and φ ran from 0 to 15 degrees in steps of 0.75 degrees. For each starting point, we obtained another line segment on the disparity axes, and so gradually built up a picture of the distribution of possible disparities for the chosen eye position. The value of the angular disparities (δx , δy ) so obtained does not depend on the interocular distance 2I or the eye’s focal length f individually, but only on the ratio I/f , which we took to be 3.
1990
J. Read and B. Cumming
2.1.5 Distribution of Eye Postures. Finally, we wanted to find the mean distribution of disparity after averaging over likely eye positions. To do this, we repeated the above procedure for 1000 different randomly chosen eye positions. The eyes’ common Helmholtz elevation V, mean Helmholtz azimuth Hc , and the vergence angle D were drawn from independent gaussian distributions with mean 0 degree (for D, the absolute value was then taken, since D must be positive for fixation). The Helmholtz azimuths of the two eyes are given by HL = Hc − D/2, HR = Hc + D/2. The Helmholtz torsion in each eye, TL and TR , was then set according to equation 2.1. The disparity distribution for this eye position was then calculated as above. 2.2 Models of Disparity Tuning in V1 Neurons. In parts B and C of the Results (section 3), we investigate how existing models of disparityselective V1 neurons can be modified to produce horizontally elongated disparity tuning surfaces, regardless of preferred orientation. Most of the results presented here use our model of V1 disparity selectivity (Read et al., 2002), which is a modified version of the stereo energy model (Ohzawa et al., 1990; Ohzawa, DeAngelis, & Freeman, 1997). As in the original energy model, disparity-selective cells are built from binocular subunits characterized by a receptive field in the left and right eye (see Figure 2A). Each binocular subunit (BS in Figure 2A) sums its inputs and outputs the square of this sum. However, whereas the original energy model is entirely linear until inputs from the two eyes have been combined, our version postulates that inputs from left and right eyes pass through a threshold before being combined at the binocular subunit. For instance, this could occur if the initial linear operation represented by the receptive field is performed in monocular simple cells (MS in Figure 2A). Because these cannot output a negative firing rate, they introduce a threshold before inputs from the two eyes are combined. This modification allows better quantitative agreement with the data in a number of respects. It was introduced to account for the weaker disparity tuning observed with anticorrelated stereograms (Cumming & Parker, 1997). It also naturally accounts for disparity-selective cells that do not respond to monocular stimulation in one of the eyes, which are not possible in the original energy model, and explains why the dominant spatial frequency in the Fourier spectrum of the disparity tuning curve is usually lower than the cell’s preferred spatial frequency tuning, even though the energy model predicts that these should be equal (Read & Cumming, 2003b). All the models considered in this article have an initial linear stage in which we calculate the inner product of the retinal image I(x, y) with the receptive field function ρ(x, y) in that eye: v=
+∞
−∞
dx
+∞ −∞
dy I(x, y)ρ(x, y).
(2.6)
Understanding Horizontal Disparity Specialization
1991
Figure 2: (A) Circuit diagram for our modified version of the energy model (Read et al., 2002). Binocular subunits (BS) are characterized by receptive fields in left and right eyes. The initial processing of the retinal images is linear, but input from left and right eyes is thresholded before being combined together. Here, this thresholding is represented by a layer of monocular simple cells (MS). In our simulations, the receptive field functions are two-dimensional Gabor functions (shown with gray scale: white = ON region, black = OFF region). The dots in the receptive field profiles mark the centers of the receptive fields, which are taken to be the peak of the gaussian envelope. (B) Disparity tuning surface for this binocular subunit. The gray scale indicates the mean firing rate of a simulated binocular subunit to 50,000 random dot patterns with different disparities: white = high firing rate (normalized to 1), gray = baseline firing rate obtained for binocularly uncorrelated patterns; black = low firing rate. The dot shows the difference between the receptive field centers, which is the position disparity of this binocular subunit (zero in this example).
In our modified version of the energy model (Read et al., 2002), the response of a binocular subunit is B = [(vL ) + (vR )]2 ,
(2.7)
where the subscripts L and R indicate left and right eyes, and denotes a threshold operation: (v) = (v − q) if v exceeds some threshold level q, and 0 otherwise. In part C of the Results (section 3.3), the threshold q was set to zero (half-wave rectification). Since any given random dot pattern is as likely to occur as its photographic negative, the inner product in equation 2.6 is equally likely to be positive or negative. Thus, a threshold of zero implies that the monocular simple cell fired to only half of all random dot patterns.
1992
J. Read and B. Cumming
In part B of the Results, the threshold was raised above zero, such that the monocular simple cell fired to only 30% of all random dot patterns. 2.2.1 Inhibitory Input from One Eye. We also consider the case where input from one eye influences the binocular subunit via an inhibitory synapse. In the expression for the response of the binocular subunit, the inhibitory synapse is represented by a minus sign: B = [Pos{(vL ) − (vR )}]2 ,
(2.8)
where we have explicitly included half-wave rectification (Pos(x) = x if x > 0, and 0 otherwise) since it is now possible for the net input to the binocular subunit to be negative, and it should not fire in this case. A subunit of this type is disparity selective while still being “monocular” in that it does not respond to stimulation in the inhibitory eye alone. Many such cells are observed (Ohzawa & Freeman, 1986a; Poggio & Fischer, 1977; Read & Cumming, 2003a, 2003b). 2.2.2 The Original Energy Model. We also consider the response of the original energy model (Ohzawa et al., 1990). This assumes that each binocular subunit receives bipolar linear input from both eyes, so its response is B = [vL + vR ]2
(2.9)
This circuitry is sketched in Figure 8A. 2.2.3 Response Normalization. In part C of the Results (section 3.3), we consider the effect of incorporating a form of response normalization. This postulates that monocular inputs are gated by inhibitory zones outside the classical receptive field, before they converge on a binocular subunit (see Figure 3). The response of a binocular subunit is now B = [(vL zL ) + (vR zR )]2
(2.10)
where the “gain factor” z describes the total amount of inhibition from the inhibitory zones. z ranges from 1 (inhibitory zones are not activated; output from monocular subunit is allowed through unimpeded) to 0 (inhibitory zones are highly activated: monocular subunit is silenced). To work out how much a particular retinal image activates the inhibitory zones, we calculate the inner product of the retinal image with each inhibitory zone’s weight function w(x, y) (analogous to a receptive field): +∞ +∞ u= dx dy I(x, y)w(x, y). −∞
−∞
Understanding Horizontal Disparity Specialization
1993
Figure 3: Circuit diagram for a binocular subunit in which the response from a classical receptive field is gated by input from inhibitory regions. As in Figure 2, monocular subunits, each characterized by a receptive field, feed into a binocular subunit. However, now the input from each eye may be suppressed, prior to binocular combination, by activity in horizontally elongated inhibitory regions. Excitatory connections are shown with an arrowhead, and quasi-divisive inhibitory input is shown with a filled circle. In the plots of the inhibitory regions, the gray scale shows several inhibitory zones in a ring around the receptive field in each eye.
1994
J. Read and B. Cumming
The weight functions are gaussians, which ensures that the inhibition is spatially broadband; this is designed to approximate the total input from a large pool of neurons with different spatial frequency tunings. In order to obtain the desired horizontally elongated disparity tuning surfaces, we postulate that the inhibitory zones are elongated along the horizontal axis (the elliptical contours in Figure 10A). This means that each inhibitory zone individually is orientation selective; however, we include multiple inhibitory zones, so arranged that their total effect is independent of orientation. Thus, the inhibition does not alter the orientation tuning of the original receptive field. The amount of inhibition contributed by each end zone is a sigmoid function of µ. The total inhibition from n zones is z=
n 1 1 . n j=1 1 + exp[(|uj | − c)/s]
(2.11)
The parameters c and s are chosen so that when u is small, the sigmoid function is 1, allowing monocular input to pass unimpeded, but as the magnitude of u increases, it falls to zero, meaning that more and more of the monocular input is blocked. This is qualitatively similar to divisive normalization (Heeger, 1992, 1993). c and s are small relative to the typical activation of each inhibitory zone, so that the sigmoid is less than half for 85% of the random dot patterns. We take the absolute value of |u| so that the inhibitory zones are activated by both bright and dark features. In the simulation of Figure 10, we used 10 inhibitory zones, which were gaussians with standard deviations 0.31 degree in a horizontal direction and 0.04 degree vertically. The inhibitory zone centers were placed at 0.5 degree from the origin, at polar angles that are integer multiples of 36 degrees. As explained in section 3, these precise values are unimportant. Again, the input from one eye can be purely subtractive inhibition; the output of the binocular subunit is then B = {Pos[(vL zL ) − (vR zR )]}2 .
(2.12)
2.2.4 Receptive Fields. Model receptive fields are two-dimensional Gabor functions with isotropic gaussian envelopes of standard deviation 0.2 degree, and carrier frequency 2.5 cycles per degree. Spatial frequency and orientation tuning are experimentally observed to be similar in the two eyes (Bridge & Cumming, 2001; Ohzawa, DeAngelis, & Freeman, 1996; Read & Cumming, 2003b), so in our simulations, these are the same in both eyes. Left- and right-eye model receptive fields differ only in their position on the retina; none of the models in this article has phase disparity (although it would be easy to incorporate it). 2.2.5 Stimuli. Orientation tuning was assessed using sinusoidal luminance grating stimuli with a spatial frequency of 2.5 cycles per degree. Disparity tuning was assessed using random dot stereograms with dot-density
Understanding Horizontal Disparity Specialization
1995
25%, as in the relevant experimental studies (cf. Cumming, 2002). The model response varies depending on the pattern of dots in the individual random dot stereogram. We therefore show the mean response, averaged over 50,000 random dot patterns generated with different seeds. This simulation was repeated for every combination of horizontal and vertical stimulus disparity on a two-dimensional grid, so as to obtain the model’s disparity tuning surface (analogous to the one-dimensional disparity tuning curve obtained when disparity is applied along a single axis). All simulated disparity tuning surfaces are shown normalized, so that 1 is the largest mean response at any disparity. The random dot patterns used dots of 2 × 2 pixels. The size of the image and model retina was either 41 × 41 pixels representing 1.4◦ × 1.4◦ (30 pixels per simulated degree; see part B of the Results, section 3.2) or 221 × 131 pixels representing 2.2◦ × 1.3◦ (100 pixels per simulated degree; see part C of the Results, section 3.3). For part C, the monocular-suppression model, a higher resolution was needed since the disparity tuning is suppressed at large disparities, so we need to be able to study the disparities around zero in more detail. All simulations were performed in Matlab 6.1. 3 Results 3.1 A: The Probability Distribution of Two-Dimensional Disparity. Stereopsis appears to be a phenomenon mainly of central vision; stereoscopic thresholds rise sharply for eccentric stimuli. We therefore consider only the central 15 degrees of cyclopean vision (i.e., the region of space projecting within 15 degrees of the fovea in both retinas). Obviously, 30 degrees is an upper bound for the disparity of any object in the central 15 degrees of vision. Most points on the retina will be associated with much smaller disparities; the precise distribution depends on the position of the eyes. To find the distribution for a given eye position, we have to consider each point PL on the left retina in turn. For each PL , we find the set of points PR in the right retina that are possible physical correspondences for PL . This is the epipolar line, and because we are restricting ourselves to central vision, we take only that portion of the epipolar line that lies within 15 degrees of the fovea. This gives us a set of possible disparities that could be associated with the point PL . So every point in the left retina gives us a set of possible disparities. If we repeat this for all points in both retinas, we obtain the set of all the physical correspondences that are possible within the central visual field for this eye position (see section 2). Figure 4A shows the probability distribution of retinal disparity for the eye position specified by a Helmholtz elevation V of 15 degrees, mean Helmholtz azimuth Hc of 15 degrees, and vergence D of 10 degrees. That is, the eyes are converged and looking down and off to the left. Note that the range of the horizontal axis is 10 times wider than that of the vertical axis. The dotted contour line marks the extent of possible physical correspon-
1996
J. Read and B. Cumming
Figure 4: The probability distribution of horizontal and vertical disparities for a vergence angle of 10 degrees. (A) Helmholtz elevation V = 15 degrees, mean Helmholtz azimuth Hc = 15 degrees. (B) Helmholtz elevation V = 0 degree, mean Helmholtz azimuth Hc = 0 degree. In each case, the outer dotted contour marks the limit of possible disparities (beyond this contour, the probability density is zero). In A, this contour is only approximate: its irregularities reflect the relatively small number of retinal positions investigated. The solid contour marks the median (50% of randomly chosen possible correspondences lie within this iso-probability contour). In A, the SD of the isotropic gaussian used for smoothing was 0.04 degree; in B, it was 0.01 degree. The width of the distribution’s central ridge is limited by this smoothing.
dences. Outside this boundary, the probability density is zero; that is, for the given eye position, there can be no physical correspondences with that disparity. For this eye position, the maximum vertical disparity is 2.3 degrees, but this occurs with a horizontal disparity of 20 degrees, far too large for fusion. If we restrict ourselves to horizontal disparities within Panum’s fusion limit, δx < |0.5|◦ , the maximum vertical disparity is around 1 degree. The solid contour line marks the median; that is, 50% of possible physical correspondences fall within the solid line. Within this boundary, the range of vertical disparities is even smaller. Figure 4A demonstrates that large, vertical disparities can occur. But in fact, during natural viewing, it must be relatively uncommon to have the eyes elevated as much as 15 degrees and turned as much as 15 degrees to the side. We flick our eyes around a scene, but we do not usually spend a long time directing our eyes eccentrically; if that became necessary, we would move our head or the object of interest so that it could be viewed with Hc ∼ V ∼ 0◦ , where large, vertical disparities are even rarer. Figure 4B shows the distribution of retinal disparity for Hc = V = 0◦ with the same vergence as before, D = 10 degrees. Now, the largest vertical disparity is 0.35 degree. As before, the solid contour line encloses 50% of the possible physical correspondences. While this contour extends to horizontal dispari-
Understanding Horizontal Disparity Specialization
1997
ties as large as 15 degrees, it never reaches a vertical disparity as large as 0.05 degree. Of course, which physical correspondences actually occur depends on the scene being viewed. But since there are far more possible correspondences with vertical disparities less than 0.1 degree, it seems reasonable to assume that with the given eye position (D = 10 degrees, Hc = V = 0 degree), it would be rare for a natural scene to contain an object with a vertical disparity more than 0.1 degree. In contrast, with this eye position, extremely large horizontal disparities are common. To estimate the incidence of vertical disparities experienced by the visual system, we need to find the disparity distribution averaged not only over retinal position, as in Figure 4, but over all eye positions. For simplicity, we assume that Hc , V, and D are independently and normally distributed with mean zero (for D, since fixation requires a positive vergence angle, we use only the positive half of the gaussian). Figure 5 shows the resulting distributions for two different sets of values for the standard deviations of these distributions. In these plots, we have homed in on the small range of horizontal disparities that can actually be fused rather than the full range of possible horizontal disparities. Note that again the range of the horizontal axis is 20 times that of the vertical. In Figure 5A, gaze direction is assumed
Figure 5: The distribution of horizontal and vertical disparities for normal viewing, averaged over 1000 random eye positions. Helmholtz elevation V and mean Helmholtz azimuth Hc are picked from a gaussian with mean 0 degree and SD either 10 degrees (A) or 20 degrees (B). To obtain vergence D, a random number is picked from a gaussian with mean 0 degree and SD either 3 degrees (A) or 8 degrees (B), and then vergence is set to the absolute value of this number. The resulting vergence distribution has a mean 2.4 degrees (A) or 6.4 degrees (B), and SD 1.8 degrees (A) or 4.8 degrees (B). The contour line marks the median for the section of the probability distribution shown (50% of randomly chosen possible correspondences with |δx| < 0.5 and |δy| < 0.05 lie within this iso-probability contour).
1998
J. Read and B. Cumming
to be fairly tightly distributed about Hc = V = 0 degree (with a standard deviation of 10 degrees), while vergence is generally small (mean 2.4 degrees, SD 1.8 degrees). Figure 5B gives more weight to eccentric and converged gaze (the standard deviation of Hc and of V is 20 degrees, meaning that the person spends 5% of his or her time gazing more than 40◦ off to the side, while the vergence angle D has mean 6.4 degrees, SD 4.8 degrees, meaning that the person spends 20% of the time looking at objects closer than about 34 cm). These distributions are chosen to reflect opposite extremes of behavior, to demonstrate that our conclusions do not depend critically on the assumptions made about the distribution of eye position during normal viewing. The only critical assumption is that binocular eye movements are coordinated such that the gaze lines always intersect. Under both sets of assumptions, the most striking feature in Figure 5 is the extreme narrowness of the disparity distributions in the vertical dimension (especially given the much larger scale on the vertical axis): less than 0.01 degree in both cases. At first sight, this extreme narrowness is hard to reconcile with Figure 4A, where the probability distribution peaks for −0.5◦ < δy < 0◦ . The solution is that the plots in Figure 5 represent the average of many plots like those shown in Figure 4. These all have a rodlike structure, with the orientation of the rod depending on the particular eye posture. But all the rods pass close to zero disparity. Thus, when we average over all eye postures, we end up with a strong peak close to zero disparity. This peak is stronger in Figure 5B, where the gaze and vergence are more variable. The rods are thus more widely dispersed, and we end up with a more pronounced peak at their common intersection. In Figure 5A, the eyes are assumed to stay fairly close to primary position, and so the rods are almost all close to horizontal. The averaged distribution is thus more elongated horizontally. The slight bias toward crossed (positive) disparities particularly evident in Figure 5A stems from an obvious geometric reason: the amount of uncrossed disparity, −δx, can never exceed the vergence angle D. (This restriction is visible as a sharp boundary in Figure 4.) No such restriction exists for crossed disparity. We began this discussion by restricting ourselves to central vision—the central 15 degrees around the fovea. If we widen this to the central 45 degrees—essentially the entire retina—then obviously larger disparities, both horizontal and vertical, become possible. However, the shape of the disparity distribution remains highly elongated horizontally (not shown). This analysis has shown that independent of the precise assumptions made, the distribution of two-dimensional disparity experienced by the visual system in natural viewing is highly elongated. Large, horizontal disparities are far more likely to be encountered than vertical disparities of the same magnitude. We would therefore expect disparity detectors in V1 to be tuned to a considerably wider range of horizontal than vertical disparities. Following conflicting reports in earlier studies (Barlow et al., 1967; Joshua & Bishop, 1970; von der Heydt, Adorjani, Hanny, & Baumgartner, 1978),
Understanding Horizontal Disparity Specialization
1999
recent experimental evidence shows that this is indeed the case. Cumming (2002) extracted a single preferred two-dimensional disparity for each of 60 neurons and found that the distribution of this preferred disparity was horizontally elongated, with an SD of 0.225 degree in the horizontal direction and only 0.064 degree vertically. This observation is simple to incorporate within existing models: for example, if a cell’s preferred disparity reflects the position disparity between its receptive field locations in the two eyes, then the horizontally elongated distribution means that there is greater scatter in the horizontal than in the vertical location of the two eyes’ receptive fields. However, Cumming also found that the disparity tuning for individual neurons was horizontally elongated, independent of their orientation tuning. While this too makes sense in view of the extreme horizontal elongation of the disparity distribution shown in Figure 5, it presents problems for existing models of individual neurons. In the next section, we explain why and investigate how these models can be altered so as to reconcile them with the data. In existing physiological models of disparity selectivity, tuning to twodimensional binocular disparity must reflect the monocular tuning to orientation. This is illustrated in Figure 2, which sketches the circuitry underlying our model of disparity selectivity. In this example, the left and right eye receptive fields are oriented at 45 degrees to the horizontal (see Figure 2A). The disparity tuning surface (see Figure 2B) is approximately the crosscorrelation of the two receptive fields (the thresholds prior to binocular combination in our model mean that it is not exactly the cross- correlation, but the approximation is accurate enough to be helpful). It is therefore elongated along the same axis as the receptive field. This means that as we move away from the cell’s preferred disparity along an axis orthogonal to the preferred orientation, the response falls off rapidly, whereas it falls off only slowly along an axis parallel to the preferred orientation. In this sense, therefore, the cell is most sensitive to disparities applied orthogonal to its preferred orientation. The challenge is to find a way of making the cell most sensitive to vertical disparity, while keeping its oblique orientation tuning. We consider two possible mechanisms that achieve this. 3.2 B. Multiple Binocular Subunits with Horizontal Position Disparity. Tuning to nonzero disparity may arise if the left and right eye receptive fields differ in their position on the retina, or in the arrangement of their ON and OFF subregions. In recent years, there has been intense debate about the relative contribution of the two mechanisms, known as position and phase disparity (Anzai, Ohzawa, & Freeman, 1997, 1999; DeAngelis, Ohzawa, & Freeman, 1991; DeAngelis et al., 1995a; Fleet et al., 1996; Ohzawa et al., 1997; Prince, Cumming, & Parker, 2002; Tsao, Conway, & Livingstone, 2003; Zhu & Qian, 1996). Both mechanisms predict a disparity tuning surface elongated along the preferred orientation of the cell, and thus preclude the experimentally observed specialization for horizontal disparity. However,
2000
J. Read and B. Cumming
all previous analysis has assumed that position and phase disparity is the same for all binocular subunits feeding into a given complex cell. Here, we relax this assumption and show that this breaks the relationship between disparity tuning surface and orientation tuning, allowing a specialization for horizontal disparity independent of orientation selectivity. In Figure 2A, the left and right eye receptive fields are identical and occupy corresponding positions on the retina. The receptive fields are represented by Gabor functions; the dot indicates the receptive field center, which is in the same place in each retina. Thus, the disparity tuning surface has its peak at zero disparity (the dot in Figure 2B). If the receptive fields were offset from one another, they would have a position disparity, and the peak of the disparity tuning surface would be offset accordingly. We now consider a complex cell that sums the outputs of several binocular subunits that differ in their horizontal position disparity. Figure 6A shows the centers of the left eye receptive fields for 18 binocular subunits that all feed into the same complex cell. (For comparison, one receptive field is indicated with contour lines.) The gray scale shows the sum of the gaussian envelopes for all 18 subunits, which is almost circularly symmetrical. Thus, no anisotropy is visible when the cell is probed monocularly. The disparity tuning of the complex cell depends on how pairs of monocular receptive fields are wired together into binocular subunits. We choose
Understanding Horizontal Disparity Specialization
2001
to pair together receptive fields that have the same vertical position on the retina but differ in their horizontal position. This means that the binocular subunits all have zero vertical position disparity but differ in their horizontal position disparity. Three examples are shown in Figure 7; the position disparity for these three subunits are indicated with different symbols in Figure 6C. For example, the top subunit in Figure 7, marked with a diamond, has zero position disparity, whereas the bottom subunit in Figure 7, marked with a triangle, has a position disparity of −0.6 degree. Because of the horizontal scatter of the subunits, the disparity tuning surface of the complex cell (see Figure 6C) is elongated along the horizontal axis, even though the monocular receptive field envelope is isotropic (see Figure 6A), and the preferred orientation remains at 45 degrees (see Figure 6B), reflecting the individual receptive fields. This demonstrates that disparity tuning
Figure 6: Facing page. Response properties for a multiple subunit complex cell. (A) Monocular receptive field envelope. The gray scale shows the sum of the gaussian envelopes for all 18 receptive fields in one eye. The dots indicate the center of the receptive fields. The square is the center of the receptive field shown in Figure 2 (marked here with contour lines; solid lines show ON regions and broken lines OFF regions). (B) Orientation tuning curve. This shows the mean response of the complex cell to drifting grating stimuli at the optimal spatial frequency, as a function of the grating’s orientation. The preferred orientation is 45 degrees, reflecting the structure of the individual receptive fields. (C) Disparity tuning surface. This shows the mean response of the complex cell to random dot stereograms as a function of two-dimensional disparity. The disparity tuning surface is clearly elongated along the horizontal axis. The responses have been normalized to one, as indicated with the scale bar. The superimposed dots show the position disparities of the individual binocular subunits; the 18 subunits have 9 different disparities. The circuitry for three subunits is sketched in Figure 7; the position disparity for the three subunits shown there is here indicated with matching symbols. In the top subunit in Figure 7, the receptive fields in the two eyes are identical; this subunit therefore has zero position disparity (the diamond at (0,0)). The middle subunit in Figure 7 has receptive fields at different positions in each retina. It therefore has a horizontal position disparity, marked with a triangle pointing up. The bottom subunit in Figure 7 has an even larger horizontal position disparity, marked with a triangle pointing right. The tuned-excitatory disparity tuning surface shown here is the sum of 18 disparity tuning surfaces like that in Figure 2, but offset from one another horizontally. Thus, although the individual tuning surfaces were elongated along an oblique axis, the resulting disparity tuning surface is elongated horizontally. (D) Disparity tuning surface for a tuned-inhibitory complex cell. This was obtained with exactly the same receptive fields and subunits as in C, except that now, one out of each pair of monocular subunits made an inhibitory synapse onto the binocular subunit (see equation 2.8).
2002
J. Read and B. Cumming
Figure 7: Circuit diagram indicating how multiple binocular subunits (BS) can be combined to yield a complex cell whose disparity tuning surface is elongated along the horizontal axis. Our model complex cell receives input from 18 binocular subunits, of which 3 are shown. The gray scale plots on the left show the monocular receptive field functions for monocular simple cells (MS). These are combined into binocular subunits (BS); note that the monocular simple cells apply an output threshold prior to binocular combination. The centers of the receptive field envelopes are shown with the symbols used to identify these disparities in Figure 6. As shown by the positions of these symbols, the three subunits have different horizontal position disparities (0 degree, −0.3 degree, −0.6 degree) and no vertical disparity. The receptive field phase is chosen randomly for each subunit, but within each subunit, it is the same for each eye.
Understanding Horizontal Disparity Specialization
2003
and orientation tuning can be decoupled in a cell that receives input from multiple subunits with different position disparities. In Figure 7, all the binocular subunits receive excitatory input from the monocular subunits feeding into them, and the left and right receptive fields for each subunit are in phase. This results in a “tuned-excitatory” type of disparity tuning surface (Poggio & Fischer, 1977), in which the cell’s firing rate rises above the baseline level for range of preferred disparities (see Figure 6C). If for each subunit one eye sends excitatory input and the other inhibitory (Read et al., 2002), then we obtain a “tuned-inhibitory” complex cell, whose firing rate falls below the baseline for a particular range of disparities (see Figure 6D). The position disparity of the individual subunits works in exactly the same way as before, so this cell’s disparity tuning surface is also elongated along the horizontal axis. Thus, this model works for both tuned-excitatory and tuned-inhibitory cells. 3.2.1 Energy Model Subunits Tend to Cancel Out. In the simulations above, we used our modified version of the stereo energy model (Ohzawa et al., 1990) because this is in better quantitative agreement with experimental data than the original energy model (Read & Cumming, 2003b; Read et al., 2002). As we shall see, this modification is also key to the success of the multiple-subunit model shown in Figures 7 and 6. The receptive fields in the example cells in Figures 2 and 7 were chosen to be similar to the type of receptive fields derived from reverse correlation mapping (DeAngelis et al., 1991; DeAngelis, Ohzawa, & Freeman, 1995b; Jones & Palmer, 1987; Ohzawa et al., 1996). They are spatially bandpass, containing several ON and OFF subregions and little power at DC. In the energy model, bandpass receptive fields must yield bandpass disparity tuning curves (i.e., curves with several peaks and troughs). One problem for the energy model is that multiple peaks are not often seen in experimental disparity tuning curves. These are frequently close to gaussian in shape, with only weak side lobes, even when the spatial frequency tuning is bandpass. That is, real disparity tuning curves seem to be more low-pass than predicted from the energy model. This discrepancy has been noted by various authors (Ohzawa et al., 1997; Prince, Pointon, Cumming, & Parker, 2002), and quantified in detail by us (Read & Cumming, 2003b). Our modification to the energy model, introducing thresholds prior to binocular combination, helps fix this problem by removing side lobes from the disparity tuning curves, making them more low-pass and in better agreement with experimental data (Read & Cumming, 2003b). It turns out that this reduction of the side lobes is also what enables us to construct a horizontally elongated disparity tuning surface by combining multiple subunits with different horizontal position disparities. To see why, consider what happens without the monocular thresholds. Figure 8A shows the circuit diagram of a binocular subunit according to the original energy model; the key feature is that bipolar input from the
2004
J. Read and B. Cumming
Figure 8: (A) Circuit diagram for a single binocular subunit from the original energy model (Ohzawa et al., 1990). In contrast to our modified version (see Figure 2), inputs from the two eyes are combined linearly (though the binocular subunit still applies an output nonlinearity of half-wave rectification and squaring). This results in more bandpass disparity tuning. (B) The side lobes in the disparity tuning surface shown here are much deeper than in Figure 2. This surface has almost no power at DC. As a consequence, when several binocular subunits with different horizontal position disparity combine to form a complex cell, their disparity tuning surfaces cancel out over most of the range (cf. Figure 9A). (C) The disparity tuning surface for a complex cell receiving input from 18 energy model binocular subunits whose position disparities are indicated by the symbols. Instead of being elongated horizontally as in Figure 7, all but the two end subunits are canceled out by their neighbors, so it has two separate regions of oscillatory response. This does not resemble the responses recorded from real cells.
two eyes is combined linearly. The disparity tuning surface for this subunit is shown in Figure 8B. Note that the central peak is flanked by two deep troughs where firing falls well below baseline, so the surface as a whole has very little power at DC (compare Figure 8B to Figure 2B, which shows the equivalent surface for our modified version). As a result, when we add together several copies of such disparity tuning surfaces, most of the power cancels out, leaving a highly unrealistic tuning surface containing two distinct regions of modulation (see Figure 8C). To see why this happens, consider Figure 9A. The thin, broken lines show five hypothetical disparity tuning curves (in one dimension for clarity) that have no power at DC, because their deep side troughs have the same area as their central peak. When the five offset curves are averaged (heavy line), the peak of one subunit is canceled out by the side troughs of the two adjacent subunits, so that the response averages out to the baseline level everywhere except at the two ends, where the subunits have only one neighbor. In Figure 9B, the five disparity tuning curves are gaussians, representing the effect of a
Understanding Horizontal Disparity Specialization
6
6
response
B8
response
A8 4 2 0
-1
0
1
disparity (deg)
2005
4 2 0
-2
0
2
disparity (deg)
Figure 9: Problems encountered in combining subunits tuned to different disparities. Thin, broken curves: hypothetical disparity tuning curves from five binocular subunits. They are identical in form but offset from one another horizontally. Heavy curve: the mean of the thin curves. (A) These binocular subunits have bandpass disparity tuning (like those obtained from the original energy model). The thin curves have near-zero DC component, since the central peak is nearly equal in area to the two side troughs. When the subunits are combined, their disparity tuning curves cancel out over most of the range, leaving an oscillatory region at each end. This does not resemble real disparity tuning curves. (B) These binocular subunits have low-pass disparity tuning (like those obtained from our modified version of the energy model). Now, combining many subunits tuned to different disparities results in a broadened disparity tuning curve resembling experimental results. However, the disparity tuning is weakened. Whereas the tuning curves for individual subunits (thin lines) have amplitude equal to their baseline, the amplitude for the combination (heavy line) is only 25% of the baseline. This problem could be solved by imposing a final output threshold on the complex cell. If the region of the plot below the dotted line could be removed by an output threshold, then the amplitude of the disparity tuning curve would be a larger fraction of the new baseline.
high threshold before binocular combination (Read & Cumming, 2003b). Because there are no inhibitory side lobes, cancellation does not occur, and the resultant disparity tuning curve (heavy curve) is above baseline for an extended range. In our simulations, the threshold was set such that each monocular simple cell fired on average for 3 out of every 10 random- dot patterns; for comparison, this figure would be 5 out of 10 for a threshold at zero (half-wave rectification). This does not entirely remove the side lobes (see Figure 2), but has enough of an effect to prevent cancellation. Thus our modification, introduced for quite different reasons, also enables us to obtain horizontally elongated disparity tuning surfaces by combining multiple subunits with horizontal position disparities. For bandpass receptive
2006
J. Read and B. Cumming
fields, this is not possible in the energy model, although it is possible for receptive fields with a substantial DC response. 3.2.2 An Additional Output Threshold is Needed. Figure 9B also highlights a problem with this multiple-subunit model as it stands. Summing multiple subunits inevitably reduces the amplitude of the modulation with disparity. In Figure 9B, the individual disparity tuning curves have amplitude equal to the baseline response, while the amplitude of the averaged curve (black) is just 20% of the baseline. This effect is apparent on examining the scale bars in Figure 6: the amplitude of the disparity tuning is only about 5% of the baseline (the response at large disparities, where the random dot patterns in the two eyes are effectively uncorrelated). In an experiment, a cell with such weak modulation would probably not even pass the initial screening for disparity selectivity. This problem can be solved if we postulate that the complex cell applies an output threshold: that is, it fires only when its inputs exceed a certain value (cf. the dotted line in Figure 9B). For the tunedexcitatory model, it would be equally valid to postulate that the necessary threshold is applied by the individual binocular subunits. However, for tuned-inhibitory cells, the threshold must be applied by the complex cell that sums the individual subunits (this is apparent on redrawing Figure 9B with tuned-inhibitory tuning curves). 3.3 C. Monocular Suppression from Horizontally Elongated Inhibitory Zones. We now turn to an alternative way of decoupling disparity and orientation tuning. A well-known feature of V1 neurons is that they can be inhibited by activity outside their classical receptive field (Cavanaugh, Bair, & Movshon, 2002; Freeman, Ohzawa, & Walker, 2001; Gilbert & Wiesel, 1990; Jones, Wang, & Sillito, 2002; Maffei & Fiorentini, 1976; Sillito, Grieve, Jones, Cudeiro, & Davis, 1995; Walker, Ohzawa, & Freeman, 1999, 2000, 2002). Phenomena such as end stopping, side stopping, and cross-orientation inhibition have been explained by suggesting that individual V1 neurons are subject to a form of gain control from the rest of the population, for example, by divisive normalization (Carandini, Heeger, & Movshon, 1997; Heeger, 1992, 1993). Yet existing models of disparity selectivity ignore these aspects of V1 neurons’ behavior. We now extend our model (Read et al., 2002) to include these effects. We postulate that inputs from each eye are suppressed by activity from inhibitory zones prior to binocular combination. We shall show that if the individual inhibitory zones are elongated horizontally on the retina, the disparity tuning surface will be horizontally elongated. If there are several inhibitory zones arranged isotropically, the total inhibition is independent of orientation. Thus, this horizontal elongation in disparity tuning is achieved without altering the cell’s orientation selectivity. 3.3.1 Monocular Suppression Can Decouple Disparity and Orientation Tuning. As before, we consider a binocular subunit with identical obliquely
Understanding Horizontal Disparity Specialization
2007
oriented receptive fields in left and right eyes. However, now the monocular inputs are gated by inhibitory zones before being combined in a binocular subunit. This is sketched in Figure 3. If the retinal image does not stimulate the inhibitory zones, the inhibitory synapses in Figure 3 are inactive, and the output of the monocular simple cell is passed to the binocular subunit just as in the previous model. But if the retinal image stimulates the inhibitory zones, the inhibitory synapses become active, and the firing rate of the monocular cell is reduced or even silenced (see section 2 for details). This is very similar to the divisive normalization proposed to explain nonspecific suppression and response saturation (Carandini et al., 1997; Heeger, 1992, 1993; Muller, Metha, Krauskopf, & Lennie, 2003): the firing rate of a fundamentally linear neuron is suppressed by activity in a “normalization pool” of cortical neurons. In the simple model presented here, the horizontal elongation of the individual inhibitory zones, chosen in order to obtain horizontally elongated disparity tuning surfaces, means that they are tuned to horizontal orientation. This is a side effect of the computationally cheap way we have chosen to implement divisive normalization by a pool of neurons: in a full model, each inhibitory zone would represent input from a multitude of sensors tuned to different spatial frequencies and orientations, so that the response of each inhibitory zone individually would be independent of orientation. Equally, it would be possible to construct the surround such that it was sensitive to orientations other than horizontal. So long as the region over which these subunits are summed remains horizontally elongated, it would still produce the desired effect on disparity tuning. In this way, it would be possible to construct a model using the same principles that matched the reported orientation selectivity of surround suppression (Cavanaugh et al., 2002; Levitt & Lund, 1997; Muller et al., 2003). However, the simple model presented here is designed to be nonspecific in that the suppression contributed by all the inhibitory zones together is independent of orientation. This means that the suppression in our model affects the disparity tuning while leaving the orientation tuning unchanged. One way of achieving this is to place the inhibitory zones in a ring around the original receptive field, as shown in Figure 10A, so that while the individual inhibitory zones are horizontally elongated, the total inhibitory region is roughly isotropic. Figure 10B shows the orientation tuning curves obtained with no suppression (solid curve) and after including the effect of suppression from the inhibitory end zones (broken curve). Despite the horizontal orientation of each inhibitory zone individually, the cell still responds best to a luminance grating aligned at 45 degrees. The ring pattern is consistent with experimental evidence indicating that suppressive regions are located all around the receptive field (Walker et al., 1999). In fact, as far as disparity tuning to random dot patterns is concerned, similar results are obtained no matter where the inhibitory zones are placed, provided that the overall arrangement is not tuned to orientation. We also ran our simulations with inhibitory end zones placed at the top and bottom of the original recep-
2008
J. Read and B. Cumming
Figure 10: Horizontally elongated inhibitory end zones need not alter the orientation tuning. (A) Receptive field and inhibitory end zones. The gray scale shows the receptive field—a Gabor function in which a central ON region (white) is flanked by two OFF regions (black). The contours show the ten inhibitory end zones. Each inhibitory zone is a gaussian, and the contour marks the halfmaximum. (B) Orientation tuning curve for the model with (dashed line) and without (solid line) gating by the inhibitory zones. This demonstrates that the horizontally elongated inhibitory zones do not alter the orientation tuning of the cell: it still responds best to a grating oriented at 45 degrees.
tive field (not shown), and obtained essentially the same results. Similarly, the gaps between the inhibitory zones in Figure 10 are not important. We chose a sparse arrangement of inhibitory zones so that the location of the individual zones would be clear; in fact, the same results are obtained with a much denser array of overlapping inhibitory zones forming a complete ring around the original receptive field (not shown). Although the inhibitory zones have little effect on orientation tuning, they have a profound effect on the two-dimensional disparity tuning observed with random dot stereograms. Figure 11A shows the disparity tuning surface that would be obtained from this binocular subunit in the absence of inhibitory zones. It is, as expected, elongated along an oblique axis corresponding to the orientation tuning. Figures 11B and 11C show the disparity tuning surface after incorporating inhibition from the ten inhibitory zones (in Figure 11B, for the same range of disparities as in Figure 11A, and in Figure 11C, focusing in on a smaller range of disparities). Comparing Figures 11A and 11B, it is apparent that the original disparity tuning surface has been greatly reduced. Suppression from the inhibitory zones has removed or weakened the cell’s responses to most disparities away from the preferred disparity (zero, in this example), especially to disparities with
Understanding Horizontal Disparity Specialization
2009
Figure 11: Monocular suppression by horizontally oriented inhibitory regions can result in a horizontally elongated disparity tuning surface. (Top row) Disparity tuning surface with and without monocular suppression. (A) Disparity tuning surface for a single binocular subunit without any suppression from inhibitory zones, as sketched in Figure 2 (equation 2.7, with the threshold set at zero so that represents half-wave rectification). (B, C) Disparity tuning surface for the same binocular subunit after incorporating monocular suppression from inhibitory zones prior to binocular combination, as sketched in Figure 3 (see equation 2.10; again represents half-wave rectification). At large scales (B), some trace of the original oblique structure remains, but in the central region where sensitivity to disparity is strongest (shown expanded in C), the structure is horizontal. (Bottom row) Correlation between left and right eye inputs as a function of disparity. (D) Correlation between output of classical receptive fields in left and right eyes (see equation 2.6). (E) Correlation between total inhibition from end zones in left and right eyes (see equation 2.11). (F) Correlation between output of classical receptive fields after inhibition from end zones.
significant vertical components. However, the response to horizontal disparities has been relatively spared. This is especially clear in Figure 11C, where we focus in on the smallest disparities. What remains of the disparity tuning surface is now elongated along a near-horizontal axis. 3.3.2 The Monocular Suppression Vetoes Vertical-Disparity Matches. The inhibitory zones effectively veto local correlations with vertical disparities. Functionally, this would be useful since such correlations are likely to be false matches, which will only hamper a solution of the correspondence problem. Yet at first sight, it is hard to understand how this can occur. In
2010
J. Read and B. Cumming
our model, the inhibition is purely monocular: the inhibitory zones receive input from only one eye and suppress the response only in that eye. How, then, can they detect interocular correlation for vertical disparities in order to suppress it? To understand this, we first note that the output of an ordinary binocular subunit (without gating from inhibitory zones) depends on the correlation between the terms from left and right eyes, which depends on disparity. If we plot the correlation between the inputs from left and right eyes, before normalization by the inhibitory zones, as a function of horizontal and vertical disparity, the resulting correlation surface has the same orientation as the receptive fields (see Figure 11D). Similarly, the correlation between the suppressive input from the inhibitory zones in left and right eyes is oriented along the same axis as the inhibitory zones, that is, horizontally (see Figure 11E). It was to achieve this result that we made the inhibitory zones horizontally elongated. Regions of weaker correlations at disparities corresponding to the separation between nearby inhibitory zones are also visible. Finally, Figure 11F homes in on the same small range of disparities as in Figure 11C and shows the correlation between left and right eye inputs, after suppression from the inhibitory zones. In order for these to be tightly correlated, we need strong correlation between both the left and right receptive field inputs and the left and right eye inhibition. Thus, Figure 11F can be thought of as the product of Figures 11D and 11E. For disparities with a significant vertical component in Figure 11F, the lack of correlation in the inhibitory zone responses (see Figure 11E) has largely canceled out the correlation between the inputs from the receptive fields (see Figure 11D). The original obliquely oriented correlation function has been weakened, leaving a band of high correlation for small horizontal disparities. This horizontally elongated correlation translates into the horizontally elongated disparity tuning surface we saw earlier (see Figure 11C). Figure 12 shows that this works for tuned-inhibitory cells (see equation 2.12) too. Thus, end stopping can decouple disparity tuning from orientation tuning, for both tuned-excitatory and tuned-inhibitory cells. This basic idea has been suggested before (Freeman & Ohzawa, 1990; Maske et al., 1986a; Maske, Yamane, & Bishop, 1986b), but no one has previously developed a quantitative model and examined its properties. 4 Discussion Because our eyes are displaced horizontally, most disparities we encounter are horizontal. We have quantified this by producing estimates of the probability distribution of two-dimensional disparity encountered during natural viewing, making plausible assumptions for the distribution of eye posture. The results are independent of the precise assumptions used and show clearly that the probability distribution of experienced disparity is highly
Understanding Horizontal Disparity Specialization
2011
Figure 12: Monocular suppression by horizontally oriented inhibitory regions can also yield a tuned-inhibitory disparity tuning. As for Figures 11A to 11C, but for a model incorporating a tuned-inhibitory synapse, so the disparity is of the tuned-inhibitory type.
elongated along the horizontal axis. Psychophysically, this anisotropy mirrors our different sensitivities to horizontal and vertical disparity (Stevenson & Schor, 1997). Physiologically, it is reflected in V1 at two different levels. At the population level, there is a wider scatter in the horizontal than in the vertical component of preferred disparities (Barlow et al., 1967; Cumming, 2002; DeAngelis et al., 1991). At the level of individual cells, disparity tuning surfaces are usually elongated along the horizontal axis, regardless of orientation tuning (Cumming, 2002). While the specialization at the population level was expected and is straightforward to incorporate into existing models, the specialization found in individual cells is at first sight extremely surprising. It is not only incompatible with all existing models of the response properties of disparity-tuned cells; it is the exact opposite of the behavior previously predicted to be functionally useful for stereopsis. Many workers have argued that cells mediating stereopsis should be most sensitive to horizontal disparity; their firing rate should change more steeply as a function of horizontal than of vertical disparity (DeAngelis et al., 1995a; Gonzalez & Perez, 1998; LeVay & Voigt, 1988; Maske et al., 1986a; Nelson et al., 1977; Ohzawa & Freeman, 1986b). Given the expectation that the direction of maximum disparity sensitivity would be orthogonal to the cell’s preferred orientation, this led to the expectation that cells tuned to vertical orientations should be most useful for stereopsis, since these were expected to be the most sensitive to horizontal disparity. It was therefore puzzling why real disparity-tuned cells occur with the full range of preferred orientations, with no obvious tendency to prefer vertical orientations. The recent discovery that the direction of maximum disparity sensitivity is in fact independent of orientation tuning (Cumming, 2002) would have satisfactorily solved this puzzle if it had not been for the fact that the direction of maximum disparity sensitivity was found to be vertical rather than horizontal. From the conventional point of view, this anisotropy of individual V1 neurons looks more like a specialization for
2012
J. Read and B. Cumming
vertical than for horizontal disparity. It is important, therefore, before we discuss how we have been able to reproduce the horizontally elongated disparity tuning surfaces of V1 neurons, to consider why this anisotropy occurs. Armed with hindsight and recent experimental evidence, we shall argue that the previous expectations were flawed and attempt to suggest plausible reasons that the observed anisotropy is in fact a useful specialization for horizontal disparity. The previous expectation of vertically elongated disparity tuning surfaces was based on the fact that these are most sensitive to horizontal disparity. Yet given that stereo information is usually thought to be represented in the activity of a whole population of sensors tuned to different disparities, this is not particularly relevant. The sensitivity with which the population encodes horizontal disparity is not limited by the sensitivity of the individual sensors. It is not possible, therefore, to deduce the shape of the individual disparity tuning surfaces from the need for sensitivity to horizontal disparity. A population of sensors with either horizontally elongated or vertically elongated surfaces would be equally capable of achieving high sensitivity to horizontal disparity. Imagine a population of neurons whose disparity tuning surfaces are isotropic (equally sensitive to vertical and horizontal disparity). If the scatter of center positions (preferred disparity) is equal in vertical and horizontal directions, the population carries equivalent information concerning the two disparity directions. The population response could be made more sensitive to horizontal than vertical disparity by reducing the scatter in the horizontal direction. However, the population would show such sensitivity over a narrower range of horizontal than vertical disparities. For simple pooling over a population of independent disparity detectors, there is a trade-off between the range of disparity encoded in any one direction and the sensitivity to small changes, regardless of the shape of individual filters. That said, the horizontal displacement of the eyes does place constraints on the maximum useful extent of the individual disparity tuning surfaces. There is no point in having any one filter encompass a larger range than is required of the population. As our simulation of the probability distribution of the two-dimensional disparities encountered during normal viewing demonstrates, large, vertical disparities are rare in comparison to large, horizontal disparities. The iso-probability contours (see Figures 4 and 5) span a greater range of horizontal than vertical disparities. Suppose that the brain ignores disparities outside a certain iso-probability contour and that this is the functional reason that we cannot fuse large disparities outside Panum’s fusional limit: such large disparities are too rare to be worth encoding. The range of disparities to be encoded in each direction places an upper limit on the useful extent of an individual cell’s disparity tuning surface in that direction. Due to the horizontally elongated shape of the iso-probability contours, this upper limit is much lower in the vertical direction than in the horizontal direction. While this does not prove that individual disparity
Understanding Horizontal Disparity Specialization
2013
tuning surfaces must be horizontally elongated (it would be theoretically possible to cover the desired horizontally elongated range with vertically elongated disparity sensors), it nevertheless suggests that horizontally elongated surfaces may be more likely than vertically elongated ones. The characteristic feature of horizontally elongated disparity tuning surfaces is that they continue responding to features across a range of horizontal disparities, while their response falls quickly back to baseline if the vertical disparity changes. This may be useful, given that features with significant vertical disparity are likely to be “false matches.” There is some evidence that the brain does not explicitly search for nonzero vertical disparities when solving the stereo correspondence problem. If the eyes are staring straight ahead (more precisely, are in primary position), then there are no vertical disparities, and the epipolar line of geometrically possible matches for a given point in the other retina is parallel to the horizontal retinal meridian. When the eyes move, the epipolar lines shift on the retina, so they are not in general exactly horizontal. However, rather than recompute the position of the epipolar lines every time the eyes move, the brain seems to approximate epipolar lines by horizontally oriented search zones of finite vertical extent. The search zones are horizontal because the epipolar lines are usually close to horizontal, while their finite vertical width allows for the fact that nonzero vertical disparities do occur. Computationally, this strategy is an enormous saving. The cost is that when the eyes adopt an extreme position that gives rise to vertical disparities larger than the vertical width of the search zones, matches are not found and the correspondence problem cannot be solved (Schreiber, Crawford, Fetter, & Tweed, 2001). If this theory is correct, the horizontally elongated disparity tuning surfaces in V1 (Cumming, 2002) may be a neural correlate of the psychophysics reported by Schreiber et al. (2001). The vertical extent of the search zones should then be determined by the vertical width of the disparity tuning surface and the scatter in preferred vertical disparity. The tolerance of vertical disparity estimated from Cumming’s (2002) data is ∼0.4 degree at eccentricities between 2 and 9 degrees, while the psychophysics suggests that vertical disparities of ∼0.7 degree should be tolerated at 3 degrees eccentricity (Schreiber et al., 2001; Stevenson & Schor, 1997). Given the uncertainties, these are in reasonable agreement. Of course, nonzero vertical disparities do sometimes occur, especially when the eyes are looking up, down, or off to one side, and there is evidence that they influence perception (first shown by Helmholtz, 1925). Thus, the brain must presumably contain sensors that detect vertical disparity. However, the relative rarity of vertical disparities in perceptual experience suggests that these sensors should be correspondingly sparse. This may explain why vertical disparities appear to influence perception at a more global level, being integrated across the whole image, whereas horizontal disparities appear to act more locally, contributing to a local depth map (Rogers & Bradshaw, 1993, 1995; Stenton, Frisby, & Mayhew, 1984).
2014
J. Read and B. Cumming
The above arguments give some insight into the functional constraints underpinning V1’s observed specialization for horizontal disparity at both the population level and the level of individual cells. The next problem is how to reconcile the individual cell specialization—the horizontally elongated disparity tuning surfaces—with existing models of disparity tuning, in which disparity tuning surfaces must be elongated parallel to the preferred orientation. Thanks to a multitude of electrophysiological, psychophysical, and mathematical studies, stereopsis is one of the best-understood perceptual systems. Our understanding of the early stages of cortical processing is encapsulated in the famous energy model of disparity selectivity, which has successfully explained many aspects of the behavior of real cells (Cumming & Parker, 1997; Ohzawa et al., 1990, 1997). Although it has long been known that there are quantitative discrepancies between the energy model and experimental data (Ohzawa et al., 1997; Prince, Cumming et al., 2002; Prince, Pointon et al., 2002; Read & Cumming, 2003b), the decoupling of disparity and orientation observed in real cells (Cumming, 2002) violates a key energy model prediction in a dramatic, qualitative way. This forces us to reconsider the model’s most fundamental assumptions, such as the underlying linearity (Uka & DeAngelis, 2002). This initial linear stage has been a cornerstone of a whole class of models, not only of disparity selectivity but of V1 processing in general. Fortunately, as we demonstrate in this article, it turns out that the existing models can be fairly readily modified to allow the observed separation of disparity and orientation tuning. We present two possible ways of solving the problem: (1) multiple subunits with different horizontal position disparities and (2) monocular suppression from inhibitory zones. Both yield horizontally elongated disparity tuning surfaces—the first by extending the response to a wider range of horizontal disparities and the second by preventing response to any but a narrow range of vertical disparities. Both involve further modifications of the energy model, building on previous work in which we introduced a threshold prior to binocular combination (Read et al., 2002). The models are not necessarily alternative possibilities; some combination of the two may occur. For instance, complex cells in V1 may represent the sum of multiple subunits with different horizontal position disparities, with the monocular inputs to each subunit gated by horizontally elongated inhibitory zones. 4.1 Position versus Phase Disparity. The multiple subunit model relies on horizontal position disparities to achieve horizontally elongated disparity tuning surfaces. Phase disparities alone could not achieve this because differences in receptive field phase produce disparities only along an axis orthogonal to the cell’s preferred orientation. Of course, phase disparities could occur alongside the position disparities postulated here. Furthermore, phase disparities may be important in producing the wider range of preferred disparities along the horizontal than along the vertical axis which is
Understanding Horizontal Disparity Specialization
2015
observed at the population level (Anzai et al., 1999; Cumming, 2002; DeAngelis et al., 1991, 1995a). 4.2 Model Tests and Predictions. When postulating new physiological models, it is important to consider how they could be tested and falsified. One important point is that, at least as developed here, the multiple subunit model works only for complex cells. The model would respond to all phases of a grating stimulus at the optimum spatial frequency, since as the grating moves, it simply passes from one of the multiple subunits to the next. Thus, the observation of simple cells with horizontally elongated disparity tuning surfaces would present difficulties for the multiple subunit model. It would require a rather contrived relationship between the phase of a subunit’s receptive field and its position within the array of subunits. Since the monocular suppression model achieves horizontally elongated disparity tuning by postulating horizontally elongated inhibitory regions, the obvious way of testing this model appears to be to see if suppressive zones in real cells are horizontally elongated. However, if there are several suppressive zones, as in our simulations, this could be difficult to establish. For example, in our simulations, the inhibitory zones are arranged in a ring. An experimental determination of the regions where stimuli exerted a suppressive effect would discover a halo of suppression around the receptive field center. In fact, this halo is made up of many horizontally elongated components, but this might not be apparent experimentally. Thus, it is hard to test this feature of the model. Importantly, both models predict that with appropriately chosen stimuli, the horizontally elongated disparity tuning should vanish, and the coupling between disparity tuning and preferred orientation should reemerge. For example, the monocular suppression model predicts that when the disparate stimulus is a single dot, the two-dimensional disparity tuning should be elongated along the cell’s preferred orientation, as in the energy model. This is because—at least if the inhibitory zones do not substantially overlap the classical receptive field—when the dot stimulus activates the inhibitory zones, it is already largely outside the classical receptive field, so further monocular suppression has little effect. The disparity tuning is therefore elongated along the preferred orientation, as it would be in a model without horizontally elongated inhibitory zones. In contrast, random dot patterns—and most natural stimuli—activate both the receptive field and the inhibitory regions at the same time, allowing the inhibitory zones to suppress the response to vertical disparities. In this context, it is interesting to note that Pack, Born, and Livingstone (2003), using a single disparate dot stimulus, did not report the specialization for horizontal disparity found by Cumming (2002); in their data, in both V1 and MT, disparity tuning appeared to agree well with direction tuning. Our monocular suppression model explains how this apparent discrepancy could be due to the different stimuli used. It would thus be valuable to test disparity tuning
2016
J. Read and B. Cumming
in the same cell with both a disparate dot stimulus and with random-dot patterns. The multiple subunit model still shows a horizontally elongated disparity tuning surface when probed with a single dot stimulus, as with random dot patterns. However, with suitable stimuli, the disparity tuning of this model too can be made to reveal the oriented structure of the receptive fields. For example, consider extended one-dimensional noise stimuli (“bar codes”). The disparity of such a stimulus is effectively one-dimensional: disparities applied parallel to the bar code have no effect, so all disparities can be reduced to their component orthogonal to the bar code. Now consider a stimulus made up of two such bar codes superimposed—one aligned with the cell’s preferred orientation and the other orthogonal to it. Any disparity can be resolved into a component parallel to the preferred orientation (which would affect only the orthogonal bar code), and a component orthogonal to the preferred orientation (which would affect only the parallel bar code). Thus, disparity would effectively be applied only along the two axes orthogonal and parallel to the cell’s preferred orientation. The disparity tuning is therefore elongated along this axis. This is not just an inevitable consequence of the stimulus, but reflects the underlying linearity of the models presented in this article. If V1 neurons implement more sophisticated nonlinearities than proposed here, they could potentially show horizontally elongated disparity tuning surfaces even with this crossed bar code stimulus. Thus, experiments with this stimulus could be a powerful way of probing the computations performed by V1 neurons. 5 Conclusion This work provides the first explanation of the newly observed specialization for horizontal disparity in V1. It demonstrates that although this observation initially seemed completely at odds with all previous models, it can in fact be fairly straightforwardly incorporated into existing models of disparity selectivity. This represents an important step forward in our understanding of how cortical circuitry is specialized for binocular vision. Acknowledgments This work was supported by the National Eye Institute. References Anzai, A., Ohzawa, I., & Freeman, R. D. (1997). Neural mechanisms underlying binocular fusion and stereopsis: Position vs. phase. Proc. Natl. Acad. Sci. U.S.A., 94, 5438–5443. Anzai, A., Ohzawa, I., & Freeman, R. D. (1999). Neural mechanisms for encoding binocular disparity: Receptive field position versus phase. J. Neurophysiol., 82, 874–890.
Understanding Horizontal Disparity Specialization
2017
Barlow, H. B., Blakemore, C., & Pettigrew, J. D. (1967). The neural mechanisms of binocular depth discrimination. J. Physiol., 193, 327–342. Bridge, H., & Cumming, B. G. (2001). Responses of macaque V1 neurons to binocular orientation differences. J. Neurosci., 21(18), 7293–7302. Bruno, P., & Van den Berg, A. V. (1997). Relative orientation of primary positions of the two eyes. Vision Res., 37(7), 935–947. Carandini, M., Heeger, D. J., & Movshon, J. A. (1997). Linearity and normalization in simple cells of the macaque primary visual cortex. J. Neurosci., 17(21), 8621–8644. Cavanaugh, J. R., Bair, W., & Movshon, J. A. (2002). Selectivity and spatial distribution of signals from the receptive field surround in macaque V1 neurons. J. Neurophysiol., 88(5), 2547–2556. Cumming, B. G. (2002). An unexpected specialization for horizontal disparity in primate primary visual cortex. Nature, 418(6898), 633–636. Cumming, B. G., & Parker, A. J. (1997). Responses of primary visual cortical neurons to binocular disparity without depth perception. Nature, 389, 280– 283. Cumming, B., & Parker, A. (2000). Local disparity not perceived depth is signaled by binocular neurons in cortical area V1 of the macaque. J. Neurosci., 20(12), 4758–4767. DeAngelis, G. C., Ohzawa, I., & Freeman, R. D. (1991). Depth is encoded in the visual cortex by a specialised receptive field structure. Nature, 352, 156– 159. DeAngelis, G. C., Ohzawa, I., & Freeman, R. D. (1995a). Neuronal mechanisms underlying stereopsis: How do simple cells in the visual cortex encode binocular disparity? Perception, 24(1), 3–31. DeAngelis, G. C., Ohzawa, I., & Freeman, R. D. (1995b). Receptive- field dynamics in the central visual pathways. Trends in Neuroscience, 18(10), 451–458. Fleet, D., Wagner, H., & Heeger, D. (1996). Neural encoding of binocular disparity: Energy models, position shifts and phase shifts. Vision Res., 36, 1839–1857. Freeman, R. D., & Ohzawa, I. (1990). On the neurophysiological organisation of binocular vision. Vision Res., 30(11), 1661–1676. Freeman, R. D., Ohzawa, I., & Walker, G. (2001). Beyond the classical receptive field in the visual cortex. Prog. Brain Res., 134, 157–170. Gilbert, C. D., & Wiesel, T. N. (1990). The influence of contextual stimuli on the orientation selectivity of cells in primary visual cortex of the cat. Vision Res., 30(11), 1689–1701. Gonzalez, F., & Perez, R. (1998). Neural mechanisms underlying stereoscopic vision. Prog. Neurobiol., 55(3), 191–224. Heeger, D. J. (1992). Normalization of cell responses in cat striate cortex. Vis. Neurosci., 9(2), 181–197. Heeger, D. J. (1993). Modeling simple-cell direction selectivity with normalized, half- squared, linear operators. J. Neurophysiol., 70(5), 1885–1898. Helmholtz, H. von (1925). Treatise on physiological optics. Rochester, NY: Optical Society of America. Jones, H. E., Wang, W., & Sillito, A. M. (2002). Spatial organization and magnitude of orientation contrast interactions in primate V1. J. Neurophysiol., 88(5), 2796–2808.
2018
J. Read and B. Cumming
Jones, J. P., & Palmer, L. A. (1987). The two-dimensional spatial structure of simple receptive fields in cat striate cortex. J. Neurophysiol., 58(6), 1187– 1211. Joshua, D., & Bishop, P. O. (1970). Binocular single vision and depth discrimination: Receptive field disparities for central and peripheral vision and binocular interaction of peripheral single units in cat striate cortex. Exp. Brain Res., 10, 389–416. Julesz, B. (1971). Foundations of cyclopean perception. Chicago: University of Chicago Press. LeVay, S., & Voigt, T. (1988). Ocular dominance and disparity coding in cat visual cortex. Vis. Neurosci., 1(4), 395–414. Levitt, J. B., & Lund, J. S. (1997). Contrast dependence of contextual effects in primate visual cortex. Nature, 387(6628), 73–76. Maffei, L., & Fiorentini, A. (1976). The unresponsive regions of visual cortical receptive fields. Vision Res., 16(10), 1131–1139. Marr, D., & Poggio, T. (1976). Cooperative computation of stereo disparity. Science, 194, 283–287. Maske, R., Yamane, S., & Bishop, P. O. (1986a). End-stopped cells and binocular depth discrimination in the striate cortex of cats. Proc. R. Soc. Lond. B Biol. Sci., 229(1256), 257–276. Maske, R., Yamane, S., & Bishop, P. O. (1986b). Stereoscopic mechanisms: Binocular responses of the striate cells of cats to moving light and dark bars. Proc. R. Soc. Lond. B Biol. Sci., 229(1256), 227–256. Minken, A. W., & Van Gisbergen, J. A. (1996). Dynamical version—vergence interactions for a binocular implementation of Donders’ law. Vision Res., 36(6), 853–867. Mok, D., Ro, A., Cadera, W., Crawford, J. D., & Vilis, T. (1992). Rotation of Listing’s plane during vergence. Vision Res., 32(11), 2055–2064. Muller, J. R., Metha, A. B., Krauskopf, J., & Lennie, P. (2003). Local signals from beyond the receptive fields of striate cortical neurons. J. Neurophysiol., 90, 822–831. Nelson, J. I., Kato, H., & Bishop, P. O. (1977). Discrimination of orientation and position disparities by binocularly activated neurons in cat striate cortex. J. Neurophysiol., 40(2), 260–283. Nikara, T., Bishop, P. O., & Pettigrew, J. D. (1968). Analysis of retinal correspondence by studying receptive fields of binocular single units in cat striate cortex. Exp. Brain Res., 6(4), 353–372. Ohzawa, I., DeAngelis, G. C., & Freeman, R. D. (1990). Stereoscopic depth discrimination in the visual cortex: Neurons ideally suited as disparity detectors. Science, 249, 1037–1041. Ohzawa, I., DeAngelis, G. C., & Freeman, R. D. (1996). Encoding of binocular disparity by simple cells in the cat’s visual cortex. J. Neurophysiol., 75(5), 1779– 1805. Ohzawa, I., DeAngelis, G. C., & Freeman, R. D. (1997). Encoding of binocular disparity by complex cells in the cat’s visual cortex. J. Neurophysiol., 77(6), 2879–2909. Ohzawa, I., & Freeman, R. D. (1986a). The binocular organization of complex cells in the cat’s visual cortex. J. Neurophysiol., 56(1), 243–259.
Understanding Horizontal Disparity Specialization
2019
Ohzawa, I., & Freeman, R. D. (1986b). The binocular organization of simple cells in the cat’s visual cortex. J. Neurophysiol., 56(1), 221–242. Pack, C. C., Born, R. T., & Livingstone, M. S. (2003). Two-dimensional substructure of stereo and motion interactions in macaque visual cortex. Neuron, 37(3), 525–535. Poggio, G. F., & Fischer, B. (1977). Binocular interaction and depth sensitivity of striate and prestriate cortex of behaving rhesus monkey. J. Neurophysiol., 40(6), 1392–1405. Prince, S. J., Cumming, B. G., & Parker, A. J. (2002). Range and mechanism of encoding of horizontal disparity in macaque V1. J. Neurophysiol., 87(1), 209– 221. Prince, S. J., Pointon, A. D., Cumming, B. G., & Parker, A. J. (2002). Quantitative analysis of the responses of V1 neurons to horizontal disparity in dynamic random-dot stereograms. J. Neurophysiol., 87, 191–208. Qian, N. (1994). Computing stereo disparity and motion with known binocular cell properties. Neural Computation, 6, 390–404. Read, J. C. A., & Cumming, B. G. (2003a). Ocular dominance predicts neither strength nor class of disparity selectivity with random-dot stimuli in primate V1. J. Neurophysiol., 91, 1271–1281. Read, J. C. A., & Cumming, B. G. (2003b). Testing quantitative models of binocular disparity selectivity in primary visual cortex. J. Neurophysiol., 90(5), 2795– 2817. Read, J. C. A., Parker, A. J., & Cumming, B. G. (2002). A simple model accounts for the reduced response of disparity-tuned V1 neurons to anti-correlated images. Vis. Neurosci., 19, 735–753. Rogers, B. J., & Bradshaw, M. F. (1993). Vertical disparities, differential perspective and binocular stereopsis. Nature, 361(6409), 253–255. Rogers, B. J., & Bradshaw, M. F. (1995). Disparity scaling and the perception of frontoparallel surfaces. Perception, 24(2), 155–179. Schreiber, K., Crawford, J. D., Fetter, M., & Tweed, D. (2001). The motor side of depth vision. Nature, 410(6830), 819–822. Sillito, A. M., Grieve, K. L., Jones, H. E., Cudeiro, J., & Davis, J. (1995). Visual cortical mechanisms detecting focal orientation discontinuities. Nature, 378(6556), 492–496. Somani, R. A., DeSouza, J. F., Tweed, D., & Vilis, T. (1998). Visual test of Listing’s law during vergence. Vision Res., 38(6), 911–923. Stenton, S. P., Frisby, J. P., & Mayhew, J. E. (1984). Vertical disparity pooling and the induced effect. Nature, 309(5969), 622–623. Stevenson, S. B., & Schor, C. M. (1997). Human stereo matching is not restricted to epipolar lines. Vision Res., 37(19), 2717–2723. Tsao, D. Y., Conway, B., & Livingstone, M. S. (2003). Receptive fields of disparitytuned simple cells in macaque V1. Neuron, 38, 103–114. Tweed, D. (1997a). Kinematic principles of three-dimensional gaze control. In M. Fetter, T. P. Haslwanter, H. Misslisch & D. Tweed (Eds.), Three-dimensional kinematics of eye, head and limb movements. Amsterdam: Harwood Academic Publishers. Tweed, D. (1997b). Visual-motor optimization in binocular control. Vision Res., 37(14), 1939–1951.
2020
J. Read and B. Cumming
Uka, T., & DeAngelis, G. C. (2002). Binocular vision: An orientation to disparity coding. Curr. Biol., 12(22), R764–766. Van Rijn, L. J., & Van den Berg, A. V. (1993). Binocular eye orientation during fixations: Listing’s law extended to include eye vergence. Vision Res., 33(5–6), 691–708. von der Heydt, R., Adorjani, C., Hanny, P., & Baumgartner, G. (1978). Disparity sensitivity and receptive field incongruity of units in the cat striate cortex. Exp. Brain Res., 31(4), 523–545. Walker, G. A., Ohzawa, I., & Freeman, R. D. (1999). Asymmetric suppression outside the classical receptive field of the visual cortex. J. Neurosci., 19(23), 10536–10553. Walker, G. A., Ohzawa, I., & Freeman, R. D. (2000). Suppression outside the classical cortical receptive field. Vis. Neurosci., 17(3), 369–379. Walker, G. A., Ohzawa, I., & Freeman, R. D. (2002). Disinhibition outside receptive fields in the visual cortex. J. Neurosci., 22(13), 5659–5668. Zhu, Y. D., & Qian, N. (1996). Binocular receptive field models, disparity tuning, and characteristic disparity. Neural Computation, 8(8), 1611–1641. Received October 1, 2003; accepted April 6, 2004.
LETTER
Communicated by Daniel Wolpert
Different Predictions by the Minimum Variance and Minimum Torque-Change Models on the Skewness of Movement Velocity Profiles Hirokazu Tanaka Center for Neurobiology and Behavior and Department of Physiology and Cellular Biophysics, Columbia University, New York, NY 10032, U.S.A.
Meihua Tai Department of Mechanical Engineering, Polytechnic University, Brooklyn, NY 11201, U.S.A.
Ning Qian
[email protected] Center for Neurobiology and Behavior and Department of Physiology and Cellular Biophysics, Columbia University, New York, NY 10032, U.S.A.
We investigated the differences between two well-known optimization principles for understanding movement planning: the minimum variance (MV) model of Harris and Wolpert (1998) and the minimum torque change (MTC) model of Uno, Kawato, and Suzuki (1989). Both models accurately describe the properties of human reaching movements in ordinary situations (e.g., nearly straight paths and bell-shaped velocity profiles). However, we found that the two models can make very different predictions when external forces are applied or when the movement duration is increased. We considered a second-order linear system for the motor plant that has been used previously to simulate eye movements and single-joint arm movements and were able to derive analytical solutions based on the MV and MTC assumptions. With the linear plant, the MTC model predicts that the movement velocity profile should always be symmetrical, independent of the external forces and movement duration. In contrast, the MV model strongly depends on the movement duration and the system’s degree of stability; the latter in turn depends on the total forces. The MV model thus predicts a skewed velocity profile under many circumstances. For example, it predicts that the peak location should be skewed toward the end of the movement when the movement duration is increased in the absence of any elastic force. It also predicts that with appropriate viscous and elastic forces applied to increase system stability, the velocity profile should be skewed toward the beginning of the movement. The velocity profiles predicted by the MV model can even show oscillations when the plant becomes highly oscillatory. Our c 2004 Massachusetts Institute of Technology Neural Computation 16, 2021–2040 (2004)
2022
H. Tanaka, M. Tai, and N. Qian
analytical and simulation results suggest specific experiments for testing the validity of the two models. 1 Introduction The problem of motor planning is ill posed. For a given initial and a target position, the required control signals cannot be uniquely determined (Bernstein, 1967). Conceptually, the problem can be divided into a few subproblems (Hollerbach, 1982; Kawato, Furukawa, & Suzuki, 1987): (1) the determination of a desired trajectory given an initial and a target position in the visual coordinate system, (2) the coordinate transformation of the trajectory from the visual coordinate system to the body-oriented coordinate system, (3) the determination of muscle torques for producing the desired trajectory, and (4) the specification of neuronal control signals for realizing the muscle torques. In general, each of these subproblems is ill posed due to kinematic or dynamic redundancies. To understand why the brain chooses a particular motor plan among infinite possibilities, it is usually assumed that the motor system tries to optimize a certain quantity related to movements. One such optimization principle is the minimum jerk (MJ) model proposed by Flash and Hogan (1985). The model determines a unique trajectory by minimizing jerk, the third temporal derivation of trajectory in the task-oriented coordinate system. It thus prefers paths with smooth acceleration. It predicts straight paths and bell-shaped velocity profiles often observed in human reaching movements. The MJ model is purely kinematic, and it provides a solution to the first subproblem mentioned above. Later, dynamic models were also proposed, attempting to solve more than one subproblem simultaneously. We will analyze and compare two representative dynamic models in this article: the minimum torque change (MTC) model (Uno, Kawato, & Suzuki, 1989) and the minimum variance (MV) model (Harris & Wolpert, 1998). The MTC model is a dynamical extension of the MJ model; it prefers smooth muscle torques instead of acceleration. In contrast, the MV model introduces a signal-dependent noise term into the control signals and requires a minimum postmovement variance around the target position. These dynamic models explain a wider range of experimental observations than the MJ model does (Uno, Kawato, et al., 1998; Harris & Wolpert, 1998). Several authors have discussed how these optimization processes may be implemented in a biologically plausible network (Massone & Bizzi, 1989; Kawato, Maede, Uno, & Suzuki, 1990; Hoff & Arbib, 1993). The MTC and MV models make similar predictions for movements under normal situations (Uno, Kawato, & Suzuki, 1989; Harris & Wolpert, 1998), such as nearly straight paths and bell-shaped velocity profiles (Kelso, Southard, & Goodman, 1979; Morasso, 1981; Abend, Bizzi, & Morasso, 1982). This is remarkable since the two models employ very different optimization criteria: the MTC model prefers the smoothness of muscle torques during
Minimum Variance and Minimum Torque-Change Models
2023
the movement, while the MV model focuses on the accuracy after the movement. The main purpose of this article is to explore the conditions under which the two models make divergent predictions. Obviously, this is important for determining which optimization principle, if any, may be used by the brain for motor planning. Specifically, we formulate these optimization processes with a linear motor plant and show that the velocity profiles predicted by the MV model can be highly asymmetrical when we change the movement duration or the balance between viscous and elastic forces, while the MTC model always predicts a symmetrical velocity profile. In section 2, we provide an analytical formulation of these two models with a linear plant and explain the underlying reasons for the different predictions. We also determine the conditions under which the velocity profiles predicted by the MV model become skewed toward the beginning or the end of a movement. In section 3, we verify our analyses with numerical simulations of representative cases. We discuss in section 4 some specific experiments for testing the models. Since neither model includes sensory feedback, we also briefly discuss potential effects of feedback on the velocity profile in the framework of Todorov and Jordan’s recent model (Todorov & Jordan, 2002; Todorov, 2004). Preliminary results have been presented in abstract form (Tanaka & Qian, 2003). 2 Analyses Let us first define a motor plant to be used by both the MTC and the MV models. We consider a second-order linear system whose dynamics is specified by ˙ + a0 θ(t) = τ (t). θ¨ (t) + a1 θ(t)
(2.1)
This equation is based on Newtonian mechanics and has been used for modeling eye movements (Robinson, Gordon, & Gordon, 1986) and singlejoint arm movements (Hogan, 1984). (The nonlinear equations for a twojoint arm (Luh, Walker, & Paul, 1980) reduce to this linear form when the upper joint is assumed to be fixed.) Here, a0 and a1 are elastic and viscous constants, respectively. θ is a state variable representing eye position or joint angle, and a dot over the variable denotes the temporal derivative. τ (t) is the muscle torque. In our analysis of the MV model, we will make the simplifying assumption that τ is also the neural control signal. In other words, we assume that there is no delay between neural activity and muscle responses. This simplification allows us to use the same dynamic equation for the MTC and MV models and derive analytical solutions. In reality, the muscle torque should be a low-pass filtered version of the control signal (Winters & Stark, 1985). We will introduce this more realistic relationship in our numerical simulations in section 3, and show that it does not change the conclusions of our analysis.
2024
H. Tanaka, M. Tai, and N. Qian
The second-order differential equation 2.1 can be reduced to the first order by introducing a vector representation of the state: θ x≡ ˙ . θ
(2.2)
Then the equation becomes ˙ = Ax(t) + Bτ (t), x(t) where the matrices A and B are defined as 0 1 0 A≡ . , B≡ −a0 −a1 1
(2.3)
(2.4)
The stability of the plant is determined by the eigenvalues of the matrix A, 2 λ± = −a1 ± a1 − 4a0 /2.
(2.5)
The system is stable if the real parts of both eigenvalues are negative, and unstable otherwise. This stability condition is satisfied if both a0 and a1 are positive. The stable system is called over-, critical-, or underdamped according to whether a21 > 4a0 , a21 = 4a0 , or a21 < 4a0 , respectively. We can control the system’s degree of stability by adjusting a0 and a1 to alter the real-part magnitude of the eigenvalues λ± . The system can also be made oscillatory (underdamped) by increasing the elastic force relative to the viscous force. Equation 2.3 can be integrated after a coordinate transform that diago˙ can then be obtained via the inverse transform. nalizes A, and θ (t) and θ(t) The results are: t (λ+ eλ− t − λ− eλ+ t )θ0 + 0 eλ+ (t−t ) − eλ− (t−t ) τ (t )dt θ (t) = (2.6) (λ+ − λ− ) t λ λ (−eλ+ t + eλ− t )θ0 + 0 λ+ eλ+ (t−t ) − λ− eλ− (t−t ) τ (t )dt ˙θ (t) = + − (2.7) (λ+ − λ− ) ˙ = 0. Thus, θ (t) and θ˙ (t) under the initial condition that θ(0) = θ0 and θ(0) are essentially filtered versions of τ (t). Equation 2.7 will be used in our numerical simulations of the MTC model’s velocity profiles in section 3. 2.1 The Minimum Torque-Change Model. We show analytically that with the second-order linear plant, the MTC model always predicts symmetrical velocity profiles of movement.
Minimum Variance and Minimum Torque-Change Models
2025
2.1.1 The Cost Function. The cost function of the MTC model is defined as an integration of the torque changes over the movement duration [0, t f ] (Uno, Kawato, et al., 1998), CMTC
1 = 2
tf
τ˙ 2 (t)dt.
(2.8)
0
If there are no elastic and viscous forces in equation 2.1, this cost function reduces to that for the MJ model. By minimizing the cost function, a unique trajectory is specified among all trajectories that satisfy equation 2.1 and appropriate boundary conditions (see below). The optimization process can be solved by using either the Lagrange multiplier method (Uno, Kawato, et al., 1989) or the Euler-Lagrange equation (Wada, Kaneko, Nakano, Osu, & Kawato, 2001). The latter method is more convenient for proving the symmetry of the velocity profile, as we show below. The solution based on the Lagrange multiplier method is provided in the appendix and is used in our numerical simulations in section 3. 2.1.2 Symmetric Velocity Profiles Predicted by the MTC Model. By substituting equation 2.1 into equation 2.8 and applying the variational procedure with respect to θ, we obtain the following Euler-Lagrange equation (Wada et al., 2001): 4 2 d6 θ (t) 2 d θ(t) 2 d θ(t) − a + a = 0. 1 0 dt6 dt4 dt2
(2.9)
The boundary terms of the partial integrations vanish because of the zero velocity and acceleration conditions at the initial and final times. This equation is clearly invariant under time translation. We can thus let the movement duration be from −t f /2 to t f /2 such that a symmetrical velocity profile is equivalent to an even function of time. The boundary conditions are θ (−t f /2) = θ0 , ˙ ˙ ¨ ¨ θ (t f /2) = θ f , and θ(−t f /2) = θ(t f /2) = θ(−t f /2) = θ (t f /2) = 0. The above differential equation is reduced to fifth order if we use the ˙ as the variable velocity ω(t) ≡ θ(t) 3 d5 ω(t) dω(t) 2 d ω(t) − a + a20 = 0, 1 dt5 dt3 dt
(2.10)
and the original boundary conditions become five new conditions: ω(−t f /2) tf /2 = ω(t f /2) = ω(−t ˙ ˙ f /2) = 0, and −tf /2 dtω(t) = θ f − θ0 . The eigenf /2) = ω(t values of equation 2.10 are zero and four nonzero values αi (i = 1, 2, 3, 4) given by the characteristic polynomial: α 4 − a21 α 2 + a20 = 0.
(2.11)
2026
H. Tanaka, M. Tai, and N. Qian
The solution of equation 2.10 is a linear combination of the eigenstates, ω(t) = β0 + 4i=1 βi eαi t where the coefficients βi are determined by the five conditions mentioned above. Since equation 2.11 is a function of α 2 , only two of the four eigenvalues, say α1 and α3 , are independent, and the other two are given by α2 = −α1 and α4 = −α3 . It is then straightforward, though tedious, to solve for βi and show that they satisfy the following simple relations: β1 = β2 ,
β3 = β4 .
(2.12)
Using this result, the velocity can be expressed as ω(t) = β0 + β1 eα1 t + e−α1 t + β3 eα3 t + e−α3 t .
(2.13)
This solution is an even function of time ω(−t) = ω(t). Therefore, with the linear plant, the velocity profile predicted by the MTC model is always exactly symmetrical regardless of the parameter values. 2.2 The Minimum Variance Model. We now consider the MV model with the second-order linear plant and determine the conditions under which the velocity profiles are asymmetrical. 2.2.1 The Cost Function. According to the MV model (Harris & Wolpert, 1998), a signal-dependent noise term ξ should be added to the dynamic equation 2.3: ˙ = Ax(t) + B [τ (t) + ξ(t)] . x(t)
(2.14)
ξ is assumed to be a gaussian white noise with zero mean and a variance proportional to the square of the control signal τ (Harris & Wolpert, 1998): E[ξ(t)ξ(t )] = Kτ 2 (t)δ(t − t ).
E[ξ(t)] = 0,
(2.15)
Here K is a proportionality constant that scales the amplitude of noise. E[·] denotes the operation of taking an average over the noise distribution. The dynamic equation 2.14 is stochastic, so it is not particularly informative to consider each individual trajectory. Instead, we consider the expected value of the state vector and the covariance matrix by averaging over the signal-dependent noise ξ :
t
(2.16)
T eA(t−t ) B eA(t−t ) B τ 2 (t )dt .
(2.17)
E[x(t)] = e x0 + Cov[x(t)] = K 0
eA(t−t ) Bτ (t )dt ,
At
0 t
Minimum Variance and Minimum Torque-Change Models
2027
Here x0 = (θ0 , 0)T is the initial state vector at t = 0. These expressions are obtained by integrating the equation of motion, equation 2.14, and then taking the average over the noise variable according to equation 2.15. Consider a movement over an interval 0 ≤ t ≤ t f . The cost function to be minimized by the MV model is the variance of the position summed over a certain postmovement duration [t f , t f + tp ] (Harris & Wolpert, 1998). We first consider the simplified case that the cost function is the variance of the position at the final time point t f only. The problem, then, is to minimize the variance of position θ at t f under the constraint that the expected value of the state x at t f is the target state x f = (θ f , 0)T (Harris & Wolpert, 1998). The variance of the position is the (1,1) component of the 2×2 covariance matrix, equation 2.17. The constrained optimization can be solved by introducing a 2 × 1 Lagrange multiplier µ, with the resulting augmented cost function: C MV
=K 0
tf
f (t; t f ) τ 2 (t) dt − µT E[x(t f )] − x f ,
(2.18)
where f (t; t f ) ≡ eA(tf −t) B(eA(tf −t) B)T
1,1
.
(2.19)
The subscript denotes the (1, 1) component of the matrix. f (t; t f ) is a weighting factor that determines how much the noise in control signal τ (t) at time t contributes to the final variance of position. By applying the variational principle with respect to τ (t) to equation 2.18, we obtain an analytical solution for the optimal control signal: τ (t) =
µT eA(tf −t) B . 2Kf (t; t f )
(2.20)
The Lagrange multiplier vector µ is a constant that can be determined via equation 2.16 and the boundary condition E[x(t f )] = x f . Note that the control signal u(t) is inversely related to the weighting factor f (t; t f ). This makes sense because, as we mentioned above, f (t; t f ) determines how much the noise in the control signal at time t will contribute to the variance at the final time t f . A large f means a large contribution, and the corresponding noise (and thus the control signal) should be small in order to minimize the final variance. A problem with the above simplification is that since f (t f ; t f ) = 0 according to equation 2.19, u(t) given by equation 2.20 diverges at t f . The problem can be avoided by using the integration of the variance (and also the constraint) over a postmovement period [t f , t f + tp ] as the cost function (Harris & Wolpert, 1998). The reason is that noise at t f does not affect positional variance at t f but does contribute to the variance after t f . It can then be
2028
H. Tanaka, M. Tai, and N. Qian
shown that equation 2.20 becomes tf +tp µ(t )T eAt dt e−At B tf τ (t) = tf +tp 2K tf f (t; t )dt τ (t) = a0 θ f
(0 ≤ t ≤ t f )
(t f < t ≤ t f + tp ),
(2.21) (2.22)
and the divergence problem disappears. Here, the Lagrange multiplier vector µ(t) is a function of time. The meaning of equation 2.21 is similar to that of equation 2.20, with the torque inversely proportional to the weighting factor for the noise. Equation 2.22 is the torque needed for balancing the elastic force over the postmovement period in order to keep the position at θf . 2.2.2 Velocity Profiles Predicted by the MV Model. A precise discussion of the velocity profile shape predicted by the MV mode requires determining the Lagrange multiplier vector µ from the boundary conditions and then substituting equation 2.20 into equation 2.7. However, since the result will be complicated, involving integrals with no closed-form solutions, we take a more heuristic approach here and verify our conclusions via numerical simulations in section 3. Specifically, we focus on the weighting factor f (t; t f ) because it determines how much the noise in the control signal τ (t) at time t contributes to the positional variance at the end of the movement t f . If, for example, f (t; t f ) is small at the start of a movement and becomes large near the end, then the signal-dependent noise during the early phase of the movement does not matter as much as the noise near the end, and we expect a large initial control signal and a velocity profile skewed toward the beginning. Using the same diagonalization procedure for obtaining equation 2.7, we can express the weighting factor, equation 2.19, in terms of the two eigenvalues λ± of matrix A. For the nondegenerate case λ+ = λ− , the weighting factor is given by λ (t −t) 2 e + f − eλ− (tf −t) (0 ≤ t ≤ t f ), (2.23) f (t; t f ) = (λ+ − λ− )2 and for the degenerate (i.e., critical damping) case where the two eigenvalues λ± take the same value λ, f (t; t f ) = (t f − t)2 e2λ(tf −t)
(0 ≤ t ≤ t f ).
(2.24)
We first consider a stable case when both eigenvalues are real and negative. By differentiating equations 2.23 and 2.24, the weighting factor is found to have a single peak at t∗ = t f −
1 λ− ln λ+ − λ− λ+
(2.25)
Minimum Variance and Minimum Torque-Change Models
2029
for the nondegenerate plant and t∗ = t f +
1 λ
(2.26)
for the degenerate plant. For real and negative eigenvalues, we have |λ− | ≥ |λ+ | according to equation 2.5, and it is easy to see that t∗ is always less than t f . The system is most stable when the two negative eigenvalues have similar and large magnitudes. Under this condition, t∗ is very close to t f , indicating that noise in the early phase of movement tends to decay away quickly and does not contribute much to the final variance. This means that the control signal can be large around the start of the movement, and the velocity profile should thus be skewed toward the movement onset. As one or both eigenvalues become smaller in magnitude, the system’s degree of stability decreases, and t∗ shifts toward the start of the movement. (With very small eigenvalue magnitudes, t∗ can even be less than 0, and the weighting factor will be a monotonically deceasing function over [0, t f ].) Consequently, the early noise becomes more and more important, the control signal should thus be weaker near the movement onset, and the velocity profile will eventually become more skewed toward the end of the movement. Next, we consider the unstable case where the eigenvalues are real but at least one of them is positive. This implies a negative viscous constant, which could be realized with a robotic manipulandum. In this case, it can be shown that the weighting factor, equation 2.23 or equation 2.24, is a monotonically decreasing function of time. Again, this means that the velocity profile should be skewed toward the end of the movement. In the case of complex eigenvalues λ± = µ ± iν, the system is oscillatory. The analytical expression for the weighting factor becomes f (t; t f ) =
e2µ(tf −t) 1 − cos 2ν(t f − t) , 2 2ν
(2.27)
which is also oscillatory with frequency ν/π. When this frequency is sufficiently high, one expects to see oscillation in the control signal and thus in the velocity profile as well. We plot in Figure 1 the weighting factor as a function of time for a few representative cases. Figure 1A shows curves corresponding to four negative eigenvalues for the degenerate (critical damping) plant. As mentioned above, the peak shifts to the left as the system’s degree of stability (the magnitude of the eigenvalue) decreases. A stable overdamping, an unstable, and a stable underdamping (oscillatory) case are shown in Figure 1B. Finally, since the above arguments are based on how the signal-dependent noise propagates through time, one expects that the movement duration should affect the shape of the velocity profile in the MV model. For a very stable system where the noise decays with time, a longer duration
2030
H. Tanaka, M. Tai, and N. Qian (A) Critical Damping Cases
(B) Stable and Unstable Cases stable overdamping unstable stable underdamping
Normalized weighting factor
Normalized weighting factor
1
0.8
0.6
λ tf = −1.5 λ t = −2.0 f λ tf = −2.5 λ tf = −3.0
0.4
0.2
0 0
0.2
0.4
0.6
Normalized time
0.8
1
1 0.8 0.6 0.4 0.2 0 0
0.2
0.4
0.6
0.8
1
Normalized time
Figure 1: The weighting factor f (t; t f ) plotted as a function of normalized time t/t f . (A) Four cases of stable, critical damping with λt f equal to −1.5, −2.0, −2.5, and −3.0 from left to right. (B) A stable overdamping (λ± t f = −1, −2), an unstable (λ± t f = 0.4, 0.2), and a stable underdamping (oscillatory) case (λ± t f = −1 ± 5i).
will make the early noise contribute even less to the final variance, and the velocity profile should be skewed more toward the movement onset with increasing duration. This explains a result in Harris and Wolpert (1998) that the velocity profile of a saccadic eye movement model becomes more skewed toward the beginning with longer duration. Likewise, if the system is less stable or unstable such that the velocity profile is skewed toward the end, the skewing can be slight for a short duration but should become more pronounced with increasing duration. Note that duration is a prefixed parameter in both the MV and MTC models that does not depend on movement amplitude, even though actual movement duration usually increases with amplitude (Fitts, 1954). For the linear plant we used, both models predict that when the movement amplitude is changed, the velocity profile will simply be scaled up or down without changing its duration or shape. One can avoid this problem by an ad hoc adjustment of the movement duration according to the amplitude prior to the optimization procedure. We summarize our arguments for the MV model as follows. When we change the viscous and elastic constants of the system by applying external viscous and elastic forces, the shape of the movement velocity profiles should change. In particular, when the system is made more (or less) stable, the peak of the velocity profile should be shifted toward the beginning (or end) of the movement. The skewing should be more pronounced with increasing duration. Intuitively, a more stable system tends to attenuate the signal-dependent noise more over time, the control signal (and the associ-
Minimum Variance and Minimum Torque-Change Models
2031
ated noise) can thus afford to be large around the movement onset, and the velocity profile will consequently be skewed toward the beginning. In contrast, a less stable or unstable system may amplify noise through time, the control signal should start small, and the velocity profile will be skewed toward the end. When the system is made strongly oscillatory, the noise effect oscillates in time, and the velocity profile should also show oscillation. 3 Simulations We have conducted extensive numerical simulations in order to confirm our analyses above, particularly when the simplifying assumptions for the MV model are removed. We consider single-joint movements of the forearm with the dynamic equation ˙ + (k0 + k)L20 θ(t) = τ (t). I0 θ¨ (t) + (b0 + b)θ(t)
(3.1)
This equation can be derived by fixing the shoulder angle in the two-joint arm model examined by Uno, Kawato, et al. (1989). θ represents the elbow angle and τ is the muscle torque exerted to the elbow. I0 , b0 , k0 , and L0 are the moment of inertia, the intrinsic viscosity, the intrinsic elasticity, and the length of the forearm, respectively, and we adopt the standard values of 0.25 kg · m2 , 0.20 kg · m2 /s, 0 N/m, and 0.35 m for these parameters (Harris & Wolpert, 1998). We also included externally applied viscosity b and elasticity k for altering the system’s degree of stability and oscillation. This single-joint system does not have any kinematic redundancy but does have dynamic redundancies. For the MTC model, we used the Lagrange multiplier method presented in the appendix to find the torque numerically and then used equation 2.7 to obtain velocity. For the MV model, we made the simplifying assumptions in section 2 that the control signal is the muscle torque and that the skewing of velocity profile can be qualitatively understood by the weighting factor f . These limitations are removed in our simulations. We considered a second-order muscle model to relate the neural control signal u to the muscle torque τ (Winters & Stark, 1985): d d 1 + ta 1 + te τ (t) = u(t), dt dt
(3.2)
where ta and te are muscle activation and excitation time constants, and their values are 40 and 30 ms, respectively. This equation implies that the torque is acquired by low-pass filtering the control signal. Consequently, we expect a smooth rise of the muscle torque instead of a sudden onset. We used the original cost function of Harris and Wolpert (1998) by integrating the positional variance over a period of 400 ms after the movement
2032
H. Tanaka, M. Tai, and N. Qian
(longer postmovement duration does not alter the results), and numerically calculated the velocity profiles via quadratic programming. We considered only stable cases in our simulations since unstable cases generate divergent movement trajectories that are difficult to test experimentally. Based on our analyses, we examined how the system’s degree of stability, the movement duration, and the plant oscillation affect the shape of the predicted velocity profiles. 3.1 Effects of the System’s Degree of Stability. The system has different degrees of stability when its eigenvalues have different magnitudes but the same (negative) sign. The simplest way to change the degree of stability is to apply external viscous and elastic forces to the forearm. In psychophysical experiments, arbitrary elastic and viscous forces can be introduced by a robotic manipulandum. One could also add the elastic force by attaching a spring to the hand. The movement duration was fixed at 400 ms in all simulations. We first considered the baseline simulation with no external forces (b = k = 0). We then increased the viscous coefficients b from 1 to 5 in step of 1 kg · m2 /s. This range of viscous force can be achieved by a manipulandum (Shadmehr & Mussa-Ivaldi, 1994). We also imposed an external elastic force for each b according to k = (b0 + b)/4I0 L20 such that the plant was in critical damping condition. k ranged from 0.33 to 220 N/m. Based on our analyses in section 2, the velocity profile of the MV model should skew more toward the beginning as the system becomes more stable, while the velocity profile of the MTC model should always remain symmetrical. This is confirmed by the simulations in Figure 2. Figure 2A shows the normalized velocity profiles predicted by the MV model. The right-most profile is the baseline case without any external forces, and it skews slightly toward the end of the movement. The profile gradually shifts toward the movement onset as the system’s degree of stability increases with the external forces. To show the peak shift more clearly, we plot in Figure 2B the normalized peak locations of the profiles in Figure 2A as a function of the external viscosity. Along the vertical axis, 0, 0.5, and 1 indicate the beginning, midpoint, and end of the movement, respectively. The corresponding simulations for the MTC model are shown in Figures 2C and 2D. As expected, the predicted velocity profiles are always symmetrical regardless of the external forces. The six normalized velocity profiles almost overlap each other completely. 3.2 Effects of the Movement Duration. We next simulated the effects of the movement duration on the shape of the velocity profile. We varied the duration from 200 ms to 1000 ms in steps of 200 ms while keeping all other parameters to their standard values. No external forces were applied in these simulations. The results for the MV model are shown in Figures 3A and 3B. The peak of the velocity profile shifts more toward the end of the
Minimum Variance and Minimum Torque-Change Models
(B) MV: Peak Locations Normalized peak location
Normalized velocity
(A) MV: Velocity Profiles 1 0.8 0.6 0.4 0.2 0 0
0.1
0.2
0.3
0.4
1 0.8 0.6 0.4 0.2 0
0.6 0.4 0.2 0 0
0.1
0.2
0.3
Time (s)
0.4
Normalized peak location
Normalized velocity
1 0.8
0
2
4
External viscosity (kg m2/s)
Time (s)
(C) MTC: Velocity Profiles
2033
1
6
(D) MTC: Peak Locations
0.8 0.6 0.4 0.2 0
0
2
4
6
External viscosity (kg m2/s)
Figure 2: Velocity profiles and peak locations under different degrees of system stability. (A) Normalized velocity profiles predicted by the MV model. The rightmost curve corresponds to no external force applied to the arm model. The other four curves, from right to left, are for a critical damping system with increasing system stability via external viscous and elastic forces. (B) Normalized peak location of the velocity profiles in A. The peak locations are divided by the total movement duration. (C) Normalized velocity profile predicted by the MTC model with the same plant parameters as in A. (D) Normalized peak location of the profiles in C.
movement with increasing duration. In contrast, the MTC model always predicts symmetrical profile (see Figures 3C and 3D). These results are again in agreement with our analyses. The movement duration can be easily manipulated in psychophysical experiments by requiring different accuracy at the target location or varying movement amplitude (Fitts, 1954). 3.3 Effects of the Plant Oscillation. Finally, we compared the two models when the plant has complex eigenvalues and is thus oscillatory. This happens when the elastic force is more dominant over the viscous force and can be achieved by imposing a strong external elastic force or partially canceling the intrinsic viscous force via a manipulandum. In our simulations, we used the elastic coefficient k of 200, 300, and 400 N/m while keeping all
2034
H. Tanaka, M. Tai, and N. Qian
(B) MV: Peak Locations Normalized peak location
Normalized velocity
(A) MV: Velocity Profiles 1
1
0.8
0.8
0.6
0.6 0.4
0.4
0.2
0.2
0 0
0.2 0.4 0.6 0.8
1
0 0
(C) MTC: Velocity Profiles Normalized velocity
0.2
0.4
0.6
0.8
1
Movement duration (s)
1
normalized peak location
Time (s) 1
(D) MTC: Peak Locations
0.8
0.8
0.6
0.6 0.4
0.4
0.2
0.2
0 0
0.2 0.4 0.6 0.8
Time (s)
1
0 0
0.2
0.4
0.6
0.8
1
Movement duration (s)
Figure 3: Velocity profiles and peak locations under various movement durations. The format is same as Figure 2, with A and B for the MV model and C and D for the MTC model.
the other parameters to their standard values. The corresponding values for νt f /π were 1.26, 1.53, and 1.78, respectively. The movement’s duration was 400 ms in all simulations. The simulated velocity profiles for the MV model are shown in Figure 4A. The right-most curve corresponds to the largest k value. It is clear from the figure that as k increases, the profile becomes more oscillatory, and the peaks shifts toward the end of the movement. With the largest k we used, the predicted velocity is initially negative, indicating that the movement should first be in the opposite direction of the target. In contrast, the velocity profiles for the MTC model shown in Figure 4B are always symmetrical. Again, these simulations are consistent with our analytical considerations. Note that the MTC model solution, equation 2.13, can also be oscillatory when α’s are complex. However, we found through simulations that k has to be larger than about 1000 N/m for the oscillation to be observed. In addition, the MTC solution always has to be symmetrical with or without oscillation.
Minimum Variance and Minimum Torque-Change Models
(B) MTC: Velocity Profiles Normalized velocity
Normalized velocity
(A) MV: Velocity Profiles 1
0.8
2035
1
0.8
0.6
0.6
0.4
0.4
0.2
0.2
0 0
0.1
0.2
0.3
Time (s)
0.4
0 0
0.1
0.2
0.3
0.4
Time (s)
Figure 4: Velocity profiles under different degrees of plant oscillation. (A) Prediction of the MV model when the external elasticity k takes values of 200, 300, and 400 N/m, respectively. (B) Prediction of the MTC model under the same condition.
4 Discussion The main goal of this study is to compare the two well-known models for motor planning: the minimum variance model and the minimum torque change model. We have focused on the different velocity profiles predicted by the two models and found through analyses and simulations that with a second-order linear plant, the MTC model always predicts a perfectly symmetrical velocity profile, while the MV model predicts skewed velocity profiles under many conditions. Our results suggest the following specific experiments for testing the validity of the two models. Subjects should be instructed to make single-joint movements of the forearm in a horizontal plane; this is the condition where the second-order linear plant used in our studies is valid. The movement velocity profiles should be measured under each of the following manipulations: 1. Different levels of viscous forces are applied through a manipulandum. 2. Different movement durations are imposed by asking the subjects to hit the target position with various degrees of precision. 3. Different elastic forces are applied through either a manipulandum or attachment of springs. If the observed velocity profiles under these manipulations are all symmetrical, the MTC model is supported. But, if the velocity profiles become more skewed toward the movement onset with increasing viscosity in condition 1, the skewing is more pronounced with movement duration in condition 2, and the velocity profile becomes more skewed toward the end (and more
2036
H. Tanaka, M. Tai, and N. Qian
oscillatory) with elastic force in condition 3, the MV model is favored. All other possible outcomes will suggest that neither model is correct. In condition 1, the movement duration may increase with viscosity. This confound should not matter for testing purposes because a concurrent increase of duration with viscosity would only make the skewing predicted by the MV model stronger. When the applied elastic force is very strong in condition 3, the MV model makes the counterintuitive prediction that the movement should first be in the opposite direction of the target (see Figure 4A). In this study, we considered the second-order linear plant that is often used in motor control modeling and can be realized with single-joint arm movements. This simplification allowed us to gain analytical insights into the two models. The simplification is justified because we are mainly interested in searching for specific conditions where the two models can be clearly differentiated. It may be desirable to extend our studies to nonlinear plants, such as the two-link arm model, because many extant experimental studies involve multijoint movements. However, nonlinear systems are not only difficult to analyze but also difficult to simulate due to the local-minimum problem in the optimization process. Our simulations (not shown) indicate that the problem becomes worse when large external forces are applied. Predictions from the two models based on nonlinear simulations may thus be less reliable for distinguishing the models than the results in this article. In addition, the MTC and MV models’ predictions have a complex dependence on the system parameters in general. This makes experimental tests of the models difficult due to the uncertainties in the estimation of arm’s intrinsic parameters. In contrast, we show here that for the special case of single-joint movements, the symmetry of the velocity profiles predicted by the MTC model holds for any parameter combinations. Likewise, the trend of velocity peak shift predicted by the MV model when duration and external forces are manipulated is valid regardless of the intrinsic parameter values. The simple case we considered in this article may thus provide a more conclusive test of the models. It is known that large-amplitude saccadic eye movements show skewed velocity profiles (Collewijn, Erkelens, & Steinman, 1988; Harris & Wolpert, 1998). To the extent that the eye can be approximated by a linear plant, this observation argues against the MTC model for eye movements. Indeed, the MTC model was proposed only for arm movement, not eye movement (Uno, Kawato, et al., 1998), while the MV model was proposed for both (Harris & Wolpert, 1998). Skewed velocity profiles in arm movements have also been reported (Nagasaki, 1989; Milner & Ijaz, 1990). However, these studies usually employ multi-joint arm movements, a condition under which the MTC model can also predict skewed velocity profile (Uno, Kawato, et al., 1989). We therefore conclude that the extant data cannot distinguish the two models, and new experiments as outlined above should be performed. Our analyses can be readily generalized to higher-order linear systems, and we expect that our conclusions will still be valid. It would also be in-
Minimum Variance and Minimum Torque-Change Models
2037
teresting to examine the implications of the other models developed by Kawato et al. that are closely related to the MTC model considered here. They include the minimum muscle-tension-change model (Uno, Suzuki, & Kawato, 1989) and the minimum motor-command-change model (Kawato, 1992). Since all of these models require smoothness in dynamic variables, and the smoothness criterion is independent of the system stability or duration, we speculate that they may all predict nearly symmetrical velocity profiles under a linear plant. Both the MTC and MV models are purely feedforward and are perhaps most appropriate for understanding rapid, well-learned movements. However, sensory feedback, when present, has been shown to have profound effects on trajectory formation (Keele & Posner, 1968; Carlton, 1981; Sheth & Shimojo, 2002). Todorov and Jordan recently proposed an extended linearquadratic-gaussian (LQG) model that includes sensory feedback during movement execution (Todorov & Jordan, 2002; Todorov, 2004). We have performed some simulations with their model and the linear plant used in this article, and found that sensory feedback can strongly influence the shape of the predicted velocity profile. For example, for single-joint movements without external forces, the feedback can cause the velocity profile to change from skewing toward the end to skewing toward the beginning of a movement. Since the extended LQG model contains many parameters not present in the MTC or MV model and some of those parameters (such as the relative weighting between the error and the control cost) can also influence the shape of the velocity profile, we will present a full account of the extended LQG model predictions in a future publication. Appendix A: A Solution of the MTC Model by the Lagrange Multiplier Method We present a solution of the MTC model based on the Lagrange multiplier method, following the original work of Uno, Kawato, et al. (1989). We used this method for our numerical simulations of the MTC model in section 3, because the Euler-Lagrange method can be numerically less stable under some parameters. The problem is to minimize the torque change, equation 2.8, under the constraint that the dynamic equation, equation 2.3, should be satisfied. The augmented cost function after introducing a Lagrange multiplier vector px is thus C MTC =
tf
dt 0
1 2 τ˙ − pTx (x˙ − Ax − Bτ ) . 2
(A.1)
The variation procedure would give rise to a second-order differential equation. Instead, by introducing an auxiliary variable z(t) as the temporal
2038
H. Tanaka, M. Tai, and N. Qian
derivative of the torque τ (t), we obtain a new cost function, C˜ MTC =
tf
dt 0
1 2 z − pTx (x˙ − Ax − Bτ ) − pτ (τ˙ − z) , 2
(A.2)
that will give rise to a set of first-order differential equations. Here, pτ is another Lagrange multiplier to enforce the constraint that z should be the derivative of τ . Applying the variation principle with respect to x, z, τ , px , and pτ , we have the following set of equations: x˙ = Ax + Bτ, z = pT , τ˙ = −pτ , p˙x = −AT px , p˙τ = −BT px .
(A.3)
The boundary conditions of the state variable at the initial and final time are θ (0) = θ0 , θ (t f ) = θ f ,
˙ ¨ θ(0) = 0, θ(0) = 0, ˙ f ) = 0, θ(t ¨ f ) = 0. θ(t
(A.4)
The initial condition of the Lagrange multipliers px (0) and pτ (0) is determined so that the boundary conditions A.4 are satisfied. Finally, by integrating equations A.3, the muscle torque is obtained as T τ (t) = a0 θ0 − pτ (0)t + BT (AT )−1 t12 + (AT )−1 (e−A t − 12 ) px (0). (A.5) Here 12 is a 2 × 2 unit matrix. The optimal trajectory is obtained by solving the dynamical equation with the muscle torque, equation A.5. Acknowledgments We thank Emanuel Todorov and Daniel Wolpert for answering our inquiries on their models and providing us their codes. We are also grateful to John Krakauer, Pietro Mazzoni, Claude Ghez, and two anonymous reviewers for their helpful comments and suggestions. This work is supported by NIH grant MH54125. References Abend W., Bizzi E., & Morasso P. (1982). Human arm trajectory formation. Brain, 105, 331–348.
Minimum Variance and Minimum Torque-Change Models
2039
Bernstein N. (1967). The coordination and regulation of movements. London: Pergamon. Carlton L. G. (1981). Processing visual feedback information for movement control. Journal of Experimental Psychology: Human Perception and Performance, 7, 1019–1030. Collewijn H., Erkelens C. J., & Steinman R. M. (1988) Binocular coordination of human horizontal saccadic eye-movements. Journal of Physiology, 404, 157– 182. Fitts P. M. (1954). Information capacity of the human motor system in controlling the amplitude of movement, Journal of Experimental Psychology 47, 381– 391. Flash T., & Hogan N. (1985). The coordination of movements: An experimentally confirmed mathematical model. Journal of Neuroscience, 5, 1688–1703. Harris C. M., & Wolpert D. M. (1998). Signal-dependent noise determines motor planning. Nature, 394, 780–784. Hoff B., & Arbib M. A. (1993). Models of trajectory formation and temporal information of reach and grasp. Journal of Motor Behavior, 25, 175–192. Hogan N. (1984). An organizing principle for a class of voluntary movements. Journal of Neuroscience, 4, 2745–2754. Hollerbach J. M. (1982). Computers, brains and the control of movement. Trends in Neuroscience, 5, 189-192. Kawato M. (1992). Optimization and learning in neural networks for formation and controls of coordinated movement. In D. Meyer & S. Kornlum (Eds.), Attention and performance XIV (pp. 821–849). Cambridge, MA: MIT Press. Kawato M., Furukawa K., & Suzuki R. (1987). A hierarchical neural-network model for control and learning of voluntary movement. Biological Cybernetics, 57, 169–185. Kawato M., Maeda Y., Uno Y., & Suzuki R. (1990). Trajectory formation of arm movement by cascade neural network model based on minimum torquechange criterion. Biological Cybernetics, 62, 275–288. Keele S. W., & Posner M. I. (1968). Processing visual feedback in rapid movements. Journal of Experimental Psychology, 77, 155–158. Kelso J. A. S., Southard D. L., & Goodman D. (1979). On the nature of human interlimb coordination. Science, 203, 1029–1031. Luh J. Y. S., Walker M. W., & Paul R. P. C. (1980). Online computational scheme for mechanical manipulators. Journal of Dynamic Systems Measurement and Control, 102, 69–76. Massone L., & Bizzi E. (1989). A neural network model for limb trajectory formation. Biological Cybernetics, 61, 417–425. Milner T. E., & Ijaz M. M. (1990). The effect of accuracy constraints on 3dimensional movement kinematics. Neuroscience, 35, 365–374. Morasso P. (1981). Spatial control of arm movements. Experimental Brain Research, 42, 223–227. Nagasaki H. (1989). Asymmetric velocity and acceleration profiles of human arm movements, Experimental Brain Research, 74, 319–326. Robinson D. A., Gordon J. L., & Gordon S. E. (1986). A model of smooth pursuit eye movement system. Biological Cybernetics, 55, 43–47.
2040
H. Tanaka, M. Tai, and N. Qian
Shadmehr R., & Mussa-Ivaldi F. (1994). Adaptive representation of dynamics during learning of a motor task. Journal of Neuroscience, 14, 3208–3224. Sheth B. R., & Shimojo S. (2002). How the lack of visuomotor feedback affects even the early stages of goal-directed pointing movements. Experimental Brain Research, 143, 181–190. Tanaka H., & Qian N. (2003). Different predictions by the minimum variance and minimum torque-change models on the skewness of movement velocity profiles. Program no. 492.7. 2003 Abstract Viewer/Itinerary Planner. Washington, DC: Society for Neuroscience. Available online at: http://sfn. scholarone.com/itin2003/index.html. Todorov E. (2004). Stochastic optimal control and estimation methods adapted to the noise characteristic of the sensorimotor system. Manuscript submitted for publication. Todorov E., & Jordan M. I. (2002). Optimality feedback control as a theory of motor coordination. Nature Neuroscience, 5, 1226–1235. Uno Y., Kawato M., & Suzuki R. (1989). Formation and control of optimal trajectory in human multijoint arm movement. Biological Cybernetics, 61, 89– 101. Uno Y., Suzuki R., & Kawato M. (1989). Minimum muscle-tension-change model which reproduces human arm movement. In Proceedings of the 4th Symposium on Biological and Physiological Engineering, (pp. 299–302). Fukuok, Japan. Wada Y., Kaneko Y., Nakano E., Osu R., & Kawato M. (2001). Quantitative examinations for multijoint arm trajectory planning. Neural Networks, 14, 381–393. Winters J. M., & Stark L. (1985). Analysis of fundamental human movement patterns through the use of in-depth antagonistic muscle models. IEEE Transactions on Biomedical Engineering, 32, 826–839. Received September 5, 2003; accepted April 2, 2004.
LETTER
Communicated by Eero Simoncelli
Disambiguating Visual Motion Through Contextual Feedback Modulation Pierre Bayerl
[email protected] Heiko Neumann
[email protected] Department of Neural Information Processing, University of Ulm, D-89069 Ulm, Germany
Motion of an extended boundary can be measured locally by neurons only orthogonal to its orientation (aperture problem) while this ambiguity is resolved for localized image features, such as corners or nonocclusion junctions. The integration of local motion signals sampled along the outline of a moving form reveals the object velocity. We propose a new model of V1-MT feedforward and feedback processing in which localized V1 motion signals are integrated along the feedforward path by model MT cells. Top-down feedback from MT cells in turn emphasizes model V1 motion activities of matching velocity by excitatory modulation and thus realizes an attentional gating mechanism. The model dynamics implement a guided filling-in process to disambiguate motion signals through biased on-center, off-surround competition. Our model makes predictions concerning the time course of cells in area MT and V1 and the disambiguation process of activity patterns in these areas and serves as a means to link physiological mechanisms with perceptual behavior. We further demonstrate that our model also successfully processes natural image sequences. 1 Introduction The analysis and interpretation of moving shapes and objects based on motion estimations is a major task in everyday vision. However, motion can locally be measured only orthogonal to an extended contrast (aperture problem), while this ambiguity can be resolved at localized image features, such as corners or junctions from nonoccluding geometrical configurations. The models previously suggested to solve the aperture problem, which integrate localized motion signals to generate coherent form motion. For example, the vector sum approach averages movement vectors measured for a coherent shape (Wilson, Ferrera, & Yo, 1992). Local motion signals of an object define possible solutions of the motion constraint equation. (Horn & Schunk, 1981). If several distinct measures are combined, their associated constraint c 2004 Massachusetts Institute of Technology Neural Computation 16, 2041–2066 (2004)
2042
P. Bayerl and H. Neumann
lines in the velocity space intersect and thus yield the velocity common to the individual measures (intersection of constraints, IOC) (Adelson & Movshon, 1982). Bayesian models combine different probabilities for velocities with a preference for, e.g., slower motions (Weiss & Fleet, 2001; Weiss, Simoncelli, & Adelson, 2002). Such preferences were encoded as statistical priors in the Bayesian estimation process. Like for the IOC, Bayesian models mostly assume that motion estimates belonging to distinct objects were already grouped together. Unambiguous motion signals can be measured at locations of significant 2D image structure such that curvature maxima, corners or junctions.1 These sparse features can be tracked over several frames to yield robust movement estimates and predictions (feature tracking) (Del Viva & Morrone, 1998). Coherent motion is often computed by smoothing and interpolating sparse measures estimated at localized image contrasts. Thus, the inverse problem of motion estimation is regularized by minimizing a smoothing constraint over surface regions (Horn & Schunk, 1981; Nagel, 1987) or along boundaries (Hildreth, 1984). In this letter, we investigate the mechanisms of the primate visual system to process fields of visual motion induced by moving objects or selfmotion. Motion information is primarily processed along the dorsal pathway in the visual system, but mutual interactions exist at different stages between the dorsal and ventral path (Van Essen & Galant, 1994). Our modeling of mechanisms of cortical motion processing focuses on the integration and segregation of visual motion in reciprocally connected areas V1 and MT. Our model dynamics are defined by spatially local isotropic interactions of velocity tuned cells. Additional signals indicating luminance boundaries or explicitly detected motion discontinuities could easily be used to improve our results, but would make it impossible to analyze the basic functionality of our system, which we claim to represent a fundamental mechanism of motion processing in the visual system. Despite its simplicity, the model is able to explain experimental data and, without parameter changes, to successfully process real-world data used for model benchmarking (Barron, Fleet, & Beauchemin, 1994). The new contribution of this work is a unified mechanism of motion disambiguation that deals with localized features as well as elongated boundaries. It is demonstrated how recurrent feedback processing between two model areas with similar dynamics operating on two different scales stabilizes local feature measurements. In addition, we show how such features trigger a filling-in process along boundaries to resolve the aperture problem and therefore arrive at globally consistent motion estimates through local lateral interactions. Such local lateral processes are biased and controlled by feedforward-feedback interactions that 1 In case of spatial object occlusions, junctions may signal incorrect velocities, since their boundaries belong to different object shapes (McDermott, Weiss, & Adelson, 2001). We do not particularly consider the corresponding segmentation problem, but briefly discuss consequences in section 4.
Disambiguating Visual Motion
2043
can be viewed as scale-invariant mechanisms of hypothesis testing and correction. The letter is structured as follows. In section 2, we describe the model and the model dynamics. In section 3, we present computational simulations, followed by a discussion of the model and the results in section 4. 2 The Model Motion analysis in visual cortex starts with primary visual area V1 and is subsequently followed by parietal areas such as MT/MST and beyond. These areas communicate with a bidirectional flow of information via feedforward and feedback connections. The mechanisms of this feedforward and feedback processing between model areas V1 and MT can be described by a unified architecture of lateral inhibition and modulatory feedback whose elements have been developed in the context of shape processing and boundary completition (Neumann & Sepp, 1999). In this section, we present the model dynamics within and between model cortical areas involved to realize the integration and segregation of inherently ambiguous input patterns. In a nutshell, the model consists of two areas with similar architecture that implement the following mechanisms. 1. Feedback modulation. Cells in area V1 are modulated by cell activations from model area MT. Cells in MT can, in principle, also be modulated by higher areas such as MST or attention. Since we focus here on the two stages of V1-MT interaction, the feedback to model area MT is set to zero. 2. Feedforward integration. 3. Lateral shunting inhibition enhancing unambiguous signals (see Figure 1). 2.1 Input Stage of Model V1 (Initial Motion Detection). The input stage in V1 consists of a set of modified elaborated Reichardt detectors (ERD) (Adelson & Bergen, 1985) to measure local motion for a specific range of velocities at each location (population code) (Pouget, Zemel, & Dayan, 2000). The functionality of the correlation-based detector is described in appendix A. The activities of this input stage (c(3) (x,∆x) for different velocities (encoded by x) at different locations x (see Figure 1 and appendix A) indicate unambiguous motion at corners and line endings, ambiguous motion along contrasts, and no motion for homogeneous regions. Physiological findings suggest that motion-sensitive cells in V1 respond to orientation as well (component motion) and that only MT cells get (more) invariant to orientation indicating “pattern motion” (Movshon, Adelson, Gizzi, & Newsome, 1985; Movshon & Newsome, 1996). We focus our investigation on the
P. Bayerl and H. Neumann
...
...
2044
•
...
c(3)
S1 S1 S1
÷ S2
v(1)
•
v(2) v(3) model V1
• •
...
• •
(copy of v(3) from model V1)
S1 S1 S1
÷ S2
v(1)
v(2) v(3) model MT
Figure 1: Overview of model areas V1 and MT. The architecture defines the dynamics of interaction between both areas consisting of modulary feedback (v(1) ), feedforward integration (v(2) ), and lateral shunting inhibition (v(3) ). Differences in the parameterization concern receptive field sizes and the wirings of feedforward input and feedback (see Table 1). 1 denotes an integration with isotropic gaussian kernels in space and in the velocity domain (see Table 1). 2 denotes the sum over all velocities at each individual location. c(3) is the input stage of model V1 and is realized as a correlation detector based on complex cell responses (see appendix A). Note that the input stage of model MT is just a copy of v(3) from V1.
role of feedback in motion disambiguation and therefore employ motionsensitive cells in model V1 that are not orientation selective (see the discussion in section 4.1). The input stage is divided in two steps. The first concerns cells selective to static-oriented contrasts at a fixed spatial frequency and independent of contrast polarity. Consistent with Movshon and Newsome (1996), these cells also respond to very weak contrasts, which is realized through shunting normalization (see appendix A). The second concerns direction-selective cells, pooling over all orientation-selective cells at different time steps, which yields a representation of visual motion independent of contrast orientation. This simplification does not contradict basic properties of motion-selective cells in V1, since immediate response properties of model cells concerning visual motion basically indicate component motion except at very few locations in the image with intrinsic two-dimensional structures such as corners or line endings (see sections 3 and 4.1). We claim that the important difference between V1 and MT is the spatial size of receptive fields and that the proposed mechanisms within each area and between each area reveal properties consistent with physiological and psychophysical observations and yield highly accurate visual motion estimations (see section 3). In the following, we describe how different parts of the model contribute to the disambiguation of ambiguous motion signals in a biased competition process, while motion discontinuities remain preserved.
Disambiguating Visual Motion
2045
Table 1: Parametrization of Equations 2.1–2.3 for Model Areas V1 and MT. Model Area
netIN
netFB
C
Kernel Size σ1 (spatial)
Kernel Size σ2 (velocity)
V1 MT
c(3) v(3) V1
v(3) MT 0
100 0
0 (dirac) 7
0.75 0.75
Notes: Spatial RF ratio V1:MT = 1:5 (the RF √size of cells in model V1 is defined by the RF sizes of the correlation detector: 2σ 2 = 1.41 (σ = 1), and the RF size of MT cells is influenced by the RF sizes in V1:
1.412 + 72 = 7.14).
2.2 Motion Processing in Model Area V1 and MT (Motion Integration). Two areas, model V1 and model MT, with similar dynamics subsequently process the initial motion signal in a recurrent loop (an outline is given in Figure 1). Here we outline the details of the architecture of both model areas, which consists of three steps: feedback modulation, feedforward integration, and lateral inhibition (see Table 1 for the parameterizations of processing stages in model V1 and MT). The computational logic of recurrent feedback processing between two areas is that the higher area integrates localized measures from the lower area over a larger spatial range, or neighborhood, and thus evaluates these signals in a much larger context (Hup´e et al., 2001; Friston & Buchel, ¨ 2000). Consider the higher area, MT, as one in which cells evaluate their input via their feedforward input connection strength. Then the resulting activity could be interpreted as a signature of the degree of match between expected input (encoded by the input weights of receptive field [RF] kernels) and the current input signal. Feedback in turn functions as a predictor that enhances those signals in the lower area that are compatible with respect to feature specificity by way of top-down modulation (Grossberg, 1980; Ullman, 1995; Mumford, 1994). Such excitatory modulatory feedback interaction has the effect of providing activations in V1 that match the “expectations” of MT, a competitive advantage in subsequent mutual inhibition. Modulatory feedback has the further advantage that only compatible patterns get emphasized in the lower area and no activity is produced where no signal is provided by the input. Thus, our model adopts the “no strong loop” hypothesis developed by Crick and Koch (1998). Also, modulatory feedback can be compared with the intersection of constraints (IOC) analysis of motion integration, since the input signal constrains the feedback signal and the intersection of both gets emphasized.2 Before integrating the modified input 2 In our model, we particularly focus on the feedforward-feedback interaction between areas V1 and MT. This does not deny, however, the existence of (modulatory) feedback connections from, for example, area MST also. Such an extension can be incorporated by adopting the model mechanisms of MT-V1 feedback modulation also at the stage of model MT to include MST-MT feedback as well.
2046
P. Bayerl and H. Neumann
signal, it is squared to sharpen the distribution. Then the signal is processed by cells with isotropic spatial and isotropic directional gaussian RFs (see equation 2.2). This feedforward integration can be compared with vector average if the population code is interpreted as the sum of its components. Cell activities are normalized subsequently by lateral shunting inhibition (see equation 2.3). The sum of activations of cells sensitive to any velocity at a specific location normalizes the total energy (Simoncelli & Heeger, 1998). Therefore, unambiguous signals (indicating the presence of only one or only a few velocities) get emphasized, while ambiguous signals (indicating many possible velocities) lead to flat population responses. This step is similar to a feature tracking mechanism, whose behavior in our model emerges from the dynamic behavior of the system without employment of any explicit feature detection mechanism. Each model area consists of three stages of processing whose dynamics are defined by the following equations (compare Figure 1; see appendix B for the steady-state solutions used for the simulations): ∂t v(1) = −v(1) + netIN · (1 + C · netFB ) (x,space)
∂t v(2) = −v(2) + (v(1) )2 ∗ Gσ1 ∂t v(3) = −0.01 · v(3) + v(2) −
(∆x,velocity)
∗ Gσ2
1 v(2) , + v(3) · x 2n
(2.1) (2.2) (2.3)
where n denotes the number of cells tuned to different velocities at any specific location. netIN is the input of the model area (e.g., the output of the correlation detector for model V1), and netFB is the feedback signal (e.g., the output of model MT for model V1). * denotes the convolution operation and Gσi are gaussian kernels in space(Gσ1 ) and in the velocity domain (Gσ2 ). Velocity is coded by a spatial shift x per frame. Compared to model V1, the enlarged RF sizes in model MT (RF ratio V1:MT = 1:5; see Table 1) lead to less ambiguous signals, since the influence of unambiguous motion features generated by, e.g., line endings, is larger due to a larger aperture. We propose a solution to the aperture problem by feedback of activities from a higher stage, which provide a larger context of motion integration. Thus, less ambiguous MT responses from large spatial regions are combined with ambiguous but spatially localized V1 cell responses such that these signals in turn become less ambiguous. The temporal evolution of this disambiguation process in turn spreads, or fills in, unambiguous motion information along the moving shape outline like a traveling wave triggered by localized features. Again, we emphasize that this disambiguation is achieved by the network dynamics without the employment of specialized feature detectors and explicit mechanisms for decision making.
Disambiguating Visual Motion
2047
To handle continuous sequences of images (instead of iterating just over the correlation results of one image pair), the feedback signal has to “follow” the detected motion. This is realized by shifted synaptic feedback connections corresponding to the velocity (represented by x) in accordance with the cells’ maximum sensitivity and thus represents a kind of prediction mechanism. 3 Results The model dynamics described in the previous section emerges from a layered architecture of cells, or groups of cells, in which each area is represented by three layers. Model cells were considered as single compartment entities whose gradual activation dynamics follow shunting, or membrane, properties. The network can be shown to obey bounded input-output stability. Since the equations equilibrate fast, we simulate the dynamics using steady-state equations in order to save computing time. Computational simulations demonstrate the ability of the presented neural architecture to cope with synthetic image sequences as well as with natural sequences. The same set of parameters (see Table 1) was used for all simulations. The number of iterations varied, depending on how many frames were available. The number of cells to represent the velocity space was chosen to be 225 (∆x ∈ {−7, . . . , 7}2 , including ∆x = 0). We investigated the model by probing it with artificially generated image sequences utilized in physiological and psychophysical experiments. Furthermore, we demonstrate that the same model using the same parameter settings is able to process realistic sequences. These test cases were also used in benchmark tests for technical motion detection approaches. 3.1 Empirical Data. Simulations of model V1 and MT qualitatively confirm the results from neurophysiological recordings of the time course of MT cells (Pack & Born, 2001). Model MT cell responses along the extended contrasts initially signal normal flow (perpendicular to the contrast orientation and corresponding to component motion) as in experimental observations. In the psychophysical experiment, the cells’ selectivity gradually switches after approximately 150 ms to signal the true velocity. In our model, this change is facilitated by filling in correct motion interpretations derived from intrinsic feature signals at line endings or corners (see Figure 2). As a consequence, our model suggestions are twofold: (1) the spatial distance of MT cells from unambiguous motion features directly influences the time until the signal is completely disambiguated and (2) scale invariance is achieved through temporal integration of disambiguated motion estimates along boundaries. Note that the filling-in process preserves spatially localized responses due to modulatory feedback. Our model generates disambiguated motion cell responses in MT as well as in V1. This predicts that recordings of motion-sensitive cells in V1 should result in similar time pat-
2048
P. Bayerl and H. Neumann
terns as for recordings in MT. There is no clear evidence that V1 cells behave similar to MT cells. This may be a consequence of the RF size of V1 cells, which makes it extremely difficult to obtain spatially accurate physiological measurements over time for moving objects. However, recent physiological studies show that subpopulations of V1 cells near line endings actually encode pattern motion and show similar dynamics as cells in MT (Pack, Livingstone, Duffy, & Born, 2003) (see section 4.1). The following experiment shows how feedback processing causes the model to generate neural hysteresis and emphasizes the role of feedback for the disambiguation process. Figure 3 illustrates how neural hysteresis is generated by processing a random dot cinematogram, gradually changing from rightward to leftward motion or vice versa. Such displays induce perceptual hysteresis, indicating interactions with a short-term memory (Williams & Phillips, 1987). Such an interaction may help to lock in and keep the initially detected velocity (e.g., leftward motion) until the stimu-
Disambiguating Visual Motion
2049
lus has changed enough to switch to another alternative (e.g., rightward motion). The feedback signal in our model can be interpreted as short-term memory since it contains information from preceding time steps. Model simulations show that hysteresis is generated as a consequence of feedback processing. Without feedback, our model generates no hysteresis behavior, and the motion signal never gets completely disambiguated (maximum of approximately 80% correct flow information for purely left- or rightward motion; see Figure 3B). The latter effect occurs since the correlation detector used in the input stage cannot distinguish between the different stimulus dots, which leads to a high ambiguity in the correlation signal (similar to most other motion detectors). If feedback is applied, motion gets completely disambiguated and 100% right- or leftward motion is indicated (even in the presence of dots with switched direction) until the signal represented by MT cells switches nearly immediately to 100% of the other direction. This behavior can be interpreted as neural decision making. 3.2 Real-World Data 3.2.1 Artificial Sequences. Figures 4 and 6 illustrate the ability of the model to segregate different regions of visual motion. The artificially gener-
Figure 2: Facing page. Example of processing an artificial test sequence (100 × 100 pixel). (A) Temporal development (t = 1, . . . , 9) of the direction indicated by local model MT cell populations at three different image locations (indicated by simulated “electrodes”). The direction of motion measured at the corner (electrode 1, filled circle) points toward the true direction of motion (45 degrees) from the beginning, while on the edge (electrodes 2 and 3, open circles and triangles) flow initially measured orthogonal to the orientation of local contrast (90 degrees) gradually changes to the correct direction. Cell populations spatially more distant from locations with initially unambiguous estimations take longer to get disambiguated. (B) Real data showing the temporal course of direction indicated by a cell population in macaque area MT located on a moving bar pattern, redrawn and adapted from Pack and Born (2001): (1) 60 ms after stimulus onset, motion is orthogonal to the orientation of the bar. (2) 150 ms after stimulus onset, correct motion is indicated. The temporal activity pattern of model cell populations (A) qualitatively matches the temporal pattern of activity in macaque MT (B) by rescaling the temporal axis by an appropriate factor. (C) Temporal development of V1 cell population responses. Populations of motion-sensitive cells representing the local velocity space are illustrated at given locations indicated by small circles (dark = low activity, light = high activity). t = 1: Initial flow estimations (V1) at different (subsampled) cell locations and activity patterns of selected cell populations (large arrows: true direction of motion). t = 2, . . . , 6: Demonstration of the disambiguation process.
2050
P. Bayerl and H. Neumann
ated sequence of a flight over Yosemite National Park simulates self-motion over a static environment, which leads to an expanding flow pattern, with the exception of clouds moving horizontally to the right. The resulting flow estimation demonstrates that the model is able to extract motion induced by gray-level sequences of different frequencies, although the correlation detector is based on only one small scale (the clouds are represented on a much coarser scale than the ground).
Disambiguating Visual Motion
2051
Figure 4 shows the quantitative analysis of the processing results for the Yosemite sequence. We obtain a flow field of 100% density with an average angular error of 6.20 degrees after 10 iterations. This result compares well with technical solutions that produce less accurate estimations in most cases (Barron et al., 1994; Nicolescu & Medioni, 2003). The median error of 2.95 degrees after 10 iterations illustrates the robust model performance when outliers near region boundaries are excluded that occur exclusively at the horizon. Note that Barron et al. (1994) did not use iterations and multiresolution to refine the results of the computational models they have compared, specifically in those approaches, where iterations and multiresolution processing would be optional. However iterations were used, for example, to propagate smoothness constraints (Horn & Schunk, 1981), which can be compared to our relaxation process. Again, we point out that our approach relies on a single-scale estimation of visual motion, which is subsequently integrated and in turn updated by context information. 3.2.2 Real Camera Images. Natural image sequences recorded by cameras are noisy, and there is a high probability of encountering complex Figure 3: Facing page. Perceptual hysteresis effect and model feedback for motion disambiguation. Theproportion of MT cell activities indicating rightward motion ( activitiesright / activitieslef t+right ) is plotted for each frame, processing two random dot cinematograms (each sequence shows 60 moving dots and consists of 60 frames, 40 × 40 pixel). In the first sequence (sequence a, the solid line), the dots are initialized with a random position and a velocity of three pixels per frame to the right. In each frame, one right-moving dot changes its direction by moving to the left. In the second sequence (sequence b, the dashed line), the dots have an initial direction to the left, switching one after the other to the right. (A) Feedback processing disambiguates the signal and generates a directional hysteresis effect that indicates the inertia generated by locking in the prediction from top-down feedback of a motion direction measured over time. Both sequences show an initial ambiguity in estimated motion (relative activity is 80% and 20% for correct and incorrect motion, respectively), because the correlation detector confounds corresponding dots within a certain neighborhood. Therefore, wrong velocity cues are detected in addition to the correct ones. After a few iterations of feedback processing, the MT cells are disambiguated and indicate a perfectly coherent motion signal (100% or 0% rightward). The response for sequence a (solid line) switches from rightward to leftward motion between 60% and 75% of changed dot directions, whereas for sequence b (dashed line), it switches from leftward to rightward motion between 40% and 25% of changed dot directions (hysteresis). (B) Without feedback (C = 0; see Table 1), no hysteresis is generated. The sum of cell activities indicating rightward motion is proportional to the relative number of dots moving to the right. The initial ambiguity of 80% versus 20% correct and incorrect motion responses is not resolved.
2052
P. Bayerl and H. Neumann
imaging conditions like occlusions and nonrigid motion. The proposed architecture successfully deals with a large set of natural sequences, including traffic sequences and animals in motion. Figure 5 illustrates the results for two examples: a traffic sequence with the moving taxi and a moving zebra. Feedback processing combines temporal information in order to segregate motion and eliminate outliers and noise. The spreading of unambiguous flow is visible especially where the aperture problem arises (along the stripes of the zebra) and where occlusion falsifies initial motion estimates (the border of the cars). Object segregation capabilities of the model are sketched in Figure 6. Concurrent interaction (shunting inhibition) in velocity space (including zero velocity, ∆x = 0) segregates regions of different motion from each other as well as from regions with stationary contrast configurations. 4 Discussion In this letter, we presented a new computational framework of recurrent motion processing to integrate and segregate visual motion signals. In the following, we discuss the model’s biological plausibility, compare it with existing models of motion processing, and summarize the major contributions. 4.1 Relevance and Biological Plausibility. There is both structural and functional evidence for the mechanisms and layered organization of our model. Anatomical and physiological studies suggest inter- and intra-areal connections also used in our model (Van Essen & Galant, 1994; Maunsell, 1995). Motion-sensitive cells can be found in MT as well as in V1 (Maunsell & Van Essen, 1983). Physiological studies (Movshon et al., 1985) have shown that cells in V1 are sensitive to component motion (motion along oriented contrasts) while cells in MT are less sensitive to oriented components but signal pattern motion. Since our focus is on the investigation of feedback in the processing of motion stimuli, we kept the input stage to the model as simple as possible. As a consequence, we assumed motion-sensitive cells in V1 to be independent of contrast orientation. We claim that this does not constrain the capability of our model to resolve ambiguities among motionsensitive V1 cells if they were sensitive to orientation as well, because motion signals indicated by such cells inherently contain ambiguities by the definition of component motion. However, such an approach would obscure the disambiguation process described by our model, which is a consequence of lateral and feedback interactions between both model areas, differing in the spatial scale of integration. We claim that model V1 could be adapted by incorporating further details to match physiological properties of V1, such as spatial frequency tuning, contrast polarity, and orientation tuning (Hubel & Wiesel, 1968). There is recent physiological evidence that V1 cells can partially encode pattern motion (thus, motion independent of the orientation). Pack et al.
Disambiguating Visual Motion
2053
Figure 4: (A) Detected motion using an artificially generated sequence of four frames simulating a flight over the Yosemite national park (316 × 252 pixel resolution); shown: motion indicated by V1 cell populations. Expanding flow is caused by self-motion through a static scene (the ground) and rightward translational flow by moving clouds (upper region of the horizon). Motion is detected even though the underlying gray-level structure has strongly varying spatial frequency content, or scale. Both regions of motion (ground and clouds) are clearly segregated from each other after a few iterations of V1-MT feedback processing. (B,C) Angular error of the optical flow direction indicated by MT cell populations processing the first two frames of the Yosemite sequence. Since there is no ground-truth available for the movement of the clouds, we assumed a horizontal motion to the right (according to Barron et al., 1994). (B) Spatial configuration of the angular error at different time steps (dark = large errors, light = small errors). Error peaks occur at flow discontinuities (clouds and ground). Note that the regions containing these outliers are much smaller than the RF size of MT cells (indicated at the top left of each image). (C) Temporal development of the average and the median angular error (of the entire flow field). The average error almost converges after five iterations and slightly improves to 6.20 degrees after 10 iterations. The median error of 2.95 degrees (after 10 iterations) illustrates the accuracy of the model excluding outliers, such as errors of approximately 180 degrees, which occur at region boundaries along the horizon.
2054
P. Bayerl and H. Neumann
Figure 5: Detected motion using real-world sequences (four frames). (A) The Hamburg taxi sequence (256 × 190 pixel resolution). (B) A walking zebra (http://www.junglewalk.com, 320 × 240 pixel). Shown: motion indicated by MT cell populations. Initially, erroneous flow information is eliminated where no motion occurs, and slight directional corrections can be observed where flow information is affected by the aperture problem (at the border of the cars and the stripes of the zebra).
(2003) showed that the time course of a subpopulation of V1 cells is similar to the time course of cells in MT solving the aperture problem near line endings. This is consistent with the prediction of our model that cells in V1 and MT are disambiguated simultaneously as a consequence of feedback and local competitive interaction. Lateral inhibition (normalization) is accomplished in our model by an isotropic center-surround interaction, which represents the simplest way to realize that functionality. The shunt-
Disambiguating Visual Motion
2055
Figure 6: Object segregation capabilities of the model. Display of maximal MT/V1 activities that were normalized against global minima and maxima for the sequences in Figures 4 and 5A (dark = low activation, light = high activation). Low activation appears due to strong competition at locations of high motion contrast. (A) In the taxi sequence, the cars are clearly segregated from the background and against each other. (B, C) The Yosemite sequence reveals two major regions of motion: ground and sky/clouds. It shows that the motion signal in V1 (C) contains some structural information that is averaged in MT (B).
2056
P. Bayerl and H. Neumann
ing equation (see equation 2.3) reflects the neuronal activity saturation. As a result of our modeling investigations, we provide evidence that experimentally observed behavior is generated by network dynamics that emerges from several elementary operations and functional principles of layering and connectivity. With the detail of employed model components, it was not our primary focus to quantitatively fit physiological data. Instead, our model serves as a link between physiology and perceptual behavior, which also enables processing real-world image sequences for benchmarking. The model suggests key principles of motion disambiguation and integration. The proposed modulatory feedback mechanism is supported by recent physiological investigations of feedback connections between early visual areas (V1, V2, and V3) and MT (Hup´e et al., 2001; Friston & Buchel, ¨ 2000). For example, Hup´e et al. (2001) show that cell activities in V1 are highly affected by feedback from MT in an excitatory manner shortly after stimulus onset. This is consistent with our model, in which only excitatory feedback modulation is used (1 + netFB ≥ 1; see equation 2.1). As a result of recurrent processing, and consistent with physiological recordings of the time course of MT neurons (Pack & Born, 2001), our model disambiguates the motion signal shortly after stimulus onset. Here, the time to establish the final percept is influenced by the strength of the feedback connections, the RF field size ratio between V1 and MT, and, as a prediction of our model, the spatial extension of the region of ambiguous motion (see Figure 2). The time course of MT cell populations was also investigated by Pack and Born (2001) for different bar lengths (2–8 degrees). Consistent with our results, the time required to disambiguate such stimuli was roughly proportional to the bar length (Pack, personal communication, December 2003). 4.2 Comparison with Existing Models of Motion Processing. Different stages and computational mechanisms utilized by our model were also used in other biologically inspired as well as computational models. For the purpose of better readability, we organize our discussion according to categories of existing models of motion estimation; pure feedforward and recurrent models. 4.2.1 Feedforward Models. Simoncelli and Heeger (1998) proposed a model of detecting motion energy in areas V1 and MT using linear spatiotemporal filters. Individual motion estimates are normalized by dividing individual responses through the average response of activity in a spatial neighborhood. This center-surround mechanism has also been employed in our model. We achieve such a normalization by an antagonistic mechanism that involves shunting inhibition. The net effect leads to a divisive inhibition at individual locations by average activities integrated over a neighborhood in the space-velocity domain. Unlike Simoncelli and Heeger, we have incorporated a mechanism of modulatory feedback that disambiguates the motion signal and spreads activities over longer distances, solving the aperture
Disambiguating Visual Motion
2057
problem. It is worth mentioning that their filters in V1 and MT differ from our mechanisms and model the fact that motion-sensitive cells in V1 have less or no speed tuning. Such filters could also be used in our model, but in order to focus on the influence of feedback processing, we omitted any additional parameters. Nowlan and Sejnowski (1994, 1995) described a model of motion integration that utilized an explicit selection signal that is computed to determine the regions in the visual field where velocity estimations are most reliable. Motion-sensitive cells are then gated by this signal to produce the final estimate. Note that the way they learn how to compute the selection signal is an elegant method that may be applied to learn a normalization process like the one described by Simoncelli and Heeger (1998). Our model differs in several ways from Nowlan’s approach. While their approach utilizes a feedforward scheme, our model combines feedforward estimates with feedback integration and prediction. As a consequence, initial rough estimates are integrated and evaluated over time within a recurrent loop of matching velocities and motion predictions generated in area MT. As a by-product, the determination of reliable motion estimates is computed implicitly in our model instead of explicitly generating a decision-like selection signal. Weiss and Fleet (2001) and Weiss, Simoncelli, and Adelson (2002) solved the problem to determine the velocity of a moving object using a Bayesian estimation approach. They estimate coherent motion of a moving shape by maximizing the posterior probability of velocity votes given the detected image motion. This formulation leads to a probability representation in velocity space for all measures of single moving objects. In the spirit of IOC computation, all probability distributions are multiplicatively combined, including a given prior of expected velocities in the scene. The aperture problem is solved implicitly by maximizing the posterior from all individual measures. In our model, we do not directly combine all initial estimates, since this requires a priori knowledge about which moving parts in the stimulus belong together. Instead, we let initial motion signals being modulated by a predictive signal from the higher processing stage of area MT, which serves as a local prior. In order to achieve a global consistent estimate, this process is iterated to allow propagation of disambiguated motion signals along extended shape boundaries. 4.2.2 Recurrent Models. Grossberg, Mingolla and Viswanathan (2001) and Mingolla (2003) presented a model of motion integration and segmentation in MT and MST based on inputs from the form pathway (modeled as the FACADE framework—Form-And-Color-And-DEpth) (Grossberg, 1994). They studied how motion signals from partly occluded patterns can be integrated and segregated in a recurrent fashion. In contrast to our approach, their feedback signal (from MST) inhibits MT activities and has a more global character due to the RF size of MST cells (depending on the stimulus, these RFs cover 50% up to 100% of the entire stimulus). The authors suggest that such a mechanism of feedback inhibition and selection
2058
P. Bayerl and H. Neumann
also helps to solve the aperture problem (Mingolla, 2003). Unlike our model, this is realized by a decision-like mechanism through the inhibitory influence of global context information delivered by large-spanning kernels. We predict, therefore, that any resolution of uncertainty from the aperture problem should be independent of the length of the bar stimulus. Instead, our model propagates salient motion information along extended boundaries through recurrent interaction of MT and V1 cells with different RF sizes. This filling-in mechanism achieves size-invariance properties. Also, its temporal properties concur with experimental observations as the time needed for disambiguation increases with distance from locations of unambiguous motion. Lid´en and Pack (1999) proposed a model of recurrent lateral motion interactions, which is able to produce a traveling wave of motion activation to solve the aperture problem. Like our model, they use the normalization similar to the mechanism described by Simoncelli and Heeger (1998) to emphasize salient motion estimates. In contrast to our model, their normalization mechanism is not isotropic in the velocity space. The propagation is done by recurrent lateral excitation leading to an unbounded filling-in process, which has to be constrained by long-range inhibition of motion cells of different directional selectivity and by a separately processed motion boundary signal. In the absence of concurrent motion signals from multiple objects, their model leads to completely filled-in motion fields, which must be gated by multiplying the input signal in order to display only relevant motion patterns. Conversely, our model implements a kind of “soft gating” by biasing the input signal during feedback processing (see equation 2.1) and therefore produces spatially restricted motion estimates at all time steps without an explicit computation of motion or form boundaries. Koechelin, Anton, and Burnod (1999) describe a model of motion integration along the V1-MT pathway that utilizes mechanisms of recurrent lateral interactions. Their model utilizes a multiplicative combination of feedforward input and the result of lateral integration, which leads the authors to claim that their approach implements a neural mechanism of Bayesian estimation. Salient motion features are emphasized through normalization, and the results of recurrent lateral modulation (gating) are used to propagate these features. Though these mechanisms seem to be rather similar compared to those proposed in our model, the realization and behavior differ in many respects. For example, their gating process leads to strong inhibition of the input signal once the model has focused on one specific velocity while the stimulus changes to another velocity. Such lateral multiplication intensifies the winner-takes-all characteristic of their model (Koechelin et al., 1999) and makes it more vulnerable to errors. Our model follows a gradual prediction-and-correction philosophy realized by an exclusively excitatory modulation of the feedforward input through the feedback signal, which is followed by a center-surround competitive mechanism to realize a biased competition. Essential to our model is the decoupling into different areas with different RF sizes. That provides a larger context to the
Disambiguating Visual Motion
2059
higher area and thus the ability to correct (bias) and disambiguate cell activities in earlier areas with higher spatial accuracy. Another important point is that Koechelin et al. did not include cells sensitive to zero velocity or include an interaction with static form information in their model. This renders it impossible to segregate moving objects from a static background in a spatially localized fashion. Combined with the winner-takes-all property, their model may even form some regions of motion in a static image sequence with spatiotemporal noise that actually are not present. Finally, the results published in Koechelin et al. (1999) show that they fail to solve the aperture problem for moving bars in cases when they are longer than the size of their RFs. Contrary to this behavior, in our model a traveling wave of activation emerges, which helps to disambiguate motion signals along extended bars and shape outlines independent of the ratio between shape size and RF size. This concurs with the temporal evolution of MT cell activities investigated by (Pack & Born, 2001). Our proposed architecture is demonstrated to deal with large varieties of shape or object size as to provide a mechanism of size invariant motion integration. 5 Conclusion In sum, we presented a model of motion processing in area V1 and MT capable of handling synthetic as well as artificial image sequences. The model shows the following key properties: initial detection of raw flow information, temporal spreading of reliable motion signals to gradually correct uncertain flow estimates, and the ability to sharply segregate regions of individual visual motion. We showed how to solve the aperture problem by contextual modulation, how feedback acts as short-term memory to account for hysteresis effects in motion disambiguation, and how global consistency is achieved by local interactions. Our model is unique in the sense that it combines mechanisms of local lateral interaction with modulatory and purely excitatory feedback to solve ambiguities of detected visual motion. Our approach makes several new contributions. First, we propose a model of cortical feedforward and feedback processing in the dorsal pathway of motion integration implementing a neural hypothesis-test cycle of computation. Most important, the feedback mechanism is part of top-down modulatory enhancement of initial activities that match signal properties at a higher processing stage. Second, the disambiguation of initial estimates is solved by the interplay between top-down modulation and subsequent lateral competition. Consequently, the network dynamics propagate disambiguated motion signals along shape boundaries, thus realizing a guided filling-in process (Neumann, 2003). This mechanism is important in that it provides a means to process objects of different sizes in an invariant fashion. Third, the model serves as a link between physiological recordings (e.g., Pack & Born, 2001) and psychophysical investigations of perceptual motion integration (Williams & Phillips, 1987). Beyond this, the model is
2060
P. Bayerl and H. Neumann
able to process real-world stimulus sequences to yield accurate motion estimations. We believe that this further justifies the explanatory competence of key computational elements of the model, as most other biologically inspired models do not compare the quality of their results against other technical or nontechnical models. Based on model simulations, we make several predictions concerning the computational mechanisms involved in early motion perception: 1. The disambiguation process observed for macaque MT cells (Pack & Born, 2001) should also be observed in V1 cells as a direct consequence of feedback interactions. This model prediction is partly confirmed by the findings of Pack et al. (2003). 2. The time to disambiguate regions of ambiguous motion depends on the distance of such regions from unambiguous motion features (e.g., induced by corners or line ends). This time is consumed by the increased number of feedforward-feedback cycles necessary to bias responses that cohere with the apparent motion direction. 3. We predict that without feedback, no perceptual hysteresis is generated, and motion activity patters remain ambiguous. At the current stage of modeling, the process of motion grouping is solely based on proximity in the spatial and the velocity domain. Interactions with the form pathway could enhance the model performance by grouping motion information preferably along contours. Ownership cues arising from occlusion can also be deduced from the ventral form pathway. Psychophysical investigations suggest that motion features are integrated only when they are intrinsic to the moving boundary, while extrinsic signals should be suppressed (Shimojo, Silverman, & Nakayama, 1989). This topic needs further investigation since it is yet unclear how signals from the form and motion pathway are integrated utilizing mainly excitatory interactions between cortical areas. However, even without all these extensions, the model already yields psychophysically and physiologically consistent results for a broad range of motion stimuli. In addition, the quality of estimated motion direction in real-world sequences compares well with technical solutions without explicit parameter tuning for each type of sequence. In all, the proposed model provides further evidence for key computational principles that are involved in the cortical computation of sensory stimuli, their integration and segregation. Neumann and Sepp (1999) have proposed the basic mechanisms of feedforward feature detection, subsequent integration and matching, and subsequent modulation through feedback to implement a neural hypothesis-testing paradigm for boundary integration in static form perception. Model simulations of model V1V2 interaction demonstrated context-dependent changes of orientation selectivity, texture density suppression, and subjective contour completion as observed in physiological and psychophysical experiments. An exten-
Disambiguating Visual Motion
2061
sion of this architecture by Thielscher and Neumann (2003) incorporates a model area V4 to investigate the segregation of textures generated from bars of different orientations. Whereas these models focus on the inter-areal feedforward-feedback interaction, we also proposed a model of intra-areal recurrent processing of V1 contour processing (Hansen & Neumann, 2004). These neurodynamical models of static form perception utilize the same core mechanisms of layered processing in cortical architecture. In this letter, we now propose the same core mechanisms to account for the processing of temporally varying stimuli in the cortical motion pathway. Given the evidence gathered from our computational experiments, we claim that the early processing stages in visual cortex along the ventral and the parietal pathway are organized in a homologous fashion. Modulatory feedback and subsequent divisive inhibition realize a mechanism of biased competition already at an early stage with a similar behavior as the one proposed by Desimone and Duncan (1995) for attention mechanisms to filter out irrelevant information. We have proposed a concept of feedback as part of a layered structure and representation and presented an implementation of multiple loops of recurrent interaction whose dynamics realize multilevel cortical hypothesis testing cycles. Appendix A: Equations for Initial Motion Detection (Input Stage) Here we describe the equations used to generate an initial raw motion estimation, which is used as input stage to V1. We use (oriented) model complex cells to compute a spatiotemporal correlation (modified elaborated Reichardt detectors [ERD], similar to Adelson and Bergen (1985)) to measure local motion for a specific range of velocities at each location. Equations A.1 to A.4 (illustrated in Figure 7A) generate a raw motion signal (encoded as population code in c(3) ), which is computed from a pair of images and is further processed by the main model equations, B.1 to B.3. c(1) t,x,α represents oriented complex cells responses normalized through shunting inhibition. Eight orientations (α) were used for the simulations 2 (* denotes the convolution operator, Gσ is a gaussian function, and ∂x(α) the second directional derivative in direction α). The convolution with the second derivative of an isotropic gaussian filter (with approximately σ = 1) was computed by applying two times a sampled first derivative of an isotropic gaussian filter (σ = 0.75) in order to preserve (numerically) a dccomponent of zero. Figure 7B illustrates some examples for sampled first derivatives of a gaussian. Figure 7E shows a zoomed part of a frame from an example sequence to compare the size of the kernels (printed at the same scale) with typical image structures processed by the model: c(1) t,x,α =
2 G It,x ∗ ∂x(α) σ . 2 G |∗G 0.01 + β |It,x ∗ ∂x(β) σ σ
(A.1)
2062
P. Bayerl and H. Neumann
Figure 7: (A) Schematic overview of the correlation detector used as the input stage of V1 (for details, see the text). (B) First derivative gaussian kernels for different orientations. (C) Iosotropic gaussian kernel with σ = 1 used in model V1 and the correlation detector (see the text). (D) Gaussian kernel (σ = 7) used in model MT (see the text). (E) Zoomed part of a frame from an example sequence, printed at the same scale as the kernels in B–D.
Disambiguating Visual Motion
2063
(2−) c(2+) t,x,∆x and ct,x,∆x represent half-detectors for a specific velocity defined by a shift ∆x = (x, y) between two successive frames and can be interpreted as a raw correlation of model complex cell responses. These results are obtained by summing over all orientations (as a consequence, these results are no longer orientation specific) and pooling over a small spatial neighborhood with an isotropic gaussian kernel (σ = 1), illustrated in Figure 7C:
c(2−) t,x,∆x
(1) (1) c · c ∗ Gσ t,x,α t+1,x+ ∆ x,α α = c(1) · c(1) t,x+∆x,α ∗ Gσ . α t+1,x,α
c(2+) t,x,∆x =
(A.2) (A.3)
c(3) t,x,∆x builds the final population code at each location indicating raw motion estimatations ([·]+ is a rectification operator realizing max(·, 0)). A shunting inhibition normalized these activities, which differs from the standard implementation of the full Reichardt detector, which uses subtractive inhibition: c(3) t,x,∆x =
(2−) [c(2+) t,x,∆x ]+ − 0.5 · [ct,x,∆x ]+
1.0 + [c(2−) t,x,∆x ]+
.
(A.4)
The key features of the mechanism are that (1) the responses of normalized complex cells of all orientations are involved for motion estimation and that (2) even stationary patterns can produce (weak) responses in a spatial velocity code (e.g., induced by a periodic gray-level pattern). The resulting activities c(3) (x,∆x) for different velocities (encoded by x) at different locations (x) indicate unambiguous motion at corners and line endings, ambiguous motion along contrasts, and no motion for homogeneous regions. Appendix B: Equations for Motion Integration Here we present the steady-state equations corresponding to equations 2.1 through 2.3. v(1) samples the input signal (netIN , depending on the model area; see Table 1) and multiplicatively enhances incoming signals matching to the feedback signal (netFB , depending on the model area; see Table 1). v(1) = netIN · (1 + C · netFB ).
(B.1)
v(2) realizes the integration in space and the velocity domain by isotropic gaussian filters. In the velocity domain, the filter size is identical in both model areas (σ = 0.75). Note that the resolution of the velocity space was set to 15 × 15 for all simulations. The spatial kernel is a dirac-impulse (a
2064
P. Bayerl and H. Neumann
gaussian with σ = 0) in model V1 and a gaussian filter with σ = 7 in model MT (illustrated in Figure 7D): (x,space)
v(2) = (v(1) )2 ∗ Gσ1
(∆x,velocity)
∗ Gσ2
(B.2)
v(3) represents normalized velocity estimations in both model areas, which is realized by a shunting inhibition with the sum over all velocities. As a consequence, ambiguous signals are weakened, while unambiguous signals are emphasized: 1 · v(2) v(2) − 2n ∆x(2) . v(3) = (B.3) 0.01 + ∆x v References Adelson, E., & Bergen, J. (1985). Spatiotemporal energy models for the perception of motion. Optical Society of America, A 2(2), 284–299. Adelson, E., & Movshon, J. (1982). Phenomenal coherence of moving visual patterns. Nature, 300, 523–525. Barron, J. L., Fleet, D. J., & Beauchemin, S. S. (1994). Performance of optical flow techniques. Int. J. Computer Vision, 12(1), 43–77. Crick, F., & Koch, C. (1998). Constraints on cortical and thalamic projections: The no-strong-loop hypothesis. Nature, 391, 245–250. Del Viva, M. M., & Morrone, M. C. (1998). Motion analysis by feature tracking. Vision Research, 38, 3633–3653. Desimone, R., & Duncan, J. (1995). Neural mechanisms of selective visual attention. Annual Review of Neuroscience, 18, 193–222. Friston, K. J., & Buchel, ¨ C. (2000). Attentional modulation of effective connectivity from V2 to V5/MT in humans. PNAS, 97(13), 7591–7596. Grossberg, S. (1980). How does a brain build a cognitive code? Psychological Review, 87, 1–51. Grossberg, S. (1994). 3-D vision and figure-ground separation by visual cortex. Perception and Psychophysics, 55, 48–120. Grossberg, S., Mingolla, E., & Viswanathan, L. (2001). Neural dynamics of motion integration and segmentation within and across apertures. Vision Research, 41, 2521–2553. Hansen, T., & Neumann, H. (2004). Neural mechanisms for the robust detection of junctions. Neural Computation, 16(5), 1013–1037. Hildreth, E. C. (1984). Computations underlying the measurement of visual motion. Artificial Intelligence, 23, 309–354. Horn, B. K. P., & Schunk, B. G. (1981). Determining optical flow. Artificial Intelligence, 17, 185–203. Hubel, D. H., & Wiesel, T. N. (1968). Receptive fields and functional architecture of monkey striate cortex. Jour nal of Physiology, 195, 215–243. Hup´e, J. M., James, A. C., Girard, P., Lomber, S. G., Payne, B. R., & Bullier, J. (2001). Feedback connections act on the early part of the responses in monkey visual cortex. J. Neurophys., 85, 134–145.
Disambiguating Visual Motion
2065
Koechelin, E., Anton, J. L., & Burnod, Y. (1999). Bayesian inference in populations of cortical neurons: A model of motion integration and segregation in area MT. Biological Cybernetics, 80, 25–44. Lid´en, L., & Pack, C. C. (1999). The role of terminators and occlusion in motion integration and segmentation: A neural solution. Vision Research, 39, 3301– 3320. Maunsell, J. H. R. (1995). The brain’s visual world: Representation of visual targets in cerebral cortex. Science, 270, 764–769. Maunsell, J. H. R., & Van Essen, D. C. (1983). Functional properties of neurons in the middle temporal visual area of the macaque monkey. I: Selectivity for stimulus direction, speed and orientation. J. Neurophys., 49(5), 1127–1147. McDermott, J., Weiss, Y., & Adelson, E. H. (2001). Beyond junctions: Nonlocal form contraints on motion interpretation. Perception, 30, 905–923. Mingolla, E. (2003). Neural models of motion integration and segmentation. Neural Networks, 16, 939–945. Movshon, J. A., Adelson, E. H., Gizzi, M. S., & Newsome, W. T. (1985). The analysis of moving visual patterns. In C. Chagas, R. Gattass, & C. R. Gross (Eds.), Pattern recognition mechanisms (pp. 117–151). Vatican City, Vatican Press. Movshon, J. A., & Newsome, W. T. (1996). Visual response properties of striate cortical neurons projecting to area MT in macaque monkeys. J. Neuroscience, 16(23), 7733–7741. Mumford, D. (1994). Neuronal architectures for pattern-theoretic problems. In C. Koch & J. Davis (Eds.), Large scale neuronal theories of the brain (pp. 125–152). Cambridge, MA: MIT Press. Nagel, H. H. (1987). On the estimation of optical flow: Relations between different approaches and some new results. Artificial Intelligence, 33, 299–324. Neumann, H. (2003). Completion phenomena in vision: A computational approach. In L. Pessoa & P. De Weerd (Eds.), Filling-in—From perceptual completion to cortical reorganization (pp. 151–173). New York: Oxford University Press. Neumann, H., & Sepp, W. (1999). Recurrent V1–V2 interaction in early visual boundary processing. Biological Cybernetics, 81, 425–444. Nicolescu, M., & Medioni, G. (2003). Layered 4D representation and voting for grouping from motion. PAMI, 25(4), 492–501. Nowlan, S. J., & Sejnowski, T. J. (1994). Filter selection model for motion segmentation and velocity integration. Optical Society of America, A 11(12), 3177–3200. Nowlan, S. J., & Sejnowski, T. J. (1995). A selection model for motion processing in area MT of primates. J. Neuroscience, 15(2), 1195–1214. Pack, C. C., & Born, R. T. (2001). Temporal dynamics of a neural solution to the aperture problem in cortical area MT. Nature, 409, 1040–1042. Pack, C. C., Livingstone, M. S., Duffy, K. R., & Born, R. T. (2003). End-stopping and the aperture problem: Two-dimensional motion signals in macaque V1. Neuron, 39, 671–680. Pouget, A., Zemel, R. S., & Dayan, P. (2000). Information processing with population codes. Nature Review Neuroscience, 1(2), 125–132. Shimojo, S., Silverman, G., & Nakayama, K. (1989). Occlusion and the solution to the aperture problem for motion. Vision Research, 29, 619–626.
2066
P. Bayerl and H. Neumann
Simoncelli, E. P., & Heeger, D. J. (1998). A model of neuronal responses in visual area MT. Vision Research, 38, 743–761. Thielscher, A., & Neumann, H. (2003). Neural mechanisms of cortico-cortical interaction in texture boundary detection: A modeling approach. J. Neuroscience, 12, 921–939. Ullman, S. (1995). Sequence seeking and counter streams: A computational model for bidirectional information flow in the visual cortex. Cerebral Cortex, 5, 1–11. Van Essen, D. C., & Galant, J. L. (1994). Neural mechanisms of form and motion processing in the primate visual system. Neuron, 13, 1–10. Weiss, Y., & Fleet, D. J. (2001). Velocity likelihoods in biological and machine vision. In R. P. N. Rao, B. A. Olshausen, & M. S. Lewicki (Eds.), Probabilistic models of the brain: Perception and neural function (pp. 81–100). Cambridge, MA: MIT Press. Weiss, Y., Simoncelli, E. P., & Adelson, E. H. (2002). Motion illusions as optimal percepts. Nature Neuroscience, 5(6), 598–604. Williams, D., & Phillips, G. (1987). Cooperative phenomena in the perception of motion direction. Optical Society of America, A 4(5), 878–885. Wilson, H. R., Ferrera, V. P., & Yo, C. (1992). A psychophysically motivated model for two-dimensional motion perception. Visual Neuroscience, 9, 79–97. Received August 8, 2003; accepted April 1, 2004.
LETTER
Communicated by Geoffrey Goodhill
Exact Solution for the Optimal Neuronal Layout Problem Dmitri B. Chklovskii
[email protected] Cold Spring Harbor Laboratory, Cold Spring Harbor, NY 11724, U.S.A.
Evolution perfected brain design by maximizing its functionality while minimizing costs associated with building and maintaining it. Assumption that brain functionality is specified by neuronal connectivity, implemented by costly biological wiring, leads to the following optimal design problem. For a given neuronal connectivity, find a spatial layout of neurons that minimizes the wiring cost. Unfortunately, this problem is difficult to solve because the number of possible layouts is often astronomically large. We argue that the wiring cost may scale as wire length squared, reducing the optimal layout problem to a constrained minimization of a quadratic form. For biologically plausible constraints, this problem has exact analytical solutions, which give reasonable approximations to actual layouts in the brain. These solutions make the inverse problem of inferring neuronal connectivity from neuronal layout more tractable. 1 Introduction Wiring up distant neurons in the brain is costly to an organism (Ramon ´ y Cajal, 1899/1999). The cost of wiring arises from its volume (Cherniak, 1992; Mitchison, 1991), metabolic requirements (Attwell & Laughlin, 2001), signal delay and attenuation (Rall et al., 1992; Rushton, 1951), or possible guidance defects in development (Dickson, 2002). Whatever the origin of the wiring cost, it must grow with the distance between connected neurons. Therefore, placing connected neurons as close as possible reduces the wiring cost and, for a given connectivity, confers selection advantage to an organism. In principle, this evolutionary argument allows one to predict neuronal placement from connectivity data by solving an optimal layout problem. In practice, solving this problem for many neurons with nonstereotypical connectivity is complicated, and often impossible, due to the large number of possible neuronal permutations, which grows exponentially in the number of neurons. In this letter, we argue that the wiring cost may scale approximately as the wire length squared (see section 2). In this approximation, the optimal layout problem reduces to the minimization of a quadratic form (see section 3). The trivial solution is ruled out by biological constraints that can be classified into external and internal. For both classes of constraints, the optimal layout c 2004 Massachusetts Institute of Technology Neural Computation 16, 2067–2078 (2004)
2068
D. Chklovskii
can be found in analytical form (see sections 4 and 5). To test the quadratic placement optimization, we compare its predictions in two cases where both connectivity and layout are known: prefrontal cortical areas in macaque (see section 6) and Caenorhabditis elegans ganglia (see section 7). The solution of the quadratic optimal layout problem gives a reasonable approximation to the actual placement of these multineuron complexes. Thus, the quadratic wire length cost function promises to be a powerful tool for solving optimal neuronal layout problems. 2 Wiring Cost May Scale as Wire Length Squared Because the exact origin of the wiring cost is not known, one can only guess its dependence on the distance between neurons. In this section, we consider several plausible hypotheses for the wiring cost function and argue that the wire length squared may serve as a reasonable approximation. Previous work suggests that the cost of wiring is proportional to its volume (Cherniak, 1992; Chklovskii, Schikorski, & Stevens, 2002; Mitchison, 1991; Stepanyants, Hof, & Chklovskii, 2002), which scales with the distance times wire diameter squared. If the wire diameter is fixed, the wiring cost grows linearly with distance. But if the cost is proportional to volume, why not make all the axons infinitesimally thin? My collaborators and I have argued (Chklovskii et al., 2002; Chklovskii & Stepanyants, 2003) that the observed axon diameter may result from the trade-off between the wire volume cost, which grows with wire diameter, and signal propagation delay cost, which decreases with wire diameter because of an increase in conduction speed. This trade-off is captured by the wiring cost function, C, which contains two terms—one proportional to the signal propagation delay, T, to power n, and the other proportional to the wire volume, V:
C = αTn + βV,
(2.1)
where α and β are unknown constants. If the wires are myelinated axons, the signal propagation speed, s, scales linearly with the wire diameter, d, s = kd, leading to the following expression for the cost function:
C=α
L kd
n +β
π 2 d L, 4
(2.2)
where L is the wire length. The cost function is minimized by the wire diameter that solves the equation ∂ C/∂d = 0. By substituting this optimal wire diameter into the cost function 2.2, I get the following dependence of the cost on the wire length, L: n/(n+2) 3n n πα 2/n β C= 1+ L n+2 . 2 2 2nk
(2.3)
Exact Solution for the Optimal Neuronal Layout Problem
2069
If the exponent, n = 1, then the wiring cost scales linearly with the wire length. It is possible, however, that the signal propagation delay is a hard constraint (n = ∞). Then the wiring cost scales as the length cubed. Intermediate values of the exponent give cost functions that lie between the linear and the cubic dependence on length. Therefore, it is reasonable to approximate the actual wire length cost function by a quadratic expression:
C=
3 4
π 2 αβ 2 k4
1/3 L2 .
(2.4)
It is possible that the actual cost function is noticeably different from the quadratic form (e.g., if the exponent, n < 1). Still, the quadratic cost function can be very useful due to the exact solvability of the quadratic layout problem, as demonstrated in sections 4 and 5. The validity of the quadratic cost function is established by the comparison of theoretical predictions with experimental data (see sections 6 and 7). Thus, the quadratic cost function may play a role in neuronal layout optimization similar to the harmonic oscillator in physics or the fruit fly in genetics. 3 Optimal Layout Problem Requires Constraints In order to formulate the quadratic optimal layout problem, we represent a neuronal circuit as a nondirected weighted graph. Nodes of the graph correspond to neurons (or multineuron complexes), and edges correspond to connections between neurons (or between multineuron complexes). The weight of each edge represents the connection strength and is given by the (constant) coefficient in front of the wire length squared (see equation 2.4) times the multiplicity of the connection. In turn, the multiplicity of the connection is given by the number of parallel wires between the given pair of neuronal complexes or perhaps by the number of synapses between the given pair of neurons. The directionality of connections can be ignored because the cost of a wire does not depend on the direction of signal propagation. The graph is specified algebraically by the adjacency matrix (or the wiring diagram), A, where weights Aij give (nondirectional) connection strengths between neurons i and j. This matrix is symmetric (Aij = Aji ), nonnegative (Aij ≥ 0), with all diagonal elements equal to zero (Aii = 0). With the help of the adjacency matrix, the quadratic wire length cost function for the neuronal circuit can be written as
C=
1 Aij (ri − rj )2 , 2 i,j
(3.1)
where ri , rj are coordinates of the nodes i and j. The quadratic optimal layout problem is to find coordinates, which minimize the cost function for given constraints. Constraints exclude the trivial solution, ri = 0, and may
2070
D. Chklovskii
be classified by their biological origin into external and internal. External constraints arise from the fact that the brain is not an isolated network of neurons but is connected with the sensory and motor organs, the placement of which is determined by functional requirements unrelated to wiring (see section 4). Internal constraints arise from the volume exclusion by neuron bodies and axons, meaning that no two neurons or axons can occupy the same point in space (see section 5). The quadratic wire length cost function 3.1 has a simple physical analogy. If neurons are connected by stretched rubber bands of zero length at rest, then equation 3.1 represents their elastic energy. The weights Aij in equation 3.1 represent elasticity of connections. Then the minimal energy state is achieved when all neurons are in one location and all rubber bands have zero length. This trivial solution is ruled out by external or internal constraints. Previously, the rubber band analogy inspired the elastic net algorithm (Durbin & Mitchison, 1990; Durbin & Willshaw, 1987; Goodhill & Willshaw, 1990), whose relationship to the wiring minimization is discussed in Goodhill and Sejnowski (1997). 4 Exact Solution Under External Constraints The function of the brain is to bridge sensory input and motor output. Communications between sensory and motor organs, on the one hand, and the brain, on the other, require biological wires. The cost of these wires must be included in the overall cost function, 1 C= Aij (ri − rj )2 + Bij (ri − fj )2 , (4.1) 2 i,j i,j where the first term represents the cost of wiring between neurons in the brain and the second term represents the cost of wiring between the brain and sensory and motor organs. Weight Bij represents connection strength between neuron i and organ j, and fj is the coordinate of organ j. As various functional requirements determine organ placement in the body plan (e.g., frontal eyes, forward nose, muscles attached to bones), it is reasonable to formulate the optimal neuronal layout problem with the organ coordinates fixed. To solve the optimal layout problem, we search for the minimum of the wiring cost function 4.1, while varying the locations of the brain neurons, ri . An elegant way to do this is by first rewriting the two terms of the cost function in a matrix form (Hall, 1970), 1 1 Aij (ri − rj )2 = Aij (r2i − 2ri rj + rj2 ) 2 i,j 2 i,j r2i Aij − ri Aij rj = rT (DA − A)r = i T
= r Lr,
j
i,j
(4.2)
Exact Solution for the Optimal Neuronal Layout Problem
where r is a column vector {ri } matrix DAij = δij Laplacian of matrix A, Bij (ri − fj )2 = Bij (r2i − 2ri fj + fj2 ) i,j
i,j
=
r2i
i
Bij − 2
j
k
ri Bij fj +
2071
Aik , and L is called the
i,j
j
fj2
Bij
i
= rT DB r − 2rT Bf + const, (4.3) where matrix DBij = δij k Bik . The minimum of the quadratic wire length cost function 4.1 can be found by taking a derivative in respect to ri and setting it to zero: dC = 2(L + DB )r − 2Bf = 0. dr
(4.4)
Then the optimal layout is given by the following matrix equation: r = (L + DB )−1 Bf.
(4.5)
This solution for the layout problem can be easily generalized to d spatial dimensions. Because the cost function 4.1 is separable into d terms, each containing distances along different dimensions, equation 4.5 gives the projection of the layout vector onto the corresponding spatial dimension. In the rubber bands analogy, this solution is reminiscent of a cobweb, where the network nodes’ location is determined by the forces exerted by external and internal connections. If external connections, Bij , are much stronger than internal, Aij , the location of the nodes is determined mainly by the balance of external forces. For example, if each node makes a strong connection with only one fixed organ, it should be located on that organ. In the opposite limit, when internal connections dominate, all nodes cluster together near the center of mass. Then the optimal layout problem can be broken into two steps. First, the location of the center of mass, rcm , corresponds to the minimum of Ccm = (rcm − fj )2 Bij . (4.6) j
i
Second, coordinates of the nodes relative to the center of mass can be found by using the spectral decomposition of the Laplacian (see the next section). This two-step solution allows one to predict the center of mass location even when the internal connections are not completely known. 5 Spectral Analysis Emulates Internal Constraints The finite size of neuronal bodies and axons places constraints on the possible layouts because of volume exclusion, or congestion. Inclusion of these
2072
D. Chklovskii
constraints is in general a difficult problem. Here we present an approximate treatment of the internal constraints, which yields an aesthetically appealing exact solution (Hall, 1970). In order to avoid the trivial solution, the norm of vector r is fixed, yielding the following optimization problem (see equation 4.2 for derivation): minimize C =
1 Aij (ri − rj )2 = rT Lr, subject to rT r = 1. 2 i,j
(5.1)
This minimization problem is solved by the eigenvector of L corresponding to its lowest eigenvalue. However, the lowest eigenvalue of L is 0, and the corresponding eigenvector is 1/N1/2 , which means that all nodes are at the same point. This led Hall (1970) to introduce an additional constraint by requiring that the minimization problem solution be orthogonal to that eigenvector: rT 1 = 0 or ri = 0. (5.2) i
Then the solution to the one-dimensional optimal layout problem is given by the eigenvector of the Laplacian, v2 , corresponding to the second lowest eigenvalue, λ2 . If the problem is d-dimensional, then the solution is given by the d eigenvectors, corresponding to the 2nd to d + 1st lowest eigenvalues (Hall, 1970). The layout problem with internal constraints also admits a physical analogy. In addition to the elastic force exerted on the nodes by massless rubber bands, there is repulsive force proportional to the distance from the origin. The role of the additional constraint 5.1 is to pin the center of mass to the origin. Alternatively, one can view this problem as finding the configuration with minimum elastic energy for fixed moment of inertia 5.1 and center of mass 5.2. Unfortunately, the above solution for internal constraints cannot be combined straightforwardly with that for external constraints presented in section 4. For example, the center of mass coordinates is determined by incompatible considerations in the two cases: arbitrary placement at the origin versus force balance depending on external constraints. Yet the Laplacian spectrum and the corresponding eigenvectors may approximate the solution obtained with external constraints if the weights of internal connections dominate that of external. This relationship between the two formulations can be formalized by rewriting the scalar product between the ith eigenvector, vi , and the external constraint solution, r, vi r =
vi Bf , λ i + βi
(5.3)
where βi are coefficients of the spectral decomposition of DB in the projection basis.
Exact Solution for the Optimal Neuronal Layout Problem
2073
6 Layout of Prefrontal Cortical Areas in Macaque Cerebral cortex consists of multiple areas whose spatial arrangement and interconnectivity are reproducible from animal to animal. Previous work suggests that the arrangement of cortical areas is determined by minimizing the total length of the interconnections between them (Cherniak, 1994; Mitchison, 1991). Recently, this suggestion has been put to a direct test in the macaque prefrontal cortex (Klyachko & Stevens, 2003), where most of the interconnections and the layout of areas are known (Carmichael & Price, 1996; see Figure 1). Brute force enumeration of all possible area layouts shows that the wiring in the actual layout is the shortest (Klyachko & Stevens, 2003). Here we test the quadratic wire length approach on the data set used in Klyachko and Stevens (2003) by using both external and internal constraints formulation. In the external constraints formulation, the 14 areas on the periphery of the prefrontal cortex are treated as fixed organs, their locations being the actual ones (the crosses in Figure 1A and 1B). Locations of the remaining 10 areas are continuously varied to minimize the sum of connection lengths squared. Equation 4.5 yields the placement shown in Figure 1B. Although the predicted locations are closer together than in reality, the predicted ordering is close to the actual one. The only exception to the correct ordering is the placement of 14r, which should be lower. Interestingly, this placement corresponds to one of the close-to-optimal placements reported in Klyachko and Stevens (2003). There are two possible explanations for why the areas are predicted to bunch up more than they do in reality. First, the external constraint formulation neglects volume exclusion, that is, the fact that the areas have finite size and cannot overlap. Second, there may be connections between the areas that were considered in Carmichael and Price (1996) and the areas that were not considered. Because these areas lie outside the considered fixed areas, their inclusion would pull apart the movable areas. The internal constraints formulation applied to the 18 areas included in prefrontal orbital and medial networks (Carmichael & Price, 1996) yields the arrangement shown in Figure 1C. This placement has approximately correct ordering of the areas. Moreover, this analysis correctly clusters cortical areas into two clusters distinguished by the sign of the second eigenvector component. These clusters correspond to the known (Carmichael & Price, 1996) subdivisions of the prefrontal cortex orbital (blue labels) and medial (red labels) networks. Thus, the predictions of the internal constraints formulation are consistent with anatomical data. 7 Layout of Ganglia in C. elegans Neurons in the C. elegans nervous system are clustered into several ganglia distributed along its body. Most connections between the ganglia are known, making this system a natural choice for testing the wiring optimiza-
2074
D. Chklovskii
A 46 45
45
12r 12m
11l
13l
32
32
10o
12l
10m
11m
12r
14r 12o
24a/b
46
24a/b
10o 12l
9
B
9
13m
25
12o
13b
12m 13l
14c
14r 10m 11m
11l 13m
13b 14c 13a
13a Ial
Ial
Iam Iapm
Iai
25
Iam Iapm
Iai
C 12r
0.6
third eigenvector
0.4
0.2 24a/b
12m 12l
0
10o
11l 11m
13l
13b
32 10m
14r Iai
14c
13m
-0.2
Iam Ial
-0.4
-0.6
-0.4
-0.2
Iapm
0
0.2
0.4
0.6
second eigenvector
Figure 1: Comparison of the actual cortical area arrangement (A) with the predictions of the quadratic layout optimization under external (B) and internal (C) constraints. (A) Cortical area centers in the coordinate frame of the flattened prefrontal cortex, taken from Klyachko and Stevens (2003) and labeled according to Carmichael and Price (1996). Crosses indicate areas that were fixed in the external constraint formulation, and circles indicate movable areas. (B) Area locations predicted by the external constraint formulation. Blue lines show internal connections (Aij ), and red lines external ones (Bij ). (C) Area locations predicted by the internal constraint formulation. Areas cluster by the sign of the second eigenvector component (negative versus positive abscissa) in accordance with the known division of the prefrontal cortex into orbital (brown labels) and medial (green labels) networks (Carmichael & Price, 1996).
Exact Solution for the Optimal Neuronal Layout Problem
2075
1
predicted position
0.8
0.6
0.4
0.2
0 0
0.2
0.4
0.6
0.8
1
actual position
Figure 2: Solid dots (connected in the anterior-posterior order) show predicted versus actual positions of C. elegans ganglia normalized by the distance from head to tail. Deviations from the diagonal line correspond to differences in the actual versus predicted ganglia positions. Although predicted ganglia positions differ from the actual ones, their order is predicted correctly with the exception of the dorsorectal ganglion (actual position: 0.88 of body length).
tion approach. The layout problem is essentially one-dimensional because of the large aspect ratio (more than 10:1) of the worm body. Brute force enumeration of all permutations of 11 movable components (including nerve ring in addition to ganglia; Cherniak, 1994) shows that the actual ordering minimizes the total wire length. Here I show that solving a quadratic placement problem can largely reproduce the actual order of ganglia. In the external constraints formulation, the locations of ganglia are given by equation 4.5, where the wiring diagram and fixed locations of sensors and muscles are the same as in Cherniak (1994). Figure 2 shows the predicted positions of C. elegans ganglia versus the actual ones. The predicted order agrees with the actual one with the exception of a single ganglion. This is a reasonably good agreement considering that there are 11! alternative orderings. However, predicted ganglia locations deviate from the actual ones. These deviations may be due to missing information in the wiring diagram (e.g., the lack of neuromuscular connections); deviations of the cost function from the quadratic form; or
2076
D. Chklovskii
other factors that should be included in the cost function. Future work will determine which of these factors are responsible for the disagreement. Internal constraints are unlikely to play a significant role in the placement of ganglia due to the relative sparseness of the C. elegans nervous system. Therefore, I skip this analysis. 8 Discussion In this letter, I argue that the wire length squared may approximate the wiring cost, thus reducing the optimal layout problem to the constrained minimization of a quadratic form. For two types of constraints, external and internal, exact analytical solutions exist, allowing straightforward and intuitive analysis. To test the quadratic optimization approach, I revisit two known cases of wire length minimization, where previous solutions relied on brute force complete enumeration. Minimization of wire length squared approximates the actual layouts reasonably well. One recurring problem with external constraint formulation is the bunching of graph nodes in the solution. This happens because the number of internal connections usually exceeds that of the external ones. The bunching does not happen in actual brains because of the ”volume exclusion” of multineuron complexes or internal constraints. The spectral method emulates these constraints and eliminates the bunching problem. However, exclusion of the external connections may lead to the overall rotation of the graph or incorrect positioning of some multineuron complexes. As in any other theoretical analysis, the optimal neuronal layout solution relies on several simplifying assumptions. The central assumption, the quadratic form of the cost function, is supported by the argument in section 2, showing that quadratic cost function may be a reasonable approximation. The utility of this approximation is due to its exact solvability (see sections 4 and 5). The validity of this approximation is supported by the fact that its predictions are consistent with experimental data (see sections 7 and 8). Another assumption is that wiring consists of point-to-point (nonbranching) axons. This assumption is valid for connections between cortical areas and can serve as a first-order approximation in other cases. Future work will analyze the impact of the axonal branching and the presence of dendrites on brain design. Also in the real brain, axonal branches are not exactly straight lines. Their curvature is itself a result of internal constraints, which are included here only on the mean-field level (see section 5). A more detailed treatment of internal constraints, or congestion, will be presented elsewhere. Although quadratic cost function is an approximation, it yields optimal layouts reasonably close to those obtained by minimizing total wiring length in realistic situations. While complete enumeration of possible layouts is limited to a small number of movable components, quadratic layout problem yields exact solutions in analytical form for the wiring diagrams as big as computers can handle. These analytical solutions can be readily and
Exact Solution for the Optimal Neuronal Layout Problem
2077
intuitively investigated, making the inverse problem (predicting connectivity from neuronal layout) more tractable. Since solving this problem may complement the existing experimental methods for establishing neuronal connectivity, the quadratic cost function promises to be an important tool for understanding brain design and function. Acknowledgments I thank Natarajan Kannan for bringing Hall’s work to my attention and Armen Stepanyants, Anatoli Grinshpan, and all the members of my group for helpful discussions. I am grateful to Charles Stevens for making Klyachko and Stevens (2003) available prior to publication. This work was supported by the Lita Annenberg Hazen Foundation and the David and Lucile Packard Foundation. References Attwell, D., & Laughlin, S. B. (2001). An energy budget for signaling in the grey matter of the brain. J. Cereb. Blood Flow Metab., 21, 1133–1145. Carmichael, S. T., & Price, J. L. (1996). Connectional networks within the orbital and medial prefrontal cortex of macaque monkeys. J. Comp. Neurol., 371, 179–207. Cherniak, C. (1992). Local optimization of neuron arbors. Biol. Cybern., 66, 503– 510. Cherniak, C. (1994). Component placement optimization in the brain. J. Neurosci., 14, 2418–2427. Chklovskii, D. B., Schikorski, T., & Stevens, C. F. (2002). Wiring optimization in cortical circuits. Neuron, 34, 341–347. Chklovskii, D. B., & Stepanyants, A. (2003). Power-law for axon diameters at branch point. BMC Neurosci., 4, 18. Dickson, B. J. (2002). Molecular mechanisms of axon guidance. Science, 298, 1959–1964. Durbin, R., & Mitchison, G. (1990). A dimension reduction framework for understanding cortical maps. Nature, 343, 644–647. Durbin, R., & Willshaw, D. (1987). An analogue approach to the travelling salesman problem using an elastic net method. Nature, 326, 689–691. Goodhill, G. J., & Sejnowski, T. J. (1997). A unifying objective function for topographic mappings. Neural Computation, 9, 1291–1303. Goodhill, G. J., & Willshaw, D. J. (1990). Application of the elastic net algorithm to the formation of ocular dominance stripes. Network, 1, 41–59. Hall, K. (1970). An r-dimensional quadratic placement algorithm. Management Science, 17, 219–229. Klyachko, V. A., & Stevens, C. F. (2003). Connectivity optimization and the positioning of cortical areas. Proc. Natl. Acad. Sci. USA, 100, 7937–7941. Mitchison, G. (1991). Neuronal branching patterns and the economy of cortical wiring. Proc. R. Soc. Lond. B Biol. Sci., 245, 151–158.
2078
D. Chklovskii
Rall, W., Burke, R. E., Holmes, W. R., Jack, J. J., Redman, S. J., & Segev, I. (1992). Matching dendritic neuron models to experimental data. Physiol. Rev., 72, S159–186. Ramon ´ y Cajal, S. (1999). Textura del sistema nervioso del hombre y de los vertebrados (Texture of the nervous system of man and the vertebrates). New York: Springer-Verlag. (Original work published in 1899). Rushton, W. A. (1951). Theory of the effects of fibre size in medullated nerve. J. Physiol., 115, 101–122. Stepanyants, A., Hof, P. R., & Chklovskii, D. B. (2002). Geometry and structural plasticity of synaptic connectivity. Neuron, 34, 275–288. Received October 21, 2003; accepted March 17, 2004.
LETTER
Communicated by Dean Buonomano
Decoding a Temporal Population Code Philipp Knusel ¨
[email protected] Reto Wyss
[email protected] Peter Konig ¨
[email protected] Paul F.M.J. Verschure
[email protected] ¨ ¨ Institute of Neuroinformatics, University/ETH Zurich, Zurich, Switzerland
Encoding of sensory events in internal states of the brain requires that this information can be decoded by other neural structures. The encoding of sensory events can involve both the spatial organization of neuronal activity and its temporal dynamics. Here we investigate the issue of decoding in the context of a recently proposed encoding scheme: the temporal population code. In this code, the geometric properties of visual stimuli become encoded into the temporal response characteristics of the summed activities of a population of cortical neurons. For its decoding, we evaluate a model based on the structure and dynamics of cortical microcircuits that is proposed for computations on continuous temporal streams: the liquid state machine. Employing the original proposal of the decoding network results in a moderate performance. Our analysis shows that the temporal mixing of subsequent stimuli results in a joint representation that compromises their classification. To overcome this problem, we investigate a number of initialization strategies. Whereas we observe that a deterministically initialized network results in the best performance, we find that in case the network is never reset, that is, it continuously processes the sequence of stimuli, the classification performance is greatly hampered by the mixing of information from past and present stimuli. We conclude that this problem of the mixing of temporally segregated information is not specific to this particular decoding model but relates to a general problem that any circuit that processes continuous streams of temporal information needs to solve. Furthermore, as both the encoding and decoding components of our network have been independently proposed as models of the cerebral cortex, our results suggest that the brain could solve the problem of temporal mixing by applying reset signals at stimulus onset, leading to a temporal segmentation of a continuous input stream.
c 2004 Massachusetts Institute of Technology Neural Computation 16, 2079–2100 (2004)
2080
P. Knusel, ¨ R. Wyss, P. Konig, ¨ and P. Verschure
1 Introduction The processing of sensory events by the brain requires the encoding of information in an internal state. This internal state can be represented by the brain using a spatial code, a temporal code, or a combination of both. For further processing, however, this encoded information requires decoding at later stages. Hence, any proposal on how a perceptual system functions must address both the encoding and the decoding aspects. Encoding requires the robust compression of the salient features of a stimulus into a representation that has the essential property of invariance. The decoding stage involves the challenging task of decompressing this invariant and compressed representation into a high-dimensional representation that facilitates further processing steps such as stimulus classification. Here, based on a combination of two independently proposed and complementary encoding and decoding models, we investigate sensory processing and the properties of a decoder in the context of a complex temporal code. Previously we have shown that visual stimuli can be invariantly encoded in a so-called temporal population code (Wyss, Konig, ¨ & Verschure, 2003). This encoding was achieved by projecting the contour of visual stimuli onto a cortical layer of neurons that interact through excitatory lateral couplings. The temporal evolution of the summed activity of this cortical layer, the temporal population code, encodes the stimulus-specific features in the relative spike timing of cortical neurons on a millisecond timescale. Indeed, physiological recordings in area 17 of cat visual cortex support this hypothesis showing that cortical neurons can produce feature-specific phase lags in their activity (Konig, ¨ Engel, Rolfsema, & Singer, 1995). The encoding of visual stimuli in a temporal population code has a number of advantageous features. First, it is invariant to stimulus transformation and robust to both network and stimulus noise (Wyss, Konig, ¨ & Verschure, 2003; Wyss, Verschure, & Konig, ¨ 2003). Thus, the temporal population code satisfies the properties of the encoding stage outlined above. Second, it provides a neural substrate for the formation of place fields (Wyss & Verschure, in press). Third, it can be implemented without violating known properties of cortical circuits such as the topology of lateral connectivity and transmission delays (Wyss, Konig, ¨ & Verschure, 2003). Thus, the temporal population code provides a hypothesis on how a cortical system can invariantly encode visual stimuli. Different approaches for decoding temporal information have been suggested (Kolen & Kremer, 2001; Mozer, 1994; Buonomano & Merzenich, 1995; Buonomano, 2000). A recently proposed approach is the so-called liquid state machine (Maass, Natschl¨ager, & Markram, 2002; Maass & Markram, 2003). We evaluate the liquid state machine as a decoding stage since it is a model that aims to explain how cortical microcircuits solve the problem of the continuous processing of temporal information. The general structure of this approach consists of two stages: a transformation and a readout stage.
Decoding a Temporal Population Code
2081
The transformation stage consists of a neural network, the liquid, which performs real-time computations on time-varying continuous inputs. It is a generic circuit of recurrently connected integrate-and-fire neurons coupled with synapses that show frequency-dependent adaptation (Markram, Wang, & Tsodyks, 1998). This circuit transforms temporal patterns into highdimensional and purely spatial patterns. A key property of this model is that there is an interference between subsequent input signals, so that they are mixed and transformed into a joint representation. As a direct consequence, it is not possible to separate consecutively applied temporal patterns from this spatial representation. The second stage of the liquid state machine is the readout stage, where the spatial representations of the temporal patterns are classified. Whereas most previous studies considered Poisson spike trains as inputs to the liquid state machine, in this article, we investigate the performance of this model in classifying visual stimuli that are represented in a temporal population code. Although the liquid state machine was originally proposed for the processing of continuous temporal inputs, it is unclear how this generalizes to the continuous processing of a sequence of stimuli that are temporally encoded. By analyzing the internal states of the network, we show that in its original setup, it tends to create overlaps among the stimulus classes. This suggests that in order to improve its performance, a reset locked to the onset of a stimulus could be required. We compare different strategies on preparing this network to the presentation of a new stimulus, ranging from random and deterministic initialization strategy to pure continuous processing with no stimulus-triggered resets. We find a large range of classification performance, showing that the no-reset strategy is significantly outperformed by the different types of stimulus-triggered initializations. Building on these results, we discuss possible implementations of such mechanisms by the brain. 2 Methods 2.1 Temporal Population Code. We analyze the classification of visual stimuli encoded in a temporal population code as produced by a cortical type network proposed earlier (Wyss, Konig, ¨ & Verschure, 2003). This network consists of 40 × 40 integrate-and-fire cells that are coupled with symmetrically arranged excitatory connections having distance-specific transmission delays. The inputs to this network are artificially generated “visual” patterns (see Figure 1). Each of the 11 stimulus classes consists of 1000 samples. The output of the network (see Figure 2) is the sum of activities recorded during 100 ms with a temporal resolution of 1 ms—that is, a temporal population code. We are exclusively interested in assessing the information in the temporal properties of this code. Thus, each population activity pattern is rescaled such that the peak activity is set to one. The resulting population activity patterns (which we also refer to as temporal activity patterns)
2082
P. Knusel, ¨ R. Wyss, P. Konig, ¨ and P. Verschure
Figure 1: Prototypes of the synthetic “visual” input patterns used to generate the temporal population code. There are 11 different classes where each class is composed of 1000 samples. The resolution of a pattern is 40 times 40 pixels. The prototype pattern of each class is generated by randomly choosing four vertices and connecting them by three to five lines. Given a prototype, 1000 samples are constructed by randomly jittering the location of each vertex using a two-dimensional gaussian distribution (σ = 1.2 pixels for both dimensions). All samples are then passed through an edge detection stage and presented to the network of Wyss, Konig, ¨ & Verschure (2003).
constitute the input to the decoding stage, the liquid state machine (see Figure 3). Based on a large set of synthetic stimuli consisting of 800 classes and using mutual information, we have shown that the information content of the temporal population code is 9.3 bits given a maximum of 9.64 bits (Wyss, Konig, ¨ & Verschure, 2003; Rieke, Warland, de Ruyter van Steveninck, & Bialek, 1997).
Decoding a Temporal Population Code
2083
Figure 2: Temporal population code of the 11 stimulus classes. Shown are the mean traces of the population activity patterns encoding the number of active cells as a function of time (1 ms temporal resolution, 100 ms length) after rescaling.
2.2 Implementation of the Liquid State Machine. The implementation of the liquid state machine evaluated here, including the readout configuration, is closely based on the original proposal (Maass et al., 2002; see the appendix). The liquid is formed by 12 × 12 × 5 = 720 leaky integrateand-fire neurons (the liquid cells) that are located on the integer points of a cubic lattice where 30% randomly chosen liquid cells receive input, and 20% randomly chosen liquid cells are inhibitory (see Figure 3). The simulation parameters of the liquid cells are given in Table 1. The probability of a synaptic connection between two liquid cells located at a and b is given by a gaussian distribution, p(a, b) = C · exp(−(|a − b|/λ)2 ), where |.| is the Euclidian norm in R3 and C and λ are constants (see Table 2). The synapses connecting the liquid cells show frequency-dependent adaptation (Markram et al., 1998; see the appendix).
2084
P. Knusel, ¨ R. Wyss, P. Konig, ¨ and P. Verschure
Figure 3: General structure of the implementation of the liquid state machine. A single input node provides a continuous stream to the liquid that consists of recurrently connected integrate-and-fire neurons that are fully connected with 11 readout groups. Each of the readout groups consists of 36 integrate-and-fire neurons. Weights of the synaptic connections projecting to the readout groups are trained using a supervised learning rule. Table 1: Simulation Parameters of the Neurons of the Liquid. Name
Symbol
Background current Leak conductance Membrane time constant Threshold potential Reset potential Refractory period
Ibg gleak τmem vθ vreset tref r
Value 13.5 nA 1 µS 30 ms 15 mV 13.5 mV 3 ms
Note: The parameters are identical to Maass et al. (2002).
The readout mechanism is composed of 11 neuronal groups consisting of 36 integrate-and-fire neurons with a membrane time constant of 30 ms (see Figure 3 and the appendix). All readout neurons receive input from the liquid cells and are trained to classify a temporal activity pattern at a specific point in time after stimulus onset, tL . Thus, training occurs only once during the presentation of an input. A readout cell fires if and only if its membrane potential is above threshold at t = tL ; that is, the readout cell is not allowed to fire at earlier times. This readout setup is comparable to the original proposal of the liquid state machine (Maass et al., 2002). Each readout group represents a response class, and the readout group with the highest number of firing cells is the selected response class. Input classes are mapped to response classes by changing the synapses projecting from the
Decoding a Temporal Population Code
2085
Table 2: Simulation Parameters of the Synapses Connecting the Liquid Cells. Value Name
Symbol
Average length of connections Maximal connection probability Postsynaptic current time constant Synaptic efficacy (weight) Utilization of synaptic efficacy Recovery from depression time constant Facilitation time constant
λ C τsyn wliq U τrec τ f ac
EE
EI
IE
II
2 (independent of neuron type) 0.4 0.2 0.5 0.1 3 ms 3 ms 6 ms 6 ms 20 nA 40 nA 19 nA 19 nA 0.5 0.05 0.25 0.32 1.1 s 0.125 s 0.7 s 0.144 s 0.05 s 1.2 s 0.02 s 0.06 s
Notes: The neuron type is abbreviated with E for excitatory and I for inhibitory neurons. The values of wliq , U, τrec , and τ f ac are taken from a gaussian distribution of which the mean values are given in the table. The standard deviation of the distribution of the synaptic efficacy is equal to the mean value, and it is half of the mean value for the last three parameters. The parameters are identical to Maass et al. (2002).
liquid onto the readout groups. A supervised learning rule changes these synaptic weights only when the selected response class is incorrect (see the appendix). In this case, the weights of the synapses to firing cells of the incorrect response class are weakened, whereas those to the inactive cells of the correct response class are strengthened. As a result, the firing probability of cells in the former group, given this input, is reduced while that of the latter is increased. The synapses evolve according to a simplified version of the learning rule proposed in Maass et al. (2002) and Auer, Burgsteiner, and Maass (2001), the main difference being that the clear margin term has been ignored. (Control experiments have shown that this had no impact on the performance.) The 1000 stimulus samples of each class are divided into a training and test set of 500 samples each. The simulation process is split into two stages. In the first stage, the synaptic weights are updated while all training samples are presented in a completely random order until the training process converges. In the second stage, the training and test performance of the network is assessed. Again, the sequence of the samples is random, and each sample is presented only once. In both stages, the samples are presented as a continuous sequence of temporal activity patterns where each stimulus is started exactly after the preceding one. Regarding the initialization of the network, any method used can reset either the neurons (membrane potential) or the synapses (synaptic utilization and fraction of available synaptic efficacy), or both. A reset of any of those components of the network can be deterministic or random. Combining some of these constraints, we apply five different methods to initialize the network at stimulus onset: entire-hard-reset, partial-hard-reset, entirerandom-reset (control condition), partial-random-reset (as used in Maass et al., 2002; Maass, Natschl¨ager, & Markram, 2003) and no-reset (see Table 3 for
2086
P. Knusel, ¨ R. Wyss, P. Konig, ¨ and P. Verschure
Table 3: Initialization Values of the Liquid Variables: Membrane Potential, Synaptic Utilization, and Fraction of Available Synaptic Efficacy.
Reset Method
Membrane Potential
Synaptic Utilization
Fraction of Available Synaptic Efficacy
Entire-hard-reset Partial-hard-reset Entire-random-reset Partial-random-reset No-reset
13.5 mV 13.5 mV [13.5 mV, 15 mV] [13.5 mV, 15 mV] –
U – [0, U] – –
1 – [0, 1] – –
Notes: Five different methods are used to initialize these variables. The symbol [ , ] denotes initialization values drawn from a uniform distribution within the given interval.
the corresponding initialization values). Whereas only the neurons are initialized by means of the partial reset, the entire reset initializes the neurons and the synapses. The initialization values are deterministic with the hardreset methods, and they are random with the random-reset methods. The random initialization is used to approximate the history of past inputs. The validity of this approximation will be controlled below. Finally, the network is not reset in case of the no-reset method. 2.3 Liquid State and Macroscopic Liquid Properties. The state of the network is formally defined as follows: Let z(t) be a time-dependent vector that represents the active cells at time t in the network with a 1 and all inactive cells with a 0. We call z ∈ Rp the liquid output vector (with p the number of liquid cells). The liquid-state-vector z˜ (usually called only the liquid state) is now defined as the component-wise low-pass-filtered liquid output vector using a time constant of τ = 30 ms. We introduce three macroscopic liquid properties. In all of the following equations, z˜ ijk ∈ Rp denotes the liquid state after the kth presentation of sample j from class i where i = 1, . . . , n, j = 1, . . . , m, and k = 1, . . . , r with n the number of classes, m the number of samples per class and r the number of presentations of the same sample, and p the number of liquid cells. For simplicity, we omit the time dependence in the following definitions. We compute a principal component analysis by considering all the vectors z˜ ijk as n · m · r realizations of a p-dimensional random vector. Based on the new coordinates zˆ ijk of the liquid state vectors in the principal component system, the macroscopic liquid properties are defined. The center of class i, ci , and the center of a sample j from class i, sij , are defined as the average values of the appropriate liquid state vectors:
ci =
m r 1 zˆ ijk mr j=1 k=1
Decoding a Temporal Population Code
sij =
2087
r 1 zˆ ijk . r k=1
Since these vectors are defined as average values over several presentations of the same sample, the liquid noise (see below) is not considered in these values if the number of repetitions r is large enough. The liquid-noise σ liq is defined as the average value of the vectorial standard deviation (the standard deviation is computed for each component separately) of all presentations of a sample, σ liq =
n m 1 stdk (ˆzijk ), mn i=1 j=1
and can be interpreted as the average scattering of a sample around its center sij . The average distance vector between the centers of all classes, the liquidliq class-distance dC , is defined as liq
dC =
1,...,n 2 |ci − cj |, n(n − 1) i<j
where |.| is the absolute value. liq The liquid-sample-distance dS is defined as the vectorial standard deviation of the sample centers of one class, averaged over all classes, liq
dS =
n 1 stdj (sij ), n i=1
where the subscript S stands for sample. 3 Results In the first experiment, we investigate the performance of the liquid state machine in classifying the temporal activity patterns by initializing the network according to the control condition (entire-random-reset; see section 2). The readout cell groups are trained to classify a sample at 100 ms after stimulus onset. We run 10 simulations, each using the complete training and testing data sets. Each simulation comprises a network where the synaptic arborization and the parameters controlling the synaptic dynamics are randomly initialized (see Table 2). We find that after training, 60.6±2% (mean ± standard deviation) of the training samples and 60.2±2% of the test samples are classified correctly. The corresponding values of the mutual information
2088
P. Knusel, ¨ R. Wyss, P. Konig, ¨ and P. Verschure
between stimulus and response class are 1.725 ± 0.056 bits using the training data set and 1.696 ± 0.053 bits using the test data set. The maximum value of the mutual information is log2 (11) ≈ 3.459 bits. Thus, although the generalization capability of the network is excellent (the performances on the test and training sets are virtually identical), it achieves only a moderate overall performance comparing it to a statistical clustering of the temporal activity patterns that shows 83.8% correct classifications (Wyss, Konig, ¨ & Verschure, 2003). In order to elucidate the mechanisms responsible for this moderate performance, we take a closer look at how the temporal activity patterns are represented in the network. Since we always triggered the training of the readout cell groups at 100 ms after stimulus onset, we are particularly interested in the liquid state (see section 2) at this point. Due to the fact that the liquid state is high-dimensional, we employ a principal component analysis to investigate the representation of the temporal activity patterns in the network. The first 50 samples of each class are presented 20 times to the network, which results in 20 repeats per sample × 50 samples per class × 11 classes = 11,000 liquid state vectors. Each of these 720-dimensional vectors is considered as a realization of 720 random variables. On these data, a principal component analysis is applied. Based on the new coordinates of the liquid states in the principal component system, we compute the three macroscopic liquid properties: the liquid-class-distance, the liquid-sampledistance, and the liquid-noise (see section 2). For the projection of the liquid states onto each principal component, these three properties describe the average distance between the centers of the classes, the average distance between the centers of the samples of one class, and the average variability of the liquid states of one particular sample. Thus, by means of the liquidsample-distance and the liquid-noise, the extent of all samples of one class along each principal axes can be assessed. This extent is limited by the average distance between the samples of one class (liquid-sample-distance) and the sum of this distance with the liquid-noise, the average variability of the liquid states of one sample. Hence, the projection of the liquid states of different classes onto a principal component is separated if the corresponding liquid-class-distance is greater than the sum of the liquidsample-distance and the liquid-noise. Conversely, the projection of liquid states onto a particular principal component overlaps if the corresponding liquid-sample-distance is greater than the liquid-class-distance. On the basis of this interpretation of the macroscopic liquid properties, we are able to quantitatively assess the separation among the classes. The above analysis of the liquid states is summarized in Figure 4. First, we find that the liquid-noise exceeds the liquid-class- and the liquid-sampledistance for dimensions greater than or equal to 26. Thus, there is little stimulus- or class-specific information but mostly noise in these components. Second, the liquid-sample-distance is greater than the liquid-classdistance for all dimensions greater than or equal to 5; the liquid states of
Decoding a Temporal Population Code
2089
Figure 4: Liquid state distances versus principal component dimensions. The network is initialized using the entire-random-reset method. The solid line shows the liquid-class-distance, the dashed line the liquid-sample-distance, the dotted line the liquid-noise, and the dash-dotted line the sum of the liquidsample-distance and the liquid-noise. For dimensions greater than 26, the liquidnoise is greater than the liquid-sample-distance, which is greater than the liquidclass distance. For dimensions 1 to 4, the liquid-class-distance is greater than the sum of the liquid-sample-distance and the liquid-noise.
different classes overlap for these dimensions. Third, for dimensions less than or equal to 4, the liquid-class-distance is greater than the sum of the liquid-sample-distance and the liquid-noise. As a result of this, the liquid states of different classes have little overlap for these dimensions. Fourth, as a consequence of the third point, the liquid-class-distance is also greater than the liquid-sample-distance for dimensions between 1 and 4. Given these macroscopic liquid properties, we can conclude, from the third observation, that the projection of the liquid states onto principal components 1 to 4 has little overlap. Therefore, class-specific information can be found only in the first four principal components. This finding is somewhat surprising, given the dimensionality of the temporal population code, which is of the order of 20 (Wyss, Verschure, & Konig, ¨ 2003) and considering that the liquid states were originally proposed to provide a very high-dimensional representation of the input (Maass et al., 2002). Finally, it follows from the second observation that the liquid states projected onto principal components greater than or equal to 5 do not carry class-specific information, with or without liquid-noise. Therefore, the liquid state machine appears to encode the stimuli into a low-dimensional representation.
2090
P. Knusel, ¨ R. Wyss, P. Konig, ¨ and P. Verschure
Taking the above observations into account, how can the liquid state machine be modified in order to increase its ability to separate the stimulus classes? Since the liquid state machine does not a priori contain class-specific information (that is, class-specific features cannot be detected), the liquidclass- and the liquid-sample-distance cannot be changed independently. Thus, it is not possible to selectively increase the liquid-class-distance while decreasing the liquid-sample-distance. However, as a result of the entirerandom-reset method used to initialize the network, the liquid-noise is independent of the stimuli and could be eliminated by resetting the liquid variables to predefined values at stimulus onset. According to the macroscopic liquid properties, this would therefore lead to an increased separation between the classes, which improves the classification performance of the liquid state machine. We examine the classification performance of the liquid state machine using four reset methods: the entire-hard-, partial-hard-, partial-random-, and no-reset methods (see section 2 and Table 3). We use the same experimental protocol as above, and the results are summarized in Figure 5. First, initializing the network with the entire-hard-reset method yields a better performance than with the entire-random-reset method, as predicted above. Quantitatively, this amounts to approximately a 10% increase in performance. Second, comparing the performance of the entire-hard-/entirerandom-reset method to its partial counterpart, we find that initializing the network with the partial-hard-/partial-random-reset method results in a higher performance (see Figure 5). Employing a two-way ANOVA on the classification performance using the results of the testing data set, we find for α = 0.01, pentire/partial ≈ 0.0002, phard/random ≈ 4 · 10−15 , and pInteraction ≈ 0.11. Thus, both entire and partial as well as hard and random resets result in significant differences of the average classification performance and the mutual information. The only difference between the partial and the entire reset is that the former does not reset the synapses (see Table 3), that is, the synaptic utilization and the available fraction of synaptic efficacy are never reset. Thus, this difference has to account for the observed improvement of the classification performance. Third, using the no-reset method, the network yields a performance that is significantly lower than a network initialized with any other reset method (for instance, performance comparison of entire-random-reset and no-reset, t-test of mutual information of testing data set, α = 0.01, p ≈ 2 · 10−16 ). Thus, resetting the network is required to achieve a satisfying classification performance. We investigate in more detail the performance difference yielded by the entire and the partial reset methods. As we found above, entire and partial reset render approximately the same performance. Since the only difference between them is that the synapses are not reset in case of the partial reset method, this suggests that the synaptic short-term plasticity has no effect on the performance of the network. Consequently, the decoding of the temporal activity pattern would be a result of the membrane time constant only.
Decoding a Temporal Population Code
2091
Figure 5: Evaluation of different reset mechanisms. (a) Classification performance and (b) mutual information of the readout cell groups trained 100 ms after stimulus onset with the input classes. Five different initialization methods are used (see section 2). Gray and white bars show the performance for the training and test data set, respectively, and 10 simulations are run per reset condition.
2092
P. Knusel, ¨ R. Wyss, P. Konig, ¨ and P. Verschure
Figure 6: Liquid state distances versus principal component dimensions. The network is initialized using the no-reset method. The solid line shows the liquidclass-distance, the dashed line the liquid-sample-distance, the dotted line the liquid-noise, and the dash-dotted line the sum of the liquid-sample-distance and the liquid-noise. The liquid-class-distance is greater than the sum of the liquid-sample-distance and the liquid-noise only for dimensions 2 and 6. All other dimensions are dominated by the liquid-noise.
Hence, we effectively remove synaptic short-term depression by setting the recovery time constant, τrec , for all synapse types to 1 ms. This results in a training and testing performance of 10.0 ± 2.8%, which is almost chance level. A further analysis of this very low performance reveals that it is caused by the saturation of activity within the network. Thus, synaptic short-term depression is required for the proper operation of the network as it balances the amount of excitation. Since a reset of the network has a large effect on its classification performance, we again explore the representation of the temporal activity patterns in the network in order to explain this effect quantitatively (see Figure 6). However, here we use the no-reset method to record the liquid states at 100 ms after stimulus onset. We apply the same analysis as before to plot the three macroscopic liquid properties versus the principal components (see section 2 and Figure 4). This analysis shows that the liquid-class-distance is greater than the sum of the liquid-sample-distance and the liquid-noise only for dimensions 2 and 6. As this difference is only marginal for dimension 6, virtually only the projection of the liquid states onto principal component 2 have a small overlap. Hence, only the second principal component carries class-specific information. Comparing this result with the previous analysis
Decoding a Temporal Population Code
2093
of the liquid states obtained using the entire-random-reset method (see Figure 4), we find that not resetting the network results in an enormous increase in the overlap of the liquid states between different classes. Thus, in case of the no-reset method, there is a rather critical dependence between the initial state of the network at stimulus onset and the liquid state recorded after stimulus presentation. In all previous experiments, we trained the readout cell groups exactly at 100 ms after stimulus onset. Since it was shown that the information encoded in the temporal population code increases rapidly over time and already 60% of the total information is available after 25 ms (Wyss, Konig, ¨ & Verschure, 2003), it is of interest to investigate how fast the classification performance of the liquid state machine rises. Moreover, it is unclear from the previous experiments whether the classification performance is better at earlier times after stimulus onset. In the following experiment, this will be examined by training the readout cell groups at one particular time between 2 and 100 ms with respect to stimulus onset. The network is initialized using the entire-hard-reset, the entire-random-reset, or the no-reset method. For each fixed training and testing time and initialization method, 10 simulations are performed (as in previous experiments). The results depicted in Figure 7 show that up to 26 ms after stimulus onset, the classification performance stays at chance level (0.09) and 0 bits of mutual information for both training and testing. Thus, the first two bursts of the temporal activity pattern do not give rise to class-specific information in the network. The best performance is achieved by initializing the network with the entirehard-reset method, whereas the no-reset method again results in the lowest classification performance. As already shown in Wyss, Konig, ¨ & Verschure (2003), here we also find a rapid increase in the classification performance (see Figure 7). The performance does not increase after 55 ms but rather fluctuates at a maximum level. Consequently, processing longer temporal activity patterns does not augment the mutual information or the classification performance. 4 Discussion In this study, we investigated the constraints on the continuous time processing of temporally encoded information using two complementary networks; the encoding network compresses its spatial inputs into a temporal code by virtue of highly structured lateral connections (Wyss, Konig, ¨ & Verschure, 2003), while the decoding network decompresses its input into a high-dimensional space by virtue of unstructured lateral connections (Maass et al., 2002). Our analysis of the decoding network showed that it did not sufficiently separate the different stimulus classes. We investigated different strategies to reset the decoding network before stimulus presentation. While resetting the network leads to a maximal performance of 75.2%, the no-reset method performs dramatically below the other meth-
2094
P. Knusel, ¨ R. Wyss, P. Konig, ¨ and P. Verschure
Figure 7: Performance of the liquid state machine at varying temporal length of the temporal population code using different reset methods. (a, b) Classification performance and (c, d) mutual information during training and testing as a function of the time of training and testing (chance level in a and b is 1/11 ≈ 0.091). Up to 50 ms, the performance shows an oscillation, which results from the strong onset response in the temporal population code.
ods investigated—35.4±1.9% correct classifications. A quantitative analysis showed that this performance difference is caused by overlaps of the classes in the high-dimensional space of the decoding network. Thus, in order to decode and classify temporal activity patterns with the liquid state machine successfully, the latter needs to be clocked by the presentation of a new stimulus as opposed to using a true continuous mode of operation. The liquid state machine was successfully applied to the classification of a variety of temporal patterns (Maass et al., 2002; Maass & Markram, 2003). In this study, we investigate yet another type of stimulus: a temporal population code. While our input stimuli are continuously varying, most other studies consider spike trains. Given the maximal performance of 75.2% correct classifications of the test stimuli, however, we believe that in principle, the liquid state machine is capable of processing this kind of temporal activity patterns. Since the liquid state machine was put forward as a general and
Decoding a Temporal Population Code
2095
biologically based model for computations on continuous temporal inputs, it should be able to handle these kinds of stimuli. The liquid state machine was originally proposed for processing continuous streams of temporal information. This is a very difficult task, as any decoder of temporal information has to maintain an internal state of previously applied inputs. However, continuous streams of information can often be divided into short sequences (that is, temporally confined stimuli). The knowledge of the onset of a new stimulus would certainly be beneficial for such a network, as the single stimuli could be processed separately and the network could be specifically initialized and reset prior to their presentation. Thus, as opposed to a regime where a continuous stream of information is processed, there would be a possibility of avoiding interferences of stimuli in the internal state of the network and the network should therefore perform better. However, while the performance difference between continuous or stimulus-triggered processing of temporal information is very intuitive, it is unclear how big its effect would be on the performance and the internal state of the information in the decoding network. Moreover, in previous work on liquid state machines, this difference was not assessed (Maass et al., 2002, 2003; Maass & Markram, 2003; Legenstein, Markram, & Maass, 2003). Here, we quantitatively investigated this difference in the context of the temporal population code where the input is not a continuous stream but composed of temporally confined stimuli. The initial hypothesis was that the decoding network can process a continuous stream of temporal activity patterns generated by the encoding network. We found, however, that for the decoding network to perform reasonably, it required a reset of its internal states at stimulus onset. The resulting percentage of correctly classified stimuli practically doubled for both hard- and random-reset. A mathematical analysis revealed a critical dependence between the initial state of the network at stimulus onset and its internal state after stimulus presentation. Whereas this dependence is fairly low in the case of any reset method, not resetting the network drastically increases it, which results in much larger overlaps of the internal states between different stimulus classes. Our analysis suggests that although the mixing of previously temporally segregated information is of central importance for the proper operation of the liquid state machine, the mixing of information across stimuli leads to an inevitable degradation of its classification performance and the internal representation of the stimuli. In the original study, the decoding network was actually initialized with a method that is similar to the partial-random-reset method used here (Maass et al., 2002, 2003). This raises the question whether the liquid state machine operated in a true continuous mode in the cited studies. In conclusion, our results suggest that a reset mechanism is an essential component of the proposed encoding-decoding system. Any reset system consists of two components: a signal that mediates the onset of a stimulus and a mechanism triggered by this signal that allows resetting the components of the processing substrate. Potential candidate
2096
P. Knusel, ¨ R. Wyss, P. Konig, ¨ and P. Verschure
neural mechanisms for signaling a new stimulus are the thalamic suppression found during saccades (Ramcharan, Gnadt, & Sherman, 2001) or the hyperpolarization observed in the projection neurons of the antennal lobe of moths coinciding with the presentation of a new odor (Heinbockel, Christensen, & Hildebrand, 1999). The temporal population code naturally generates such a signal, characterized by a pronounced first burst that can be easily detected and used to trigger a reset mechanism. Regarding reset mechanisms, we have distinguished several approaches that differ mainly in how they could be implemented by the brain. While the hard-reset method could be implemented by strongly inhibiting all cells in the processing substrate, implementing the random-reset method appears difficult. It could possibly be approximated by driving the processing substrate into a transient chaotic state, which could be achieved by selectively increasing and decreasing the synaptic transmission strength in a short time window after stimulus onset. This approach has similarities with simulated annealing (Kirkpatrick, Gelatt, & Vecchi, 1983) as well as the annealing mechanism presented in Verschure (1991) where chaotic behavior in a neural network is attained by adaptively changing the learning rate. Comparing the classification performance with and without resetting the synapses (entire versus partial reset) reveals that the latter outperforms the former. Thus, not to reset the synapses is rather an advantage than a shortcoming of the proposed mechanisms. Furthermore, these considerations suggest that such a reset system could be implemented in a biologically plausible way. From a general point of view, not only the liquid state machine but any decoder of continuous streams of temporal information has to maintain previously applied inputs in an internal state. Thus, inputs applied at temporally segregated times are mixed into a joint internal state. Our results demonstrate that in the absence of a stimulus-locked reset of this internal state, the effect of mixing strongly degrades the specificity of this internal state, which results in a significant decrease of the network’s performance. Thus, since the liquid state machine is seen as a model of cortical microcircuits, this raises the question how these circuits solve the problem of the mixing of temporally segregated information. On the basis of our results, we predict that it is solved by employing stimulus-onset specific-reset signals that minimize the mixing of information from past and present stimuli. Although some evidence exists that could support this hypothesis, further experimental work is required to identify whether the brain makes use of a form of temporal segmentation to divide a continuous input stream into smaller “chunks” that are processed separately. Appendix: Details of the Implementation We consider the time course of a temporal activity pattern of the encoding network, Iinp (t), as a synaptic input current to the decoding network. This
Decoding a Temporal Population Code
2097
current is arbitrarily normalized to a maximal value of 1 nA. The dimensionless weights of the input synapses, winp , are chosen from a gaussian distribution with mean value and standard deviation of 90 that is truncated at zero to avoid negative values. As only 30% of the liquid cells receive input from the temporal activity pattern (see section 2), a random subset of 70% of the input weights is set to zero. The recurrent synapses in the liquid show short-term plasticity (Markram et al., 1998). Let t be the time between the (n − 1)th and the nth spike in a spike train terminating on a synapse; then un , which is the running value of the utilization of synaptic efficacy, U, follows: − t − t un = un−1 e τf ac + U 1 − un−1 e τf ac , (A.1) where τ f ac is the facilitation time constant. The available synaptic efficacy, Rn , is updated according to
t
t
Rn = Rn−1 (1 − un )e− τrec + 1 − e− τrec ,
(A.2)
where τrec is the recovery from depression time constant. The peak synaptic current, Iˆsyn , is defined as Iˆsyn = wliq Rn un ,
(A.3)
where wliq is the weight of the synapses connecting the liquid cells. The excitatory and inhibitory postsynaptic currents Isyn,exc (t) and Isyn,inh (t) are given by an alpha function, Isyn,x (t) = Iˆsyn
e τsyn,x
te
t − τsyn,x
,
(A.4)
where the subscript x stands for exc or inh, Iˆsyn is the peak synaptic current (see equation A.3), and τsyn,x is the time constant of the postsynaptic potential. Finally, the connection probability, p(a, b), of two liquid cells located at the integer points a and b of a cubic lattice follows a gaussian distribution, p(a, b) = C · e(−(|a−b|/λ) ) , 2
(A.5)
where |.| is the Euclidian norm in R3 and C and λ are constants. The values of all the synaptic parameters listed above are given in Table 2. The liquid cells are simulated as leaky integrate-and-fire neurons. The membrane potential, v(t), is updated according to dt τmem gleak × (Ibg + winp Iinp (t) + Isyn,exc (t) − Isyn,inh (t) − gleak v(t)),
v(t + dt) = v(t) +
(A.6)
2098
P. Knusel, ¨ R. Wyss, P. Konig, ¨ and P. Verschure
where dt is the simulation time constant, τmem the membrane time constant, gleak the leak conductance, Ibg the background current, winp Iinp (t) the synaptic input current from the temporal activity pattern, and Isyn,exc (t) and Isyn,inh (t) are synaptic currents (see equation A.4). If v(t) > vθ , that is, the membrane potential is greater than the threshold potential, a spike is generated, and v(t) is set to the reset potential, vreset , and the neuron is quiescent until the refractory period of duration tref r has elapsed. The values of the parameters listed above are given in Table 1. The readout neurons are simulated as leaky integrate-and-fire neurons. Let i = 1, . . . , I be the index of a readout group (I = 11), j = 1, . . . , J the index of a readout neuron in group i (J = 36), and k = 1, . . . , K the index of a liquid neuron (K = 720). Then the membrane potential of readout neuron j of readout group i, rij (t), follows rij (t + dt) = rij (t) +
dt (rij,syn (t) − rij (t)), τmem,R
(A.7)
where dt is the simulation time constant, τmem,R = 30 ms the readout neuron membrane time constant, and rij,syn (t) the postsynaptic potential given by rij,syn (t) =
K
sgijk ak (t).
(A.8)
k=1
s = 0.03 is an arbitrary and constant scaling factor, gijk are the synaptic weights of liquid cell k to readout neuron j of readout group i, and ak (t) is the activity of liquid cell k, which is 1 if the liquid cell fired an action potential at time t and 0 otherwise. A readout cell may fire only if its membrane potential is above threshold, rθ = 20 mV, at t = tL , that is, rij (tL ) > rθ . tL is a specific point in time after stimulus onset. After a spike, the readout cell membrane potential, rij , is reset to 0 mV and the readout cell response, qij , is set to 1 (qij is zero otherwise). The readout group response, qi , of readout group i is then qi =
J
qij .
(A.9)
j=1
A simplified version of the learning rule described in Maass et al. (2002) and Auer et al. (2001) is used to update the synaptic weights gijk . Let N be the index of the stimulus class (the correct response class) and M the index of the selected response class, that is, M = arg(maxi=1,...,I qi ) is the readout group with the highest number of activated readout cells. Then two cases are distinguished: if N = M, that is, the selected response class is correct, the synaptic weights are not changed. And if N = M, then for all j = 1, . . . , J and
Decoding a Temporal Population Code
2099
k = 1, . . . , K, the synaptic weights are updated according to the following rule: η(−1 − gMjk ) if (rMj (tL ) > rθ ) and ak (tL ) = 0 (A.10) gMjk ← gMjk + 0 else gNjk ← gNjk +
η(1 − gNjk ) 0
if (rNj (tL ) < rθ ) and ak (tL ) = 0 , (A.11) else
where η is a learning parameter. Thus, synapses to firing readout cells of the incorrect response class M are weakened (see equation A.10), whereas those to the inactive readout cells of the correct response class N are strengthened (see equation A.11). References Auer, P., Burgsteiner, H., & Maass, W. (2001). The p-delta learning rule for parallel perceptrons. Manuscript submitted for publication. Buonomano, D. (2000). Decoding temporal information: A model based on short-term synaptic plasticity. Journal of Neuroscience, 20(3), 1129–1141. Buonomano, D., & Merzenich, M. (1995). Temporal information transformed into a spatial code by a neural network with realistic properties. Science, 267, 1028–1030. Heinbockel, T., Christensen, T., & Hildebrand, J. (1999). Temporal tuning of odor responses in pheromone-responsive projection neurons in the brain of the sphinx moth manduca sexta. Journal of Comparative Neurology, 409(1), 1–12. Kirkpatrick, S., Gelatt, C., & Vecchi, M. (1983). Optimization by simulated annealing. Science, 220, 671–680. Kolen, J., & Kremer, S. (Eds.). (2001). A field guide to dynamical recurrent networks. New York: IEEE Press. Konig, ¨ P., Engel, A., Rolfsema, P., & Singer, W. (1995). How precise is neuronal synchronization. Neural Computation, 7, 469–485. Legenstein, R. A., Markram, H., & Maass, W. (2003). Input prediction and autonomous movement analysis in recurrent circuits of spiking neurons. Reviews in the Neurosciences, 14(1–2), 5–19. Maass, W., & Markram, H. (2003). Temporal integration in recurrent microcircuits. In M. Arbib (Ed.), The handbook of brain theory and neural networks (2nd ed., pp. 1159–1163). Cambridge, MA: MIT Press. Maass, W., Natschl¨ager, T., & Markram, H. (2002). Real-time computing without stable states: A new framework for neural computation based on perturbations. Neural Computation, 14(11), 2531–2560. Maass, W., Natschl¨ager, T., & Markram, H. (2003). A model for real-time computation in generic neural microcircuits. In S. Becker, S. Thrun, & K. Obermayer (Eds.), Advances in neural information processing systems, 15 (pp. 213–220). Cambridge, MA: MIT Press.
2100
P. Knusel, ¨ R. Wyss, P. Konig, ¨ and P. Verschure
Markram, H., Wang, Y., & Tsodyks, M. (1998). Differential signaling via the same axon of neocortical pyramidal neurons. Proceedings of the National Academy of Sciences, USA, 95, 5323–5328. Mozer, M. (1994). Neural net architectures for temporal sequence processing. In A. Weigend & N. Gershenfeld (Eds.), Time series prediction: Forecasting the future and understanding the past (pp. 243–264). Reading, MA: Addison-Wesley. Ramcharan, E., Gnadt, J., & Sherman, S. (2001). The effects of saccadic eye movements on the activity of geniculate relay neurons in the monkey. Visual Neuroscience, 18(2), 253–258. Rieke, F., Warland, D., de Ruyter van Steveninck, R., & Bialek, W. (1997). Spikes— exploring the neural code. Cambridge, MA: MIT Press. Verschure, P. (1991). Chaos-based learning. Complex Systems, 5, 359–370. Wyss, R., Konig, ¨ P., & Verschure, P. (2003). Invariant representations of visual patterns in a temporal population code. Proceedings of the National Academy of Sciences, USA, 100(1), 324–329. Wyss, R., & Verschure, P. (in press). Bounded invariance and the formation of place fields. In S. Thrun, L. Saul, & B. Scholkopf ¨ (Eds.), Advances in neural information processing systems. Cambridge, MA: MIT Press. Wyss, R., Verschure, P., & Konig, ¨ P. (2003). On the properties of a temporal population code. Reviews in the Neurosciences, 14(1–2), 21–33. Received July 9, 2003; accepted February 26, 2004.
LETTER
Communicated by Bard Ermentrout
Minimal Models of Adapted Neuronal Response to In Vivo–Like Input Currents Giancarlo La Camera
[email protected] Alexander Rauch
[email protected] Hans-R. Luscher ¨
[email protected] Walter Senn
[email protected] Stefano Fusi
[email protected] Institute of Physiology, University of Bern, CH-3012 Bern, Switzerland
Rate models are often used to study the behavior of large networks of spiking neurons. Here we propose a procedure to derive rate models that take into account the fluctuations of the input current and firing-rate adaptation, two ubiquitous features in the central nervous system that have been previously overlooked in constructing rate models. The procedure is general and applies to any model of firing unit. As examples, we apply it to the leaky integrate-and-fire (IF) neuron, the leaky IF neuron with reversal potentials, and to the quadratic IF neuron. Two mechanisms of adaptation are considered, one due to an afterhyperpolarization current and the other to an adapting threshold for spike emission. The parameters of these simple models can be tuned to match experimental data obtained from neocortical pyramidal neurons. Finally, we show how the stationary model can be used to predict the time-varying activity of a large population of adapting neurons. 1 Introduction Rate models are often used to investigate the collective behavior of assemblies of cortical neurons. One early and seminal example was given by Knight (1972a), who described the difference between the instantaneous firing rate of a neuron and the instantaneous rate of a homogeneous population of neurons in response to a time-varying input. Ever since, more refined analyses have been developed, some making use of the so-called population density approach (see, e.g., Abbott & van Vreeswijk, 1993; Treves, 1993; Fusi & Mattia, 1999; Brunel & Hakim, 1999; Gerstner, 2000; Nykamp & Tranchina, 2000; Mattia & Del Giudice, 2002). The population activity c 2004 Massachusetts Institute of Technology Neural Computation 16, 2101–2124 (2004)
2102
G. La Camera, A. Rauch, H.-R. Luscher, ¨ W. Senn, and S. Fusi
at time t is the fraction of neurons of the network emitting a spike at that time. In a homogeneous network of identical neurons and in stationary conditions, this population activity is the single neuron current-frequency (f-I) curve (the response function), which is accessible experimentally and has been the subject of many theoretical (Knight, 1972a; Amit & Tsodyks, 1991, 1992; Amit & Brunel, 1997; Ermentrout, 1998; Brunel, 2000; Mattia & Del Giudice, 2002) and in vitro studies (see, e.g., Knight, 1972b; Strafstrom, Schwindt, & Crill, 1984; McCormick, Connors, Lighthall, & Prince, 1985; Poliakov, Powers, Sawczuk, & Binder, 1996; Powers, Sawczuk, Musick, & Binder, 1999; Chance, Abbott, & Reyes, 2002; Rauch, La Camera, Luscher, ¨ Senn, & Fusi, 2003). The f-I curve is central to much theoretical work on networks of spiking neurons and is a powerful tool in data analysis, where the population activity can be estimated by the peristimulus time histogram but requires many repetitions of the recordings under the same conditions. A suitable rate model would avoid this cumbersome and time-consuming procedure (see, e.g., Fuhrmann, Markram, & Tsodyks, 2002). When the rate models are derived from detailed model neurons, the predictions on network behavior can be very accurate. Recently Shriki, Hansel, and Sompolinsky (2003) fitted a rate model to the f-I curve of a conductancebased neuron with Hodgkin-Huxley sodium and potassium conductances and an A-current. The A-current was included to linearize the f-I curve as observed in experiment. With their model, Shriki et al. (2003) can predict, in several case studies, the behavior of the network of neurons from which the rate model was derived. A similar approach was taken by Rauch et al. (2003), who fitted the response functions of white noise–driven IF neurons to the f-I curves of rat pyramidal neurons recorded in vitro. They found that firing-rate adaptation had to be included in the model in order to fit the data. As opposed to the approach of Shriki et al. (2003), the contribution of the input fluctuations to the output rate was taken explicitly into account. Here we present a general scheme to derive adapting rate models in the presence of noise, of which the model used by Rauch et al. (2003) is a special case. The general scheme that we introduce may be considered a generalization of a model due to Ermentrout (1998). This author introduced a general rate model in which firing-rate adaptation is due to a feedback current; that is, the adapted rate f is given as the self-consistent solution of the equation f = (I − αf ), where I is the applied current, is the f-I curve in stationary conditions, and α is a parameter that quantifies the strength of adaptation. In this model, the effect of noise is not considered. For most purposes, the input to a cortical neuron can be decomposed in an average component, m, and a fluctuating gaussian component whose amplitude is quantified by its standard deviation, s, and the response of the neuron can be expressed as a function of these two parameters (see, e.g., Ricciardi, 1977; Amit & Tsodyks, 1992; Amit & Brunel, 1997; Destexhe, Rudolph, Fellous, & Sejnowski, 2001).
Adapting Rate Models
2103
We show that under such conditions, Ermentrout’s model can be easily generalized, and the adapted response can be obtained as the fixed point of f = (m − αf, s). For this result to hold, it is necessary that adaptation is slow compared to the timescale of the neural dynamics. In such a case, the feedback current αf is a slowly fluctuating variable and does not affect the value of s. Note that a slow adaptation is a minimal request in the absence of noise (Ermentrout, 1998). The proposed model is very general, but it can be used to full advantage only if the response function is known analytically. This is the case of simple model neurons, for which the rate function can be calculated, or more complex neurons whose f-I curve can be fitted by a suitable model function (e.g., Larkum, Senn, & Luscher, ¨ in press). In section 2, the adapting rate model is introduced and tested on several versions of IF neurons, whose rate functions are known and easily computable. The resulting rate models are checked against the simulations of the full models from which they are derived, including the leaky IF (LIF) neuron with conductance-based synaptic inputs. Only slight variations are needed if a different mechanism of adaptation is considered, as, for example, an adapting threshold for spike emission, which is dealt with in section 2.2. Evidence is also provided that the LIF neuron with an adapting threshold is able to fit the response functions of rat pyramidal neurons (see section 2.3), a result that parallels those of Rauch et al. (2003) obtained with an afterhyperpolarization current. Finally, in section 3, we show how the stationary response function can be used to predict the time-varying activity of a large population of adapting neurons. 2 Adapting Rate Models in the Presence of Noise Firing-rate adaptation is a complex phenomenon characterized by several timescales and affected by different ion currents. At least three phases of adaptation have been documented in many in vitro preparations, referred to as initial or one-interspike (ISI) interval adaptation, which affects the first or at most the first two ISIs (Schwindt, O’Brien, & Crill, 1997), early adaptation, involving the first few seconds, and late adaptation, shown in response to a prolonged stimulation (see Table 1 of Sawczuk, Powers, & Binder, 1997, for references and a list of possible mechanisms). Initial adaptation depends largely on Ca2+ -dependent K+ current (Sah, 1996; Powers et al., 1999), although other mechanisms may also play a role (Sawczuk et al., 1997). The early and late phases of adaptation are not well understood, and several mechanisms have been put forward: in neocortical neurons, it seems that Na+ -dependent K+ currents (Schwindt, Spain, & Crill, 1989; Sanchez-Vives, Nowak, & McCormick, 2000), and slow inactivation of Na+ channels (Fleidervish, Friedman, & Gutnik, 1996) may play a major role; in motoneurons, evidence is accumulating for an interplay between slow inactivation of Na+ channels, which tend to decrease the firing rate,
2104
G. La Camera, A. Rauch, H.-R. Luscher, ¨ W. Senn, and S. Fusi
and the slow activation or facilitation of a calcium current, which tends to increase the discharge rate (Sawczuk et al., 1997; Powers et al., 1999). Despite the many mechanisms involved, we derive in the following a simple model for the adapted rate at steady state, which describes synthetically the overall phenomenology. The cellular mechanism is inspired by those mentioned above: upon emission of a spike, a quantity AN of a given ion species N (one can think of Ca2+ or Na+ ) enters the cell and modifies the intracellular ion concentration [N]i , which then exponentially decays to its resting value in a characteristic time τN . Its dynamics are described by d[N]i [N]i + AN δ(t − tk ), =− dt τN k
(2.1)
where the sum is taken over all the spikes emitted by the neuron up to time t. As a consequence, an outward, N-dependent current Iahp = −gN [N]i , proportional to [N]i through the average peak conductance gN , results and causes a decrease in the discharge rate. Following the literature, we give this current the name of afterhyperpolarization (AHP) (e.g., Sah, 1996). Experimentally, the time constant τN of the dynamics underlying AHP summation (see equation 2.1) is found to be of the order of tens of milliseconds (fast AHP), hundreds of milliseconds to a few seconds (mediumduration AHP), to seconds (slow AHP) (see, e.g., Powers et al., 1999). Values often used in modeling studies are of the order of 100 ms (Wang, 1998; Ermentrout, 1998; Liu & Wang, 2001). In all cases, N-dynamics is typically slower than the average ISI. This fact can be exploited to obtain the stationary, adapted output frequency by noticing that from equation 2.1, the steady state (ss) intracellular concentration would be proportional to the neuron’s output frequency in a time window of a few τN : [N]i,ss = τN AN δ(t − tk ) ≈ τN AN f, tk 0 and is an increasing function of m. 2.1 Examples. We checked the rate model, equation 2.2, against full simulations for several versions of IF neurons. In each case, the rate function in the presence of noise, , is known. 2.1.1 Leaky Integrate-and-Fire Neuron. The leaky integrate-and-fire (LIF) neuron is the best-known and most widely used of all IF neurons (see section A.1 for details of the model). Its response function in the presence of noise has been known for a long time; it reads LIF (m, s) = τr + τ
Cθ −mτ √ σ τ CVr −mτ √ σ τ
√
−1 x2
πe (1 + erf(x))dx
(2.3)
(Siegert, 1951; Ricciardi, 1977; Amit & Tsodyks, 1991). Vr and τr are the reset potential and the absolute refractory period after spike emission, which is said to occur when the membrane potential hits a fixed threshold θ ; τ is the membrane time constant; C is the membrane capacitance; and erf(x) = √ √ x 2 (2/ π) 0 dte−t is the error function. m and σ = 2τ s are the average current and the amplitude of the fluctuations, with s in units of current and τ a time constant to preserve units (set equal to 1 ms throughout). The equivalence between the rate model and the full simulation for different values of the noise is shown in Figure 1A. We report also the lower bound for the time constant (around τN ∼ 80 ms) for which the result holds (see Figure 1B). For irregular spike trains the agreement is remarkable also at very low frequencies, where the condition that the average ISI be smaller than τN is violated. This may be explained by the fact that although > τN , the ISI distribution is skewed towards smaller values, and Iahp ∼ −αf is still a good approximation. 2.1.2 Quadratic Integrate-and-Fire Neuron. The quadratic integrate-andfire (QIF) neuron—see section A.2—is related to a canonical model of type I membrane (see, e.g., Ermentrout & Kopell, 1986; Ermentrout, 1996). As such, it is expected to reproduce the firing behavior of any type I neuron close to
2106
G. La Camera, A. Rauch, H.-R. Luscher, ¨ W. Senn, and S. Fusi
firing rate [Hz]
A 80
60 40 20
60
0 −60 0
40
1
20 0 0
sim
(%)
B
∆f/f
2
500 m [pA]
1000
8 6 4 2 0 −2 0
100
200 300 τN [ms]
400
500
Figure 1: Adapting rate model from the LIF model neuron, theory versus simulations. (A) rate functions of adapted LIF neuron. Plots show firing rate as a function of the mean current m at constant noise s = 0, 200 and 600 pA (from right-most to left-most curve). Lines: Self-consistent solutions of equation f = LIF (m−αf, s) ( fth ), with LIF given by equation 2.3. Dots: Simulations of the full model, equations 2.1 and A.1 ( fsim ). Adaptation parameters: τN = 500 ms, gN AN = 8 pA (so that α = 4 pA·s). Neuron parameters: τr = 5 ms, C = 500 pF, θ = 20 mV, Vr = 10 mV, Vrest = 0, τ = 20 ms. Inset: Sample of membrane voltage (mV, top trace) and feedback current Iahp (pA, bottom trace) as a function of time (in seconds) for the input point (m, s) = (550, 200) pA. Note how equilibrium is reached after ≈ 1s = 2τN . (B) Dependence of ( fsim − fth )/fsim on τN . As τN is varied, AN is rescaled so that the total amount of adaptation (α = 4 pA·s) is kept constant. Parameters of the current: m = 600 pA (full symbols) and m = 800 pA (empty symbols); s = 0 (circles) and s = 200 pA (triangles). All other parameters as in A. Mean spike frequencies assessed across 50 s, after discarding a transient of 10τN . Integration time step dt = 0.1 ms. For s > 0, finite noise effects have to be expected, but the error is always below 3%. For τN < 80 ms, the error is positive, meaning that fsim > fth : the neuron adapts only slightly because N decays too quickly (vertical dotted line: 80 ms).
Adapting Rate Models
2107
firing rate [Hz]
60
40 20 0 40 0 −0.6 0
1
2
20 0 −2
0
µ
2
4
Figure 2: Adapting rate model from the QIF neuron, theory versus simulations. f-I curves plotted as in Figure 1A. Lines: Self-consistent solutions of equation f = QIF (µ − αf, σ ), with QIF given by equation 2.4. Dots: Simulations of the full model, equations 2.1 and A.2 ( fsim ). Parameters: τN = 500 ms, gN AN = 0.06 (α = 0.03 s), τ = 20 ms, τr = 0; σ = 0, 0.8, 1.6 (from right to left). Inset: Same as in Figure 1A, for the point (µ, σ ) = (−0.2, 1.6). Mean spike frequencies fsim assessed across 50 s, after discarding a transient of 10τN .
bifurcation, where the firing rates are low. Its firing rate in the presence of white noise reads √ QIF = τr + πτ
+∞
−∞
−1 σ 2 x6 dx exp −µx2 − , 48
(2.4)
(see Brunel & Latham, 2003). µ and σ quantify the drift and the fluctuations of a (dimensionless) gaussian input current. The adapted response in stationary conditions is given by f = QIF (µ − αf, σ ). The match between the predictions of the adapting rate model and simulations of the QIF neuron is presented in Figure 2. 2.1.3 Conductance-Based IF Neuron. The scenario outlined so far considered only current-based model neurons—models in which the input current does not depend on the state of the membrane voltage. A more realistic description takes into account the dependence of the current on the reversal potentials, VE,I , for the excitatory and inhibitory inputs, respectively. The adapting rate model for this class of neurons can be derived in the same way as done for the current-based models. For the LIF neuron with reversal potentials defined in section A.3, and referred to as the conductance-based (CB) neuron in the following, one finds that the adapted response is the
2108
G. La Camera, A. Rauch, H.-R. Luscher, ¨ W. Senn, and S. Fusi
fixed point of the self-consistent equation, f = CB (m0 − αf, s0 ),
(2.5)
where CB = τr + τ ∗
Cθ−m0 τ ∗ √ s0 τ ∗ CVr −m0 τ ∗ √ s0 τ ∗
√
−1 πex (1 + erf(x))dx 2
.
(2.6)
CB is the rate function of the CB neuron (Burkitt, Meffin, & Grayden, 2003). Note that the only difference with the response function of the current-based LIF neuron, equation 2.3, is in the dependence of the quantities m0 , s0 , τ ∗ , given, respectively, by equations A.10, A.11, and A.6, upon the input parameters g¯ E,I , νE,I . Here g¯ E,I are the excitatory and inhibitatory peak conductances and νE,I the mean frequencies of the input spike trains, modeled as Poisson processes. Equations A.10 and A.11 have to be compared with m = g¯ E νE − g¯ I νI ,
s2 = g¯ 2E νE + g¯ 2I νI ,
valid for the current-based neuron. Note that one way to obtain the plots of √ Figure 1A is to increase νE while scaling g¯ E as A/ νE , with A held constant and for constant inhibition (i.e., for constant g¯ I , νI ). This in fact corresponds √ to increasing m as A νE − g¯ I νI at constant noise s2 = A2 + g¯ 2I νI . To make a comparison with the current-based neuron, we plotted CB as a function of µE = g¯ E νE according to such a procedure in Figure 3, which presents the match between the predictions of the rate model and simulations of the adapting CB neuron. 2.1.4 Other Model Neurons. A similar agreement is obtained for other IF model neurons (results not shown). Particularly worth mentioning is the constant leakage IF neuron with a floor (CLIFF) (Fusi & Mattia, 1999; Rauch et al., 2003), whose response function is very simple and does not require any integration: (m, s) = τr +
2Cθ(m−λ) C(θ − V ) −1 σ2 r − − 2CVr (m−λ) σ2 σ2 − e . e + 2(m − λ)2 m−λ
The input parameters m, σ are defined as for the LIF neuron. The CLIFF neuron is an LIF neuron with constant leakage (i.e., with the term −(V − Vrest )/τ in equation A.1 replaced by the constant −λ/C), and a reflecting barrier for the the membrane potential (Fusi & Mattia, 1999). However, the scheme proposed here to derive the adapting rate models does not apply to IF neurons only. For example, the response function of a Hodgkin-Huxley neuron with an A-current can be fitted by the simple model 1 = a[m − mth ]+ − b[m − mth ]2+ (Shriki et al., 2003), where a, b
Adapting Rate Models
firing rate [Hz]
80 60 40
2109 60 40 20
0 −60 0
1
2 2
20 0 400
1 0 2
600
3
4
5
800 1000 1200 gE νE [nS/s]
Figure 3: Adapting rate model from the conductance-based LIF neuron, theory versus simulations. Lines: Self-consistent response of equation 2.5 plotted as g¯ E νE → f at constant inhibition, with νI = 500 Hz, g¯ I = 1 nS throughout. Dots: Simulations of the full models ( fsim ). Each curve is obtained moving along νE and scaling g¯ E so that σE2 ≡ g¯ 2E νE constant, to allow comparison with the current-based neurons in Figure 1A. g¯ E [nS] as a function of νE [Hz] shown in the √ right inset as g¯ E vs log10 (νE ). The resulting σE values were 7.0, 16.9, 33.1 nS/ s (from right to left). Adaptation and neuron parameters as in Figure 1A, plus VE = √70 mV, VI = −10 mV. Left inset as in Figure 1 with µE = 783 nS/s, σE = 33.1 nS/ s. Mean spike frequencies fsim assessed across 50 s, after discarding a transient of 10τN .
are two positive constants (such that 1 ≥ 0 always), the rheobase mth is an increasing function of the leak conductance gL (Holt & Koch, 1997), and [x]+ = x if x > 0, and zero otherwise. The adapting rate model that corresponds to 1 , that is, f = 1 (m − αf ; a, b, gL ), could be interpreted as the rate model for the Hodgkin-Huxley neuron underlying it. Another example is given by the function 2 ∝ [m − mth ]+ , which describes the firing behavior of a type I membrane close to bifurcation and has been fitted (Ermentrout, 1998) to the in vitro response of cells from cat neocortex in the absence of noise (Stafstrom et al., 1984), and to the Traub model (Traub & Miles, 1991). It is easily seen that 2 is the rate function of the QIF neuron when σ = 0. Like 1 , this model does not take fluctuations explicitly into account. 2.1.5 IF Neurons with Synaptic Dynamics. The adapting rate model also works in the presence of synaptic dynamics, provided that the appropriate response function is used. For example, for the LIF neuron with fast synaptic √ dynamics, this is equation 2.3 with {θ, Vr } replaced by {θ, Vr } + 1.03sv τs /τ ,
2110
G. La Camera, A. Rauch, H.-R. Luscher, ¨ W. Senn, and S. Fusi
where sv is the standard deviation of the input in units of voltage and τs τ is the synaptic time constant (Brunel & Sergi, 1998; Fourcaud & Brunel, 2002). This model will be used in section 3, which deals with timedependent inputs. For slow synaptic dynamics, the response of the LIF neuron has been given by Moreno-Bote and Parga (2004b), while the response of the QIF neuron in both regimes has been given by Brunel and Latham (2003). 2.2 The Adapting Threshold Model. The above construction can be applied to other models in which the adaptation mechanism is of a different type. Among those is a model in which the threshold for spike emission adapts (see, e.g., Holden, 1976; Wilbur & Rinzel, 1983; Liu & Wang, 2001). In the adapting threshold model, the emission of an action potential causes the threshold θ for spike emission to step increase by an amount Bθ and then exponentially decay to its resting value θ0 with a time constant τθ : dθ θ − θ0 + Bθ δ(t − tk ). =− dt τθ k
(2.7)
There is indeed evidence that the spike threshold rises after the onset of a current step (Schwindt & Crill, 1982; Powers et al., 1999). Note that the feedback now affects the threshold, not the current. This case can be handled in a similar way as done for AHP adaptation: after a transient of the order of τθ , the neuron will have an average threshold proportional to its own output frequency, θ ≈ θ0 + Bθ τθ f ≡ θ0 + β f, with β = Bθ τθ , and the adapted response f is given by the self-consistent solution of f = (θ0 + β f ; m, s).
(2.8)
Also, this reduction is expected to work for slow enough threshold dynamics, τθ ∼ 100 ms, but apart from that, at any output frequency. This is confirmed by Figure 4, in which the prediction of the rate model is checked against the simulations of the full model for the LIF neuron (that is, equation 2.7 and equation A.1 with Iahp = 0). A similar agreement is obtained for the conductance-based LIF neuron (not shown). The qualitative differences between the two adapting models for the LIF neuron are illustrated in the next section, where we report the results of fitting the response functions to those of cortical pyramidal neurons.
Adapting Rate Models
firing rate [Hz]
60 40
2111 60 40 20 24 20 0
1
2
20 0 0
500 m [pA]
1000
Figure 4: LIF neuron with an adapting threshold, theory versus simulations. f-I curves plotted as in Figure 1A. Lines: Self-consistent solutions of equation f = (θ0 + β f, s), with given by equation 2.3. Dots: Simulations of the full model, equations 2.7 and A.1, ( fsim ). Adaptation parameters: τθ = 500 ms, Bθ = 0.5 mV·s (so that β = 0.25 mV·s). Neuron parameters and s values as in Figure 1A. Inset: Sample of membrane potential and θ (t) for the point (m, s) = (550, 200) pA (time in seconds). Note that for this point, the output frequency is f ≈ 14 Hz, so that after the transient θ = θ0 + β f ≈ 23.5 mV, as shown in the inset. Mean spike frequencies fsim assessed across 50 s, after discarding a transient of 10τθ .
2.3 Quantitative Comparison with Experimental Data. We have shown previously that the LIF neuron with AHP adaptation can be fitted to the experimental response functions of rat pyramidal neurons under noisy current injection (Rauch et al., 2003). We extended the analysis to the LIF neuron with an adapting threshold to find basically the same result. The results for the 26 rat pyramidal neurons considered for the analysis are summarized in Table 1 (see appendix B for details). Two examples are shown in Figure 5. The two adapting models can be made equivalent in the region of low frequencies, being both threshold-linear with slopes 1/α (AHP) and τ/Cβ (adapting threshold; see appendix C for details). This is confirmed by Table 1, which shows that C and τ are the same for the two models and Cβ/τ = 4.46 pA·s, consistent with α = 4.3 ± 2.2. The two models differ in the value of the refractory period, which is much shorter for the adapting threshold model. In fact, imposing τr = 0 only marginally affects the quality of the fits. This is because the LIF neuron with an adapting threshold has a square root behavior in m at intermediate and large frequencies (see equation C.1). To the contrary, a refractory period
2112
G. La Camera, A. Rauch, H.-R. Luscher, ¨ W. Senn, and S. Fusi
firing rate [Hz]
CELL A
CELL B
30 20 10 0
20
30 20 10 0
20
AHP
10 0
AT
10 0 200 400
0
0 200 400
m [pA] Figure 5: Best fits of different models to the rate functions of two rat pyramidal cells (see appendix B). Models: LIF neuron with AHP adaptation (AHP) and with an adapting threshold (AT). Theoretical curves (full lines) and experimental points (dots) plotted as in Figure 1A. Best-fit parameters are: Cell A: AHP: τr = 6.6 ms, τ = 27.1 ms, C = 260 pF, Vr = 1.7 mV, α = 5.1 pA·s, P = 0.32; AT: τr = 2.2 ms, τ = 28.8 ms, C = 270 pF, Vr = 14 mV, β = 0.58 mV·s, P = 0.38. Cell B: AHP: τr = 19.8 ms, τ = 41.1 ms, C = 440 pF, Vr = −2 mV, α = 2.8 pA·s, P = 0.85; AT: τr = 6.8 ms, τ = 40.1 ms, C = 430 pF, Vr = −12.8 mV, β = 0.30 mV·s, P = 0.80. In all the fits, the threshold (θ0 in AT) was kept fixed to 20 mV. P equals the probability that χ 2 is larger than the observed minimum 2 χmin . The fit was accepted whenever P > 0.01. Amplitude of the fluctuating current: cell A: s = 0, 200 and 400 pA; cell B: s = 50, 200, 400 and 500 pA.
is required to bend the response of the model with AHP, otherwise linear in that region. 3 Adapting Response to Time-Dependent Inputs The stationary response function can be used also to predict the timevarying activity of a population of adapting neurons, as shown in this section. Consider an input spike train of time-varying frequency νx (t), targeting each cell of the population through x-receptor mediated channels. Each spike contributes a postsynaptic current of the form g¯ x e−t/τx , where g¯ x is the peak conductance of the channels. In the diffusion approximation, such an input Ix is an Ornstein-Uhlenbeck (OU) process with average ¯ x = g¯ x νx (t)τx , variance s¯2x (t) = (1/2) g¯ 2x νx (t)τx , and correlation length τx : m ¯x Ix − m 2dt dIx = − dt + s¯x ξt , (3.1) τx τx where ξt is the unitary gaussian process defined in section A.1.
Adapting Rate Models
2113
Table 1: Summary of the Results of the Fit of the LIF Neuron to the Experimental Rate Functions of 26 Rat Neocortical Pyramidal Cells. AHP
AT
N
14
13
α [pA · s], β [mV · s] τr [ms] τ [ms] C [nF] Vr [mV] P
4.3 ± 2.2 9.0 ± 6.5 33.2 ± 9.4 0.50 ± 0.18 0.1 ± 11.2 0.40 ± 0.30
0.29 ± 0.13 3.0 ± 4.0 32.5 ± 9.2 0.50 ± 0.18 1.1 ± 13.7 0.33 ± 0.29
Notes: N is the number of fitted cells that required an adaptation parameter (α, or β) > 0. Two cells could be fitted without adaptation and were not included in the analysis. The parameters (left-most column) are defined in section 2.1 and their best-fit values are reported as average ± SD. The threshold for spike emission was held fixed to 20 mV. P is the probability 2 ] across fitted cells requiring adaptation. P[χ 2 > χmin A fit was accepted whenever P > 0.01. The threshold for spike emission was held fixed to 20 mV. AT: adapting threshold model.
The population activity of noninteracting neurons is well predicted by f (t) = (mx , s2x ), where is the stationary response function, and mx , s2x are the time-varying average and variance of Ix (see, e.g., Renart, Brunel, & Wang, 2003). These evolve according to the first-order dynamics (y˙ ≡ dy/dt), ˙ x = −(mx − m ¯ x ), τx m
(3.2)
and analogously for s2x , with τx replaced by τx /2 (e.g., Gardiner, 1985). We now include adaptation in the following way: f = (mx − Iahp , s2x ) τN I˙ahp = −Iahp + αf,
(3.3)
where Iahp is the AHP current, which follows the instantaneous output rate with time constant τN . Note that for a stationary stimulus, that is, νx constant, after a transient of the order of max{τx , τN }, one recovers the stationary ¯ x , s = s¯x . model, equation 2.2, with m = m In the case of several independent components, they follow their own synaptic dynamics and sum up in the argument of the response function to give the time-varying firing rate: 2 f = mx − Iahp , sx . x
x
2114
G. La Camera, A. Rauch, H.-R. Luscher, ¨ W. Senn, and S. Fusi
f(t) [Hz]
100 50
I [nA]
0 1.8 1.2 0.6 200
300 400 time [ms]
500
Figure 6: Time-varying activity of a population of independent, adapting LIF neurons in response to a noisy, broadband stimulus. (Top) prediction of the adapting rate model, equation 3.4 (gray), compared to the simulations of 20,000 neurons (black), as described in section 3. Shown is the activity after a transient of 200 ms. The short horizontal bar indicates a pulselike increase of 1 ms duration in ν0 , of strength 12ν0 . The long horizontal bar indicates a 50% steplike increase in ν0 , during which the inhibitory rate step increases by 0.7ν0 . (Bottom) Average time course of the stimulus. Neuron parameters as in Figure 1a, apart from τr = 2 ms. Other parameters (refer to the text): ν0 = 350 Hz, νinh = 1.2ν0 , Gampa,gaba = 200 pA, Gnmda = 10 pA, τampa,gaba = 5 ms, τnmda = 100 ms. Bin size of the PSTH: 0.5 ms. Integration time step: 0.01 ms.
In Figure 6 we show an example with two fast (τx = 5 ms) components, one excitatory (AMPA-like), the other inhibitory (GABAA -like), plus a third component mimicking slow (NMDA-like) current (with τnmda = 100 ms). The latter component is only slowly fluctuating, so that its variance can be neglected (Brunel & Wang, 2001), as in the case of the adaptation current. The output rate was calculated as f (t) = (mampa + mnmda − mgaba − Iahp , s2ampa + s2gaba ),
(3.4)
where is equation 2.3 corrected for fast synaptic dynamics (Fourcaud & Brunel, 2002), see section 2.1.5. The excitatory stimulus was of the form ν0 (1+ (t)), with ν0 = 350 Hz and (t) a rectified superposition of 10 sinusoidal components with random frequencies ωi /2π, phases φi , and amplitudes Ai drawn from uniform distributions (the latter between 0 and 0.5): Ai sin(ωi t + φi ) . (t) = i
+
Adapting Rate Models
2115
The maximum value used for ω/2π was 150 Hz. νI was held constant ¯ ampda,nmda = g¯ ampa,nmda ν0 (1 + (t))τampa,nmda , m ¯ gaba = to 1.2ν0 . This gives m 1.2 g¯ gaba ν0 τgaba , with mx given by equation 3.2, and analogously for s2ampa,gaba . The input currents used in the simulations were [Ix ]+ , where Ix evolved according to the OU process, equation 3.1. Each neuron received an independent realization of the stochastic currents. The time-varying population activity was assessed through the peristimulus time histogram (PSTH) with a bin size of 0.5 ms. The model makes a good prediction of the population activity, even during fast transients, as in response to the impulse of 1 ms duration at t = 250 ms and a step increase in ν0 , νinh occurred at t = 400 ms (horizontal bars in Figure 6). The small discrepancies are due to finite size effects (Brunel & Hakim, 1999; Mattia & Del Giudice, 2002), and to the approximation used for the stationary response function. Similar results were obtained with the adapting threshold model (i.e., with Iahp = 0 and equation 3.3 replaced by τθ θ˙ = −(θ − θ0 ) + β f ), and for the CB neuron (not shown). It has to be noticed that the condition τampa,gaba τ ∗ , where τ ∗ is the effective time constant of the CB neuron (see equation A.6), is usually more difficult to fulfill, as τ ∗ can reach values as small as a few ms, depending on the input (see, e.g., Destexhe et al., 2001). However, when τ ∗ < τampa,gaba , a good approximation to the response function has been given by Moreno-Bote and Parga (2004a). 4 Discussion We have proposed a general scheme to derive adapting rate models in the presence of fluctuating input currents. The adapting rate model is a reduced two-dimensional model that is a minimal model to fit the response of cortical neurons to in vivo–like current with stationary distribution. The rate models are easily computable; reproduce well the simulations of the full models from which they are derived; are of wide applicability (e.g., apply to model neurons with conductance-based synaptic inputs); allow for different mechanisms of adaptation to be used; and can be used to predict the activity of a large population of adapting spiking neurons in response to time-varying inputs. We considered an AHP current and an adapting threshold for spike emission, two mechanisms for which there is experimental evidence and which are often used in modeling studies. The only requirement is that adaptation be slow compared to the scale of the neural dynamics. Since the adapting rate models (with either model of adaptation) considered in this study are able to fit the rate functions of neocortical neurons, they offer a synthetic way to describe cortical firing, and the diversity of possible models can be used for a quantitative (and possibly functional) classification of cortical neurons based on their observed response to in vivo–like currents.
2116
G. La Camera, A. Rauch, H.-R. Luscher, ¨ W. Senn, and S. Fusi
Since the stationary model predicts the time-varying activity of a large population of independent neurons, the response function measured by Rauch et al. (2003) can be used to make quantitative analyses of the population dynamics, and not only of the stationary states (Amit & Brunel, 1997; Mattia & Del Giudice, 2002). This is particularly the case for networks of neurons in the fluctuation-dominated regime (e.g., Shadlen & Newsome, 1998; Fusi & Mattia, 1999), when spikes are emitted because of large fluctuations of the membrane potentials around an average value that is below the threshold. Such a regime could be the consequence of a high-conductance state (Destexhe et al., 2001), as observed in vivo (Par´e, Shink, Gaudreau, Destexhe, & Lang, 1998). Otherwise, the network may fall in a synchronized regime, for example, in response to a step current, causing the population activity to oscillate around that predicted by the rate model. Additional work is required to investigate the predictions of the timevarying model in more complex situations, as, for example, in populations of interacting neurons or in the case of voltage-gated, saturating conductances (as, e.g., NMDA-mediated). In the absence of adaptation, the predictions of the model have been shown to be good in these more complex cases as well (e.g., Renart et al., 2003). We expect no difference in performance when adaptation is included as done in section 3. Appendix A: Model Neurons A.1 Leaky Integrate-and-Fire Neuron. The leaky IF neuron is a singlecompartment model characterized by its membrane voltage V and with a fixed threshold θ for spike emission. Upon crossing of θ from below, a spike is said to occur, and the membrane is clamped to a reset potential Vr for a refractory time τr , after which normal dynamics resume. We assume that a large number of postsynaptic potentials of small amplitude are required to reach the threshold. In such a condition, prevalent in the cortex, the subthreshold membrane dynamics can be assimilated to a continuous random walk, the Ornstein-Uhlenbeck (OU) process (this is the diffusion approximation; see, e.g., L´ansky´ & Sato, 1999). Taking into account the effect of Iahp , the subthreshold dynamics of the membrane potential obeys the stochastic differential equation, dV = − where
√ V − Vrest dt + µdt + σ ξt dt, τ
(A.1)
√ m + Iahp 2τ s µ= , σ = C C are the average and standard deviation in unit time of the membrane voltage. m and s2 are the average and the variance of the synaptic input current,
Adapting Rate Models
2117
√ Iahp = −gN [N]i (see equation 2.1), and 2τ is a factor to preserve units (see, e.g., Rauch et al., 2003). Vrest = 0 is the resting potential, C is the capacitance of the membrane, τ = RC, and R is the membrane resistance. ξt is a gaussian process with flat spectrum and unitary variance, ξt ξt = δ(t − t ) (white noise; see, e.g., Tuckwell, 1988, or Gardiner, 1985, for more details). In a nonadapting neuron, Iahp ≡ 0. In the adapting threshold model (see section 2.2), Iahp = 0 and the threshold θ ≥ θ0 is a dynamical variable obeying equation 2.7. A.2 Quadratic Integrate-and-Fire Neuron. The dimensionless variable V, to be interpreted as the membrane potential of the white noise–driven QIF neuron obeys √ τ dV = (V 2 + µ)dt + σ ξt τ dt,
(A.2)
where τ is a time constant that mimics the effect of the membrane time constant, and µ, σ 2 are the average and variance per unit time of the input current. A spike is said to occur whenever V = +∞, after which V is clamped to V = −∞ for a refractory period τr . In practice, in the simulations, V is reset to −50 whenever V = +50. This gives an accurate approximation for the parameters chosen in Figure 2. On the other hand, in the rate function, equation 2.4, the actual values used for the integration limits do not matter, provided they are larger than +10 and smaller than −10 respectively. A.3 Conductance-Based LIF Neuron. The membrane potential of the conductance-based LIF neuron obeys dV = − g˜ L (V − Vrest )dt + gE (VE − V)dPE + gI (VI − V)dPI , where gE,I = τ g¯ E,I /C are dimensionless peak conductances, g˜ L = 1/τ is the leak conductance in appropriate units (1/ms), VE,I are the excitatory and inhibitory reversal potentials, and dPE,I = j δ(t − tjE,I )dt are Poisson spike trains with intensity νE,I . In the diffusion approximation (dPx → νx dt + √ νx dtξt ), the equation can be put in a form very similar to equation A.1 (see, e.g., Hanson & Tuckwell, 1983; L´ansky´ & L´ansk´a, 1987; Burkitt, 2001): dV = −
√ V dt + µ0 dt + σ0 (V) dtξt ∗ τ
(A.3)
where µ0 2 σ0 (V) ∗
= g˜ L Vrest + (gE VE νE + gI VI νI ) =
g2E (VE
2
− V) νE +
g2I (VI −1
τ = ( g˜ L + gE νE + gI νI ) .
(A.4) 2
− V) νI
(A.5) (A.6)
2118
G. La Camera, A. Rauch, H.-R. Luscher, ¨ W. Senn, and S. Fusi
The main differences with respect to the current-based IF neuron are (1) the fluctuations depend on the membrane voltage; (2) an input-dependent, effective time constant τ ∗ appears; (3) the parameter µ0 is not the average of the total input current (e.g., part of the input contributes to the leak term −V/τ ∗ and is not considered in µ0 ); and (4) the voltage is bounded from below by the inhibitory reversal potential (below VI , inhibitory inputs become excitatory). Usually the last point is taken care of by imposing a reflecting barrier R at VI (Hanson & Tuckwell, 1983; L´ansky´ & L´ansk´a, 1987). The rate function of the CB neuron, CB = τr + τ
∗
θ −µss √ 2σss Vr −µss √ 2σss
√
−1 x2
πe (1 + erf(x))dx
,
(A.7)
has been given in Burkitt et al. (2003) in the absence of a reflecting barrier. Figure 3 shows that in the typical case, it works also in the presence of a reflecting barrier at VI . The constants µss , σss2 appearing in equation A.7 are the stationary average and variance of the free (i.e., spikeless) membrane voltage, which are (Hanson & Tuckwell, 1983; Burkitt, 2001): µss = µ0 τ ∗ =
g˜ L Vrest + (gE VE νE + gI VI νI ) ( g˜ L + gE νE + gI νI )
(A.8)
and σss2 =
τ ∗ σ02 (µss ) τ∗ 2 ≈ σ (µss ), 2 1−η 2 0
(A.9)
where η ≡ (g2E νE + g2I νI )τ ∗ /2. The approximation in equation A.9 follows from the fact that η is negligible in the typical case. For example, when g¯ E,I ∼ 1 nS, C ∼ 500 pF, and νE,I ∼ 103 Hz, then η ∼ 10−4 − 10−3 . (A convenient way to obtain the result, equation A.9, is to make use of the equality σss2 = τ ∗ σ02 (V)/2, where the average . is taken on the free process. This is a generalization of a wellknown property of the OU process in which σ0 is constant.) equation A.7 can therefore be written as equation 2.6, −1 τ∗ Cθ−m √0 √ x2 s0 τ ∗ ∗ CB = τr + τ CV −m τ ∗ πe (1 + erf(x))dx , r √ 0 s0 τ ∗
where m0 ≡ Cµ0 and s0 ≡ Cσ0 (µss ), so that m0 has units of current: m0 = C g˜ L Vrest + C(gE VE νE + gI VI νI ) s20 = C2 g2E (VE − µss )2 νE + C2 g2I (VI − µss )2 νI ,
(A.10) (A.11)
Adapting Rate Models
2119
with µss given by equation A.8 and gE,I ≡ τ g¯ E,I /C. The adapted frequency is given by the solution of either f = CB (m0 − αf, s0 ) (AHP-based model) or f = CB (θ0 + β f ; m0 , s0 ) (adapting threshold model). Appendix B: Fitting Procedure Here we briefly summarize the experimental procedure and the data analysis that led to the results of Table 1. Full details can be found in Rauch et al. (2003). Pyramidal neurons from layer 5 of rat somatosensory cortex were injected with an OU process with a correlation time constant τ = 1 ms to resemble white noise. Stimuli were delivered in random order from a preselected pool, which depended on the cell. The time length of the stimulus was between 6 and 12 seconds. The spike trains of 26 selected cells were analyzed to assess their mean spike frequencies. A transient ranging from 0.5 to 2 seconds (depending on stimulus duration) was discarded to deal with the stationary spike train only. On balance, the stationary response was usually adapted with respect to the transient one. The model rate functions were fitted to the experimental ones using a random least-square procedure, that is, a Monte Carlo minimization of the function (see, e.g., Press, Teukolsky, Vetterling, & Flannery, 1992), 2 χN−M =
N MODEL (mi , si ; ) − fi 2 , i i=1
where i runs over the experimental points, fi are the experimental spike frequencies, is the parameter set, and the weights i correspond approximately to a confidence interval of 68% for the output rate. M is the number of parameters to be tuned and N the number of experimental points. The 2 best-fit was accepted if the probability of a variable χN−M to be larger than 2 the observed χmin was larger than 0.01. The parameter set includes five parameters: τr , Vr , C, τ , and α [pA·s] for the AHP adaptation or β [mV·s] for the adapting threshold mechanism. Note that since LIF , equation 2.3, is invariant under the scaling θ → θh, Vr → Vr h, C → C/ h (h constant), only two out of these three parameters are independent. Therefore, the threshold for spike emission (θ0 in the case of an adapting threshold) was set to 20 mV throughout. The results are summarized in Table 1 and discussed in the text. Appendix C: The Effects of Adaptation on the LIF Neuron Here we summarize and compare the properties of the LIF neuron endowed with the two models of adaptation, which are mentioned in the analysis of the experimental data in section 2.3. We will refer to the self-consistent solutions of equation 2.2 and 2.8 as fα (m, s), fβ (m, s), respectively. We consider the regions of low and intermediate-to-large rates in turn.
2120
G. La Camera, A. Rauch, H.-R. Luscher, ¨ W. Senn, and S. Fusi
1. At low frequencies, the two models can be made equivalent by choosing βC/τ = α. This is because the response at rheobase, otherwise highly nonlinear, is linearized by either model of adaptation. One obtains fα,β ≈ ρα,β [m − mth ]+ as m → mth = Cθ/τ + , where mth is the rheobase current, ρα = α −1 for AHP adaptation, and ρβ = τ/Cβ for adapting threshold. ([x]+ = x if x > 0 and zero otherwise, and m → y+ means that the limit is performed for values of m larger than y.) The linearization argument is due to Ermentrout (1998) for AHP-like adaptation and can be easily generalized to the case of an adapting threshold for the LIF neuron. The slope, ρ, can be obtained by looking at how the two forms of adaptation affect the rheobase mth = Cθ/τ , that is, by requiring that m − αfα − Cθ/τ = m − C(θ + β fβ )/τ . One finds fα /fβ = Cβ/ατ so that for Cβ = ατ the output frequencies at the rheobase (hence, the slopes of the linearized rate functions) are the same. 2. For τr = 0, the rate functions of the two adapting models differ away from the rheobase. This can be seen most easily for large inputs, where the nonadapted response is approximately linear, f ∼ m/C(θ − Vr ). It is easy to derive that AHP adaptation preserves this linearity, fα ∼
m , C(θ − Vr ) + α
while for an adapting threshold, (θ − Vr ) fβ = 2β
4βm 1+ −1 , C(θ − Vr )2
(C.1)
with asymptotic behavior fβ ∼ m/Cβ. The introduction of a finite refractory period makes the AHP model bend in this region, 1 α fα ∼ 1− , τr τr m allowing the two models to match on the entire range of observed output rates. Acknowledgments We thank Paolo Del Giudice and Maurizio Mattia for useful discussions. This work was supported by the Swiss National Science Foundation (grants 31-61335.00 and 3152-065234.01) and the Silva Casa Foundation.
Adapting Rate Models
2121
References Abbott, L. F., & van Vreeswijk, C. (1993). Asynchronous states in networks of pulse-coupled oscillators. Phys. Rev. E, 48, 1483–1490. Amit, D. J., & Brunel, N. (1997). Model of global spontaneous activity and local structured (learned) delay activity during delay. Cerebral Cortex, 7, 237– 252. Amit, D. J., & Tsodyks, M. V. (1991). Quantitative study of attractor neural network retrieving at low spike rates: I. Substrate-spikes, rates and neuronal gain. Network, 2, 259–273. Amit, D. J., & Tsodyks, M. V. (1992). Effective neurons and attractor neural networks in cortical environment. Network, 3, 121–137. Brunel, N. (2000). Persistent activity and the single cell f-I curve in a cortical network model. Network, 11, 261–280. Brunel, N., & Hakim, V. (1999). Fast global oscillations in networks of integrateand-fire neurons with low firing rates. Neural Computation, 11, 1621–1671. Brunel, N., & Latham, P. (2003). Firing rate of the noisy quadratic integrate-andfire neuron. Neural Computation, 15, 2281–2306. Brunel, N., & Sergi, S. (1998). Firing frequency of leaky integrate-and-fire neurons with synaptic currents dynamic. J. Theor. Biol., 195, 87–95. Brunel, N., & Wang, X. J. (2001). Effects of neuromodulation in a cortical network model of object working memory dominated by recurrent inhibition. Journal of Computational Neuroscience, 11, 63–85. Burkitt, A. N. (2001). Balanced neurons: Analysis of leaky integrate-and-fire neurons with reversal potentials. Biol. Cybern., 85, 247–255. Burkitt, A. N., Meffin, H., & Grayden, D. B. (2003). Study of neuronal gain in a conductance-based leaky integrate-and-fire neuron model with balanced excitatory and inhibitory input. Biol. Cybern., 89, 119–125. Chance, F. S., Abbott, L. F., & Reyes, A. D. (2002). Gain modulation from background synaptic input. Neuron, 35, 773–782. Destexhe, A., Rudolph, M., Fellous, J. M., & Sejnowski, T. J. (2001). Fluctuating dynamic conductances recreate in-vivo like activity in neocortical neurons. Neuroscience, 107, 13–24. Ermentrout, B. (1996). Type I membranes, phase resetting curves, and synchrony. Neural Computation, 8, 979–1001. Ermentrout, B. (1998). Linearization of f-I curves by adaptation. Neural Computation, 10(7), 1721–1729. Ermentrout, G. B., & Kopell, N. (1986). Parabolic bursting in an excitable system coupled with a slow oscillation. SIAM J. Appl. Math., 46, 233–253. Fleidervish, I., Friedman, A., & Gutnick, M. J. (1996). Slow inactiavation of Na+ current and slow cumulative spike adaptation in mouse and guinea-pig neocortical neurones in slices. J. Physiol. (Cambridge), 493, 83–97. Fourcaud, N., & Brunel, N. (2002). Dynamics of the firing probability of noisy integrate-and-fire neurons. Neural Computation, 14, 2057–2110. Fuhrmann, G., Markram, H., & Tsodyks, M. (2002). Spike frequency adaptation and neocortical rhythms. J. Neurophysiology, 88, 761–770.
2122
G. La Camera, A. Rauch, H.-R. Luscher, ¨ W. Senn, and S. Fusi
Fusi, S., & Mattia, M. (1999). Collective behavior of networks with linear (VLSI) integrate and fire neurons. Neural Computation, 11, 633–652. Gardiner, C. W. (1985). Handbook of stochastic methods. New York: SpringerVerlag. Gerstner, W. (2000). Population dynamics of spiking neurons: Fast transients, asynchronous states, and locking. Neural Computation, 12, 43–90. Hanson, F. B., & Tuckwell, H. C. (1983). Diffusion approximation for neural activity including synaptic reversal potentials. J. Theor. Neurobiol., 2, 127–153. Holden, A. V. (1976). Models of stochastic activity of neurons. New York: SpringerVerlag. Holt, G. R., & Koch, C. (1997). Shunting inhibition does not have a divisive effect on firing rates. Neural Computation, 9, 1001–1013. Knight, B. W. (1972a). Dynamics of encoding of a population of neurons. Journal of General Physiology, 59, 734–736. Knight, B. W. (1972b). The relationship between the firing rate of a single neuron and the level of activity in a network of neurons. Experimental evidence for resonance enhancement in the population response. Journal of General Physiology, 59, 767. L´ansky, ´ P., & L´ansk´a, V. (1987). Diffusion approximation of the neuronal model with synaptic reversal potentials. Biol. Cybern., 56, 19–26. L´ansky, ´ P., & Sato, S. (1999). The stochastic diffusion models of nerve membrane depolarization and interspike interval generation. Journal of the Peripheral Nervous System, 4, 27–42. Larkum, M., Senn, W., & Luscher, ¨ H.-R. (in press). Top-down dendritic input increases the gain of layer 5 pyramidal neurons. Cerebral Cortex. Liu, Y. H., & Wang, X. J. (2001). Spike-frequency adaptation of a generalized leaky integrate-and-fire model neuron. Journal of Computational Neuroscience, 10, 25–45. Mattia, M., & Del Giudice, P. (2002). Population dynamics of interacting spiking neurons. Phys. Rev. E, 66, 051917. McCormick, D. A., Connors, B. W., Lighthall, J. W., & Prince, D. (1985). Comparative electrophysiology of pyramidal and sparsely stellate neurons of the neocortex. J. Neurophysiology, 54, 782–806. Moreno-Bote, R., & Parga, N. (2004a). Membrane potential and response properties of populations of cortical neurons in the high conductance state. Manuscript submitted for publication. Moreno-Bote, R., & Parga, N. (2004b). Role of synaptic filtering on the firing response of simple model neurons. Physical Review Letters, 92, 028102. Nykamp, D. Q., & Tranchina, D. (2000). A population density approach that facilitates large-scale modeling of neural networks: Analysis and an application to orientation tuning. J. Comp. Neurosci., 8, 19–30. Par´e, D., Shink, E., Gaudreau, H., Destexhe, A., & Lang, E. J. (1998). Impact of spontaneous synaptic activity on the resting properties of cat neocortical pyramidal neurons in vivo. Journal of Neurophysiology, 11, 1450–1460. Poliakov, A. V., Powers, R. K., Sawczuk, A., & Binder, M. D. (1996). Effects of background noise on the response of rat and cat motoneurons to excitatory current transients. Journal of Physiology, 495.1, 143–157.
Adapting Rate Models
2123
Powers, R. K., Sawczuk, A., Musick, J. R., & Binder, M. D. (1999). Multiple mechanisms of spike-frequency adaptation in motoneurones. J. Physiol. (Paris), 93, 101–114. Press, W., Teukolsky, S. A., Vetterling, W. T., & Flannery, B. P. (1992). Numerical recipes in C: The art of scientific computing. Cambridge: Cambridge University Press. Rauch, A., La Camera, G., Luscher, ¨ H.-R., Senn, W., & Fusi, S. (2003). Neocortical cells respond as integrate-and-fire neurons to in vivo–like input currents. J. Neurophysiol., 90, 1598–1612. Renart, A., Brunel, N., & Wang, X. J. (2003). Mean-field theory of recurrent cortical networks: From irregularly spiking neurons to working memory. In J. Feng (Ed.), Computational neuroscience: A comprehensive approach. Boca Raton, FL: CRC Press. Ricciardi, L. M. (1977). Diffusion processes and related topics in biology. Berlin: Springer-Verlag. Sah, P. (1996). Ca2+ -activated K+ currents in neurons: Types, physiological roles and modulation. Trends Neurosci., 19, 150–154. Sanchez-Vives, M. V., Nowak, L. G., & McCormick, D. A. (2000). Cellular mechanisms of long-lasting adaptation in visual cortical neurons in vitro. J. Neuroscience, 20, 4286–4299. Sawczuk, A., Powers, R. K., & Binder, M. D. (1997). Contribution of outward currents to spike frequency adaptation in hypoglossal motoneurons of the rat. Journal of Physiology, 78, 2246–2253. Schwindt, P. C., & Crill, W. E. (1982). Factors influencing motoneuron rhythmic firing: Results from a voltage-clamp study. J. Neurophysiol., 48, 875– 890. Schwindt, P., O’Brien, J. A., & Crill, W. E. (1997). Quantitative analysis of firing properties of pyramidal neurons from layer 5 of rat sensorimotor cortex. J. Neurophysiol., 77, 2484–2498. Schwindt, P. C., Spain, W. J., & Crill, W. E. (1989). Long-lasting reduction of excitability by a sodium-dependent potassium current in cat neocortical neurons. J. Neurophysiol., 61, 233–244. Shadlen, M. N., & Newsome, W. T. (1998). The variable discharge of cortical neurons: Implications for connectivity, computation and information coding. J. Neurosci., 18, 3870–3896. Shriki, O., Hansel, D., & Sompolinsky, H. (2003). Rate models for conductancebased cortical neural networks. Neural Computation, 15, 1809–1841. Siegert, A. J. F. (1951). On the first passage time probability function. Phys. Rev., 81, 617–623. Stafstrom, C. E., Schwindt, P. C., & Crill, W. E. (1984). Repetitive firing in layer V from cat neocortex in vitro. J. Neurophysiol., 52, 264–277. Traub, R. D., & Miles, R. (1991). Neuronal networks of the hippocampus. Cambridge: Cambridge University Press. Treves, A. (1993). Mean field analysis of neuronal spike dynamics. Network, 4, 259–284. Tuckwell, H. C. (1988). Introduction to theoretical neurobiology. Cambridge: Cambridge University Press.
2124
G. La Camera, A. Rauch, H.-R. Luscher, ¨ W. Senn, and S. Fusi
Wang, X. J. (1998). Calcium coding and adaptive temporal computation in cortical pyramidal neurons. J. Neurophysiol., 79, 1549–1566. Wilbur, W. J., & Rinzel, J. (1983). A theoretical basis for large coefficient of variation and bimodality in neuronal interspike interval distribution. J. Theor. Biol., 105, 345–368. Received March 28, 2003; accepted April 6, 2004.
LETTER
Communicated by Anthony Burkitt
Including Long-Range Dependence in Integrate-and-Fire Models of the High Interspike-Interval Variability of Cortical Neurons B. Scott Jackson scott
[email protected] Institute for Sensory Research and Department of Bioengineering and Neuroscience, Syracuse University, Syracuse, NY 13244, U.S.A.
Many different types of integrate-and-fire models have been designed in order to explain how it is possible for a cortical neuron to integrate over many independent inputs while still producing highly variable spike trains. Within this context, the variability of spike trains has been almost exclusively measured using the coefficient of variation of interspike intervals. However, another important statistical property that has been found in cortical spike trains and is closely associated with their high firing variability is long-range dependence. We investigate the conditions, if any, under which such models produce output spike trains with both interspike-interval variability and long-range dependence similar to those that have previously been measured from actual cortical neurons. We first show analytically that a large class of high-variability integrate-and-fire models is incapable of producing such outputs based on the fact that their output spike trains are always mathematically equivalent to renewal processes. This class of models subsumes a majority of previously published models, including those that use excitation-inhibition balance, correlated inputs, partial reset, or nonlinear leakage to produce outputs with high variability. Next, we study integrate-and-fire models that have (nonPoissonian) renewal point process inputs instead of the Poisson point process inputs used in the preceding class of models. The confluence of our analytical and simulation results implies that the renewal-input model is capable of producing high variability and long-range dependence comparable to that seen in spike trains recorded from cortical neurons, but only if the interspike intervals of the inputs have infinite variance, a physiologically unrealistic condition. Finally, we suggest a new integrate-and-fire model that does not suffer any of the previously mentioned shortcomings. By analyzing simulation results for this model, we show that it is capable of producing output spike trains with interspikeinterval variability and long-range dependence that match empirical data from cortical spike trains. This model is similar to the other models in this study, except that its inputs are fractional-gaussian-noise-driven Poisson processes rather than renewal point processes. In addition to this model’s c 2004 Massachusetts Institute of Technology Neural Computation 16, 2125–2195 (2004)
2126
B. Jackson
success in producing realistic output spike trains, its inputs have longrange dependence similar to that found in most subcortical neurons in sensory pathways, including the inputs to cortex. Analysis of output spike trains from simulations of this model also shows that a tight balance between the amounts of excitation and inhibition at the inputs to cortical neurons is not necessary for high interspike-interval variability at their outputs. Furthermore, in our analysis of this model, we show that the superposition of many fractional-gaussian-noise-driven Poisson processes does not approximate a Poisson process, which challenges the common assumption that the total effect of a large number of inputs on a neuron is well represented by a Poisson process. 1 Introduction The integrate-and-fire (IF) neuron is a common model of general neuronal processing when simplicity is of the essence. Its simplicity is particularly beneficial when one wishes to model a large network of interconnected neurons, such as portions of the cortex. Each neuron in the cortex receives a very large number of inputs but produces a relatively low output firing rate, which is on the order of that of each individual input. In order for the standard, nonleaky IF model with excitatory inputs to have a large number of inputs and an output spike rate similar to its inputs, its threshold must be very large compared to the potential caused by a single input spike. In this case, a large number of input spikes must converge on the IF neuron before an output spike will be produced. Thus, although the inputs may contain a high degree of variability, the output will typically be very regular due to the averaging effect of the integrator. However, it has been known for some time that the spike trains of cortical neurons are quite variable. This discrepancy can be easily explained with a leaky IF model if the rate of decay of the potential is short relative to the time between output spikes. This explanation seems reasonable for the conditions under which the variability of cortical neurons has typically been measured, when the firing rates have been low. However, Softky and Koch (1992, 1993) showed that this high variability, as measured by the coefficient of variation (the standard deviation divided by the mean) of the interspike intervals (CVISI ), persists even when cortical neurons produce high firing rates, with interspike intervals that are short relative to reasonable decay times resulting from membrane leakage. This inconsistency between the IF model and physiological measurements suggests that the IF model may not represent the essential nature of neuronal dynamics. Softky and Koch (1992, 1993) responded to this problem by arguing that cortical neurons act as coincidence detectors instead of as integrators, where the coincidence detection mechanism is likely to be located in the dendrites (see also Softky, 1994, 1995). Others, however, have attempted to preserve the essential elements of the IF neuron through reasonable modifications
Long-Range Dependence in Models of Cortical Variability
2127
of the model. Shadlen and Newsome (1994) suggested that IF neurons that integrate over many inputs are able to produce very irregular spike trains if the average amount of inhibition at the input is equal to the average amount of excitation. But other studies (Brown & Feng, 1999; Feng & Brown, 1998a; Feng, 1999; Shadlen & Newsome, 1998; Burkitt, 2000, 2001) have shown that even when the amount of inhibition is less than, but very close to, the amount of excitation, the CVISI of the output from the IF neuron can match measurements made in real cortical neurons. In addition, realistic CVISI values can be produced in the IF neuron by using long-tailed interspikeinterval distributions for the input processes (Feng, 1997; Feng & Brown, 1998a, 1998b; Feng, Tirozzi, & Brown, 1998), by using correlated inputs (Stevens & Zador, 1998; Sakai, Funahashi, & Shinomoto, 1999; Shinomoto & Tsubo, 2001; Feng & Brown, 2000a; Feng, 2001; Feng & Zhang, 2001; Destexhe & Pare, 1999, 2000), or by using a postspike reset potential near threshold (Troyer & Miller, 1997; Bugmann, Christodoulou, & Taylor, 1997; Lansky & Smith, 1989). Furthermore, several studies have created networks of IF neurons where the network dynamics create highly variable outputs at the level of single neurons (Usher, Stemmler, Koch, & Olami, 1994; Usher, Stemmler, & Olami, 1995; Tsodyks & Sejnowski, 1995; van Vreeswijk & Sompolinsky, 1996, 1998). The primary measure that has been used to compare the variability of IF models to that of cortical neurons is the CVISI . But the CVISI is a limited measure of the variability of spike trains, since it is computed only from the variance and mean of the interspike intervals. It is not, for example, sensitive to temporal correlations between interspike intervals, which can drastically alter the overall variability of a spike train without altering the variability of the interspike intervals. However, not only are temporal correlations present in the sequence of interspike intervals of many, if not most, of the neurons in subcortical and cortical sensory pathways, but these correlations persist for an extraordinarily long time (subcortical: Teich, 1989, 1996; Teich & Lowen, 1994; Lowen & Teich, 1996b; Teich, Johnson, Kumar, & Turcott, 1990; Turcott et al., 1994; Teich, Heneghan, Lowen, Ozaki, & Kaplan, 1997; Lowen, Ozaki, Kaplan, Saleh, & Teich, 2001; cortical: Teich, Turcott, & Siegel, 1996; Gruneis, Nakao, Yamamoto, Musha, & Nakahama, 1989; Gruneis, Nakao, & Yamamoto, 1990; Gruneis et al., 1993; Wise, 1981). The common mathematical term for these persistent correlations is long-range dependence (LRD). But particularly in less theoretical areas of study, LRD processes are often called second-order self-similar processes, fractal processes, or processes with power-law statistics. LRD produces behavior in the variance of a process that is unusual with regard to standard statistical procedures and, as will become clear, is an important component in understanding the variability of neural spike trains. However, its importance to the study of highly variable integrator models of cortical processing has rarely been acknowledged (but see Usher et al., 1994, 1995), and then only with regard to networks of interconnected IF neurons.
2128
B. Jackson
In this study, we investigate, using both mathematical analysis and computer simulations, whether single-neuron “high-variability” IF models can produce LRD while their CVISI values remain close to in vivo cortical measurements (usually within the range of 0.5 to 1.5). A similar investigation of network models would also be worthwhile but is beyond the scope here. In addition to the single-neuron models mentioned above, we also describe and investigate a new model for the high variability of cortical neurons, which is motivated by the fact that LRD is present in neurons that project into cortex. Before looking at these models, however, we must have a foundation for studying LRD in point processes, the mathematical processes usually employed to represent neural spike trains. In the following section, we describe the theory of LRD in point processes and methods for analyzing LRD in point process–like data, such as spike times in neural spike trains. 2 Long-Range Dependence: Definitions and Theory 2.1 Long-Range Dependence in Stochastic Processes. Most classical statistical estimators and tests are based on the assumption that the stochastic processes to which they are applied are short-range dependent. A stationary stochastic process {Xi : i ∈ Z} is said to be short-range dependent if the autocovariance Cov{X0 , Xj } (and the autocorrelation) decays fast enough, as the separation, or lag, j is increased, that the infinite sum of these autocovariances converges to a finite constant. On a practical level, this means that an arbitrarily large, though perhaps less than total, amount of the influence of the past can be contained in a sufficiently large finite history of the process. In the frequency domain, the spectral density (or power spectrum) of a short-range dependent process has a finite limiting value as the frequency approaches zero. Although most “classical” types of stochastic processes are short-range dependent, observations from many “physical” processes are not (see Beran, 1994, for some examples). Such processes are said to be long-range dependent (LRD). The infinite sum of the autocovariances of an LRD process is unbounded since these autocovariances decay very slowly (Beran, 1994; Cox, 1984). Equivalently, the spectral density of an LRD process has a pole at zero frequency. Therefore, any finite history of such a process, no matter how large, will necessarily leave out an infinitely large amount of the influence from the past. LRD can result in grossly inaccurate results if classical statistical methods are applied to empirical data derived from processes that possess it (Beran, 1994; Cox, 1984). Therefore, it is important to recognize LRD when it occurs in stochastic processes. The divergence of the infinite sum of the autocovariances,
lim
n→∞
n j=0
Cov{X0 , Xj } = ∞,
(2.1)
Long-Range Dependence in Models of Cortical Variability
2129
is perhaps most helpful in forming an intuitive understanding of LRD. It is important to note that this is an asymptotic property and that this asymptotic property is related to the asymptotic form of the autocovariance function. Since LRD is an asymptotic property, each individual autocovariance by itself is not critical. In other words, the autocovariance at any fixed finite lag can be arbitrarily large or arbitrarily small for either a short-range or long-range dependent process. The factor that determines the “range” of the dependence is the relationship among the autocovariances at arbitrarily large lags. Thus, LRD results from the joint effect of these covariances, not from the effect of any single covariance. There are a number of different possibilities for defining LRD for arbitrary stochastic processes. Either the unbounded infinite sum of the autocorrelations or the spectral density pole at zero frequency would suffice as a definition (see, e.g., Beran, 1994). But following Daley and Vesilo (1997) and due to its ease of use analytically and empirically, we will use the asymptotic behavior of the variance of the sum of consecutive random variables in the process to define LRD. Definition 1. A stationary stochastic process {Xi : i ∈ Z}, for which each random variable Xi has finite variance, exhibits LRD when Var{ ni=1 Xi } = ∞. lim sup n n→∞ 2.2 Long-Range Dependence in Stochastic Point Processes. Spike trains are not usually modeled with stochastic processes like those considered in section 2.1. Instead, they are typically represented mathematically by point processes, where each point represents the time of occurrence of a spike. A point process on the real line (and a neural spike train in time) can be described by either its interpoint distances or by the numbers of points in any arbitrary set of intervals on the real line. If the points of a point process are given by the increasing sequence {τi : i ∈ Z}, then the sequence of intervals is given by {Yi = τi − τi−1 : i ∈ Z}. The number of points in an arbitrary interval (a, b], a < b, is given by N(a, b] = N((a, b]) = #{i : τi ∈ (a, b]}. Then, for example, the counts in a sequence of adjacent intervals of length T > 0 would be given by {N(a + (i − 1)T, a + iT] : i ∈ Z}, where a is any real number. Since either the sequence of interpoint intervals or the sequence of counts in adjacent intervals (or both) can be LRD, there are two ways in which a point process can exhibit LRD. Again following Daley and Vesilo (1997), we will call these two types of LRD long-range interval dependence and long-range count dependence, respectively. Thus, we have the following two definitions:
2130
B. Jackson
Definition 2. (Daley & Vesilo, 1997)1 A stationary point process N(·) on the real line exhibits long-range interval dependence (LRiD) when the stationary sequence of interpoint intervals {Yi }, with finite variances, is LRD in the sense that n
lim sup n→∞
Var{
i=1 Yi }
n
= ∞.
Definition 3. (Daley & Vesilo, 1997) A second-order stationary point process N(·) on the real line exhibits long-range count dependence (LRcD) when lim sup t→∞
Var{ N(0, t] } = ∞. t
The definition of LRiD is a direct analog of the general definition for LRD (see definition 1). The definition for LRcD, however, takes a slightly different form in order to avoid the use of a counting-interval-length parameter T. This parameter is superfluous with regard to the definition of LRcD since n
N((i − 1)T, iT] = N(0, nT],
i=1
and nT goes to infinity as n goes to infinity. 2.3 Long-Range Dependence in Stochastic Renewal Point Processes. The conditions under which LRD is present in renewal point processes, point processes in which all interpoint intervals are mutually independent, will play a critical role in many of the arguments in this study. Clearly, a renewal point process cannot be LRiD, since by definition there is no dependence between the intervals,2 but it can be LRcD (e.g., Daley & Vesilo, 1997; Lowen & Teich, 1993b, 1993c). But under what conditions, if any, can a renewal point process be LRcD? As it turns out, this question has a straightforward answer: a renewal point process is LRcD when the variance of the interpoint intervals is infinite. This was stated by Daley (1999), but only a terse argument was included there. Due to the abbreviated nature of those comments and some confusing elements in their argument, we will precisely state this result next. The proof, which follows Daley’s argument, may be found in the appendix. 1 The original definition of Daley and Vesilo (1997) does not contain the phrase “with finite variances,” though the finiteness of the variances of the intervals is necessary and may have been implied. 2 This is true even if the variance of the interval distribution is infinite, even though definition 2 is not applicable to that case. If an extension of the concept of LRiD to the infinite interval variance case is to be meaningful, a point process with independent intervals cannot also be LRiD.
Long-Range Dependence in Models of Cortical Variability
2131
Theorem 1. A stationary renewal point process with distribution function F of the generic interpoint-interval random variable X, which has F(0) = 0 and finite mean µ = E{X}, is LRcD if and only if E X2 = ∞.3 2.4 The Hurst Indices. The strength of LRD may be measured using the Hurst index. Strictly speaking, the Hurst index is a measure of the selfsimilarity of a stochastic process (e.g., Beran, 1994; Samorodnitsky & Taqqu, 1994; Jackson, 2003, sect. 3.2.3). A (nonconstant) self-similar process Y(t) is necessarily nonstationary (Beran, 1994), but its increments X(t) ≡ Y(t) − Y(t − s), for some fixed s > 0, may form a stationary, long-range dependent process. If we associate the Hurst index H of the self-similar process Y with its incremental process X, then the Hurst index becomes a measure of LRD. If the covariances of the stationary increments of a self-similar process exist and decay to zero as the lag is increased, then 0 < H < 1 (Beran, 1994). For the present purposes, we can ignore the range 0 < H < 0.5, since the conditions for this to occur in a physical process are very unstable (Beran, 1994). When H = 0.5, the increment process has no dependence or memory. An example of a self-similar process with stationary increments and a Hurst index of H = 0.5 is Brownian motion, of which “white” gaussian noise is the stationary increment process. But when 0.5 < H < 1, the increment process is LRD, and larger values of H are associated with stronger LRD in the increment process. In general, however, the Hurst index does not need to be associated with a self-similar process or its increments. Instead of being a “self-similarity parameter,” it may be thought of as a “long-memory parameter” for stationary processes.4 Further discussion of this connection may be found in Jackson (2003). An alternate definition of the Hurst index is n Var i=1 Xi H = sup h : lim sup =∞ . n2h n→∞ This definition is more closely related to the concept of long-range dependence and is applicable to all stationary stochastic processes, but it is still consistent with the prior interpretation of the Hurst index in the context of self-similar processes. For point processes, Daley (1999) has defined the Hurst index in an analogous manner with respect to its counting process. 3 Theorem 1 is restricted to stationary processes. However, since LRcD is a property of the limiting behavior of the process, the renewal point process need only be asymptotically stationary for the result to apply. This proves useful, for example, since we often start a renewal point process with a point at the origin. Such a process is not stationary but is asymptotically stationary. 4 In fact, the Hurst index derives its name from the exponent parameter that Hurst (1951) used to demonstrate the existence of long-range dependence in records of the level of the Nile River and records of other geophysical processes.
2132
B. Jackson
Definition 4. (Daley, 1999) The Hurst index of a stationary, ergodic,5 orderly6 point process N(·) with finite second moment is Var{N(0, t]} H = sup h : lim sup =∞ . t2h t→∞ Since for the process assumed in this definition, Var{N(0, t]} = o(t2 ) for t → ∞, the Hurst index must be no larger than one. On the other hand, as long as it is not the case that the point process is zero with probability one, lim sup t→∞
Var{N(0, t]} = ∞, t2
for any ≤ 0.
So the Hurst index for this point process must be nonnegative. Thus, 0 ≤ H ≤ 1, which is the same range that we had for self-similar processes, save the possible inclusion of the end points. Furthermore, according to definition 3, if a point process is LRcD, then H ≥ 0.5. In practice, however, most non-LRcD stochastic point processes have H = 0.5, while LRcD point processes have H > 0.5. Continuous physical processes with H < 0.5 are statistically unstable and are therefore rare in practice. Point processes with H < 0.5 should share this property and are not useful, as far as we know, for modeling neural spike trains. The Hurst index of definition 4, however, quantifies the strength of only LRcD in a point process. In order to quantify the strength of LRiD, we need an index of the dependence between the interpoint intervals. The Hurst index H of a point process was defined on the basis of the counts of the process, but we may also define HI , the Hurst index of the intervals of the process. Definition 5. Let N(·) be a stationary, ergodic, orderly point process with finite second moment, and let {Yi } be the stationary sequence of its interpoint intervals, also with finite second moment. Then the interval Hurst index of this point process is Var{ ni=1 Yi } HI = sup h : lim sup = ∞ . n2h n→∞ As for the (count) Hurst index, 0 ≤ HI ≤ 1 (Jackson, 2003), and, according to definition 2, if the point process is LRiD, then HI ≥ 0.5. Furthermore, larger values of HI signify stronger LRiD. 5 A stationary point process N(·) with finite mean density m = E{N(0, 1]} is ergodic if Pr{limx→∞ N(0, x]/x = m} = 1 (Daley, Rolski, & Vesilo, 2000). 6 A point process N(·) on the real line is orderly if Pr{N(t, t+δ] > 1} = o(δ), for all t ∈ R. This implies that the point process has no multiple simultaneous occurrences (see, e.g., Cox & Isham, 1980; Daley & Vere-Jones, 1988).
Long-Range Dependence in Models of Cortical Variability
2133
2.5 Relationship Between the Different Types of Long-Range Dependence and the Variability of Intervals. The two types of LRD, LRcD and LRiD, are not independent. In this section, we give a brief overview of the relationship between these two types of LRD and their relationship to the variability of the interpoint intervals. Jackson (2003) contains a more thorough discussion of these relationships and makes explicit what has been proven mathematically and what has not. A complete mathematical theory of this relationship, however, is not currently available. In general, LRcD can be associated with LRiD, infinite-variance interpoint intervals, or both. Presumably, LRcD cannot exist in a point process without the presence of at least one of these two properties of the interpoint intervals. Even in these seemingly straightforward statements, a complication in the theory of LRD in point processes arises. How can the intuitively plausible circumstance of simultaneous LRiD and infinite-variance interpoint intervals occur? LRiD is defined (according to definition 2) only when the interpoint intervals have finite variance, which derives from the fact that LRD is defined only for finite-variance processes (see definition 1). Unfortunately, at the present time, a general definition of LRD for infinite-variance processes does not exist. However, several methods for handling this situation have been applied in the past (see Jackson, 2003). One method for distinguishing LRD in infinite-variance processes makes use of the sample correlation. Although, strictly speaking, the correlation is not defined for infinite-variance random variables, the sample correlation can obviously be calculated for a sample from an infinite-variance process. Assuming that the sample correlation retains pertinent properties when applied to an infinite-variance process, its asymptotic properties should be useful for distinguishing short-range and long-range dependence. The study of Davis and Resnick (1986) supports this argument. They found that in the case of the moving average process,
Xn =
∞
cj Zn−j ,
j=−∞
where {Zi } is a sequence of independent and identically distributed random variables called the innovations, the sample correlation converges to the same function of the coefficients {cj } whether the innovations {Zi } have finite or infinite variance. In this study, we will essentially use this strategy for handling infinitevariance processes when we consider LRiD in both simulated and empirical neural spike trains. Thus, we will assume that the sample statistics will asymptotically behave in a similar manner whether the intervals have infinite variance or not. Furthermore, we will see that results based on this assumption are coherent.
2134
B. Jackson
2.6 Statistical Procedures for Recognizing Long-Range Dependence in Point Processes 2.6.1 Shuffled Surrogate Data. Shuffled surrogate data can prove useful in determining the relative contributions of infinite-variance intervals and LRiD to LRcD in a point process or a neural spike train. Teich, Lowen, and their coworkers (Teich, Turcott, & Lowen, 1990; Lowen & Teich, 1992; Teich & Lowen, 1994; Turcott, Barker, & Teich, 1995; Lowen & Teich, 1996b; Teich et al., 1996, 1997; Turcott & Teich, 1996) have made extensive use of this method to “distinguish those properties of the data that arise from correlation among intervals from those properties inherent in the form of the [distribution of interval lengths]” (Turcott & Teich, 1996). The recent work of Daley and his coworkers (Daley & Vesilo, 1997; Daley, 1999; Daley et al., 2000) and Kulik and Szekli (2001) further develops an understanding of the information that is available from the shuffled surrogate data. A set of shuffled surrogate data is formed by randomly shuffling the interpoint intervals of a finite sample from a stochastic point process or the interspike intervals of a finite spike train. The surrogate point process formed from the shuffled intervals has the same distribution of interpoint intervals as the original, but the dependency structure of the interpoint intervals has been disrupted. Thus, the surrogate point process is essentially equivalent to a sample from a renewal point process that has an interpoint interval distribution equivalent to the marginal distribution of the intervals in the original point process. Therefore, the surrogate point process cannot be LRiD, and any LRcD present is due to the high variability of the interspike interval distribution (Daley, 1999; theorem 1). Consequently, we expect that for a point process with no LRiD, the “amount” of LRcD in the surrogate data would be equivalent to that in the original data, since no LRcD would be destroyed by the shuffling procedure. Furthermore, for a point process with finite-variance interpoint intervals, we expect that the surrogate data would have no LRcD, since only infinite-variance intervals can create LRcD in the shuffled data. The results of these arguments are presented in Table 1 as the four possible combinations of finite- or infinite-variance intervals and LRiD or no LRiD. 2.6.2 Statistical Functions for the Variance of Aggregations. In order to investigate the presence of LRD in neural spike trains, we need to examine the behavior of the variance of aggregations of random variables as the aggregation size increases. Consider a discrete, stochastic time series X1 , X2 , X3 , . . . , and let
X(M) =
M i=1
Xi
Long-Range Dependence in Models of Cortical Variability
2135
Table 1: The Four Different Scenarios for the Presence or Absence of Long-Range Dependence in a Point Process and Their Effect on Shuffled Surrogate Data.
Variance of Intervals
LRiD in Original Data?
LRcD in Original Data?
LRcD in Surrogate Data?
“Amounts” of LRcD
Finite Finite Infinite Infinite
No Yes No Yes
No Yes Yes Yes
No No Yes Yes
Original = Surrogate Original > Surrogate Original = Surrogate Original > Surrogate
Note: The four possibilities are determined by whether the interpoint intervals are finite and whether the point process is LRiD.
be the aggregation of M adjacent random variables in the time series. Then we have chosen to plot 1 Var{X(M) } M
(2.2)
versus M. This is slightly different from the standard variance-time curve, but has some practical advantages. Our method results in a curve that will have a slope of zero if there is no dependence between the random variables, a positive slope if there is positive dependence, and a negative slope if there is negative dependence. This facilitates simple qualitative interpretation of these graphs. Furthermore, equation 2.2 yields expressions that differ by only a constant from the limit arguments in the definitions of LRcD and LRiD. We will make this statement more explicit as we introduce the precise functions that we have chosen to graph. 2.6.3 The Index-of-Dispersion Curves. There are two index-of-dispersion curves that prove useful in investigating LRD properties in point processes and spike trains: the index of dispersion curve of the counts, also known as the Fano factor curve (FFC), and the index of dispersion curve of the interpoint intervals (IDC). Each is the graph of the variance of an aggregation of random variables divided by a function of the mean of the aggregated variables versus the aggregation size. The FFC is the graph of the function (e.g., Fano, 1947; Teich, 1989; Lowen & Teich, 1991; Thurner et al., 1997; Cox & Isham, 1980) Var{N(0, t]} Var{N(0, t]} = E{N(0, t]} t E{N(0, 1]}
Var{N(0, t]} 1 = , t E{N(0, 1]}
F (t) =
(2.3)
2136
B. Jackson
while the IDC is the graph of the function (e.g., Cox & Isham, 1980),
Var{ ki=1 Xi } Var{ ki=1 Xi } 1 , I (k) = = k (E{X})2 k (E{X})2
(2.4)
where {Xi } is the sequence of interpoint intervals and X is the generic interpoint-interval random variable. Both F (t) and I (k) are equal to one for all t and k, respectively, when the point process is a Poisson process. In both equations 2.3 and 2.4, the last expression splits the definition into two parts. The second part of each of these expressions is a constant, while the first is the limit argument in the definition of either LRcD or LRiD. Thus, the FFC will increase without bound with increasing t when the point process is LRcD, and the IDC will increase without bound with increasing k when the point process is LRiD. Furthermore, when the intervals of a stationary point process have finite variance, the FFC and IDC are asymptotically related by the equation (Cox & Isham, 1980), lim I (k) = lim F (t).
k→∞
t→∞
(2.5)
From this we can prove the following relationship between LRcD and LRiD: Proposition 1. Let N(·) be a stationary, ergodic, orderly point process with finite variance (i.e., E{[N(A)]2 } < ∞ for bounded A) and a stationary sequence of intervals {Xi : i ∈ N} that has finite variance. Then N is LRcD if and only if it is LRiD. Proof. This result follows directly from equations 2.5, 2.4, and 2.3, and the definitions of LRcD and LRiD. Regardless of whether the point process is in any way LRD, the FFC and IDC can also convey information about other (short-range) dependencies. The FFC and IDC begin, at low values of t and k, respectively, at particular values and then increase and decrease depending on the correlation properties of the process. The limit of the FFC as t approaches zero is always one if the point process is stationary and orderly (Jackson, 2003). The IDC has a realizable lower limit of k = 1 where it is equal to Var{X}/(E{X})2 . Decreases in the FFC or IDC are caused by negative correlations in the process itself or between the interpoint intervals, respectively, while increases are caused by positive correlations. Since correlations in the interpoint intervals of a process create same-sign correlations in the process itself, increases or decreases in the FFC can be accompanied by similar increases or decreases in the IDC. However, decreases in the FFC may also be caused by a marginal interval distribution with a lower variance than that of the exponential distribution (the interval distribution for a Poisson process) with an equivalent mean,
Long-Range Dependence in Models of Cortical Variability
2137
while increases may be caused by a marginal interval distribution with a higher variance than that of the equivalent-mean exponential distribution. These different effects on the shape of the FFC may be teased apart using the IDC and surrogate data FFC. As already noted, the IDC will convey information about the correlations between interpoint intervals, but the surrogate data FFC will not be affected by such correlations in the original process. Instead, increases and decreases in the surrogate data FFC will be dependent on only the shape of the marginal interval distribution. Figure 1 contains sets of estimated FFCs and IDCs for several different types of LRD point processes. These curves are plotted using a logarithmic scale on both the abscissa and the ordinate since LRD processes typically produce curves that approximate a power law (a straight line on these “loglog” plots) at large counting intervals or large interpoint interval aggregations. Furthermore, processes with stronger LRD (and higher Hurst indices) have curves with larger asymptotic “log-log” slopes.
Original Data Shuffled Data
101
10
c 10 2
2
Original Data Shuffled Data
Fano Factor
b 10
2
Fano Factor
Fano Factor
a 10
1
10 0
Original Data Shuffled Data
10 1
10 0
100 -2
10
-1
0
10
10
1
10
10
2
10
10 -1 -2 10
3
2
10
1
10
0
10
0
10
1
10
2
10
3
10
0
10
1
10
10
2
10
10 -1 -2 10
3
4
Number of Aggregated Intervals
10
2
10
1
10
0
10
-1
10
0
10
1
10
2
10
3
-1
0
10
10
1
10
10
2
10
3
Counting Interval (sec)
Index of Dispersion
10
-1
10
Counting Interval (sec)
Index of Dispersion
Index of Dispersion
Counting Interval (sec)
10
4
Number of Aggregated Intervals
10
3
10
2
10
1
10
0
10
0
10
1
10
2
10
3
10
4
Number of Aggregated Intervals
Figure 1: Fano factor curves and curves of the index of dispersion of intervals for finite-length samples from three point-process types related to neural spike trains and models. Shuffled surrogate data (dashed lines) are formed by randomly shuffling the interpoint intervals of the original sample data from the point process. (a) Curves for a renewal process with an interval distribution possessing infinite variance. (b) Curves for a point process with intervals that are long-range dependent and have a distribution with finite variance. The three Fano factor curves for shuffled surrogate data and three sets of indexof-dispersion curves are from processes with interval variability greater than a Poisson process (top), equal to a Poisson process (middle), and less than a Poisson process (bottom). (c) Curves for a point process with intervals that are long-range dependent and have a distribution with infinite variance.
2138
B. Jackson
Figure 1a depicts the FFCs and IDCs for samples from an LRcD renewal point process, which, of course, is not LRiD. In this case, the interpoint intervals are completely independent, and LRcD is present only if the interval distribution has infinite variance (Daley, 1999; theorem 1). Thus, shuffling the intervals does not change the statistical properties of the point process, and the FFCs for the original and surrogate data are nearly the same, as are the IDCs for the original and surrogate data. Furthermore, the IDCs are essentially horizontal lines, owing to the independence of the intervals. Theoretically, the values of the index of dispersion of aggregated intervals should be infinite, since the variance of the interpoint interval distribution is infinite. But since they are derived from estimated values from a limited sample of the process, the IDCs are at finite values. These estimated values, however, are high, being approximately equivalent to the highest values of the corresponding FFCs. If the length of the sample from the renewal process is gradually increased, then the trend should be for these calculated IDCs to move upward without bound. Figure 1b illustrates representative FFCs and IDCs for point processes with LRiD and finite-variance interpoint intervals. In this case, although the FFC for the original data increases without bound, the FFCs for surrogate data approach a constant limiting value. If the standard deviation of the intervals is equal to their mean, as in the case of exponentially distributed intervals, then the asymptotic value of the surrogate data FFC will be one. If the interval variance is larger than in this case, then the surrogate data FFC will asymptotically approach a value greater than one, and if it is less, the FFC will approach a value less than one. The LRiD of the point process is evident in the ever-increasing original data IDCs. The IDCs for the shuffled surrogate data, on the other hand, are horizontal lines, since the shuffling procedure destroys the serial dependence between the intervals and the intervals have finite variance. The three sets of IDCs (top, middle, and bottom) correspond to the three surrogate data FFCs (top, middle, and bottom, respectively). So if the standard deviation of the intervals is equal to their mean, as in the case of exponentially distributed intervals, then the IDC values for a single “aggregated” interval are equal to one. If the interval variance is larger than the mean, then the IDCs have a value greater than one for a single interval, and if it is less than the mean, then the IDCs have a value less than one for a single interval. Finally, Figure 1c depicts the FFCs and IDCs for an LRcD point process that both is LRiD and has intervals distributed with infinite variance. In this case, the FFC for the shuffled surrogate data increases without bound due to the infinite variance of the interval distribution. However, since the intervals are also LRD, this FFC is not the same as the FFC for the original data. Instead, the surrogate data FFC has a shallower slope than the original data FFC due to the loss of LRiD caused by the shuffling procedure. The IDCs in Figure 1c do not differ significantly in shape from those in Figure 1b, although the point processes are statistically different in these two cases.
Long-Range Dependence in Models of Cortical Variability
2139
These IDCs convey the fact that the point process is LRiD, as is evident from the ever-increasing IDC for the original data. Furthermore, whereas the slopes of the original data IDCs in Figure 1b are approximately the same as those of the corresponding FFCs, the slope of the original data IDC in Figure 1c is shallower than the original data FFC. In terms of the Hurst indices, this means that in Figure 1b HI ≈ H, while in Figure 1c, HI < H. This implies that in the latter case, only some of the LRcD is due to LRiD. The rest is due to the infinite variance of the intervals. However, as in Figure 1a, the infinite variance of the interval distribution is not evident in the IDCs alone due to the finite sample time. But as the length of the sample increases, the vertical position of these curves should, on average, move upward without bound. Thus, we have defined a statistical procedure, the FFC, for recognizing LRD in the counts of point process data, and another, the IDC, for recognizing LRD in the intervals of point process data.7 Together, these two statistical curves can often detect the presence of infinite variance in the interpoint intervals as well. Furthermore, by comparing these statistical curves for an original set of data with those for a set of data obtained by randomly shuffling the original interpoint intervals, a more robust and sensitive indicator of the presence of LRcD, LRiD, and infinite interval variance is produced. Hence, using the original data FFC and IDC and the surrogate data FFC and IDC in combination is a good strategy for discerning among the four potential scenarios of Table 1 for LRD in point process data, such as those encountered when studying neural spike trains. In addition, this analysis provides information on the strength of LRcD in the data and the relative contributions of LRiD and infinite interval variance to its presence. 2.7 Long-Range Dependence in Neural Spike Trains. The spike trains of many neurons possess long-range dependence, although this term is infrequent in the neurophysiological literature. Sometimes this property is called long-term correlation, a designation that is quite similar to long-range dependence, but in other contexts, more specialized terms such as secondorder self-similarity, fractal behavior, and power law (second-order) statistics are used. The use of the latter terms derives from the fact that the correlation function (and other equivalent second-order statistical functions) of a longrange dependent process usually approximates the form of a power law function. Also, since the power law function of the power spectral density, the Fourier transform of the correlation function, has the form of 1/f α for 0 < α < 2 under these conditions, long-range-dependent spike trains are sometimes designated as having 1/f fluctuations. A large majority of the spike trains from mammalian sensory systems, at both the subcortical and cortical levels, that have been assayed for long-
7 Further discussion and more detailed mathematical analysis of the FFC and IDC may be found in Jackson (2003).
2140
B. Jackson
range dependence also exhibit it, and long-range dependence has also been found in the visual system of certain insects (Turcott et al., 1995). Long-range dependence has been found at many different levels of the mammalian visual and auditory systems, including the primary auditory nerve (Teich, 1989; Teich & Lowen, 1994; Lowen & Teich, 1996b), the lateral superior olive (Teich et al., 1990; Turcott et al., 1994), the retina and lateral geniculate nucleus (Teich, 1996; Teich et al., 1997; Lowen et al., 2001), and the visual cortex (Teich et al., 1996). Furthermore, long-range dependence has been found in other sensory and nonsensory systems: in ventrobasal neurons of the thalamus (Kodama, Mushiake, Shima, Nakahama, & Yamamoto, 1989), the somatosensory cortex (Wise, 1981), the mesencephalic reticular formation (Yamamoto, Nakahama, Shima, Kodama, & Mushiake, 1986; Gruneis et al., 1989, 1990, 1993), and the hippocampus (Mushiake, Kodama, Shima, Yamamoto, & Nakahama, 1988; Kodama et al., 1989). One notable exception to the existence of long-range dependence in mammalian sensory neurons is the peripheral vestibular system (Teich, 1989). But since the peripheral vestibular system is the only system, to the best of our knowledge, that has produced a negative result in assays for LRD, LRD seems to be a pervasive property of spike trains in most sensory neural systems and may be common in other neural systems as well. The articles mentioned above primarily analyze the count statistics of spike trains, but none of these papers refers to long-range count dependence as formulated in definition 3. However, they all illustrate statistical characteristics of spike trains that are related to long-range dependence. Often the Fano factor curves for these spike trains are shown to diverge for large counting times, implying that the spike trains exhibit LRcD. Other studies used the power spectral density instead, showing that it diverges as frequency decreases to zero. This also implies that a spike train is LRcD (see lemma 2.3 in Jackson, 2003). Although the presence of LRcD can be supported by their analyses, many of the studies mentioned above did not make use of procedures that can discern the presence of LRiD. Nevertheless, Teich, Lowen, and their coworkers often calculated FFCs for shuffled surrogate data, which can be informative with respect to LRiD, in their studies. In subcortical auditory and visual neurons, they found that FFCs for shuffled surrogate data asymptote to a constant less than one (Teich, 1989, 1996; Teich et al., 1990, 1997; Teich & Lowen, 1994; Turcott et al., 1994; Lowen & Teich, 1996b; Lowen et al., 2001). Thus, the LRcD in these systems is a result of LRiD, and the intervals have sub-Poissonian variability. In primary visual cortex, Teich et al. (1996) also found that the FFCs for shuffled surrogate data asymptote to a constant, but in this case the constant was greater than one. These results signify that as in subcortical neurons, the LRcD in the spike trains is due to LRiD. However, unlike the interspike intervals in subcortical neurons, those in cortical neurons are more highly variable than the intervals of a Poisson process. Therefore, although the FFCs for the original spike trains from both sub-
Long-Range Dependence in Models of Cortical Variability
2141
cortical and cortical neurons are similar to that in Figure 1b, the FFCs for shuffled surrogate data are different between these two sections of the nervous system. The surrogate data FFCs for subcortical neurons are similar to the lowermost dashed curve in Figure 1b, whereas those for cortical neurons are similar to the uppermost dashed curve. 3 Integrate-and-Fire Models That Produce Renewal Point Process Outputs 3.1 The Balanced Excitation-Inhibition Model. Gerstein and Mandelbrot (1964) first proposed that the output of a neuron is highly variable if there is a balance between the amounts of excitation and inhibition in its synaptic inputs. Later, Shadlen and Newsome (1994) used this idea as a solution to the apparent incompatibility, noted by Softky and Koch (1992, 1993), of temporal integration and high variability for cortical neurons. Work on this solution was further extended in several subsequent studies (Brown & Feng, 1999; Feng & Brown, 1998a; Feng, 1999; Shadlen & Newsome, 1998; Burkitt, 2000, 2001). In this section, we evaluate whether excitation-inhibition balance can produce LRcD, in addition to high variability, in simple IF models. We restrict ourselves to the simplest of these models since (1) they can be handled analytically, (2) they include the primary models considered in the literature, and (3) more complex models are covered by the arguments for generic renewal point processes in the subsequent section. This exercise will serve to illustrate the conflict between LRcD and realistic values of CVISI as it occurs in a particular type of renewal model. The model that we are considering here is the basic IF model, without leakage or reversal potentials, that has both excitatory and inhibitory inputs, all of which are Poisson processes and mutually independent. Each event, or “spike,” arriving on an input causes a postsynaptic potential (PSP), a depolarizing potential (EPSP) for an excitatory input, and a hyperpolarizing potential (IPSP) for an inhibitory input. Both EPSPs and IPSPs are Dirac delta functions, causing instantaneous jumps in the voltage, and, for simplicity and because exact analytical results are available in this case, we begin with EPSPs and IPSPs that are equal in amplitude. We denote the EPSP amplitudes by aE > 0 and the IPSP amplitudes by aI > 0. Thus, in the present case, we can let a = aE = aI . Furthermore, we let ME and MI denote the number of excitatory and inhibitory inputs, respectively, and λE and λI denote the input rate for each excitatory and inhibitory fiber, respectively. Since all inputs are independent Poisson processes, they can be combined such that only two effective inputs need to be considered: an excitatory input of rate E = ME λE and an inhibitory input of rate I = MI λI . The integral, with respect to time, of all of the resulting postsynaptic potentials is a random walk, which forms the time-varying potential, V(t), of the IF model. When this potential crosses a predetermined, constant thresh-
2142
B. Jackson
old, Vth , an output spike is initiated. Following an output spike, the voltage is reset to its resting level, v0 , and the process starts anew. Since the inputs are Poisson processes, the postsynaptic potentials are delta functions, and the IF neuron completely resets at each occurrence of an output spike, the output of the present model is clearly a renewal process. Hence, the output is completely specified by its interval distribution, or, equivalently, the first passage time of the potential V(t) to level Vth from V(0) = v0 . Tuckwell (1988) has derived the interval density, f (t), for the output of this model, f (t) = θˆ
E I
θ/2 ˆ
e−(E +I )t Iθˆ (2 E · I t), t
t > 0,
(3.1)
where θˆ = Vtha−v0 is the number of excitatory inputs required for V(t) to cross threshold, x is the least integer greater than or equal to x, and Iρ (x) is the modified Bessel function: Iρ (x) =
x 2k+ρ 1 . k! (k + ρ + 1) 2
∞ k=0
Now let X be an arbitrary interspike interval with density function f (t). Then, according to Tuckwell (1988), the first three (central) moments of X are 1, Pr{X < ∞} = E θˆ , I
E{X} =
θˆ E −I ,
if E ≥ I , if E < I if E > I , , if E ≤ I
∞,
,
(3.2)
(3.3)
and Var{X} =
ˆ E +I ) θ( , (E −I )3
∞,
if E > I , . if E ≤ I
(3.4)
Thus, in the case when E ≤ I , the coefficient of variation of the interspike intervals, CVISI , does not exist. However, if E > I , the coefficient of variation is CV{X} =
E + I . ˆθ (E − I )
(3.5)
Long-Range Dependence in Models of Cortical Variability
2143
Therefore, by adjusting I within the interval [0, E ), the CVISI can be set to
ˆ ∞), with arbitrarily large values occurring any value in the interval [1/ θ, as I approaches E . In other words, as the amount of inhibition is increased to bring the model into perfect excitation-inhibition balance, the CVISI goes to infinity. Now we wish to know when the output of this process exhibits LRcD. Since the output of the model is a renewal point process, from theorem 1 we know that if E{X} < ∞, then the output will be LRcD if and only if E{X2 } = ∞. Furthermore, if the mean E{X} is infinite, then the second moment E{X2 } must be infinite as well. Hence, according to equation 3.4, a necessary condition for the model to be LRcD is that E ≤ I .8 Thus, in trying to fit this IF model to both the CVISI and LRcD properties of real neurons, we find ourselves at an impasse. With the amount of inhibition less than the amount of excitation, the model can match the CVISI values measured from real neurons, but under these conditions, it is not LRcD, since both the mean and variance are finite. On the other hand, by making the amount of inhibition equal to or greater than the amount of excitation, we may create LRcD in the model, but the CVISI becomes infinite. Therefore, this model of highly variable cortical neurons cannot manifest LRcD while still producing interspike-interval variability consistent with empirical measurements. A common generalization of this model is produced by relaxing the condition that the EPSPs and IPSPs have equal magnitude. In other words, consider the case when aE = aI . This is similar to the model proposed by Shadlen and Newsome (1994) for matching the variability of real cortical neurons, except that their model had a lower limit, or elastic barrier, on the voltage of the IF neuron. This model, unlike the case when aE = aI , cannot be analyzed directly. However, the potential V(t) in this case can be approximated by a Wiener process with drift, an approximation that becomes exact as the input rates E and I go to infinity and the PSP amplitudes aE and aI go to zero. Expressions for the density, distribution and first three central moments of the output intervals for this model are also available in Tuckwell (1988) and have been reproduced in Jackson (2003). From these equations, we find that for the approximation to this more general model, the coefficient of variation of the interspike intervals, CVISI , does not exist when aE E ≤ aI I . However, if aE E > aI I , the coefficient of variation is CV{X} =
a2E E + a2I I , θ(aE E − aI I )
(3.6)
where θ = Vth − v0 does not depend on the PSP amplitudes since we no longer have the simplifying assumption that they are all equal. Therefore, 8 Since, according to equation 3.3, E{X} is also infinite when ≤ , we cannot E I conclude from theorem 1 that this is also a sufficient condition.
2144
B. Jackson
√ we see that the CVISI may take any value in the interval aE /θ , ∞ when aE E > aI I , with arbitrarily large values occurring as aI I approaches aE E . Thus, we are in the same predicament as before. If aE E > aI I , then the model can be adjusted to match the CVISI values measured from real neurons. However, under this condition, both the mean and variance are finite, and therefore the model is not LRcD. But if aE E ≤ aI I , the process may be LRcD, but the CVISI is infinite. Therefore, even when the EPSP and IPSP magnitudes are unequal, the IF model is incapable of matching both the interspike-interval variability and the LRcD measured in real cortical neurons. 3.2 Renewal Integrate-and-Fire Models and Their General Properties. In the previous section, we showed that the basic high-variability IF model that requires balanced excitation and inhibition cannot produce both a finite CVISI and LRcD at the same time. However, this result is easily extended to all renewal models, a class that includes a significant portion of the single-neuron, high-variability IF models in the literature. By using the term single-neuron, we are excluding consideration of network models, where the statistical nature of the entire set of inputs to each neuron is not explicitly specified but instead consists, at least partially, of outputs from other similar neurons within an interconnected network. If each component of an IF model is renewal (i.e., has no memory of the process prior to the time of the last output spike), then the output of this model must be renewal. Hence, the following conditions are jointly sufficient to render the output of such a model renewal: 1. All inputs to the IF neuron are Poisson processes (i.e., are stationary with no autocorrelations). 2. The cross-covariance between any set of inputs is zero for any nonzero lag. 3. The PSPs are direct changes in the IF potential, and either the IF potential is reset to a fixed value after the occurrence of each output spike or these reset values form a set of independent and identically distributed random variables. 4. All other parameters of the model (e.g., threshold, leakage conductance, or reversal potentials) are either constant, set to a fixed value upon the occurrence of an output spike, or have post–output spike values that form a set of independent and identically distributed random variables. These conditions are not entirely general, but they are practically general for the high-variability IF models that have been studied in the literature. Note that if the above conditions are met, it does not matter whether the
Long-Range Dependence in Models of Cortical Variability
2145
model has leakage or reversal potentials or a dynamic threshold, as long as their parameters meet condition 4. Furthermore, the PSPs may have nonzero duration as long as condition 3 is met. Many of the nonnetwork high-variability IF models in the literature meet these four conditions and are therefore renewal. Excitation-inhibition balance models that fit this category were considered by Shadlen and Newsome (1994), Brown and Feng (1999), Feng and Brown (1998a, 1999), Feng (1999), and Burkitt (2000, 2001). Renewal models that produce highly variable outputs due to correlations between inputs9 were considered by Feng and Brown (2000a), Feng (2001), Feng and Zhang (2001), and Salinas and Sejnowski (2000). Other models that are renewal as well are the partial reset models considered in Troyer and Miller (1997) and Bugmann et al. (1997), the time-varying threshold model considered in Wilbur and Rinzel (1983), and the nonlinear leakage model considered in Feng and Brown (2000b). According to theorem 1, each of these models that has an interval distribution with finite mean either has an interval distribution with finite variance and is not LRcD or has an interval distribution with infinite variance and is LRcD. In the first case, it might be possible to match empirically measured values of CVISI , but the LRcD property of real neurons is unattainable. However, in the second case, the model will be LRcD, but it will be impossible to match empirically measured CVISI values. Thus, just like the model that was considered in section 3.1, all renewal models with finite-mean intervals fail to match both the interspike-interval variability and LRcD measured in real cortical neurons. The preceding argument assumes that the interval distribution has a finite mean. The example in section 3.1, where the limit of CVISI as the model approached the infinite-mean condition could be determined analytically, shows that at least some renewal models with infinite mean intervals fail to produce the required variability and LRcD properties, and fail in a manner similar to that described in the preceding argument for the finite mean case. This, however, does not prove that all such models fail in this way. Nevertheless, models with infinite mean intervals have at least two degenerate properties. First, such a model cannot be stationary. This means, for instance, that if we were to analyze or simulate a renewal point process with an infinite mean interval, we would need to begin with a point at some specified time—for instance, at the origin. Second, since an interval distribution with an infinite mean will also have an infinite variance, the CVISI has no meaning for such a model. Therefore, regardless of whether the model is LRcD, a direct comparison of the interval variability of the model with that of real cortical neurons is impossible using a single measure.
9 The inputs of these models are correlated only at zero lag and do not therefore violate condition 2.
2146
B. Jackson
To test whether such a model is reasonable, we should consider the behavior of both the sample mean and the sample variance of interspike intervals from physiological recordings as an increasing number of interspike intervals are included in the calculation. If both increase without bound for long recordings from cortical neurons, then it is possible that a renewal model with infinite mean interval would be reasonable. Even so, unless the sample CVISI values (the sample standard deviation divided by sample mean) converge to a finite value as the amount of data included in the calculations is increased, these values are unfit as model constraints. However, to the best of our knowledge, this convergence analysis has not been carried out on spike trains recorded from cortical neurons. 4 Integrate-and-Fire Models with Renewal Point Process Inputs 4.1 The Integrate-and-Fire Models of Feng and His Coworkers. In the previous section, we showed that any model of cortical neurons that produces a renewal output cannot exhibit both CVISI values and LRcD properties that are similar to those seen in real cortical neurons. However, there are a number of high-variability models that produce output spike trains that are not renewal point processes (RPPs). One type of nonrenewal, highvariability model includes the models analyzed by Feng and his coworkers (Feng, 1997, 1999; Feng & Brown, 1998a, 1998b; Feng et al., 1998). These models are identical to the class of models considered in section 3, except that the inputs are not Poisson point processes. Instead, the inputs can be any other type of RPP, and hence the cumulative inputs consisting of the superposition of all excitatory inputs and the superposition of all inhibitory inputs are no longer memoryless. In particular, Feng and his coworkers considered RPP inputs with positive gaussian and Pareto interval distributions. The former has a tail that decreases to zero faster than the exponential distribution (i.e., the interval distribution of a Poisson process), and the latter has a tail that decreases to zero more slowly than the exponential distribution. In general, they found that both longer-tailed input distributions and more closely balanced amounts of excitation and inhibition increased the CVISI of the output. This suggests that the CVISI of an IF model with RPP inputs can be within the physiologically realistic range, regardless of the interval distributions of the RPPs, if the ratio between the amounts of excitation and inhibition is properly adjusted. The necessary range of inhibition-excitation ratios, however, does depend on the interval distribution of the inputs. In particular, for long-tailed input distributions, less inhibition is required to produce realistic CVISI values, whereas for short-tailed input distributions, the amount of inhibition needs to be much closer to the amount of excitation (Feng & Brown, 1998a). The IF model with Poisson inputs raises enough analytical difficulties that we do not expect that complete analytical results can be obtained for the case of general RPP inputs. However, a few general observations can
Long-Range Dependence in Models of Cortical Variability
2147
be made. First, due to the integration mechanism, the likelihood of the occurrence of an output spike immediately following another output spike is low but will increase as the time since the last output spike increases. Thus, at small counting windows, we expect the counts to be negatively correlated. This will not, however, affect the correlation structure of the intervals, since the IF mechanism completely resets at the occurrence of each output spike. Therefore, the IF mechanism cannot by itself create memory in the model that is longer than the interspike intervals of the output. Hence, the mediumand long-term memory properties of the output must be governed by the inputs and the mechanisms by which they are combined. The combination of the inputs may be considered as two separate component processes: superposition and excitation-inhibition interaction. For Poisson inputs, the excitation-inhibition interaction can affect long-term memory by producing high interval variability (see section 3; Tuckwell, 1988), but with concomitant increases in the mean interval length. Balancing the amounts of excitation and inhibition will presumably have a similar type of effect when the inputs are RPPs. Any additional memory properties of the output, in particular those that are longer than the interspike intervals, must therefore originate in the superpositions of the input point processes. Thus, we should be able to gain some further intuition about the memory properties of the IF model with RPP inputs by considering the dependency structures of superpositions of RPPs. 4.2 Analytical Results for the Superposition of Renewal Point Processes. 4.2.1 The Positive Gaussian and Pareto Distributions. In the following, we will review some results relating the statistical properties of component RPPs to the properties of their superpositions and apply them for component RPPs with positive gaussian and Pareto interval distributions. These are the two distributions considered by Feng and his coworkers (Feng, 1997, 1999; Feng & Brown, 1998a, 1998b; Feng et al., 1998), and they represent two different classes of interval distributions: those with superexponential tails and those with subexponential tails. A more complete treatment, containing additional mathematical derivations, is contained in Jackson (2003, sect. 3.5.2). The positive gaussian distribution is an example of a distribution that has a support of [0, ∞) and a tail that is shorter than the exponential distribution. If X is a random variable with a gaussian, or normal, distribution and a mean of zero, then Y = |X| has a positive gaussian, or “folded” gaussian, distribution. The probability density function of the positive gaussian distribution is 2 t2 exp − , if t ≥ 0; 2 πµ f (t) = π µ (4.1) 0, otherwise,
2148
B. Jackson
where µ > 0 is the expected value. Since its tail is shorter than the exponential distribution, which has a variance of µ2 when its mean is µ, the positive gaussian distribution must have a variance that is less than µ2 . Specifically, its variance is ( π2 − 1) µ2 . The Pareto distribution is an example of a distribution that has a support of [0, ∞) and a tail that is longer than the exponential distribution. The probability density function of the Pareto distribution is f (t) =
αKα (t + K)−α−1 , 0,
if t ≥ 0; otherwise,
(4.2)
with parameters K > 0 and α > 0. K is essentially a normalization constant, and α determines the length of the tail of the Pareto distribution. The tail probability Pr{X > x} of a Pareto-distributed random variable decays as x−α , and only the moments less than α exist. The mean of the Pareto distribution, if α > 1, is K/(α − 1), while if α > 2, the variance is 2K2 /[(α − 1)(α − 2)]. 4.2.2 The Interval Distribution of the Superposition of Renewal Point Processes. It is well known that as the number of component RPPs increases, under proper normalization, their superposition approaches a Poisson process (Cox & Smith, 1954; Khintchine, 1960; Cox, 1967). Thus, as the number of inputs to the IF model increases, the model becomes more similar to that considered in section 3, where the output was LRD if and only if the variance of the intervals of the output was infinite. Clearly, then, as the number of RPP inputs increases, the model will become less and less likely to possess both a finite CVISI and LRD. The forms of the interpoint interval distribution and the serial dependence between the intervals in the superposition, for any fixed, finite number of inputs, will indicate the direction from which it approaches the Poisson process as more component processes are added. First, consider the interval distribution of the superposition process. Let G(t) be the marginal (cumulative) distribution function of the intervals of the superposition of p independent RPPs, each with intervals distributed according to F(t) with a mean of µ. Then the distribution functions of the components and the superposition are related by (Cox & Smith, 1954; Lawrance, 1973), 1 − G(t) = (1 − F(t))
p−1 1 − F(s) . ds µ
∞
t
(4.3)
From equations 4.1 and 4.3, we find that the tail of the interval distribution for a superposition of positive-gaussian RPPs is (Jackson, 2003, sect. 3.5.2) 1 − G(t) = erfc
t √
µ π
−
e
t2 π µ2
−
p−1 t t , erfc √ µ µ π
for t ≥ 0,
Long-Range Dependence in Models of Cortical Variability
where 2 erfc(x) = √ π
∞
2149
e−s ds 2
x
is the complementary error function. Since the complementary error function is always between zero and one, the second term in the difference above must be positive, and p is a positive integer, 1 − G(t) ≤ e
−
(p−1)t2 √ µ2 π
.
(4.4)
The right-hand side of equation 4.4 decreases faster than an exponential function. Thus, the tail of the interval distribution for the superposition process also decreases faster than an exponential function, as is the case for the component processes. In particular, this result implies that the intervals of the superposition process have finite means and variances. Using equations 4.2 and 4.3, we find that the tail of the interval distribution for a superposition of Pareto RPPs, when α > 1, is (Jackson, 2003, sect. 3.5.2) 1 − G(t) = K(α−1)p+1 (t + K)−[(α−1)p+1] ,
for t ≥ 0.
(4.5)
Comparing equation 4.5 to the tail of the interval distribution for the component processes, 1 − F(t) = Kα (t + K)−α ,
for t ≥ 0,
we see that the superposition of p independent Pareto RPPs, each with parameters K and α > 1, has a Pareto interval distribution with parameters K and α = (α − 1)p + 1 > 1. Thus, the intervals in the superposition have finite variance as long as (α − 1)p + 1 > 2 or, equivalently, if p > 1/(α − 1). Thus, as α, the parameter for the component processes, approaches the value of one, an increasing number of inputs are required if their superposition is to have intervals with finite variance. Furthermore, if the intervals of the component processes have finite variance, i.e., α > 2, then the intervals in their superposition do as well. 4.2.3 Short-Range Dependence Between the Intervals of the Superposition of Renewal Point Processes. The dependency structure of the intervals of the superposition of RPPs is much more difficult to derive than the their distribution. However, Enns (1970) found that the the covariance between the lengths of two adjacent intervals, τn and τn+1 , in the superposition of p independent RPPs is lim Cov(τn , τn+1 ) = µ2 I(p),
n→∞
(4.6a)
2150
B. Jackson
where
∞
I(p) =
(h(t) ∗ h(t))(1 − H(t))p−1 dt −
0
1 , p2
(4.6b)
H(t) is the distribution of the forward recurrence time, with density h(t) =
1 − F(t) , µ
for each RPP, and the symbol ∗ denotes convolution. If there is only one component process, then I(1) = 0, which is consistent with the “superposition” process being an RPP. In addition, limp→∞ I(p) = 0. This means that the superposition process approaches an RPP as the number of components is increased, in agreement with the result that the superposition approaches a Poisson process in this case. Under certain conditions, the hazard rate of the component RPPs can be used to easily determine the sign of I(p), and thus, by equation 4.6a, of the covariance of adjacent intervals in the superposition process. The hazard rate of an RPP having an interval distribution F(t), with probability density function f (t) = dtd F(t), is z(t) =
f (t) . 1 − F(t)
If the hazard rate of the component RPPs is monotone, then the sign of its derivative can be used to determine the sign of I(p) in equations 4.6a. When the hazard rate of the inputs is monotone nondecreasing, then I(p) ≤ 0, and when it is monotone nonincreasing, then I(p) ≥ 0 (Enns, 1970). The exponential distribution, which has a constant hazard rate (i.e., both nondecreasing and nonincreasing), is the boundary case, yielding I(p) ≡ 0, and uncorrelated intervals in the superposition, which is consistent with the fact that the superposition of a number of Poisson processes is also a Poisson process. Thus, if the hazard rate of the inputs is monotone nondecreasing but also nonconstant, then I(p) < 0, and adjacent intervals in the superposition process are negatively correlated. If the hazard rate of the inputs is monotone nonincreasing and nonconstant, then I(p) > 0, and adjacent intervals in the superposition process are positively correlated. The hazard rate of an RPP with positive-gaussian-distributed intervals is monotone increasing (Jackson, 2003, appendix A). Hence, adjacent intervals in the superposition of such RPPs are negatively correlated. This is a common result for RPPs that are used to model real phenomena. If the hazard rate is increasing, then as time passes since the last event, it becomes more likely, per unit time, that the next event will occur. On the other hand, the hazard rate of a Pareto RPP (Jackson, 2003, sect. 3.5.2), α z(t) = , t+K
Long-Range Dependence in Models of Cortical Variability
2151
is monotone decreasing for all allowable (i.e., positive) values of α and K, so adjacent intervals in the superposition of these RPPs are positively correlated. Thus, although the marginal interval distribution of the superposition of Pareto RPPs is also Pareto, the superposition process is not an RPP when there are multiple component processes. 4.2.4 Long-Range Dependence in the Superposition of Renewal Point Processes. We are particularly concerned with the variability and long-term memory properties of the output of the IF model with RPP inputs. As discussed earlier, LRD in the output of this model is likely to be produced by LRD in the superposition of the inputs. In order to evaluate the LRD of a superposition of independent RPPs, we can consider the variance function V(t) ≡ Var{N(0, t]}, which is the variance of the number of events in an interval of length t. For a single RPP with a finite-variance interval distribution, it is known (e.g., Cox & Smith, 1954; Cox, 1967; Feller, 1971) that V(t) ∼
σ 2t µ3
t → ∞,
as
where µ and σ 2 are, respectively, the mean and variance of the interval distribution. Thus, the variance function of the superposition of p independent, statistically identical RPPs is VS (t) ∼
pσ 2 t µ3
as
t → ∞,
(4.7)
where µ and σ 2 are the mean and variance of the component processes. Therefore, if the component processes have finite interval variances, and thus are not LRcD, then their superposition has finite variance and is not LRcD. Actually, the connection between the interval variance of the component processes and the LRcD of their superposition is even stronger than the preceding discussion indicates. Theorem 1 states that an RPP is LRcD if and only if its intervals have infinite variance. The proof given there can be readily adapted to obtain a similar result for the superposition of a finite number of independent, statistically identical RPPs.10 Theorem 2. Suppose that N1 (·), N2 (·), . . . , Np (·) are p independent stationary RPPs, each with the same distribution function F of their generic interpoint interval random variable X, which has F(0) = 0 and finite mean µ = E{X}.11 Then 10
See Jackson (2003, theorem 3.4) for the proof of this result. This theorem can easily be generalized to the case where the component processes have different interval distribution functions. In that case, the superposition process is LRcD if and only if one or more of the component processes has intervals with infinite variance. However, this generalization is unnecessary for the results of this study. 11
2152
B. Jackson
p the superposition of these p RPPs, NS (t) = i=1 Ni (t), is LRcD if and only if 2 E X = ∞, that is, the variance of the intervals in the component processes are infinite. Since any distribution with a tail that is shorter than the exponential distribution has finite variance, theorem 2 asserts that the superposition of RPPs with this kind of interval distribution is not LRcD. For component processes with intervals distributed according to a positive gaussian distribution, equation 4.7 implies that the asymptotic behavior of the variance function of the superposition process is given by VS (t) ∼
p( π2 − 1)t µ
as
t → ∞.
Thus, in accordance with theorem 2, the limit of VS (t)/t, as t → ∞, is a finite constant. More instructive is the asymptotic behavior of the FFC. For the superposition of positive gaussian RPPs, as the counting interval length increases, the Fano factor approaches VS (t) π lim FS (t) = lim = − 1 ≈ 0.5708 < 1, pt t→∞ 2
t→∞
µ
which is independent of the number of component processes, p, and the parameter µ of their interval distribution. The fact that the limiting value of the Fano factor is less than one is correlated with the prior finding that (at least) adjacent intervals are negatively correlated in the superposition. The Pareto distribution has finite variance only if α > 2. Thus, according to theorem 2, the superposition of Pareto RPPs will be LRD if and only if α ≤ 2. For component processes with intervals distributed according to a Pareto distribution with α > 2, equation 4.7 implies that the asymptotic behavior of the variance function of the superposition process is given by 2K2 t p (α−1)(α−2) 2p(α − 1)2 t VS (t) ∼ = as t → ∞. 3 K(α − 2) K α−1)
Thus, as expected, the limit of VS (t)/t, as t → ∞, is a finite constant when α > 2. Also, for the superposition of Pareto RPPs, with α > 2, the Fano factor approaches lim FS (t) = lim
t→∞
t→∞
VS (t) pt(α−1) K
=
2(α − 1) , α−2
which, as in the case of positive gaussian RPPs, is independent of the number of component processes, p. However, in this case, the limiting value of the Fano factor is dependent on a parameter, α, of the distribution. But for all
Long-Range Dependence in Models of Cortical Variability
2153
α > 2, the limiting value of the Fano factor is greater than one, which is correlated with the fact that (at least) adjacent intervals in the superposition of Pareto RPPs are positively correlated. In fact, this value is always greater than two for α > 2, with it approaching the value of two as α → ∞ and growing without bound as α → 2. 4.3 Simulation Results for the IF Model with Renewal Point Process Inputs. The analytical results of section 4.2, by describing the superposition of the inputs, provide some clues to the properties that should be expected of the output of the IF model with RPP inputs. In addition, the arguments in section 4.1 furnish additional insight into characteristics that the output is likely to possess. However, a complete analytical treatment of this model, save when the inputs are Poisson processes, is not available. Therefore, in order to test the ability of IF models with renewal point process inputs to produce realistic CVISI values and LRcD, we have run computer simulations of these models. For each model and set of parameters, we ran 10 simulations, each with a duration of 100,000 seconds. Each simulation used 100 excitatory inputs and 100r inhibitory inputs, where r was one parameter of the simulations that was varied. Thus, r represents the ratio of the number of inhibitory inputs to the number of excitatory inputs. The value of 100 was chosen to match the work of Feng and his coworkers. Moreover, this is a relatively large number of inputs but should not produce the trivial situation where the superpositions of the RPPs are nearly Poisson processes. If the results from the RPP inputs matched those from the Poisson process inputs, then this number would have to be reduced. The IF mechanism had a reset potential of zero and a threshold of unity, and each spike that occurred in either an excitatory or inhibitory input caused the potential to instantaneously increase or decrease, respectively, by 1/40 = 0.025. When the potential reached the threshold value, an output spike was registered, and the potential was reset. The value of the appropriate parameter of the interval distribution of the inputs was chosen to yield a nominal output rate of 2.5 spikes per second according to the following formula: output rate (# of excitatory inputs)(postsynaptic potential)(1 − r) 2.5 = , (4.8) 100 · 0.025 · (1 − r)
input rate =
where postsynaptic potential is the amount by which input spikes increased or decreased the potential in the IF mechanism. If possible, the interval from time zero to the first spike in each input process was chosen according to the distribution of the forward recurrence time. This ensured that each input was stationary (or at equilibrium). If this was not possible, then a pseudostationary state was created by generating a random interval length from the
2154
B. Jackson
interval distribution and choosing the origin to be a uniformly distributed point within this interval. We analyzed the output spike trains of these models using estimators of the CVISI , the FFC, and the IDC. The IDC required calculation of the sample variance of the length of, say, M adjacent intervals for many values of M. Since the IDC was plotted on a double logarithmic graph, we began with a value of M = 1 and used 10 values of M per decade, which were as equally spaced in logarithmic coordinates as possible, given that M has to be an integer. No M values greater than one-fifth of the total number of interspike intervals in the output of the simulation were used. Similarly, the FFC required calculation of the sample variance of the counts in windows of size T, say, for many values of T. The values of T ranged from 0.01 second to one-fifth of the total simulation length, with 10 logarithmically spaced values per decade. Finally, the interspike intervals of the output spike train from each of the 10 simulations for a given model and set of parameters were randomly shuffled once to produce a set of 10 surrogate spike trains. The sample variances of the counts and aggregated intervals were then calculated, in the same manner as before, for these new spike trains in order to produce the surrogate data FFCs and IDCs. 4.3.1 Simulations of the IF Model with Poisson Process Inputs. For comparison, we ran simulations of the IF model with Poisson process inputs with the same parameters as our other simulations. For the IF model with Poisson process inputs, it is possible to derive, in closed form, the theoretical value of the CVISI of its output as a function of the inhibition-excitation ratio, r. Using equation 3.5, we find that this function is CVISI (r) =
1+r , θˆ (1 − r)
(4.9)
where θˆ =
1−0 = 40 1/40
for our simulations. This function is plotted as a dashed line in each graph of Figure 2. In Figure 2a, the values of CVISI estimated from 10 simulations of the Poisson input model at each value of r are individually plotted. However, since the variability of these estimates across simulations is so low, the 10 symbols at each value of r fall almost exactly on top of each other. It is apparent from this graph that the results of our simulations agree well with the theoretical values of CVISI for this model.
Long-Range Dependence in Models of Cortical Variability
1.0 0.8
b Simulations Theory
0.6 0.4 0.2 0.0 0.0
0.2 0.4 0.6 0.8 1.0 Inhibition-Excitation Ratio, r
Coefficient of Variation
Coefficient of Variation
a
2.5 2.0
2155
Gaussian Renewal Inputs Mean for Gaussian Inputs Poisson Process Inputs
1.5 1.0 0.5 0.0 0.0
0.2 0.4 0.6 0.8 1.0 Inhibition-Excitation Ratio, r
Figure 2: The coefficient of variation of interpoint intervals as a function of the inhibition-excitation ratio r for the output of an integrate-and-fire model with inputs that are (a) Poisson point processes and (b) positive gaussian distributed renewal point processes. There were 10 simulation runs at each value of r, and the calculated value of the coefficient of variation for each of these runs is plotted with a symbol. The dashed line in each graph is the theoretical curve for the model with Poisson process inputs, and the solid line in b connects the means calculated at each value of r. For each value of r, the parameter controlling the rate of the inputs was set so that the nominal rate at the output was 2.5 spikes per second.
4.3.2 Simulations of the IF Model with Positive Gaussian Renewal Inputs. For simulations of the IF model with positive gaussian RPP inputs, the sole parameter of the interval distribution, the mean µ, was set to be the inverse of the nominal input rate from the calculation in equation 4.8. Thus, the only parameter that we varied in these simulations was the inhibition-excitation ratio r. For this model, we ran simulations with r-values of 0.0, 0.5, 0.6, 0.65, 0.7, 0.75, 0.8, 0.85, 0.9, 0.95, 0.97, 0.98, and 0.99. In Figure 2b, the calculated value of CVISI for each simulation is plotted versus the value of r. Again, at each location, there are actually 10 symbols, but the low variability of the estimates causes the symbols to fall almost exactly on top of each other at all but the highest values of r. The solid line connects the means of the 10 estimates of CVISI at each value of r, and the dashed line is the theoretical curve for the Poisson input model. These results are in close agreement with those of Figure 2a of Feng and Brown (1998a). As we expect, due to the negative correlation in the superpositions of the inputs, the CVISI values for the positive gaussian model are always less than the corresponding values for the Poisson model. However, the CVISI for the positive gaussian model does increase, apparently without bound, as r approaches one, in the same manner as that for the Poisson model. In particular, equation 4.9 specifies that the CVISI function with respect to r for the Poisson input model is the square root of a rational function with a pole
2156
B. Jackson
at one, and the increasing, convex shape of the CVISI curve for the gaussian input model is well approximated by the same type of function. Thus, as observed by Feng and Brown (1998a), the positive gaussian model requires more balance than the Poisson model in order to achieve any particular CVISI for its outputs. Only for values of r greater than about 0.9 is the CVISI greater than 0.5, which is the minimal value needed to match physiological data. Although the positive gaussian model may be able to achieve any arbitrary value of CVISI , albeit perhaps at the cost of a high degree of excitationinhibition balance, its success as a model of cortical neurons quickly diminishes when the possibility of LRD is considered. Figure 3 contains examples of FFCs and IDCs for the output of the IF model with positive gaussian renewal inputs for several representative values of the inhibition-excitation ratio r; the FFCs and IDCs for all of our simulations may be found in Jackson (2003, appendix B). The sub-Poissonian variability of the interspike intervals in the output caused by the integration mechanism and the negative correlation in the superpositions of the inputs is clearly evident in the general downward trend of both the original data and surrogate data FFCs and their asymptotic approach to a constant less than one for low r-values. Furthermore, as r is increased (i.e., as the amount of inhibition is brought into balance with the amount of excitation), the effect of the excitation-inhibition interaction is apparent in the rising value of the asymptotes of the FFCs. At a value of about r = 0.97 (not shown in Figure 3), the excitation-inhibition interaction completely cancels the two variance-decreasing effects, and the intervals of the output of the model have Poisson-like variability. Thus, for larger values of r, the asymptotic values of the FFCs are greater than one. Even at these large r-values, however, the FFCs are still below one for small counting intervals. For instance, the FFCs for r = 0.99 do not exceed one until the counting interval is about 0.1 second in length. A small amount of negative correlation between the interspike intervals of the output is also evident at low r-values in the differences between the original data FFCs and surrogate data FFCs, as well as in the difference between the original data IDCs and surrogate data IDCs. The fact that the original data FFCs are lower than the surrogate data FFCs implies that the intervals in the original data must be negatively correlated, since the intervals in the surrogate data are uncorrelated but have the same distribution as the original data intervals. This argument is supported by the original data IDCs, which are decreasing at low aggregation levels and are below the corresponding surrogate data IDCs. Since this negative correlation in the output intervals is apparent, and, in fact, its effect largest, at r = 0, where there is no excitation-inhibition interaction, it must be the result of the negative correlation between intervals in the superposition of the inputs. These effects of the negative correlation, however, gradually disappear as r increases. Thus, for r greater than about 0.9, the original data curves and the surrogate data curves are nearly equivalent, implying that the output
r
10
0.00
Fano Factor
10
1
0
-1
10
-2
10
-3
10
-2
10
-1
10
0
10
1
10
2
10
3
10
4
10
10
5
Index of Dispersion of Intervals
Long-Range Dependence in Models of Cortical Variability
1
10
0
10
-1
10
-2
10
-3
10
0
10
0.50
Fano Factor
10
1
0
-1
10
-2
10
-3
10
-2
10
-1
10
0
10
1
10
2
10
3
10
4
10
5
10
10
1
0
-1
10
-2
10
-3
10
-2
10
-1
10
0
10
1
10
2
10
3
10
10
4
10
5
10
1
0
-1
10
-2
10
-3
10
-2
10
-1
10
0
10
1
10
2
10
3
10
4
10
Counting Interval (sec)
5
10
-1
10
-2
10
-3
10
0
10
1
10
2
10
3
10
4
10
5
10
10
10
1
0
-1
10
-2
10
-3
10
0
10
1
10
2
10
3
10
4
10
5
10
Number of Aggregated Intervals
5
10
Index of Dispersion of Intervals
0.99
Fano Factor
10
4
10
0
Counting Interval (sec) 10
3
10
Number of Aggregated Intervals Index of Dispersion of Intervals
0.95
Fano Factor
10
2
10
1
Counting Interval (sec) 10
1
10
Number of Aggregated Intervals Index of Dispersion of Intervals
Counting Interval (sec) 10
2157
10
10
1
0
-1
10
-2
10
-3
10
0
10
1
10
2
10
3
10
4
10
5
10
Number of Aggregated Intervals
Figure 3: FFCs and IDCs estimated from simulations of the IF model with positive gaussian inputs. Each set of axes contains 10 curves calculated from original data (black) and 10 curves calculated from the corresponding shuffled surrogate data (gray). For each value of the inhibition-excitation ratio r, each individual FFC in the left set of axes was calculated from the same data as one of the IDCs in the right set of axes. Also, for each value of r, the parameter µ of the positive gaussian distribution, which controls the rate of the inputs, was set so that the nominal rate at the output was 2.5 spikes per second.
2158
B. Jackson
of the model is roughly a renewal process for closely balanced amounts of excitation and inhibition. Several additional characteristics of the FFCs and IDCs in Figure 3 are consistent with theoretical predictions. First, the surrogate data IDCs are all close to being horizontal lines, and the mean surrogate data IDC at each r-value is, for all practical purposes, a horizontal line. This is to be expected since surrogate data form a renewal process. Next, the value of each FFC at the smallest counting intervals is nearly one, in agreement with the theoretical result mentioned in section 2.6.3, and each FFC has the same asymptotic value as the corresponding IDC, as predicted by equation 2.5. This also means that the vertical position of the IDCs increases with increasing r. From this analysis, it is apparent that the positive gaussian input IF model is no better at producing high interval variability and LRcD than the Poisson input IF model. Not only does this model require perfect excitationinhibition balance in order to produce LRcD, but at any given level of balance, the gaussian input model is further from being LRcD than the Poisson input model. Furthermore, if the level of balance is high enough to produce values of CVISI above 0.5 in the gaussian input model, then the output of this model is nearly renewal, which is precisely the characteristic that was used to expose the inadequacy of the Poisson input model. Moreover, since the gaussian input model will more closely approximate the Poisson input model as more inputs are included, increasing the number of inputs will never completely overcome the shortcomings of the gaussian input model that we have found. In sum, the positive gaussian input IF model does not solve the high-variability problem while producing LRD and is, in fact, worse in this regard than the more tractable Poisson-input model. 4.3.3 Simulations of the IF Model with Pareto Renewal Inputs. The IF model with Pareto RPP inputs has three parameters: the inhibition-excitation ratio r, the “tail” parameter α, and the normalization constant K. For any given values for r and α > 1, the value of K was determined by setting the mean of the Pareto distribution equal to the inverse of the input rate determined by equation 4.8. Thus, we set K = (α − 1)/input rate if α > 1. When 0 < α ≤ 1, the Pareto distribution does not have a mean, and, hence, the previous calculations are not justified. α = 1 was the only value in this range that we used, and in this case we set K = 0.1/input rate, where input rate was still determined by equation 4.8, which is the same value that we would have calculated for K when α = 1.1. We found empirically that this value yielded an output spike rate in the vicinity of 2.5 spikes per second for our simulations. Thus, for the Pareto input model simulations, we varied two parameters: α and r. For α, we used values of 1.0, 1.25, 1.5, 1.75, 1.9, 2.0, 2.1, 2.5, and 3.0, and for r, we used values of 0.0, 0.5, 0.7, 0.8, 0.9, and 0.95. Ten simulations were run for each combination of these parameter values. For comparison, Feng and his coworkers (Feng, 1997; Feng & Brown, 1998b, 1998a; Feng et al., 1998) used only values of α = 1.0 and α = 2.1 in their simulations.
Long-Range Dependence in Models of Cortical Variability
2159
Besides just spanning a wide range, our additional values for α were chosen for two specific reasons. First, we wanted to use some values of α between 1.0 and 2.0, exclusive, where the Pareto distribution has infinite variance but still has a finite mean. Second, we wanted to use some values that were significantly greater than two. Although at the value of α = 2.1, the Pareto distribution does have finite mean and variance, this value is close enough to 2.0 that the differences between finite and infinite variance are not going to be very obvious. Figure 4 shows the estimated values of CVISI from these simulations plotted versus the inhibition-excitation ratio r. The graphs in each column contain the same data, only at different scales. Column a contains graphs of the mean values of CVISI for each combination of parameter values for r and α. Thus, each symbol represents the mean calculated over 10 simulations. Means calculated from simulations with the same value of α are connected by solid lines, while the dashed line is the theoretical curve for the Poisson input model. The curves all have the same increasing, convex shape, which can be well fit by the square root of a rational function with a pole at r = 1. Also, as we expect due to the positive correlations in the superpositions of the inputs, the CVISI values for the Pareto model are always greater than those for the Poisson model. Thus, in accord with the findings of Feng and Brown (1998a), the Pareto model requires less balance than the Poisson model to achieve any particular CVISI of its output. Furthermore, as α is decreased, a particular value of CVISI can be attained with less balance between the excitation and inhibition. For example, at low values of α, the CVISI is actually significantly larger than physiological measurements in the cortex at even moderate degrees of balance. Column b of Figure 4 shows, for a subset of the values of α, the CVISI estimates for all simulation runs. The solid lines are identical to those in column a, connecting the mean for each set of 10 simulation runs. Although at low r values and high α values the symbols are tightly grouped, like the CVISI estimates for the positive gaussian model, the variance of these estimates of CVISI increases significantly as r increases and as α decreases. The most dramatic effect is seen for medium to high values of r and low values of α. These increases in estimator variance are almost certainly caused by the progressive movement of the model toward a state of nonstationarity as r or α approach one. We saw in section 3 that when r is equal to one, the mean of the intervals in the output of the Poisson input model is infinite, and no point process with infinite mean intervals can be stationary. This is likely to be the case with the present model as well. On the other hand, when α equals one, the input processes have infinite mean intervals, also forcing the model to be nonstationary. Comparing Figure 4 with Figure 2a of Feng and Brown (1998a), we see that our results for α = 2.1 are in good agreement with their corresponding results. In contrast, our CVISI values for α = 1 are substantially larger than those in their Figure 2b. The different values of K used can explain some,
2160
B. Jackson
a
b 1.0 Coefficient of Variation
Coefficient of Variation
1.0 0.8 0.6 0.4 0.2 0.0
Coefficient of Variation
Coefficient of Variation
6.0 4.0 2.0
40 20 0
0.0 0.2 0.4 0.6 0.8 1.0 Inhibition-Excitation Ratio, r
8.0 6.0 4.0 2.0 0.0 0.2 0.4 0.6 0.8 1.0 Inhibition-Excitation Ratio, r
100 α = 3.0 α = 2.5 α = 2.1 α = 2.0 α = 1.9 α = 1.75 α = 1.5 α = 1.25 α = 1.0 Poisson
0.0 0.2 0.4 0.6 0.8 1.0 Inhibition-Excitation Ratio, r
Coefficient of Variation
Coefficient of Variation
60
0.2
0.0
0.0 0.2 0.4 0.6 0.8 1.0 Inhibition-Excitation Ratio, r
100 80
0.4
10.0
8.0
0.0
0.6
0.0
0.0 0.2 0.4 0.6 0.8 1.0 Inhibition-Excitation Ratio, r
10.0
0.8
80 60 40 20 0
0.0 0.2 0.4 0.6 0.8 1.0 Inhibition-Excitation Ratio, r
Figure 4: The coefficient of variation of interpoint intervals as a function of the inhibition-excitation ratio r for the output of an IF model with inputs that are Pareto-distributed renewal point processes. For each combination of the parameters, r and the “tail” parameter α of the Pareto distribution, 10 separate simulation runs were conducted with the parameter K of the Pareto distribution set so that the nominal rate at the output was 2.5 spikes per second. (Column a) The mean values of the estimates are plotted for each combination of parameters. Each solid line connects the means for a single value of α, and the dashed line is the theoretical curve for the model with Poisson process inputs. All three sets of axes contain the same data, but each displays them on a different vertical scale. (Column b) The estimates calculated from each individual simulation run are plotted for a subset of the values of α. Solid and dashed lines are the same as in column a, and all three sets of axes contain the same data.
Long-Range Dependence in Models of Cortical Variability
2161
but not all, of this disagreement. In simulations with K set to one in order to match the value used by Feng and Brown (1998a), our CVISI values were closer to, but still larger than, their results. For instance, at r = 0.9, the most balanced condition that they used, they have CVISI ≈ 3; our simulations for Figure 4 resulted in CVISI ≈ 80, and our simulations with K = 1 produced CVISI ≈ 8.3. Since Feng and Brown (1998a) do not give many details regarding their simulations, it is difficult to determine the source of the remaining difference between their results and ours. We speculate that their lower values may be the result of either a shorter simulation duration or different starting conditions for the simulation of this nonstationary process. Figures 5 through 7 contain FFCs and IDCs calculated from simulated output of the IF model with Pareto RPP inputs for several different representative combinations of values for the parameters α and r. The curves for α = 2.5, α = 1.75, and α = 1.0 are shown in Figures 5, 6, and 7, respectively. For each of these values of α, FFC and IDC sets are shown for each of four values of r ranging from zero to 0.90 or 0.95. The complete set of FFCs and IDCs for all of our simulations may be found in Jackson (2003, appendix B). These graphs possess some common characteristics across all parameter values. First, as should be the case in all situations, the FFCs approach the value of one at very small counting interval lengths, and the mean surrogate data IDCs are horizontal lines. Second, like those of the positive gaussian model, the FFCs have an initially negative slope as a result of the relative unlikelihood of the occurrence of very short intervals, which is caused by the integration mechanism. Third, unlike those of the positive gaussian model, the original data IDCs generally have an initial positive slope and remain above the corresponding surrogate data IDCs. This is a result of the positive correlation between intervals in the superposition of Pareto renewal processes, in contrast to the negative correlation in the superposition of positive Gaussian renewal processes. These positively correlated intervals also produce positive slopes in the original data FFCs subsequent to their initial declination from one. This produces original data FFCs that have an asymptotic value larger than one, except when the value of α is large and the value of r is small. In the latter case, the positive correlation in the superpositions, combined with the variance-producing effect of the excitation-inhibition interaction, is not strong enough to overcome the effect of the integration process. Like those for the positive gaussian model, the original data and corresponding surrogate data curves for the Pareto model become progressively more similar as the excitation and inhibition are brought into better balance, that is, as r is increased. This means that the surrogate data FFCs change more quickly with increases in r than do the original data FFCs and that the original data IDCs better approximate horizontal lines as r increases. In the current model, however, this is the result of a loss of positive, not negative, correlation in the output intervals. Nevertheless, this loss of correlation implies that the output of the model resembles a renewal point process at values of r close to one.
2162
B. Jackson
Fano Factor
0.00
10
10
1
0
-1
10
-2
10
-2
10
-1
10
0
10
1
10
2
10
3
10
4
10
5
10
Index of Dispersion of Intervals
α = 2.50
r
10
10
1
0
-1
10
-2
10
0
10
0.50
Fano Factor
10
10
1
0
-1
10
-2
10
-2
10
-1
10
0
10
1
10
2
10
3
10
4
10
5
10
10
1
0
-1
10
-2
10
-2
10
-1
10
0
10
1
10
2
10
3
10
10
4
10
5
10
1
0
-1
10
-2
10
-2
10
-1
10
0
10
1
10
2
10
3
10
4
10
Counting Interval (sec)
5
10
-1
10
-2
10
0
10
1
10
2
10
3
10
4
10
5
10
10
10
1
0
-1
10
-2
10
0
10
1
10
2
10
3
10
4
10
5
10
Number of Aggregated Intervals
5
10
Index of Dispersion of Intervals
Fano Factor
0.95
10
4
10
0
Counting Interval (sec) 10
3
10
Number of Aggregated Intervals Index of Dispersion of Intervals
Fano Factor
0.80
10
2
10
1
Counting Interval (sec) 10
1
10
Number of Aggregated Intervals Index of Dispersion of Intervals
Counting Interval (sec)
10
10
1
0
-1
10
-2
10
0
10
1
10
2
10
3
10
4
10
5
10
Number of Aggregated Intervals
Figure 5: FFCs and IDCs estimated from simulations of the IF model with Pareto inputs and parameter α = 2.5. Each set of axes contains 10 curves calculated from original data (black) and 10 curves calculated from the corresponding shuffled surrogate data (gray). For each value of the inhibition-excitation ratio r, each individual FFC in the left set of axes was calculated from the same data as one of the IDCs in the right set of axes. Also, for each value of r, the parameter K of the Pareto distribution was set so that the nominal rate at the output was 2.5 spikes per second.
Long-Range Dependence in Models of Cortical Variability
10
Fano Factor
10 10 10
3
2
1
0
-1
10
-2
10
-2
10
-1
10
0
10
1
10
2
10
3
10
4
10
5
10
Index of Dispersion of Intervals
α = 1.75
r
0.00
10 10 10 10
3
2
1
0
-1
10
-2
10
0
10
Counting Interval (sec)
Fano Factor
10 10 10
3
2
1
0
-1
10
-2
10
-2
10
-1
10
0
10
1
10
2
10
3
10
4
10
5
10
10
10 10
3
2
1
0
-1
10
-2
10
-2
10
-1
10
0
10
1
10
2
10
3
10
10
4
10
5
10
10 10
3
2
1
0
-1
10
-2
10
-2
10
-1
10
0
10
1
10
2
10
3
10
4
10
Counting Interval (sec)
5
10
10 10
1
0
-1
10
-2
10
0
10
1
10
2
10
3
10
4
10
5
10
10 10 10 10
3
2
1
0
-1
10
-2
10
0
10
1
10
2
10
3
10
4
10
5
10
Number of Aggregated Intervals
5
10
Index of Dispersion of Intervals
0.95
Fano Factor
10
4
10
2
Counting Interval (sec) 10
3
10
Number of Aggregated Intervals Index of Dispersion of Intervals
0.80
Fano Factor
10
2
10
3
Counting Interval (sec) 10
1
10
Number of Aggregated Intervals Index of Dispersion of Intervals
10
0.50
2163
10 10 10 10 10 10
3
2
1
0
-1
-2 0
10
1
10
2
10
3
10
4
10
5
10
Number of Aggregated Intervals
Figure 6: FFCs and IDCs estimated from simulations of the IF model with Pareto inputs and parameter α = 1.75. Curves for original data are shown in black, and those for surrogate data are shown in gray. The parameter K of the Pareto distribution was set so that the nominal rate at the output was 2.5 spikes per second for each simulation.
2164
B. Jackson
10
0.00
Fano Factor
10 10 10 10
4 3 2 1 0
-1
10
-2
10
-2
10
-1
10
0
10
1
10
2
10
3
10
4
10
5
10
Index of Dispersion of Intervals
α = 1.00
r
10 10 10 10 10
4 3 2 1 0
-1
10
-2
10
0
10
Counting Interval (sec)
0.50
Fano Factor
10 10 10 10
4 3 2 1 0
-1
10
-2
10
-2
10
-1
10
0
10
1
10
2
10
3
10
4
10
5
10
Index of Dispersion of Intervals
10
10
10 10 10
4 3 2 1 0
-1
10
-2
10
-2
10
-1
10
0
10
1
10
2
10
3
10
10
4
10
5
10
10 10 10
4 3 2 1 0
-1
10
-2
10
-2
10
-1
10
0
10
1
10
2
10
3
10
4
10
Counting Interval (sec)
5
10
10 10 10
2 1 0
-1
10
-2
10
0
10
1
10
2
10
3
10
4
10
5
10
10 10 10 10 10
4 3 2 1 0
-1
10
-2
10
0
10
1
10
2
10
3
10
4
10
5
10
Number of Aggregated Intervals
5
10
Index of Dispersion of Intervals
0.90
Fano Factor
10
4
10
3
Counting Interval (sec) 10
3
10
Number of Aggregated Intervals Index of Dispersion of Intervals
0.70
Fano Factor
10
2
10
4
Counting Interval (sec) 10
1
10
Number of Aggregated Intervals
10 10 10 10 10
4 3 2 1 0
-1
10
-2
10
0
10
1
10
2
10
3
10
4
10
5
10
Number of Aggregated Intervals
Figure 7: FFCs and IDCs estimated from simulations of the IF model with Pareto inputs and parameter α = 1.0. Curves for original data are shown in black, and those for surrogate data are shown in gray. The parameter K of the Pareto distribution was set so that the nominal rate at the output was 2.5 spikes per second for each simulation.
Long-Range Dependence in Models of Cortical Variability
2165
In section 4.1, we mentioned two component processes to the combination of the inputs in the IF model, each of which could possibly produce LRD. First, LRD may be produced through the balance of excitation and inhibition. In the Poisson input model, this could occur, but since the output of the model was a renewal process, the inhibition-excitation balance also created output intervals with infinite variance. Our simulations suggest that the balance of excitation and inhibition has a similar effect on the Pareto-input model, the output of which becomes more RPP-like with increasing r, and the CVISI of which apparently increases without bound as r increases. Thus, the balance of excitation and inhibition is also unable to produce a realistic form of LRD in this model. The second way that LRD can be present in the output of the RPP-input IF model is if the superpositions of the inputs are LRD. According to theorem 2, this will occur when, and only when, the intervals of the input have infinite variance. For Pareto RPP inputs, this condition is met when α ≤ 2. From our simulations, it is clear that when α ≤ 2, the output of the model is LRD, and the nature of the LRD is different from that produced by inhibition-excitation balance. Increasing the value of r tends to pull the IDC closer to horizontal, while increasing the variability of the inputs (i.e., decreasing α) tends to increase the slope of the IDC. Thus, while more balance reduces the dependence between the intervals of the output of the model, making the output process more similar to an LRcD RPP (with infinite interval variance), lowering the value of α strengthens the interval dependence, ultimately producing an LRcD process with LRiD and finite interval variance. These two different effects can be seen as well in the combination of original data and surrogate data FFCs. As the excitation and inhibition become more balanced, the difference between these two FFCs disappears. And as the variability of the inputs increases, this difference becomes larger, primarily due to the increasing slope of the original data FFC. This trend may break down as α approaches one, where the mean interspike interval of the input processes becomes infinite. In general, then, our simulations of the IF model with Pareto RPP inputs show that with proper adjustment of the parameters r and α, this model can produce outputs that simultaneously exhibit both LRD and high interspike interval variability like that found in cortical neurons. More specifically, from the data plotted in Figure 4, we see that the CVISI of the output of this model is between 0.5 and 1.5, a typical range of CVISI estimated from recordings of cortical neurons, for the ranges of r shown in Table 2 for each value of α used in our simulations. Combining this information with comparisons between the FFCs obtained from simulations of this model, examples of which are shown in Figures 5 through 7, and the FFCs obtained from physiological recordings that are shown in Teich et al. (1996), we can evaluate the ability of the Pareto input model to match the variability and LRD of in vivo cortical neurons.
2166
B. Jackson
Table 2: Range of the Inhibition-Excitation Ratio r, for Each Value of α, That Leads to Values of CVISI Between 0.5 and 1.5 for the IF Model with Pareto Renewal Process Inputs. α
Range of r
3.0 2.5 2.1 2.0 1.9 1.75 1.5 1.25 1.0
0.75 < r < 0.95 0.70 < r < 0.93 0.68 < r < 0.91 0.66 < r < 0.90 0.64 < r < 0.89 0.61 < r < 0.86 0.57 < r < 0.82 0.47 < r < 0.75 0.20 < r < 0.56
Note: The ranges of r were approximated from plots of CVISI versus r (see Figure 4) obtained from simulations of the model.
For 2.0 < α ≤ 3.0, when the interval variance of the inputs is finite, a value of r between about 0.7 and 0.9 is required to match physiological CVISI estimates. For r values near 0.9, these FFCs resemble those of Cell 4 and Cell 7 in Teich et al. (1996), although neither the simulation nor the physiological results suggest the presence of LRcD. At somewhat lower αvalues, between 1.5 and 2.0, the simulation FFCs with r at the high end of their CVISI -matching range tend to resemble those of cells 3 and 19 of Teich et al. (1996), which do suggest the presence of LRcD. At higher values of r, the FFCs still retain a resemblance to cells 3 and 19 of Teich et al. (1996), but their CVISI -values are too large. Of course, eventually, if r were made close enough to 1.0, the output of the model would be too renewal-like and not similar to the results from these cells. For values of α close to 1.0, the output of the model is still LRcD, but the surrogate data FFC’s asymptotes are too low for a value for r-values that lead to physiological values of CVISI . Slightly larger values of r, however, lead to FFCs resembling those of cells 3 and 19 of Teich et al. (1996), but produce too large a CVISI value. In addition, as α approaches one, the inputs are approaching a condition of nonstationarity, which results in a significant increase in the variability of our estimates of the Fano factor and the index of dispersion of intervals. This can be seen in the increased spread of the curves for α = 1.0 in Figure 7. Although the Pareto input model is able to produce outputs that share common statistical features with the spike trains in cortical neurons, the type of inputs required are not justified physiologically. In order to produce LRD in the output of the model, the inputs must have infinitely variable
Long-Range Dependence in Models of Cortical Variability
2167
intervals, which is not the case for neurons that might serve as inputs to cortical neurons: other cortical neurons or subcortical, especially thalamic, neurons. This appears to be the fundamental weakness of the Pareto input model. However, there is also another potential problem with this model. As mentioned in section 4.2.2, the Pareto input model will more closely approximate the Poisson input model as more inputs are included. Thus, increasing the number of inputs will eventually undermine the success of this model. Thus, the Pareto input model may not be successful as an explanation for high variability and LRD in neurons that receive an exceptionally high number of inputs. In the next section, we describe and explore another relatively simple model that remedies both of these problems. 5 Integrate-and-Fire Models with Fractional-Gaussian-Noise-Driven Poisson Process Inputs In the previous section, the output of the IF model was LRD when the input was LRD. If this is a general property that is not dependent on the inputs being RPPs, then LRD inputs with finite interval variance should produce LRD in the output of the IF model as well. In this section, we consider just such a model, which will thus have inputs that are more statistically compatible with known properties of in vivo neurons. More specifically, we have chosen to use inputs that are fractional-gaussian-noisedriven Poisson processes, which were developed to model LRD in biological neurons. 5.1 The Fractional-Gaussian-Noise-Driven Poisson Process. The stationary, or homogeneous, Poisson process is the simplest stochastic point process. The probability of occurrence of a point in this model is uniform throughout time, parameterized by the rate of occurrence λ, and is completely independent of the past. In addition, the occurrence of multiple points at any one time instant is virtually impossible. Customarily, this is the first model that is used for situations that can be represented as a series of events, due to its simplicity and analytical tractability, and the inputs of the high-variability IF models of cortical neurons are no exception. Nevertheless, with more study of the process to be modeled, one usually finds that more complicated models are eventually required. The inputs of the models considered in section 4 retained some of the temporal independence of the Poisson process but allowed modification of the interval distribution. The Poisson process can, however, be generalized in such a way that temporal dependence can be specified, but certain distributional properties are retained. This generalization involves replacing the (constant) rate parameter with a time-dependent function λ(t), resulting in a nonstationary, or nonhomogeneous, Poisson process. However, this model is useful only when there is a known deterministic trend in the rate of occurrence of points that can be used to define the function λ(t).
2168
B. Jackson
Often a more useful point process model is produced by replacing the rate parameter with a stochastic process (t), yielding what is called a doubly stochastic Poisson process (DSPP)—or, sometimes, called a “Cox process” since it was proposed in a seminal paper by Cox (1955). As long as the stochastic process (t) is stationary, the DSPP will be stationary but can still be dependent on its past. In essence, the DSPP has a dependency structure equivalent to its stochastic rate process. In particular, for our purposes, if its stochastic rate process is LRD, then so is the DSPP. Fractional gaussian noise is an LRD stochastic process that was introduced by Mandelbrot and coworkers (Mandelbrot, 1965; Mandelbrot & Wallis, 1968, 1969a, 1969b, 1969c; Mandelbrot & van Ness, 1968) in order to stochastically model the fluctuations in water levels of the Nile River, which, Hurst (1951) had previously discovered, possessed long memory. Fractional gaussian noise (fGn) is the increment process of fractional Brownian motion (fBm), which is the only gaussian process that is both self-similar and has stationary increments. fBm is parameterized by the self-similarity index 0 < H < 1, where H = 0.5 corresponds to the ordinary Brownian motion with independent increments. Thus, fGn can be parameterized by the selfsimilarity index of the corresponding fBm, with H = 0.5 corresponding to the ordinary “white” gaussian noise. When H < 0.5, fGn has (long-range) negative dependence, which is not useful for modeling neural spike trains. However, when H > 0.5, fGn is LRD, so we will be interested in fGn with H only in the range 0.5 ≤ H < 1. In all cases, individual random variables in the process are distributed according to a gaussian distribution, but as H approaches one, the strength of the dependence, or correlation, between these random variables increases. The fractional-gaussian-noise-driven Poisson process (fGnDP) is created by using a modified form of fGn as the stochastic rate process of a DSPP. Modification of the fGn is necessary for several reasons. First, the rate of a Poisson process cannot be negative, whereas samples of fGn can certainly assume negative values. Second, fGn has mean zero, so that simple truncation of the fGn at zero will significantly change its statistical properties. Third, fGn, like ordinary “white” gaussian noise, is inherently a discrete process, while the rate process for a DSPP should be defined in continuous time. Let {GH (k), k ∈ Z} denote standard fGn, with a mean of zero and a variance of one. Then, in order to remedy the aforementioned incongruencies, we let the rate process of a DSPP be12 t (t) = max 0, λ + σ GH , τ
(5.1)
12 This rate function is not actually a stationary stochastic process since time zero is always at the beginning of a τ -length sampling interval. However, this “small” nonstationarity facilitated mathematical analysis and did not affect our results in test simulations due to the length of the simulations and the analysis methods.
Long-Range Dependence in Models of Cortical Variability
2169
where λ and σ are positive constants and x is the largest integer less than x. The max function keeps (t) from being negative, and the λ parameter makes the mean nonzero. The expression τt converts continuous time t to discrete “time.” Thus, the value of (t) is constant on each τ -length time interval. Assuming that σ λ, so that the probability that λ + σ GH < 0 is negligible, the mean rate of the fGnDP with the rate process in equation 5.1 is E{(t)} ≈ λ. In addition, if N(t) is a DSPP with the rate process in equation 5.1, then the variance of N(t) will be approximately λt plus a term proportional to σ 2 . Teich et al. (1990) suggested using “fractal-noise-driven” Poisson process models for neural spike trains of the auditory nerve in order to account for the long-range dependence, or “fractal behavior”, that had been previously discovered (Teich, 1989). In subsequent works, the theory and application of this model to the auditory nerve and other peripheral sensory-system neurons was further developed (Teich, 1992; Teich & Lowen, 1994; Lowen & Teich, 1993a, 1995, 1996a, 1997; Kumar & Johnson, 1993; Thurner et al., 1997). The fractional-gaussian-noise-driven Poisson process is, arguably, the simplest of these models.13 Furthermore, since the theory of fractional gaussian noise has been well developed in many different contexts, and since the other “fractal” noises that were suggested as the driving noise in these models converge to fractional gaussian noise under appropriate conditions, the fGnDP is a natural starting point for using “fractal-noisedriven” Poisson process models. The fGnDP was used as a model of the spike trains in the auditory nerve because it is LRcD (for H > 0.5), but unlike LRD renewal processes, it is also LRiD and has finite interval variance that is similar to a Poisson process. These properties are shared by the auditory nerve, as well as most subcortical neurons that have been studied with respect to such properties. Thus, a further justification of the IF model with fGnDP inputs is the fact that those thalamic neurons that have been studied with respect to LRD and that also project into the cortex possess these properties as well. 5.2 The Superposition of Fractional-Gaussian-Noise-Driven Poisson Processes. The only difference between the present model and the model of section 4 is the form of the input processes. Thus, much of the reasoning in that section regarding general properties of the model applies here as well. In particular, we still expect the fGnDP-input model to exhibit negative correlation on small timescales, due to the integration mechanism, and this negative correlation will presumably manifest itself in a negatively sloped FFC for small counting windows. Also, excitation-inhibition balance will
13 In this literature, the fractional-gaussian-noise-driven Poisson process is called the fractal-gaussian-noise-driven Poisson process. We have used the former in accordance with mainstream mathematical and analytical literature on fractional gaussian noise.
2170
B. Jackson
most likely still produce memory over longer ranges by producing high interval variability. Finally, additional memory properties of the output, if present, should be evident in the superposition of the inputs. Therefore, we will next consider the properties of the superposition of fGnDPs. The superposition of two independent DSPPs is also a DSPP, as proved in the following theorem: Theorem 3. Let N1 (t) and N2 (t) be independent doubly stochastic Poisson processes with (independent) rate processes 1 (t) and 2 (t), respectively. Then the superposition N(t) = N1 (t) + N2 (t) is a doubly stochastic Poisson process with rate process (t) = 1 (t) + 2 (t). Proof. The probability generating functional of a doubly stochastic Poisson process N(t) with stochastic rate process (t) is (e.g. Cox & Isham, 1980; Daley & Vere-Jones, 1988) GN [ξ ] = E exp
∞ −∞
(ξ − 1)(t) dt
,
where E denotes expectation with respect to . Therefore, if N1 (t) and N2 (t) are independent doubly stochastic Poisson processes with independent stochastic rate processes 1 (t) and 2 (t), respectively, then the probability generating functional of their superposition N(t) = N1 (t) + N2 (t) is (e.g., Cox & Isham, 1980; Daley & Vere-Jones, 1988) GN [ξ ] = GN1 [ξ ]GN2 [ξ ] ∞ (ξ − 1)1 (t) dt = E exp −∞ ∞ · E exp (ξ − 1)2 (t) dt −∞ ∞ = E exp (ξ − 1)(1 (t) + 2 (t)) dt . −∞
Thus, N(t) is a doubly stochastic process with rate process (t) = 1 (t) + 2 (t). This result is easily extended to any finite sum of independent DSPPs by repetitive application of the previous theorem, yielding the following corollary: Corollary 1. Let N1 (t), N2 (t), . . . , Nm (t) be independent doubly stochastic Poisson processes with (independent) rate processes 1 (t), 2 (t), . . . , m (t), for finite integer m. Then the superposition N(t) = m i=1 Ni (t) is a doubly stochastic Poisson process with rate process (t) = m i=1 i (t).
Long-Range Dependence in Models of Cortical Variability
2171
Another theorem, concerning the sum of independent fractional gaussian noises, is therefore useful with regard to the fGnDP-input IF model: Theorem 4. Let G1 (k) and G2 (k) be two independent standard (i.e., zero mean, unity variance) fractional gaussian noises with identical Hurst indices H. Then their weighted sum G(k) = σ1 G1 (k) + σ2 G2 (k), σ1 , σ2 > 0, is also a fractional gaussian noise with Hurst index H, mean zero, and variance σ12 + σ22 . Proof. Let Z = {Zj , j = . . . , −1, 0, 1, . . .} be any stationary sequence. The sequence of transforms {TN , N = 1, 2, 3, . . .} defined, for each N, as TN : Z → TN Z = {(TN Z)i , i = . . . , −1, 0, 1, . . .}, where (TN Z)i =
1 (i+1)N Zj , i = . . . , −1, 0, 1, . . . , NH j=iN+1
is called the renormalization group with index H (Samorodnitsky & Taqqu, 1994). TN transforms the original sequence into a sequence composed of renormalized sums of N adjacent components of the original sequence. According to corollary 7.2.13 of Samorodnitsky and Taqqu (1994), fractional gaussian noise is the only gaussian fixed point of the renormalization group, d
where Z is by definition a fixed point of the renormalization group if TN Z = Z for all N ≥ 1. Now let G1 = {G1 (k), k = . . . , −1, 0, 1, . . .} and G2 = {G2 (k), k = . . . , −1, 0, 1, . . .} be two standard fractional gaussian noises, each with Hurst index H, and let G = σ1 G1 + σ2 G2 = {σ1 G1 (k) + σ2 G2 (k), k = . . . , −1, 0, 1, . . .}, for constants σ1 and σ2 . Also let {TN , N = 1, 2, 3, . . .} be the renormalization group with Hurst index H equal to the Hurst indices of G1 and G2 . Each TN is clearly a linear transformation, so for each N ≥ 1, TN G = TN (σ1 G1 + σ2 G2 ) = σ1 (TN G1 ) + σ2 (TN G2 ).
2172
B. Jackson
Since G1 and G2 are independent and are both fixed points of the renormalization group, d
TN G = σ1 G1 + σ2 G2 = G. Thus, G is also a fixed point of the renormalization group. Furthermore, for each k = . . . , −1, 0, 1, . . . , σ1 G1 (k) has a gaussian distribution with mean zero and variance σ12 , and σ2 G2 (k) has a gaussian distribution with mean zero and variance σ22 . Thus, G(k), for each k, also has a gaussian distribution with mean zero, and its variance is σ12 + σ22 . Finally, since G is gaussian and a fixed point of the renormalization group, it must be fractional gaussian noise. This theorem can also be extended to any finite sum by its repetitive application. Corollary 2. Let Gi (k), i = 1, 2, . . . , m, be m independent standard fractional gaussian noises with identical Hurst indices H. Then their weighted sum G(k) = m σ G i=1 i i (k), where σi > 0 for i = 1, 2, . . . , m, is also2 a fractional gaussian noise with Hurst index H, mean zero, and variance m i=1 σi . Nevertheless, according to equation 5.1, the rate process of an fGnDP is not a linear function of an fGn. In order to produce a valid rate process, the linear function of fGn,
t , λ + σ GH τ must be truncated below at zero. If this truncation were unnecessary, then, using corollaries 1 and 2, the sum of m independent fGnDPs with rate processes
t i (t) = λ + σ GH,i , i = 1, 2, . . . , m, τ would be an fGnDP with rate process (t) =
m i=1
i (t) = mλ +
√
t m σ GH . τ
(5.2)
Hence, the mean rate of this fGnDP would be mλ and the variance mσ 2 . Statistically, however, it is desirable that the rate process of fGnDP be as much like fGn as possible. The truncation process was used only since the rate of a Poisson process cannot be negative. Thus, ideally, we could consider the superposition of m fGnDPs to be an fGnDP with the rate process
Long-Range Dependence in Models of Cortical Variability
2173
5.2 truncated at zero. In other words, the rate process of the superposition would be √ t (t) = max 0, mλ + m σ GH . (5.3) τ In order to compare the difference between this “ideal” situation and the “real” superposition of fGnDPs with rate processes of the form 5.1, consider the sum of two independent fGnDPs. Suppose that the rate processes of these fGnDPs are t i (t) = max 0, λ + σ GH,i , i = 1, 2. τ Then the rate process of their superposition is (t) = 1 (t) + 2 (t) t t + max 0, λ + σ GH,2 = max 0, λ + σ GH,1 τ τ
t t = max 0, λ + σ GH,1 , λ + σ GH,2 , τ τ t , (5.4) 2λ + σ (GH,1 + GH,2 ) τ while the “ideal” situation yields a rate process of √ (t) = 2λ + 2 σ GH
t . τ
(5.5)
√ Thus, since GH,1 + GH,2 is distributed like 2GH , the “ideal” case and the “real” case are different when either σ GH,1 < −λ or σ GH,2 < −λ, but not when both are true. As we assumed previously, these occurrences should be rare. In addition, when they do occur, the differences that they produce should usually be small, since big differences would necessitate that λ + σ GH,i 0 for the process i that has an untruncated negative rate. Therefore, the superposition of m independent fGnDPs with rate processes of the form 5.1 is well approximated by a single fGnDP with the rate process in equation 5.3. 5.3 Simulation Procedures for the IF Model with fGnDP Inputs. For the IF model with fGnDP inputs, we ran simulations similar to those for the renewal input model. Again, the simulations were 100,000 seconds in duration, and 10 independent simulations were run for each set of parameter values. Each simulation used 100 excitatory inputs and 100r inhibitory inputs, as did the RPP input model simulations. The IF mechanism had a
2174
B. Jackson
reset potential of zero and a threshold of unity, and inputs caused instantaneous increases or decreases of 1/40 in the integration potential. The rate of each input was calculated according to equation 4.8, yielding a nominal output rate of 2.5 spikes per second. Each input of this model is specified by its stochastic rate process t i (t) = max 0, λ + σ GH,i , τ under the assumption that they are all statistically identical. The parameter λ, the mean rate of the fGnDP if the effect of truncation is neglected, was specified by the rate of the inputs that produced a nominal output rate of 2.5 spikes per second. In previous studies (unpublished), we found that the value of σ = 30 worked well for modeling neural spike trains when the rate was λ = 100. Since the variance of the counts √ should be additive with respect to the rate, this suggested that σ = 3 λ. Finally, the sample time τ , that is, the length of the intervals over which the rate of the Poisson processes remained constant, was set to 0.1 second. This value was chosen by balancing the cost of simulation time with the condition that this value not have a significant effect on the results. Therefore, the fGnDP input model had only two parameters that were left for us to vary: the inhibition-excitation ratio r and the Hurst index H of the inputs. For r, we used the same set of values that was used for the Pareto input model in section 4: 0.0, 0.5, 0.7, 0.8, 0.9, and 0.95. We desired to use a set of values for H that spanned the range 0.5 ≤ H < 1.0, which includes all values for which fGn is uncorrelated or LRD, but not degenerate. Thus, we chose the values 0.50, 0.55, 0.60, 0.65, 0.70, 0.75, 0.80, 0.85, 0.90, and 0.95. For the simulations of the fGnDP-input model, we used the result from section 5.2 that the superposition of independent fGnDPs may be approximated by a single fGnDP. Thus, for each simulation run, we needed to produce only two fGnDPs: one for the excitatory inputs and one for the inhibitory inputs. Both of these fGnDPs had Hurst indices of H, and the rate processes were " ! √ E (t) = max 0, 100λ + 3 100λ GH,E (10t) for the excitatory process and " ! √ I (t) = max 0, 100rλ + 3 100rλ GH,I (10t) for the inhibitory process. Thus, changes in the number of excitatory inputs (or, equivalently, in the number inhibitory inputs for a fixed value of r) are equivalent to changes in the spike rate of these inputs. Hence, given that we chose the spike rates of the inputs so that the nominal spike rate would be 2.5 spikes per second, the number of inputs that we used will actually be inconsequential to our results.
Long-Range Dependence in Models of Cortical Variability
2175
Since each of the 1200 simulated fGns (60 sets of parameters × 10 simulation runs per set × 2 fGns per run) requires 106 sample points (105 seconds × 10 sample points per second), and long fGn simulations are computationally demanding, samples of fGn were produced using a fast Fourier transform (FFT) method (Davies & Harte, 1987; Beran, 1994; Bardet, Lang, Oppenheim, Philippe, & Taqqu, 2002). However, unlike most popular FFT methods (e.g., Paxson, 1997; Ledesma & Derong, 2000), which are approximate (and often quite inexact), this method is exact. The main drawback of any FFT method, however, is that it requires large amounts of memory for such long samples. Further discussion of this and other fGn algorithms, as well as references, can be found in Jackson (2003, sect. 3.6.3). After simulating the fGnDP-input IF model with all combinations of the values of the two parameters r and α, we calculated estimates of the CVISI , the FFC, and the IDC for each simulation run in the same manner as was done for the renewal input model. This included estimating the FFC and IDC of the surrogate data, produced by shuffling the interspike intervals, for each simulation run. 5.4 Simulation Results for the IF Model with fGnDP Inputs. Figure 8 shows the estimated values of CVISI from all of our simulations of the fGnDP-input model plotted versus the inhibition-excitation ratio r. Column a contains graphs of the mean CVISI values calculated across the 10 simulations at each combination of parameter values for r and H. Means for the same value of H are connected by lines. Like the gaussian input model and the Pareto input model in section 4, the graph of CVISI as a function of r for any particular value of H has an increasing, convex shape. More specifically, these curves are well approximated by the square root of a rational function with a pole at r = 1. Furthermore, the values of CVISI all lie above the curve for the Poisson input model (dashed line), and they increase with H. Since, like the Pareto input model, the intervals in the superpositions of the inputs are positively correlated for this model, the fact that CVISI is always greater (at least for 0.5 ≤ H < 1) than in the Poisson input model is to be expected. The positive correlation between CVISI and H is also not surprising, given the results for the Pareto input model. In the Pareto input model, decreasing the value of α strengthened the dependence between intervals in the output, and, we suspect, the input superpositions, of the model. This was also associated with increases in the value of CVISI . In the present model, increasing the value of H will certainly strengthen the dependency between intervals in the input superpositions. Thus, we should expect that such increases in H lead to stronger dependence in the output as well, and that this is associated with increases in the value of CVISI . Moreover, we will see in the IDCs that increases in H do indeed strengthen the dependence between the output intervals. These results thus indicate that for higher values of H, the fGnDP input model requires less excitation-inhibition balance to achieve any particular value of CVISI . This means that for high values of H, the CVISI of the
2176
B. Jackson
output of the fGnDP input model is substantially larger than physiological estimates at even moderate degrees of balance, as was also the case with the Pareto input model. Column b of Figure 8 shows for a subset of the values of H the individual estimates of CVISI for each simulation. The lines connect the mean values of these estimates, and there are 10 symbols, one for each simulation run, for each of the different combinations of the two parameters r and H. Here again the symbols are tightly grouped for low values of r and weaker LRD (i.e., low values of H). Also, in similarity to the other two sets of simulations, the variance of estimates of CVISI increases as either r or H is increased. Again, this is a result of the fact that the model approaches a nonstationary state as r or H approach the value of one. Figures 9 through 11 show the FFCs and IDCs for simulations of the IF model with fGnDP inputs for several representative combinations of the values for the parameters H and r. The curves for Hurst indices of H = 0.5, H = 0.7, and H = 0.9 are shown in Figures 9, 10, and 11, respectively. For each of these values of the Hurst index, FFC and IDC sets are shown for the inhibition-excitation balance ratios r = 0.0, r = 0.5, r = 0.8, and r = 0.95. The entire set of FFCs and IDCs from all of our simulations may be found in Jackson (2003, appendix B). For these curves, as should always be the case, the FFCs approach a value of one at very small counting interval lengths, and the mean surrogate data IDCs are horizontal. Due to the lack of very short intervals in the output of this model, caused by the integration mechanism, the initial trend of the FFCs is downward. These initial downward trends are comparable to those seen in the FFCs of the gaussian input and Pareto input models. In contrast, the original data IDCs for the fGnDP input model, when H > 0.5, have an initial, and indeed enduring, upward trend. In fact, these IDCs all closely resemble a power law function, being very linear on a double logarithmic plot. Like the Pareto input model, but in contradistinction to the gaussian input model, a superposition of the inputs to the fGnDP model has positively correlated interspike intervals. Not only does this positive correlation produce the positive slopes of the IDCs, but it also produces positive slopes in the original data FFCs for medium and large counting intervals. Thus, except when H equals 0.5 or when H is close to 0.5 and r is small, the original data FFCs have an asymptotic value larger than one. Theoretically, the output of the fGnDP input model with H = 0.5 is a renewal process. fGn is simply the common “white” gaussian noise when H = 0.5. Thus, since this fGn has no memory and Poisson processes have no memory, the fGnDPs used as inputs to the IF model will have no memory when H = 0.5. Thus, in this sense, it is very similar to the Poisson input IF model, with the integration mechanism, which is reset at the occurrence of each output spike, being the only component possessing memory. The results of our simulations are in agreement with this prediction. As is evident in Figure 9, the original data FFCs and IDCs are nearly identical to the
Long-Range Dependence in Models of Cortical Variability
a
b
0.8 0.6 0.4 0.2 0.0
Coefficient of Variation
Coefficient of Variation
6.0 4.0 2.0
20 15 10 5 0
0.2 0.0 0.2 0.4 0.6 0.8 1.0 Inhibition-Excitation Ratio, r
8.0 6.0 4.0 2.0 0.0 0.2 0.4 0.6 0.8 1.0 Inhibition-Excitation Ratio, r
40 H = 0.5 H = 0.55 H = 0.6 H = 0.65 H = 0.7 H = 0.75 H = 0.8 H = 0.85 H = 0.9 H = 0.95 Poisson
0.0 0.2 0.4 0.6 0.8 1.0 Inhibition-Excitation Ratio, r
Coefficient of Variation
Coefficient of Variation
25
0.4
0.0
0.0 0.2 0.4 0.6 0.8 1.0 Inhibition-Excitation Ratio, r
40 30
0.6
10.0
8.0
35
0.8
0.0
0.0 0.2 0.4 0.6 0.8 1.0 Inhibition-Excitation Ratio, r
10.0
0.0
1.0
Coefficient of Variation
Coefficient of Variation
1.0
2177
35 30 25 20
80 70 60 50 40 30 20 10 0
0.0
0.2
0.4
0.6
0.8
1.0
15 10 5 0
0.0 0.2 0.4 0.6 0.8 1.0 Inhibition-Excitation Ratio, r
Figure 8: The coefficient of variation of interpoint intervals as a function of the inhibition-excitation ratio r for the output of an IF model with inputs that are fractional-gaussian-noise-driven Poisson processes. For each combination of the parameters, r and the Hurst index H of the fractional Gaussian noise, 10 separate simulation runs were conducted. (Column a) The mean values of the estimates are plotted for each combination of parameters. Each solid line connects the means for a single value of H, and the dashed line is the theoretical curve for the model with Poisson process inputs. All three sets of axes contain the same data, but each displays them on a different vertical scale. (Column b) The estimates calculated from each individual simulation run are plotted for a subset of the values of H. Solid and dashed lines are the same as in column a, and all three large sets of axes contain the same data. The inset on the bottom set of axes contains only the data for H = 0.9, revealing two additional data points that are out of range of the larger set of axes.
2178
B. Jackson
Fano Factor
0.00
10
10
1
0
-1
10
-2
10
-2
10
-1
10
0
10
1
10
2
10
3
10
4
10
5
10
Index of Dispersion of Intervals
H = 0.50 r
10
10
1
0
-1
10
-2
10
0
10
Counting Interval (sec)
10
1
0
-1
10
-2
10
-2
10
-1
10
0
10
1
10
2
10
3
10
4
10
5
10
Index of Dispersion of Intervals
0.50
Fano Factor
10
10
1
0
-1
10
-2
10
-2
10
-1
10
0
10
1
10
2
10
3
10
10
4
10
5
10
1
0
-1
10
-2
10
-2
10
-1
10
0
10
1
10
2
10
3
10
4
10
Counting Interval (sec)
5
10
-1
10
-2
10
0
10
1
10
2
10
3
10
4
10
5
10
10
10
1
0
-1
10
-2
10
0
10
1
10
2
10
3
10
4
10
5
10
Number of Aggregated Intervals
5
10
Index of Dispersion of Intervals
Fano Factor
0.95
10
4
10
0
Counting Interval (sec) 10
3
10
Number of Aggregated Intervals Index of Dispersion of Intervals
Fano Factor
0.80
10
2
10
1
Counting Interval (sec) 10
1
10
Number of Aggregated Intervals
10
10
1
0
-1
10
-2
10
0
10
1
10
2
10
3
10
4
10
5
10
Number of Aggregated Intervals
Figure 9: FFCs and IDCs estimated from simulations of the IF model with fGnDP inputs with Hurst index H = 0.5. Each set of axes contains 10 curves calculated from original data (black) and 10 curves calculated from the corresponding shuffled surrogate data (gray). For each value of the inhibition-excitation ratio r, each individual FFC in the left set of axes was calculated from the same data as one of the IDCs in the right set of axes. The spike rate of the inputs was set in order to produce a nominal output rate of 2.5 spikes per second for each simulation.
Long-Range Dependence in Models of Cortical Variability
2179
r Fano Factor
0.00
10 10 10
2
1
0
-1
10
-2
10
-2
10
-1
10
0
10
1
10
2
10
3
10
4
10
5
10
Index of Dispersion of Intervals
H = 0.70 10 10 10
2
1
0
-1
10
-2
10
0
10
Counting Interval (sec)
10 10
2
1
0
-1
10
-2
10
-2
10
-1
10
0
10
1
10
2
10
3
10
4
10
5
10
Index of Dispersion of Intervals
0.50
Fano Factor
10
10
10
2
1
0
-1
10
-2
10
-2
10
-1
10
0
10
1
10
2
10
3
10
10
4
10
5
10
10
2
1
0
-1
10
-2
10
-2
10
-1
10
0
10
1
10
2
10
3
10
4
10
Counting Interval (sec)
5
10
10
0
-1
10
-2
10
0
10
1
10
2
10
3
10
4
10
5
10
10 10 10
2
1
0
-1
10
-2
10
0
10
1
10
2
10
3
10
4
10
5
10
Number of Aggregated Intervals
5
10
Index of Dispersion of Intervals
Fano Factor
0.95
10
4
10
1
Counting Interval (sec)
10
3
10
Number of Aggregated Intervals Index of Dispersion of Intervals
Fano Factor
0.80
10
2
10
2
Counting Interval (sec) 10
1
10
Number of Aggregated Intervals
10 10 10
2
1
0
-1
10
-2
10
0
10
1
10
2
10
3
10
4
10
5
10
Number of Aggregated Intervals
Figure 10: FFCs and IDCs estimated from simulations of the IF model with fGnDP inputs with Hurst index H = 0.7. Curves for original data are shown in black, and those for surrogate data are shown in gray. The spike rate of the inputs was set in order to produce a nominal output rate of 2.5 spikes per second for each simulation.
2180
B. Jackson
4
10
3
0.00
Fano Factor
10
2
10
1
10
0
10
-1
10
-2
10
-2
10
-1
10
0
10
1
10
2
10
3
10
4
10
5
10
Index of Dispersion of Intervals
H = 0.90 r
4
10
3
10
2
10
1
10
0
10
-1
10
-2
10
0
10
Counting Interval (sec)
3
Fano Factor
10
2
10
1
10
0
10
-1
10
-2
10
-2
10
-1
10
0
10
1
10
2
10
3
10
4
10
5
10
Index of Dispersion of Intervals
4
10
0.50
10 10 10
4 3 2 1 0
-1
10
-2
10
-2
10
-1
10
0
10
1
10
2
10
3
10
4
10
5
10
10 10 10
4 3 2 1 0
-1
10
-2
10
-2
10
-1
10
0
10
1
10
2
10
3
10
4
10
Counting Interval (sec)
5
10
3
10
2
10
1
10
0
10
-1
10
-2
10
0
10
1
2
10
3
10
10
4
10
5
10
10 10 10 10 10
4 3 2 1 0
-1
10
-2
10
0
10
1
2
10
3
10
4
10
10
Number of Aggregated Intervals
5
10
Index of Dispersion of Intervals
0.95
Fano Factor
10
4
10
4
Counting Interval (sec) 10
3
10
Number of Aggregated Intervals Index of Dispersion of Intervals
0.80
Fano Factor
10
2
10
10
Counting Interval (sec) 10
1
10
Number of Aggregated Intervals
10 10 10 10 10
4 3 2 1 0
-1
10
-2
10
0
10
1
10
2
10
3
10
4
10
5
10
Number of Aggregated Intervals
Figure 11: FFCs and IDCs estimated from simulations of the IF model with fGnDP inputs with Hurst index H = 0.9. Curves for original data are shown in black, and those for surrogate data are shown in gray. The spike rate of the inputs was set in order to produce a nominal output rate of 2.5 spikes per second for each simulation.
Long-Range Dependence in Models of Cortical Variability
2181
surrogate data ones. This is the behavior that we expect from a renewal process. Due to the presence of the fGn, each input to this model should, however, be more variable than a Poisson process, which should manifest itself in higher variability at the output of the model. Indeed, as we saw in Figure 8, the variance of the output of the fGnDP input model for H ≥ 0.5 is always greater than that for the Poisson input model having the same r-value. When H > 0.5, the inputs of the fGnDP model, as well as the superpositions of the inputs, are LRD. This was also true of the Pareto input model when α ≤ 2.0. The LRD in the Pareto inputs and superpositions, however, came in the form of LRcD with infinite interval variance and no LRiD, whereas in the fGnDP inputs and superpositions, it is in the form of LRcD with LRiD and finite interval variance. But this distinction does not seem to be important for the IF model, at least with regard to the statistical procedures that we have used. This is demonstrated in the striking similarity between the results from the Pareto model with α = 1.75 in Figure 6 and those from the fGnDP model with H = 0.7 in Figure 10. Thus, disregarding the effect of the interaction between excitation and inhibition, in the fGnDP input model, the LRiD at the input propagates through the model, while in the Pareto input model, the infinite interval variance of the inputs is converted into LRiD by the model. In either case, the result at the output seems to be just about the same. The effect on the output of the fGnDP input model of balancing the amounts of excitation and inhibition is comparable to its effect on the other models that we have considered. As r increases toward one, the effect of the correlation in the inputs is gradually overwhelmed by the high variability of the excitation-inhibition interaction. The output will continue to exhibit LRcD, but this LRcD will progressively become more a result of high interval variability than of LRiD. However, the surrogate data FFCs all, except perhaps when H and r are both very close to one where the approach to nonstationarity creates unusually large variance in our estimates, asymptote to a finite constant, consistent with the intervals in the output having finite variance. The output interval variance does increase with r, causing the asymptotic values of the surrogate data FFCs to move upward, but these FFCs always remain below their original data counterparts because of the presence of LRiD. Also, in a manner analogous to that in the Pareto input model, the difference between the original data FFCs and the surrogate data FFCs at long counting intervals increases as H increases. But this trend again seems to break down, in this case as H approaches one. Pareto RPPs are non-LRD for any α greater than two, but fGnDPs are non-LRD only when H = 0.5. In addition, the Pareto RPPs will produce positive correlation between the intervals of the output for any α, including α > 2, while intervals in the output of the fGnDP model with H = 0.5 are independent. Thus, these two models behave quite differently in their non-LRD state. The non-LRD Pareto model is more flexible, allowing a
2182
B. Jackson
range of different FFC shapes and differences between the original data curves and the surrogate data curves. Only the asymptotic value of the FFCs can be adjusted, by changing the value of r, for the non-LRD fGnDP model, and original data and surrogate data FFCs are always identical. Therefore, with regard to these measures, the fGnDP with H = 0.5 is not an improvement over the Poisson input model, even for modeling non-LRD data. The fGnDP model is, however, more flexible than the Pareto model in producing weaker short-term dependence when LRD is present in the output. Due to the positive correlation that is present in the non-LRD Pareto model, the original data FFCs and IDCs are already significantly different from their surrogate data counterparts on shorter timescales when α = 2.0, the first value at which it is LRD. In contrast, the original data curves for the fGnDP gradually differentiate themselves, at all timescales, from the surrogate data curves as LRD is introduced into the output. Thus, although these two models can produce outputs with similar statistical features, the statistical characteristics of their outputs are adjustable in different ways. In short, our simulations show that with proper adjustment of the parameters r and H, the fGnDP input model can produce outputs that are similar to cortical spike trains in that they possess both LRD and intervals with finite variance. The typical range of values for CVISI estimates in cortical spike trains is 0.5 to 1.5. Table 3 shows the approximate range of the inhibition-excitation ratio r, for each value of H in our simulations, that produces values for the CVISI in this range. Using these data and comparisons of the FFCs from our simulations, some of which are shown in Figures 9 through 11 with those shown in Teich et al. (1996), the ability of the fGnDP input IF model to match statistical properties of cortical spike trains can be directly evaluated. For low values of H, the value of CVISI for the fGnDP model is within the physiological range for r between about 0.6 and 0.9. This is very similar to the Pareto model at α-values below, but close to, 2.0. For r values near 0.9, the FFCs for H ≈ 0.6 resemble those of cells 4 and 7 in Teich et al. (1996). For the Pareto model, however, such FFCs were created when α was greater than 2.0. The critical difference, however, is that the H ≈ 0.6 fGnDP model is LRD, while the α > 2.0 Pareto model is not. However, since the slope of the FFCs is so shallow for the H ≈ 0.6 fGnDP model, the LRD in the output is difficult to distinguish using an FFC calculated from any reasonable length sample of the process. For H in the neighborhood of 0.65 to 0.85, the upper limit of r-values needed to produce physiological CVISI ’s is 0.8–0.9. For this combination of parameters, the FFCs of the fGnDP model resemble those of cells 3 and 19 of Teich et al. (1996). These FFCs suggest the presence of LRcD, but the surrogate data FFCs asymptote to a finite value above one. If r is reduced, then this asymptotic value drops below one, which does not match the results in Teich et al. (1996), although the LRcD is still present. This case may, however, match the FFCs of neurons with CVISI < 1.0 if they were
Long-Range Dependence in Models of Cortical Variability
2183
Table 3: Range of the Inhibition-Excitation Ratio r, for Each Value of H, That Leads to Values of CVISI Between 0.5 and 1.5 for the IF Model with fGnDP Inputs. H
Range of r
0.5 0.55 0.6 0.65 0.7 0.75 0.8 0.85 0.9 0.95
0.69 < r < 0.96 0.66 < r < 0.95 0.63 < r < 0.94 0.59 < r < 0.92 0.55 < r < 0.90 0.50 < r < 0.87 0.45 < r < 0.81 0.32 < r < 0.77 0.25 < r < 0.64 0.06 < r < 0.28
Note: The ranges of r were approximated from plots of CVISI versus r (see Figure 8) obtained from simulations of the model.
available. Similar to those of the Pareto model, as r is increased, the FFCs still resemble the physiologically measured FFCs, but the values of CVISI become too large. As H increases to about 0.9 and above, the FFCs and CVISI values can no longer be matched to those measured physiologically. When these models have physiological values of CVISI , they are still LRcD, but the surrogate data FFCs asymptote at too low a value. Like the Pareto model with α ≈ 1.0, moderately large values of r tend to result in FFCs that suggest that the output of the model is renewal-like. Here, again, this is most likely due to the model’s approach to nonstationarity, which occurs at H = 1, coupled with the effect of tightly balanced excitation and inhibition. Also, much greater variability in the estimates of the Fano factor and the index of dispersion of intervals is evident in this parameter range. Therefore, like the Pareto input model, the fGnDP input model can produce outputs that share common statistical features with the spike trains in cortical neurons. However, the fGnDP input model has two significant advantages over the Pareto input model. First, whereas the inputs to the Pareto model that were required in order to produce LRD do not seem to be justified physiologically, the inputs to the fGnDP are known to be statistically similar to the spike trains of neurons that project into cortex. Furthermore, while increasing the number of inputs to the Pareto input model will make it less likely to produce appropriate outputs, the success of the fGnDP input model will not be affected by this change. This is due to the fortunate result that the superposition of a number of fGnDPs is also an fGnDP and does not therefore asymptotically approach a Poisson process.
2184
B. Jackson
6 Discussion A number of different types of IF models have been created in order to explain how cortical neurons can integrate over large numbers of inputs while still producing highly variable outputs. Although these models can produce values for the coefficient of variation of interspike intervals (CVISI ) similar to those calculated from in vivo cortical spike trains, we considered whether such models can also produce LRD in their outputs. Based on the observation that a large class of these models produces outputs that are renewal point processes, we were able to prove analytically that the output of these models cannot simultaneously have both LRD and a finite CVISI . Therefore, since the spike trains of in vivo cortical neurons are long-range dependent (see section 2.7), none of these renewal models produces highly variable outputs in a way that is consistent with the properties of cortical spike trains. Thus, their success in representing the cortical processing that leads to highly variable spike trains is doubtful. By considering LRD, in addition to the CVISI , we were able to show that a large portion of the single-neuron models of cortical high variability are incompatible with cortical spike trains. Shinomoto and his colleagues (Shinomoto & Sakai, 1998; Shinomoto, Sakai, & Funahashi, 1999) have previously demonstrated another statistical difference between cortical spike trains and the output of one of these models, the standard leaky IF model with Poisson inputs. They argued that the leaky IF model could not match both the coefficient of variation and the skewness coefficient of the interspike-interval density of cortical spike trains. The skewness coefficient, or just skewness, is the third central moment divided by the standard deviation cubed and is a measure of the symmetry of a distribution. Our result, however, has several advantages over the arguments in these articles. First, our result is more generally applicable. The approach that we have taken has allowed us to handle a large number of models analytically without having to deal with the details of each specific model. Second, their arguments were based on an approximation to the leaky IF model, whereas ours are directly applicable to the models themselves. Most of these models are analytically intractable in detail, but by considering general principles, we have been able to deal with them analytically without approximation. Third, in order to discount the models, Shinomoto and his colleagues placed restrictions on the parameter ranges of the models, something that was unnecessary in our arguments. Furthermore, the consideration, in this study, of a general statistical description that is applicable to many models adds to our intuition of cortical variability and narrows the range of feasible models for future studies. This general analytical approach does not, however, apply to one important type of single-neuron IF model that has previously been proposed as a solution to the high-variability problem. These models consist of an IF mechanism with inputs that are renewal point processes, but not Poisson point
Long-Range Dependence in Models of Cortical Variability
2185
processes as in the other models. Therefore, we examined these models in more detail. In particular, we investigated models with inputs that have either positive gaussian-distributed intervals or Pareto-distributed intervals, which have distribution tails that are shorter and longer, respectively, than those of a Poisson process. IF models with non-Poissonian renewal point process inputs are capable of producing correlation between the intervals in the superpositions of inputs that causes the intervals in the output of the model to be similarly correlated. If the intervals of the inputs are distributed according to a positive gaussian distribution, this correlation is negative. We expect this result to be true of any interval distribution that possesses a shorter tail than the exponential distribution, the interval distribution of a Poisson process. Although the gaussian input model can produce any value of CVISI if the inhibition-excitation ratio r is properly adjusted, the negative correlation created by these inputs hinders the production of LRD. Thus, in this sense, the gaussian input model is inferior to the standard Poisson input model with respect to modeling cortical spike trains. If the intervals of the RPP inputs are distributed according to a Pareto distribution, then positive correlation is produced in the IF model. We expect this to also be true of any interval distribution with a tail that is longer than the exponential distribution. Furthermore, if the tail of the distribution is long enough that the variance is infinite, then both the inputs and the output of the model are LRD. We found that by proper adjustment of α, a parameter of the Pareto distribution that affects the length of its tail, and the inhibition-excitation ratio r, we could produce spike trains with the Pareto input IF model that had interval variance and LRD similar to spike trains recorded from cortical neurons. Therefore, this model succeeds where the renewal IF models (i.e., those with renewal-point-process outputs) fail. Although the output of a properly adjusted Pareto input IF neuron is both long-range count dependent (LRcD) and long-range interval dependent (LRiD) with finite interval variance, the inputs do not possess LRiD, but instead are LRcD with infinite interval variance. This “conversion,” so to speak, is produced by the integration process in the model and has precedent in other models that aggregate processes with infinite variance (Mandelbrot, 1969; Taqqu & Levy, 1986; Lowen & Teich, 1993b, 1993c; Willinger, Taqqu, Sherman, & Wilson, 1995, 1997; Taqqu, Willinger, & Sherman, 1997; Heath, Resnick, & Samorodnitsky, 1998). However, although the output of this model is consistent with our current knowledge of the statistical structure of cortical spike trains, the inputs are not consistent with the statistical structure of either cortical or subcortical spike trains, including those in neurons that project into cortex. This inconsistency seriously undermines the credibility of the renewal input IF models. In addition, as the number of inputs increases, this model approximates a Poisson input model, which cannot successfully produce realistic CVISI values and LRD. Hence, it may
2186
B. Jackson
not be able to explain these properties in neurons that receive an extremely large number of inputs. We therefore suggested a model with inputs that are LRD but are statistically similar to actual neurons. The fractional-gaussian-noise-driven Poisson process (fGnDP) has elsewhere been used as a model for sensory neurons in the periphery, such as primary auditory neurons, in order to model the LRD present in their spike trains. The same statistical characteristics present in these neurons are known to exist in many subcortical sensory pathways, including certain neurons that project into the cortex. Thus, this process seems to be a reasonable choice for the inputs to the IF model of cortical neurons. We found that as for the Pareto inputs, the fGnDP inputs did indeed produce LRD in the output of the IF model. However, in the fGnDP input model, both the inputs and the outputs had LRcD and LRiD, and neither had intervals with infinite variance. By proper adjustment of the Hurst index H, a parameter that modulates the strength of the LRD present in an fGnDP, and the inhibition-excitation ratio r, we were able to produce output spike trains with the fGnDP input IF model that approximated the LRD and variability of cortical spike trains. Furthermore, these results show that a tight balance between the amounts of excitation and inhibition at the inputs is not necessary for high interspike-interval variability at the output. Although the output of this model is not necessarily more similar to that of cortical neurons than the Pareto input IF model, the physiological justification of its fGnDP inputs is reason enough to favor it over the Pareto model. Furthermore, unlike the Pareto input model, the fGnDP input model has the advantageous property that it does not lose its ability to produce output spike trains that are statistically similar to cortical neurons as the number of inputs increases. This is a consequence of the fact that the superposition of many fGnDPs is also an fGnDP and does not approximate a Poisson process. Due to the statistical similarity between fGnDPs and neural spike trains, this fact also has serious implications for the common assumption that the convergence of a large number of neural spike trains approximates a Poisson process. Other recent studies have also had success using an IF model with temporally correlated inputs to match the statistics of cortical spike trains. Salinas and Sejnowski (2002) have derived analytical results for the means and variances of interspike intervals at the outputs of both nonleaky and leaky IF neurons driven by temporally correlated noise. They concluded from their results that the temporal correlations in the inputs to a neuron are an important factor affecting the variability at the output, which is what we have found as well. However, their study did not look at temporal correlations, much less long-range correlations, in the outputs. Sakai et al. (1999) and Shinomoto and Tsubo (2001) have shown that temporally correlated inputs to leaky IF models can produce not only coefficients of variation and skewness coefficients of interspike intervals, but also correlation between consecutive interspike intervals, that match those estimated from cortical
Long-Range Dependence in Models of Cortical Variability
2187
spike trains. Although their study of the correlation between consecutive interspike intervals would not reveal long-term correlations, their result is consistent with the production of long-term correlations in the output of IF models by long-term correlations in the inputs, as we found for the fGnDP input model. Although some of the input correlations in these studies extended into the infinite past, they were short term. In other words, they decreased fast enough to result in the convergence of their infinite sum (see section 2.1). Based on our results, for which LRD was necessary at the input, this suggests that these models would not produce LRD. However, this is a relatively weak argument, and a resolution to this question would require further analysis of these models. Although the fGnDP input IF model seems to be the best “single-neuron” model heretofore suggested for matching both the interspike-interval variability and LRD of cortical spike trains, network models, like those suggested by Usher and his colleagues (1994, 1995), may also succeed at this task. In the network model of Usher and his coworkers, the output of each individual neuron is highly variable due to the complex network dynamics. The spike trains of individual units in their model did exhibit both high CVISI values and “long-term fluctuations,” but these “long-term fluctuations,” extended only to time frames on the order of 1 second, much shorter than the dependence in our fGnDP input model and in biological neurons. If network models are, however, capable of producing LRD, then many of the inputs to each model cortical neuron will be statistically similar to the fGnDPs used as inputs in this study, since many of the inputs to each individual neuron will be outputs from other neurons within the network. The main statistical difference will be that the variability of the interspike intervals in these interneuron connections will need to be higher than that in fGnDPs. The results presented in this study highlight the need for more thorough study of the variability of cortical spike trains before judging the validity of models of this variability and drawing general conclusions from them. In particular, the insufficiency of the CVISI estimator alone as a measure of this variability and the necessity of also using a measure that is sensitive to the correlational structure of the spike trains is demonstrated. We have assumed, for the sake of argument, that empirical measurements of the CVISI of cortical spike trains, which have resulted in values typically in the range from 0.5 to 1.2, are valid. However, another possible scenario exists. If the “real” CVISI of cortical spike trains were actually very large, such that for all practical purposes it could be assumed infinite, estimates of CVISI from short-duration recordings of cortical spike trains might still be small. Under such conditions, it is theoretically possible that cortical spike trains could be long-range dependent, but that empirical estimates of CVISI could be on the order of one. However, since CVISI estimates in this case would increase with increasing recording duration, and since such estimates have been measured under numerous recording conditions by many different
2188
B. Jackson
researchers, this potentiality seems improbable. More likely, under such conditions, empirical estimates of CVISI would be widely variable and span a much larger range than 0.5 to 1.2. However, to rule out this possibility, the asymptotic behavior of the CVISI estimator applied to cortical spike trains should be investigated. Practically, this may be accomplished for a particular spike train recording by studying the trend of the CVISI estimates as more and more of the record length is included in the calculation. If the total record length is long enough, this analysis should reveal whether the CVISI estimate converges or diverges. Appendix: Proof of Theorem 1 Daley (1999) contains only a terse argument for theorem 1. Furthermore, there is a mismatch between the text and the equation in his argument. The text specifies when equation 1.2 in that article diverges, but equation 1.2 is an equation for Var{ N(0, t] }, not Var{ N(0, t] }/t. The former, which diverges for most interesting stochastic point processes, is only a necessary condition for long-range dependence, whereas the latter is the definition of longrange dependence. For these reasons, we have included the following proof (following Daley’s argument) of theorem 1. Proof. Since the interpoint interval random variable X has a finite mean, the moments E{N(0, 1]} = 1/µ and E{N2 (0, 1]} are finite, where N(·) is the counting measure of the point process. Therefore, by theorem 3.5.III of Daley and Vere-Jones (1988), 2
E{N (0, t]} = 0
t
2U(s) − 1 ds, µ
where
∞ lim Pr{N(0, t] ≥ j|N(−h, 0] > 0} U(t) = j=0
h↓0
is called the expectation function. For a general stationary ergodic point processes with finite second moments, U(t) is the analog of the renewal function (and, for a renewal point process, is the renewal function). The variance function can now be written as Var{N(0, t]} = E{N2 (0, t]} − (E{N(0, t]})2 2 t 2U(s) − 1 t = ds − µ µ 0
Long-Range Dependence in Models of Cortical Variability
2189
t
2s 2U(s) − 1 − 2 ds = µ µ 0 t s t 2 U(s) − ds − = µ 0 µ µ So, Var{N(0, t]} 2 = t µ
t
s 1 1 U(s) − ds − , t 0 µ µ
which goes to infinity as t → ∞ if and only if the integrand goes to infinity. In other words, Var{N(0, t]} =∞ t
t = ∞, lim U(t) − t→∞ µ lim
t→∞
if and only if (A.1)
where the first limit is the definition (see definition 3) of LRcD. Now, if σ 2 = Var{X2 }, then for a renewal point process (Feller, 1971), 0 ≤ U(t) −
σ 2 + µ2 t → µ 2µ2
as t → ∞,
where the right side is replaced by ∞ if Var{X2 } does not exist. Therefore,
t (A.2) =∞ if and only if E{X2 } = ∞. lim U(t) − t→∞ µ Putting statements A.1 and A.2 together yields the desired result. It may prove instructive to consider the following simpler alternate proof of the fact that an LRcD renewal point process necessarily has infinitely variable interpoint intervals (i.e., the “only if” part of the “if and only if” of theorem 1). This proof requires only a well-known application of the central limit theorem to renewal point processes (e.g., Cox, 1967; Feller, 1971). If the variance of the generic interpoint interval random variable X is finite, then for large t, N(0, t] is asymptotically normally distributed with mean t/µ and variance σ 2 t/µ3 , where µ and σ 2 are the mean and variance of X, respectively. Therefore, if E{X2 } < ∞, then Var{N(0, t]} σ2 = 3 < ∞, t→∞ t µ lim
and the process is not LRcD. Thus, if the renewal point process is LRcD, then E{X2 } = ∞.
2190
B. Jackson
Acknowledgments I am extremely grateful to Laurel Carney for many helpful discussions, suggestions, and editorial readings of the manuscript for this article and for the many other ways that she supported this study. I also thank Hyune-Ju Kim and Phillip Griffin for comments on an earlier version. This research was supported by the Lewis DiCarlo Endowment, the Jerome R. & Arlene L. Gerber Fund, and NIH R01DC01641. References Bardet, J.-M., Lang, G., Oppenheim, G., Philippe, A., & Taqqu, M. S. (2002). Generators of long-range dependent processes: A survey. In P. Doukhan, G. Oppenheim, & M. S. Taqqu (Eds.), Long-range dependence: Theory and applications. Boston: Birkhauser. Beran, J. (1994). Statistics for long-memory processes. Boca Raton, FL: Chapman & Hall/CRC Press. Brown, D., & Feng, J. (1999). Is there a problem matching real and model CV(ISI)? Neurocomputing, 26–27, 87–91. Bugmann, G., Christodoulou, C., & Taylor, J. G. (1997). Role of temporal integration and fluctuation detection in the highly irregular firing of a leaky integrator neuron model with partial reset. Neural Computation, 9(5), 985– 1000. Burkitt, A. N. (2000). Interspike interval variability for balanced networks with reversal potentials for large numbers of inputs. Neurocomputing, 32–33, 313–321. Burkitt, A. N. (2001). Balanced neurons: Analysis of leaky integrate-and-fire neurons with reversal potentials. Biological Cybernetics, 85(4), 247–255. Cox, D. R. (1955). Some statistical methods connected with series of events. Journal of the Royal Statistical Society, Series B, 17(2), 129–157. Cox, D. R. (1967). Renewal theory. London: Chapman and Hall. Cox, D. R. (1984). Long-range dependence: A review. In H. A. David & H. T. David (Eds.), Statistics: An appraisal (pp. 55–74). Ames, IA: Iowa State University Press. Cox, D. R., & Isham, V. (1980). Point processes. London: Chapman and Hall. Cox, D. R., & Smith, W. L. (1954). On the superposition of renewal processes. Biometrika, 41(1/2), 91–99. Daley, D. J. (1999). The Hurst index of long-range dependent renewal processes. Annals of Probability, 27(4), 2035–2041. Daley, D. J., Rolski, T., & Vesilo, R. (2000). Long-range dependent point processes and their Palm-Khinchin distributions. Advances in Applied Probability, 32(4), 1051–1063. Daley, D. J., & Vere-Jones, D. (1988). An introduction to the theory of point processes. New York: Springer-Verlag. Daley, D. J., & Vesilo, R. (1997). Long range dependence of point processes, with queueing examples. Stochastic Processes and Their Applications, 70(2), 265–282.
Long-Range Dependence in Models of Cortical Variability
2191
Davies, R. B., & Harte, D. S. (1987). Tests for Hurst effect. Biometrika, 74(1), 95–101. Davis, R., & Resnick, S. (1986). Limit theory for the sample covariance and correlation functions of moving averages. Annals of Statistics, 14(2), 533–558. Destexhe, A., & Pare, D. (1999). Impact of network activity on the integrative properties of neocortical pyramidal neurons in vivo. Journal of Neurophysiology, 81(4), 1531–1547. Destexhe, A., & Pare, D. (2000). A combined computational and intracellular study of correlated synaptic bombardment in neocortical pyramidal neurons in vivo. Neurocomputing, 32–33, 113–119. Enns, E. G. (1970). A stochastic superposition process and an integral inequality for distributions with monotone hazard rates. Australian Journal of Statistics, 12, 44–49. Fano, U. (1947). Ionization yield of radiations. II. The fluctuations of the number of ions. Physical Review, 72, 26–29. Feller, W. (1971). An introduction to probability theory and its applications (2nd ed.). New York: Wiley. Feng, J. (1997). Behaviours of spike output jitter in the integrate-and-fire model. Physical Review Letters, 79(22), 4505–4508. Feng, J. (1999). Origin of firing variability of the integrate-and-fire model. Neurocomputing, 26–27, 117–122. Feng, J. (2001). Is the integrate-and-fire model good enough? A review. Neural Networks, 14(6–7), 955–975. Feng, J., & Brown, D. (1998a). Impact of temporal variation and the balance between excitation and inhibition on the output of the perfect integrate-andfire model. Biological Cybernetics, 78(5), 369–376. Feng, J., & Brown, D. (1998b). Spike output jitter, mean firing time and coefficient of variation. Journal of Physics A, 31(4), 1239–1252. Feng, J., & Brown, D. (1999). Coefficient of variation of interspike intervals greater than 0.5. How and when? Biological Cybernetics, 80(5), 291–297. Feng, J., & Brown, D. (2000a). Impact of correlated inputs on the output of the integrate-and-fire model. Neural Computation, 12(3), 671–692. Feng, J., & Brown, D. (2000b). Integrate-and-fire models with nonlinear leakage. Bulletin of Mathematical Biology, 62(3), 467–481. Feng, J., Tirozzi, B., & Brown, D. (1998). Output jitter diverges to infinity, converges to zero or remains constant. In M. Verleysen (Ed.), 6th European Symposium on Artificial Neural Networks. ESANN’98 (pp. 39–45). Brussels, Belgium: D-Facto. Feng, J., & Zhang, P. (2001). Behavior of integrate-and-fire and Hodgkin-Huxley models with correlated inputs. Physical Review E, 63(5, pt. 1–2), 051902/1–11. Gerstein, G., & Mandelbrot, B. (1964). Random walk models for the spike activity of a single neuron. Biophysical Journal, 4, 41–68. Gruneis, F., Nakao, M., Mizutani, Y., Yamamoto, M., Meesmann, M., & Musha, T. (1993). Further study on 1/f fluctuations observed in central single neurons during REM sleep. Biological Cybernetics, 68(3), 193–198. Gruneis, F., Nakao, M., & Yamamoto, M. (1990). Counting statistics of 1/f fluctuations in neuronal spike trains. Biological Cybernetics, 62(5), 407–413.
2192
B. Jackson
Gruneis, F., Nakao, M., Yamamoto, M., Musha, T., & Nakahama, H. (1989). An interpretation of 1/f fluctuations in neuronal spike trains during dream sleep. Biological Cybernetics, 60(3), 161–169. Heath, D., Resnick, S., & Samorodnitsky, G. (1998). Heavy tails and long range dependence in on/off processes and associated fluid models. Mathematics of Operations Research, 23(1), 145–165. Hurst, H. E. (1951). Long-term storage capacity of reservoirs. Transactions of the American Society of Civil Engineers, 116, 770–799. Jackson, B. S. (2003). Consequences of long-range temporal dependence in neural spiking activity for theories of processing and coding. Unpublished doctoral dissertation, Syracuse University. Khintchine, A. Y. (1960). Mathematical methods in the theory of queueing. New York: Hafner Publishing Co. Kodama, T., Mushiake, H., Shima, K., Nakahama, H., & Yamamoto, M. (1989). Slow fluctuations of single unit activities of hippocampal and thalamic neurons in cats. I. Relation to natural sleep and alert states. Brain Research, 487(1), 26–34. Kulik, R., & Szekli, R. (2001). Sufficient conditions for long-range count dependence of stationary point processes on the real line. Journal of Applied Probability, 38(2), 570–581. Kumar, A. R., & Johnson, D. H. (1993). Analyzing and modeling fractal intensity point processes. Journal of the Acoustical Society of America, 93(6), 3365–3373. Lansky, P., & Smith, C. E. (1989). The effect of a random initial value in neural first-passage-time models. Mathematical Biosciences, 93(2), 191–215. Lawrance, A. J. (1973). Dependency of intervals between events in superposition processes. Journal of the Royal Statistical Society. Series B, 35(2), 306–315. Ledesma, S., & Derong, L. (2000). Synthesis of fractional gaussian noise using linear approximation for generating self-similar network traffic. Computer Communication Review, 30(2), 4–17. Lowen, S. B., Ozaki, T., Kaplan, E., Saleh, B. E. A., & Teich, M. C. (2001). Fractal features of dark, maintained, and driven neural discharges in the cat visual system. Methods, 24(4), 377–394. Lowen, S. B., & Teich, M. C. (1991). Doubly stochastic Poisson point process driven by fractal shot noise. Physical Review A, 43(8), 4192–4215. Lowen, S. B., & Teich, M. C. (1992). Auditory-nerve action potentials form a nonrenewal point process over short as well as long time scales. Journal of the Acoustical Society of America, 92(2 pt. 1), 803–806. Lowen, S. B., & Teich, M. C. (1993a). Estimating the dimension of a fractal point process. Proceedings of the SPIE, 2036, 64–76. Lowen, S. B., & Teich, M. C. (1993b). Fractal renewal processes. IEEE Transactions on Information Theory, 39(5), 1669–1671. Lowen, S. B., & Teich, M. C. (1993c). Fractal renewal processes generate 1/f noise. Physical Review E, 47(2), 992–1001. Lowen, S. B., & Teich, M. C. (1995). Estimation and simulation of fractal stochastic point processes. Fractals, 3(1), 183–210. Lowen, S. B., & Teich, M. C. (1996a). Refractoriness-modified fractal stochastic point processes for modeling sensory-system spike trains. In J. M. Bower (Ed.), Computational neuroscience (pp. 447–452). San Diego: Academic Press.
Long-Range Dependence in Models of Cortical Variability
2193
Lowen, S. B., & Teich, M. C. (1996b). The periodogram and Allan variance reveal fractal exponents greater than unity in auditory-nerve spike trains. Journal of the Acoustical Society of America, 99(6), 3585–3591. Lowen, S. B., & Teich, M. C. (1997). Estimating scaling exponents in auditorynerve spike trains using fractal models incorporating refractoriness. In E. R. Lewis, G. R. Long, R. F. Lyon, P. M. Narins, C. R. Steele, & E. Hecht-Pointar (Eds.), Diversity in auditory mechanics (pp. 197–204). Singapore: World Scientific. Mandelbrot, B. B. (1965). Une classe de processus stochastiques homoth´etiques a soi: Application a` loi climatologique de H. E. Hurst. Comptes Rendus de l’Acad´emie des Sciences de Paris, 240, 3274–3277. Mandelbrot, B. B. (1969). Long-run linearity, locally gaussian process, H-spectra and infinite variances. International Economic Review, 10(1), 82–111. Mandelbrot, B. B., & van Ness, J. W. (1968). Fractional Brownian motions, fractional noises and applications. SIAM Review, 10(4), 422–437. Mandelbrot, B. B., & Wallis, J. R. (1968). Noah, Joseph and operational hydrology. Water Resources Research, 4(5), 909–918. Mandelbrot, B. B., & Wallis, J. R. (1969a). Computer experiments with fractional gaussian noises. Water Resources Research, 5(1), 228–267. Mandelbrot, B. B., & Wallis, J. R. (1969b). Robustness of the rescaled range R/S in the measurement of noncyclic long run statistical dependence. Water Resources Research, 5, 967–988. Mandelbrot, B. B., & Wallis, J. R. (1969c). Some long-run properties of geophysical records. Water Resources Research, 5, 321–340. Mushiake, H., Kodama, T., Shima, K., Yamamoto, M., & Nakahama, H. (1988). Fluctuations in spontaneous discharge of hippocampal theta cells during sleep-waking states and PCPA-induced insomnia. Journal of Neurophysiology, 60(3), 925–939. Paxson, V. (1997). Fast, approximate synthesis of fractional gaussian noise for generating self-similar network traffic. Computer Communication Review, 27(5), 5–18. Sakai, Y., Funahashi, S., & Shinomoto, S. (1999). Temporally correlated inputs to leaky integrate-and-fire models can reproduce spiking statistics of cortical neurons. Neural Networks, 12(7–8), 1181–1190. Salinas, E., & Sejnowski, T. J. (2000). Impact of correlated synaptic input on output firing rate and variability in simple neuronal models. Journal of Neuroscience, 20(16), 6193–6209. Salinas, E., & Sejnowski, T. J. (2002). Integrate-and-fire neurons driven by correlated stochastic input. Neural Computation, 14(9), 2111–2155. Samorodnitsky, G., & Taqqu, M. S. (1994). Stable non-gaussian random processes: Stochastic models with infinite variance. Boca Raton, FL: Chapman & Hall/CRC. Shadlen, M. N., & Newsome, W. T. (1994). Noise, neural codes and cortical organization. Current Opinion in Neurobiology, 4(4), 569–579. Shadlen, M. N., & Newsome, W. T. (1998). The variable discharge of cortical neurons: Implications for connectivity, computation, and information coding. Journal of Neuroscience, 18(10), 3870–3896. Shinomoto, S., & Sakai, Y. (1998). Spiking mechanisms of cortical neurons. Philosophical Magazine B, 77(5), 1549–1555.
2194
B. Jackson
Shinomoto, S., Sakai, Y., & Funahashi, S. (1999). The Ornstein-Uhlenbeck process does not reproduce spiking statistics of neurons in prefrontal cortex. Neural Computation, 11(4), 935–951. Shinomoto, S., & Tsubo, Y. (2001). Modeling spiking behavior of neurons with time-dependent Poisson processes. Physical Review E, 64(4–1), 041910. Softky, W. R. (1994). Sub-millisecond coincidence detection in active dendritic trees. Neuroscience, 58(1), 13–41. Softky, W. R. (1995). Simple codes versus efficient codes. Current Opinion in Neurobiology, 5(2), 239–247. Softky, W. R., & Koch, C. (1992). Cortical cells should fire regularly, but do not. Neural Computation, 4(5), 643–646. Softky, W. R., & Koch, C. (1993). The highly irregular firing of cortical cells is inconsistent with temporal integration of random EPSPs. Journal of Neuroscience, 13(1), 334–350. Stevens, C. F., & Zador, A. M. (1998). Input synchrony and the irregular firing of cortical neurons. Nature Neuroscience, 1(3), 210–217. Taqqu, M. S., & Levy, J. B. (1986). Using renewal processes to generate longrange dependence and high variability. In E. Eberlein & M. S. Taqqu (Eds.), Dependence in probability and statistics: A survey of recent results (Vol. 11, pp. 73–89). Boston: Birkhauser. Taqqu, M. S., Willinger, W., & Sherman, R. (1997). Proof of a fundamental result in self-similar traffic modeling. Computer Communication Review, 27(2), 5– 23. Teich, M. C. (1989). Fractal character of the auditory neural spike train. IEEE Transactions on Biomedical Engineering, 36(1), 150–160. Teich, M. C. (1992). Fractal neuronal firing patterns. In T. McKenna, J. Davis, & S. F. Zornetzer (Eds.), Single neuron computation (pp. 589–625). Boston: Academic Press. Teich, M. C. (1996). Self-similarity in sensory neural signals. In J. K. J. Li & S. S. Reisman (Eds.), Proceedings of the IEEE 22nd Annual Northeast Bioengineering Conference (pp. 36–37). New Brunswick, NJ: IEEE. Teich, M. C., Heneghan, C., Lowen, S. B., Ozaki, T., & Kaplan, E. (1997). Fractal character of the neural spike train in the visual system of the cat. Journal of the Optical Society of America A, 14(3), 529–546. Teich, M. C., Johnson, D. H., Kumar, A. R., & Turcott, R. G. (1990). Rate fluctuations and fractional power-law noise recorded from cells in the lower auditory pathway of the cat. Hearing Research, 46(1–2), 41–52. Teich, M. C., & Lowen, S. B. (1994). Fractal patterns in auditory nerve-spike trains. IEEE Engineering in Medicine and Biology Magazine, 13(2), 197–202. Teich, M. C., Turcott, R. G., & Lowen, S. B. (1990). The fractal doubly stochastic Poisson point process as a model for the cochlear neural spike train. In P. Dallos, C. D. Geisler, J. W. Matthews, M. A. Ruggero, & C. R. Steele (Eds.), The mechanics and biophysics of hearing (Vol. 87, pp. 354–361). New York: Springer-Verlag. Teich, M. C., Turcott, R. G., & Siegel, R. M. (1996). Temporal correlation in cat striate-cortex neural spike trains. IEEE Engineering in Medicine and Biology Magazine, 15(5), 79–87.
Long-Range Dependence in Models of Cortical Variability
2195
Thurner, S., Lowen, S. B., Feurstein, M. C., Heneghan, C., Feichtinger, H. G., & Teich, M. C. (1997). Analysis, synthesis, and estimation of fractal-rate stochastic point processes. Fractals, 5, 565–596. Troyer, T. W., & Miller, K. D. (1997). Physiological gain leads to high ISI variability in a simple model of a cortical regular spiking cell. Neural Computation, 9(5), 971–983. Tsodyks, M. V., & Sejnowski, T. (1995). Rapid state switching in balanced cortical network models. Network, 6(2), 111–124. Tuckwell, H. C. (1988). Introduction to theoretical neurobiology (Vol. 2). Cambridge: Cambridge University Press. Turcott, R. G., Barker, D. R., & Teich, M. C. (1995). Long-duration correlation in the sequence of action potentials in an insect visual interneuron. Journal of Statistical Computation and Simulation, 52(3), 253–271. Turcott, R. G., Lowen, S. B., Li, E., Johnson, D. H., Tsuchitani, C., & Teich, M. C. (1994). A nonstationary Poisson point process describes the sequence of action potentials over long time scales in lateral-superior-olive auditory neurons. Biological Cybernetics, 70(3), 209–217. Turcott, R. G., & Teich, M. C. (1996). Fractal character of the electrocardiogram: Distinguishing heart-failure and normal patients. Annals of Biomedical Engineering, 24(2), 269–293. Usher, M., Stemmler, M., Koch, C., & Olami, Z. (1994). Network amplification of local fluctuations causes high spike rate variability, fractal firing patterns and oscillatory local field potentials. Neural Computation, 6(5), 795–836. Usher, M., Stemmler, M., & Olami, Z. (1995). Dynamic pattern formation leads to 1/f noise in neural populations. Physical Review Letters, 74(2), 326–329. van Vreeswijk, C., & Sompolinsky, H. (1996). Chaos in neuronal networks with balanced excitatory and inhibitory activity. Science, 274(5293), 1724–1726. van Vreeswijk, C., & Sompolinsky, H. (1998). Chaotic balanced state in a model of cortical circuits. Neural Computation, 10(6), 1321–1371. Wilbur, W. J., & Rinzel, J. (1983). A theoretical basis for large coefficient of variation and bimodality in neuronal interspike interval distributions. Journal of Theoretical Biology, 105(2), 345–368. Willinger, W., Taqqu, M. S., Sherman, R., & Wilson, D. V. (1995). Self-similarity through high-variability: Statistical analysis of ethernet LAN traffic at the source level. Computer Communication Review, 25(4), 100–113. Willinger, W., Taqqu, M. S., Sherman, R., & Wilson, D. V. (1997). Self-similarity through high-variability: Statistical analysis of ethernet LAN traffic at the source level. IEEE/ACM Transactions on Networking, 5(1), 71–86. Wise, M. E. (1981). Spike interval distributions for neurons and random walks with drift to a fluctuating threshold. In C. Taillie, G. Patil, & B. Baldessari (Eds.), Statistical distributions in scientific work (Vol. 6, pp. 211–231). Boston: Reidel. Yamamoto, M., Nakahama, H., Shima, K., Kodama, T., & Mushiake, H. (1986). Markov-dependency and spectral analyses on spike-counts in mesencephalic reticular neurons during sleep and attentive states. Brain Research, 366(1–2), 279–289. Received August 15, 2003; accepted April 4, 2004.
LETTER
Communicated by Sam Roweis
Learning Eigenfunctions Links Spectral Embedding and Kernel PCA Yoshua Bengio
[email protected] Olivier Delalleau
[email protected] Nicolas Le Roux
[email protected] Jean-Fran¸cois Paiement
[email protected] Pascal Vincent
[email protected] Marie Ouimet
[email protected] D´epartement d’Informatique et Recherche Op´erationnelle, Centre de Recherches Math´ematiques, Universit´e de Montr´eal, Montr´eal, Qu´ebec, Canada, H3C 3J7
In this letter, we show a direct relation between spectral embedding methods and kernel principal components analysis and how both are special cases of a more general learning problem: learning the principal eigenfunctions of an operator defined from a kernel and the unknown data-generating density. Whereas spectral embedding methods provided only coordinates for the training points, the analysis justifies a simple extension to out-of-sample examples (the Nystrom ¨ formula) for multidimensional scaling (MDS), spectral clustering, Laplacian eigenmaps, locally linear embedding (LLE), and Isomap. The analysis provides, for all such spectral embedding methods, the definition of a loss function, whose empirical average is minimized by the traditional algorithms. The asymptotic expected value of that loss defines a generalization performance and clarifies what these algorithms are trying to learn. Experiments with LLE, Isomap, spectral clustering, and MDS show that this out-of-sample embedding formula generalizes well, with a level of error comparable to the effect of small perturbations of the training set on the embedding. 1 Introduction In the past few years, many unsupervised learning algorithms have been proposed that share the use of an eigendecomposition for obtaining a lowerdimensional embedding of the data that characterize a nonlinear manifold near which the data would lie: locally linear embedding (LLE) (Roweis & c 2004 Massachusetts Institute of Technology Neural Computation 16, 2197–2219 (2004)
2198
Y. Bengio et al.
Saul, 2000), Isomap (Tenenbaum, de Silva, & Langford, 2000), and Laplacian eigenmaps (Belkin & Niyogi, 2003). There are also many variants of spectral clustering (Weiss, 1999; Ng, Jordan, & Weiss, 2002) in which such an embedding is an intermediate step before obtaining a clustering of the data that can capture flat, elongated, and even curved clusters. The two tasks (manifold learning and clustering) are linked because the clusters that spectral clustering manages to capture can be arbitrary curved manifolds (as long as there are enough data to locally capture the curvature of the manifold). Clustering and manifold learning are intimately related: both clusters and manifold are zones of high density. All of these unsupervised learning methods are thus capturing salient features of the data distribution. As shown here, spectral clustering is in fact working in a way that is very similar to manifold learning algorithms. The end result of most inductive machine learning algorithms is a function that minimizes the empirical average of a loss criterion (possibly plus regularization). The function can be applied on new points, and for such learning algorithms, it is clear that the ideal solution is a function that minimizes the expected loss under the unknown true distribution from which the data were sampled, also known as the generalization error. However, such a characterization was missing for spectral embedding algorithms such as metric multidimensional scaling (MDS) (Cox & Cox, 1994), spectral clustering (see Weiss, 1999, for a review), Laplacian eigenmaps, LLE, and Isomap, which are used for either dimensionality reduction or clustering. They do not provide a function that can be applied to new points, and the notion of generalization error is not clearly defined. This article seeks to provide an answer to these questions. A loss criterion for spectral embedding algorithms can be defined. It is a reconstruction error that depends on pairs of examples. Minimizing its average value yields the eigenvectors that provide the classical output of these algorithms, that is, the embeddings. Minimizing its expected value over the true underlying distribution yields the eigenfunctions of a linear operator that is defined by a kernel (which is not necessarily positive semidefinite) and the data generating density. When the kernel is positive semidefinite and we work with the empirical density, there is a direct correspondence between these algorithms and kernel principal components analysis (PCA) (Scholkopf, ¨ Smola, & Muller, ¨ 1998). Our work is therefore a direct continuation of previous work (Williams & Seeger, 2000) noting that the Nystrom ¨ formula and the kernel PCA projection (which are equivalent) represent an approximation of the eigenfunctions of the above linear operator (called G here). Previous analysis of the convergence of generalization error of kernel PCA (Shawe-Taylor, Cristianini, & Kandola, 2002; Shawe-Taylor & Williams, 2003) also helps to justify the view that these methods are estimating the convergent limit of some eigenvectors (at least when the kernel is positive semidefinite). The eigenvectors can thus be turned into estimators of eigenfunctions, which can therefore be applied to new points, turning the spectral embedding algo-
Learning Eigenfunctions Links Spectral Embedding and Kernel PCA
2199
rithms into function induction algorithms. The Nystrom ¨ formula obtained this way is well known (Baker, 1977) and will be given in equation 1.1. This formula has been used previously for estimating extensions of eigenvectors in gaussian process regression (Williams & Seeger, 2001), and Williams and Seeger (2000) noted that it corresponds to the projection of a test point computed with kernel PCA. In order to extend spectral embedding algorithms such as LLE and Isomap to out-of-sample examples, this article defines for these spectral embedding algorithms data-dependent kernels Kn that can be applied outside the training set. See also the independent work of Ham, Lee, Mika, and Scholkopf ¨ (2003) for a kernel view of LLE and Isomap, but where the kernels are only applied on the training set. Additional contributions of this article include a characterization of the empirically estimated eigenfunctions in terms of eigenvectors in the case where the kernel is not positive semidefinite (often the case for MDS and Isomap), a convergence theorem linking the Nystrom ¨ formula to the eigenfunctions of G, as well as experiments on MDS, Isomap, LLE and spectral clustering and Laplacian eigenmaps showing that the Nystrom ¨ formula for out-of-sample examples is accurate. All of the algorithms described in this article start from a data set D = (x1 , . . . , xn ) with xi ∈ Rd sampled independently and identically distributed (i.i.d.) from an unknown distribution with density p. Below we will use the notation Ex [ f (x)] = f (x)p(x)dx for averaging over p(x) and n 1 f (xi ) Eˆ x [ f (x)] = n i=1
for averaging over the data in D, that is, over the empirical distribution ˜ y), symmetric ˆ denoted p(x). We will denote kernels with Kn (x, y) or K(x, functions, not always positive semidefinite, that may depend not only on x and y but also on the data D. The spectral embedding algorithms construct an affinity matrix M either explicitly through Mij = Kn (xi , xj ) or implicitly through a procedure that takes the data D, and computes M. We denote by vik the ith coordinate of the kth eigenvector of M, associated with the eigenvalue k . With these notations, the Nystrom ¨ formula discussed above can be written √ n n fk,n (x) = vik Kn (x, xi ), (1.1) k i=1 where fk,n is the kth Nystrom ¨ estimator with n samples. We will show in section 4 that it estimates the kth eigenfunction of a linear operator.
2200
Y. Bengio et al.
2 Kernel Principal Component Analysis Kernel PCA is an unsupervised manifold learning technique that maps data points to a generally lower-dimensional space. It generalizes the principal component analysis approach to nonlinear transformations using the kernel trick (Scholkopf, ¨ Smola, & Muller, ¨ 1996; Scholkopf ¨ et al., 1998; Scholkopf, ¨ Burges, & Smola, 1999). The algorithm implicitly finds the leading eigenvectors and eigenvalues of the covariance of the projection φ(x) of the data in feature space, where φ(x) is such that the kernel Kn (x, y) = φ(x)·φ(y) (i.e., Kn must not have negative eigenvalues). If the data are centered in feature space (Eˆ x [φ(x)] = 0), the feature space covariance matrix is C = Eˆ x [φ(x)φ(x) ],
(2.1)
with eigenvectors wk and eigenvalues λk . To obtain a centered feature space, a kernel K˜ (e.g., the gaussian kernel) is first normalized into a data-dependent kernel Kn via ˜ y) − Eˆ x [K(x ˜ , y)] − Eˆ y [K(x, ˜ y )] + Eˆ x [Eˆ y [K(x ˜ , y )]]. (2.2) Kn (x, y) = K(x, The eigendecomposition of the corresponding Gram matrix M is performed, solving Mvk = k vk , as with the other spectral embedding methods (Laplacian eigenmaps, LLE, Isomap, MDS). However, in this case, one can obtain a test point projection Pk (x) that is the inner product of φ(x) with the eigenvector wk of C, and using the kernel trick, it is written as the expansion n 1 Pk (x) = wk · φ(x) = √ vik Kn (xi , x). k i=1
(2.3)
Note that the eigenvectors of C are related to the eigenvectors of M through λk = k /n and n 1 vik φ(xi ), wk = √ k i=1
as shown in Scholkopf ¨ et al. (1998). Ng et al. (2002) already noted the link between kernel PCA and spectral clustering. Here we take advantage of that link to propose and analyze out-of-sample extensions for spectral clustering and other spectral embedding algorithms. Recently Ham et al. (2003) showed how Isomap, LLE, and Laplacian eigenmaps can be interpreted as performing a form of kernel PCA, although they do not propose to use that link in order to perform function induction (i.e., obtain an embedding for out-of-sample points). Recent work has shown convergence properties of kernel PCA that are particularly important here. Shawe-Taylor et al. (2002) & Shawe-Taylor and
Learning Eigenfunctions Links Spectral Embedding and Kernel PCA
2201
Williams (2003) give bounds on the kernel PCA convergence error (in the sense of the projection error with respect to the subspace spanned by the eigenvectors), using concentration inequalities. 3 Data-Dependent Kernels for Spectral Embedding Algorithms The spectral embedding algorithms can be seen to build an n × n similarity matrix M, compute its principal eigenvectors vk = (v1k , . . . , vnk ) with eigenvalues k , and associate with the ith training example the em1 bedding √ with √ coordinates (vi1 , vi2 , . . .) (for Laplacian eigenmaps and LLE) or ( 1 vi1 , 2 vi2 , . . .) (for Isomap and MDS). In general, we will see that Mij depends not only on (xi , xj ) but also on the other training examples. Nonetheless, as we show below, it can always be written Mij = Kn (xi , xj ) ˜ is first where Kn is a data-dependent kernel. In many algorithms, a matrix M formed from a simpler, often data-independent kernel (such as the gaussian kernel) and then transformed into M. By defining a kernel Kn for each of these methods, we will be able to generalize the embedding to new points x outside the training set via the Nystrom ¨ formula. This will only require computations of the form Kn (x, xi ) with xi a training point. 3.1 Multidimensional Scaling. Metric MDS (Cox & Cox, 1994) starts from a notion of distance d(x, y) that is computed between each pair of ˜ ij = d2 (xi , xj ). These distances are then training examples to fill a matrix M converted to equivalent dot products using the double-centering formula, which makes Mij depend not only on (xi , xj ) but also on all the other examples: 1 Mij = − 2
1 1 1 ˜ Sk , Mij − Si − Sj + 2 n n n k
(3.1)
√ n ˜ ij . The embedding of the example xi is given by k vik where Si = j=1 M where v·k is the kth eigenvector of M. To generalize MDS, we define a corresponding data-dependent kernel that generates the matrix M, 1 Kn (a, b) = − (d2 (a, b) − Eˆ x [d2 (x, b)] − Eˆ x [d2 (a, x )] + Eˆ x,x [d2 (x, x )]), (3.2) 2 where the expectations are taken on the training set D. An extension of metric MDS to new points has already been proposed in Gower (1968), in which 1 For Laplacian eigenmaps and LLE, the matrix M discussed here is not the one defined in the original algorithms, but a transformation of it to reverse the order of eigenvalues, as we see below.
2202
Y. Bengio et al.
one solves exactly for the coordinates of the new point that are consistent with its distances to the training points, which in general requires adding a new dimension. Note also that Williams (2001) makes a connection between kernel PCA and metric MDS, remarking that kernel PCA is a form of MDS when the kernel is isotropic. Here we pursue this connection in order to obtain out-of-sample embeddings. 3.2 Spectral Clustering. Several variants of spectral clustering have been proposed (Weiss, 1999). They can yield impressively good results where traditional clustering looking for “round blobs” in the data, such as K-means, would fail miserably (see Figure 1). It is based on two main steps: first embedding the data points in a space in which clusters are more “obvious” (using the eigenvectors of a Gram matrix) and then applying a classical clustering algorithm such as K-means, as in Ng et al. (2002). To construct the the spectral clustering affinity matrix M, we first apply a dataindependent kernel K˜ such as the gaussian kernel to each pair of examples: ˜ ij = K(x ˜ i , xj ). The matrix M ˜ is then normalized, for example, using divisive M normalization (Weiss, 1999; Ng et al., 2002):2 ˜ ij M Mij = . Si Sj
(3.3)
To obtain m clusters, the first m principal eigenvectors of M are computed, and K-means is applied on the unit-norm coordinates, obtained from the embedding vik of each training point xi . To generalize spectral clustering to out-of-sample points, we define a kernel that could have generated that matrix: Kn (a, b) =
˜ b) 1 K(a, . n ˆ ˜ ˜ , b)] Ex [K(a, x)]Eˆ x [K(x
(3.4)
This normalization comes out of the justification of spectral clustering as a relaxed statement of the min-cut problem (Chung, 1997; Spielman & Teng, 1996) (to divide the examples into two groups such as to minimize the sum of the “similarities” between pairs of points straddling the two groups). The additive normalization performed with kernel PCA (see equation 2.2) makes sense geometrically as a centering in feature space. Both normalization procedures make use of a kernel row and column average. It would be interesting to find a similarly pleasing geometric interpretation to the divisive normalization.
˜ ij . This alternative M Better embeddings are usually obtained if we define Si = j =i normalization can also be cast into the general framework developed here, with a slightly different kernel. 2
Learning Eigenfunctions Links Spectral Embedding and Kernel PCA
2203
Figure 1: Example of the transformation learned as part of spectral clustering. Input data are on the left and transformed data on the right. Gray level and cross and circle drawing are used only to show which points get mapped where: the mapping reveals both the clusters and the internal structure of the two manifolds.
3.3 Laplacian Eigenmaps. The Laplacian eigenmaps method is a recently proposed dimensionality-reduction procedure (Belkin & Niyogi, 2003) that was found to be very successful for semisupervised learning. The authors use an approximation of the Laplacian operator such as the gaussian kernel or the k-nearest-neighbor graph: the symmetric matrix whose element (i, j) is 1 if xi and xj are k-nearest-neighbors (xi is among the k nearest neighbors of xj or vice versa) and 0 otherwise. Instead of solving an ordinary eigenproblem, the following generalized eigenproblem is solved: ˜ k = σk Syk , (S − M)y
(3.5)
with eigenvalues σk , eigenvectors yk , and S the diagonal matrix with ele˜ The smallest eigenvalue is left ments Si previously defined (row sums of M). out, and the eigenvectors corresponding to the other small eigenvalues are used for the embedding. This is actually the same embedding that is computed with the spectral clustering algorithm from (Shi & Malik, 1997). As noted in Weiss (1999) (normalization lemma 1), an equivalent result (up to a component-wise scaling of the embedding) can be obtained by considering the principal eigenvectors vk of the normalized matrix M defined in equation 3.3. To fit the common framework for spectral embedding in this article, we have used the latter formulation. Therefore, the same data-dependent kernel can be defined as for spectral clustering, equation 3.4, to generate the matrix M; that is, spectral clustering adds a clustering step after a Laplacian eigenmap dimensionality reduction. 3.4 Isomap. Isomap (Tenenbaum et al., 2000) generalizes MDS to nonlinear manifolds. It is based on replacing the Euclidean distance by an empirical approximation of the geodesic distance on the manifold. We define
2204
Y. Bengio et al.
the geodesic distance D(·, ·) with respect to a data set D, a distance d(·, ·), and a neighborhood k as follows:
D(a, b) = min π
|π |
d(πi , πi+1 ),
(3.6)
i=1
where π is a sequence of points of length |π| = l ≥ 2 with π1 = a, πl = b, πi ∈ D ∀i ∈ {2, . . . , l − 1} and (πi ,πi+1 ) are k-nearest-neighbors of each other. The length |π| = l is free in the minimization. The Isomap algorithm obtains the normalized matrix M from which the embedding is derived by transforming the raw pairwise distances matrix as follows: (1) compute the ˜ ij = D2 (xi , xj ) of squared geodesic distances with respect to the data matrix M D and (2) apply to this matrix the distance-to-dot-product transformation, √ equation 3.1, as for MDS. As in MDS, the embedding of xi is k vik rather than vik . There are several ways to define a kernel that generates M and also generalizes out-of-sample. The solution we have chosen simply computes the geodesic distances without involving the out-of-sample point(s) along the geodesic distance sequence (except possibly at the extreme). This is automatically achieved with the above definition of geodesic distance D, which uses only the training points to find the shortest path between a and b. The double-centering kernel transformation of equation 3.2 can then be applied, using the geodesic distance D instead of the MDS distance d. A formula has been proposed (de Silva & Tenenbaum, 2003) to approximate Isomap using only a subset of the examples (the “landmark” points) to compute the eigenvectors. Using the notation of this article, that formula is 1 ek (x) = √ vik (Eˆ x [D2 (x , xi )] − D2 (xi , x)). 2 k i
(3.7)
The formula is applied to obtain an embedding for the nonlandmark exam¨ formula ples. One can show (Bengio et al., 2004) that ek (x) is the Nystrom when Kn (x, y) is defined as above. Landmark Isomap is thus equivalent to performing Isomap on the landmark points only and then predicting the embedding of the other points using the Nystrom ¨ formula. It is interesting to note a recent descendant of Isomap and LLE, Hessian Eigenmaps (Donoho & Grimes, 2003), which considers the limit case of the continuum of the manifold and replaces the Laplacian in Laplacian eigenmaps by a Hessian. 3.5 Locally Linear Embedding. The LLE algorithm (Roweis & Saul, 2000) looks for an embedding that preserves the local geometry in the neighborhood of each data point. First, a sparse matrix of local predictive weights Wij is computed, such that j Wij = 1, Wij = 0 if xj is not a k-nearest-neighbor ˜ = (I−W) (I−W) of xi and ||( Wij xj )−xi ||2 is minimized. Then the matrix M j
˜ is formed. The embedding is obtained from the lowest eigenvectors of M,
Learning Eigenfunctions Links Spectral Embedding and Kernel PCA
2205
except for the eigenvector with the smallest eigenvalue, which is uninteresting because it is proportional to (1, 1, . . . , 1) (and its eigenvalue is 0). To select the principal eigenvectors, we define our normalized matrix here by ˜ and ignore the top eigenvector (although one could apply an M = cI − M additive normalization to remove the components along the (1, 1, . . . , 1) direction). The LLE embedding for xi is thus given by the vik , starting at the second eigenvector (since the principal one is constant). If one insists on having a positive semidefinite matrix M, one can take for c the largest ˜ (note that c changes the eigenvalues only additively and eigenvalue of M has no influence on the embedding of the training set). In order to find our kernel Kn , we first denote by w(x, xi ) the weight of xi in the reconstruction of any point x ∈ Rd by its k nearest neighbors in the training set (if x = xj ∈ D, w(x, xi ) = δij ). Let us first define a kernel Kn by Kn (xi , x) = Kn (x, xi ) = w(x, xi ) and Kn (x, y) = 0 when neither x nor y is in the training set D. Let Kn be such that Kn (xi , xj ) = Wij + Wji − k Wki Wkj and Kn (x, y) = 0 when either x or y is not in D. It is then easy to verify that the kernel Kn = (c − 1)Kn + Kn is such that Kn (xi , xj ) = Mij (again, there could be other ways to obtain a data-dependent kernel for LLE that can be applied out-of-sample). Something interesting about this kernel is that when c → ∞, the embedding obtained for a new point x converges to the extension of LLE proposed in Saul and Roweis (2002), as shown in Bengio et al. (2004) (this is the kernel we actually used in the experiments reported here). As noted independently in Ham et al. (2003), LLE can be seen as performing kernel PCA with a particular kernel matrix. The identification becomes more accurate when one notes that getting rid of the constant eigenvector (principal eigenvector of M) is equivalent to the centering operation in feature space required for kernel PCA (Ham et al., 2003). 4 Similarity Kernel Eigenfunctions As noted in Williams and Seeger (2000), the kernel PCA projection formula (equation 2.3) is proportional to the so-called Nystrom ¨ formula (Baker, 1977; Williams & Seeger, 2000), equation 1.1, which has been used successfully to “predict” the value of an eigenvector on a new data point, in order to speed up kernel methods computations by focusing the heavier computations (the eigendecomposition) on a subset of examples (Williams & Seeger, 2001). The use of this formula can be justified by considering the convergence of eigenvectors and eigenvalues, as the number of examples increases (Baker, 1977; Koltchinskii & Gin´e, 2000; Williams & Seeger, 2000; Shawe-Taylor & Williams, 2003). Here we elaborate on this relation in order to better understand what all these spectral embedding algorithms are actually estimating. If we start from a data set D, obtain an embedding for its elements, and add more and more data, the embedding for the points in D converges (for eigenvalues that are unique): Shawe-Taylor and Williams (2003) give bounds
2206
Y. Bengio et al.
on the convergence error (in the case of kernel PCA). Based on these kernel PCA convergence results, we conjecture that in the limit, each eigenvector would converge to an eigenfunction for the linear operator defined below, in the sense that the ith element of the kth eigenvector converges to the application of the kth eigenfunction to xi . In the following, we assume that the (possibly data-dependent) kernel Kn is bounded (i.e., |Kn (x, y)| < c for all x, y in R) and has a discrete spectrum; it can be written as a discrete expansion: Kn (x, y) =
∞
αk φk (x)φk (y).
k=1
Consider the space Hp of continuous functions f that are square integrable as follows: f 2 (x)p(x)dx < ∞, with the data-generating density function p(x). One must note that we actually work not on functions but on equivalence classes: we say two functions f and g belong to the same equivalence class (with respect to p) if and only if ( f (x) − g(x))2 p(x)dx = 0 (if p is strictly positive, then each equivalence class contains only one function). We will assume that Kn converges uniformly in its arguments and in probability to its limit K as n → ∞. This means that for all > 0, lim P( sup |Kn (x, y) − K(x, y)| ≥ ) = 0.
n→∞
x,y∈Rd
We will associate with each Kn a linear operator Gn and with K a linear operator G, such that for any f ∈ Hp , Gn f = and
n 1 Kn (·, xi ) f (xi ) n i=1
(4.1)
K(·, y) f (y)p(y)dy,
(4.2)
Gf =
which makes sense because we work in a space of functions defined everywhere. Furthermore, as Kn (·, y) and K(·, y) are square integrable in the sense defined above, for each f and each n, Gn ( f ) and G( f ) are square integrable in the sense defined above. We will show that the Nystrom ¨ formula, equation 1.1, gives the eigenfunctions of Gn (see proposition 1), that their value on the training examples corresponds to the spectral embedding, and that they converge to the eigenfunctions of G (see proposition 2) if they converge at all. These results will hold even if Kn has negative eigenvalues.
Learning Eigenfunctions Links Spectral Embedding and Kernel PCA
2207
The eigensystems of interest are thus the following: Gfk = λk fk
(4.3)
Gn fk,n = λk,n fk,n ,
(4.4)
and
where (λk , fk ) and (λk,n , fk,n ) are the corresponding eigenvalues and eigenfunctions. Note that when equation 4.4 is evaluated only at the xi ∈ D, the set of equations reduces to the eigensystem Mvk = nλk,n vk . The following proposition gives a more complete characterization of the eigenfunctions of Gn , even in the case where eigenvalues may be negative. The next two propositions formalize the link already made in Williams and Seeger (2000) between the Nystrom ¨ formula and eigenfunctions of G. Gn has in its image m ≤ n eigenfunctions of the form: √ n n fk,n (x) = vik Kn (x, xi ) k i=1
Proposition 1.
(4.5)
with corresponding nonzero eigenvalues λk,n = nk , where vk = (v1k , . . . , vnk ) is the kth eigenvector of the Gram matrix M, associated with the eigenvalue k . For xi ∈ D these functions coincide with the corresponding eigenvectors, in the sense √ that fk,n (xi ) = nvik . Proof. First, we show that the fk,n coincide with the eigenvectors of M at xi ∈ D. For fk,n associated with nonzero eigenvalues, √ n n vjk Kn (xi , xj ) fk,n (xi ) = k j=1 √ n k vik = k √ = nvik . (4.6) The vk being orthonormal, the fk,n (for different values of k) are therefore different from each other. Then for any x, (Gn fk,n )(x) =
n 1 Kn (x, xi ) fk,n (xi ) n i=1
n 1 k = √ Kn (x, xi )vik = fk,n (x), n n i=1
(4.7)
which shows that fk,n is an eigenfunction of Gn with eigenvalue λk,n = k /n.
2208
Y. Bengio et al.
The previous result shows that the Nystrom ¨ formula generalizes the spectral embedding outside the training set. However, there could be many possible generalizations. To justify the use of this particular generalization, the following theorem helps in understanding the convergence of these functions as n increases. We would like the out-of-sample embedding predictions obtained with the Nystrom ¨ formula to be somehow close to the asymptotic embedding (the embedding one would obtain as n → ∞). Note also that the convergence of eigenvectors to eigenfunctions shown in Baker (1977) applies to data xi , which are deterministically chosen to span a domain, whereas here the xi form a random sample from an unknown distribution. Proposition 2. If the data-dependent bounded kernel Kn (|Kn (x, y)| ≤ c) converges uniformly in its arguments and in probability, with the eigendecomposition of the Gram matrix converging, and if the eigenfunctions fk,n of Gn associated with nonzero eigenvalues converge uniformly in probability, then their limits are the corresponding eigenfunctions of G. Denote fk,∞ the nonrandom function such that
Proof.
sup | fk,n (x) − fk,∞ (x)| → 0
(4.8)
x
in probability. Similarly, let K the nonrandom kernel such that sup |Kn (x, y) − K(x, y)| → 0
(4.9)
x,y
in probability. Let us start from the Nystrom ¨ formula and insert fk,∞ , taking advantage of Koltchinskii and Gin´e (2000), theorem 3.1, which shows that λk,n → λk almost surely, where λk are the eigenvalues of G: fk,n (x) = =
n 1 fk,n (xi )Kn (x, xi ) nλk,n i=1
(4.10)
n 1 fk,∞ (xi )K(x, xi ) nλk i=1
+
n λk − λk,n fk,∞ (xi )K(x, xi ) nλk,n λk i=1
+
n 1 fk,∞ (xi )[Kn (x, xi ) − K(x, xi )] nλk,n i=1
+
n 1 Kn (x, xi )[ fk,n (xi ) − fk,∞ (xi )]. nλk,n i=1
(4.11)
Learning Eigenfunctions Links Spectral Embedding and Kernel PCA
2209
Below, we will need to have shown that fk,∞ (x) is bounded. For this, we use the assumption that Kn is bounded, independently of n: |Kn (x, y)| < c. Then n n 1 1 | fk,n (x)| = fk,n (xi )Kn (x, xi ) ≤ | fk,n (xi )||Kn (x, xi )| nλk,n i=1 n|λk,n | i=1 ≤
n c | fk,n (xi )| n|λk,n | i=1
≤
n √ c n|vik | n|λk,n | i=1
≤ √ ≤
n 1 c √ n|λk,n | i=1 n
c |λk,n |
,
where in the second line, we used the bound on Kn , on third equation 4.6, the n and on the fourth, the fact that the maximum of i=1 ai s.t. ai ≥ 0 and n 2 √1 . Finally, using equation 4.8 and the a = 1 is achieved when a = i i=1 i n convergence of λk,n , we can deduce that | fk,∞ | ≤ c/|λk |, thus is bounded. We now insert λ1k fk,∞ (y)K(x, y)p(y) dy in equation 4.11 and obtain the following inequality: fk,n (x) − 1 fk,∞ (y)K(x, y)p(y) dy λk n 1 1 ≤ fk,∞ (xi )K(x, xi ) − fk,∞ (y)K(x, y)p(y) dy nλk i=1 λk n λ − λ k k,n + fk,∞ (xi )K(x, xi ) nλk,n λk i=1 n 1 + fk,∞ (xi )[Kn (x, xi ) − K(x, xi )] nλk,n i=1 n 1 + Kn (x, xi )[ fk,n (xi ) − fk,∞ (xi )] nλk,n i=1 ≤ A n + B n + C n + Dn .
(4.12)
From our convergence hypothesis (equations 4.8 and 4.9), the convergence of λk,n to λk almost surely, and the fact that fk,∞ , K, and Kn are bounded, it is
2210
Y. Bengio et al.
clear that the last three terms Bn , Cn , and Dn converge to 0 in probability. In addition, applying the law of large numbers, the first term An also converges to 0 in probability. Therefore, fk,n (x) →
1 λk
fk,∞ (y)K(x, y)p(y) dy
in probability for all x. Since we also have fk,n (x) → fk,∞ (x), we obtain λk fk,∞ (x) =
fk,∞ (y)K(x, y)p(y) dy,
which shows that fk,∞ is an eigenfunction of G, with eigenvalue λk ; therefore fk,∞ = fk : the limit of the Nystrom ¨ function, if it exists, is an eigenfunction of G.
Kernel PCA has already been shown to be a stable and convergent algorithm (Shawe-Taylor et al., 2002; Shawe-Taylor & Williams, 2003). These articles characterize the rate of convergence of the projection error on the subspace spanned by the first m eigenvectors of the feature space covariance matrix. When we perform the PCA or kernel PCA projection on an out-of-sample point, we are taking advantage of the above convergence and stability properties in order to trust that a principal eigenvector of the empirical covariance matrix estimates well a corresponding eigenvector of the true covariance matrix. Another justification for applying the Nystrom ¨ formula outside the training examples is therefore, as already noted earlier and in Williams and Seeger (2000), in the case where Kn is positive semidefinite, that it corresponds to the kernel PCA projection (on a corresponding eigenvector of the feature space correlation matrix C). Clearly, we have with the Nystrom ¨ formula a method to generalize spectral embedding algorithms to out-of-sample examples, whereas the original spectral embedding methods provide only the transformed coordinates of training points (i.e., an embedding of the training points). The experiments described below show empirically the good generalization of this out-ofsample embedding. An interesting justification for estimating the eigenfunctions of G has been shown in Williams and Seeger (2000). When an unknown function f is to be estimated with an approximation g that is a finite linear combination of basis functions, if f is assumed to come from a zero-mean gaussian process prior with covariance E f [ f (x) f (y)] = K(x, y), then the best choices of basis functions, in terms of expected squared error, are (up to rotation/scaling) the leading eigenfunctions of the linear operator G as defined above.
Learning Eigenfunctions Links Spectral Embedding and Kernel PCA
2211
5 Learning Criterion for the Leading Eigenfunctions Using an expansion into orthonormal bases (e.g., generalized Fourier decomposition in the case where p is continuous), the best approximation of K(x, y) (in the sense of minimizing expected squared error) using only m terms is the expansion that uses the first m eigenfunctions of G (in the order of decreasing eigenvalues): m
λk fk,n (x) fk,n (y) ≈ Kn (x, y).
k=1
This simple observation allows us to define a loss criterion for spectral embedding algorithms, something that was lacking up to now for such algorithms. The limit of this loss converges toward an expected loss whose minimization gives rise to the eigenfunctions of G. One could thus conceivably estimate this generalization error using the average of the loss on a test set. That criterion is simply the reconstruction error: L(xi , xj ) = Kn (xi , xj ) −
m
2 λk,n fk,n (xi ) fk,n (xj )
.
k=1
Proposition 3. Asymptotically, the spectral embedding for a continous kernel K with discrete spectrum is the solution of a sequential minimization problem, iteratively minimizing the expected value of the loss criterion L(xi , xj ). First, with {( fk , λk )}m−1 k=1 already obtained, one can recursively obtain (λm , fm ) by minimizing
Jm (λ , f ) =
K(x, y) − λ f (x) f (y)
−
m−1
2 λk fk (x) fk (y)
p(x)p(y)dxdy,
(5.1)
k=1
where by convention we scale f such that f (x)2 p(x) = 1 (any other scaling can be transferred into λ ). Second, if the Kn are bounded (independently of n) and the fk,n converge uniformly in probability, with the eigendecomposition of the Gram matrix converging, the Monte Carlo average of this criterion, 2 n m n 1 λk,n fk,n (xi ) fk,n (xj ) , Kn (xi , xj ) − n2 i=1 j=1 k=1 converges in probability to the above asymptotic expectation.
2212
Y. Bengio et al.
Proof. We prove the first part of the proposition concerning the sequential minimization of the loss criterion, which follows from classical linear algebra (Strang, 1980; Kreyszig, 1990). We proceed by induction, assuming that we have already obtained f1 , . . . , fm−1 orthogonal eigenfunctions in order of decreasing absolute value of λi . We want to prove that (λ , f ) that minimizes Jm is (λm , fm ). Setting ∂∂λJm = 0 yields
λ = f , Kf −
m−1
λi fi (x) fi (y). f (x) f (y)p(x)p(y)dxdy.
(5.2)
i=1
Thus, we have Jm = Jm−1 −2 +
λ f (x) f (y)(K(x, y) −
m−1
λi fi (x) fi (y))p(x)p(y)dxdy
i=1
(λ f (x) f (y))2 p(x)p(y)dxdy,
which gives Jm = Jm−1 − λ2 , so that λ2 should be maximized in order to minimize Jm . Take the derivative of Jm with regard to the value of f at a particular point z (under some regularity conditions to bring the derivative inside the integral), and set it equal to zero, which yields the equation
K(z, y) f (y)p(y)dy =
+
λ f (z) f (y)2 p(y)dy m−1
λi fi (z) fi (y) f (y)p(y)dy.
i=1
Using the constraint || f ||2 = f , f = (Kf )(z) = λ f (z) +
m−1
f (y)2 p(y)dy = 1, we obtain:
λi fi (z) fi (y) f (y)p(y)dy,
(5.3)
i=1
m−1 + which rewrites into Kf = λ f i=1 λi fi f , fi . Writing Kf in the basis of ∞ all the eigenfunctions, Kf = i=1 λi fi f , fi , we obtain λ f = λm fm f , fm +
∞ i=m+1
λi fi f , fi .
Learning Eigenfunctions Links Spectral Embedding and Kernel PCA
2213
Since the fi are orthogonal, take the norm and apply Parseval’s theorem: λ2 = λm 2 f , fm 2 +
∞
λi 2 f , fi 2 .
i=m+1
If the eigenvalues are distinct, we have λm > λi for i > m, and the last expression is maximized when f , fm = 1 and f , fi = 0 for i > m, which proves that f = fm is in fact the mth eigenfunction of the kernel K and thereby λ = λm . If the eigenvalues are not distinct, then the result can be generalized in the sense that the choice of eigenfunctions is not unique anymore, and the eigenfunctions sharing the same eigenvalue form an orthogonal basis for a subspace. This concludes the proof of the first statement. To prove the second part (convergence statement), we want to show that the difference between the average cost and the expected asymptotic cost
n (x, y) = m λk,n fk,n (x) fk,n (y) and K(x,
y) = tends toward 0. If we write K k=1 m k=1 λk fk (x) fk (y), that difference is n n
1 2 2
n (xi , xj ) − Ex,y K(x, y) − K(x,
y) K (x , x ) − K n i j n2 i=1 j=1 n n
1 2 2
i , xj ) − Ex,y K(x, y) − K(x,
y) ≤ 2 K(xi , xj ) − K(x n i=1 j=1 n n 1
n (xi , xj ) − K(xi , xj ) + K(x
i , xj ) + 2 Kn (xi , xj ) − K n i=1 j=1
Kn (xi , xj ) − Kn (xi , xj ) + K(xi , xj ) − K(xi , xj ) .
The eigenfunctions and the kernel being bounded, the second factor in the product (in the second term of the inequality) is bounded by a constant B with probability 1 (because of the λk,n converging almost surely). Thus, we have with probability 1: n n
1 2 2
(x , x ) − K (x , x ) − E K(x, y) − K(x, y) K n i j n i j x,y n2 i=1 j=1 n
1 n 2 2
K(xi , xj ) − K(xi , xj ) − Ex,y K(x, y) − K(x, y) ≤ 2 n i=1 j=1
2214
Y. Bengio et al.
n B n
Kn (xi , xj ) − K(xi , xj ) − Kn (xi , xj ) + K(xi , xj ) . + 2 n i=1 j=1 But then, with our convergence and bounding assumptions, the second term in the inequality converges to 0 in probability. Furthermore, by the law of large numbers, the first term also tends toward 0 (in probability) as n goes to ∞. We have therefore proved the convergence in probability of the average loss to its asymptotic expectation. Note that the empirical criterion is indifferent to the value of the solutions fk,n outside the training set. Therefore, although the Nystrom ¨ formula gives a possible solution to the empirical criterion, there may be other solutions. Remember that the task we consider is that of estimating the eigenfunctions of G, that is, approximating a similarity function K where it matters according to the unknown density p. Solutions other than the Nystrom ¨ formula might also converge to the eigenfunctions of G. For example, one could use a nonparametric estimator (such as a neural network) to estimate the eigenfunctions. Even if such a solution does not yield the exact eigenvectors on the training examples (i.e., does not yield the lowest possible error on the training set), it might still be a good solution in terms of generalization, in the sense of good approximation of the eigenfunctions of G. It would be interesting to investigate whether the Nystrom ¨ formula achieves the fastest possible rate of convergence to the eigenfunctions of G. 6 Experiments We want to evaluate whether the precision of the generalizations suggested in the previous sections is comparable to the intrinsic perturbations of the embedding algorithms. The perturbation analysis will be achieved by replacing some examples by others from the same distribution. For this purpose, we consider splits of the data in three sets, D = F ∪ R1 ∪ R2 , and training with either F ∪ R1 or F ∪ R2 , comparing the embeddings on F. For each algorithm described in section 3, we apply the following procedure: 1. We choose F ⊂ D with m = |F| samples. The remaining n − m samples in D/F are split into two equal-size subsets, R1 and R2 . We train (obtain the eigenvectors) over F∪R1 and F∪R2 , and we calculate the Euclidean distance between the aligned embeddings obtained for each xi ∈ F. When eigenvalues are close, the estimated eigenvectors are unstable and can rotate in the subspace they span. Thus, we estimate an alignment (by linear regression) between the two embeddings using the points in F. 2. For each sample xi ∈ F, we also train over {F ∪ R1 }/{xi }. We apply the Nystrom ¨ formula to out-of-sample points to find the predicted
Learning Eigenfunctions Links Spectral Embedding and Kernel PCA
2215
embedding of xi and calculate the Euclidean distance between this embedding and the one obtained when training with F ∪ R1 , that is, with xi in the training set (in this case, no alignment is done since the influence of adding a single point is very limited). 3. We calculate the mean difference (and its standard error, shown in Figure 2) between the distance obtained in step 1 and the one obtained in step 2 for each sample xi ∈ F, and we repeat this experiment for various sizes of F. The results obtained for MDS, Isomap, spectral clustering, and LLE are shown in Figure 2 for different values of |R1 |/n (i.e., the fraction of points exchanged). Experiments are done over a database of 698 synthetic face images described by 4096 components that is available online at http://isomap. stanford.edu. Similar results have been obtained over other databases, such as Ionosphere (http://www.ics.uci.edu/˜mlearn/MLSummary.html) and
D
D
R
D
R
D
R
R
Figure 2: δ (training set variability minus out-of-sample error), with respect to ρ (proportion of substituted training samples) on the Faces data set (n = 698), obtained with a two-dimensional embedding. (Top left) MDS. (Top right) Spectral clustering or Laplacian eigenmaps. (Bottom left) Isomap. (Bottom right) LLE. Error bars are 95% confidence intervals. Exchanging about 2% of the training examples has an effect comparable to using the Nystrom ¨ formula.
2216
Y. Bengio et al.
swissroll (http://www.cs.toronto.edu/˜roweis/lle/). Each algorithm generates a two-dimensional embedding of the images, following the experiments reported for Isomap. The number of neighbors is 10 for Isomap and LLE, and a gaussian kernel with a standard deviation of 0.01 is used for spectral clustering and Laplacian eigenmaps. Then 95% confidence intervals are drawn beside each mean difference of error on the figure. As expected, the mean difference between the two distances is almost monotonically increasing as the number |R1 | of substituted training samples grows, mostly because the training set embedding variability increases. Furthermore, we find that in most cases, the out-of-sample error is less than or comparable to the training set embedding instability when around 2% of the training examples are substituted randomly. 7 Conclusion Spectral embedding algorithms such as spectral clustering, Isomap, LLE, metric MDS, and Laplacian eigenmaps are very interesting dimensionalityreduction or clustering methods. However, up to now they lacked a notion of generalization that would allow easily extending the embedding out-ofsample without again solving an eigensystem. This article has shown with various arguments that the well-known Nystrom ¨ formula can be used for this purpose and that it thus represents the result of a function induction process. These arguments also help us to understand that these methods do essentially the same thing, but with respect to different kernels: they estimate the eigenfunctions of a linear operator associated with a kernel and with the underlying distribution of the data. This analysis also shows that these methods are minimizing an empirical loss and that the solutions toward which they converge are the minimizers of a corresponding expected loss, which thus defines what good generalization should mean, for these methods. It shows that these unsupervised learning algorithms can be extended into function induction algorithms. The Nystrom ¨ formula is a possible extension, but it does not exclude other extensions that might be better or worse estimators of the eigenfunctions of the asymptotic linear operator G. When the kernels are positive semidefinite, these methods can also be immediately seen as performing kernel PCA. Note that Isomap generally yields a Gram matrix with negative eigenvalues, and users of MDS, spectral clustering, or Laplacian eigenmaps may want to use a kernel that is not guaranteed to be positive semidefinite. The analysis in this article can still be applied in that case, even though the kernel PCA analogy does not hold anymore. The experiments performed here have shown empirically on several data sets that the predicted out-of-sample embedding is generally not far from the one that would be obtained by including the test point in the training set and that the difference is of the same order as the effect of small perturbations of the training set. An interesting parallel can be drawn between the spectral embedding
Learning Eigenfunctions Links Spectral Embedding and Kernel PCA
2217
algorithms and the view of PCA as finding the principal eigenvectors of a matrix obtained from the data. This article parallels for spectral embedding the view of PCA as an estimator of the principal directions of the covariance matrix of the underlying unknown distribution, thus introducing a convenient notion of generalization, relating to an unknown distribution. Finally, a better understanding of these methods opens the door to new and potentially much more powerful unsupervised learning algorithms. Several directions remain to be explored: • Using a smoother distribution than the empirical distribution to define the linear operator Gn . Intuitively, a distribution that is closer to the true underlying distribution would have a greater chance of yielding better generalization, in the sense of better estimating eigenfunctions of G. This relates to putting priors on certain parameters of the density, as in Rosales and Frey (2003). • All of these methods are capturing salient features of the unknown underlying density. Can one use the representation learned through the estimated eigenfunctions in order to construct a good density estimator? Looking at Figure 1 suggests that modeling the density in the transformed space (right-hand side) should be much easier (e.g., would require much fewer gaussians in a gaussian mixture) than in the original space. • Learning higher-level abstractions on top of lower-level abstractions by iterating the unsupervised learning process in multiple layers. These transformations discover abstract structures such as clusters and manifolds. It might be possible to learn even more abstract (and less local) structures, starting from these representations. Acknowledgments We thank L´eon Bottou, Christian L´eger, Sam Roweis, Yann Le Cun, and Yves Grandvalet for helpful discussions, the anonymous reviewers for their comments, and the following funding organizations: NSERC, MITACS, IRIS, and the Canada Research Chairs. References Baker, C. (1977). The numerical treatment of integral equations. Oxford: Clarendon Press. Belkin, M., & Niyogi, P. (2003). Laplacian eigenmaps for dimensionality reduction and data representation. Neural Computation, 15(6),1373–1396. Bengio, Y., Paiement, J., Vincent, P., Delalleau, O., Le Roux, N., & Ouimet, M. (2004). Out-of-sample extensions for LLE, Isomap, Mds, eigenmaps, and spectral clustering. In S. Thrun, L. Saul, & B. Scholkopf ¨ (Eds.), Advances in neural information processing systems, 16. Cambridge, MA: MIT Press.
2218
Y. Bengio et al.
Chung, F. (1997). Spectral graph theory. Providence, RI: American Mathematical Society. Cox, T., & Cox, M. (1994). Multidimensional scaling. London: Chapman & Hall. de Silva, V., & Tenenbaum, J. (2003). Global versus local methods in nonlinear dimensionality reduction. In S. Becker, S. Thrun, & K. Obermayer (Eds.), Advances in neural information processing systems, 15 (pp. 705–712). Cambridge, MA: MIT Press. Donoho, D., & Grimes, C. (2003). Hessian eigenmaps: new locally linear embedding techniques for high-dimensional data (Tech. Rep. No. 2003-08). Stanford, CA: Department of Statistics, Stanford University. Gower, J. (1968). Adding a point to vector diagrams in multivariate analysis. Biometrika, 55(3), 582–585. Ham, J., Lee, D., Mika, S., & Scholkopf, ¨ B. (2003). A kernel view of the dimensionality reduction of manifolds (Tech. Rep. No. TR-110). Tubingen, ¨ Germany; Max Planck Institute for Biological Cybernetics. Koltchinskii, V., & Gin´e, E. (2000). Random matrix approximation of spectra of integral operators. Bernoulli, 6(1), 113–167. Kreyszig, E. (1990). Introductory functional analysis with applications. New York: Wiley. Ng, A. Y., Jordan, M. I., & Weiss, Y. (2002). On spectral clustering: Analysis and an algorithm. In T. G. Dietterich, S. Becker, & Z. Ghahramani (Eds.), Advances in neural information processing systems, 14, Cambridge, MA: MIT Press. Rosales, R., & Frey, B. (2003). Learning generative models of affinity matrices. In Proceedings of the 19th Annual Conference on Uncertainty in Artificial Intelligence (pp. 485–492). San Francisco: Morgan Kaufman. Roweis, S., & Saul, L. (2000). Nonlinear dimensionality reduction by locally linear embedding. Science, 290(5500), 2323–2326. Saul, L., & Roweis, S. (2002). Think globally, fit locally: Unsupervised learning of low dimensional manifolds. Journal of Machine Learning Research, 4,119–155. Scholkopf, ¨ B., Burges, C. J. C., & Smola, A. J. (1999). Advances in kernel methods— Support vector learning. Cambridge, MA: MIT Press. Scholkopf, ¨ B., Smola, A., & Muller, ¨ K.-R. (1996). Nonlinear component analysis as a kernel eigenvalue problem (Tech. Rep. No. 44), Tubingen, ¨ Germany: Max Planck Institute for Biological Cybernetics. Scholkopf, ¨ B., Smola, A., & Muller, ¨ K.-R. (1998). Nonlinear component analysis as a kernel eigenvalue problem. Neural Computation, 10, 1299–1319. Shawe-Taylor, J., Cristianini, N., & Kandola, J. (2002). On the concentration of spectral properties. In T. Dietterich, S. Becker, & Z. Ghahramani (Eds.), Advances in neural information processing systems, 14. Cambridge, MA: MIT Press. Shawe-Taylor, J., & Williams, C. (2003). The stability of kernel principal components analysis and its relation to the process eigenspectrum. In S. Becker, S. Thrun, & K. Obermayer (Eds.), Advances in neural information processing systems, 15. Cambridge, MA: MIT Press. Shi, J., & Malik, J. (1997). Normalized cuts and image segmentation. In Proc. IEEE Conf. Computer Vision and Pattern Recognition (pp. 731–737). New York: IEEE.
Learning Eigenfunctions Links Spectral Embedding and Kernel PCA
2219
Spielman, D., & Teng, S. (1996). Spectral partitionning works: planar graphs and finite element meshes. In Proceedings of the 37th Annual Symposium on Foundations of Computer Science. New York: IEEE. Strang, G. (1980). Linear algebra and its applications. New York: Academic Press. Tenenbaum, J., de Silva, V., & Langford, J. (2000). A global geometric framework for nonlinear dimensionality reduction. Science, 290(5500), 2319–2323. Weiss, Y. (1999). Segmentation using eigenvectors: A unifying view. In Proceedings IEEE International Conference on Computer Vision (pp. 975–982). New York: IEEE. Williams, C. (2001). On a connection between kernel pca and metric multidimensional scaling. In T. Leen, T. Dietterich, & V. Tresp (Eds.), Advances in neural information processing systems, 13 (pp. 675–681). Cambridge, MA: MIT Press. Williams, C., & Seeger, M. (2000). The effect of the input density distribution on kernel-based classifiers. In Proceedings of the Seventeenth International Conference on Machine Learning. San Mateo, CA: Morgan Kaufmann. Williams, C. K. I., & Seeger, M. (2001). Using the Nystrom ¨ method to speed up kernel machines. In T. Leen, T. Dietterich, & V. Tresp (Eds.), Advances in neural information processing systems, 13 (pp. 682–688). Cambridge, MA: MIT Press. Received March 24, 2003; accepted March 28, 2004.
LETTER
Communicated by Dmitri Chklovskii
Predicting Axonal Response to Molecular Gradients with a Computational Model of Filopodial Dynamics Geoffrey J. Goodhill
[email protected] Department of Neuroscience, Georgetown University Medical Center, Washington, D.C. 20007, U.S.A.
Ming Gu
[email protected] Computation and Neural Systems, California Institute of Technology Pasadena, CA 91125, USA
Jeffrey S. Urbach
[email protected] Department of Physics, Georgetown University, Washington, D.C. 20057, U.S.A.
Axons are often guided to their targets in the developing nervous system by attractive or repulsive molecular concentration gradients. We propose a computational model for gradient sensing and directed movement of the growth cone mediated by filopodia. We show that relatively simple mechanisms are sufficient to generate realistic trajectories for both the short-term response of axons to steep gradients and the long-term response of axons to shallow gradients. The model makes testable predictions for axonal response to attractive and repulsive gradients of different concentrations and steepness, the size of the intracellular amplification of the gradient signal, and the differences in intracellular signaling required for repulsive versus attractive turning. 1 Introduction A key method used by developing axons to navigate to appropriate targets in the embryonic nervous system is guidance by molecular gradients (Mueller, 1999; Song & Poo, 2001; Yu & Bargmann, 2001; Dickson, 2002, Huber, Kolodkin, Ginty, & Cloutier 2003; Guan & Rao, 2003). Both gradient sensing and directed movement are primarily mediated by the growth cone, a highly specialized structure at the axonal tip (Gordon-Weeks, 2000). Growth cones typically consist of a central zone surrounded by web-like veils, the lamellipodia, and highly dynamic spike-like protrusions, the filopodia, which usually have an average lifetime of just a few minutes (Rehder c 2004 Massachusetts Institute of Technology Neural Computation 16, 2221–2243 (2004)
2222
G. Goodhill, M. Gu, and J. Urbach
& Kater, 1996). Growth cones can display turning responses in a few minutes when exposed to a steep gradient in a two-dimensional culture system (e.g. Gundersen & Barrett, 1979; Zheng, Felder, Conner, & Poo, 1994; Zheng, Wan, & Poo, 1996; Ming et al., 1997; Song, Ming, & Poo, 1997; de la Torre, Hopker, ¨ Ming, & Poo, 1997; Song et al., 1998; Hong, Hinck, Nishiyama, Poo, & Tessier-Lavigne, 1999; Hong, Nishiyama, Henley, Tessier-Lavigna, & Poo, 2000; Zheng, 2000; Ming et al., 2002; Nishiyama et al., 2003). When exposed to a shallow gradient generated by diffusion from nearby tissue in a three-dimensional culture, system growth cones display quite variable trajectories, but with a tendency for smooth turning up (or down for repulsive factors) the gradient on a timescale of several hours (e.g. Lumsden & Davies, 1983, 1986; Tessier-Lavigne, Placzek, Lumsden, Dodd, & Jessell, 1988; Colamarino & Tessier-Lavigne, 1995; Keynes et al., 1997; Richards, Koester, Tuttle, & O’Leary, 1997; Varela-Echavarria, Tucker, Puschel, ¨ & Guthrie, 1997; Brose et al., 1999; Braisted et al., 2000; Caton et al., 2000; Patel et al., 2001; Nguyen Ba-Charvet et al., 2001; Shu & Richards, 2001; Anderson et al., 2003, Charron, Stein, Jeong, McMahon, & Tessier-Lavigne, 2003). Attraction can sometimes be converted to repulsion, and vice versa, by methods including manipulation of the levels of cyclic nucleotides in the growth cone (Song et al., 1997, 1998; Song & Poo, 2001; Nishiyama et al., 2003), by growing on alternative substrates (Hopker, ¨ Shewan, Tessier-Lavigne, Poo, & Holt, 1999), and electrical activity (Ming, Henley, Tessier-Lavigne, Song, & Poo, 2001). The axonal trajectories generated by repulsion roughly mirror those generated by attraction. Several lines of evidence suggest that filopodia often make a critical contribution to the sensing and movement capabilities of the growth cone both in vitro and in vivo (Argiro, Bunge, & Johnson, 1985; O’Connor, Duerr, & Bentley, 1990; Myers & Bastiani, 1993; Davenport, Dou, Rehder, & Kater, 1993; Zheng et al., 1996; Steketee & Tosney, 1999; but see also Wang, Liu, Diefenbach, & Jay, 2003). When the filopodia are eliminated, growth cones cannot navigate their environment and do not respond to either substrate bound or diffusible guidance cues (Bentley & Toroian-Raymond, 1986; Chien, Rosenthal, Harris, & Holt, 1993). In a steep gradient, in vitro filopodia become asymmetrically distributed toward the source of an attractive factor and away from the source of a repulsive factor on the timescale of a few minutes (Zheng et al., 1996; Zheng, 2000). The general issue of chemotaxis by small sensing devices has been extensively studied experimentally in systems such as bacteria, leukocytes, and slime molds (e.g., Devreotes & Zigmond, 1988; Eisenbach, 1996; Parent & Devreotes, 1999). In addition, chemotaxis in these systems has been subjected to a variety of theoretical analyses (Tranquillo, 1990), including predictions of the minimum detectable gradient steepness (Berg & Purcell, 1977), models of the trajectories generated (Tranquillo & Lauffenburger, 1987), and models of the intracellular transduction pathways that mediate response to gradients (Moghe & Tranquillo, 1995; Barkai & Leibler, 1997). However, such quantitative analyses are much less well developed for growth cones
Axonal Response to Gradients
2223
(Buettner, 1995; Robert & Sweeney, 1997; Hely & Willshaw, 1998; Goodhill, 1998; Goodhill & Urbach, 1999; Meinhardt, 1999). In particular, the theoretical consequences for growth cone gradient sensing of the unique dynamical properties of filopodia remain unexplored, and the precise requirements for generating similar trajectories to those seen experimentally in both attractive and repulsive gradients have not been investigated. Here, we present the first quantitative model for how filopodia mediate both attractive and repulsive gradient-sensing and -directed movement by growth cones. Our basic hypothesis is that new filopodia tend to be generated in the direction where ligand binding is highest, in the case of attraction, and lowest, in the case of repulsion, and that the growth cone takes a turn toward the average direction of the filopodia. We show that this hypothesis is sufficient to account for the trajectories generated by growth cones in both attractive and repulsive gradients on both short and long timescales. The model is expressed at the level of filopodial dynamics and does not address in detail the sequence of intracellular signaling events leading from receptor binding to the spatially anisotropic polymerization and bundling of actin that actually causes filopodial extension (Song & Poo, 2001; Tanaka & Sabry, 1995; Suter & Forscher, 2000; Steketee, Balazovich, & Tosney, 2001; Zhou, Waterman-Storer, & Cohan, 2002). However, the model makes a number of predictions about the mathematical constraints that these intracellular transformations should satisfy to reproduce trajectories seen experimentally. In particular, it predicts the degree of amplification required from the binding signal to the turning signal and how the signal transduction pathways mediating attraction may differ from those mediating repulsion. The model also suggests that the response to attractive and repulsive gradients may not be entirely symmetric. 2 The Model We consider an idealized growth cone consisting of a two-dimensional semicircular body from which several one-dimensional filopodia extend (see Figure 1A). Both the (one-dimensional) surface of the growth cone body and the filopodia are covered with receptors at random locations. The probability pi for receptor i to be bound is given by pi = Ci /(Ci + KD ), where Ci is the external ligand concentration at the position of receptor i and KD is the dissociation constant for the receptor-ligand complex. At each time step, the state of each receptor (bound or unbound) is updated by random assignment with this probability. Receptor binding is averaged within a local region (hereafter referred to as a bin), with bins equally spaced around the growth cone. Receptor binding is then divided by the total number of receptors in that bin to give the fraction of bound receptors per bin, b(θ ). This binding signal is assumed to establish a shallow internal gradient of an intracellular signaling molecule. This shallow gradient by itself is insufficient to generate the degree of turning shown by real axons in gradients (see
2224
A
G. Goodhill, M. Gu, and J. Urbach
B P(filopodium generation)
0.15
0.1
0.05
0
−50
0
50
Angle around growth cone
Figure 1: The model. (A) Model growth cone. The small circles represent receptors, which are distributed randomly on the surface of both the growth cone and the filopodia. (B) Schematic example of the internal signaling gradient generated after amplification. An attractive external ligand gradient points to the right of the growth cone, and the concentration at the center of the growth cone is KD . Solid line: no amplification (i.e., b(θ)). Dashed line: amplification = 5 (i.e., b5 (θ )). Dash-dotted line: amplification = 10 (i.e., b10 (θ)). Note that the larger the amplification, the higher the probability is that a new filopodium will be generated on the right rather than the left of the growth cone (though there is always a nonzero probability of a filopodium appearing at any angle). For clarity, a fractional change in concentration of 50% across the growth cone is shown; the gradients actually used in the model are much shallower than this.
section 3). We therefore further assume that there is an amplification step between this gradient and the probability of generating a new filopodium. In reality, this amplification arises from a complex network of intracellular signalling (Parent & Devreotes, 1999; Song & Poo, 2001; Guan & Rao, 2003). The precise details of this network are not yet well understood for growth cones; we simply assume that its ultimate effect is to raise the shallow internal gradient to a power n (see Figure 1B; this is somewhat analogous to the standard assumption in many neural network models that the output of a neuron is simply a sigmoid function of its input (Haykin, 1999)). This amplified gradient then gives the probability P(θ ) of generating a new filopodium in each bin, that is, P(θ) ∝ bn (θ) in the attractive case. For repulsive ligand gradients, an inversion step is required. We consider three forms: P(θ) ∝ (1 − b(θ))n , P(θ) ∝ 1 − bn (θ), and P(θ ) ∝ b−n (see section 5). The bin from which a new filopodium will extend is then chosen from this probability distribution, and a new filopodium is extended from a position chosen from a uniform distribution within this bin. At the same time, the oldest filopodium is retracted, maintaining a constant number of filopodia on the growth cone. The growth cone then moves a constant small distance, mostly in the forward direction but with a slight deviation to one side or the other (a rate of growth that depends on the concentration of factor could
Axonal Response to Gradients
2225
be added as appropriate). This deviation is in the direction defined by the net pull of the filopodia. The entire cycle of receptor binding, internal gradient establishment, amplification, bin selection, filopodia generation, and directed movement is then repeated. We have investigated two versions of the model. In the angle version, the distance the growth cone moves forward at each time step is constant, and the deviation is toward the mean angle of the filopodia. In the force version, the distance moved and deviation are proportional to the force exerted by the filopodia, assuming each pulls equally along its length. The results generated by these two versions of the model are very similar, and we therefore present results only for the simpler angle model (see also section 4). 2.1 Parameters. Unless otherwise stated, the parameter values used to generate the data shown in section 3 were those listed below and in Table 1, which also lists some typical values from the experimental literature. This list represents a small fraction of the quantitative data in the literature, and in some axon guidance contexts, different parameter values would be appropriate. Additional parameter values were as follows. The number of receptors on the growth cone was 3000 in total, with 100 each per filopodium (1500 total on filopodia) and the remaining 1500 distributed on the body of
Table 1: Parameters Used in the Model. Parameter
Model Experimental Value Values
Axon Type
Growth cone width (microns) Growth cone speed (microns/hour)
20
DRG Glass, PLL Bray & Chapman, 1985 Xenopus spinal Glass Zheng et al., 1996 PC12, SCG Collagen Aletta & Greene, 1988 Xenopus spinal Glass Zheng, 2000 Zebrafish retinal In vivo Hutson & Chien, 2002 Xenopus spinal Glass Zheng et al., 1996
Number of filopodia per
15
growth cone Filopodium length (microns)
20
10
18 10 20±2, 18±4 18–24 35–45 11–18 (turning) 40 (not turning) 4–6 6 8–20 10–25 6.9±0.4 5–15 3–4 4–7
Filopodium lifetime (min)
7.5
9 6 7.5
Substrate
References
Zebrafish retinal In vivo LBD, SND In vivo
Hutson & Chien, 2002 Kim, Kolodziej, & Chiba, 2002 Xenopus spinal Glass Zheng et al., 1996 PC12, SCG Collagen Aletta & Greene, 1988 DRG Glass, PLL Bray & Chapman, 1985 Aplysia PLL Goldberg & Burmeister, 1986 Zebrafish retinal in vivo Hutson & Chien, 2002 LBD, SND in vivo Kim, Kolodziej, & Chiba, 2002 Xenopus spinal Glass Zheng et al., 1996 LBD, SND in vivo Kim et al., 2002 Xenopus spinal Glass Gomez, Robles, Poo, & Spitzer, 2001
Notes: Parameter values used in the simulations reported here and some representative values from the literature. Note that since the number of filopodia is assumed fixed, the filopodium lifetime determines the size of each time step: for the parameters here, the time step is 30 seconds. DRG, dorsal root ganglion; PLL, poly-L lysine; SCG, superior cervical ganglion.
2226
G. Goodhill, M. Gu, and J. Urbach
the growth cone. The number of angle bins was 25 (an average of 120 receptors per bin). The ligand concentration was equal to KD at the starting point of each growth cone trajectory. The concentrations of diffusible attractants encountered by axons in vivo are generally unknown, but KD is the concentration at which growth cones are expected to be maximally sensitive to gradients (Tranquillo, 1990). The gradient was exponential, with a steepness (expressed as percentage change in ligand concentration across 10µm) varying from 0.1% to 10%. The new orientation of the growth cone is equal to 0.97 times the previous orientation, plus 0.03 times the average angle of the filopodia. This value controls how quickly the axon reorients in response to changes in filopodia distribution. Values much lower than 0.97 produced trajectories for long-term responses that were more convoluted than those observed experimentally. The length of each time step was 30 seconds, and the distance moved forward in this time was 1/6 µm (equivalent to 20 µm/hour). At successive time steps, the growth cone makes a statistically independent measurement of the concentration in each bin. For diffusion-limited receptor-binding kinetics, the minimum time between statistically independent concentration measurements can be estimated to be the radius of the sensing device squared divided by the diffusion constant (Berg & Purcell, 1977). For diffusion constants in the range 10−6 to 10−7 cm2 /sec, this time is between 1 and 10 seconds, which as required is less than the time step used in the model. Each individual axon was initialized with a different random seed and a different initial distribution of filopodia. The total time simulated was 2 hours for “short” trajectories and 40 hours for “long” trajectories. 3 Results Our model represents the main steps that may occur inside the growth cone to convert a shallow external ligand gradient to directed movement (see Figure 1). The gradient of receptor binding over the surface of the growth cone produces a difference in the likelihood that filopodia will be generated on one side of the growth cone compared to the other. This difference could arise, for example, from an internal gradient in actin polymerization or bundling triggered by a gradient of an intracellular signaling molecule. In each time step, the growth cone moves forward and turns slightly toward the net direction of the filopodia, due either to the forces generated by filopodial adhesions or the asymmetric effects of the filopodia on microtubule dynamics. Two crucial components in the conversion from the initial shallow gradient to the gradient of probability in filopodium generation are amplification and inversion. Amplification is the process that steepens the initial gradient so that it is effective for guidance (illustrated in Figure 1B). Inversion is required for repulsive ligand gradients: besides being amplified, the direction of the initial shallow gradient must now be reversed so
Axonal Response to Gradients
2227
that filopodia are more likely to be generated on the down-gradient side of the growth cone than the up-gradient side. Growth cones can display a wide variety of behaviors in different contexts. For instance, a single filopodial contact can sometimes cause a complete redirection of the growth cone (O’Connor et al., 1990). Our concern was to model two specific well-characterized behaviors of axons growing in or on a uniform substrate: a rapid turn in response to a steep external ligand gradient and a slow turn in response to a shallow external ligand gradient. Experimental data for the first case come primarily from Poo and colleagues (Zheng et al., 1994, 1996; Ming et al., 1997, 2002; de la Torre et al., 1997; Song et al., 1997, 1998; Hong et al., 1999, 2000; Zheng, 2000; Nishiyama et al, 2003). By slowly ejecting a chemotropic factor from a pipette near to a growth cone growing on a two-dimensional substrate covered in fluid medium, gradients are established of steepness 5% to 10% over 10 µm at the growth cone (Zheng et al., 1994). Under appropriate conditions, the axons turn toward (or away from) the pipette on a timescale of the order of 1 hour. Experimental data regarding the long-term response of axons to shallow gradients come from the 3D collagen gel assay (e.g., Lumsden & Davies, 1983, 1986; Tessier-Lavigne et al., 1988; Colamarino & Tessier-Lavigne, 1995; Keynes et al., 1997; Richards et al., 1997; Varela-Echavarria et al., 1997; Brose et al., 1999; Braisted et al., 2000; Caton et al., 2000; Patel et al., 2001; Nguyen Ba-Charvet et al., 2001; Shu & Richards, 2001; Anderson et al., 2003; Charron et al., 2003). Here, a target explant of factor-secreting tissue or a block of, for example, 293T cells transfected with the factor is placed close to an explant containing the neurons under investigation. Under appropriate conditions, axons emerge from the latter explant and grow preferentially toward (or away from) the target on a timescale of the order of 1 day. This is generally interpreted to mean that the factor has diffused from the target into the collagen, creating a gradient (Goodhill, 1997, 1998), though such gradients have not been directly measured. To investigate the behavior of the model for short trajectories in a steep attractive gradient, we simulated 100 axons, each starting from the origin and initially moving in the positive y direction (see Figure 2). An exponential gradient with a fractional change of either 0% or 5% per 10 µm was present directed along the x-axis, at an angle of 90 degrees to the initial trajectory. Parameter values are given in section 2, and the relationship between filopodial initiation probability P and receptor binding b was given by P ∝ bn . Figures 2A and 2B show a typical set of trajectories for an amplification factor n = 10 for zero gradient and a 5% gradient, respectively (Zheng et al., 1994, 1996; Ming et al., 1997, 2002; de la Torre et al., 1997; Song et al., 1997, 1998; Hong et al., 1999, 2000; Zheng, 2000) Analysis of the model below shows that any combination of amplification and steepness with the same product (e.g., n = 5, steepness = 10%) would produce approximately the same effect. Figure 2C shows the cumulative distribution of final turning angles for zero gradient, amplification = 10, and for a 5%
2228
G. Goodhill, M. Gu, and J. Urbach
B 40
40
30
30
Distance (microns)
Distance (microns)
A 20 10 0 −10 −40
−20
0 20 Distance (microns)
10 0 −10 −40
40
C
20
−20
0 20 Distance (microns)
40
D 20
Mean x displacement (microns)
Cumulative angle distribution (%)
100 80 60 40 20 0
−60
−40
−20 0 20 40 Turning angle (degrees)
60
15 10 5 0 −5 −10 −15 −20 0
20
40
60 80 Time (minutes)
100
120
Figure 2: Short axon trajectories generated by the model. An attractive gradient of steepness 0% (A) or 5% (B) points to the right. One hundred simulated axons are shown, each starting from a different random seed. Amplification = 10 in both cases. Growth cones themselves are not drawn. (C) Cumulative distribution of final axon turning angles. The left-most curve is for zero gradient; the other three curves moving from left to right are for a 5% gradient with amplification = 1,5,10. (D) Mean x displacement of each population of 100 axons as a function of time. The four curves from bottom to top correspond to the four curves from left to right in C.
gradient, amplification = 1,5,10. An alternative measure of the response to the cumulative distribution is the mean distance each population of axons has moved in the x direction as a function of time (see Figure 2D). For amplification = 1, only a modest degree of turning up the gradient is seen, and there is a wide spread in trajectories, while for amplification = 10, there is more robust turning and a smaller spread of trajectories. It is thus clear that amplification in the range 5 to 10 in the model is necessary to reproduce the behavior of axons seen experimentally. We also investigated the sensitivity of the model to the number of filopodia used in the range 5 to 25 (data not shown); there were no significant changes in the observed behavior. The model is somewhat sensitive to the total number of receptors on the growth cone. The mean x displacement after 2 hours increased slightly as recep-
Axonal Response to Gradients
2229
A
B 20
Mean x displacement (microns)
Mean x displacement (microns)
30 25 20 15 10 5 0 0
2
4 6 8 Gradient steepness (%)
10
15
10
5
0 0.001
0.01
0.1 1 10 Concentration/Kd
100
1000
C 10
Gradient steepness (%)
9 8 7 6 5 4 3 2 1 0.5 0.1 0
10
20 30 40 Turning time (mins)
50
60
Figure 3: Statistics of short axon trajectories. (A) Mean × displacement after 2 hours as a function of gradient steepness. Error bars are standard error in the mean (SEMs) averaged over 100 axons per run. The three curves are amplification = 1,5,10 from bottom to top. (B) Mean x displacement as a function of concentration at the growth cone, for amplification = 1,5,10. Error bars are SEMs averaged over five runs of 100 axons per run. (C) Gradient steepness versus turning time. Solid lines are amplification = 1 to 10 moving from right to left. Dashed line: amplification = 100. Dotted line: theoretical prediction based on thermal noise from Berg and Purcell (1977). Dot-dash line: theoretical prediction based on receptor noise from Tranquillo (1990).
tor numbers increased above 3000 and decreased significantly as receptor numbers decreased below 1000 (data not shown). The model produces specific predictions for how the axonal response to gradients varies with gradient steepness and with absolute concentration (Goodhill, 1998; Goodhill & Urbach, 1999). Figure 3A plots the mean x displacement after 2 hours as a function of (attractive) gradient steepness for amplification = 1,5,10. As expected, the mean x displacement increases with increasing gradient steepness. Figure 3B plots x displacement as a function of the concentration at the starting position of the growth cones for amplification = 1,5,10. It can be seen that the response drops off
2230
G. Goodhill, M. Gu, and J. Urbach
away from C = Kd . For high concentrations, most receptors are bound most of the time, whereas for low concentrations, hardly any receptors are bound, and in neither case is it possible to detect a small difference in average binding between the two sides of the growth cone. This roughly matches data for leukocyte chemotaxis (Zigmond, 1977), and for growth cones using a new experimental assay we have recently developed (Rosoff et al., 2004). In order to compare the behavior of the model with theoretical predictions regarding gradient detection by small sensing devices, we simulated gradient steepness versus turning time, defined as the first time at which the mean x displacement exceeds the standard deviation of x displacement (see Figure 3C). As expected, the greater the amplification, the shorter is the turning time for a given gradient steepness. Also shown in Figure 3C is the theoretical prediction of Berg and Purcell (1977), as applied to growth cones (Goodhill & Urbach, 1999). This model calculates the unavoidable thermal fluctuations in the local concentration of chemoattractant, and thus the minimum possible detectable gradient steepness. Here, turning time is taken to be the time over which the sensing device averages concentration measurements before making a decision about the direction of the gradient. There are at least three effects that reduce the senstivity of our model below the theoretical maximum. One effect is the randomness arising from the generation of new filopodia (see section 2), which is included to model intracellular noise. This effect can be effectively elimated by using an extremely high value of the amplification, such as that used to generate the dashed line (amplification = 100). A second effect is the noise arising from the inherently stochastic receptor binding. Another curve in Figure 3C shows the maximum possible turning time after allowing for receptor noise (Tranquillo, 1990). Finally, at short times, the turning time is limited by the fact that the growth cone can reorient a relatively small amount in each time step (see section 2; without this inertia, the trajectories are highly irregular and do not match those seen in experiments). Besides being attracted by molecular gradients, axons can often also be repelled by these gradients (Song et al., 1997, 1998; Song & Poo, 2001; Nishiyama et al., 2003). We used the same model parameters for repulsion as for attraction, except that now filopodia are preferentially generated on the side of the growth cone facing down the gradient. Figure 4A shows repulsive turning when the the probability of generating a new filopodium is given by P(θ) ∝ (1 − b(θ))n (5% gradient to the right, amplification = 10). Figure 4B shows response as a function of concentration. This curve is similar to that for attractive turning shown in Figure 3B, except reversed: in the repulsive case, sensitivity drops off faster at low concentration than high concentration, whereas the opposite is true for attraction. We also investigated P(θ) ∝ b−n and P(θ) ∝ (1 − bn (θ)) (data not shown). The former case produced a response as a function of concentration very similar to the attractive case, while the latter case produced very little turning. The bio-
Axonal Response to Gradients
2231
A
B Mean x displacement (microns)
−20
Distance (microns)
40 30 20 10 0 −10 −40
−20
0 20 Distance (microns)
40
−15
−10
−5
0 0.001
0.01
0.1 1 10 Concentration/Kd
100
1000
Figure 4: Short trajectories in a repulsive gradient. (A) As before, the gradient increases to the right (steepness = 5%, amplification = 10). (B) The repulsive response as a function of concentration for amplification = 1,5,10. Note the mirror symmetry between B and Figure 3B.
logical implementation of these different forms for P(θ) is considered in the section 5, and their mathematical properties are analyzed below. Although simulating the short-term response of axons in a steep gradient allows comparison with experiments using the pipette assay, more relevant to understanding the behavior of axons in vivo are simulations of the long-term response of axons to shallow gradients, as in 3D collagen gels (Lumsden & Davies, 1983, 1986; Tessier-Lavigne et al., 1988; Colamarino & Tessier-Lavigne, 1995; Keynes et al., 1997; Richards et al., 1997; VarelaEchavarria et al., 1997; Brose et al., 1999; Braisted et al., 2000; Caton et al., 2000; Patel et al., 2001; Nguyen Ba-Charvet et al., 2001; Shu & Richards, 2001; Anderson et al., 2003). Figure 5 shows results of the model after 40 hours of simulated growth from an “explant,” giving axons of length 800 µm. One hundred axons were initially distributed uniformly in a disk (explant) of radius 300 µm, with random initial directions. The disks drawn in Figure 5 are added simply to show the extent of the explant; as in the experimental case, it obscures the initial trajectories of the axons until they leave the explant. The trajectories are generated with the gradient increasing in the positive y direction (upward in the figure: the actual gradient steepness present in standard 3D collagen gel experiments is unknown). All other parameters were as for the short trajectories above, with amplification = 10. The explants generated by the model in this case resemble (at least qualitatively) explants seen experimentally (Lumsden & Davies, 1983, 1986; Tessier-Lavigne et al., 1988; Colamarino & Tessier-Lavigne, 1995; Keynes et al., 1997; Richards et al., 1997; Varela-Echavarria et al., 1997; Brose et al., 1999; Braisted et al., 2000; Caton et al., 2000; Patel et al., 2001; Nguyen Ba-Charvet et al., 2001; Shu & Richards, 2001; Anderson et al., 2003). Figure 6A shows the mean y displacement as a function of the concentration at the center of the explant.
2232
G. Goodhill, M. Gu, and J. Urbach
Figure 5: Long axon trajectories from an explant. Amplification = 10, time simulated = 40 hours. Gradient points upward, with the steepness indicated for each row. Each column represents a different random seed (the same within each column).
Again, sensitivity decays faster at high concentrations than low concentrations. Figure 6B shows the analogous result in a repulsive gradient with the same parameters, using the P(θ) ∝ (1 − b(θ))n form. Again, sensitivity decays faster at low concentrations than high concentrations. Figure 6C shows mean y displacement as a function of gradient steepness. 4 Analysis of the Model Here we analyze certain aspects of the behavior of our model mathematically: the trajectory of the initial turn up the gradient and the concentration dependence of the sensitivity. 4.1 Calculation of Trajectories. The model is fundamentally stochastic: receptor binding is probabilistic, as is the position at which a new filopodium is generated (biased by the direction of maximum binding). However, for a simplified and deterministic version, a closed-form solution for the initial turn can be derived. Consider the “bare” (no filopodia) growth cone shown in Figure 7A, being attracted by an exponential gradient along the x axis of form C = C0 eαx . (Refer to Table 2 for terminology.) The average binding b(θ) at each position on the growth cone is given by b(θ) =
C0 eα(x+r sin(θ+φ)) . KD + C0 eα(x+r sin(θ+φ))
Axonal Response to Gradients
2233
A
B −600
Mean y displacement (microns)
Mean y displacement (microns)
600 500 400 300 200 100
−500 −400 −300 −200 −100 0
0 0.001
0.01
0.1 1 10 Concentration/Kd
100
1000
0.001
0.01
0.1 1 10 Concentration/Kd
100
1000
C Mean y displacement (microns)
600
500
400
300
200
100
0 0.10.15
0.25
0.5
1
Gradient steepness (%)
Figure 6: Analysis of long axon trajectories from an explant. Amplification = 10, time simulated = 40 hours. (A,B) Mean y displacement as a function of concentration at the center of the explant (A = attraction, B = repulsion). The three curves in each case are (from top to bottom) for steepness = 1%, 0.5%, and 0.1%. (C) Mean y displacement as a function of gradient steepness (top line, amplification = 10; bottom line, amplification = 5).
For shallow gradients, small α, bn (θ) is approximately n ¯ αx Ce nαr n b (θ) = sin(θ + φ) . 1+ ¯ αx ¯ αx 1 + Ce 1 + Ce Define a normalized and amplified probability distribution: bn (θ) . B(θ) = +π/2 n −π/2 b (ψ)dψ Then B(θ) = ≈
1+ 1 π
nαr ¯ αx sin(θ + 1+Ce sin φ π + 2nαr ¯ αx 1+Ce
φ)
2nαr sin φ nαr sin(θ + φ) 1− + . ¯ αx + 1) ¯ αx + 1 π(Ce Ce
2234
G. Goodhill, M. Gu, and J. Urbach
Table 2: Terminology for the Analysis of the Model. Parameter
Definition
r φ θ C¯ C0 α b(θ ) B(θ ) n η
k
Growth cone radius Angle of growth cone relative to coordinate frame Angle over the growth cone surface ∈ [−π/2, π/2] Ligand concentration normalized by KD (dimensionless) Ligand concentration at the origin Gradient steepness, where concentration = C0 eαx Receptor binding as a function of θ Normalized and amplified binding density Amplification parameter Time step in differential equation for φ Time step in differential equation for x, y 2ηnαr ¯ π(1+C)
The binding-weighted direction θ specified by this distribution is first to order in α and assuming nα is still small: +π/2 2nαr θ = cosφ. (4.1) ψB(ψ)dψ ≈ ¯ π(Ceαx + 1) −π/2 We now regard this as a small increment in the angle φ of the growth cone at each time step, that is, φ˙ ∝ θ . This leads to the set of coupled differential equations 2nαr cosφ, x˙ = sin φ, y˙ = cosφ, ¯ π(Ceαx + 1) where η and are small. These equations are analytically intractable. However, for this deterministic version of the model, it is reasonable to assume that the background concentration does not change significantly over the length of the initial turn. That is, we can assume eαx ≈ 1 for the initial turn. The equation for φ˙ then becomes simply φ˙ = η
2nαr cosφ. π(C¯ + 1) Assuming φ(0) = 0, that is, the initial direction of the growth cone is perpendicular to the gradient as in our simulations, this has the solution φ˙ = η
φ = arcsin [tanh(kt)] , 2ηnαr ˙ and y˙ using this expression where k = π(1+ ¯ . Solving the equations for x C) yields the following parametric trajectory:
ln(cosh(kt)), k 2 y(t) = arctan(tanh(kt/2))cosh(kt) 1 − tanh2 (kt). k
x(t) =
Axonal Response to Gradients
A
2235
B Distance (microns)
40
ϕ θ r (x,y) Gradient
30 20 10 0 −10 −40
−20
0 20 Distance (microns)
40
Figure 7: Analysis of the model. (A) Schematic of the terminology used in the analysis. (B) Comparison of simulated trajectories (black lines; mean trajectory is solid white line) with the prediction of the analysis (dashed white line).
Figure 7 compares this predicted trajectory for typical parameters with the actual trajectories produced by simulations of the full stochastic model. The solid white line in Figure 7B is the mean of the set of trajectories simulated with the full model, while the dashed white line is the trajectory generated by the closed-form solution above: there is a good match. nαr This analysis reveals that k ∝ 1+ is a key determinant of the rate of C¯ turning. Any combination of n, α, and r that has the same product will have the same rate of turning. Thus, doubling the gradient steepness is equivalent to doubling the amplification, or doubling the width of the growth cone, and so on. In the force version of the model, the total pull is the vector sum of the forces exerted by the filopodia. The total filopodial force can then be decomposed into a forwards (in the current direction of the growth cone) and turning (orthogonal to the current direction) force. The expected values of these forces are given by B(θ)cos(θ) and B(θ ) sin(θ ) , respectively. For the turning force, we have B(θ) sin(θ) =
π/2
−π/2
B(ψ) sin(ψ)dψ =
nαr ¯ αx + 1) 2(Ce
cosφ.
This is exactly the same as the equivalent equation for the “angle” version of the model (see equation 4.1), except for the replacement of 2/π by 1/2. This demonstrates analytically that the force and angle models generate very similar trajectories. 4.2 Gradient Sensitivity. The concentration dependence of the gradient sensitivity can be understood from a first-order approximation of the concentration dependence of the filopodia initiation probability, B(C), near
2236
G. Goodhill, M. Gu, and J. Urbach
the average concentration at the growth cone, Co . By Taylor expansion,
dB(C)
B(C) ≈ B(Co ) + (C − Co ) dC C=Co
(C − Co ) Co dB(C)
. = B(Co ) 1 + Co B(Co ) dC C=Co We have studied the concentration dependence of the sensitivity for gradients with constant fractional change across the growth cone, so the variation o in the quantity C−C Co across the growth cone is comparable at all concentra θmax B(Co ) = 1, tions. In addition, the normalization condition requires that θmin so the dependence on concentration and amplification of the effective gain
Co dB(C)
Gef f is determined by Gef f ≡ B(Co ) dC
. By the chain rule, C=Co
Gef f =
Co dB
db
. B(Co ) db b(Co ) dC Co
For brevity, we now drop the subscript from Co . The last term in the expression above comes from the variation of the receptor binding with concentration and is independent of the form of the amplification: db Kd Kd = b2 2 . = dC (C + Kd )2 C Note that this function is approximately constant for C Kd and falls off rapidly when C is larger than Kd , as a consequence of the receptor saturation. For the first form of amplification considered, B(b) = A ∗ bn , where A is the normalization constant, K nKd C d (n−1) Gef f = nAb . (4.2) b2 2 = Abn C C + Kd For C Kd , Gef f ≈ n, while for C Kd , Gef f ≈ nKd /C. Thus, the amplification is effective at low concentrations, although the overall sensitivity drops, presumably because of statistical noise. At high concentrations, the amplification is ineffective because of the receptor saturation mentioned above. For the first type of repulsion, B(b) = A ∗ (1 − b)n , we find Gef f = −n ∗ (Kd /C) ∗
b2 C . = −n (1 − b) C + Kd
This looks just like equation 4.2, with C/Kd substituted for Kd /C and a change of sign. This relationship is a consequence of the fact that changing from b to
Axonal Response to Gradients
2237
1 − b is formally equivalent to switching from amplifying bound receptors to amplifying unbound receptors. Finally, for B(b) = A ∗ (1 − bn ), Gef f = −n ∗ (Kd /C) ∗
Kd Cn b(n+1) = −n . (1 − bn ) C + Kd (C + Kd )n − Cn
For C Kd , Gef f ≈ −n(C/Kd )n , while for C Kd , Gef f ≈ −1. 5 Discussion Our model demonstrates that both the short-term response of axons to steep gradients and the long-term response of axons to shallow gradients can be reproduced from a set of relatively simple assumptions. The model makes specific and quantitative predictions for the amount and type of amplification of the binding signal required, the way the response varies with gradient steepness and with ligand concentration, how the turning in attractive gradients may differ from turning in repulsive gradients, and how intracellular signaling may vary between these two cases. The model also captures surprisingly well the degree of variability of axonal responses seen experimentally. While the majority of axons appear to be influenced by the gradient, some do not, even in the case of long trajectories. In experimental assays, a heterogeneous response is usually attributed to the presence of a heterogeneous population of axons. An alternative explanation suggested by our simulation data is that, at least in some cases, the stochastic nature of filopodium generation sometimes directs axons on an apparently unresponsive trajectory. How might the sequence of events that leads to growth cone turning in the model be implemented biologically? As long as the products of the signaling cascade (Mueller, 1999; Song & Poo, 2001; Yu & Bargmann, 2001; Guan & Rao, 2003) do not diffuse too rapidly, a shallow internal gradient (e.g., of G-protein signaling) will follow directly from the gradient in receptor binding. Amplification of this gradient can come from the cooperativity of receptor binding and from autocatalytic behavior, for instance, involving Ca2+ (Zheng, 2000; Gomez & Spitzer, 2000; Hong et al., 2000; Gomez, Robles, Poo, & Spitzer, 2001), and activator-inhibitor dynamics (Meinhardt, 1999; Song & Poo, 2001). A gradient of activation of small GTPases of the Rho family (Hall, 1998), perhaps acting via Cdc42, could then produce a gradient of actin polymerization, with new filopodia more likely to sprout where the polymerization is enhanced. The positions of the filopodia are coupled to the direction of growth cone advance through adhesions between the filopodia and the substrate, and coupling between retrograde F-actin flow and microtubule extension (Suter & Forscher, 2000; Zhou et al., 2002). Our model assumes that the overall rate at which new filopodia form is independent of the receptor binding (see Zheng, 2000). Since the outputs
2238
G. Goodhill, M. Gu, and J. Urbach
of the signaling cascade enhance or depress actin polymerization locally, some other process must be utilized to keep the overall level of polymerization constant. Such global effects could come from competition for limited resources for actin polymerization or from activator-inhibitor dynamics. Experimental data suggest that the same pathways may be involved in both attractive and repulsive responses (Song & Poo, 2001; Yu & Bargmann, 2001). The most straightforward way to generate a repulsive response in our model is to effectively switch the roles of bound and unbound receptors by replacing P(b) ∝ bn with P(b) ∝ (1 − b)n . A possibly more realistic way of implementing this involves the silencing of an attractive response by heteromeric receptor complexes, such as the interaction of DCC and UNC-5 in netrin signaling (Yu & Bargmann, 2001), where the attractive function of DCC is silenced by interaction with the UNC-5/netrin complex. If an attractive signaling complex (e.g., netrin/DCC) is saturated, the amount of intracellular signaling will decrease as the proportion of silencing receptorsligand complexes (UNC-5/netrin) increases. Alternatively, the products of the amplified cascade could depress rather than enhance polymerization activity. As shown in section 4.2, the chemotactic response is primarily determined by the first derivative of the amplification function, and thus the form P(b) ∝ b−n produces a repulsive response approximately equal and opposite to the attractive response produced by bn . This could arise from a Ca2+ set point (Gomez & Spitzer, 2000; Petersen & Cancela, 2000) for optimum axon outgrowth: the polymerization gradient is in the same direction as the Ca2+ gradient when the Ca2+ level is below the set point, and in the opposite direction when the level is above the setpoint (see Zheng, 2000; Hong et al., 2000). Experimental measurements of the concentration dependence of attractive and repulsive chemotactic sensitivities should be able to distinguish between the different types of attractive and repulsive amplification considered here. This could be done with both the pipette assay and the collagen gel assay we have recently developed (Rosoff et al., 2004). Given the number of systems where the switch has been observed (Song & Poo, 2001; Yu & Bargmann, 2001; Guan & Rao, 2003), it is quite possible that more than one type of behavior will be observed. We have chosen mathematical models for intracellular amplification that are relatively simple and realistic. As the signal transduction pathways become more fully elucidated, it will be possible to directly calculate from chemical kinetics the transformation between receptor binding and filopodia initiation probability. This transformation will undoubtedly depend on a variety of factors, but is likely to share the basic characteristics of one or more of the general forms we have considered here. Acknowledgments This work was funded by grants from the NIH, NSF, and Whitaker Foundation.
Axonal Response to Gradients
2239
References Aletta, J. M., & Greene, L. A. (1988). Growth cone configuration and advance: A time-lapse study using video-enhanced differential interference contrast microscopy. J. Neurosci., 8, 1425–1435. Anderson, C. N., Ohta, K., Quick, M. M., Fleming, A., Keynes, R., & Tannahill D. (2003). Molecular analysis of axon repulsion by the notochord. Development, 130, 1123–1133. Argiro, V., Bunge, M. B., & Johnson, M. I. (1985). A quantitative study of growth cone filopodial extension. Journal of Neuroscience Research, 13, 149–162. Barkai, N., & Leibler, S. (1997). Robustness in simple biochemical networks. Nature, 387, 913–917. Bentley, D., & Toroian-Raymond A. (1986). Disoriented pathfinding by pioneer neuron growth cones deprived of filopodia by cytochalasin treatement. Nature, 323, 712–715. Berg, H. C., & Purcell, E. M. (1977). Physics of chemoreception. Biophysical Journal, 20, 193–219. Braisted, J. E., Catalano, S. M., Stimac, R. Kennedy T. E., Tessier-Lavigne, M., Shatz, C. J., & O’Leary, D. D. M. (2000). Netrin-1 promotes thalamic axon growth and is required for proper development of the thalamocortical projection. J. Neurosci., 20, 5792–5801. Bray, D. B., & Chapman, K. (1985). Analysis of microspike movements on the neuronal growth cone. J. Neurosci., 5, 3204–3213. Brose, K., Bland, K. S., Wang, K. H., Arnott, D., Henzel, W., Goodman, C. S., Tessier-Lavigne, M., & Kidd, T. (1999). Slit proteins bind robo receptors and have an evolutionarily conserved role in repulsive axon guidance. Cell, 96, 795–806. Buettner, H. M. (1995). Computer simulation of nerve growth cone filopodial dynamics for visualization and analysis. Cell Motil. Cytoskeleton, 32, 187–204. Caton, A., Hacker, A., Naeem, A., Livet, J., Maina, F., Bladt, F., Klein, R., Birchmeier, C., & Guthrie, S. (2000). The branchial arches and HGF are growthpromoting and chemoattractant for cranial motor axons. Development, 127, 1751–1766. Charron, F., Stein, E., Jeong, J., McMahon, A. P., & Tessier-Lavigne, M. (2003). The morphogen sonic hedgehog is an axonal chemoattractant that collaborates with netrin-1 in midline axon guidance. Cell, 113, 11–23. Chien, C. B., Rosenthal, D. E., Harris, W. A., & Holt, C.E. (1993). Navigational errors made by growth cones without filopodia in the embryonic Xenopus brain. Neuron, 11, 237–251. Colamarino, S. A., & Tessier-Lavigne, M. (1995). The axonal chemoattractant netrin-1 is also a chemorepellent for trochlear motor axons. Cell, 81, 621–629. Davenport, R. W., Dou. P., Rehder, V., & Kater, S. B. (1993). A sensory role for neuronal growth cone filopodia. Nature, 361, 721–724. de la Torre, J. R., Hopker, ¨ V. H., Ming, G-l., & Poo, M-M. (1997). Turning of retinal growth cones in a netrin-1 gradient mediated by the netrin receptor DCC. Neuron, 19, 1211–1224.
2240
G. Goodhill, M. Gu, and J. Urbach
Devreotes, P. N., & Zigmond, S. H. (1988). Chemotaxis in eukaryotic cells: A focus on leukocytes and Dictyostelium. Ann. Rev. Cell. Biol., 4, 649–686. Dickson, B. J. (2002). Molecular mechanisms of axon guidance. Science, 298, 1959–1964. Eisenbach, M. (1996). Control of bacterial chemotaxis. Molecular Microbiology, 20, 903–910. Goldberg, D. J., & Burmeister, D. W. (1986). Stages in axon formation: Observations of growth of Aplysia axons in culture using video-enhanced contrastdifferential interference contrast microscopy. J. Cell. Biol., 103, 1921–1931. Gomez, T. M., Robles, E., Poo, M-M., & Spitzer, N. C. (2001). Filopodial calcium transients promote substrate-dependent growth cone turning. Science, 291, 1983–1987. Gomez, T. M., & Spitzer, N. C. (2000). Regulation of growth cone behavior by calcium: New dynamics to earlier perspectives. J. Neurobiol., 44, 174–183. Goodhill, G. J. (1997). Diffusion in axon guidance. European Journal of Neuroscience, 9, 1414–1421. Goodhill, G. J. (1998). Mathematical guidance for axons. Trends in Neurosciences, 21, 226–231. Goodhill, G. J., & Urbach, J. S. (1999). Theoretical analysis of gradient detection by growth cones. Journal of Neurobiology, 41, 230–241. Gordon-Weeks, P. R. (2000). Neuronal growth cones. Cambridge: Cambridge University Press. Guan, K. L., &, Rao Y. (2003). Signalling mechanisms mediating neuronal responses to guidance cues. Nature Rev. Neurosci., 4, 941–956. Gundersen, R. W., & Barrett, J. N. (1979). Neuronal chemotaxis: Chick dorsalroot axons turn toward high concentrations of nerve growth factor. Science, 206, 1079–1080. Hall, A. (1998). Rho GTPases and the actin cytoskeleton. Science, 279, 509–514. Haykin, S. (1999). Neural networks: A comprehensive foundation. Upper Saddle River, NJ: Prentice Hall. Hely, T. A., & Willshaw, D. J. (1998). Short term interactions between microtubules and actin filaments underlie long term behaviour in neuronal growth cones. Proc. Roy. Soc. B, 265, 1801–1807. Hong, K., Hinck, L., Nishiyama, M., Poo, M-M., & Tessier-Lavigne, M. (1999). A ligand-gated association between cytoplasmic domains of UNC5 and DCC family receptors converts netrin-induced growth cone attraction to repulsion. Cell, 92, 205–215. Hong, K., Nishiyama, M., Henley, J., Tessier-Lavigne, M., & Poo, M-M. (2000). Calcium signalling in the guidance of nerve growth by netrin-1. Nature, 403, 93–98. Hopker, ¨ V. H., Shewan, D., Tessier-Lavigne, M., Poo, M-M., & Holt, C. (1999). Growth-cone attraction to netrin-1 is converted to repulsion by laminin-1. Nature, 401, 69–73. Huber, A. B., Kolodkin, A. L., Ginty, D. D., & Cloutier, J. F. (2003). Signaling at the growth cone: Ligand-receptor complexes and the control of axon growth and guidance. Annu. Rev. Neurosci., 26, 509–563. Hutson, L. D., & Chien, C.-B. (2002). Pathfinding and error correction by retinal axons: The role of astray/robo2. Neuron, 33, 205–217.
Axonal Response to Gradients
2241
Keynes, R., Tannahill, D., Morgenstern, D. A., Johnson, A. R., Cook, G. M. W., & Pini, A. (1997). Surround repulsion of spinal sensory axons in higher vertebrate embryos. Neuron, 18, 889–897. Kim, M. D., Kolodziej, P., & Chiba, A. (2002). Growth cone pathfinding and filopodial dynamics are mediated separately by Cdc42 activation. J. Neurosci., 22, 1794–1806. Lumsden, A. G. S., & Davies, A. M. (1983). Earliest sensory nerve fibres are guided to peripheral targets by attractants other than nerve growth factor. Nature, 306, 786–788. Lumsden, A. G. S. and Davies, A. M. (1986). Chemotropic effect of specific target epithelium in the developing mammalian nervous system. Nature, 323, 538– 539. Meinhardt, H. (1999). Orientation of chemotactic cells and growth cones: Models and mechanisms. Journal of Cell Science, 112, 2867–2874. Ming, G., Henley, J., Tessier-Lavigne, M., Song, H., & Poo, M. (2001). Electrical activity modulates growth cone guidance by diffusible factors. Neuron, 29, 441–452. Ming, G-I., Song, H-J., Berninger, B., Holt, C. E., Tessier-Lavigne, M., & Poo, M-M. (1997). cAMP-dependent growth cone guidance by netrin-1. Neuron, 19, 1225–1235. Ming, G., Wong, S. T., Henley, J., Yuan, X., Song, H., Spitzer, N., & Poo, M. (2002). Adaptation in the chemotactic guidance of nerve growth cones. Nature, 417, 411–418. Moghe, P. V., & Tranquillo, R. T. (1995). Stochasticity in membrane-localized ligand-receptor G-protein binding—consequences for leukocyte movement behavior. Ann. Biomed. Eng., 23, 257–267. Mueller, B. K. (1999). Growth cone guidance: First steps towards a deeper understanding. Annu. Rev. Neurosci., 22, 351–388. Myers, P. Z., & Bastiani, M. J. (1993). Growth cone dynamics during the migration of an identified commissural growth cone. J. Neurosci., 13, 127–143. Nguyen Ba-Charvet, K. T., Brose, K., Ma, L., Wang, K. H., Marillat, V., Sotelo, C., Tessier-Lavigne, M., & Chedotal, A. (2001). Diversity and specificity of actions of Slit2 proteolytic fragments in axon guidance. J. Neurosci., 21, 4281– 4289. Nishiyama, M., Hoshino, A., Tsai, L., Henley, J. R., Goshima, Y., Tessier-Lavigne, M., Poo, M. M., & Hong K. (2003). Cyclic AMP/GMP-dependent modulation of Ca2+ channels sets the polarity of nerve growth-cone turning. Nature, 424. 990–995. O’Connor, T. P., Duerr, J. S., & Bentley, D. (1990). Pioneer growth cone steering decisions mediated by single filopodial contacts in situ. J. Neurosci., 10, 3935– 3946. Parent, C. A., & Devreotes, P. N. (1999). A cell’s sense of direction. Science, 284, 765–770. Patel, K., Nash, J. A., Itoh, A., Liu, Z., Sundaresan, V., & Pini, A. (2001). Slit proteins are not dominant chemorepellents for olfactory tract and spinal motor axons. Development, 128, 5031–5037. Petersen, O. H., & Cancela, J. M. (2000). Attraction or repulsion by local Ca(2+) signals. Curr. Biol., 10, R311–314.
2242
G. Goodhill, M. Gu, and J. Urbach
Rehder, V., & Kater, S. B. (1996). Filopodia on neuronal growth cones: Multifunctional structures with sensory and motor capabilities. Seminars in the Neurosciences, 8, 81–88. Richards, L. J., Koester, S. E., Tuttle, R., & O’Leary, D. D. M. (1997). Directed growth of early cortical axons is influenced by a chemoattractant released from an intermediate target. J. Neurosci., 17, 2445–2458. Robert, M. E., & Sweeney, J. D. (1997). Computer model: Investigating role of filopodia-based steering in experimental neurite galvanotropism. J. Theor. Biol., 188, 277–288. Rosoff, W. J., Urbach, J. S., Esrick, M., McAllister, R. G., Richards, L. J., & Goodhill, G. J. (2004). A novel chemotaxis assay reveals the extreme sensitivity of axons to molecular gradients. Nat. Neurosci., 7, 678–682. Shu, T., & Richards, L. J. (2001) Cortical axon guidance by the glial wedge during the development of the corpus callosum. J. Neurosci., 21, 2749–2758. Song, H., Ming, G., He, Z., Lehmann, M., Tessier-Lavigne, M., & Poo, M-M. (1998). Conversion of neuronal growth cone responses from repulsion to attraction by cyclic nucleotides. Science, 281, 1515–1518. Song, H. J., Ming, G. L., & Poo, M-M. (1997). cAMP-induced switching in turning direction of nerve growth cones Nature, 388, 275–279. Song, H., & Poo, M-M. (2001). The cell biology of neuronal navigation. Nat. Cell. Biol., 3, E81–88. Steketee, M., Balazovich, K., & Tosney, K. W. (2001). Filopodial initiation and a novel filament-organizing center, the focal ring. Mol. Biol. Cell., 12, 2378–2395. Steketee, M. B., & Tosney, K. W. (1999). Contact with isolated sclerotome cells steers sensory growth cones by altering distinct elements of extension. J. Neurosci., 19, 3495–3506. Suter, D. M., & Forscher, P. (2000). Substrate-cytoskeletal coupling as a mechanism for the regulation of growth cone motility and guidance. J. Neurobiol., 44, 97–113. Tanaka, E., & Sabry, J. (1995). Making the connection: Cytoskeletal rearrangements during growth cone guidance. Cell, 83, 171–176. Tessier-Lavigne, M., Placzek, M., Lumsden, A. G. S., Dodd, J., & Jessell, T. M. (1988). Chemotropic guidance of developing axons in the mammalian central nervous system. Nature, 336, 775–778. Tranquillo, R. T. (1990). Models of chemical gradient sensing by cells. In W. Alt & G. Hoffman (Eds.), Biological motion, New York: Springer-Verlag. (pp. 415– 441). Tranquillo, R. T., & Lauffenburger, D. A. (1987). Stochastic-model of leukocyte chemosensory movement. J. Math. Bio., 25, 229–262. Varela-Echavarria, A., Tucker, A., Puschel, ¨ A. W., & Guthrie, S. (1997). Motor axon subpopulations respond differentially to the chemorepellents netrin-1 and semaphorin D. Neuron, 18, 193–207. Wang, F. S., Liu, C. W., Diefenbach, T. J., & Jay, D. G. (2003). Modeling the role of myosin 1c in neuronal growth cone turning. Biophys. J., 85, 3319–3328. Yu, T. W., & Bargmann, C. I. (2001). Dynamic regulation of axon guidance. Nat. Neurosci., 4 (Suppl). 1169–1176. Zheng, J. Q. (2000). Turning of nerve growth cones induced by localized increases in intracellular calcium ions. Nature, 403, 89–93.
Axonal Response to Gradients
2243
Zheng, J. Q., Felder, M., Conner, J. A., & Poo, M-M. (1994). Turning of growth cones induced by neurotransmitters. Nature, 368, 140–144. Zheng, J. Q., Wan, J-J., & Poo, M-M. (1996). Essential role of filopodia in chemotropic turning of nerve growth cone induced by a glutamate gradient. J. Neurosci., 16, 1140–1149. Zhou, F. Q., Waterman-Storer, C. M., & Cohan, C. S. (2002). Focal loss of actin bundles causes microtubule redistribution and growth cone turning. J. Cell Biol., 157, 839–849. Zigmond, S. H. (1977). Ability of polymorphonuclear leukocytes to orient in gradients of chemotactic factors. J. Cell. Biol., 75, 606–616. Received February 4, 2004; accepted April 29, 2004.
LETTER
Communicated by Michael Dickinson
Insect-Inspired Estimation of Egomotion Matthias O. Franz
[email protected] ¨ biologische Kybernetik, Tubingen, ¨ Max-Planck-Institut fur Germany
Javaan S. Chahl
[email protected] Center of Visual Sciences, Research School of Biological Sciences, Australian National University, Canberra, Australia
Holger G. Krapp
[email protected] Department of Zoology, University of Cambridge, Cambridge, U.K.
Tangential neurons in the fly brain are sensitive to the typical optic flow patterns generated during egomotion. In this study, we examine whether a simplified linear model based on the organization principles in tangential neurons can be used to estimate egomotion from the optic flow. We present a theory for the construction of an estimator consisting of a linear combination of optic flow vectors that incorporates prior knowledge about the distance distribution of the environment and about the noise and egomotion statistics of the sensor. The estimator is tested on a gantry carrying an omnidirectional vision sensor. The experiments show that the proposed approach leads to accurate and robust estimates of rotation rates, whereas translation estimates are of reasonable quality, albeit less reliable. 1 Introduction A moving visual system generates a characteristic pattern of image motion on its sensors. The resulting optic flow field is an important source of information about the egomotion of the visual system (Gibson, 1950). In the fly brain, part of this information is analyzed by a group of wide-field, motionsensitive neurons, the tangential neurons in the lobula plate (Hausen, 1993; Egelhaaf et al., 2002). A detailed mapping of their local preferred directions and motion sensitivities (Krapp & Hengstenberg, 1996) reveals a striking similarity to certain egomotion-induced optic flow fields (see Figure 1). This suggests that each tangential neuron extracts a specific egomotion component from the optic flow that may be useful for gaze stabilization and flight steering. c 2004 Massachusetts Institute of Technology Neural Computation 16, 2245–2260 (2004)
2246
M. Franz, J. Chahl, and H. Krapp
75
elevation [deg.]
45
15
-15
-45
-75 0
30
60
120 90 azimuth [deg.]
150
180
Figure 1: Mercator map of the response field of the neuron VS10. The orientation of each arrow gives the local preferred direction, and its length denotes the relative local motion sensitivity. The results suggest that VS10 responds maximally to rotation around an axis at an azimuth of about 50 degrees and an elevation of about 0 degrees (after Krapp, Hengstenberg, & Hengstenberg, 1998).
A recent study (Franz & Krapp, 2000) has shown that a simplified computational model of the tangential neurons as a weighted sum of flow measurements was able to explain certain properties of the observed response fields. The weights were chosen according to an optimality principle that minimizes the output variance of the model caused by noise and distance variability between different scenes. In that study, however, we mainly focused on a comparison between the sensitivity distribution in tangential neurons and the weight distribution of such optic flow processing units. Here, we present a classical linear estimation approach that extends the previous model to the complete egomotion problem. We again use linear combinations of local flow measurements, but instead of prescribing a fixed motion axis and minimizing the output variance, we minimize the quadratic error in the estimated egomotion parameters. The derived weight sets for the single-model neurons are identical to those obtained from one of the model variants discussed in Franz and Krapp (2000). Of primary interest for this article is that the new approach yields a novel, extremely simple estimator for egomotion that consists of a linear combination of model neurons. Our experiments indicate that this insect-inspired estimator shows—in spite of its simplicity—an astonishing performance that often comes close to the more elaborate approaches of classical computer vision.
Insect-Inspired Estimation of Egomotion
A.
B.
y
optic flow vectors
2247 LPD unit vectors
ui
LMSs
summation
w11
di pi
w12
vi
+
w13
z x
Figure 2: (A) Sensor model. At each viewing direction di , there are two measurements xi and yi of the optic flow pi along two directions ui and vi on the unit sphere. (B) Simplified model of a tangential neuron. The optic flow and the local noise signal are projected onto a unit vector field of local preferred directions (LPDs). The projections are weighted (local motion sensitivities, LMSs) and linearly integrated. The model assumes that the integrated output encodes an egomotion component defined by either a translation or a rotation axis.
This article is structured as follows. In section 2, we describe the derivation of the egomotion estimator from a least-squares principle. In section 3, we subject the obtained model to a rigorous real-world test on a gantry carrying an omnidirectional vision sensor. The evidence and the properties of such a neural representation of egomotion are discussed in section 4. A preliminary account of our work has appeared in Franz and Chahl (2003). 2 Optimal Linear Estimators for Egomotion 2.1 Egomotion Sensor and Neural Model. In order to simplify the mathematical treatment, we assume that the N motion detectors of our egomotion sensor are arranged on the unit sphere. The viewing direction of the inputs to a particular motion detector with index i is denoted by the radial unit vector di . At each viewing direction, we define a local two-dimensional coordinate system on the sphere consisting of two orthogonal tangential unit vectors ui and vi (see Figure 2A).1 We assume that we measure the local flow component along both unit vectors subject to additive noise. Formally, this means that we obtain at each viewing direction two measurements xi and yi along ui and vi , respectively, given by xi = pi · ui + nx,i
and
yi = pi · vi + ny,i ,
(2.1)
where nx,i and ny,i denote additive noise components with a given covariance Cn and pi the local optic flow vector. When the spherical sensor trans1 For mathematical convenience, we do not take into account the hexagonal arrangement of the optical axes of the photoreceptors within the fly compound eye.
2248
M. Franz, J. Chahl, and H. Krapp
lates with T while rotating with R about an axis through the origin, the egomotion-induced image flow pi at di is pi = −µi (T − (T · di )di ) − R × di .
(2.2)
µi is the inverse distance between the origin and the object seen in direction di , the so-called nearness (Koenderink & van Doorn, 1987). The entire collection of flow measurements xi and yi comprises the input to a simplified neural model consisting of a weighted sum of all local measurements (see Figure 2B), θˆ =
N i
wx,i xi +
N
wy,i yi ,
(2.3)
i
with local weights wx,i and wy,i . In this model, the local motion sensitivity is defined as wi = (wx,i , wy,i ) , and the local preferred direction is parallel to the vector w1i (wx,i , wy,i ) . The resulting local motion sensitivities and local preferred directions can be compared to measurements on real tangential neurons (Franz & Krapp, 2000). As our basic hypothesis, we assume that the output of such neural models is used to estimate a egomotion component of the sensor. Since the output is a scalar, we need in the simplest case an ensemble of six neural models to encode all six rotational and translational degrees of freedom. To keep the mathematical treatment simple, we assume that the motion axes of interest are aligned with the global coordinate system. In principle, any set of linearly independent axes could be used. The local weights of each unit are chosen to yield an optimal linear estimator for the respective egomotion component. In addition, we allow the neural models to interact linearly, such that the whole ensemble output is a linear combination of the individual neural outputs. This last step is necessary since the neural models do not react specifically to their own egomotion component due to the broad tuning of the motion detector model (cf. equation 2.1). The response of the neural model can be made more specific by using the output of the other neurons to suppress the signal caused by other egomotion components (Krapp, Hengstenberg, & Egelhaaf, 2001; Haag & Borst, 2003). 2.2 Prior Knowledge. An estimator for egomotion consisting of a linear combination of flow measurements necessarily has to neglect the dependence of the optic flow on object distances. As a consequence, the estimator output will be different from scene to scene, depending on the current distance and noise characteristics. The best the estimator can do is to add up as many flow measurements as possible, hoping that the individual distance deviations of the current scene from the average over all scenes will cancel
Insect-Inspired Estimation of Egomotion
2249
each other. Clearly, viewing directions with low distance variability and small noise content should receive a higher weight in this process. In this way, prior knowledge about the distance and noise statistics of the sensor and its environment can improve the reliability of the estimate. If the current nearness at viewing direction di differs from the average nearness µ¯ i over all scenes by µi , the measurement xi (or yi , respectively) can be written as (see equations 2.1 and 2.2) xi = −(µ¯ i u i , (ui × di ) )
T + nx,i − µi ui T, R
(2.4)
where the last two terms vary from scene to scene, even when the sensor undergoes exactly the same egomotion. To simplify the notation, we stack all 2N measurements over the entire motion detector array in the vector x = (x1 , y1 , x2 , y2 , ..., xN , yN ) . Similarly, the egomotion components along the x-, y-, and z-directions of the global coordinate sytem are combined in the vector θ = (Tx , Ty , Tz , Rx , Ry , Rz ) , the scene-dependent terms of equation 2.4 in the 2N-vector n = (nx,1 − µ1 u1 T, ny,1 − µ1 v1 T, . . .) and the scene-independent terms in the 6xN matrix F = ((−µ¯ 1 u ¯ 1 v 1 , −(u1 × d1 ) ), (−µ 1 , −(v1 × d1 ) ), . . .) . The entire ensemble of measurements over the sensor becomes thus x = Fθ + n.
(2.5)
Assuming that T, nx,i , ny,i , and µi are uncorrelated, the covariance matrix C of the scene-dependent measurement component n is given by Cij = Cn,ij + Cµ,ij u i CT uj ,
(2.6)
with Cn being the covariance of n, Cµ of µ and CT of T. These three covariance matrices, together with the average nearness µ¯ i , constitute the prior knowledge required for deriving the optimal estimator. 2.3 Optimized Linear Estimator. Using the notation of equation 2.5, we write the output of the whole ensemble as a linear estimator θˆ = Wx.
(2.7)
W denotes a M × 2N weight matrix where each of the M rows consists of a linear combination of the weight sets of the neural models (see equation 2.3). The optimal weight matrix is chosen to minimize the mean square error e of the estimator given by ˆ 2 ) = tr[WCW ], e = E(θ − θ
(2.8)
2250
M. Franz, J. Chahl, and H. Krapp
where E denotes the expectation. We additionally impose the constraint that the estimator should be unbiased for n = 0, that is, θˆ = θ . From equations 2.5 and 2.7, we obtain the constraint equation WF = 1M×M .
(2.9)
The solution minimizing the associated Euler-Lagrange functional ( is a M × M-matrix of Lagrange multipliers), J = tr[WCW ] + tr[ (1M×M − WF)],
(2.10)
can be found analytically and is given by W=
1 −1 F C , 2
(2.11)
with = 2(F C−1 F)−1 . The rows of F C−1 correspond to the neural model of equation 2.3.2 acts as a correction matrix that suppresses the part of the neural signal caused by the egomotion components to which the neuron is not tuned to. When computed for the typical interscene covariances of a flying animal, the resulting weight sets are able to explain many of the receptive field characteristics of the tangential neurons (Franz & Krapp, 2000). However, the question remains whether the output of such an ensemble of neural models can be used for some real-world task. This is by no means evident given the fact that—in contrast to most approaches in computer vision—the distance distribution of the current scene is completely ignored by the linear estimator. 3 Experiments 3.1 Linear Estimator for an Office Robot. As our test scenario, we consider the situation of a mobile robot in an office environment. This scenario allows for measuring the typical motion patterns and the associated distance statistics that otherwise would be difficult to obtain for a flying agent. The distance statistics were recorded using a rotating laser scanner. The 26 measurement points were chosen along typical trajectories of a mobile robot while wandering around and avoiding obstacles in an office environment. The recorded distance statistics therefore reflect properties of both the environment and the specific movement patterns of the robot. From these measurements, the average nearness µ¯ i and its covariance Cµ were 2 The resulting local motion sensitivities correspond exactly to those obtained from the linear range model in Franz and Krapp (2000) if one assumes a diagonal C.
Insect-Inspired Estimation of Egomotion
A.
2251
75 2.25
2.
2.5
-15
1
-150
-120
75
-60
-30 0 30 azimuth (deg.)
1.25
1.25 1.5
1.25
75
1. 1. 25 75 1 0.75
1.25
1 0.75
0.5
-120
0.5
0.25
0.25
-90
-60
180
0.75
1 1.
1.5
0.25
-150
150
0.5
1.75
1 0.75
-45
120
0.75 1
-15
90
0.25
0.5
1.25 1.5
1.5
60
0.25
0.75
15
-75 -180
0.75
0.25
45
2
1 0.75
-90
2.5 5
1.25
1.5
-75 -180
1.7
1.5
1.5 1.25 0.75
elevation (deg.)
2
1.75
-45
B.
25
3 2.75
2.25 2
1.75
2.25
15
2.5
elevation (deg.)
45
-30 0 30 azimuth (deg.)
60
90
120
150
180
Figure 3: Distance statistics of an indoor robot (0 azimuth corresponds to forward direction; the distances on the contour lines are given in m). (A) Average distances from the origin in the visual field (N = 26). Darker areas represent larger distances. (B) Distance standard deviation in the visual field (N = 26). Darker areas represent stronger deviations.
computed (cf. Figure 3; we used distance instead of nearness for easier interpretation). The distance statistics show a pronounced anisotropy, which can be attributed to three main factors. (1) Since the robot tries to turn away from the obstacles, the distance in front and behind the robot tends to be larger than on its sides (see Figure 3A). (2) The camera on the robot usually moves at a fixed height above ground (here, 0.62 m) on a flat surface. As a consequence, distance variation is particularly small at very low elevations (see Figure 3B). (3) The office environment also contains corridors. When the robot follows the corridor while avoiding obstacles, distance variations in the frontal region of the visual field are very large (see Figure 3B).
2252
M. Franz, J. Chahl, and H. Krapp
The estimation of the translation covariance CT is straightforward since our robot can translate only in forward direction, along the z-axis. CT is therefore 0 everywhere except the lower right diagonal entry, which corresponds to the square of the average forward speed of the robot (here, 0.3 m/s). The motion detector noise was assumed to be zero mean, uncorrelated, and uniform over the image, which results in a diagonal Cn with identical entries. The noise standard deviation of 0.34 deg/s was determined by presenting a series of artificially translated images of the laboratory (moving at 1.1 deg/s) to the flow algorithm used in the implementation of the estimator (see section 3.2). µ, ¯ Cµ , CT and Cn constitute the prior knowledge necessary for computing the estimator (see equations 2.6 and 2.11). The optimal weight sets for the neural models for the six degrees of freedom (each of which corresponds to a row of F C−1 ) are shown in Figure 4. All neural models have in common that image regions near the rotation or translation axis receive less weight. In these regions, the egomotion components to be estimated generate only small flow vectors which are easily corrupted by noise. Equation 2.11 predicts that the estimator will preferably sample in image regions with smaller distance variations. In our measurements, this is mainly the case at the ground around the robot (see Figure 3). The rotation-selective neural models assign higher weights to distant image regions, since distance variations at large distances have a smaller effect. In our example, distances are largest in front and behind the robot so that the neural model for yaw assigns the highest weights to these regions (see Figure 4F). This effect is less pronounced in the other rotational neural models because the translational flow is almost orthogonal to their local directions and thus interferes to a much lesser degree. Although the small weights near the motion axes and the overall distribution of local directions are similar to those found in tangential neurons, our neural models show specific adaptations to the indoor robot scenario: the highly weighted ground regions are exactly the opposite to our model predictions for a flying animal where the ground region shows a stronger distance variability than regions near and above the horizon (Franz & Krapp, 2000). The predicted dorsoventral asymmetry with small weights in the ground region is indeed observed in the tangential neurons (see Figure 1 and Krapp, et al., 1998). The strong weighting of the frontal region in the yaw neural model (see Figure 4F) is also corridor specific, so it is not surprising that this feature is not present in an animal that evolved in an open outdoor environment.
3.2 Gantry Experiments. The egomotion estimates from the ensemble of neural models were tested on a gantry with three translational and one rotational (yaw) degree of freedom. Since the gantry had a position accuracy below 1 mm, the programmed position values were taken as ground truth for evaluating the estimator’s accuracy.
Insect-Inspired Estimation of Egomotion D. 75
75
45
45
elevation (deg.)
elevation (deg.)
A.
2253
15 15
75
75 0
30
60 90 120 azimuth (deg.)
150
180
B.
0
30
60 90 120 azimuth (deg.)
150
180
0
30
60 90 120 azimuth (deg.)
150
180
E. 75
75
45
45
elevation (deg.)
elevation (deg.)
15 45
45
15 15
15 15 45
45
75
75 0
30
60 90 120 azimuth (deg.)
150
180
C.
F. 75
75
45
45
elevation (deg.)
elevation (deg.)
15
15 15
15 15 45
45
75
75 0
30
60 90 120 azimuth (deg.)
150
180
0
30
60 90 120 azimuth (deg.)
150
180
Figure 4: Neural models computed as part of the linear estimator. Notation is identical to Figure 1. The depicted region of the visual field extends from −15◦ to 180◦ azimuth and from −75◦ to 75◦ elevation. The model neurons are tuned to (A) forward translation, (B) translations to the right, (C) downward translation, (D) roll rotation, (E) pitch rotation, and (F) yaw rotation.
As vision sensor, we used a camera mounted above a mirror with a circularly symmetric hyperbolic profile. This setup allowed for a 360 degree horizontal field of view extending from 90 degrees below to 45 degrees above the horizon. Such a large field of view considerably improves the estimator’s performance since the individual distance deviations in the scene
2254
M. Franz, J. Chahl, and H. Krapp
A.
B.
estimated self-motion
rotation
translation
estimator response [%]
150
100
50
0
1
2
C.
4
5
D.
0.6
estimator response
0.6
estimator response
3
location
true self-motion
0.5 0.4 0.3 0.2
0.4 0.3 0.2 0.1
0.1 0
0.5
0 1
2
translation speed
3
1
2
3
angular velocity
Figure 5: Gantry experiments. Results are given in arbitrary units, true rotation values are denoted by a dashed line, and translation by a dash-dot line. Gray bars denote translation estimates, and white bars rotation estimates. (A) Estimated versus real egomotion. (B) Estimates of the same egomotion at different locations. (C) Estimates for constant rotation and varying translation. (D) Estimates for constant translation and varying rotation.
are more likely to be averaged out. More details about the omnidirectional camera can be found in Chahl and Srinivasan (1997). In each experiment, the camera was moved to 10 different start positions in the lab at the same height above ground (0.62 m) as the robot camera,3 but with largely varying distance distributions near and above the horizon. After recording an image of the scene at the start position, the gantry translated and rotated at various speeds and directions and took a second image. After the recorded image pairs (10 for each type of movement) were unwarped, we computed 3 The translational neurons in Figure 4 for the mobile robot case assign a high weight to the ground region. As a consequence, the translation estimates strongly depend on the correct height above ground, whereas rotational neurons are only indirectly affected.
Insect-Inspired Estimation of Egomotion
2255
the optic flow input for the neural models using a standard gradient-based scheme (Srinivasan, 1994). The average error of the rotation rate estimates over all trials (N = 450) was 0.7◦ /s (5.7% relative error; see Figure 5A), the error in the estimated translation speeds (N = 420) was 8.5 mm/s (7.5% relative error). The estimated rotation axis had an average error of magnitude 1.7 degrees, and the estimated translation direction was 4.5 degrees. The larger error of the translation estimates is mainly caused by the direct dependence of the translational flow on distance (see equation (2.2)) whereas the rotation estimates are only indirectly affected by distance errors via the current translational flow component, which is largely filtered out by the local direction template. The larger sensitivity of the translation estimates to distance variations can be seen by moving the sensor at the same translation and rotation speeds in various locations. The rotation estimates remain consistent over all locations, whereas the translation estimates show a higher variance and also a location-dependent bias, for example, very close to laboratory walls (see Figure 5B). A second problem for translation estimation comes from the different properties of rotational and translational flow fields. Due to its distance dependence, the translational flow field shows a much wider range of local image velocities than a rotational flow field. The smaller translational flow vectors are often swamped by simultaneous rotation or noise, and the larger ones tend to be in the upper saturation range of the used optic flow algorithm. This can be demonstrated by simultaneously translating and rotating the sensor. Again, rotation estimates remain consistent at different translation speeds while translation estimates are strongly affected by rotation (see Figures 5C and 5D). 4 Discussion 4.1 Egomotion Estimation. Our experiments show that it is possible to obtain useful egomotion estimates from an ensemble of linear neural models in a real-world task. Although a linear approach necessarily has to ignore the distances of the currently perceived scene, an appropriate choice of local weights and a large field of view are capable of reducing the influence of noise and distance variability on the estimates. In particular, rotation estimates were highly accurate and consistent across different scenes and different simultaneous translations. Translation estimates were not as accurate and were less robust against changing scenes and simultaneous rotation. The performance difference was to be expected because of the direct distance dependence of the translational optic flow, which leads to a larger variance of the estimator output. This problem can be resolved only by also estimating the distances in the current scene (e.g., in the iterative schemes in Koenderink & van Doorn, 1987; Heeger & Jepson, 1992). This, however, requires significantly more complex computations. Another reason is the limited dynamic range of the flow algorithm used in the experiments, as
2256
M. Franz, J. Chahl, and H. Krapp
discussed in the previous section. One way to overcome this problem would be using an optic flow algorithm that estimates image motion on different temporal or spatial scales, which is computationally more expensive. Our results show that the linear estimator accurately estimates rotation under general egomotion conditions and without any knowledge of the object distances of the current scene. The estimator may be used in technical applications such as image stabilization of a moving camera or the removal of the rotational component from the currently measured optic flow. Both measures considerably simplify the estimation of distances from the remaining optic flow and the detection of independently moving objects. In addition, the simple architecture of the estimator allows for an efficient implementation at low computational costs, which are several orders of magnitude smaller than the costs of computing the entire optic flow input. The components of the estimator are simplified neural models, which, when computed for a flying animal, are able to reproduce characteristics of the tangential neuron receptive field organization, that is, the distribution of local motion sensitivities and local preferred directions (Franz & Krapp, 2000). Our study suggests that tangential neurons may be used for selfmotion estimation by linearly combining their outputs at the level of the lobula plate (e.g., Krapp et al., 2001) or at later integration stages. Evidence for the latter possibility comes from recent electrophysiological studies on motor neurons, which innervate the fly neck motor system and mediate gaze stabilization behavior (Huston & Krapp, 2003). The possible behavioral role of such egomotion estimates, however, will critically depend on the dynamic properties of the whole sensorimotor loop, as well as on specific tuning of the motion processing stage providing the input. An example of using integrated optic flow for controlling a robotic system is described in Reiser and Dickinson (2003). 4.2 Neural Computation of Egomotion. The description of any egomotion requires at most six degrees of freedom. Therefore, an ensemble of six neurons, as in our gantry experiments, would be sufficient to encode the entire egomotion of the fly. There are, however, at least 13 tangential neurons (3 HS and 10 VS neurons) in either side of the fly lobula plate, which do not cover all degrees of freedom, for example, none of the currently known receptive fields represents lift translation (reviews in Hausen, 1984, 1993; Krapp, 2000). A plausible explanation might be that the axes covered by tangential neurons—thus constituting the sensory coordinate system—are aligned with the axes used by the motor coordinate system. Recent studies on gaze stabilization in Calliphora suggest that in some cases, the output of individual tangential neurons is connected to individual motor neurons driving certain head movements (Huston & Krapp, 2003). Another hint comes from an interesting property of our linear model. The linearly combined output of two model neurons corresponds to the linear combination of their respective weight sets. For the tangential neurons, this
Insect-Inspired Estimation of Egomotion B. 75
75
45
45
elevation [deg.]
elevation [deg.]
A.
2257
15 -15
15 -15 -45
-45
-75
-75 0
30
120 60 90 azimuth [deg.]
150
180
0
30
120 60 90 azimuth [deg.]
150
180
Figure 6: Hypothetical neurons constructed for (A) roll and (B) pitch rotation.
would mean that the summed output of several neurons may be treated as a superposition of their individual local response properties. The receptive fields of many tangential neurons often cover only a smaller part of the visual field, perhaps due to anatomical or developmental constraints. By summing the output of several neurons, one could build estimators with extended receptive fields covering more than one visual hemisphere. This is demonstrated in Figure 6, where we construct a hypothetical pitch neuron from the inverted output of VS1-3 added to the output of VS8-10. Also shown in Figure 6 is the response field of VS6, which was shown to be ideally suited to sense roll rotations (Franz & Krapp, 2000). 4.3 Linearity. Finally, we have to point out a basic difference between the proposed theory and optic flow processing in the fly. It assumes that the motion detector signals the tangential neurons integrate depend linearly on velocity (see equation 2.1; Reichardt, 1987). The output of fly motion detectors, however, is linear only within a limited velocity range. The motion detector output also depends on the spatial pattern properties of the visual surroundings (Borst & Egelhaaf, 1993). These properties are reflected by the tangential neurons’ response properties. Beyond certain image velocities, for instance, their response stays at a plateau when the velocity is increased. Even higher velocities result in a decrease of the neuron’s response (Hausen, 1982). Within the plateau range, tangential neurons can indicate only the presence and the sign of a particular egomotion, not the actual velocity. A detailed comparison between linear model neurons and tangential neurons shows characteristic differences. Under the conditions of the neurophysiological experiments reported in Franz and Krapp (2000), tangential neurons seem to operate in the plateau range rather than in the linear range. Under such response regimes, a linear combination of tangential neuron outputs would not indicate the true egomotion.
2258
M. Franz, J. Chahl, and H. Krapp
Physiological mechanisms have been described that may help to overcome these limitations to a certain degree. A nonlinear integration of local motion detector signals, known as dendritic gain control (Borst, Egelhaaf, & Haag, 1995), prevents the output of the tangential neurons from saturating when its entire receptive field is stimulated. This mechanism results in a size-invariant response, which still depends on velocity. Harris, O’Carroll, and Laughlin (2000) show that contrast gain control is of similar significance. It contributes to the neuron’s adaptation to visual motion, that is, it prevents the tangential neurons from saturating at high visual contrasts and image velocities. Although these mechanisms may not establish a linear dependence over the entire velocity range, they may considerably extend it. Evidence supporting this idea comes from a study by Lewen, Bialek, and de Ruyter van Steveninck (2001). The authors performed electrophysiological experiments on the H1 tangential neuron in a natural outdoor environment and at bright daylight. They show that the linear dynamic range of H1 under these conditions is significantly extended compared to stimulation with a periodic grating within the same range of velocities but applied in the laboratory. Despite these results, it is still not entirely clear whether an extended linear dynamic range of the tangential neurons is sufficient to cover all needs in gaze stabilization and flight steering. Within the linear range, however, the fly might take advantage of all the beneficial properties the linear model offers. For instance, it may combine the outputs of several tangential neurons to form matched filters for particular egomotions. In case of the intrinsic tangential neuron VCH, thought to be involved in figure-ground discrimination (Warzecha, Borst, & Egelhaaf, 1992), this seems to hold true. VCH receives input from several other tangential neurons, the response fields of which are well characterized. By combining the response fields of the inputting tangential neurons, the VCH response field is readily explained (Krapp et al., 2001). This suggests that linear combination of tangential neuron response fields may well be an option for the fly visual system to estimate egomotion. Acknowledgments The authors wish to thank J. Hill, M. Hofmann, and M. V. Srinivasan for their help. Financial support was provided by the Human Frontier Science Program and the Max-Planck-Gesellschaft. References Borst, A., & Egelhaaf, M. (1993). Detecting visual motion: Theory and models. In F. Miles & J. Wallman (Eds.), Visual motion and its role in stabilization of gaze (pp. 3–27). Amsterdam: Elsevier.
Insect-Inspired Estimation of Egomotion
2259
Borst, A., Egelhaaf, M., & Haag, J. (1995). Mechanisms of dendritic integration underlying gain control in fly motion-sensitive interneurons. J. Comput. Neurosci., 2, 5–18. Chahl, J. S., & Srinivasan, M. V. (1997). Reflective surfaces for panoramic imaging. Applied Optics, 36(31), 8275–8285. Egelhaaf, M., Kern, R., Krapp, H. G., Kretzberg, J., Kurtz, R., & Warzecha, A.-K. (2002). Neural encoding of behaviourally relevant visual-motion information in the fly. TINS, 25(2), 96–102. Franz, M. O., & Chahl, J. S. (2003). Linear combinations of optic flow vectors for estimating self-motion: A real-world test of a neural model. In S. Becker, S. Thrun, & K. Obermayer (Eds.), Advances in neural information processing systems, 15, Cambridge, MA: MIT Press. Franz, M. O., & Krapp, H. G. (2000). Wide-field, motion-sensitive neurons and matched filters for optic flow fields. Biol. Cybern., 83, 185–197. Gibson, J. J. (1950). The perception of the visual world. Boston: Houghton Mifflin. Haag, J., & Borst, A. (2003). Orientation tuning of motion-sensitive neurons shaped by vertical-horizontal network interactions. J. Comp. Physiol. A, 189, 363–370. Harris, R. A., O’Carroll, D. C., & Laughlin, S. B. (2000). Contrast gain reduction in fly motion adaptation. Neuron, 28, 595–606. Hausen, K. (1982). Motion sensitive interneurons in the optomotor system of the fly. II. The horizontal cells: Receptive field organization and response characteristics. Biol. Cybern, 46, 67–79. Hausen, K. (1984). The lobula-complex of the fly: Structure, function and significance in visual behaviour. In M. A. Ali (Eds.), Photoreception and vision in invertebrates (pp. 523–559). New York: Plenum Press. Hausen, K. (1993). The decoding of retinal image flow in insects. In F. A. Miles & J. Wallman (Eds.), Visual motion and its role in the stabilization of gaze (pp. 203–235). Amsterdam: Elsevier. Heeger, D. J., & Jepson, A. D. (1992). Subspace methods for recovering rigid motion. I: Algorithm and implementation. Intl. J. Computer Vision, 7, 95–117. Huston, S. J., & Krapp, H. G. (2003). Visual receptive field of a fly neck motor neuron. In N. Elsner & H.-U. Schnitzler (Eds.), G¨ottingen neurobiology report 2003. Stuttgart: Thieme. Koenderink, J. J., & van Doorn, A. J. (1987). Facts on optic flow. Biol. Cybern., 56, 247–254. Krapp, H. G. (2000). Neuronal matched filters for optic flow processing in flying insects. In M. Lappe (Ed.), Neuronal processing of optic flow (pp. 93–120). San Diego, CA: Academic Press. Krapp, H. G., & Hengstenberg, R. (1996). Estimation of self-motion by optic flow processing in single visual interneurons. Nature, 384, 463–466. Krapp, H. G., Hengstenberg, R., & Egelhaaf, M. (2001). Binocular input organization of optic flow processing interneurons in the fly visual system. J. Neurophysiol., 85, 724–734. Krapp, H. G., Hengstenberg, B., & Hengstenberg, R. (1998). Dendritic structure and receptive field organization of optic flow processing interneurons in the fly. J. Neurophysiology, 79, 1902–1917.
2260
M. Franz, J. Chahl, and H. Krapp
Lewen, G. D., Bialek, W., & de Ruyter van Steveninck, R. R. (2001). Neural coding of naturalistic motion stimuli. Network: Comput. Neural Syst., 12, 317–329. Reichardt, W. (1987). Evaluation of optical motion information by movement detectors. J. Comp. Physiol. A, 161, 533–547. Reiser, M. B., & Dickinson, M. H. (2003). A test bed for insect-inspired robotic control. Phil. Trans. Royal Soc. Lond. A, 361, 2267–2285. Srinivasan, M. V. (1994). An image-interpolation technique for the computation of optic flow and egomotion. Biol. Cybern., 71, 401–415. Warzecha, A.-K., Borst, A., & Egelhaaf, M. (1992). Photo-ablation of single neurons in the fly visual system reveals neuronal circuit for detection of small objects. Neurosci. Letters, 141, 119–122. Received October 31, 2003; accepted May 3, 2004.
LETTER
Communicated by Emilio Salinas and Peter Thomas
Correlated Firing Improves Stimulus Discrimination in a Retinal Model Garrett T. Kenyon
[email protected] Physics Division, Los Alamos National Laboratory, Los Alamos, NM 87545, U.S.A.
James Theiler
[email protected] Non-Proliferation and International Security, Los Alamos National Laboratory, Los Alamos, NM 87545, U.S.A.
John S. George
[email protected] Physics Division, Los Alamos National Laboratory, Los Alamos, NM 87545, U.S.A.
Bryan J. Travis
[email protected] Environmental and Earth Sciences, Los Alamos National Laboratory, Los Alamos, NM 87545, U.S.A.
David W. Marshak
[email protected] Department of Neurobiology and Anatomy, University of Texas Medical School, Houston, TX 77030, U.S.A.
Synchronous firing limits the amount of information that can be extracted by averaging the firing rates of similarly tuned neurons. Here, we show that the loss of such rate-coded information due to synchronous oscillations between retinal ganglion cells can be overcome by exploiting the information encoded by the correlations themselves. Two very different models, one based on axon-mediated inhibitory feedback and the other on oscillatory common input, were used to generate artificial spike trains whose synchronous oscillations were similar to those measured experimentally. Pooled spike trains were summed into a threshold detector whose output was classified using Bayesian discrimination. For a threshold detector with short summation times, realistic oscillatory input yielded superior discrimination of stimulus intensity compared to rate-matched Poisson controls. Even for summation times too long to resolve synchronous inputs, gamma band oscillations still contributed to improved discrimination by reducing the total spike count variability, c 2004 Massachusetts Institute of Technology Neural Computation 16, 2261–2291 (2004)
2262
G. Kenyon, J. Theiler, J. George, G. Travis, and D. Marshak
or Fano factor. In separate experiments in which neurons were synchronized in a stimulus-dependent manner without attendant oscillations, the Fano factor increased markedly with stimulus intensity, implying that stimulus-dependent oscillations can offset the increased variability due to synchrony alone. 1 Introduction Neurons are thought to represent sensory information primarily as changes in their individual firing rates. In order to obtain reliable estimates of ratecoded information on physiological timescales, from tens to hundreds of msec, it may be necessary to estimate the average firing rate over an ensemble of similarly activated cells. Averaging yields a more accurate estimate of the ensemble firing rate only to the extent that the input spike trains are statistically independent (Mazurek & Shadlen, 2002; Shadlen & Newsome, 1994, 1998). When the input spike trains are instead partially synchronized, averaging their firing rates no longer produces the same improvement in signal to noise, as fluctuations in the number of spikes arising from individual cells will no longer tend to cancel out across the population of input fibers. Nonetheless, firing correlations are ubiquitous in the nervous system, often associated with coherent oscillations in the gamma frequency band (40–120 Hz) that synchronizes activity both within and between brain areas (Fries, Schroder, Roelfsma, Singer, & Engel, 2002; Singer & Gray, 1995). It is thus important to understand what consequences such synchronous oscillations might have for how information is represented by central neurons. One possibility is that firing correlations simply impose an upper limit on the amount of rate-coded information a population of neurons can represent in its pooled activity over a given unit of time (Mazurek & Shadlen, 2002; Shadlen & Newsome, 1994, 1998). From this perspective, firing correlations are an inevitable but undesirable consequence of the massive interconnectivity of neural circuits but otherwise serve no significant information processing function. A second possibility is that the correlations themselves may encode relevant information, a view supported by studies showing that coherent oscillations may be involved in a variety of cognitive processes, including attention (Fries, Reynolds, Rorie, & Desimone, 2001), perception (Srinivasan, Russell, Edelman, & Tononi, 1999), top-down priming (Engel, Fries, & Singer, 2001), and feature integration (Fries et al., 2002; Singer & Gray, 1995). Theoretical arguments indicate that correlations are not always detrimental to a population code and can actually increase the overall information content depending on the specific nature of the correlations, tuning differences between the individual neurons, and how the population-coded signals are extracted (Abbott & Dayan, 1999). Here, we adopt a more empirical approach. Specifically, we employ artificial spike trains exhibiting stimulus-dependent synchronous oscillations similar to those observed between ganglion cells in the cat retina, which we use to explore how realistic
Correlated Firing Improves Discrimination
2263
firing correlations affect the extraction of visual information by a general threshold detection process. Information extraction was assessed by quantifying the performance of a Bayesian discriminator in classifying different stimulus intensities based on the output of a threshold detector. When driven by correlated inputs in which the stimulus intensity was encoded by both the firing rates of individual cells and the strength of their synchronous oscillations, threshold elements mediated equal or superior stimulus discrimination than when driven by statistically independent Poisson generators that produced, on average, the same number of spikes. Threshold detectors with short integration times and low background event rates (i.e., coincidence detectors) always extracted more information from spike trains with realistic correlations than from Poisson controls, as the stimulus information encoded by the level of synchrony could be extracted directly. Neurons in the lateral geniculate nucleus (LGN) and visual cortex are preferentially sensitive to synchronous inputs (Alonso, Usrey, & Reid, 1996; Usrey, Alonso, & Reid, 2000) and may respond differentially to coincidences in their primary afferents. Somewhat surprisingly, even threshold detectors with long integration times, which were thus unable to resolve synchronous inputs, still mediated equal or superior stimulus discrimination when driven by correlated inputs compared to Poisson controls, consistent with previous theoretical analysis (Abbott & Dayan, 1999). Because the stimulus-evoked oscillations became stronger as a function of the stimulus intensity, the loss of rate-coded information due to the increased firing synchrony was effectively countered by a reduction in the variability of the total input spike count. 2 Methods 2.1 Artificial Spike Trains. Spike trains were generated by one of two computer models, whose parameters were adjusted to qualitatively match the stimulus-dependent, synchronous oscillations recorded from cat ganglion cells (Mastronarde, 1989; Neuenschwander, Castelo-Branco, & Singer, 1999; Neuenschwander & Singer, 1996). Some theoretical studies of correlated activity have employed mathematically generated spike trains in which the degree of firing synchrony was independent of, and thus conveyed no information about, the applied stimulus (Mazurek & Shadlen, 2002; Shadlen & Newsome, 1994, 1998). However, the synchronous oscillations between retinal neurons depend strongly on stimulus parameters such as size (Ariel, Daw, & Rader, 1983; Ishikane, Kawana, & Tachibana, 1999; Neuenschwander et al., 1999), contrast (Neuenschwander et al., 1999), connectedness (Ishikane et al., 1999; Neuenschwander & Singer, 1996), and velocity (Neuenschwander et al., 1999). To produce realistic firing correlations, we adopted two very different yet complementary strategies for generating artificial spike trains. In the first case, we used a detailed model of axonmediated inhibitory feedback in the inner retina in order to generate syn-
2264
G. Kenyon, J. Theiler, J. George, G. Travis, and D. Marshak
chronous oscillations via a complex, but biologically plausible, dynamical process. In the second case, synchronous oscillations were generated using a minimalist approach in which each ganglion cell generated Poisson distributed spikes at a rate modulated by a common source of oscillatory input. Although both models exhibited similar firing correlations, as assessed by comparing their multiunit cross-correlation histograms (multiunit CCHs), defined below, the two models were based on very different underlying mechanisms. Thus, we were able to investigate whether our conclusions regarding the information content of spike trains with realistic correlations were robust with respect to their underlying mode of generation. To guard against the possibility that our mathematically generated spike trains exaggerated the information conveyed by correlations, both models were subject to the following constraints. First, when corroborating information was available, the largest stimulus-evoked correlations between the model-generated spike trains were comparable to the correlations measured experimentally under similar conditions. Thus, under these circumstances, we are confident that none of the correlations generated by the model were substantially in excess of the correlations present physiologically. Second, the level of correlations in the absence of stimulation was comparable to the spontaneous correlations measured experimentally, ensuring that a correlation code would not possess an unfair advantage by starting from an unphysiologically low baseline level. Third, because the retinal model did not incorporate several known adaptation mechanisms, particularly those occurring in the outer retina, we ensured that the plateau firing rates evoked by stimuli of various intensities were probably larger than those that would occur physiologically. Thus, the model was probably conservative and likely biased in favor of a rate code. 2.2 Retinal Model. Artificial spike trains with realistic firing correlations were generated by a retinal model (see Figure 1), details of which have been published previously (Kenyon et al., 2003). The model retina was organized as a 32 × 32 array with wraparound boundary conditions containing five distinct cell types: bipolar cells (BP), small amacrine cells (SA), large amacrine cells (LA), polyaxonal amacrine cells (PA), and alpha ganglion cells (GC). All cell types were modeled as single-compartment, RC circuit elements obeying a first-order differential equation of the following form: (k,k ) 1 (k) (k) (k ) (k,k ) (k) (k,k ) (k) V˙ ij = − (k) Vij − Wii · f (Vi j ) · Wjj − b − Lij , (2.1) τ i j k where Vij(k) is a 2D array denoting the normalized membrane potentials of all cells of type k, (1 ≤ k ≤ 5), with i denoting the row and j the column of the corresponding cell, τ (k) is the time constant, b(k) is a bias current for setting
Correlated Firing Improves Discrimination
2265
A. Feedforward & Feedback Inhibition
BP
LA
GC
SA
B. Serial Inhibition
PA
LA
SA
C. Resonance Circuit
GC
PA
PA
Figure 1: The retinal model contained five cell types—bipolar (BP) cells, small (SA), large (LA) and polyaxonal (PA) amacrine cells, and alpha ganglion (GC) cells—arranged as a 32 × 32 square mosaic with wraparound boundary conditions. Conceptually, connections could be organized into three categories. (A) Feedforward and feedback inhibition. Excitatory synapses from BPs were balanced by a combination of reciprocal synapses and direct inhibition of the GCs, primarily mediated by the nonspiking amacrine cell types. (B) Serial inhibition. The three amacrine cell types regulated each other through a negative feedback loop. (C) Resonance circuit. The PAs were excited locally via electrical synapses with GCs and whose axons mediated wide field inhibition to all cell types, but most strongly onto the GCs. Not all connections are shown. Symbols: Excitation (triangles), inhibition (circles), gap junctions (resistors).
the resting potential, L(k) ij is an external input representing light stimulation,
) Wii(k,k gives the connection strengths between presynaptic, k , and postsy ) naptic, k, cell types as a function of their row (vertical) separation, Wjj(k,k gives the same information as a function of the column (horizontal) separa tion, and the functions f (k,k ) give the associated input-output relations for the indicated pre- and postsynaptic cell types, detailed below. The output of the axon-mediated inhibition was delayed by 2 msec, except for the axonal connections onto the axon-bearing amacrine cells, which was delayed for 1 msec. All other synaptic interactions were delayed by one time step, which equaled 1 msec. Although the ratio of ganglion cells to bipolar cells
2266
G. Kenyon, J. Theiler, J. George, G. Travis, and D. Marshak
reflects known density differences, their true anatomical ratio is higher, but computational constraints limited the total number of neurons that could be efficiently simulated. However, for the relatively large stimuli used in our experiments, fine details of the ganglion cell receptive field structure should not have been critical. Numerical ratios were always powers of 2 to ensure that the underlying connectivity was translationally invariant between processing modules. All equations were integrated in Matlab using a direct Euler method. Previous analysis has verified that the synchronous oscillations predicted by the model are robust with respect to individual parameter variation and integration step size (Kenyon et al., 2003). The input-output function for gap junctions was given by the identity
f (k,k ) (Vij(k ) ) = Vij(k ) ,
(2.2)
where the dependence on the presynaptic potential has been absorbed into the definition of τ (k) . This is possible because both the decay term in equation 2.1 and the omitted dependence on the presynaptic potential in equation 2.2 depend linearly on Vij(k) , allowing the coefficients to be combined. Many retinal neurons, including bipolar cells and most amacrine cells, do not fire action potentials but rather generate stochastically distributed postsynaptic potentials at a rate proportional to their membrane voltage (Freed, 2000). Such stochastic inputs are likely to contribute significantly to ganglion cell membrane noise (van Rossum, O’Brien, & Smith, 2003). The input-output function for graded stochastic synapses was constructed by comparing, on each time step, a random number with a Fermi function:
f (k,k ) (Vij(k ) )
= θ
1
1 + exp(−αVij(k ) )
− r ,
(2.3)
where α sets the gain (equal to 4 for all nonspiking synapses), r is a uniform random deviate equally likely to take any real value between 0 and 1, and θ is a step function, θ(x) = 1, x ≥ 0; θ(x) = 0, x < 0. Finally, the input-output relation used for spiking synapses was
f (k,k ) (Vij(k ) ) = θ(Vij(k ) ).
(2.4)
A modified integrate-and-fire mechanism was used to model spike generation. A positive pulse (amplitude = 10.0) was delivered to the cell on the time step after the membrane potential crossed threshold, followed by a negative pulse (amplitude = −10.0) on the subsequent time step. This resulted in a 1 msec action potential followed by an afterhyperpolarization that decayed with the time constant of the cell. Action potentials produced impulse responses in electrically coupled cells as well, an important element
Correlated Firing Improves Discrimination
2267
of the oscillatory feedback dynamics. The bias current, b, was decremented by −0.5 following each spike, and then returned to the resting value with the time constant of the cell, adding to the relative refractory period. There was in addition an absolute refractory period of 1 msec. Along both the horizontal and vertical directions, synaptic strengths fell off as gaussian functions of the distance between the pre- and postsynaptic cells. For a given vertical separation, the weight factor was determined by a gaussian function of the following form,
) Wi(k,k (k) ,i (k )
=α
W (k,k ) exp
i(k) − i(k ) 2 − , 2 2σ (k )
(2.5)
) where Wi(k,k (k) ,i (k ) is the weight factor from presynaptic cells of type k at col
umn index i(k ) to the postsynaptic cells of type k at column index i(k) , where the superscripts k and k are necessary because the number of rows may be different for the two cell types, α is a numerically determined normalization factor that ensured the total synaptic input integrated over all presynaptic cells of type k to every postsynaptic cell of type k equaled W (k,k ) , σ (k ) is the gaussian radius of the interaction, which depended only on the presynaptic cell type, and the quantity i(k) − i(k ) denotes the vertical distance between the pre- and postsynaptic cells, taking into account the wraparound boundary conditions employed to mitigate edge effects and assuming all cell types are distributed uniformly over rectilinear grids of equal total area and whose center points coincide. An analogous weight factor describes the dependence on horizontal separation. Equation 2.5 was augmented by a cutoff condition that prevented synaptic interactions beyond a specified distance, determined by the radius of influence of the presynaptic outputs and the postsynaptic inputs, roughly corresponding to the axonal and dendritic fields, respectively. A synaptic connection was possible only if the output radius of the presynaptic cell overlapped the input radius of the postsynaptic cell. Except for axonal connections, the input and output radii were the same for all cell types. For the large amacrine cells and the ganglion cells, the radius of influence extended out to the centers of the nearest neighboring cells of the same type, producing a coverage factor greater then one (Vaney, 1990). The radii of the bipolar, small, and axon-bearing amacrine cells (nonaxonal connections only) extended only halfway to the nearest cell of the same type, giving a coverage factor of one (Cohen & Sterling, 1990). The radius of the axonal connections was equal to nine ganglion cell diameters. The external input was multiplied by a gain factor of 3, chosen so that a stimulus intensity of 1 would produce an approximately saturating response in the model bipolar cells. Values for model parameters are listed in Tables 1 and 2.
2268
G. Kenyon, J. Theiler, J. George, G. Travis, and D. Marshak
Table 1: Cellular Parameters.
BP SA LA PA GC
τ
b
n×n
d
σ
10.0 25.0 20.0 5.0 5.0
−0.0 −0.5 −0.25 −0.025 −0.025
64 × 64 64 × 64 32 × 32 64 × 64 32 × 32
0.25 0.25 1.0 0.25/9.0a 1.0
0.25 0.25 0.5 0.25/3.0a 0.5
Notes: τ : time constant (msec); b: bias; n × n: array size; d: cutoff radius, σ : gaussian radius (see equation 2.5). a Inner radius/outer radius.
Table 2: Synaptic Weights.
BP SA LA PA GC
BP
SA
LA
PA
GC
* 3.0b 3.0b 0.75b 9.0b
−0.375b * * −0.75b −4.5b
−3.0b −3.0b 0.25a 0.25a −4.5b
−3.0b/−15.0c 0.0b/−15.0c −3.0a/−15.0c 0.25a/−45.0c 0.25a /−270.0c
* * * 0.25a,d *
Notes: Each term represents the total integrated weight from all synapses arising from the corresponding presynaptic type (columns) to each cell of the corresponding postsynaptic type (rows), (the quantity W (k,k ) in equation 2.5). Asterisk = absence of corresponding connection. Synapse type indicated by superscript: a gap junction, b nonspiking synapse, c spiking synapse. d Maximum coupling efficiency (ratio of post- to presynaptic depolarization) for this gap junction synapse: DC = 11.3%, action potential = 2.7%.
2.3 Rate-Modulated Poisson Process. A second method of generating artificial spike trains was based on a minimalist assumption: that firing correlations between cat ganglion cells are due to a common oscillatory input. An oscillatory time series of a duration, T, and temporal resolution,
t, possessing realistic temporal correlations was constructed as follows. Starting with the discrete frequencies, fk , fk =
k T , 0≤k< , T
t
the discrete Fourier coefficients were defined as follows:
( f k − f0 ) 2 Ck = e2πir exp , 2σ 2
(2.6)
(2.7)
where f0 is the central oscillation frequency, σ is the width of the spectral peak in the associated power spectrum, and r is a uniform random deviate between 0 and 1 that randomized the phases of the individual Fourier
Correlated Firing Improves Discrimination
2269
components (generated by the Matlab intrinsic function RAND). These coefficients were used to convert back to the time domain using the discrete inverse Fourier transform, Rn = A
1 N−1 Ck e−2πifk tn + R0 , N k=1
(2.8)
where the real part of Rn denotes the value of the time-dependent firing rate at the discrete times, tn = n · t, N = T · t, A is an empirically determined scale factor and we have added a constant offset, R0 , which sets the mean firing rate. The quantity A was determined by the formula A = 0.25 + (1.1 − 0.25)(6 + I)/5, where I is the stimulus intensity in log2 units, with values ranging from −6 to −1, and the other coefficients were determined empirically to produce a reasonable match to the retinal model. The quantity A was then normalized relative to the standard deviation of Rn over all time steps with A = 1. Thus, the highest stimulus intensities produced rate fluctuations on the order of the standard deviation due to noise. The width of the frequency spectrum, σ , was given by an analogous formula: σ = 10−(10−6.25)(6+I)/5. Negative values of Rn were truncated at zero and the resulting time series rescaled so that its average value remained equal to R0 . The time series defined by Rn was used to generate oscillatory spike trains via a pseudorandom process, Sn = θ (Rn t − r),
(2.9)
where Rn · t is the probability of a spike in the nth time bin, θ is a step function, θ (x < 0) = 0, θ(x > 0) = 1, and r is again a uniform random deviate. In the limit that Rn · t 1, the above procedure reduces to a rate-modulated Poisson process. The same time series, Rn , was used to modulate the firing rate of each element contributing to the artificially generated multiunit spike train, thus producing temporal correlations due to shared input. 2.4 Correlated Poisson Process. To generate artificial spike trains that possess spatial but not temporal correlations, we employed a pseudo-Poisson process to generate a template spike train from which mutually correlated spike trains could be constructed. To produce a random template spike train of duration T, temporal resolution t, and mean spike rate R0 , we simply used equation 2.9 with the replacement Rn → R0 : S(0) n = θ(R0 t − r).
(2.10)
To produce spatially correlated spike trains, the template time series, S(0) n , was used to construct new spike trains according to the formula
(R0 t) S(k) = θ S · C t + (1 − S ) · (1 − C t) − r , (2.11) n n n 1 − R0 t
2270
G. Kenyon, J. Theiler, J. George, G. Travis, and D. Marshak
where S(k) n denotes the spike train of the kth cell, C is the conditional firing rate given that a spike occurred in the corresponding time bin of the template train, and r is again a uniform random deviate. The maximum value of C· t was (1 − R0 · t) · 0.5. 2.5 Data Analysis. Reported correlations are expressed as a fraction of the expected synchrony due to chance, measured during either baseline activity or the plateau portion of the response to a sustained stimulus (200–600 msec). With this normalization, a correlation amplitude of one at zero delay corresponded to a doubling in the number of synchronous events over the expected rate. Correlations were plotted as a function of the delay after averaging over all events occurring during the plateau portion of the response. For each delay value, this average was compensated for edge effects arising from the finite length of the two spike trains. To increase the signal to noise, the firing rates or correlations were averaged over all cells, or distinct cell pairs, responding to the same stimulus, producing a multiunit peristimulus time histogram (PSTH) or multiunit cross-correlation histogram (CCH), respectively. Autocorrelation functions were not included in the multiunit CCH, which thus included only cross-correlations between distinct cell pairs. Error bars were estimated by assuming Poisson statistics for the count in each histogram bin. All correlations were obtained by averaging over 200 stimulus trials, using a bin width of 1 msec. 2.6 Bayes Discriminator. A Bayes discriminator was used to distinguish between different intensities based on the distribution of threshold events across independent stimulus trials. For each intensity, the number of suprathreshold events was determined on successive trials and the results normalized as a probability distribution. For any given pair of intensities, the percentage of correct classifications made by an ideal observer was inversely related to the degree of overlap between the two distributions. Total overlap corresponded to performance at chance (50% correct), while zero overlap implied perfect discrimination (100% correct). For the threshold detection process, all spikes occurring within a given time bin, whose width varied from 1 to 20 msec depending on the experiment, were summed together and the result compared to a threshold. There was no overlap between successive time bins. The duration of the discrimination interval, which varied from 50 to 400 msec, was adjusted so as to approximately normalize the task difficulty as the number of inputs was varied. When the discrimination interval contained the transient portion of the response (0–50 msec), constructing an appropriate Poisson control was less straightforward than during the plateau period, during which the firing rate could be treated as constant. Moreover, because high-frequency oscillations during the response peak were strongly stimulus locked, it was not possible to use the multiunit PSTH to estimate the time-dependent event rates of the equivalent Poisson generators, as these would have contained
Correlated Firing Improves Discrimination
2271
the high-frequency stimulus-locked oscillations that the control is intended to eliminate. By using a boxcar filter with a width of 9 msec, however, we were able to modify the multiunit PSTH so as to remove high-frequency components but without eliminating the central response peak. The average number of spikes was always the same for both the model ganglion cells and the Poisson controls. 3 Results The principal characteristics of the retinal model, used to generate most of the artificial spike trains used in this study, are best illustrated by examining the multiunit PSTHs and multiunit CCHs computed from the simulated responses to a narrow bar stimulus of maximum intensity (see Figure 2A). The multiunit PSTH, which combines the responses of all 12 model ganglion cells directly beneath the stimulus, consisted of a sharp peak, about 50 msec wide, followed by a plateau period of sustained elevated activity (see Figure 2B). Relative to baseline, the sustained increase in spike activity produced by the bar stimulus was comparable to, or larger than, the sustained increase in firing exhibited by cat ganglion cells in response to high-contrast features (Creutzfeldt, Sakmann, Scheich, & Korn, 1970; Enroth-Cugell & Jakiela, 1980). Likewise, the pronounced downward notch separating the peak and plateau regions is characteristic of ganglion cell responses to large, flashed stimuli (Cox & Rowe, 1996). Despite the absence of periodic structure in the plateau portion of the multiunit PSTH, there were nonetheless very prominent high-frequency oscillations in the multiunit CCH recorded during the plateau portion of the response (see Figure 2C, solid black line). In the cat retina, high-contrast stimuli produce an approximate doubling in the number of synchronous events relative to the expected level due to chance (Neuenschwander et al., 1999). In the spike trains generated by the retinal model, the synchronous oscillations evoked by a maximum intensity bar stimulus resulted in a qualitatively similar increase in the number of synchronous events relative to the expected level. Thus, the artificially generated spike trains used in this study reflect physiologically reasonable levels of correlated activity. During the plateau portion of the response, the high-frequency oscillations were not strongly phase-locked to the stimulus onset, but rather tended to become phase randomized over time, as revealed by the decline in correlation strength with increasing delay. Consistent with the absence of stimulus-locked oscillations during the plateau period, the shift predictor was negligible (see Figure 2C, dashed gray line). Correlations between cat ganglion cells recorded during sustained activity decline in a similar manner with increasing delay and likewise possess negligible shift predictors (Neuenschwander et al., 1999). We also computed the multiunit CCH for the 50 msec period following stimulus onset encompassing the response peak (see Figure 2D, solid black
2272
G. Kenyon, J. Theiler, J. George, G. Travis, and D. Marshak
B
50 Hz
A
0.2
0.6 sec
D
C plateau
peak
0.5
0.5
-0.5
-0.5
-50
50 msec
-20
20 msec
Figure 2: Artificial spike trains generated by the retinal model. (A) A column of 12 model ganglion cells was stimulated by a narrow bar (white rectangle, intensity = 12 , stimulus dimensions 2 × 12 GC receptive field diameters). Circles indicate GC receptive field diameter. (B) Multiunit peristimulus time histogram (PSTH) obtained by averaging the individual PSTHs over all ganglion cells activated by the stimulus (bin width, 1 msec). The solid line at the bottom of the panel indicates the stimulus duration (600 msec). Vertical ticks denote the peak and plateau portions of the response, 0–50 msec and 200–600 msec, respectively. (C) Multiunit cross-correlation histogram (CCH), obtained by combining individual CCHs from all distinct pairs of ganglion cells activated by the stimulus measured during the plateau portion of the response (solid black lines). Correlations expressed as a fractional change from the expected synchrony due to chance (dimensionless units). Shift predictors (dashed gray lines) obtained by recomputing the multiunit CCH using spike trains from different stimulus trials. (D) Multiunit CCH recorded and measured during the response peak (same organization as (C). Correlations were larger during the response peak, as was the shift predictor, due to the strong stimulus locking of the high-frequency oscillations.
Correlated Firing Improves Discrimination
2273
line), during which the high-frequency oscillations were strongly stimuluslocked, as revealed by the periodic structure in the multiunit PSTH as well as by the large shift predictor (see Figure 2D, dashed gray line). A tendency for high-frequency oscillations to become less stimulus locked as a function of time from stimulus onset is also seen in experimental data (Neuenschwander et al., 1999). 3.1 Firing Correlations and Firing Rate Are Both Modulated by Stimulus Intensity. To investigate the relationship between correlation strength and stimulus contrast, the same narrow bar covering 12 ganglion cells was varied over a 32-fold range of intensities (see Figure 3). Both the peak and plateau firing rates of the model ganglion cells activated by the stimulus, as measured by the multiunit PSTH, increased in a graded manner as the stimulus intensity was raised (see Figure 3B). Similarly, the degree of synchrony during both the plateau and peak portions of the response, assessed by the amplitude of the multiunit CCH at zero delay, also increased with stimulus intensity (see Figures 3C and 3D, respectively). The amplitude of stimulus-evoked high-frequency oscillations between cat ganglion cells depends similarly on luminance contrast (Neuenschwander et al., 1999). As a function of stimulus intensity, synchrony could be modulated over a greater dynamic range, measured relative to baseline activity, than could the fractional change in the multiunit firing rate, during both the response plateau (see Figure 3D) and the response peak (see Figure 3E). Overall, firing correlations were very sensitive to stimulus intensity, and thus might convey information about luminance contrast in addition to that represented by the mean firing rate across the ensemble of activated cells. 3.2 Correlated Inputs Transmit More Information Through Coincidence Detectors than Independent Rate-Matched Controls. A simple threshold detector with a short integration time window and a low rate of suprathreshold background events was able to extract the stimulus intensity more reliably from a hybrid rate/correlation code than from an equal number of statistically independent, rate-matched Poisson event generators. Spikes from a column of 12 ganglion cells activated by a narrow bar were summed into a simple threshold detector (see Figure 4). The event rate of the detector was determined by the total number of time bins on which the input crossed threshold during a 200 msec epoch within the plateau portion of the response. For these experiments, the detection threshold was set to a level that required three or more spikes to arrive within the same 2 msec time bin in order to produce a detector event. To assess the extent to which the detector was able to utilize firing correlations between its inputs, the 12 stimulated ganglion cells were replaced by independent Poisson generators that, on average, produced the same number of spikes per unit time. For both correlated and Poisson input, the baseline detector event rate was very low, around 1 Hz (see Figure 4A). A high-intensity stimulus produced
2274
G. Kenyon, J. Theiler, J. George, G. Travis, and D. Marshak
B
A
C
plateau -2
2 -2
-2
peak -2
2
20 Hz
-2
-3
-3
-4
-4
-3
-4
0.2
0.6 sec
-50
D
-20
E 10
fractional increase
50
peak
plateau
firing rate synchrony
20 msec
1
5
0.1
10 0
0 -5
-3
-1
-5
stimulus intensity (log2 units)
-3
-1
Correlated Firing Improves Discrimination
2275
a greater increase in the detector event rate when the inputs were correlated as compared to the case when the inputs were independent. This extra sensitivity reflected the fact that a threshold process with a short integration time is well suited for detecting rare synchronous events (Kenyon, Fetz, & Puff, 1990; Kenyon, Puff, & Fetz, 1992), a property also exhibited by cortical and subcortical neurons (Alonso et al., 1996; Usrey et al., 2000). When driven by correlated input from the model ganglion cells, the output of the threshold detector allowed for better discrimination between different stimulus intensities than when driven by independent Poisson generators. The probability of observing a given number of detection events was plotted for a range of stimulus intensities (see Figure 4B). As the stimulus intensity increased, the distributions shifted to the right, reflecting the greater number of suprathreshold inputs. When the threshold detector was driven by correlated input from the retinal model, the event distributions for different intensities were more separable than when driven by the ratematched controls. To quantify the degree to which firing correlations caused detector output to be more discriminable, we used a Bayes discriminator to estimate the maximum percentage of intensity comparisons that could be correctly classified (Duda, Hart, & Stork, 2001). Starting from several different baselines, firing correlations allowed for a higher percentage of correct intensity classifications over a broad range of intensity increments (see Figure 4C). The abscissa of each point gives the final stimulus intensity, while the x-intercept of the line passing through that point yields the corresponding baseline intensity. The ordinate of each point gives the percentage of trials on which the given pair of stimulus intensities could be correctly classified, using only the single trial output of the detector. For many pairwise intensity discriminations, synchronous oscillations allowed approximately 10 additional trials out of 100 to be correctly classified (solid black lines) compared to the Poisson control (dashed gray lines). Averaged
Figure 3: Facing page. Firing correlations are modulated over a greater dynamic range than firing rate as a function of stimulus intensity. The stimulus was again a narrow bar covering 12 ganglion cells. (A) Intensity series formed by the multiunit PSTHs of ganglion cells activated by the stimulus. The intensity, in log2 units, is indicated at the upper right of each histogram (bin width = 10 msec). (B) Intensity series formed by the multiunit CCHs between all pairs of ganglion cells activated by the stimulus, relative to the baseline level of synchrony. Firing correlations during the plateau response are strongly modulated by stimulus intensity. (C) Intensity series formed by the multiunit CCHs recorded during the response peak (same organization as (B). (D,E) Fractional change from baseline in synchrony (black lines, circles) as a function of stimulus intensity, compared to the fractional change in firing rate (gray lines, squares), recorded during the (D) plateau and (E) peak portions of the response. In both cases, synchrony was modulated over a greater dynamic range than the firing rate.
2276
G. Kenyon, J. Theiler, J. George, G. Travis, and D. Marshak
A baseline
model
stimulated
22.8 Hz
200 msec
.8
probability
0.3 Hz
53.9 Hz
0
B
Poisson
0.7 Hz
-f
Poisson
model
-6 -5 -4 -3 -2 -1
.4
0 0
10
20
0
suprathreshold events
20
10
C model Poisson
% correct
100%
50%
-f
-5
-3
log2 intensity
-1
Correlated Firing Improves Discrimination
2277
over all intensity increments, the extra number of correct classifications was approximately 5 in 100, but this value is likely conservative due to saturation. Our results suggest that when retinal output is decoded by a threshold detection process with a short integration time window, synchronous oscillations allow information to be transmitted more reliably than would be the case with independent firing rates. During the peak portion of the response, firing correlations also permitted better discrimination of intensity increments, as represented by the output of the threshold detector applied to the first 50 msec of the response (see Figure 5). However, constructing an appropriate Poisson control was less straightforward than during the plateau period, as the high-frequency oscillations occurring during the response peak were strongly stimulus locked (see Figure 5A, left bottom panel, smooth trace). In order to produce control spike trains that contained only firing correlations due to the transient itself, we used a boxcar filter with a width of 9 msec to remove high-frequency oscillations from the multiunit PSTH without eliminating the central response peak. The average number of input spikes arriving during the response peak was the same for both the model ganglion cells and the time-dependent Poisson control. The distribution of detector events was more separable when input was correlated by high-frequency oscillations than when input was correlated only by the transient response peak (see Figure 5B). A Bayesian Figure 4: Facing page. Firing correlations during the plateau portion of the response allow improved discrimination of stimulus intensity compared to independent Poisson input. (A) Example of the threshold detection process. Stimuli consisted of a narrow bar presented at various intensities (same stimulus as in Figure 3). Ganglion cell input to the threshold detector during the plateau portion of the response is shown on the left and equivalent Poisson input on the right. The baseline activity of the detector in the absence of stimulation (top row) is very low. A stimulus with intensity = −1 in log2 units (bottom row) produced strong correlations between ganglion cells, resulting in a relatively large number of suprathreshold events. Dashed line: detector threshold. Dotted lines: average input ± S.D. Summation window, 2 msec. (B) Normalized probability distributions giving, for each stimulus intensity, the relative number of suprathreshold events during the 200 msec analysis interval. (Left) Retinal model. (Right) Poisson control. The distributions of suprathreshold events produced by the retinal model were more separable. (C) Percentage correct intensity classifications by an ideal observer. The ideal observer chose between two equally likely stimulus intensities based on the number of detector events on each trial. Input to the detector came from either the retinal model (solid black lines) or Poisson controls (dashed gray lines). Each point represents the fraction of trials on which the intensity indicated by the abscissa was correctly distinguished from a lower intensity, denoted by the intersection of each line with the x-axis. Error bars computed assuming binary statistics for the overlap between each pair of distributions (omitted from Poisson controls for clarity).
2278
G. Kenyon, J. Theiler, J. George, G. Travis, and D. Marshak
A
Poisson
stimulated
baseline
model 0.9 Hz
0.7 Hz
109.3 Hz
74.3 Hz
200 msec
0
B probability
1.0
model
-f
Poisson Poisson
-6 -5 -4 -3 -2 -1
0.5
0 0
5
0
10
5
suprathreshold events
C model Poisson
% correct
100%
50%
-f
-5
-3
log2 intensity
-1
Figure 5: Firing correlations during the response peak allow improved discrimination of stimulus intensity. Same organization as in Figure 4. To remove the high-frequency stimulus-locked oscillations present in the response peak (A, bottom left), the multiunit PSTH was smoothed by replacing the measured firing rate in each bin by the average of the surrounding nine bins (A, bottom right). Firing correlations during the response peak allowed for improved discrimination of stimulus intensity compared to the case where correlations were due solely to the response transient itself (B,C).
Correlated Firing Improves Discrimination
2279
discriminator analysis again confirmed that on average, 5 additional trials out of 100 could be correctly classified when input to the threshold detector was correlated by stimulus-dependent, synchronous oscillations as compared to the nonoscillatory control (see Figure 5C). These results show that gamma band oscillations can convey relevant stimulus information even when superimposed with other sources of stimulus-dependent firing correlations. The above results presumably depend on the detailed nature of the firing correlations present in the input spike trains. In particular, two sets of input spike trains might possess similar correlations as assessed by the multiunit CCH, an average over many independent trials, but still differ significantly with respect to the correlations present on individual trials. Our retinal model provides a biologically plausible method for generating realistic correlations, but as with any other model of complex neural circuitry, it is impossible to know to what extent the proposed physiological mechanisms are correct. To control for the possibility that the correlations produced by our retinal model might have inadvertently biased our results, a second set of input spike trains was generated using a very different mechanism. In contrast to the inhibitory feedback dynamics employed in our retinal model, the second model required only a source of common input to modulate, in phase, the firing rates of the stimulated cells. The parameters of the common input model were adjusted so that both approaches yielded similar multiunit CCHs over a range of stimulus intensities (see Figure 6A). Despite employing very different underlying mechanisms, both models supported very similar levels of stimulus discrimination, with the difference in total performance being less than 1 trial in 100 (see Figure 6B). These results indicate that the information conveyed by spatiotemporal correlations is unlikely to be an artifact of a particular mode of generation, but rather reflects a general property of neural populations exhibiting synchronous gamma band oscillations. 3.3 Correlations Mediate Greater or Equivalent Stimulus Discrimination for a Wide Class of Threshold Detectors. The hybrid rate/correlation code continued to mediate equal or superior stimulus discrimination even when the integration time of the threshold detector was increased so as to diminish the importance of synchronous inputs. To quantify the performance of the ideal observer for a given threshold detector, we defined the quantity
, which gives the difference in the percentage of correctly classified trials using correlated input compared to the Poisson control, averaged over all intensity pairs. Our results show that correlated input mediates greater or equivalent stimulus discrimination over a wide range of integration times and background activity levels (see Figure 7). Plotted as a function of integration time, was largest for small summation intervals well suited to resolve synchronous inputs (see Figure 8A). When plotted as a function of the detection threshold, generally increased for integration times less
2280
G. Kenyon, J. Theiler, J. George, G. Travis, and D. Marshak
A
feedback model
common input model
correlograms
14
0 -40
B
0
msec
40
-40
0
msec
40
feedback common
% correct
100%
50% -f
-5
-3
-1
log2 intensity Figure 6: Different mechanisms for generating synchronous oscillations yield similar stimulus discrimination. (A, left): Multiunit CCHs generated by the retinal model in response to stimuli of varying intensity, starting with no stimulus (bottom row) and ranging from –6 to –1 (log2 units, top row). Multiunit CCHs expressed as a fraction of asymptotic baseline level. (A, right): Multiunit CCHs produced by an equal number of Poisson event generators whose rates were modulated by a source of common oscillatory input (same organization as in left column). Both sets of artificial spike trains exhibited similar correlations as measured by the multiunit CCH, but their correlations may have differed at the single trial level. (B) Despite employing very different mechanisms, stimulus discrimination was similar for the two models (same parameters as used in Figure 4).
Correlated Firing Improves Discrimination
2281
than approximately 10 msec, since larger thresholds produced lower values of background activity and thus made the detection process more sensitive to synchrony (see Figure 7B). A somewhat surprising aspect of our results was that as the integration time became large enough to effectively discard intensity information encoded by the degree of synchrony, did not become strongly negative, but rather approached an asymptotic level near zero. Since synchrony adversely affects the amount of information that can be extracted from the average firing rate over a population of similarly activated neurons (Mazurek & Shadlen, 2002; Shadlen & Newsome, 1994, 1998), it might have been anticipated that once the integration time became sufficiently long to discount any information encoded by the degree of synchrony, independent Poisson inputs would have mediated greater stimulus discrimination than correlated inputs. However, such reasoning fails to consider the effects of stimulusevoked oscillations, which like synchrony also increased as a function of stimulus intensity, as indicated by the increased persistence of periodic structure in the multiunit CCH as the stimulus intensity was increased (see Figure 3). As a result of the stronger oscillations evoked by higher stimulus intensities, the spike trains became more regular, and thus the total number of inputs over the course of each 200 msec discrimination trial became more reliable predictors of the stimulus intensity. To quantify the reliability of the afferent spike trains, we computed the Fano factor of the multiunit input as a function of stimulus intensity (see Figure 7C). The Fano factor is defined as the variance in the number of spikes divided by the mean and is equal to one for a Poisson process (Teich, 1989). At all intensities, the Fano factor of the combined correlated input (solid line) was less than that of the Poisson control (dashed line), indicating that the stimulus-evoked oscillations caused the total number of spikes to be less variable, and therefore more reliable, than for independent Poisson generators. In the absence of oscillations, the Fano factor would have increased markedly with stimulus intensity as a consequence of the increased synchrony. To illustrate this fact, a second control was employed in which stimulus-dependent synchrony was introduced in the absence of oscillations. The maximum synchrony was set approximately equal to the maximum synchrony present in the hybrid rate/correlation code, while the minimum synchrony was set to zero and a linear interpolation was used for intermediate intensities. By allowing no more than one spike in each 10 msec time bin, thereby introducing a refractory period, the Fano factor in the absence of spatial correlations could be reduced to approximately the same level exhibited by the model during background activity. In the absence of oscillations, strong synchrony produced a large rise in the Fano factor of the combined input as the stimulus intensity was increased (dotted line). Our results demonstrate that a refractory period by itself cannot account for the reliability of the total integrated input in the presence of strong synchrony, but that such reliability results naturally from the asso-
2282
G. Kenyon, J. Theiler, J. George, G. Travis, and D. Marshak
20
' % correct
A
0
10
5
B 20
15
20
integration time (msec)
' % correct
1 msec
0
20 msec 2
6
4
C 1.5
Fano factor
8
10
threshold
correlated Poisson Poisson
1
0.5
0
model
-f
-5
-3
log2 intensity
-1
Correlated Firing Improves Discrimination
2283
ciated high-frequency oscillations. Thus, stimulus-dependent synchronous oscillations do not necessarily result in less information transfer, even when only the total spike count over a relatively long interval is considered. 3.4 Hybrid Rate/Correlation Codes Work Best over Limited Numbers of Cells. Up to this point, we have considered only the effects of correlations between relatively small numbers of inputs. While this is consistent with the convergence ratios of sensory input onto neurons in the lateral geniculate nucleus and V1 (Reid & Alonso, 1995; Usrey, Reppas, & Reid, 1999), hybrid rate/correlation codes might cease to be advantageous as the number of inputs is increased. To explore the behavior of a hybrid rate/correlation code as a function of the number of input spike trains, we used a much larger stimulus that activated a 12 × 12 array of neurons. Oscillations between retinal ganglion cells increase markedly in response to the larger stimuli (Ariel et al., 1983; Ishikane et al., 1999; Neuenschwander et al., 1999), and this was also true in our retinal model. However, because the maximum correlation strength between cat alpha cells has not been precisely determined for such stimuli, our results should be interpreted only as a rough estimate of the information that might be conveyed by a hybrid rate/correlation code as the number of inputs increases. As a function of the number of input spike trains, reached a maximum for a relatively small number of correlated inputs, between 10 and 50 (see Figure 8A). For integration times less than 5 msec, remained greater than or equal to zero regardless of the number of inputs. For a threshold process employing longer integration times, correlations produced progressively poorer stimulus discrimination as the number of inputs increased. Figure 7: Facing page. Hybrid rate/correlation codes yield superior or equivalent stimulus discrimination for a broad class of threshold detectors. (A) , the difference in the percentage of successfully classified trials using a hybrid code as opposed to independent Poisson controls, plotted as a function of temporal integration window. declined with increasing integration time but did not become strongly negative. Individual points are for different thresholds. (B)
versus detection threshold for different integration times. Same data as in previous panel. The increase in with threshold declines progressively at longer integration times. (C) Fano factors (variance/mean of the total input spike count) plotted versus stimulus intensity. Solid line + circles: retinal model. Dashed line + squares: Poisson control. Dotted line + triangles: modified Poisson process in which the separate generators were synchronized by an amount proportional to the stimulus intensity, but in which no oscillations were present. To account for the effects of a refractory period on the Fano factor, no more than one spike was allowed to occur in any 10 msec time bin. The Fano factor for the retinal model remained less than one due to the presence of high-frequency oscillations, while synchrony alone produced a large increase in variability relative to independent Poisson inputs.
2284
G. Kenyon, J. Theiler, J. George, G. Travis, and D. Marshak
A
' % correct
12
1 msec
8
2 msec
4
0
10 msec 0
50
100
150
# of inputs
B 4
-1
Fano factor
3
-2 2
-3 1
-f 0
50
100
150
# of inputs Figure 8: Hybrid rate/correlation codes work optimally for a limited numbers of input neurons. These experiments used a 12 × 12 uniform spot that produced larger synchronous oscillations than did a narrow bar. (A) versus the number of input cells plotted for several different integration times. reached a peak at between 10 and 50 inputs and remained positive as the number of inputs increased as long as the integration time was small, but becomes negative for the same number of inputs if the integration time was too long to resolve synchronous inputs. (B) Fano factor versus number of inputs plotted for several different intensities (log2 units). At high intensities, which produce strong synchronous oscillations, the Fano factor increased sharply as a function of the number of inputs, thus accounting for the poorer performance of the hybrid code in this regime.
Correlated Firing Improves Discrimination
2285
As the number of inputs was raised, we increased the threshold so as to maintain the background detection rate as close to 1 Hz as possible, but without falling below 0.1 Hz. As the number of inputs increased, it was necessary to use a lower threshold for the Poisson control than for the correlated input, due to the small oscillations present in the background activity, in order to maintain the level of background suprathreshold events close to the target level of 1 Hz. We obtained similar results when the target level of background suprathreshold events was allowed to increase, but at smaller values of on average. In order to maintain task difficulty as the number of inputs was increased, the duration of each discrimination trial was progressively lowered from 400 to 100 msec. The dependence of the hybrid code on the number of cells feeding into the detector is paralleled by the behavior of the Fano factor of the combined input (see Figure 8B). When the number of inputs was small, the Fano factor was always less than one, regardless of the stimulus intensity. As the number of inputs increased, the Fano factor became much greater than one at all but the smallest stimulus intensities. This result is consistent with previous studies showing that synchrony becomes progressively more destructive of information encoded by the average firing rate as the number of neurons contributing to the average increases (Mazurek & Shadlen, 2002; Shadlen & Newsome, 1994, 1998). When the number of inputs is very large, therefore, a hybrid rate/correlation code is likely to be effective only when the integration time is small enough to resolve the stimulus information encoded directly by the degree of synchrony itself. 4 Discussion Several influential studies of how firing correlations affect the representation of information in neural ensembles focused primarily on how synchrony impedes accurate estimation of the average firing rate of the population (Mazurek & Shadlen, 2002; Shadlen & Newsome, 1994, 1998). In particular, these studies emphasized how averaging over ensembles of similarly activated neurons reduces only the trial-to-trial variability in the estimated firing rate to the extent that the individual cells are uncorrelated. From this point of view, synchrony is an inevitable feature of densely interconnected networks and, as such, limits the effective size of neural ensembles to several tens of strongly correlated cells. However, these studies generally ignored the possibility that synchronous oscillations could themselves directly encode stimulus information and thus compensate for the attendant loss of the information encoded by the ensemble-averaged firing rate. Here, we have demonstrated that realistic synchronous oscillations can lead to improved information transmission under fairly general assumptions. In particular, synchrony, when evoked in an intensity-dependent manner, can significantly improve information transfer through threshold neurons functioning as coincidence detectors. In a complementary fashion, oscillations,
2286
G. Kenyon, J. Theiler, J. George, G. Travis, and D. Marshak
when also proportional to intensity, mediate greater stimulus discrimination by making the total number of input spikes more reliable regardless of the integration time window employed. General principles of population encoding and decoding, especially those involving correlations between multiple neurons, are still very difficult to investigate experimentally. To reproduce the results of the present model, it would be necessary to record simultaneously from multiple input neurons and at least one common postsynaptic neuron and to manipulate the firing correlations between the input neurons through direct multielectrode stimulation or via pharmacological techniques. While it is not clear that such biological experiments are currently feasible, there does exist sufficient information to construct artificial spike trains possessing physiologically realistic correlations. Moreover, it is not necessary to possess a complete understanding of the physiological mechanisms that give rise to strong correlations in order to analyze how their information content might be extracted using a biologically plausible mechanism. Here, we have focused on signal detection using a threshold process that captures certain essential aspects of single-neuron dynamics. Moreover, we were able to compare the stimulus discrimination accomplished by a hybrid rate/correlation code with that mediated by a pure rate code that produced, on average, the same number of spikes. Finally, by using a computational model, we were able to examine the effects of synchronous oscillations over a wide class of threshold detection processes, and in this way we obtained insights into the physiological conditions necessary to use a hybrid rate/correlation code. The main drawback of using a computational model is the need to ascertain whether the results are physiologically relevant. Fortunately, synchronous oscillations between retinal ganglion cells have been sufficiently well characterized so as to impose tight constraints on artificially generated spike trains. Where possible to verify, the model retina appeared to favor a rate code over a correlation code. This was the case for the narrow bar stimulus employed in our first several experiments. The maximum synchrony between the model spike trains, measured relative to chance and averaged over all cells responding to the bar, was somewhat less than the levels of synchrony between widely separated cells recorded in the cat retina in response to similar stimuli (Neuenschwander & Singer, 1996). Likewise, the maximum sustained increase in firing rate in our artificial spike trains was probably exaggerated due to the lack of several known adaptation mechanisms in the model, particularly those present in the outer retina, although our model did account in some degree for adaptation effects mediated by synaptic inhibition. Our results were also useful for illustrating general phenomena, such as how hybrid rate/correlation codes might scale as the number of cells conveying redundant information increased. The effectiveness of a hybrid rate/correlation code was found to involve a trade-off between the loss of information in the ensemble-averaged firing rates due to synchrony
Correlated Firing Improves Discrimination
2287
and a corresponding gain in information due to the information encoded by the synchronous oscillations themselves. At longer integration times that were insensitive to synchronous input, stimulus-dependent oscillations still contributed to improved performance on intensity discrimination tasks by causing spike counts to become more regular. As the number of neurons increased, the loss of rate-coded information due to synchrony became more severe. Our results are likely to be helpful for interpreting synchronous oscillations in the retina and elsewhere in the central nervous system. Gamma band oscillations are ubiquitous in the vertebrate retina, having been measured extracellularly in cats (Laufer & Verzeano, 1967; Neuenschwander et al., 1999; Neuenschwander & Singer, 1996; Steinberg, 1966), rabbits (Ariel et al., 1983), frogs (Ishikane et al., 1999) and mudpuppy (Wachtmeister & Dowling, 1978), as well as in the ERGs of humans (De Carli et al., 2001; Wachtmeister, 1998) and primates (Frishman et al., 2000). The conservation of retinal oscillations across such a broad range of vertebrate species suggests they may be important for visual function. Synchronous oscillations have also been recorded elsewhere in the mammalian nervous system, including visual (Gray, Konig, ¨ Engel, & Singer, 1989; Gray & Singer, 1989; Kreiter & Singer, 1996; Livingstone, 1996) and sensorimotor (Murthy & Fetz, 1996a, 1996b) cortex and the hippocampus (Traub, Whittington, Collins, Buzsaki, & Jefferys, 1996). Numerous interpretations of the informationprocessing function accomplished by synchronous oscillations have been suggested (Engel et al., 2001; Fries et al., 2001, 2002; Singer & Gray, 1995; Srinivasan et al., 1999). Here, we note that since synchronous oscillations are widely present throughout the brain, the nervous system might as well make use them. If correlations are indeed an unavoidable consequence of neural connectivity, our results suggests that the loss of information encoded by the average firing rate could be mitigated by employing a hybrid rate/correlation code. Correlations may also convey information that is not well represented by the ensemble-average firing rate. Our results suggest that the brain might exploit synchronous oscillations to represent additional types of information while compensating for the loss of information encoded by the ensemble-averaged firing rate. Several of our results were predicated on an analysis of the correlations present during the plateau portion of the response, using a discrimination window of 200 msec. While the majority of stimulus information is probably transmitted during the first 50 msec encompassing the response peak, additional information is likely transmitted by later components as well. The mean intersaccade interval for primates during a free viewing visual search task is approximately 200 msec (Mazer, Gallant, & Gustavsen, 2003), while optimal temporal frequencies for neurons in areas 17/18 in the cat visual cortex are typically in the range of a few Hz (Movshon, Thompson, & Tolhurst, 1978), implying sustained activations on the order of several hundred msec. Moreover, stimulus-related increases in gamma band power have re-
2288
G. Kenyon, J. Theiler, J. George, G. Travis, and D. Marshak
cently been shown to persist for at least 1 sec during free viewing conditions (Salazar, Kayser, & Konig, ¨ 2004). Thus, the 200 msec time window used in several of our experiments is broadly consistent with relevant physiological timescales. In addition, we found that stimulus-dependent synchronous oscillations mediated improved discrimination even for analysis windows confined to the first 50 msec following stimulus onset, despite the presence of strong additional correlations due to the response peak itself in both the model and control data. It has been argued that information transfer is maximized when retinal outputs are uncorrelated (Srinivasan, Laughlin, & Dubs, 1982) and recent studies indicate that the relatively strong firing correlations sometimes observed between neighboring ganglion cells convey little additional information about natural scenes (Nirenberg, Carcieri, Jacobs, & Latham, 2001). Our results are not in conflict with these findings. In our analysis, retinal inputs were pooled into a single variable, roughly corresponding to a postsynaptic membrane potential. The only benefit of statistical independence was therefore a possible improvement in the signal to noise of the pooled input. Moreover, information theory by itself does not address how stimulus properties are extracted by target neurons. While formal mathematical measures predict that ganglion cells convey more information when their activity is uncorrelated, experimental and theoretical evidence suggests that synchronous inputs are particularly salient (Alonso et al., 1996; Kenyon et al., 1990; Usrey et al., 2000), and our current results demonstrate that information encoded by firing correlations between model ganglion cells can be efficiently extracted by threshold neurons under fairly general assumptions. Simultaneous recordings in cat from the retina, the LGN, and area 18 of the visual cortex indicate that synchrony between retinal ganglion cells can be propagated to higher levels in the visual system (Castelo-Branco, Neuenschwander, & Singer, 1998). It may therefore be necessary to consider the impact of correlated activity in order to fully account for its role in information processing. Finally, we note that using firing correlations to encode local stimulus properties does not preclude additional encoding functions that have been suggested (Meister & Berry, 1999).
Acknowledgments We acknowledge useful discussions with Rob Smith and Greg Stephens. This work was supported by the Lab Directed Research and Development Program at the Los Alamos National Laboratory, the Department of Energy Office of Nonproliferation Research and Engineering, and the MIND Institute for Functional Brain Imaging. D.W.M. was supported by the National Eye Institute EY06472 and the National Institute of Neurological Disease and Stroke NS38310.
Correlated Firing Improves Discrimination
2289
References Abbott L. F., & Dayan, P. (1999). The effect of correlated variability on the accuracy of a population code. Neural Comput., 11, 91–101. Alonso, J. M., Usrey, W. M., & Reid, R. C. (1996). Precisely correlated firing in cells of the lateral geniculate nucleus. Nature, 383, 815–819. Ariel, M., Daw, N. W., & Rader, R. K. (1983). Rhythmicity in rabbit retinal ganglion cell responses. Vision Research, 23, 1485–1493. Castelo-Branco, M., Neuenschwander, S., & Singer, W. (1998). Synchronization of visual responses between the cortex, lateral geniculate nucleus, and retina in the anesthetized cat. J. Neurosci., 18, 6395–6410. Cohen, E., & Sterling, P. (1990). Convergence and divergence of cones onto bipolar cells in the central area of cat retina. Philosophical Transactions of the Royal Society of London—Series B: Biological Sciences, 330, 323–328. Cox, J. F., & Rowe, M. H. (1996). Linear and nonlinear contributions to step responses in cat retinal ganglion cells. Vision Research, 36, 2047–2060. Creutzfeldt, O. D., Sakmann, B., Scheich, H., & Korn, A. (1970). Sensitivity distribution and spatial summation within receptive-field center of retinal oncenter ganglion cells and transfer function of the retina. J. Neurophysiol., 33, 654–671. De Carli, F., Narici, L., Canovaro, P., Carozzo, S., Agazzi, E., & Sannita, W. G. (2001). Stimulus- and frequency-specific oscillatory mass responses to visual stimulation in man. Clin. Electroencephalogr., 32, 145–151. Duda, R. O., Hart, P. E., & Stork, D. G. (2001). Pattern classification. New York: Wiley. Engel, A. K., Fries, P., & Singer, W. (2001). Dynamic predictions: Oscillations and synchrony in top-down processing. Nat. Rev. Neurosci., 2, 704–716. Enroth-Cugell, C., & Jakiela, H. G. (1980). Suppression of cat retinal ganglion cell responses by moving patterns. J. Physiol., 302, 49–72. Freed, M. A. (2000). Rate of quantal excitation to a retinal ganglion cell evoked by sensory input. J. Neurophysiol., 83, 2956–2966. Fries, P., Reynolds, J. H., Rorie, A. E., & Desimone, R. (2001). Modulation of oscillatory neuronal synchronization by selective visual attention. Science, 291, 1560–1563. Fries, P., Schroder, J. H., Roelfsema, P. R., Singer, W., & Engel, A. K. (2002). Oscillatory neuronal synchronization in primary visual cortex as a correlate of stimulus selection. J. Neurosci., 22, 3739–3754. Frishman, L. J., Saszik, S., Harwerth, R. S., Viswanathan, S., Li, Y., Smith, E. L., III, Robson, J. G., & Barnes, G. (2000). Effects of experimental glaucoma in macaques on the multifocal ERG. Multifocal ERG in laser-induced glaucoma. Doc. Ophthalmol., 100, 231–251. Gray, C. M., Konig, P., Engel, A. K., & Singer, W. (1989). Oscillatory responses in cat visual cortex exhibit inter-columnar synchronization which reflects global stimulus properties. Nature, 338, 334–337. Gray, C. M., & Singer, W. (1989). Stimulus-specific neuronal oscillations in orientation columns of cat visual cortex. Proc. Natl. Acad. Sci. U.S.A., 86, 1698–1702.
2290
G. Kenyon, J. Theiler, J. George, G. Travis, and D. Marshak
Ishikane, H., Kawana, A., & Tachibana, M. (1999). Short- and long-range synchronous activities in dimming detectors of the frog retina. Vis. Neurosci., 16, 1001–1014. Kenyon, G. T., Fetz, E. E., & Puff, R. D. (1990). Effects of firing synchrony on signal propagation in layered networks. In D. S. Touretzky (Ed.), Advances in neural information processing systems, 2 (pp. 141–148). San Mateo, CA: Morgan Kaufmann. Kenyon, G. T., Moore, B., Jeffs, J., Denning, K. S., Stephens, G. J., Travis, B. J., George, J. S., Theiler, J., & Marshak, D. W. (2003). A model of high-frequency oscillatory potentials in retinal ganglion cells. Vis. Neurosci., 20, 465–480. Kenyon, G. T., Puff, R. D., & Fetz, E. E. (1992). A general diffusion model for analyzing the efficacy of synaptic input to threshold neurons. Biol. Cybern., 67, 133–141. Kreiter, A. K., & Singer, W. (1996). Stimulus-dependent synchronization of neuronal responses in the visual cortex of the awake macaque monkey. J. Neurosci., 16, 2381–2396. Laufer, M., & Verzeano, M. (1967). Periodic activity in the visual system of the cat. Vision Res., 7, 215–229. Livingstone, M. S. (1996). Oscillatory firing and interneuronal correlations in squirrel monkey striate cortex. J. Neurophysiol., 75, 2467–2485. Mastronarde, D. N. (1989). Correlated firing of retinal ganglion cells. Trends in Neurosciences, 12, 75–80. Mazer, J. A., Gallant, J. L., & Gustavsen, K. (2003). Goal-related activity in V4 during free viewing visual search: Evidence for a ventral stream visual salience map. Neuron, 18, 1241–1250. Mazurek, M. E., & Shadlen, M. N. (2002). Limits to the temporal fidelity of cortical spike rate signals. Nat. Neurosci., 5, 463–471. Meister, M., & Berry, M. J., II. (1999). The neural code of the retina. Neuron, 22, 435–450. Movshon, J. A., Thompson, I. D., & Tolhurst, D. J. (1978). Spatial and temporal contrast sensitivity of neurones in areas 17 and 18 of the cat’s visual cortex. J. Physiol., 283, 101–120. Murthy, V. N., & Fetz, E. E. (1996a). Oscillatory activity in sensorimotor cortex of awake monkeys: Synchronization of local field potentials and relation to behavior. J. Neurophysiol., 76, 3949–3967. Murthy, V. N., & Fetz, E. E. (1996b). Synchronization of neurons during local field potential oscillations in sensorimotor cortex of awake monkeys. J. Neurophysiol., 76, 3968–3982. Neuenschwander, S., Castelo-Branco, M., & Singer, W. (1999). Synchronous oscillations in the cat retina. Vision Res., 39, 2485–2497. Neuenschwander, S., & Singer, W. (1996). Long-range synchronization of oscillatory light responses in the cat retina and lateral geniculate nucleus. Nature, 379, 728–732. Nirenberg, S., Carcieri, S. M., Jacobs, A. L., & Latham, P. E. (2001). Retinal ganglion cells act largely as independent encoders. Nature, 411, 698–701. Reid, R. C., & Alonso, J. M. (1995). Specificity of monosynaptic connections from thalamus to visual cortex. Nature, 378, 281–284.
Correlated Firing Improves Discrimination
2291
Salazar, R. F., Kayser, C., & Konig, P. (2004). Effects of training on neuronal activity and interactions in primary and higher visual cortices in the alert cat. J. Neurosci., 24, 1627–1636. Shadlen, M. N., & Newsome, W. T. (1994). Noise, neural codes and cortical organization. Curr. Opin. Neurobiol., 4, 569–579. Shadlen, M. N., & Newsome, W. T. (1998). The variable discharge of cortical neurons: Implications for connectivity, computation, and information coding. J. Neurosci., 18, 3870–3896. Singer, W., & Gray, C. M. (1995). Visual feature integration and the temporal correlation hypothesis. Annual Review of Neuroscience, 18, 555–586. Srinivasan, M. V., Laughlin, S. B., & Dubs A. (1982). Predictive coding: A fresh view of inhibition in the retina. Proc. R. Soc. Lond. B Biol. Sci., 216, 427–459. Srinivasan, R., Russell, D. P., Edelman, G. M., & Tononi, G. (1999). Increased synchronization of neuromagnetic responses during conscious perception. J. Neurosci., 19, 5435–5448. Steinberg, R. H. (1966). Oscillatory activity in the optic tract of cat and light adaptation. J. Neurophysiol., 29, 139–156. Teich, M. C. (1989). Fractal character of the auditory neural spike train. IEEE Trans. Biomed. Eng., 36, 150–160. Traub, R. D., Whittington, M. A., Colling, S. B., Buzsaki, G., & Jefferys, J. G. (1996). Analysis of gamma rhythms in the rat hippocampus in vitro and in vivo. J. Physiol. (Lond.), 493, 471–484. Usrey, W. M., Alonso, J. M., & Reid, R. C. (2000). Synaptic interactions between thalamic inputs to simple cells in cat visual cortex. J. Neurosci., 20, 5461–5467. Usrey, W. M., Reppas, J. B., & Reid, R. C. (1999). Specificity and strength of retinogeniculate connections. J. Neurophysiol., 82, 3527–3540. van Rossum, M. C., O’Brien, B. J., & Smith, R. G. (2003). Effects of noise on the spike timing precision of retinal ganglion cells. J. Neurophysiol., 89, 2406–2419. Vaney, D. I. (1990) The mosaic of amacrine cells in the mammalian retina. In N. N. Osborne & G. J. Chader (Eds.), Progress in retinal research (pp. 49–100). Oxford: Pergamon Press. Wachtmeister, L., & Dowling, J. E. (1978). The oscillatory potentials of the mudpuppy retina. Invest. Ophthalmol. Vis. Sci., 17, 1176–1188. Received April 7, 2003; accepted May 3, 2004.
LETTER
Communicated by Heinrich Buelthoff
A Temporal Stability Approach to Position and Attention-Shift-Invariant Recognition Muhua Li
[email protected] James J. Clark
[email protected] Centre for Intelligent Machines, McGill University, Montr´eal, Qu´ebec, Canada H3A 2A7
Incorporation of visual-related self-action signals can help neural networks learn invariance. We describe a method that can produce a network with invariance to changes in visual input caused by eye movements and covert attention shifts. Training of the network is controlled by signals associated with eye movements and covert attention shifting. A temporal perceptual stability constraint is used to drive the output of the network toward remaining constant across temporal sequences of saccadic motions and covert attention shifts. We use a four-layer neural network model to perform the position-invariant extraction of local features and temporal integration of invariant presentations of local features in a bottom-up structure. We present results on both simulated data and real images to demonstrate that our network can acquire both position and attention shift invariance.
1 Introduction Humans are adept at visually recognizing objects or patterns under different viewing conditions. They are tolerant of position shifts, rotations, and deformations in the visual images. Psychological evidence (Bridgeman, Von der Hejiden, & Velichkovsky, 1994; Deubel, Bridgeman, & Schneider, 1998; Leopold & Logothetis, 1998; Norman, 2002; Walsh & Kulikowski, 1998) shows that there exist mechanisms along the visual pathway that maintain perceptual stability in the face of these variations in visual input. Hubel and Wiesel (1962) found that simple neurons in the primary visual cortex respond selectively to stimuli with specific orientation, while complex neurons present certain position-invariant properties. Neurons in higher visual areas, such as the inferotemporal cortex (IT), have larger receptive fields and show more complex forms of invariance. They respond consistently to scaled and shifted versions of the preferred stimuli (Gross & Mishkin, 1977; Ito, Tamura, Fujita, & Tanaka, 1995; Perrett, Rolls, & Caan, 1982; Rolls, 2000). c 2004 Massachusetts Institute of Technology Neural Computation 16, 2293–2321 (2004)
2294
M. Li and J. Clark
Maintaining perceptual stability is also an emerging issue in computer vision systems. An important consideration in the design of robotic vision systems is to be able to recognize the external world from the video stream acquired as the robot is wandering about. The video input is often erratic and unstable because the robot moves its eyes, head, and body to perceive the surroundings and avoid obstacles when it performs tasks. To perform well in recognition tasks, a robot should be able to maintain a constant perception of the structure of an object when changing views of the object during its motor activities. A number of models have been proposed to describe the mechanisms underlying perceptual stability, such as spatial-phase invariance, translation invariance, and scale invariance (Chance, Nelson, & Abbott, 2000; Fukushima, 1980; Riesenhuber & Poggio, 1999; Salinas & Sejnowski, 2001). In particular, temporal association is deemed an important factor in the development of transformation invariance (Miyashita, 1988; Rolls, 1995). Temporal continuity was first employed by Foldi´ ¨ ak (1991) to capture the temporal relationship of input patterns. It has been demonstrated that transformation invariances such as translation or position invariance and viewpoint invariance can be learned by imposing temporal continuity on the response of a network to temporal sequences of patterns (Bartlett & Sejnowski, 1998; Becker, 1993, 1999; Einh¨auser, Kayser, Konig, ¨ & Kording, ¨ 2002; Foldi´ ¨ ak, 1991; Kording ¨ & Konig, ¨ 2001). The human visual system as a whole seamlessly combines retinal images and visual-related motor commands to give a complete representation of the observed external environment. However, most research work done so far has focused on achieving different degrees of invariance based only on the sensory input, while ignoring the important role of visual-related motor signals. In our opinion, visual-related self-action signals are crucial in learning spatial invariance, as they provide information as to the nature of changes in the visual input. A critical issue that must be considered in modeling human vision is that the visual system has to deal with an overwhelming amount of information. It is well known that selective attention plays an important role in the human visual system by permitting the focusing on a small fraction of the total input visual information (Koch & Ullman, 1985; Maunsell & Cook, 2002). Shifting of attention enables the visual system to actively, and efficiently, acquire useful information from the external environment for further processing. Our goal is to develop object recognition systems that use covert and overt shifts in attention for feature selection. Covert attention shifts result from a change in feature selection processes occurring with the eye held fixed. Overt attention shifts refer to the change in the image data being attended to that results from a large, saccadic eye movement. Both overt and covert attention shifts cause changes to the visual input that the object recognition system works on. It is important that the functioning of the object recognition system be invariant to the effects of these attention shifts.
Learning of Position and Attention Shift Invariance
2295
With respect to changes induced by eye movements, or overt attention shifts, this invariance is specifically position invariance, where the recognition process should provide the same answer regardless of the location on the retina that the image of the object is projected. Most object recognition techniques that employ attention shifts are mainly based on covert or overt attention shifts, and they rarely consider both. The bulk of these methods consider only covert shifts, where the retinal input to the systems remains unchanged during the learning and recognition process (Kikuchi & Fukushima, 2001; Olshausen, Anderson, & Van Essen, 1993). If we directly apply such methods to overt attention shifts, the distortions due to the nonuniformity of the retina and the nonlinearity of projection on to the hemispherical retina may cause problems when foveating eye movements take place. For example, even though Kikuchi and Fukushima’s model of invariant pattern recognition (2001) employs a “scan path” of eye saccades, it does not model any associated relative distortions. In their approach, the only effect of eye movement is a spatial displacement of the imaged features. Their model achieves shift invariance and scale invariance based on extracted spatial relations, which are internally encoded by the visual system as a chain of saccadic vectors and fixated local features. This model is too simplistic, however; a true model of recognition with eye movements must take into account the image distortions resulting from eye movements. In this letter, we propose a new approach to attaining position invariance in which the processes of covert and overt attention shifts play a central role. We implicitly assume that the variation of feature positions on various cortical feature maps arises entirely from the action of covert and overt attention shifts. Motion of scene features in the external world is irrelevant, as it is the action of the attention systems that determines the location of the scene features in the internal representations. In this way of thinking, position invariance is really invariance to attention shifts, whether they be covert or overt. Desimone (1990) points out that the effects on the visual cortex of covert and overt attention shifts are very similar. It is conceivable, therefore, that we could develop a unified approach in which covert and overt shifts are not distinguished. We employ a temporal difference learning scheme where knowledge of the attention shift command is used to gate the learning process, permitting temporal correlation to take place between visual inputs across attention shifts. We implement a four-layer neural network model and test it on simulated data consisting of various geometrical shapes undergoing transformations. 2 A Neural Network Model of Attention Shift Invariance The overall model being proposed is composed of two submodules, as illustrated in Figure 1. One is the attention control module, which generates attention-shift signals according to a saliency map. This module also gen-
2296
M. Li and J. Clark
Executive m odule O utput layer
…
Attention sh ift signal
W
Integration
Hidden … ..layer
A
Extraction
Saccade motor signal
Control m odule
Encoded layer
… ..
Encoding
bf Input layer
……
Local feature im age
Attention windo w
Localized feature within the attention windo w
R aw retinal im age
Figure 1: The proposed neural network model is composed of two modules: a control module and an executive module. The control module is an attentionshift mechanism that generates attention-shift signals and saccade motor signals to trigger the learning processes in the executive module. It also selects local features, which are part of the raw retinal image falling within the attention window, as input to the executive module. The executive module consists of a four-layer network, which accomplishes the extraction of position-invariant local features and the integration of attention-shift-invariant complex features from lower level to higher level, respectively.
erates saccadic motor command signals (or overt attention shift signals), which are used to determine the timing for learning. The module obtains as input local feature images from the raw retinal images via a dynamically position-changing attention window. The second submodule is the executive module, which performs the learning of invariant neural representations across attention shifts in temporal sequences. Two forms of learning, position-invariant extraction of local features, and integration of position-invariant object representation (a composition of a set of local features) across attention shifts, are triggered by the saccadic motor signals and attention shift signals from the control module, respectively. 2.1 Temporal Continuity Approaches to Development of Position Invariance. In this section, we detail how our network learns invariance to position changes that result from eye movements. Our approach is based on
Learning of Position and Attention Shift Invariance
2297
the work of Foldi´ ¨ ak (1991) and Einh¨auser et al. (2002). They proposed methods for developing position invariance that rely on a temporal continuity of the images of objects projected onto the retina. These methods have been shown to work well in learning position invariance in both simulated data and real-world video sequences. Einh¨auser et al. proposed a three-layer feedforward network model capable of learning from natural stimuli that develops receptive field properties matching those of cortical simple and complex neurons. Hebbian learning in each output layer cell emphasizes the temporal structure of the input in the learning process. Experimental results show that the middle layer cells learn simple cell response properties that have strong selectivity to both orientation and position. The output layer cells learn complex cell response properties, which exhibit a level of position invariance while preserving orientation selectivity. However, their learning algorithm depends crucially on temporal smoothness in the input. The learning result is very sensitive to the timescale and the temporal structure in the input. When the time interval between successive input scenes becomes large, the temporal difference in the input data can also become large. Equivalently, when structures in the scene are moving rapidly, the temporal changes in the input stream can be large. Einh¨auser et al. reported that the output layer cells lose the position-invariant property when the input lacks temporal smoothness. In order to produce position-invariant recognition, the visual system must be presented with images of an object at different locations on the retina. In techniques such as those of Einh¨auser et al. (2002) and Foldi´ ¨ ak (1991), it is mainly the motion of the objects in the external world that produces the required presentations of the object image across the retina. There are a number of problems with this. Most important, the motion of objects in 3D space can change the appearance of objects significantly. Thus, the problem of developing position invariance is converted to the much more difficult problem of developing viewpoint invariance. The difference in the appearance of a moving object is generally greater as the displacement increases. This means that only local position invariance can be learned. In this letter, we propose a position-invariance learning method that is not overly affected by external object motion. The key aspect of our approach is the use of rapid attention shifts (overt or covert) to provide the necessary object image displacements. In this way, position invariance can be seen to arise from attention shift invariance. The short time between acquisition of images across an attention shift minimizes the change in the image due to motion of the object in space. Thus, in our approach, most of the change in the image is due to the attention shifts (and not to the motion of the object). Clearly, learning invariance with respect to some quantity requires exposure to data that varies only with respect to this quantity. Since we are learning invariance to attention shifts (covert or overt), we require a signal in which the variation is entirely due to attention shifts. This is accomplished by using only those images associated with attention shifts. At other times, the
2298
M. Li and J. Clark
images are changing due to extraneous factors, such as object motion, and thus can interfere only with the learning of the invariance. Our approach has the additional advantage that the attention shift command can be used as a signal to direct learning. We can, for example, arrange for learning to take place only during the period of time immediately before and after the attention shifts. 2.2 Extraction of Position-Invariant Local Features. The need for development of position invariance arises due to projective distortions and the nonuniform distribution of visual sensors on the retinal surface. These factors result in qualitatively different signals when object features are projected onto different positions of the retina. For example, when a linear object feature in space is projected onto a hemispherical surface, such as the retina, it is relatively undistorted when projected near the optical axis (i.e., near the fovea), whereas its image becomes curved when projected away from the optical axis (e.g., in the periphery). The problem of finding a position-invariant representation for such features can therefore be thought of as that of finding the underlying relationship between various distorted retinal images of the same physical feature at different retinal positions. We propose that this problem can be simplified if a canonical representation of the feature can be specified. The foveating capability of the human visual system gives us such a canonical representation. It is the role of the foveating system to shift the image of a feature being attended to in the retinal periphery to become centered on the fovea. The foveal image of an object feature is a suitable candidate for the feature’s canonical representation since, statistically, among all the retinal images of a feature, the foveal image is the most frequently observed. Furthermore, the process of fixation and tracking ensures that the foveal image representation is very stable relative to the peripheral images. When we refer to the neural representation of a feature’s foveal image as its canonical representation, the problem of position-invariant representation of a feature can be interpreted as one of associating the neural representations of all of its peripheral images with its single canonical representation. At a deeper level, the approach that we are proposing involves executing self-actions of the observer (in this case, saccadic eye movements) and observing the resulting changes in the retinal image. The idea that knowledge of self-action and the resulting sensory changes plays a role in perception is becoming popular. For example, O’Regan and No¨e (2001) proposed that visual percepts are based on the sensorimotor contingencies that describe the relation between motor activities and visual sensory input. Our approach to the learning of position invariance is based on the proposal of Clark and O’Regan (2000) that position invariance could be achieved through learning of the sensorimotor contingencies associated with a given feature. They presented a prototype of an association mechanism using the temporal difference learning schema of Sutton and Barto
Learning of Position and Attention Shift Invariance
2299
(1981) to learn the association between pre- and postmotor visual input data, leading to the desired position-invariance properties. A saccade is employed to foveate preattended features, so that associations between presaccadic peripheral stimuli and postsaccadic foveal stimuli (the canonical image) can be learned each time a saccade occurs. Given an input presaccadic neural response X, an association matrix V, and a reinforcement reward λ, their learning rule is as follows: Vij = α(λ(t) + Vij (t − 1)[γ Xj (t) − Xj (t − 1)])X¯ j (t)
(2.1)
X¯ j (t) = δ(Xj (t − 1) − X¯ j (t − 1)).
(2.2)
with
The reinforcement reward λ(t) here is the postsaccadic neural response to the foveal feature (the canonical image), and has the same dimension as X. Clark and O’Regan’s model (2000) works well in handling geometric distortions of images features due to position variance. However, a limitation of their model is that the association is very space and time-consuming, with resource requirements growing exponentially with the number of input neurons. In this letter, we provide a more efficient version of the Clark-O’Regan approach. Our aim is to reduce the computational requirements of their model while retaining the capability of learning position invariance of local features. We make a modification to their learning rule, using temporal differences over longer timescales rather than just over pairs of successive time steps. In addition, we use a sparse coding approach to reencode the simple neural responses, which reduces the size of the association weight matrix and therefore the computational complexity. Our model includes an input layer, an encoded layer, and a hidden layer, as well as the connection matrix between the layers. We model the input layer neuronal receptive field profile with a Gaborlike function. We refer to the input layer units as simple neurons, as they have similar properties to the simple cells of the visual cortex. The response of each simple neuron to a retinal image is the convolution of its receptive field profile and the image. The simple neural responses then are encoded by a sparse coding approach (Hyv¨arinen & Hoyer, 2001; Olshausen & Field, 1996, 1997) to reduce the statistical redundancies in the input pattern. The learning of basis functions sets and their sparsely distributed coefficients ensures that only a small number of active neurons in the encoded layer represent the original input pattern. The details of the encoding process are as follows. Let F denote the simple neuronal responses. A set of basis functions bf and a set of corresponding sparsely distributed coefficients ai are learned
2300
M. Li and J. Clark
to represent F: F(j) =
ai ∗ bfi (j) ⇒ F = bf ∗ a.
(2.3)
i
The basis function learning process is a solution to a regularization problem that finds the minimum of a functional E. This functional measures the difference between the original neural responses F and the reconstructed responses F = bf ∗ a, subject to a constraint of sparse distribution on the coefficients λ: 2 1 ai ∗ bfij + α Sparse(ai ) Fj − E(bf, a) = 2 j i i
(2.4)
where Sparse(a) = ln(1 + a2 ).
(2.5)
Sparseness is enforced by the second term of equation 2.4, which drives the coefficients a toward small values. In our implementation, E is minimized over its two arguments bf and a, respectively. The minimization is first performed over a, with a fixed value of bf, and then performed over bf. The inner minimization loop over a is performed by iterating the nonlinear conjugate gradient method (Shewchuk, 1994) until the derivative of E(bf , a) with respect to a is zero: ∂E(bfij , ai ) ∂Sparse(ai ) = bfij ∗ Fj − ak ∗ bfkij − α ∗ . ∂Ri ∂ai j k
(2.6)
The outer minimization loop over bf is accomplished by simple gradient descent: bfij = η < ai ∗ Fj − ak ∗ bfkj > . (2.7) k
After each learning step, bf is normalized to ensure that bf = 1. The normalization prevents bf from being unbounded, which would otherwise lead to undesired zero values of a. The sparsely distributed coefficients a then become the output of the encoded layer, which we denote as S. A weight matrix between the encoded layer and the hidden layer serves to associate the encoded simple neuron responses related to the same physical stimulus at different retinal positions. Immediately after a saccade takes place, this weight matrix A is updated according to a temporal difference reinforcement learning rule, to strengthen
Learning of Position and Attention Shift Invariance
2301
the weight connections between the neuronal responses to the presaccadic feature to those of the postsaccadic feature. The neuronal response in the hidden layer H is represented by the following equation: H = A ∗ S.
(2.8)
The weight matrix A is updated only at those times when a saccade occurs. The updating is done with the following temporal reinforcement learning rule: ˜ − 1)], (2.9) ˜ − 1))) ∗ S(t A(t) = η ∗ [((1 − κ) ∗ R(t) + k ∗ (γ ∗ H(t) − H(t where ˜ ˜ − 1)) H(t) = α1 ∗ (H(t) − H(t ˜ ˜ S(t) = α2 ∗ (S(t) − S(t − 1)).
(2.10)
The factor γ is adjusted to obtain desirable learning dynamics. The parameters η, α1 , and α2 are learning rates with predefined constant values. In order to investigate the use of the temporal reinforcement in the learning of position invariance, we introduce a weighting parameter κ to balance the importance between the reinforcement reward and the temporal output difference between successive steps. The effect of a varying κ will be demonstrated in section 3.1. ˜ of the neural responses in the The short-term memory traces, H˜ and S, hidden layer and the encoded layer are maintained to emphasize the temporal influence of a response pattern at one time step on later time steps. These are temporally low-pass filtered traces of the activities of the hidden layer neurons and encoded layer neurons, respectively. Therefore, the learning rule incorporates a Hebbian term between the input trace and the output trace residuals (the difference between the current and the trace activity), as well as between the input trace and the reinforcement signal. This temporal reinforcement learning rule is not the same as traditional trace rules (Foldi´ ¨ ak, 1991; Wallis, Rolls, & Foldi´ ¨ ak, 1993; Wallis & Rolls, 1997), which emphasize the Hebbian connection between the input stimulus and the decaying trace of previous output stimuli. Equation 2.9 differs slightly from equation 2.1 in Clark and O’Regan (2000), in that we use the temporal difference of the output trace residuals (over longer timescales) instead of the pair-wise temporal difference. This modification enables us to have a longer trace of activities of hidden layer neurons in previous time steps, which helps to obtain more globally optimal solutions. The reinforcement reward R(t) is the sparsely encoded simple neural response right after a saccade. The weight update rule correlates this reinforcement reward R(t) and (an estimate of) the temporal difference of the
2302
M. Li and J. Clark
Hidden neuronal response
Simple neuronal response trace
H
~ H Correlation
T
~ S Suppression during saccade
T
Figure 2: Illustration of temporal difference learning under a temporal perceptual stability constraint. The short-term memory trace of neural response in the encoded layer emphasizes the temporal influence of previous neural responses on later training. The weights between the encoded layer neurons and the hidden layer neurons are enhanced when there is a significant temporal difference between the current hidden neural output and the previous output.
hidden layer neural responses with the memory trace of the encode layer neural responses. The constraint of temporal perceptual stability requires that updating is necessary only when there is a difference between current neural response and previous neural responses kept in the short-term memory trace, as illustrated in Figure 2. Our proposed position-invariant approach is able to eliminate the limitations of Einh¨auser et al.’s model (2002) without imposing an overly strong constraint on the temporal smoothness of the scene images. For example, in the case of recognizing a rapidly moving object, a uniform temporal sampling results in the object appearing in significantly different positions on the retina. This could cause a temporal discontinuity in the input that will cause problems for the Einh¨auser et al. model. Even worse, the appearance of the object may have changed due to a change in its pose as it moves. This means that the variation in the input data depends not only on the position of the object, but also on its orientation in space. Such object motion will not affect the learning result of our approach, however, because it employs a nonuniform temporal sampling, in which images are obtained only immediately before and after an attention shift (either overt or covert). As the attention shift takes little time, there is little effect of object motion on the input data. Most of the variation in the position of the object in the image is due to the attention shift.
Learning of Position and Attention Shift Invariance
2303
2.3 Temporal Integration of a Position-Invariant Representation of an Object Across Attention Shifts. The integration level of the executive submodule in our system is concerned with the invariant representation of an object across attention shifts. Position invariance is implicitly incorporated because the attention shift invariance is based on a temporal integration of position-invariant local features. Attention shift information is provided in our model by the control module. This module receives as input the retinal image of an object (combination of simple features). It constructs a saliency map (Itti, Koch, & Niebur, 1998; Koch & Ullman, 1985) that is used to select the most salient area as the next attention-shift target. The saliency map is a weighted sum of feature saliencies, such as edge orientation, color, and edge contrast. The selection of feature types and their corresponding weights depends on the tasks to be performed. Currently, our implementation uses gray-level images, and we use only orientation contrast and intensity contrast as saliency map features (refer to Itti et al., 1998, for implementation details). Intensity features, I(σ ), are obtained from an eight-level gaussian pyramid computed from the raw input intensity, where the scale factor σ ranges from [0..8]. Local orientation information is obtained by convolution with oriented Gabor pyramids O(σ, θ), where σ ∈ [0..8] is the scale and θ ∈ [0◦ , 45◦ , 90◦ , 135◦ ] is the preferred orientation. Feature maps are calculated by a set of “center-surround” operations, denoted by , which are implemented as the difference between fine (at scale c ∈ [2, 3, 4]) and coarse scales (at scale s = c + δ, with δ ∈ [3, 4]): I(c, s) = |I(c)I(s)|
(2.11)
O(c, s, θ) = |O(c, θ)O(s, θ)|.
(2.12)
In total, 30 feature maps—6 for intensity and 24 for orientation—are calculated and are combined into two conspicuity maps, at the scale (σ = 4) of the saliency map, through a cross-scale addition ⊕: I¯ =
4 c+4
N(I(c, s))
c=2 s=c+3
O¯ =
N
θ∈{0◦ ,45◦ ,90◦ ,135◦ }
(2.13)
4 c+4
N(O(c, s, θ )) ,
(2.14)
c=2 s=c+3
where N(·) is a map normalization operator. The saliency map S is obtained by the weighted sum (here, we choose all weights to have the same value) of the two maps: S=
1 ¯ + N(O)). ¯ (N(I) 2
(2.15)
2304
M. Li and J. Clark
A winner-take-all algorithm (Koch & Ullman, 1985) determines the location of the most salient feature in the calculated saliency map. This location then becomes the target of the next attention shift. In the case of an overt attention shift, the positional information of the target (including saccadic direction and amplitude) is sent to the executive submodule to command execution of a saccade. The target is foveated after the commanded motion, and a new retinal image is formed. The new image is fed into the module as input for the next learning iteration. A covert attention shift, on the other hand, will not foveate the attended target, and therefore the subsequent retinal image input remains unchanged. Since both overt and covert attention shifts play an important role in determining the timing for learning process at this stage, we use an attention-shift signal instead of a saccade signal as a motor signal from the control module to trigger the integration learning. In the implementation of our model, an inhibition-of-return (IOR) mechanism is added to prevent immediate attention shifts back to the current feature of interest to allow other parts of the object to be explored. The localized image features, which are obtained when part of an object falls in the attention window before and after attention shifts, are fed into the input layer of the four-layer network. Given that position-invariant representations of local features have already been learned, an integration of local features from an object can be learned in a temporal sequence as long as the attention window stays within the range of the object. Here we assume that attention always stays on the same object during the recognition procedure of an object even in the presence of multiple objects. In our experiments, this assumption is enforced by considering only scenes that contain a single object. In practice, of course, there will be attention shifts between different objects. Although we have not yet tested our method in such situations, it is expected that such interobject attention shifts will only slow learning. This is because a given object will typically be viewed in proximity to a wide range of different objects and backgrounds. Thus, there will be no persistent pairing of an object feature with a particular background feature, and no strong association will be made. The only persistent associations will be those of features within the same object. The learning at this stage includes two further aspects: a winner-take-all interaction between the output layer neural activities and a fatigue effect on the continuously active output layer neurons. The winner-take-all interaction ensures that only one neuron in the output layer wins the competition to respond actively to a certain input pattern. The fatigue process is a modified implementation of inhibition of return, which prevents one unit from winning all of the input patterns. The fatigue process gradually decreases the fixation of interest on the same object after several attention shifts. Although in our testing we restrict the scenes to contain only one object at a time, we have several objects to be learned on the model. Therefore, it is necessary that the currently active neuron will be suppressed for a while
Learning of Position and Attention Shift Invariance
2305
when the learning moves to the next object. The fatigue effect is controlled by a fixation-of-interest function FOI(u). A u value is kept for each output layer neuron in an activation counter initialized to zero. Each counter traces the recent neural activities of its corresponding output layer neuron. The counter automatically increases by 1 if the corresponding neuron is activated and decreases by 1 until 0 if not. If a neuron is continuously active over a certain period, the possibility of its subsequent activation (i.e., its fixation of interest on the same stimulus) is gradually reduced, allowing other neurons to be activated. A gaussian function of u2 is used for this purpose: FOI(u) = e−u
4
/σ 2
.
(2.16)
The output layer neural response C0 is obtained by multiplying the hidden layer neural responses H with the integration weight matrix W. C0 is then adjusted by multiplying with FOI(u) and is biased by the local estimation of the maximum output layer neural responses (weighted by a factor κ < 1): C = C0 ∗ FOI(u) − k ∗ C˜ 0 .
(2.17)
If Ci exceeds a threshold, the corresponding output layer neuron is activated (Ci = 1). The temporal integration of local features is accomplished by dynamically tuning the connection weight matrix between the hidden layer and the output layer. Responses to local features of the same object can be correlated by applying the constraint that output layer neural responses remain constant over time. Given as input the hidden layer neural responses H from the output of the lower layers, and as output the output layer neural responses C, the weight matrix W is dynamically tuned in a Hebbian manner using ˆ of the complex layer neural responses C: the short-term memory trace C ˆ − η ∗ C(t)) ∗ H(t) − C(t) ∗ W(t)] W(t) = γ ∗ [(C(t)
(2.18)
ˆ = α ∗ (C(t) − C(t ˆ − 1)). C(t)
(2.19)
with
ˆ acts as an estimate of the neuron’s recent The short-term memory trace C responses. The second term of the learning rule emphasizes the importance of the temporal difference between successive steps in maintaining a stable state. The last term is a local operation that keeps each weight bounded. 2.4 Discussion. The development of the human visual system proceeds gradually from the very basic learning stage, as in the way a newborn baby learns to recognize the complicated external world by exploring simple
2306
M. Li and J. Clark
shapes and colors step by step. Similarly, in our model, the integration of responses to local features that belong to the same object is based on lower-level extraction of position-invariant local features that have already been learned to some extent. The integration becomes faster when positioninvariant representations of local features are correlated in a temporal order rather than the correlation between numerous different neural responses to all local features in random positions. This is also a reason that we do not explicitly distinguish overt and covert attention shift at this stage. In the case of covert attention shifts, although the attended local features are not brought into the fovea, the representations of these peripheral local features are position invariant based on the learning accomplished by the first stage. In the case of overt attention shifts, the attended local features are retargeted to the fovea, and therefore the representations of these local features are already identical to the learned position-invariant representations. Therefore, both types of attention shift can function under this integration. Our approach is basically a description of a technique for encoding invariant neural responses to changes induced by attention shifts; therefore, attention shifts are an important part of encoding invariant representations for the input patterns, but not necessarily for recognition of an already encoded object. We only need to assume that these attention shifts do occur and that only a single object is being viewed. For online learning, the two processes of feature extraction and integration are concurrently performed. Because the early learning process of integration is essentially random and has no effect on the later result, we can use a gradually increasing parameter to adjust the learning rate of integration. This parameter can be thought of as an evaluation of the gained experience at the basic learning stage. The value of this parameter is set near 0 at the beginning of the learning and near 1 after a certain amount of learning, at which point the extraction process is deemed to have gained sufficient confidence in its experience on extracting position-invariant local features. 3 Simulation and Results We designed two experiments to test our model’s position-invariant and attention-shift-invariant properties, respectively. In our model, position invariance is achieved when a set of neurons can discriminate one stimulus from others across all positions. We refer to a set of neurons, as our representation is in the form of a population code, in which more than one neuron may exhibit a strong response to one set of stimuli. Between each set of neurons there might be some overlap, but the combinations of actively responding neurons are unique and can therefore be distinguished from each other. We consider attention-shift invariance to have been achieved when the position-invariant set of neurons retains their coherence across attention shifts, when such attention shifts stay on the same object.
Learning of Position and Attention Shift Invariance
2307
We designed a third experiment to show that our model performs better than the models of Foldi´ ¨ ak (1991) and Einh¨auser et al. (2002) when the input patterns lack temporal smoothness. 3.1 Demonstration of Position Invariance. To demonstrate the process of position-invariant local feature extraction, we focus on the extraction submodule. This module is composed of three layers: the input layer, the encoded layer, and the hidden layer. We use two different test sets of local features as training data at this stage: a set of computer-generated images of simple oriented linear features and a set of computer-modified images of real objects. We first implemented a simplified model that has 648 input layer neurons, 25 encoded layer neurons, and 25 hidden layer neurons for testing with the first training data set. The receptive fields of the input layer neurons are generated by Gabor functions over a 9 × 9 grid of spatial displacements, each with eight different orientations evenly distributed from 0 to 180 degrees. The first training image set is obtained by projecting straight lines of four different orientations ([0 degrees, 45 degrees, 90 degrees, 135 degrees]) through a pinhole eye model (as shown in Figure 3) onto seven different positions of a spherical retinal surface. The simulated retinal images each have a size of 25×25 pixels. The training data are shown in Figure 4A, along with a subset of the input layer receptive fields (see Figure 4B). Figure 5 shows the 25 basis functions (which are the receptive fields of the encoded layer neurons), trained using Olshausen and Field’s (1997) sparse coding approach on simple neural responses. It was found in our experiment that some neurons in the hidden layer responded more actively to one of the stimuli regardless of its positions on the retina than to all other stimuli, as demonstrated in Figure 6. For example, neuron 8 exhibits a higher firing rate to line 4 than to any of the other lines, while neuron 17 responds to line 1 most actively. The other neurons remain inactive to the stimuli, which leaves possible space to respond to other stimuli in the future. It was next shown that the value of the weighting parameter κ in equation 2.9 had a significant influence on this submodule performance. To evaluate the performance, the standard deviation of activities of the hidden layer neurons are calculated when the submodule is trained with different values of κ(= 0, 0.2, 0.5, 0.7, and 1). The standard deviation of the neural activities is calculated over a set of input stimuli. The value stays low when the neuron tends to maintain a constant response to the temporal sequence of a feature appearing at different positions. Figure 7 shows the standard deviation of the firing rate of the 25 hidden layer neurons with different values of κ. The standard deviation becomes larger as κ increases. This result shows that the reinforcement reward plays an important role in the learning of position invariance. When κ is near 1, which means the learning depends fully on the
2308
M. Li and J. Clark
Figure 3: Distorted retinal images obtained when features projected through a pinhole eye model onto the hemispherical surface of the retina.
Learning of Position and Attention Shift Invariance
2309
Figure 4: (A) Computer-simulated retinal images of lines with four orientations at seven positions used as training data set. (B) A random sample of 100 out of 648 Gabor receptive field profiles of the simple neurons.
2310
M. Li and J. Clark
Figure 5: Basis functions visualized as the receptive field profiles of the 25 encoded layer neurons. The basis functions were trained using Olshausen and Field’s sparse coding approach.
temporal difference between stimuli before and after a saccade, the hidden layer neurons are more likely to have nonconstant responses. In our second simulation we tested image sequences of real-world objects, such as a teapot and a bottle (see Figures 8B and 8C). The images of these objects were projected onto the simulated retina at nine different positions following routes such as that illustrated in Figure 8A. Each retinal image has a size of 64 × 48 pixels. The number of neurons in the encoded layer and the hidden layer has been increased from 25 to 64 from the numbers used in the previous experiment. This was required because the size of the basis function set to encode the sparse representations should also increase as the complexity of the input images increases.
Learning of Position and Attention Shift Invariance
2311
Figure 6: Neural activities of the four most active hidden layer neurons responding to computer-simulated data set at different positions. The neuron firing rates for each of the four stimuli (four lines with different orientations) at each of the seven retinal positions are shown. Each neuron has its preferred orientation selectivity across all positions.
Figure 9 shows the neural activities of the four most active neurons in the hidden layer when responding to the two image sequences of a teapot and a bottle, respectively. Neurons 3 and 54 exhibit relatively strong responses to the teapot across all nine positions, while neuron 27 mainly responds to the bottle. Neuron 25 has strong overlapping neural activities to both stimuli. The sets of neurons that have relatively strong activities are different from each other, satisfying our definition of position invariance. 3.2 Demonstration of Attention Shift Invariance. For simplicity in this experiment, we use binary images of basic geometrical shapes such as rectangles, triangles, and ovals. These geometrical shapes are, as in the previous experiment, projected onto the hemispherical retinal surface through a pin-
2312
M. Li and J. Clark
Figure 7: Comparisons of position-invariant submodule performance with varied weighting parameter κ (κ = 0, 0.2, 0.5, 0.7, 1), using a measurement of standard deviation of each neuronal response to a stimulus across different positions. The weighting parameter emphasizes the importance of the reinforcement reward with small κ. Lower standard deviation values mean that the neural responses remain stable while higher values mean instability. The values for the 25 neurons in the hidden layer are shown.
hole. Their positions relative to the fovea change as a result of saccadic movements. Here we use a weighted combination of intensity contrast and orientation contrast to compute the saliency map, as they are the most important and distinct attributes of the geometrical shapes we use in the training. A winnertake-all mechanism is employed to select the most salient area as the next fixation target. After a saccade is performed to foveate the fixation target, the saliency map is updated based on the newly formed retinal image, and a new training iteration begins. Figure 10 shows a sequence of saliency maps calculated from retinal images of geometrical shapes for a sequence of saccades. Figures 11B and 11D show a sequence of pre- and postsaccadic local features of the retinal images of a rectangular shape falling in a 25 × 25 pixel
Learning of Position and Attention Shift Invariance
2313
Figure 8: Training image sequences of two real objects (B and C). The images in the sequences were taken at nine positions following a path as indicated in A.
2314
M. Li and J. Clark
Figure 9: Neural activities of the four most active hidden layer neurons responding to two real objects at different positions. The neuron firing rates for each of the two stimuli (a teapot and a bottle) at each of the nine positions are shown.
attention window, respectively. The local features shown in Figure 11 after a saccade are not exactly the ideal canonical foveal images because of calculation errors in the position of the saccadic target. This situation also occurs in human vision where saccadic eye movements are not always able to put the selected target exactly in the fovea. In fact, undershooting of the target is the usual situation. This undershot local feature is likely to be refoveated by a subsequent small, corrective saccade. An enhanced algorithm dealing with this undershooting was described in Li and Clark (2002). Even if the correction of undershooting is not taken into consideration in this model, we still can obtain invariance, although the efficiency of the model performance will be impaired somewhat. This is because these noncanonical foveal features will exhibit greater variability than the ideal canonical features and therefore require a longer learning process. But the temporal association mechanism is still able to associate the various near-canonical
Learning of Position and Attention Shift Invariance
2315
Figure 10: Dynamically changing saliency maps for three geometrical shapes. They are computed from the retinal images after the first six saccades following an overt attention shift. The small, bright rectangle indicates an attention window centered at the most salient point in the saliency map.
Figure 11: Local features of a rectangular shape before (b) and after (d) an overt attention shift. (a, c) Retinal images of the same rectangle at different positions due to overt attention shifts.
2316
M. Li and J. Clark
Figure 12: Neural activities of the two most active output layer neurons across attention shifts. The neuron firing rates for the two geometric shapes of a triangle and a rectangle are shown.
neural representations and produce a stable neural response to the same stimulus across transformation. We show in Figure 12 some of the output layer neural responses (neurons 2 and 5) to two geometrical shapes: a rectangle and a triangle. Neuron 2 responds to the rectangle more actively than to the triangle, while neuron 5 has a more active response to the triangle than to the rectangle.
3.3 Comparison with Other Temporal Approaches. In this section, we demonstrate how our proposed approach performs well in situations where the input lacks smoothness in time. Position-invariant learning models that use temporal continuity, such as those of Foldi´ ¨ ak (1991) and Einh¨auser et al. (2002), are observed to perform poorly in these situations. We use a digital camera mounted on a computer-controlled pan-tilt unit (PTU) to acquire images around a toy bear. Images of the object are acquired as the PTU randomly changes its pan and tilt positions. The PTU movements are constrained so as to keep the bulk of the object in view at all times. The action of the PTU simulates human eye and head movements, which result
Learning of Position and Attention Shift Invariance
2317
in the displacement of the object features on the imaging surface. Pairs of images before and after each movement are obtained and converted into gray-level images. These image pairs are fed into the model as training data in a random order. The resulting time sequence of images is not smooth at all. This is an extreme test but nonetheless realistic, and it clearly demonstrates the difference in performance between our method and the methods based on temporal continuity. We implement Foldi´ ¨ ak’s trace rule (1991) and the learning rule for the position-invariant complex neuron of the top layer as given in Einh¨auser et al. (2002). We use the training data to train using both of these rules as well as with our proposed model. The learning results are compared using the mean variance of the output neuron responses over the whole stimuli set. If the model is to exhibit position invariance, the output neuron responses should remain nearly constant and therefore have a low variance. We show the results produced by the three models in Figure 13. Each time unit in the plot represents 25 learning iterations. The figure shows that our model converges to a stable state very quickly, with a low mean variance. The mean variance in Einh¨auser’s model is larger than in our approach, and it descends very slowly over the time interval. Foldi´ ¨ ak’s model produces an increasing response variance with time, implying a complete failure of the learning process for such a nonsmooth input sequence. 4 Conclusions In this letter, we have presented a neural network model that achieves position invariance. Our approach is based on a study of a more general problem: learning invariance to attention shifts. Attention shifts are the primary reason for images of object features to be projected at various locations on the retina. Object motion in the world is rarely the cause of such variation, as pursuit tracking of object features cancels out this motion. Following Desimone (1990), we treat covert and overt attention shifts as equivalent, from the point of view of their effect on the visual cortex. For the task of learning position invariance, the advantage of treating image feature displacements as being due to attention shifts is the fact that attention shifts are rapid and that there is a neural command signal associated with them. The rapidity of the shift means that learning can be concentrated to take place only in the short time interval around the occurrence of the shift. This focusing of the learning solves the problems with time-varying scenery that plagued previous methods, such as those proposed by Foldi´ ¨ ak (1991), Becker (1993, 1999), Kording ¨ and Konig ¨ (2001), and Einh¨auser et al. (2002). We used an extension of Clark and O’Regan’s (2000) association model to learn position invariance across overt attention shifts via temporal difference learning on pairs of pre- and postsaccadic stimuli. The extension involves the use of a sparse coding approach, which reduces the size of the association weight matrix and therefore the computational complexity.
2318
M. Li and J. Clark
Figure 13: Comparison of the performance of three models (our proposed model, Foldi´ ¨ ak’s, and Einh¨auser’s respectively) in learning of position invariance in the case of time-varying scenery. The performance is evaluated by mean variance of the hidden layer neuron responses over the whole stimuli set along a time interval. Each time unit in the x-axis is composed of 25 learning iterations.
We apply the constraint of temporal stability across attention shifts, and temporally integrate position-invariant neural response patterns of local features within attention windows to attain attention-shift-invariant object representations. We implemented a simplified version of our model and tested it with both computer-simulated data and computer-modified images of real objects. In these tests, local features were obtained from retinal images falling in an attention window by an attention shift mechanism. The incorporation of the attention shift mechanism speeds up the learning process by actively acquiring useful information about the correlated relationship between different neural responses of a same local feature at various positions, and
Learning of Position and Attention Shift Invariance
2319
relationship between the partial and the whole (i.e., local features of an object and the object as a whole entity). The results show that our model works well in achieving both position invariance and attention-shift invariance, regardless of retinal distortions. We demonstrated that our method performs well in realistic situations in which the temporal sequence of input data is not smooth, situations in which earlier approaches have had difficulty. Acknowledgments We acknowledge funding support from the Institute for Robotics and Intelligent Systems. M.L. thanks Precarn for its financial support. References Bartlett, S. M., & Sejnowski, T. J. (1998). Learning viewpoint invariant face representations from visual experience by temporal association. In H. Wechsler, P. J. Phillips, V. Burce, S. Fogelman-Soulie, & T. Huang (Eds.), Face recognition: From theory to applications (pp. 381–390). New York: Spinger-Verlag. Becker, S. (1993). Learning to categorize objects using temporal coherence. In C. L. Giles, S. J. Hanson, & J. D. Cowan (Eds.), Advances in neural information processing systems, 5 (pp. 361–368). San Mateo, CA: Morgan Kaufmann. Becker, S. (1999). Implicit learning in 3D object recognition: The importance of temporal context. Neural Computation, 11(2), 347–374. Bridgeman, B., Van der Hejiden, A. H. C., & Velichkovsky, B. M. (1994). A theory of visual stability across saccadic eye movements. Behavioural and Brain Sciences, 17(2), 247–292. Chance, F. S., Nelson, S. B., & Abbott, L. F. (2000). A recurrent network model for the phase invariance of complex cell responses. Neurocomputing, 32-33, 339–334. Clark, J. J., & O’Regan, J. K. (2000). A temporal-difference learning model for perceptual stability in color vision. In Proceedings of 15th International Conference on Pattern Recognition (Vol. 2, pp. 503–506). Los Alammitos, CA: IEEE Computer Society. Desimone, R. (1990). Complexity at the neuronal level (commentary on “Vision and complexity,” by J. K. Tsotsos). Behavioural and Brain Sciences, 13(3), 446. Deubel, H., Bridgeman, B., & Schneider, W. X. (1998). Immediate post-saccadic information mediates space constancy. Vision Research, 38, 3147–3159. Einh¨auser, W., Kayser, C., Konig, ¨ P., & Kording, ¨ K. P. (2002). Learning the invariance properties of complex cells from their responses to natural stimuli. European Journal of Neuroscience, 15, 475–486. Foldi´ ¨ ak, P. (1991). Learning invariance from transformation sequences. Neural Computation, 3, 194–200. Fukushima, K. (1980). Neocognitron: A self-organizing neural network model for a mechanism of pattern recognition unaffected by shift in position. Biological Cybernetics, 36, 193–202.
2320
M. Li and J. Clark
Gross, C. G., & Mishkin, M. (1977). The neural basis of stimulus equivalence across retinal translation. In S. Harnad, R. Doty, J. Jaynes, L. Goldstein, & G. Krauthamer (Eds.), Lateralization in the nervous system (pp. 109–122). New York: Academic Press. Hubel, D. H., & Wiesel, T. N. (1962). Receptive fields, binocular interaction and functional architecture in the cat’s visual cortex. Journal of Physiology, 160, 106–154. Hyv¨arinen, A., & Hoyer, P. (2001). A two-layer sparse coding model learns simple and complex cell receptive fields and topography from natural images. Vision Research, 41, 2413–2423. Ito M., Tamura H., Fujita I., & Tanaka K. (1995). Size and position invariance of neuronal responses in monkey inferotemporal cortex. Journal of Neurophysiology, 73(1), 218–226. Itti, L., Koch, C., & Niebur, E. (1998). A model of saliency-based visual attention for rapid scene analysis. IEEE Transactions on Pattern Analysis and Machine Intelligence, 20(11), 1254–1259. Kikuchi, M., & Fukushima, K. (2001). Invariant pattern recognition with eye movement: A neural network model. Neurocomputing, 38-40, 1359–1365. Koch, C., & Ullman, S. (1985) Shifts in selective visual attention: Towards the underlying neural circuitry. Human Neurobiology, 4, 219–227. Kording, ¨ K. P., & Konig, ¨ P. (2001). Neurons with two sites of synaptic integration learn invariant representations. Neural Computation, 13, 2823–2849. Leopold, D. A., & Logothetis, N. K. (1998). Microsaccades differentially modulate neural activity in the striate and extrastriate visual cortex. Experimental Brain Research, 123, 341–345. Li, M., & Clark, J. J. (2002). Sensorimotor learning and the development of position invariance. Poster session presented at the 2002 Neural Information and Coding Workshop, Les Houches, France. Maunsell, J. H. R., & Cook, E. P. (2002). The role of attention in visual processing. Philosophical Transactions of the Royal Society of London, Series B, Biological Sciences, 357, 1063-1072. Miyashita, Y. (1988). Neuronal correlate of visual associative long-term memory in the primate temporal cortex. Nature, 335(27), 817–820. Norman, J. (2002). Two visual systems and two theories of perception: An attempt to reconcile the constructivist and ecological approaches. Behavioural and Brain Sciences, 25(1), 73–96. Olshausen, B. A., Anderson, C. H., & Van Essen, D. C. (1993). A neurobiological model of visual attention and invariant pattern recognition based on dynamic routing of information. Journal of Neuroscience, 13(11), 4700–4719. Olshausen, B. A., & Field, D. J. (1996). Emergence of simple-cell receptive field properties by learning a sparse code for natural images. Nature, 381, 607–609. Olshausen, B. A., & Field, D. J. (1997). Sparse coding with an overcomplete basis set: A strategy employed by V1? Vision Research, 37, 3311–3325. O’Regan, J. K., & No¨e, A. (2001). A sensorimotor account of vision and visual consciousness. Behavioral and Brain Sciences, 24(5), 939–973. Perrett, D. I., Rolls, E. T., & Caan, W. (1982). Visual neurons responsive to faces in the monkey temporal cortex. Experimental Brain Research, 47, 329–342.
Learning of Position and Attention Shift Invariance
2321
Riesenhuber, M., & Poggio, T. (1999). Hierarchical models of object recognition in cortex. Nature Neuroscience, 11(2), 1019–1025. Rolls, E. T. (1995). Learning mechanisms in the temporal lobe visual cortex. Behavioural Brain Research, 66, 177–185. Rolls, E. T. (2000). Functions of the primate temporal lobe cortical visual areas in invariant visual object and face recognition. Neuron, 27(2), 205–218. Salinas, M., & Sejnowski, T. J. (2001). Gain modulation in the central nervous system: Where behaviour, neurophysiology, and computation meet. Neuroscientist, 7(5), 430–440. Shewchuk, J. R. (1994). An introduction to the conjugate gradient method without the agonizing pain. Available online at: http://www-2.cs.cmu.edu/ ∼jrs/jrspapers.html. Sutton, R. S., & Barto, A. G. (1981). Toward a modern theory of adaptive networks: Expectation and prediction. Psychological Review, 88(2), 135–170. Wallis, G., & Rolls, E. T. (1997). Invariant face and object recognition in the visual system. Progress in Neurobiology, 51, 167–194. Wallis, G., Rolls, E. T., & Foldi´ ¨ ak, P. (1993). Learning invariant responses to the natural transformations of objects. International Joint Conference on Neural Networks, 2, 1087–1090. Walsh, V., & Kulikowski, J. J. (Eds.). (1998). Perceptual constancy: Why things look as they do. Cambridge: Cambridge University Press. Received April 23, 2003; accepted April 5, 2004.
LETTER
Communicated by Jonathan Victor
Testing for and Estimating Latency Effects for Poisson and Non-Poisson Spike Trains Val´erie Ventura
[email protected] Department of Statistics and Center for the Neural Basis of Cognition, Carnegie Mellon University, Pittsburgh, PA 15213, U.S.A.
Determining the variations in response latency of one or several neurons to a stimulus is of interest in different contexts. Two common problems concern correlating latency with a particular behavior, for example, the reaction time to a stimulus, and adjusting tools for detecting synchronization between two neurons. We use two such problems to illustrate the latency testing and estimation methods developed in this article. Our test for latencies is a formal statistical test that produces a p-value. It is applicable for Poisson and non-Poisson spike trains via use of the bootstrap. Our estimation method is model free, it is fast and easy to implement, and its performance compares favorably to other methods currently available. 1 Introduction It is often of interest to determine the response latencies of one or more neurons to repeated presentations of a stimulus. One may be interested in adjusting synchronization tools, most commonly the cross-correlogram (CC) and joint peristimulus time histogram (JPSTH), for trial-to-trial effects such as excitability and latency effects (Brody, 1999a, 1999b; Baker & Gerstein, 2001). Another application consists of correlating neural response latency to a particular behavioral variable such as reaction time, as for example, in Everling, Dorris, and Munoz (1998), Everling, Dorris, Klein, and Munoz (1999), Everling and Munoz (2000), Hanes and Schall (1996), and Horwitz and Newsome (2001). We illustrate both applications in section 5. Methods to estimate latencies are also varied. For example, Brody (1999b) estimates latencies based on minimizing the peak of the CC shuffle corrector, although he cautions that his estimates are flawed when latencies are not, in fact, present. Baker and Gerstein (2001) rectify this problem by providing three alternative estimation methods, all based on detecting the time at which the firing rate increases from the baseline, which we refer to as change-point methods. Two of their estimates possess good properties but are computationally expensive, and depend on an assumed model for the spike trains. Baker and Gerstein (2001) also provide a diagnostic that indicates whether excitability and latency effects may be present, although no c 2004 Massachusetts Institute of Technology Neural Computation 16, 2323–2349 (2004)
2324
V. Ventura
measure of significance, for example, a p-value, is given for this diagnostic. Latency estimation methods have also been developed for continuous response waveforms like electroencephalograms rather than for point processes, although perhaps they could be applied successfully to smoothed spike trains. For example, Woody (1967) takes the latency of a waveform on a particular trial to be the shift that maximizes the correlation between the shifted waveform and a template. In the statistical literature, Pham, Mocks, Kohler, and Gasser (1987) develop a change-point method for continuous signals based on a maximum likelihood paradigm that involves some modeling assumptions and substantial machinery. In section 2, we describe a method to estimate latencies, which, like the method of Woody (1967), uses the whole duration of the response of the neuron rather than just the time at which the firing rate increases from baseline, as do change-point methods. It compares well to existing methods in terms of efficiency and computational simplicity. Specifically, our estimates require only calculations of sample means so that they are simple and very fast to obtain, do not require any model assumptions, and have smaller biases and variances than other available methods, based on a variety of simulated data, as shown in section 3. Section 4 concerns statistical inferences about latencies. Because estimating nonexistent effects typically adds random noise to statistical procedures, we first propose a formal statistical test for latency effects; by “formal,” we mean that we provide not only a diagnostic test but also a p-value. We also explain how variances can be calculated for the latency estimates. The methods of this section are straightforward for Poisson spike trains but require a more careful treatment otherwise, which is why we deferred its treatment to the end of the article. Finally, we validate our methods based on two real applications in section 5 and conclude in section 6. 2 Latency Estimation The methods developed in this letter rely on a simple result: the spike times of a Poisson process with firing rate λ(t) can be viewed as a random sample from a distribution with density proportional to λ(t). Although this result applies to Poisson spike trains only, we show that our estimation method applies generally. Say that K trials of a neuron were recorded under identical repeated presentations of a particular stimulus or experiment. Let λ(t) denote the time-varying firing rate of the neuron. If τk ≥ 0 denotes the latency of trial k and if we assume that the only effect of a latency is to delay the onset of the response to the stimulus by τk , then the firing rate of trial k is λ(t − τk ); hence, its spike times can be considered a random sample with distribution proportional to λ(t − τk ). Therefore, assuming that there is no source of variability between trials other than latency effects and the random variability associated with the spike generation mechanism, the spike times combined
Testing and Estimating Latencies
2325
over all K trials can be viewed as a random sample with distribution proportional to K
λ(t − τk ).
(2.1)
k=1
The mixture distribution 2.1 has the same “shape” as the peristimulus time histogram (PSTH) of the observed trials. We take our estimates of τ1 , . . . , τk to be the values that minimize the variance V(τ ) of equation 2.1 with respect to τ = (τ1 , . . . , τK ), that is, (τˆ1 , . . . , τˆK ) = argminV(τ ),
V(τ ) = var
K
λ(t − τk ) .
(2.2)
k=1
To make an analogy from probability densities back to spike trains, this criterion finds the set of latency shifts that make the PSTH the narrowest and therefore the highest. Note that adding the same arbitrary constant τ0 to all latencies shifts the PSTH of the spike trains by τ0 , but does not change its shape, which suggests that equation 2.2 will produce latency estimates that are defined relative to one another. However, latency estimates are often used to calculate correlations, for example, with a behavioral variable, which are invariant to a constant shift in either or both variables. In particular, only the relative times between two neuron spikes are needed to calculate their CC. We first select a time window [T1 , T2 ], which includes the period of response of the neuron to the stimulus on all trials. We divide [T1 , T2 ] into small enough bins so that at most one spike falls into any bin on each trial; these intervals are centered at times tj . Let nkj = 0 or nkj = 1 record the absence or presence of a spike at time tj for spike train k. Letting τk denote the true latency for trial k, the variance to be minimized with respect to all τk is V(τ ) = E2 (τ ) − E21 (τ ), where E1 (τ ) and E2 (τ ) are the first two moments of the distribution of all trials combined, given in equation 2.1. Specifically, E1 (τ ) =
1 1 hkj tj − τk = t¯k − τk , K k j K k
(2.3)
E2 (τ ) =
1 hkj (tj − τk )2 , K k j
(2.4)
and
2326
V. Ventura
where hkj = nkj /nk· are the nkj standardized by the total number of spikes, nk· = j nkj , for trial k, and t¯k = j hkj tj . Setting ∂V(τ )/∂τi to zero gives, for i = 1, . . . , K, t¯k , τk = t¯i − K−1 (2.5) τi − K−1 k
k
with solutions of the form τˆk = t¯k − τ0 ,
for all k = 1, . . . , K,
(2.6)
for an arbitrary τ0 . That is, the latency estimate for trial k is the sample mean t¯k of the spike times that occurred between T1 and T2 , shifted by an arbitrary amount τ0 . The latencies are relative rather than absolute, since τ0 can take any value. Should absolute latencies be needed, all we need is the absolute time position T so that T+ τˆk become the absolute latencies; this is illustrated in section 2.1. Because latency estimates are sample means, smoothing the spike trains first does not affect the latency estimates, but it reduces their variability (see section 3). The benefits of smoothing neural data are discussed more generally in Kass, Ventura, and Cai (2003). We used a kernel smoother so that if nkj = 1 or 0 denotes the presence or absence of a spike at time tj , the smoothed spike train has values n∗kj
−1 −1
= (T2 − T1 ) h
i
ti − tj nki K h
,
(2.7)
where the summation is over all the time bins in [T1 , T2 ], K(·) is the standard normal kernel, and h is the bandwidth, which controls the amount of smoothness. To estimate the latencies from smoothed spike trains, we apply our estimation procedure ∗with hkj in equations 2.3 and 2.4 replaced by ∗ = n∗ /n∗ , where n∗ = hjk j nkj . kj k· k· Before we illustrate this procedure, recall that it is based on the fact that the spike times of a Poisson process can be considered a random sample with distribution proportional to the firing rate λ(t). However, because the resulting latency estimates are the sample means of the spike times, the method is also valid for non-Poisson spike trains, since the center of mass of a distribution can be estimated consistently from either a random (Poisson) or a correlated (non-Poisson) sample from λ(t); this is illustrated in sections 2.1 and 3. What the distribution of the spike trains has an effect on are secondorder properties such as variances of estimates, confidence intervals, and statistical tests of hypotheses (see section 4). 2.1 Implementation and Illustration. There are several options to implement the estimation procedure. For example, one can slide a window of
Testing and Estimating Latencies
2327
length T2 − T1 along each trial until the sample spike times in the moving window match across trials, or one can fix the estimation window [T1 , T2 ] and calculate the mean spike times for each trial in that window. We implemented the latter. We calculate equation 2.6, shift the trials, and iterate. Indeed, even though equation 2.5 is deterministic, V(τ ) and its derivatives use spikes in the fixed time window [T1 , T2 ]; therefore, once the trials have been shifted, a not exclusively different set of spikes contributes to the new V(τ ) and thus yields new latencies; we use them to adjust the previous estimates. Convergence is achieved when the latency adjustments are negligible, or when the decrease in variance V(τ ) in equation 2.2 is smaller than some ε > 0. The second criterion is easier to implement since it requires monitoring of only one quantity, and we declare that convergence is reached when the change in V(τ ) is smaller than 1% of V(τ ) for three successive iterations. To illustrate the procedure, we simulated K = 500 gamma(8) spike trains with firing rate shown in the middle top panel of Figure 2, with a 20 Hz baseline and a 60 Hz stimulus induced rate, then shifted the trials according to latencies sampled from a uniform distribution on [400, 1000] msec. (More details about simulated samples are provided in section 3.) The top center panel of Figure 1 shows a raster plot of the first 30 spike trains, and the top left panel shows the PSTH of all K = 500 trials, with the underlying firing rate superimposed; the effect of the latencies can clearly be seen. The top right panel plots the true versus the estimated latencies obtained from the first iteration of the algorithm; both sets of latencies are recentered around zero to compare relative rather than absolute latencies. The successive rows of panels show the result of the successive iterations, with raster plots and PSTHs based on the spike trains reshifted according to the current latency estimates. The algorithm converged after seven iterations. We used τ0 = min(t¯k ) in equation 2.6 to ensure that [T1 , T2 ] contains the response of the neuron to the stimulus even after the trials are shifted. With that choice, the trial with the smallest estimated latency is not shifted, while the other trials get shifted to align with it, as seen in Figure 1. Hence, to transform relative latencies τˆk into absolute latencies T + τˆk , we take T to be the time of onset of response to stimulus based on the PSTH of the shifted trials; T can thus be determined by any change-point method or simply by the naked eye. The resulting estimate of T will be accurate since it is based on a PSTH rather than on a single spike train. In Figure 1, the smallest true latency was τ0 = 418.17 msec, while we obtained T = 418.54 msec by clicking on a computer plot of the PSTH of the shifted trials.
2.2 Simultaneously Recorded Neurons. Assume that N neurons are recorded simultaneously and that the same latency applies to all neurons in a particular trial, up to a constant, so that the latency for neuron i in trial k is τk + βi , i = 1, . . . , N, k = 1, . . . , K. Our estimation procedure applies as
V. Ventura
0
500 Time (ms)
1500
D
0
500 Time (ms)
1500
E .. ......... .. . ..... ............................. ................... ........................ ...................... ........ .... . ...... ...... ................... ..................... .................... .. ... ... ..... ..... ................... .................. .. . .... ..... ............ .............. .......... . ...... ................................. .. ..... .. .. ... ......... ............................. ............... ................................ ... . .... ........ .......... . ...... .. .. ........ ....................................................................... ................................... ... ........ . . . . . . . . . .. ... .... ..... ... .............................................. .............................. ... . ........ ... .... ...... ........... ................. ................................................ .................. .... ........ ........ ...... .......... ......... ............................... .................. ........................ ........................... ........ . . . ... ..... ........ .. .......... ............................ ....... ............. .................. ................ .. ...... .......... ....................................... ............... ............................................. ....... .... ..... ...... .... .. . ... . . .... . .. . ................... ... . ....... ....... .. .......... .............. ......................................................................................... ..... ........... . . ... .... ........... ................. .................. ................................................................ . . . .. . .. .... ... . ...... .......... ..... ..... ...................................................... ...... ...... .... ... ..... ... . . . . ... . . . . . .. ... . . ........ ................................................................................................... .. ... .. . ....... ..... ....... ............................................ ...... . ... .
0
500 Time (ms)
1500
G
0
500 Time (ms)
1500
H ......... .. . ................................ .................... ............................ .. ............... ...... .... ..... ....... .... .......................... ........... .... ........ .... ..... .......... ................. ............... .............. .. ... ..... .......... ............... .......... ...... ............ . ................................................ . .. ..... .. ... ............. ........................................................ ..... ........ .. ....... . .. ... .. ... .............. ..... .................................................................................. ....... . . ..... . ..... ................................ ......... ... ..... ... . . . . . . . . . . . . ... . . . . . . . ..... .. .... ........ ....................... .... ...... ... ... ....... .... ..................... ................................................................ .... .... ... ..... ... . . . . .... . . .. . . . . . .. .. ......... .................................... ............................................................. ... ... .... . . .. . ... .. .. ... ................................. ................................... ... ................ ..... ......... ... .... ...... ........................................................................ ..................... . . ....... ... .... ... ... ... . ........... ........................................ ........ ............................ . .. .... . .... ....... . ........... ............................. ............................................................... .... ... . . ...... .... ...... ... ...... .............. .......... .......................... ...................... ......... ................................ .... ... ... .. ..... . ..... . . .. . ... ..... ... ...... ... ......... ............................................................ ....... ............. ............... .... .. ... ....... ...... .. ..... ....................................... .................................................... ...... . ... .
0
500 Time (ms)
1500
J
0
500 Time (ms)
1500
K .. ................................ .................... ............................ . ... ............. .... ..... .... ....... ............................................ ...................... ......... . .... ... ... .. .... .................. ..... . . ... ......... .... ........ .................. ................ ..... .............. .. . ... . . ..... .. ... ... ............... ................................. ............................................................ ..... ............ ...... ..... ...... ... ............................... ...................................... ............. ......... ......... ....... .. ..... ..... ........ . . .. . . . . . . . .. ... . . .................................................. ............................. .. . ... ... .. ... ...... ............. ......... ..... ....................................................... ............... .... .... .............. ......... . .. ..... ..... ..... ..................... ................................................................. ............... ... ... ... .. . ... ... . .. . .. ... ............................ .............. ........... .............................. ......... .... ... ........ ...... ............................................. ..................... ................ . ..... ... ........................... ......... . ... .......... . ....... ........ ....... ....... .................................................... ............................................. .... ... ....... . ..... ..... .. .......................................... ..................................... ........................................... ...... ..... .. ... . ... . ..... ........................................ ............. ...... . ...... .. ............... ...................... ......... ...... ......... . . ... ...................................................... ................... ..... . .. ............ ....................................... ......................... ... ...... . .......
0
500 Time (ms)
1500
0
500 Time (ms)
1500
C
...... . ....................................... . . ... ... .. ..... . ... .................. . ....................................................... . . . . . . . ............................................. . .............. .. ......... . . . . . 701
−300
Estimated latencies −200 100
...... ... .. . .......................... .............. .................. ..................................................... ............ .......... . ...... ... .................. ........... ................... ....... . ...... ................ ......................... ....... ..... . ........ ....................... . .... . ..... .............. .................... ......................... ....... ... ....... ......... ................................................... ... . ........ ............ ............ ......... . ..... ................... . .... ..... ........................... ......................................... ......... ............. ........ ...... ..... ..... ... . . . . . . . . . . . . ... . . . . . . . ..... . . .. . . .. . ......... ...... ......................................................... ................. ...... ... ..... .................. ..... .............. ......... .......... ...... ... ..... ........... ..... ..................................... .................... ......................................... ............. .. . . . .. . ... .. ... ............................ ........................ ..... .......................... ............. ........ ................... ...... ........... ................... ........... .. .... .. . .... ........................................................ ........... .......... ................... .............. ......... .... .. .... .. ..... ............. .............................. ............................ ....... .... ............... ... . .. ... ............... ... .. ...... .......................... ....................... ........ .... ..... ............................ ...................................... ....... .. .. ...... . ..... . ... . .... . . ... . . .... . . ....... .............................. ............. ............. ................... ...... ........... . ....... ....... . .. . . . . ... . . .. ... . ......................................... ............. ..... ............... .......................................... . ...
0 200 True latencies
F
. .. .......... ............................................ . . . .. . . . ................ .......... ..................................................... . . . . . . . . ..................................... .. . .. . . ......... . ......... . 602
−300
Estimated latencies −300 0 300
B
0 200 True latencies
I
.... .................... . . . . . . . . . . . . . . . . . ... ...... .. . . . ......................... .................................................. .. . . . .. .. .. . . . ....................... .. .............................. . 558
−300
Estimated latencies −200 400
A
Estimated latencies −150 0 150
2328
0 200 True latencies L 538 530 . 526 ....... 522 . ....................
..... ................... . . ....... ........... ................................................... ........ .. . ................... . .......................... . .
−300
0 200 True latencies
Figure 1: Iterations for latency estimation. Successive rows of panels correspond to successive iterations, until convergence. The data consist of K = 500 trials with rate shown in the right panels and latencies uniformly distributed on [400, 1000] msec. The first column shows the PSTHs of the trials shifted by the current latency estimates, and the second column the corresponding rasters; only the first 30 trials were plotted for visibility. The last column shows the centered estimated latencies versus the centered true latencies. The straight line is the first diagonal; the number in the upper left corner is the current value of V(τ ) in equation 2.2. We did not show iterations 4–6, but recorded the successive values of V(τ ) in the bottom right panel.
Testing and Estimating Latencies
2329
before, but with hkj in equations 2.3 and 2.4 replaced by N−1
l l=1 hkj , where
= and = 1 or 0 indicates the presence or absence of a spike at time tj in trial k for neuron l. Basically, all this does is combine the spikes of all N neurons for each trial into a “combined” spike train, with rate hlkj
nlkj /nlk· ,
N
(t) =
nlkj
N
λl (t − βl ),
l=1
where λl (t) is the rate of neuron l. This is illustrated in section 5.2 to adjust a CC for latency effects. 3 Procedure Performance and Limitations In this section, we investigate more fully the properties of our estimates. We assess the effects of the firing rate λ(t) and the number of trials K. We illustrate that gains in efficiency can be obtained from smoothing the spike trains and discuss briefly how to choose the estimation window [T1 , T2 ]. We also show that the latency estimation procedure can readily be applied to spike trains with simple constant excitability covariations in addition to latency effects, but illustrate how more complex excitability effects can bias the estimates. We measure the efficiency of the estimates using MSE = K
−1
K
1/2 2
[(τˆk − τˆ• ) − (τk − τ• )]
,
(3.1)
k=1
where τk and τˆk are true and estimated latencies for trial k, and τ• and τˆ• are their respective means. True and estimated latencies are recentered at zero to compare relative rather than absolute latencies. Equation 3.1 is the square root of what is known in statistics as the mean squared error (MSE); it is a combined measure of bias and variance. To assess the effect of the firing rate, we use the three rates shown in Figure 2 for a variety of baseline and response to stimulus rates; they are referred to as the step(a, b), block(a, b), and transient(a, b) rates, where a and b are the values of the baseline and of the maximum firing rates. As in Baker and Gerstein (2001), a ranges from 10 to 50 Hz in steps of 10 Hz, and b is two to five times a, in steps of one. For block and transient rates, the duration of the response is 1000 msec, after which the rate returns to baseline. Finally, the latencies are uniformly distributed on [400, 1000] msec. Although Baker and Gerstein used only the step rate, the performances of their estimators also apply to the block rate because they are based on change-point methods. Baker and Gerstein did not consider a transient rate, for which the departure from baseline is perhaps harder to detect.
2330
V. Ventura
Three rate types Transient(a,b) rate
0
500
1000 1500 Time (ms)
a
a
b
Block(a,b) rate b
Rate (Hz) a b
Step(a,b) rate
0
500
1000 1500 Time (ms)
0
500
1000 1500 Time (ms)
Estimates efficiencies Poisson spike trains step rate
G
J
5
10 20 30 40 50 Baseline (Hz)
2
2 5 4 10 20 30 40 50 Baseline (Hz)
3
3 4 5
3 4 5
2 3 5
2 3 5
2 3 5 10 20 30 40 50 Baseline (Hz) 3 4 5
2 3 4 5
2 3 4 5
−3
K
2 3 4 5
Excitability effects transient rate
Estimation window [T1,T2] block(30,60) rate
2 3 4 5
0
500 T1
1000
Smoothing transient(30,b) rate
L 150
I 150
2
22 2 33 3 44 4 55 5
10 20 30 40 50 Baseline (Hz)
0 100 300 500 number of trials
MSE 100
2
Bias 0
2
150 MSE 100
2 3 4
150
3 4 5
MSE 100
2 3 5 4
2 3 5
Proposed method transient rate
0
3 4 5
2
0
3 5 4
2
2 3 4 5
F
50
150
2
2
Number of trials block(10,b) rate
50
2 5
2 3 4 5
3 3 4 4 5 10 20 30 40 50 Baseline (Hz)
H
0
MSE 100
2 4
50
2 5
0
2 4
3 5 4
T2 2000 3000
3 5
MSE 150 250
4 5
50
4 5
Proposed method block rate
E
150 MSE 100
3 4 5
0
50
2
BG’s rate change method step, block rates 2
MSE 100
2
3
2 3
10 20 30 40 50 Baseline (Hz)
10 20 30 40 50 Baseline (Hz)
50
5
2
150
BG’s Bayesian method step, block rates
C
3 4
2
2 3 4 5
2 3 4 5
2
2
3 4 5
3 4 5
10 20 30 40 50 Baseline (Hz)
MSE 100
10 20 30 40 50 Baseline (Hz)
B
2
3 4 5
50
3 4 5
MSE 100
150
3 4 5
2
2
0
3 4 5
2
50
4 5
3 4 5
2
0
3
2
0
50
MSE 100
150
2 2
Bias step, block, transient rates
50
Proposed method step rate
D
0
BG’s variance method step, block rates
A
22 2 2 333 3 454 5
2 3 4
2 3 5
0 200 600 1000 Bandwidth (ms)
Figure 2: MSE (3.1) of the latency estimates. (A,B,C): Methods in Baker and Gerstein (2001). (All other panels) Proposed method. (K) MSE (3.1) as a function of T1 and T2 ; lighter shades denote lower MSE. (J) Bias for all simulations in D–F. All panels but G and H use 500 gamma(8) spike trains with rate indicated in the title. G uses 500 Poisson spike trains. On most panels, the baseline a is on the x-axis, and the maximum rate is b = k · a, where k is the plotting symbol.
Testing and Estimating Latencies
2331
We use two types of spike generating mechanisms, Poisson and gamma spike trains of order q = 8, the latter to match the simulation results in Baker and Gerstein (2001). The former generates spikes independent of the past, while the latter can be used to model spike trains where refractory period effects are present. Recall, however, that our estimation method works generally. Figure 2D through 2F show equation 3.1, based on K = 500 gamma(8) spikes with step, block, and transient rates, from which we conclude that the type of firing rate does have some impact on efficiency. In particular, our method is more effective for firing rates that return to baseline than for sustained rates. Figures 2D and 2E are directly comparable to Figures 2F, 3G, and 4G in Baker and Gerstein (2001), which we reproduced for convenience in Figures 2A through 2C. Figure 2H shows that the efficiency does not depend on the number of trials. Baker and Gerstein (2001) used the mean of the errors [τˆk − τk ] as a measure of the tendency to consistently over- or underestimate the latencies; their methods in Figures 2A through 2C produced biases as large as, in absolute value, 20, 30, and 70, respectively. The equivalent measure for our relative latencies is the mean of [(τˆk − τˆ• ) − (τk − τ• )], which is always zero. We therefore used the median of these errors as our measure of bias, which we plotted in Figure 2J. The bias is close to zero, which suggests that our estimates are randomly scattered around the true relative latencies, as could be seen in Figure 1. We discuss conditions under which estimates can be biased in section 3.1. Figure 2G suggests that latencies based on Poisson spike trains are less accurate than latencies based on gamma(8) spike trains with the same rate in Figure 2D. This is not surprising since there is more variability in Poisson than in gamma(8) spike times. We discuss further the variances of the estimates in section 4. Figure 2L shows equation 3.1 as a function of the bandwidth h in equation 2.7 used to smooth the spike trains. It is clear that some efficiency can be gained from smoothing, but that the gain is fairly insensitive to h. However, our experience suggests that convergence is most easily assessed with smaller bandwidths. In the rest of this article, we used h = 200 msec. The efficiencies in Figure 2 all used [T1, T2] = [200, 2200] msec. They can be improved by finer choices of T1 and T2 . The gray scale in Figure 2K shows equation 3.1, averaged over 1000 simulated data sets, as a function of T1 ∈ [−100, 1000] msec and T2 ∈ [1300, 3600] msec. We used gamma(8) data with block(30,60) rate; results are qualitatively similar for other spike trains. True latencies are uniform on [400, 1000] msec, so that, based on the PSTH, the response to stimulus starts around 400 msec and reaches its peak around 1000 msec, which is indicated by white vertical lines. The response begins to decline around 1400 msec and returns to baseline around 2000 msec, which is indicated by horizontal white lines. It is clear that our procedure is sensitive to T1 and T2 . A good choice for T1 is just before the
2332
V. Ventura
firing rate begins to depart from baseline (based on the PSTH). For sustained responses to stimuli (step rate, not shown), we found that the efficiency of our method is not sensitive to the particular choice of T2 , whereas for nonsustained rates (block or transient), a good choice for T2 is such that the length of [T1 , T2 ] is roughly equal to the duration of the response to the stimulus. Finally, even with optimal choices of T1 and T2 , the efficiencies of the latency estimates are comparable for block and transient rates but not as good for the step or sustained rate, which we observed already in Figures 2D through 2F. 3.1 Limitations. Our latency estimation is based on the assumption that the firing rate is identical across trials except for a time shift. In many contexts, however, the conditions of the experiment or the subject may vary across repeated trials enough to produce discernible trial-to-trial spike train variation beyond that predicted by Poisson or other point processes. We illustrate the consequences of such effects on our latency estimates. Assume that the state of the neuron varies slowly so that the firing rate of trial k is αk λ(t − τk )
(3.2)
rather than λ(t − τk ), where αk is some positive constant; that is, the firing rate is inflated or deflated by a multiplicative gain αk on each trial, as pictured in Figure 3A. Because the αk are constants, firing rates αk λ(t − τk ) and λ(t − τk ) are proportional to the same density; hence, they have the same means and thus produce the same estimates of latency. What differs is the number of spike times used for estimation, which affects the variance of the means rather than their average values (see section 4). Figure 2I shows the efficiency of the latency estimates for 500 gamma(8) spike trains with rate 3.2, αk uniformly distributed on [0.5, 1.5], τk uniformly distributed on [400, 1000], and λ(t) the transient rate. Figure 2F uses spike trains that are in every way similar but for excitability covariations. We picked the αk in equation 3.2 so that the total number of spikes across the K trials are comparable for Figures 2F and 2I. This explains why the efficiencies are also fairly comparable. In practice, excitability effects will likely be more complex than equation 3.2, and depending on the degree of deviation from it, the latency estimates will be biased. A partial solution that sacrifices some efficiency is to reduce the estimation window [T1 , T2 ] so that the excitability effects are approximately constant on that window. This is illustrated in Figure 3, based on gamma(8) spike trains with firing rate for trial k, gk (t − τk ) λ(t − τk ),
Testing and Estimating Latencies
2333
0.0
0.0
Firing rate 0.05 0.15
B
Firing rate 0.05 0.15
A
0
500
C
1000 1500 Time (ms)
2000
2500
0
500
1000 1500 Time (ms)
2000
2500
................. ... . ............. ......... . ......... .......... ........... ...................... ....... ......... ....... .. ... ... ... .. ... ............... ........... ...................................... ... . .... ........ ...... ................ ............... ......... . .. ................... .. . ..... ... .... ...... ................. ......... .. . ......... ... ...... ............ . . ... ......................... .... ............ .............. ....................... ..... ....... ..... ... ... ... ....... ............... ................. ............... ..................... .... .................... ..... ............. ... ................................ .............. .......... ...... .. ........... ....... ....... .... ..................... ....... ...... ........ .... .... .............. .................... ........ ................ ........ ... .......... ......... ..... ..... . ............ ... ........ ..... ..... ....... ............. ................ ............... ....... ...... .... ...... ................ .................................. ... ...... ..... ........ .... ................ ............... ..... ..... ...... ..... ..... ... ....... ...... ... ....... .. ... . . . . . . . . . .. ... .. ... .. ..... ....... . .. .... . . . . . . . . . . . . .. ............... ....... . . . . . . . ... . . . . . .. ................ . ... . . ......... .. . . . . .. . . .. . . . . . . . ... ....................... ....... . ....... ........... . .... .... .. ... ...... . .. . .. . ...... . ..... .. .. .... ............. . ........ ............. .. ....... .... .... ... ... ..... .......... .................... ................ ...... .......... ........ .................... .... . ..... ................. ....... ......... ........................ .. ... .... ........................ ..... ..... ............. ...... ........... ............... ........ ......... .......... ...... . ......... . .... ........ ..... .. .... .............. ....... ... .. . ..... ................. ....... . . . ... .... ... . ........ ................ . . . ... ................................... . . . ... ........... ........ . . .. ... . ........ . . ... ... ...... . . ...... . . ...... . .... . .. . .. . . . .. .. .. ........... . ........ . ...... . . ...... .. . . . . . . . .. . . . . . . . . ....... ............. . .. ................ .... .... ...... ... .... . . ............ .. ......... .............. ....................... ..... .. ................. .................................. ...... ..... ............ ........ ....... .... . .. ... .......... ................. . .............................................. .......... ... ........ .... ....... ....... .. ...... ... .... ................... ... ............. .... .... ... .... ... .. .. ..... .......... ..... ................ ............ .......... ................................... .......... ...... .. ....... .... ...... .. .... ..... . ... . ... . ...... ... . . ... .............. ...... .......... ............ ................. .......... ........... ............. ...... .............. ... .... ...... ..... ...... . . ............. . . . . . . . ...... . ....... ...... .... .. .. .. ...... ... ... ... ........ . .. ....... . . .... ... . .. . ... . . . . . . ................... . ...... . . . . . . . . . ..... ....... . . ... . ... . . . . . . . .. ... . .. . ......... . ....... .. . .. . .. . . . .. .. ... .. ................ ...... ........... .......... ........................ ................ ....... ....... .......... ............. .. ..... ..... ..... ... .............. .. . ................... ... .. .. .... ................ .. ................ ............... .... .. ......... ......... . ..... .. ....... ........... ... ............ .. ............................ ........ ............................. ......... ................ ..... ............. ...... ...... .. .... ........ ........ ........ ....... ....... ...... .. ............... ......... ... . .. . . . . ......... ......................... . . .. ........ . .. ..... .... .... . ...... . ....... . ..... . ....... . .. . ..... .. . .. . . . . . . . .. .... . ...... . . . . . . . . . . . . . . . . . . . . . . . .. ...... ................... ... . ...... ... .. . .................. . ............. . . . . .. . . . . .. . . .......... . ....... . . . . . .. . . .. . . . ..... . .. . ..... . .......... . .. .. . . . . . . . . . . . . . . . . . . ............... . . . ... ..... . ................ ..... . . .. . . . . .. . . . . . . . .. . .. .. ... . .. ... . ................... ......... ............................ ................. ................. .. . . . . ...... 0
500
1000
1500
2000
2500
Time (ms)
−100
Estimated latencies 0 100−100 0 100
D T2 = 1300 ... ..... ............................... ............................. .......... ....... . ........................ ................. .. ....... ............... ................. . ............ . .............................. . . . .............................. . . ............... .................... ....................... . ...................... .... ..... −200 0 200 True Latencies
T2 = 1500 .............................. ........................... ..... .......... .................... ............................ ................ .................... . ..... ............................ ........... . . .......... . . ........................ . .. ............ . .............................. .. . ...... . ................. .................... .. −200 0 200 True Latencies
T2 = 1750 ............................... ........................... ..................... ............ .... ................................ ................... . ......... ..........
T2 = 2000 .............. ..................... ............................ . ...... . ..... ....... . .................. ............................. ....................... . ..................
. . ... .. ....................................... . . ......................... .... ..... .. . . .. . . . ..................... ....................................... ................................... . . .. .. .................................. .
. . . .. .............................................. .. ................................................... .. .. . . . ... . ..................................... . . .................. .. .. .. .......................... .............. . . . . . . ........................................... .. . .
−200 0 200 True Latencies
−200 0 200 True Latencies
Figure 3: Excitability effects. (A) Firing rates for four trials with multiplicative excitability effects (see equation 3.2). (B) Firing rates for five trials with complex excitability effects. (C) Raster plot for 50 trials with complex excitability effects. (D) True versus estimated latencies for several values of T2 . The first row of panels is for excitability effects in A, and the bottom row for excitability effects in B.
2334
V. Ventura
with λ(t) the transient(20,120) rate and τk uniformly distributed on [400, 600] msec. We generated excitability effects using gk (t) = {a0k + a1k · h1 (t, µ1k )}{1 + ck · a2k · h2 (t, µ2k )}, where a0k , a1k and a2k are gain coefficients that are normally distributed with respective means and variances 0.8 and 0.22 , 0.4 and 0.12 , and 1 and 0.52 , and cr is a Bernoulli random variable with Pr(ck = 1) = Pr(ck = 0) = 0.5. The time component h2 (t, µ2k ) is a normal density with standard deviation 200 and mean µ2k that we take to be normally distributed with mean 1700 and variance 1502 ; h2 adds a later component to the neural response at a random time, for 50% of the trials. The other time component h1 (t, µ1k ) is a gamma density function that takes nonzero values when t > µ1k , where µ1k = 1800 − 2τk depends on the latency; it inflates and lengthens the first peak of the firing rate by random amounts. Figure 3B shows five fairly extreme firing rates from this model, all with latencies τk = 400 msec, while Figure 3C shows 50 typical spike trains generated from the model with latencies uniformly distributed on [400, 600] msec. The estimated latencies from 500 such spike trains are plotted against the true latencies in the bottom panels of Figure 3D, where it is clear that the estimates become severely biased as T2 becomes larger. For comparison, the top panels in Figure 3D are the corresponding plots from spike trains shown in Figure 3A, with multiplicative excitability effects (see equation 3.2). For T2 smaller than 1300 msec, latency estimates from trials with either type of excitability effects are comparable. If excitability effects are so extreme that the firing rate does not have a common component across trials, a change-point latency estimation method may be more efficient. 4 Inference for Latencies Typical consequences of estimating nonexistent effects on subsequent statistical procedures are loss of statistical power and efficiency. It is therefore important to test if response latencies are constant across trials. We illustrate this below for simulated data, and in section 5 for real data. We also provide standard deviations for the latencies. To carry out a statistical test, we must choose a test statistic T, determine its distribution under the null hypothesis H0 that the response latency is constant, which we refer to as the “null distribution of the test statistic,” and finally compare this distribution to the observed value tobs of the test statistic, typically via the p-value, to determine if the null hypothesis should be rejected. Choosing a test statistic is most easily done if we consider once more the analogy between firing rates and distributions. The spike times of trial k can be viewed as a sample from a distribution proportional to λ(t − τk ). Let µk denote its true mean, with estimate the sample mean tk in equation 2.6.
Testing and Estimating Latencies
2335
Then the null hypothesis H0 that all latencies are equal is equivalent to the assumption of equal means in different samples, H0 : µ1 = · · · = µk = · · · = µK , which is routinely dealt with an ANOVA F-test. The ANOVA test statistic is K nk (t¯k − t¯· )2 /(K − 1) F = K k=1 , nk ¯ 2 k=1 j=1 (tkj − tk ) /(m· − K)
(4.1)
where tkj , j = 1, . . . are the spike times of trial k, t¯k is their mean, t¯· is the sample mean of the spike times across all K trials, nk is the number of spikes in trial k, and n is the number of spikes across all K trials. Large values of F provide evidence against the null hypothesis of constant latency; thus, the p-value is p = Pr(F ≥ fobs | H0 true),
(4.2)
where fobs denotes the observed value of equation 4.1 in the sample, and “| H0 true” means that the probability is calculated with respect to the null distribution of F. It is commonly assumed that under H0 , the ANOVA Fstatistic has a Fisher F-distribution with K and m· degrees of freedoms, and it is common practice to reject H0 when equation 4.2 is smaller than 5%, or 1%. This null distribution involves assumptions that we come back to later in this section. Whatever the outcome of the global test above, it may be of interest to test the latency of a particular trial against a value t0 or to compare the latencies of any two trials. This can be dealt with in one-sample and two-sample ttests, respectively; note that a two-sample t-test is equivalent to the F-test (see equation 4.1) applied to K = 2 trials. In this article, we consider only the tests of every pair of trials, since choosing a particular point t0 seems arbitrary; moreover, it seems more relevant to test if particular trials have latencies that differ from those of the other trials, as illustrated in section 5. We report the p-values of all two sample t-tests in a K × K matrix, where K is the number of trials. An important point is that the simple excitability effects (see equation 3.2) do not affect either test, because an overall inflated or deflated rate αk λ(t−τk ) does normalize to the same distribution as the rate λ(t − τk ). What happens is that excitability becomes a sample size effect that is seamlessly taken care of by the test statistic F via the spike counts nk in equation 4.1. Of course it is unlikely that excitability effects will be exactly of the simple form of equation 3.2, but they may be sufficiently well approximated by that equation on a small testing window [T1 , T2 ]. Section 5.2 presents an application where excitability effects are present, yet our testing and estimation procedures appear to work well. Global and pairwise tests are illustrated in Figure 4A. We simulated 100 Poisson trials with block(20, 60) rate, latency 400 msec for the first 50 trials,
2336
V. Ventura
and 700 msec for the other 50 trials. This shows clearly on the raster plot. The p-value (see equation 4.2) for the global test is less than 0.00001, suggesting that latencies are variable across trials. The right-most panel shows the K×K matrix of the p-values of the pairwise tests, with white, gray, and black cells corresponding to p-values smaller than 1% (strong evidence that the latencies are different), between 1 and 5% (evidence), and above 5% (no evidence), respectively. The two mostly black areas of the matrix correspond to pairs of trials that belong to the same latency group. These areas are not perfectly black because if α denotes the significance level of the test, that is, the probability of erroneously rejecting a true null hypothesis, we should expect to reject approximately a proportion α of tests that have H0 true. We indeed verified that the proportion of false-positive results (white and gray pixels in the two mostly black areas of the matrix) is just about 5% and the proportion of white pixels about 1%. The test appears to lack power, since many pairs of latencies that are different are not discovered, as indicated by the dark pixels (P > 5%) in the mostly white areas of the matrix. However, given a fixed testing window [T1 , T2 ], the ANOVA F-test is in fact known to be the most powerful test; the power happens to be low in this example because the firing rate is low, which yields small sample sizes, and Poisson trials are quite variable. Figure 4B shows that the same test applied to gamma(8) spike trains with the same rate is more powerful (see section 4.2). To determine the optimal testing window, we conducted a simulation study similar to that in section 2. The power for the global test was estimated by the proportion of significant tests applied to 1000 simulated samples that have latency effects. We found that the power was sensitive to the testing window and that for all rate types, a good choice for T1 is just before the firing rate begins to depart from the baseline, and for T2 approximately when the response to stimulus ends, based on the PSTH of the unshifted trials. Figure 4A also shows 95% confidence intervals for τk obtained as follows. Our estimates τˆk are sample means, so the central limit theorem applies to give · τˆk ∼ N(τk , σk2 );
(4.3)
that is, τˆk is approximately normally distributed with mean the true relative latency τk , and variance σk2 . Hence, a 95% confidence interval for the true τk is +
τˆk − 2σk .
(4.4)
To estimate σk2 for each trial, we consider once more the spike times to be a random sample from a density proportional to the firing rate. Therefore,
Testing and Estimating Latencies
2337
with nk the number of spikes in [T1 , T2 ] for trial k, the estimated variance of τˆk is σˆ k2
S2 = k, nk
S2k
=
j (tkj
− t¯k )2
nk − 1
,
(4.5)
where t¯k and S2k are the usual sample mean and sample variance of the spike times tkj . The methods proposed in this section so far are fast and straightforward. However, the F-test, t-tests, and confidence intervals’ results hold only if the spike times are a random sample from λ(t−τk ) and if λ(t−τk ) is proportional to a normal distribution. The second condition is not generally crucial provided the number of spikes in each trial is not too small; this follows from the central limit theorem. The first condition is the most worrisome, since the spike times are random only for Poisson spike trains. For non-Poisson spike trains, we know of no result that specifies the null distribution of F, and thus we will have to either derive it theoretically or obtain an approximation using a bootstrap simulation. The first option is too daunting, if at all possible, so we chose the second. 4.1 Bootstrap Inference. In this section, we provide general bootstrap simulation algorithms for tests and confidence intervals, (see Davison and Hinkley, 1997, for a complete bootstrap treatment). Theoretical and bootstrap results are compared in the Poisson case, where we can get both. Let T denote a test statistic and tobs its value in the sample. For the global test, T is the ANOVA F-statistic (see equation 4.1), and for a pairwise test, T is the two-sample t-statistic, or equivalently the F-statistic (see equation 4.1) evaluated for the two trials under consideration. A general bootstrap testing algorithm follows: Bootstrap Testing 1. For r = 1 · · · R (a) Create bootstrap sample r that satisfies the null hypothesis H0 . Options are described below. (b) Calculate t∗r , the value of the test statistic T in bootstrap sample r. 2. The histogram of the R values of t∗r approximates the null distribution of T. The bootstrap p-value that approximates the exact p-value in equation 4.2 is: pboot =
1 + #{t∗r ≥ tobs } R+1
(4.6)
2338
V. Ventura
Tests and Confidence Intervals 100
pairwise p−values
0
500
1500 Time (ms)
50
..... ......... ........ .. ........... ........ . . .... ... . ..... ... ..... .... ..... ..... .............. . ... .................... . ...... ..... .... ..... ....... ....... ................ ... ............. ... .... .......................... ............ ........ ................ .. ..... ...... ..................... .............. .............. ..... ... ... .... .............. ............ .................. ............... .... ... ......... .... ....... .... ........... .. ...... .... ........ .... ........... .... . ........... ..... ..... ......... . .... ... ......... ..... ............... ......... ...... .. ............ ... .............. ...... ..... ..... ............ .................... .... ................ ....... .......... ....... ... .... ..... .. ......... ...... ... ..... ........... .................. ..... ............ ..... ........ .... ....... .. ...... ........... ........... .......... ... ........ .... ....... ............... ................... ............ ............ ...................... .. ........... .. .... ........... .. ...... ...... ....................................... ........... ...... ..... .... .... ...... .... ............. ..... ...... ......... ..... .. ............. ............ ................................ ........ ............ ... ... .......... ... ... ..... ...... ........ ..................... ........... ........ ..... ............ ..... ................... .. . ... . ..... .. . .. .... . .. . . . . .. .. . . . ... ... . . . ... . . . .. .. . ..... . . .. . .... . . .. . . .. .. . .. .. . . . .. . . . .. ... . ... .. ..... .... . .. . . . . . . . . .... . . . . .... .. . ... .......... .... ................... ............ ... ... . ...... ... .......... .................. . ............ ... ................... ......................... ..... ........... .............. ........ ..... ............ ..... ... ..... ....... ............................ .... .... ........ ... ............. ......... ... ............ .......... ..... .......... ........ ..... ... ............ ........... ....................................... ...... ...... .................. .... .............. ............ ..... ............. ............... .......... .......... ...... .... ...... .... ....................... .................. ......... ............. ......... ...... ... ........ ... ...... .... ..... ....................................... ..... .... .................. .......... ....... ............. ..... ..... ...... ...... .......... .. . . . .. .. . . . . .. . .... .. . . .... . .. .. . . ... . .. . . . . . . . ... .. . . . .. . .... .. ... . . .. . . . .. .. . .. .. . .. . . . . . .. . .. . . . . ... . . ... . . . .. . .. . ..... . . . . . . . . . . . . .... ......... .. . ..... .... ....... ....... ......... ............. ..... ........ ..... .... .......... ..... ... ......... .......... ................ ............. ... ... ... ................. .... .... ..... ....... ............. ....... ............ ....... ... ..... ........... ......... ..... ............ .. ..... ............ .. .. ...... ..... ............... ......... .............. .......... ....... .... ........ ............ ..... .. ..... .. .... .... ... ........... ........... ... ..... ......................... ....... .. ...... ............... ....... ..... ..... ....... .... ... ......... .. ................... ............ ...... .... .... .... ....... ......... ........... ..... ....... ... ..... ... ...... ........ .. .......... ...... ... ............. .... . ...... ...... ... ............ ... ........... ... ...... ........... ...... ......... .................. ....... ..... .......... .... .......... ..... ......... . ..... ......... ............. .... .. .... ... ... .................. ........ ........ ......... ..................................... ......... ..... ..... .... ........ ...... ......... ......... .. ......... ..... ........ .............. ..... .............. ... ............................ .... ............. ............ ...... .... .... .... . . . .. . .. . .. . . . .. . ... .... . ..... . .. . .. . . . ... . ... . .. ... . ..... .... ........... . . .. . . . . . .... . .. .... . .. .... . . . . . .. . . . . ... .... ... . . ... . ... .. . . . .. . .. . . . .. . . . ... . . ... ... . . .......... ..... ...... .......... ....... ............. ................................... .. ........ .................... .... .......... ... .... .......... ....... ...... ....... ....... ...... ................. ........ ................................ .... ....... ... ...... .. .... .... .... ..... ..... .................... ............. .. ....... ..... ............ ... ........... . ... .......... .... ..... ....... ..... ............... ......... ..... ............. ........... ..... ..... ........... .. ... ... ..... ......... ........... .... ........... .......... ....... ... ........... ........... .................. ..................... ....... ............ ....... ......... ... ..................... ...... ........ ........... ......... ........ .... ................. ........ .............................. ..... .......... ... ....................... ........................... ......................... ..... ...... ................ ..... ..... ..... .. ......... ........ .. ................. ....... ..... ...... ............. .......... ....... .. .... ....... ......... ....... .............. ...... . ...................... .... ........ ................... ... ......... .... ......... .................... ....... ........ . .....
0
Trial number
A – Poisson
2500
200 400 600 800 95% CI for relative latencies
0
100
pairwise p−values
500
1500 Time (ms)
50
.......... ............. ..... ................ ........ ..... ..... ....... ......... ... ... ..... ..... .... . ... ...... .......... ..... ................. .... ...... ............ ... ........ ...... ... ....................... ... ... .... ....... .. ..... ... ........... ..... .................. .. ............. ..... ..... ..... ............ ...... ........... ..... ..... ..... ........ .............. ....... .... ..... .... ...... ........ ........ .............. ..... .......... ..... ...... ........... .................. .. ..... ... .... . ....... ... ...... ......... ..... .......... .... ............... ..... ........ ........... .... ........... ............ ........ ...... ...... ......... ... ............. ............... ............ ........ ..... ................ .... ..... .... ......... .. ..... ...... ..... ...... ......... ........... ..... .... ................. .. .... ............ ............. .. ........ ............ .......... ............ ........ ....... ................ ............ ....... ....... ............ ... ... ... ... ..... .... ...................... . ..... ....... ....... ........... ............ ... ....... .................... ........ ...... .......... .... .......... ........ . ........ . .... .. .............. ...... ..... ......................... ............... ....... ....... .... ..... ............ ...... ........ .. ........ ...... ...... .......... .... ........ .. ....... ......... .. ....... .... ............... .......... ....... .......... ........ .... ...... ......... .. . ..... .... ........ ......... ...... .... ...... ... ...... ...... ........ ................... .............. ......... .... .......... ..... ........ .............. .......... .... ... .................... ................ ...................... ....... ... ...... ... ....... ....... .. ... ...... .......... .............................. .......... ....... .... ........ ...... .............. ..... .......... .. ... ... ... .. . ... . . . ..... ..... ...... ..... ..... .... ... .... .. . . . .. ... ... . ..... .. . ... . . ... . . ........ . .... . ...... .. . . .. . ... . .. .. . . . . . . ........ . . . .... .. ..... ............. ... . .... ... . . .... . .. .. . .. ... . ........ .. ....... . .. ....... .. ... .. ...... ................ .......... ... ....... .. ..... .... ....... ...... ..... ... ..... ............. .................... ........... ... .......... ...... ........ ............. ............. .......... ..................... ...... ....... ...... ...... ... ........ .... ..... ............................ ..... ............ .. .. .. .......... ..... ... ........... .............. .... ... ......... ... ........ ....... .......... .......... ........ .. ... ........ .... ... . ...... ...... ................ ........... ....... ..... ......... ...... ........... ........ ... ....... ....... .... .......... .... .......... .............. .......... .............. ............ ......... ...... ....... .... .... ....... ......... ... .... .... ............ ... ...... ..... .. ...... .... .......... ......... ...... ........ ... .... ............ ................................... ........ ........ ...... ............... ................. ...... ............ .... ... ... .. .. .................. ..... ................. ........... .......... . . ... . ........ .. ... . ...... ..... ...... ........ ... ........ . ....... . . . ... ... .. . .. .. ............ .. .. ..... . . ... ........ .... ... ..... .... ..... .... . . . . . ... .... . .... . .... .. ... . ..... .. . . . ... ... . . ... ..... . ..... . . . .. . . ...... . . ... . ... ... . ... . .... .. ......... ................... ............ ...... ........ ......... ............ ........ ..... ......... .... .......... ...... ....... ..... ........ ..... ... .............. ......... ......... ...... .... ....... ..... ....... ...... ..... ....... ....................... .... ............... .... ... ........... ..... .... ..... ...... ............. ....... ... ...... ........ ..... ..... .............. ...................... ....... .......... ... ......... ........ ................... ....... ....... ....... ....... ........ ......... ..... ............... .... .. .. ... .... .... ............. .................. .... ....... .......... .... ....... ................ ..... ..... .. ....... ......... ..... .... .......... ... ..... ..... ...... ........... ....... ....... ...... ........ ......... .................... ....... ........... .... ....... ... .... .......... .... ....... ... ..... ... ........ ......... ........... ............ ..... ......... .......... ............. ............. ..... .. ......... ... .. . ...... ....... ... .............. .... ........ ..... ........ .. ........ .... .... .... ....... .......... .. .. ....... .... ............ ........ .... ..... ........... .............. ... ... ...... ................... ........ ............ ..... ... ....... .. ..... ...... ...... .... .................. ...... ........... ....... ..... ............. ................. ....... .... ............... ...... .. ...... .......... ...... ........... ........ ................. ........... ...... ..... ... .... ...... ....................... ....... ....................... .. .... ........ ..... .. .. ............... ... ... ............... .................. ............ .......... ..... ........ ......... ...... ............ ..... ... ...... ...... ..... .. ... ........... ...... ........ ... .... ..................................... ................ .... .... .... .......... ................ ............... ............ .... ............ ...... ... .... ....... .... ........ ........ ............... ........................... ... .... ..... .... ...... .. ...... ...... .... .... ... .... ......... ............ ...... ....... .... .... ...... ............................ ........... .... ..... ... ... ........ .......... .. ...... .... ... ..... .... . ... . .. . . ..... ... ...... ... .... .... ... . . . . . .. .. . .... ... .... .... .. ....... . ... . . .. ... ....... . . . ........ .. .. . ...... ..... . .. .. .. ... . . .. . . .. . .... . . . . . . . . . . . .. .... .. ......... ........ .. . ... ... . .... .. . . . . . .. . .. . ... . . . ...... . . . .. . . . . ....... .. .... .......... ................... ....... .............. ............ ......... ........ ......... ......... ..... ........ ...... ...... .................... ............. ..... .... ............ .. .... .. ......... ....... ...... ...... ........... ..... ........ ............... ........ ..... ................ ..... .... ..... ........... ... ... ............... .......... ... ..... ...... ...... .. ... ... .... . . . . . ... . . . ........... .... . ... ........ . ... ... . ...... .... . .. . .... .. .. . ... . . ... . . . ..... .. .. ..... . . . . . .. . . . .. ... . . ... . .. ......... . .. . .... . ..... . ... . . .... ..... .. . .. . ... . . . . . .. .. . .. .. .. . ..... ... .... . .. .......... ........ .... ...... . ................. ........ ........ .... ............ ... ..... ............ ....... ........... ......... ..... .. ...... . ........... ............ ..................... .... ..... ..... ....... ... ...... ... ...................... ... ........ .. ........ .... ... .... ........ ...... ................... ... ... . ...... ............... . ............. ............ ....... ........... .......... .................................. ..... .. .... ..............
0
2500
200 400 600 800 95% CI for relative latencies
Null distribution of F C Bootstrap P = 0.922
0
D
1.0
E
1.5 T*r
2.0
0.5
Gamma bootstrap IMI bootstrap F approximation
0.2
0.4
Gamma(2)
0.0
0.5
1.0 T*r
1.0
2.0 0.0
0.5
1.0 T*r
1.5 T*r
0.8
Gamma(1)
1.5
100
2.0
P < 0.0001% P < 0.0001% P = 88.9%
0.6 T*r
F
50 Trial number
Bootstrap P = 0 F approx P = 0
F approx P = 0.93
0.5
100
0
Trial number
B – Gamma
50 Trial number
1.0
Gamma(1/2)
1.5
2.0 1.0
1.5
2.0 T*r
2.5
3.0
Testing and Estimating Latencies
2339
For Poisson spike trains, bootstrap samples in step 1a can be obtained parametrically or nonparametrically as follows (see Cowling, Hall, & Phillips, 1996, for other sampling options): Poisson Bootstrap Sample • Nonparametric. (i) Combine the spike times of all K trials. (ii) Sample m· = k nk spike times from the set of combined spike times, and form the new spike trains by allocating the first n1 spikes to trial 1, the next n2 spikes to trial 2, . . . , and the last nK spikes to trial K. • Parametric. (i) Estimate λˆ 0 (t), the firing rate of the neuron based on the PSTH of the unshifted trials. (ii) For k = 1, . . . , K, form trial k by generating nk spikes from a Poisson process with mean λˆ 0 (t). Standard methods to estimate firing rates are gaussian filtering and spline smoothing (see Ventura, Carta, Kass, Olson, & Gettner, 2001). Figures 4C and 4D show histograms of R = 1000 values of t∗r for two sample of K = 100 Poisson spike trains, where T is the F-statistic (see equation 4.1) for the global latency test. Parametric and nonparametric bootstraps produced indistinguishable results. The data in Figure 4C have H0 true (equal latencies), whereas Figure 4D used data with true latencies uniformly distributed on [400, 1000] msec. The Fisher F-distribution is overlaid; it matches the two histograms almost perfectly, as we would expect, since Figure 4: Facing page. (A) 95% confidence intervals for the latency estimates of the trials in the raster plot, and K × K matrix of p-values of all pairwise latency tests. Black: P ≥ 5%; gray: 1% < P < 5%; white: P ≤ 1%. The spike trains are Poisson with block(20,60) rate, latencies 400 msec for the first 50 trials, and 700 msec for the other 50 trials. (B) Same as A but for gamma(8) spike trains. (C, D) Asymptotic F-approximation (solid curves) and bootstrap null distributions (histograms) for the F-statistic (see equation 4.1), with observed values fobs indicated by vertical lines. The spike trains are Poisson with block(20,60) rate, with (C) no latencies, and (D) latencies uniformly distributed on [400, 1000] msec. (E) Bootstrap gamma (histogram), bootstrap IMI (bold curve), and bootstrap Poisson (dotted curve) null distributions for the test F-statistic (see equation 4.1) with observed values fobs indicated by the vertical line, based on gamma(8) spike trains with block(20,60) rate and latencies uniformly distributed on [400, 600] msec. (We use latencies in [400, 600] msec rather than in [400, 1000] msec so that fobs is within the frame of the plot.) (F) Bootstrap gamma (histograms) and bootstrap IMI (curves) null distributions of (see equation 4.1) based on gamma(2), gamma(1), that is, Poisson, and gamma(1/2) spike trains with block(20,60) rate and latencies uniformly distributed on [400, 600] msec. The dotted curve on the middle panel is the asymptotic F-approximation.
2340
V. Ventura
the spike times are random, and the number of trials is large enough that the firing rate does not have to be bell shaped. The bootstrap and F-test p-values, equations 4.6 and 4.2, are very close, and for both data sets, we take the appropriate decision: reject H0 for Figure 4D, with the conclusion that latencies vary across trials, and fail to reject H0 for Figure 4C. The dashed curves in Figure 4E are the null distribution of equation 4.1 obtained by Poisson bootstrap and F approximation, this time for gamma(8) rather than Poisson spike trains, with latencies uniformly distributed on [400, 600] msec. Once again, both distributions are practically identical, and thus they are indistinguishable in the plot, with corresponding p-values equations 4.2 and 4.6 equal to 88.9%; we fail to reject H0 , which is the wrong decision. Also plotted as a histogram is the parametric bootstrap null distribution from a gamma(8) rather than a Poisson model, with rate λˆ 0 (t) fitted to the unshifted spike trains via gaussian filtering. The resulting null distribution is approximately the “correct” one, since the correct model was used. The corresponding bootstrap p-value equation 4.6, is zero so that we now reject H0 , the correct decision. Figure 4E illustrates that the validity of a test (bootstrap or not) requires an appropriate model for the data, from which bootstrap samples can be simulated. Model selection for spike trains is discussed briefly in section 4.2. Now a few remarks about bootstrap testing are important. A bootstrap sample should be in every way similar to the observed sample—hence the need for an appropriate model. In the context of statistical testing, bootstrap samples should also conform to the hypothetical reality imposed by H0 . Here, H0 forces us to assume that the trials have a common firing rate, even if we see with the naked eye that they do not. Both our parametric and nonparametric bootstraps did satisfy H0 : the implicitly (combined spikes) and explicitly estimated common firing rate λˆ 0 (t) ignored any latency effects, since we fitted the same firing rate to all trials. If the data also contain excitability effects, the bootstrap samples must also contain that extra source of variability. (This issue is developed further in Cai, Kass, & Ventura, 2004.) The confidence intervals, equation 4.3, hold for Poisson and non-Poisson spike trains since they are based on the central limit theorem, but the estimate of σk2 proposed in equation 4.5 is valid for Poisson data only. We provide a bootstrap alternative that is valid for any spike trains: Bootstrap Standard Deviations 1. For r = 1 · · · R (a) Create bootstrap sample r. (b) For each spike train k = 1, . . . , K in bootstrap sample r, calculate the mean t¯∗kr of the spike times.
Testing and Estimating Latencies
2341
2. For k = 1, . . . , K, the bootstrap estimate of σk2 is the sample variance of the t¯∗kr , R ∗ (t¯ − t¯∗k. )2 2 t¯∗kr . , where t¯∗k. = R−1 σˆ k = r=1 kr R−1 r Note that unlike for testing, the model used to simulate bootstrap samples in step 1a should be fitted to the spike trains first shifted according to the latency estimates. Indeed, the final latency estimates are based on shifted trials. We applied the standard deviation bootstrap algorithm to Poisson and gamma(8) spike trains. In the Poisson case, we obtained bootstrap standard deviations comparable to the analytic result in equation 4.5. The bootstrap standard deviations for gamma(8) spike trains were used to calculate the confidence intervals in Figure 4B. In this section, we have shown that the bootstrap compares well to asymptotic results in the Poisson case, which gives it credibility when no such asymptotic results are available. For Poisson spike trains, it will be safe to use the theoretical results, unless the numbers of trials or spikes are very small, or the firing rate is very dissimilar in shape to a normal density; in doubt, a bootstrap test is also easily done. For non-Poisson spike trains, all we need to perform bootstrap inference is an appropriate model fitted to the observed data, from which bootstrap samples can be simulated. 4.2 Model Selection for Spike Trains. The quality of any test, bootstrap or not, depends on how well the chosen model fits the data; by quality, we refer to how well the actual significance level of the test matches the nominal one. This is illustrated below. Model selection, for spike trains or any other data, involves proposing competing models and determining which fits the data best. Reich, Victor, and Knight (1998) introduced the power ratio statistic to test whether spike trains can be completely characterized by an inhomogeneous firing rate. Brown, Barbieri, Ventura, Kass, and Frank (2002) proposed a goodness-offit test based on the time rescaling theorem; a goodness-of-fit test determines if a particular model appears to fit, without requiring an alternative model. In the case where two competing models can be fit by maximum likelihood, it is standard to use the likelihood ratio (LR) test; common tests like t-tests, and ANOVA F-tests are LR tests; Pearson chi-squared tests are asymptotic approximations to LR tests. The space of models is infinite, so an exhaustive model search is unrealistic, as is the search for the “true” model. A good start is to consider the point process models that have been found to fit some neural data well, for example, inhomogeneous renewal processes, Poisson processes with refractory periods and integrate-and-fire models. (Barbieri, Quirk, Frank, Wilson, & Brown, 2001; Johnson, 1996; Reich et al., 1998). We also like to consider the
2342
V. Ventura
inhomogeneous Markov interval (IMI) models of Kass and Ventura (2001) because they include as particular cases inhomogeneous Poisson and homogeneous renewal process models, and thus are well suited to fit data that do not deviate much from these two large classes of models. To illustrate these ideas, consider once more the gamma(8) spike trains used in Figure 4E. Likelihood ratio tests were performed to compare Poisson, gamma(8), and IMI models; the (true) gamma(8) model was favored over the others (P < 10−3 ). For the sake of illustration, we still performed a parametric bootstrap based on the IMI model, with resulting null distribution overlaid as a bold curve; it is much closer to, although not equal to, the gamma bootstrap null distribution than the bootstrap Poisson distribution was. The discrepancy between the IMI and the correct gamma(8) bootstrap null distributions happens because inhomogeneous renewal processes are not IMI models (only homogeneous renewal processes are). The test based on the IMI bootstrap is conservative, so that the actual significance level is much smaller than the nominal level. Figure 4F is the same as Figure 4E, but based on simulated gamma(q) spike trains with q = 2, 1, and 0.5, which are, respectively, less, as, and more variable than Poisson spike trains; the F approximation was also plotted (dotted curve) in the middle panel. Although the IMI bootstrap null distributions do not quite match their gamma counterparts, these plots suggest that the IMI model yields reasonable inference for data that are close to Poisson or renewal processes. 5 Applications This section illustrates our latency testing and estimation method based on two examples rather than draws conclusions about the functional characteristics of the types of neurons we used. 5.1 Correlating Neural Activity and Behavior. Individual neurons in the frontal eye field of a rhesus monkey were recorded while the animal performed a memory-guided saccade task or a delayed-response task, as described in Roesch and Olson (2003). Figure 5 shows the PSTH of 20 identical trials of a particular neuron in that experiment. The cue indicating the direction of eye movement is presented for the short period between the two vertical bars. After a waiting period, the central light is turned off at time t = 0, at which point the monkey is to execute the eye movement. For each trial, the onset of movement was recorded. We want to investigate if the latency of neural response predicts (or is correlated with) the onset of movement. Before we apply the latency test, we check if the spike trains are Poisson. The mean versus variance plot in Figure 5A suggests that the deviation from the Poisson assumption is minimal; this is confirmed by the goodness-offit test of Brown et al. (2002; not shown). The test of Cai et al. (2004) did
Testing and Estimating Latencies
2343
not detect any excitability effects. We thus used the simple latency test of section 4. The global test for latencies suggests strongly that latencies vary across trials, with a p-value smaller than 10−8 . Figure 5C shows the overlaid smoothed PSTHs of the original and shifted trials; as expected, the latter is slightly higher. The latency estimates were robust to the choice of all estimation windows [T1 , T2 ] within the guidelines of section 3. Figure 5B shows the latency estimates plotted against the time of onset of movement, with latencies and onsets, respectively, recentered to have sample mean zero. We conclude that for this particular neuron, there exists a strong correlation (0.72, with P < 0.00001) between the latency of the neural response to the stimulus and the onset of eye movement. 5.2 Adjusting the Cross-Correlogram for Latency Effects. The raw CC displays the correlation between the spike trains of two neurons at a series of time lags. It is not used directly to assess synchrony, because other correlation sources typically contribute to it. The most common such source is due to modulations in firing rates following some experimental stimulus, although it is easily accounted for by subtracting the shuffle corrector (Perkel, Gerstein, & Moore, 1967). However, the remaining features in the shuffled corrected CC merely indicate that sources of correlation other than stimulus-induced correlations exist between the two neurons. A potential source is synchrony, but other sources include variations in firing rates’ amplitudes and in latencies, which Brody (1999a, 1999b) refers to as excitability and latency effects. Baker and Gerstein (2001) provide a list of references that report such effects. Therefore, for the CC to be a useful and reliable tool of synchrony assessment, it must be adjusted for all possible sources of correlation other than synchrony. The customary way to account for latency effects is to estimate the latencies, realign the spike trains, and proceed with the shifted spike trains as one would with usual spike trains (Brody, 1999b; Baker & Gerstein, 2001). We illustrate our latency testing and estimation on two neurons recorded simultaneously in the primary visual cortex of an anesthetized macaque monkey Aronov, Reich, Mechler, & Victor, 2003, Figure 1B, units 40106.st; 67.5 degrees spatial phase; other spatial phases produced similar results). The data consist of 64 trials shown in Figure ??B. The stimulus is identical for all trials, and consists of a standing sinusoidal grating that appears at time 0 and disappears at 237 msec. The mean versus variance plots in Figure ??A lie somewhat above the diagonal, which suggests more variability than predicted under the Poisson assumption. The goodness-of-fit tests of Brown et al. (2002) in Figure ??A indeed confirm that the data deviate slightly from Poisson, since the empirical versus model quantiles curves do not lie completely within the 99% joint confidence bands. The IMI model provides a better fit, although the improvement is only marginal. The slight deviation from Poisson could be
V. Ventura
Variance
A
−2000
−1000
1000
• •
• • • •
• • • • • • • •
0.0
0.2
20 40 60
•
•
•
•
• • • • • • −40 −20
Corr = 0.72 ( p < 0.00001 )
• 0
20
• • • • • • •
• • • • •
0.4
•
• • • • • • • • • • • •
0.6
. .. ... . .... .... .................. . ... . ... .. .. .. ••..........•........ .. .... . .. • . . . . ........•.................. ........ .... .. . .. . . ............... •.....•..... ............ . .... .. . . . . . . •.... .......... ............. . . .. . . .. •.... ............. ... .. . ... ............ ... . ••.......•..... . . ... ...... ... . . . .. •............ . ...... ......... . ..•..... .......... . .. . . . . .. .. . . ...•..•......................... .. ... • . . . . ... . . . .. . . .. . ... . •..•.... ......... .. . . . . .. • ............... . . . .
•
• • • • • •
• • • • •
• • • • • • •
Mean
• •
•
•
Time (ms)
−40 −20 0
Onset
B
0
0.0 0.2 0.4 0.6 0.8 1.0
2344
40
60
80
−200
0
200
400
600
800
Time (ms)
Latency
C
Original trials
. . ...................... ...... . ... . ... .... . ..... .. .... . ........ ... .... ... .... .......... .... ... . .. .... ............. ...... . ........ ..... ................ ...... . ... . . . . . ... .. . .. . ....... ..... ... .... . . . ........ ...... ...... .... ..... ..................... ...... . . .. .. . .. . ....... .. ......... ... ..... ........... . .. . . . .. . . ...... ..... .. .. .. ...... .................... . . ... . ....... ...... .... ....... .... . . . .
Firing rate
Shifted Unshifted
−200
0
200
400
600
800
Shifted trials
−200
0
200
400
Time (ms)
600
800
.... ....... . .. ... . .. . ... .. .. ....... ................... . ..... ..... ... .... .. .......................... ......... ............... . ... ... .. .. .. . ... . . . . .. . . . . . ... .... ............. ..... . .... ......... .... ... . .. . ...... ........ .. ... ..... ... .... .......... .......... .. . .. ... .. ... . .. . ..... .. . . ... ......... ........ . . . .. .. ... .. . . . ... ... ... .............................................. . . . .. . . . ... .. −200
0
200 400 Time (ms)
600
800
Figure 5: (A) PSTH for a neuron in Roesch and Olson (2003), based on 20 identical trials. The cue indicating the direction of the eye movement is presented between the two vertical bars; the cue for the movement is at time t = 0. Mean versus variance plot of the spike trains, based on time bins ranging from 2 to 20 msec. The solid line is the first diagonal. (B) Onset of movement versus neural response latency estimates. The solid line is the first diagonal, and the dotted line the fitted linear regression of onset on latency. Raster plot of the 20 spike trains with filled circles marking the latency estimate and open circles the onset of movement. (C) Overlaid smoothed PSTHs and raster plots of the original and shifted trials.
Testing and Estimating Latencies
2345
partly due to excitability effects, which were found to be significant based on the test for excitability in Cai et al. (2004). Although the deviations from the Poisson model may be important from a functional standpoint, they are not large enough from a statistical standpoint to warrant the more complicated bootstrap procedures for non-Poisson spike trains, especially since the evidence for latency effects is overwhelming, as discussed below. Figure 6B shows the raster plots for the two neurons, from which it appears that the few first and last trials have latencies that differ from those of the other trials. This is confirmed by the global tests for latencies, with a p-value 0.045 for neuron 2, and p-values smaller than 10−8 for neuron 1 and for the two neurons combined. Lack of statistical power for neuron 2 is due to the sparseness of the spikes. The value of the test statistic for neuron 1 is tobs = 3.3, which is extreme enough as to leave no doubt about the presence of latency effects, under Poisson, gamma or IMI models. Figure 6B also shows the matrix of p-values for all pairs of trials for the two neurons combined, with white and gray pixels corresponding to p-values smaller than 1% and 5%, respectively. If all trials had equal latencies, we would expect about 5% of white and gray pixels, whereas we have about 30%. This confirms the outcome of the global tests. Additionally, there is a clear pattern in the pairwise p-values; the mostly black area in the middle indicates that the latencies of all trials are similar, except for the first few and last trials, as could in fact be seen by the naked eye on the raster plots. Next, we estimated the latencies for the two neurons separately. The presence of excitability effects is not of overwhelming concern because the estimation windows [T1 , T2 ], shown in Figure 6B, are short enough that the effects on these intervals are presumably well approximated by equation 3.2. Figure 6B shows a plot of the latency estimates of one neuron versus the other, which shows that the two sets of latencies covary, with correlation 0.69. To shift the spike trains, we used the latency estimates based on the combined spikes of the two neurons (see section 2.2) because they are less variable. Finally, Figure 6C shows CC for these data, corrected for the correlation induced by modulations of the firing rates, along with 95% confidence bands. The left-most panel is for the observed spike trains, from which one may conclude that there is synchronous activity at small lags. However, these effects disappear entirely once the CC is adjusted for correlations induced by the latencies. We also produced a CC based on the observed spike trains, after removing the few first and last trials that appear to have different latencies. This CC is qualitatively similar to the CC adjusted for latencies, although the confidence bands are wider because fewer trials are used. 6 Conclusion We have developed statistical procedures for testing and estimating latency effects in spike trains obtained from identical repeats of an experiment. The
2346
V. Ventura
A – Poisson tests
•
0.0
0.2 Mean
0.4
neuron 2
Empirical Quantiles 0.0 0.5 1.0
0.0
•
Poisson IMI
Variance 0.2 0.4
Variance 0.2 0.4
• •• •• • • • •• • • •• • •• • • • • •• • • • • • • • •• • • • •• • • •• • • • •• • • •• • • ••
0.0
Empirical Quantiles 0.0 0.5 1.0
neuron 1
0.0 0.5 1.0 Model Quantiles
•
• • •• •••• • • •• • ••• • • • • • • • •••••• •• • •• • • •• • • •• • • •••••••• • •• • • •• • • •• •• • •• • • • ••
0.0 0.1 0.2 0.3 Mean
Poisson IMI
0.0 0.5 1.0 Model Quantiles
B – Latency testing and estimation
. . ..
0
200 400 Time (ms)
. .. . . .. . .. .... . ... . . ...
600
0
200 400 Time (ms)
neuron 1 shifted
. . ..
0
200 400 Time (ms)
600
neuron 2 shifted
................. . .. . .. . . ... . .. . . . ............ . ... . .. . . .. . .......... . . ............. . . ..... ........ ...... ... .... . .. . . .. .. ... . . . . .......... . . .... .. . .. ....... . . .. . ................. .... ........ .. . .. ... .... .. .. .... ... . ......... ........ ................ ... . . .
. ..
........ . .. . . . .... ...... .. . . ... .. .... . .. ............ ...... . .. .. ..... .... .......... ... . . .... ..... ........ ..... . .................. .. .. ... ........... .... .. ....... . ....... . ... .. ..... .... . ... .... . ................ ......... .... .. . . .... ..... ... .... . . .. ... .. . ..... .. ................. ...... ... . .. . .... .... . .. .
Trial number 0 20 40 60
. ..
pairwise p−values
neuron 2
........... . .. . . ........... . .. .. . . . ........... . ... . .. . . .. . ........... . . ....... ... . . . . ............. .. .......... .. ...... . ............ . . .. . ............... .. . ............... . . .. .............. ...... ... ... .. ........... . .. .. .. ... ... ....... ................ .... .. .
. . .. ......... . . . . .. .... .. . . .. ........ ........ . .. ... ..... ... . .... ....... . .. . .. . . . . . . . . . . . . .. . .. .......... ..... ........... ... . ..... .. ..... .... .. ..... ........... . . ................. ... .. . . ... .... ..... . ..... .... . .. ..... ................ ... .... . ... ...... .... . .. ...... . .. . .. ... . .. .. . .... . ...... .......... .... .. ... .. . . .. . ... ... . ....
600
0
200 400 Time (ms)
600
0
neuron 2 latencies −10 0 10 30
neuron 1
50 Trial number
• •
• • • • • • • • •• • •••••••• • • • •• •• • • • ••• • • •• • • •
•• •
•
• •
Corr = 0.69
−5 0 5 10 15 neuron 1 latencies
C – Cross-correlograms
−10 0 10 Lag (1.3 ms)
30
−30
0.4 0.0
0.0
0.4
Middle spike trains
−10 0 10 Lag (1.3 ms)
30
−0.4
−30
Shifted spike trains
−0.4
Adjusted CC −0.4 0.0 0.4
Observed spike trains
−30
−10 0 10 Lag (1.3 ms)
30
Testing and Estimating Latencies
2347
main attractions of our methods are their simplicity. Moreover, they appear to be efficient and powerful based on a large number of simulated spike trains. We also applied our methods to two real data sets and obtained results that seem reasonable. We proposed a formal statistical procedure to test for unequal latencies across trials; by “formal,” we mean that we provide not only a diagnostic test but also a p-value. For Poisson spike trains, this test is the usual analysis of variance (ANOVA) test for the equality of several means. For non-Poisson spike trains, we still use the ANOVA F-statistic, although we obtain its null distribution via a parametric bootstrap. We use an inhomogeneous Markov interval (IMI) model to fit non-Poisson spike trains when competing parametric models do not fit the data as well. Our estimation method consists of finding the set of shifts that minimizes the spread of the resulting PSTH. We show that this minimization criterion is equivalent to finding the shifts so that the means of the shifted spike times are equal in all trials. Therefore, our estimates require calculations only of sample means so that they are simple and very fast to obtain, and they do not require any model assumptions. We applied our method successfully to Poisson and non-Poisson spike trains with various rates, including spike trains that contain simple multiplicative excitability effects. In situations where more complicated excitability effects are present, our estimates can be biased, so it may be preferable to use a change-point method as, for example, in Baker and Gerstein (2001). But our latency estimates can still be used as starting values in other latency estimation algorithms. Indeed, change-point methods are based on detecting the rate change from baseline, typically on each trial separately, so that the latency estimates thus obtained do not benefit from any information that may be contained in other trials. Our method is based on the differences in firing rate from trial to trial. Including this extra information in change-point estimation procedures is likely to improve them.
Figure 6: Facing page. (A) Mean versus variance plots and goodness-of-fit tests for Poisson and IMI models (Brown et al. 2002) for two simultaneously recorded neurons in the primary visual cortex of an anesthetized macaque monkey (Aronov et al., 2003). (B) Raster plots with estimation windows [T1 , T2 ] (top), and after the trials are shifted (bottom). Matrix of p-values for all pairwise tests (see section 4). Estimates of latencies for neuron 1 plotted against those of neuron 2, with (0,1) line (solid) and fitted linear regression (dotted line), and sample Pearson correlation. (C) Cross-correlograms adjusted for firing-rate modulation (shuffle corrected) and 95% confidence bands for the observed and the shifted spike trains, as well as for the observed trials that appear to have constant latencies. We used bins of 1.3 msec. The central bin is not plotted because a recording artifact prevents testing of synchrony at lag 0.
2348
V. Ventura
Our latency estimates are obtained thus far in a completely nonparametric way. We do not assume a particular model for the firing rate or for the spike generation mechanism; spike trains do not have to be Poisson or gamma. But the absence of a specific model does not preclude assumptions; in particular, our procedure produces meaningful estimates under the basic assumption that the firing rates for all trials are proportional to one another except for a time shift. This begs for two extensions. First, if the basic assumption is met, can we improve our procedure by explicitly making use of the common element, that is, λ, between all trials? If the basic assumption is not appropriate—for example, if more complicated excitability effects exist—can we modify the procedure to allow for this? Our answer to both questions involves a more statistical approach than we have used. A full treatment is beyond the scope of this article, and is, in fact, the topic of a future article that treats the estimation, and adjustment, of latency and excitability effects (Cai et al., 2004). Acknowledgments This work was supported by grants N01-NS-2-2346 and 1R01 MH64537-01 from the National Institutes of Health. References Aronov, D., Reich, D. S., Mechler, F., & Victor, J. D. (2003). Neural coding of spatial phase in V1 of the macaque monkey. J. Neurophysiol., 89, 3304-¬3327. Baker, S. N. & Gerstein, G. L. (2001). Determination of response latency and its application to normalization of cross-correlation measures. Neural Computation, 13, 1351–1377. Barbieri, R., Quirk, M. C., Frank, L. M., Wilson, M. A., and Brown, E. N. (2001). Construction and analysis of non-Poisson stimulus-response models of neural spike train activity. J. Neurosci. Methods, 105, 25–37. Brody, C. D. (1999a). Correlations without synchrony. Neural Computation, 11, 1537–1551. Brody, C. D. (1999b). Disambiguating different covariation. Neural Computation, 11, 1527–1535. Brown, B. N., Barbieri, R., Ventura, V., Kass, R. E., & Frank, L. M. (2002). The timerescaling theorem and its applications to neural spike train data analysis. Neural Computation, 14(2), 325–346. Cai, C., Kass, R. E., & Ventura, V. (2004). Trial to trial variability effects and its effect on time-varying dependence between two neurons. Manuscript submitted for publication. Cowling, A., Hall, P., & Phillips, M. J. (1996). Bootstrap confidence regions for the intensity of a Poisson point process. Journal of the American Statistical Association, 91, 1516–1524. Davison, A. C., & Hinkley, D. V. (1997). Bootstrap methods and their applications. Cambridge: Cambridge University Press.
Testing and Estimating Latencies
2349
Everling, S., Dorris, M. C., Klein, R. M., & Munoz, D. P. (1999). Role of primate superior colliculus in preparation and execution of anti-saccades and prosaccades. J. Neurosci. 19, 2740–2754. Everling, S., Dorris, M. C., & Munoz, D. P. (1998). Reflex suppression in the antisaccade task is dependent on prestimulus neural processes. J. Neurophysiol., 80, 1584–1589. Everling, S., & Munoz, D. P. (2000). Neuronal correlates for preparatory set associated with pro-saccades and anti-saccades in the primate frontal eye field. J. Neurophysiol., 20, 387–400. Hanes, D. P., & Schall, J. D. (1996). Neural control of voluntary movement initiation. Science, 274, 427–430. Horwitz, G. D. & Newsome, W. T. (2001). Target selection for saccadic eye movements: Prelude activity in the superior colliculus during a directiondiscrimination task. J. Neurophysiol., 86, 2543–2558. Johnson, D. (1996) Point process models of single-neuron discharges. J. Comput. Neurosci., 3, 275–299. Kass, R. E., & Ventura, V. (2001). A spike train probability model. Neural Computation, 13, 1713–1720. Kass, R. E., Ventura, V., & Cai, C. (2003). Statistical smoothing of neuronal data. NETWORK: Computation in Neural Systems (special issue on Information and Statistical Structure in Spike Trains), 14, 5–15. Perkel, D. H., Gerstein, G. L., & Moore, G. P. (1967). Neuronal spike trains and stochastic point processes. II. Simultaneous spike trains. Biophys. J., 7, 419– 440. Pham, D. T., Mocks, J., Kohler, W., & Gasser, T. (1987). Variable latencies of noisy signals: Estimation and testing in brain potential data. Biometrika, 74(3), 525– 533. Reich, D. S., Victor, J. D., & Knight, B. W. (1998). The power ratio and the interval map: Spiking models and extracellular recordings. J. Neurosci., 18(23), 10090– 10104. Roesch, M. R., & Olson, C. R. (2003). Impact of expected reward on neuronal activity in prefrontal cortex, frontal and supplementary eye fields and premotor cortex. J. Neurophysiol., 90, 1766–1789. Ventura, V., Carta, R., Kass, R. E., Olson, C. R., & Gettner, S. N. (2001). Statistical analysis of temporal evolution in single-neuron firing rates. Biostatistics., 1(3), 1–20. Woody, C. D. (1967). Characterization of an adaptive filter for the analysis of variable latency neuroelectric signals. Med. Biol. Engng., 5, 539–553. Received August 20, 2003; accepted April 8, 2004.
LETTER
Communicated by Charles J. Wilson
Two-State Membrane Potential Fluctuations Driven by Weak Pairwise Correlations Andrea Benucci
[email protected] Paul F.M.J. Verschure
[email protected] Peter Konig ¨
[email protected] ¨ ¨ Institute of Neuroinformatics University and ETH Zurich 8057 Zurich, Switzerland
Physiological experiments demonstrate the existence of weak pairwise correlations of neuronal activity in mammalian cortex (Singer, 1993). The functional implications of this correlated activity are hotly debated (Roskies et al., 1999). Nevertheless, it is generally considered a widespread feature of cortical dynamics. In recent years, another line of research has attracted great interest: the observation of a bimodal distribution of the membrane potential defining up states and down states at the single cell level (Wilson & Kawaguchi, 1996; Steriade, Contreras, & Amzica, 1994; Contreras & Steriade, 1995; Steriade, 2001). Here we use a theoretical approach to demonstrate that the latter phenomenon is a natural consequence of the former. In particular, we show that weak pairwise correlations of the inputs to a compartmental model of a layer V pyramidal cell can induce bimodality in its membrane potential. We show how this relationship can account for the observed increase of the power in the γ frequency band during up states, as well as the increase in the standard deviation and fraction of time spent in the depolarized state (Anderson, Lampl, Reichova, Carandini, & Ferster, 2000). In order to quantify the relationship between the correlation properties of a cortical network and the bistable dynamics of single neurons, we introduce a number of new indices. Subsequently, we demonstrate that a quantitative agreement with the experimental data can be achieved, introducing voltage-dependent mechanisms in our neuronal model such as Ca2+ - and Ca2+ -dependent K+ channels. In addition, we show that the up states and down states of the membrane potential are dependent on the dendritic morphology of cortical neurons. Furthermore, bringing together network and single cell dynamics under a unified view allows the direct transfer of results obtained in one context to the other and suggests a new experimental paradigm: the use of specific intracellular analysis as a powerful tool to reveal the properties of the correlation structure present in the network dynamics. c 2004 Massachusetts Institute of Technology Neural Computation 16, 2351–2378 (2004)
2352
A. Benucci, P. Verschure, and P. Konig ¨
1 Introduction In this study, we address the relationship of two experimentally observed phenomena. At the network level, correlated spiking activity between ensembles of neurons has been described in recent years. At the cellular level, the observation that the membrane potential dynamics of single neurons can show distinct up states and down states has received a lot of attention. Regarding the first phenomenon, multielectrode recordings in cat visual cortex have demonstrated that pairs of neurons sharing similar orientation tuning properties tend to have synchronized spiking activity (Singer, 1993). This finding has been confirmed in different species (Bair, 1999) and different cortical areas (Salinas & Sejnowski, 2001). The synchronization pattern is dependent on the properties of the stimulus. For example, when coherently moving gratings or randomly moving dots are used as visual stimuli, they elicit cortical activity that displays pairwise correlations of different degree (Gray, Engel, Konig, ¨ & Singer, 1990, Usrey & Reid, 1999). Moreover, crosscorrelation analysis shows that rather than having a precise synchronization, optimally driven neurons lead over suboptimally driven ones (Konig, ¨ Engel, & Singer, 1995a) (see Figure 1B). This suggests that under realistic conditions, cortical dynamics is highly structured in the temporal domain (Roskies et al., 1999). The impact of a large number of randomly timed or synchronized inputs on the subthreshold dynamics of single neurons has been studied in simulations (Salinas & Sejnowski, 2000; Bernander, Koch, & Usher, 1994; Destexhe & Par´e, 1999; Shadlen & Newsome, 1998; Softky & Koch, 1993). However, taking into account our current knowledge of the correlation structure of cortical activity, we have little insight into the cellular dynamics under realistic conditions (Singer, 1993; Lampl, Reichova, & Ferster, 1999; Douglas, Martin, & Whitteridge, 1991; Stern, Kincaid, & Wilson, 1997; Agmon-Snir & Segev, 1993; Mel, 1993; Destexhe & Par´e, 2000; Singer & Gray, 1995). Regarding the second phenomenon, a number of intracellular studies have shown that the membrane potential of neurons does not take on any value between rest and threshold with equal probability but rather that it assumes either a depolarized state, associated with spiking activity, or a resting state where the cell is silent (see Figures 2A and 2B, right panels). This behavior has been observed in different animals and brain structures (Anderson, Lampl, Reichova, Carandini, & Ferster, 2000; Stern, Jaeger, & Wilson, 1998; Steriade, Timofeev, & Grenier, 2001). This bistability of the membrane potential is referred to as up states and down states, and its biophysical properties have been characterized (Wilson & Kawaguchi, 1996; Lampl et al., 1999; Douglas et al., 1991; Stern et al., 1997; Lewis & O’Donnell, 2000; Wilson & Groves, 1981; Kasanetz, Riquelme, & Murer, 2002) (see Figures 2A and 2B, right panels). The origin of up states and down states has been related to presynaptic events (Wilson & Kawaguchi, 1996); however, the underlying mechanisms have not yet been identified.
High-Order Correlations in Neural Activity
2353
Figure 1: Model neuron and the temporal structure of its input. (A) Reconstructed layer 5 pyramidal cell (left) with a schematic of five input spike trains sharing weak pairwise correlations (right). For every pair of spike trains, there are moments in time when the probability of firing together is high and pairs of synchronized spikes occur. Due to the high level of input convergence, higherorder events emerge statistically and are shown as triplets in the example (Benucci et al., 2003). (B) Cross-correlation analysis of paired extracellular recordings from cat area 17 while using moving bars as visual stimuli (data taken from Konig ¨ et al., 1995a). The peak size is proportional to the strength of the correlation. The shift of the peak indicates a phase lag of the firing of one neuron relative to the other. In general, optimally driven cells tend to lead over suboptimally driven ones. Solid line: Gabor fit for parameters estimation. (C) Crosscorrelogram of the synthetic data. Solid line: Gabor fit of the cross-correlogram.
Here, we take a theoretical approach to examine the dynamics of the membrane potential in single neurons given a physiologically constrained representation of the temporal properties of its afferent signals. 2 Materials and Methods In the following, we describe a detailed model of a cortical neuron and a procedure to produce synthetic input spike trains with physiologically realistic first order (i.e., firing rates) and second-order statistical properties (i.e., pairwise correlations). Subsequently, our methods of analysis are introduced.
2354
A. Benucci, P. Verschure, and P. Konig ¨
High-Order Correlations in Neural Activity
2355
2.1 The Cellular Model. A morphologically detailed layer 5 pyramidal cell (Bernander, Koch, & Douglas, 1994; Bernander, Koch, & Usher, 1994; Destexhe & Par´e, 2000) is simulated using the NEURON simulation environment (Hines & Carnevale, 2001) (see Figure 1A). (The original anatomical data are kindly made available by J. C. Anderson and K. A. C. Martin.) We deliberately keep the cell as simple as possible to avoid the introduction of strong a priori hypotheses. Nevertheless, the resulting simulated neuron is of considerable complexity. The parameters for the passive properties, HH-like channels in the soma and the synapses (4000 AMPA, 500 GABAa , 500 GABAb ), are selected according to the available literature and previous computational studies (Bernander, Koch, & Douglas, 1994; Agmon-Snir & Segev, 1993; Bernander, Koch, & Usher, 1994; Mel, 1993; Destexhe & Par´e, 2000). The parameter sets used are reported in Tables 1 through 3 (see the appendix). The files used to implement voltage-dependent mechanisms, as indicated in the tables, are freely available online from the NEURON web page (http://www.neuron.yale.edu). For the cell as a whole, we tune the parameters to obtain a consistent input-output firing-rate transfer function. Figure 2: Facing page. Intracellular somatic recordings. (A) The membrane potential of the model neuron is shown during optimal stimulation (left panel). Spiking episodes (burst-like behavior) are associated with the depolarized states of the membrane potential (up states). Following spiking activity, there are periods in which the simulated cell is in a more hyperpolarized state and silent (down states). The pairwise correlation strength of the input spike trains has been set to 0.1 and the mean input firing rate to 85 Hz. The alternation between up states and down states has also been found in several experimental studies using intracellular recordings (right panel). (B) Histogram of the distribution of the membrane potential recorded under the conditions as shown in a after removal of the action potentials. The membrane potential does not take any value between the up states and down states with equal probability, but the histogram shows two peaks (left panel). Similarly, the biphasic behavior in the membrane potential has been experimentally observed (right panel). (C) If the cell in the simulation is stimulated solely by uncorrelated background activity (6 Hz), the bimodality disappears (left panel). The corresponding trace of the membrane potential is shown in the inset. Similar “shapes” in the histograms of the membrane potential under similar stimulation conditions have also been observed experimentally (right panel). Note that when not using visual stimulation, Anderson et al. (2000) found bimodal as well as nonbimodal distributions of the membrane potential. A nonbimodal example is shown here (see also section 4). (D) The duration of the up states has been increased 10 times with respect to Figure 2A due to the introduction of Ca2+ and Ca2+ -dependent K+ channels in the modeled L5 pyramidal cell. (E) Relationship between the firing rate in the up states and the mean membrane potential with Ca2+ and Ca2+ -dependent K+ currents. The input pairwise correlation strength is 0.1, and the mean input firing rate is 30 Hz.
IM Ih
IL IK IC IA
INa
Na+
Ca2+ K+
Current
Ion
Soma Dendrites Apical Soma Apical Soma Dendrites Dendrites Apical
Place
50 50 115 −77 −77 −66 −66 −77 −50
Erev (mV) 500 10 5.1 50 30 10 10 30 L IV: 2.5 L III: 7 L II: 22 L I: 32
G¯ (mS/cm2 ) 2 2 2 2 2 1 1 1 2
p
Table 1: Simulation Parameters for the Active Mechanisms.
1 1 0 0 0 1 1 0 0
q 0.05 0.05 2.12 1.5 200 2 2 2.21− 2.24 2.26
τact (ms)
−68.9
−47/ − 42 −32/ − 37 −35 −47 2.17 −24.5/ − 72.3 −22.9/ − 83.1
u1/2 (mV) (active/inactive)
−6.5
16.9/−5.9 16.2/−6.5
3/ − 3 3/ − 3 4 3
k (active/inactive)
iaspec.mod iap.mod ical.mod iapspec.mod ic.mod ianew.mod iA.mod km.mod IH.mod
File
2356 A. Benucci, P. Verschure, and P. Konig ¨
High-Order Correlations in Neural Activity
2357
Table 2: Synaptic Parameters. Type
g¯ (nS)
er (mV)
τ1 (ms)
τ2 (ms)
τpeak (ms)
GABAa GABAb AMPA NMDA
1 0.1 20 26.9
−70 −95 0 0
4.5 35 0.3 4
7 40 0.4 120
5.6 37.4 0.35 14
Table 3: Passive Properties. Erest Cm Gm Ri T
−65 mV 1 µF 0.07 mS/cm2 90 cm 36◦ C
2.2 Input Stage, First-Order Statistics. The input stage reproduces measured anatomical and physiological conditions (Douglas & Martin, 1990). Layer 5 neurons in cat’s visual cortex do not receive significant input from the lateral geniculate nucleus; the overwhelming majority of inputs are cortico-cortical afferents (Gabbott, Martin, & Whitteridge, 1987). The distributions of orientation preferences of the afferent excitatory signals to our model neuron are graded and match the experimentally observed specificity of intra-areal corticocortical connections (Kisvarday, Toth, Rausch, & Eysel, 1997; Salin & Bullier, 1995): 57% of the inputs originate from neurons with similar (±30 degrees) orientation preference, and 30% and 13% of afferents originate from neurons with a preferred orientation differing by between 30 and 60 degrees and 60 and 90 degrees, respectively, from the neuron under consideration. This connectivity automatically provides the cell with a feedforward mechanism for its orientation tuning. Whether this mechanism correctly describes the origin of orientation-selective responses in primary visual cortex is still unresolved (Sompolinsky & Shapley, 1997) and not within the scope of this study. To take into account these physiological conditions, we separate the total of 5000 afferents of the simulated neuron according to the localization of the respective synapse in layer 1 to layer 5 into five groups. This gives the flexibility to make selective changes in the correlation properties of inputs targeting different cortical layers. Each of the five groups was subdivided into 36 subpopulations of afferents with similar orientation preference, resulting in a discretization of ±5 degrees. This mimics the situation of input spike trains from cells having different tuning properties than the target neuron. The size of the subpopulations ranged linearly from 25 to 100 cells, for orthogonally tuned inputs to identically tuned inputs, respectively. The firing rates of the inputs are determined with a “typical” visual stimulus in mind (i.e., an oriented grat-
2358
A. Benucci, P. Verschure, and P. Konig ¨
ing). The maximum input firing rate of the optimally stimulated neurons is set to 85 Hz. Nonoptimally stimulated cells are assumed to have an orientation tuning of ±30 degrees half width at half height, and the firing rate is reduced accordingly. In addition, background activity is simulated with uniform, uncorrelated 6 Hz activity. Inhibitory inputs are implemented with a broader orientation tuning than excitatory inputs. 2.3 Input Stage, Second-Order Statistics. The rationale behind the synthetic spike train generation algorithm is that for each pair of spike trains, there are intervals in time when the probability of correlated spiking activity is increased. The method used closely resembles other published algorithms to produce correlated spike trains (Stroeve & Gielen, 2001; Feng & Brown, 2000), and the source code is available from the authors upon request. The statistical properties of the synthetic spike trains conform to known physiological constraints, such as average strength and precision of correlation, and for this reason, we will refer to them as physiologically realistic inputs. We do not introduce a dependence of the strength of correlation on the orientation tuning of the neurons. This conforms to the hypothesis that the synchronization patterns induced in the network reflect global stimulus properties and are not a fixed property of the network (Singer & Gray, 1995). In particular, when only a single stimulus is presented, the strength of synchronization of neuronal activity in the primary visual cortex over short distances is found not to depend on the orientation tuning of the neuron (Engel, Konig, ¨ Gray, & Singer, 1990). The details of the algorithm are as follow: within the overall time window of analysis, T (typically 10 seconds), time epochs are selected, during which the probability of spiking is increased for all 5000 cells in the population afferent to the simulated neuron. Nevertheless, neurons spike only in correspondence to a subset of these epochs (randomly chosen), and for any two given cells, there is always a nonzero overlap between such subsets. The total number of possible overlaps depends, in a combinatorial way, on the size of the subsets, which is a controlled parameter of the simulation. It is the number and amount of these overlaps that determines the pairwise correlation strength and the different degrees of temporal alignments of the spikes. The epochs are time intervals centered on specific points in time, whose duration and distribution are controlled parameters of the algorithm. We used a Poisson process to distribute these points in time, and their frequency was fixed at 75 Hz. The duration of the time epochs was 10 ms. The algorithm is very flexible; choosing a distribution other than a Poisson one (exponential or gamma of any order) would allow the creation of correlations with or without overall oscillatory activity (Gray & Viana di Prisco, 1997; Engel et al., 1990). The duration of the time epochs determines the precision of the correlations, and it can be changed to affect the overlaps between the epochs within the same spike train. The frequency of the time epochs is itself a free parameter. Spikes are assigned within the
High-Order Correlations in Neural Activity
2359
epochs according to a distribution whose skewness and integral is parametrically controlled. The skewness controls the shape of the peaks in the cross-correlograms, with the possibility of creating very “sharp” or “broad” correlation peaks, without changing the correlation strength itself. In all the simulations, we used a gaussian distribution. The integral of the distribution controls the correlation strength. Thus, the absolute fraction of spikes assigned to the epochs changed depending on the desired pairwise correlation strength. The rest of the unassigned spikes are randomly distributed according to a Poisson process with constraints related to refractoriness (2 ms in the simulation). This ensures that the coefficient of variation (CV) of each spike train ranges around one (Dean, 1981). A key point of the simulation is that by varying the frequency and the duration of the time epochs, it is possible to create pairwise correlations in the population with different degrees of high-order correlation events (see the following section for more details). The temporal dynamics of the input stage thus reproduce in a controlled fashion the correlation strengths and time lags that have been observed experimentally. In the following, the temporal precision of the correlations will be held constant, with the width of the peaks in the cross-correlograms always on the order of 10 ms. Furthermore, the time lags of the correlations are determined by a parameter that is kept fixed in all simulations (Konig, ¨ Engel, & Singer, 1995b). We vary only the correlation strength, that is, the height of the peaks in the cross-correlogram. We will refer to these dynamical features as to the correlation structure of the inputs (Konig, ¨ Engel, & Singer, 1995a; Roskies et al., 1999) (see Figures 1B and 1C). In our model, the inhibitory inputs follow the same correlation structure as the excitatory inputs. All simulations are analyzed in epochs of 10 s duration. For the quantitative evaluation of our data, we use a five-parameters Gabor fit of the cross-correlograms (Konig, ¨ 1994). We calculate the pairwise correlation strength as the area of the peak of the Gabor function divided by the area of the rectangular portion of the graph under the peak, delimited by the offset.
2.4 Higher-Order Correlations. Neuronal interactions in the cortex are measured solely in terms of correlations of the activity of pairs of neurons. Hence, the algorithm used here synthesizes spike trains with a specified pairwise correlation that satisfies those observed in the neuronal interactions in the cortex (see Figures 1B and 1C). Nevertheless, due to the high level of convergence observed in the cortex, higher-order statistical events must and do appear. This holds even for the low values of the experimentally observed pairwise correlations (Benucci, Vershure, & Konig, ¨ 2003; Destexhe & Par´e, 2000; Both´e, Spekreijse, & Roelfsema, 2000; Grun, ¨ Diesmann, & Aertsen, 2001). Accordingly, the above-described algorithm not only generates weak pairwise correlations between the input spike trains, but also creates high-order correlations in the presynaptic population of neurons. Higher-
2360
A. Benucci, P. Verschure, and P. Konig ¨
order events indicate episodes of the presynaptic dynamics characterized by spiking activity from a large fraction of afferent neurons, occurring altogether in a small time window on the order of a few milliseconds. How these nearly synchronized events naturally emerge from the pairwise correlations constraint, has been formally investigated (Benucci et al. 2003; Both´e et al., 2000). It can be intuitively understood considering that the number of pairwise coincidences rises quadratically with the number of afferents and that the number of spikes available to generate such coincidences rises only linearly with the number of afferents. It follows that spikes have to be “used” multiple times to generate coincidences; that is, higher-order correlations must appear. With an increasing number of inputs, the effect is amplified. This is important when considering realistic values for the number of afferent inputs to a given cortical cell, typically of the order of 104 . At this level of convergence, the statistical effect explained above becomes dominant, and higher-order events are a prominent feature of the dynamics (Mazurek & Shadlen, 2002). 2.5 Analysis of Intracellular Dynamics. The intracellular dynamics is evaluated using different measures in the time and frequency domain. The up states and down states of the subthreshold membrane potential are determined as in Anderson et al. (2000) using two thresholds. A sliding window of 50 ms is used to find segments in which the membrane potential is above the up threshold for more than 60% of the time. We use a similar procedure for identifying the down states (i.e., the membrane potential is below a down threshold). For the cumulative probability distribution of the membrane potential, any section where the membrane potential exceeds the up threshold is included in the analysis. To eliminate the spikes and check for the inverse relationship between the spike threshold and the slope of the rising phase of the action potential, we use the same procedure as in Azouz and Gray (2000). Furthermore, we compute a number of indices characterizing the relation of subthreshold dynamics with observable measures. These indices are defined in section 3. 2.6 Control Simulations. To verify the scope and robustness of the results found, we ran several simulations varying the values of the parameters. The rationale is that if a small change in the value of a given parameter causes a dramatic change in the results, then the mechanisms the parameter refers to should be considered critical for the emergence of the phenomenon studied. But if the changes are minor and smoothly affect the results, the associated mechanism is considered to be relevant for the modulation and quantitative features of the phenomenon, but not for the emergence of the effect per se. We changed the width of the tuning curves of the inhibitions and excitations by a factor of ± 10% without noticing significant changes in the degree of bimodality. We also checked for the robustness of the results in
High-Order Correlations in Neural Activity
2361
respect to changes in the time constants and conductance peak amplitudes of AMPA and GABA synapses. In the extreme cases of long time constants (NMDA synapses) and large (twice the mean) conductance amplitudes of AMPA synapses (see the tables for mean values), the cell was driven to an epileptic-like state. In case of GABA, the cell would be totally silenced. In between these extremes, the cell was showing smooth changes in the degree of bistable dynamics. We used exactly the same principles when performing controls with voltage-dependent mechanisms, such as calcium and sodium voltage-dependent currents, muscarinic channels, or anomalous rectifying currents (see the tables for complete listings of mechanisms used). We varied the values of the parameters around their means (as reported in the tables) and noticed only smooth modulatory effects. These controls confirm the generality of the phenomena as reported below. Finally, for controls regarding the neuronal morphology, we used two strategies: we took a different cell model, and we reduced the whole cell, or part of the dendritic tree, to single equivalent compartments while keeping the electrotonic properties unaffected (Destexhe, 2001). The second cell model (a spiny stellate neuron) was kindly made available by Y. Banitt and I. Segev. It includes active mechanisms such as Ca2+ dynamics, fast repolarizing K+ currents, Ca2+ -dependent K+ currents, spike frequency adaptation, and Na+ channels, (parameters as in Banitt & Segev, personal communication, 2003). The input spike trains are generated with the MATLAB software package. 3 Results 3.1 Emergence of Up and Down States. When we provide the model neuron with correlated inputs as explained, the membrane potential at the soma is characterized by up and down states of the dynamics (see Figures 2A and 2B, left panels). However, when the cell is stimulated with uncorrelated background activity, this bistability of the dynamics disappears (see Figure 2C, left panel). Indeed, in physiological recordings in primary visual cortex using stimulation paradigms that are not associated with correlation structures (awake cats during spontaneous activity), up and down states are not observed (Anderson et al., 2000; Steriade et al., 2001; see Figure 2C). When comparing the results of the simulation to recently published data (Azouz & Gray, 2003) the similarities are obvious. Compared to other experimental studies, however, a remarkable difference in timescales is apparent (see Figure 2A). In the purely passive model, the duration of the up states is typically around 30 ms, which is at least a factor of 10 lower than the experimental value. As discussed in Wilson and Kawaguchi (1996), the duration of the up and down states is mainly determined by the kinetic properties of Ca2+ and Ca2+ -dependent K+ currents. Including these active mechanisms in our model shows that this also holds true in the simulation (see Figure 2D). In comparison to the passive model, the duration of up states
2362
A. Benucci, P. Verschure, and P. Konig ¨
increases by a factor of 10. It reaches a duration of 300 ms, resulting in an improved match to the experimental results. 3.2 Quantification of Intracellular Dynamics. To analyze the relationship between network and single cell dynamics in a more quantitative way, we compare a number of characteristic measures of the simulated neuron to known physiological results (Anderson et al., 2000). First, as reported in the literature, we find a significant correlation between the membrane potential in the up state and the spiking frequency (see Figures 2E and 3A). In Figure 2E, the calcium dynamics exerts its modulating effect by compressing the dynamic range for the firing frequency of the up states. The range is reduced from 10 to 220 Hz as in the passive case (see Figure 3A), to 5 to 50 Hz when calcium is introduced (see Figure 2E). Second, we find an increase of the standard deviation of the membrane potential in the up-state that matches that observed in the visual cortex (see Figure 3B). This is a large (more than twofold) and highly significant effect. The significance holds also in the comparison between the variability of the membrane potential during a simulated visual stimulation Figure 3: Facing page. Characteristic features of up states and down states. (A) The firing rate in the up state (vertical axis) is shown as a function of the membrane potential in the up state (horizontal axis) for both the simulation of a passive model (left) and the experimental data (right). In both cases, the average membrane potential in the up state is correlated with the firing rate of the cell in the up state. The difference in the scale between the left and right panels depends on the mean input-firing rate chosen for this specific data set. The input correlation strength is 0.1, and the mean firing rate of the input is 85 Hz. See Figures 2D and 2E for a comparison of the active model with the experimental data. (B) The standard deviation (STD) of the membrane potential of the simulated neuron is increased in the up state as compared to the down state (left). The inset shows the subthreshold dynamics during stimulation and spontaneous background activity, respectively. The corresponding experimental data are shown to the right. In both cases, stimulation increases the variance of the subthreshold membrane potential. (C) The plot of the power spectra in the 20–50 Hz frequency band of the membrane potential for the optimal stimulation condition (left panel) shows an increase in the up state as compared to the down state. The same phenomenon is visible in the plot of the experimental data (right panel). Because no information is available on the normalization used for the power spectra in Anderson et al. (2000), we use arbitrary units, and a comparison of the absolute scale in the two panels is not possible. The vertical and horizontal dotted lines indicate the median for each corresponding axis; the light gray circle is the center of gravity of the distribution for a better comparison with the result of the simulation as shown in the left panel. The input pairwise correlation strength is 0.1 in Figures 3A and 3C and 0.2 in Figure 3B, while the mean input rate is always 85 Hz.
High-Order Correlations in Neural Activity
2363
and spontaneous activity (see the Figure 3B inset and right panel). Finally, we observe an increase of the power in the 20 to 50 Hz frequency band in the up state versus the down state (see Figure 3C). This is a smaller effect, but it is still statistically significant and is comparable to reported experimental findings (see Figure 3C, right panel). These three results match experimental findings that some consider to be some of the central characteristics of up and down state dynamics (see Figures 3A to 3C, right side). Our results show that these characteristics emerge naturally in a detailed model of a cortical neuron when exposed to realistically structured input.
2364
A. Benucci, P. Verschure, and P. Konig ¨
As a next step, we investigate measures of intracellular dynamics and relate them to the properties of the network’s activity. We define a set of indices to capture different aspects of the subthreshold activity. The first of these quantifies the strength of the correlations in the network activity and the bimodality of the membrane potential histogram. This adimensional index, S, accounts for the dependence of the bimodality on the −Vmin input correlation strength: S = Vmax |Vmax | , where Vmax and Vmin are the location of the peaks in the membrane potential histograms (see Figures 2A and 2B). Increasing the correlation strength from its typically reported value of about 0.1 as used above (Konig ¨ et al., 1995a; Salinas & Sejnowski, 2000; Kock, Bernander, & Douglas, 1995) leads to an enhancement of the bimodality of the membrane potential histogram, resulting in a sharpening of the two peaks (see Figure 4A). This index is a monotonically increasing function of the correlation strength (see Figure 4B). The next index represents the fraction of the total time spent in the up state (TUS). For each simulation, the total time the membrane potential surpassing the up threshold is divided by the total simulation time. This index is related to the integral of the peaks in the histograms, as explained in Anderson et al. (2000). The results indicate that this measure is strongly dependent on the input correlation strength. The TUS index is measured to be 7%, 13%, and 42% for the correlation strengths of 0.01, 0.1, and 0.2, respectively. Figure 4: Facing page. Sensitivity of the membrane potential to the input correlation strength. (A) Histograms of the distribution of the membrane potential for three different conditions. The input correlation strength is increased from left to right: 0.01, 0.1, and 0.2, respectively. Bimodality emerges, and the hyperpolarized peak gets further away with increasing correlation strength. (B) The “S” index (see section 2), which quantifies the increasing separation of the peaks in the membrane potential histograms, is shown for five choices of correlation strengths. Note that the data points are connected by lines simply to improve visualization. We have no a priori hypotheses about specific functional relationships. (C) Cumulative probability distribution of the time intervals during which the membrane potential dwells above the up threshold. It is shown for six trials at two levels of correlation strength each. (D) For each action potential of the simulated neuron, the maximum slope of the rising phase of the membrane potential and the threshold potential for the spike generation are shown in a scatter plot. (E) The identical measure used in an experimental study demonstrates an inverse relationship as well. (F) The slope of the corresponding linear fit (solid line in Figure 4D), provides an index that is related to the input correlation strength. (G) To perform a reverse correlation analysis for every transition from down states to up states, a time window was centered in the corresponding population (PSTH) at the input stage to identify presynaptic events associated to the transition. The plot shows the average population activity in the temporal vicinity of a down-to-up transition.
High-Order Correlations in Neural Activity
2365
2366
A. Benucci, P. Verschure, and P. Konig ¨
The third measure relates the correlation strength of the inputs and the cumulative probability (CUP) distribution of the up state (Anderson et al., 2000). The cumulative probability distributions of the time intervals during which the membrane potential is above the up threshold are computed for different values of the input correlation strengths. Though it is not independent from the TUS index, it is a refinement of the previous index in that it controls for possible artifacts due to the spike-cutting procedure (see Figure 4C). The CUP measure can easily separate the diverse input correlation strengths, that is, 0.01 pairwise correlation strength (black lines) from 0.2 pairwise correlation strength (light gray lines). Finally, the fourth index measures the slope at threshold (SAT) and thus characterizes the relationship between the input correlation structure and the dependence of the voltage threshold for the spike generation on the maximum slope of the rising phase of an action potential (see Figures 4D and 4E). Indeed, Azouz and Gray (2000) have observed an inverse relationship between the maximum slope of the rising phase of an action potential and the voltage threshold for the generation of the spike (see Figure 4E). A similar inverse relationship also appears in the simulated neuron (see Figure 4D). More important, we find that the slope of the fit quantifying the relationship between the maximum slope of the rising phase of an action potential and the voltage threshold for the generation of the spike provides a measure of the input correlation strength (see Figure 4F). This result indicates that this index tends to decrease with increasing correlation strength. These analyses in the simulation data suggest that these four indices based on the membrane potential measurable within a single neuron can be used to extract the correlation strength within the network activity from the intracellular dynamics. We focus now on the mechanisms of the observed bimodality. An important result comes from performing reverse correlation analysis of the transition from down to up states with respect to the population activity. On average, the total presynaptic activity around the transition from down to up states shows a sharp peak (see Figures 4G and 5A). This indicates that the switch of the intracellular dynamics between up and down states is induced by short, lasting, highly correlated input events. This confirms that up states are induced by presynaptic higher-order events. When the pairwise correlation strength is lowered, the amplitude of these sharp correlation peaks decreases, and the bimodality smoothly disappears. This is true for the passive L5 pyramidal cell, as well as for the model with voltagedependent mechanisms. As shown before, the quantitative features differ in the two cases (duration of up states, or mean firing rate, for example), but the bimodality itself is not affected. Moreover, there is no intrinsic bimodality in the modeled neuron; the dynamics is fully input driven. When high-order correlation events disappear following a decrease in the pairwise correlations, the bimodality is destroyed, even though the mean input firing rate to the neuron is kept constant.
High-Order Correlations in Neural Activity
2367
3.3 Simulating Network Effects. According to the above findings, up and down state fluctuations could manifest themselves in a large-scale coherent fashion if all the neurons in the population experience the same correlated dynamics (as for up and down fluctuations observed in slow-wave sleep states or in anesthetized conditions; see section 4). Instead, groups of neurons exposed to different network correlation structures would undergo different subthreshold dynamics. We tested this idea by running 11 different simulations of a neuron exposed to a corresponding number of different sets of input spike trains. While the mean firing rate and pairwise correlation strength were kept constant, the timing of the epochs used to generate the correlations was linearly changing: from identical for the first set to completely different for the last set. In other words, the timing of high-order correlation events, as highlighted by the reverse correlation analyses, was getting more and more dissimilar from set to set. For the first six sets, we observe a significant correlation between the membrane potentials of the simulated neurons (p < 0.001). When the afferent signals are only slightly overlapping, the membrane potentials are no longer significantly correlated. High values of correlation, as experimentally described, are observed only when pairs of cells share a large fraction of synaptic inputs (see Figure 5B). 3.4 Cell morphology. An important question is whether the detailed morphology of a neuron contributes to the emergence of up and down states and, if so, what the key properties involved are. We manipulate the gross morphological structure of the pyramidal cell in a series of experiments. First, we delete all morphological specificity of the pyramidal neuron by morphing the cell into a spherical, single compartment of equal surface, while keeping the firing-rate transfer function unchanged. This reduces the cell to a conductance-based integrate-and-fire (IF) model. This procedure has been done with and without Ca2+ and Ca2+ -dependent K+ currents. Second, we morph the cell into a three-compartment model (basal dendritic tree plus soma, proximal-apical dendritic tree, and distal-apical dendritic tree) that preserves dendritic voltage attenuation (Destexhe, 2001). Third, we morph only the basal part into an equivalent compartment while preserving the apical dendritic tree in detail. In all these cases, when exposed to the same inputs used in the simulation experiments described above, the up and down dynamics disappear (see Figure 5D). Interestingly, when exposing the IF model to the same input statistics and using a parameter choice for the Ca2+ currents that was eliciting long-lasting bistability in the L5 pyramidal cell, we do not find any up and down states. Note that this does not imply that it is not possible to find a choice of parameters and input statistics such that bimodality would emerge. However, when we keep the morphology of the basal dendritic tree unchanged and reduce the apical part to a single equivalent compartment, the qualitative aspects of the bimodal intracellular dynamics are preserved (see Figure 5E). Thus, we find that in the passive model, an intact basal dendritic tree is the
2368
A. Benucci, P. Verschure, and P. Konig ¨
High-Order Correlations in Neural Activity
2369
Figure 5: Facing page. (A) Somatic intracellular recordings of the membrane potential showing up-down fluctuations. The modeled cell incorporates calcium dynamics. The raster plot of the corresponding input activity is shown in the bottom panel. High-order correlation events (darker vertical stripes) of a few tens of milliseconds duration induce transitions to the up states. The distribution in time of these correlated events is according to a Poisson process. Up-states’ duration can be prolonged by increasing the frequency of high-order events, as it happens in the time window of 2 to 2.5 seconds. The same effect is obtained by increasing the calcium peak conductance and time constants (see controls in section 2), with the result of eventually merging all the up states shown in the figure into a single long-lasting up state. (B) Decorrelating the subthreshold dynamics: a template set of points in time (in the raster plot shown in panel A, such a set of points would correspond to the moments in time when the vertical stripes occur) has been used to generate 10 other different sets of points, whose numeric values differ more and more from the original one. The first set is a perfect copy of the original template; the second one has 10% difference, and so on. The last one is 100% different. This group of 11 sets is then used to generate 11 corresponding different ensembles of 5000 correlated inputs. These 11 populations have increasingly different degree of correlations between them, since they share fewer and fewer time epochs for generating pairwise correlations (see section 2). We run the simulations and recorded the voltage traces for 10 seconds of simulation time. The subthreshold dynamics of the second trace is identical to the first one (the template), while the following traces differ more and more. (C) To quantify the degree of decorrelation, we windowed each voltage trace using a 1 second time window, and for each segment, we computed the linear correlation coefficients between the reference trace and the following ones. By using the windowing procedure, we could get 10 different estimates of the correlation coefficients for every couple of voltage traces, thus deriving an estimate of the mean correlation values and standard deviations. “Seeds” in the abscissa refers to the points in time used to create the correlations (see section 2). (D) The histogram of the membrane potential for a single spherical compartment, preserving the original input-output firing-rate transfer function, shows a central predominant peak and two small satellite peaks. The more depolarized one is an effect of the spike-cutting procedure, and the more hyperpolarized one is a result of the after-spike hyperpolarization. (E) Histogram of the membrane potential with intact basal morphology and a reduced apical one, substituted by a single cylinder of equivalent surface. (F) Histogram of the membrane potential of a spiny stellate cell with active mechanisms included (see the text for details). (G) Bimodality in the histogram of the membrane potential of an IF neuron with modified EPSPs’ rising and decaying time constants, 1 ms and 12 ms, respectively.
2370
A. Benucci, P. Verschure, and P. Konig ¨
minimal condition necessary for the emergence of up and down states. The interesting observation here is that the parameter set that robustly produces bistability in an L5 pyramidal cell does not produce the experimentally observed bistability for a point neuron. As a further test of the hypothesis that the cable properties of the basal dendritic tree are essential for the generation of up and down states, we test a model of a spiny stellate cell developed by Banit and Segev (personal communication, 2003). From the point of view of the gross morphological structure, this spiny stellate cell can be considered as a pyramidal cell without an apical dendritic tree. It thus resembles the morphological characteristics of the cell used in the latter control. It is as if, instead of an equivalent cylindrical compartment, the apical part had been “cut off.” Furthermore, this model contains a number of active mechanisms (see section 2). Because Banit and Segev used this model for a different purpose, we extended it with AMPA and GABA synapses supplying correlated afferent input as described above. In this simulation, using an alternative detailed model neuron with several voltage-dependent mechanisms (see section 2), the same bimodality appears (see Figure 5F). This demonstrates that the basal dendritic tree is an important morphological compartment for the induction of up and down states. In order to investigate the role of the basal dendritic tree, we separate the effects of the interaction of many excitatory postsynaptic potentials (EPSPs) in the dendrites and the effects of electrotonic propagation on individual EPSPs by developing a neuron model, which retains some aspects of dendritic processing, but radically simplifies others. We increased the duration of the EPSPs in an IF model without any detailed morphology (i.e., a point neuron). We kept the total charge flow and the input correlation structure unchanged—a correlation strength of 0.1 and a modification of the rise and decay time constants and peak amplitude of the conductance change, g(t)— to keep the integral constant. Also in this simulation, up and down states emerge (see Figure 5G). It should be emphasized that the quantitative mismatch in the bimodality between Figure 5G and Figure 2B is not surprising since the models used are completely different: an L5 pyramidal cell with detailed morphology and a single geometrical point, IF neuron. The intuitive explanation is that in the real neuron, as well as in the detailed simulations, when a barrage of EPSPs occurs synchronously in several basal dendrites, these long, thin cables quickly become isopotential compartments, each simultaneously depolarizing the soma. Moreover, the temporal filtering properties of these cables have the net effect of prolonging the effective duration of the EPSPs. Note that this temporal broadening is effective even though the intrinsic dendritic time constant, τ , is lowered by the arrival of massive excitations, which increases the conductance (Par´e, Shink, Gandreau, Destexhe, & Lang, 1998). The overall effect is a strong, sustained current to the soma. We conclude that for the passive model, higher-order events, which naturally result from weak pairwise correlations in the network, combined
High-Order Correlations in Neural Activity
2371
with physiologically realistic electrotonic dendritic properties, explain the bimodal distribution of the membrane potential and that active conductances shape its detailed temporal properties. 4 Discussion Here we show in a detailed model of a pyramidal neuron that weak pairwise correlations in its inputs, in combination with the electrotonic properties of its basal dendritic tree, cause up and down states in its membrane potential dynamics. Furthermore, several experimental characterizations of up and down states, such as an increase in gamma power and standard deviation of the membrane potential, can be explained in terms of presynaptic correlation structures. By introducing several statistical indices, we demonstrate a way to derive the correlation strength of the inputs, and thus in the activity of the network, from the subthreshold dynamics of a single neuron. In this sense, correlated activity in the network and the bimodality of the membrane potential are different views on one and the same phenomenon. Previous explanations of up and down states have focused on a presynaptic origin and have suggested that synchronous barrages of excitations may be the major agents involved (Wilson & Kawaguchi, 1996). What kind of mechanism in the cortex could be responsible for their generation is an issue that has not been fully resolved. Here we show that no additional hypothesis other than the well-described weak pairwise correlations is required to fill this gap. However, we also show in our model that such a presynaptic source can account for the bimodality in the membrane potential of the postsynaptic neuron only when it is complemented by temporal filtering of the input spike trains by the basal dendtrites. Active intracellular mechanisms sustain the triggered transitions to up states. Importantly, the neuron exhibits bimodality in its membrane potential with, as well as without, insertion of voltage-dependent mechanisms. The key effect of K+ -dependent Ca2+ or Na+ channels is their strong modulation of the quantitative properties of the subthreshold dynamics. This includes a sixfold compression of the dynamics range of the up state’s mean firing rates. Potassium-dependent calcium and sodium channels can stretch the up-state duration from 25 to 300 ms, providing a better match with electrophysiological data. However, they are not responsible for the emergence of the phenomenon per se. The differential impact of the morphology of the basal and apical dendrites on the response properties of pyramidal cells has been pointed out in other experimental (Larkum, Zhu, & Sakmann, 1999; Berger, Larkum, & Luscher, 2001) and theoretical studies (Kording ¨ & Konig, ¨ 2001). In these studies, properties of the inputs play a key role in the interaction of apical and basal dendritic compartments. In another study, it has been shown that the dynamic regulation of the dendritic electrotonic length can give rise to highly specific interactions in a cortical network that can account for the differential processing of multiple stimuli (Verschure & Konig, ¨ 1999).
2372
A. Benucci, P. Verschure, and P. Konig ¨
4.1 The Scope of the Model. In our simulations, we do not reconstruct a complete visual system but approximate the first- and second-order statistical structure of the afferent inputs to a single neuron derived from experimental results. This approximation can be experimentally verified by intracellular recordings in area 17 of an anesthetized cat using full field gratings as visual stimuli. This setup has been used in a fair number of laboratories to demonstrate the existence of weak pairwise correlations in neuronal activity (Bair, 1999). Such synchronized activity has been observed over a spatial extent of several millimeters, roughly matching the scale of monosynaptic tangential connections in the cortex (Singer, 1993). Therefore, a pyramidal neuron in visual cortex samples mainly the activity of a region where neuronal activity is weakly correlated. Furthermore, the pyramidal cell reconstructed and simulated in this study has been recorded from and filled in primary visual cortex. Pyramidal cells are the predominant neuronal type in the whole cortex, and weak pairwise correlations have been found in many other cortical areas and species as well (Bair, 1999). Thus, our simulations apply to widely used experimental paradigms. Another important question pertaining to our study and the physiological data it relates to is whether the phenomena studied here generalize to the awake, behaving animal. The impact of the behavioral state on the cortical dynamics is not fully understood, and the bulk of physiological experiments are performed under anesthesia. Electroencephalogram data show marked differences in the neuronal dynamics between different behavioral states (Steriade et al., 2001), and we can expect that the detailed dynamics of neuronal activity are affected. Furthermore, it is known that anesthetics influence the dynamics of up and down states during spontaneous activity (Steriade, 2001). However, whether the subthreshold dynamics is influenced by anesthetics when visual stimuli are applied is yet unknown. Furthermore, in the few studies where awake cats are used (Gray & Viana di Prisco, 1997; Siegel, Kording, & Konig, ¨ 2000), correlated activity on a millisecond timescale has been observed, compatible with the results obtained with the anesthetized preparation. Interestingly, a bimodality in the subthreshold dynamics has been observed in slow-wave sleep (Steriade et al., 2001). This raises the question of how the correlation structure of neuronal activity during sleep matches that observed under anesthetics. Hence, whether the assumptions made in our study, and the data they are based on, generalize to awake or sleeping animals has to be further investigated. The few results available suggest, however, that the relationship reported here between network dynamics and up down states could be valid for different behavioral states. 4.2 Simplifications and Assumptions. The simulations presented in this study incorporate a number of assumptions and simplifications. Although within the framework of the simulation, it is possible to use statistical indices to quantify the relationship between membrane potential dy-
High-Order Correlations in Neural Activity
2373
namics and the statistical structure of the inputs, a full quantitative match between experimental and simulation results is difficult to achieve. To understand the reasons behind these differences (for standard deviation, see Figure 3B; for slope, see Figures 4D and 4E), it has to be considered that we investigated detailed compartmental models where many known channels and currents (Wilson & Kawaguchi, 1996) have been omitted. In particular, voltage-dependent currents with long time constants are known to play an important role in stabilizing and prolonging up and down states (Wilson & Kawaguchi, 1996). Once the upward or downward transition has occurred, active currents contribute to stabilizing the membrane potential in either an up state or a down state. Their duration is related not only to the input dynamics but also to the kinetic properties of such active mechanisms, which essentially implement a bistable attractor of the dynamics. Without that, the passive filtering properties of the dendrites would simply be responsible for the emergence of stretched-in-time envelopes of the coherent presynaptic events. Up and down states would be triggered but not maintained for prolonged periods. These considerations have been validated by a tenfold increase in the duration of the up states as soon as Ca2+ and K+ -dependent Ca2+ currents have been included in the model. The relative importance of voltage-dependent mechanisms and presynaptic events in inducing and maintaining a bistable neuronal dynamics seems to vary considerably depending on the animal and brain area. In view of this, the temptation to include a much larger number of active mechanisms to capture this complexity arises. However, it is quickly obvious that the data needed to specify the precise distribution and strength of each mechanism are not available. Therefore, the number of free parameters of the model, to be fitted by reaching a consistent input-output firing rate and other constraints, is dramatically increased and surpasses the number of constraints available. An interesting alternative to the above scenario can be envisioned. Instead of increasing the complexity of the modeled cell, it is possible to lower the complexity of the real cell. Essentially, an experiment can be designed in which the focus shifts from an investigation of the cellular properties of the recorded neuron as such, to using the neuron as a probe to investigate the correlation structure in the network. The main guidelines of the experiment should be to record intracellularly from a neuron in primary visual cortex while presenting different visual stimuli that will induce different correlation structures. Gratings are known to induce pairwise correlations while randomly moving dots lead to weak or no correlations (Usrey & Reid, 1999; Brecht, Goebel, Singer, & Engel, 2001; Gray et al., 1990). Alternatively, synchronous network activity could be simulated by using an intracortical electrode and applying short, lasting, depolarizing current pulses. Applying blockers of voltage-dependent channels or hyperpolarizing the neuron strongly affects the membrane conductance and simplifies the dynamics within the real neuron. The aim is to reduce as much as possible the impact
2374
A. Benucci, P. Verschure, and P. Konig ¨
of active mechanisms so that the real cell becomes simply a passive receiver of the input spike trains and their higher-order correlation statistics. Reducing the number of parameters that can vary in a real experiment to the smaller set of controlled parameters employed in the simulation study allows further validation of the conclusions derived from our model against physiological reality. This makes it possible to infer the correlation strength of the activity in the cortical network from the subthreshold dynamics as quantified by the indices described above. In this sense, the simulation study facilitates a new experimental approach, using a neuron under nonphysiological conditions as a passive probe to investigate the dynamics of the network. Appendix The tables describe parameter values used in the NEURON simulation. “Place” (in Table 1) refers to the dendritic location of the inserted mechanism: “dendrites” indicates both basal and apical dendrites, while G¯ (and g¯ for synaptic mechanisms) indicate the peak conductance values. For Ih channels, the peak conductance varies according to the dendritic location, as reported by Berger et al. (2001). The parameters m and n refer to the kinetic scheme of the Hodgkin- and Huxley-like formalism used to describe the activation and inactivation properties of voltage dependent mechanisms, according to the following equation: IA = G¯ A × mA × nA (u − EA ), p
q
where A is a generic ionic type. The parameters k and u1/2 refer to the Boltzman equation that describes the steady-state conductance behavior of the ionic mechanisms inserted, according to the equation f (u) =
1 . u −u 1 + exp 1/2k
Finally, the time course of the synaptic conductance (AMPA and GABA; see Table 2) follows an alpha function behavior: t t f (t) = exp − − exp − . τ1 τ2 NMDA channels have been used solely in control experiments (see section 2.6). Acknowledgments This work was supported the Swiss National Fund (31-61415.01), ETH Zurich ¨ and EU/BBW (IST-2000-28127/01.0208-1).
High-Order Correlations in Neural Activity
2375
References Agmon-Snir, H., & Segev, I. (1993). Signal delay and input synchronization in passive dendritic structures. J. Neurophysiol., 70, 2066–2085. Anderson, J., Lampl, I., Reichova, I., Carandini, M., & Ferster, D. (2000). Stimulus dependence of two-state fluctuations of membrane potential in cat visual cortex. Nature Neurosci., 3, 617–621. Azouz, R., & Gray, C. M. (2000). Dynamic spike threshold reveals a mechanism for synaptic coincidence detection in cortical neurons in vivo. PNAS, 97, 8110–8115. Azouz, R., & Gray, C. M. (2003). Adaptive coincidence detection and dynamic gain control in visual cortical neurons in vivo. Neuron, 37, 513–523. Bair, W. (1999). Spike timing in the mammalian visual system. Curr. Opin. Neurobiol., 9, 447–453. Benucci, A. Vershure, P. F., & Konig, ¨ P. (2003). Existence of high-order correlations in cortical activity. Phys. Rev. E., 68 (4 Pt 1), 041905. Berger, T., Larkum, M. E., & Luscher, H. R. (2001). High I(h) channel density in the distal apical dendrite of layer V pyramidal cells increases bidirectional attenuation of EPSPs. J. Neurophysiol., 85(2), 855–868. ¨ Koch, C., & Douglas, R. J. (1994). Amplification and linearization Bernander, O., of distal synaptic inputs to cortical pyramidal cells. J. Neurophysiol., 72, 2743– 2753. ¨ Koch, C., & Usher, M. (1994). The effects of synchronized inputs Bernander, O., at the single neuron level. Neural Computation, 6, 622–641. Both´e, S. M., Spekreijse, H., & Roelfsema, P. R. (2000). The effect of pair-wise and higher order correlations on the firing rate of a post-synaptic neuron. Neural Computation, 12, 153–179. Brecht, M., Goebel, R., Singer, W., & Engel, A. K. (2001). Synchronization of visual responses in the superior colliculus of awake cats. Neuroreport, 12, 43–47. Contreras, D., & Steriade, M. (1995). Cellular basis of EEG slow rhythms: A study of dynamic corticothalamic relationships. J. Neurosci., 15, 604–622. Dean, A. F. (1981). The variability of discharge of simple cells in the cat striate cortex. Exp. Brain Res., 44(4), 437–440. Destexhe, A. (2001). Simplified models of neocortical pyramidal cells preserving somatodendritic voltage attenuation. Neurocomputing, 38–40, 167–173. Destexhe, A., & Par´e, D. (1999). Impact of network activity on the integrative properties of neocortical pyramidal neurons in vivo. J. Neurophysiol., 81, 1531– 1547. Destexhe, A., & Par´e, D. (2000). A combined computational and intracellular study of correlated synaptic bombardment in neocortical pyramidal neurons in vivo. Neurocomputing, 32–33, 113–119. Douglas, R. J., & Martin, K. A. C. (1990). Neocortex. In G. M. Shepherd (Ed.), The synaptic organization of the brain (pp. 389–438). Oxford: Oxford University Press. Douglas, R. J., Martin, K. A. C., & Whitteridge, D. (1991). An intracellular analysis of the visual responses of neurons in cat visual cortex. J. Physiol. (London), 440, 659–696.
2376
A. Benucci, P. Verschure, and P. Konig ¨
Engel, A. K., Konig, ¨ P., Gray, C. M., & Singer, W. (1990). Stimulus-dependent neuronal oscillations in cat visual cortex: Inter-columnar interaction as determined by cross-correlation analysis. Eur. J. Neurosci., 2, 588–606. Feng, J., & Brown, D. (2000). Impact of correlated inputs on the output of the integrate-and-fire model. Neural Comput., 12, 671–692. Gabbott, P. L., Martin, K. A. C., & Whitteridge, D. (1987). Connections between pyramidal neurons in layer 5 of cat visual cortex (area 17). J. Comp. Neurol., 259, 364–381. Gray, C. M., Engel, A. K., Konig, ¨ P., & Singer, W. (1990). Stimulus-dependent neuronal oscillations in cat visual cortex: Receptive field properties and feature dependence. Eur. J. Neurosci., 2, 607–619. Gray, C. M., & Viana di Prisco, G. (1997). Stimulus-dependent neuronal oscillations and local synchronization in striate cortex of the alert cat. J. Neurosci., 17, 3239–3253. Grun, ¨ S., Diesmann, M., & Aersten, A. (2001). Unitary events in multiple singleneuron spiking activity: I: Detection and significance. Neural Computation, 14, 43–80. Hines, M. L., & Carnevale, N. T. (2001). NEURON: A tool for neuroscientists. Neuroscientist., 7, 123–135. Kasanetz, F., Riquelme, L. A., & Murer, M. G. (2002). Disruption of the two-state membrane potential of striatal neurones during cortical desynchronization in anesthetized rats. J. Physiol., 543, 577–589. Kisvarday, Z. F., Toth, E., Rausch, M., & Eysel, U. T. (1997). Orientation-specific relationship between populations of excitatory and inhibitory lateral connections in the visual cortex of the cat. Cerebral Cortex, 7, 605–618. ¨ & Douglas, R. J. (1995). Do neurons have a voltage or Kock, C., Bernander, O, a current threshold for action potential initiation. J. Comput. Neurosci., 2, 63– 82. Konig, ¨ P. (1994). A method for the quantification of synchrony and oscillatory properties of neuronal activity. J. Neurosci. Meth., 54, 31–37. Konig, ¨ P., Engel, A. K., & Singer, W. (1995a). How precise is neuronal synchronization? Neural Computation, 7, 469–485. Konig, ¨ P., Engel, A. K., & Singer, W. (1995b). Relation between oscillatory activity and long-range synchronization in cat visual cortex. PNAS, 92, 290–294. Kording, ¨ K. P., & Konig, ¨ P. (2001). Supervised and unsupervised learning with two sites of synaptic integration. J. Comput. Neurosci., 11, 207–215. Lampl, I., Reichova, I., & Ferster, D. (1999). Synchronous membrane potential fluctuations in neurons of the cat visual cortex. Neuron, 22, 361–374. Larkum, M. E., Zhu, J. J., & Sakmann, B. (1999). A new cellular mechanism for coupling inputs arriving at different cortical layers. Nature, 398, 338–341. Lewis, B. L., & O’Donnell, P. (2000). Ventral tegmental area afferents to the prefrontal cortex maintain membrane potential “up” states in pyramidal neurons via D1 dopamine receptors. Cerebral Cortex, 10, 1168–1175. Mazurek, M. E., & Shadlen, M. N. (2002). Limits to the temporal fidelity of cortical spike rate signal. Nature Neurosci., 5, 463–471. Mel, B. W. (1993). Synaptic integration in an excitable dendritic tree. J. Neurophysiol., 70, 1086–1101.
High-Order Correlations in Neural Activity
2377
Par´e, D., Shink, E., Gaudreau, H., Destexhe, A., & Lang, E. J. (1998). Impact of spontaneous synaptic activity on the resting properties of cat neocortical pyramidal neurons in vivo. J. Neurophysiol., 79, 1450–1460. Roskies, A., et al. (1999). Reviews on the binding problem. Neuron, 24(1), 7– 110. Salin, P. A., & Bullier, J. (1995). Corticocortical connections in the visual system: Structure and function. Physiological Rev., 75, 107–154. Salinas, E., & Sejnowski, T. J. (2000). Impact of correlated synaptic input on output firing rate and variability in simple neuronal models. J. Neurosci., 20, 6193–6209. Salinas, E., & Sejnowski, T. J. (2001). Correlated neuronal activity and the flow of neural information. Nat. Rev. Neurosci., 2, 539–550. Shadlen, M. N., & Newsome, W. T. (1998). The variable discharge of cortical neurons: Implications for connectivity, computation and information coding. J. Neurosci., 18, 3870–3896. Siegel, M., Kording, K. P., & Konig, ¨ P. (2000). Integrating top-down and bottomup sensory processing by somato-dendritic interactions. J. Neurosci., 8, 161– 173. Singer, W. (1993). Synchronization of cortical activity and its putative role in information processing and learning. Annu. Rev. Physiol., 55, 349–374. Singer, W., & Gray, C. M. (1995). Visual feature integration and the temporal correlation hypothesis. Annu. Rev. Neurosci., 18, 555–586. Softky, W. R., & Koch, C. (1993). The highly irregular firing of cortical cells is inconsistent with temporal integration of random EPSPs. J. Neurosci., 13, 334–350. Sompolinsky, H., & Shapley, R. (1997). New perspectives on the mechanisms for orientation selectivity. Current Opinion in Neurobiol., 7, 514–522. Steriade, M. (2001). Impact of network activities on neuronal properties in corticothalamic systems. J. Neurophysiol., 86, 1–39. Steriade, M., Contreras, D., & Amzica, F. (1994). Synchronized sleep oscillations and their paroxysmal developments. Trends Neurosci., 17, 199–208. Steriade, M., Timofeev, I., & Grenier, F. (2001). Natural waking and sleep states: A view from inside neocortical neurons. J. Neurophysiol., 85, 1968–1985. Stern, E. A., Jaeger, D., & Wilson, C. J. (1998). Membrane potential synchrony of simultaneously recorded striatal spiny neurons in vivo. Nature, 394, 475– 478. Stern, E. A., Kincaid, A. E., & Wilson, C. J. (1997). Spontaneous subthreshold membrane potential fluctuations and action potential variability of rat corticostriatal and striatal neurons in vivo. J. Neurophysiol., 77, 1697–1715. Stroeve, S., & Gielen, S. (2001). Correlation between uncoupled conductancebased integrate-and-fire neurons due to common and synchronous presynaptic firing. Neural Comput., 13, 2005–2029. Usrey, W. M., & Reid, R. C. (1999). Synchronous activity in the visual system. Annu. Rev. Physiol., 61, 435–456. Verschure, P.F., & Knig, P. (1999). On the role of biophysical properties of cortical neurons in binding and segmentation of visual scenes. Neural Comput., 11, 1113–1138.
2378
A. Benucci, P. Verschure, and P. Konig ¨
Wilson, C. J., & Groves, P. M. (1981). Spontaneous firing patterns of identified spiny neurons in the rat neostriatum. Brain Res., 220, 67–80. Wilson, C. J., & Kawaguchi, Y. (1996). The origins of two-state spontaneous membrane potential fluctuations of neostriatal spiny neurons. J. Neurosci., 16, 2397–2410. Received September 4, 2003; accepted April 7, 2004.
LETTER
Communicated by Yair Weiss
On the Uniqueness of Loopy Belief Propagation Fixed Points Tom Heskes
[email protected] SNN, University of Nijmegen, 6525 EZ, Nijmegen, The Netherlands
We derive sufficient conditions for the uniqueness of loopy belief propagation fixed points. These conditions depend on both the structure of the graph and the strength of the potentials and naturally extend those for convexity of the Bethe free energy. We compare them with (a strengthened version of) conditions derived elsewhere for pairwise potentials. We discuss possible implications for convergent algorithms, as well as for other approximate free energies. 1 Introduction Loopy belief propagation is Pearl’s belief propagation (Pearl, 1988) applied to networks containing cycles. It can be used to compute approximate marginals in Bayesian networks and Markov random fields. Whereas belief propagation is exact only in special cases, for example, for tree-structured (singly connected) networks with just gaussian or just discrete nodes, loopy belief propagation empirically often leads to good performance (Murphy, Weiss, & Jordan, 1999; McEliece, MacKay, & Cheng, 1998). That is, the approximate marginals computed with loopy belief propagation are in many cases close to the exact marginals. In gaussian graphical models, the means are guaranteed to coincide with the exact means (Weiss & Freeman, 2001). The notion that fixed points of loopy belief propagation correspond to extrema of the so-called Bethe free energy (Yedidia, Freeman, & Weiss, 2001) is an important step in the theoretical understanding of this success and paved the road for interesting generalizations. However, when applied to graphs with cycles, loopy belief propagation does not always converge. So-called double-loop algorithms have been proposed that do guarantee convergence (Yuille, 2002; Teh & Welling, 2002; Heskes, Albers, & Kappen, 2003), but are an order of magnitude slower than standard loopy belief propagation. It is generally believed that there is a close connection between (non)convergence of loopy belief propagation and (non)uniqueness of loopy belief propagation fixed points. More specifically, the working hypothesis is that uniqueness of a loopy belief propagation fixed point guarantees convergence of loopy belief propagation to this fixed point. The goal of this study, then, is to derive sufficient c 2004 Massachusetts Institute of Technology Neural Computation 16, 2379–2413 (2004)
2380
T. Heskes
conditions for uniqueness. Such conditions are not only relevant from a theoretical point of view, but can also be used to derive faster algorithms and suggest different free energies, as will be discussed in section 9. 2 Outline Before getting into the mathematical details, we first sketch the line of reasoning that will be followed in this article. It is inspired by the connection between fixed points of loopy belief propagation and extrema of the Bethe free energy, by studying the Bethe free energy we can learn about properties of loopy belief propagation. The Bethe free energy is an approximation to the exact variational GibbsHelmholtz free energy. Both are concepts from (statistical) physics. Abstracting from the physical interpretation, the Gibbs-Helmholtz free energy is “just” a functional with a unique minimum, the argument of which corresponds to the exact probability distribution. However, the Gibbs-Helmholtz free energy is as intractable as the exact probability distribution. The idea is then to approximate the Gibbs-Helmholtz free energy in the hope that the minimum of such a tractable approximate free energy relates to the minimum of the exact free energy. Examples of such approximations are the mean-field free energy, the Bethe free energy, and the Kikuchi free energy. The connections between the Gibbs-Helmholtz free energy, Bethe free energy, and loopy belief propagation are reviewed in section 3. The Bethe free energy is a function of so-called pseudomarginals or beliefs. For the minimum of the Bethe free energy to make sense, these pseudomarginals have to be properly normalized as well as consistent. Our starting point, the upper-left corner in Figure 1, is a constrained minimization problem. In general, it is in fact a nonconvex constrained minimization problem since the Bethe free energy is a nonconvex function of the pseudomarginals (the constraints are linear in these pseudomarginals). However, using the constraints on the pseudomarginals, it may be possible to rewrite the Bethe free energy in a form that is convex in the pseudomarginals. When this is possible, we call the Bethe free energy “convex over the set of constraints” (Pakzad & Anantharam, 2002). Now, if the Bethe free energy is convex over the set of constraints, we have, in combination with the linearity of the constraints, a convex constrained minimization problem. Convex constrained minimization problems have a unique solution (see, e.g., (Luenberger, 1984), which explains link d in Figure 1. Sufficient conditions for convexity over the set of constraints, link b in Figure 1, can be found in Pakzad and Anantharam (2002) and Heskes et al. (2003). They are (re)derived and discussed in section 4. These conditions depend on only the structure of the graph, not on the (strength of the) potentials that make up the probability distribution defined over this graph. A corollary of these conditions, derived in section 4.3, is that the Bethe free energy for a graph with a single loop is “just” convex over the set of
Uniqueness of Loopy Belief Propagation Fixed Points
2381
Figure 1: Layout of correspondences and implications. See the text for details.
constraints: with two or more connected loops, the conditions fail (see also McEliece & Yildirim, 2003). Milder conditions for uniqueness, which do depend on the strength of the interactions, follow from the track on the right-hand side of Figure 1. First, we note that nonconvex constrained minimization of the Bethe free energy is equivalent to an unconstrained nonconvex/concave minimax problem (Heskes, 2002), link a in Figure 1. Convergent double-loop algorithms like CCCP (Yuille, 2002) and faster variants thereof (Heskes et al., 2003) in fact solve such a minimax problem: the concave problem in the maximizing parameters (basically Lagrange multipliers) is solved by a message-passing algorithm very similar to standard loopy belief propagation in the inner loop, where the outer loop changes the minimizing parameters (a remaining set of pseudomarginals) in the proper downward direction. The transformation
2382
T. Heskes
from nonconvex constrained minimization problem to an unconstrained nonconvex/concave minimax problem is, in a particular setting relevant to this article, repeated in section 5.1. Rather than requiring the Bethe free energy to be convex (over the set of constraints), we then in sections 6 and 8 work toward conditions under which this minimax problem is convex/concave. These indeed depend on the strength of the potentials, defined in section 7. These conditions can be considered the main result of this article. Link c follows from the observation, in section 5.2, that the minimax problem corresponding to a Bethe free energy that is convex over the set of constraints has to be convex or concave. As indicated by link e, convex/concave minimax problems have a unique solution. This then also implies that the Bethe free energy has a unique extremum satisfying the constraints, which, since the Bethe free energy is bounded from below (see section 5.3), has to be a minimum: link f. The concluding statement by link g in the lower-right corner is, to the best of our knowledge, no more than a conjecture. We discuss it in more detail in section 9. 3 The Bethe Free Energy and Loopy Belief Propagation 3.1 The Gibbs-Helmholtz Free Energy. The exact probability distribution in Bayesian networks and Markov random fields can be written in the factorized form Pexact (X) =
1 α (Xα ). Z α
(3.1)
Here α is a potential, some function of the potential subset Xα , and Z is an unknown normalization constant. Potential subsets typically overlap, and they span the whole domain X. The convention that we adhere to in this article is that there are no potential subsets Xα and Xα such that Xα is fully subsumed by Xα . The standard choice of a potential in a Bayesian network is a child with all its parents. We further restrict ourselves to probabilistic models defined on discrete random variables, each of which runs over a finite number of states. The potentials are positive and finite. The typical goal in Bayesian networks and Markov random fields is to compute the partition function Z or marginals, for example, Pexact (Xα ) = Pexact (X). X\α
One way to do this is with the junction tree algorithm (Lauitzen & Spiegelhalter, 1988). However, the junction tree algorithm scales exponentially with the size of the largest clique and may become intractable for complex models. The alternative is then to resort to approximate methods, which can be
Uniqueness of Loopy Belief Propagation Fixed Points
2383
roughly divided into two categories: sampling approaches and deterministic approximations. Most deterministic approximations derive from the so-called GibbsHelmholtz free energy, F(P) = − P(Xα )ψα (Xα ) + P(X) log P(X), α
Xα
X
with shorthand ψ ≡ log . Minimizing this variational free energy over the set P of all properly normalized probability distributions, we get back the exact probability distribution, equation 3.1, as the argument at the minimum and minus the log of the partition function as the value at the minimum: Pexact = argmin F(P) and − log Z = min F(P). P∈P
P∈P
Since the Gibbs-Helmholtz free energy is convex in P, the equality constraint (proper normalization) is linear, and the inequality constraints (nonnegativity) are convex, this minimum is unique. By itself, we have not gained anything: the entropy may still be intractable to compute. 3.2 The Bethe Free Energy. The Bethe free energy is an approximation of the exact Gibbs-Helmholtz free energy. In particular, we approximate the entropy through X
P(X) log P(X) ≈
α
−
P(Xα ) log P(Xα )
Xα
(nβ − 1) P(xβ ) log P(xβ ), β
xβ
with xβ a (super)node and nβ = α⊃β 1: the number of potentials that contains node xβ . The second term follows from a discounting argument: without it, we would overcount the entropy contributions on the overlap between the potential subsets. The (super)nodes xβ are themselves subsets of the potential subsets, that is, xβ ∩ Xα = ∅ or xβ ∩ Xα = xβ ∀α,β , and partition the domain X, xβ ∩ xβ = ∅ ∀β,β and
xβ = X.
β
Typically the xβ are taken to be single nodes, and in the following we will refer to them as such. For clarity of notation, we will indicate these nodes by β and xβ in lowercase, to contrast them with the potentials α and potential subsets Xα in uppercase.
2384
T. Heskes
Note that the Bethe free energy depends on only the marginals P(Xα ) and P(xβ ). We replace minimization of the exact Gibbs-Helmholtz free energy over probability distributions by minimization of the Bethe free energy, F(Qα , Qβ ) = − Qα (Xα )ψα (Xα ) + Qα (Xα ) log Qα (Xα ) α
α
Xα
Xα
(nβ − 1) Qβ (xβ ) log Qβ (xβ ), − β
(3.2)
xβ
over sets of “pseudomarginals”1 or beliefs {Qα , Qβ }. For this to make sense, these pseudomarginals have to be properly normalized as well as consistent, that is,2 Qα (Xα ) = 1 and Qα (xβ ) = Qα (Xα ) = Qβ (xβ ). (3.3) Xα
Xα\β
Let Q denote all subsets of consistent and properly normalized pseudomarginals. Then our goal is to solve min
{Qα ,Qβ }∈Q
F(Qα , Qβ ).
The hope is that the pseudomarginals at this minimum are accurate approximations to the exact marginals Pexact (Xα ) and Pexact (xβ ). 3.3 Link with Loopy Belief Propagation. For completeness and later reference, we describe the link between the Bethe free energy and loopy belief propagation, as originally reported on by Yedidia et al. (2001). It starts with the Lagrangian L(Qα , Qβ , λαβ , λα , λβ ) = F(Qα , Qβ ) + λαβ (xβ ) Qβ (xβ ) − Qα (xβ ) β α⊃β xβ
+
α
+
β
λα 1 − λβ 1 −
Xα
Qα (Xα ) Qβ (xβ ) .
(3.4)
β
1 Terminology from Wainwright, Jaakkola, and Willsky (2002), used to indicate that there need not be a joint distribution that would yield such marginals. 2 Strictly speaking we also have to take inequality constraints into account, namely, those of the form Qα (Xα ) ≥ 0. However, with the potentials being positive and finite, the logarithmic terms in the free energy make sure that we never really have to worry about those; they never become “active.” For convenience, we will not consider them any further.
Uniqueness of Loopy Belief Propagation Fixed Points
2385
At an extremum of the Bethe free energy satisfying the constraints, all derivatives of L are zero: the ones with respect to the Lagrange multipliers λ give back the constraints; the ones with respect to the pseudomarginals Q give an extremum of the Bethe free energy. Setting the derivatives with respect to Qα and Qβ to zero, we can solve for Qα and Qβ in terms of the Lagrange multipliers: Q∗α (Xα )
= α (Xα ) exp λα − 1 +
λαβ (xβ )
β⊂α
Q∗β (xβ )
1 = exp λαβ (xβ ) . 1 − λβ + nβ − 1 α⊃β
In terms of the“message” µβ→α (xβ ) ≡ exp[λαβ (xβ )] from node β to potential α, the pseudomarginal Q∗α (Xα ) reads Q∗α (Xα ) ∝ α (Xα )
µβ→α (xβ ),
(3.5)
β⊂α
where proper normalization yields the Lagrange multiplier λα . With definition µα→β (xβ ) ≡
Q∗β (xβ ) µβ→α (xβ )
,
(3.6)
the fixed-point equation for Q∗β (xβ ) can, after some manipulation, be written in the form Q∗β (xβ ) ∝
µα→β (xβ ),
(3.7)
α⊃β
where again the Lagrange multiplier λβ follows from normalization. Finally, the constraint Q∗α (xβ ) = Q∗β (xβ ) in combination with equation 3.6 suggests the update µα→β (xβ ) =
Q∗α (xβ ) . µβ→α (xβ )
(3.8)
Equations 3.5 through 3.8 constitute the belief propagation equations. They can be summarized as follows. A pseudomarginal is the potential (just 1 for the nodes in the convention where no potentials are assigned to nodes) times its incoming messages; the outgoing message is the pseudomarginal divided by the incoming message. The scheduling of the messages is somewhat arbitrary. Loopy belief propagation can be “damped” by taking smaller
2386
T. Heskes
steps. This damping is usually done in terms of the Lagrange multipliers, that is, in the log domain of the messages: log µnew α→β (xβ ) = log µα→β (xβ )
+ [{log Q∗α (xβ ) − log µβ→α (xβ )} − log µα→β (xβ )]. (3.9)
Summarizing, loopy belief propagation is equivalent to fixed-point iteration, where the fixed points are the zero derivatives of the Lagrangian. 4 Convexity of the Bethe Free Energy 4.1 Rewriting the Bethe Free Energy. Minimization of the Bethe free energy, equation 3.2, under the constraints of equation 3.3 is equivalent to solving a minimax problem on the Lagrangian, equation 3.4, namely, min max L(Qα , Qβ , λαβ , λα , λβ ).
Qα ,Qβ λαβ ,λα ,λβ
The ordering of the min and max operations is important here: to enforce the constraints, we first have to take the maximum. The min and max operations can be interchanged if we have a convex constrained minimization problem (Luenberger, 1984). That is, the function to be minimized must be convex in its parameters, the equality constraints have to be linear, and the inequality constraints convex. In our case, the equality constraints are indeed linear, and the inequality constraints enforcing nonnegativity of the pseudomarginals indeed are convex. However, the Bethe free energy, equation 3.2, is clearly nonconvex in its parameters {Qα , Qβ }. This is what makes it a difficult optimization problem. Luckily the description in equation 3.2 is not unique: any other form that can be constructed by substituting the constraints of equation 3.3 is equally valid. Following Pakzad and Anantharam (2002), we call the Bethe free energy “convex over the set of constraints” if, by making use of the constraints of equation 3.3, we can rewrite it in a form that is convex in {Qα , Qβ }. 4.2 Conditions for Convexity. The problem is with the term Qβ (xβ ) log Qβ (xβ ), Sβ (Qβ ) ≡ − xβ
which is concave in Qβ . Using the constraint Qβ (xβ ) = Qα (xβ ), we can turn it into a functional that is convex in Qα and Qβ separately, but not necessarily jointly. That is, with the substitution Qβ (xβ ) = Qα (xβ ) for any α ⊃ β, the entropy, and thus the Bethe free energy, is convex in Qα and in Qβ , but not necessarily in {Qα , Qβ }. However, if we add to Sβ (Qβ ) a convex entropy contribution, −Sα (Qα ) ≡ Qα (Xα ) log Qα (Xα ), Xα
Uniqueness of Loopy Belief Propagation Fixed Points
2387
the combination of −Sα and Sβ is convex in {Qα , Qβ }, as the following lemma, needed in the proof of theorem 1 below, shows. Lemma 1. αβ (Qα , Qβ ) ≡
Qα (Xα ) log Qα (Xα ) −
Qα (xβ ) log Qβ (xβ )
xβ
Xα
is convex in {Qα , Qβ }. Proof.
The matrix with second derivatives of αβ has the components
H(Xα , Xα ) ≡
∂ 2 αβ 1 = δX ,X ∂Qα (Xα )∂Qα (Xα ) Qα (Xα ) α α
H(Xα , xβ ) ≡
∂ 2 αβ 1 =− δx ,x ∂Qα (Xα )∂Qβ (xβ ) Qβ (xβ ) β β
H(xβ , xβ ) ≡
∂ 2 αβ Qα (xβ ) =− 2 δx ,x , ∂Qβ (xβ )∂Qβ (xβ ) Qβ (xβ ) β β
where we note that Xα and xβ should be interpreted as indices. Convexity requires that for any “vector” (Rα (Xα ) Rβ (xβ )), H(Xα , Xα ) H(Xα , xβ ) Rα (Xα ) 0 ≤ (Rα (Xα ) Rβ (xβ )) Rβ (xβ ) H(xβ , Xα ) H(xβ , xβ ) R2 (Xα ) Rα (Xα )Rβ (xβ ) Qα (xβ )R2β (xβ ) α −2 + Qα (Xα ) Qβ (xβ ) Q2β (xβ ) xβ Xα Xα
Rβ (xβ ) 2 Rα (Xα ) = Qα (Xα ) . − Qα (Xα ) Qβ (xβ ) X =
α
The idea is that the Bethe free energy is convex over the set of constraints if we have sufficient convex resources Qα log Qα to compensate for the concave −Qβ log Qβ terms. This can be formalized in the following theorem. Theorem 1. The Bethe free energy is convex over the set of consistency constraints if there exists an allocation matrix Aαβ between potentials α and nodes β satisfying 1. 2.
Aαβ ≥ 0 ∀α,β⊂α Aαβ ≤ 1 ∀α
(positivity) (sufficient amount of resources)
β⊂α
3.
α⊃β
Aαβ ≥ nβ − 1 ∀β
(sufficient compensation).
(4.1)
2388
T. Heskes
Proof. First, we note that we do not have to worry about the energy terms that are linear in Qα . In other words, to prove the theorem, we can restrict ourselves to proving that minus the entropy, −S(Q) = −
α
Sα (Qα ) −
(nβ − 1)Sβ (Qβ )
β
is convex over the set of consistency constraints. The resulting operation is now a matter of resource allocation. For each concave contribution (nβ − 1)Sβ , we have to find convex contributions −Sα to compensate for it. Let Aαβ denote the “amount of resources” that we take from potential subset α to compensate for node β. Now, in shorthand notation and with a little bit of rewriting, −S(Q) = −
α
=−
β
=−
1−
−
1−
β
Aαβ +
β⊂α
Aαβ Sα
β⊂α
Aαβ +
α⊃β
α
−
β⊂α
β
α
−
Sα − (nβ − 1)Sβ
Aαβ − (nβ − 1) Sβ
α⊃β
Aαβ Sα −
α β⊂α
Aαβ [Sα − Sβ ]
Aαβ − (nβ − 1) Sβ .
α⊃β
Convexity of the first term is guaranteed if 1 − β Aαβ ≥ 0 (condition 2), of the second term if Aαβ ≥ 0 (condition 1 and lemma 1), and of the third term if α Aαβ − (nβ − 1) ≥ 0 (condition 3). This theorem is a special case of the one in Heskes et al. (2003) for the more general Kikuchi free energy. Either one of the inequality signs in condition 2 and 3 of equation 4.1 can be replaced by an equality sign without any consequences. 4.3 Some Implications Corollary 1. The Bethe free energy for singly connected graphs is convex over the set of constraints.
Uniqueness of Loopy Belief Propagation Fixed Points
2389
Proof. The proof is by construction. Choose one of the leaf nodes as the root β ∗ and define Aαβ = 1
iff β ⊂ α and β closer to the root β ∗ than any other β ⊂ α
Aαβ = 0
for all other β .
Obviously, this choice of A satisfies conditions 1 and 2 of equation 4.1. Arguing the other way around, for each β = β ∗ , there is just a single potential α ⊃ β that is closer to the root β ∗ than β itself (see the illustration in Figure 2) and thus there are precisely nβ − 1 contributions Aαβ = 1. The root itself gets nβ ∗ contributions Aαβ ∗ = 1, which is even better. Hence, condition 3 is also satisfied: α⊃β
Aαβ = nβ − 1 ∀β=β ∗ and
α⊃β ∗
Aαβ ∗ = nβ ∗ > nβ ∗ − 1.
With the above construction of A, we are in a sense “eating up resources toward the root.” At the root, we have one piece of resources left, which suggests that we can still enlarge the set of graphs for which convexity can be shown using theorem 1. This leads to the next corollary. Corollary 2. The Bethe free energy for graphs with a single loop is convex over the set of constraints. Proof. Again the proof is by construction. Break the loop at one particular place, that is, remove one node β ∗ from a potential α ∗ such that a singly connected structure is left. Construct a matrix A as in the proof of corollary 1, taking the node β ∗ as the root. The matrix A constructed in this way also just works for the graph with the closed loop since still, α⊃β
Aαβ = nβ − 1 ∀β=β ∗ and now
α⊃β ∗
Aαβ ∗ = nβ ∗ − 1.
It can be seen that this construction starts to fail as soon as we have two loops that are connected: with two connected loops, we have insufficient positive resources to compensate for the negative entropy contributions. 4.4 Connection with Other Work. The same corollaries can be found in Pakzad and Anantharam (2002) and McEliece and Yildirim (2003). Furthermore, the conditions in theorem 1 are very similar to the ones stated in Pakzad and Anantharam (2002), which for the Bethe free energy boil down to the following.
2390
T. Heskes
Figure 2: The construction of an allocation matrix satisfying all convexity constraints for singly connected (a) and single-loop structures (b). Neglecting the arrows and dashes, each graph corresponds to a factor graph (Kscischang, Frey, & Loeliger, 2001), where dark boxes refer to potentials and circles to nodes. The numbers within the circles give the corresponding “overcounting numbers,” for the Bethe free energy 1 − nβ with nβ the number of neighboring potentials. The arrows, pointing from potentials α to nodes β, visualize the allocation matrix A with Aαβ = 1 if there is an arrow and Aαβ = 0 otherwise. As can be seen, for each potential there is precisely one outgoing arrow, pointing at the node closest to the root, chosen to be the node in the upper right corner of the graph. In the singly-connected structure (a), all nonroot nodes have precisely nβ − 1 incoming arrows, just sufficient to compensate the overcounting number 1 − nβ . The root node itself has one incoming arrow, which it does not really need. In the structure with the single loop (b), we open the loop by breaking the dashed link and construct the allocation matrix for the corresponding singly connected structure. This allocation matrix works for the single-loop structure as well, because now the incoming arrow at the “root” is just sufficient to compensate for the negative overcounting number resulting from the extra link closing the loop.
Uniqueness of Loopy Belief Propagation Fixed Points
2391
Theorem 2. (Adapted from theorem 1 in Pakzad & Anantharam, 2002). The Bethe free energy is convex for the set of constraints if for any set of nodes B we have (1 − nβ ) + 1≥0, (4.2) β∈B
α∈π(B)
where π(B) ≡ {α : ∃β ∈ B; β ⊂ α} denotes the “parent” set of B, that is, those potential subsets that include at least one node in B. Proposition 1.
The conditions in theorems 1 and 2 are equivalent.
Proof. Let us first suppose that there does exist an allocation matrix Aαβ satisfying the conditions of equation 4.1. Then for any set B, (nβ − 1) ≤ Aαβ ≤ Aαβ ≤ 1, β∈B
β∈B α⊃β
α∈π(B) β⊂α
α∈π(B)
where the inequalities follow from conditions 3, 1, and 2 in equation 4.1, respectively. In other words, validity of the conditions in theorem 1 implies the validity of those in theorem 2. Next let us suppose that the conditions in Theorem 1 fail. Above, we have seen that this can happen if and only if the graph contains at least one connected component with two connected loops. But then condition 4.2 is violated as well when we take for B the set of all nodes within such a component. Since validity implies validity and violation implies violation, the conditions must be equivalent. Graphical models with a single loop have been studied in detail in Weiss (2000), yielding important theoretical results (e.g., correctness of maximum a posteriori assignments). These results are derived by “unwrapping” the single loop into an infinite tree. This argument also breaks down as soon as there is more than a single loop. It might be interesting to find out whether there is a deeper connection between this unwrapping argument and the convexity of the Bethe free energy. In summary, we have given conditions for the Bethe free energy to have a unique extremum satisfying the constraints. From the connection between the extrema of the Bethe free energy and fixed points of loopy belief propagation, it then follows that loopy belief propagation has a unique fixed point when these conditions are satisfied. These conditions fail as soon as the structure of the graph contains two connected loops. The conditions for convexity of the Bethe free energy depend on the structure of the graph; the potentials α (Xα ) do not play any role. These potentials appear only in the energy term that is linear in the pseudomarginals
2392
T. Heskes
and thus does not affect the convexity argument. Consequently, adding a “fake link” with potential α (Xα ) = 1 can change the validity of the conditions, whereas it has no effect on the loopy belief propagation updates. Even if we managed to find more interesting (i.e., milder) conditions for convexity over the set of constraints,3 the impact of fake links would never disappear. In the following, we will therefore dig a little deeper to arrive at milder conditions that do take into account (the strength of) the potentials. 5 The Dual Formulation 5.1 From Lagrangian to Dual. As we have seen, fixed points of loopy belief propagation are in a one-to-one correspondence with zero derivatives of the Lagrangian. If we manage to find conditions under which these zero derivatives have a unique solution, then for the same conditions, loopy belief propagation has a unique fixed point. In the following, we will work with a Lagrangian slightly different from equation 3.4. First, we substitute the constraint Qα (xβ ) = Qβ (xβ ) to write the Bethe free energy in the “more convex” form, F(Qα , Qβ ) = −
−
α
Qα (Xα )ψα (Xα ) +
Xα
Aαβ
β α⊃β
α
Qα (Xα ) log Qα (Xα )
Xα
Qα (xβ ) log Qβ (xβ ),
(5.1)
xβ
where the allocation matrix Aαβ can be any matrix that satisfies
Aαβ = nβ − 1.
(5.2)
α⊃β
And second, we express the consistency constraints from equation 3.3 in terms of the potential pseudomarginals Qα alone. This then yields L(Qα , Qβ , λαβ , λα ) = −
+ −
α
Qα (Xα )ψα (Xα )
Xα
α
Qα (Xα ) log Qα (Xα )
Xα
α β⊂α
Aαβ
Qα (xβ ) log Qβ (xβ )
xβ
3 We would like to conjecture that this is not possible—that the conditions in theorem 1 are not only sufficient but also necessary to prove convexity of the Bethe free energy over the set of consistency constraints. Note that this would not imply that we need these conditions to guarantee the uniqueness of fixed points, since for that convexity by itself is sufficient, not necessary.
Uniqueness of Loopy Belief Propagation Fixed Points
+
2393
λαβ (xβ )
β α⊃β xβ
1 × Aα β Qα (xβ ) − Qα (xβ ) nβ − 1 α ⊃β λα 1 − Qα (Xα ) + α
+
Xα
(nβ − 1)
β
Qβ (xβ ) − 1 .
(5.3)
xβ
Note that the constraint Qβ (xβ ) = Qα (xβ ) as well as its normalization is no longer incorporated with Lagrange multipliers, but follows when we take the minimum with respect to Qβ . It is easy to check that the fixed-point equations of loopy belief propagation still follow by setting the derivatives of the Lagrangian, equation 5.3 to zero. Although the Bethe free energy and thus the Lagrangian, equation 5.3, may not be convex in {Qα , Qβ }, they are convex in Qα and Qβ separately. Therefore, we can interchange the minimum over the pseudomarginals Qα and the maximum over the Lagrange multipliers, as long as we leave the minimum over Qβ as the final operation:4 min max L(Qα , Qβ , λαβ , λα ) = min max min L(Qα , Qβ , λαβ , λα ).
Qα ,Qβ λαβ ,λα
Qβ λαβ ,λα Qα
Rewriting
1 λαβ (xβ ) Aα β Qα (xβ ) − Qα (xβ ) nβ − 1 α ⊃β α⊃β xβ λ¯ αβ (xβ )Qα (xβ ), =−
β
α β⊂α xβ
with λ¯ αβ (xβ ) ≡ λαβ (xβ ) −
1 Aα β λα β (xβ ), nβ − 1 α ⊃β
we can easily solve for the minimum with respect to Qα : Q∗α (Xα ) = α (Xα ) exp λα − 1 +
Aαβ log Qβ (xβ ) + λ¯ αβ (xβ ) . (5.4)
β⊂α
4 In principle, we could also first take the minimum over Q and leave the minimum β over Qα , but this does not seem to lead to any useful results.
2394
T. Heskes
Plugging this into the Lagrangian, we obtain the “dual,” G(Qβ , λαβ , λα ) ≡ L(Q∗α , Qβ , λαβ , λα ) =− α (Xα ) α
Xα
exp × λα − 1 +
β⊂α
+
α
λα +
(nβ − 1)
β
Aαβ log Qβ (xβ ) + λ¯ αβ (xβ )
Qβ (xβ ) − 1 .
(5.5)
xβ
Next, we find for the maximum with respect to λα , ∗ exp 1 − λα = Aαβ log Qβ (xβ ) + λ¯ αβ (xβ ) α (Xα ) exp β⊂α
Xα
≡ Z∗α ,
(5.6)
where we have to keep in mind that Z∗α by itself, like Q∗α , is a function of the remaining pseudomarginals Qβ and Lagrange multipliers λαβ . Substituting this solution into the dual, we arrive at G(Qβ , λαβ ) ≡ G(Qβ , λαβ , λ∗α ) =−
α
log Z∗α
+ (nβ − 1) Qβ (xβ ) − 1 . β
(5.7)
xβ
Let us pause here for a moment and reflect on what we have done so far. The Lagrangian, equation 5.3, being convex in Qα , has a unique minimum in Qα (given all other parameters fixed), which is also the only extremum. It happens to be relatively straightforward to express the value at this minimum in terms of the remaining parameters and then also to find the optimal (maximal) λ∗α . Plugging these values into the Lagrangian equation 5.3, we have not lost anything. That is, zero derivatives of the Lagrangian are still in one-to-one correspondence with zero derivatives of the dual, equation 5.7, and thus with fixed points of loopy belief propagation. 5.2 Recovering the Convexity Conditions (1). To find a minimum of the Bethe free energy satisfying the constraints in equation 3.3, we first have to take the maximum of the dual, equation 5.7, over the remaining Lagrange multipliers λαβ and then the minimum over the remaining pseudomarginals Qβ . The duality theorem, a standard result from constrained optimization (see, Luenberger, 1984) tells us that the dual G is concave in the Lagrange multipliers. The remaining question is then whether the dual is convex in
Uniqueness of Loopy Belief Propagation Fixed Points
2395
Qβ . If it is, we have a convex-concave minimax problem, which is guaranteed to have a unique solution. Link c in Figure 1 follows from the following proposition. Proposition 2. Convexity of the Bethe free energy, equation 5.1, in {Qα , Qβ } implies convexity of the dual, equation 5.7, in Qβ . Proof. First, we note that the minimum of a convex function over some of its parameters is convex in its remaining parameters. In obvious onedimensional notation, with y∗ (x) ≡ argmin f (x, y), y
f (x + δ, y∗ (x + δ)) + f (x − δ, y∗ (x − δ)) ≥ 2 f (x, (y∗ (x + δ) + y∗ (x − δ))/2) ≥ 2 f (x, y∗ (x)), where the first inequality follows from the convexity of f in {x, y} and the second inequality from y∗ (x) being the unique minimum of f (x, y). Therefore, the dual, equation 5.5, is convex in Qβ when the Lagrangian, equation 5.3, and thus the Bethe free energy, equation 5.1, is convex in {Qα , Qβ }. Furthermore, from the duality theorem, the dual, equation 5.5, is concave in the Lagrange multipliers {λαβ , λα }. Next, we note that the maximum of a convex or concave function over its maximizing parameters is again convex: with y∗ (x) ≡ argmax f (x, y), y
f (x + δ, y∗ (x + δ)) + f (x − δ, y∗ (x − δ)) ≥ f (x + δ, y∗ (x)) + f (x − δ, y∗ (x)) ≥ 2 f (x, y∗ (x)), where the first inequality follows from y∗ (x ± δ) being the unique maximum of f (x ± δ, y) and the second inequality from the convexity of f (x, y) in x. Hence, the dual, equation 5.7, must still be convex in Qβ . For now, we did not gain or lose anything in comparison with the conditions for theorem 1. However, the inequalities in the above proof suggest a little space that will lead to milder conditions for the uniqueness of fixed points. 5.3 Boundedness of the Bethe Free Energy. For completeness and to support link f in Figure 1, we will here prove that the Bethe free energy is bounded from below. The following theorem can be considered a special case of the one stated in Minka (2001) on the Bethe free energy for expectation propagation, a generalization of (loopy) belief propagation. Theorem 3. If all potentials are bounded from above, that is, α (Xα ) ≤ max for all α and Xα , the Bethe free energy is bounded from below on the set of constraints.
2396
T. Heskes
Proof. It is sufficient to prove that the function G(Qβ ) ≡ maxλαβ G(Qβ , λαβ ) is bounded from below for a particular choice of Aαβ satisfying equation 5.2. n −1 Considering Aαβ = βnβ , we then have
nβ − 1 log α (Xα ) exp log Qβ (xβ ) G(Qβ ) ≥ − nβ α β⊂α Xα (nβ − 1) Qβ (xβ ) − 1 + β
xβ
β
xβ
nβ − 1 ≥− log α (Xα )Qβ (xβ ) nβ α β⊂α Xα (nβ − 1) Qβ (xβ ) − 1 + nβ − 1 log max ≥− nβ α β⊂α Xα\β (nβ − 1) − log Qβ (xβ ) + Qβ (xβ ) − 1 + β
≥−
xβ
xβ
nβ − 1 log max , nβ α β⊂α X α\β
where the first inequality follows by substituting the choice λαβ (xβ ) = 0 for all α, β, and xβ in G(Qβ , λαβ ), the second from the concavity of the function y
nβ −1 nβ
, and the third from the upper bound on the potentials.
6 Toward Better Conditions 6.1 The Hessian. The next step is to compute the Hessian—the second derivative of the dual with respect to the pseudomarginals Qβ . The first derivative yields Q∗ (xβ ) ∂G Aαβ α =− + (nβ − 1), ∂Qβ (xβ ) Qβ (xβ ) α⊃β which is immediate from the Lagrangian, equation 5.3. To compute the matrix of second derivatives Hββ (xβ , xβ ) ≡
∂ 2G , ∂Qβ (xβ )∂Qβ (xβ )
Uniqueness of Loopy Belief Propagation Fixed Points
2397
we make use of Q∗α (xβ , xβ ) − Q∗α (xβ )Q∗α (xβ ) ∂Q∗α (xβ ) = A , αβ ∂Qβ (xβ ) Qβ (xβ ) where both β and β should be a subset of α and with convention Q∗α (xβ , xβ ) = Q∗α (xβ ) and Q∗α (xβ , xβ ) = 0 if xβ = xβ . Here, the first term follows from the differentation of equation 5.4 and the second term from the normalization as in equation 5.6. Distinguishing between β = β and β = β , we then have Hββ (xβ , xβ ) =
Aαβ (1 − Aαβ )
α⊃β
+
α⊃β
Hββ (xβ , xβ ) = −
A2αβ
Q∗α (xβ ) δx ,x Q2β (xβ ) β β
Q∗α (xβ )Q∗α (xβ )
Qβ (xβ )Qβ (xβ )
Aαβ Aαβ
α⊃{β,β }
Q∗α (xβ , xβ ) − Q∗α (xβ )Q∗α (xβ ) Qβ (xβ )Qβ (xβ )
for β = β, where δxβ ,xβ = 1 if and only if xβ = xβ . Here, it should be noted that both β and xβ play the role of indices, that is, xβ should not be mistaken for a variable or parameter. The parameters are still the (tables with) Lagrange multipliers λαβ and pseudomarginals Qβ . The goal is now to find conditions under which this Hessian is positive (semi) definite for any setting of the parameters {Qβ , λαβ }, that is, conditions that guarantee K≡ Sβ (xβ )Hββ (xβ , xβ )Sβ (xβ ) ≥ 0, β,β xβ ,xβ
for any choice of the “vector” S with elements Sβ (xβ ). Straightforward manipulations yield Sβ (xβ )Hββ (xβ , xβ )Sβ (xβ ) (K) β,β xβ ,xβ
=
α β⊂α xβ
+
Aαβ (1 − Aαβ )Q∗α (xβ )R2β (xβ )
α {β,β }⊂α xβ ,x
−
α
{β,β }⊂α β =β
xβ ,xβ
Aαβ Aαβ Q∗α (xβ )Q∗α (xβ )Rβ (xβ )Rβ (xβ )
(K1 ) (K2 )
β
Aαβ Aαβ Q∗α (xβ , xβ )Rβ (xβ )Rβ (xβ ),
where Rβ (xβ ) ≡ Sβ (xβ )/Qβ (xβ ).
(K3 )
2398
T. Heskes
6.2 Recovering the Convexity Conditions (2). Let us first see how we get back the conditions for convexity of the Bethe free energy, equation 5.1. Since K2 =
α
β⊂α xβ
2 Aαβ Q∗α (xβ )Rβ (xβ )
≥0
and5 K3 =
α
{β,β }⊂α β =β
xβ ,xβ
Aαβ Aαβ Q∗α (xβ , xβ )
2 1 2 1 2 1 Rβ (xβ ) − Rβ (xβ ) − Rβ (xβ ) − Rβ (xβ ) × 2 2 2 ≥ Aαβ Aαβ − Aαβ Q∗α (xβ )R2β (xβ ),
α β⊂α xβ
(6.1)
β ⊂α
we have K = K1 + K2 + K3 ≥
α β⊂α xβ
Aαβ 1 −
β ⊂α
Aαβ Q∗α (xβ )R2β (xβ ).
That is, sufficient conditions for K to be nonnegative are Aαβ ≥ 0 ∀α,β⊂α and
Aαβ ≤ 1 ∀α ,
β⊂α
precisely the conditions for theorem 1. 6.3 Fake Interactions. While discussing the conditions for convexity of the Bethe free energy, we noticed that adding a “fake interaction,” such as a constant potential, can change the validity of the conditions. We will see that here this is not the case and these fake interactions drop out as we would expect them to. Suppose that we have a fake interaction α (Xα ) = 1. From the solution, equation 5.4, it follows that the pseudomarginal Q∗α (Xα ) factorizes:6 Q∗α (xβ , xβ ) = Q∗α (xβ )Q∗α (xβ ) ∀{β,β }⊂α . 5
This step is in fact equivalent to the Gerschgorin theorem for bounding the eigenvalues of a matrix. 6 The exact marginal P exact (Xα ) need not factorize. This is really a consequence of the locality assumptions behind loopy belief propagation and the Bethe free energy.
Uniqueness of Loopy Belief Propagation Fixed Points
2399
Consequently, the terms involving α in K3 cancel with those in K2 , which is most easily seen when we combine K2 and K3 in a different way: K2 + K3 = A2αβ Q∗α (xβ )Q∗α (xβ )Rβ (xβ )Rβ (xβ ) (K˜ 2 ) α β⊂α xβ ,x β
−
α
{β,β }⊂α β =β
Aαβ Aαβ
xβ ,xβ
× [Q∗α (xβ , xβ ) − Q∗α (xβ )Q∗α (xβ )]Rβ (xβ )Rβ (xβ ).
(K˜ 3 )
This leaves us with the weaker requirement (from K1 ) Aαβ (1 − Aαβ ) ≥ 0 for all β ⊂ α. The best choice is then to take Aαβ = 1, which turns condition 3 of equation 4.1 into Aα β + 1 ≥ nβ − 1. α ⊃β α =α
The net effect is equivalent to ignoring the interaction, reducing the number of neighboring potentials nβ by 1 for all β that are part of the fake interaction α. We have seen how we get milder and thus better conditions when there is effectively no interaction. Motivated by this “success,” we will work toward conditions that take into account the strength of the interactions. Our starting point will be the above decomposition in K˜ 2 and K˜ 3 where, since K˜ 2 ≥ 0, we will concentrate on K˜ 3 . 7 The Strength of a Potential 7.1 Bounding the Correlations. The crucial observation, which will allow us to obtain milder and thus better conditions for the uniqueness of a fixed point, is the following lemma. It bounds the term between brackets in K˜ 3 such that we can again combine this bound with the (positive) term K1 . However, before we get to that, we take some time to introduce and derive properties of the “strength” of a potential. Lemma 2.
Two-node correlations of loopy belief marginals obey the bound
Q∗α (xβ , xβ )
− Q∗α (xβ )Q∗α (xβ ) ≤ σα Q∗α (xβ , xβ ) ∀ {β,β }⊂α ∀xβ ,x , β =β
β
(7.1)
with the “strength” σα a function of the potential ψα (Xα ) ≡ log α (Xα ) only: σα = 1 − exp(−ωα ) with ωα ≡ max ψα (Xα ) + (nα − 1)ψα (Xˆ α ) − Xα ,Xˆ α
where nα ≡
β⊂α
β⊂α
1.
ψα (Xˆ α\β , xβ ) ,
(7.2)
2400
T. Heskes
Proof. For convenience and without loss of generality, we omit α from our notation and renumber the nodes that are contained in α from 1 to n. We consider the quotient between the loopy belief on the potential subset divided by the product of its single-node marginals: n−1 (X) µβ (xβ ) (X ) µβ (xβ ) Q∗ (X) β β X = n ∗ Q (xβ ) (X\β , xβ ) µβ (xβ )µβ (xβ ) β=1
X\β
β
(X) =
β =β
(X )
β
X
n−1 µβ (xβ )
, (X\β , xβ ) µβ (xβ )
β
X\β
(7.3)
β =β
where we substituted the properly normalized version of equation 3.5: a loopy belief pseudomarginal is proportional to the potential times incoming messages. The goal is now to find the maximum of the above expression over all possible messages and all values of X. Especially the maximum over messages µ seems to be difficult to compute, but the following intermediate lemma helps us out. The maximum of the function n V(µ) = (n − 1) log (X) µβ (xβ )
Lemma 3.
X
β=1
n log (X\β , x∗β ) µβ (xβ ) , − β=1
X\β
β =β
with respect to the messages µ under constraints xβ µβ (xβ ) = 1 for all β and µβ (xβ ) ≥ 0 for all β and xβ , occurs at an extreme point µβ (xβ ) = δxβ ,xˆ β for some xˆ β to be found. Proof. Let us consider optimizing the message µ1 (x1 ) with fixed messages µβ (xβ ) for β > 1. The first and second derivatives are easily found to obey ∂V Q(x1 |x∗β ) = (n − 1)Q(x1 ) − ∂µ1 (x1 ) β=1 ∂ 2V Q(x1 |x∗β )Q(x1 |x∗β ), = (n − 1)Q(x1 )Q(x1 ) − ∂µ1 (x1 )∂µ1 (x1 ) β=1
Uniqueness of Loopy Belief Propagation Fixed Points
2401
where (X)
β µβ (xβ ) . X (X ) β µβ (xβ )
Q(X) ≡
Now suppose that V has a regular extremum (maximum or minimum) not at an extreme point, that is, µ1 (x1 ) > 0 for two or more values of x1 . At such an extremum, the first derivative should obey (n − 1)Q(x1 ) −
β=1
Q(x1 |x∗β ) = λ,
with λ a Lagrange multiplier implementing the constraint x1 µ1 (x1 ) = 1. Summing over x1 , we obtain λ = 0 (in fact, V is indifferent to any multiplicative scaling of µ). For the matrix with second derivatives at such an extremum, we then have ∂ 2V Q(x1 |x∗β )Q(x1 |x∗β ), = ∂µ1 (x1 )∂µ1 (x1 ) β=1 β =1 β =β
which is positive semidefinite: the extremum cannot be a maximum. Consequently, any maximum must be at the boundary of the domain. Since this holds for any choice of µβ (xβ ), β > 1, it follows by induction that the maximum with respect to all µβ (xβ ) must be at an extreme point as well. The function V(µ) is, up to a term independent of µ, the logarithm of equation 7.3. So the intermediate lemma 3 tells us that we can replace the ˆ maximization over messages µ by maximization over values X: n−1 ˆ (X) (X) Q∗ (X) max ∗ . = max µ Q (xβ ) (Xˆ \β , xβ ) Xˆ β
β
Next, we take the maximum over X as well and define the “strength” σ to be used in equation 7.1 through n−1 ˆ (X) (X) Q∗ (X) 1 . ≡ max ∗ = max X,µ 1−σ Q (xβ ) (Xˆ \β , xβ ) X,Xˆ β
β
(7.4)
2402
T. Heskes
The inequality 7.1 then follows by summing out X\{β,β } in Q∗ (X) −
Q∗ (xβ ) ≤ σ Q∗ (X).
β
The form of equation 7.2 then follows by rewriting equation 7.4 as ˆ with ω ≡ − log(1 − σ ) = max W(X; X) X,Xˆ
ˆ = ψ(X) + (n − 1)ψ(X) ˆ − W(X; X)
ˆ ψ(X\β , xβ ) ,
β
where we recall that ψ(X) ≡ log (X). 7.2 Some Properties. In the following we will refer to both ω and σ as the strength of the potential. There are several properties worth noting: • The strength of a potential is indifferent to multiplication with any term that factorizes over the nodes, that is, ˜ ˜ = ω() for any choice of µ. if (X) = (X) µβ (xβ ) then ω() β
This property relates to the arbitrariness in the definition of equation 3.1: if two potentials overlap, then multiplying one potential with a term that only depends on the overlap and dividing the other by the same term does not change the distribution. Luckily, it also does not change the strength of those potentials. • To compute the strength, we can enumerate all possible combinations. However, we can neglect all combinations X and Xˆ that differ in fewer than two nodes. To see this, consider W(x1 , x2 , x\1\2 ; xˆ 1 , xˆ 2 , x\1\2 ) = ψ(x1 , x2 , x\1\2 ) + ψ(xˆ 1 , xˆ 2 , x\1\2 ) − ψ(xˆ 1 , x2 , x\1\2 ) − ψ(x1 , xˆ 2 , x\1\2 ) = −W(x1 , xˆ 2 , x\1\2 ; xˆ 1 , x2 , x\1\2 ). If now also xˆ 2 = x2 , we get W(x1 , x\1 ; xˆ 1 , x\1 ) = −W(x1 , x\1 ; xˆ 1 , x\1 ) = 0. Furthermore, if W(x1 , x2 , x\1\2 ; xˆ 1 , xˆ 2 , x\1\2 ) ≤ 0, then it must be that W(x1 , xˆ 2 , x\1\2 ; xˆ 1 , x2 , x\1\2 ) ≥ 0 and vice versa. So ω, the maximum over all combinations, must be nonnegative, and we can indeed neglect all combinations that by definition yield zero. • Thus, for finite potentials, 0 ≤ ω < ∞ and 0 ≤ σ < 1.
Uniqueness of Loopy Belief Propagation Fixed Points
2403
• With pairwise potentials, the above symmetries can be used to reduce the number of evaluations to |x1 ||x2 |(|x1 | − 1)(|x2 | − 1)/4 combinations. And indeed, for binary nodes x1,2 ∈ {0, 1}, we immediately obtain ω = |ψ(0, 0) + ψ(1, 1) − ψ(0, 1) − ψ(1, 0)|.
(7.5)
Any pairwise binary potential can be written as a Boltzmann factor: (x1 , x2 ) ∝ exp[wx1 x2 + θ1 x1 + θ2 x2 ]. In this notation, we find the simple and intuitive expression ω = |w|: the strength is the absolute value of the “weight.” It is indeed independent of (the size of) the thresholds. In the case of {−1, 1}, coding the relationship is ω = 4|w|. • In some models, there is the notion of a “temperature” T, that is, (X) ∝ ˜ ˜ exp[ψ(X)/T] where ψ(X) is considered constant. In obvious notation, we then have ω(T) = ω(1)/T and thus σ (T) = 1 − exp[−ω(1)/T] = 1 − [1/(1 − σ (1))]1/T . • Loopy belief revision (max-product) can be interpreted as a zerotemperature limit of loopy belief propagation (sum product). More specifically, we get the belief revision updates if we imagine running loopy belief propagation on potentials that are scaled with temperature T and then take the limit T to zero. Consequently, when analyzing conditions for uniqueness of loopy belief revision fixed points, we can take σ (0) = 0 if σ (1) = 0 (fake interaction), yet σ (0) = 1 whenever σ (1) > 0. 8 Conditions for Uniqueness 8.1 Main Result. Theorem 4. Loopy belief propagation has a unique fixed point if there exists an allocation matrix Aαβ between potentials α and nodes β with properties 1.
Aαβ ≥ 0 ∀α,β⊂α
2.
(1 − σα ) max Aαβ + σα
3.
β⊂α
Aαβ ≥ nβ − 1 ∀β
β⊂α
(positivity) Aαβ ≤ 1 ∀α
(sufficient amount of resources) (8.1) (sufficient compensation)
α⊃β
with the strength σα a function of the potential α (Xα ) as defined in equation 7.2. Proof. For completeness, we first summarize our line of reasoning. Fixed points of loopy belief propagation are in one-to-one correspondence with
2404
T. Heskes
extrema of the dual, equation 5.5. This dual has a unique extremum if it is convex/concave. Concavity is guaranteed, so we focus on conditions for convexity, that is, for positive (semi)definiteness of the corresponding Hessian. This then boils down to conditions that ensure K = K1 + K˜ 2 + K˜ 3 ≥ 0 for any choice of Rβ (xβ ). Substituting the bound, equation 7.1, into the term K˜ 3 , we obtain K˜ 3 ≥ − ≥−
α
α
{β,β }⊂α β =β
σα
xβ ,xβ
β⊂α xβ
Aαβ Aαβ σα Q∗α (xβ , xβ )Rβ (xβ )Rβ (xβ ) Aαβ
β ⊂α β =β
Aαβ Q∗α (xβ )R2β (xβ ),
where in the last step, we applied the same trick as in equation 6.1. Since K˜ 2 ≥ 0 and combining K1 and (the above lower bound on) K˜ 3 , we get K = K1 + K˜ 2 + K˜ 3 ≥ Aαβ 1 − Aαβ − σα Aαβ Q∗α (xβ )R2β (xβ ). α β⊂α xβ
β =β
This implies (1 − σα )Aαβ + σα
Aαβ ≤ 1
∀α,β⊂α ,
β ⊂α
which, in combination with Aαβ ≥ 0 and σα ≤ 1, yields condition 2 in equation 8.1. The equality constraint, equation 5.2, that we started with can be relaxed to the inequality condition 3 without any consequences. We get back the stricter conditions of theorem 1 if σα = 1 for all potentials α. Furthermore, “fake interactions” play no role: with σα = 0, condition 2 becomes maxβ⊂α Aαβ ≤ 1, suggesting the choice Aαβ = 1 for all β ⊂ α, which then effectively reduces the number of neighboring potentials nβ in condition 3. 8.2 Comparison with Other Work. To the best of our knowledge, the only conditions for uniqueness of loopy belief propagation fixed points that depend on more than just the structure of the graph are those in Tatikonda and Jordan (2002) for pairwise potentials. The analysis in Tatikonda and Jordan is based on the concept of the computation tree, which represents an unwrapping of the original graph with respect to the loopy belief propagation algorithm. The same concept is used in Weiss (2000) to show that belief revision yields the correct maximum a posteriori assignments in graphs
Uniqueness of Loopy Belief Propagation Fixed Points
2405
with a single loop and Weiss and Freeman (2001) to prove that loopy belief propagation in gaussian graphical models yields exact means. Although the current theorems based on the concept of computation trees are derived for pairwise potentials, it should be possible to extend them to more general factor graphs. The setup in Tatikonda and Jordan (2002) is slightly different; it is based on the factorization 1 ˆ α (Xα ) ˆ β (xβ ), Pexact (X) = Z α β to be compared with our equation 3.1, where there are no self-potentials β (xβ ). With this in mind, the statement is then as follows. Theorem 5. (adapted from Tatikonda & Jordan, 2002, in particular proposition 5.3). Loopy belief propagation on pairwise potentials has a unique fixed point if max ψˆ α (Xα ) − min ψˆ α (Xα ) < 2 ∀β . (8.2) α⊃β
Xα
Xα
To make the connection between theorem 5 and theorem 4, we will first strengthen the former and then weaken the latter. We will focus on the case of binary pairwise potentials. Since the definition of self-potentials is arbitrary and the condition 8.2 is valid for any choice, we can easily improve the condition by optimizing this choice. This then leads to the following corollary. Corollary 3. This corollary concerns an improvement of theorem 5 for pairwise binary potentials. Loopy belief propagation on pairwise binary potentials has a unique fixed point if ωα < 4 ∀β , (8.3) α⊃β
with ωα defined in equation 7.2. Proof. The condition 8.2 applies to any arbitrary definition of self-potenˆ β (xβ ). In fact, it is valid for any choice tials ψˆ α (Xα ) = ψα (Xα ) +
φαβ (xβ ),
β⊂α
where ψα (Xα ) is any choice of potential subsets that fits in our framework of no self-potentials (as argued above, there is some arbitrariness here as
2406
T. Heskes
well). We can then optimize this choice to obtain milder, and thus better, conditions. Omitting α and renumbering the nodes from 1 to 2, we have
ˆ ˆ min max ψ(x1 , x2 ) − min ψ(x1 , x2 ) x1 ,x2 φ1 ,φ2 x1 ,x2 = min max [ψ(x1 , x2 ) + φ1 (x1 ) + φ2 (x2 )] φ1 ,φ2
x1 ,x2
− min [ψ(x1 , x2 ) + φ1 (x1 ) + φ2 (x2 )] . x1 ,x2
In the case of binary nodes (two-by-two matrices ψ(x1 , x2 )), it is easy to check that the optimal φ1 and φ2 that yield the smallest gap are such that ψ(x1 , x2 ) + φ1 (x1 ) + φ2 (x2 ) = ψ(xˆ 1 , xˆ 2 ) + φ1 (xˆ 1 ) + φ2 (xˆ 2 ) ≥ ψ(x1 , xˆ 2 ) + φ1 (x1 ) + φ2 (xˆ 2 ) = ψ(xˆ 1 , x2 ) + φ1 (xˆ 1 ) + φ2 (x2 ), (8.4) for some x1 , x2 , xˆ 1 , and xˆ 2 with x1 = xˆ 1 and x2 = xˆ 2 . Solving for φ1 and φ2 , we find 1 ψ(xˆ 1 , x2 ) − ψ(x1 , xˆ 2 ) + ψ(xˆ 1 , xˆ 2 ) − ψ(x1 , x2 ) 2 1 φ2 (x2 ) − φ2 (xˆ 2 ) = ψ(x1 , xˆ 2 ) − ψ(xˆ 1 , xˆ 2 ) + ψ(xˆ 1 , xˆ 2 ) − ψ(x1 , x2 ) . 2 φ1 (x1 ) − φ1 (xˆ 1 ) =
Substitution back into equation 8.4 yields ψ(x1 , x2 ) + φ1 (x1 ) + φ2 (x2 ) − ψ(x1 , xˆ 2 ) − φ1 (x1 ) − φ2 (xˆ 2 ) 1 = ψ(x1 , x2 ) + ψ(xˆ 1 , xˆ 2 ) − ψ(xˆ 1 , x2 ) − ψ(x1 , xˆ 2 ) , 2 which has to be nonnegative. Of all four possible combinations, two of them are valid and yield the same positive gap, and the other two are invalid since they yield the same negative gap. Enumerating these combinations, we find
ˆ ˆ min max ψ(x1 , x2 ) − min ψ(x1 , x2 ) φ1 ,φ2
x1 ,x2
x1 ,x2
1 ω = |ψ(0, 0) + ψ(1, 1) − ψ(0, 1) − ψ(1, 0)| = , 2 2 from equation 7.5. Substitution into the condition 8.2 then yields equation 8.3. Next we derive the following weaker corollary of theorem 4:
Uniqueness of Loopy Belief Propagation Fixed Points
2407
Corollary 4. This is a weaker version of theorem 4 for pairwise potentials. Loopy belief propagation on pairwise potentials has a unique fixed point if ωα ≤ 1 ∀β , (8.5) α⊃β
with ωα defined in equation 7.2. Proof. Consider the allocation matrix with components Aαβ = 1 − σα for all β ⊂ α. With this choice, conditions 1 and 2 of equation 8.1 are fulfilled, since (condition 1) σα ≤ 1 and (condition 2) (1 − σα )(1 − σα ) + 2σα (1 − σα ) = 1 − 2σα2 ≤ 1. Substitution into condition 3 yields (1 − σα ) ≥ 1 − 1 and thus σα ≤ 1. α⊃β
α⊃β
(8.6)
α⊃β
Since ωα = − log(1 − σα ) ≥ σα , condition 8.5 is weaker than condition 8.6. Summarizing, the conditions in Tatikonda and Jordan (2002) are, for binary pairwise potentials and when strengthened as above, at most a constant (factor 4) less strict and thus better than the ones derived here. The latter are better when the structure is (close to) a tree. The best set of conditions follows by taking the union of both. Note further that the conditions derived in Tatikonda and Jordan (2002) are, unlike theorem 4, specific to pairwise potentials. 8.3 Illustration. For illustration we consider a 3 × 3 Ising grid with toroidal boundary conditions as in Figure 3a and uniform ferromagnetic potentials proportional to α 1−α . 1−α α The trivial solution, which is the only minimum of the Bethe free energy for small α, is the one with all pseudomarginals equal to (0.5, 0.5). With simple algebra, for example, following the line of reasoning that leads to the belief optimization algorithm in Welling and Teh (2003), it can be shown that this trivial solution becomes unstable at the critical αcritical = 2/3 ≈ 0.67. For α > 2/3, we find two minima: one with “spins up” and the other one with “spins down.” In this symmetric problem, the strength of each potential is given by
α 1−α 2 ω = 2 log . and thus σ = 1 − 1−α α
2408
T. Heskes
Figure 3: Three Ising grids in factor-graph notation: circles denote nodes, boxes interactions. (a) Toroidal boundary conditions. All elements of the allocation matrix equal to 3/4 (not shown). (b) Aperiodic boundary conditions and (c) two loops left. The elements of the allocation matrix along the edges follow directly from optimizing condition 3 in theorem 4 and symmetry considerations. With B = 2 − 2A in b and C = 1 − A in c, the optimal settings for the single remaining variable A then boil down to 3/4 and 1 − 1/8, respectively. See the text for further explanation.
Uniqueness of Loopy Belief Propagation Fixed Points
2409
The minimal (uniform) compensation in condition 3 of theorem 4 amounts to A = 3/4 for all combinations of potentials and nodes. Substitution into condition 2 then yields σ ≤
1 1 ≈ 0.55. and thus α ≤ √ 3 1 + 2/3
The critical value that follows from corollary 3 is in this case slightly better: 1 ≈ 0.62. 1 + e−1/2 Next we consider the same grid with aperiodic boundary conditions as in Figure 3b. Numerically, we find a critical αcritical ≈ 0.79. The value that follows from corollary 3 is dominated by the center node and hence stays the same: a unique loopy belief propagation fixed point for α < 0.62. Theorem 4 can be exploited to shift resources a little. In principle, we can solve the nonlinear programming problem, but for this small problem, it can still be done by hand with the following argumentation. Minimal compensation according to condition 3 in theorem 4 combined with symmetry considerations yields the allocation matrix elements along on the edges in Figure 3b. It is then easy to check that there are only two different appearances of condition 2: 3 1 (2 − 2A)σ + ≤ 1 and σ + A ≤ 1. 4 2 The optimal choice for A is the one in which both conditions turn out to be identical. In this way, we obtain A = 3/4, yielding, ω < 1 and thus α ≤
σ ≤
1 1 ≈ 0.58, and thus α ≤ √ 2 1 + 1/2
still slightly worse than the condition from corollary 3. An example in which the condition obtained with theorem 4 is better than the one from corollary 3 is given in Figure 3c. Straightforward analysis √ following the same recipe as for Figure 3b yields A = 1 − 1/8 with 1 1 ≈ 0.65, and thus α ≤ σ ≤ √ 2 1 + 1 − 1/2 better than the α < 0.62 from corollary 3 and to be compared with the critical αcritical ≈ 0.88. 9 Discussion In this article, we derived sufficient conditions for loopy belief propagation to have just a single fixed point. These conditions remain much too strong to be anywhere near the necessary conditions and in that sense should be seen as no more than a first step. These conditions have the following positive features:
2410
T. Heskes
• Generalize the conditions for convexity of the Bethe free energy. • Incorporate the (local) strength of potentials. • Scale naturally as a function of the “temperature.” • Are invariant to arbitrary definitions of potentials and self-interactions. Although the analysis that led to these conditions may seem quite involved, it basically consists of a relatively straightforward combination of two observations. The first observation is that we can exploit the arbitrariness in the definition of the Bethe free energy when we incorporate the constraints. This forms the basis of the resource allocation argument. And the second observation concerns the bound on the correlation of a loopy belief propagation marginal that leads to the introduction of the strength of a potential. Besides its theoretical usefulness, there are more practical uses. First, algorithms for guaranteed convergence explicitly minimize the Bethe free energy. They can be considered “bound optimization algorithms,” similar to expectation maximization and iterative proportional fitting: in the inner loop, they minimize a bound on the Bethe free energy, which is then updated in the outer loop. In practice, it appears that the tighter the bound, the faster the convergence (see, e.g., Heskes et al., 2003). Instead of a bound that is convex (Yuille, 2002) or convex over the set of constraints (Teh & Welling, 2002; Heskes et al., 2003), we might relax the convexity condition and choose a tighter bound that still has a unique minimum, thereby speeding up the convergence. Second, in Wainwright et al. (2002) a convexified Bethe free energy is proposed. The arguments for this class of free energies are twofold: they yield a bound on the partition function (instead of just an approximation, as the standard Bethe free energy) and have a unique minimum. Focusing on the second argument, the conditions in this article can be used to construct Bethe free energies that may not be convex (over the set of constraints), but do have a unique minimum and, being closer to the standard Bethe free energy, may yield better approximations. We can think of the following opportunities to make the sufficient conditions derived here stricter and thus closer to necessary conditions: • The conditions guarantee convexity of the dual G(Qβ , λαβ ) with respect to Qβ . But in fact we need only G(Qβ ) ≡ maxλαβ G(Qβ , λαβ ) to be convex, which is a weaker requirement. The Hessian of G(Qβ ), however, appears to be more difficult to compute and to analyze in general, but may lead to stronger results in specific cases (e.g., only pairwise interactions or substituting a particular choice of Aαβ ). • It may be possible to strengthen the bound, equation 7.1, on loopy belief correlations, especially for interactions that involve more than two nodes. An important question is how the uniqueness of loopy belief propagation fixed points relates to the convergence of loopy belief propagation.
Uniqueness of Loopy Belief Propagation Fixed Points
2411
Intuitively, one might expect that if loopy belief propagation has a unique fixed point, it will also converge to it. This also seems to be the argumentation in Tatikonda and Jordan (2002). However, to the best of our knowledge, there is no proof of such correspondence. Furthermore, the following set of simulations does seem to suggest otherwise. We consider a Boltzmann machine with four binary nodes, weights
0 1 w = ω −1 −1
1 0 1 −1
−1 1 0 −1
−1 −1 , −1 0
zero thresholds, and potentials ij (xi , xj ) = exp[wij /4] if xi = xj and ij (xi , xj ) = exp[−wij /4] if xi = xj . Running loopy belief propagation, possibly damped as in equation 3.9, we observe “convergent” and “nonconvergent” behavior. For relatively small weights, loopy belief propagation converges to the trivial fixed point with Pi (xi ) = 0.5 for all nodes i and xi = {0, 1}, as in the lower left inset in Figure 4. For relatively large weights, it ends up in a limit cycle, as shown in the upper right inset. The weight strength that forms the transition between this “convergent” and “nonconvergent” behavior strongly depends on the step size.7 This by itself makes it hard to defend a one-to-one correspondence between convergence of loopy belief propagation (apparently depending on step size) and uniqueness of fixed points (obviously independent of step size). For weights larger than roughly 5.8, loopy belief propagation failed to converge to the trivial fixed point even for very small step sizes. However, running a convergent double-loop algorithm from many different initial conditions and many weight strengths considerably larger than 5.8, we always ended up in the trivial fixed point and never in another one. We found similar behavior for a three-node Boltzmann machine (same weight matrix as above, except for the fourth node) for very large weights: loopy belief propagation ends up in a limit cycle, whereas a convergent double-loop algorithm converges to the trivial fixed point, which here, by corollary 2, is guaranteed to be unique. In future work we hope to elaborate on these issues.
7 Note that the conditions for guaranteed uniqueness imply ω = 4/3 for corollary 3 and ω = log(2) ≈ 0.69 for theorem 4, both far below the weight strengths where “nonconvergent” behavior sets in.
2412
T. Heskes 6 1
weight strength
5.5
0 0
100
5
4.5
4
0.505
0.495 0
3.5 0
200
0.2
0.4
0.6
0.8
1
step size Figure 4: The transition between “convergent” and “nonconvergent” behavior as a function of the step size used for damping loopy belief propagation and the weight strength. Simulations on a four-node Boltzmann machine. The insets show the marginal P1 (x1 = 1) as a function of the number of loopy belief iterations for step size 0.2 and strength 4 (lower left) and step size 0.6 and strength 6 (upper right). See the text for further detail.
Acknowledgments This work has been supported in part by the Dutch Technology Foundation STW. I thank the anonymous reviewers for their constructive comments and Joris Mooij for computing the critical αcritical ’s in section 8.3. References Heskes, T. (2002). Stable fixed points of loopy belief propagation are minima of the Bethe free energy. In S. Becker, S. Thrun, & K. Obermayer (Eds.), Advances in neural information processing systems, 15 (pp. 359–366). Cambridge, MA: MIT Press. Heskes, T., Albers, K., & Kappen, B. (2003). Approximate inference and constrained optimization. In Uncertainty in artificial intelligence: Proceedings of the Nineteenth Conference (UAI-2003) (pp. 313–320). San Francisco: Morgan Kaufmann. Kschischang, F., Frey, B., & Loeliger, H. (2001). Factor graphs and the sumproduct algorithm. IEEE Transactions on Information Theory, 47(2), 498–519.
Uniqueness of Loopy Belief Propagation Fixed Points
2413
Lauritzen, S., & Spiegelhalter, D. (1988). Local computations with probabilities on graphical structures and their application to expert systems. Journal of the Royal Statistics Society B, 50, 157–224. Luenberger, D. (1984). Linear and nonlinear programming. Reading, MA: AddisonWesley. McEliece, R., MacKay, D., & Cheng, J. (1998). Turbo decoding as an instance of Pearl’s “belief propagation” algorithm. IEEE Journal on Selected Areas in Communication, 16(2), 140–152. McEliece, R., & Yildirim, M. (2003). Belief propagation on partially ordered sets. In D. Gilliam & J. Rosenthal (Eds.), Mathematical systems theory in biology, communications, computation, and finance (pp. 275–300). New York: Springer. Minka, T. (2001). Expectation propagation for approximate Bayesian inference. In J. Breese & D. Koller (Eds.), Uncertainty in artificial intelligence: Proceedings of the Seventeenth Conference (UAI-2001) (pp. 362–369). San Francisco: Morgan Kaufmann. Murphy, K., Weiss, Y. & Jordan, M. (1999). Loopy belief propagation for approximate inference: An empirical study. In K. Laskey & H. Prade (Eds.), Proceedings of the Fifteenth Conference on Uncertainty in Artificial Intelligence (pp. 467–475). San Francisco: Morgan Kaufmann. Pakzad, P., & Anantharam, V. (2002). Belief propagation and statistical physics. In 2002 Conference on Information Sciences and Systems. Princeton, NJ: Princeton University. Pearl, J. (1988). Probabilistic reasoning in intelligent systems: Networks of plausible inference. San Francisco: Morgan Kaufmann. Tatikonda, S., & Jordan, M. (2002). Loopy belief propagation and Gibbs measures. In A. Darwiche & N. Friedman (Eds.), Uncertainty in artificial intelligence: Proceedings of the Eighteenth Conference (UAI-2002) (pp. 493–500). San Francisco: Morgan Kaufmann. Teh, Y., & Welling, M. (2002). The unified propagation and scaling algorithm. In T. Dietterich, S. Becker, & Z. Ghahramani (Eds.), Advances in neural information processing systems, 14 (pp. 953–960). Cambridge, MA: MIT Press. Wainwright, M., Jaakkola, T., & Willsky, A. (2002). A new class of upper bounds on the log partition function. In A. Darwiche & N. Friedman (Eds.), Uncertainty in artificial intelligence: Proceedings of the Eighteenth Conference (UAI-2002) (pp. 536–543). San Francisco: Morgan Kaufmann. Weiss, Y. (2000). Correctness of local probability propagation in graphical models with loops. Neural Computation, 12(1), 1–41. Weiss, Y., & Freeman, W. (2001). Correctness of belief propagation in graphical models with arbitrary topology. Neural Computation, 13(10), 2173–2200. Welling, M., & Teh, Y. (2003), Approximate inference in Boltzmann machines. Artificial Intelligence, 143(1), 19–50. Yedidia, J., Freeman, W., & Weiss, Y. (2001). Generalized belief propagation. In T. Leen, T. Dietterich & V. Tresp (Eds.), Advances in neural information processing systems, 13 (pp. 689–695). Cambridge, MA: MIT Press. Yuille, A. (2002). CCCP algorithms to minimize the Bethe and Kikuchi free energies: Convergent alternatives to belief propagation. Neural Computation, 14, 1691–1722. Received December 2, 2003; accepted April 29, 2004.
LETTER
Communicated by Steven Nowlan
Neural Network Uncertainty Assessment Using Bayesian Statistics: A Remote Sensing Application F. Aires
[email protected] Department of Applied Physics and Applied Mathematics, Columbia University, NASA Goddard Institute for Space Studies, New York, NY 10025, U.S.A., and CNRS/IPSL/ ´ Laboratoire de M´et´eorologie Dynamique, Ecole Polytechnique, 91128 Palaiseau Cedex, France
C. Prigent
[email protected] CNRS, LERMA, Observatoire de Paris, Paris 75014, France
W.B. Rossow
[email protected] NASA Goddard Institute for Space Studies, New York, NY 10025, U.S.A.
Neural network (NN) techniques have proved successful for many regression problems, in particular for remote sensing; however, uncertainty estimates are rarely provided. In this article, a Bayesian technique to evaluate uncertainties of the NN parameters (i.e., synaptic weights) is first presented. In contrast to more traditional approaches based on point estimation of the NN weights, we assess uncertainties on such estimates to monitor the robustness of the NN model. These theoretical developments are illustrated by applying them to the problem of retrieving surface skin temperature, microwave surface emissivities, and integrated water vapor content from a combined analysis of satellite microwave and infrared observations over land. The weight uncertainty estimates are then used to compute analytically the uncertainties in the network outputs (i.e., error bars and correlation structure of these errors). Such quantities are very important for evaluating any application of an NN model. The uncertainties on the NN Jacobians are then considered in the third part of this article. Used for regression fitting, NN models can be used effectively to represent highly nonlinear, multivariate functions. In this situation, most emphasis is put on estimating the output errors, but almost no attention has been given to errors associated with the internal structure of the regression model. The complex structure of dependency inside the NN is the essence of the model, and assessing its quality, coherency, and physical character makes all the difference between a blackbox model c 2004 Massachusetts Institute of Technology Neural Computation 16, 2415–2458 (2004)
2416
F. Aires, C. Prigent, and W. Rossow
with small output errors and a reliable, robust, and physically coherent model. Such dependency structures are described to the first order by the NN Jacobians: they indicate the sensitivity of one output with respect to the inputs of the model for given input data. We use a Monte Carlo integration procedure to estimate the robustness of the NN Jacobians. A regularization strategy based on principal component analysis is proposed to suppress the multicollinearities in order to make these Jacobians robust and physically meaningful. 1 Introduction Neural network (NN) techniques have proved very successful in developing computationally efficient algorithms for geophysical applications. We are interested, in this study, in the application of the NN retrieval methods for satellite remote sensing (Aires, Rossow, Scott, & Ch´edin, 2002a): the NN is used as a nonlinear multivariate regression to represent the inverse radiative transfer function in the atmosphere. This is an application of the inverse theory: remote sensing requires the estimation of geophysical variables from indirect measurements by applying the inverse radiative transfer function to radiative measurements. NN are well adapted to solve nonlinear problems and are especially designed to capitalize more completely on the inherent statistical relationships among the input and output variables. A rigorous statistical approach requires not only a minimization of output errors, but also an uncertainty estimate of the model parameters (Saltelli, Chan, & Scott, 2000). The reliability of the inverse model is as important as its answer, but until now, probably because of the lack of adequate tools, the uncertainty of an NN statistical model has rarely been quantified. Our work is based on the developments of Le Cun, Denker, and Sola (1990) and MacKay (1992). These studies introduced error bar estimates for neural networks using a Bayesian approach, but these tools were developed and tested in simple cases for a unique network output. In this article, we use a slightly different approach than the more traditional full Bayesian method, where scalar hyperparameters are estimated using the so-called evidence approach. A multiple output method is used in order to develop uncertainty tools for real-world applications. Our Bayesian methodology provides first uncertainty estimates for the parameters of the neural network (i.e., the network weights). A similar approach is, but in a simpler presentation (a monovariate case), used, for example, in Bishop (1996), Neal (1996), and Nabney (2002). The robustness of the NN parameters is assessed using the Hessian matrix (second derivative) of the log likelihood with respect to the NN weights. Uncertainty estimates for the parameters of the neural network can then be used for the determination of a variety of other probabilistic quantities related to the overall stochastic character of the NN model. Such possible applications can use theoretical derivations when they are available. In this
Neural Network Uncertainty Assessment
2417
article, one such analytical application provides uncertainty estimates of the network output (error bars plus their correlation structure). Reliability of the NN predictions is very important for any application. Confidence intervals (CI) have been developed for classical linear regression theory with wellestablished results (e.g., Koroliouk, Portenko, Skorokhod, & Tourbine, 1983). For nonlinear models, such results are more recent (Bates & Watts, 1988), and in NN they are rarely available. Generally, only the root mean square (RMS) of the generalization error is provided, but this single quantity is not situation dependent. Other approaches use bootstrap techniques to estimate such CI, but they are limited by the large number of computations that such techniques require. Recently, Rivals and Personnaz (2000, 2003) introduced a new method for estimating CI by using a linear Taylor expansion of the NN outputs (which makes traditional estimation of CI for nonlinear models a tractable problem). In this article, we separate the errors that are due to the NN weight uncertainty and the errors from all remaining sources. Such additional sources of uncertainty can be, for example, noise in the inputs of the NN (Wright, Ramage, Cornford, & Nabney, 2000). We will comment on an approach to analyze in even more detail the various contributions to output errors. These errors are described in terms of covariance matrices that can be interpreted using eigenvectors called error patterns (Rodgers, 1990). When a theoretical derivation is too complex to be obtained, another possible application of weight uncertainty is the empirical estimation of probabilistic quantities. Modern Bayesian statistics are used here together with Monte Carlo (MC) simulations (Gelman, Carlin, Stern, & Rubin, 1995) to estimate uncertainties on the NN Jacobian. These Jacobians, or sensitivities, of a NN model are defined as the partial first derivatives of the model outputs with respect to its inputs. These quantities are very useful. However, the NN model is trained to obtain good fit statistics for its outputs, but most of the time, no constraint is applied to structure the internal regularities of the model. Statistical inference is an ill-posed inverse problem (Tarantola, 1987; Vapnik, 1997; Aires, Schmitt, Scott, & Ch´edin, 1999): many solutions can be found for the NN parameters (i.e., the synaptic weights) for similar output statistics. One of the reasons for this nonunique solution comes from the fact that multicollinearities can exist among the variables. Such correlations on input or output variables are a major problem even for linear regressions: the parameters of the regression are very unstable and can vary drastically from one experiment to another. The Jacobians are the equivalent of the linear regression parameters, so similar behavior is expected: when multicollinearities are present, the Jacobians will probably be highly variable and unreliable, even if the output statistics of the NN are very good. The aim of this article is to investigate this problem, analyze it, and suggest a solution. Many regularization techniques exist to reduce the number of degrees of freedom in the model for the multicollinearity problem or for any other ill-posed problem. For example, one approach is to reduce the number of
2418
F. Aires, C. Prigent, and W. Rossow
inputs to the NN (Rivals & Personnaz, 2003); this is a model selection tool. However, the introduction of redundant information in the input of the NN can be useful for reducing the observational noise (e.g., Aires et al., 2002a; Aires, Rossow, Scott, & Ch´edin, 2002b) as long as the NN learning is regularized in some way. Furthermore, the input variables used in this work are highly correlated (among brightness temperatures, among first guesses, or between observations and first guesses), so it would be difficult to extract few of the original variables and avoid the multicollinearities by input selection. We propose to solve this nonrobustness by using a principal component analysis (PCA) regression approach. Our technological developments are illustrated by application to an NN inversion algorithm for remote sensing over land. Such NN methods have already been used to retrieve columnar water vapor, liquid water, or wind speed over ocean using special sensor microwave/imager observations (Stogryn, Butler, & Bartolac, 1994; Krasnopolsky, Breaker, & Gemmill, 1995; Krasnopolsky, Gemmill, & Breaker, 2000). Our algorithm includes for the first time the use of a first guess to retrieve the surface skin temperature Ts, the integrated water vapor content WV, the cloud liquid water path LWP, and the microwave land surface emissivities Em between 19 and 85 GHz from SSM/I and infrared observations. Neural network techniques have proved very successful in developing computationally efficient algorithms for remote sensing (e.g., Aires et al., 2002b), but uncertainty estimates on the retrievals have been a limiting factor for the use of such methods. Our technical developments on this remote sensing application provide a new framework for the characterization and the analysis of various sources of neural network errors. Estimation of the Jacobian uncertainties is then used as a diagnostic tool to identify nonrobust regressions, resulting from unstable learning processes. The Bayesian approach for the estimation of NN weight uncertainty is presented in section 2. The NN technique is described, the theoretical formulation of a posteriori distributions for the NN weights is developed, and a remote sensing application is presented as an example to illustrate some first results of the application of our weight uncertainty analysis. The theoretical computation of the predictive distribution of network outputs is developed in section 3. Theoretical developments are used to characterize the NN output uncertainty sources. Section 4 presents the uncertainty estimate of NN Jacobians. The PCA of the NN input and output data is described together with its regularization properties. Network Jacobians are presented with their corresponding uncertainties. Conclusions and perspectives are discussed in section 5. 2 Network Weights Uncertainty 2.1 The Quality Criterion. In this study, we use a classical multilayer perceptron (MLP) trained by backpropagation algorithm (Rumelhart, Hin-
Neural Network Uncertainty Assessment
2419
ton, & Williams, 1986). For the definition of the quality criterion to maximize, we present a general matrix formulation of the problem and link our derivation to the “classical” literature on Bayesian error estimation often introduced with a scalar formulation (MacKay, 1992; Bishop, 1996). The first and main term in the quality criterion used to train a neural network is the “data” term, expressed using the difference between the target data and the NN estimates as measured by a particular distance. Many distance measures can be used, but it is often supposed that the differences follow a gaussian probability distribution function (PDF), which means that the right distance is the Mahalanobis distance (Crone & Crosby, 1995). The ideal covariance matrix for the gaussian PDF, denoted Cin = Ain −1 , describes what we call the intrinsic noise (or natural variability) of the physical variables y to retrieve. Note that Cin takes into account only the intrinsic variability and not the error associated with the retrieval scheme itself: this makes this measure coherent physically. The information encoded in Cin is difficult to obtain a priori; we will see how to estimate this quantity, but we suppose here that it is known. The data quality term becomes ED (w) =
N 1 T [εy (n) ] · Ain · εy (n) , 2 n=1
(2.1)
where εy (n) = (t(n) − y (n) ) is the output error and the index (n) indicates the sample number in database B . This criterion leads to a weighted least squares when the matrix Cin is just diagonal. When no a priori information is available, Cin = I , and the criterion becomes the classical least squares. In order to regularize the learning process, a regularization term is sometimes added to the data term in the quality criterion. The weight decay (Hertz, Krogh, & Palmer, 1991) is probably the most common regularization technique for NN: Er (w) =
1 T w · Ar · w , 2
(2.2)
where Cr = Ar −1 is the covariance matrix of the gaussian a priori distribution for the network weights. The overall quality criterion that is minimized during the learning stage is the sum of the data and the regularization terms, E(w) = ED (w) + Er (w).
(2.3)
The two matrices Ain and Ar are called hyperparameters. They are generally simplified in the classical literature by using two scalars instead, respectively, β and α, so that the general quality criterion becomes E(w) = βED + αEr , where ED and Er are simplified quadratic forms. In this formulation, β represents the inverse of the observation noise variance for all
2420
F. Aires, C. Prigent, and W. Rossow
outputs and α is a weight for the regularization term linked to the a priori general variance of the weights. This is obviously poorer and less general than our matrix formulation in equation 2.3, but the hyperparameters Ain and Ar are difficult to guess a priori. 2.2 Intrinsic Uncertainty of Targets. The conditional probability P(t|x, w) represents the variability of target t for input x and network weights w, due to a variety of sources like the errors in the model linking x to t in B or the observational noise on x. This variability includes all sources of uncertainty except those from the NN regression model, represented by uncertainties on the network weights w, that are fixed in the conditional probability. If the neural network gw fits the data well (after the learning stage), the intrinsic variability is evaluated by comparing the target values, t, matched with each input x in the data set B to the NN outputs y . Generally, this distribution can be approximated locally to first order by a gaussian distribution with zero mean and a covariance matrix Cin = Ain −1 : P(t|x, w) =
1 − 1 εy T · Ain · εy , e 2 Z
(2.4)
where Z is a normalization factor. The likelihood of the parameters w, given the inverse model structure g of the trained NN gw , is expressed by evaluating this probability over the database B that includes D = {t(n) ; n = 1, . . . , N}, the set of output samples, N T εy (n) · Ain · εy (n) − 12 1 (n) (n) n=1 P(D|x, w) = P(t |x , w) = N e , Z n=1 N
(2.5)
that we simplify by P(D|x, w) =
1 −ED e , ZN
(2.6)
using the defintion of ED in equation 2.1. The smaller ED is, the likelier the output data sample D is (i.e., the closer all y are to target t). The conditioning of the previous probabilities as in equation 2.5 is dependent on the input x, but since the distribution of x is not of interest here, this variable will be omitted in the following notation for simplicity. 2.3 Theoretical Derivation of Weight PDF. In classical regression techniques, a point estimate of the parameters w is searched for (i.e., only one estimate of the weight vector w is evaluated). In the Bayesian context, an uncertainty of w described by a PDF P(w) can also be characterized. This
Neural Network Uncertainty Assessment
2421
distribution of the weights conditional on a database is given by the Bayes theorem: P(w|D) =
P(D|w)P(w) . P(D)
(2.7)
P(D) does not depend on the weights, and the prior P(w) is a uniform distribution in this application (since there is no prior information on w), meaning that no regularization term Er (w) is used in equation 2.3. So we can use for P(w|D) the expression for P(D|x, w) from equation 2.6, the other terms in equation 2.7 being considered as constant normalization factors. Laplace’s method is now used: it consists in using a local quadratic approximation of the log-posterior distribution. A second-order Taylor expansion of ED (w ) is performed, where w is the set of the final optimized network weights (parameters of the neural network regression) found at the end of the learning process: 1 ED (w) = ED (w ) + bT · w + wT · H · w, 2
(2.8)
where w = w − w , b is the Jacobian vector given by b = |w (ED (w)), and H is the Hessian matrix given by H = |w ( |w (ED (w))). The linear term bT · w disappears because we are at the optimum w , which means that the gradient b is zero. For the local quadratic approximation to be valid, w must be a real optimum (at least locally in the weight space), otherwise, the gradient b cannot be neglected anymore and the matrix H might not be positive definite, which will make its use difficult for subsequent uncertainty estimates. The second-order approximation leads to P(w|D) =
1 T 1 −ED (w ) − 1 wT · H · w 2 e ∝ e− 2 w · H · w . ZN
(2.9)
This means that the a posteriori PDF of the neural network weights follows a gaussian distribution with mean w and covariance matrix H −1 . This probability represents a plausibility (in the Bayesian sense) for the weight w, not the probability of obtaining the weight w when using the learning algorithm. If a regularization term, such as the one described in equation 2.2, is used, then this probability becomes: 1 T P(w|D) ∝ e− 2 w · (H + Ar ) · w .
(2.10)
These two terms are used to weight the contribution to the variability of the weights due to the network model and the variability of the weights due to
2422
F. Aires, C. Prigent, and W. Rossow
the gaussian distribution of the a priori information on the weights. What is interesting about this formula is that to obtain the covariance matrix on the weights, we invert H + Ar instead of H only, which is more robust (see section 2.6) since Ar is the inverse of a positive definite matrix. 2.4 Hessian Matrix for a One-Hidden-Layer Network. The Hessian, H , of the previously defined log likelihood is a matrix of dimension W × W (W is the dimension of w) whose components are defined by ∂ 2 E(w) Hij (x) = , ∂wi ∂wj x
(2.11)
where wi and wj are two weights from the set w. There are many ways of estimating the Hessian matrix; some are generic methods, and some are specific to the MLP. For example, one generic approximation method uses finite differences, but in our case, it is possible to retrieve a mathematical expression for the Hessian based on the NN model. This theoretical Hessian is less demanding computationally—its scaling is O(W 2 ) (where W is the number of weights in the neural network)—than the previous approximation by finite differences, which scales like O(W 3 ). 2.5 A Remote Sensing Example. An NN inversion scheme has been developed to retrieve surface temperature (Ts), water vapor column amount (WV), and microwave surface emissivities at each frequency/polarization (Em ), over snow- and ice-free land from a combined analysis of satellite microwave (SSM/I) and infrared International Satellite Cloud Climatology Project (ISCCP) data (Aires, Prigent, Rossow, & Rothstein, 2001; Prigent, Aires, & Rossow, 2003). This study aims, in part, to provide uncertainty estimates for these retrievals. To avoid nonuniqueness and instability in an inverse problem, it is essential to use all a priori information available. The chosen solution is then constrained so that it is physically more consistent (Rodgers, 1976). We introduce a priori first-guess information into the input of an NN model, so the neural transfer function becomes
y = gw (y b , x◦ ),
(2.12)
where y is the retrieval (i.e., retrieved physical parameters), gw is the NN with parameters w, y b is the first guess for the retrieval of physical parameters, and x the observations. In this approach, the first guess is considered to be an independent estimate of the state obtained from sources other than the indirect measurements (here, the satellite observations). These are sometimes called virtual measurements (Rodgers, 1990). The extensive learning database used in this study, together with the characteristics of the a priori first-guess information and related background
Neural Network Uncertainty Assessment
2423
errors, are presented in Aires et al. (2001). Over the 9,830,211 samples for clear, snow- and ice-free measurements from a whole year of data, we have used only N =20,000 samples, chosen randomly, to construct the learning database B . The learning algorithm and the network architecture are able to infer the inverse radiative transfer equation with these N samples. The conjugate gradient optimization algorithm used to train the NN is fast and efficient: the learning errors decrease extremely fast and then stabilize after a few thousand iterations, each iteration involving the whole learning database B . This learning stage determines the optimal weights w . Once trained, the neural network gw represents statistically the inverse of the radiative transfer equation. The NN model is then valid for all observations (i.e., global inversion), where iterative methods, such as variational assimilation, have to compute an estimator for each observation (i.e., local inversion). Table 1 gives the RMS scores for the first guesses and the retrievals. For each output, the retrieval is a considerable improvement compared to the first guess. 2.6 Neural Network Hessian Regularization. The Hessian H is computed using the data set B . A few comments about the inversion of H are required. This matrix can be very large when the NN considered is big (W, the size of H , and the number of parameters in the NN can reach a few thousand). This means that the inversion can be sensitive to numerical problems. As a consequence, the estimation of H needs to be done with enough samples from B ; otherwise, the subspace spanned by the samples describing H might be too small or the eigenvalues of H too close to zero or even negative, making the inversion numerically impossible. We noted in section 2.3 that the gradient b in equation 2.8 is supposed to be zero; otherwise, the local quadratic approximation is not good enough, implying that the Hessian matrix H is not positive definite. As a consequence, it is very important that the learning of the NN converges close enough toward the optimal solution w . Monitoring the convergence al-
Table 1: First Guess and Retrieval RMS Errors.
Ts (K) WV (Kg.m−2 ) Em 19V Em 19H Em 22V Em 37V Em 37H Em 85V Em 85H
First Guess
Retrieval
3.52411 7.99485 0.01504 0.01837 0.01659 0.01425 0.01802 0.01764 0.02137
1.46250 3.83521 0.00494 0.00495 0.00562 0.00497 0.00501 0.00682 0.00820
2424
F. Aires, C. Prigent, and W. Rossow
gorithm could be enhanced by checking in parallel the positive definite character of the corresponding Hessian of the network. Even when enough samples from B are used to estimate H and when the learning convergence is reached, numerical problems can still exist. This situation can be related to an inconsistency between the complexity of the NN versus the complexity of the desired function to be estimated: too many degrees of freedom in the NN can produce an ill-conditioned Hessian matrix H . A possible solution often used in this context is to introduce a diagonal regularization matrix: H is replaced by H + λI , where λ is a small scalar and I is the identity matrix. The regularization factor λ is chosen to be small enough not to change the structure in H but big enough to allow the inversion: a compromise must be found. To determine the factor λ representing the right trade-off, we use four regularization criteria together with a discrepancy measure between the nonregularized H and the regularized matrix H + λI . The regularization criteria are the condition number with respect to inversion (the lower the better); the P-number, which is a positive integer if the matrix is not positive definite and zero otherwise (the lower the better); and the number of negative diagonal terms in the matrix (the lower the better). For the discrepancy measure between H and H + λI , we use the RMS differences between the square roots of the positive diagonal elements of the matrices (the lower the better). This quantity measures the differences that the regularization has introduced in the standard deviations of the two covariance matrices. In Figure 1, the variations of these four quantities for an increasing λ, from 0 to 50 are shown. A good compromise is found to be λ = 12.0: the regularization criteria are satisfactory (positive definite matrix, all diagonal terms positive, minimum condition number), and the discrepancy measure is still small. 2.7 PDF of Network Weights. To complete the analysis of the uncertainties due to the inversion algorithm, the posterior distribution of the network weights needs to be determined. As previously stated, this PDF represents a plausibility of weights w, not a probability of finding the particular weights. We saw in section 2.3 that this distribution follows a gaussian PDF with mean w and covariance matrix H −1 . In Figure 2, the optimum weights w are shown together with ± two standard deviations. As previously noted, weights between the hidden and the output layers are more variable than weights between the input and the hidden layers. This is due to the fact that the first processing stage of the NN, at the hidden layer level, is a high-level processing that includes the nonlinearity of the network. The second processing stage of the NN, from the hidden layer to the output layer, is just a linear postprocessing of the hidden layer.
A
800
400
200
0
10
8
40
50
0.18 0.16 0.14 0.12 0.1
C
x 10
3
20 30 LAMBDA
40
50
40
50
D
15
2
1
0
10
20
P. NUMBER
COND. NUMBER
20 30 LAMBDA
B
0.2
600
4
2425
RMS DIFFERENCE
NEGATIVE DIAGONAL TERMS
Neural Network Uncertainty Assessment
10
5
10
20 30 LAMBDA
40
50
0
10
20 30 LAMBDA
Figure 1: Quality criteria for variable λ. (A) The number of negative diagonal terms in the matrix. (B) RMS differences between the square root of the positive diagonal elements of the matrices. (C) The condition number with respect to inversion. (D) The P-number, which is a positive integer if the matrix is not positive definite and zero otherwise. See the text.
It is possible to know much more than just the output estimates of an NN. From the distribution of weights, samples {wr ; r = 1, . . . , R} of NN weights can be chosen. Each of the R samples wr represents a particular NN. Together, they represent the uncertainty on the NN weights. These samples can be used later to integrate under the PDF of weights in a Monte Carlo approach. For neural networks, the number of parameters (i.e., size of w) is big, so it is preferable to use an advanced sampling technique. Even if these samples are included within the large variability of the two standard deviations envelope, correlation constraints avoid random oscillations from noise by imposing some structure on them. The weights have considerable latitude to change, but their correlations constrain them to follow a strong dependency structure. This is why different weight configurations can result in the same outputs. Most important for network processing is the structure of these correlations. For example, if the difference of two inputs is a good predictor, as long as two weights linked to the two inputs perform the
2426
F. Aires, C. Prigent, and W. Rossow 3
A MEAN MEAN−2STD MEAN+2STD
WEIGHT VALUE
2 1 0 −1 −2 −3
3
10
20
30
40 50 60 WEIGHT NUMBER
70
80
100
B MEAN MEAN−2STD MEAN+2STD
2 WEIGHT VALUE
90
1 0 −1 −2 −3
100
200
300
400 500 WEIGHT NUMBER
600
700
800
Figure 2: Mean network weights w ± 2 standard deviation: (A) The first 100 NN weights corresponding to input/hidden layer connections, and (B) all 821 NN weights with weight 510 to 819 for hidden/output layer connections.
difference, the absolute value of the weights is not essential. Another source of uncertainty for the weights is the fact that some permutations of neurons have no impact on the network output. For example, if two neurons in the hidden layer of the network are permuted, the network answer would not change. The sigmoid function used in the network is saturated when the neuron activity entering is too low or too high. This means that a change of a weight going to this neuron would have a negligible consequence. These are just a few reasons that explain why the network weights can vary and still provide a good general fitting model. Variability of the network weights is considered a natural variability, inherent to the neural technique. Furthermore, what is important for the NN user is not the variability of the weights but the uncertainty that this variability produces in the network outputs or in even more complex quantities such as the network Jacobians.
Neural Network Uncertainty Assessment
2427
3 Uncertainty in Network Outputs 3.1 Theoretical Derivation of the Network Output Error PDF. The distribution of uncertainties of the NN output, y , is given by P(y |x, D) = P(y |x, w) · P(w|D)dw, (3.1) where D is the set of outputs y in a data set B = {(x(n) , t(n) ) ; n = 1, . . . , N} of N matched input-output couples. Using equations 3.4 and 3.12, we find that this probability is equal to: 1 1 T T 1 e− 2 (t − gw (x)) · Ain · (t − gw (x)) · e− 2 w · H · w dw, (3.2) Z where Ain is the inverse of Cin , the covariance matrix of the “intrinsic noise” of physical variables y , and H is the Hessian matrix of the quality criterion used by the learning process. Note that all the terms not dependent on w have been put together in the normalization factor Z. A first-order expansion of the NN function gw about the optimum weight w is now used: gw (x) = gw (x) + GT · w,
(3.3)
where
G = |{w=w } (gw )
(3.4)
is a W × M matrix. Introducing equation 3.4 into 3.2, and using εy = (y − gw (x)), we obtain T 1 T ·A ·ε T − ε y y in 2 P(t|x, D) ∝ e e−εy · Ain · (G w) T 1 T e− 2 w · (G · Ain · G + H ) · w dw T 1 1 T T ∝ e− 2 εy · Ain · εy eh · w − 2 w · O · w dw,
(3.5) (3.6)
where h = [−εy T · Ain · GT ]T and O = G · Ain · GT + H . The integral term in equation 3.6 can be simplified by (2π)
dimW 2
T 1 1 |O |− 2 e 2 h · O · h .
(3.7)
We can rewrite equation 3.6 using this simplification to obtain 1 T P(t|x, D) ∝ e− 2 εy · Ain · εy T T 1 T e 2 εy · Ain · G (G · Ain · G + H )
−1
· G · Ain · εy
(3.8)
T T −1 1 T ∝ e− 2 εy · [Ain − Ain · G (G · Ain · G +H ) G · Ain ] · εy . (3.9)
2428
F. Aires, C. Prigent, and W. Rossow
This means that the distribution of t follows a gaussian distribution with mean gw (x) and covariance matrix: −1
−1
C0 = [Ain − Ain · GT (G · Ain · GT + H ) G · Ain ] .
(3.10)
This covariance matrix can be simplified by multiplying the numerator and denominator by
G · (I + H −1 · G · Ain · GT ) · G to obtain
C0 = Cin + GT · H −1 · G.
(3.11)
We see that the uncertainty in the network outputs is due to the intrinsic noise of the target data embodied in Cin and the uncertainty described by the posterior distribution of the weight vector w embodied in GT · H −1 · G. This relation describes the fact that the uncertainties are approximately related to the inverse data density. As expected, uncertainties are larger in the less dense data space, where the learning algorithm gets less information. 3.2 Sources of Uncertainty. In equation 3.11, we have separated the sources of error in two terms: the intrinsic noise with covariance matrix Cin and the neural inversion term with covariance matrix GT H −1 G. Our neural inversion term refers to the errors due only to the uncertainty in the inverse model parameters, and all the remaining outside sources of errors are grouped in Cin . The inversion uncertainty can itself be decomposed into three sources, corresponding to the three main components of an NN model: 1. The imperfections of the learning data set B , which include simulation errors when B is simulated by a model, collocation and instrument errors when B is a collection of coincident inputs and outputs, nullspace errors, and others. This is probably the most important source of uncertainty due to the inversion technique. 2. Limitations of the network architecture because the model might not be optimum, with too few degrees of freedom or a structure that is not optimal. This is usually a lower-level source of uncertainty because the network can partly compensate for these deficiencies. 3. A nonoptimum learning algorithm because as good as the optimization technique is, it is impossible in practice to be sure that the global minimum w has been found instead of a local one. We think that this source of uncertainty is limited. Matrix Cin includes all other sources of errors. Our approach allows for the estimation of the global Cin , but if some individual terms are known,
Neural Network Uncertainty Assessment
2429
it is possible to subtract them from Cin . For example, if the instrument noise is known, it is possible to measure the impact of this noise on the NN outputs. The individual terms can then be subtracted from the global Cin . For simplification and because we do not use such a priori information, we adopt the hypothesis that Cin is constant for each situation; only the inversion term is situation dependent. But any a priori information about any nonconstant term in Cin could be used in this very flexible approach. Note that the specification of the sources of uncertainty by the approach of Rodgers (1990) uses mainly the concept of Jacobians of either the direct or the inverse model in order to linearize the impact of each error source. Linearity and gaussian variables are easily manageable analytically, the algebra being essentially based on the covariance matrices—for example: • CM = Dx · E · Dx T , the covariance of the errors due to instrument ∂g noise, where Dx = ∂xw is the contribution function and E = η T · η is the covariance matrix of instrument noise η ; or • or F = Ab · Cb · Ab T , the covariance of the forward model errors, where Cb is the covariance matrix errors of the forward model parameter, b, and Ab is the sensitivity matrix of observations b with respect to b (Rodgers, 1990). Some bridges can be built to link our error analysis and the approach used in variational assimilation by Rodgers (1990). In section 4, such Jacobians are analytically derived in the neural network framework. This makes feasible the use of Rodgers’ estimates. The difference would be that our linearization uses Jacobians that are situation dependent; this means that the estimation of the error sources would be nonlinear in nature. This will be the subject of another study. In Wright et al. (2000), noise in NN inputs is considered an additional source of uncertainty. An approach for the empirical characterization of the various sources of uncertainties is to use simulations. For example, for the instrument noise-related uncertainty, it is easy to introduce a sample of noise into the network inputs and analyze the consequent error distribution of the outputs. The advantage of such simulation approach is that it is very flexible and allows for the manipulation of nongaussian distributions. This will be the subject of another study. 3.3 Distribution of Network Outputs. After the learning stage, we estimate C0 , the covariance matrix of network errors εy = (t − gw (x)), over the database B . equation 3.11 shows that this covariance adds the errors due to neural network uncertainties and all other sources of uncertainty. Table 2 gives the numerical values of C0 for the particular example from Prigent et al. (2003). The right/top triangle is for the correlation, and the left/bottom triangle is for the covariance. The diagonal values give the variance of errors of quantity. The correlation part indicates clearly that some errors are highly
2.138910 −1.392113 −0.006294 −0.005261 −0.006274 −0.006121 −0.005290 −0.004895 −0.003906
Em 19V
−0.87 0.16 0.000024 0.000019 0.000024 0.000021 0.000018 0.000020 0.000017
WV
−0.24 14.708836 0.003179 −0.001143 0.003140 0.001049 −0.002954 −0.004945 −0.011933
−0.72 −0.06 0.77 0.000024 0.000020 0.000018 0.000020 0.000020 0.000022
Em 19H −0.76 0.14 0.88 0.72 0.000031 0.000023 0.000020 0.000027 0.000024
Em 22V −0.84 0.05 0.89 0.73 0.84 0.000024 0.000020 0.000023 0.000020
Em 37V −0.72 −0.15 0.74 0.81 0.71 0.81 0.000025 0.000022 0.000027
Em 37H −0.49 −0.18 0.60 0.60 0.71 0.70 0.65 0.000046 0.000044
Em 85V
−0.32 −0.37 0.42 0.56 0.54 0.50 0.67 0.79 0.000067
Em 85H
Notes: The right/top triangle is for correlation, and the left/bottom triangle is for covariance; the diagonal gives the variance. Correlations with absolute value higher than 0.3 are in bold.
Ts WV Em 19V Em 19H Em 22V Em 37V Em 37H Em 85V Em 85H
Ts
Table 2: Covariance Matrix C0 of Network Output Error Estimated over the Database B.
2430 F. Aires, C. Prigent, and W. Rossow
Neural Network Uncertainty Assessment
2431
correlated. This is why it would be a mistake to monitor only the error bars, even if they are easier to understand. The correlations of errors exhibit the expected physical behavior. Errors in Ts are negatively correlated with the other errors, with large values of correlation with the vertical polarization emissivities, for the channels that are much less sensitive to the water vapor (Em 19V and Em 37V). The vertical polarization emissivities are larger than for the horizontal polarizations and are often close to one, with the consequence that the radiative transfer equation in channels that are much less sensitive to the water vapor (the 19 and 37 GHz channels) is quasi-linear in Ts and in Em V. In contrast, errors in water vapor are weakly correlated with the other errors: the largest correlation is with the emissivity at 85 GHz in the horizontal polarization. The 85 GHz channel is the most sensitive to water vapor, and, since the emissivity for the horizontal polarization is lower than for the vertical, the horizontal polarization channel is more sensitive to water vapor. Correlations between the water vapor and the emissivities errors are positive or negative, depending on the respective contribution of the emitted and reflected energy at the surface (which is related not only to the surface emissivity but also to the atmospheric contribution at each frequency). Correlations between emissivity errors are always of the same signs and are high for the same polarizations, decreasing when the difference in frequency increases. The correlations involved in the PDF of the errors described by the covariance matrix C0 make it necessary to understand the uncertainty in a multidimensional space. This is more challenging than just determining the individual error bars, but it is also much more informative: the diagonal elements of the covariance matrix provide the variance for each output error, but the off-diagonal terms show the level of dependence among these output errors. To statistically analyze the covariance matrix C0 , this matrix can be decomposed into its orthogonal eigenvectors (not shown). These base functions constitute a set of error patterns (Rodgers, 1990). 3.4 Covariance of Output Errors Due to the Neural Inversion. The matrix H −1 is the covariance of the PDF of network weights. The use of the gradient G transforms this matrix into GT H −1 G, the covariance error of the NN outputs associated with the uncertainty of weights. Note that multiplication by G partially regularizes H −1 , so that for this particular purpose of the estimation of the output errors, H does not need to be regularized. Table 3 represents this covariance matrix GT H −1 G averaged over the whole learning database B . Even if some of the bottom-left values representing the covariance matrix are close to zero (this is an artifact since the variability ranges of the variables are quite different from each other), structure is still present in this matrix, as is shown in the correlation part (top right). The error correlation matrix GT H −1 G, related to the NN inversion method, has relatively small magnitudes with a maximum of 0.55. However,
0.493615 −0.106484 −0.000325 −0.000255 −0.000268 −0.000330 −0.000270 −0.000231 −0.000128
Em 19V −0.28 0.10 0.000002 0.000001 0.000001 0.000001 0.000001 0.000000 0.000000
WV
−0.14 1.063071 0.000167 −0.000060 0.000152 0.000033 −0.000183 −0.000282 −0.000681
−0.14 −0.02 0.33 0.000006 0.000001 0.000000 0.000001 0.000000 0.000000
Em 19H −0.25 0.09 0.55 0.26 0.000002 0.000001 0.000000 0.000000 0.000000
Em 22V −0.32 0.02 0.55 0.22 0.50 0.000002 0.000001 0.000000 0.000000
Em 37V −0.16 −0.07 0.28 0.29 0.26 0.34 0.000005 0.000000 0.000001
Em 37H
−0.19 −0.15 0.27 0.10 0.28 0.38 0.16 0.000002 0.000001
Em 85V
−0.06 −0.25 0.08 0.13 0.12 0.14 0.26 0.43 0.000006
Em 85H
Notes: The right/top triangle is for correlation, and the left/bottom triangle is for covariance; the diagonal gives the variance. Correlations with absolute value higher than 0.3 are in bold.
Ts WV Em 19V Em 19H Em 22V Em 37V Em 37H Em 85V Em 85H
Ts
Table 3: Covariance matrix GT H−1 G of Error Due to Network Uncertainty, Averaged over the Database B.
2432 F. Aires, C. Prigent, and W. Rossow
Neural Network Uncertainty Assessment
2433
it has a structure similar to the global correlation matrix, with the same signs of correlation and similar relative values between the variables. 3.5 Covariance of the Intrinsic Noise of Target Values. To estimate Cin , we use equation 3.11,
Cin = C0 B − GT H −1 GB ,
(3.12)
where the two right-hand terms are the covariance matrix of the total output errors averaged over B (see section 3.3) and the covariance matrix of the output errors due to the network inversion scheme averaged over B (see section 3.4). Table 4 gives the numerical values of the matrix Cin : The right/top triangle is for the correlation and the left/bottom triangle is for the covariance. Intrinsic error correlations can be very large (up to 0.99). The structure of Cin is also very similar to the structure of the global error correlation matrix, the only noticeable difference being the larger correlation values. An important remark is that the covariance error is dominated, in this application, by the intrinsic errors. This behavior is totally dependent on the particular application that is treated. 3.6 Network Outputs Error Estimate. Once Cin is available, we can estimate a C0 (x) that is dependent on the observations x, the term GT H −1 G varying with input x. It should be noted that the use of the regularization for matrix H has virtually no consequences for the results obtained for the error bars in the following. Using no regularization for the Hessian matrix is possible since H is multiplied by the gradients in GT H −1 G. This is an additional argument that the regularization helps the matrix inversion without damaging the information in the Hessian. C0 (x) is estimated for each of the 1,239,187 samples for clear-sky pixels in July 1992. Figure 3 presents the monthly mean standard deviations (square root of the diagonal terms in C0 (x)) for four outputs: the surface skin temperature Ts, the columnar integrated water vapor WV, and the microwave emissivities at 19 GHz for vertical and horizontal polarizations. The errors exhibit the expected geographical patterns. Large errors on Ts are concentrated in regions where the emissivities are lower or highly variable: inundated areas and deserts. In inundated areas, for instance (around the rivers like the Amazon or the Mississippi), or in coastal regions, the contribution from the surface is weaker, and sensitivity to Ts is lower because the emissivities are lower. In sandy regions through desert areas, due to higher transmission in the very dry sandy medium, microwave radiation does not come from the very first millimeters of the surface, but from deeper below the surface—the lower the frequency, the deeper (Prigent & Rossow, 1999). As a consequence, the microwave radiation is not directly related to the skin surface temperature (see Prigent & Rossow, 1999, for a detailed explanation) and Ts cannot be retrieved with the same accuracy.
1.645294 −1.285629 −0.005968 −0.005006 −0.006005 −0.005790 −0.005019 −0.004663 −0.003777
Em 19V −0.99 0.17 0.000021 0.000017 0.000023 0.000020 0.000017 0.000019 0.000016
WV
−0.27 13.645765 0.003011 −0.001083 0.002988 0.001015 −0.002770 −0.004662 −0.011251
−0.92 −0.06 0.89 0.000017 0.000019 0.000017 0.000018 0.000019 0.000021
Em 19H −0.86 0.14 0.91 0.83 0.000029 0.000022 0.000019 0.000026 0.000024
Em 22V −0.95 0.05 0.92 0.86 0.87 0.000022 0.000019 0.000022 0.000019
Em 37V −0.88 −0.16 0.83 0.98 0.80 0.90 0.000019 0.000021 0.000026
Em 37H
−0.55 −0.19 0.63 0.71 0.75 0.72 0.74 0.000043 0.000042
Em 85V
-0.37 −0.39 0.46 0.66 0.58 0.54 0.76 0.82 0.000060
Em 85H
Notes: The right/top triangle is for correlation, and the left/bottom triangle is for covariance; the diagonal gives the variance. Correlations with absolute value higher than 0.3 are in bold.
Ts WV Em 19V Em 19H Em 22V Em 37V Em 37H Em 85V Em 85H
Ts
Table 4: Covariance Matrix Cin of Intrinsic Noise Errors, Estimated over the Database B.
2434 F. Aires, C. Prigent, and W. Rossow
Neural Network Uncertainty Assessment
2435
Figure 3: Standard deviation of error maps for (A) surface skin temperature Ts, (B) columnar integrated water vapor WV, (C) microwave emissivity at 19 GHz vertical polarization, and (D) microwave emissivity at 19 GHz horizontal polarization.
2436
F. Aires, C. Prigent, and W. Rossow
The same arguments hold for the errors in emissivity. All the parameters being tightly related for a given pixel, the water vapor errors are also rather large in inundated regions and in sandy areas. 3.7 Outlier Detection. What is the behavior of the neural retrieval when the situation is particularly difficult, as when the first-guess is far from the actual solution? In principle, the nonlinearity of the NN allows it to have different weights on the observations and first-guess information, depending on the situation. For example, if the first guesses are better in tropical cases than in polar cases, the NN will have inferred this behavior during the learning stage and then will give less emphasis to the first guess when a polar situation is to be inverted. This assumes once again that the training data set is correctly sampled. To understand the behavior of the uncertainty estimates better, a good strategy is to introduce artificial errors for each source of information and to analyze the resulting impact on the network outputs. In Figure 4, the retrieval STD error change index is presented to show the effect of perturbating the mean inputs or the mean FGs by an artificial error. The impact of these artificial errors is measured in terms of the percentage of the regular STD retrieval error. For example, an impact index of 120% means that the regular STD retrieval error estimate increases by 20% when the input is perturbed. The impact indices can be compared for each of the nine network outputs. These results are obtained by averaging over the 20,000 samples in B . Figure 4A presents the error impacts when all 17 network inputs are changed by a factor ranging from −5% to +5%. Obviously, this will correspond to incoherent situations since the complex nonlinear relationships between vertical and horizontal brightness temperatures and first guesses will not be respected. As expected, the error increases monotically with the absolute value of the perturbation. However, the impact is not uniform among the output variables. For WV, which is retrieved with a rather low accuracy, changes in the inputs do not have a large influence. The impact on the emissivities is larger for horizontal polarizations than for vertical: horizontal polarization emissivities are much more variable than the vertical ones, and as a consequence, emissivities for vertical polarization have rather similar values in outputs whatever the situation and do not depend that much on the inputs. It can also be noted that positive perturbations have a slightly stronger impact than negative ones. This is to be related to the distribution of the variables in the training database. For the emissivities, for instance, the distribution has a steep cut-off for unit emissivity, above which the emissivities are not physical. On the contrary, a large range of emissivities exists in the training data base at lower values (see Figure 3 in Aires et al., 2001). As a consequence, decreasing the emissivity first guess will still be physically realistic, whereas increasing it will not be. Figure 4B is the same except that the changes are made only for the firstguess inputs. We note a similar behavior (nonuniform impact among output
Neural Network Uncertainty Assessment
2437 (B) − MEAN FG CHANGE
150 140 130 120 110
−5% −4% −3% −2% −1% +1% +2% +3% +4% +5%
STD ERROR CHANGE INDEX
STD ERROR CHANGE INDEX
(A) − MEAN INP. CHANGE 160
100 Ts WV 19V 19H 22V 37V 37H 85V 85H NEURAL NETWORK OUTPUT
160 150 140 130 120 110
100 Ts WV 19V 19H 22V 37V 37H 85V 85H NEURAL NETWORK OUTPUT
115
110
TS −5% WV −5% E19V −5% E19H −5% E22V −5% E85V −5% E85H −5%
105
100 Ts WV 19V 19H 22V 37V 37H 85V 85H NEURAL NETWORK OUTPUT
(D) − IND. FG POS. CHANGE STD ERROR CHANGE INDEX
STD ERROR CHANGE INDEX
(C) − IND. FG NEG. CHANGE 120
−5% −4% −3% −2% −1% +1% +2% +3% +4% +5%
120
115
110
TS +5% WV +5% E19V +5% E19H +5% E22V +5% E85V +5% E85H +5%
105
100 Ts WV 19V 19H 22V 37V 37H 85V 85H NEURAL NETWORK OUTPUT
Figure 4: Estimated STD error change index for an artificial pertubation: (A) of the mean input, (B) of the mean first-guess input, (C) of individual first-guess negative changes, and (D) of individual first-guess positive changes. See the detailed explanation in the text. Statistics are performed over 20,000 samples from B.
variables and with larger impact for positive perturbations), but we observe also that errors are larger than when all the inputs are perturbed in Figure 4A. This suggests that the error estimate is able to detect inconsistencies between observations and first-guess inputs. In Figures 4C and 4D, the first-guess input variables are perturbed individually with, respectively, negative and positive amplitude of 5%. For negative perturbations, the biggest impact is produced by the Ts first-guess perturbation: it is noticeable that the Ts error impact is similar for the retrieval of Em 19H and for its own retrieval. For other variables, the impacts have lower levels, with almost no impact from the WV first guess. The WV first guess is associated with a large error (40%), and as a consequence the NN gives little importance to this first guess. For positive individual perturbations in Figure 4D, the results are similar to the negative errors. The magnitude of the positive changes as compared to the negative ones is related again to the distribution of the variables in the training data set (see Figure 3 in Aires et al., 2001): If the distribution is not symmetrical around
2438
F. Aires, C. Prigent, and W. Rossow B TBH −5% TBH +5%
200 180 160 140 120
STD ERROR CHANGE INDEX
STD ERROR CHANGE INDEX
A 220
Ts WV 19V 19H 22V 37V 37H 85V 85H NEURAL NETWORK OUTPUT
220
180 160 140 120 Ts WV 19V 19H 22V 37V 37H 85V 85H NEURAL NETWORK OUTPUT
D EMH −5% EMH +5%
130 120 110 100 Ts WV 19V 19H 22V 37V 37H 85V 85H NEURAL NETWORK OUTPUT
STD ERROR CHANGE INDEX
STD ERROR CHANGE INDEX
C 150 140
TBV −5% TBV +5%
200
150 140
EMV −5% EMV +5%
130 120 110 100 Ts WV 19V 19H 22V 37V 37H 85V 85H NEURAL NETWORK OUTPUT
Figure 5: Estimated STD error change index for an artificial pertubation: (A) of horizontal polarization brightness temperatures, (B) of vertical polarization brightness temperatures, (C) of horizontal polarization first-guess emissivities, and (D) of vertical polarization first-guess emissivities. Statistics are performed over 20,000 samples from B.
a mode value, depending on the shape of the distribution, increasing or decreasing the value can be more or less realistic. In Figure 5, “incoherencies” have been introduced between the vertical and horizontal polarizations in the brightness temperatures (TB) observations and in the first-guess emissivities, Em s, by increasing or decreasing one, keeping the other polarization constant. In Figure 5A, we increased and decreased artificially by 5% the horizontal TB, and in Figure 5B, the same has been done for vertical polarizations. Figures 5C and 5D are similar for first-guess emissivities instead of TB. Several comments can be made. First, the impact is larger for observations than for first-guess errors, which suggests that observations are more important for the retrieval, the first guess being used mostly as an additional constraint. Second, these polarization inconsistencies have a bigger impact than changes of the means in Figure 4. For example, the NN might emphasize the difference of polarization for the retrieval, and then these inconsistencies would have a very strong impact. This shows that the NN, using complex nonlinear multi-
Neural Network Uncertainty Assessment
2439
variate relationships, is sensitive to inconsistencies among the inputs. It is encouraging to see that our error estimates are able to detect such situations. Finally, the relative impact of the positive and negative changes can be explained again by the distribution of the variables in the learning database. For the emissivities, whatever the polarization and the frequency, the histograms are not symmetrical, having a broad tail toward lower values and an abrupt end for the higher values. As a consequence, when artificially increasing the emissivities, unrealistic values are attained, which is not the case when decreasing the emissivities. (See Aires et al., 2001, for a complete description of the distributions of the learning database and the histograms of the inputs.) The results shown in Figures 4 and 5 are consistent with a coherent physical behavior, confirming that the new tools developed in this study and its companion articles can be used to diagnose difficult retrieval situations such as might be caused by bad first guesses, inconsistent measurements, situations not included in the training data set, or uncertainties of the NN on the possible retrievals. Our a posteriori probability distributions for the NN retrieval define confidence intervals on the retrieved quantities that allow the detection of such situations. It could be argued that a limitation of our retrieval uncertainty estimates comes from the fact that our technique is based on statistics over a data set B . This could mean that the error estimate is valid only when we are inside the variability spanned by B . On the contrary, it has been shown that the local quadratic approximation approach increases error estimates in sparsely sampled data space domains (see, e.g., MacKay, 1992). 4 Network Jacobian Uncertainties The a posteriori distribution of weights is useful to estimate the uncertainties of network outputs (see section 3). We will now show that these distributions can also be used for the estimation of complex probabilistic quantities via Monte Carlo simulations. As an example of such an approach, we use it to estimate the uncertainties of the NN Jacobians. 4.1 Definition of Neural Network Sensitivities. The NN technique not only provides a statistical model relating the input and output quantities, it also enables an analytical and fast calculation of the neural Jacobians (the derivative of the analytical expression of the NN model), also called the neural sensitivities or adjoint model (Aires et al., 1999). For example, the neural Jacobians for the two-layered MLP (a MLP network with one hidden layer) are ∂yk dσ = wjk · ∂xi da j∈S 1
i∈S0
wij xi · wij .
(4.1)
2440
F. Aires, C. Prigent, and W. Rossow
For a more complex MLP network with more hidden layers, a backpropagation algorithm exists that computes efficiently the neural Jacobians (Bishop, 1996). Since the NN is nonlinear, these Jacobians depend on the situation defined by the particular input, x. The neural Jacobian concept is a very powerful tool since it allows for a statistical estimation of the multivariate and nonlinear sensitivities connecting the input and output variables in the model under study, which is a useful data analysis tool (Aires & Rossow, 2003). The Jacobian matrix with terms given by equation 4.1 describes the global sensitivities for each retrieved parameter: they indicate the relative contribution of each input in the retrieval for a given output parameter. The Jacobian is situation dependent, which means that depending on the situation x, the NN uses the available information in different ways. 4.2 Sampling Strategy for Network Weights. To go beyond the point estimation approach where a learning algorithm is used to estimate only the optimal set of weights, the distribution of weights w uncertainty must be investigated. This distribution of weights can be used to estimate complex probabilitistic quantities like the confidence intervals of stochastic variables, the distribution of the outputs, and other probabilities of quantities dependent on the output of the network. All these potential applications require the integration under the PDF of weights. Fortunately, the a posteriori distribution of weights is gaussian (see section 2). This means that the normalization term Z1N in equation 2.9 is easily obtained (this is a main difficulty when integrating a PDF). The integration and the manipulation of a gaussian PDF is particularly easy compared to other distributions. However, when faced with the estimation of complex quantities, the analytical solution of such integrations can still be difficult to obtain. The estimation of the network Jacobians PDF is such a situation. This is why simulation strategies have to be used. Simulations first sample the PDF of weights with {wr ; r = 1, . . . , R} and then use this sample to approximate the integration under the whole weight PDF. Using only w , the MAP parameters, to estimate some other dependent quantities (such as NN Jacobians) directly may not be optimal, even if we are not interested in uncertainty estimates. In fact, most of the mass of the distribution (i.e., the location of the domain where the probability is higher), in a high-dimension space, can be far from the most probable state (i.e., the MAP state). The high dimension makes the mass of the PDF more on the periphery of the density domain and less at its center. Nonlinearities can also distort the distribution of the estimated quantity. This is why it is good to use R samples of the weights {wr ; r = 1, . . . , R} to estimate the density of the quantity of interest. Concerning the network Jacobians, the MAP network Jacobian is given by using the most probable network weights w . The mean Jacobian is not sufficient for a real sensitivity analysis; a measure of the uncertainty in this
Neural Network Uncertainty Assessment
2441
estimate is required as well. In fact, the NN is designed to reproduce the right outputs, but without any a priori information, the internal regularities of the network have no constraint. As a consequence, the internal regularities, such as the NN Jacobians, are expected to have a large variability. This variability needs to be monitored. To estimate the uncertainties of the Jacobians, we use R =1000 samples from the weights PDF described in section 2. Using an adequate sampling algorithm is a key issue here. To sample this gaussian distribution in very high-dimension space (about 800 network weights), the metropolis algorithm is used. This method is also suitable for nongaussian PDFs. For each weight sample wr , we estimate the mean Jacobian over the entire data set B . This means that we have at our disposal a sample of R =1000 mean Jacobians. They are then averaged, and a PDF for each individual term in the Jacobian matrix is obtained. 4.3 Multicollinearity Problem. Table 5 gives the mean neural Jacobian values for the variables xk and yi for the neural network, as defined in equation 4.1. These values indicate the relative contribution of each input in the retrieval of a given output. The numbers correspond to global mean over B values, which may mask rather different behaviors in various regions of the input space. The standard deviations of the uncertainty PDF are also indicated. The variability of the Jacobians is large: uncertainty of the neural sensitivities can be up to several times the mean value. For most cases, the Jacobian value is not in the confidence interval, which means that the actual value is not significant. In linear regression, obtaining nonsignificant parameters is often the signal that multicollinearities are a problem for the regression. The distribution of the Jacobians shows that most of them are not statistically significant. The reason for such uncertainty can be the pollution of the learning process by multicollinearities in the data (inputs and outputs), which introduce compensation phenomena. For example, if two inputs are correlated and they are used by the statistical regression to predict an output component, then the learning has some indeterminacy: it can give more or less emphasis to the first of the inputs as long as it compensates this underor overallocation by, respectively, an over- or underallocation in the second, correlated input variable. This means that the two corresponding sensitivities will be highly variable from one learning to another one. The output prediction would be just as good for both cases, but the internal structure of the model would be different. Since it is these internal structures (i.e., Jacobians) that are of interest here, this problem needs to be resolved. To see if the multicollinearities and consequent compensation phenomena are at the origin of the sensitivity uncertainties, the correlation between sensitivities is measured. If some of these sensitivities are correlated or anticorrelated, it means that from one learning cycle to another, the sensitivities will always be related following the compensation principle. The correlation
−0.15±0.11 0.33±0.07 0.07±0.06 −0.03±0.10 0.04±0.05 0.02±0.05 −0.07±0.07 −0.05±0.06 −0.18±0.09 0.13±0.09
0.18±0.08 −0.04±0.05 −0.07±0.04 −0.11±0.08 −0.06±0.04 −0.07±0.04 −0.08±0.06 −0.04±0.04 −0.03±0.07 −0.03±0.06
Ts WV Em 19V Em 19H Em 22V Em 37V Em 37H Em 85V Em 85H Tlay
−0.27±0.08 0.03±0.06 0.12±0.05 0.22±0.09 0.11±0.04 0.11±0.04 0.14±0.06 0.07±0.04 0.12±0.09 −0.04±0.08
−0.16±0.24 0.17±0.21 0.19±0.18 −0.28±0.17 −0.54±0.17 0.05±0.15
0.91±0.18
Em 19V
−0.14±0.09 0.04±0.08 0.11±0.08 −0.04±0.15 0.04±0.04 0.07±0.05 0.09±0.07 0.07±0.05 −0.04±0.11 0.01±0.09
−0.20±0.23 1.26±0.40 −0.14±0.25 −0.25±0.21 −0.00±0.22 −0.14±0.19 −0.16±0.17
Em 19H
−0.26±0.10 −0.01±0.09 0.09±0.07 0.29±0.14 0.15±0.04 0.13±0.05 0.15±0.07 0.11±0.05 0.17±0.10 −0.07±0.09
−0.46±0.36 0.59±0.28 0.25±0.21 −0.13±0.21 −0.61±0.23 0.09±0.18
0.57±0.22
Em 22V
−0.31±0.08 0.03±0.06 0.15±0.05 0.19±0.09 0.13±0.04 0.16±0.04 0.18±0.06 0.10±0.05 0.12±0.08 −0.06±0.08
0.02±0.19 −0.54±0.23 −0.15±0.21 1.12±0.19 0.18±0.19 −0.29±0.18 −0.12±0.16
Em 37V
−0.12±0.07 −0.04±0.05 0.06±0.04 0.09±0.07 0.05±0.03 0.07±0.03 0.11±0.05 0.04±0.04 0.05±0.06 −0.03±0.06
−0.29±0.15 0.03±0.18 −0.09±0.17 0.05±0.15 0.84±0.15 −0.33±0.14 0.02±0.13
Em 37H
−0.26±0.09 −0.06±0.07 0.16±0.05 0.16±0.10 0.13±0.04 0.15±0.04 0.20±0.06 0.20±0.05 0.21±0.08 −0.13±0.08
−0.17±0.21 −0.30±0.23 −0.77±0.21 0.63±0.20 −0.20±0.20 1.06±0.18 −0.14±0.16
Em 85V
−0.07±0.09 −0.15±0.08 0.03±0.05 0.18±0.10 0.06±0.04 0.07±0.04 0.15±0.06 0.10±0.05 0.21±0.07 −0.04±0.07
−0.12±0.19 −0.43±0.27 −0.26±0.22 0.01±0.19 0.61±0.21 −0.15±0.20 0.45±0.17
Em 85H
Notes: Columns are network outputs, y, and rows are network inputs, x. Sensitivities with absolute value higher than 0.3 are in bold, and positive 5% significance test are indicated by an asterisk. The first part of this table is for SSM/I observations; the second part corresponds to first guesses.
0.04±0.23 0.42±0.27 −0.79±0.27 −0.16±0.21 −0.67±0.23 −0.05±0.20 1.60±0.18
0.26±0.19 0.08±0.19 0.11±0.19 0.20±0.18 0.15±0.18 0.24±0.16 −0.13±0.15
WV
TB19V TB19H TB22V TB37V TB37H TB85V TB85H
Ts
∂y Table 5: Global Mean Nonregularized Neural Sensitivities ∂ x .
2442 F. Aires, C. Prigent, and W. Rossow
Neural Network Uncertainty Assessment
2443
of a set of sensitivities is shown in Table 6; some of the correlations are significant. For example, as expected, the correlation between the sensitivities of Ts to TB19V and TB22V is larger in absolute value than Ts with higherfrequency TB. The negative sign of this correlation is explained by the fact that TB19V, being highly correlated with TB22V, a large sensitivity of Ts to TB19V will be compensated for in the NN by a low sensitivity to TB22V, leading to a negative correlation. The absolute value of the correlations is not extremely high (about 0.3 or 0.4), but when added, all these correlations define a quite complex and strong dependency structure among the sensitivities. This is a sign that multicollinearities and subsequent compensations are acting in the network model. To avoid such multicollinearity problems, the network learning needs to be regularized by using some physical a priori information to better constrain the learning, in particular in terms of dependency structure among the variables, or by employing some statistical a priori information that will help reduce the number of degrees of freedom in the learning process in a physically meaningful way. In the following sections, we investigate the latter regularization strategy by using PCA. 4.4 Principal Component Analysis of Inputs and Outputs. Let C x be the K × K covariance matrix of inputs to a neural network and C y be the M × M covariance matrix of the outputs. We use the eigendecomposition of these two matrices to obtain F x and F y the K × K and M × M matrices whose columns are the corresponding eigenvectors. Instead of the full matrices, we can use the truncated K × K matrix F x and the M × M matrix F y (K < K and M < M), to use only the lower-order components (Aires et al., 2002a). Inputs x and outputs y are projected using
x = F x · S 1x −1 · (x − m1x ) y = F y · S 1y −1 · (y − m1y ),
(4.2) (4.3)
where S 1x and S 1y are the diagonal matrices with diagonal terms equal to the standard deviation of, respectively, inputs and outputs, and the vectors m1x and m1y are the input and output means. The vectors x and y are a compression of the real data, but the inverse transformations of equations 4.2 and 4.3 go back from the compression to the full representation with, of course, some compression errors. PCA is optimum in the leastsquares sense: the square errors between data and its PCA representation are minimized. Using a reduced PCA representation allows us to reduce the dimension of the data, but a compromise needs to be found between a good compression level (a smaller number of PCA components used) and a small compression error (a larger number of PCA components used). The more PCA components that are used for compression, the lower the compression error is. Another advantage of the PCA representation is to suppress part of
1.00 −0.19 −0.15 −0.05 0.13 −0.08 −0.01 0.14 −0.00 −0.01
... 1.00 −0.18 −0.44 −0.00 −0.04 0.18 −0.17 0.14 0.18
∂Ts ∂TB19V
... ... 1.00 −0.16 −0.59 0.12 −0.05 0.25 −0.44 −0.41
∂Ts ∂TB19H
... ... ... 1.00 −0.01 −0.18 −0.08 −0.16 0.09 0.17
∂Ts ∂TB22V
... ... ... ... 1.00 −0.03 −0.38 −0.06 0.03 0.06
∂Ts ∂TB37H
Note: Correlations with absolute value higher than 0.3 are in bold.
∂Ts ∂Ts ∂Ts ∂TB19V ∂Ts ∂TB19H ∂Ts ∂TB22V ∂Ts ∂TB37H ∂Ts ∂TB85V ∂Ts ∂TB85H ∂Ts ∂Em 19V ∂Ts ∂Em 19H ∂Ts ∂Em 85H
∂Ts ∂Ts
... ... ... ... ... 1.00 −0.45 0.15 −0.03 0.03
∂Ts ∂TB85V
... ... ... ... ... ... 1.00 0.04 0.01 −0.12
∂Ts ∂TB85H
Table 6: Correlation Matrix for a Sample of Neural Network Sensitivities.
... ... ... ... ... ... ... 1.00 −0.41 −0.31
∂Ts ∂Em 19V
... ... ... ... ... ... ... ... 1.00 0.26
∂Ts ∂Em 19H
... ... ... ... ... ... ... ... ... 1.00
∂Ts ∂Em 85H
2444 F. Aires, C. Prigent, and W. Rossow
Neural Network Uncertainty Assessment
2445
the noise during the compression process, when the lower-order principal components of a PCA decomposition describe the real variability of the observations or the signal and the remaining principal components describe higher-frequency variabilities. The higher orders are more likely to be related to the gaussian noise of the instrument or to very minor variability. We will consider in the following that the higher-order components describe noise (instrumental plus unimportant information) and use the reduced instead of the full PCA representation. We will not comment on compression or denoising considerations in this study (see Aires et al., 2002a). Figure 6 describes the cumulated percentage of explained variance by a cumulated number of PCA components for the input and output data. The first PCA components for the inputs and the outputs of the neural network are represented in Figures 7A and 7B, respectively. The physical consistency of the PCA has been checked (not shown) by projecting the samples of the database B onto the map of the first two principal components that represent most of the variability. Clusters of points are related to surface characteristics. Since surface types are known to represent a large part of the variability, the fact that the PCA is able to coherently separate them demonstrates the physical significance of the PCA representation. This is particularly important because the PCA will be used, in the following, to regularize the NN
100 17 INPUTS 9 OUTPUTS
PERCENTAGE OF VARIANCE EXPLAINED
95
90
85
80
75
70
65
60
2
4
6 8 10 12 CUMULATED NUMBER OF EIGEN−VALUES
14
16
Figure 6: Percentage of variance explained by the first PCA components of inputs (solid) and outputs (dotted) of the neural network.
2446
F. Aires, C. Prigent, and W. Rossow
A
0.5
PRINCIPAL COMPONENT
COMPONENT 1 COMPONENT 2 COMPONENT 3 COMPONENT 4
0
−0.5
−1
1
1
2
3
4
5
6
7 8 9 10 11 NEURAL NETWORK INPUT
12
13
14
15
16
17
B COMPONENT 1 COMPONENT 2 COMPONENT 3 COMPONENT 4
0.8
0.6
PRINCIPAL COMPONENT
0.4
0.2
0
−0.2
−0.4
−0.6
−0.8
−1
Ts
WV
Em19V
Em19H Em22V Em37V NEURAL NETWORK OUTPUT
Em37H
Em85V
Em85H
Figure 7: First PCA components of inputs (A) and outputs (B) of the neural network.
Neural Network Uncertainty Assessment
2447
learning. The patterns that are found by the PCA will distribute the contribution of each input and each output for a given sensitivity. It is essential that these patterns have a physical meaning. 4.5 PCA Regression Approach. The fact that the dimension of the inputs is reduced decreases the number of parameters in the regression model (i.e., weights in the neural network) and consequently decreases the number of degrees of freedom in the model, which is good for any statistical techniques. The variance in determining the actual values of the neural weights is reduced. The training of the NN is simpler because the inputs are decorrelated. Correlated inputs in a regression are called multicollinearities, and they are well known to cause problems for the model fit (Gelman et al., 1995). Suppressing these multicollinearities makes the minimization of the quality criterion more efficient: it is easier to minimize, with less probability of becoming trapped in a local minimum. Therefore, it has the general effect of suppressing uncertainty in the determination of the parameters of the NN model. (For a detailed description of PCA-based regression, see Jolliffe, 2002.) How many PCA components should the regression use? From section 4.4, it is preferable to use the optimal compromise between the best compression fit and denoising in terms of global statistics. This statement is related to the PCA representation, not taking into account how the NN uses these components. No theoretical results exist to define the optimal number of PCA components to be used in a regression; it depends entirely on the problem to be solved. Various tests can be performed. Experience with the NN technique shows that if the problem is well regularized, once sufficient information is provided as input, adding more PCA components to the inputs does not have a large impact on the retrieved results; the processing just requires more computations because of the increased data dimension. Therefore, we recommend being conservative and taking more PCA components than the denoising optimum would indicate in order to keep all possibly useful information. In terms of output retrieval quality, the number of PCA components used in the input of the NN needs to be reduced for reasons other than just denoising or compression. In fact, during the learning stage, the NN is able to relate each output to the inputs that help predict it, disregarding the inputs that vary randomly. In some cases, the dimension of the network input is so big (few thousands) (Aires et al., 2002a) that a compression is necessary. In our case, K = 17 is easily manageable, so all the input variables could be used. For our study here, K = 12 is chosen to reduce the number of degrees of freedom in the network architecture. This number of input PCA components is large enough for the retrieval, representing 99.46% of the total variance (see Table 7). No additional information would be gained from adding the higher-order PCA input components.
2448
F. Aires, C. Prigent, and W. Rossow
∂ y Table 7: Global Mean Neural Sensitivities ∂ x of Raw Network Output and Input.
NN Outputs NN Inputs Compo 1 Compo 2 Compo 3 Compo 4 Compo 5 Compo 6 Compo 7 Compo 8 Compo 9 Compo 10 Compo 11 Compo 12
Compo 1
Compo 2
Compo 3
Compo 4
Compo 5
−0.81±0.01 −0.21±0.01 0.51±0.01 0.17±0.01 −0.06±0.01 −0.01±0.01 −0.01±0.01 −0.03±0.01 0.01±0.01 −0.01±0.01 −0.14±0.01 0.10±0.01
−0.25±0.01 −0.69±0.01 −0.65±0.01 −0.46±0.01 0.04±0.01 0.02±0.01 0.02±0.01 −0.10±0.01 0.02±0.01 0.03±0.01 0.22±0.01 −0.18±0.01
0.53±0.01 −0.62±0.01 0.41±0.01 0.05±0.01 −0.00±0.01 0.01±0.01 −0.01±0.01 0.01±0.01 −0.01±0.01 0.00±0.01 −0.01±0.01 0.10±0.01
−0.05±0.01 0.16±0.01 0.47±0.01 −0.44±0.01 −0.02±0.01 0.02±0.01 −0.02±0.01 −0.04±0.01 −0.00±0.01 0.02±0.01 0.05±0.01 0.08±0.01
−0.03±0.01 0.10±0.02 0.02±0.01 −0.07±0.01 0.77±0.04 0.07±0.01 0.10±0.01 −0.05±0.01 −0.25±0.01 0.12±0.01 0.25±0.01 −0.14±0.01
Notes: Columns are network outputs, y , and rows are network inputs, x . Sensitivities with absolute value higher than 0.3 are in bold.
The number of PCA components used in the NN output is related to the retrieval error magnitude for a nonregularized NN. If the compression error is minimal compared to the retrieval error of the nonregularized network, then M , the number of output components used, is satisfactory. It would be useless to try to retrieve something that is noise in essence. Furthermore, it could lead to numerical problems and interfere with the retrieval of the other, more important, output components. In this application, M = 5 has been chosen, representing 99.93% of the total variability of the outputs. The outputs of the network, that is, the PCA components, are not homogeneous; they have different dynamic ranges. The importance of each of the components in the output of the NN is not equal. The first PCA component represents 52.68% of the total variance of the data, where the fifth component represents only 0.43%. Giving the same weight to each of these components during the learning process would be misleading. To resolve this, we give a different weight to each of the network outputs in the “data” part, ED , of the quality criterion used for the network learning (see section 2). For an output component, this weight is equal to the standard deviation of the component. This is equivalent to using equation 2.1, where Ain is the diagonal matrix with diagonal terms equal to the standard deviation of the PCA components. Off-diagonal terms are zero since, by definition, no correlation exists between the components in εy = (t(n) − gw (x(n) )) (i.e., the output error, target, or desired output minus the network output).
Neural Network Uncertainty Assessment
2449
4.6 Retrieval Results of PCA Regression. The mean RMS retrieval error for the new NN with PCA representation of its inputs and outputs is slightly higher than for the original nonregularized NN. For example, the surface skin temperature RMS error is 1.53 instead of 1.46 in the nonregularized NN. This is expected because we know that reducing variance (overfitting) by regularization increases the bias (RMS error). This is know as the bias/variance dilemma (Geman, Bienenstock, & Doursat, 1992). This dilemma describes the compromise that must be found between a good fitting on the learning database B and a robust model with physical meaning. The differences of RMS errors are, in this case, negligible. In order to estimate the NN weight uncertainties, we use the approach described in section 2: the Hessian matrix H must first be computed and then regularized in order to obtain the covariance matrix of the weights PDF. This regularization of the Hessian matrix is done to make it positive definite, which is not the same goal as the regularization of the NN behavior by the PCA representation. These two regularization steps should not be confused. Figure 8 presents the corresponding standard deviation for the NN weights with various regularization parameters λ around the optimal value,
0.25 600 660 670 680 1000
STD ON WEIGHT VALUE
0.2
0.15
0.1
0.05
0
10
20
30
40 50 60 WEIGHT NUMBER
70
80
90
100
Figure 8: Standard deviation of NN weights with increased regularization parameter λ: λ = 600 (dotted), λ = 660 (dot-dash), λ = 670 (dashed), λ = 680 (solid), and λ = 1000 (asterisk).
2450
F. Aires, C. Prigent, and W. Rossow
λ = 660, which is determined as described in section 2.6 using various quality criteria. It is interesting that the ill-conditioning of the Hessian matrix shows large sensitivity to some particular network weights. For λ too small, the standard deviation is very chaotic and nonmonotonic, with some values going from extreme large values to even negative ones. Increasing λ makes the standard deviation of the particular weights converging to a more acceptable, positive value and coherent with the other standard deviations. At the same time, increasing λ uniformly decreases the standard deviation in all the network weights. The balance between a λ large enough to regularize H but without changing the standard deviation of well-behaved weights must be found. This is probably the most important issue for the uncertainty estimates described in this study. Another approach to obtain a well-conditioned Hessian would be to constrain the Hessian matrix H to stay definite positive during the learning stage. 4.7 PCA-Regularized Jacobians. Before they are introduced as inputs and outputs of the neural network, the reduced-PCA representations, x and y , need to be centered and normalized. This is a requirement for the neural network method to work efficiently. The new inputs and outputs of the neural network are given by:
x = S 2x −1 · (x − m2x ) y = S 2y −1 · (y − m2y ),
(4.4) (4.5)
where the S 2x and S 2y are the diagonal matrices of the standard deviations of, respectively, x and y (defined in equations 4.2 and 4.3) and vectors m2x and m2y are the respective means. ∂y The NN formulation allows derivation of the network Jacobian ∂ x for the normalized quantities of equations 4.4 and 4.5. To obtain the Jacobian in physical units, one should use equations 4.2 through 4.5 to find
∂y ∂y T = S 1y · F y · S 2y · · S 2x −1 · F x · S 1x −1 . ∂x ∂ x
(4.6)
Equation 4.6 gives the neural Jacobian for the physical variables x and y . To enable comparison of the sensitivities between variables with different variation characteristics, the terms S 1y and S 1x −1 can be suppressed in this expression so that, for each input and output variable, a normalization by its standard deviation is used. The resulting nonlinear Jacobians indicate the relative contribution of each input in the retrieval to a given output variable. 4.8 Uncertainty of Regularized NN Jacobians. In Table 7, the PCA∂ y regularized NN is used to estimate the mean Jacobian matrix ∂ x of raw
Neural Network Uncertainty Assessment
2451
network outputs and inputs, together with the corresponding standard deviations. The standard deviations are much more satisfactory in this case: some high sensitivities are present, but they are all significant to the 5% confidence interval. The structure of this sensitivity matrix is interesting and illustrates the way the NN connects inputs and outputs together. For example, the first output component is related to the first input component (0.81 sensitivity value) but also the third input component (0.51). This shows that the PCA components are not the same in output and in input, so that the NN needs to nonlinearly transform the input component to retrieve the output ones. With increasing output component number, the input component number used increases too. But higher-order input components (more than five) have limited impact. Even if the mean sensitivity is low, it does not mean that the input component has no impact on the retrieval for some situations. The nonlinearity of the NN allows it to have a situation dependency of the sensitivities so that a particular input component can be valuable for some particular situations. ∂y
Using equation 4.6, we obtain the corresponding Jacobian matrix ∂ x for the physical variables instead of the PCA components, but normalized to be able to compare individual sensitivities (see Table 8). The uncertainty of the sensitivities is now very low, and most of the mean sensitivities are significant to the 5% level. This demonstrates that the PCA regularization has solved, at least partially, the problem of Jacobian uncertainty by suppressing the multicollinearities in the statistical regression. Interferences among variables are suppressed, and the standard deviations calculated for each neural sensitivity are very small, as compared to the values previously estimated without regularization (see Table 5). In addition, the sensitivities make more sense physically, as expected. The retrieved Ts is very sensitive to the brightness temperatures at vertical polarizations for the lower frequencies (see the numbers in bold in the corresponding column). The emissivities being close to one for the vertical polarization (and higher than for the horizontal polarization), Ts is almost proportional to TB in window channels (i.e., those that are not affected by water vapor). Sensitivity to the Ts first guess is also rather high but associated with a higher standard deviation. Sensitivities to the first-guess emissivities are weak, regardless of frequency and polarization. WV information clearly comes from the 85 GHz horizontal polarization channel. It is worth emphasizing that the sensitivity of WV to TB85H is almost twice as large as to the WV first guess, meaning that real pertinent information is extracted from this channel. Sensitivity of the retrieved emissivities to the inputs strongly depends on the polarization, the vertical polarization emissivities being more directly related to Ts and TBV given their higher values generally close to one. Emissivities in vertical polarization are essentially sensitive to Ts and to the TBV, whereas the emissivities in the horizontal polarization are dominated by the emissivity first guess. The sensitivity ma-
0.20±0.02 −0.06±0.01 −0.09±0.01 −0.07±0.02 −0.07±0.01 −0.08±0.01 −0.08±0.02 −0.05±0.01 −0.05±0.02 −0.03±0.02
Ts WV Em 19V Em 19H Em 22V Em 37V Em 37H Em 85V Em 85H Tlay
Em 19V
0.06±0.00 0.00±0.00 0.05±0.00 0.04±0.00 −0.02±0.00 −0.02±0.00 −0.06±0.00
−0.04±0.00 0.01±0.00 0.03±0.00 0.03±0.00 0.02±0.00 0.02±0.00 0.02±0.00 0.01±0.00 0.01±0.00 −0.00±0.00
WV
−0.52±0.02 0.14±0.01 −0.34±0.01 −0.27±0.01 0.28±0.01 0.39±0.01 0.80±0.02
−0.18±0.02 0.42±0.01 0.09±0.01 −0.08±0.02 0.05±0.01 0.02±0.01 −0.13±0.02 −0.06±0.01 −0.16±0.02 0.11±0.02
−0.01±0.00 0.00±0.00 0.01±0.00 0.05±0.00 0.01±0.00 0.00±0.00 0.03±0.00 −0.00±0.00 0.01±0.00 −0.01±0.00
−0.00±0.00 0.03±0.00 −0.01±0.00 −0.01±0.00 0.02±0.00 −0.01±0.00 0.01±0.00
Em 19H
−0.05±0.00 0.01±0.00 0.03±0.00 0.02±0.00 0.02±0.00 0.02±0.00 0.02±0.00 0.01±0.00 0.01±0.00 −0.00±0.00
0.06±0.00 −0.00±0.00 0.04±0.00 0.04±0.00 −0.01±0.00 −0.02±0.00 −0.05±0.00
Em 22V
−0.05±0.00 −0.00±0.00 0.02±0.00 0.00±0.00 0.02±0.00 0.03±0.00 0.02±0.00 0.03±0.00 0.02±0.00 −0.01±0.00
0.04±0.00 −0.01±0.00 0.03±0.00 0.03±0.00 −0.01±0.00 −0.00±0.00 −0.03±0.00
Em 37V
−0.01±0.00 −0.01±0.00 0.01±0.00 0.04±0.00 0.01±0.00 0.01±0.00 0.03±0.00 0.00±0.00 0.02±0.00 −0.01±0.00
−0.01±0.00 0.03±0.00 −0.01±0.00 −0.01±0.00 0.02±0.00 −0.01±0.00 0.01±0.00
Em 37H
−0.04±0.00 −0.02±0.00 0.01±0.00 −0.04±0.00 0.02±0.00 0.03±0.00 0.00±0.00 0.05±0.00 0.04±0.00 −0.01±0.00
−0.02±0.00 −0.01±0.00 −0.01±0.00 0.00±0.00 0.01±0.00 0.04±0.00 0.04±0.00
Em 85V
−0.01±0.00 −0.01±0.00 0.01±0.00 0.01±0.00 0.01±0.00 0.01±0.00 0.02±0.00 0.02±0.00 0.03±0.00 −0.01±0.00
−0.03±0.00 0.02±0.00 −0.02±0.00 −0.02±0.00 0.02±0.00 0.01±0.00 0.04±0.00
Em 85H
Notes: Columns are network outputs, y, and rows are network inputs, x. Sensitivities with absolute value higher than 0.3 are in bold. The first part of this table is for SSM/I observations; the second part corresponds to first guesses.
0.23±0.02 0.06±0.01 0.21±0.01 0.21±0.01 0.06±0.01 0.12±0.01 0.01±0.02
TB19V TB19H TB22V TB37V TB37H TB85V TB85H
Ts
∂y Table 8: Global Mean Regularized Neural Sensitivities ∂ x .
2452 F. Aires, C. Prigent, and W. Rossow
NORMALIZED SENSITIVITY ESTIMATE
Neural Network Uncertainty Assessment
0.6
2453
A
0.4
0.2
0
−0.2
6
2
4
6
8
10 12 SIMULATION NUMBER
14
16
20
B TS/TS TS/TB37V TS/TB85V TS/EM37V TS/EM85V
5 4 PDF
18
3 2 1 0
−0.4
−0.2
0 0.2 0.4 NORMALIZED SENSITIVITY ESTIMATE
0.6
0.8
Figure 9: (A) Twenty samples of five neural network sensitivities ( ∂Ts , ∂Ts , ∂Ts ∂TB37V ∂Ts ∂Ts ∂Ts , , and ∂Em 85V ). (B) Histogram of the same network sensitivities. ∂TB85V ∂Em 37V
trix clearly illustrates how the NN extracts the information from the inputs to derive the outputs. In Figure 9 (resp. Figure 10), 20 samples of five NN sensitivities are represented. These samples are found using the Monte Carlo sampling strategy described in section 2.7. Associated with these samples are represented the histogram of the same NN sensitivities. As can be seen in these two figures, the uncertainty on the NN sensitivities has been largely reduced with the regularized NN (see Figure 10) compared to the nonregularized NN (see Figure 9). Experiments (not shown) establish that such PCA-regularized NNs have robust Jacobians even when the NN architecture is changed—for example, with a different number of neurons in hidden layer. This shows how robust and reliable the new NN Jacobians and the NN model have become with the help of the PCA representation regularization. 5 Conclusion and Perspectives This study provides insight into how the NN model works and how the NN outputs are estimated. These developments draw the NN technique closer to
NORMALIZED SENSITIVITY ESTIMATE
2454
F. Aires, C. Prigent, and W. Rossow
0.6
A
0.4
0.2
0
−0.2
60
2
4
6
8
10 12 SIMULATION NUMBER
14
16
20
B dTS/dTS dTS/dTB37V dTS/dTB85V dTS/dEM37V dTS/dEM85V
50 40 PDF
18
30 20 10 0
−0.4
−0.2
0 0.2 0.4 NORMALIZED SENSITIVITY ESTIMATE
0.6
0.8
Figure 10: (A) Twenty samples of five regularized neural network sensitivities ( ∂Ts , ∂Ts , ∂Ts , ∂Ts , and ∂E∂Ts ) and (B) Histogram of the same network ∂Ts ∂TB37V ∂TB85V ∂Em 37V m 85V sensitivities.
better-understood classical methods, in particular linear regressions. With these older techniques, estimation of uncertainties of the statistical fit parameters is standard and is completely mandatory before the use of the regression model. Having at our disposal similar statistical tools for the NN establishes it on a stronger theoretical and practical basis so that NNs can be a natural alternative to traditional regression methods, with its particular advantage of nonlinearity. The tools are very generic and can be used for different linear or nonlinear regression models. A fully multivariate formulation is introduced. Its generality will allow future developments (like the iterative reestimation strategy or the fully Bayesian estimation of the hyperparameters). The uncertainty of the NN weights can be large, but as we saw, the complex structure of correlation constrains this variability so that the NN outputs are a good statistical fit to the desired function. In the Bayesian approach, the prediction (estimation of the NN output) does not use just a specific estimation of the weights w but rather integrates the outputs over the distribution of weights P(w), the “plausibility” distribution. This approach is different in the sense that the prediction is given in terms of the PDF instead of the mode value. This article describes a technique to estimate
Neural Network Uncertainty Assessment
2455
the uncertainties of NN retrievals and provides a rigorous description of the sources of uncertainty. The second application of the weight PDF is the estimation of NN Jacobians uncertainty. In this article, we show how to estimate the Jacobians of a nonlinear regression model, in particular for a NN model. New tools are provided to check how robust and stable these Jacobians are by estimating their uncertainty PDF by Monte Carlo simulations, providing the identification of situations where regularization needs to be used. As is often the case, regularization is a fundamental step of NN learning, especially for inverse problems (Badeva & Morosov, 1991; Tikhonov & Arsenin, 1977). We propose a regularization method based on the PCA regression (using a PCA representation of input and output data for the NN) to suppress the problem of multicollinearities in data at the origin of the NN Jacobian variability. Our approach is able to make the learning process more stable and the Jacobians more reliable, and it can be more easily interpreted physically. All these tools are very general and can be used for other nonlinear models of statistical inference. Together with the introduction of first-guess information first described in Aires et al. (2001), error specification makes the neural network approach even closer to more traditional inversion techniques like variational assimilation (Ide, Courtier, Ghil, & Lorenc, 1997) and iterative methods in general. Furthermore, quantities obtained from NN retrievals can now be combined with a forecast model in a variational assimilation scheme since the error covariances matrices can be estimated. These covariance matrices are not constant; they are situation dependent. This makes the scheme even better since it is now possible to assimilate only inversions of good quality (low-uncertainty estimates). Bad situations can be discarded from the assimilation or, even better, can be used as an “extreme” detection scheme that would, for example, signal the need for an increased number of simulations in an ensemble forecast. All these new developments establish the NN technique as a serious candidate for remote sensing in operational schemes, compared to the more classical approaches (Twomey, 1977). Our method provides a framework for the characterization, the analysis, and the interpretation of the various sources of uncertainty in any NN-based retrieval scheme. This makes possible improvements in the inversion schemes. Any fault that can be detected can be corrected: lack of data in the observation domains, errors of the model in some specific situations, or detection of extreme events. This should benefit a large community of NN users in meteorology or climatology. Many new algorithmic developments can be pursued, and we provided a few ideas. For example, the network output uncertainties can easily be used for novelty detection (i.e., data that have not been used to train the network) or fault detection (i.e., data that are corrupted by errors, like instrumentrelated problems). Our determination of error characteristics can also be used with adaptive learning algorithms (i.e., learning when a small addi-
2456
F. Aires, C. Prigent, and W. Rossow
tional data set is provided after the main learning of the network has been done). The NN Jacobians can be used to express the various sources of uncertainty with even more detail, using Rodgers’s approach (Rodgers, 1990). A source of uncertainty can be the presence of noise in NN inputs (Wright et al., 2000). Another technical development would be the optimization of the hyperparameters using an iterative reestimation strategy or evidence measure in a Bayesian framework (Neal, 1996; Nabney, 2002). The Jacobians of a nonlinear model, such as the NN, are a very powerful concept. Many applications of the NN Jacobians can be derived from this study. The Hessian matrix can be used for many purposes (Bishop, 1996): (1) in several second-order optimization algorithms, (2) in adaptative learning algorithms (i.e., learning when a small additional data set is provided after the main learning of the network is done), (3) for identifying parameters (i.e., weights) not significant in the model as indicated by small diagonal terms Hii , which is used by regularization processes like the “weight pruning” algorithm; (4) for automatic relevance determination, which is able to select the most informative network inputs and eliminate the negligible ones; and (5) to give a posteriori distributions of the neural weights as we do here. We also saw that the regularization of the Hessian matrix is essential if one wants to use it. A regularization solution is given in this article, but for some purposes, a few other techniques can also be used. In terms of the NN model, it allows us to obtain a robust model that will generalize well and suffer less from over